One of the several winners of this year’s Knight Foundation media innovation contest that offers great potential for scholars and students is a service and set of open-source tools called DocumentCloud. Currently in beta, it focuses on three primary feature areas designed to help journalists or anyone “reporting on” primary sources: search and analysis, highlighting and annotation, and document sharing.
In most of the examples I have seen of DocumentCloud being used, when we say documents we are talking about scanned texts: a politician’s scandalous memo, an oil company’s outline for destroying the planet, a scathing letter from an actress to her studio, and so on. Once you have an account you can upload your stash and take advantage of their analytic tools on docs that were either OCRed before uploading or done afterwards within DocumentCloud using their installation of the open source tool Tesseract. You can, of course, search across the full content and metadata of documents in your collection but it also attempts to identify people, organizations, and places. It also plots any date references in the documents on a timeline which anchors any search (though the usefulness of this, I found, depends very much on what kind of documents contain these dates). Finally, like a number of similar document hosting sites, DocumentCloud is establishing itself as the home of a searchable catalog of collections that have been made public by their member projects, and offer an API for embedding documents into a website.
The most impressive aspect of their service, however, is their clean document browser and annotation features. See for example, some of the ways this works in their “featured reporting” examples from various media sites. Documents can display top level descriptions, comments, a tab of notes, and a tab showing the OCRed text. Most visually appealing and useful, however, are the wonderful highlighted boxes over areas of text which the journalist wishes to focus upon. Below the text, commentary can be added directly which becomes visible when those highlighted sections are in focus. There is also a link to download a copy of a PDF version of the document for offline reading. This document browser interface is an implementation of the NY Times Document Viewer which has been out for over a year as open source and can be deployed without going through DocumentCloud. It is being used independently by websites such as OpenGovernment but requires significant preparatory work done on the hosted documents which is otherwise automatically done by DocumentCloud through another one of its open source scripts DocSplit. The whole workflow of uploading, searching, browsing, hosting, and sharing is provided by the service for those who don’t have the technical infrastructure to reassemble the various components on their own servers.
Let me suggest why DocumentCloud offers a useful combination of features from the perspective of a historian. Along with other contributors at Frog in a Well I often use my postings to comment upon passages in historical documents, usually PDFs I have saved from a microfilm machine, downloaded from a database, or obtained directly in the archive. In the case of US archival documents, these are not protected by copyright and I may offer the full documents for download. In most of these cases, the document viewer would have been an excellent addition to the posts and it would be equally true for primary documents referred to in online scholarly journals, digital dissertations, or web-based monographs. DocumentCloud, or an independent installation of the document viewer may avoid some of the restrictions and risks of going with some of commercial online document viewers such as Scribd.
The target audience for DocumentCloud is very clearly journalists for the time being but I hope they will consider the potential of the service for a whole range of fields in the academic world. I asked their lead developer Jeremy Ashkenas whether they had thought about opening up the service to the world of scholars. They are willing to consider projects on a case by case basis. While their mission statement and grant support was clearly directed at the world of journalism and media, he said that they had definitely been thinking and talking about the broader potential of a service which addresses very general problems of working with documents online.
Have you had good experiences with other document viewer services online? If not, have you approached the problem of sharing and browsing online primary source documents in other ways?