by

The Future of Primary Texts Online is Almost Here

Books[Today brings the exciting news from the ECCO-TCP project that 2000-odd 18th-century texts, formerly in databases, are now publicly accessible. Explaining why we should care is Alan Bilansky. Alan has a PhD in Rhetoric and Democracy from Penn State. He works as a technology consultant for faculty at the University of Illinois at Urbana-Champaign, where he's also a student in the the Graduate School of Library and Information Science.--@jbj]

A humanities scholar who could travel back in time to 1990 with a laptop (and a very long Ethernet cable) in tow would likely have an exalted status remarkably like Twain’s yankee in King Arthur’s court.  It’s no exaggeration to say that life in the humanities has been radically transformed over the last decade or so as a result of the release of databases of primary texts, including Early English Books Online (EEBO), Eighteenth-Century Collections Online (ECCO), and Early American Imprints (Evans).

When these collections made the leap from microfilm to digital, scholars who previously spent hours at microfilm readers found they had quicker and easier access.  Even more importantly, they could search in new ways, especially in ECCO and Evans, where limited full-text searching was available.

That’s not to say the databases are perfect.  One problem has been access:  scholars at institutions whose libraries can’t afford the subscription fees and unaffiliated scholars have had to choose between finding ways to access the databases through other institutions or sticking with their old-fashioned access.  Moreover, anyone who’s tried to use full-text searching in ECCO or Evans knows that results will be uneven.

An even braver, even newer world is on the way.  The Text Creation Partnership, a consortium of libraries, has partnered with publishers to improve the usability of these databases and broaden access.  And today, 2,231 texts published in the 18th century became freely available to all (from ECCO-TCP). (You can access them today at 18thconnect.org!) In the age of Google, this might sound tiny, but I’m here to tell you this is a big deal, both in itself and in what is to follow.

The TCP began with the Universities of Michigan and Oxford.  It has allied with ProQuest, Gale, and Readex, the publishers of EEBO, ECCO, and Evans, respectively.  Still headquartered in Michigan and Oxford, the partnership now includes more than 150 libraries, whose contributions help to fund the production of texts of texts.  This cooperative model for text creation will ultimately lead to better texts and better searching as well as moving these resources into the public domain.  Here are three interesting aspects of the TCP project.

Accurately Keyed Text

There are four ways to move text from paper to the digital realm: create images, type in the text, use optical character recognition (OCR), or some combination of the three.  Many digitizing efforts have opted for OCR mostly because it’s cheaper and faster.  Efficiency is not without its cost, in the forms of errors, sometimes annoying, sometimes amusing, but worst of all, confusing to the researcher.

Here’s an example from a copy of Moby-Dick in Google Books:

And here is the transcript:

Google Moby-Dick transcription

There’s no need to read the transcript, of course, as you can read the images (unless your vision is impaired).  The trouble really comes when we start searching or handling the text in any way more complicated than simple reading.

Searching for “Chapter 3” in this text only brings up the heading for Chapter 90.  More generally, I would not try hard to search for structural data.  On the other hand, searching for running text will probably give better results.  Since it’s unlikely that an OCR error will produce a real word and not gibberish, you can expect to find only pages that really do contain the words you’re looking for–but not all of them.

Comparing ECCO-TCP to Google Books is unfair to both, but instructive.  Given the scale of Google’s effort, OCR is the only option, and the heroic scale also makes them the most famous user of OCR text.   The example of the chapter title is significant. OCR errors are at the heart of the problems with Google Book Search’s metadata, as Geoff Nunberg demonstrates when he finds 527 hits for “internet” before 1950.  If this sort of information like when a book was published can’t be trusted, its usefulness for literary historians is in question.  And ECCO has a further advantage over Google Books.  The images in ECCO were all unambiguously paired with trustworthy bibliographic data (the English Short Title Catalog), long before they were digitized or the text was typed in.

Here’s an example of one of the TCP’s texts, this one by Mary Wollstonecraft, published in 1792, from the ECCO:

I chose this sample because there are some long s’s, notoriously difficult for OCR.  Here’s the TCP-keyed text looks on ECCO-TCP:

In addition to the accurately OCRed text, it’s also clear that the structural elements of the page are preserved and searchable.

Ownership of the Texts

Partner institutions will own these collections of data, bibliographically and textually accurate, and they’re a huge opportunity for text-mining.  One example of a scholar already at work is Ted Underwood at the University of Illinois at Urbana-Champaign.  He started with a collection of 2,189 texts from ECCO, which he could access through his University’s membership in the TCP.  He used those texts to build a topic model of eighteenth-century diction — or in other words, to identify clusters of related terms that allowed him to follow cultural debates by tracing the frequency of those clusters.  Access to the TCP texts meant, first of all, that the texts themselves could be trusted, which he found difficult with any other source (optical character recognition creaks and groans when applied to books published before 1800). It also meant that the metadata for the texts could be extracted automatically, permitting them to be grouped by author or characterized by genre.

Eventual Free Access for All (Some Now, More Soon)

In the past people have begged, borrowed, and occasionally stolen access to these premium databases.  As part of the agreement, all these texts will be freely available, and this will tear down two kinds of walls.  This means, first, that all these texts will come out from behind paywalls, accessible by everyone.  Second they will no longer be walled into separate collections, but form one unified corpus of published texts in written in English.

The images of pages will remain the property of the publishers of the subscription databases.  The texts produced are to be freely available.  The publishers have the exclusive right to sell the texts along with their images for five years after production of texts for that database ends.

The production of ECCO-TCP texts ended last year.  Gale, who according to the agreement could have continued to sell access until 2016, is waiving their exclusive rights.  Those texts are now freely available to scholars, teachers, and students, today.

18thConnect, an online scholarly community, now offers full-text searching of the 2,229 texts, downloading of plain text, and some online functionality for text-mining.  Soon, the TCP will also provide online access, and partner libraries can make the texts available in ways they choose to as well. It’s also worth noting the 18thConnect is heading up an effort to correct the OCR transcripts of the works in ECCO the TCP effort didn’t get to.

Evans-TCP completed its production of 4,976 texts last year, and these texts will enter the public domain in 2016.   The biggest of these is still under way.  In April, the TCP released 4,180 EEBO texts, with another 3,424 texts to be released soon. This is a significant milestone, representing a real beginning of EEBO’s second phase, which still has another 36,000-odd texts to come.

Libraries can still join EEBO-TCP.   Partnership still offers the privileges of ownership, and the more partners, the more funding and the quicker the project will be complete.   If your institution subscribes to EEBO, you’ll find the text versions added to the images already there, and you can search the full text of these in addition to the full records of everything in the collection.  If your institution is a TCP partner, then you can go to the EEBO-TCP Website and access all the keyed texts (and if your library subscribes to EEBO as well, the corresponding images are pulled in).  If you’re not lucky enough to fit in any of the above groups, you can look at the EEBO-TCP “demo” site and see a small sample of the texts, and browse the records of the whole collection.  The same goes for EVANS-TCP.

What new scholarly digitization projects are you most excited about, or looking forward to seeing? Let us know in comments!

Photo by Flickr user CCAC North Library / Creative Commons licensed

Edited at 16:38 on 4/25 to correct the scope of EEBO-TCP’s current production plans.

Return to Top