• July 29, 2014

Counting on Google Books

Counting on Google Books 1

Michael Morgenstern for The Chronicle Review

Enlarge Image
close Counting on Google Books 1

Michael Morgenstern for The Chronicle Review

Humanities scholars may someday count as a watershed the paper that appeared on Wednesday in Science, titled "Quantitative Analysis of Culture Using Millions of Digitized Books." But they'll have certain things to get past before they can appreciate that.

The paper describes some examples of quantitative analysis performed on what is by far the largest corpus ever assembled for humanities and social-science research. Culled from Google Books, it contains more than five million books published between 1800 and 2000—at a rough estimate, 4 percent of all books ever published—of which two-thirds are in English and the others distributed among Chinese, French, German, Hebrew, Russian, and Spanish. The English corpus alone contains some 360 billion words, a size that permits analyses on a scale that aren't possible with collections like the Corpus of Historical American English, at Brigham Young University, which tops out at a mere 410 million words.

Not everyone will find these statistics bracing. A lot of scholars have reservations about studying literature en bloc, mindful of Seneca's admonition that distrahit animum librorum multitudo, or loosely, "Too many books spoil the prof." And they're apprehensive about the prospect of turning literary scholarship into an engineering problem.

The framing of the Science paper will aggravate those qualms. The authors of the paper claim that the quantitative data gathered from the corpus are the bones that can be assembled into "the skeleton of a new science." They call the new field "culturomics," defining it as "the application of high-throughput data collection and analysis to the study of human culture," which "extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities."

That's culturomics with a long o, with the implication that the object of study is the "culturome," presumably the mass of structured information that characterizes a culture. The point of comparison might be biological models of evolution or simply the idea that culture, like the genome, can be "cracked" via massive distributed (that is, "high-throughput") processing.

The inspiration for the Science paper came from two young Harvard researchers, Jean-Baptiste Michel and Erez Lieberman-Aiden, with backgrounds in genomics and mathematics. And almost all of the more than 12 authors of the paper (11 individuals plus "the Google Books team") are mathematicians, scientists, or engineers—some at Google, the rest mostly at Harvard or the Massachusetts Institute of Technology. The very fact that the paper was submitted to Science suggests that the authors are more interested in winning the ear of their scientific colleagues than in reaching the scholars who will be the primary beneficiaries of this new approach. Having glimpsed a new domain from a peak in Darien, the authors' first thought was to call home.

It's hard to imagine anything likelier to raise the hackles of humanists or cultural historians, who aren't disposed to think of their fields on the model, say, of pre-Mendelian biology. But there's nothing in the research that compels this understanding of "culturomics." Indeed, a close reading of the paper clarifies the limits of quantitative corpus investigations as well as the power. Where we're going we'll still need readers.

Humanists and social scientists have been doing quantitative corpus research for a long time, in fields like linguistics, political science, and intellectual history. But the Google project does initiate a new phase. Of course there is the jump in scale, not just in the size of the corpus but also in the staggering processing power that the researchers can throw at it. And what it takes Google's server farms to do right now, anyone with a home computer will be able to do tomorrow. Like most everything else, a terabyte isn't what it used to be. You can already fit everything that's ever been written in the glove compartment of your Hyundai; within a few years it will fit in your eyeglass frames.

Scholars can't download the entire corpus right now, but the impediments are legal and commercial rather than technological. (Google could make available a corpus of all the public-domain works published through 1922 without raising any copyright issues, but it has decided not to do that.) In the meantime, scholars have access to the corpus via the Web sites Ngrams.GoogleLabs.com, and culturomics.org. At this point, they are confined to examining the "trajectories" of individual words or strings of up to five words long ("we don't need no badges"), in the form of a graph that shows the relative frequency of a word over some period between 1800 and 2000, or that compares the frequency of use of several words. (Scholars can also download a visualization tool and the full set of trajectories, but not the texts they're drawn from.)

That leaves out a lot, compared with what you can do with other corpora. As of now, for example, you can't ask for a list of the words that follow the adjective "traditional" for each decade from 1900 to 2000 in order of descending frequency, or restrict a search for "bronzino" to paragraphs that contain "fish" and don't contain "painting." Some of those capabilities will probably be available soon, though users won't be able to replicate many of the computationally heavy-duty exercises that the researchers report in the paper, and linguists won't really be happy until they can download the whole corpus and have their way with it.

And while the Harvard researchers have purged the research corpus of a large proportion of the metadata errors that have plagued Google Books, there are still a fair number of misdated works, and there's no way to restrict a query by genre or topic. You can ask the system to plot the trajectory "dear reader" in books published in Britain during the 19th century, but you can't limit the search to novels.

But in the end, the most important consequence of the Science paper, and of allowing public access to the data, is that it puts "culturomics" into conversational play. Whatever misgivings scholars may have about the larger enterprise, the data will be a lot of fun to play around with. And for some—especially students, I imagine—it will be a kind of gateway drug that leads to more-serious involvement in quantitative research.

Short of reassigning humanities and social-science departments to the engineering school, how might all of this change disciplines? The exercises in the Science paper are meant to suggest the range of possibilities. A couple of these fit neatly into continuing scholarship. In one exercise, the researchers computed the rates at which irregular English verbs became regular over the past two centuries. The patterns that emerged will be grist for theories of language evolution, But quantitative methods are already widely accepted in the field, and in any case, morphological change is a "cultural" phenomenon only by courtesy of the dean of humanities. Another ingenious study uses quantitative methods to detect the suppression of the names of artists and intellectuals in books published in Nazi Germany, the Stalinist Soviet Union, and contemporary China. Those results could be published tomorrow in a history journal, but precisely because they're consistent with other kinds of data that historians are already using; they won't shift any disciplinary paradigms.

The more interesting exercises are also, in a way, the most problematic. In one exercise, the authors investigate the evolution of fame, as measured by the relative frequency of mentions of people's names. They began with the 740,000 people with entries in Wikipedia and sorted them by birth date, picking the 50 most frequently mentioned names from each birth year (so that the 1882 cohort contained Felix Frankfurter and Virginia Woolf, and so on). Next they plotted the median frequency of mention for each cohort over time and looked for historical tendencies. It turns out that people become famous more quickly and reach a greater maximum fame now than they did 100 years ago, but that their fame dies out more rapidly. You can take that result as a quantitative demonstration of the rise of what Leo Braudy called "disposable fame" in his book The Frenzy of Renown, which the authors cite. And the technique could be a powerful source of data for the burgeoning field of celebrity studies, as it's designated in the title of a new journal from Routledge.

But the method isn't up to distinguishing among the varieties of fame and eminence that Braudy and others have carved out. And there are obvious limits to equating fame with mere frequency of mention. At one point, for example, the authors observe that "'Galileo', 'Darwin', and 'Einstein' may be well-known scientists, but 'Freud' is more deeply ingrained in our collective subconscious." But it defies belief that Freud is vastly better known than Darwin among the authors of books in a corpus that was drawn from the collections of research libraries. We simply mention Freud more often. Maybe that's because we refer to Darwin only when we're talking about evolution, while we're apt to bring up Freud when we're talking about ourselves. Or maybe there's some other explanation. But the data don't wear their cultural significance on their sleeves; they need cultural historians to speak for them.

I have a friend, a gifted amateur musician and computer scientist, who was involved in electronic music in its early days. Inevitably, within a few years, the field was taken over by composers. That happened partly because new interfaces made the technology more accessible, but also because a command of the subject matter always trumps mere technical expertise. As my friend put it, "It's a lot easier to turn an artist into a geek than to turn a geek into an artist."

In the same way, we'll know that the program of quantitative corpus research is successful when the engineers have stepped back as the techniques are absorbed into the academy, sometimes as a method, sometimes just as a background of operating assumptions. That was the fate of 19th-century philology—the study of "La Vie des Mots" (The Life of Words) in the title of a book of the period by Arsène Darmesteter. Quantitative corpus studies are destined to play the same role, though they imply a different understanding of what the life of words is all about. We really don't even need a name like "culturomics," or any new name at all: this is just e-philology. (Or "the newer philology," since "the new philology" is taken.)

One salutary effect of looking at word trajectories is that they dispel some of the unreflective philological assumptions that color the way humanists and social scientists tend to think about words. Take the obsession with origins, in particular the genealogical model of vocabulary change that's implicit in the structure of major dictionaries. Scholars speak of new words or word senses "entering the language" at a specific date, with the implication that they bring new concepts along with them. But decades or even centuries can pass before a "new" word gains a purchase in the language. "Propaganda" had something like its modern sense by Carlyle's time, in the 19th century, but it was a recondite item; only with World War I did it enter "into the vocabulary of peasants and ditch diggers," as one contemporary put it. Between 1914 and 1950, its frequency in the print news media increased tenfold, only to fall back significantly by 2000. It isn't that people have lost interest in the thing the word denotes, as you might conclude from the falling frequency of "slide rule" or "Dinah Shore." But we think of political discourse differently now (the decline of "propaganda" coincides with the rise of "Orwellian," as it happens).

Then, too, comparing word trajectories enables you to pin down the emergence of new vocabularies that are the harbingers of cultural regime change—the signs, as Quentin Skinner put it, that "society has entered into the self-conscious possession of a new concept." The Oxford English Dictionary documents the first appearance of "lifestyle" in 1915, but it wasn't until the late 1960s that the word became commonplace (in 1967 it appeared in the Chicago Tribune just 29 times; by 1972 the figure was 1,571). That coincided with a sharp increase in the use of "demographic," which first appeared in 1882 but became 50 times as frequent from the 1950s to the 1970s, spinning off the noun "demographics" in the process—all part of an emerging vocabulary (with the appearance of terms like "upscale" and "trendy," and of new senses for "blue collar" and "preppie") that reflected the consumerization of class. In the age before corpora, there was no way to get a handle on this phenomenon. (It's a fair bet that Raymond Williams's influential 1976 book, Keywords: A Vocabulary of Culture and Society, would have looked very different if he had had access to the Google Books corpus and not just to the OED.)

The most obvious—though not the only—application of these techniques is in analyzing broad swaths of cultural and literary production, what Franco Moretti, of Stanford, calls "distant reading," which examines hundreds or even thousands of texts at a swoop. But there's nothing in the Science paper that threatens the importance of close reading, New Historicist anecdotalism, or any of the other more ruminative forms of scholarship. On the contrary, there needn't even be a sharp division between the two approaches. These new results are very often just intriguing quantitative nuggets that call out for narrative explication. Scientists like to say that "data" is not the plural of "anecdote," but sometimes "anecdotes" can be the plural of "data." And, like other anecdotes, they don't compel any single interpretation, and sometimes even bring us back to the texts they were abstracted from.

Consider an interesting study of the titles of 19th-century books by the historians Dan Cohen and Fred Gibbs, of George Mason University, who also worked with the Google Books corpus. What does it signify that the words "hope" and "happiness" became less frequent in book titles in the second half of that century? To Cohen and Gibbs, it suggests that there was an undercurrent of depression during that period. But a reader of Schopenhauer might conclude that all those earlier mentions of happiness were the unmistakable signs of misery and abjection. To prove the case one way or the other, one might be driven to, well, read some of the books.

Some people worry that the effect of these quantitative studies will be to trivialize scholarship. In a news article that appeared in The Chronicle last spring about Moretti's research, Katie Trumpener, a professor of comparative literature at Yale, voiced her concerns about the quantitative turn in literary studies. It's all well and good when it's done by an original thinker like Moretti, she said, but what happens when it's taken up by his "dullard" descendants? "If the whole field did that, that would be a disaster," with everyone producing insignificant numbers and "jumped-up claims about what they mean."

It's unlikely that "the whole field" of literary studies—or any other field—will take up these methods, though the data will probably figure in the literature the way observations about origins and etymology do now. But I think Trumpener is quite right to predict that second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions. But it isn't as if those scholars would be doing more valuable work if they were approaching literature from some other point of view.

This should reassure humanists about the immutably nonscientific status of their fields. Theories of what makes science science come and go, but one constant is that it proceeds by the aggregation of increments great and small, so that even the dullards have something to contribute. As William Whewell, who coined the word "scientist," put it, "Nothing which was done was useless or unessential." Humanists produce reams of work that is precisely that: useless because it's merely adequate. And the humanities resist the standardizations of method that make possible the structured collaborations of science, with the inevitable loss of individual voice. Whatever precedents yesterday's article in Science may establish for the humanities, the 12-author paper won't be one of them.

Geoffrey Nunberg, a linguist, is an adjunct full professor in the School of Information at the University of California at Berkeley.

Comments

1. jweinheimer - December 17, 2010 at 07:12 am

I just discovered something in this tool that you may not want to publish, but it does shed some light on matters and is funny besides. I searched for a not very polite word over the centuries: http://ngrams.googlelabs.com/graph?content=fuck&year_start=1500&year_end=2008&corpus=0&smoothing=3 and was amazed that this word appeared the book "The Act of Tonnage and Poundage, and Rates of Merchandize" from 1702.

When I opened it, I found http://books.google.com/books?id=Zjk7AAAAcAAJ&pg=PA201&dq=%22fuck%22&hl=en&ei=ilALTbPpIo72sgb8h63jDA&sa=X&oi=book_result&ct=result&resnum=3&ved=0CC8Q6AEwAjgK#v=onepage&q=%22fuck%22&f=false in the sentence:
"Every Merchant making an Entry of Goods, either Inwards or Outwards shall be dispatched in such Order as he cometh;..." and it misread the old spelling of "such".

2. claytonburns - December 17, 2010 at 02:02 pm

Geoffrey Nunberg: Thanks for this. English at 360 billion words would allow sensitive searches for rare patterns. For example, ask people to compose a sentence with "can blurred" (you can't put anything between the words, except that you can make "can" negative, "can't"; you can't start the sentence with "can"; "can" is a modal. Create a sentence and tell me what the sentence type is. Curiously, there is always a struggle.

"Why can't blurred images be used in court?" "When can blurred images be used in court?" Could you tell me how many sentences following this pattern there are in the 360 billion words of English? What is the trajectory?

A good advance would be tagged corpora. If you are perched in your little Quebec of a pulpit, then Quebec is a metaphoric global. How many would you expect to find in 100 billion words?

The issue of irregular verbs is compelling. Especially since they are the base of vowel gradation in poetry ("After Apple-Picking" by Robert Frost for "e" gradation). But since this extremely powerful set of sound symbolic patterns in poetry (and sometimes in fiction, as with "a" gradation in "The Road" by Cormac McCarthy--"...and returned again as trackless and as unremarked as the path of any nameless sisterworld in the ancient dark beyond)" has never been decisively focused, the issue is not the data, but the ability to frame the interpretation. Or the ability to shift perceptual and cognitive frames so as to see the data behind the data.

A limitation of another sea of data is that we should have had long ago the Internet merged analytical indexes for all non-fiction books. This would be a simple legal matter: If you were going to publish a non-fiction book, you would have to submit a high quality comprehensive analytical index to the Library of Congress before the book appeared in the bookstores and libraries. The indexes would be merged and linked back to the books. Now, if I want to find "zitterbewegung" in "The Road to Reality" index, I will see that it is not there. Indexing is informal. We could have made "refined data" gains with the Internet indexes, but it is as with tagged corpora: we just passed up the opportunity.

3. corpusprof - December 17, 2010 at 05:32 pm

http://corpus.byu.edu/coha
Corpus of Historical American English

-- 400 million words, 1810s-2000s.
-- Along with accurate frequency of words and phrases by decade and year, also

-- allows for many types of searches that Google Books / culturomics.org can't:
* changes in meaning (via collocates; "nearby words")
* changes in word forms (via wildcard searches)
* grammatical changes (because corpus is "tagged" for part of speech)
* show all words that are more common in one set of decades than in another
* integrate synonyms and customized word lists into queries
* etc etc etc
-- Funded by the US National Endowment for the Humanities (NEH), 2009-2011.

Take a look at the "Compare to Google Books / Archives" link off the first page.

4. gnunberg - December 17, 2010 at 10:56 pm

Geoff Nunberg: I should say that I had "the much better structured..." before the CHAE, but the phrase was lost in the editing process. As I noted, you can do a lot of things with this and other corpora that you can't do with the Google corpus right now.

5. syy89 - December 20, 2010 at 06:19 pm

Interesting!

6. dlfrye - December 20, 2010 at 08:20 pm

Very interesting! I have been playing with ngrams since I read about it over the weekend, and I very quickly discovered a couple of things worth mentioning.
1) The dates are not 1800-2000, but rather 1500-2008. However, as hinted at in the first comment above, data before 1800 is pretty sketchy; the number of pre-1800 works in the google corpus is small, and the number of errors are large (look up "tofu" for one of many quick examples), yielding lots of noise.
2) The search is case-sensitive (unlike a regular Google search), so that a search on "Dear Reader,Dear reader,dear Reader,dear reader" yields four distinct curves. There is at present no way to do a non-case-sensitive search.
3) A minor point, but: according to the way "ngrams" counts words, the phrase "we don't need no badges" counts as 6 words, not five -- because it counts "don" and "t" as separate words. FWIW! And anyway, the phrase should be "we don't need no stinking badges," doncha know.

7. dlfrye - December 20, 2010 at 08:37 pm

Following up on the suggestion of "corpusprof" above to compare the Corpus of Historical American English (which I've used before) and Google Books, I checked the phrase "global warming" in both. I found some 61 citations of the phrase in Google Books dated before 1930, while COHA had none at all. Aha! The only problem is that *every single one* of those early Google Books sightings of "global warming" is a false positive: they all come from post-2000 publications (scientifice journals, mainly) that were misdated to 1929, 1906, 1869, etc. (mainly because the default date of many journals is the first date of publication for the entire series, not the particular issue). Oh well!

On the plus side for Google Books: they draw prettier graphs. Maybe that's something COHA could look into.

8. natalie_binder - December 21, 2010 at 06:49 pm

Over the past week, I have been blogging about similar problems found in the data and execution of Ngrams. The problems are twofold: the first serious issue is poor OCR and metadata (see "Google's word engine isn't ready for prime time" http://thebinderblog.com/2010/12/17/googles-word-engine-isnt-ready-for-prime-time/). The second is what I call "thin description"--the data from Ngrams is so flat that I'm not sure it can accurately answer questions ("The problem with Google's thin description" http://thebinderblog.com/2010/12/18/google-ngrams-thin-description/). The relationship betwen word-frequency and culture is neither articulated nor proven in the Science paper.

That said, I do see a lot of potential in ngrams--or at least the *idea* of ngrams. Over the next few years it will be really important for academics to take a leadeship role in projects like these. I also see opportunities for the public to help out, perhaps through some form of social gaming ("Fixing Google's word engine" http://thebinderblog.com/2010/12/21/how-to-fix-googles-word-engine/).

9. metaglossia - December 24, 2010 at 01:21 pm

Ngrams definitely bring out another very useful dimension of corpus analysis. The results are impressive.
We do hope Google will offer the possibility of exploiting corpora from each of the over 6000 languages of the world rather just the most widespread ones.

10. ericleasemorgan - December 29, 2010 at 08:48 am


To this I have two comments. First, yes, I agree. The "whole field" will not take up the digital humanities computing techniques, but that does not mean they are not useful. There are many different and varied ways to unearth and create new knowledge and understanding. The use of technology is one of them. The epistemological methods of scientists and the epistemological methods of humanists compliment rather than conflict with one another. People who think otherwise have a C.P. Snow Two Cultures problem. Second, given the reassurance stated in the article, such a problem truly exists in academia. What is really needed is a greater amount of holistic thinking. --Eric Lease Morgan, Librarian

11. tfreeman1951 - January 07, 2011 at 10:33 am

According to Wikipedia, "We don't need no sticking badges!" is a frequent misquote. The actual lines from the movie are: "Badges? We ain't got no badges. We don't need no badges! I don't have to show you any stinkin' badges!"

12. unclefishbits - January 12, 2011 at 02:13 pm

There seems to be a lot of mistakes. Search "felching". Every instance of that horrific word comes up when the scanned word should have been "fetching".

How many of these errors are embedded throughout the project. I am sure Kant and Voltaire would have hated to know people of the 21st century thought they were into that. =)

Add Your Comment

Commenting is closed.

subscribe today

Get the insight you need for success in academe.