• April 19, 2014

Scholars Elicit a 'Cultural Genome' From 5.2 Million Google-Digitized Books

The English language is going through a time of huge growth. Humanity is forgetting its history more rapidly each year. And celebrities are losing their fame faster than in the past.

Those are some of the findings in a paper published on Thursday in the journal Science by a Harvard-led team of researchers. The scholars quantified cultural trends by investigating the frequency with which words appeared over time in a database of about 5.2 million books, roughly 4 percent of all volumes ever published, according to Harvard's announcement.

The research team, headed by Jean-Baptiste Michel and Erez Lieberman Aiden, culled that digital "fossil record" from more than 15 million books digitized by Google and its university partners. Google is giving the public a glimpse of the researchers' data through an online interface that lets users key in words or phrases and plot how their usage has evolved. The paper's authors bill this as "the largest data release in the history of the humanities."

Scholars have explored quantitative approaches to the humanities for years. What's novel here is the volume of material. According to a Google spokeswoman, the data set of 5.2 million books includes both in- and out-of-copyright titles in several languages from 1500 to 2008. Its more than 500 billion words amount to a sequence of letters 1,000 times as long as the human genome. This "cultural genome" would stretch to the moon and back 10 times over if arranged in a straight line.

Chronicle of Higher Education

A Harvard-led team used books digitized by Google to analyze the occurrence of words since 1500. A graph shows the appearance of four words from 1965 to 2008: "fry" (in red), "bake" (blue), "grill" (green), and "roast" (yellow). Photograph by Google Books

"It radically transforms what you can look at," says Mr. Aiden, a junior fellow in Harvard's Society of Fellows and principal investigator of the Laboratory-at-Large, part of Harvard's School of Engineering and Applied Sciences. Mr. Aiden and Mr. Michel, a postdoctoral researcher in Harvard's psychology department and its Program for Evolutionary Dynamics, call their approach "culturomics."

The method's cross-disciplinary potential is demonstrated in the Science paper's findings:

  • The English lexicon grew by 70 percent from 1950 to 2000, with roughly 8,500 new words entering the language each year. Dictionaries don't reflect a lot of those words. "We estimated that 52 percent of the English lexicon—the majority of the words used in English books—consists of lexical 'dark matter' undocumented in standard references," the authors write.
  • Researchers tracked references to individual years to demonstrate how humanity is forgetting its past more quickly. Take "1880": It took 32 years, until 1912, for references to that year to fall by half. But references to "1973" fell by half within 10 years.
  • Compared with their 19th-century counterparts, modern celebrities are younger and more well known—but their time in the limelight is shorter. Celebrities born in 1800 initially achieved fame at an average age of 43, compared with 29 for celebrities born in 1950.
  • Mining the data set can yield insights into the effects of censorship and propaganda. The authors give the example of the Jewish artist Marc Chagall. His name comes up only once in the German corpus during the Nazi era, even as he became increasingly prominent in English-language books.

The paper and the public data-mining tool come as Google's broader book-digitization effort remains in legal limbo. Authors and publishers have besieged that project, calling it copyright infringement, but a legal settlement has yet to be approved.

Asked how Google was protecting the copyright of the books in its new tool, a spokeswoman, Jeannie Hornung, said the publicly available data sets "cannot be reassembled into books."

Instead, the data sets "contain phrases of up to five words with counts of how often they occurred in each year," according to a Google blog post. They include Chinese, English, French, German, Russian, and Spanish books.

Some scholars, meanwhile, have criticized the value of reading huge quantities of books with computers. In a Chronicle article this year, they warned that cranking words from deeply specific texts like grist through a mill is a recipe for lousy research. Still others have attacked the quality of Google's data.

Mr. Aiden acknowledged that "people should be really skeptical about this," but he urged scholars to give the tool a try for themselves.


1. flowney - December 16, 2010 at 04:44 pm

I applaud this effort as a compliment to the more traditional methods of analysis. The more, the better. At the same time, I am concerned that these data sets will be skewed by the short term goals of authors and their representatives. The trend toward encrypting digital works is most disturbing since that might forever exclude modern works from being a fully functioning part of our intellectual history. It should be a condition of copyright that an unencrypted copy be placed in escrow pending the expiration of copyright which, itself, should be limited to the author's lifetime or 80 years whichever is longest.

2. d_fevens - December 16, 2010 at 08:27 pm

*Asked how Google was protecting the copyright of the books in its new tool, a spokeswoman, Jeannie Hornung, said the publicly available data sets "cannot be reassembled into books."*

Whether or not the in-copyright book is publicly available is completely beside the point. When the University of Wisconsin in partnership with Google digitized my work, (a book with a title and numbered pages of content) without my permission they created digital copies (books with a title and numbered pages of content) and then comercialized them by making them available for searches on the internet and distributing copies to member libraries. (e.g. HathiTrust) They digitized the whole book and they identify their search results as coming from a particular page of a particular book. Their search results would have little value if they could not say they searched the whole book. The University of Wisconsin-Google partnership digitized my work in 2008-- I am still waiting for an apology for this infringement of my copyright. Some example for students, eh?
Douglas Fevens,
Halifax, Nova Scotia--
The University of Wisconsin, Google, & Me

3. arrive2__net - December 16, 2010 at 11:49 pm

'the data sets "contain phrases of up to five words with counts of how often they occurred in each year,"'

If this approach to restructuring the data turns out to be invalid, the research based on it could suddenly become invalid. The approach would apparently limit research to individual words, or to word grouping only where the words occur within the 5 word range. There seems to be a risk that some concepts may seem to raise and fall over time where what actually happened was a change in preferred synonym or level of abstraction.

Bernard Schuster

4. steve1255 - December 17, 2010 at 07:41 am

In reference to forgetting history -- "Take "1880": It took 32 years, until 1912, for references to that year to fall by half. But references to "1973" fell by half within 10 years."

Perhaps part of the decrease is due to a change in writing style? We could have stopped including the numberical year over and over again when writing about historical events.

5. richardtaborgreene - December 17, 2010 at 09:03 am

I am opposed, on principle, to new methods whatsoever. We should only innovate with proven methods that work flawlessly. That way progress will be progress and not regress. Furthermore, we should all live in caves and spit more---that will give us time to consider more carefully the ultimate value of stoves and vaccines, stem cells and the republican party.

6. tappat - December 17, 2010 at 10:00 am

I agree entirely with the commentary by richardtaborgreene, #5. It is so good to read plain, clear, simple, virile statements like these. It warms the cockles.

7. vandoesborgh - December 17, 2010 at 11:28 am

I think that it is wonderful that this is built upon the intensly flawed Google Books project. I did a search of "ipod" and was surprised to find it used in at least two books from 1800:

"...upon a ipod bottom, estar bien fun- dado. To fix one's bottom us in one, ..." (Baretti, A dictionary Spanish and English, and English and Spanish, 1800)

"...ÍA W Y£ R. A9 from the ti*ipod of Apr.Ho, Hear from my desk the wordi that follow :' " Some, by philosopher ..." (Johnson, The Works of the Poets of Great Britain and Ireland, 1800)

These are just the earliest entries. According to Google Books, "ipod" was used throughout the 19th and 20th centuries.

(For those not understanding--the actual images reveal the true words, Google didn't proofread the OCR scans which their searches are based upon. How could they? They scanned 15 million books.)

8. lexalexander - December 17, 2010 at 03:40 pm

William Gibson, the novelist generally credited with having invented the word "cyberspace" just a couple of decades ago in his book "Neuromancer," ran the word through this system and noted an interesting uptick in the use of the word in books beginning late in the 19th century, peaking around 1900, and then returning to zero until the late 20th century.

I realize the likeliest explanation is a glitch in either technology or methodology, but the romantic in me prefers to think there's a story there.

9. anonscribe - December 17, 2010 at 09:17 pm

Maybe it's because I'm a young scholar, but this seems really exciting. Yes, there are glitches and problems with Google's categorization procedures. But, the immediate potential to be able to analyze broad trends based on such a huge sample is exciting. It will take time, more digitization, better analysis, etc. to make it a reliable tool. But, Stephen Pinker in the article published today in Science is right: This will be universal.

I think folks in the humanities (me) ought to embrace this. I think if enough good analyses are done, it could legitimate the professions again. We're live in a data-driven culture, and literary scholars being able to build arguments partly based on reliable data will be a huge advantage. If you read the article published in Science, you'll see just a few of the potential arguments that can be constructed from this data: censorship, suppression of authors, adoption of technology, the length of cultural memory, etc.

This won't solve disputes within the humanities, but it will transform them. It will likely drive the discipline closer to the social sciences. There will always be a need for people who do "soft" criticism, theorizing new paradigms, analyzing texts philosophically or rhetorically. But, this will allow lit scholars to analyze things statistically also, perhaps reinforcing current beliefs about an issue, perhaps challenging them. It will also provide a new kind of evidence for those doing soft criticism to build upon.

I think I just heard the heartbeat of the humanities get a little stronger within the academy.

10. arc99999 - December 18, 2010 at 12:42 pm

from the above comments, it seems we have identified some room for improvement... first, perhaps Google should allow wiki-style editing abilities for those who wish to compare text images with the text on record (we all know that text-recognition software is not as accurate as the human eye at deciphering the multitude of fonts that have been used over the years). Second, as with all research data, the earliest results should be subject to scrutiny for the purpose of improving the quality of research (internal validity, external validity, and controlling for confounding variables). Third, the entire purpose of gathering this information and having it "searchable" in such detail is to envoke exactly this kind of discussion! I love reading the input given and would like to encourage commentors to provide productive commentary, and constructive criticism... cynicism and snide remarks don't go over as well on a discussion board as they do at the office coctail party.

11. corpusprof - December 20, 2010 at 11:15 am

Looking at the pretty charts in Culturomics and the new Google Books interface is nice. But of course there is much more to looking at cultural / language change than just using simple frequency charts of exact words and phrases.

The NEH-funded, 400 million word Corpus of Historical American English (freely available at http://corpus.byu.edu/coha) allows for a much wider ranges of searches. Besides frequency lists like Google Books (with essentially the same results), a simple 2-3 second search can find changes in word meaning and usage (e.g. gay, care, web; or what we're saying about any topic over time), grammatical changes, and it can find *all words* that are more frequent in one period than another (rather than one by one, as with Google Books), as well as much more.

More information at:

Add Your Comment

Commenting is closed.

subscribe today

Get the insight you need for success in academe.