Scholars Elicit a 'Cultural Genome' From 5.2 Million Google-Digitized Books

December 16, 2010

The English language is going through a time of huge growth. Humanity is forgetting its history more rapidly each year. And celebrities are losing their fame faster than in the past.

Those are some of the findings in a paper published on Thursday in the journal Science by a Harvard-led team of researchers. The scholars quantified cultural trends by investigating the frequency with which words appeared over time in a database of about 5.2 million books, roughly 4 percent of all volumes ever published, according to Harvard's announcement.

The research team, headed by Jean-Baptiste Michel and Erez Lieberman Aiden, culled that digital "fossil record" from more than 15 million books digitized by Google and its university partners. Google is giving the public a glimpse of the researchers' data through an online interface that lets users key in words or phrases and plot how their usage has evolved. The paper's authors bill this as "the largest data release in the history of the humanities."

Scholars have explored quantitative approaches to the humanities for years. What's novel here is the volume of material. According to a Google spokeswoman, the data set of 5.2 million books includes both in- and out-of-copyright titles in several languages from 1500 to 2008. Its more than 500 billion words amount to a sequence of letters 1,000 times as long as the human genome. This "cultural genome" would stretch to the moon and back 10 times over if arranged in a straight line.

Chronicle of Higher Education

A Harvard-led team used books digitized by Google to analyze the occurrence of words since 1500. A graph shows the appearance of four words from 1965 to 2008: "fry" (in red), "bake" (blue), "grill" (green), and "roast" (yellow). Photograph by Google Books

"It radically transforms what you can look at," says Mr. Aiden, a junior fellow in Harvard's Society of Fellows and principal investigator of the Laboratory-at-Large, part of Harvard's School of Engineering and Applied Sciences. Mr. Aiden and Mr. Michel, a postdoctoral researcher in Harvard's psychology department and its Program for Evolutionary Dynamics, call their approach "culturomics."

The method's cross-disciplinary potential is demonstrated in the Science paper's findings:

  • The English lexicon grew by 70 percent from 1950 to 2000, with roughly 8,500 new words entering the language each year. Dictionaries don't reflect a lot of those words. "We estimated that 52 percent of the English lexicon—the majority of the words used in English books—consists of lexical 'dark matter' undocumented in standard references," the authors write.
  • Researchers tracked references to individual years to demonstrate how humanity is forgetting its past more quickly. Take "1880": It took 32 years, until 1912, for references to that year to fall by half. But references to "1973" fell by half within 10 years.
  • Compared with their 19th-century counterparts, modern celebrities are younger and more well known—but their time in the limelight is shorter. Celebrities born in 1800 initially achieved fame at an average age of 43, compared with 29 for celebrities born in 1950.
  • Mining the data set can yield insights into the effects of censorship and propaganda. The authors give the example of the Jewish artist Marc Chagall. His name comes up only once in the German corpus during the Nazi era, even as he became increasingly prominent in English-language books.

The paper and the public data-mining tool come as Google's broader book-digitization effort remains in legal limbo. Authors and publishers have besieged that project, calling it copyright infringement, but a legal settlement has yet to be approved.

Asked how Google was protecting the copyright of the books in its new tool, a spokeswoman, Jeannie Hornung, said the publicly available data sets "cannot be reassembled into books."

Instead, the data sets "contain phrases of up to five words with counts of how often they occurred in each year," according to a Google blog post. They include Chinese, English, French, German, Russian, and Spanish books.

Some scholars, meanwhile, have criticized the value of reading huge quantities of books with computers. In a Chronicle article this year, they warned that cranking words from deeply specific texts like grist through a mill is a recipe for lousy research. Still others have attacked the quality of Google's data.

Mr. Aiden acknowledged that "people should be really skeptical about this," but he urged scholars to give the tool a try for themselves.