• August 29, 2015

Google's Book Search: A Disaster for Scholars

OSU-Thompson-Library,-Grand-Reading-Room Enlarge Image
close OSU-Thompson-Library,-Grand-Reading-Room

Whether the Google books settlement passes muster with the U.S. District Court and the Justice Department, Google's book search is clearly on track to becoming the world's largest digital library. No less important, it is also almost certain to be the last one. Google's five-year head start and its relationships with libraries and publishers give it an effective monopoly: No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project. Of course, 50 or 100 years from now control of the collection may pass from Google to somebody else—Elsevier, Unesco, Wal-Mart. But it's safe to assume that the digitized books that scholars will be working with then will be the very same ones that are sitting on Google's servers today, augmented by the millions of titles published in the interim.

That realization lends a particular urgency to the concerns that people have voiced about the settlement —about pricing, access, and privacy, among other things. But for scholars, it raises another, equally basic question: What assurances do we have that Google will do this right?

Doing it right depends on what exactly "it" is. Google has been something of a shape-shifter in describing the project. The company likes to refer to Google's book search as a "library," but it generally talks about books as just another kind of information resource to be incorporated into Greater Google. As Sergey Brin, co-founder of Google, puts it: "We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site."

Seen in that light, the quality of Google's book search will be measured by how well it supports the familiar activity that we have come to think of as "googling," in tribute to the company's specialty: entering in a string of keywords in an effort to locate specific information, like the dates of the Franco-Prussian War. For those purposes, we don't really care about metadata—the whos, whats, wheres, and whens provided by a library catalog. It's enough just to find a chunk of a book that answers our needs and barrel into it sideways.

But we're sometimes interested in finding a book for reasons that have nothing to do with the information it contains, and for those purposes googling is not a very efficient way to search. If you're looking for a particular edition of Leaves of Grass and simply punch in, "I contain multitudes," that's what you'll get. For those purposes, you want to be able to come in via the book's metadata, the same way you do if you're trying to assemble all the French editions of Rousseau's Social Contract published before 1800 or books of Victorian sermons that talk about profanity.

Or you may be interested in books simply as records of the language as it was used in various periods or genres. Not surprisingly, that's what gets linguists and assorted wordinistas adrenalized at the thought of all the big historical corpora that are coming online. But it also raises alluring possibilities for social, political, and intellectual historians and for all the strains of literary philology, old and new. With the vast collection of published books at hand, you can track the way happiness replaced felicity in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader."

But to pose those questions, you need reliable metadata about dates and categories, which is why it's so disappointing that the book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.

Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities is dated 1888, and an edition of Henry James's What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google's book search, but these errors are endemic. A search on "Internet" in books published before 1950 produces 527 results; "Medicare" for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. "Charles Dickens" turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

How frequent are such errors? A search on books published before 1920 mentioning "candy bar" turns up 66 hits, of which 46—70 percent—are misdated. I don't think that's representative of the overall proportion of metadata errors, though they are much more common in older works than for the recent titles Google received directly from publishers. But even if the proportion of misdatings is only 5 percent, the corpus is riddled with hundreds of thousands of erroneous publication dates.

Google acknowledges the incorrect dates but says they came from the providers. It's true that Google has received some groups of books that are systematically misdated, like a collection of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's own doing. A lot of them arise from uneven efforts to automatically extract a publication date from a scanned text. A 1901 history of bookplates from the Harvard University Library is correctly dated in the library's catalog. Google's incorrect date of 1574 for the volume is drawn from an Elizabethan armorial bookplate displayed on the frontispiece. An 1890 guidebook called London of To-Day is correctly dated in the Harvard catalog, but Google assigns it a date of 1774, which is taken from a front-matter advertisement for a shirt-and-hosiery manufacturer that boasts it was established in that year.

Then there are the classification errors, which taken together can make for a kind of absurdist poetry. H.L. Mencken's The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles (a 1930 English edition of Flaubert's novel is classified under Physicians, which I suppose makes a bit more sense.) An edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering. And a catalog of copyright entries from the Library of Congress is listed under Drama (for a moment I wondered if maybe that one was just Google's little joke).

You can see how pervasive those misclassifications are when you look at all the labels assigned to a single famous work. Of the first 10 results for Tristram Shandy, four are classified as Fiction, four as Family & Relationships, one as Biography & Autobiography, and one is not classified. Other editions of the novel are classified as 'Literary Collections, History, and Music. The first 10 hits for Leaves of Grass are variously classified as Poetry, 'Juvenile Nonfiction, Fiction, Literary Criticism, Biography & Autobiography, and, mystifyingly, Counterfeits and Counterfeiting. And various editions of Jane Eyre are classified as History, Governesses, Love Stories, Architecture, and Antiques & Collectibles (as in, "Reader, I marketed him.").

Here, too, Google has blamed the errors on the libraries and publishers who provided the books. But the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And BISAC classifications weren't in wide use before the last decade or two, so only Google can be responsible for their misapplications on numerous books published earlier than that: the 1919 edition of Robinson Crusoe assigned to Crafts & Hobbies or the 1907 edition of Sir Thomas Browne's Hydriotaphia: Urne-Buriall, which has been assigned to Gardening.

Google's fine algorithmic hand is also evident in a lot of classifications of recent works. The 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture, and the Body (misdated 1899) is assigned to Health & Fitness—not a labeling you could imagine coming from its publisher, the University of California Press, but one a classifier might come up with on the basis of the title, like the Religion tag that Google assigns to a 2001 biography of Mae West that's subtitled An Icon in Black and White or the Health & Fitness label on a 1962 number of the medievalist journal Speculum.

But even when it gets the BISAC categories roughly right, the more important question is why Google would want to use those headings in the first place. People from Google have told me they weren't included at the publishers' request, and it may be that someone thought they'd be helpful for ad placement. (The ad placement on Google's book search right now is often comical, as when a search for Leaves of Grass brings up ads for plant and sod retailers—though that's strictly Google's problem, and one, you'd imagine, that they're already on top of.) But it's a disastrous choice for the book search. The BISAC scheme is well-suited for a chain bookstore or a small public library, where consumers or patrons browse for books on the shelves. But it's of little use when you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example the BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the Poetry subject heading has just 20 subheadings. That means that Bambi and Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and Verlaine have to scrunch together in the single subheading reserved for Poetry/Continental European. In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore.

Such examples don't exhaust Google's metadata errors by any means. In addition to the occasionally quizzical renamings of works (Moby Dick: or the White Wall), there are a number of mismatches of titles and texts. Click on the link for the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voice of the Heart, while the link on a misdated number of Dickens's Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James." More mysterious is the entry for a book called The Mosaic Navigator: The Essential Guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. The only connection I can come up with is that Jones was the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word "mosaic," though the details of the process leave me baffled.

For the present, then, scholars will have to put on hold their visions of tracking the 19th-century fortunes of liberalism or quantifying the shift of "United States" from a plural to singular noun phrase over the first century of the republic: The metadata simply aren't up to it. It's true that Google is aware of a lot of these problems and they've pledged to fix them. (Indeed, since I presented some of these errors at a conference last week, Google has already rushed to correct many of them.) But it isn't clear whether they plan to go about this in the same way they're addressing the scanning errors that riddle the texts, correcting them as (and if) they're reported. That isn't adequate here: There are simply too many errors. And while Google's machine classification system will certainly improve, extracting metadata mechanically isn't sufficient for scholarly purposes. After first seeming indifferent, Google decided it did want to acquire the library records for scanned books along with the scans themselves, but as of now the company hasn't licensed them for display or use—hence, presumably, those stabs at automatically recovering publication dates from the scanned texts.

Some of the slack may be picked up by other organizations such as the Internet Archive or HathiTrust, a consortium of participating libraries that is planning to make available several million of the public-domain books from their collections that Google scanned, along with their bibliographic records. But for now those sources can only provide access to books in the public domain, about 15 percent of the scanned collections; only Google will have the right to display the orphan works published since 1923.

In any case, none of that should relieve Google of the responsibility of making its collections an adequate resource for scholarly research. That means, at a minimum, licensing the catalogs of the Library of Congress and OCLC Online Computer Library Center and incorporating them into the search engine so that users can get accurate results when they search on various combinations of dates, keywords, subject headings, and the like. ("Adequate" means a lot more than that, as well, from improving the quality of scanning to improving Google's very flaky hit-count algorithms and rationalizing the resulting rankings, which now make no sense at all and often lead with inferior or shoddy editions of classic works.) Whether or not a guarantee of quality is a contractual obligation, it's implicit in the project itself. Google has, justifiably, described its book-scanning program as a public good. But as Pamela Samuelson, a director of the Center for Law & Technology at the University of California at Berkeley, has said, every great public good implies a great public trust.

I'm actually more optimistic than some of my colleagues who have criticized the settlement. Not that I'm counting on selfless public-spiritedness to motivate Google to invest the time and resources in getting this right. But I have the sense that a lot of the initial problems are due to Google's slightly clueless fumbling as it tried master a domain that turned out to be a lot more complex than the company first realized. It's clear that Google designed the system without giving much thought to the need for reliable metadata. In fact, Google's great achievement as a Web search engine was to demonstrate how easy it could be to locate useful information without attending to metadata or resorting to Yahoo-like schemes of classification. But books aren't simply vehicles for communicating information, and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching.

That makes for a steep learning curve, all the more so because of Google's haste to complete the project so that potential competitors would be confronted with a fait accompli. But whether or not the needs of scholars are a priority, the company doesn't want Google's book search to become a running scholarly joke. And it may be responsive to pressure from its university library partners—who weren't particularly attentive to questions of quality when they signed on with Google—particularly if they are urged (or if necessary, prodded) to make noise about shoddy metadata by the scholars whose interests they represent. If recent history teaches us anything, it's that Google is a very quick study.

Geoffrey Nunberg, a linguist, is an adjunct full professor at the School of Information at the University of California at Berkeley. Images of some of the errors discussed in this article can be found here.


1. 11159995 - September 01, 2009 at 09:41 am

Professor Nunberg has done everyone in academe a great service by documenting how badly Google has bungled the handling of metadata. As every publisher that is preparing its lists of books to claim in the Settlement already knows,"the book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess," as the professor so aptly puts it. He also rightly complains about Google's ineptness in applying BISAC codes to the books in its system. But this is not all Google's fault. As the professor observes, this coding system was devised specifically for the benefit of the large chain bookstores; it ill serves academe, as the categories do not correspond well to standard ways of differentiating fields and subfields in scholarship. E.g., one wouldn't know by looking at the BISAC codes that the American Political Science Association has long divided political science into four main categories: American politics, comparative politics, international relations, and political theory. One has to struggle mightily, as an academic press, to make these codes work meaningfully for scholarly books. Thus Google's reliance on this trade-driven system only compounds the problems it creates for the academic community. --- Sandy Thatcher, Penn State University Press

2. martinllevine - September 01, 2009 at 11:03 am

A very helpful list of what Google should do to improve the usefulness of Google Books. Here's another one: a recurring problem with date of publication is that all volumes of a journal are assigned the date of voulme 1.

3. bekka_alice - September 01, 2009 at 12:43 pm

Bless you for "making noise" about this. I can imagine that if I were to try to find some appropriate contact at Google to whom I might send a letter of dismay, I'd likely get to the wrong department if to anyone at all. My missive would be lost or tossed as a stray crank. I appreciate that you have a platform and are using it for the good of us all - including the ultimate good of Google so it doesn't spend a decade creating a resource spurned as substandard and useless for research. The number and degree of errors is sufficient to induce a mild despair in a reader who really would like to make good use of what could be a fantastic resource. I'd be willing to volunteer time to read and ensure publication dates were listed correctly, and I'm sure there are others who would be willing to do so to save the project from a bizarre choice to use scanning systems to do a job requiring thought. A volunteer team would probably also do wonders with the magnificently horrible classifications assigned so far. For some items, such as ranking, there isn't as easy of a solution. But I do hope that Google stops presenting surreal and transparent falsehoods about the sources of the bad data and turns their attention instead toward fixing the problem. I've trusted them as a company for some time; I'd be disappointed to lose faith in them over a project where they place CYA above doing the best job to create a potential wonder of the world.

4. argosyatlanta - September 01, 2009 at 02:49 pm

Another disturbing feature: books stripped of their own internal metadata. I tried to bring to the attention of Hal Varian, Google's chief economist, the case of his own opus on internet economics. I figured that, having run the UC library system, he would be understanding. The problem? Google left out Hal's author bio.

5. dlsadmin - September 01, 2009 at 03:41 pm

This is a wonderful and regrettably amusing treatment of the metadata problems in Google Book Search that everyone, particularly Google, interested in digital libraries should read. There are, however, a few significant errors and vague innuendos such as the tiresome and fear-mongering 'de facto monopoly' argument that has been trundled out in response to commercial digitization efforts for the last fifteen years. The error I need to respond to, however, is the characterization of HathiTrust.

Nunberg states that HathiTrust may "only provide access to books in the public domain," and this is simply not true. We may provide access to books within the parameters established by the law. Most notably, this allows us to open access to works where the individual or organization gives us permission. I won't argue that this has happened on a very large scale, but then again we have yet to undertake the work with our communities--communities of scholars--to make that happen. I came to work today to find nearly a dozen signed permissions agreements requesting we open access to works whose rights have reverted to the authors, and this is indeed what we'll do.

It would also be wrong to think that this sort of open reading access is the only meaningful use HathiTrust institutions can make of these works. One of the most significant uses is their preservation. The widespread use of acidic paper for most of the 19th and 20th centuries means that nearly all of the works being digitized are deteriorating. Preserving these works is a key library function sanctioned by the law and doing so in a digital form allows the HathiTrust libraries to share the burden of preservation much more effectively. There are other uses established by the law, including access by our users with print disabilities and supporting computational research. Nunberg's grudging "only provide access to books in the public domain" fails to acknowledge these important activities by HathiTrust partners.

It is worth pointing out a couple of subtler quibbles with Nunberg's characterization of HathiTrust and the problem of orphan works. First, it needs to be said that many works assumed to be in-copyright orphans are actually in the public domain, and it's the arduous work of establishing rights that keeps some of these waters muddied. By coming together as they have, HathiTrust institutions can attack this particular problem with shared resources. With generous support from the Institute of Museum and Library Studies, we are in the process of creating a Copyright Review Management System and, even in the planning and development stages, our work serves to "free" several thousand titles each month. Second, although HathiTrust is indeed “a consortium of participating libraries” (and I believe Nunberg implies here "*Google* participating libraries"), HathiTrust's intention is to bring together *research libraries*, whether Google partners or not. We are in active discussions with several research libraries that are not Google partners, discussions that will expand our collective collections and bring even more library resources to bear on these questions of preservation and access.

I should add one final note about the search capabilities HathiTrust plans to offer, which Nunberg questions in a separate article (http://languagelog.ldc.upenn.edu/nll/?p=1701#). Our plans for reliable and comprehensive bibliographic and full text search across both in-copyright and public domain works are ambitious and well-documented on the HathiTrust website. For example, our full text search initiatives are covered in detail at http://www.hathitrust.org/large_scale_search, and we recently announced plans to launch our comprehensive search service in October, 2009.

-- John Wilkin, Executive Director, HathiTrust

6. charlesmann - September 02, 2009 at 09:31 am

May I add an additional problem with Google's Book Search, one that has caused me many hours of frustration? In my experience, it rarely distinguishes the separate volumes or editions of multivolume books or series.

Two examples: Richard Hakluyt's "Principal Navigations" and Blair and Robertson's "Philippine Islands, 1493-1898". The former is a multivolume compilation of early European traveler's reports that is an essential reference for anyone interested in colonial history--so essential, in fact, that many researchers would welcome the chance to download a searchable version at home. A search today for "principal navigations hakluyt inauthor:hakluyt" on Google Books turns up 2,171 entries, of which 1,349 are "full view". The first four entries are: 1) Vol. 14 of the Goldsmid edition (correctly identified in the metadata but not in the search listing); 2) Vol. 4 of the 1926 reprint of the 1907 Dutton edition (not correctly identified in either place); 3) Vol. 2 of a multivolume selection edited by Payne that began appearing in 1893 (incorrectly identified in both places); 4) Vol. 1 of the Goldsmid. Alas, anyone who wants to find a particular volume or simply a complete set has to keep clicking randomly on entries until, scores or even hundreds of books later, they happen to find the desired text(s).

The opposite occurs with the Blair and Robertson, a 55-volume compilation of translated texts about Spain's venture in the Philippines, and an essential but hard-to-find source for anyone interested in colonial Asian history. There the same search for "philippine islands inauthor:blair robertson" turns up just 5 volumes. By spending several days poking around the nooks and crannies of Google Books, I was able to discover that Google Books actually has multiple copies of each volume in the series. Sometimes I could happen upon a volume only by searching for text strings within it; sometimes I could find it only by searching for "Philippine Islands" and clicking through page after page after page of listings in the hopes of stumbling across it.

This is a pity, because book sets like these are often expensive and hard to find -- only 500 copies of the Blair and Robertson were printed. By providing worldwide access to them, Google is performing a great service. I am grateful to the company for doing it. But Prof. Nunberg is entirely correct to observe that in this instance they are falling far short of their corporate mission: "to organize the world's information and make it universally accessible and useful."

7. lukelea - September 02, 2009 at 10:47 am

Another suggestion for Google: They ought to arrange results by the Dewey Decimal System and other contemporary orderings used by libraries. That way you could brouse other, nearby books the same way you do when you are free in the stacks. Just a thought.

8. ramesh1 - September 02, 2009 at 11:24 am

You are right google made haste but lets hope google will make improvement in future. I think this is great revolution for future generation that they can get all knowledge in one place.

9. unusedusername - September 02, 2009 at 01:45 pm

For everyone whining about Google, I have one piece of advise: start your own library. Google's library didn't even exist 10 years ago. It is hardly a "monopoly" with an impossible barrier to entry. If you don't like it, don't use it.

10. larryc - September 02, 2009 at 06:43 pm

An engaging and somewhat wrong-headed article. I don't really care how Google uses categories, it does not change my work at all. And the metadata problems are fizable (and I think Nunberg is exaggerating them anyway).

And yet if the Google Books project is to improve it is important that we point out its shortcomings.

(I blogged a longer reaction to the article here: http://northwesthistory.blogspot.com/2009/09/googles-book-search-disaster-for.html)

11. mightythylacine - September 02, 2009 at 07:02 pm

It seems a little silly to complain about a completely free tool which are not required to use.

At the end of the day using it costs you nothing and can only be benificial. If you disgagree you can always build your own free public library from the ground up.

12. gsheldon - September 02, 2009 at 08:59 pm

I continue to watch with interest the very thoughtful and insightful comments made by many observers of the Google Books program and the proposed settlement, and continue to be confused by those who characterize the program with scary phrases like "disaster for scholars." The fact of the matter is that scholars are no worse off than they were before Google's mass digitization program -- they can still use the well-established network of local and national bibliographic systems and services (campus and regional library catalogs, OCLC WorldCat, etc.) to locate the works they need, and can visit the holding libraries or make ILL requests to obtain the works. Some scholars will in fact be better off through the services that Google Books provides, but no one will be worse off. Of course, GB can be improved, and we can hope that it will be, but how is this a "disaster for scholars"?

Gary Lawrence
Director of Systemwide Library Planning (retired), University of California

13. richardtaborgreene - September 02, 2009 at 11:59 pm

People not politically included in the fashioning of a system they use tend to whine and bitch a lot. This probably has a brain basis in some neuron or other. Google can void such enemy-building dynamics by simply using technology to assemble a swarm-intelligence or crowd-power editing/fixing/commenting/indexing body that allows bitchers and whiners something more constructive to do with their finger dexterities.

14. tech2doc - September 03, 2009 at 03:01 am

I agreee with richardtaborgreene, the system has limitations and faults, pretty much like every human devised system since the dawn of time. Allowing more experts to come in and correct mistakes would be useful...and the 12 people who care about left-handed Russian authors before 1750 who had mustaches can now correct the database so that future generations will not head down the dark path from this error...

15. elizstone - September 03, 2009 at 06:47 am

And here I thought it was just me--listed by Google.books as the second author on my own book. Not anyone whose name I know, by the way. Given the mishaps, I guess I should be glad I can be found at all!

16. orwant - September 03, 2009 at 07:38 am

Geoff also made these points on his blog at http://languagelog.ldc.upenn.edu/nll/?p=1701, where I responded to them. (I manage the Google Books metadata team.)

17. nightspore - September 03, 2009 at 10:33 am

Charles Mann is absolutely right about the difficulties of navigating multi-volume sets. "About this book" almost always gets them wrong, and you have to look at wrongly labeled volume after volume to put together a jury-rigged version of, say, Clarissa or any Trollope novel.

18. iagoarchangel - September 03, 2009 at 10:58 am

I'm with bekka_alice: I hope Google sees the opportunity to do something great, instead of just enormous, by heeding points logged by Mr Nunberg. A mashup of Google Books with OCLC metadata (like the delightful WorldCat Identities), or a good mechanism to crowd-source metadata, could be a dream come true. Maybe the pot of gold at the end of this rainbow is subscription-based premium service ("Google Books Gold--now with high-quality metadata").

I'm also with Gary Lawrence: Google Books is not a "disaster" even though its usefulness for many types of scholarship seems limited. I have to wonder whether Mr Nunberg's editor created a sensational title with this word that does not occur in the article itself. In any case, it's a good title for igniting all this enthusiastic discussion, and some optimism.

Jimmy Thomas
The Library Corporation

19. paievoli - September 03, 2009 at 11:03 am

We are not even mentioning peer-reviewed issues here. What if a book is written before a major experiment is conducted and the material in the earlier book is wrong? Who says a student stops and finds the newest information. someone has to vet this information and a scanner cannot do it. This is going to be the beginning of lunatics running the asylum.
I believe completely in digital content but someone has to review this material for quality control. And Google, I believe is like always just looking for more profits. "Do no evil" - to whom?

20. orwant - September 03, 2009 at 02:13 pm

Jimmy, thanks for your comment. We do use OCLC WorldCat data in Google Books. However, we wouldn't develop a subscription-based premium service for metadata -- we want to provide the highest quality metadata we can, for free.

21. iagoarchangel - September 03, 2009 at 03:47 pm

Jon Orwont (Google Books Metadata Team Leader),
Wow! I'm thoroughly impressed with the response you posted under the illustrated blog edition of Geoff's paper. I overlooked your brief comment above, and missed that opportunity to honor all the effort your team has already put into addressing his points here and there.

Readers who got this far in the comments,
Do follow that link and enjoy the rest of the story!

22. pyegar - September 03, 2009 at 04:59 pm

For greater context- another huge, ambitious metadata project that lacked perfection, yet added value:


That used cameras, printing presses, and good old sweat labor. Yet some rate of error was tolerated. So, also, for Murray's OED.

23. d_fevens - September 03, 2009 at 08:52 pm

I am not a scholar; in fact I describe myself as a “pretend writer and researcher”. One of my works, “Fevens, a family history” was scanned by the partnership of the University of Wisconsin-Madison/Google Inc. in 2008. Even though my copyright was registered with the Canadian Intellectual Property Office, neither the university nor Google sought my permission. I found out by accident on May 13th of this year that they had digitized it. At my insistence it has been removed from the online search engines; I am however still waiting for written confirmation that their digital volume(s) in their digital libraries of “Fevens” has/have been destroyed and also an apology from the University of Wisconsin for this infringement of my copyright. If I had not discovered my book online, and the Google Book Settlement becomes law, Google would have owned the digital copyrights to my book after April 5, 2011. As for Google using “fair use” as an argument for their, in my opinion, illegal digitization of copyrighted works, I would point out that the Section 108 Study Group; ("a select committee of copyright experts charged with updating for the digital world the Copyright Act's balance between the rights of creators and copyright owners and the needs of libraries and archives." as the group is described on their web site) 2008 report states:
“Machines read and render digital content by copying it. As a result, copies are routinely made in connection with any use of a digital file. While these copies may be temporary or incidental to the use, they are considered "reproductions" under the copyright law for which authorization is required absent an applicable exception.”
(Introduction, Page 6, Second "bulleted" item)
I do not believe the partnership that exists between the University and Google is an "applicable exception" because they are a de facto commercial enterprise.

For scholars who are interested in accuracy; when I first went to my book at Google Books they had added my name to the cover, thereby redesigning it.
Douglas Fevens
Halifax, Nova Scotia
The University of Wisconsin, Google & Me

24. virtualgab - September 04, 2009 at 02:29 am

Why is Google imposing these absurd categories on the world's literature? Maybe they should read Clay Shirky's 2005 piece, "Ontology is overrated", in which he elegantly decimates the notion of library classificatory systems:

25. simonfairbairn - September 04, 2009 at 08:43 am

"No less important, it is also almost certain to be the last one...No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project."

Stuff like this makes me crazy. Self-important statements of 'fact' about the future when human beings are notoriously and ridiculously inaccurate at prediction.

You don't know if scanning 'will always' be expensive and labor-intensive. You don't know that 'no' competitor is going to be able to do the same thing, but bigger and better. You don't know if it's always going to be Google's servers hosting these books.

The concerns you have may be valid, but don't try to over-inflate their importance by basing them on a dystopian premise when you just don't know (unless you managed to get that time machine working, in which case I take all of this back).

"It's almost certain...But it's safe to assume..." The only thing that's 'almost certain' and 'safe to assume' about technology is that it's going to change and, probably, into something that luddites with their verbal frame-breaking weren't expecting at all.

26. srminton - September 04, 2009 at 09:53 am

The overall quality and accuracy of digitized books is currently acceptable. I've recently started reading a lot of e-books on a variety of platforms in order to research the model, and the number of errors within the text is astounding and unsettling. Often, entire paragraphs of literary works are misplaced, misquoted, missed out completely or repeated at random. The number of 'typos' is also vastly higher than in printed copies. I wonder if, in our rush to digitize literature which has been gradually printed over hundreds of years, we are simply making a complete mess which will never be undone.

27. srminton - September 04, 2009 at 09:53 am

The overall quality and accuracy of digitized books is currently unacceptable. I've recently started reading a lot of e-books on a variety of platforms in order to research the model, and the number of errors within the text is astounding and unsettling. Often, entire paragraphs of literary works are misplaced, misquoted, missed out completely or repeated at random. The number of 'typos' is also vastly higher than in printed copies. I wonder if, in our rush to digitize literature which has been gradually printed over hundreds of years, we are simply making a complete mess which will never be undone.

28. srminton - September 04, 2009 at 09:55 am

Acceptable/unacceptable - it's all about the editing.

29. leclair - September 04, 2009 at 11:48 am

Counter-intuitive solution at the intersection of Google and Wiki.

30. jimcbender1652 - September 04, 2009 at 02:58 pm

My concern is less with the search results than how titles are chosen and how the copyright law is applied. For example, why should a 200 year old book not be available through Google Books? The answer is that someone has made a reprint five years ago in Europe, and Google therefore withholds John Charnock's History of Marine Architecture to prop up sales of the book. At one point, one volume was available, but that was withdrawn. As an independent scholar and writer, I want books of historical interest to be available, especially when they are over 70 years old or more. Paid services are priced for institutions, so they are not an option, generally.

31. barefootliam - September 07, 2009 at 04:31 pm

The biggest reason I have to complain about Google books is the accuracy. I know from my own experience that if you scan images at, say, 400dpi grayscale, you get massively better OCR results than Google is getting. So I am guessing their scan resolution is too low. In any case I am seeing that line drawings are often not visible at all on the pages. It's good enough for advertising revenue, Google's goal, but it's not good enough for people who need to use the books.

We have to do better.

I've been experimenting with taking multiple copies of the same book scanned and OCR'd independently by Google (!), e.g a 4-million-word dictionary of biography from 1811 (Chalmers') and have made a majority text version at http://www.words.fromoldbooks.org/ (although the work is still in progress) and maybe that way we can get better quality texts, but it seems to me the books will all have to be scanned again in any case for diagrams, pictures, footnotes, and other missing things. You often can't tell that a footnote or diagram wasn't scanned unless you know to look for it!

32. billyd - September 07, 2009 at 04:49 pm

In researching the mysteries behind an epistolary narrative by William Monahan I needed to find out more about a French aristocrat named Claude de Bourdeille, Count of Montresor. I was delighted to come across several copies of his memoirs in the Google Books library, but unfortunately the pages were often illegible and the series and volume information of the collections that the memoirs were published in was almost always wrong or missing. The only information the Google algorithms seem to get right is the date on which the digitization of the books took place.

33. jakevanderpuy - September 07, 2009 at 11:00 pm

@ Liam

Google patented a new infrared scanning process which allows them to more easily scan books. Quantity over quality is what they've been going for. Which makes sense when considering the sheer monumental size of the project facing them.

Can always go back and organize this 'problem' later right?

34. mgrochowalski - September 07, 2009 at 11:35 pm

This is the first I've heard of HathiTrust. I've looked for a bit, but there does not seem to be a way to download the full text of a book.
However, you can apparently download public domain and some Creative Commons books from Google.
I think the main problem is that the full texts of these still copyrighted works is legally locked to Google's servers.

35. barefootliam - September 08, 2009 at 08:37 am

@jakevanderpuy - commercial OCR software already corrects for the curvature of the page, and does a massively bette
r job than Google's OCR software.

Yes, it's possible to scan millions of books again, although no doubt a few will have been lost, single copies no l
onger in good enoughcondition, acid paper that crumbled, fires, floods, and so forth. Yes, we all agree they are g
oing for quantity over quality, that was the point of the article I think :-)

36. chinabuzz - September 08, 2009 at 02:27 pm

Prof. Nunberg misses the point of Google's Book Search entirely: to offer the greatest number of books to the greatest number of users. He has the luxury of accessing the University of California at Berkeley's libraries, along with public libraries in the Bay Area, and most likely, other university libraries within the UC system.

However, as a historian that lives outside of the U.S. in a non-English-speaking country, I am entirely cut off from such resources, except via the Internet. Google Book Search saved me hundreds, if not thousands of dollars in the purchase of new and used books plus shipping, and tens if not hundreds of hours of work by myself or a research assistant in trying to collect the same information.

Google Book Search is not perfect, nor would it or any other service be. But Google is spending its own money to do a job that the university libraries themselves have not: make their catalogs available beyond their campus boundaries. Even an imperfect system will offer users around the world far more information than they would have had without it.

37. trishjw - September 09, 2009 at 12:13 am

You bring up numerous points that Google should take to heart. Not just correcting misdates. A subject can be very difficult to find when there are 4000 or 10,000 items and the most recent are the only ones they show not those 20 years ago --or longer. I find a lot of repetition of the same item even under the same URL. These are not helpful for students and harried workers are much more likely to take one or two items of the most recent and miss those that are much better just a little older on further into the 10,000 items. Another aspect that Google and all Internet searches eliminate is what a couple of writers mentioned--one in New Yorker and one in New York Times just over the past 12 months. That's "serendipity" that was so available with libraries--esp under the Dewey Decimal System even more than the Library of Congress. There have been many times that I searched for a specific author or reference book and found 3-5 others that were just as good or better than the one on the list. That's difficult to do now if no one knows the name of the author or the title of the book or magazine that is an additional reference. This serendipity can also show one about subjects that he/she never knew about and delves into just out of curiosity. The items on Google lead much most easily to copy work rather than original writing or research. I hope someone finds a way to mix and match just as well as making things more available that came from authors weeks ago, months ago or years ago. Newest is not necessarily the best but no one learns that from the Internet--Google or otherwise. I hope someone can be creative and find it out.

38. 11134078 - September 11, 2009 at 10:46 am

Searching OCLC in a really serious way is damned near impossible. Anybody here ready to join me in defending good old LC subject headings (and the sooner we get away from the current substitute, the better)?

39. hjharris - September 20, 2009 at 04:42 pm

I don't get this "disaster" argument. I've been using GB for almost three years, and It Has Transformed My Life. Like several other contributors to this discussion, I'm not based in the US and my university library doesn't come anywhere near the size and quality that _some_ US scholars at favored research universities take for granted. So for me GB is almost pure gain.

I have never bothered much about the quality of the metadata, because, like (I imagine) most users, I rely on keyword searching and then just _looking_ at an item in order to find whether, or for what purposes, something thrown up by a search is going to be useful. This is not too dissimilar to that excellent old research method, wandering along the shelves of a library where Dewey or the LC system tells you you're most likely to strike gold, pulling a book out if it looks interesting, having a quick glance inside, and then either adding it to your pile or moving on.

OK, the information GB posts on the "About" page of a book isn't always good enough to allow you to create an accurate citation to it, but there's usually a simple way around this problem: you look at the book's title page instead.

OK, the scanning could be better, but it's mostly of usable quality (even the Plain Text version, which is so handy for copying & pasting chunks of a work straight into one's own research notes), and one can always report a bad page. In my experience, there are human beings at the end of the wire at Google, who respond to what one has to say -- sometimes of course with a formulaic answer, but where appropriate with _action_.

I could go on. I give thanks to Google every day, and offer up silent prayers whenever I come across something really fascinating that I _know_ I could not have found if I'd relied on the old combination of good catalogue + research library + shelf-skimming.

In fact, I'm _so_ grateful that when I wondered what to do with the copyright of my first (1982) book, which had reverted to me when the US university press selling it had got rid of a couple of thousand hardback copies, and neither it nor anybody else was interested in a paperback issue, there was only one answer: let Google Books make it all available, free, to any user. I might not make any money from it, but I wasn't making any anyway, and at least this way I might pick up a few more readers.

I'm a great fan of the Internet Archive, Open Library, etc., too (notably because they give non-US users access to material published between the late 1860s and the 1923 copyright cut-off, which is public domain in the US and freely available, but which for some complicated legal reason GB won't make available exc as snippets to non-US readers), but they don't compare with GB, either in scale and scope or the ability to do free-text searches of all items' content. Until they or some other non-profit digital library does do this, then Google Books will remain my everyday library of choice.

40. wkeane - September 23, 2009 at 12:12 pm

Here's another Google glitch: at least in the cases I've checked, fold-out maps are scanned still folded, and thus utterly worthless.

41. dyserenity - October 11, 2009 at 02:04 pm

I had a similar thing happen to me a couple weeks ago. I wrote about it on my blog (dyserenity.livejournal.com) and here's that post copy/pasted:

This article brings up some interesting points on the digitization of print media. Google is scanning books, Amazon and other online retailers are selling digital books, but what happens with digitization? How should content be coordinated to prevent loss of other types of data about the book, named metadata?

Data is difficult, to say the least, to transfer from print to digital text. Either it's all entered manually--which would take absurd amounts of time and manpower--or an algorithm transfers the text to data. The article above talks about Google's difficultly with books' metadata. The years on many books are wrong, along with their categorization. One example, The Mosaic Navigator: The Essential Guide to the Internet Interface is dated 1939 and Google says its authors are Sigmund Freud and Katherine Jones. Mosaic is an internet browser from the 80's, and this book was published way after Freud died.

At the beginning of the semester, I ran into a mismatched metadata problem on Amazon. I have a Kindle, so I thought it would be awesome to get my textbooks on it. I needed Theories of Personality By Richard Ryckman, Ninth Edition. Another textbook of similar title and topics exist: Theories of Personality by Duane P. Schultz and Sydney Ellen Schultz, Ninth Edition. From the Ryckman hardcover page, the "Buy the Kindle version of this book" link lead to a digital version Schultz book.

Of course, I realized these were different books with different content after I made my purchase. I contacted Amazon and they refunded my money and fixed the link between the unrelated books. But Google doesn't have customer motivation like this; customers can speak with their money to Amazon. Google has the jump on any other company who happens to be digitizing books. They've been doing it for more than five years. No other company is going to be able to break this monopoly.

But without a customer push, how should Google resolve these problems?

42. juanz - October 19, 2009 at 10:15 pm

For another example of a massive digitization disaster, see The Universal Digital Library, at http://www.ulib.org/, hosted by Carnegie Mellon University. They thought that there was no need for cataloging. Go there and see the consequences.

43. marlena21 - November 03, 2009 at 06:15 am

I agreee with richardtaborgreene, the system has limitations and faults, pretty much like every human devised system since the dawn of time. Allowing more experts to come in and correct mistakes would be useful...and the 12 people who care about cialis left-handed Russian authors before 1750 who had mustaches can now correct the database so that future generations will not head down the dark path from this error...

44. 11134078 - December 21, 2009 at 04:52 pm

You either use LC subject headings or you don't. In the first case, you have decent bibliographic control. In the second case, you don't.

45. spinoza - December 21, 2009 at 09:54 pm

Google Books is far and away the most significant revolution in access to historical information--and knowledge--that has occurred since, well, the Internet itself (I was going to say Gutenberg, but...). Google Books is the democratizing of scholarly communication on a level that has never occurred before. It's important to recognize that that those screaming "disaster" are doing so from the Ivory Towers of American academe, in itself a highly restrictive social realm if there ever was one. I find it ironic that one of the most vocal critics of Google Books is the director of the Harvard University Library, the most restrictive and elitist academic library in the world (though it should be emphasized that his predecessor signed Harvard up as a valuable contributor to Google Books). Access to this library is reserved only for students, faculty, and staff of Harvard, an outsider can't even enter the building, let alone gain access to its collections. This means that its some 15 million volumes can only be used be the few thousand people associated with Harvard.

Over the past 15 years the NSF has distributed countless millions of dollars in grants to academic institutions for digital library research and development, and there is embarrassingly little to show for it in terms of tangible products for scholars. Several attempts to create digital library collections by academic institutions and non-profit organizations have thus far led to only modest results. Google Books, on the other hand, has completely transformed scholarship for individuals like myself (and the international scholars commenting here), it has made extremely rare volumes accessible in a way we could only have dreamed of a few years ago. The metadata in Google Books is far from perfect, but I have been able to find what I need through structured searching.

Professor Nunberg should first take a close look at the metadata in OCLC's Worldcat before criticizing Google's efforts. For example, a record that I just accessed this evening was created by participating member Yale University: the German book Text, Geschichte, Anthropologie, which focuses on the history of German and French anthropology, has the subject: Spanish literature--History and criticism!!! The Worldcat catalog is filled with this kind of silliness because of the endemic mismanagement of technical services units in academic libraries, something that has been going on for at least 20 years and has gotten pronouncedly worse over the past five years. For many reasons--poor staffing, badly managed conversion projects, inadequate technology, poor management, and so on--has led the Worldcat catalog of books to be filled with the same kinds of egregious errors that Nunberg points out with Google Books. If I had a choice, I would without hesitation prefer the Google Book database for searching over Worldcat.

46. emeiselm - January 01, 2010 at 01:15 pm

One further point - we shouldn't assume the catalog will remain in a static condition. Its likely they will have some way to flag mistakes, crowdsource corrections and improve metadata to a point unimagined by current libraries.

Add Your Comment

Commenting is closed.

  • 1255 Twenty-Third St., N.W.
  • Washington, D.C. 20037
subscribe today

Get the insight you need for success in academe.