• Tuesday, May 29, 2012

Previous

Next

Old Dominion U. Researchers Ask How Much of the Web Is Archived

July 6, 2011, 3:21 pm

Researchers at Old Dominion U. in Virginia are trying to figure out what percentage of the Web is archived by sampling from the four different sources shown here.

Researchers at Old Dominion University in Virginia are trying to figure out how much of the public Web is archived and who is storing it, as part of a larger effort to preserve the digital record.

Michael L. Nelson, a computer-science professor, has been working with professors and students since September to determine how much of the Web’s history has been preserved in Internet databases around the world.

Mr. Nelson’s team estimated the percentage of 4,000 Web pages that were archived by sampling data known as URI’s, or uniform-resource identifiers. An identifier is a label for a specific Web-page address or name. The researchers used Memento, a browser plug-in they developed in 2009, to find old versions of the pages across various Internet archives.

The URI’s were compiled from various sources: from search-engine caches from Google, Bing, and Yahoo!, from an Internet archive called the Open Directory Project, from a link-sharing service called Delicious, and from a Web-address-shortening service called Bitly.

The report showed that 35 percent to 90 percent of Web pages have at least one archived copy and that the chance of a page being archived depended on the source. For instance, URI’s gathered from Delicious were much more likely to be archived than Bitly URI’s, but the reason for that is not entirely clear. Mr. Nelson plans to continue the project, as he felt that no “final answer” had yet been reached.

Alexis Rossi, the Web-collections manager at Internet Archive, found the university’s efforts interesting, but she wondered whether it is even possible to accurately assess archival rates in a continually changing landscape.

“It’s such a moving target—the Web is expanding all the time,” Ms. Rossi said. Internet Archive was one of several archives used in the study and has been preserving the Web since 1996.

“People are coming to the realization that if nobody saves the Internet, their work will just be gone,” Ms. Rossi said. She also said the project may shed light on the efficacy of Web archiving as libraries and Internet users begin to think more about preserving the Web.

For Mr. Nelson, the study is another step toward creating a browsing experience that links the past to the present: where users can replay events as they unfolded, such as media coverage of hurricane Katrina in 2005 or 2007’s Virginia Tech shootings.

“You relive the experience in a way that a summary page can’t even begin to capture,” Mr. Nelson said, imagining a day when such historical searches become common.

Scott G. Ainsworth, the project’s lead student researcher, compared saving old Web pages to the historical preservation of old Sears catalogs. “You never know what’s going to be important in 100 or 150 years,” he said.

This entry was posted in Archive Watch, Research. Bookmark the permalink.

  • Print
  • Comment
  • http://www.matthauger.com Matt Hauger

    The reason for the bit.ly / Delicious discrepancy seems clear enough. Delicious-stored URIs are, by definition, stuff people want to save and return to later. That’s what the service is for. It seems likely that sites bookmarked in Delicious would be ideal sites to archive, as well. bit.ly URIs, meanwhile, are disposable. They’re throwaway links, tweeted once and soon forgotten. Not exactly archive-worthy material.

    To step back for a moment, though, are we sure we really want to record the whole web? What are the drawbacks to that sort of persistent cultural memory? Jeffrey Rosen pondered these questions in a New York Times article last year: ‘The Web Means the End of Forgetting.’ As he notes, it’s hard to overcome childhood mistakes when the digital evidence never dies.

  • bryanalexander

    It’s good to see the Internet Archive mentioned here.  They’ve done tremendous, unsurpassed work in archiving the Web.