The Chronicle of Higher Education: Articles

INFORMATION TECHNOLOGY


March 6, 1998

In Attempting to Archive the Entire Internet, a Scientist Develops a New Way to Search It

Non-profit group uses 'data mining' in effort to preserve World-Wide Web content

By JEFFREY SELINGO

SAN FRANCISCO

For 110 years, Building 116 served unobtrusively as a general store and as quarters for non-commissioned officers stationed at the Presidio, the U.S. Army base south of the Golden Gate. Now that the base has been decommissioned, Building 116 stands out.

Its red-tile roof and cream-colored shiplap siding have been restored, for one thing. And it's one of the few buildings still in use on the 1,480-acre property, which became a national recreation area in 1995. What makes Building 116 unique, though, is what's inside -- a virtual preservation project that aims to create a complete archive of the Internet.

In a back room, a buzz comes from giant computers that are "data mining" the World-Wide Web and Usenet discussion groups, finding and recording pages of the computer network on a nearby digital-tape machine. The computers take complete snapshots of the Web every two months, allowing users to find pages long after their owners have taken them down and let their hyperlinks lapse. So far, the archive has compiled eight terabytes of data -- the equivalent of 800,000 books -- and has recorded at least three snapshots of more than 500,000 Web sites.

Inside those virtual books is the patchwork history of ordinary people: pages of college students long graduated; Web sites of political campaigns since forgotten; early, awkward versions of sites that are now well-known; infamous sites that held our attention for weeks, such as the Heaven's Gate cult's page.

Fred Mertz for The Chronicle

Brewster Kahle started the non-profit Internet Archive after selling his previous venture, the Wide Area Information Server, to America Online for $15-million.

Just as the Internet has allowed all kinds of ordinary people to become their own publishers, it has allowed a computer scientist named Brewster Kahle (left) to create the non-profit Internet Archive.

Mr. Kahle decided to save the Internet's contents for posterity after selling his previous venture, the Wide Area Information Server, to America Online for $15-million. The system, which he invented, makes it easier to search electronic data bases.

Why an archive? "We need to preserve this heritage," says Mr. Kahle, an affable and enthusiastic 37-year-old who is a graduate of the Massachusetts Institute of Technology. "Or one day, digital anthropologists will wonder if we ever learned anything from the history of other inventions. Remember, nobody recorded television in the early days."

Once he started collecting all that information, Mr. Kahle says, he realized how difficult finding things on line was becoming, with the number of Web sites doubling every six months even as other material falls into neglect. So he set about creating a Web search engine using the technology he developed to manage the massive amounts of data he was collecting for his quirky history project.

The result is Alexa, a search engine operated by Alexa Internet, the for-profit company that is part of the Internet Archive. "This will change the way that researchers use the Internet," Mr. Kahle says.

Alexa is software that can be retrieved free from the company's Web site (http://www.alexa.com) and added to a Web browser. Unlike other search engines, such as Yahoo! and Excite, it doesn't rely on word searches. Instead, it watches where its users go on the Internet, and then records that information in a central data base. Based on that information, Alexa can tell a user the most popular paths that other Alexa users have taken from the site the user is visiting at a given time.

It also can suggest other sites offering related material. The top 10 sites pop up in a thin, gray bar near the browser (see below) and change as the user moves from page to page.

For example, from the "Perseus Project" (http://www.perseus.tufts.edu), a site with an extensive collection of ancient Greek texts in translation, Alexa points the user to sites about classicists and Mediterranean archaeologists at the University of Michigan, sites about publishers and journals available electronically, sites about Hellenistic linguistics, and to "Project Gutenberg," an Internet producer of free electronic texts.

Since October, more than 200,000 people have downloaded Alexa. The service, which Mr. Kahle hopes will soon be fully supported by advertising revenue, is not yet turning a profit. But by the end of the year, he expects it to have a million users. Eventually, he hopes to use the profits from Alexa to finance the gathering of data for the archive.

The advantage of Alexa as a search engine is that it "attempts to be an objective source" for people seeking information. Where conventional links are chosen by a page's creator according to what the creator knows and prefers, Alexa also brings other Web users' knowledge and preferences to bear. The sites recommended in a given search sometimes change, depending on the surfing patterns of Alexa users.

The system has its oddities. If users frequently traveled from the "Perseus Project" to, say, The New York Times, the newspaper could be added to the top-10 list of an Alexa user looking at the Perseus site, even though the only thing the two sites have in common is their users. In fact, such a situation has already occurred. From the Perseus site, Alexa suggests -- based on other users' habits -- visits to the sites of Franklin and Marshall College and Bates College. Alexa officials say students at the two colleges probably use the Perseus site in their classes.

"It's sometimes random and not always perfect," Mr. Kahle says. "But if researchers use a traditional search engine, they may miss some of the best sites." A search engine such as Excite, using the keywords "Greek texts" to find sites related to the "Perseus Project," turned up 268,057 matches. "With Alexa, you're bound to hit at least some of the top sites," Mr. Kahle adds.

Still, one needs a traditional search engine or a specific Web address to get started, Mr. Kahle acknowledges. And Alexa, unlike the Alta Vista search engine and others, can suggest linking only to entire Web sites, not to specific pages within them.

"I don't think of it in the same way as a search engine -- it's a supplement," says Bruce Livett, a reader and deputy head of the biochemistry and molecular-biology department at the University of Melbourne, in Australia. "Alexa gives you relevant sites in the general sense, sites that you sometimes miss because other search engines depend on specific keywords you enter."

Dr. Livett, who has been using Alexa since October, surfs the Web to keep up with the research work of colleagues around the world. "It's competitive work, and I need to know what they're doing." Alexa, he says, has alerted him to research sites that did not turn up in searches using Excite and Anzwers, a search engine designed for Web users in Australia and New Zealand.

Part of Alexa's appeal, he says, is access to the Internet Archive. When Alexa users get a dreaded "404 -- file not found" error, they can click on a button on Alexa's tool bar and pull up the missing page from the archive. Using the archive, Mr. Livett found an audio interview he needed that had been removed from a Web site.

Alexa is "an immediate use for the archive," says Mr. Kahle, adding that the archive is the component that he expects will eventually separate his search engine from the rest of the pack. Alexa also offers a direct link to the Encyclopaedia Britannica Web site, allowing users to retrieve reference information without leaving the Web page they are viewing at the time.

As more people begin to use Alexa and the archive, however, tricky questions about copyright and privacy have begun to crop up. An e-mail discussion list for Web publishers recently included a heated debate about copyright issues surrounding old newspaper articles that are part of the archive.

The data-mining computers skip Web pages that require passwords, as well as Web sites protected by the Standard for Robot Exclusion, which blocks search engines from copying pages or directories. Still, some Web publishers said in the e-mail discussion that Alexa officials should be asking on-line newspapers and journals if they want to be part of the archive, instead of forcing them to block Alexa from copying pages.

How the archive will be used in the long term is not clear. Mr. Kahle often mentions the early days of television, when programs were broadcast live and recording technology was primitive. "When it comes to a point where users have a camcorder recording the Net, then the archive won't be worth it," he says.

In Building 116, the archive is stored in a digital-tape library that looks like a vending machine. The tapes currently have the capacity to hold 20 terabytes of data in all, about as much information as is in the Library of Congress. So much content is being added to the Internet that the archive grows by about a terabyte of data each month. The data-mining computers are able to adjust their site visits to concentrate on those that change most frequently. They will come upon a site, however, only if Alexa users have visited it, if anyone else on the Web has linked to it, or if it is listed with a directory service.

While the archive has been able to keep up with textual information on the Internet -- it is complete from October 1996 to the present -- the effort to collect images is running a few months behind.

Through Alexa, the archive receives about 14 requests for old pages every second. Not bad, its founder says, when one considers that only about 9,000 people visit the San Francisco Public Library on an average day. Alexa also helps researchers by listing facts about the site they're visiting: the address of the individual, company, or other organization that owns the server on which the site is located; how many people have visited the site; how frequently the site is updated; how fast its computers are; and how many pages the site contains. Alexa also allows users to vote for their favorite sites and keeps a running total on each site.

Mr. Kahle says Alexa does not keep individual statistics on its users. Although the search engine tracks the paths of users as they jump from site to site, it does not record users' names. "We don't care who you are," he says. "We just care what path you take."

Mr. Kahle dreams that Alexa could become as popular -- and as profitable -- as search engines like Yahoo! and Excite. And the Internet Archive, he says with enthusiasm, could become part of a large research library, although he's not sure how. "I don't think about the details," he says. "That's why we're doing something now that others thought was impossible, or even crazy."


Copyright (c) 1998 by The Chronicle of Higher Education
http://chronicle.com
Date: 03/06/98
Section: Information Technology
Page: A27

ALSO SEE:

The World-Wide Web site of the Internet Archive, and a new Internet search engine that can be used to find material in the Internet Archive


Front page | Guide to the site | Today's news | Information technology | Colloquy | Washington | New grant competitions | This Week's Chronicle | Chronicle archive | Information Bank | Jobs | Advertisers | About The Chronicle | Help