Facing Flood of E-Mail, Archives Seeks Help From Supercomputer Researchers
By FLORENCE OLSEN
For a year now, the National Archives and Records Administration has been preparing for President Clinton's moving day. When the President leaves the White House, tens of millions of e-mail messages from his administration will move into the archives for posterity.
Archives officials say they're still not ready. Up until now, the archives has been processing only about 50,000 electronic records a year, most of which have been data-base
Mr. Thibodeau says he feels more optimistic today about solving the technical problems of archiving than he has felt in 25 years as an archivist.
|
files. "But there's no way we can expand our current technology to deal with tens of millions of records," says Kenneth Thibodeau, the archives' director of electronic-records programs.
What to do? "It looked to us like a research problem," says Mr. Thibodeau. And researchers at the University of California's San Diego Supercomputer Center agreed.
Now, working with grants from the archives, researchers at the supercomputer center are helping archives officials prepare to handle the millions of electronic records that will be arriving in literally thousands of formats.
Using parallel-processing supercomputers and the Extensible Markup Language, a new formatting language that many experts see becoming the next dominant language of the World-Wide Web, the
TOO LITTLE INFORMATION?
The General Accounting Office wants the National Archives and Records Administration to proceed immediately with a government-wide survey of agencies' current electronic record-keeping practices, rather than postpone the survey until the archives finishes an internal review of its records policies -- a project that archives officials say could take 18 to 24 months to complete.
Michael W. Jarvis, the principal author of a July G.A.O. report, says the nation's archivist has scant information about how many and what kinds of electronic records federal agencies are keeping. In a sample survey it conducted recently, the archives found that most federal employees need more policy direction and training to preserve electronic records that are of permanent value to historians and policy researchers.
Electronic record-keeping systems can solve the technical problems of maintaining documents of historical value, but the archives still has to rule on which types of records agencies need to capture and save, says Michael Tankersley, a senior staff attorney for Public Citizen, the public-interest group founded by Ralph Nader. This month Public Citizen lost a case in court against the archives, which it had sued for failing to protect electronic records that may have historical value. (See a story from The Chronicle, August 11.)
The archives' cautious approach to electronic records is "foot-dragging," says Mr. Tankersley, who contends that the agency has not been aggressive in confronting electronic-records issues "because institutionally it is more comfortable dealing with paper."
Dennis A. Trinkle, an assistant professor of history at DePauw University who is president of the American Association for History and Computing, agrees with Mr. Tankersley that the archives' current electronic-records policy doesn't do enough to satisfy the needs and desires of historians. "We're going to lose a lot of important materials, and we're going to be unable to reconstruct the past as fully as we would like to be able to," he says.
-- Florence Olsen
|
university researchers say they have solved the archives' most immediate problem -- dealing with the huge volume of e-mail messages from the White House.
In less than two days' time, they converted one million e-mail messages into a standard, transportable XML format, says Chaitanya Baru, a University of California at San Diego research scientist who is the leader of the supercomputer center's Data Intensive Computing Environments Group. Mr. Baru says the same or similar software and hardware could easily be used to convert the mass of e-mail records the archives is expecting from the White House.
Mr. Baru and other researchers also say that XML appears to solve the biggest problem associated with archiving electronic records, which is being able to read an electronic document long after the technology that created it becomes obsolete.
For the archives project, Mr. Baru and his colleagues devised their own "document-type definition" for e-mail -- essentially a standard way of describing the various parts of an e-mail document. That, or some other standard vocabulary for describing an e-mail document, would be needed for building a permanent e-mail archive accessible to the public.
In the future, government agencies might convert their e-mail messages and word-processing documents into an XML format before those records ever arrive at the archives, Mr. Baru says. Many technology companies, including the Microsoft Corporation, believe that XML will play a prominent role in the future of the Web and are including XML in the current or upcoming releases of their Web browsers, word processors, and e-mail programs.
For example, Mr. Baru says, Microsoft could even build into future versions of its word-processing program a feature that would let officials spot-check a document to insure that it conformed to a government-approved format for archival purposes.
Most of the digital-records collections that the archives is preparing to receive will be far more complex than the White House collection, Mr. Thibodeau says. Within the next couple of years, the archives must be ready to accept -- in electronic format -- a large volume of State Department diplomatic messages, the most-used document series in the archives, starting with about one million such messages for each year from 1972 to 1975. And military-personnel case files that the archives will receive as scanned documents from the Defense Department contain at least 3,000 different types of documents.
Faced with such challenges, the archives asked the San Diego research group to try its archiving methods in a second test, which included geographic-information-system data, office-automation files from Congress, a collection of scanned images of museum art works, automated patent-application case files, and data bases spanning 30 years of government activity -- "as large a variety as we could put together for test purposes," Mr. Thibodeau says.
Once again the San Diego research scientists were able, in essence, to "wrap" the different electronic records inside XML documents that the computer could process and that a person could read as if they were the original records.
Mr. Thibodeau says he feels more optimistic today about solving the technical problems of archiving than he has felt in 25 years as an archivist. "I'm convinced it's cheaper to save electronic information" than it is to save paper, he says.
Most of the archives' electronic records are stored on magnetic-tape cartridges whose storage capacity doubles with each new generation of tape. "We have a quarter-billion-dollar building that opened in 1993," Mr. Thibodeau says. Used primarily for storing paper and microfilm records, it will probably fill up in the next 10 years. "But we're not worried about running out of space for electronic records," he says.
Background stories from The Chronicle: