Dumped On by Data: Scientists Say a Deluge Is Drowning Research

February 10, 2011

Scientists are wasting much of the data they are creating. Worldwide computing capacity grew at 58 percent every year from 1986 to 2007, and people sent almost two quadrillion megabytes of data to one another, according to a study published on Thursday in Science. But scientists are losing a lot of the data, say researchers in a wide range of disciplines.

In 10 new articles, also published in Science, researchers in fields as diverse as paleontology and neuroscience say the lack of data libraries, insufficient support from federal research agencies, and the lack of academic credit for sharing data sets have created a situation in which money is wasted and information that could reveal better cancer treatments or the causes of climate change goes by the wayside.

"Everyone bears a certain amount of responsibility and blame for this situation," said Timothy B. Rowe, a professor of geological sciences at the University of Texas at Austin, who wrote one of the articles.

A big problem is the many forms of data and the difficulty of comparing them. In neuroscience, for instance, researchers collect data on scales of time that range from nanoseconds, if they are looking at rates of neuron firing, to years, if they are looking at developmental changes. There are also difference in the kind of data that come from optical microscopes and those that come from electron microscopes, and data on a cellular scale and data from a whole organism.

"I have struggled to cope with this diversity of data," said David C. Van Essen, chair of the department of anatomy and neurobiology at the Washington University School of Medicine, in St. Louis. Mr. Van Essen co-authored the Science article on the challenges data present to brain scientists. "For atmospheric scientists, they have one earth. We have billions of individual brains. How do we represent that? It's precisely this diversity that we want to explore."

He added that he was limited by how data are published. "When I see a figure in a paper, it's just the tip of the iceberg to me. I want to see it in a different form in order to do a different kind of analysis." But the data are not available in a public, searchable format.

Ecologists also struggle with data diversity. "Some measurements, like temperature, can be taken in many places and in many ways, " said O.J. Reichman, a researcher at the National Center for Ecological Analysis and Synthesis, at the University of California at Santa Barbara. "It can be done with a thermometer, and also by how fast an organ grows in a crayfish" because growth is temperature-sensitive, said Mr. Reichman, a co-author of another of the Science articles.

A Big Success Story

The situation criticized in the Science articles contrasts with the big success story in scientific data libraries, GenBank, the gene-sequence repository, said Mr. Reichman and several other scientists. GenBank created a common format for data storage and made it easy for researchers to access it. But Mr. Reichman added that GenBank did not have to deal with the diversity issue.

"GenBank basically had four molecules in different arrangements," he said. "We have way more than four things in ecology," he continued, echoing Mr. Van Essen's lament.

But even gene scientists today say they are struggling with the many permutations of those four molecules. In another Science article, Scott D. Kahn, chief information officer at Illumina, a leading maker of DNA-analysis equipment, notes that output from a single gene-sequencing machine has grown from 10 megabytes to 10 gigabytes per day, and 10 to 20 major labs now use 10 of those machines each. One solution being contemplated, he writes, is to store just one copy of a standard "reference genome" plus mutations that differ from the standard. That amounts to only 0.1 percent of the available data, possibly making it easier for researchers to store the information and analyze it.

To cope with data diversity, Mr. Reichman said scientists should develop a common language for tagging their data. "If you record data from a particular location, the tags about that location—latitude and longitude, for instance—need to be consistent from researcher to researcher," he said. Ecology has grown into a relatively idiosyncratic science, and all researchers have their own methods, so a common language will require a culture shift. "It's become more urgent to do this because of the pressing environmental questions, like the effects of climate change, that we are being called on to answer," he said. "And the ability to access more than one set of measurement or interactions will make the science better."

Another factor that makes developing shared-data libraries urgent is that many scientists now store their own data. "And when they retire or die, their data goes with them," said Mr. Rowe. In his field, using three-dimensional-imaging machines like CT scanners to analyze fossils, the first people to do that have already left the field, so there has already been a tremendous loss of data.

There is a financial cost to this, he added. "It costs money to do a CT scan, and the National Science Foundation pays for that with a grant. But if that scan isn't curated, and disappears when the scientist retires or forgets about it, then the next scientist asks the NSF for money to do it again. That's just a waste," he said.

In all of the papers, scientists cited examples of small libraries of shared data that could be scaled up. Mr. Rowe helped to develop a project called DigiMorph, which contains three-dimensional scans of about 1,000 biological specimens and fossils. Those data sets have been viewed by about 80,000 visitors, he said, and have been used in 100 scientific papers. Sharing the data, he said, brings the cost to researchers, and their grant-giving agencies, way down. Another project, the Neuroscience Information Framework, contains many more data sets and has been used by even more scientists.

Mr. Rowe thinks agencies like the NSF and the National Institutes of Health should get behind efforts like this to a much greater extent than they have done. "Right now they are financing data generation, but not the release of that data, or the ability of other scientists to analyze it. I think, with all respect, that they are really missing the boat."