• April 20, 2014

Dumped On by Data: Scientists Say a Deluge Is Drowning Research

Scientists are wasting much of the data they are creating. Worldwide computing capacity grew at 58 percent every year from 1986 to 2007, and people sent almost two quadrillion megabytes of data to one another, according to a study published on Thursday in Science. But scientists are losing a lot of the data, say researchers in a wide range of disciplines.

In 10 new articles, also published in Science, researchers in fields as diverse as paleontology and neuroscience say the lack of data libraries, insufficient support from federal research agencies, and the lack of academic credit for sharing data sets have created a situation in which money is wasted and information that could reveal better cancer treatments or the causes of climate change goes by the wayside.

"Everyone bears a certain amount of responsibility and blame for this situation," said Timothy B. Rowe, a professor of geological sciences at the University of Texas at Austin, who wrote one of the articles.

A big problem is the many forms of data and the difficulty of comparing them. In neuroscience, for instance, researchers collect data on scales of time that range from nanoseconds, if they are looking at rates of neuron firing, to years, if they are looking at developmental changes. There are also difference in the kind of data that come from optical microscopes and those that come from electron microscopes, and data on a cellular scale and data from a whole organism.

"I have struggled to cope with this diversity of data," said David C. Van Essen, chair of the department of anatomy and neurobiology at the Washington University School of Medicine, in St. Louis. Mr. Van Essen co-authored the Science article on the challenges data present to brain scientists. "For atmospheric scientists, they have one earth. We have billions of individual brains. How do we represent that? It's precisely this diversity that we want to explore."

He added that he was limited by how data are published. "When I see a figure in a paper, it's just the tip of the iceberg to me. I want to see it in a different form in order to do a different kind of analysis." But the data are not available in a public, searchable format.

Ecologists also struggle with data diversity. "Some measurements, like temperature, can be taken in many places and in many ways, " said O.J. Reichman, a researcher at the National Center for Ecological Analysis and Synthesis, at the University of California at Santa Barbara. "It can be done with a thermometer, and also by how fast an organ grows in a crayfish" because growth is temperature-sensitive, said Mr. Reichman, a co-author of another of the Science articles.

A Big Success Story

The situation criticized in the Science articles contrasts with the big success story in scientific data libraries, GenBank, the gene-sequence repository, said Mr. Reichman and several other scientists. GenBank created a common format for data storage and made it easy for researchers to access it. But Mr. Reichman added that GenBank did not have to deal with the diversity issue.

"GenBank basically had four molecules in different arrangements," he said. "We have way more than four things in ecology," he continued, echoing Mr. Van Essen's lament.

But even gene scientists today say they are struggling with the many permutations of those four molecules. In another Science article, Scott D. Kahn, chief information officer at Illumina, a leading maker of DNA-analysis equipment, notes that output from a single gene-sequencing machine has grown from 10 megabytes to 10 gigabytes per day, and 10 to 20 major labs now use 10 of those machines each. One solution being contemplated, he writes, is to store just one copy of a standard "reference genome" plus mutations that differ from the standard. That amounts to only 0.1 percent of the available data, possibly making it easier for researchers to store the information and analyze it.

To cope with data diversity, Mr. Reichman said scientists should develop a common language for tagging their data. "If you record data from a particular location, the tags about that location—latitude and longitude, for instance—need to be consistent from researcher to researcher," he said. Ecology has grown into a relatively idiosyncratic science, and all researchers have their own methods, so a common language will require a culture shift. "It's become more urgent to do this because of the pressing environmental questions, like the effects of climate change, that we are being called on to answer," he said. "And the ability to access more than one set of measurement or interactions will make the science better."

Another factor that makes developing shared-data libraries urgent is that many scientists now store their own data. "And when they retire or die, their data goes with them," said Mr. Rowe. In his field, using three-dimensional-imaging machines like CT scanners to analyze fossils, the first people to do that have already left the field, so there has already been a tremendous loss of data.

There is a financial cost to this, he added. "It costs money to do a CT scan, and the National Science Foundation pays for that with a grant. But if that scan isn't curated, and disappears when the scientist retires or forgets about it, then the next scientist asks the NSF for money to do it again. That's just a waste," he said.

In all of the papers, scientists cited examples of small libraries of shared data that could be scaled up. Mr. Rowe helped to develop a project called DigiMorph, which contains three-dimensional scans of about 1,000 biological specimens and fossils. Those data sets have been viewed by about 80,000 visitors, he said, and have been used in 100 scientific papers. Sharing the data, he said, brings the cost to researchers, and their grant-giving agencies, way down. Another project, the Neuroscience Information Framework, contains many more data sets and has been used by even more scientists.

Mr. Rowe thinks agencies like the NSF and the National Institutes of Health should get behind efforts like this to a much greater extent than they have done. "Right now they are financing data generation, but not the release of that data, or the ability of other scientists to analyze it. I think, with all respect, that they are really missing the boat."


1. princeton67 - February 10, 2011 at 08:54 pm

Just a matter if time. There were no wires when Franklin, Volta, and Faraday worked with/on electrcity; there were no paved roads when cars were first being used; there were no exchanges when telephones were introduced; there was no internet protocol when computers started connecting. etc., etc., etc.
Produce: the propogation will come.

2. princeton67 - February 10, 2011 at 08:57 pm

And, as evidenced above, there were typos before the CHE finally engineered proofreading, either automated or retroactive, into its Comments site.

3. cdwickstrom - February 11, 2011 at 10:02 am

Moore's law continues to prevail, and we continue to try to stuff that memory full. T.S. Eliot was right. "Where is the wisdom that is buried in the knowledge? Where is the knowledge that is buried in the information?" (From the canto of "The Rock") the following corollary would also seem appropriate. "Where is the information that is buried in the data?"

4. soonertulsa - February 11, 2011 at 10:55 am

The National Library of Medicine already has created standardize "tags" for biomedical research. They're called MeSH (MEdical Subject Heading) terms. Anyone can search the MeSH database at http://www.ncbi.nlm.nih.gov/mesh. MeSH is the list of terms or "tags" used to search the Medline database at Pubmed.gov.

5. davi2665 - February 11, 2011 at 11:09 am

It is not surprising that many scientists are overwhelmed with data. Too many researchers grind out endless mountains of numbers that can be generated by the extraordinary technologies of our era. And the graduate training that produces our scientists encourages this whiz-bang number crunching and technology. What is NOT fostered and encouraged is education that produces a breadth of knowledge that would allow an individual to integrate nano-scale measurements with developmental measurements expanding over many years. The capacity to see the big picture, and to integrate information from diverse sources is a lost art, and is not conducive to the academic compulsion to publish and grind out grants to permit academic survival. It is not the mountain of generated data that should be decried, it is the lack of integrative capabilities and insights of many of the scientists who are buried in minutae.

It reminds me of the physicians who immediately do a grand casino work up on a patient, with varied and expensive technologies, and then gather these mountains of information and scratch their heads trying to figure out the diagnosis based on what the individual data components show. A good history and physical examination would likely eliminate the need for a vast amount of the technology and data generation we now routinely produce. But that wouldn't bring in as much revenue, would it? Integrative thinking is seldom as remunerative as a technology tour de force, and does not look as good in a tenure file as piles of generated data.

6. dank48 - February 11, 2011 at 02:46 pm

Data, data everywhere,
Nor any thought to think.

7. dboyles - February 11, 2011 at 03:12 pm

The idea that since data isn't captured it is somehow being "wasted" seems mistaken to some extent. Research distills conclusions out of the wastage of data it leaves in its wake. Science is a highly reductionist endeavor and it bound to create mounds of waste alongside and because of its pursuit like many other human endeavors. Data 'recycling' is a rather curious idea. If indeed there is unexamined data that "could reveal better cancer treatments or the causes of climate change" that is one thing. But assuming that data is inherently valuable at some future data, when it isn't now, is an expensive proposition: we can count on the cost of data archiving, transmission, and reinterpretation to skyrocket beyond affordability. Additionally, the more unique the context under which complex data was generated the more difficult it may be to apply it to any other context other than its original context. As far as Moore's law we have yet to see what will limit it. Sheer sustainability of the costs of data maintenance and prevention of its corruption and loss of integrity may be one limiting factor.

8. s_pelech_kinexus - February 11, 2011 at 09:28 pm

While there is dearth of proteomics data with respect to protein expression, regulation and function, there remains of wealth of largely ignored genomics and proteomics data available in many open-access repositories on-line. While standardization is a larger issue in some of the other scientific disciplines, a great deal of progress on this front has already been made for those engaged in biomolecular research. The real problem is the lack of expertise available within the scientific community to interpret the data from system-wide analyses with our powerful new tools. In the language of life, each of the ~23,000 genes/proteins encoded by the human genome is like a noun with its own special meaning. The possible interactions and functions of these proteins are akin to verbs. If we want to make real progress in understanding how cells work, a lot more researchers will need to markedly expand their biomolecular vocabularies. Otherwise, we will have a lot of data, but little knowledge.

Add Your Comment

Commenting is closed.

subscribe today

Get the insight you need for success in academe.