• April 20, 2014

Researchers Urged to Think Harder About Compiling and Sharing Data

Data overload is creeping up on everyone, and research scientists are no exception. So it's time, according to a report out today from the federally chartered National Academies, to think about what to do with all that data.

The report, "Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age," calls on researchers, their universities, and publishers of academic journals to consider new policies for compiling, tracking, storing, and sharing data. Otherwise, the report says, the flood of data coming out of scientific research could be lost, misinterpreted, or misused.

As an example, the report's authors—more than three dozen experts, mostly at research universities—suggest that scientists try harder to think of what data are relevant to their findings and then include that data in their published work.

While that kind of change could help scientists guard against mistakes and even accusations of fraud, it's not clear how many researchers are ready to take that step. The National Academies report comes just two days after Elsevier, the world's largest publisher of medical and scientific literature, said it was developing a new model of Web-based research article at its Cell Press division that would allow study data to be incorporated into the presentation. And Emilie Marcus, the editor in chief at Cell Press, said it wasn't immediately clear how eager researchers would be to take advantage of such a presentation.

The National Academies report today says researchers should strongly consider cooperating.

"Legitimate reasons may exist for keeping some data private or delaying their release," the report says, "but the default assumption should be that research data, methods (including the techniques, procedures, and tools that have been used to collect, generate, or analyze data, such as models, computer code, and input data), and other information integral to a publicly reported result will be publicly accessible when results are reported."


1. rbuteragt - July 23, 2009 at 03:00 pm

I know few scientists who are against data sharing. The real obstacle (for my lab, at least) is the immense amount of time it would take to format and annotate the data so it could be understandable by someone else. There is raw data, then extraction of features from that raw data, then analysis of those extracted features. Many labs have in-house processes for dealing with their data and analysis, but such methods are not standardized. Even though various markup languages are being developed, there is still a lack of tools supporting such languages. As any software devleoper knows, there is a big difference in effort between "code that does the job" and "code other people can use." I think this applies to experimental data as well. Ironically, I find it much easier (in terms of effort) to share complex computational models than experimental data. This is not a complaint. Data sharing is a good idea, but we need to discuss the real obstacles towards data sharing.

2. davi2665 - August 20, 2009 at 09:48 am

Data sharing from research has many roadblocks, some of them difficult to overcome. First, the IT infrastructure of universities is highly variable, and the ability to actually merge data sets and analyze them is questionable, if not impossible, with today's technological hodge podge. It is difficult enough to get a hospital's departments to have a common IT platform for patient data, including individual physicians, and costs hundreds of millions of dollars to establish in just one system. Fat chance that the IT framework will be built for merging of data. Second, universities want to keep the data of their researchers proprietary, with the hope that some intellectual property may come from it. Paranoia will always win out. Third, even if a university is inclined to want to share data, the contractual process of obtaining useful data sharing agreements is mind boggling and time consuming, with no one winning except the lawyers. Fourth, sharing patient data that may have a link to personal health information (PHI) requires a blizzard of paperwork related to HIPAA and IRB approvals, including prior informed consent BEFORE the research is submitted and approved for pursuit. Putting together such agreements after the fact is challenging. Fifth, so what if the data sets can be merged and put into the same IT framework. There are many data warehouses that contain many billions of data points; they are next to useless unless one has a clear cut plan of attack to mine the data and analyze it, a very difficult task for even the most seasoned investigators. And sixth, the amount of time it takes (and personnel, including the hiring of IT specialists) to integrate and share data is daunting, and will add cost to research projects that are already stretched to the limit based on minimal funding. Even the most simple form of data sharing, that of providing raw data details from NIH funded projects, would involve huge time and effort, and would end up taking the form of presenting endless data arrays that resemble research data books- in other words, next to useless. Good luck in trying to implement what, on the surface, appears to be a great idea, but from a practical point of view is currently virtually unachievable.

3. sayeedmd - August 27, 2009 at 11:46 am

Both of these comments include valid and important points. Our research group at the Sheridan Libraries of Johns Hopkins University is trying to address some of these points through a prototype data curation development that will demonstrate a workflow and system to link and preserve connected datasets and articles. The Institute of Museum and Library Services and Microsoft Research have provided funding for this work so we are using both open-source and Microsoft specific tools. Further information is available at: https://wiki.library.jhu.edu/display/DATAPUB/Home

Add Your Comment

Commenting is closed.

subscribe today

Get the insight you need for success in academe.