• September 30, 2014

Too Many Researchers Are Reluctant to Share Their Data

A new model of data sharing and openness is emerging in the scientific community that replaces traditional ways of thinking about research findings as the private property of the primary investigator. Large granting agencies, including the National Science Foundation and the National Institutes of Health, have embraced the new model of more-open access to research data. Later this year, the NSF will start requiring scientists seeking research grants to include a data-management plan in their applications, describing how and when their data will be shared.

The issue has also captured the attention of a U.S. House of Representatives subcommittee, which held hearings last week on an NIH data-sharing policy requiring that federally financed research data be freely available within 12 months of publication.

But the change has been slower to take hold among scientists themselves—resistance that is bogging down the pace of scientific progress. Policy changes come on the heels of a 2009 National Academies of Science report about data stewardship, which suggested that, with the help of technology, the scientific professions are moving, but only slowly and reluctantly, toward a paradigm shift.

It has become increasingly apparent that scientific data should be considered a product in much the same way journal articles or conference proceedings are, and they should therefore be shared as widely as articles and proceedings, while being credited to their producers. NIH embraced this perspective as early as 2003 by requiring data-sharing plans in certain types of grant applications. The momentum has clearly shifted toward more transparency, at least among those who finance science. But among those who do science, it remains much less clear how long the transition will take.

For the past five years, I have been on the front lines of this shift, and I have seen little consensus among scientists. As an archive director for the Inter-University Consortium for Social and Political Research at the University of Michigan, which is among the oldest and largest data archives in the world, it has been my job to persuade researchers financed by federal sources to share their research data with the broad academic community. I have attended countless scientific meetings, presented at workshops, shared coffee and meals, and been cornered in poster sessions by disgruntled scientists and worried graduate students. The topic is always access to secondary research data— the reuse of primary data for secondary analysis. Researchers' concerns include discomfort over possible misuse of their data and losing credit for their work.

Data sharing is a bit like going to the dentist. We can all agree that it is a good thing to do and intrinsic to good scientific practice. In reality, however, researchers tend to view data sharing with a mix of fear, contempt, and dread.

Of the reasons and excuses offered for not sharing research data, very few have substantial legitimacy from a scientific or even institutional perspective. Arguments related to protecting human subjects are valid. In the social and behavioral sciences, when researchers collect data, we promise to protect our subjects' identities. That is an important promise that, if broken, can substantially damage our research by making it more difficult to get cooperative subjects for future surveys, and by eroding our trustworthiness.

But the remainder of excuses for not sharing data are rooted in the nature of academic rewards, the financing of science, and misunderstandings about data. The misdirection occurs when we believe that we can protect our subjects by never sharing data with anyone beyond our work group. That would be true only if data still resided on pieces of paper.

The truth is, informal data sharing occurs every time a researcher puts a data file on a thumb drive and hands it to a graduate student, who then puts it on a shared network drive at a university, and then puts it on his or her laptop to take home and hook up to an unsecured wireless network. Formal data-sharing plans force us to think through data-protection and disclosure-control practices. Informal data sharing actually puts subjects at greater risk because we trust our colleagues, while never questioning the networks, computers, and places they use to store and analyze our data.

Moreover, many scientists extend the human-subjects argument from individuals to populations. They argue that when the data concern a vulnerable population, they could be misrepresented if widely released. The scientists are less concerned with protecting the identities of individuals than with controlling how the data are used to portray a particular population.

Again, that is a laudable goal, but misguided. Researchers often persuade a person to participate in a study to improve the condition of his or her community. Participants clearly believe in the value of the science brought to bear on their issues. The original research team, however, may not have all the answers. The value of science lies in the ability to exchange and test alternative solutions. By trying to protect a community from harm, the team may actually be hurting it by shutting out alternatives.

I've also heard and thought about many other arguments against data sharing, none of which ultimately hold water. For example:

  • "I worked hard for this, and I want to exploit it as much as I can." It is true that academe is designed to reward publications and, thus, when we share data, we run the risk of being "scooped." That suggests, however, that the individuals who collected the data have no competitive advantage at all. Secondary-data analysis, by those who are not the primary researcher, is actually quite hard even when the data are well prepared and documented, because data collection has become so complicated it is difficult to navigate. Further, data producers gain the advantage of having completed their work first, forcing others to cite them in future publications.
  • "People won't use the data properly." Can we dictate how other analysts use our written work? The scientific discourse is one of error and correction. The literature is filled with such exchanges. If we prevent people from entering the conversation because we are afraid they might say something stupid, we violate the basic principle of science that statements are considered valid when well supported by evidence or until proved wrong. Data are the raw materials of those conversations.
  • "It's too expensive to clean it up." Collecting data is a bit like cooking a good meal. If you clean as you go, when you are full and sleepy you will have much less to do. Documenting and cleaning data are good scientific practice. It should be very little work to make data ready for someone else.
  • "I won't share it because it's mine." That is the least credible and most objectionable reason for not sharing data. Call it the kindergarten gambit. In fact, data collection supported by the federal government belongs to the institution to which the grant was given. Contracts have different types of ownership embedded in their agreements. In the case of grants, a researcher's legal claims on the original data are minimal unless he or she has negotiated an alternative agreement with his or her institution. More important than the legal claim is the moral one. If we continue to ask American taxpayers to finance scientific research, we ought to be willing to share some of its products: data.

The most effective argument in favor of data sharing is simple: It is good science. The scientific community need only look to the field of astronomy for a well-documented example. As The Chronicle has reported, Alexander S. Szalay, a professor of physics and astronomy at the Johns Hopkins University, has helped change his field to one in which sharing is routine, building an archive that brings together millions of digital images of the universe. Genomics also serves as an example, with several large-scale projects engaged in broad collaboration on gene sequencing, like the Human Genome Project and the "genomewide association studies" that scan markers across many people to identify variations associated with particular diseases that it has made possible.

At our consortium for social and political research, which has nearly 700 members and is housed at the Institute for Social Research at the University of Michigan, data sharing has been our mission since our founding nearly 50 years ago. Demand for the data we disseminate is only growing. In 2010 we've seen record-high downloads from our Web sites. Our experience demonstrates that making data available for secondary analysis by the wider research community is an essential component of social-science inquiry.

Our hope is that the most recent effort by the NSF, along with practices already in place at NIH, will push more researchers to realize that being overly protective of one's data is counterproductive. When the shift finally occurs, perhaps I will spend less time on the road trying to persuade people of the value of data sharing, and more time facilitating its use.

Felicia LeClere is an associate research scientist and director of Data Sharing for Demographic Research and the National Addiction & HIV Data Archive Program at the Inter-University Consortium for Political and Social Research at the University of Michigan. She is also an associate research scientist at the university's Population Studies Center, Institute for Social Research and will be a principal scientist at the National Opinion Research Center at the University of Chicago in September.

Comments

1. ksledge - August 04, 2010 at 08:29 am

I truly hope that data sharing becomes more the norm in the future. In my field (cognitive neuroscience), some kinds of data collection (fMRI) are very expensive, but the data themselves are often very rich and could sometimes support new ideas for several different manuscripts. Moreover, researchers are constantly determining new ways to analyze fMRI data. Currently, because few people share data, researchers are paying tens of thousands of dollars per study to essentially re-run a study that has already been performed, so that they can do a new analysis of the resulting data. If we shared data more, we'd save tax payers millions of dollars. Data sharing would also allow faculty and students at liberal arts colleges (where they will not get the big grants or equipment to perform these studies) to perform some of this same kind of research using data sets from previous studies.

2. refranck - August 04, 2010 at 01:35 pm

The incentives seem obvious. Faculty members are commonly promoted and tenured, based primarily on their publishing records. Keeping a good source of data private is one way of ensuring a continuing stream of publications -- by reducing the risk of being "scooped" by others (among other things).
Also, those who take the time to "clean up" their data for public release take time away from doing what's rewarded (getting published), in order to do what's not rewarded. (Tenure cases that turn on volume of publications are commonplace. If anyone's heard of a tenure case that turned on a faculty member's conscientious sharing of data, it would certainly be interesting to hear about it.)
All things considered, it's reasonable to hypothesize that not sharing data is a predictable result of the prevalent academic culture -- even though all agree that sharing data is virtuous.
Equivalently, we should consider the possibility that our culture rewards A (publishing) while also hoping for B (sharing data), but not rewarding B.

3. stevanharnad - August 04, 2010 at 11:27 pm

FIRST THINGS FIRST: OA THEN OD

Open Access (OA) to refereed journal articles, unlike Open Data (OD), has no conflicts with the author's interests. So OA needs to come first. Once OA is universal, the benefits of OD will be more apparent, and there will be much more OD too. OA can be mandated (by institutions and funders). OD cannot (because of the conflicts of interests -- some of them quite legitimate).

See: http://bit.ly/OAversusOD

Stevan Harnad
American Scientist Open Access Forum

4. donmac23 - August 05, 2010 at 06:47 pm

I read somewhere that over 80% of use of research data occurs after the initial study. sorry, but I can't find the link now! Perhaps if it's so hard to access, this does not apply to a lot of US research but clearly, there's a lot of value in making this more possible. In Australia, there is a very big project underway called ANDS (http://www.ands.org.au/about-ands.html) to make research data more accessible online.

Don McIntosh
http://www.spacetimeresearch.com

5. tsb2010 - August 06, 2010 at 09:21 am

Interesting perspective if you're "fat and happy" (aka tenured), and a classic "do as I say" story. Of course, in an ideal world we would and should share our data.
I don't think that the "I worked hard for this, and I want to exploit it as much as I can" argument can be dismissed so quickly. In some fields the acquisition of data is hard and takes many years. If I spent the countless hours doing mindnumbing experiments in order to get to my data, I sure would like to profit from that - as opposed to having others descend upon it like vultures.

6. tsb2010 - August 06, 2010 at 09:22 am

The solution to this is not more articles and commentaries. We first need to change the way the "system" works, and properly reward those who worked for the data.

7. fleclere - August 10, 2010 at 10:37 pm

So as the author of this piece, I would like to say that data sharing actually benefits junior people most because they can capitalize on secondary data from others. It also helps folks at smaller schools with fewer resources to help gain access to data. The 'fat' and 'happy' as the poster tsb2010 dubs them actually have the smallest incentive to share because they have the resources to collect data, clean and retain it. I actually believe and not naively that data sharing flattens the academic hierarchy pretty substantially. In disciplines that have a strong secondary data culture, it is quite possible to have a very productive academic career with more limited resources. As for my motives, I am not a tenured professor but rather contract research faculty whose time has been spent among the tenured, untenured, and every variation in between. It remains true that the likelihood that anyone would scoop the person who collected is very remote unless the project is exceedingly simple.

8. drmhp - August 11, 2010 at 09:10 am

It seems as if the utility of more open data sharing my vary accross fields of science. For example, coming from a background in psychological research, the process of research design and the development of unique research methodologies is nearly as important to furthering the field as the data that result from the research process. I would be concerned that graduate students, for example, steered toward the re-analysis of other datasets would be missing out on a key component of their development as researchers. We essentially already have similar practice going on with meta-analysis, which has increased dramatically in representation accross journals in the field.

9. timewaster123 - August 25, 2010 at 09:33 am

I wonder about the human subjects research though. Because if you don't put secondary data use on the initial consent form, do you actually have the right to share the secondary data with others?

Add Your Comment

Commenting is closed.

subscribe today

Get the insight you need for success in academe.