The Chronicle of Higher Education
The Chronicle of Higher Education: Colloquy

Data Deluge

Thursday, June 22, at 2 p.m., U.S. Eastern time

The topic

A scientist whose research generates vast amounts of digital data may not recognize what the strings of numbers mean a few months down the road, and another researcher coming fresh to the material will have even more trouble understanding it. If the machine it is stored on becomes obsolete, the data -- which might have led to new discoveries -- will be lost forever. In contrast, making the data available to other scientists for their own research might save years of research and millions of dollars.

So should digital data be stored on central institutional servers -- or even in national or international archives? How should archivists organize the data? Should researchers be allowed to keep all or part of their stored data secret? What kinds of new science -- and publishing -- might emerge as digital data continue to proliferate?

  » Lost in a Sea of Science Data (6/23/2006)

The guest

D. Scott Brandt, associate dean of libraries at Purdue University, is helping to build a repository of scientific data that will be idistributed,i or stored on the hard drives of faculty members, on departmental servers, or as part of a large-scale computing project run by Purdue and a handful of other institutions.


A transcript of the chat follows.

Scott Carlson (Moderator):
    Hello, and welcome to our discussion about the "data deluge" in science and the ways that librarians and archivists might deal with it. Thanks to digital technology, scientists are generating vast amounts of valuable data that, months later, may be irretrievable or indecipherable. Librarians are being called in to archive that information, but financial, technical, and even cultural barriers stand in the way. Who should pay for archiving digital data? Should it be stored close by, where it can remain private, or in large, central repositories?

We're here today with D. Scott Brandt, an associate dean of libraries at Purdue University. Mr. Brandt is coordinating a distributed institutional repository that will archive the data of Purdue scientists. Please send us questions or comments about the topic, and Mr. Brandt will respond.

Let's get the discussion started....


Question from Steve, small community college:
    It's easy to see the huge potential impact of some kind of universal data archive. What seems unlikely, however, is for those in the science community to ever shelve their competition with one another, despite whatever breakthroughs might arise from doing so. Is such cooperation between competitors possible, or would such a database be more frequented by the less competitive researchers of "Small Science"?

D. Scott Brandt:
    Two things: We can’t really respond fully to the competition aspect—as noted in the article, there was some disagreement between the researcher and the librarian about whether even the metadata should be shared… although we believe there is a range of detail in metadata that can be shared, as well as degrees of access that will need to be accommodated. But the role that librarians should play in building these systems and services may not always result in data that gets shared outside of the university. Thanks!


Question from Frank, midsize research u:
    Who pays for a project like this -- government? the university? -- and what are the costs?

D. Scott Brandt:
    This is a very good question. Developing not only a business model, but a sustainability model, is something that we at Purdue, and others, such as the NSF, are interested in. As part of our mission, the Libraries represent an institutional commitment to curate research data as a part of the intellectual record of the university (unlike research projects that are funded on soft money for a finite period of time, librarians take a hundred year view…). The initial proof-of-concept project for a distributed repository at Purdue was subsidized by the Libraries in response to specific problems researchers have here. We are working in collaboration with researchers to consider their metadata and curation needs and include librarians in their proposals. But ultimately, we think the university should be interested in maintaining the output of research done at its institution, perhaps similar to how it has supported that research through the library’s mission. The government has already made clear that data and information generated from research it sponsors should be made available, and the implication is that it is up to the researcher to do so. As for specific costs, it’s too early to tell what they would be in full production mode. This is an area that libraries are looking at to build sustainable systems and services. Thanks!


Question from Frank, midsize research u:
    Could more money be saved if we didn't save this data and we merely did experiments over again?

D. Scott Brandt:
    This is something that should go into the collection development aspect of curating and archiving data. One of the issues related to metadata is that we not only create “descriptive” metadata, but also metadata that addresses preservation and collection management issue like this. We can’t keep everything, but this activity would take place in consultation with the data creators and users (like collection assessment and weeding). Thanks!


Comment from Pam Baxter, Cornell University:
    Regarding "who pays," some granting agencies, and in particular NIH, make it clear that costs associated with metadata production, archiving, and such, can be written into the grant. Again, without an enforcement mechanism, it's unlikey compliance will be any higher than it is now.


Question from Pamela Alexander, University of Pennsylvania:
    While an obviously important part of the problem is technical, another aspect is the need for establishing policy for sharing data. As was described in this article, many researchers are reluctant to lose a perceived competitive edge by making their data available to others. However, if this goal is seen as desirable, it will be necessary for federal granting agencies to develop incentives and even requirements for researchers to archive data from funded studies and to provide detailed metadata to make these data accessible to others. As some research agencies (such as the National Institute of Justice) have done, providing funding for secondary data analysis is also another piece of the solution.

D. Scott Brandt:
    Good comment—this aspect is important to researchers. One thing is that we think it is critical to have institutional commitment, and we have written this into our strategic plan. Also, at the federal and funding agency level, there is movement to make research results, and in some cases data, available as a stipulation of accepting the grant. For instance, the NIH says it “strongly encourages” pubic accessibility and this is widely interpreted as precursor to further funding. Also, the Cornyn-Lieberman legislation passed recently was intended to ensure research is available to the public (this covers 11 agencies).


Question from Lila Guterman, The Chronicle of Higher Education:
    I'm curious about how scientists and librarians are trained to deal with all these data sets. Do library students have to take statistics? Do science graduate students take computer science?

D. Scott Brandt:
    The researchers we’ve talked to recognize that they don’t have the skills (or the time) for dealing with the organization of their data—formatting it, enhancing it with metadata, curating and/or archiving it. At Purdue they look to librarians to help them. The librarian situation is interesting. For instance, we are posting a position called data research scientist, which is a library based researcher applying technical skills to metadata problems. But librarians have an important role in curating and archiving data, applying library science principles to e-science to resolve problems related to metadata, ontologies, data management, etc. Thanks!


Question from Pam Baxter, Data Archivist, Cornell Institute for Social and Economic Research:
    I'm not sure this is so much a question as a comment from the realm of the social sciences. It seems that one need look no further for a shared respository model than ICPSR, the Inter-university Consortium for Social and Political Research. The differences between scientific and social science data products strike me as more a matter of scale than degree. ICPSR has also initiated a more distributed model of archiving in addition to the "mother ship" archive at the University of Michigan. Seems to me a more intractable problem, regardless of discpline, is convincing researchers that letting go of their data should not be regarded as a threat but as an opportunity--and those sorts of problems are social rather than technical or legal.

D. Scott Brandt:
    Yes, in addition to ICPSR we’ve seen a few shared and/or regional repositories, some run by not-for-profits, some by government-related agencies. And you’re right; we can learn from them how to deal with scale. In some ways this comment relates to the comments Jim Caruthers made regarding “big science”—people coming together under a “project” where, even if there data is homogenous, a process has been developed that participants can follow. Purdue’s distributed repository arose out of working with local, individual researchers and groups to help them address problems… the so called “little science” which actually accounts for a large group of researchers. But you’re right about the issue of “letting go of their data,” although we have seen some differences in disciplines (for instance, some earth scientists and some biologists are very happy to “give up” data, where others, like engineers and some humanists, may not be). This is where we see pursuing relationships with university administrators regarding policy (will it be a mandate to deposit all data?) and IT to build systems with varying levels of access (local, campus-wide, inter-institutional, etc.), as well as with the researchers themselves. Thanks!


Question from Jeff Young, The Chronicle:
    How important are the cultural factors involved in moving to shared data libraries? Do you think there will be scientists who will be happy to only cull others' data rather than generate their own?

D. Scott Brandt:
    Yes, those will likely be factors—there is supposedly a story about a scientist who put data “out there,” and someone else used it to come up with different results and got credit for it… In many ways, that’s the point of the Human Genome project—map the genomes, put the data out there, and let people make sense of it—and it might be influencing or setting up a new or different way to collaborate for research. Having sat in on a computational biology conference last summer, I can tell you that there are people who do this, although it might be unique to that discipline. From what I gather in talking to researchers in biochemistry, certain disciplines of engineering and statistics, this is likely to be the way of the future, given the phenomenal increase in data generation due to the use of sensors to collect data from just about anywhere imaginable.


Question from Raymond Yee, UC Berkeley:
    Can you tell us more about the organizational structure you have in place at Purdue? How many people? What roles? Involvement by librarians? scientists? computer scientists?

D. Scott Brandt:
    Good question.... Where should I start… The interdisciplinary research focus at Purdue was the impetus behind creating Discovery Park, a group of interdisciplinary centers. When James Mullins came to Purdue in July 2004 he started asking the 9 school and 73 department heads, as well as various center directors, how the libraries could support their work. This resulted in the creation of an Associate Dean of Research position in the Libraries to interaction with like positions across campus and take part in institutional research planning. Librarians are faculty, as well as subject experts and liaisons to various disciplines who interact with the associated faculty in those areas. I function as the coordinator of these activities, and part of my challenge is integrating the more traditional roles of librarians into new initiatives in sponsored research. Our goal is help researchers at Purdue solve problems, but then share results with other libraries who face similar problems… We work closely with our IT division (ITaP) as well as the Cyber Center, who deals with building up this interdisciplinary cyberinfrastructure.


Comment from Jim Jacobs, UCSD:
    With regards to the question about how scientists and librarians are trained, there is an active movement in the social science data community to view the creation, archiving, use, and re-use of data as a life-cycle, with different people playing different roles throughout the life-cycle. This means having librarians working with researchers at the beginning of their research even before data are collected to help them think about preservation and re-use and help them design and create standards-compliant meta-data. The goal is to protect researchers from having to be librarians and librarians from having to be researchers. We're already beginning to see this model work well and result in better documented data ready for deposit in a library or archive and ready for re-use. Researchers see the benefit of this.


Comment from Sandy, intern:
    It seems to me scientists usually work in such deep silos. Some of them are so specialized that some of their data is only relevant to what they have to study (ie: the parameters, noise level, controls of the particular experiment, I don't know if anyone attempted to collect microarray data). Storing data and curating data is wonderful idea, but the data that stands by itself is not meaningful. You almost have to go with the Genebank model (with enough staff to support the curation, alignment of the sequence, tools to analyze the datasets).

The question seems to be how to find commonality between the discreet data sets and generate information/knowledge from these datasets. Libraries could be a neutral playground for this interaction. The question seems to be the level of expertise, resource allocation, and institutional commitment. It is really a long-term investment. I wonder how an academic institution can support that.


Question from Raymond Yee, UC Berkeley:
    Are you also looking at archiving digital content from the humanities and social sciences?

D. Scott Brandt:
    Yes. We are currently looking at archiving research from a visual arts and design project, and have a great relationship with the new dean of Liberal Arts here, and we always keep our eyes open for more opportunitie


Question from Brian Simboli:
    Can you address how you intend to handle the old problem of migrating data as old platforms become obsolete?

D. Scott Brandt:
     There is no certain way to future-proof data, and most methods available now are largely untested. We are working to capture as much system metadata as possible to allow batch conversion when such tools become available. We try to remain as platform and application agnostic as possible, but our ability to do so varies widely from discipline to discipline and experiment to experiment.


Scott Carlson (Moderator):
    Scott is getting to your questions as fast as he can. If you'd like, send in comments about the topic and I'll post them for all to see....


Scott Carlson (Moderator):
    I'm posting a comment from Michael Pickett. People out there should feel free to respond....


Comment from Michael Pickett:
    In terms of ongoing funding, I wonder if the funding agencies would be more inclined to fund repositories and archiving in individual grants or by covering it in the overhead rate.


Question from Sarah Everts, Chemical & Engineering News:
    I'm interested in the people who might consider a career trying to solve this problem. What skills are required to handle this data? If someone gets training in digital science data storage (ie, to handle the tech side of things), what kind of workforce demand is there now and 5-10 years in the future?

D. Scott Brandt:
    Here’s how we describe the data research scientist (DRS) position which works in this area: carry out sponsored research projects related to data, datasets and data mining applications, including data description and enhancement; collaborate with the Libraries’ and university data producers and repository contributors to develop cost effective and efficient strategies and reliable data streams for managing data and importing it into the institutional repository; organize access to data and related resources using traditional and emerging metadata schema; track developments in data management practices, as well as recommend and design appropriate applications to facilitate and enhance access to data sets and other collections. My guess is that it will 5-10 years to develop the tools and techniques to make this doable for the future.


Question from Jim Jacobs, UCSD:
    Do you think that the model for data preservation and access that the social science data community has successfully used over the last 30 years fits in with how we might successfully preserve and maintain usable access to scientific data?

D. Scott Brandt:
     The sheer volume of data created by simulations and the computational sciences will likely break traditional models. Future solutions need to take things like grid computing and alternative uses for data such as visualization into account.


Question from Jim Jacobs, UCSD:
    Would you address Clifford Lynch's comment from the "Lost in a Sea of Science Data" article? "Some of this data is going to outlive their projects, and it's going to have to go to the custody of central administrative entities, like the library," he says. "In the long run, it's hard for me to see how libraries or disciplinary data archives operating at a national or international scale aren't going to have to take responsibility for archiving the data."

D. Scott Brandt:
    Ultimately, for the long term, we agree on this. But our distributed approach is our first step in responding directly to the local and immediate needs of researchers at Purdue.


Question from Robert Hanisch, Space Telescope Science Institute, Baltimore, and National Virtual Observatory:
    More of a comment, I think... In astronomy we are starting a pilot project for digital data preservation, much along the lines being discussed here. Astronomy has a strong traditional of data archiving and common data formats, and the Virtual Observatory framework provides for data discovey, access, and integration from distributed resources. We want to use this infrastructure to capture and preserve the digital data represented in the peer-reviewed literature (which is often more highly processed than the data in the observatory archives). We believe that this requires a collaboration among the publishers, editors, librarians, and technologists (and all are involved in our pilot project). Many astronomers are eager to make their data available, but journals have been hesitant to take on this responsibility. The Virtual Observatory framework allows us to utilize distributed repositories and to share the cost, though we clearly need the experience of implementing an end-to-end prototype in order to understand the impact on the business model.

We are hoping that the experience in astronomy can inform other disciplines. So here is a question: to what extent do you think solutions for data preservation in one domain might apply to others?

D. Scott Brandt:
    Absolutely, although the timelines for how long an institution holds onto the data and their capacity for doing so will vary greatly. The collaborations are critical to develop solutions that work.


Question from Charon, Research Institution:
    Do you believe that librarians have the necessary skills to build data repositories for such disperate data? Archiving the data as a single "blob" will not provide the value that archiving via a relational data would.

D. Scott Brandt:
    While we work data, we are really more concerned with the metadata and the collection management process, which covers not only preservation, but access and use. And we believe that for data to interoperate, metadata has to interoperate.


Question from Michael, a Cal State Univ. Campus:
    Beyond archiving raw data, what is the role of libraries in archiving and hosting web portals to access the resultant scholarship? We are struggling with whether our library should assume this role or if the colleges with which the faculty are affiliated should take on the responsibility for developing and maintaining websites for journals and other online resources. Your thoughts?

D. Scott Brandt:
    At Purdue, we’ve developed a distributed institutional repository framework that provides an application layer, and we work to support protocols and interfaces for not only our own applications (e.g., portal), but those developed by others outside of the Libraries. We’ve tried to provide the tools they need and support for them.


Question from Brian Simboli:
    As a librarian, I know that we're already quite busy with many other things. Specifically, what will the staff costs be? Do you see dedicating someone to maintain this at least part time?

D. Scott Brandt:
    The role of the librarian is changing… In the face of large scale digitization and rapidly advancing technology, these are both exciting and perhaps threatening times for many librarians. We see curating datasets as a function of building collections, not unlike acquiring other material… We predict that in ten years this will be a comfortable and familiar facet of librarianship—and could be part of any librarian’s collection development responsibilities.


Scott Carlson (Moderator):
    That's it for today. Thank you for joining us in this discussion. I also want to thank Scott Brandt for showing up and answering questions about this important topic.