Citation by Citation, New Maps Chart Hot Research and Scholarship's Hidden Terrain

Stephen Brashear for The Chronicle

Front left to right, Carl Bergstrom, Martin Rosvall, Daril Vilhena, and Jevin West, design an algorithm to map scholarly research.
September 11, 2011

Imagine a Google Maps of scholarship, a set of tools sophisticated enough to help researchers locate hot research, spot hidden connections to other fields, and even identify new disciplines as they emerge in the sprawling terrain of scholarly communication. Creating new ways to identify and analyze patterns in millions of journal citations, a team led by two biologists, Carl T. Bergstrom and Jevin D. West, and a physicist, Martin Rosvall, has set out to build just such a guidance system.

Trapped in disciplinary valleys, surrounded by dense forests of information, researchers have a hard time seeing a lot of scholarship that might be relevant to their work, especially if it's not published in the places they already know to look. The work of Mr. Bergstrom and his colleagues is a response, they say, to the problem of how to work with an overwhelming and ever-growing amount of information.

"There's just too much," says Mr. West, a postdoctoral researcher in the lab of Mr. Bergstrom, who is a professor of biology at the University of Washington. Researchers "need tools for searching and for navigating the scholarly landscape."

He and his colleagues have been developing those tools: a set of mapping-and-recommendation services that will be freely available and can run on a desktop or laptop computer, so that anyone in any field can use them. The work builds off the thinking behind the Eigenfactor score, a method of assessing journals' relative influence that Mr. Bergstrom and Mr. West unveiled in 2007. The Eigenfactor algorithm takes into account the source of citations. A citation in a high-profile journal like Nature, for instance, counts for more than a citation from a journal only a handful of people ever see or cite. That's a more nuanced way to evaluate a journal's standing than the widely used impact factor, which tracks how many citations a journal gets but does not weight the sources.

A passion for information theory and network analysis, in their own fields and beyond, brought the two biologists together with Mr. Rosvall, who is an assistant professor of physics at Umeå University, in Sweden. It's a collaborative effort. According to Mr. West, Mr. Rosvall and Mr. Bergstrom did much of the theoretical work that produced the mapping equation, the mathematical approach that underlies much of the team's current research. That work led to the development of InfoMap, an algorithm based on the map equation, that the team uses to build visualizations and maps of science. Mr. West, along with Mr. Bergstrom and Mr. Rosvall, has spent much of the past year trying out the map equation on different networks, including the scholarly literature.

The researchers concluded that, if they wrote the right algorithms and used large enough data sets, the citations used to determine Eigenfactor scores could be made to reveal larger patterns in the scholarly literature as well: tracing the flow of ideas among disciplines, or identifying fields as they take shape. For instance, using citation data from about 7,000 journals, the team pinpointed a period in 2004-5 when a distinct neuroscience literature emerged, suggesting that the field "has transformed from an interdisciplinary specialty to a mature and stand-alone discipline," as Mr. Bergstrom and Mr. Rosvall explain in a paper published in the journal PLoS One.

Their work is among the most cutting-edge and aesthetically appealing of many attempts to find new ways to get a handle on the universe of scientific literature. It's not enough to do keyword searches in a database anymore. "You need a way to get your bearings," Mr. Bergstrom says. "There should be a faster, easier way to help people get a big picture of a field so they can dive in."

The results are meant to be accessible to users in any field, telling all kinds of stories about how researchers share knowledge.

"Good maps simplify and highlight," Mr. Bergstrom explains. When working with large data sets, he says, one critical question to ask is, "What are the important structures here?"

Some structures revealed by citations are natural groups or clusters of journals that have intellectual traffic with journal clusters in other fields, traffic that's measured by back-and-forth citation flow.

When they describe what they're doing, Mr. Bergstrom and his colleagues speak like explorers, invoking geographical and urban imagery to describe the landscape their algorithms reveal. Mr. Rosvall compares moving through the scholarly landscape to trying to get from one Rockies mountaintop to the next; the team's challenge is to identify peaks and valleys and help researchers move past the barriers that separate them. Mr. Bergstrom likens the network of citations to a city that is "growing organically as you're trying to navigate through it." Capture it in the right sort of map, he says, and "if that map is there, the story of how fields are changing is all there in this big lattice of citations."

Pretty Pictures

There's a strong artistic component to the Eigenfactor work. Its creators have made great strides in developing visualizations of the data, working up elegant webs that capture clusters of journals and the connections among them.

Show people one of those clusters and they are likely to comment on how beautiful it is. "One of the things about their stuff is I find it aesthetically lovely," says Michael J. Kurtz, a staff astronomer at the Harvard-Smithsonian Center for Astrophysics and the founder of and project scientist on the Astrophysics Data System, a digital-library portal whose databases contain more than nine million bibliographic records.

He calls one of the Eigenfactor visualizations of the sciences "iconic" and praises it as uncluttered. "It really shows the basic flows of information among the sciences," Mr. Kurtz says, calling the group "by far the most clever people working in the field today."

The maps' aesthetic appeal also makes them more attractive to potential users. A researcher doesn't have to understand algorithms to be able to work with the visualizations that Mr. Bergstrom et al. create.

To feed their algorithms and build new maps of science, the researchers have developed partnerships with several groups that are custodians of large data sets. They include Thomson Reuters, Microsoft Academic Search, the Social Science Research Network, and JSTOR, an online system of journal access. "Three or four years ago, when we got started, access to data was an enormous problem," Mr. Bergstrom says. "Now you have all these cool projects that are out there making citation abstracts." Access "is not the major bottleneck" anymore.

Chronicle of Higher Education

Look who's talking: This network of disciplines shows how strongly different areas in the large JSTOR collection of scholarly journals are connected. Thicker lines represent more back-and-forth journal citations; thinner lines indicate less communication.
Jevin D. West

Recently JSTOR agreed to provide the three researchers with access to its entire corpus, including not just citations but also the metadata, or descriptive labels associated with those citations, as well the full text of articles, says Kevin Guthrie, president of Ithaka, a nonprofit group, including JSTOR, that promotes the innovative use of technology in higher education. The researchers agreed not to share any of the material with third parties but otherwise "could use the data in any way consistent with their research," he wrote in an e-mail to The Chronicle.

The Eigenfactor team's work supports JSTOR's mission "to help the scholarly community use technologies in transformative ways," Mr. Guthrie says. But it also holds promise for improving the services JSTOR offers to its users. "We are eager to test whether the Eigenfactor algorithms will group and cluster articles in ways that could be helpful to users," he explains. "Jevin and Carl's work over the last year shows promise that citation networks could form the basis for a powerful recommendation engine that could help students and researchers find relevant articles that they would not have discovered in other ways."

Mr. Bergstrom and his colleagues expect to roll out an "Eigenfactor recommends" tool this fall. The idea is to help researchers determine where their work sits in the scholarly landscape, and what's nearby that they ought to know about. "Most recommendations up to this point use textual information"—keywords, for instance, Mr. West says. "We can do a lot of neat things in terms of helping people find the papers they'd like to find," Mr. Bergstrom says. "What is this idea? Where did it come from? What are the classic reviews? What are the hot breaking things? These mapping techniques allow us to figure that out."

Ithaka's Mr. Guthrie agrees. "Perhaps the connections among articles will help authors find other authors writing on similar topics but in other fields," he wrote in his e-mail. "In general, we find the research so promising and the data so rich that we expect there to be many interesting outcomes that we cannot fully envision right now."

Working with JSTOR also represents the team's first big foray into the humanities, and a chance to work with full-text data as well as citations. For this article, they created a visualization of how the different fields represented in JSTOR's corpus connect to one another. And they are eager to collaborate with scholars in other fields to analyze and put in context the scholarly patterns that their algorithms reveal. Historians of science, for instance, could have a field day with data and visualizations that help pinpoint the spread of certain ideas or the rise of a field like neuroscience.

Michael C. Jensen, founder and chairman of the Social Science Research Network and an emeritus professor of business administration at Harvard University, sees enormous possibilities in the Eigenfactor work. SSRN, he estimates, has some 300,000 PDF's of articles on its Web site. The references in those papers "provide a quite clean data set for Carl and Jevin to apply their work to," he says.

"The idea is not only to rank papers and authors and institutions and find out where the most important work is being done, but also to use those interrelationships to make searches into the literature more powerful and effective," Mr. Jensen says. He wants to see "if they can use the Eigenfactor algorithm and the SSRN data to forecast which papers are the hot up-and-comers. We don't know exactly how that's going to come out yet. That's the next step, and I'm very excited about it."

With the scientific literature growing rapidly, he says, it's more important to have systems that allow people "to find exactly those things that are relevant to their research topics, problems they're having. That will make a huge difference in the world."

Grass-Roots Bibliometrics

The Eigenfactor team forms one important node in a growing network of researchers who want to make bibliometrics, or ways of measuring the impact of information, into more than a tool for assessing the influence of individual scholars and journals. It's an interest that brings scholars, data aggregators, and policy makers together.

One essential data source for Mr. Bergstrom's team has been Thomson Reuters. It lists Eigenfactor scores alongside the impact factors it derives from its citation databases and publishes in its annual Journal Citation Reports. The Eigenfactor group posts the results of its calculations, using Thomson Reuters data, six months after the company publishes its yearly impact factors.

Marie McVeigh, director of production and bibliographic policy for Journal Citation Reports, describes the two approaches as complementary. It is important to note how widely an article is cited, she says, but beyond that, "I think it does matter who's recommending your journal. It does matter where those citations are coming from."

That connections matter is a fundamental point for Mr. Bergstrom, Mr. West, and Mr. Rosvall. The focus on individual papers has obscured the larger patterns in the citation data, they say. According to Mr. West, one question driving them was, "How can we extract all this extra information that seems to exist in the citation network? The network itself had been ignored for the last hundred years. It hadn't been included when scholars or policy makers were trying to evaluate what was going on in science, say trying to measure the influence of a scholar or an institution."

One of those policy makers, at the National Science Foundation, would like to change that. Julia I. Lane, program director of the foundation's Science of Science and Innovation Policy Program, says it awarded money to Mr. Bergstrom and to another researcher, Johan Bollen, an associate professor of informatics at Indiana University's School of Informatics and Computing, to do complementary investigations into how to track scientific innovation using data on citations and usage: how often articles are downloaded, as well as how often they're cited.

"What we have is proposal- and award-administration factories," Ms. Lane says. "We don't really have a method of linking inputs and outputs." Mr. Bollen and Mr. Bergstrom are "applying new techniques to try and figure out, Are these ideas taking root? It's the community developing metrics rather than a bunch of federal bureaucrats" doing that.

She welcomes researchers' desire to get beyond publication counts. "The old style was just to describe clusters of documents," which isn't very useful in helping people like her formulate science policy, she says. "One of the most interesting things is moving beyond the documents and looking at the human behavior that's underlying it."

New Territory for Citations

Mr. Bergstrom, Mr. Rosvall, and Mr. West are infectiously enthusiastic about the possibilities of their work. How—or how much—researchers will use the new tools remains to be seen. The Eigenfactor team may dream of a Google Maps of scholarship, but certain groups of scholars may prefer to draw up their own maps.

Harvard's Mr. Kurtz and his colleagues at the Astrophysics Data System have collaborated successfully with the Eigenfactor group. In 2007 they ran the mapping algorithm on a subset of journal articles in astrophysics. "In order to identify clusters with similar subject-matter content, it's probably the best algorithm," he says. But his group hasn't yet seen how to use it effectively at the individual-article level, and that's what they'd really like to do.

"This is a starting point for searching. It's not the end point," Mr. Kurtz says. "It's an ingredient in building a search, and that's how we plan to use it." The Astrophysics Data System has been doing a major overhaul of its Web site, with a new user interface. It has its own approach to what Mr. Kurtz calls "custom search engineering for scholarly pursuits."

Santo Fortunato, a researcher at the Institute for Scientific Interchange, an international project based in Italy, recently compared methods of sorting scholarly journals and articles into related clusters. "The method by Rosvall and Bergstrom turned out to be the best among those used for the comparison," Mr. Fortunato told The Chronicle in an e-mail.

But he cautioned that it hasn't ­really been tested yet in the real world. The benchmarks he used in his study were "not proxies of any specific real system," he wrote. Without knowing what "the natural communities" being mapped are, he says, it's hard to gauge how accurately the Eigenfactor InfoMap approach identifies them. "The power of InfoMap for the analysis of citation networks is then still unknown," Mr. Fortunato concluded.

The possibilities will begin to be tested more widely this fall, when some of the mapping tools are made publicly available. For Mr. Bergstrom, the Eigenfactor team's biggest stumbling block has to do with real-world applications—how to figure out what researchers really want and will be able to use.

"What I find most challenging right now is how to build the proper user interface," he says. "What would people want? That's even harder than the algorithmic problems we're facing at this stage."

Chronicle of Higher Education

By tracking how often journals in neurology, psychology, and molecular and cell biology cited one another, researchers identified a period in 2004-5 when their ideas merged into the field of neuroscience.
Image by Martin Rosvall and Carl T. Bergstrom, PLoS One