• September 2, 2014

Crowd Science Reaches New Heights

The Rise of Crowd Science 1

Joey Pulone for The Chronicle

Alexander Szalay's career in astronomy took an unexpected turn when the Johns Hopkins U., where he is a professor, joined the Sloan Digital Sky Survey and he volunteered to help with data storage.

Alexander S. Szalay is a well-regarded astronomer, but he hasn't peered through a telescope in nearly a decade. Instead, the professor of physics and astronomy at the Johns Hopkins University learned how to write software code, build computer servers, and stitch millions of digital telescope images into a sweeping panorama of the universe.

Along the way, thanks to a friendship with a prominent computer scientist, he helped reinvent the way astronomy is studied, guiding it from a largely solo pursuit to a discipline in which sharing is the norm.

One of the most difficult tasks has been changing attitudes to encourage large-scale collaborations. Not every astronomer has been happy to give up those solo telescope sessions. "To be alone with the universe is a very dramatic thing to do," admits Mr. Szalay, who spent years selling the idea of pooling telescope images online to his colleagues.

Today, data sharing in astronomy isn't just among professors. Amateurs are invited into the data sets through friendly Web interfaces, and a schoolteacher in Holland recently made a major discovery, of an unusual gas cloud that might help explain the life cycle of quasars—bright centers of distant galaxies—after spending part of her summer vacation gazing at the objects on her computer screen.

Crowd Science, as it might be called, is taking hold in several other disciplines, such as biology, and is rising rapidly in oceanography and a range of environmental sciences. "Crowdsourcing is a natural solution to many of the problems that scientists are dealing with that involve massive amounts of data," says Haym Hirsh, director of the Division of Information and Intelligent Systems at the National Science Foundation. Findings have just grown too voluminous and complex for traditional methods, which consisted of storing numbers in spreadsheets to be read by one person, says Edward Lazowska, a computer scientist and director of the University of Washington eScience Institute. So vast data-storage warehouses, accessible to many researchers, are going up in several scholarly fields to try to keep track of the wealth of information.

Persuading scientists to fully embrace the age of big data, though, will require a change in academic reward structures to give new currency to papers with more authors than ever and to scientists who spend their careers crunching other peoples' numbers.

"The culture shift is the sharing of data," says Mr. Lazowska. "And the astronomers have led the way."

Astronomy Rebooted

Mr. Szalay's unusual career began with a stint as a rock star. While in graduate school in Hungary, he played lead guitar in the band Panta Rhei, which released two albums and several singles in the 1970s. "I wouldn't call ourselves 'stars,' but we were pretty well received," he says, modestly. "We toured Germany, we went to Poland and Czechoslovakia."

Their sound was decidedly nerd rock—lots of plinky synthesizers and broken rhythms. The synthesizers were home-built. "In Communist Hungary you couldn't buy anything—you had to build things on your own," Mr. Szalay says. The willingness to tinker would become a hallmark of his career.

Mr. Szalay left the band to focus on his academic career after landing a postdoctorate position at the University of California at Berkeley, making solitary visits to telescopes as many astronomers did.

He wound up at Johns Hopkins, where he has been for most of the last 23 years.

Then in 1992 came the project that would change his career. Johns Hopkins joined the Sloan Digital Sky Survey project, a computerized snapshot of the heavens.

Mr. Szaley signed up to lead the design and building of the archive, even though he knew nothing about the technology of data storage. His research interests drove his decision to jump in: He was hoping to better understand the Big Bang by looking at the distribution of galaxies in the universe.

"I needed a lot of data that was well organized so I could easily apply statistical tools to it," he said. It was such an enormous task, though, that he promised his departmental colleagues he would devote all his time to the sky survey and put aside any of his own trips to observatories. "I thought, OK, this is going to be six to eight years, I can deal with it," he said. "It turned out to be 18 years."

A Geeky Guide

A couple of years after Mr. Szalay joined the project, a colleague introduced him to Jim Gray, who was a kind of rock star himself—in the computer-science world. Wired magazine once wrote that the programmer's work had made possible ATM machines, electronic tickets, and other wonders of modern life.

When Mr. Szalay met him, Mr. Gray was a technical fellow at Microsoft Research and was looking for enormous sets of numbers to place in the databases he was designing.

The two men formed an instant friendship, and decided they had a lot to learn from each other.

"So I taught him astronomy, and he really turned into a very good astronomer—he became a card-carrying member of the community," says Mr. Szalay.

And Mr. Gray taught the astronomer computer science. Mr. Szalay has now published so many papers about his work on databases that he has a joint appointment in the computer-science department at Hopkins.

As the sky survey matured, though, many traditional stargazers remained skeptical.

"The astronomical community did not believe we would ever really make the data public," says Mr. Szalay. The typical practice in the mid-1990s was to guard data because it was so difficult to get telescope time, and scholars did not want to get scooped on an analysis of something they gathered.

One incident demonstrates the mood at the time. A young astronomer saw a data set in a published journal and wanted to reanalyze it, so he asked his colleague for the numbers. The scholar who published the paper refused, so the junior scholar took the published scatterplot, guessed the numbers, and published his own analysis. The original scholar was so upset that he called for the second journal to retract the young scholar's paper.

Mr. Szalay said that astronomers changed their minds once the first big data sets hit the Web, starting with some images from NASA, followed by the official release of the first Sloan survey results in 2000.

"Once they saw the first data release, and they also saw that it was easy to use, I think they started turning around," he said.

And Mr. Szalay and Mr. Gray spoke at many astronomy conferences, presenting a list of 20 questions that could be answered only with large, shared data sets, to try to win support for the approach. They felt they were onto something that would have an impact far beyond astronomy.

"We realized that this is the new way of doing science," said Mr. Szalay. "Computers are becoming a new kind of instrument."

Lost at Sea

In 2007 tragedy ended their long partnership. Mr. Gray set out from San Francisco on a solo trip on his 40-foot sailboat and did not return.

His friends in computer science and astronomy quickly mobilized what has become a legendary search effort, taking their ideas about crowdsourcing to a new level in the process.

The scientists, along with tech-industry leaders whom Mr. Gray had mentored in the past, offered to help the Coast Guard search the open sea using any technology they could think of. Google executives and others helped provide fresh satellite images of the area. And an official at Amazon used the company's servers to send those satellite images to volunteers—more than 12,000 of them stepped forward—who scanned them for any sign of the lost researcher.

Mr. Szalay and his son, Tamas, wrote software that would make the satellite images clearer and led a parallel analysis with researchers who volunteered via e-mail.

But Jim Gray was never found.

Some of the techniques that the astronomer learned from the search effort, though, have now been incorporated into a Web site that invites anyone to help categorize images from the Sloan Digital Sky Survey.

It's called Galaxy Zoo, and it's led by Chris Lintott, an astronomer at the University of Oxford.

Just click "classify galaxies" on the Galaxy Zoo Web site, and a picture from a telescope appears, along with questions including "Is the galaxy smooth and rounded?" and "Does the galaxy have a mostly clumpy appearance?" Visitors must register and complete a short tutorial before their results are counted. Each image is shown to at least 10 different people to try to cut down on erroneous classifications. If 80 percent of the crowd agrees on a classification of an image, it sticks. Otherwise, the image might go through the whole process again.

"It's not some fun game online while the scientist do the real work," says Mr. Lintott. "I hope visitors are learning that science is not just something done by people in lab coats in some underground bunkers. Science is something people can get involved in."

The number of volunteers surprised the organizers. "The server caught fire a couple of hours after we opened it" in July 2007, he said, burning out from overuse. More than 270,000 people have signed up to classify galaxies so far.

One of them is Hanny van Arkel, a schoolteacher in Holland, who found out about the site after her favorite musician, Brian May, guitarist for the rock group Queen, wrote about it on his blog.

After clicking around on Galaxy Zoo for a while one summer, she landed on an image with what she describes as a "very bright blue spot" on it. "I read the tutorial and there was nothing about a blue spot," she says, so she posted a note to the site's forums. "I was just really wondering, What is this?"

Her curiosity paid off.

Scientists now believe the spot is a highly unusual gas cloud that could help explain the life cycle of quasars. The Hubble telescope was recently pointed at the object, now nicknamed "Hanny's Voorwerp," the Dutch word for object.

Astronomers have published papers about the discovery, listing Ms. van Arkel as a co-author. "Don't ask me to explain them to you, but I am a co-author of them," she says with a laugh.

Now other disciplines have approached Galaxy Zoo to find out how they can use the approach.

Gene Wikis

Astronomy is just one of many disciplines being reshaped by a data explosion. Bioscientists have found that decoding entire genomes also meant cultural shifts for their profession. Again, persuading professors to take the time to share proved to be a challenge.

A case in point is a project to create a genetic road map using the same wiki platform that supports Wikipedia.

It started under the name of GenMAPP, or Gene Map Annotator and Pathway Profiler. Participation rates were low at first because researchers had little incentive to format their findings and add them to the project. Tenure decisions are made by the number of articles published, not the amount of helpful material placed online. "The academic system is not set up to reward the sharing of the most usable aspects of the data," said Alexander Pico, bioinformatics group leader and software engineer at the Gladstone Institute of Cardiovascular Disease.

In 2007, Mr. Pico, a developer for GenMAPP, and his colleagues added an easy-to-edit Wiki to the project (making it less time-consuming to participate) and allowed researchers to mark their gene pathways as private until they had published their findings in academic journals (alleviating concerns that they would be pre-empting their published research). Since then, participation has grown quickly, in part because more researchers—and even some pharmaceutical companies—are realizing that genetic information is truly useful only when aggregated.

"There's a sort of a call to action in the biology community right now toward sharing data in usable formats and usable ways," says Mr. Pico. But he admits some in the field are still skeptical that sharing will become the norm.

Another gush of data is happening deep in the Pacific Ocean, as a series of thousands of sensors strung along an underwater fiber-optic cable, along with new self-guided mobile sensors that can beam back data, promises to make oceanography the next field to embrace the data revolution and a crowd approach.

Mr. Lazowska, the computer scientist at the University of Washington who focuses on data-driven science, says that at the moment oceanography is "expeditional," meaning that data are hard to come by because only a few organizations can afford the equipment to probe the depths. But new technologies, like those mobile sensors, promise to pipe in more data than scientists can manage without a shared database, like what the Sloan project did for astronomy.

"In oceanography the individual investigator tends to be king or queen—it's individual papers that really determine how one proceeds in the field," said John Orcutt, a professor of geophysics at the Scripps Institution of Oceanography at the University of California at San Diego. "Generally there haven't been big data undertakings in the past, but there are many pressures now that are forcing that change, and I believe we're moving toward a different sort of world."

Major issues remain unresolved. As data continue to grow at an ever-more-rapid pace, more efficient ways to store and process the information will be needed. Computer algorithms will play an increasing role, too, so that robot scientists can do some classifying, perhaps checked by human volunteers.

Mr. Szalay spends much of his time trying to build faster servers to handle all that telescope data.

He's involved with a new project, the National Virtual Observatory, which will link many large telescope data sets that have emerged in recent years.

And he is focused on training the next generation of astronomers to become card-carrying computer scientists—to learn as much about mapping data as mapping the heavens. They will need such training, he argues, to master a new paradigm of science and answer the universe's biggest questions.

Comments

1. arrive2__net - May 30, 2010 at 03:00 am

This article seems to describe what may become the trend of the future in scientific pursuits that involve huge data sets. Once it was difficult to get access to data, now the problem is more along the lines of being able to process the immense supply of it. The field of statistics is itself a still growing technology, but finding ways to integrating complex computer and/or human processing, as suggested in the article, is a whole new ballgame that promises to open up whole new complex ways of understanding nature and solving problems. Good article.

Bernard Schuster
Arrive2.net

2. lexalexander - June 01, 2010 at 11:43 am

I'm a little surprised that an article on Crowd Science in general and astronomy in particular wouldn't mention the SETI@home project, based at http://setiathome.ssl.berkeley.edu. I've been running SETI on my computers for more than a decade, making me one of tens of thousands of people helping analyze masses of data collected by radio telescopes.

Two bonuses: 1) the program screensaver looks really cool; 2) the freeware that Berkeley gives participants to analyze data is called BOINC, as in "Scientific Progress Goes 'Boink'".

3. recurver - June 01, 2010 at 03:51 pm

Yeah, SETI is an interestingly conspicuous omission.

4. captaink - June 01, 2010 at 06:02 pm

SETI@home is a great tool, but it is not the same as Citizen Science (called "Crowd Science" in this article), which involves active engagement and participation in science through the application of human computation (including cognition, pattern recognition, and anomaly detection). Citizen Science is much more than the passive use of a screensaver program. Check out http://zooniverse.org/

5. landrumkelly - June 02, 2010 at 05:20 am

If only Newton, Planck, and Einstein had had any idea of the possibilities inhering in not going it alone.

6. lexalexander - June 02, 2010 at 08:18 am

captaink: True, but SETI and other distributed-computed projects created the foundation for this approach -- and not only in science. In journalism, for example, Josh Marshall's Talking Points Memo blog relied on its readers to pin their respective congresscritters down in 2005 with respect to their positions on privatizing Social Security. (Although not science, this might be a closer parallel to Citizen Science than SETI is, now that I think about it.)

7. 11182967 - June 02, 2010 at 09:12 am

If I recall the story correctly, Copernicus had problems dislodging data from Tyco Brahe's nephew and heir. So walls between analysts/theorizers and experimenters/observers are nothing new in astronomy.

8. dnewton137 - June 02, 2010 at 08:36 pm

I find it amusing, and a bit dismaying, that some have just now discovered a new way of doing science, have invented a novel name, "Crowd Science," and have attributed it to astronomy and the emergence of large data bases. They're a century or two behind the times!

There are a couple of truisms that have been commonly appreciated by physicists for at least the last half century or so (in my own experience). They are, "Physics is a social science," and "Physics is a contact sport." They well describe the fact that physics is commonly done by exposing a new idea or experimental result to the vigorous criticism of many colleagues and that emerge into the canon if, and only if, they survive this competitive process. The myth of a genius who emerges from his/her attic with a "lone discovery" fully ready for adoption into the canon of the field is just that, a myth.

It is true that the development of fields like experimental elementary particle physics, which involve large scale collaborative work by many physicists, and which incidentally yield very large data bases, have yielded published reports with dozens, even hundreds, of authors. But that has been true for several decades.

I don't know how things are progressing in other disciplines, but I suspect that something similar exists in them, and that astronomy is not leading the way to a new form of science, "Crowd Science."

9. cdwickstrom - June 04, 2010 at 09:39 am

Would that we taught the new generations of scientists and scholars across all disciplines the power of sharing earlier in their academic careers. But, we adhere to the 19th century notions of the solitary scientist in the lonely laboratory in our education of doctoral students, leaving the discovery to post-doctoral experience. While collaboration in research can be messy at times, it frequently generates exciting results in far less time than solitary effort. I need only point to the rarity of single author publication in peer reviewed journals to make my point.

10. piskie54 - June 08, 2010 at 11:47 pm

Some fascinating implications here - not least being copyright! It reminds me of the many "primitive" (traditional) societies where the concept of individual ownership is quite alien and all property and ideas belong to the community. The breaking down of barriers where eg physicists only talk to other physicists but not to biologists can only lead to an explosion of new ideas and new interpretations of 'problems'. Are we at last seeing the end of the "greed is good" philosophy, leading to a hope of a new community consciousness and caring for each other and the earth we live on?(It is the dawning of the Age of Aquarius...)

11. richardmitnick - June 13, 2010 at 01:50 pm

While the scientific activity described in the article is different than that of SETI@home, still it is important for people to know about all of the wonderful scientific projects at august institutions and universities around the world that are running on the BOINC software which came out of SETI. There are projects in Mathematics, Computing, Earth Sciences, Astronomy, Physics, Chemistry, Biology and Medicine.

A special set of projects is managed by World Community Grid (WCG). These projects deal with Cancer, AIDS, Dengue Fever, Muscular Dystrophy.

One WCG Cancer project, at the Cancer Institute of New Jersey, in New Brunswick, NJ, produced a data base which reduced tissue typing on 1 PC from 137 years to 1 day.

We "crunchers" have saved lab scientists thousands of hours of lab time. We have combined to produce what is in some cases the largest supercomputer in the world.

Add Your Comment

Commenting is closed.

subscribe today

Get the insight you need for success in academe.