by

What Wikipedia Deletes, and Why

Wikipedia, the online encyclopedia, famously allows anyone to write or revise its entries, and the history of each item is open for anyone to review. Except for material that leaders of the effort consider too “dangerous” to leave online.

The fine print of its stated practices notes that in some cases, material is completely spiked from the record. Or, as the policy reads: “a revision with libelous content, criminal threats or copyright infringements may be removed afterwards.”

These total redactions are what a University of Pennsylvania research team has been mining for the past year in the hopes of shedding some light on what Wikipedia deletes forever and why. In 2010 redactions accounted for more than 56,000 of the 47.1 million revisions, according to the research team.

The researchers, Andrew G. West and Insup Lee, wondered what content on the enormously popular Web site could be so troubling that Wikipedia administrators would decide to remove it forever. “Wikipedia is at that paramount example of open-source transparency,” Mr. Lee said. “So when you see them behaving in a nontransparent manner, you want to see what motivates them to do this.”

Copyright infringement was the most common reason Wikipedia stated for deleting material, Mr. West and Mr. Lee found.

The Wikimedia Foundation has been sued over copyright and privacy issues in the past. While only 0.007 percent of page views in 2010 to the English Wikipedia site resulted in content that was later redacted, that’s enough to land the organization and its operators in hot water. That’s why leaders of the encyclopedia refer to the material it redacts as “dangerous content.”

“We’ve identified that on the surface these copyright cases are the worst,” said Mr. Lee.

“The research goal for us is, how can we provide some automated way to detect the problems so they can be removed immediately?” Mr. West added. “It’s very difficult to stop people from adding something, but we can find a way to get rid of it quickly.”

The difficulty in identifying instances of plagiarism, the pair said, is evident in the numbers. Most “dangerous content,” such as libel or invasions of privacy, is taken down within two minutes, on average. But copyright-related issues stayed up for an average of 21 days, they found.

Wikipedia’s leaders have recently increased the number of people with the ability to permanently delete text, including entries in the history pages. In May 2010, approximately 40 people held these rights; now more than 1,800 people do, Mr. West and Mr. Lee said.

The larger work force has helped to reduce the amount of dangerous content found on the site, the researchers said. But humans alone won’t solve the problem in its entirety. Sometimes they even introduce problems when trying to delete dangerous content and removing beneficial revisions in the process, which the research team refers to as “collateral damage.” This brings up the question, then, of who even gets to make the call when something is dangerous content or not.

“For all the problems on Wikipedia,” Mr. West said, “I feel strongly that the solutions have to be automatic in nature because these attackers increasingly have these machines doing their bidding for them.”

The biggest hurdle the Wikipedia operators need to overcome, in the minds of the research team, is trust. If the encyclopedia hopes to see continued success, that will be the main obstacle, they said.

More on the authors’ Wikipedia redaction research can be viewed in their full paper, “What Wikipedia Deletes: Characterizing Dangerous Collaborative Content.”

Return to Top