When people wanted to study language, they used to have two options. They could use a computer to analyze a large body of formal writing, like newspapers. Or they could go out and interview a bunch of people.
Now Twitter offers something fresh for researchers interested in the evolution of language: massive amounts of informal written communication.
Scientists at Carnegie Mellon University are demonstrating the potential of that data in a study, which has found that regional dialects are thriving on Twitter. In fact, local slang seems to be evolving within the social-media site.
Some of the findings confirm what you’d expect: People in Northern California tend to say “hella,” for example, to mean “very.” But there are other oddities, like “something.” If you live in New York City, you’re likely to write “suttin,” while people in many other cities tend to write “sumthin.” Northern Californians say “koo” for “cool” in their tweets, while Southern Californians favor “coo.” And people in cities seem more likely than people in rural areas to abbreviate “you” as “yu.” New Yorkers have a coinage all their own: “uu.”
Profanity inspires a whole subgenre of regional Twitterisms. Take the many ways of expressing amusement. “LMAO” means “laughing my a** off”—that’s national, of course. But people around Washington, D.C., where this blog is based, seem to favor “LLS,” for “laughing like sh*t.”
And folks in Philadelphia, Pittsburgh, and Cleveland have a penchant for yet another abbreviation: “CTFU,” meaning “cracking the f**k up.”
“It’s not really used anywhere else,” explains Jacob Eisenstein, a postdoctoral fellow in Carnegie Mellon’s Machine Learning Department.
“There are big regional differences in social media,” he says. “And some of these correspond to things that we know about from spoken language. But other things seem to be completely organic to social media.”
People who use Twitter on their smartphones can geotag their tweets with GPS coordinates. To conduct their study, Mr. Eisenstein and his co-authors culled a week’s worth of Twitter messages published last March. They narrowed those down to geotagged messages from users who posted at least 20 Tweets. The result was a database of 330,000 tweets and 9,500 users.
The researchers wrote a computer program to find patterns in all that text. The technique they developed could predict the whereabouts of a Twitter user in the United States with a median error of roughly 300 miles.
Asked for a reaction to the study, Geoffrey Nunberg, an expert on linguistic technologies, sent an e-mail to Wired Campus saying that the research “seems technically and methodologically impressive.”
“But the findings are less than epochal,” says Mr. Nunberg, an adjunct professor at the School of Information at the University of California at Berkeley. “‘Slang may depend on geography more than standard English does’—well, we’ve sort of known that for a while. But I do think that the availability of these huge corpora of tweets and text messages could produce some interesting results in the future.”
Co-authors of the Carnegie Mellon report are Eric P. Xing, an associate professor of machine learning; Noah A. Smith, an assistant professor in the Language Technologies Institute; and Brendan O’Connor, a machine-learning graduate student.
Wired Campus would love to learn about other ways that technology is changing the study of language. If you know of any interesting new research, drop us a note in the comments below.