A Critic Sees Deep Problems in the Doctoral Rankings

Kris Snibbe, Harvard News Office

Stephen Stigler
September 30, 2010

One scholar who is not impressed by the National Research Council's doctoral study is Stephen M. Stigler, a professor of statistics at the University of Chicago.

Mr. Stigler was invited by the NRC to review early drafts of the methodology guide, and he began to circulate criticisms privately among colleagues in the summer of 2009. This week he posted a public critique of the NRC study on his university's Web site. That statement's bottom line: "Little credence should be given" to the NRC's ranges of rankings.

"Their measures don't distinguish among programs—or at least they don't distinguish in the way that people expect for a study like this," says Mr. Stigler, who is the author of Statistics on the Table: The History of Statistical Concepts and Methods (Harvard University Press, 1999). "Are we No. 10 or No. 15 or No. 20? There's not very much real information about quality in the simple measures they've got."

One of Mr. Stigler's chief concerns is the way the NRC gathered data for its R-rankings. (For a detailed explanation of the NRC's S-rankings and R-rankings, see a list of Frequently Asked Questions.)

To construct its R-rankings, the NRC surveyed faculty members for their opinions (on a scale of 1 to 6) of a sample of programs in each field. But those samples were generally not very large. In many cases, fewer than half of the programs in a field were included. In psychology, only 50 out of the 237 doctoral programs were included.

The NRC project's directors say that those small samples are not a problem, because the reputational scores were not converted directly into program assessments. Instead, the scores were used to develop a profile of the kinds of traits that faculty members value in doctoral programs in their field. That profile was then used to assess all programs in the field.

So if the reputational scores implied, for example, that faculty members in sociology admired large, ethnically diverse programs with high GRE scores, then all sociology programs that had those traits tended to do well in the NRC's R-rankings.

That system is fine in theory, Mr. Stigler says. But he strongly believes that all programs should have been included in the underlying reputational surveys.

For one thing, Mr. Stigler says, the relationships between programs' reputations and the various program traits are probably not simple and linear. Take GRE scores, for example. If students' average GRE scores fall below a certain level, that might be associated with steep drops in programs' reputations. Or, at the other end of the quality scale, reputations might spike sharply in cases where faculty members' publication-citation rates are above some threshold. In other words, if these correlations between reputation and citations were plotted on a graph, the most accurate representation would be a curved line, not a straight line. (The curve would occur at the tipping point where high citation levels make reputations go sky-high.)

But it is impossible to have an accurate picture of those nonlinear relationships, Mr. Stigler says, if only a small minority of programs were included in the reputational survey.

And that means in turn, according to Mr. Stigler, that the accuracy of each program's R-ranking depends on whether it was actually included in the reputational survey. If a program was not in the survey, then its R-ranking is based on weights that have "potentially much greater errors than those for other programs," he wrote in a privately circulated critique last year.

The NRC should disclose which programs were and were not included in those reputational surveys, Mr. Stigler says. But there are no plans to do so, the project's committee chairman, Jeremiah P. Ostriker, said in a recent interview.

A 'Paradox'

Mr. Stigler is not much happier with the S-rankings. Those rankings are based on surveys where faculty members were asked directly about which traits are most important to the quality of doctoral programs in their fields.

Many doctoral programs have S-ranking ranges that are very wide. For example, a program's S-ranking range might be 3-18, meaning that the NRC is 90 percent confident that its "true" S-rank is between 3 and 18. The wide breadth there is based partly on variations in how faculty members weighted the traits on the surveys.

But in an analysis done this week, Mr. Stigler noticed that in most fields, there actually wasn't much variation in how faculty members weighted the various traits. For most traits in most fields, the range of faculty weights is very small.

That presents a paradox, Mr. Stigler says. Tiny differences in faculty weights lead to huge swings in programs' S-rankings. How can that be?

The answer, Mr. Stigler argues, is that the variables in the NRC's study actually aren't very good at making distinctions among programs, especially among programs that are clustered in the middle of the quality spectrum.

"Their measures, for most of the body of programs, are unable to distinguish between programs," Mr. Stigler says. "They can roughly distinguish, I suppose, between what is in the top half and the lower half of the nation, which is not a major feat."

Mr. Stigler says that it was a mistake for the NRC to so thoroughly abandon the reputational measures it used in its previous doctoral studies, in 1982 and 1995. Reputational surveys are widely criticized, he says, but they do provide a check on certain kinds of qualitative measures. When the new NRC counts faculty publication rates, it does not offer any information about whether scholars in the field believe those publications are any good. (That's especially true in humanities fields, where the NRC report does not include citation counts.)

"Everybody involved in this was trying hard, and with good intentions and high integrity," Mr. Stigler says. "But once they decided to rule out reputation, they cut off what I consider to be the most useful measure from all past surveys."

In an e-mail message to The Chronicle this week, Mr. Ostriker declined to reply to Mr. Stigler's specific statistical criticisms. But he pointed out that the National Academies explicitly instructed his committee not to use reputational measures.

"Many other groups have collected reputationally based ratings and rankings in the past and continue to do so," Mr. Ostriker said. "I can see the virtue in such efforts, but it was not our task to do this."