David Glenn’s Chronicle article on using course sequence grades to estimate teacher quality in higher education illustrates a crucial flaw in the way education researchers often think about the role of evidence in education practice.
The article cites a recent study of Calculus grades in the Air Force Academy. All students there are required to take Calculus I and II. They’re randomly assigned to instructors who use the same syllabus. Students all take the same final, which is collectively graded by a pool of instructors. These unusual circumstances control for many external factors that might otherwise complicate an analysis of teacher quality.
The researchers found that students taught by permanent faculty got worse grades in Calculus I than students taught by short-term faculty. But the pattern reversed when those students went on to Calculus II—those taught by full-time faculty earned better grades in the more advanced course, suggesting that short-term faculty might have been “teaching to the test” at the expense of deeper conceptual understanding. Students taught by full-time faculty were also more likely to enroll in upper-level math in their junior and senior years. In addition, the study found that student course evaluations were positively correlated with grades in Calculus I but negatively correlated with grades in Calculus II.
All of which suggests that analyzing course-sequence grades is a fruitful way to evaluate the quality of teaching in higher education. There are a lot of lower-division undergraduates out there taking a relatively small number of core courses in a predictable order. Yet a number of university-based experts quoted in the article voiced deep skepticism about the idea, essentially arguing that without pristine Air Force Academy-like conditions, one couldn’t adequately control for external factors and produce a reliable estimate of teacher effects.
Here’s the problem: There’s a huge difference between the minimum standards of accuracy necessary for information to be valid as scholarship and the minimum accuracy necessary for it to be useful for making decisions about running a college.
The former is much greater than the latter, and rightly so: There should be a high bar for findings to enter the canon of human knowledge. But if you’re trying to evaluate teacher effectiveness for the purposes of deciding who is most likely to help students learn, the information needs to be accurate enough so the decisions you make are likely to be better decisions than those you would have made without the information—and that’s all. If, for example, you had to choose between hiring Teacher A and Teacher B, and you had evidence that Teacher A was much more effective that met P < .10 standards of accuracy but not P < .05, that evidence might not be good enough to get into a peer-reviewed journal but you’d be an idiot if you ignored it in choosing who to hire. That’s because while evidence of teacher effects can theoretically wait forever until it’s good enough to enter the scholarly record, someone needs to be hired for teaching today.
Yet college hiring and promotion standards are weirdly dichotomous when it comes to accuracy and evidence. In some respects they’re overly-biased toward accuracy at the expense of relevance, as with the use of student evaluations, a presumably accurate measurement of student opinions that, per the Air Force study and others, may very well signal the opposite of teacher quality. They also use scholarly publishing and citation records, which have nothing to do with teaching but are easy to count. These are then combined with factors like “collegiality” that are so wildly subjective and non-empirical that they can’t even be talked about in the same way. Meanwhile, course-sequence grade data that’s literally just sitting there for the taking is ignored.
In other words, you’re better off using reasonably accurate information about the right thing than extremely accurate information about the wrong thing. And if you step back for a minute and think about how all the day-to-day decisions driving well-functioning organizations are made, they all flow from this common-sense approach. But because universities correctly apply a very stringent standard of accuracy to their scholarship they’re ignoring information useful for their teaching and operating sub-optimally as a result.



2 Responses to Measuring College-Teacher Quality
nordicexpat - January 13, 2011 at 3:50 pm
If I remember correctly (the study did come out a while ago and the article is blocked), the problem with that study is that it used performance in one class as a proxy for how well material was learned in another. That “might” be a common sense, but the problem with the assumption is that lots of other factors could be responsible for the different outcomes in Calculus II (effort for one: students who did badly in Calculus I may have simply tried harder in Calculus II; students who did well in Calculus I may have coasted in Calculus II). Did the study control for student motivation and other factors? Did they test directly whether the students who received a lower grade in Calculus I retained more information than those who did well by giving a diagnostic exam to all students at the beginning of Calculus II? Do they know for sure which teachers “taught to the test” and which did not in Calculus I? Couldn’t you also say that the fault lies in the test used in Calculus I, since it obviously didn’t measure what it was supposed to do, if the study is correct? And if we can’t trust outcomes measured by a test in Calculus I, why should we trust outcomes measured by a test in Calculus II?
(These questions are meant sincerely, by the by. I’m not in principle opposed to the idea)
v8573254 - January 13, 2011 at 5:31 pm
I assume the sample remained the same for both courses.