• September 1, 2014

The Promise and Peril of Outcomes Assessment

As a result of continuing pressure from regional accrediting associations, state legislatures, and coordinating boards for greater "accountability," increasing numbers of colleges are using some form of outcomes assessment. In many cases those assessments involve objective or standardized testing.

Much of the impetus for the increasing use of standardized testing in higher education and for the development of new instruments for measuring learning has come from widespread recognition of the limitations of traditional course grades, specifically their questionable capacity to reflect change, growth, or improvement, and the noncomparability of grades from institution to institution and from instructor to instructor. Has the A student actually learned more from the course than the C student, or did the A student already know most of the course material when she enrolled? If Professor Jones's students receive mostly A's and Professor Smith's students mostly B's in the same course, does this mean that Professor Jones's students learned more, that Professor Jones is a better teacher, that Professor Jones is an easier grader, that Professor Smith's students are poorer learners, or what?

Unlike course grades, most standardized and other objective tests can be used as yardsticks for comparing students, and most can be used repeatedly to measure change or growth. Some critics of outcomes assessments mistakenly assume that these tests are supposed to be used in the same manner as course grades—as one-time, end-of-course assessments of the individual student's performance. In fact, such assessments are seldom used in that way. Most commonly they are used with groups of students, either to compare two or more groups or to determine how much a given group has improved.

The fact that test scores can be meaningfully compared means that repeated administrations of the same measure­—say, at the beginning and at the end of an educational program—can be used to judge how much students have learned or improved. Assuming that the test adequately measures what the program is designed to teach, the information from such assessments represents a tremendous improvement over the information contained in the traditional grade-point average. However, judging from the way most institutions use assessment tools, assessment "experts" and test manufacturers are not doing a very good job of coaching colleges in how to get useful and valid results.

Even many users of the assessments believe, incorrectly, that these tests can be given one time to render valid judgments about the effects of an educational program, or to compare the relative effectiveness or quality of different elementary and secondary schools or colleges. For decades public schools, school districts, and even entire states have been compared by averaging test scores that have been obtained through one-shot assessments of students. "Good" programs, schools, or states are then assumed to be those whose students earn the highest mean scores, and "poor" or "underperforming" programs, schools, or states are presumed to be those whose students get the lowest mean scores.

Such comparisons are potentially both invalid and unfair, since they ignore the entering level of students' performance when they enroll in the program or school. For all we know, some "poor" or "mediocre" schools may be doing an outstanding job educationally, given the low level of their students' performance when they enter, and some of the "best" may be doing a mediocre job, in light of the high level of their students' performance when they enroll.

Colleges face the same challenges as assessment "experts" encourage them to judge the effectiveness of undergraduate programs by using cross-sectional outcomes assessments. This use of the word "outcomes" is a most unfortunate choice, since the term implies some sort of causal relationship: If a student's score is indeed the "outcome" of the educational program being assessed, then the program is supposed to be able to take credit (or blame) for that student's score.

But how much the students' outcome scores represent growth or improvement over where they began is anyone's guess. Indeed, there is scattered evidence suggesting that, when it comes to mathematical competency, American college students show a net decline from the beginning to the end of college.

But even if more colleges use before-and-after assessments to measure change over time, the data are of limited usefulness unless the college has some way of knowing why some students learned more than others. How come Jessica showed a remarkable degree of improvement, while Jason showed little change? To answer such questions, the people conducting the assessments need to gather additional data on each student's particular educational experiences (courses taken, study habits, co-curricular activities, and so on). By associating these different experiences with changes in the students' scores, institutions are in a much better position to strengthen their educational programs.

The much-cited study reported in the book Academically Adrift, by Richard Arum and Josipa Roksa, is one of those rare national assessment efforts that actually employs before-and-after testing of the same students over time, as well as information on the students' academic experiences. Although the sweeping conclusions concerning how many students failed to improve their scores over time have been called into question, this longitudinal study made it possible to examine several factors in the undergraduate experience that can affect student learning.

A serious limitation of this particular study, however, is its exclusive reliance on a test of questionable reliability that assesses a relatively narrow set of cognitive skills.

The question of which outcomes get assessed and how they are assessed will always be an issue in any assessment project. Given a traditional undergraduate education's multiple goals, it is absurd to expect that one or two measures are going to cover the territory. Some politicians assume that we should concentrate on basic skills—writing, reading, computing—while many educators value subtler skills, such as creativity and critical and abstract thinking.

But if you read a few college catalogs and mission statements, it becomes clear that many institutions are also committed to cultivating "affective" qualities such as leadership, citizenship, and moral development. In other words, if we are going to conduct outcomes assessments that truly reflect the major goals of a liberal education, then we have to include multiple measures of such diverse qualities.

In short, higher education's current fascination with outcomes assessment is not likely to contribute much to our efforts to strengthen undergraduate education unless institutions insist on using assessment tests on a longitudinal, before-and-after basis, strive to collect data on each student's educational experiences, and are willing to employ a broad array of outcome measures that more fully reflect the diverse goals of a liberal education.

Alexander W. Astin is a professor emeritus of higher education at the University of California at Los Angeles and co-author of Assessment for Excellence: The Philosophy and Practice of Assessment and Evaluation in Higher Education, second edition (Rowman & Littlefield, 2012).

subscribe today

Get the insight you need for success in academe.