Chicago
Beginning in 2011, the 331 universities that participate in the Voluntary System of Accountability will be expected to publicly report their students' performance on one of three national tests of college-level learning.
But at least one of those three tests—the Collegiate Learning Assessment, or CLA—isn't quite ready to be used as a tool of public accountability, a scholar suggested here on Tuesday during the annual meeting of the Association for Institutional Research.
Braden J. Hosch is director of institutional research and assessment at Central Connecticut State University, one of four institutions in that state that participate in the accountability system. He scrutinized the performance of students on the CLA at his institution over a three-year period and discovered something that made him queasy: Students' performance on the test was strongly correlated with how long they spent taking it.
To understand why that pattern might make an assessment specialist uncomfortable, here is a brief primer on the test. In the CLA, students write essays or memoranda in response to material that they haven't seen previously. The goal is to measure students' skills in critical thinking, problem solving, analytical reasoning, and writing.
The CLA is one of three tests that have been endorsed by the Voluntary System of Accountability, a three-year-old effort by public, four-year universities to supply basic, comparable information on the undergraduate student experience online. Besides the CLA, which is sponsored by the Council for Aid to Education, other tests that participants in the voluntary system may use are the Collegiate Assessment of Academic Proficiency, from ACT Inc., and the Measure of Academic Proficiency and Progress, offered by the Educational Testing Service. The paper that Mr. Hosch presented here on Tuesday concerns only the CLA, and not the other two tests.
Colleges that participate in the CLA typically administer the test to approximately 100 first-year students and to approximately 100 seniors each year. If the seniors' scores are higher than those of the first-year students, that is taken as evidence that students at that college gain fundamental skills while they are there.
The test has sometimes been criticized for relying on a cross-sectional system rather than a longitudinal model, in which the same students would be tested in their first and fourth years of college. The test's creators say that the cross-sectional model is valid and that a longitudinal model would be severely cumbersome for most colleges, because many students transfer or take longer than four years to graduate. Several papers that defend the CLA's framework can be found at the project's Web site.
Mixed Motives
But beyond that basic question of design, there have long been concerns about just how motivated students are to perform well on the CLA. Why sit there and carefully craft an essay, after all, if there is no particular reward or punishment for your performance?
At Mr. Hosch's university, freshmen are often recruited to take the test in conjunction with a "first-year experience" course that all students take. "But across the sections of that course, there's a lot of variation in how instructors approach it," Mr. Hosch said. "Some instructors really integrate the CLA into the course, and ask students to write reflective essays after they take the test. Others just say casually, Hey, here's a test you can take to get five points of extra credit."
Seniors at Central Connecticut State, meanwhile, are recruited to take the test through entirely different mechanisms. Mr. Hosch and his colleagues originally tried sending e-mail messages that appealed to students' sense of institutional loyalty: Help us improve our curriculum and instruction, they said. But that approach yielded a grand total of zero students after six weeks. So the university instead turned to low-level bribery. Seniors who volunteer to take the test now have their $40 cap-and-gown fees waived.
With those very different motivations, will students actually take the test seriously as they sit there drafting their essays? Or, as Mr. Hosch put it, "If you're a senior distracted by the end of the year, and your cellphone rings 20 minutes into the test, do you just pack up and walk away?"
Most cohorts of students at Central Connecticut State have apparently done well on the CLA. And one cohort—seniors who took the test in the spring of 2009—did remarkably well, with a mean score at the 98th percentile of all CLA test-takers nationwide. (That percentile figure is an "adjusted" score, taking into account the average SAT scores of Central Connecticut State students.)
Why did that cohort do so well? One answer appears to be that they spent an average of 63 minutes taking the test, up from 45 minutes for the previous year's crop of seniors.
And why was that? Did the 2009 cohort happen to be a more motivated, conscientious bunch? Were the test items more engaging? Did the test proctor say something different that year at the beginning of the test?
The Time Factor
No one knows. But the pattern was consistent across all of the cohorts that Mr. Hosch studied: The longer the students spent on the task, the higher their average scores.
And that is what worries Mr. Hosch. The CLA is a worthy effort, he said, but it should not be used for high-stakes accountability programs until colleges get a better handle on making sure that students who take the test are representative of the entire student body and that they devote roughly equal amounts of effort to the test.
"I'm not suggesting that we give up on the CLA," Mr. Hosch said. "I'm not suggesting that we give up on measuring student learning. But I do think we should acknowledge that test scores are related to time spent on the test, and I think we should research that further."
The simplest solution, Mr. Hosch said, would be to motivate students by making the CLA a truly high-stakes test—something that really mattered for their grade-point average or their graduation. But Mr. Hosch said that approach would be a serious mistake. "A high-stakes assessment is not the way we want to go," he said.
Among other things, Mr. Hosch suggested that small groups of similar colleges should create consortia for measuring student learning. For example, five liberal-arts colleges might create a common pool of faculty members that would evaluate senior theses from all five colleges. "That wouldn't be a national measure," Mr. Hosch said, "but it would be much more authentic."
'Raising the Stakes'
In an e-mail message to The Chronicle on Tuesday, Richard Shavelson, a professor of education at Stanford University and one of the CLA's creators, conceded that students' motivation is related to their performance on the test.
But he added that at the institutional level, those variations in motivation tend to wash out, so that it is still valid to use the test to assess a college's general level of learning. (Jeffrey T. Steedle, a graduate student at Stanford, presented evidence to that effect last month at the American Educational Research Association's conference.)
"Braden is correct to point out that motivation is critical and a big concern in low-stakes testing and can affect individual students' test scores," Mr. Shavelson said. "The challenge confronting higher education is for institutions to address the recruitment and motivation issues if they are to get useful data. From my perspective, we need to integrate assessment into teaching and learning as part of students' programs of study, thereby raising the stakes a bit while enhancing motivation of both students and faculty. (Incidentally, we find that some faculty do not support assessment programs and convey their feelings to students as well.)"
Richard B. Arum, a professor of sociology at New York University who has studied the CLA, said in an e-mail that he was not surprised by Mr. Hosch's findings. And he said that he shared Mr. Hosch's concerns about using the CLA in public accountability regimes.
"I do agree with his central point that it would not be prudent to move to an accountability system based on cross-sectional assessments of freshmen and seniors at an institution," said Mr. Arum, who is an author, with Josipa Roksa, of Academically Adrift: Limited Learning on College Campuses, forthcoming from the University of Chicago Press.
Mr. Hosch's paper and related materials are available at his Web site.









Comments
1. fullprof99 - June 02, 2010 at 05:31 am
You could make the same argument regarding most college essay exams: the students who take the longest usually do best. But then, they're usually the best (most serious and attentive) students.
2. jeff1 - June 02, 2010 at 07:51 am
This is an interesting short article. I am not surprised in the least that CLA supporters would maintain that it remains a good test. The cross-sectional approach is not as good as a cohort model tracking students across time (with all due respect to all the articles on the CLA site). That said, if an institution uses the same methodology over time to test first year and senior students, then it should yield some useful results upon which improvments could be based. Will it yield public accountability measures? I think not.
3. educ_program_analyst - June 02, 2010 at 09:30 am
My university has used a similar kind of test -- the College BASE -- for over 20 years to assess general education. My research has found that time spent taking the test accounts for as much of the variance in score as an overall measure of prior academic knowledge (ACT scores, specifically). We are participating in the VSA and will be using ACT's CAAP test. Last summer I shared my research with ACT and asked them to replicate the model. The CAPP actually asks testers how hard they felt they worked on the test, which should function well as a proxy for motivation. [I've put my request to ACT online at http://www.uwgb.edu/oira/reports/ACTCAAP.docx] CAAP has not responded to my request. This is a trememdously important problem. Without controlling specifically for motivation or effort, schools' scores will reflect the testing environment more than the learning environment. The results schools will report on the VSA documents will indicate which ones put togther the best testing protocol, as opposed to which schools delivered the best curriculum. I don't think this is what students, parents and the public really want and need to know. Anybody interested in this topic should start by reading Daniel Koretz's "Measuring Up: What Educational Testing Really Tells Us" (2008, Harvard Univ. Press). It's really quite sobering when considered as a critique of the VSA movement.
4. princeton67 - June 02, 2010 at 09:41 am
"...did remarkably well, with a mean score at the 98th percentile of all CLA test-takers nationwide. (That percentile figure is an "adjusted" score, taking into account the average SAT scores of Central Connecticut State students.)"
Of what exactly does this "adjustment" consist? For example, does a student who got a 1000 SAT and an 80 (raw score, out of 100) on the CLA have a higher "adjusted" percentile score than a student who achieved a 1400 SAT and also an 80 CLA??? I would think "yes", given that the CLA apparently weighs academic improvement more than acheivement.
5. intered - June 02, 2010 at 11:05 am
Mr. Glenn's observations have scientific merit, especially in going to the issue of motivation. In measurement science, the theoretical presupposition of most such tests is that examinees are performing as well as they can. While this presupposition was generally valid in the 1950's, it has steadily eroded since then with changes in the culture. Today, and especially with graduating seniors who would not have taken the test were it not for the $40 consideration, one can -- and should -- assume that the primary motivation is to earn the $40 as efficiently as possible (i.e., least effort for return).
Second, cross-sectional research suffers serious logical and empirical limitations as to the nature and scope of inferences that can be drawn by a rational person. Logically, causality cannot be inferred. Hinting about causality, as is done by this test maker, doesn't get around the problem. Empirically, the assumption is that changes observed via a simple pre/post model spanning four years in the lives of 22 year olds can be attributed to interventions (i.e., life) that consumed an average of 2-4 hours of their day and should not be attributed to interventions that took place during the other 20-22 hours. There are ethical issues here.
Third, and most damning, is the fact that most attempts to conceptualize, operationalize, and measure 'critical thinking' are either philosophically or scientifically corrupt. (See: http://www.intered.com/storage/jiqm/v6n3_4_ct.pdf) What evidence do we have that the construct is philosophically and psychologically sound? What evidence do we have that the instrument is valid (I analyzed raw data from more than 2,500 records of two of the most popular tests of critical thinking. For one test, the purported scales were scientifically indiscernible. For the other test, the alpha's of the so-called "scales" averaged less than 0.30. This is scientific misconduct.)
The real question here is how presumably well-educated college administrators can possess so little understanding of measurement sciences that they would spend money on such unsound and potentially harmful practices. -- Robert W Tucker, President, InterEd, Inc.
6. phildept - June 02, 2010 at 11:23 am
The CLA has serious problems. Some have to do with its lack of relation to persistence toward graduation. We don't use a cohort because half the cohort disappears--but we don't count those disappeared students against the institution. We measure how much better students do with complex thinking tasks four years later--but we don't control well at all for how much better they would do four years later elsewhere, e.g., working at Starbucks or in the Army, even though one would expect them to continue to develop critical thinking skills while they run the obstacle course into today's world between ages 18 and 22. That their materials use the term "higher level thinking skills" suggests the CLA is still being guided by Bloom's Taxonomy with its diminished role for memory work and its overemphasis on cutting thinking up into modules. The CLA abjures ranking institutions against each other (for example, by adjusting scores of freshmen to eliminate pre-existing differences), eviscerating one aspect of accountability. Granted, the CLA sets performance tasks, critiquing and making arguments, which begin with prompts which are lightyears ahead of the artificial exercises typically found in critical thinking courses, in that they come with context and some realistic details (and this may already be having a salutary effect on that subdiscipline). Still, the prompts leave out one crucial thing which would make the students' work more realistic: the prompts do not allow the student to do more research to clarify the issue (since it is yanked out of any real context even though it comes with indicators of what such a context would be) or to really address other relevant arguments. Further, that kind of development, of facility in handling controversies, is just what dropouts may get just as much--or even more--than college students.
These problems betray a larger problem with accountability efforts in general. The attempt to simplify higher ed into something measurable with a set of rubrics requires we have a simple answer to what education is for. That, though, diminishes education, with the result that accountability efforts, as they have since Veblen commented on these matters in 1918, at least so far, diminish the quality of education.
7. wilkenslibrary - June 02, 2010 at 12:03 pm
While some of us think and write more quickly than others, why are we surprised that time on task produces better writing? If I took an hour writing this instead of two minutes, I could undoubtedly improve it. If I put it down and came back to it tomorrow, I would undoubtedly have had additional ideas to communicate and would probably have found more felicitous phrasing. I am unfamiliar with this test, but why would we expect that our students' writing on any test would not improve with more thinking and composing time?
8. lowenstm - June 02, 2010 at 05:32 pm
"From my perspective, we need to integrate assessment into teaching and learning as part of students' programs of study, thereby raising the stakes a bit while enhancing motivation of both students and faculty. (Incidentally, we find that some faculty do not support assessment programs and convey their feelings to students as well.)"
I like that statement that's quoted from Prof. Shavelson. If we could find ways to accomplish this kind of integration everyone would benefit. Assessment in which students are involved with intentionality should enhance learning.
9. 11262324 - June 02, 2010 at 06:33 pm
The testing lobby will always find a way to keep their "tests" in our face. VSA was a carry over from the Bush administration and Margaret Spellman and, that alone, is reason not to participate. I agree completely with #8, but the testing lobby would never condone that.
10. 11122741 - June 03, 2010 at 04:52 pm
To Robert Tucker: Robert, Robert, Robert, you need to refresh your E.L. Thorndike blind empirical linear model. What if criticial thinking, like creativity and several other variables is an intermitent (and somewhat choatic) phenonenon and not the steady stream continuous phenomenon you implicitly define and believe it to be. Ask any writer,poetic, painter,sculptor or even scholar ...each attempt they make is not successful or a master piece ....and the same is true of thinking of any kind.
So given this point, why are you suprised that the Cronbach alphas are .3 and given the wide varieties of critical thinking and critical thinking skills that alpha is even an apporpriate criterion for evaluating these tests (which it is not). For heterogeneous and heteroscedastic tests of intermitent phenomenon,
the old test-retest relaibility coefficient is the first and foremost criteria and then the old convergent-discriminant validity design. Would you expect an omnibus test of Guilford intelligence cube to have a high alpha or dsicenible factor structure??? ....neither are a fixed requirement of any instrument and both a only expectations of vertain models, views and beliefs about the nature of the phenomenon measured. And I agree a little understand of measurement is a good thing.
On the CLA, I have shifted back to giving in-class essay exams on 3 occasions in all of the courses, each being 90 minutes,in all undergraduate courses I teach. You can imagine how popular I am and what some of my student evaluations are like. For the past 3 years, I have kept track of the time the student worked on the exam before they turned it in and there is a strong correlation between time and performance and particularly so on the first examination of this kind during the semester with many Freshman being done in less than 30 minutes and most seniors done in 60 minutes as the experience is so novel for them. I have yet to have that student who aces the exam in 30 minutes but I am an optimist. The correlation shrinks with each exam and particularly so as students find out 30 minutes isn't going to cut it for a decent grade nor is unreadable and sloppy work and life without a word processor is frightening (old later do I let them use their laptops to answer questions in some courses). The differences in the essays student write in these course between the first one (which many students tell me is the first in-class essay exam they have every taken) and the last is truly amazing as is the increased confidence that students gain in their ability to stand and deliver. I also have them write weekly do develop their skills. Another problem with the CLA then I am guessing is that many "short responders" most likely have little prior experience with writing such essays and this may also be true with the senior responders unless they are taking courses like mine which most probably is not the case. So there is probably also a "practice effect/test-wiseness" factor with the CLA that hasn't been investigated (feel free, I have enough studies to do, yes Rich that is another dissertation for one of your students). FYI, after 3 years I now have to put enrollment caps on my undergrad courses and the thank you emails I get from students in graduate school and on-the job are very satisfying. Of course, maybe the whole thing with the CLA is that it is suppose to be a type and form of testing that all students are unfamilar with to neutralize the practice/wiseness effewct but I do not recall reading that in the write-up.
11. intered - June 05, 2010 at 01:50 pm
@11122741 (since you do not identify yourself),
Most of your first paragraph responds to a position that I do not hold, perhaps your soapbox? I would suggest that not waste your time responding to what you erroneously believe to be my philosophical position and deal with the issues. Additionally, recognize that I was attempting to keep this discussion open to the majority of the readers who are not measurement scientists. In retrospect, it was a mistake to have referred to one of my old mentors statistics as an example. This entire discussion can take place in non-technical language.
Second, the logical heart of your response to my post sounds like gobbledygook. Do I have it correct that you are asserting that the findings produced by the application of precision tools to determine whether or not a test is valid may or may not apply because the phenomena upon which the test was designed to report are heterogeneous, heteroscedastic, and occur intermittently? No problem with the phenomena. You have described much of life. I do, however, have a serious problem with your position that these facts are largely irrelevant to the validity of the test. Key point coming up: Can you provide us with an independent, scientifically valid, method for deciding when to pay attention to such a test's results and when to ignore them because they are . . . (insert technical terms)? If you cannot, you have described a meta-scientific belief system.
@11122741, I have analyzed the validity of many hundreds of non-published and published tests over the last 35 years and I can tell you that tests of critical thinking stand apart in failing to live up to their published claims or, more recently, have couched their claims in lawyer-generated vagueness. This includes test-retest reliability, etc. as well as much more advanced analytics.
In non-technical terms, here are a few facts for others who might be trying to follow this tortured logic.
When a test's publisher sells you an instrument supported by technical documentation that defines specific scales (in ordinary language and in operational terms) and claims that those scales are valid, obligations are created by that transaction, among them: (a) that the application of standard statistical tools for determining the presence and validity of the scales will, if properly applied, confirm this claim, (b) when the correct application of such tools to large datasets secured from the publisher's defined target population fails to find any scales (either no scales at all of detects entirely different scales), the publisher has lied, engaged in a fraudulent act, or similar, (c) if a scale as defined by the publisher is detected but only at a very weak level, such scales, qua scales for the end-user, are useless and, again, the publisher has misrepresented the facts by calling our attention to such scales as if they were materially meaningful.
My purpose here is not to demean the efforts of those who struggle to define and understand the many things there are to mean by 'critical thinking' (@11122741 would have seen that had he read my 1996 article). Micheal Scriven and I spent more than a few years trying to apply his multiple-ranking item tool (a very robust and creative tool, I recommend it to others when the alternative is multiple-choice items) to the assessment of critical thinking in health care professionals. The result might be deemed partially successful, at best. I eventually abandoned the test after about 10,000 administrations because the scoring was so complex we could not place it in non-technical hands.
Again, the CLA may or may not be useful when used as a personal data gathering device in the hands of a motivated and knowledgeable instructor. Beyond that, it appears to be overrepresented.