Why we should be skeptical about standardized test scores

Tough talk on teacher accountability is all the rage this summer. Trouble is, we don’t know how to handle the perverse incentives that arise the moment we place undue weight on easily manipulated exams. But that hasn’t stopped a slew of education leaders from weighing in on the need to hold teachers’ feet to the fire.

In the past few weeks, D.C. Schools Chancellor Michelle Rhee made headlines for firing 241 teachers, Secretary of Education Arne Duncan gave a major speech on education reform and Race to the Top finalists were announced for round two, many of which agreed to overhaul their state’s teacher evaluation and tenure system.

Even President Barack Obama took up the theme of education, weighing in on his administration’s reform agenda for three-quarters of an hour at the National Urban League Centennial Conference – although the president who relied on teacher-union support in his election treaded carefully.

“I am 110 percent behind our teachers,” Obama said. “But all I’m asking in return – as a President, as a parent, and as a citizen – is some measure of accountability. So even as we applaud teachers for their hard work, we’ve got to make sure we’re seeing results in the classroom.”

The president dismissed educators’ fears that their evaluations would be based on standardized test scores alone.

“Everybody thinks that’s unfair. It is unfair,” Obama said. “But that’s not what Race to the Top is about. What Race to the Top says is, there’s nothing wrong with testing – we just need better tests …”

His remarks reflect a newfound perception that recent progress in New York schools has been mostly a mirage, and that the public trusted in tests that were flawed.

The president is right. Yes, we “just” need better tests. But creating better tests is very hard and very expensive. And in a system as vast and complex as ours, it’ll be tempting to continue using tests that can be graded quickly and that don’t look very different from the ones we now use. But without a radically different approach to standardized testing in this country, we are unlikely to get different results.

Some people seem to believe, however, that we’ve got everything figured out already – that we can precisely measure each teacher’s performance, and that our standardized tests are not just good but infallible.

In this brave new age of accountability, student scores on standardized tests are being used by some districts to decide, in whole or in part, the following: which teachers are first laid off; which teachers are fired; which teachers are rated effective or ineffective; which teachers receive bonuses, and how big those bonuses are; which principals receive bonuses, and how big those bonuses are; which students are required to repeat a year; and which students graduate from high school.

These scores also have been at the center of debates on mayoral control of schools, especially in New York City and Washington, D.C. These cities’ mayors, Michael Bloomberg and Adrian Fenty, respectively, have asked voters to elect and reelect them based on how they run the schools in their cities and how their students perform.

The educational decisions now made in part on standardized test scores are neither few nor inconsequential. This is hardly about who gets a sticker for a job well done, or who gets a slap on the wrist for a student’s substandard performance.

It is worth remembering, then, Campbell’s Law: “the more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

In other words, when important decisions are based on a handful of numbers – like standardized test scores – the numbers soon become unreliable. The incentives to distort the numbers prove irresistible to just about everyone, from mayors seeking reelection and principals hoping for bonuses to teachers wanting to keep their jobs and students longing to graduate.

That policies don’t always play out as planned is a truism we must accept. And though we’ll often fail to foresee unintended consequences, we shouldn’t stop trying to predict – and correct for – them.

A concrete example of unintended and unforeseen consequences will help illustrate this point.

Bus drivers in Santiago, Chile are paid in one of two ways: either a fixed salary, or a variable sum determined by the number of passengers picked up. The original idea behind the differential pay was to encourage buses not to clump together by paying drivers per passenger, which would encourage drivers to space themselves out and allow new customers to accumulate at bus-stops.

Sounds great in theory.

Here’s what happened in practice: bus drivers started racing to pass buses ahead of them in an effort to swoop up waiting passengers. Drivers also started leaving bus-stops before boarding passengers had found a seat or a hand-hold. So, in short, the average wait-time dropped for those served by drivers paid per passenger, but the rate of accidents skyrocketed and passenger comfort plummeted.

The lesson here is that fixating on a single metric – in this case, the number of passengers picked up per driver – distorted drivers’ incentives. Safety became an afterthought. A system that sought to increase customer satisfaction by reducing wait-time ended up having the opposite effect. The tradeoff for shorter wait-times turned out to be more accidents and fewer satisfied customers.

In their myopic focus on wait-times, then, policymakers in Santiago failed to foresee that their proposed solution would generate negative externalities. And not just everyday negative externalities, like pollution or second-hand smoke, but ones with an immediate and often significant impact: injury or death in a traffic accident. Upon reflection, it should be obvious that short wait-times aren’t the only thing that matters to bus-riders. They also want to arrive at their destinations in one piece, without having to visit the hospital or morgue. But policymakers appear not to have considered this.

Now, let’s look at recent student test scores in New York City. The public has heard for years from Mayor Bloomberg and Schools Chancellor Joel Klein that the city’s schools are improving. Bloomberg and Klein have regularly cited better student test scores as evidence of improvement – that is, higher percentages of students demonstrating “proficiency” on state exams.

But last week it was revealed that these test scores actually show something quite different: not better performances by students, but lower standards and easier-to-pass tests. The same press that dutifully reported student improvement changed its tune.

The New York Daily News titled its piece, “Big, Fat F in Schools,” while The Wall Street Journal’s headline read “ ‘Hard Truth’ on Education.”

But what was most surprising about the coverage was that the news surprised anyone. “You mean students haven’t really gotten a lot smarter in the last two years?” some wondered.

No, they haven’t.

But they haven’t gotten a lot dumber either. Their performance is, in fact, largely unchanged.

What changed is simply the state’s definition of “proficient.”

The gains were merely an illusion, sleight of hand on the part of policymakers and politicians. Mayor Bloomberg said his interpretation was that “the test is harder and more comprehensive,” but this wasn’t the truth. The test isn’t harder or more comprehensive; it’s just that the minimum passing score was increased.

The real story isn’t that years of gains were erased, as The Wall Street Journal said. It’s that there was no academic progress in the first place – just a lower bar for determining who was declared proficient.

The skeptics among us – those who have questioned such results for months, if not years – felt vindicated at last. But it’s a shame that vindication was so long in coming, and it’s a scandal that more people are not incensed now. I don’t quite understand where the rage and outrage are.

What can we learn from the New York City example? I can think of at least four lessons.

1. We shouldn’t get excited or depressed about short-term changes in test scores. Often they don’t mean much. Long-term trends are more reliable – and therefore more meaningful. Scores on the National Assessment of Educational Progress (NAEP) going back one, two and three decades are trustworthy. An individual state’s scores from last year probably aren’t.

2. Politicians are prone to slicing and dicing scores to their advantage. This shouldn’t surprise us, but neither should it silence us. Year-to-year changes in scores are unimpressive? Look at the decade-long trend. Long-term trends show no growth? Look at the change over the past two years. This is the game in which Michelle Rhee engaged last month when the percentage of elementary students in Washington, D.C. deemed proficient in reading and math unexpectedly dropped this year. Rhee touted instead the gains since 2007-08.

3. When numbers look too good to be true, they’re too good to be true. This is no less true of schooling than baseball and cycling. Seventy-three homeruns in a single season? Hmm. An epic comeback in Stage 17 of the 2006 Tour de France? Hmm. Those results strained credulity because they weren’t clean – and people suspected so from the start but had to wait years for confirmation. We’ve seen similar things in schools.

In New York City, 97 percent of elementary and middle schools earned As or Bs on the district report card last year, compared to 79 percent in 2008 and just 61 percent in 2007. Are most schools getting dramatically better in just one or two years? Probably not. As President Obama said last week, “change is hard. …We won’t see results overnight.” We should always be wary of overnight results.

Randi Weingarten, president of the American Federation of Teachers, said in response to President Obama’s speech, “there are no silver-bullet solutions for our schools.” There’s only hard work, day after day and year after year, with the possibility of gradual – real and substantive – improvement. Instant, immense improvement is as elusive as Halley’s Comet. It is therefore also suspect.

The most likely explanations for a school whose students dramatically improve from one year to the next are that the test has changed or that the school is serving a different population of students. And the most likely explanation for a school whose students do significantly worse one year to the next is a change in how performance is being measured, not a change in the students’ actual performance. This is the story of Public School 85 in the Bronx, where math proficiency among third-graders plunged from 81 percent two years ago to 18 percent last year. The good news is that last year’s students probably weren’t any worse than their apparently highflying predecessors; what changed was the definition of “proficient,” not the students’ performance.

4. We remain very far from an accountability system impervious to perverse incentives. Therefore, we must be very careful in how we use student test scores in any decisions, especially those about personnel. A new Mathematica study released by the U.S. Department of Education says that “in a typical performance measurement system, more than 1 in 4 teachers who are truly average in performance will be erroneously identified” as below average, with a similar percentage of below-average teachers not showing up as underperformers.

This should scare not just classroom teachers but anyone who believes our current data systems are infallible. They are not. Importantly, the study also notes that more than 90 percent of the variation in student learning is due to factors beyond a teacher’s control. We ignore this fact at our own peril. It does not mean that teachers don’t matter, or that teachers cannot or should not be held accountable. But it does mean that we must proceed cautiously and ask tough questions of those who believe we’ve finally found the holy grail to measure teacher performance.

A version of this story appeared here on The Washington Post’s “Answer Sheet” on August 7, 2010.

One reply on “Why we should be skeptical about standardized test scores”

At The Hechinger Report, we publish thoughtful letters from readers that contribute to the ongoing discussion about the education topics we cover. Please read our guidelines for more information. We will not consider letters that do not contain a full name and valid email address. You may submit news tips or ideas here without a full name, but not letters.

By submitting your name, you grant us permission to publish it with your letter. We will never publish your email address. You must fill out all fields to submit a letter.

Pingback: Educación a Debate

Letters are closed