Were some D.C. teacher dismissals based on flawed calculations?

We live in an age of accountability and transparency – and yet some school districts seem not to be playing by the rules. I recently wrote about the lack of accountability in the way districts report how they calculate teacher “value-added” measures that are used for medium-stakes and high-stakes personnel decisions (such as granting teachers tenure or firing them).

Districts such as Washington, D.C. and New York City have failed to disclose the technical materials that describe the strengths and weaknesses of their chosen value-added technology. There are hundreds of decisions regarding how to calculate value-added scores in a given school district, some of which may be routine, and others of which might be controversial. Moreover, these decisions have consequences for who gets what score – and this is a serious matter if the scores have medium and high stakes for teachers.

Below, I illustrate how value-added scores may have been misused in the termination of 26 teachers in the D.C Public Schools last week and the classifying of hundreds of other general education teachers in grades four through eight as “minimally effective.” I cannot be sure that this is what happened, as no technical report on the IMPACT system ratings is available. The only documentation currently on the DCPS website is the DCPS IMPACT Guidebook for General Ed Teachers with Individual Value Added (IVA), which was published in October 2009, concurrent with the announcement of the system. Perhaps Mathematica Policy Research, the contractor for the IMPACT system, figured out an alternative approach; based on the information available, there’s no way to tell.

There’s no polite way to say this: the procedures described in the DCPS IMPACT Guidebook for producing a value-added score are idiotic. These procedures warrant this harsh characterization because they make a preposterous assumption based on a misunderstanding of the properties of the DC Comprehensive Assessment System (DC CAS).

To comply with the No Child Left Behind Act, the DC CAS measures student performance in reading and math in grades three through eight and grade 10, as well as student performance in science, biology, and composition in select grades. The test, developed by testing contractor CTB/McGraw-Hill, is designed to track student mastery of the D.C. content standards.

Tests generally take the form of multiple-choice and constructed-response items; the pattern of correct responses generates a raw score, indicating the number of correct items (with partial credit for constructed-response items), sometimes with a correction or “penalty” for guessing. These raw scores are converted into scaled scores, scores on a scale that represents a common yardstick for multiple versions of a test. For example, there were 54 questions on the math portion of the SAT in 2008, and raw scores ranging from -5 to 54 were converted into scaled scores in the familiar 200-800 range. (It’s possible to earn a negative raw score on the SAT because of the so-called “guessing penalty,” but the lowest possible scaled score is 200.) In D.C., then, if this year’s third-grade reading test is slightly easier than last year’s test, a given raw score would convert into a lower scaled score than the same raw score would have in the preceding year.

Some tests, such as the National Assessment of Educational Progress, are designed to be vertically equated. This means that the same scale is used to locate the performance of students in different grades. In a vertically-equated testing system, a score of 450 represents the same level of performance regardless of whether it was earned by a fourth-grader or a fifth-grader. Tests need not be vertically equated to meet the requirements of No Child Left Behind, which merely demands that test-takers in each grade be classified as “proficient” or “not proficient” in relation to grade-level content standards. Typically this is done first by converting raw scores into scaled scores, and then in a second step determining the scaled score threshold that separates students judged proficient at that grade from those who are below that standard.

The scaling approach taken by the DC CAS is, to my mind, pretty unconventional, because the scaled scores do not overlap across grades. In grade four, the minimum possible scaled score is 400, and the maximum possible scaled score is 499. In grade five, however, the minimum possible scaled score is 500, and the maximum possible scaled score is 599. (The same approach is used in grades six through eight.) This means that a fourth-grade student who got every question on the fourth-grade math assessment correct would receive a lower scaled score than a fifth-grade student who got every question wrong on the fifth-grade assessment. That sounds ridiculous, but it’s not problematic if the scale for fourth-grade performance is acknowledged to be different from the scale for fifth-grade performance. The design of the DC CAS allows for comparing performance in fourth grade in one year with fourth-grade performance in the next year; but it doesn’t permit measuring how much students have gained from one grade to the next. Measuring growth from one grade to the next requires a test that is vertically equated.

Which brings us to the value-added calculations for the DCPS IMPACT. As is common in value-added measures, a teacher’s “value added” is determined by the difference between the growth expected for a student (or class of students) with a given set of characteristics with a typical teacher, based on a statistical prediction equation, and the growth actually observed for that student (or class of students) with that particular teacher. How is that growth measured? According to the DCPS IMPACT Guidebook, the actual growth is a student’s scaled score at the end of a given year minus his or her scaled score at the end of the prior year. If a fifth-grader received a scaled score of 535 in math and a score of 448 on the fourth-grade test the previous year, his actual gain would be calculated as 87 points.

Subtracting one score from another only makes sense if the two scores are on the same scale. We wouldn’t, for example, subtract 448 apples from 535 oranges and expect an interpretable result. But that’s exactly what the DC value-added approach is doing: subtracting values from scales that aren’t comparable.

By assuming that the difference between a student’s score in one year and his or her score in the following year is a definite (and precise) quantity, the DCPS value-added scheme assumes that the scaled scores are measured on an interval-level scale, in which the difference between a score of 498 and a score of 499 represents the same difference in performance as the difference between 499 and 500. But this simply cannot be. The difference between 498 and 499 is a tiny difference among very high achievers in the fourth grade. But the difference between 499 and 500 is the difference between the highest performing fourth-grader and the lowest performing fifth-grader; and there are many fourth-graders who outperform low-scoring fifth-graders.

And heaven help the poor teacher who is teaching a class filled with students who’ve been retained in grade. A fifth-grade student who got every question wrong on the reading test at the end of fourth grade and every question wrong at the end of fifth grade would show an actual gain of 500–400=100 points. A fifth-grader repeating fifth grade who had a scaled score of 510 the first time through, and a scaled score of 530 during his or her second year in fifth grade, would show an actual gain of just 20 points. DC’s value-added methods may, of course, simply exclude students who are retained in grade from the calculations, but that sends an unpleasant message about whose scores count when teachers are evaluated.

Did DCPS completely botch the calculation of value-added scores for teachers, and then use these erroneous scores to justify firing 26 teachers and lay the groundwork for firing hundreds more next year? According to the only published account of how these scores were calculated, the answer, shockingly, is yes.

This article also appeared on July 28, 2010 here on Valerie Strauss’ “Answer Sheet” blog in The Washington Post.