Rigor mortis

The word rigor comes up a lot in teacher-evaluation systems. It’s akin to motherhood, apple pie and the American flag. What policymaker is going to take a stand against rigor? But the term is getting distorted almost beyond recognition.

In science, a rigorous study is one in which the scientific claims are supported by the evidence. Scientific rigor is primarily determined by the study’s design and data-analysis methods. It has nothing to do with the substance of the scientific claims. A study that concludes that an educational program or intervention is ineffective, for example, is not inherently more rigorous than one that concludes that a program works.

In the current discourse on teacher-evaluation systems, however, an evaluation system is deemed rigorous based either on how much of the evaluation rests on direct measures of student-learning outcomes, or the distribution of teachers into the various rating categories, or both. If an evaluation system relies heavily on NCLB-style state standardized tests in reading and mathematics—say, 40 percent of the overall evaluation or more—its proponents are likely to describe it as rigorous. Similarly, if an evaluation system has four performance categories—e.g., ineffective, developing, effective and highly effective—a system that classifies very few teachers as highly effective and many teachers as ineffective may be labeled rigorous.

In these instances, the word rigor obscures the subjectivity involved in the final composite rating assigned to teachers. The fraction of the overall evaluation based on student-learning outcomes is wholly a matter of judgment; and if you believe, as I do, that a teacher’s responsibility for advancing student learning extends well beyond the content that appears on standardized tests, you could conceivably argue that increasing the weight given to standardized tests in teacher evaluations makes these evaluations less rigorous. This is, however, a hard sell in the absence of other concrete measures of student-learning outcomes that could supplement the standardized-test results.

Even more importantly, describing a teacher-evaluation system as rigorous hides the fact that the criteria for assigning teachers to performance categories—either for subcomponents or for the overall composite evaluation—are arbitrary. There’s no scientific basis for saying, as New York has, that of the 20 points out of 100 allocated for student “growth” on New York’s state tests, a teacher needs to receive 18 to be rated “highly effective,” or that a teacher receiving 3 to 8 points will be classified as “developing.” In fact, the cut-off separating “developing” from “effective” changed last week as a result of an agreement reached between the New York State Education Department and the state teachers’ union—not because of science, mind you, but because of politics.

And it’s politics, and politics alone, that accounts for the fact that the rules for the overall composite evaluation say that any teacher who scores 0 to 64 points will be classified as ineffective, and that the two subcomponents for student “growth” and local assessments, each of which counts for 20 points, classify teachers who score 0 to 2 points on each component as ineffective. This means, as New York principal Carol Burris and others have pointed out, that if a teacher is classified as ineffective on both of these subcomponents, that teacher is automatically rated ineffective overall, even if that teacher is rated highly effective on the 60 points allocated for measures of a teacher’s professional practices. It certainly seems odd that two components accounting for 40 percent of a teacher’s overall rating can trump the remaining 60 percent –but this isn’t science, it’s politics.

Other states face the same challenge in assigning teachers’ value-added scores or student growth percentile scores to performance categories, and most of them have punted, issuing regulations that defer these difficult decisions until later. Illinois says that it’s “working diligently” on this. Georgia claims that its model will be identified soon. Michigan is counting on a rating system to be developed by the Governor’s Council on Educator Effectiveness. After a year of debate, Delaware concluded that it couldn’t figure out how to use students’ scores on the state assessment system in teachers’ summative ratings for the 2011-12 school year, and deferred implementation until the future.

It violates a basic principle of fairness for teachers to be held accountable for performance criteria that aren’t clearly specified in advance and that may be unattainable. These states, and many others, have their work cut out for them.

Nowhere is this more evident than with the mapping of teachers’ value-added or student growth percentile scores onto the ratings composing a teacher’s summative evaluation. The value-added or student growth percentile scores are measured with errors that can be substantial, especially when they are based on a single year’s worth of student achievement data. But the scoring bands for ratings categories such as “developing” or “effective” have strict cut-offs. What to do?

One way of reclaiming the concept of rigor in teacher-evaluation systems is to assign ratings that take into account the uncertainty or errors in the measures. This is consistent with a scientific conception of rigor: the assignment of teachers to rating categories should be consistent with the quality of the evidence for doing so. A teacher shouldn’t be assigned a rating of “ineffective” based on a value-added score, for example, if there’s a substantial probability that the teacher’s true rating is “developing.”

So here’s a challenge, and a proposal. The challenge is to state education policymakers across the country who have hitched their teacher-evaluation systems to measures that seek to isolate teachers’ contributions to their students’ learning: Develop clear and consistent guidelines for assigning teachers to rating categories that take into account the inherent uncertainty and errors in the value-added measures and their variants.

And here’s the proposal: A teacher should be assigned to the lower of two adjacent rating categories only if there is at least 90 percent confidence that the teacher is not in the higher category. Operationally, this involves a statistical test based on a cut score, a teacher’s score and the error associated with that score.

Suppose, for example, that the cut-off separating “ineffective” and “developing” is a teacher being in the 10th percentile across the state on a value-added or student growth percentile measure. Teacher A’s percentile rating is the eighth percentile, but the standard error for her rating is two percentile points. Given the uncertainty in the rating, there is a 16 percent probability that Teacher A’s true percentile rating is greater than the 10th percentile, and an 84 percent probability that her true percentile is lower than the 10th percentile. Thus, in my proposal, Teacher A should be classified as developing, not ineffective.

Conversely, Teacher B’s percentile rating is in the fourth percentile, and the standard error for her rating is three percentile points. Given the uncertainty in the rating, there is only a 2 percent probability that Teacher B’s true percentile value is above 10, and a 98 percent probability that his true percentile rating is lower than the 10th percentile. Teacher B would therefore be classified as ineffective.

Other approaches are certainly viable; the 90 percent confidence rating is arbitrary, but one that seems sensible to me. In most educational, social and medical research, a common standard is to trust an observed effect only if that effect could be observed by chance under 5 percent of the time, relative to the hypothesis that there’s no true effect in the population. The 90 percent standard I’m proposing is slightly more lenient. And of course this approach doesn’t address the arbitrariness in the New York scheme described above.

If policymakers aren’t willing to take measurement error into account in a defensible way in teacher-evaluation systems, don’t talk to me about rigor—rigor is dead.