The trouble with New York's decision not to release test items

First Rule of Fight Club: Do Not Talk about Fight Club

Second Rule of Fight Club: DO NOT TALK about Fight Club

Has the New York State Education Department watched too many Brad Pitt movies? Okay, that’s a rhetorical question, but one that might be posed to other state education agencies also engaged in the business of high-stakes testing. This week, students in grades 3 through 8 across the state of New York are taking mathematics exams aligned with the Common Core State Standards. Following on the heels of last week’s English Language Arts exams, the math exams also promise to be unusually challenging, reflecting the complex skills and knowledge inscribed in the Common Core standards.

Regardless of broad pronouncements from policymakers and the media about the inherent superiority of the Common Core standards and the assessments designed to measure mastery of them, the truth is that no one really knows whether the standards will lead to higher student achievement, or whether the assessments will be good measures of students’ readiness for college and careers. In New York, although this year’s assessments are the first to be aligned with the Common Core standards, they have a short shelf-life: the state plans to administer the Partnership for Assessment of Readiness for College and Careers (PARCC) assessments in the spring of 2015, if those assessments are ready for prime time by then.

In the meantime, discussions about the content and quality of the assessments are hamstrung by New York’s decision not to release test items to the public. For educators, the issue is quite serious: disclosure of secure test items by a teacher or school leader is considered a moral offense that can lead to disciplinary action, including loss of certification.

The strongest arguments in favor of keeping test questions and answers private are technical. It is desirable that different forms of a test, including those administered in different years, be scaled in such a way that a given score represents the same level of performance, regardless of the test form or year. Anchor items are used to link different forms of a test and equate them. Modern test theory uses the difficulty of test items, and their ability to differentiate higher and lower performers, as tools to estimate a test-taker’s performance. It’s important for anchor items to have a stable level of difficulty over time; if they become easier or harder over time, their ability to serve as a common anchor across test forms is compromised, as is our confidence that a given test score denotes the same level of performance over time. A change in the difficulty of a test item over time is referred to as item parameter drift.

Item parameter drift can occur due to changes in curriculum, teaching to a test, or practice. But the biggest risk is from the widespread release of test items, whether unintentionally, as in a security breach, or intentionally. If a wide swath of the test-taking population knows test questions and the right answers, the questions will be easier, even if the test-takers are not more capable. It’s for this reason that questions and answers in educational tests frequently aren’t released to the public: disclosing test questions would limit their ability to be reused and to serve as anchor items.

The National Assessment of Educational Progress (NAEP) is a case in point. The No Child Left Behind Act (NCLB) provides that the public shall have access to all assessment instruments used in NAEP, but that the Commissioner of the National Center for Education Statistics, which houses NAEP, may decline to make available test items that are intended for reuse for up to 10 years after their initial use.

Of course, one of the other features of the lovely NCLB law is that it prohibits the federal government from using NAEP to rank or punish individual students, teachers, schools or local education agencies. For this reason, NAEP is a low-stakes test—despite the ways in which pundits jump to draw broad policy inferences from comparisons of NAEP performance over time or across jurisdictions.

But one could argue that disclosure of test questions and answers may be justified when the test is used for high-stakes decisions such as student promotion, or the evaluation of teachers and/or schools. For most such high-stakes decisions, there are winners and losers, and when these decisions are made by agents of the government, the losers have a legitimate interest in whether the decisions were fair. One need look back no further than last week, when New York City announced that, due to a series of errors made by NCS Pearson, several thousand children were incorrectly classified as ineligible for gifted and talented programs.

Or, if you wish, reach back to last year, when the New York State Education Department discarded a series of items in the Grade 8 English Language Arts exam based on a passage involving a talking pineapple. Not too many people rose to defend the test items associated with this fable involving a hare and a pineapple, but Pearson, the firm contracted to develop and administer the exam, did. The choice of both the passage and the items, the company claimed, “was a sound decision in that ‘The Hare and the Pineapple’ and associated items had been field tested in New York State, yielded appropriate statistics for inclusion, and it was aligned to the appropriate NYS Standard.” Vetted by some teachers, too, I reckon. But with all of that, the passage and items were ludicrous.

One item following the passage asked which of the animals in the passage was the wisest: the moose, crow, hare or owl. Pearson claimed that it was unambiguous that the wisest animal was the owl, based on clues in the text. One such clue was that the owl declared that “Pineapples don’t have sleeves,” which, Pearson reported, was a factually accurate statement. So too, to the best of my knowledge, is that owls don’t talk.

High-stakes tests administered by governmental agencies call for a heightened sense of procedural fairness, including the ability to interrogate the tests and how they were constructed, and what counts as a correct response. The point is not so much that bad test items get discarded—although that may be appropriate from time to time—as much as it is that the procedures are subject to scrutiny by those they affect. New York does not have a great recent track record on this. The technical reports on the construction of last year’s state English Language Arts and math tests have not been made public yet, even though we’re in the midst of this year’s testing. And the technical manual for New York’s statewide teacher rankings, a modified version of value-added modeling, was released months ago—before the manual for the tests on which those rankings were based. It’s hard to know how much to trust the growth percentiles or value-added models without more information on the tests themselves.

Moreover, it may be especially important to have open and public discussions about tests that are aligned with the Common Core standards, which are new to educators and the public. The point of these tests, especially in their earliest administrations, is really not “ripping the Band-Aid off,” as New York City Schools Chancellor Dennis Walcott has declared—nor is it to document just how few students will meet the new standards, as a vehicle for supporting one policy reform or another. Rather, it’s to engage educators, policymakers and the public in a conversation about what we want our students to know, and how we can move them toward the desired levels of knowledge and skill.

And one good way to frame that conversation is to ground it in the discussion of particular assessment questions. Might teachers disagree with one another about what the best answer to an assessment question is? If they do, shouldn’t they be talking about it? Will students have an opportunity to discuss why a response is incorrect, what a better response might be, and why? Or will they simply receive a scale score telling them, and their parents, that they are well below grade-level?

Much has been made of the notion that assessments aligned with the Common Core standards are to be “authentic,” with real-world content that parallels what students might experience in adult daily life. (Ideally, something more sophisticated than “If Johnny has $5.63 and is wearing a pair of Nike Free Run+ 3 shoes, how long will it take him to run to the 7-Eleven to buy a delicious Coca-Cola product?”) If the content is indeed authentic, and reflective of what we expect students to know and be able to do as productive adults, we should be discussing that content, not hiding it under a rock.

There is a middle ground between total nondisclosure of test items and answers, and complete disclosure. It’s possible to retain the security of anchor items while releasing items that won’t be used again. But it’s easier to do this when there’s a more extensive bank of assessment items with known properties, and such an item bank for the Common Core does not yet exist. It may not be the most popular conclusion, but perhaps we should be investing more in the development of good assessment items.

First Rule of High-Stakes Assessments: Talk about High-Stakes Assessments

Second Rule of High-Stakes Assessments: TALK about High-Stakes Assessments