The 7 steps to making a statewide test

Who designs the state tests that spread a blanket of silence across the hallways and classrooms of Massachusetts schools each spring as students scratch away at paper with No. 2 pencils?

Website for The Boston Globe — This story also appeared in The Boston Globe

A) A phalanx of drill sergeants who would ban recess if they could
B) A coven of witches who feed on the sweat and tears of small children
C) Dozens of PhDs and former teachers who spend their days in cubicles in Dover, New Hampshire

The answer is C. The people behind Massachusetts’s test for grades three through eight – used to rate schools, partially evaluate teachers and decide if students graduate or not – are actually an enthusiastic bunch who want to help make schools better. They’re employees of the nonprofit Measured Progress and they care deeply about how the tests can help a state or a school district measure long-term trends and improve education. And they take their jobs very, very seriously.

“We spend a lot more time looking at that item than a student,” said Raymond Reese, a content development specialist in math at Measured Progress. “We always keep in mind real kids are going to be sitting down and taking this test and we want them to be able to succeed.” The nonprofit test development group has worked with the Massachusetts Department of Elementary and Secondary education to make and score MCAS for all but four of the 19 years it has been given. The company also works on large-scale assessments for 16 other states and is one of two companies in the running to design the MCAS 2.0.

Here’s how they do their work:

1. The state decides what the test should look like. The list of standards – the skills and knowledge the state has decided students should have at each grade level – is the test maker’s bible. Examples from the Massachusetts standards, which incorporate the Common Core, include: “Determine a theme of a story, drama, or poem from details in the text; summarize the text” or “Use the four operations with whole numbers to solve problems.”

The state and Measured Progress agree on the test design, which includes which standards to test and how many questions to devote to each one. These conversations also include how long the tests should be and the ratio of multiple-choice questions to open-ended responses.

2. Measured Progress begins writing question. Now, it’s time to start writing questions, or “items” as they’re called in the industry. Each year, MCAS tests included some recycled items, but a large number will have been field-tested in previous years and used for scoring for the first time. A good item is clear and concise and will address the standard it’s supposed to be assessing.

Multiple-choice questions can have only one right answer. Reading passages need to be drawn from actual fiction or nonfiction. In math, word problems need to be realistic. “You can’t have someone going on a hike where he’s hiking 20 miles per hour,” Reese said.

The questions go through a thorough internal review. “We want these items to be pretty good before we put them in front of a client,” said Stuart Kahl, Measured Progress’s founding principal.

3. Educators weigh in. Once the items have gone through this initial review, it’s time for the Assessment Development Committee meetings. The state assembles six to 12 educators from across Massachusetts for each grade (and sometimes subject area) that’s being tested to go through items with them, one by one. They look at the wording, how the question looks on the page, and the rubric that defines how the open-ended questions will be scored.

“ We spend a lot more time looking at that item than a student. We always keep in mind real kids are going to be sitting down and taking this test and we want them to be able to succeed.”

If teachers don’t like an item, they can either suggest revisions or vote to reject it (fewer than 10 percent are discarded at this stage). Meanwhile, separate committees of state educators vet the items for bias.

Meanwhile, separate committees of state educators vet the items for bias. “They’re really looking for things that might affect performance on items other than the student’s ability,” said Kahl. A writing prompt about a day spent at the beach, for instance, might be easy for a kid on Cape Cod and tough for a student in Springfield who’s never taken a trip to the coast, depending on how it is asked. The trick is to write questions that make no assumptions about students’ prior experience with things outside of the classroom.

4. Field tests check for flaws. When everyone has signed off on the questions, it’s time to see how well they do in the real world with field-testing. For an existing exam, like MCAS, potential new questions are sneaked onto a real test. These questions are not factored into students’ final scores, but the trial run is used to tell if there’s anything statistically deviant about them.

Students can’t tell the difference – and that’s important, Kahl said. “The motivation is the same,” he said. “They don’t know what’s a field test… It just gets the best data on those items.”The process “gets the best data on those items,” Kahl says.

5. Psychometricians scan for bias. The results are then handed over to the psychometricians, whose job is to answer the question, “Are we measuring things we did not intend to measure?” according to Louis Roussos, senior psychometrician and research scientist at Measured Progress.

For example, if black students performed poorly on one question but equivalently to white students on the rest of the test, that’s a red flag that there could be bias in the item. Psychometricians also look for discrimination, which in test development is actually a good thing. As Kahl explained it, “An item has to separate the A student from the B student, or the C student from the D student.”

With data in hand, Measured Progress goes back to the educator committees to determine which items move forward and which are sent back for possible revision and field-testing the following year.

6. Proofers review the final product. The next step is to turn a long list of individual items into a physical test – the sheets of paper and booklets that students will see. Seemingly simple, this part of the process requires particular attention to detail. “A lot of checking goes on to make sure those forms adhere to specifications,” Kahl said. “And then more review and proofing because you want those final forms to be flawless.”

The move to online testing complicates the process further. Items may look different online than on paper. Test questions might also even look different on different web browsers. Everything needs to be checked, and then checked again.

7. Students take tests, scores are tallied. The tests are shipped out to schools with specific instructions on how to send them back, so they can be tracked in Measured Progress’s vast warehouse. It has a painstakingly detailed system in place to make sure every sheet of every test is accounted for and tied back to an individual student when the boxes are unpacked and scanned.

Multiple-choice items are scored by scanners that can process a million sheets a day. But open-ended math questions and written work is sent out to one of Measured Progress’s 881 workstations for graders at their offices in New Hampshire, Colorado, or New York. During peak scoring season — April through June — computers are manned in shifts from 8 a.m. to 10 p.m. These scorers can go through hours of training for each item.

Finally, the scores are sent back to the state education department, and on to the school and the students. By the time this happens, the cycle has already started again and questions for next year’s test items are already being developed.

This story was produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education.