This personal report about setting the cut scores for New York’s Common Core 11th grade ELA test was written by Dr. Maria Baldassarre Hopkins, Assistant Professor in the School of Education at Nazareth College. The cut score is the passing mark.
Professor Hopkins writes:
My name is Maria, and I am not a psychometrician.
There. I said it.
Apparently it took me a while to get it through my thick skull. I was reminded no fewer than three times at the cut score setting for the new Common Core aligned ELA Regents Exam that I am, indeed, not a psychometrician.
“Mary, are you a psychometrician?” I was asked when I made one of my frequent requests for more information.
My name ain’t Mary. And, no, I am not a psychometrician.
Last year I wrote critically of the cut score setting process for the 3-8 Common Core assessments. I was astonished when I was invited back for the 11th grade iteration after expressing blatant disapproval of NYSED/Pearson’s gamemaker role in the Hunger Games of academic achievement. You might wonder why I chose to go back. In addition to the camaraderie of some of New York’s finest educators and the Desmond’s delicious bread pudding, I prefer being at the table in the event that I might bring some modicum of sanity to an otherwise batty process.
Once again, I was required to sign a non-disclosure agreement which limits me from disclosing any secure test materials, including “all oral and written information … relating to the development, review, and/or scoring of a New York State assessment.” On the other hand, Commissioner King emphasized the importance of participants going out and talking about the cut score setting process, as well as encouraging our colleagues to participate in the future. While it may be my close reading skills at fault, I’m not entirely clear on where “secure test materials” end and “talking about the process” starts. I haven’t been dragged into court yet, so I think we’re good. Still, I will err on the side of caution here by not divulging any actual conversations or actual data to which I was privy. Read closely, friends.
Oh, I almost forgot–you should totally get on one these panels if have the chance.
Concern #1: Students are not PLDs
An important early step in the cut score setting process happened in February when educators from across the state were brought together to craft Performance Level Descriptors (PLDs) that would be instrumental in determining cut scores. PLDs are statements that say what a student at each level of proficiency should be capable of doing under each standard.
For example, imagine anchor standard 11 said the following: “Analyze the body language of a person trying to persuade you to resign from a task after you have asked too many questions.” PLDs would be statements that say what a student at each level (2-5) is capable of. A level 3 PLD might say: “Analyzes body language adequately and correctly;” a level 4 might say: “Thoroughly analyzes body language in a way that is both correct and lightly nuanced;” a level 2 might say: “Inconsistently analyzes body language and with some inaccuracy.” Do you get the picture? Essentially, each standard is broken up into 5 proficiency levels.
PLDs, along with Ordered Item Booklets (OIB) are the tools of the trade for cut score setters. An OIB is basically the test booklet from the June 3rd administration, but instead of questions ordered as they appeared on the actual exam, they are ordered from least to most difficult. The only factor accounted for in the ordering is the number of students who answered each question correctly. A lot of students got it right? Easy question. Not many students got it right? Hard question. Text complexity of passage, plausibility of multiple choice options, level of questioning—you know, the stuff that makes questions hard—are of little consequence.
For the purpose of cut score setting, PLDs become groups of “students.” As we move through the OIB attempting to place a bookmark on the last question a “Level 3” student should be able to answer correctly, we ask ourselves: “Based on the PLD description, should a student at this level be able to answer this question?” Yes? Move on in the book. No? Place your bookmark on the last “yes.”
The problem is that PLDs are not actually students. PLDs are arbitrary, almost meaningless statements that are made up very quickly by people who, for all intents and purposes, have little inclination what will be done with them after students take the exam. So we end up having hypothetical conversations like this one that inform where we place our bookmarks and, therefore, what the cut score becomes:
Jane at Table 1: Man, this question is super hard because–Broca’s Brain?!
Come on, how many 11th graders would actually understand the message here? I am going to say a Level 3 probably won’t get this right.
Dick at Table 2: No, this text is grade level appropriate. I just asked that state ed person in the corner and she said so. Our PLD says right here that a Level 3 student understands grade level texts. So, no, it should not be too hard. A Level 3 student should definitely get this question right.
Let me say this one more time, this time in response to imaginary Dick at Table 2: PLDs are not students. They are broad categories that can be interpreted differently by every single person that reads them. Even if, as a student, I fall squarely into the Level 3 category for my ability to understand a grade level text, that does not necessarily mean that I am able to distinguish between the very subtle nuances presented to me in the multiple choice options. It does not mean that there is a multiple choice option that approximates the (correct) answer I came up with on my own when I read the question. It does not mean that I have had the lived/linguistic experiences necessary in order to comprehend the nuances of the figurative language, even if I have a good sense of what the text, taken as a whole, is saying. For Dick, none of that matters. Because PLD. (View the test in its entirety here and assess the difficulty level for yourself).
PLDs do a good job making general statements about what a kid can kinda do in a vague sort of way. What they do not do is assuage the subjectivity of individual bookmarkers. They are also terrible at representing the complexity of actual students and attending to the myriad and layered complexities involved in answering each and every question on the assessment.
But take this with a grain of salt. I’m no psychometrician.
Concern #2: Setting Cut Scores on a Test that is Not Fully Operationalized
As it turns out, psychometricians aren’t big on anecdotal evidence. But here’s what I know, anecdotally speaking. Not all 11th graders in NYS took the new regents exam. Districts were given the choice of whether they would administer the test or not. Some districts chose to opt out all together while others administered both the new and the old tests. My concern was one about the representative nature of the sample upon which we were basing our cut score decisions. Based on the demographics of students who actually took this new test, would it be possible to draw a sample that was representative of all 11th graders in NYS? Were various demographic groups, including (but not limited to) Latino and Black students, students with disabilities and English learners accurately represented in the test data that would be informing the cut score setting process?
I had a difficult time imagining how that was possible. Perhaps it is because I am not a psychometrician, or maybe it was just pragmatics. Would school districts be willing to tender the expense of test proctors, graders, and substitute teachers, along with the loss of precious instructional time, on a test that they knew full well their students were not prepared for? My sense was that it would be mostly higher achieving students and wealthier districts choosing to give this test. If that is true—and I have been assured by NYSED staff that it is not—then the sample is skewed toward students who are expected, statistically speaking, to perform pretty well. All I could think during the cut score setting was that If our cut score was based on data skewed toward higher achieving students, everyone else will be at a grave disadvantage for years to come. They will be expected to perform to a bar set by predominately successful students. Unfortunately, though I asked, I was not permitted to see any data that reflected the demographics of students tested. I was assured, however, that the details of the sample would be provided in the cut score report.
On June 23rd, SED released their cut score report. In it, they break the sample down into several demographic categories and illustrate that the percentage of students in each category in the sample is similar to that in the population. Despite anything one can learn in Statistics 101, never do they give the number of test takers in the sample. The sample can be 10,000 students or it can be 100. These percentages actually tell us nothing about whether or not test results of the sample can be generalized to New York’s population of 11th graders.
While there is no way to tell from the data SED eventually provided, it is possible that the sample is not skewed. After an hour or more of asking for data about the sample, speaking with several SED folks who each gave me different answers about the sample and reasons that I would not be permitted to see any data (ranging from “it’s secure” to “we don’t have it” to questioning the legitimacy of my request due to my non-you-know-what status), everyone eventually got on the same page. By the end of our last day, the group was on message: the sample is representative.
But, even if this is true, it doesn’t actually improve the situation. Students across the board were underprepared for the exam having had only one year of Common Core-aligned instruction. Because this is a test they were not actually prepared to take, difficulty levels were inflated (remember: they are based only on the number of students who answered each item correctly) causing the cut score to be set relatively low. As years progress and as students have more experience with the Common Core, they will inevitably perform better. All of this cut score nonsense will be long since forgotten, and we will all sing the praises of Commissioner King for increasing graduation rates through his tireless pursuit of high standards. Of course, this type of score manipulation is not new. In 2013, chances o f 11th graders’ success on the Regents were diminished by 20% thanks alone to score conversion charts. Now that I think about it, that event set the stage really nicely for the necessity of speedy reform.
Regardless of the sample, this was a test students were not actually prepared to take. Cut scores should have never been set for the next who-knows-how-many-years based on a pilot run. Period.
Even a psychometrician should know that.