New York: An Insider’s View of the Test Scores

This personal report about setting the cut scores for New York’s Common Core 11th grade ELA test was written by Dr. Maria Baldassarre Hopkins, Assistant Professor in the School of Education at Nazareth College. The cut score is the passing mark.

Professor Hopkins writes:

My name is Maria, and I am not a psychometrician.

There. I said it.

Apparently it took me a while to get it through my thick skull. I was reminded no fewer than three times at the cut score setting for the new Common Core aligned ELA Regents Exam that I am, indeed, not a psychometrician.

“Mary, are you a psychometrician?” I was asked when I made one of my frequent requests for more information.
My name ain’t Mary. And, no, I am not a psychometrician.

Last year I wrote critically of the cut score setting process for the 3-8 Common Core assessments. I was astonished when I was invited back for the 11th grade iteration after expressing blatant disapproval of NYSED/Pearson’s gamemaker role in the Hunger Games of academic achievement. You might wonder why I chose to go back. In addition to the camaraderie of some of New York’s finest educators and the Desmond’s delicious bread pudding, I prefer being at the table in the event that I might bring some modicum of sanity to an otherwise batty process.
Once again, I was required to sign a non-disclosure agreement which limits me from disclosing any secure test materials, including “all oral and written information … relating to the development, review, and/or scoring of a New York State assessment.” On the other hand, Commissioner King emphasized the importance of participants going out and talking about the cut score setting process, as well as encouraging our colleagues to participate in the future. While it may be my close reading skills at fault, I’m not entirely clear on where “secure test materials” end and “talking about the process” starts. I haven’t been dragged into court yet, so I think we’re good. Still, I will err on the side of caution here by not divulging any actual conversations or actual data to which I was privy. Read closely, friends.
Oh, I almost forgot–you should totally get on one these panels if have the chance.

Concern #1: Students are not PLDs
An important early step in the cut score setting process happened in February when educators from across the state were brought together to craft Performance Level Descriptors (PLDs) that would be instrumental in determining cut scores. PLDs are statements that say what a student at each level of proficiency should be capable of doing under each standard.
For example, imagine anchor standard 11 said the following: “Analyze the body language of a person trying to persuade you to resign from a task after you have asked too many questions.” PLDs would be statements that say what a student at each level (2-5) is capable of. A level 3 PLD might say: “Analyzes body language adequately and correctly;” a level 4 might say: “Thoroughly analyzes body language in a way that is both correct and lightly nuanced;” a level 2 might say: “Inconsistently analyzes body language and with some inaccuracy.” Do you get the picture? Essentially, each standard is broken up into 5 proficiency levels.
PLDs, along with Ordered Item Booklets (OIB) are the tools of the trade for cut score setters. An OIB is basically the test booklet from the June 3rd administration, but instead of questions ordered as they appeared on the actual exam, they are ordered from least to most difficult. The only factor accounted for in the ordering is the number of students who answered each question correctly. A lot of students got it right? Easy question. Not many students got it right? Hard question. Text complexity of passage, plausibility of multiple choice options, level of questioning—you know, the stuff that makes questions hard—are of little consequence.

For the purpose of cut score setting, PLDs become groups of “students.” As we move through the OIB attempting to place a bookmark on the last question a “Level 3” student should be able to answer correctly, we ask ourselves: “Based on the PLD description, should a student at this level be able to answer this question?” Yes? Move on in the book. No? Place your bookmark on the last “yes.”

The problem is that PLDs are not actually students. PLDs are arbitrary, almost meaningless statements that are made up very quickly by people who, for all intents and purposes, have little inclination what will be done with them after students take the exam. So we end up having hypothetical conversations like this one that inform where we place our bookmarks and, therefore, what the cut score becomes:
Jane at Table 1: Man, this question is super hard because–Broca’s Brain?!

Come on, how many 11th graders would actually understand the message here? I am going to say a Level 3 probably won’t get this right.

Dick at Table 2: No, this text is grade level appropriate. I just asked that state ed person in the corner and she said so. Our PLD says right here that a Level 3 student understands grade level texts. So, no, it should not be too hard. A Level 3 student should definitely get this question right.

Let me say this one more time, this time in response to imaginary Dick at Table 2: PLDs are not students. They are broad categories that can be interpreted differently by every single person that reads them. Even if, as a student, I fall squarely into the Level 3 category for my ability to understand a grade level text, that does not necessarily mean that I am able to distinguish between the very subtle nuances presented to me in the multiple choice options. It does not mean that there is a multiple choice option that approximates the (correct) answer I came up with on my own when I read the question. It does not mean that I have had the lived/linguistic experiences necessary in order to comprehend the nuances of the figurative language, even if I have a good sense of what the text, taken as a whole, is saying. For Dick, none of that matters. Because PLD. (View the test in its entirety here and assess the difficulty level for yourself).

PLDs do a good job making general statements about what a kid can kinda do in a vague sort of way. What they do not do is assuage the subjectivity of individual bookmarkers. They are also terrible at representing the complexity of actual students and attending to the myriad and layered complexities involved in answering each and every question on the assessment.

But take this with a grain of salt. I’m no psychometrician.

Concern #2: Setting Cut Scores on a Test that is Not Fully Operationalized
As it turns out, psychometricians aren’t big on anecdotal evidence. But here’s what I know, anecdotally speaking. Not all 11th graders in NYS took the new regents exam. Districts were given the choice of whether they would administer the test or not. Some districts chose to opt out all together while others administered both the new and the old tests. My concern was one about the representative nature of the sample upon which we were basing our cut score decisions. Based on the demographics of students who actually took this new test, would it be possible to draw a sample that was representative of all 11th graders in NYS? Were various demographic groups, including (but not limited to) Latino and Black students, students with disabilities and English learners accurately represented in the test data that would be informing the cut score setting process?

I had a difficult time imagining how that was possible. Perhaps it is because I am not a psychometrician, or maybe it was just pragmatics. Would school districts be willing to tender the expense of test proctors, graders, and substitute teachers, along with the loss of precious instructional time, on a test that they knew full well their students were not prepared for? My sense was that it would be mostly higher achieving students and wealthier districts choosing to give this test. If that is true—and I have been assured by NYSED staff that it is not—then the sample is skewed toward students who are expected, statistically speaking, to perform pretty well. All I could think during the cut score setting was that If our cut score was based on data skewed toward higher achieving students, everyone else will be at a grave disadvantage for years to come. They will be expected to perform to a bar set by predominately successful students. Unfortunately, though I asked, I was not permitted to see any data that reflected the demographics of students tested. I was assured, however, that the details of the sample would be provided in the cut score report.

On June 23rd, SED released their cut score report. In it, they break the sample down into several demographic categories and illustrate that the percentage of students in each category in the sample is similar to that in the population. Despite anything one can learn in Statistics 101, never do they give the number of test takers in the sample. The sample can be 10,000 students or it can be 100. These percentages actually tell us nothing about whether or not test results of the sample can be generalized to New York’s population of 11th graders.
While there is no way to tell from the data SED eventually provided, it is possible that the sample is not skewed. After an hour or more of asking for data about the sample, speaking with several SED folks who each gave me different answers about the sample and reasons that I would not be permitted to see any data (ranging from “it’s secure” to “we don’t have it” to questioning the legitimacy of my request due to my non-you-know-what status), everyone eventually got on the same page. By the end of our last day, the group was on message: the sample is representative.

But, even if this is true, it doesn’t actually improve the situation. Students across the board were underprepared for the exam having had only one year of Common Core-aligned instruction. Because this is a test they were not actually prepared to take, difficulty levels were inflated (remember: they are based only on the number of students who answered each item correctly) causing the cut score to be set relatively low. As years progress and as students have more experience with the Common Core, they will inevitably perform better. All of this cut score nonsense will be long since forgotten, and we will all sing the praises of Commissioner King for increasing graduation rates through his tireless pursuit of high standards. Of course, this type of score manipulation is not new. In 2013, chances o f 11th graders’ success on the Regents were diminished by 20% thanks alone to score conversion charts. Now that I think about it, that event set the stage really nicely for the necessity of speedy reform.

Regardless of the sample, this was a test students were not actually prepared to take. Cut scores should have never been set for the next who-knows-how-many-years based on a pilot run. Period.

Even a psychometrician should know that.

Marilyn johnson says:

July 8, 2014 at 8:12 am

Welcome to the rabbit hole…

LikeLike

danielkatz2014 says:

July 8, 2014 at 8:13 am

The rapidity of this process has been completely mind boggling. The people responsible have been acting as if this is a product roll out trying to hit the market before a competitor can get its own product out and damned be the bug checking.

Donna says:

July 8, 2014 at 1:20 pm

But of course. Gates/Microsoft makes it a habit of rolling out the product with bugs in them…always and forever. Amen.

LikeLike

- Duane Swacker says:
  
  July 9, 2014 at 7:58 pm
  
  And then sell the fixes.
  
  LikeLike

Rob Slater says:

July 8, 2014 at 8:19 am

This piece…. Ah, the clearly ground lens of anecdote! So much at stake, swinging from a limb of that psychometric tree. Orwellian!

Bob Shepherd says:

July 8, 2014 at 9:12 am

In these matters, the devil lies in the details, and proponents often know nothing or next to nothing of the details.

teacherken says:

July 8, 2014 at 9:18 am

As i remember from my doctoral course in assessment, the setting of cut scores is normally done by a modified Angoff process, where “experts” each determine was would be appropriate and from the range of responses a level is set someplace within that range. Except even with a range and presuming the “experts” are applying appropriate expectations, those responsible for administering the test may well choose to set the cut score inappropriately for political reasons. Remember that cut scores do not have to remain constant year to year, and for political reasons educational administrators at state levels often want the data to demonstrate how their approach is improving results – the second and subsequent administrations of new testing regimes show improved passing results, when in fact this may merely be an artifact of manipulating cut scores.

I live in Virginia which saw this happen. In high school American History for the first round of the Standards of Learning (established under Republican Governor George Allen and continued under his immediate successor Democrat Mark Warner) the original cut score was set above ANY of the expert determinations – naturally that lead to a very high failure rate. As I recall a large percentage of the students in wealthy Fairfax County “failed.”

I taught one year in a middle school in Arlington Virginia. The year before I taught the pass rate on Middle School US HIstory was 58%. The year i was there our pass rate was 81%. Sounds like great improvement, right? Especially given that the other two teachers of the subject were first year teachers and I, while experienced (6 years) was new to the curriculum and the grade level.

The conversion from raw score to scaled score had been changed to have a higher percentage of students pass. If we restated the the scores from the previous (58%) year to the new (81%) year, the improvement was far less dramatic: the previous year would have been around 72%. My personal pass rate was 89%.

So did we show improvement? Perhaps. But there were no controls for variance between the student population of the first year and that of the second. We were testing different cohorts and it is quite conceivable that most of the variance in scores could be attributed to demographic differences in the populations. Similarly, there were no controls that could determine whether the reason 89% of my kids “passed” while the two new teachers both had pass rates in the 70s were simply that I had more able students or students with a greater base of prior knowledge. Or, and this is important, that my students had had better teachers for the first half of US History (which is broken into two years of instruction with the testing at the end of the 2nd year).

Two more points. I do not see in the underlying piece any addressing of the validity of any individual question, I also do not see whether some questions were discarded as outliers because students whose overall score was low performed at a higher level on such questions than students whose overall score was higher. Both of these issues are important in assessing the reliability and validity of the testing instrument.

Just a few thoughts to throw into the conversation.

Bob Shepherd says:

July 8, 2014 at 9:28 am

If you plot the raw-to-scaled-score conversions for the New York ELA and Math tests for the entire period following NCLB, you will find that instead of being linear transformations, the resulting graphs jump around like lines on a Jackson Pollack painting or like gerbils on methamphetamine. The cut scores were chosen for completely political reasons, clearly.

LikeLike

July 8, 2014 at 9:21 am

One can train students to do well on these ELA tests. That’s definitely doable.

But that training will have almost no positive consequences for their ability to read, write, and think generally.

In other words, skill in taking these tests is something entirely different from skill in matters like reading a novel, poem or play with understanding; preparing a research paper; writing a speech, essay, or short story; speaking and writing standard English, etc.

In other words, the tests are entirely invalid.

Bob Shepherd says:

July 8, 2014 at 9:25 am

That is, they do not measure what they purport to measure or what people think that they measure.

LikeLike

- Bob Shepherd says:
  
  July 8, 2014 at 9:32 am
  
  And it’s a shocking indictment of our system that this is not clear to those involved in creating and administering these tests.
  
  One should always ask: “What are you really measuring here?” and “Is that what you intend to measure?” and “Do these measurements correlate with independent measurements of the same knowledge or abilities that you can trust?”
  
  LikeLike
- Duane Swacker says:
  
  July 9, 2014 at 8:05 pm
  
  ” “What are you really measuring here?” and “Is that what you intend to measure?” and “Do these measurements correlate with independent measurements of the same knowledge or abilities that you can trust?””
  
  Or even more importantly “Are what you are attempting to “measure” measurable?” (and I think you all know my answer to that question)
  
  If not then the whole exercise is an example of what I call MMOO!!! Mental Masturbation and Obligatory Onanism.
  
  LikeLike
Bob Shepherd says:

July 8, 2014 at 9:33 am

All that said, as long as kids have to take the tests, we must teach them how to do that.

LikeLike

- SPEDUCATOR says:
  
  July 8, 2014 at 9:56 am
  
  Exactly, our school district was rated rather highly, but this was just due to test scores. It received a lot of good press; however I wish you could tell me as a parent that is was rated highly due to low class size, varied AP classes, teacher continuing ed, project based learning, emphasis on group work: social interactions; otherwise a diverse learning environment.
  
  LikeLike

izzy says:

July 8, 2014 at 9:25 am

The Common Core may be appropriate “obedience training” for dogs and zoo animals, but not children.

Margaret M. Nolan says:

July 8, 2014 at 9:38 am

Achieve, Inc, the entity which seems to have birthed these assessments, needs scrutiny. Go to their website, check out the Board of Directors, Big-money contributors, and Staff, and you will not see the names of any experts in teaching, learning, psychometrics, or any other scholars in educational research. Try as I may, I can find no trace of psychometric expertise among this group of Common Core Standards “creators”: David Coleman; William McCallum; Phil Daro; Jason Zimba; Susan Pimental. They are, nevertheless, firmly ensconced in the world of nonprofit, well-funded organizations, so there’s that.

July 8, 2014 at 9:41 am

By the way, many thanks Dr. Maria Baldassare Hopkins for her description of the debacle that is Common Core implementation. You are right on target.

Eas99 says:

July 8, 2014 at 9:42 am

“Psycho” metrics. Of course Maria couldn’t figure the cut scores out- she’s a teacher…we need everything scripted and laid out for us; then we are supposed to regurgitate the material back to the students like a momma bird feeds a nestling. Anyone/everyone who doesn’t have personal contact with students then gets to rate, judge and critique both teacher and student. If we do well, we’re patted on the head like an obedient dog, if not, we’re subjected to intensified scrutiny and more pointless training…

Duane Swacker says:

July 9, 2014 at 8:42 pm

“. . . and more pointless training…”

And I first read that as ‘and more potty training…

LikeLike

English Teacher in California says:

July 8, 2014 at 9:43 am

Down the rabbit hole, indeed. If you want to join in the madness, Smarter Balanced is looking for lots of people to help set “achievement scores.” Since this is an online experience, it will lack the bread pudding. Perhaps you could sign up with a group of friends and bring your own refreshments.

English Teacher in California says:

July 8, 2014 at 9:46 am

http://Www.smarterbalanced.org
Registration now open for online panel of Achievement Level Setting
We are inviting up to a quarter of a million K-12 educators, higher education faculty, parents, and other interested parties to participate virtually in recommending achievement level scores. The online panel runs from October 6-17 and registration is now open. We encourage you to register because your voice matters! LEARN MORE

LikeLike

larry says:

Using straight “cut scores” to make pass/fail decisions (or even graduation decisions) — below: fail, above:pass — is NEVER statistically valid because there is ALWAYS UNCERTAINTY.

One cannot be certain that any given version of a test is a “true” measure of the standards and one can not be certain that any individual’s performance represents their “true” score.

If one is failing students when they simply “drop below” a cut score with no regard for HOW FAR below the cut score relative to the magnitude of the uncertainty, one is simply up to no good.

Anyone who does that is not a psychometrician, but simply a psycho.

Margaret M. Nolan says:

July 8, 2014 at 9:54 am

Right on. Just Google: “psychometricians, US,” and fine out what rare birds indeed are these scholars. No wonder they basically went ahead without them. Too much work and money involved. What a scam.

LikeLike

Donna says:

July 8, 2014 at 10:16 am

Isn’t the real point that if the questions were framed in accuracy, there would be only one answer? Simplistically, 2 + 2 will always and forever = 4. If an essay was about ducks and the mama duck taking its babies to swim and along the way met distractions but finally got to the Atlantic Ocean, the multiple choice question answers would not be about turtles, fish, clouds and leaves. Common sense.

Do the testmakers intentionally make the questions too verbose and the possible answers difficult from which to choose correctly?

Up is down. Down is up. What is their point? More remediation? More kids left back? More teachers fired? More money for Pearson, Gates, etc.?

Chris in Florida says:

July 8, 2014 at 10:46 am

I remember when the “science” of Psychometry was considered to be akin to Astrology, Phrenology, and Tarot card reading. I’ve never been exactly certain what happened to suddenly elevate the very uncertain, new, and unproven “science’ of psychometry to the level of high priestly mysticism that it enjoys in education. I know that NCLB certainly aided in the virus-like spread because I received dozens of letters inviting me to become a psychometrician; they were/are urgently needed are hard to find.

This article is enlightening and reminiscent of my own adventures in holistic scoring of the old high-stakes tests in NYC many years ago. Subjective and wildly inaccurate would be good descriptors.

And just like teacherken and Bob Shepard talk about above, Florida cut scores have always been a political tool, rising and falling depending on the latest argument Jeb Bush and his successors want to use in their political campaigns.

This is an election year so the scores were magically “higher” on this year’s ‘final’ F-CAT tests. The governor crowed about how this proved that Jeb Bush’s reforms were working!

Last year there was a push on for a parent trigger law and increased funding for charters so the scores were magically “lower”, resulting in thousands of newly graded “D” and “F” schools. The governor crowed about how CCSS and the accompanying tests were needed because they would raise the bar and fix this shameful showing.

That is until the Tea Party freaked out and the state DOE pretended to get rid of the CCSS by renaming them the New Florida Standards and adding the USDOE-approved 15% extra standards, withdrawing from PARC and shifting to AIR, and nobody was fooled except the true believers.

This politically-caused drop in school grades coincidentally happened at the same time as the DOE’s creation of the DA (Differentiated Accountability dept., a descendant of the Spanish Inquisition) which began entering and taking over all “D” and “F” schools (read: Title I) preparing them for eventual closure and conversion to charters by conducting constant brief walk-throughs, mandating 11-page lesson plan formats for every subject, every day, and generally being sour, rude, condescending and hateful. When pressed they acknowledged that they were under state mandate to report nothing but negative things since saying positive things might convince schools that they weren’t as bad as their grade implied.

Michael Fiorillo says:

July 8, 2014 at 12:03 pm

Thank you, Chris.

We should be very skeptical about granting too much credibility to self-styled “scientists” who would have us think that the workings of the mind in complex and ever-shifting social, economic and political settings can be validly and reliably “metered.”

The term itself embodies mechanistic thinking that is unscientific.

Yes, statistics and the scientific method have a role to play in test creation and scaling, but that does not make it a “science,” just as, B-School assumptions notwithstanding, management is not a “science.”

How long before the current crop of Gates and Pearson-funded psycho-metricians are placed in the company of 19th and early 20th century phrenologists and eugenicists, who gilded their unacknowledged racial, cultural and political biases with the gleaming veneer of science?

The sooner, the better.

LikeLike

- Ang says:
  
  July 8, 2014 at 9:06 pm
  
  Chris, Michael, & KTA,
  XO,
  Ang
  
  LikeLike
KrazyTA says:

July 8, 2014 at 4:49 pm

Chris in Florida & Michael Fiorillo:

A blast from the past—

[start quote]

A person who uses statistics does not thereby automatically become a scientist, any more than a person who uses a stethoscope automatically becomes a doctor. Nor is an activity necessarily scientific just because statistics are used in it.

The most important thing to understand about reliance on statistics in a field such as testing is that such reliance warps perspective. The person who holds that subjective judgment and opinion are suspect and decides that only statistics can provide the objectivity and relative certainty that he seeks, begins by unconsciously ignoring, and ends by consciously deriding, whatever can not be given a numerical measure or label. His sense of values becomes distorted. He comes to believe that whatever is non-numerical is inconsequential. He can not serve two masters. If he worships statistics he will simplify, fractionalize, distort, and cheapen in order to force things into a numerical mold.

The multiple-choice tester who meets criticisms by merely citing test statistics shows either his contempt for the intelligence of this readers or else his personal lack of concern for the non-numerical aspects of testing, importantly among them the deleterious effects his test procedures have on education.

[end quote]

[Banesh Hoffman, THE TYRANNY OF TESTING, from the 2003 reissue of the 1964 edition of the 1962 original, pp. 143-144]

Thank y’all for your comments.

😎

LikeLike