Who Set the NY Cut Scores—and What We Still Need to Know

In this post, teacher Maria Baldassarre-Hopkins describes the process in which she and other educators participated, setting cut scores for the new Common Core tests in New York.

She signed a confidentiality agreement, so she is discreet on many questions and issues.

At the end of the day, Commissioner King could say that educators informed the process but in reality they made recommendations to him, which he was free to accept, modify, or ignore.

As many teachers have pointed out, in blogs and comments, no responsible teacher would create a test with the expectation that 70% of students are sure to fail. It would not be hard to do. You might, for example, give students in fifth grade a test designed for eighth graders. Repeat in every grade and the failure rate will be high. Or you might test students on materials they never studied. Some will get it, because of their background knowledge, but most will fail.

Why would you want most students to fail?

Commissioner King has repeatedly warned superintendents, principals, and everyone else that they should expect the proficiency rates to drop by 30-35-37% and they did.

This is a manufactured crisis. We know who should be held accountable.

It is Commissioner John King and Regents Chancellor Merryl Tisch. They wanted a high failure rate. They got what they wanted.

***********

A response to the post above, by Fred Smith. Fred worked for many years as an assessment experts at the New York City Board of Education. He has now become an invaluable resource for those who are fighting the misuse and abuse of high-stakes testing.

Fred writes:

Folks,

Kudos to Maria Baldassarre-Hopkins. This is an extremely important piece–an outline and an articulate account of how the 2013 cut scores were set. We’re finally getting a glimpse inside the testing program’s “black box”–how cut points are/were established.

Three points grab me and support contentions I share with other observers:

First, the cut scores are after-the-fact. “Cut scores were to be decided upon after (emphasis hers) NYS students in grades 3-8 took the tests.” I believe the standards were set in late June/early July.

Second, the review committee’s work is advisory–Despite the committee’s elaborate review process, the end results are recommendations to the commissioner.

Third – “(During the review) We were given more data in the form of p-values for each question in the OIB – the percentage of students who answered it correctly on the actual assessment.” This and the timing of the review strongly suggest that item-level data (item statistics) from the April 2013 operational tests were used to inform the determination of cut scores. That is, data generated by the test population were used–changing the concept of a standards-based test (as in testing aligned with the common core learning standards) to one that depends on the performance of students who took the test.

This makes the Level 2, 3 and 4 thresholds dependent on how well kids did on the exams–bringing the test score distribution into play and rendering judgments about cut scores and student achievement relative to the composition of the students who took a particular set of items at a particular time–a normative framework instead of a standards-based one. These factors will vary from year to year, and since 2013 was a baseline year with little it could be anchored to, it is even murkier to see how SED can justify what was done.

Let’s not forget either that the items on the April 2013 exams were largely generated via the indefensible June 2012 stand-alone field testing, a procedure that could not have yielded reliable or valid information to construct the core-aligned statewide tests–and, as a further consequence, would call the item stats the review committee worked with into question.

SED’s slide show presentation to the Regents in late July about the cut scores, this week’s news management spin campaign and its web site power point barrage on the release of the scores do not address important remaining questions about the quality of the 2013 exams and the cut scores.

There is information the SED obviously has in its possession (and desperately wants to keep hidden), as strikingly noted by Ms. Hopkins. We must demand and obtain: 1- P-values (difficulty levels) for all field test items that were selected for inclusion on the operational April tests–both the field test p-values and the corresponding operational test p-values. 2- In addition, we must have complete item analysis data — showing the percentage of students who chose the correct answers, the percentage who chose each distractor (each incorrect mislead) and the percentage of omissions (no response to item). 3 – We must be given the same information demanded in #1 and # 2 but broken down by ethnicity and separately by need/resource capacity.

Even if SED refuses to produce all of the 2013 operational items that it owns for our scrutiny, there is no justification for refusal to provide the statistical data we are demanding–because none of the data involve exposure of the items and their content. SED and Pearson have no legitimate excuses for keeping us in the dark based on the immediate availability and nature of the information we are seeking.

The only way forward for all of us who want to have public schools that work is to cry out for sunshine, transparency and truth-in-testing. Short of that we can have no faith in anything coming out of Albany about its latest vision of reform. The messengers of bad news are on the run. Blow the trumpets. Get your representatives on board. Don’t let them slip and slide. This is a pivotal year

Fred

M says:

August 9, 2013 at 5:32 pm

Translated from test speak (and this is my best approximation – I’m not an expert):

Only 30% of students were known to be proficient on the NAEP. Therefore, whatever the basis of this test is, should mirror that.

Thereby the cut scores were determined by King with the goal of making it mirror the NAEP results regardless of what the actual test said because it couldn’t be otherwise – if the scores didn’t mirror NAEP then the test wasn’t designed properly.

So regardless of what is actually on the NAEP or on this test, their only similarity is the pass/fail numbers and those were chosen based on NAEP as the primary criteria – students could not legitimately get a passing score because it was pre-determined that most would fail as the validator of the test.

Final translation:

They predicted failure because they controlled the cut scores. They knew they would make it mirror the NAEP’s failure rate on purpose. This test had no objectivity to no chance for students to outperform it.

What does that say about the validity of using tests to evaluate teachers if they will forever be able to control the bell curve? Particularly if the socio-economic demographic turns out to be the primary predictor of whether a student did well on this test?

Please correct me but it was a rather dense read – that was my best take-away point though I might be seeing what I want to see in it.

Linda says:

August 9, 2013 at 5:46 pm

So the kids were mere props for this predetermined ruse.

Plain and simple: this was legally sanctioned child abuse.

A mock trial should be held on the capital steps in Albany.

- Duane Swacker says:
  
  August 10, 2013 at 6:53 am
  
  “A mock trial. . . ”
  
  NO, a REAL trial for the child abuse that is the whole sham process should be held in the appropriate legal jurisdiction.

Michael Paul Goldenberg says:

August 9, 2013 at 5:36 pm

Typical. Having done scoring for state math tests from Connecticut (middle school) and NJ (elementary grades), I can say that anything shady you’ve heard about how these things are handled is likely an understatement. But confidentiality agreements makes it pretty dicey to start blowing whistles without the backing of a legal fund and really good lawyers. I also scored pilot questions for the social studies portion of the high school test in Michigan in the late ’90s. Again, some pretty questionable issues arose, but nothing that made me crazy like I saw with the mathematics from Connecticut. The interplay between the state and the scoring company with of course $$ trumping everything else nearly guarantees wrongdoing, even though no one really MEANS to “do wrong.” The system simply makes it almost inevitable.

Alan Tucker from the mathematics department at SUNY@Stony Brook has two papers (at last look) on his web page about his work investigating problems with the 2003 Math Level A NY State Regents exam. They both get into technical issues about the math that informs cut scores and calls into serious question whether any such tests can ever escape certain fundamental flaws. Look for them here: http://www.ams.sunysb.edu/~tucker/

Duane Swacker says:

August 10, 2013 at 6:58 am

“. . . with of course $$ trumping everything else nearly guarantees wrongdoing, even though no one really MEANS to “do wrong.” The system simply makes it almost inevitable.”

Perhaps an example of the banality of evil.

“What’s Going On”? http://www.youtube.com/watch?v=GDb4Ss9OJ64

Kathy Irwin says:

August 9, 2013 at 5:50 pm

But calling for sunshine, transparency and truth-in-testing is still not calling out for children. They are never mentioned. The flesh and blood of them is never center stage. Pearson and its peers are experimenting on the lives of real people they have never met and care nothing about. Their for-fat-profit instruments rain fear, shame, inadequacy, anger and sorrow down on human bodies. Who let’s this go on? There is no confidentiality agreement, there is only complicity.

Jeff says:

August 9, 2013 at 5:57 pm

I’m confused. If the common core was designed so that a single set of rigorous curriculum standards in core subjects aimed at ensuring college and workplace readiness for all students would finally be in place across this nation, then how can cut scores for states using either the PARCC or Smarter Balanced assessments be allowed to be set on a state-by-state basis? Sounds precisely like the absurdity of NCLB, wherein every state set its own measure of what constituted “proficient,” and we all know how disgustingly that was gamed.

Assessments tied to standards, like the PARCC and Smarter Balanced tests purport to be, are supposed to be criterion-referenced — i.e., either the student gets the questions tied to a particular standard correct, or she doesn’t. So what we’re going to end up with is that a NY or CT or MA student’s score may not be “passing,” wherein the same score from a student in states known to have far lower student performance will be judged to be solidly passing! Yes? And as states move to require these scores be added to high school transcripts, this utter nonsense is going to inform colleges and employers precisely just how? What a huge waste of fiscal and intellectual resources!

Duane Swacker says:

August 10, 2013 at 7:02 am

Don’t despair of being confused. That’s part of the plan. Read and understand Wilson’s study I reference below and it will become quite obvious that it’s all a chimera, a falsehood, a lie, an explosive brain fart.

And yes, it is “a huge waste of fiscal and intellectual resources!”

on the prairie says:

August 9, 2013 at 5:59 pm

http://excelined.org/news/governor-jeb-bush-delivers-education-reform-address-to-the-american-legislative-exchange-council/

ALEC speech answers a lot of this…

First Grade Teacher says:

August 9, 2013 at 6:40 pm

I got sick to my stomach and had to stop reading after the first few comments.

Duane Swacker says:

August 10, 2013 at 7:26 am

Nope couldn’t make it through that nonsense. At least I didn’t have to see and hear that lying, hubristic SOB.

I got this far: “. . . must be backed by sufficient academic rigor. . . ” and had to stop.

RRRRRRRRRRRRRRRRRR!!!!!!!!!!

“Rigor” AAAAARRRRRRGGGGGHHHHH!!!!!

I am the not so proud owner (just until Weds when it will be returned unused) of a t-shirt (that we’re supposed to wear on the the first day of class) that was given to us by the administration that has three “R” words on it, “reading” in a thin type script and “RIGOR and RELATIONSHIPS” in a very bold thick script. During open house I showed it to some students, current and former students and their responses to the shirt were quite telling as to what message the students might take from the message being touted. And it’s certainly not the intended message. But then the designers of the shirt are just spouting the latest edujargon.

From MW online:
RIGOR: Noun

1 a (1) : harsh inflexibility in opinion, temper, or judgment : severity (2) : the quality of being unyielding or inflexible : strictness (3) : severity of life : austerity
b : an act or instance of strictness, severity, or cruelty
2: a tremor caused by a chill
3: a condition that makes life difficult, challenging, or uncomfortable; especially : extremity of cold
4: strict precision : exactness
5a obsolete : rigidity, stiffness
b : rigidness or torpor of organs or tissue that prevents response to stimuli
c : rigor mortis

Synonyms
rigour – severity – austerity – stringency – strictness

Antonym
flexibility

Yep, just what we want our students to be!!!

Now perhaps they are trying to convey meaning #4 but they sure aren’t being “strictly precise” in choosing that word.

- first grade teacher says:
  
  August 10, 2013 at 10:09 am
  
  I am first grade teacher lowercase – not to be confused with uppercase First Grade Teacher. That said, the article made me sick too!
  Duane – red rigor tshirts – I read your comment several days earlier about the red rigor tshirts, I couldn’t stop laughing. Absurd. Crazy. Someone spent time conceiving, designing, and I’m sure proudly distributing these little gems. Hope they get a clue eventually.
  I used to go along to get along. No more. Glad you retired the shirt unworn.

tuppercooks says:

August 9, 2013 at 7:45 pm

John King should resign.

Alabama teacher says:

August 9, 2013 at 7:59 pm

New York parents, teachers and students should run these people out of office. Are there any lawyers out there with any decency? This is beyond crazy.
At the very least, every single student in he state of NY should be OPTED OUT of the test next spring.

Duane Swacker says:

August 10, 2013 at 6:51 am

From F. Smith: “SED and Pearson have no legitimate excuses for keeping us in the dark based on the immediate availability and nature of the information we are seeking.”

Sure there is depending upon one’s definition of “legitimate”.

And “The only way forward for all of us who want to have public schools that work is to cry out for sunshine, transparency and truth-in-testing.”

While I agree with the general gist of the statement the fact is that there is NO “truth in testing”. Noel Wilson has elucidated all the errors, lies, and obfuscations involved in the whole educational standards and standardized testing regimes in his “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700 .

Plainly seeing the windmills for what they are, I implore all to read and understand what Wilson has so brilliantly exposed as the frauds they are:

Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)

1. A quality cannot be quantified. Quantity is a sub-category of quality. It is illogical to judge/assess a whole category by only a part (sub-category) of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as one dimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing we are lacking much information about said interactions.

2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).

3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.

4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”

In other word all the errors involved in the process render any conclusions invalid.

5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. As a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.

6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms shit in-crap out.

7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”

In other words it measures “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?

My answer is NO!!!!!

One final note with Wilson channeling Foucault and his concept of subjectivization:
“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self evident consequences. And so the circle is complete.”

In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.

Lehrer says:

August 10, 2013 at 7:59 am

“At the end of the day, Commissioner King could say that educators informed the process but in reality they made recommendations to him, which he was free to accept, modify, or ignore.”

This seems to be the modus operandi in many school districts: have a few teachers sit on a committee, spend countless hours sifting through information and debating the merits of several courses of action, and then making a choice only to have it vetoed by the superintendent because it was not his/ her choice. I stopped volunteering for these committees once I realized that all decisions were made prior to the committee ever meeting and teachers were only needed to lend some legitimacy to the process. The part that really makes me see red is when a program becomes unpopular or fails and the superintendent responds with, “Well, this is what you people chose.”

Laura Shapiro says:

August 10, 2013 at 8:36 am

Data Points

Our children’s test scores of are spread across the bare table.
We are asked to analyze
black data points connected by decisive red and yellow lines.

The expensive, colorful graphs
Conclude that 85% of our children have failed.
We will be held accountable.

False, punishing calculations.
Smother the true, non-linear story
of Samantha, the new ELA teacher, who tells us
that Jose, who came to the US last year from Mexico,
spoke for the first time in front of the whole class today!

Joseph Mugivan says:

August 10, 2013 at 11:13 am

Nothing new here with this info…..
Beginning in the new millennium, when Giuliani was mayor, we went to cut scores with norm referenced testing i.e. Levels 1-4. which made it easy to expand the $ummer $chool boondoggle. Remember $ocial promotion? What is interesting is that Regent Meryl Tisch is the educational adviser to UFT mayoral candidate Bill Thomson, who fell on his sword for Bloomberg and the UFT (no endorsement then) in Bloomberg’s last term.
Before 2000 we had McGraw Hill doing criterion referenced testing and educators were able to compare student achievement in real time by late spring. As a doctoral student in literacy studies at Hofstra, I was able to do a correlational study comparing city and state scores of each student and was able to find NONE. I scored top of my class in educational statistics.

andrewdavidmitchell says:

August 10, 2013 at 3:05 pm

I think it’s vital that we deconstruct the words “college and career ready.”

“College” appears to mean “earn a Bachelor’s degree before age 25.” See this NY Times article, which summarizes results from a 2012 US Census study: http://www.nytimes.com/2012/02/24/education/census-finds-bachelors-degrees-at-record-level.html?_r=0

The pass rates on the NAEP and these new NY exams are also about 30%.

“And” means that whichever percentage is lower, college-ready or career-ready, is the statistically significant value.

“Career ready” is not yet clearly defined. I wonder how different policy-makers interpret these two words. It cannot describe fewer than 30% of our students, but there is no other data to indicate what percentage of students are considered career ready.

These pass rates are valid if you accept my definition of the terms.

My fear is that nearly 100% of students are still considered “college OR career ready” by the edu-reformers. There’s a low barrier of entry to unskilled and low-paid entry-level jobs.

These low scores are both accurate and dangerous. There are people who want to distract our nation from its sins of inequality and lack of equitable access to educational resources and make a profit by shifting funds from the public to the profit-making (let’s stop calling it “private,” that’s too generous a term) sector!

Just watch the op/ed pages around NYS this weekend and over the next month to see how privatization (read: “profit-making”) will save our students from the evil, lazy public sector. It’s brilliant marketing and leaves me so frustrated.

Who Set the NY Cut Scores—and What We Still Need to Know

19 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

Who Set the NY Cut Scores—and What We Still Need to Know

Share this:

19 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats