James Harvey here explores “the problem with proficiency.”
Common Core tests arbitrarily decided that the NAEP proficiency level should be the “passing” mark for all. Test results are routinely reported as if those who did not meet this standard were “failing.”
I have routinely argued on this blog that NAEP proficiency is equivalent to earning an A, and that it was nuts to expect all students to earn an A. Only in one state (Massachusetts) have as many as 50% reached the standard.
Harvey demonstrates the reality.
He writes:
“In 1996, the International Education Assessment (IEA) released one of the earliest examinations of how well 4th grade students all over the world could read. IEA is a highly credible international institution that monitors comparative school performance; it also administers the Trends in Mathematics and Science Study (TIMSS), a global assessment of 4th and 8th grade mathematics and science achievement. Its 1996 assessment (The IEA Reading Literacy Study, a predecessor to the Progress in International Reading Literacy Study, or PIRLS) demonstrated that out of 27 participating nations, U.S. 4th graders ranked number two in reading (National Center for Education Statistics, 1996). Only Finland ranked higher. To the extent these rankings mean very much, this second-place finish for the United States was an impressive accomplishment.
”But around the same time, the National Assessment Governing Board of the National Assessment of Educational Progress (NAEP) reported that just one-third of American 4th graders were “proficient” in reading. To this day, the board of NAEP continues to release similarly bleak findings about American 4th graders’ reading performance (National Center for Education Statistics, 2011). And IEA continues to release global findings indicating that the performance of U.S. 4th graders in reading remains world class (Mullis et al., 2012).
“How could both these findings be accurate? Was it true, as NAEP results indicated, that U.S. 4th graders couldn’t walk and chew gum at the same time? Or was IEA’s conclusion—that the performance of American 4th graders in an international context was first class—more valid? A broader question arises here, one that has intrigued researchers for years: How would other nations perform if their students were held to the NAEP achievement-level benchmark for “proficient”? How might they perform on Common Core-aligned assess-ments with benchmarks that reflect those of NAEP?
”How Would Other Nations Score on NAEP?
“In 2015, statistician Emre Gönülates and I set out to explore these questions on behalf of the National Superintendents Roundtable (of which I am executive director) and the Horace Mann League (on whose board I serve). The results of our examination, recently released in a report titled How High the Bar? (Harvey & Gönülates, 2017), are eye-opening. In short, the vast majority of students in the vast majority of nations would not clear the NAEP bar for proficiency in reading, mathematics, or science. And the same is true of the “career and college-readiness” benchmarks in mathematics and English language arts that are used by the major Common Core-aligned assessments.
“This finding matters because in recent years, communities all over the United States have seen bleak headlines about the performance of their students and schools. Many of these headlines rely on reports about student achievement from NAEP or the Common Core assessments. One particular concern is that only a minority of students in the United States meet the NAEP Proficient benchmark. Frequently, arguments in favor of maintaining this particular benchmark as the desired goal for American students and education institutions are couched in terms of establishing demanding standards so the United States becomes more competitive internationally.
“But the reality is that communities around the world would face identical bleak headlines if their students sat down to take the NAEP assessments. So, when U.S. citizens read that “only one-third” or “less than half” of the students in their local schools are proficient in mathematics, science, or reading (or other subjects), they can rest assured that the same judgments could be applied to national education systems throughout the world if students in those nations participated in NAEP or Common Core-related assessments. (This is true despite the widespread perception that average student performance in some other nations exceeds average student performance in the United States. The metric applied in our study is not a rank ordering of mean scores by nation but the percentage of students in each nation likely to exceed the NAEP Proficient benchmark.)
“Our findings may not even be surprising when we consider questions that have arisen from previous research on NAEP.”
Harvey goes on to explain why it is absurd to use NAEP proficiency as a passing mark.

“I have routinely argued on this blog that NAEP proficiency is equivalent to earning an A, and that it was nuts to expect all students to earn an A.”
And it is just as nuts to expect that a simple letter A, B, etc. . . can be an adequate descriptor of what occurs in a classroom, the child’s mind and the interaction between the teacher and the class.
Fallacious thinking paved over with more fallacious thinking can only result in fallacious thinking squared. Ay ay ay!
LikeLike
NAEP and all the other standardized tests used around the world have all of the onto-epistemological errors and falsehoods and psychometric fudgings that Noel Wilson identified/showed in his 1997 dissertation that render using the results of such tests COMPLETELY INVALID. To begin to understand I urge all of you to read and comprehend “Educational Standards and the Problem of Error” found at:
http://epaa.asu.edu/ojs/article/view/577/700
Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine.
A description of a quality can only be partially quantified. Quantity is almost always a very small aspect of quality. It is illogical to judge/assess a whole category only by a part of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as unidimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing the descriptive information about said interactions is inadequate, insufficient and inferior to the point of invalidity and unacceptability.
A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).
Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.
Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”
In other words all the logical errors involved in the process render any conclusions invalid.
The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. And a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.
Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.
And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”
In other words it attempts to measure “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?
My answer is NO!!!!!
One final note with Wilson channeling Foucault and his concept of subjectivization:
“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self-evident consequences. And so the circle is complete.”
In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.
LikeLike
would be interested in all the comments on this; our city has just instituted the “Grade Level Reading” by grade 3…. Springfield MA is doing this and they are rated on the scales as being very low in capacity to deliver the results (loving cities/communities). In reading through online today I see the quote: “true reading proficiency sets a rigorous bar” and in OHio 64% read on grade level (at these early stages)… will dig out the references for these statements as we prepare for our “initiative” on grade level reading… and discuss the issues more. I have tried to caution my state rep(s) on their jumping on the bandwagon with little thoughtful discussion taking place. Are any other cities signed up for this “reading at grade level”?
LikeLike
Notice the truncated view of education perpetuated by the league tables set up by test scores. Test scores in Math, ELA, Science matter to the bean counters.
The myth that there is a standard to be met for college AND career readiness (or is it college OR career readiness?) gets perpetuated and the Common Core is treated as if uncontested wisdom. The OCED test scores are more of the same misrepresentation of learning, with league tables among nations–nicely fitting the ethos of Trump… got to be number one or you are only mediocre. Larry Cuban reminds us that the peak of the bell curve means average and average is easily construed as mediocre. Test scores are the weapons of math destruction of much that we should be paying attention to (Tip of the hat to Cathy O’ Neal’s book: Weapons of Math Destruction, more relevant with each day).
Thank goodness the NAEP tests in the visual arts are only given every decade, and only in grade 8 (when many students are not enrolled in art classes). And thank goodness, the most informative part of the NAEP tests are the background questions.
As for the Common Core, that hot mess of standards and the supporting tests are hard to kill. Every Gates-funded project has strings attached–include or promote his $330 million plus investment in that sham.
LikeLike
It’s only natural to take the next logical step and ask how space aliens would fare on NAEP assessments.
This could have profound implications should they ever visit earth.
They might not take it too well if we told them they were failures.
LikeLike
“The Common Galaxy Test”
We need a common test
For country and for glory
That lets us match our best
With those of A. Centauri
LikeLike
That is entirely unfair and unacceptable, Poet. Other planets have much smaller, much more homogeneous populations. The only thing Earth has going for it is billionaires. Thank Neptune for Bill Gates! He grants us the ability to give him data and let him tell us how to live. All hail.
LikeLike
Mars failed the PISA exam; Pluto excelled.
LikeLike
“Solarsystem PISA”
Uranus passed the PISA test
But no one’s home on Mars
And Pluto did the very best
(To earn the Tesla cars)
And Mercury was out to lunch
But Venus made the grade
And Jupiter improved a bunch
But Saturn was just “slayed”
And Neptune played a different tune
With focus on the art
But Earth was simply laid to ruin —
On PISA came apart
LikeLike
“The problems cannot be addressed if they are defined by trying to meet inappropriate benchmarks that were developed as much to prove a political point as to enlighten public understanding about the nature of our educational challenges.”
The standardized testing in our schools has been a political football with moveable goalposts of “proficiency” as defined by uninformed politicians. Testing has been misused to pigeonhole students, deny opportunity, rate teachers and schools, fire teachers and close schools. This misuse of testing has been a tool of educational “reformers” that want to destroy public education in favor of moving public funds into privatized education.
The above grade level expectations on the NAEP are a perfect example of the misuse of test scores. Comparing the US to nations like Singapore or Japan is a flawed comparison. These are relatively small homogeneous populations with considerably lower levels of poverty than the large diverse US where over 50% of students in public schools live in poverty. As Pasi Sahlberg has said, “The US does not an education problem; it has a poverty problem.” Our poverty problem is the product of poor decisions made by our policymakers, not educators. The US has many students that perform at high academic levels, but they represent a much smaller percentage of the total than our poor students. Our country will never get the highest scores on standardized tests, but it does not matter. We have plenty of achievers in our country that continue to invent and produce, and this is far more important than scores on standardized tests.
LikeLike
after Duane told me to read N. Wilson’s article I borrowed his “psychometric fudge”
Subject: psychometric “Fudge”
*Developing tests for assessing student’s abilities has been a part of the American educational scene for over a century.
*There are proven and established psychometric techniques to establish high degrees of reliability and validity for tests/products as reported by technical manuals (see for example, IOWA Test , California Achievement Test , Stanford, Metropolitan, etc) the tests we took when we were in school (See Jay P Green’s article)
*PARCC/MCAS and Smarter Balance should have provided similar evidence that they are proven tools for assessing students on credible “higher” standards (See J P Greene’s article)
*It is impossible to tell how accurate and precise these experimental tests are because evidence of their accuracy is not readily available. Publishers of these tests do make mistakes that are uncovered on occasion and revealed (less frequently) to the public paying the bill.
*A greater concern is the “cut scores” that determine proficiency levels and DESE establishes cut scores for MA schools.
*Setting cut scores is a subjective decision. (in the “reading wars” there are a lot more “subjective” choices — is turquoise a 2nd grade work? is “quarantine ” a 3rd grade word?)
*Comparisons with other state testing programs would have provided relevant comparisons for determining cost-effectiveness of MA testing program. This would have brought true transparency to MA emerging experimental tests that assumed great weight and expense over these past several years
*These faulty tests are now being used for high stakes consequences. The roll-out from design (logic plan) to implementation has been a disaster known as “test and punish”
*What do these experimental State test results tell us? Not a lot. It is a very blunt instrument….. I suggest we not even use them for Haverhill. (MCAS is a very blunt instrument; if there is any reality to the data at the “quadrant” level IF? then it is still a costly experiment producing very little serviceable data for the money (the lawyer’s like to use actionable here where I use serviceable)
*Testing companies (Pearson and all of them) fail to respond to objections from parents when they claim its tests are not adequate.
*What about the State legislature that has to approve these expenditures from DESE for state testing?
*We need to be knowledgeable about psychometrics and be aggressively proactive in the public and political domain to CLARIFY all of these questions. The damage done to the integrity of student testing is a direct result of this controversy and presents a serious challenge.
LikeLike
Jean,
Be sure to read Daniel Koretz’s “The Testing Charade” for back up.
LikeLike
thanks, I most certainly will
LikeLike
Jean,
You have touched on a subject that I’ve been exploring a little, that of the questions of validity and reliability as used by the test makers themselves. Do they really subject any and all questions and tests to the analysis that is demanded of them by the guiding organization’s Standards Analysis? My guess is no they don’t. If not then the test makers have not followed their professional ethical mandate and the tests should be tossed out. Even if they have done so, those validity concerns analysis is not enough to overcome Wilson’s total destruction of the standards and testing process.
So it comes down for me: Why even waste the time energies and monies in such an invalid process?
LikeLike
“Reliable Invalidity”
Reliably invalid
Is what the testing is
Like lettuce in a salad
It’s always in the mix
LikeLiked by 1 person
Using the NAEP as a target that all children must reach to be considered successful in school is like saying that we must remove all cars in America that don’t match the most reliable car in the U.S.
Guess what the most reliable car in America is?
https://www.inc.com/chris-matyszczyk/the-most-reliable-car-in-america-is-one-you-might-never-have-heard-of.html
Just like all people are not created to be exactly the same, all cars are not created to be exactly the same.
The people behind this thinking have assembly line brains.
LikeLike
Yet my state, Illinois, decided that those cut scores weren’t high enough, so they raised them!?
LikeLike
“Common Core tests arbitrarily decided that the NAEP proficiency level should be the “passing” mark for all.”
These people believe that demand-supply duality works everywhere: they demand A from all students, and the supply of abundant A’s will come.
LikeLike
They don’t actually believe that.
They just know that most students will fail.
But they don’t care because it ain’t their problem.
They don’t ever have to deal with the monstrosity that they created.
The actual work of carrying out the impossible mandate falls on the shoulders of someone else (in this case teachers).
That’s what you do when you are “The Decider”: you create impossible requirements (insurmountable problems) for the pee-ons and then move on to a high paying gig working for a company that benefits directly from the policies you put in place.
That’s precisely what David Coleman did and now he is raking in $700k per year from College Board effectively working as an “enforcer” of the unreasonable mandates that he put in place.
LikeLike
The point of having a fine sieve is in separating the chaff from seeds.
LikeLike
In my view standards should be more akin to minimum electrical building codes, while they don’t create a great carpenter, they help assure that at a minimum 100% of buildings will be safe from electrical fires. So for education, we might create a foundational statement like, “All children should be able to perform basic reading by the third grade so as to pursue continued learning in future grades.” Further, to encourage self-learning, we will encourage engagement in reading, “a love of reading” for all students. Learning should be joyful and exciting. But the minimum standard for 100% of students. That’s a standard.
Current bogus content standards and cut score undermine true standards.
And if you want to create a cutoff, akin to the bar exam, fine, but that asking all students to score above a point set at 50% of previous learners.
LikeLike