What if the International Tests Are Wrong?

The corporate reform assault on American public education rests in large part on the international test called PISA (Programme in International Student Assessment), where US students rank well behind other nations and have only middling performance. Of course, the critics who brandish these mediocre scores never admit that they are heavily influenced by the unusually high proportion of students living in poverty, and that American students in low-poverty schools score as well or better than the highest performing nations. To do so would be an admission that poverty matters, and they reject that idea.

But what if the PISA tests are fundamentally flawed? So argues testing experts in this article in the (UK) TES.

It turns out on examination that the results vary widely from one administration of the test to another. Students in different countries do not answer the same questions. And there are serious technical issues that experts debate.

The article asks:

“But what if there are “serious problems” with the Pisa data? What if the statistical techniques used to compile it are “utterly wrong” and based on a “profound conceptual error”? Suppose the whole idea of being able to accurately rank such diverse education systems is “meaningless”, “madness”?

“What if you learned that Pisa’s comparisons are not based on a common test, but on different students answering different questions? And what if switching these questions around leads to huge variations in the all- important Pisa rankings, with the UK finishing anywhere between 14th and 30th and Denmark between fifth and 37th? What if these rankings – that so many reputations and billions of pounds depend on, that have so much impact on students and teachers around the world – are in fact “useless”?

“This is the worrying reality of Pisa, according to several academics who are independently reaching some damning conclusions about the world’s favourite education league tables. As far as they are concerned, the emperor has no clothes.”

The article cites the concerns of many testing experts:

“Professor Svend Kreiner of the University of Copenhagen, Denmark, has looked at the reading results for 2006 in detail and notes that another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions.

“This in itself is ridiculous,” Kreiner tells TES. “Most people don’t know that half of the students taking part in Pisa (2006) do not respond to any reading item at all. Despite that, Pisa assigns reading scores to these children.”

“People may also be unaware that the variation in questions isn’t merely between students within the same country. There is also between-country variation.

“For example, eight of the 28 reading questions used in Pisa 2006 were deleted from the final analysis in some countries. The OECD says that this was because they were considered to be “dodgy” and “had poor psychometric properties in a particular country”. However, in other countries the data from these questions did contribute to their Pisa scores.

“In short, the test questions used vary between students and between countries participating in exactly the same Pisa assessment.”

Professor Kreiner says the methodology renders the results “meaningless.”

“The Rasch model is at the heart of some of the strongest criticisms being made of Pisa. It is also the black box within Pisa’s black box: exactly how the model works is something that few people fully understand.

“But Kreiner does. He was a student of Georg Rasch, the Danish statistician who gave his name to the model, and has personally worked with it for 40 years. “I know that model well,” Kreiner tells TES. “I know exactly what goes on there.” And that is why he is worried about Pisa.

“He says that for the Rasch model to work for Pisa, all the questions used in the study would have to function in exactly the same way – be equally difficult – in all participating countries. According to Kreiner, if the questions have “different degrees of difficulty in different countries” – if, in technical terms, there is differential item functioning (DIF) – Rasch should not be used.

“That was the first thing that I looked for, and I found extremely strong evidence of DIF,” he says. “That means that (Pisa) comparisons between countries are meaningless.”

Please, someone, anyone, send this article to Secretary Arne Duncan; to President Obama; to Bill Gates; and to all the other “reformers” who want to destroy public education based on flawed and meaningless international test scores.

deb says:

August 20, 2013 at 11:23 am

Is this surprising at all? How about the realization that the tests given in each state to determine AYP and proficiency are different every year and that impacts the exactness of difficulty and the ability to actually tell that a student has made a year’s growth. In looking at the sample questions on the released questions for practice tests in Ohio, the differences at the same grade level each year are vast. The differences between 3rd and 4th as well as 4th and 5th are huge. There is no scientific proof that the tests mean what their proponents claim that they mean.

LikeLike

teachingeconomist says:

August 20, 2013 at 11:51 am

It is typically argued here that poverty has an important impact on student learning. It is also argued here that schools with a large percentage of high poverty students do relatively worse, on average, than schools with a low percentage of high poverty students. If standardized exam results are truly random, would we see the correlation between poverty and the exam results?

LikeLike

- deb says:
  
  August 20, 2013 at 12:03 pm
  
  I would guess that it would. My view is that since the tests measure processes more than retention of information, a child’s poverty will impact their learning and that would impact whatever version of the test they were taking. Lacking sleep, being hungry, having screaming caregivers, dealing with divorce and abuse will impact the testing day no matter the questions. That in and of itself invalidates the tests’ accuracy. JMO.
  
  LikeLike
- Teachingeconomist says:
  
  August 20, 2013 at 12:24 pm
  
  Do you think that the correlation between test scores and poverty misleads us into believing that there is a relationship between poverty and retention when in fact what we are seeing is that poor students are more likely to be having a bad day on test day than than m
  
  LikeLike
- Teachingeconomist says:
  
  August 20, 2013 at 12:25 pm
  
  (Cont) more wealthy students?
  
  LikeLike
- deb says:
  
  August 20, 2013 at 12:45 pm
  
  Wealth is no guarantee of anything but opportunity. Parental support and prep helps those kids. Remember that generally the discussion is about passing/proficiency not about advanced scores. Failures are often indicative (at least in my experience) of students who don’t even try to answer, partial answers, not having time to get finished, falling asleep, or being disinterested. I have no personal contact with urban poor or suburban wealthy kids.
  
  LikeLike

Duane Swacker says:

August 20, 2013 at 11:32 am

“. . . What if the statistical techniques used to compile it are “utterly wrong” and based on a “profound conceptual error”? Suppose the whole idea of being able to accurately rank such diverse education systems is “meaningless”, “madness”?”

That “profound conceptual error” is the epistemological error that a quality of the teaching and learning process can be quantified and that that quantification can be assigned/attached to the student, teacher or school. Noel Wilson proves that these results are “vain and illusory”, in other words “meaningless madness”, in “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700

Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)

1. A quality cannot be quantified. Quantity is a sub-category of quality. It is illogical to judge/assess a whole category by only a part (sub-category) of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as one dimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing we are lacking much information about said interactions.

2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).

3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.

4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”

In other word all the errors involved in the process render any conclusions invalid.

5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. As a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.

6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.

7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”

In other words it measures “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?

My answer is NO!!!!!

One final note with Wilson channeling Foucault and his concept of subjectivization:

“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self evident consequences. And so the circle is complete.”

In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society

greatbooksdude says:

August 20, 2013 at 11:46 am

All of this is true, but the reality is that American students are consistently learning less and less. I work with them–I know first-hand. The problem is that trying to “reform” it rather than OVERHAUL it is bound to simply perpetuate and exacerbate the issue.

FLERP! says:

August 20, 2013 at 2:08 pm

I hope nobody else who works with American students, and knows first-hand, contradicts you. I won’t know what to believe!

LikeLike

Nic Spaull says:

August 20, 2013 at 12:10 pm

If you want to read both sides of this argument, here is Andreas Schleicher’s response to some of these criticisms: http://www.tes.co.uk/article.aspx?storycode=6345213&utm_content=bufferb3e92&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer

August 20, 2013 at 12:29 pm

I don’t know. I would assume it has an impact. But day after day, coupled with high absentee rates, transiency, and even homelessness make these students less likely to succeed. Heck, my dad was poor. He graduated early and was still salutatorian. But they had food. Worked hard. Slept well. Walked to school. Poor people aren’t naturally dumb.

retiredbutmissthekids says:

August 20, 2013 at 3:51 pm

Oh, Deb, your last sentence rings so true! However, this constant of “teaching to the test” as well as the Common Core is going to make every last child whose parents cannot afford private schools dumb (with this dumbed down, authoritarian curriculum). And that is the objective of the 1%–worker bees for Walmart.

LikeLike

- Deb says:
  
  August 20, 2013 at 4:15 pm
  
  I don’t think that the CC is necessarily dumbed down from what I saw of it. It isn’t a curriculum. It is a framework on which to build a good curriculum. It flaws aren’t in dumbing down but in sending the objectives to lower grade levels making it difficult for students to succeed. It also has changed the vocabulary and conversation for elementary teachers. I believe their intent is to change elementary schools from using elementary/developmental ways to explain things to kids, but that requires using technical terminology that is often foreign to the teachers’ educational experiences. Some of the terms on the rubrics for 4th grade ELA are absolutely not typical to educational experiences. In some ways, the new CC seems to expect all elementary teachers to be “renaissance people” who are experts in all areas. They are taking the focus away from pedagogy, fun, developmental learning to technical terminology at younger and younger ages. So, I am not sure what will happen to those kids along the way. All I can say is: if it is a requirement for public school kids to know and do certain things, then it should be a requirement for charter and private schools to do the same.
  
  LikeLike
- Harlan Underhill says:
  
  August 20, 2013 at 8:26 pm
  
  Now this is the most clear and sensible precis of what the standards actually read like that I have seen. It does demand a possible step up for some elementary teachers unfamiliar with the framework of logic and rhetoric which is standard among college teachers of composition and literature. In my high-school English department, almost all of us were refugees from college and university teaching, and thus always taught from the framework of logic and rhetoric, but the one member for a couple of years whose background was in teaching reading below 7th grade felt perpetually inferior and nervous with the literature and composition we taught from 7th grade on up through 12th. I always thought EVERYONE knew even the elementary linguistics, rhetoric, and logic with which I was familiar, but it turned out not to be the case and was a very painful experience for her, and eventually she quit, apparently feeling that she was in way, way over her head. I had a long term sub once whose Master’s Degree was in English Language Arts, but he just could not understand Shakespeare well enough to teach it, even read it out loud. He was much more comfortable as a third grade teacher. Nothing else “wrong” with him. Good persona, good class management skills, but he just didn’t “know” enough to cope with Shakespeare, and it was painfully obvious to the kids in my 9th and 10th grade classes.
  
  LikeLike
- deb says:
  
  August 20, 2013 at 8:41 pm
  
  But, see, it isn’t about smarts or superiority. It is about different coursework. Unfortunately, some people think their particular interest or strength is superior to other fields, but it is only different. No one can know everything. No one needs to. If someone thinks that they do, they become a boorish personality.
  
  Until this grand scheme came along, developmental learning/teaching and pedagogical skills are considered if basic importance to set children up for secondary learning. High school teachers are often just as lost with developmental teaching or remediation. They are often subject centered rather than student centered. There are students at all levels who need the different styles of teaching in order to learn at an optimal level. No one should be driven from teaching by being overwhelmed with new jargon. People only have so much time to devote to learning.
  
  LikeLike
- Harlan Underhill says:
  
  August 21, 2013 at 1:06 am
  
  I don’t know how to test experimentally the proposition which I think is true, that “All skills are knowledge.” Thus I don’t think education is a matter of different content for different kids. I think there is a true hierarchy in intellectual skills. Not everyone is interested in climbing that ladder higher and higher, nor should they be, but that there is such a ladder in existence seems to me incontrovertible. The men and women at Princeton who created the first American computer (von Neumann, Arthur Burks and others) weren’t just “different” I would argue, but were “better” at thinking than most of us.
  
  LikeLike

James Harvey says:

August 20, 2013 at 2:28 pm

It’s about time this conversation was launched. PISA results have been accepted uncritically by the press and public leaders worldwide, despite severe reservations about what lies behind these numbers on the part of serious scholars like Martin Carnoy of Stanford. In Europe last year, I found not a single national education official who had anything good to say about PISA, but the politicians fell all over themselves endorsing its results.

Kreiner from the University of Copenhagen is quite correct about the Rasch model and differential item functioning. Language conventions change, even in the same language, never mind the dozens of different languages tested in PISA. If I ask you to get a “lift” in the U.S., I”m asking you to get a ride in a car (or maybe a heel insert), but in Great Britain, I’m asking you to take an elevator.

Differences in school populations and sampling issues remain unexplored in the public discussion, despite the fact that it is clear the United States has the highest proportion of low-income students in the developed world and that serious questions have been raised about the 2010 PISA results, which apparently significantly over-sampled low-income urban students in the United States while simultaneously severely undersampling low-income, migrant students in Shanghai, which operates a sort of modern-day Jim Crow system to exclude migrants.

To all of these criticisms, PISA, a profit center for the Organization for Economic Cooperation and Development, based in France, responds with smooth evasions and bland denials. The development of this international scholarly criticism is long overdue.

TC says:

August 20, 2013 at 6:26 pm

I don’t know about all the statistical methodology and technicalities…but I do understand that saying that the PISA proves that US students are learning less or doing worse than other countries is overall FALSE…IF…you breakdown how US students do according to the level of poverty found at their schools. Students that attend schools with less than 10% poverty score #1 in the world….10 – 25% they score #4…25-50% they score #10….beyond that (and the US is the only developed nation to have schools with populations of such high poverty) is where the US is FAILING and focus and change needs to occur….and the answer is NOT more testing and school closings!

Lisa Smith says:

August 20, 2013 at 11:01 pm

Don’t worry about sending it to Obama or Duncan; since they are “monitoring” you, they already have it.

William Knaak says:

August 21, 2013 at 12:33 am

Bill Knaak: Teach The Best and Stomp The Rest. For this writer, there are no surprises here. These international comparison research studies never had any juried research validity, and to my knowledge were never reported as such in any professional journal in the United States. Participation dumped validity in myriads of ways. Russia, a nation of hundreds of languages, tested only Russian-speaking schools, Italy lopped of whole provinces. Norwegian and Swedish students who had the highest scores in physics had taken physics for three years. The American academics who took the test had only one year of physics and those in the general track had a year of general science. Twenty-three percent of the items in the advanced math test assumed the student had passed a calculus course. Americans who had completed a calculus course averaged at the 50th percentile. Those who had not completed calculus were at the 16th percentile.
Many other nations do not have the American mainstream culture of self-criticism. In some nations, criticism can get you jailed, tortured or executed. When these “studies” were published, the American media had already been turned against public education. They glommed on to this “news” as research evidence of American school failure, in the expression of the late George Bracey, “like a pack of carnivores all trying to gnaw on the same bone.” The failure issue was and is political with both parties and the corporate think-tanks using the using the “data” to present the “dark side.”

paceni says:

August 30, 2013 at 5:14 pm

Diane
The following is a paper by Dr Hugh Morrison of Queen’s University Belfast, Northern Ireland which started the TES on their recent publications about the flaws in the OECD PISA use of the Rasch model. Please take the time to read it and perhaps pass it on to any of your contacts. Perhaps someone will be able to refute Dr Morrison’s claims based on a mathematical answer. So far, Andreas Schleicher and the entire PISA Consortium have been unable to do so. The implications for PISA activity in US schools are frightening since the model they use id “fundamentally flawed”.

A fundamental conundrum in psychology’s standard model of measurement and its consequences for PISA global rankings.

Dr. Hugh Morrison
The Queen’s University of Belfast
( h.morrison@qub.ac.uk )

Introduction

This paper is concerned with current approaches to measurement in psychology and their use by organisations like the Organisation for Economic Co-operation and Development (OECD) to hold the education systems of nation states to “global” standards. The OECD’s league table – the Programme for International Student Assessment (PISA) – has the potential to throw a country’s education system into crisis. For example, Ertl (2006) documents the effects of so-called “PISA-shock” in Germany, and Takayama (2008) describes a similar reaction in Japan. Given that a country’s PISA ranking can play a role in decisions concerning foreign direct investment, it is important to confirm that the measurement model which produces the ranks is sound. Moreover, the OECD has already spread its remit beyond the PISA league table to include teacher evaluation through its Teaching and Learning International Survey (TALIS). The OECD is currently developing PISA-like tests to facilitate global comparisons of the education on offer in universities through its Assessment of Higher Education Learning Outcomes (AHELO) programme: “Governments and individuals have never invested more in higher education. No reliable international data exists on the outcomes of learning: the few studies that exist are nationally focused” (Rinne & Ozga, 2013, p. 99). Given the sheer global reach of the OECD project, it is important to investigate the coherence of the measurement model which underpins its data.

At the heart of 21st century approaches to measurement in psychology is the Generalised Linear Item Response Theory (GLIRT) approach (Borsboom, Mellenbergh and Van Heerden, 2003, p. 204) and the OECD uses Item Response Theory (IRT) to generate its PISA ranks. A particular attraction of IRT for the OECD is its claim that estimates of examinee ability are item-independent. This is vital to PISA’s notion of “plausible values” because each examinee only takes a subset of items from the whole “item battery.” Without the Rasch model’s claim to item-independent ability measures, PISA’s assertion that student performance can be reported on common scales, even when these students have taken different subsets of items, would be invalid.

This paper will focus on the particular IRT model used by OECD, the so-called Rasch model, but the arguments generalise to all IRT models. Proponents of the model portray Rasch as closing the gap between psychological measurement and measurement in the physical sciences. Elliot, Murray and Pearson (1978, pp. 25-26) claim that “Rasch ability scores have many similar characteristics to physical measurement” and Wright (1997, p. 44) argues that the arrival of the Rasch model means that “there is no methodical reason why social science cannot become as stable, as reproducible, and hence as useful as physics.” This paper highlights the incoherence of the model.

The Rasch model and its paradox

The Rasch model is defined as follows:

P(X_is=1 ┤| θ_(s,) β_i)= e^((θ_s-β_i))/(1+ e^((θ_s-β_i)) )

X_is is the response (X) made by subject s to item i;

θ_(s )is the trait level of subject s;

β_i is the difficulty of item i; and

X_is=1 indicates a correct response to the item.

On the face of it, the model uses a mathematical function to allow the psychometrician to compute the probability that a randomly selected individual of ability θ will provide the correct response to an item of difficulty β. A particular ability and difficulty value will be chosen for illustration, but the analysis which follows has universal application. When the values θ = 1 and β = 2, for example, are substituted in the Rasch model, a scientific calculator will quickly confirm that the probability that an individual of ability θ = 1 will respond correctly to an item of difficulty β = 2 is given as 0.27 approximately. It follows that if a large sample of individuals, all with this same ability, respond to this item, 27% will give the correct response.

In the Rasch model “the abilities specified in the model are the only factors influencing examinees’ responses to test items” (Hambleton, Swaminathan & Rogers, 1991, p. 10). This results in a paradox. If a large sample of individuals of exactly the same ability respond to the same item, designed to measure that ability, why would 27% get it right and 73% get it wrong? If the item measures ability and the individuals are all of equal ability, then surely the model must indicate that they all get it right, or they all get it wrong?

Does the Rasch model really represent an advance on classical test theory?

The Rasch model is portrayed as a radical advance on what went before – classical test theory (CTT). In classical test theory, “[p]erhaps the most important shortcoming is that examinee characteristics and test characteristics cannot be separated: each can be interpreted only in the context of the other. The examinee characteristic we are interested in is the ‘ability’ measured by the test” (Hambleton, Swaminathan & Rogers, 1991, p. 2).

An examinee’s ability is defined only in terms of a particular test. When the test is “hard,” the examinee will appear to have low ability; when the test is “easy,” the examinee will appear to have higher ability. What do we mean by “hard” and “easy” tests? The difficulty of a test item is defined as ‘the proportion of examinees in a group of interest who answer the item correctly.’ Whether an item is hard or easy depends on the ability of the examinees being measured, and the ability of the examinees depends on whether the items are hard or easy! (Hambleton, Swaminathan & Rogers, 1991, pp. 2-3)

Measures of ability in the Rasch model, on the other hand, are claimed to be completely independent of the items used to measure such abilities. This is vital to the computation of plausible values because no student answers more than a fraction of the totality of PISA items.

A puzzle emerges immediately: if the Rasch model treats as separable what classical test theory treats as profoundly entangled – with Rasch regarded as a significant advance on classical test theory – why does the empirical data not reflect two radically different measurement frameworks? Based on large scale comparisons of item and person statistics, Fan (1998) notes: “These very high correlations indicate that CTT- and IRT-based person ability estimates are very comparable with each other. In other words, regardless of which measurement framework we rely on, the same or very similar conclusions will be drawn regarding the ability levels of individual examinees” (p. 8), and concludes: “the results here would suggest that the Rasch model might not offer any empirical advantage over the much simpler CTT framework” (p. 9). Fan (1998) confirms Thorndike’s (1962, p. 12) pessimism concerning the likely impact of IRT: “For the large bulk of testing, both with locally developed and standardized tests, I doubt that there will be a great deal of change. The items that we select for a test will not be much different, and the resulting tests will have much the same properties.”

In what follows, the case is made that in the Rasch model, just as in Classical Test Theory, ability cannot be separated from the item used to measure it. Rasch’s model is shown to be incoherent and this has clear consequences for the entire OECD project. Moreover, the arguments presented here undermine psychology’s “standard measurement model” (Borsboom, Mellenbergh & van Heerden, 2003) with implications for all IRT models and Structural Equation Modelling.

The Rasch model: early indications of incoherence

The first hints of Rasch’s confusion appear in the early pages of his 1960 treatise which sets out the Rasch model, Probabilistic Models for Some Intelligence and Attainment Tests. Rasch’s lifelong obsession – captured in his closely associated notions of “models of measurement” and “specific objectivity” – with measurement models capable of application to the social and natural sciences can be recognized in his portrayal of the Rasch model. In constructing his model Rasch (1960, p. 10) rejects deterministic Newtonian measurement for the indeterminism of quantum mechanics:

For the construction of the models referred to I shall take recourse to some points of view … of a more general character. Into the system of classical physics enter a number of fundamental laws, e.g. the Newtonian laws. … A characteristic property of these laws is that they are deterministic. … None the less it should not be overlooked that the laws do not give an accurate picture of nature. … In modern physics … the deterministic view has been abandoned. No deterministic description for e.g. radioactive emission seems within reach, but for the description of such irregularities the theory of probability has proved an extremely valuable tool.

Rasch (1960, p. 11) likens the unmeasured individual to a radioactive nuclide about to decay. Quantum mechanics teaches that, unlike Newtonian mechanics, if one had complete information about the nuclide, one still couldn’t predict the moment of decay with accuracy. Indeterminism is a constitutive feature of quantum mechanics: one cannot know, even if one had complete knowledge of the universe, what will happen next to a quantum system. Irreducible uncertainty applies. For Rasch (1960, p. 11): “Where it is a question of human beings and their actions, it appears quite hopeless to construct models which will be useful for purposes of prediction in separate cases. On the contrary, what a human being actually does seems quite haphazard, none less than radioactive emission.” Rasch (1960, p. 11) makes clear his rejection of deterministic Newtonian models: “This way of speaking points to the possibility of mapping upon models of a kind different from those used in classical physics, more like the models in modern physics – models that are indeterministic.”

Quantum indeterminism has implications for Rasch’s “models of measurement.” In quantum mechanics, measurement doesn’t simply produce information about some pre-existing state. Rather, measurement transforms the indeterminate to the determinate. Measurement causes what is indeterminate to take on a determinate value. In the classical model which Rasch rejects, measurement is simply a process of checking up on what pre-existed the act of measurement, while quantum measurement causes the previously indeterminate to take on a definite value. However, latent variable theorists in general, and Rasch in particular, treat “ability” as an intrinsic attribute of the person, and they view measurement as an act of checking up on that attribute.

The early pages of Rasch’s (1960) text raise doubts about his understanding of the central mathematical conceit of his model: probability. One gets the clear impression that Rasch associates probability with indeterminism. But completely determinate situations can involve probability. The outcome of the toss of a coin is completely determined from the moment the coin leaves the thrower’s hand. If one had knowledge of the initial speed of projection, the angle of inclination of the initial motion to the horizontal, the initial angular momentum, the local acceleration of gravity, and so on, one could use Newtonian mechanics to predict the outcome. Probability is invoked because of the coin-thrower’s ignorance of these parameters. Such probabilities are referred to as subjective probabilities.

In modern physics, uncertainty is constitutive and not a consequence of the limitations of human beings or their measuring instruments. Quantum physicists deal in objective probability. Finally, the notion of separability or “specific objectivity” as Rasch labelled it, is absolutely central to his thinking: “Rasch’s demand for specific objective measurement means that the measure of a person’s ability must be independent of which items were used” (Rost, 2001, p. 28). However, quantum mechanics is founded on non-separabilty; one cannot break the conceptual link between what is measured and the measuring instrument. The mathematics of the early pages of Rasch (1960) do not auger well for the mathematical coherence of his model, but it is important to set out the case against the model with greater rigour.

Bohr and Wittgenstein: indeterminism in psychological measurement

A possible source of Rasch’s efforts to find “models of measurement” which would apply equally to both psychometric measurement and measurement in physics was the writings of Rasch’s famous countryman, Niels Bohr. (Indeed, Rasch attended lecture courses in mathematics given by the great physicist’s brother.) Bohr argued for all of his professional life that there existed a structural similarity between psychological predicates and the attributes of interest to quantum physicists. Although he never published the details, he believed he had identified an “epistemological argument common to both fields” (Bohr, 1958, p. 27). For Bohr, no psychologist has direct access to mind just as no physicist has direct access to the atom. Both disciplines use descriptive language which was developed to make sense of the world of direct experience, to describe what cannot be available to direct experience. Bohr summarized this common challenge in the question, “How does one use concepts acquired through direct experience of the world to describe features of reality beyond direct experience?”

Given the central preoccupation of this paper, Bohr’s words are particularly striking: “I want to emphasize that what we have learned in physics arose from a situation where we could not neglect the interaction between the measuring instrument and the object. In psychology, we meet the quite similar situation” (Favrholdt, 1999, p. 203). Also, prominent psychologists echo Bohr’s thinking: “The study of the human mind is so difficult, so caught in the dilemma of being both the object and the agent of its own study, that it cannot limit its inquiries to ways of thinking that grew out of yesterday’s physics” (Bruner, 1990, p. xiii). Given that Bohr never developed his ideas for the epistemological argument common to both fields, what follows also addresses en passant a lacuna in Bohr scholarship.

If all this sounds fanciful (after all, what possible parallels can be drawn between Rasch’s radionuclide on the point of decaying and an individual on the point of answering a question?) it is instructive to return to Rasch’s (1960, p. 11) claim that “what a human being does seems quite haphazard, none less than radioactive emission.” In fact there are striking parallels between the experimenter’s futile attempts to predict the moment of decay and the psychometrician’s attempts to predict the child’s response to a (hitherto unseen) addition problem such as “68 + 57 = ?”

If one restricts oneself to all of the facts about the nuclide, the outcome is completely indeterminate. Similarly, Wittgenstein’s celebrated rule-following argument (central to his philosophies of mind, mathematics and language), set out in his Philosophical Investigations, makes clear that if one restricts oneself to the totality of facts (inner and outer) about the child, these facts are in accord with the right answer (68 + 57 = 125) and an infinity of wrong answers. Mathematics will be used for illustration but the reasoning applies to all rule-following. The reader interested in an accessible exposition of this claim is directed to the second chapter of Kripke’s (1982) Wittgenstein on Rules and Private Language. (The reader should come to appreciate the power of the rule-following reasoning without being troubled by Kripke’s questionable take on the so-called skeptical argument.) The author will now attempt the barest outlines of Wittgenstein’s writing on rule-following .

By their nature, human beings are destined to complete only a finite number of arithmetical problems over a lifetime. The child who is about to answer the question “68 + 57 = ?” for the first time has, of necessity, a finite computational history in respect of addition. Through mathematical reasoning which dates back to Leibniz, this finite number of completed addition problems can be brought under an infinite number of different rules, only one of which is the rule for addition. In short, any answer the child gives to the problem can be demonstrated to be in accord with a rule which generates that answer and all of the answers the child gave to all of the problems he or she has tackled to date. If one had access to the totality of facts about the child’s achievements in arithmetic, one couldn’t use these facts to predict the answer the child will give to the novel problem “68 + 57 = ?” because one can always derive a rule which generates the child’s entire past problem-solving history and any particular answer to “68 + 57 = ?”

Now what of facts concerned with the contents of the child’s mind? Surely an all-seeing God could peer into the child’s mind and determine which rule was guiding the child’s problem-solving? By substituting the numbers 68 and 57 into the rule, God could predict with certainty the child’s response. Alas, having access to inner facts (about the mind or brain) won’t help because having a rule in mind is neither sufficient nor necessary for responding correctly to mathematical problems. Is having a rule in mind sufficient? Clearly not since all pupils taking GCSE mathematics, for example, have access to the quadratic formula and yet only a fraction of these pupils will provide the correct answer to the examination question requiring the application of that formula. Is having the rule in mind necessary? Once again, clearly not because one can be entirely ignorant of the quadratic formula and yet produce the correct answers to algebraic problems involving quadratics using alternative procedures like “completing the square,” graphical methods, the Newton-Raphson procedure, and so on.

It is important to be clear what is being said here. If one could identify an addition problem beyond the set of problems Einstein had completed during his lifetime, is the claim that one couldn’t predict with certainty Einstein’s response to that problem? Obviously not. But the correct answer and an infinity of incorrect answers are in keeping with all the facts (inner and outer) about Einstein. When one is restricted to these facts, Einstein’s ability to respond correctly is indeterminate. In summary, before the child answers the question “68 + 57 = ?” his or her ability with respect to this question is indeterminate. The moment he or she answers, the child’s ability is determinate with respect to the question (125 is pronounced correct, and all other answers are deemed incorrect). One might portray this as follows: before responding the child is right and wrong and, at the moment of response, he or she is right or wrong.

The problem with the Rasch model

Ability only becomes determinate in context of a measurement; it’s indeterminate before the act of measurement. The conclusion is inescapable – ability is a relational property rather than something intrinsic to the individual, as psychology’s standard measurement model would have it. A definite ability cannot be ascribed to an individual prior to measurement. Ability is a joint property of the individual and the measurement instrument; take away the instrument and ability becomes indeterminate. It is difficult to escape the conclusion that ability (and intelligence, and self-concept, and so on) is a property of the interaction between individual and measuring instrument rather than an intrinsic property of the individual. If psychological constructs were viewed as joint properties of individuals and measuring instruments, then intractable questions such as “what is intelligence?”, “what is memory?” need no longer trouble the discipline.

What can be concluded in respect of Rasch? It is clear that the Rasch model is no more capable of separating ability from the item used to measure it than was its predecessor, classical test theory. Pick up any textbook on IRT and one finds the same assumption stated again and again in model development: individuals carry a determinate ability with them from moment to moment and measurement involves checking up on that ability. The ideas of Bohr and Wittgenstein can be used to reject this; for them, measurement effects a “jump” from the indeterminate to the determinate, transforming a potentiality to an actuality.

In simple terms it can be argued that ability has two facets; it is indeterminate before measurement and determinate immediately afterwards. The single description of the standard measurement model is replaced by two mutually exclusive descriptions. Ability is indeterminate before measurement and only determinate with respect to a measurement context. Neither of these descriptions can be dispensed with. The indeterminate and the determinate are mutually exclusive facets of one and the same ability.

Returning to the child who has been taught to add but hasn’t yet encountered the question “68 + 57 = ?” what can be said of his or her ability with respect to this question? When one ponders ability as a thing-in-itself, it’s tempting to think of it as something inner, something that resides in the child prior to being expressed when the child answers. If ability is to be found anywhere, surely it’s to the unmeasured mind one should look? Isn’t it tempting to think of it as something the child “carries” in his or her mind? When the focus is on ability as a thing-in-itself, it seems the child’s eventual answer to the question is somehow inferior; it’s the mere application of the child’s ability rather than the ability itself.

The concept of causality in classical physics is replaced by the notion of “complementarity” in quantum mechanics. Complementarity treats pre-measurement indeterminism and the determinate outcome of measurement as non-separable. Whitaker (1996, p. 184) portrays complementarity as “mutual exclusion but joint completion.” One cannot meaningfully separate the pre-measurement facet of ability from its measurement-determined counterpart. The analogue of Bohr’s complementarity is what Wittgensteinians refer to as first-person/third-person asymmetry. The first-person facet of ability (characterised by indeterminism) and the third-person measurement perspective cannot be meaningfully separated. Suter (1989, pp. 152-153) distinguished the first-person/third-person symmetry of Newtonian attributes from the first-person/third-person asymmetry of psychological predicates: “This asymmetry in the use of psychological and mental predicates – between the first-person present-tense and second- and third-person present-tense – we may take as one of the special features of the mental.” Nagel (1986, p. 22) notes: “the conditions of first-person and third-person ascription of an experience are inextricably bound together in a single public concept.”

This non-separability of first-person and third-person perspectives obviates the need to conclude, with Rasch, that the individual’s response need be “haphazard.” The first-person indeterminism detailed earlier seems to indicate that individuals offer responses entirely at random. After all, the totality of facts is in keeping with an infinity of answers, only one of which is correct. But one need only infer “random variation located within the person” (Borsboom, 2005, p. 55) if one mistakenly treats the first-person facet as separable from the third-person. (The author’s earlier practice of stressing the restriction to the totality of facts about the individual was intended to highlight this taken-for-granted separability.) Lord’s (1980) admonition that item response theorists eschew the “stochastic subject” interpretation for the “repeated sampling” interpretation led IRT practitioners astray by purging entirely the first-person facet from an indivisible whole. One only arrives at conclusions that are “absurd in practice” (p. 227) if one follows Lord (1980) and divorces ability from the item which measures it. Like Rasch, Lord failed to grasp that the within-subject and the between-subject aspects of psychological measurement are profoundly entangled.

Holland, Lord and the ensemble interpretation as the route out of paradox

Holland (1990) repeats Lord’s error by eschewing the stochastic subject interpretation for the random sampling interpretation, despite acknowledging “that most users think intuitively about IRT models in terms of stochastic subjects” (p. 584). The stochastic subject rationale traces the probabilities of the Rasch model to randomness in the individual subject:

Even if we know a person to be very capable, we cannot be sure that he will solve a certain difficult problem, not even a much easier one. There is always a possibility that he fails – he may be tired or his attention is led astray, or some other excuse may be given. And a person of slight ability may hit upon the correct solution to a difficult problem. Furthermore, if the problem is neither “too easy” nor “too difficult” for a certain person, the outcome is quite unpredictable. (Rasch, 1960, p. 73)

Rasch is proposing what quantum physicists call a “local hidden variables” measurement model. While Wittgenstein argues that ability is indefinite before the act of measurement (an act which effects a” jump” from indefinite to definite), psychometricians in general and Rasch in particular, treat ability as definite before measurement. The local hidden variables of the Rasch model are variables such as examinee fatigue, degree of distraction, and any other influence militating against his or her capacity to provide a correct answer. Rasch is suggesting that if one had complete information concerning the examinee’s ability, his or her level of fatigue, propensity for distraction, and so on, one could predict, in principle, the examinee’s response with a high degree of confidence. It is the absence of variables capable of capturing fatigue, attention, and so on, from the Rasch algorithm, that makes its probabilistic nature inevitable. In this local hidden variable model, probability is being invoked because of the measurer’s ignorance of the effects of fatigue, attention loss, and so on.

But Bell (1964) proved beyond doubt that local hidden variables models are impossible in quantum measurement. One can avoid the difficulties thrown up by Bell’s celebrated inequalities by treating unmeasured predicates as indefinite (Fuchs, 2011). This would have profound implications for how one conceives of latent variables in the Rasch model. If local hidden variables are ruled out, latent variables could not be assigned investigation-independent values. Ability only takes on a definite value in a measurement context. IRT can no more separate these two entities (ability and the item used to measure it) than could classical test theory. The “random sampling” approach that Holland (1990) recommends is a so-called “ensemble” interpretation. The definitive text on ensembles – Home and Whitaker (1992) – finds ensembles illegitimate because they mistakenly replace “superpositions” by “mixtures” (Whitaker, 2012, p. 279).

One gets the distinct impression from the IRT literature that the random sampling method is being urged on the field because of embarrassments that lurk in the stochastic subject model. For example Lord (1980, p. 228) refers to the later as “unsuitable”:

The trouble comes from an unsuitable interpretation of the practical meaning of the item response function … If we try to interpret Pi(A) as the probability that a particular examinee A will answer a particular item i correctly, we are likely to reach absurd conclusions. (Lord, 1980, p. 228)

Lord (1980) and Holland (1990) both attempt to avoid embarrassment by taking the simple step of ignoring the stochastic subject for the comfort of an ensemble interpretation. Home and Whitaker (1992) close their text with the words: “[W]e see the ensemble interpretation as the “comfortable” option, creating the illusion that all difficulties may be removed by taking one simple step” (p. 311).

What of the paradox identified earlier?

It is now possible to address the paradox presented earlier. Here is a restatement: If a large sample of individuals of exactly the same ability respond to the same item, designed to measure that ability, why would 27% get it right and 73% get it wrong? Suppose a large number of individuals answer a question (labelled Q1), and, of those who give the correct answer, 100 individuals, say, are posed a second question (labelled Q2). When these 100 individuals respond to Q2, 27% give the correct answer and 73% respond with the wrong answer. What can be said about the ability of each individual immediately after answering Q1 but before answering Q2? Given the natural tendency to think of ability as an attribute of mind, it seems reasonable to focus on the individual’s ability “between questions” as it were.

Poised between questions, each individual’s ability with respect to Q1 is determinate; they have answered Q1 correctly moments before. What of their ability with respect to Q2, the question they have yet to encounter? According to the reasoning presented above, all the facts are in keeping with both a correct and an incorrect answer. The individual’s ability relative to Q2 is indeterminate. Quantum mechanics portrays such states as “superpositions” – the individuals all have the same indefinite ability characterised as: “correct with probability 27% and incorrect with probability 73%.” It is easy to see why 100 individuals each with an ability characterised in this way could be portrayed as subsequently producing 27 correct responses and 73 incorrect responses to Q2.

In this approach the paradox dissolves. All 100 individuals have definite abilities (as measured by Q1), but only 27% go on to answer Q2 correctly. But note the crucial step in the logic required to dissolve the paradox: each individual’s ability is simultaneously determinate with respect to Q1 and indeterminate with respect to Q2. A change in question (from Q1 to Q2) effects a radical change from indeterminate to determinate. It is therefore only meaningful to talk about a definite ability in relation to a measurement context. Ability is a joint property of the individual and the item; pace Rasch they cannot be construed as separable! It follows therefore that the examiner (the person who selects the item) participates in the ability manifest in a response to that item. Pace Rasch measurement in education and psychology is a more dynamic affair than measurement in classical physics. The former is dynamic while the latter is merely a matter of checking up on what’s already there. Because that which is measured is inseparable from the question posed, the measurer participates in what he or she “sees.” Newtonian detachment is as unattainable in psychology and education as it is in quantum theory.

Conclusion

Returning to the real life consequences of this refutation of latent variable modelling in general and Rasch modelling in particular, one cannot escape the conclusion that the OECD’s claims in respect of its PISA project have scant validity given the central dependence of these claims on the clear separability of ability from the items designed to measure that ability.

References

Bell, J.S. (1964). On the Einstein-Podolsky-Rosin paradox. Physics, 1, 195-200.
Bohr, N. (1929/1987). The philosophical writings of Niels Bohr: Volume 1 – Atomic theory and the description of nature. Woodbridge: Ox Bow Press.
Bohr, N. (1958/1987). The philosophical writings of Niels Bohr: Volume 2 – Essays 1933 – 1957 on atomic physics and human knowledge. Woodbridge: Ox Bow Press.
Borsboom, D. (2005). Measuring the mind: conceptual issues in contemporary psychometrics. Cambridge: Cambridge University Press.
Borsboom, D., Mellenbergh, G.J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110 (2), 203-219.
Bruner, J.S. (1990). Acts of meaning. Cambridge, MA: Harvard University Press.
Davies, E.B. (2003). Science in the looking glass. Oxford: Oxford University Press.
Davies, E.B. (2010). Why beliefs matter. Oxford: Oxford University Press.
Elliot, C.D., Murray, D., & Pearson, L.S. (1978). The British ability scales. Windsor: National Foundation for Educational Research.
Ertl, H. (2006). Educational standards and the changing discourse on education: the reception and consequences of the PISA study in Germany. Oxford Review of Education, 32(5), 619-634.
Fan, X. (1998). Item response theory and classical test theory: an empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357-381.
Favrholdt, D. (Ed.). (1999). Niels Bohr collected works (Volume 10). Amsterdam: Elsevier Science B.V.
Fuchs, C.A. (2011). Coming of age with quantum information: Notes on a Paulian idea. Cambridge: Cambridge University Press.
Hacker, P.M.S. (1993). Wittgenstein, mind and meaning – Part 1 Essays. Oxford: Blackwell.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamental of item response theory. Newbury Park, CA: Sage Publications.
Hark ter, M.R.M. (1990). Beyond the inner and the outer. Dordrecht: Kluwer Academic Publishers.
Holland, P.W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55(4), 577-601.
Home, D., & Whitaker, M.A.B. (1992). Ensemble interpretation of quantum mechanics. A modern perspective. Physics Reports (Review section of Physics Letters), 210 (4), 223-317.
Jöreskog, K.G., & Sörbom, D. (1993). LISREL 8 user’s reference guide. Chicago: Scientific Software International.
Kalckar, J. (Ed.). (1985). Niels Bohr collected works (Volume 6). Amsterdam: Elsevier Science B.V.
Kripke, S.A. (1982). Wittgenstein on rules and private language. Oxford: Blackwell.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 355-383.
Nagel, T. (1986). The view from nowhere. New York: Oxford University Press.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Paedagogiske Institut.
Rinne, R., & Ozga, J. (2013). The OECD and the global re-regulation of teacher’s work: Knowledge-based regulation tools and teachers in Finland. In T. Seddon & J.S. Levin Eds.), World yearbook of education (pp. 97-116). London: Routledge.
Rost, J. (2001). The growing family of Rasch models. In A. Boomsma, M.A.J. van Duijn, & T.A.B. Snijders (Eds.), Essays on item response theory (pp. 25-42). New York: Springer.
Sobel, M.E. (1994). Causal inference in latent variable models. In A. von Eye & C.C. Clogg (Eds.), Latent variable analysis (pp. 3-35). Thousand Oakes: Sage.
Suter, R. (1989). Interpreting Wittgenstein: A cloud of philosophy, a drop of grammar. Philadelphia: Temple University Press.
Takayama, K. (2008). The politics of international league tables: PISA in Japan’s achievement crisis debate. Comparative Education, 44(4), 387-407.
Thorndike, R.L. (1982). Educational measurement: Theory and practice. In D. Spearritt (Ed.), The improvement of measurement in education and psychology: Contributions of latent trait theory (pp. 3-13). Melbourne: Australian Council for Educational Research.
Whitaker, A. (1996). Einstein, Bohr and the quantum dilemma. Cambridge: Cambridge University Press.
Whitaker, A. (2012). The new quantum age. Oxford: Oxford University Press.
Wittgenstein, L. (1953). Philosophical Investigations. G.E.M. Anscombe, & R. Rhees (Eds.), G.E.M. Anscombe (Tr.). Oxford: Blackwell.
Wittgenstein, L. (1980a). Remarks on the philosophy of psychology Volume 1 (Edited by G.E.M. Anscombe & G.H. von Wright; translated by G.E.M. Anscombe). Oxford: Basil Blackwell.
Wittgenstein, L. (1980b). Remarks on the philosophy of psychology Volume 2 (Edited by G.H. von Wright & H. Nyman; translated by C.G. Luckhardt & M.A.E. Aue). Oxford: Basil Blackwell.
Wright, B.D. (1997). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-52
Wright, C. (2001). Rails to infinity. Cambridge, MA: Harvard University Press.

What if the International Tests Are Wrong?

21 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

What if the International Tests Are Wrong?

Diane Ravitch's Blog

21 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats