Edward H. Haertel is one of the nation’s premier psychometricians. He is Jacks Family Professor of Education Emeritus at Stanford University. I had the pleasure of serving with him on the National Assessment Governing Board, after I joined the board in 1997. He is wise, thoughtful, and deliberate. He understands the appropriate use and misuse of standardized testing.
He was invited by the Educational Testing Service to deliver the 14th William H. Angoff Memorial Lecture, which was presented at ETS in March 21, 2013 and at the National Press Club on March 22, 2013.
This lecture should be read by every educator and policymaker in the United States. Haertel explains the research on value-added models (VAM), which attempt to measure teacher quality by the rise or fall of student test scores, and shows why VAM should not be used to grade and rank teachers.
Haertel begins by pointing out that social scientists generally agree that “teacher differences account for about 10% of the variance in student test score gains in a single year.” Out-of-school factors account for about 60% of the variance; many other influences are unexplained variables.
Small though 10% may be, it is the only part of the influence that policymakers think they can directly affect, so many states have enacted policies to give bonuses or to administer sanctions based on student test scores. In Colorado, for example, policymakers have decided that the rise or fall of test scores counts for 50% of the teacher’s evaluation, which will determine tenure, pay, and retention or firing.
Haertel proceeds to demolish various myths associated with VAM, for example, the myth that the achievement gap would close completely if every child had a “top quintile” teacher or if every low-performing student had a top quintile teacher. He notes that “there is no way to assign all of the top-performing teachers to work with minority students or to replace the current teaching force with all top performers. The thought experiment cannot be translated into an actual policy.”
He notes other confounding variables: students are not randomly assigned to classrooms. Some teachers get classes who are easier or harder to teach. Changing the test will change the ratings of the teachers. The advocates of VAM routinely ignore the importance of peer effects, the peer culture of a school in which students “reinforce or discourage one another’s academic efforts.”
He adds: “In the real world of schooling, students are sorted by background and achievement through patterns of residential segregation, and they may also be grouped or tracked within schools. Ignoring this fact is likely to result in penalizing teachers of low-performing students and favoring teachers of high-performing students, just because the teachers of low-performing students cannot go as fast…Simply put, the net result of these peer effects is that VAM will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in.”
After a careful review of the current state of research, Haertel reaches this conclusion:
“Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. The information they provide is simply not good enough to use in that way. It is not just that the information is noisy. Much more serious is the fact that the scores may be systematically biased for some teachers and against others, and major potential sources of bias stem from the way our school system is organized. No statistical manipulation can assure fair comparisons of teachers working in very different schools, with very different students, under very different conditions. One cannot do a good enough job of isolating the signal of teacher effects from the massive influences of students’ individual aptitudes, prior educational histories, out-of-school experiences, peer influences, and differential summer learning loss, nor can one adequately adjust away the varying academic climates of different schools. Even if acceptably small bias from all these factors could be assured, the resulting scores would still be highly unreliable and overly sensitive to the particular achievement test employed. Some of these concerns may be addressed, by using teacher scores averaged across several years of data, for example. But the interpretive argument is a chain of reasoning, and every proposition in the chain must be supported. Fixing one problem or another is not enough to make the case.”
Please read this important paper. It is the most important analysis I have read of why value-added models do not work. Since Race to the Top has promoted the use of VAM, Haertel’s analysis demonstrates why Race to the Top is demoralizing teachers across the nation, why it is destabilizing schools, and why it will ultimately not only fail to achieve its goals but will do enormous damage to teachers, students, the teaching profession, and American education.
Please send this paper to your Governor, your mayor, your state commissioner of education, your local superintendent, the members of your local board of education, and anyone else who influences education policy.

This paper suggests that we should think about how we are allocating our spending on public education. If out of school factors much more important for learning perhaps there is a good argument to spend more of our education budget outside of the school.
LikeLike
On what? I’m all for after school programs and summer “school” if it can be done as a positive experience, not more test-prep misery.
LikeLike
We would need more details about the things that really matter. Perhaps it would be transferring resources from encouraging teachers to get graduate degrees to more comprehensive healthcare for students or following Oklahoma’s lead in providing universal public preschool.
LikeLike
Well, that’s happening in FL (and I think NC too). In FL, they didn’t mention where the money saved from not paying teachers for grad degrees will go, so cynic that I am I assume it’s lowering taxes for the rich, not helping kids at all.
LikeLike
So we fill their bellies, provide decent healthcare, stabilize housing, provide counseling, create jobs for their parents…and they will do better in school, but wait, we just used the school budget to do all those things, so I guess they will have to learn on their own. That’s the way we did it for thousands of years, right? Tell me again why it is the responsibility of the education system to solve all society’s ills. We don’t address those issues with education dollars now; I see little reason why education dollars should pay for health and human services in the future.
LikeLike
Spending the entire budget would be poor budgeting policy. I would suggest shifting a portion if the budget from things that matter less to things that matter more.
LikeLike
Again you ignore the main point. Why is it the responsibility of the education budget to fund health and human services? OVER TIME, we would hope the need for special education services would decline if we are meeting the out of school needs of children. As the need for special services decreases, funding will decrease.
LikeLike
I think the evidence about the importance of factors outside of school for students inside the school suggests that we should rethink what we call education spending.
If you want to reallocate spending within the traditional education budget, perhaps fewer resources should be devoted to encouraging teachers together advanced degrees and instead be spent to increase preschool availability or lower class size.
LikeLike
Actually this is not a bad point except that it assumes there is “extra” in the system to reallocate. The fact is there isn’t enough money in the system to fund the “basics” of education properly let alone reallocate to outside programs.
LikeLike
“He understands the appropriate use and misuse of standardized testing.
Diane (and or anyone else that dare tread in that minefield), what are the “approriate” uses of standardized testing?
If he understood what Wilson has shown about the epistemological and ontological logical errors and how they invalidate the whole process and cause harm to many of the most vulnerable in our society, the children/students then he for sure would understand the “misuse of standardized testing”.
“It is the most important analysis I have read of why value-added models do not work.”
THE MOST IMPORTANT ANALYSIS of why these models cannot work is Noel Wilson’s “Educational Standards and the Problem of Error” found at:
http://epaa.asu.edu/ojs/article/view/577/700
Since VAM is based on standardized test scores we need to ALWAYS keep in mind these words of wisdom from Wilson:
It requires an enormous suspension of rational thinking to believe that the best way to
describe the complexity of any human achievement, any person’s skill in a complex field
of human endeavour, is with a number that is determined by the number of test items
they got correct. Yet so conditioned are we that it takes a few moments of strict logical
reflection to appreciate the absurdity of this.
LikeLike
To understand why one should read Wilson’s work here is a brief summary. Be aware that there is far more in the original than what I have summarized here. If you can read and understand what Wilson is saying and still disagree I would like for you to contact me so that we can discuss your take on it.
Thanks,
Duane
Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)
1. A quality cannot be quantified. Quantity is a sub-category of quality. It is illogical to judge/assess a whole category by only a part (sub-category) of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as one dimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing we are lacking much information about said interactions.
2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).
3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.
4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”
In other word all the logical errors involved in the process render any conclusions invalid.
5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. As a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.
6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.
7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”
In other words it measures “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?
My answer is NO!!!!!
One final note with Wilson channeling Foucault and his concept of subjectivization:
“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self-evident consequences. And so the circle is complete.”
In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.
LikeLike
quote: “If he understood what Wilson has shown about the epistemological and ontological logical errors and how they invalidate the whole process and cause harm to many of the most vulnerable in our society, the children/students then he for sure would understand the “misuse of standardized testing”.
Duane, are you referring to his “psychometric fudge?” I loved that metaphor and I am still analyzing his work to fully understand it. thanks for the reminder
LikeLike
The psychometric fudges of which he writes are definitely one part of the problem. I am also currently rereading the whole work and have just finished the PF chapter. Chapters 10-15 are, for me, the most difficult to totally comprehend because I am not as familiar with the whole field of standardized testing/psychometrics. But each time I read it it gets “clearer” in my mind.
When I was learning to do upholstery we had a similar term. One had to know all the little “fudging” techniques to be able to get as perfect a job as possible when dealing with imperfect fabrics, paddings, wood, etc. . . . The difference being that this fudging was for the good and not meant to hide anything as Wilson’s psychometric fudges are.
LikeLike
We don’t need a pyschometrician to convince people that VAM is useless as teacher evaluation tool. Just read this FAQs sheet about VAM from Dade County to see that it’s a complete crap shoot and you have no control over 50% of your evaluation. Some teachers may have better odds in Vegas since in many counties they are being evaluated on test scores of students they don’t teach or subjects they don’t teach. Any teacher in danger of being VAMmed must immediately click and read the entirety of this link (gosh that sounds like SPAM but were dealing with VAM, something equally lacking in substance and also an annoying waste of time). http://kafkateach.wordpress.com/2013/12/02/vam-speaks/
LikeLike
That’s good stuff kafkateach!
LikeLike
Problems with VAM any teacher could have told you about but nobody listens to us so here it is from a more “reliable” source.
http://www.hcea-esp.org/VAM%20facts.htm
Problems with the FL VAM
· Data for FCAT teachers – most VAM experts assert that, statistically-speaking, at least three years of data should be used to reduce errors so that teachers are not misidentified
o Florida school data gathered before 2010-11 (historic data) was gathered for other purposes and is not verifiable
o Districts verified that the teacher –student linkage is accurate in the 2010-11 data; i.e. the list of students in each class/course is associated with the correct teacher
· Data for non-FCAT teachers – There is minimal or NO subject area data available for approximately 60% of Florida’s teachers because they do not teach FCAT subjects in grades 4 thru 10.
o New teachers, even those who teach FCAT subjects, have no existing data associated with them.
o PreK-2 teachers, 11-12th grade teachers, PE, music, art, and technology teachers have no FCAT scores associated with their teaching position. Most districts have no tests and data are not yet available to follow student learning growth in any of these courses. For these teachers, school districts must either:
§ Develop data streams for each teacher using student learning growth or performance data on their own district developed tests;
§ OR, (if these district level tests do not exist or there is insufficient student data) create data streams from existing FCAT data tangentially related to a teacher’s position
For both FCAT and non-FCAT teachers, districts really only have one year of verifiable data. We can expect that, due to limited data, there will be random errors and teachers will be misclassified.
· Student attendance
o Is one of the variables outside a teacher’s control and is included in the formula to mitigate the impact on the teacher effect score
o Data is reported to the DOE using daily attendance not course attendance
o Students do miss particular courses consistently, and their attendance should be a mitigating factor in calculating their teachers VAM score
The DOE does not have the student attendance data by course; consequently, teachers with students who consistently miss their course(s) will be disadvantaged in the teacher effect VAM calculation.
· The VAM formula calculation at school level
o Teachers’ VAM scores in high performing schools are likely to regress to the mean; it will be very difficult to identify high performing and/or low performing teachers at high performing schools
o Teachers in low performing schools are more likely to demonstrate the extremes in the VAM range and can be classified as Highly Effective and Unsatisfactory even if their actual teaching performance is equal to that of a mean-score teacher at a high performing school
o Districts are required to set ranges across the district even though the calculation is computed within the school, forcing an apples-to-oranges comparison
The VAM formula comparisons from school-to-school create an uneven field and an increased likelihood that classifications will be inconsistent with a teacher’s performance. The teacher-to-teacher comparison, especially for high stakes employment decisions, pits teacher against teacher and drives down any interest in collaboration.
· The four required classifications – Districts must determine cut scores between each category
o Without a middle category, teachers with similar teaching performance and scores that vary by one or two points can be assigned to completely different categories; e.g. Effective and Needs Improvement
Random error must be considered and gauged for all teachers with proximal scores near cut offs so that teachers are not misclassified or negatively affected due to statistical inaccuracies
LikeLike
kafkateach: you note near the end that VAM “pits teacher against teacher and drives down any interest in collaboration.”
Let me reinforce that point with a bit of firsthand experience. When I was a bilingual TA I worked with an outstanding teacher in an elementary school. She volunteered to take on all the hard cases in the school. She literally was the “rising tide that lifts all boats.” Administrators, teachers, aides and all other staff members appreciated what she did for the whole school. But trust me when I say she had a disproportionate share of those who today would be labeled “score suppressors.”
Under VAManiacal guidelines she would be penalized for doing an outstanding job of collaborating to make the school a much better learning environment.
Or to use the business lingo so much in vogue these days, she would be “disincentivized” to take on such students, to wit, she would have faced loss of her livelihood for doing such an outstanding job.
What sense does that make?
Oops, forget… When in pursuit of $tudent $ucce$$, VAMania makes a lot of ₵ent₵.
😎
LikeLike
address for Governor of MA
constituent.services@state.ma.us,
Diane, I take you literally. I have been emailing governor and I call the office to tell them the email is coming. So far I have been able to talk to two individuals and one person said to send it to her; that doesn’t mean any of it is read but at least they sit up and take notice. I agree that ETS has been helpful to Massachusetts (it started out in the 80s when we were first preparing tests called MEAP before MCAS). There are some in the state that would like to see ETS back in the lead instead of Pearson. The did essential field work in standard testing; I am pleased they held this Angoff presentation/lecture and they chose Haertel to give it (with his creativity and brilliance).
LikeLike
error here I meant STANDARD SETTING not testing; they taught us Nedelsky method, Angoff method and they went right into schools and worked with teachers and gave examples with the tests we were already using like SRA, IOWA, Stanford achievement etc. We did not need expensive, experimental computerized tests to get the concept and it was understandable and did not come across as “elitist” but very pragmatic and realistic (we worked with 20 school districts in Merrimack Valley in MA just outside Greater Boston).
LikeLike
The tragedy of this post is how powerless educators are in determining the goals and methods of their profession. I have read a number of journal articles by distinguished tests and measurement professionals who provide long lists of statistical problems with VAM that should, in any other profession, invalidate that measure. Instead we have the Secretary of Education supporting a method that within the profession is considered unreliable. How long would the Surgeon General remain in his job if he came out tomorrow and stated that in his opinion, smoking is healthy for you.
LikeLike
Alan, I also have read many articles that address the “flaws” in VAM, including the “authors” of VAM stating that it should not be used for “high-stakes” decisions! The only logical decision would be to not use VAM but our policy makers refuse to succumb to this thought!
Could it possibly be that our policy makers have an influence outside of Education that is directing their decisions? ^o^
LikeLike
An endless frustration of mine as well. The Bill Gates types say they want evidence-based educational policy. I think at one point Gates even said “we know what works already”. However, they appear to be roundly ignoring the very educational science they purport to have read. So, yes, a “good teacher” does have a measurable impact on student outcomes. But bad policies can wash away this effect if you don’t understand the underlying mechanism for why a “good teacher” has this impact. Not to mention the fact that “a good teacher” explains a rather small portion of test scores and it is unclear how “good teachers” interact with the factors that affect test scores more.
Oh, and test scores do not necessarily correlate to the completely non-linear process of becoming an educated human being anyway.
School reform leaders don’t want to hear any of this. They’d rather lead by intuition, even though it happens to be faulty.
LikeLike
“So, yes, a “good teacher” does have a measurable impact on student outcomes.”
NO! They don’t have a “measurable” impact on student outcomes. They may have an impact but it is not “measurable”. The teaching and learning process belongs to the realm of aesthetics which is immune to “measurement”. Measurement is a sub-category of quality and therefore cannot be substituted as a proxy for quality. The whole standards and standardized testing regime thrive on that logical sleight of hand.
We need to quit using the edudeformers discourse/vocabulary, it only reinforces the false meme.
LikeLike
Alan C. Jones: Great point! There is no evidence to support the use of VAM and overwhelming evidence that says it is damaging and invalid. Yet the U.S. Department of Education continues to shove it on districts and states through Race to the Top and waivers. At some point, there should be a Congressional investigation of malfeasance by the DOE. Maybe now. If there were any members of Congress paying attention.
LikeLike
Somehow we have to account for the experience the children are having during the day. End results mean nothing if children are not being properly and adequately serviced in terms of creative projects, well-rounded curriculum, healthful living (including not having unrealistic expectations about sitting still or understanding one to one correspondence for a bubble test before age 9).
School should be time well spent. This factor needs to enter the conversation.
LikeLike
I think you mean “served” and not “serviced”, eh!!
LikeLike
yes; I knew that didn’t sound right when I typed it but was blanking on the way to say it. woops.
albeit, we are running schools more like car factories, and cars get serviced
LikeLike
Simply put, the net result of these peer effects is that VAM will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in.”
Here in NY there are also some teachers being rewarded or penalized using students that they DO NOT personally TEACH. There are also teachers being rewarded or punished using test scores that cover material that they DO NOT personally TEACH. Some of us are actually being measured based on the effectivenes of teachers. Will someone please help me get out of this rabbit hole. My GPS is still trying to “aquire sanity”
VAM = SHAM
LikeLike
“. . . effectivenes of OTHER teachers.”
LikeLike
VAM = SHAM, kinda like SLAM, BAM, THANK YOU MAAM!
LikeLike
My read of the article is that Haertel executes a pretty surgical dismemberment of VAM as a significant input into any personnel decisions involving individual teachers. As Diane indicates it deserves to be read in detail and widely disseminated. At the same time, he recognizes that a significant variance in performance of teachers exists. The two examples he cites from Hill, Capitula and Umland (2010) are disturbing in that their in-depth study of teaching methods appears to have only involved 24 teachers. While Haertel makes the case for teacher performance evaluation, he offers no clearly superior approach.
LikeLike
How about simply emphasizing the facts on the ground? VAM means adding two tests per subject per year (to establish SGO’s at year’s beginning, then measure at year’s end). So: in a school culture which has already added multiple standardized tests per year as a result of NCLB & Common Core, you are adding 8 to 10 tests per year which have no other rationale than measuring teacher performance. Round out the discussion with a close look at SGO’s: teachers are told to devise goals at year’s beginning against which they will be measured at year’s end– thus encouraging teachers to set objectives low enough to ensure good results.
LikeLike
“teachers are told to devise goals at year’s beginning against which they will be measured at year’s end– thus encouraging teachers to set objectives low enough to ensure good results.”
I was wondering who set these goals for the vam…Who does.. r u saying the teachers? I wonder..
LikeLike
In my disrtict teacher’s set their own goals, based on previous year data
LikeLike
Teachers not ‘s
LikeLike
SF&F:
Do you have an alternative approach?
LikeLike
Our students are doing surveys right now regarding their teachers. It counts for a very small % of the overall evaluation, but it’s still measured.
Is there any literature on the invalidity of such surveys?
LikeLike
Myles:
The issue is not with a well designed survey per se but with its interpretation. Can you find a link to the survey or post some of its questions?
I designed and analyzed organization/job satisfaction surveys for 35 years. There are ways to make these surveys much more useful and valid.
LikeLike
Bernie, thanks for responding.
As teachers we are not allowed to see the questions. Although I know one question regards whether it’s too noisy to learn in the classroom.
LikeLike
My school just finished the student survey….
Some example items….primary level
Some students do not behave in my class
I try my best to learn what my teacher teaches
We stay busy in this class and do not waste time
Some days I am sleepy in this class
It is okay to give up in this class if something is too hard
I could go on, being that there are over 35 items…..all questions have no, maybe, or yes choices. Last year I scored very low, apparently many of my students chose maybe and that counted as a negative response.
LikeLike
GPT:
The items are helpful in understanding the survey. Which grade answered these questions?
Some questions are double barreled which is wrong, e.g., “We stay busy in this class and do not waste time.” Obviously one can be busy but waste time.
Others are just bad items: “Some days I am sleepy in this class.”
Also the binary scale with a neutral response is very strange.
The length of the survey is problematic and I doubt that it is necessary. Presumably the items should be worded so as to be actionable by the teachers and then the teachers should receive their results at the item level – even here 35 strikes me as too many.
LikeLike
First grade, k uses the same format as well
LikeLike
GPT:
Good grief!! Too stupid for words.
Good luck.
LikeLike