Edward H. Haertel is one of the nation’s premier psychometricians. He is Jacks Family Professor of Education Emeritus at Stanford University. I had the pleasure of serving with him on the National Assessment Governing Board, after I joined the board in 1997. He is wise, thoughtful, and deliberate. He understands the appropriate use and misuse of standardized testing.

He was invited by the Educational Testing Service to deliver the 14th William H. Angoff Memorial Lecture, which was presented at ETS in March 21, 2013 and at the National Press Club on March 22, 2013.

This lecture should be read by every educator and policymaker in the United States. Haertel explains the research on value-added models (VAM), which attempt to measure teacher quality by the rise or fall of student test scores, and shows why VAM should not be used to grade and rank teachers.

Haertel begins by pointing out that social scientists generally agree that “teacher differences account for about 10% of the variance in student test score gains in a single year.” Out-of-school factors account for about 60% of the variance; many other influences are unexplained variables.

Small though 10% may be, it is the only part of the influence that policymakers think they can directly affect, so many states have enacted policies to give bonuses or to administer sanctions based on student test scores. In Colorado, for example, policymakers have decided that the rise or fall of test scores counts for 50% of the teacher’s evaluation, which will determine tenure, pay, and retention or firing.

Haertel proceeds to demolish various myths associated with VAM, for example, the myth that the achievement gap would close completely if every child had a “top quintile” teacher or if every low-performing student had a top quintile teacher. He notes that “there is no way to assign all of the top-performing teachers to work with minority students or to replace the current teaching force with all top performers. The thought experiment cannot be translated into an actual policy.”

He notes other confounding variables: students are not randomly assigned to classrooms. Some teachers get classes who are easier or harder to teach. Changing the test will change the ratings of the teachers. The advocates of VAM routinely ignore the importance of peer effects, the peer culture of a school in which students “reinforce or discourage one another’s academic efforts.”

He adds: “In the real world of schooling, students are sorted by background and achievement through patterns of residential segregation, and they may also be grouped or tracked within schools. Ignoring this fact is likely to result in penalizing teachers of low-performing students and favoring teachers of high-performing students, just because the teachers of low-performing students cannot go as fast…Simply put, the net result of these peer effects is that VAM will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in.”

After a careful review of the current state of research, Haertel reaches this conclusion:

“Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. The information they provide is simply not good enough to use in that way. It is not just that the information is noisy. Much more serious is the fact that the scores may be systematically biased for some teachers and against others, and major potential sources of bias stem from the way our school system is organized. No statistical manipulation can assure fair comparisons of teachers working in very different schools, with very different students, under very different conditions. One cannot do a good enough job of isolating the signal of teacher effects from the massive influences of students’ individual aptitudes, prior educational histories, out-of-school experiences, peer influences, and differential summer learning loss, nor can one adequately adjust away the varying academic climates of different schools. Even if acceptably small bias from all these factors could be assured, the resulting scores would still be highly unreliable and overly sensitive to the particular achievement test employed. Some of these concerns may be addressed, by using teacher scores averaged across several years of data, for example. But the interpretive argument is a chain of reasoning, and every proposition in the chain must be supported. Fixing one problem or another is not enough to make the case.”

Please read this important paper. It is the most important analysis I have read of why value-added models do not work. Since Race to the Top has promoted the use of VAM, Haertel’s analysis demonstrates  why Race to the Top is demoralizing teachers across the nation, why it is destabilizing schools, and why it will ultimately not only fail to achieve its goals but will do enormous damage to teachers, students, the teaching profession, and American education.

Please send this paper to your Governor, your mayor, your state commissioner of education, your local superintendent, the members of your local board of education, and anyone else who influences education policy.