One of the affidavits at the trial of the Lederman v. King case was filed by psychologist Brad Lindell.
His full affidavit is included in this post, which contains all the affidavits.
He sent the following note to me to explain his view of VAM in layman’s terms:
I am Dr. Brad Lindell, one of the affiants in the Sheri Lederman case who was present at the oral arguments on Wednesday. It was truly something to observe. You got the feeling that good was was going to come from the great work of Sheri and Bruce Lederman and from the experts’ opinions in so far as changing this broken VAM system. You got the sense that the judge was listening to the science about VAM and not just to the political rhetoric.
Just want to fill you in on something that was presented in my affidavit modified to give a clear and understandable example of the effects of poor reliability on a full-scale WISC intelligence test. If the same test-retest reliability from the teacher assigned yearly VAM scores (.40) was applied to the WISC full-scale to determine the 90% confidence interval, the range would be ridiculously large.
Examples
If a student scored a full-scale IQ of 100 (average) then the 90% confidence interval would be an 81 to 119. This indicates that there would be a wide range where the scores from repeated administrations of the WISC would be expected to fall for this student. One could not have confidence in the validity of a intelligence test with low reliability. Without adequate reliability, there can not be validity. This same holds true for VAM scores, whose reliabilities have been found to be notorious low.
The reliability of the WISC is generally in the .80 to .90 range. The 90% confidence intervals are generally in the +\- 6 range. So this same person with a 100 full-scale IQ would have a 90% confidence range of 94-106. Quite a smaller range.
This is why reliability is so important, which has repeatedly been shown to be low like .2 to .4 for year-to-year VAM scores. This is also why teachers year to year VAM score vary so considerably, like in the case of Sheri Lederman. Without reliability there cannot be adequate validity.

And, as we learned in basic measurement courses, reliability is a necessary but not sufficient condition of validity. A scale may be very accurate (reliable) in assessing a certain attribute, but not be a meaningful (valid) measure of another.
To take an obviously exaggerated example– a good bathroom scale provides very precise information about weight. But it would be unwise to use the results obtained from a classroom of children and throw them into a model that claims to be an indicator of their teacher’s performance.
LikeLike
In his discussion with Campbell Brown, New Jersey Governor Christie let this one go regarding how easy it is to decide which teachers should be fired, and separated from their students.
Why it takes all of ten minutes!
(21:13 – 21:54)
(21:13 – 21:54)
CHRISTIE (to the parents):
“Let me ask you a question, ’cause there’s a lot of people out here who care about education. When you go to ‘Back To School Night’, is there ever a doubt in your mind within ten minutes of getting in that classroom, whether that’s a good teacher or a bad teacher? Ever?
“You’re either in there going, ‘It’s gonna be a good year,’
” … or you’re… ‘Oh God. This is going to be a problem.’
“You don’t need a PhD in education to understand this (i.e. decide which teachers should be fired). If we (parents) can figure it out in ten minutes, then why can’t we have a tenure system that holds teacher to account, and that has parents understanding that they (parents) can have an impact on that, too.”
——————–
Could you imagine if a teacher saying the same thing… that a teacher can tell within ten minutes whether a parent is unfit, and thus, should have their child taken away by Child Services?
TEACHER: (to the teachers):
“Let me ask you a question, ’cause there’s a lot of people out here who care about education. When you go to “Back To School Night”, is there ever a doubt in your mind within ten minutes of meeting a parent whether that’s a good parent or a bad parent? Ever?
“You’re either in there going, ‘It’s gonna be a good year,’
” … or you’re… ‘Oh God. This is going to be be a problem.’
“You don’t need a PhD in education to understand this (i.e. decide which parent should have their children taken away). If we (teachers) can figure it out in ten minutes, then why can’t we have a child and family services system that holds parents to account, and that has teachers understanding that they (teachers) can have an impact on that, too.”
——————–
LikeLike
Or teachers to students on the first day of school, deciding who needs services, adjustments in services, interventions or alternate programming for gifted students in the first ten minutes of meeting each one.
Christie shouldn’t speak in public. He sounds like a garden variety neighborhood blowhard. Really embarrassing he’s a gov.
LikeLike
Exactly…CONSTRUCT Validity is critically important. The problem with VAM and previously using the same high stakes exams to rate schools and students is that while you can reliably do the math, it doesn’t really mean you’re measuring what you say you are measuring. And EVEN if you are measuring it in one school year, that doesn’t mean you can compare PERCENTILE RANKS from year to year as a way to measure growth. Percentile ranks don’t work that way. That would be a very dumb application of math.
LikeLike
“Without adequate reliability, there can not be validity”
And vice versa! Without adequate validility, there can not be reliability.
See Noel Wilson’s review of the testing bible A Little Less than Valid: An Essay Review put out by American Educational Research Association; American Psychological Association; National Council on Measurement in Education. (2002). Washington, DC: American Educational Research Association. Found at:
Click to access v10n5.pdf
The more important of the two is validity. Reliability is nothing if the test is not valid, does not assess what it purports to assess, has serious epistemological and ontological errors and falsehoods. Since that is the case with all standardized testing, well let’s just say using the results for anything is completely invalid and a total waste of time except as an exercise in mental mathturbation (thanks SDP).
LikeLike
Whereas standardized tests measure “something”, VAM measures nothing. It is little more than a really complicated random number generator.
LikeLike
Also, WISC is not a curriculum-specific test. It is not designed for all-at-one administration in a classroom. Each new edition of WISC has elaborate documentation for the test. The buyer can beware. Also WISC test scores are not recycled to make high stakes judgments about the person who administered the test–usually a professional in psychology or allied field.
Point. Reliability issues are compounded by the fact that classroom tests are constructed from a large “bank” of items. The tests, say for grade 4 ELA, are NOT identical year-to-year over multiple years. The big fear in the school context is that cheating will occur or that there will be excessive teaching to the test. The infamous FCAT tests in Florida included items just for the convenience of field testing them on the cheap. The answers to those items were “not counted” but they produced data that would be useful in designing a future test–arguably at the expense of the students.
I can appreciate the demonstration of what reliability means in the case of WISC but classroom tests of academic content and skill are a different animal, or fruit, or veggie.
None of the tests being used to construct VAM ratings are designed to be “instructionally sensitive” nor can they be made so with enough reliability to judge teachers.
My thought experiment on how to make an instructionally sensitive test is this.
Begin with the assumption that the student is a blank slate, has learned nothing in or out of school, enters a room with one desk and no other students. Then comes the teacher who is the sole source of everything that the student will learn in “a given interval of instruction.” The test is administered to the student and it becomes a measure of the teacher’s effectiveness in “writing on the student’s blank slate.” If there is doubt about the ability of the student to learn, then get an independent estimate of that ability using a test such as WISC and then tweak the judgment about the teacher to take that into account.
LikeLike
On some “standardized” tests, the questions can vary from one test to another in the same year.
In Utah, the tests supposedly get harder if the student answers the previous questions correctly. Because of that, NO student would take exactly the same test. Even on the writing portion, there are several different writing prompts for each grade, randomly given to each student. I keep wondering how Utah can supposedly “compare” schools to each other with that much volatility in the test. Yet, schools are graded, and starting this year, teachers are evaluated by the scores of these tests. I know that there’s no validity to these tests anyway (thanks Duane), but it makes even less sense doing the tests this way.
LikeLike
this is a good discussion; I am pleased that the data came out and were used before the judge describing what VAM does to teachers. For several years (perhaps a decade or more) we have had Kevin McGrew who describes what happens to the CHILDREN when these scores are interpreted. For the most recent write up see his description of Forrest Gump here…. http://www.slideshare.net/iapsych/forrest-gump-and-iq-expectations It’s heavy in statistics/numerals but use the graphic presentation to get to the heart of the matter. When Kevin was writing at IAP he also had a good video of the race between a greyhound, a turtle, and the different attributes of the “animal” race that illustrates important concepts (for this “race” he used the RPI which is another way of considering how students perform — differently)…. In the Oliver Sacks recent book he talks about the variability in light of contingency theory — which to me makes sense when we talk about people and the vicissitudes of environment and experience.
LikeLike
“None of the tests being used to construct VAM ratings are designed to be “instructionally sensitive” nor can they be made so with enough reliability to judge teachers.” the most recent version of the Stanford Binet makes claim to “change sensitive ” scores but I think we are still one generation (or two decade) away from having anything that is useful in this regard — it is all theory and it belongs in the lab not in public schools and this is exactly what I say when they start “measuring” things like “grit”… I blame that one on Fordham Institute and Education Next (Checkers finn, Michael Petrilli and Martn West)… please tell them they are off base . They have the “ear” of David Driscoll at NAEP. This week my friend said “entangling alliances of the robber barons”…. for those groups.
LikeLike
Why won’t ed reformers compromise on VAM? I know the Obama Administration swallowed it whole and mandated it as a condition of funding, but they’re on their way out.
If it isn’t working it seems like some adult could eat some crow and admit they made a poor decision.
LikeLike
It’s like they are doubling down on a losing hand.
That doesn’t work in Vegas, and it won’t work in education either
LikeLike
Yeah, but it can mitigate the addiction for a few moments.
LikeLike
and thanks to Duane for pointing this out: “See Noel Wilson’s review of the testing bible A Little Less than Valid: An Essay Review put out by American Educational Research Association; American Psychological Association; National Council on Measurement in Education. (2002). Washington, DC: American Educational Research Association. ” I like Wilson’s description of “psychometric fudge”… because that is what it is or — sludge (and I am using a euphemism here)
LikeLike
What amazes me, Jean, is that even though many that I have spoken with about Wilson and understand what the problems are still “go along to get along”, to not rock the boat, to accept these malpractices as acceptable. It is absolutely mind boggling to me, I’m sitting here shaking my head thinking about it. What is it going to take to get to the point of “consensus” that these malpractices harm many children and we should put them on the shelf along with concepts like phrenology, eugenics, four humors/temperaments, etc. . . ???
AY AY AY AY AY I scream in utter frustration.
LikeLike
I still think public schools could use their market power. A lot of these ed reformers go on to consultancies and pitching product to public schools.
Stop buying from people who make a career out of eradicating your schools. You have a choice and there are plenty of providers. If they’re all ed reformers then we could use new ideas anyway after 15 years and other providers will enter. Bust it up.
I don’t know why we’re spending public money on people who are openly adverse to public schools.
LikeLike
I don’t know why we are using tax dollars to attack public education. Complicit governors are all about slight of hand moves, appointments, and nonsense scores and formulae with movable cut scores to shift public funds to private corporations while circumventing any due process rights of workers in the state. He or she doesn’t even need to put the shift to a public vote apparently. Annihilation with the stroke of the pen!
LikeLike
Remember Campbell’s Law. https://dianeravitch.net/2012/05/25/what-is-campbells-law/
LikeLike