A Psychologist Speaks Out about the Low Reliability of VAM Scores

One of the affidavits at the trial of the Lederman v. King case was filed by psychologist Brad Lindell.

His full affidavit is included in this post, which contains all the affidavits.

He sent the following note to me to explain his view of VAM in layman’s terms:

I am Dr. Brad Lindell, one of the affiants in the Sheri Lederman case who was present at the oral arguments on Wednesday. It was truly something to observe. You got the feeling that good was was going to come from the great work of Sheri and Bruce Lederman and from the experts’ opinions in so far as changing this broken VAM system. You got the sense that the judge was listening to the science about VAM and not just to the political rhetoric.

Just want to fill you in on something that was presented in my affidavit modified to give a clear and understandable example of the effects of poor reliability on a full-scale WISC intelligence test. If the same test-retest reliability from the teacher assigned yearly VAM scores (.40) was applied to the WISC full-scale to determine the 90% confidence interval, the range would be ridiculously large.

Examples

If a student scored a full-scale IQ of 100 (average) then the 90% confidence interval would be an 81 to 119. This indicates that there would be a wide range where the scores from repeated administrations of the WISC would be expected to fall for this student. One could not have confidence in the validity of a intelligence test with low reliability. Without adequate reliability, there can not be validity. This same holds true for VAM scores, whose reliabilities have been found to be notorious low.

The reliability of the WISC is generally in the .80 to .90 range. The 90% confidence intervals are generally in the +\- 6 range. So this same person with a 100 full-scale IQ would have a 90% confidence range of 94-106. Quite a smaller range.

This is why reliability is so important, which has repeatedly been shown to be low like .2 to .4 for year-to-year VAM scores. This is also why teachers year to year VAM score vary so considerably, like in the case of Sheri Lederman. Without reliability there cannot be adequate validity.

Fred Smith says:

August 21, 2015 at 1:47 pm

And, as we learned in basic measurement courses, reliability is a necessary but not sufficient condition of validity. A scale may be very accurate (reliable) in assessing a certain attribute, but not be a meaningful (valid) measure of another.

To take an obviously exaggerated example– a good bathroom scale provides very precise information about weight. But it would be unwise to use the results obtained from a classroom of children and throw them into a model that claims to be an indicator of their teacher’s performance.

LikeLike

Jack says:

August 22, 2015 at 12:55 am

In his discussion with Campbell Brown, New Jersey Governor Christie let this one go regarding how easy it is to decide which teachers should be fired, and separated from their students.

Why it takes all of ten minutes!
(21:13 – 21:54)

(21:13 – 21:54)
CHRISTIE (to the parents):
“Let me ask you a question, ’cause there’s a lot of people out here who care about education. When you go to ‘Back To School Night’, is there ever a doubt in your mind within ten minutes of getting in that classroom, whether that’s a good teacher or a bad teacher? Ever?

“You’re either in there going, ‘It’s gonna be a good year,’

” … or you’re… ‘Oh God. This is going to be a problem.’

“You don’t need a PhD in education to understand this (i.e. decide which teachers should be fired). If we (parents) can figure it out in ten minutes, then why can’t we have a tenure system that holds teacher to account, and that has parents understanding that they (parents) can have an impact on that, too.”
——————–

Could you imagine if a teacher saying the same thing… that a teacher can tell within ten minutes whether a parent is unfit, and thus, should have their child taken away by Child Services?

TEACHER: (to the teachers):
“Let me ask you a question, ’cause there’s a lot of people out here who care about education. When you go to “Back To School Night”, is there ever a doubt in your mind within ten minutes of meeting a parent whether that’s a good parent or a bad parent? Ever?

“You’re either in there going, ‘It’s gonna be a good year,’

” … or you’re… ‘Oh God. This is going to be be a problem.’

“You don’t need a PhD in education to understand this (i.e. decide which parent should have their children taken away). If we (teachers) can figure it out in ten minutes, then why can’t we have a child and family services system that holds parents to account, and that has teachers understanding that they (teachers) can have an impact on that, too.”
——————–

LikeLike

- Akademos says:
  
  August 22, 2015 at 9:09 am
  
  Or teachers to students on the first day of school, deciding who needs services, adjustments in services, interventions or alternate programming for gifted students in the first ten minutes of meeting each one.
  
  Christie shouldn’t speak in public. He sounds like a garden variety neighborhood blowhard. Really embarrassing he’s a gov.
  
  LikeLike
- Leslie Tremayne says:
  
  August 30, 2015 at 12:03 pm
  
  Exactly…CONSTRUCT Validity is critically important. The problem with VAM and previously using the same high stakes exams to rate schools and students is that while you can reliably do the math, it doesn’t really mean you’re measuring what you say you are measuring. And EVEN if you are measuring it in one school year, that doesn’t mean you can compare PERCENTILE RANKS from year to year as a way to measure growth. Percentile ranks don’t work that way. That would be a very dumb application of math.
  
  LikeLike

Duane Swacker says:

August 21, 2015 at 2:43 pm

“Without adequate reliability, there can not be validity”

And vice versa! Without adequate validility, there can not be reliability.

See Noel Wilson’s review of the testing bible A Little Less than Valid: An Essay Review put out by American Educational Research Association; American Psychological Association; National Council on Measurement in Education. (2002). Washington, DC: American Educational Research Association. Found at:

Click to access v10n5.pdf

The more important of the two is validity. Reliability is nothing if the test is not valid, does not assess what it purports to assess, has serious epistemological and ontological errors and falsehoods. Since that is the case with all standardized testing, well let’s just say using the results for anything is completely invalid and a total waste of time except as an exercise in mental mathturbation (thanks SDP).

Akademos says:

August 21, 2015 at 2:48 pm

Whereas standardized tests measure “something”, VAM measures nothing. It is little more than a really complicated random number generator.

Laura H.Chapman says:

August 21, 2015 at 3:03 pm

Also, WISC is not a curriculum-specific test. It is not designed for all-at-one administration in a classroom. Each new edition of WISC has elaborate documentation for the test. The buyer can beware. Also WISC test scores are not recycled to make high stakes judgments about the person who administered the test–usually a professional in psychology or allied field.

Point. Reliability issues are compounded by the fact that classroom tests are constructed from a large “bank” of items. The tests, say for grade 4 ELA, are NOT identical year-to-year over multiple years. The big fear in the school context is that cheating will occur or that there will be excessive teaching to the test. The infamous FCAT tests in Florida included items just for the convenience of field testing them on the cheap. The answers to those items were “not counted” but they produced data that would be useful in designing a future test–arguably at the expense of the students.

I can appreciate the demonstration of what reliability means in the case of WISC but classroom tests of academic content and skill are a different animal, or fruit, or veggie.

None of the tests being used to construct VAM ratings are designed to be “instructionally sensitive” nor can they be made so with enough reliability to judge teachers.

My thought experiment on how to make an instructionally sensitive test is this.

Begin with the assumption that the student is a blank slate, has learned nothing in or out of school, enters a room with one desk and no other students. Then comes the teacher who is the sole source of everything that the student will learn in “a given interval of instruction.” The test is administered to the student and it becomes a measure of the teacher’s effectiveness in “writing on the student’s blank slate.” If there is doubt about the ability of the student to learn, then get an independent estimate of that ability using a test such as WISC and then tweak the judgment about the teacher to take that into account.

Threatened Out West says:

August 21, 2015 at 6:56 pm

On some “standardized” tests, the questions can vary from one test to another in the same year.

In Utah, the tests supposedly get harder if the student answers the previous questions correctly. Because of that, NO student would take exactly the same test. Even on the writing portion, there are several different writing prompts for each grade, randomly given to each student. I keep wondering how Utah can supposedly “compare” schools to each other with that much volatility in the test. Yet, schools are graded, and starting this year, teachers are evaluated by the scores of these tests. I know that there’s no validity to these tests anyway (thanks Duane), but it makes even less sense doing the tests this way.

LikeLike

jeanhaverhill says:

August 21, 2015 at 3:44 pm

this is a good discussion; I am pleased that the data came out and were used before the judge describing what VAM does to teachers. For several years (perhaps a decade or more) we have had Kevin McGrew who describes what happens to the CHILDREN when these scores are interpreted. For the most recent write up see his description of Forrest Gump here…. http://www.slideshare.net/iapsych/forrest-gump-and-iq-expectations It’s heavy in statistics/numerals but use the graphic presentation to get to the heart of the matter. When Kevin was writing at IAP he also had a good video of the race between a greyhound, a turtle, and the different attributes of the “animal” race that illustrates important concepts (for this “race” he used the RPI which is another way of considering how students perform — differently)…. In the Oliver Sacks recent book he talks about the variability in light of contingency theory — which to me makes sense when we talk about people and the vicissitudes of environment and experience.

August 21, 2015 at 3:47 pm

“None of the tests being used to construct VAM ratings are designed to be “instructionally sensitive” nor can they be made so with enough reliability to judge teachers.” the most recent version of the Stanford Binet makes claim to “change sensitive ” scores but I think we are still one generation (or two decade) away from having anything that is useful in this regard — it is all theory and it belongs in the lab not in public schools and this is exactly what I say when they start “measuring” things like “grit”… I blame that one on Fordham Institute and Education Next (Checkers finn, Michael Petrilli and Martn West)… please tell them they are off base . They have the “ear” of David Driscoll at NAEP. This week my friend said “entangling alliances of the robber barons”…. for those groups.

Chiara says:

August 21, 2015 at 3:49 pm

Why won’t ed reformers compromise on VAM? I know the Obama Administration swallowed it whole and mandated it as a condition of funding, but they’re on their way out.

If it isn’t working it seems like some adult could eat some crow and admit they made a poor decision.

rockhound2 says:

August 22, 2015 at 8:51 am

It’s like they are doubling down on a losing hand.

That doesn’t work in Vegas, and it won’t work in education either

LikeLike

- Duane Swacker says:
  
  August 22, 2015 at 5:09 pm
  
  Yeah, but it can mitigate the addiction for a few moments.
  
  LikeLike

August 21, 2015 at 3:50 pm

and thanks to Duane for pointing this out: “See Noel Wilson’s review of the testing bible A Little Less than Valid: An Essay Review put out by American Educational Research Association; American Psychological Association; National Council on Measurement in Education. (2002). Washington, DC: American Educational Research Association. ” I like Wilson’s description of “psychometric fudge”… because that is what it is or — sludge (and I am using a euphemism here)

Duane Swacker says:

August 22, 2015 at 5:16 pm

What amazes me, Jean, is that even though many that I have spoken with about Wilson and understand what the problems are still “go along to get along”, to not rock the boat, to accept these malpractices as acceptable. It is absolutely mind boggling to me, I’m sitting here shaking my head thinking about it. What is it going to take to get to the point of “consensus” that these malpractices harm many children and we should put them on the shelf along with concepts like phrenology, eugenics, four humors/temperaments, etc. . . ???

AY AY AY AY AY I scream in utter frustration.

LikeLike

August 21, 2015 at 4:08 pm

I still think public schools could use their market power. A lot of these ed reformers go on to consultancies and pitching product to public schools.

Stop buying from people who make a career out of eradicating your schools. You have a choice and there are plenty of providers. If they’re all ed reformers then we could use new ideas anyway after 15 years and other providers will enter. Bust it up.

I don’t know why we’re spending public money on people who are openly adverse to public schools.

retired teacher says:

August 21, 2015 at 4:40 pm

I don’t know why we are using tax dollars to attack public education. Complicit governors are all about slight of hand moves, appointments, and nonsense scores and formulae with movable cut scores to shift public funds to private corporations while circumventing any due process rights of workers in the state. He or she doesn’t even need to put the shift to a public vote apparently. Annihilation with the stroke of the pen!

LikeLike

Yvonne Siu-Runyan says:

August 22, 2015 at 11:45 am

Remember Campbell’s Law. https://dianeravitch.net/2012/05/25/what-is-campbells-law/

A Psychologist Speaks Out about the Low Reliability of VAM Scores

18 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

A Psychologist Speaks Out about the Low Reliability of VAM Scores

Diane Ravitch's Blog

18 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats