Audrey Beardsley: Can VAM Be Trusted?

Audrey Amrein Beardsley, a national authority on teacher evaluation, reviews here the latest scholarly research on value-added measurement” or VAM. This is the practice of evaluating teachers by changes in their students’ test scores. It was made into a national issue by Race to the Top, which required states to make VAM a significant part of teacher evaluation. In granting waivers to states from the Draconian sanctions of NCLB, Arne Duncan required states to adopt VAM.

The research keeps building against the usefulness of VAM. The latest study concludes that VAM is highly unreliable. Children are not randomly assigned, and teachers face widely varying challenges.

Beardsley writes:

“In a recent paper published in the peer-reviewed journal Education Finance and Policy, coauthors Cassandra Guarino (Indiana University – Bloomington), Mark Reckase (Michigan State University), and Jeffrey Wooldridge (Michigan State University) ask and then answer the following question: “Can Value-Added Measures of Teacher Performance Be Trusted?…

“From the abstract, authors “investigate whether commonly used value-added estimation strategies produce accurate estimates of teacher effects under a variety of scenarios. [They] estimate teacher effects [using] simulated student achievement data sets that mimic plausible types of student grouping and teacher assignment scenarios. [They] find that no one method accurately captures true teacher effects in all scenarios, and the potential for misclassifying teachers as high- or low-performing can be substantial.”

She adds:

“They found…

“No one [value-added] estimator performs well under all plausible circumstances, but some are more robust than others…[some] fare better than expected…[and] some of the most popular methods are neither the most robust nor ideal.” In other words, calculating value-added regardless of the sophistication of the statistical specifications and controls used is messy, and this messiness can seriously throw off the validity of the inferences to be drawn about teachers, even given the fanciest models and methodological approaches we currently have going (i.e., those models and model specifications being advanced via policy).

“[S]ubstantial proportions of teachers can be misclassified as ‘below average’ or ‘above average’ as well as in the bottom and top quintiles of the teacher quality distribution, even in [these] best-case scenarios.” This means that the misclassification errors were are seeing with real-world data, we are also seeing with simulated data. This leads us to more concern about whether VAMs will ever be able to get it right, or in this case, counter the effects of the nonrandom assignment of students to classrooms and teachers to the same.

“Researchers found that “even in the best scenarios and under the simplistic and idealized conditions imposed by [their] data-generating process, the potential for misclassifying above-average teachers as below average or for misidentifying the “worst” or “best” teachers remains nontrivial, particularly if teacher effects are relatively small. Applying the [most] commonly used [value-added approaches] results in misclassification rates that range from at least 7 percent to more than 60 percent, depending upon the estimator and scenario.” So even with a pretty perfect dataset, or a dataset much cleaner than those that come from actual children and their test scores in real schools, misclassification errors can impact teachers upwards of 60% of the time….

“In sum, researchers conclude that while certain VAMs hold more promise than others, they may not be capable of overcoming the many obstacles presented by the non-random assignment of students to teachers (and teachers to classrooms).

“In their own words, “it is clear that every estimator has an Achilles heel (or more than one area of potential weakness)” that can distort teacher-level output in highly consequential ways. Hence, “[t]he degree of error in [VAM] estimates…may make them less trustworthy for the specific purpose of evaluating individual teachers” than we might think.”

Duane Swacker says:

February 11, 2015 at 1:57 pm

“In other words, calculating value-added regardless of the sophistication of the statistical specifications and controls used is messy, and this messiness can seriously throw off the validity of the inferences to be drawn about teachers, even given the fanciest models and methodological approaches we currently have going.”

That “messiness” are the various errors identified by Wilson that render such educational malpractices COMPLETELY INVALID.

Those “many obstacles” again are some of the many errors identified by Wilson that render these educational malpractices COMPLETELY INVALID.

““In their own words, “it is clear that every estimator has an Achilles heel (or more than one area of potential weakness)” that can distort teacher-level output in highly consequential ways. Hence, “[t]he degree of error in [VAM] estimates…may make them less trustworthy for the specific purpose of evaluating individual teachers” than we might think.””

That “Achilles heel”, more than being an “area of potential weakness” has been shown by Wilson to COMPLETELY INVALIDATE the processes, making them not “less trustworthy” but COMPLETELY UNTRUSTWORTHY and therefore be discarded before more harm comes not only to teachers but to the students themselves.

LikeLike

Duane Swacker says:

February 11, 2015 at 2:01 pm

To understand why those practices are COMPLETELY UNTRUSTWORTHY read and understand Noel Wilson’s never refuted nor rebutted 1997 dissertation “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700

Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine.

1. A description of a quality can only be partially quantified. Quantity is almost always a very small aspect of quality. It is illogical to judge/assess a whole category only by a part of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as unidimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing the descriptive information about said interactions is inadequate, insufficient and inferior to the point of invalidity and unacceptability.

2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).

3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.

4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”
In other word all the logical errors involved in the process render any conclusions invalid.

5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. And a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.

6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.

7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”

In other words it attempts to measure “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?

My answer is NO!!!!!

One final note with Wilson channeling Foucault and his concept of subjectivization:

“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self evident consequences. And so the circle is complete.”

In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.

By Duane E. Swacker

LikeLike

M says:

February 11, 2015 at 2:27 pm

The people who control VAM have shown a willingness over decades to manipulate cut scores to get whatever they want. Why is VAM so different in how it can tweak the formula to get certain results? Further, if they intend to change it until it’s working as intended such as firing 5-10% of bottom teachers every year, then it is already far from objective by assuming a bell curve where it might not exist (what if most ineffective teachers leave in their first few years or are counseled out?)

retired teacher says:

February 11, 2015 at 3:01 pm

VAM is a vehicle for data manipulation.

LikeLike

Linda Johnson says:

February 11, 2015 at 2:44 pm

If the United States were truly interested in teacher effectiveness and student progress, our leaders would invest in other professionals who would visit teachers’ classes and become familiar with the progress of those students. Obviously there is no two-dollar group test that can do this job. And these tests aren’t even secure! Are we really that stupid or is VAM just a cover for another agenda?

I think most of us know the answer to that one.

Duane Swacker says:

February 11, 2015 at 3:07 pm

Linda,

“And these tests aren’t even secure!”

I’m not sure what you are trying to say with that sentence. Would you please explain.

Thanks,
Duane

LikeLike

- Linda Johnson says:
  
  February 11, 2015 at 3:49 pm
  
  These tests are often around the school for several days, are given by classroom teachers and then sent to the principal and then to district office. There is much test invalidation going on. In addition to that, many schools teach to the test (as in teaching same or similar items). Of course this invalidates the test.
  
  LikeLike

Rick Lapworth says:

February 11, 2015 at 4:31 pm

VAM: 1) very arbitrary measure 2) vehement anti-model 3) varying abstruse maladies
4) voluminous abstract madness 5) voluptuously a-mathematic 6)????

SomeDAM Poet says:

February 11, 2015 at 6:28 pm

“VAMmit All”

VAMit all, we’ve had enough
From inside Gates of Hell
Enough of “sciencey” sounding stuff
And Common Core as well

Enough of Chetty-picking
To make Vergara cases
Enough of statistricking
The public, with no basis

Enough of VAMmy charts
That mimic random scatter
Essentially hurling darts
To tell us “what’s the matter”

Enough of rating teachers
For things they cannot sway
The outside world has features
Which make the teachers pay

Enough of teacher-shaming
And firing based on bunk
We really should be blaming
The ones who sell this junk

Enough of VAMmityVile horrors
Like teacher suicide
From teacher-bashing chorus
We simply shan’t abide

VAMmit all, the VAMs are DAMs
Devalue’s what they add
The DAM reforms are battling rams
And plain and simple bad

Duane Swacker says:

February 11, 2015 at 9:01 pm

SDP,

Just came across something I though you might be interested in. It’s a CD/Album called Tone Poems by David Grisman and Tony Rice. Check it out if you like acoustic music.

Duane

LikeLike

KrazyTA says:

February 11, 2015 at 10:35 pm

SomeDAM Poet and Duane Swacker: it’s no wonder that VAManiacs and their clients/patrons/enablers don’t want to get into genuine public discussions and debates about their capricious numerical chimeras.

Imagine, if you will, the accountabully underlings of the self-proclaimed “education reform” movement getting up on a public stage and being unable to refute the statement that VAM “misclassification errors can impact teachers upwards of 60% of the time” but making this incongruous pitch:

“Buy this eduproduct if you want to know how your child’s teachers and your child and your local public school measure up!”

Then—horror of horrors!—they would have to get into the reasoning and assumptions and goals built into their mathematical models and reveal just how subjective and squishy and rigged their “objective measurers/modelers” are.

¿? You’re right, Señor Swacker, even in our wildest fantasies it is hard to conceive of the thought leaders of the “killjoy rheephorm movement” engaging in genuine and open give-and-take. That would take $tudent $ucce$$ from the black into the red.

Yes, I forgot my Spanish: “no pidas peras al olmo” [don’t ask for pears from an elm tree/don’t ask for the impossible].

I guess, like that famed Mexican superhero of yesteryear, El Chapulín Colorado, I wasn’t thinking— “¡Se me chispoteó!” [Sorry, that just slipped out!].

I’ll try to be more careful next time.

Thank you both for your contributions to this thread.

😎

LikeLike

Christine Langhoff says:

February 11, 2015 at 7:02 pm

From Diane’s post:
“slur-added measurement” or VAM”

I don’t doubt that it’s a typo. But it sure seems appropriate!

dianeravitch says:

February 11, 2015 at 8:43 pm

Christine, I changed the typo
Freudian slip

LikeLike

- calanghoff says:
  
  February 11, 2015 at 10:19 pm
  
  😉
  
  LikeLike

Jon Lubar says:

February 11, 2015 at 7:32 pm

In addition to all the scholarly research, we must continue to draw attention to the numerous real world failures of VAM. Perhaps the most glaring example is NYC under Bloomberg. After 3 years of testing, after hundreds of millions of dollars and hours of instructional time were devoted to the experiment, no useable data was produced. It was all completely random. This was when Gates said that VAM results should not be used to shame teachers. What he was really trying to do was divert attention from VAM itself being shamed. If I remember correctly, it was Gary Rubenstein who crunched the numbers.

dianeravitch says:

February 11, 2015 at 8:41 pm

Yes, Gary Rubinstein showed that NYC teacher ratings were Nonsense

LikeLike

SomeDAM Poet says:

February 11, 2015 at 10:50 pm

This graph (from this post) convinced me that VAM is basically garbage.

I wrote a ditty after seeing that graph that sums up the absurdity that is VAM

“VAM Mechanics”

“Simultaneously good and bad!”
What the VAM has said
Teachers are like Schroedinger’s Cat
Both alive and dead

That such utter nonsense is being used to deprive people of their careers and livelihoods is not only completely unscientific but completely unethical given the red flags that ASA and others have put up about VAMs.

The people doing this to teachers can not claim ignorance.

LikeLike

Audrey Beardsley: Can VAM Be Trusted?

17 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

Why are you reporting this comment?

Audrey Beardsley: Can VAM Be Trusted?

Diane Ravitch's Blog

17 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats