Just How Meaningful Are Those Value-Added Ratings?

The studies of value-added measurement keep on coming, and the findings usually show what an utterly absurd idea it to think that teacher quality can be judged by student test scores. In a just world, Arne Duncan would be held accountable for the stupid and harmful theories he has imposed on the nation’s public schools. The U.S. Department of Education has become a malignant force in American education. I cannot think of any time in our nation’s history when public schools and teachers were literally endangered by the mandates coming from Washington, D.C., where the leadership is wholly ignorant of federalism.

This story in Education Week summarizes the latest batch of studies of VAM. some researchers, having made this their area of specialization, continue to prod in hopes of good news.

But look at this:

“In a study that appears in the current issue of the American Educational Research Journal, Noelle A. Paufler and Audrey Amrein-Beardsley, a doctoral candidate and an associate professor at Arizona State University, respectively, conclude that elementary school students are not randomly distributed into classrooms. That finding is significant because random distribution of students is a technical assumption that underlies some value-added models.

“Even when value-added models do account for nonrandom classroom assignment, they typically fail to consider behavior, personality, and other factors that profoundly influenced the classroom-assignment decisions of the 378 Arizona principals surveyed. That, too, can bias value-added results.

“Perhaps most provocative of all are the preliminary results of a study that uses value-added modeling to assess teacher effects on a trait they could not plausibly change, namely, their students’ heights. The results of that study, led by Marianne P. Bitler, an economics professor at the University of California, Irvine, have been presented at multiple academic conferences this year.
The authors found that teachers’ one-year “effects” on student height were nearly as large as their effects upon reading and math. The researchers did not find any correlation between the “value” that teachers “added” to height and the value they added to reading and math. In addition, unlike the reading and math results, which demonstrated some consistency from one year to the next, the height outcomes were not stable over time. The authors suggested that the different properties of the two models offered “some comfort.” Nevertheless, they advised caution.”

So, let’s get this right: teachers’ effects on students’ height were nearly as large as their effect on reading and math.

Perhaps Arne can just arrange to have all teachers fired (except for TFA), close every school (except “no-excuses” charter schools), and turnaround the whole country.

Steve K says:

May 20, 2014 at 7:15 am

It isn’t surprising that random distribution of students doesn’t generate essentially “similar” classrooms. While I’m sure that teachers have experienced the following situation in a classroom, I’ll make a comparison with my coaching career.

I coached 29 different teams at the high school level. Baseball and basketball. Sports with cuts so there was some selection in place (unlike a classroom). We would make cuts after a week. I coached freshmen mostly so we had one week to get to know and evaluate players. Personalities weren’t always openly on display.

I had teams with a lot of talent but no personality chemistry. They often underperformed. Then I had teams with mediocre talent but a real sense of togetherness. They exceeded expectations. The whole can be greater than the sum of its parts.

Classrooms can be like that. I had a class with no “superstars” in an academic sense but it was filled with kids who got along well. Great atmosphere and good results. Also, I had a class filled with kids who were over 24 on their ACT but it was stocked with three major cliques who did not really associate. You can guess how that went.

The make-up of the group can make a difference.

LikeLike

teachingeconomist says:

May 20, 2014 at 7:32 am

How do we decide when Gates Foundation funded research should be ignored because it is funded by the Gates Foundation and when it should be praised as high quality research?

MathVale says:

May 20, 2014 at 7:36 am

By looking at the actual studies.

LikeLike

- teachingeconomist says:
  
  May 20, 2014 at 7:51 am
  
  I agree, and I am glad to see others thinking that research should not be condemned based on the funding sources or the conclusions reached but on the quality of the research being done.
  
  This paper seems to be behind a paywall and I could not easily find the working paper version of it, but perhaps I just need to dig a little deeper.
  
  By the way, in poor countries around the world, “stunting” is used as a measure of chronic malnutrition (“wasting”, being underweight, is a measure of short term malnutrition). I would be surprised if cognitive ability and stunting were not correlated in those countries.
  
  LikeLike
- MathVale says:
  
  May 20, 2014 at 8:30 am
  
  Gates Foundation started with an admirable goal – what makes a great teacher. Granted, it is harder to show that than what doesn’t make a great teacher. But, Gates lost their focus and wandered into metrics and testing. Reform became punitive. Government jumped too early to conclusions and misapplied the VAM model, enforcing the concept by force of law – not reason. Peer review takes time and as VAM is questioned, those politicians and businesses with a stake have to defend an indefensible position. So they resort to demonizing teachers, ridiculing white suburban moms, and silencing principals. Another possible study might be the hypothesis the more VAM is debunked, the more The Reformers will blame educators, parents, and students.
  
  The issue is somewhat paradoxical – to answer the question “what makes a great teacher?”, we find a great teacher and ask them. This approach suggests mentoring and the age old apprenticeship system is better than flawed metrics.
  
  Maybe a question for you to ask is if the Gates Foundation will fully fund research that shows results they dislike? Now, that’s science.
  
  LikeLike
- Teachingeconomist says:
  
  May 20, 2014 at 8:39 am
  
  I have no way to know what the Gates Foundation likes or dislikes. The Gates Foundation did fund the creation of the data set used in this paper and also directly funded the research that resulted in the paper.
  
  LikeLike
- MathVale says:
  
  May 20, 2014 at 9:27 am
  
  The Gates Foundation is an enigma. If you look at Gates talk on TED or Business Insider (Google yourself to avoid link wars), he talks about mentoring and finding what makes a great teacher and avoids details, even when baited by an AEI questioner. He talks a good talk. But yet his Foundation is funding cold metrics and standards used to fire good teachers, not improve schools. The MET is being suggested here in Ohio as a way to use student surveys for 20% of teacher ratings. In his talks, he laments failing schools and teachers by comparing NAEP and PISA international rankings.
  
  Don’t be fooled. In my ancient days as a programmer, I attended a developers conference were Gates gave a talk and was greeted like a god. In one slide, an item listed the goal of “eliminating the high priesthood of programming”. Clearly an Opps and a slide meant for executives looking to cut costs in IT, not programmers. None of my fellow programming nerds seemed to notice.
  
  LikeLike
- teachingeconomist says:
  
  May 20, 2014 at 12:18 pm
  
  The Gates Foundation also funded the study cited in the post. What conclusions, if any, should we draw from that?
  
  As for societies attempts to economize on the use of rare and expensive programming talent, that seems to me to be an admirable goal. We should not use as many resources as possible to produce the necessities and conveniences of life, but as few as possible.
  
  LikeLike

MathVale says:

May 20, 2014 at 7:34 am

The height study is great. And I suppose if students go down on test scores, the teacher is shrinking the kids.

Titleonetexasteacher says:

May 20, 2014 at 8:01 am

It gives new meaning to the phrase “my students really grew this year.”

LikeLike

MathVale says:

May 20, 2014 at 8:37 am

I love that height study because it is a great “reductio ad absurdum” – assume VAM is true, then show how that leads to absurdity. The ancient Greeks were pure genius. I think we are doing that proof everyday in the classroom.

LikeLike

wgersen says:

May 20, 2014 at 7:36 am

Wait! Wasn’t Arne a basketball player at Harvard? He’ll probably look at this correlation between height gained and teachers as a good thing! We’ll not only be able to do better on international tests, but we’ll also have taller basketball players as well! This isn’t a bug or VAM… it’s a feature!

Chiara Duggan says:

May 20, 2014 at 8:25 am

He gave a commencement speech where he said his time playing basketball prepared him for his current job.
It’s fine, everyone who doesn’t have direct experience in a job when they’re applying for one does it, it’s Job Interview 101, but if you’ve ever interviewed someone for a job you know that so it was amusing to hear coming from him.
He has to explain how he went from working for a non-profit to running a huge city public school system.
One just takes whatever they did and announces that the skills transfer, thus they are a perfect fit.
“Teamwork” is a popular choice, but so is “I worked alone, so I don’t need supervision and I’m a self-starter”. It’s endlessly flexible 🙂

LikeLike

- Michael Fiorillo says:
  
  May 21, 2014 at 7:12 am
  
  Duncan also never fails to mention that he tutored at the after-school program his Mommy ran.
  
  LikeLike

jecgenovese says:

May 20, 2014 at 8:23 am

Reblogged this on peakmemory and commented:
More evidence against valued added measures of teaching

Chiara Duggan says:

May 20, 2014 at 9:04 am

“I just don’t by the notion that by somehow saying we’re going to have a system that evaluates performance – and by the way 35 percent of what we’re going to look at is whether kids learned more material – that doesn’t strike me as soul crushing or unreasonable or unrealistic,” Huffman says. “It strikes me as a very reasonable way of looking at how a teacher has done in their day-to-day duties in the classroom.”

Unless it doesn’t work, right? In that case he’d have to re-consider, being “data-driven” and all that would be pretty important, to consider some data to temper his fervent beliefs.

Or just continue to blame teachers unions, I guess, as he does in this interview.

http://nashvillepublicradio.org/blog/2014/05/19/ed-talk-kevin-huffman-says-adults-means-teachers-work-harder/?utm_content=bufferbb63f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Alan C. Jones says:

May 20, 2014 at 9:05 am

As with all statistical modeling it will always come down to who decides what to regress and what that who decides to analyze. Such abstract modeling, must by its very nature leave out those qualities in a child that at the end of the day count most in life.

May 20, 2014 at 9:13 am

I was curious about why ed reformers use “CEO” rather than “superintendent” and someone told me they do that because “superintendent” has specific qualifications and requirements in many places. I thought it was just to show fealty to a private sector model, so they wouldn’t be accused of backing public schools or anything controversial and unfashionable like that.

If that’s true, my hat is off to them. That’s truly innovative. Change the job title and the qualifications so they can leapfrog past everyone who has been working at it for 20 years and run a school system with no experience.

This kind of imaginative self-promotion is what made this country great! 🙂

2old2teach says:

May 20, 2014 at 1:11 pm

A CEO doesn’t necessarily have to know much about the business. He/she just has to manage the people that do in a manner that maximizes profits without jeopardizing the business. Of course, they could be brought in to squeeze all value out of an enterprise before declaring bankruptcy.

LikeLike

Duane Swacker says:

May 20, 2014 at 9:25 am

Since all VAM and SGPs are fundamentally based on standardized test scores. And standardized test scores have been proven to be COMPLETELY INVALID* then any conclusions drawn from said scores are, by definition, COMPLETELY INVALID. Start with crap, i.e., standardized test scores, end with crap, i.e., VAM and SGPs. Nothing fancy about that piece of simple logic. Except it seems that most people can’t apply that little bit of commen sense aphorism.

*“Educational Standards and the Problem of Error” found at:
http://epaa.asu.edu/ojs/article/view/577/700

Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)

1. A quality cannot be quantified. Quantity is a sub-category of quality. It is illogical to judge/assess a whole category by only a part (sub-category) of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as one dimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing we are lacking much information about said interactions.

2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).

3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.

4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”

In other word all the logical errors involved in the process render any conclusions invalid.

5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. As a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.

6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.

7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”

In other words it measures “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?

My answer is NO!!!!!

One final note with Wilson channeling Foucault and his concept of subjectivization:

“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self-evident consequences. And so the circle is complete.”

In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.

May 20, 2014 at 9:31 am

This looks like an interesting way to reorganize schools to make them more equitable as far as lower income students and higher income students, and she didn’t have to privatize half the schools, create a whole new system, or insist that the magic of markets was going to get her to a more equitable system:

“Fairfax County Public Schools Superintendent Karen Garza announced plans May 16 to reorganize the system’s administrative structure. The new system takes effect July 1.
Instead of eight clusters, the new system would group schools into five administrative regions, with high-achieving schools that serve affluent populations grouped with those with more diverse student bodies and larger proportions of lower-income students. Each region would have a regional assistant superintendent and an “executive principal.”

That looks like a doable and “first do no harm” approach or first step. It probably won’t make her a “rock star” however 🙂

http://annandaleva.blogspot.com/2014/05/fcps-superintendent-garza-reorganizes.html

Laura H. Chapman says:

May 20, 2014 at 10:36 am

Hurray for this example. Lewis Carroll would be proud.

“The authors found that teachers’ one-year “effects” on student height were nearly as large as their effects upon reading and math. ”

This reminds me of the example of correlation wherein the production of Panama hats had a high correlation with (say) the grades of students on a history test.
I think my faint memory of this example lingers from a clever and classic book titled How to Lie With Statistics.

In any case, with the increasing availablity of software for “analytics” using big data, I hope that a gallery full of these absurdies go viral and occasion some serious embarrasement; especially among economists, statisticians, and policy makers who have pushed for the use of VAM as the method of choice for stack ranking teachers and firing the lowest “performers.”

This example also heighten my anger at the American Statistical Association’s long delay in saying don’t use VAM to judge individual teachers.

Stiles says:

May 20, 2014 at 12:56 pm

VAM should not be used for high stakes decisions like merit pay, intensive supervision, and termination. The junk policy in effect today is awful.

That having been said, the height study may not be as absurd as it appears on the surface. There is evidence that a child’s nutrition contributes both to height and cognitive development.

But this also points out one of the limitations of VAM. In concept, VAM controls for the factors beyond the educators control. In reality, the measurable variables are too limited to control for these external factors. Do VAM models incorporate data on a child’s net nutrition? No, because that data is not available.

The ASA was slow in issuing their statement, but I think that is in part because the quantitative models used in VAM are accepted statistical methods. The problem is that policy makers are using VAM in ways that far exceed its explanatory power and ignore its measurement uncertainty. Junk policy again.

LikeLike

leonardisenberg says:

May 20, 2014 at 10:43 am

Why do teachers and academicians continue to believe or give any credibility to the notion that anybody setting public education policy in this country has any good faith belief that value-added or linking teacher effectiveness assessments to student test scores has any validity? Those running this latest scam clearly don’t. In the credit default swaps and sub prime scams, those perpetrating the scams actually paid scientists to develop bogus complex meaningless algorithms to determine if the financial instruments in question had any value- they all knew they didn’t. Now hedge funds running semi-privatized state subsidized charters get to pull a similar scam in public education.

When you put a student into my 12th grade Government class with a 3rd grade reading ability, after they have started school already years behind and then you continue to socially promote them without mastery of years of grade-level standard, the only purpose for generating non sequitur value-added assessment metrics is to get rid of professional fairly compensated teachers in favor of underpaid novice teachers. This allows those running the game to shift 40% of the total $1.2 trillion a year spent on public education in this country to “administration costs.” Bernie Madoff would be proud. The only question that remains for me is why do you dignify this by addressing it as if it had any validity, when those proposing it clearly don’t believe so?

Andy Goldstein says:

May 20, 2014 at 6:37 pm

I’d like to share a talk I gave to the School Board of Palm Beach County, FL. “VAM: The Scarlet Letter”:

Just How Meaningful Are Those Value-Added Ratings?

25 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

Just How Meaningful Are Those Value-Added Ratings?

Diane Ravitch's Blog

25 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats