Peter Greene here evaluates a report by two analysts at Bellwether Education, a DC think tank, about how teachers should be evaluated. His post is a model of how to tear apart and utterly demolish the musings of people far removed from the classroom about how things ought to work.

He begins by situating its sponsor:

“I am fascinated by the concept of think tank papers, because they are so fancy in presentation, but so fanceless in content. I mean, heck– all I need to do is give myself a slick name and put any one of these blog posts into a fancy pdf format with some professional looking graphic swoops, and I would be releasing a paper every day.

“Bellwether Education, a thinky tank with connections to the standards-loving side of the conservative reformster world, has just released a paper on the state of teacher evaluation in the US. “Teacher Evaluation in an Era of Rapid Change: From ‘Unsatisfactory’ to ‘Needs Improvement.'” (Ha! I see what you did there.) Will you be surprised to discover that the research was funded by the Bill and Melinda Gates Foundation?”

He reviews what they describe as current trends and pulls each one apart.

Here is an example of a current trend and Greene’s response:

“3) Districts still don’t factor student growth into teacher evals

“Here we find the technocrat blind faith in data rearing its eyeless head again”

The authors say: “While raw student achievement metrics are biased—in favor of students from privileged backgrounds with more educational resources—student growth measures adjust for these incoming characteristics by focusing only on knowledge acquired over the course of a school year.”

“This is a nice, and inaccurate, way to describe VAM, a statistical tool that has now been discredited more times than Donald Trump’s political acumen. But some folks still insist that if we take very narrow standardized test results and run them through an incoherent number-crunching, the numbers we end up with represent useful objective data. They don’t. We start with standardized tests, which are not objective, and run them through various inaccurate variable-adjusting programs (which are not objective), and come up with a number that is crap. The authors note that there are three types of pushback to using said crap.

“Refuse. California has been requiring some version of this for decades. and many districts, including some of the biggest, simply refuse to do it.

“Delay. A time-honored technique in education, known as Wait This New Foolishness Out Until It Is Replaced By The Next Silly Thing. It persists because it works so often.

“Obscure. Many districts are using loopholes and slack to find ways to substitute administrative judgment for the Rule of Data. They present Delaware as an example of how futzing around has polluted the process and buttress that with a chart that shows statewide math score growth dropping while teacher eval scores remain the same.

“Uniformly high ratings on classroom observations, regardless of how much students learn, suggest a continued disconnect between how much students grow and the effectiveness of their teachers.

“Maybe. Or maybe it shows that the data about student growth is not valid.

“They also present Florida as an example of similar futzing. This time they note that neighboring districts have different distributions of ratings. This somehow leads them to conclude that administrators aren’t properly incorporating student data into evaluations.

“In neither state’s case do they address the correct way to use math scores to evaluate history and music teachers.”

After carefully pulling apart the report, here are the conclusions, theirs and his:

Greene reviews their recommendations:

“It’s not a fancy-pants thinky tank paper until you tell people what you think they should do. So Adelman and Chuong have some ideas for policymakers.

“Track data on various parts of new systems. Because the only thing better than bad data is really large collections of bad data. And nothing says Big Brother like a large centralized data bank.

“Investigate with local districts the source of evaluation disparities. Find out if there are real functional differences, or the data just reflect philosophical differences. Then wipe those differences out. “Introducing smart timelines for action, multiple evaluation measures including student growth, requirements for data quality, and a policy to use confidence intervals in the case of student growth measures could all protect districts and educators that set ambitious goals.

“Don’t quit before the medicine has a chance to work. Adelman and Chuong are, for instance, cheesed that the USED postponed the use of evaluation data on teachers until 2018, because those evaluations were going to totally work, eventually, somehow.

“Don’t be afraid to do lots of reformy things at once. It’ll be swell.

“Their conclusion

“Stay the course. Hang tough. Use data to make teacher decisions. Reform fatigue is setting in, but don’t be wimps.

“My conclusion

“I have never doubted for a moment that the teacher evaluation system can be improved. But this nifty paper sidesteps two huge issues.

“First, no evaluation system will ever be administrator-proof. Attempting to provide more oversight will actually reduce effectiveness, because more oversight = more paperwork, and more paperwork means that the task shifts from “do the job well” to “fill out the paperwork the right way” which is easy to fake.

“Second, the evaluation system only works if the evaluation system actually measures what it purports to measure. The current “new” systems in place across the country do not do that. Linkage to student data is spectacularly weak. We start with tests that claim to measure the full breadth and quality of students’ education; they do not. Then we attempt to create a link between those test results and teacher effectiveness, and that simply hasn’t happened yet. VAM attempted to hide that problem behind a heavy fog bank, but the smoke is clearing and it is clear that VAM is hugely invalid.

“So, having an argument about how to best make use of teacher evaluation data based on student achievement is like trying to decide which Chicago restaurant to eat supper at when you are still stranded in Tallahassee in a car with no wheels. This is not the cart before the horse. This is the cart before the horse has even been born.”