Walter Stroup is chair of the department of STEM education and teacher development and an associate professor at the University of Massachusetts Dartmouth. In 2014, as a professor at the University of Texas, he publicly testified that the state was wasting hundreds of millions of dollars on standardized testing because the only thing that was measured was skill at passing standardized tests. This was hugely embarrassing to Pearson, which had a $500 million contract with the state of Texas. Recently Professor Stroup sent a letter to the Houston Chronicle, supporting its editorial calling for a pause in standardized testing For 2020-21.

I asked if I could post his response here.

He wrote:

[Response to July 22, 2020 “Editorial: What Gov. Abbott should do about STAAR testing this year for Texas schools.”]

As researchers and longtime education advocates, we support the conclusions of the July 22, 2020 “Editorial: What Gov. Abbott should do about STAAR testing this year for Texas schools.” Before our school system can run as normal, it will need to learn to walk again. And we shouldn’t keep objects in its way that may make it stumble.

We agree that state-mandated standardized exams should be the “last thing” student and teachers need to worry about. But that’s not enough. To support our schools and teachers, the next question has to be: if not STAAR, then what?

There is indeed a substantial body of research showing that current tests are “invalid indicators of student progress and ineffective in closing the so-called educational achievement gap.” We also agree with Commissioner Morath that we need shared measures of student progress if we are all to be held accountable for the educational outcomes in our schools.

To start our thinking about what might come next, we should ask whether STAAR tests are useful to teachers – the first responders of our school system. For that matter, are the products from one of the largest non-high-stakes test vendors in Texas, Northwest Evaluation Association (NWEA), useful to teachers?

We believe the answer is a resounding, No.

Although well intended, these tests measure the wrong kind of growth. Not only does this make them the wrong kind of tool to evaluate student achievement and institutional quality, it also means the tests themselves have become an instrument in preserving inequities in students’ educational outcomes.

When it comes to test development and scoring, two kinds of growth can be assessed.

“Growth” can be evaluated relative to achievement – how much students have learned. Or “growth” can be evaluated on a scale similar to measurements of height. Just as children get taller with age, they also get generally better at certain kinds of problem-solving tasks.

It makes a world of difference which kind we use if we want to help schools recover.

The first kind of growth – in achievement – is the only kind for which schools can, and should, be held accountable. We send children to school because we know that’s where we learned to read, write and do mathematics and we want the same for our children. Tests, to be useful in improving student outcomes, must be highly sensitive to differences in what schools do – sensitive to good teaching.

Unfortunately, current test development methodologies give us tests that behave, in almost every significant sense, like measures of biological growth, not measures of achievement.

If we buy a thermometer to measure temperature, put it in a pot of hot water, and the numbers barely change, that’s a problem. If we buy a box of these thermometers that all do the same thing, then that makes it a bigger problem. Our current box of tests has been shown to have very little sensitivity to temperature change — to differences in the quality of instruction.

When it comes to the issue of what kind of growth is being assessed by current tests, the evidence is equally clear. The grade-related growth curves the test vendor NWEA shares on its web site are remarkably similar to curves pediatricians use to chart children’s height.

Age-related or grade-related mental growth metrics can’t be used to improve educational outcomes – they simply aren’t meant to help us become mentally “taller.” Compounding the problem, they have a long history of lending support to oppressive ideologies and practices. In effect, tests fully intended to help address structural inequalities in our educational system end up having the opposite effect: keeping groups of students in the same relative position year-after-year, and across subject areas.

What are the alternatives?

Here are just some of the possibilities. Pattern-based items (PBIs) provide up to eight times more achievement-specific information per question than current items and have been deployed at scale across Texas. Performance-based assessments are being used in New Hampshire. “Badges” are being used in a number of industries as part of digital credentialing programs. Portfolio-based assessment has a long history of use in a wide array of educational settings.

The last time our legislators gathered in Austin, they passed a bill, HB-3906, directing the Texas Education Agency to “establish a pilot program” in which participating school districts would “administer to students integrated formative assessment instruments for subjects or courses for a grade level subject to assessment.” Now is the time to pilot alternative assessments that will help schools and teachers do what they do best – educate our children.

Walter Stroup has his home in Austin, Texas and is chair of the department of STEM education and teacher development and an associate professor at the University of Massachusetts Dartmouth.

Anthony Petrosino is associate dean for research and outreach in Southern Methodist University’s Simmons School.
Link to Editorial we were responding to:

Related links (links are also in the text above):

What was published in the Houston Chronicle

An Op-Ed in Dallas Morning News discussing research on current tests

NWEA’s growth curve

CDC growth curves used by pediatricians