Bob Shepherd is a polymath who worked in the education industry for decades and was recently a teacher in Florida. He spent many years developing standardized tests. He has written often about their poor quality, their lack of reliability and validity.

In this post, he explains why he has reached these conclusions:

The dirty secret of the standardized testing industry is the breathtakingly low quality of the tests themselves. I worked in the educational publishing industry at very high levels for more than twenty years. I have produced materials for all the major textbook publishers and most of the standardized test publishers, and I know from experience that quality control processes in the standardized testing industry have dropped to such low levels that the tests, these days, are typically extraordinarily sloppy and neither reliable nor valid. They typically have not been subjected to anything like the validation and standardization procedures used, in the past, with intelligence tests, the Iowa Test of Basic Skills, and so on. The mathematics tests are marginally better than are the tests in ELA, US History, and Science, but they are not great. The tests in English Language Arts are truly appalling. A few comments about those:

The new state and national standardized tests in ELA are invalid.

First, much of attainment in ELA consists of world knowledge–knowledge of what–the stuff of declarative memories of subject matter. What are fables and parables, in what ways are they similar, and in what ways do they differ? What are the similarities and differences between science fiction and fantasy? What are the parts of a metaphor? How does a metaphor work? What is metonymy? What are the parts of a metonymy? How does it differ from synecdoche? What is American Gothic? What are its standard motifs? How is it related to European Romanticism and Gothic literature? How does it differ? Who are its practitioners? Who were Henry David Thoreau and Mary Shelley and what major work did each write and why is that work significant? What is a couplet? terza rima? a sonnet? What is dactylic hexameter? What is deconstruction? What is reader response? the New Criticism? What does it mean to begin in medias res? What is a dialectical organizational scheme? a reductio ad absurdum? an archetype? a Bildungsroman? a correlative conjunction? a kenning? What’s the difference between Naturalism and Realism? Who the heck was Samuel Johnson, and why did he suggest kicking that rock? Why shouldn’t maidens go to Carterhaugh? And so on. The so-called “standards” being tested cover ALMOST NO declarative knowledge and so miss much of what constitutes attainment in this subject. Imagine a test of biology that left out almost all world knowledge and covered only biology “skills” like–I don’t know–slide-staining ability–and you’ll get what I mean here. This has been a MAJOR problem with all of these summative standardized tests in ELA since their inception. They are almost entirely content free. They don’t assess what students ought to know. Instead, they test, supposedly, a lot of abstract “skills”–the stuff on the Gates/Coleman Common [sic] Core [sic] bullet list, but they don’t even do that.

Second, much of attainment in ELA involves mastery of procedural knowledge–knowledge of what to do. E.g.: How do you format a Works Cited page? How do you plan the plot of a standard short story? What step-by-step procedure could you follow to do that? How do you create melody in your speaking voice? How do you revise to create sentence variety or to emphasize a particular point? What specific procedures can you carry out to accomplish these things? But the authors of these “standards” didn’t think that concretely, in terms of specific, concrete, step-by-step procedural knowledge. Instead, in imitation of the lowest-common-denominator-group-think state “standards” that preceded theirs, they chose to deal in vague, poorly conceived abstractions. The “standards” being tested define skills so vaguely and so generally that they cannot, as written, be sufficiently operationalized, to be VALIDLY tested.  They literally CANNOT be, as in, this is an impossibility on the level of building a perpetual motion machine or squaring the circle. Given, for example, the extraordinarily wide variety of types of narratives (jokes, news stories, oral histories, tall tales, etc.) and the enormous number of skills that it requires to produce narratives of various kinds (writing believable dialogue, developing a conflict, characterization via action, characterization via foils, showing not telling, establishing a point of view, using speaker’s tags properly, etc.), there can be no single question or prompt that tests for narrative writing ability IN GENERAL. This is a broad problem wtih the standardized ELA tests. Typically, they ask one or two multiple-choice questions per “standard.” But what one or two multiple-choice questions could you ask to find out if a student is able, IN GENERAL, to “make inferences from text” (the first of the many literature “standards” at each grade level in the Gates/Coleman bullet list)? Obviously, you can’t. There are three very different kinds of inference–induction, deduction, and abduction–and whole sciences devoted to problems in each, and texts vary so considerably, and types of inferences from texts do as well, that no such testing of GENERAL “inferring from texts” ability is even remotely possible. A moment’s clear, careful thought should make this OBVIOUS. So it is with most of the “standards” on the Gates/Coleman bullet list. And, of course, all this invalidity of testing for each “standard” can’t add up to overall validity, so, the tests do not even validly test for what they purport to test for.

Third, nothing that students do on these exams even remotely resembles what real readers and writers do with real texts in the real world. Ipso facto, the tests cannot be valid tests of actual reading and writing. People read for one of two reasons—to find out what an author thinks or knows about a subject or to have an interesting, engaging, significant vicarious experience. The tests, and the curricula based on them, don’t help students to do either. Imagine, for example, that you wish to respond to this post, but instead of agreeing or disagreeing with what I’ve said and explaining why, you are limited to explaining how my use of figurative language (the tests are a miasma) affected the tone and mood of my post. See what I mean? But that’s precisely the kind of thing that the writing prompts on the Common [sic] Core [sic] ELA tests do and the kind of thing that one finds, now, in ELA courseware. This whole testing enterprise has trivialized responding to texts and therefore education in the English language arts generally. The modeling of curricula on the all-important tests has replaced normal interaction with texts with such freakish, contorted, scholastic fiddle faddle. English teachers should long ago have called BS on this.

He wrote to explain why all standardized tests are not equally invalid:

Standardized tests are not all the same, so talk about “standardized tests” in general tends to commit what linguistic philosophers call a “category error”—a type of logical fallacy. George Lakoff wrote a book about categorization called Women, Fire, and Dangerous Things. He took the title from the classification system for nouns of the indigenous Australian language Dyribal. One of the noun categories in this language includes words referring to women, things with which one does violence (such as spears), phenomena that can kill (fire), and dangerous animals (such as snakes and scorpions). What makes this category bizarre to our ears is that the things in the category don’t actually share significant, defining characteristics. Women and things associated with them are not all dangerous. Speaking of all things balan (this category in the Dyribal language) therefore doesn’t make sense. The same is true of the phrase “standardized test.” It lumps together objects that are DIFFERENT FROM one another in profoundly important ways. Imagine a category, “ziblac,” that includes greyhound buses, a mole on Socrates’s forehead, shoelaces, Pegasus, and the square roots of negative numbers.” What could you say that was intelligible about things in the category “ziblac”? Well, nothing. Talking about ziblacs would inevitably involve committing category errors—assuming that things are similar because they share a category name when, in fact, they aren’t. If you say, “You can ride ziblacs” or “Ziblacs are imaginary” or “Ziblacs don’t exist,” you will often be spouting nonsense. Yes, some ziblacs belong to the class of things you can ride (greyhound buses, Pegasus), but some do not (shoelaces, imaginary numbers), and you can’t actually ride Pegasus because Pegasus exists only in stories. Some are imaginary (Pegasus, imaginary numbers), but they are imaginary in very different senses of the term. And some don’t exist (Pegasus, the mole on Socrates’s forehead), but don’t exist in very different ways (the former because it’s fictional, the latter because Socrates died a long time ago). When we talk of “standardized tests,” we are using such an ill-defined category, and a lot of nonsense follows from that fact.

Please note that there are many VERY DIFFERENT definitions of what “standardized test” means. The usual technical definition from decades ago was “a test that had been standardized, or normalized.” This means that the raw scores on the test had been converted to express them in terms of ”standard scores”–their number of standard deviations from the mean. You do this by starting with the raw score on a test, subtracting the population mean from it, and then dividing the difference by the population standard deviation. The result is a Z-score (or a T-score if the mean is taken to be 50 and the standard deviation is taken to be 10). People do this kind of “standardizing,” or “normalization,” in order to compare scores across students and subpopulations. Let’s call this “Standardized Test Definition 1.” Many measures converted in such a way yield a so-called “bell curve” because they deal with characteristics at that are normally distributed. An IQ test is supposed to be a test of this type. The Stanford 10 is such a Standardized Test, Definition 1.

Another, much broader definition is “any test that is given in a consistent form, following consistent procedures.” Let’s call this “Standardized Test Definition 2.” To understand how dramatically this definition of “standardized test” differs from the first one, consider the following distinction: A norm-referenced test is one in which student performance is ranked based on comparison with the scores of his or her peers, using normalized, or standardized, scores.. One of the reasons for standardized scores as per Definition 1, above, is to do such comparisons to norms. A criterion-referenced test is one in which student performance is ranked based on some absolute criterion—knowledge or mastery of some set of facts or skills. Which kind of scoring one does depends on what one is interested in—how the student compares with other students (norm-referenced) or whether the student has achieved some absolute “standard”—has or has not demonstrated knowledge of some set of facts or some skill (criterion-referenced). So, Standardized Test Type 2 is a much broader category, and includes both norm-referenced tests and criterion-referenced tests. In fact, any test can be looked at in the norm-referenced or criterion-referenced way, but which one does makes a big difference. In the case of criterion-referenced tests, one is interested in whether little Johnny knows that 2 + 2 = 4. In the case of norm-referenced tests, one is interested in whether little Johnny is more or less likely than students in general to know that 2 +_2 = 4. The score for a criterion-referenced test is supposed to measure absolute attainment. The score for a norm-referenced test is supposed to measure relative attainment. When states first started giving mandated state tests, a big argument given for these is that they needed to know whether students were achieving absolute standards, not just how they compared to other students. So, these state tests were supposed to be criterion-referenced tests, in which the reported was a measure of absolute attainment rather than relative attainment, which brings us to a third definition.

Yet another definition of “Standardized Test” is “any test that [supposedly] measures attainment of some standard.” Let’s call this “Standardized Test Definition 3.” This brings us to a MAJOR source of category error in discussions of standardized testing. The “standards” that Standardized Tests, Definition 3 supposedly measure vary enormously because some types of items on standards lists, like the CC$$, are easily assessed both reliably (yielding the same results over repeated administrations or across variant forms) and validly (actually measuring what they purport to measure), and some are not. In general, Math standards, for example, contain a lot more reliably and validly assessable items (the student knows his or her times table for positive integers through 12 x 12) than do ELA standards, which tend to be much more vague and broad (e.g., the student will be able to draw inferences from texts). As a result, the problems with the “standardized” state Math tests tend to be quite different from the problems with the state ELA tests, and when people speak of “standardized tests” in general, they are talking about very different things. Deformers simply assume that is people have paid a dedicated testing company to produce a test, that test will reliably and validly test its state standards. This is demonstrably NOT TRUE of the state tests in ELA for a lot of reasons, many of which I have discussed here: Basically, the state ELA tests are a scam.

Understanding why the state ELA tests are a scam requires detailed knowledge of the tests themselves, which proponents of the tests either don’t have or have but aren’t going to talk about because such proponents are owned by or work for the testing industry. Education deformers and journalists and politicians tend, in my experience, to be EXTRAORDINARILY NAÏVE about this. Their assumption that the ELA tests validly measure what they purport to measure is disastrously wrong.

Which leads me to a final point: Critiques of the state standardized tests are often dismissed by Ed Deformers as crackpot, fringe stuff, and that’s easy for them to do, alas, because some of the critiques are. For example, I’ve read on this blog comments from some folks to the effect that intellectual capabilities and accomplishments can’t be “measured.” The argument seems to be based on the clear differences between “measurement” as applied to physical quantities like temperature and height and “measurement” as applied to intellectual capabilities and accomplishments. The crackpot idea is that the former is possible, and the latter is not. However, t is OBVIOUSLY possible to measure some intellectual capabilities and accomplishments very precisely. I can find out, for example, very precisely how many Kanji (Japanese logograms) you know, if any, or whether you can name the most famous works by Henry David Thoreau and Mary Shelley and George Eliot and T.S. Eliot. If you choose to disdain the use of the term “measurement” to refer to assessment of such knowledge, that’s simply an argument about semantics, and making such arguments gives opponents of state standardized testing a bad name—such folks get lumped together, by Ed Deformers, with folks who make such fringe arguments.