After my last book was published, I did some radio interviews and got some interesting feedback.
One of the most informative responses came from a distinguished professor emeritus at the University of Michigan, Harry Frank, who has written textbooks about measurement and evaluation.
His observations about testing and evaluation were brilliant. What he wrote helped me understand why NCLB had failed. As I re-read this letter, I understood better why Race to the Top will fail. For one thing, it assumes that the same tests may be used both to evaluate the teacher and to counsel the teacher. What this does, he says, is to promote cheating and teaching to the test.
And Professor Frank explains why student evaluations distort the educational process.
Professor Frank gave me his permission to reprint his letter.
|
.
|
|
I am by training a social psychologist, with a subspecialty and one-time consulting practice in testing and measurement. When the Flint campus sought its first accreditation independent of the main (Ann Arbor) campus, the provost established an ad-hoc committee to develop assessment procedures. I spent nine years on the committee, my last couple as its chair. The procedures we developed became something of a model for the North Central Association of Schools and Colleges. It has worked extremely well precisely because it conformed to some very fundamental principles of validation, which No Child Left Behind blatantly (if not intentionally) ignored.
The first principle is that no assessment can be used at the same time for both counseling and for administrative decisions (retention, increment, tenure, promotion). As you emphasized (and as every organizational psychologist with an ounce of brains wailed when No Child was first described), all this does is promote cheating and teaching to the exam (much as does the staatsexamen in Germany). This principle is so basic that it’s often covered in the very first chapter of introductory texts on workplace performance evaluation. Accordingly, in the very first meeting of the committee, we established an absolute firewall. Department chairs, deans, and executive committees would never be permitted to see individual raw data; they would see only departmental pooled data. This action did not immediately eliminate faculty resistance, but it went further in that regard than even you might imagine. The same should apply to K-12 teachers’ unions. Like you, I don’t think the problem is testing–any more than the problem with a badly built house is with the hammers and saws. The problem in both cases is how potentially useful tools should be used. Many of the current difficulties would be reduced or eliminated if it were clear that (1) K-12 education is a developmental process, so assessment in schools is a developmental measure not a terminal measure. The concern should be with change not simply “score.” (2) Assessment should be a counseling resource, not a source of extrinsic motivation, i.e., rewards and punishments for teachers, administrators, and school districts. (3) Student evaluations are worse than useless; they are egregiously misleading. A 10-year study by the American Psychological Association indicated that student evaluations are correlated with only two factors: i. Students’ expected course grades compared with their expected grades in other courses. ii workload (negative correlation). For untenured faculty, course evaluations–if used for administrative decisions–therefore have the effect of motivating both grade inflation and the dumbing down of course content. (4) Instruments and procedures must be national in scope and standardized in their administration and reportage (cf. your interview comments concerning the superior validity of the national examination vs. state examinations). (5) Data should be clustered rather than pooled. That is, performance of mainstream students, students whose first language is not English, and developmentally disabled students should be examined separately. It is clearly inappropriate to compare overall scores for students in, say, Birmingham, Michigan, where an overwhelming majority are native speakers of English, with students in Taos, New Mexico, where English as a first language falls behind both Spanish and Tiwa. (6) Teachers should never have access in advance to test questions or even precise content. They should be given global guidelines–general areas in which student competence is expected. (7) Ideally, the procedures should make no attempt to be exhaustive. They should represent a random sampling of content, and the sample should change annually so that past tests cannot be used to prep students but can and should be used to familiarize students with the form of the questions, the level of detail expected, and so on. |

This is a great explanation of the basic principles of assessment; thanks for reprinting it.
I take away from it a renewed conviction that standardized tests themselves are not inherently bad, but potentially very useful in making instructional decisions. The problem, as Dr. Frank emphasizes, is in the way(s) we use them.
If we can’t force policymakers to take courses in basic assessment principles, it would be great if schools of education, as part of becoming more rigorous, could better educate future teachers and administrators in assessment. It wouldn’t solve the problems of the way we are forced to use assessments, but it would enable us to engage more knowledgeably in the conversation. My local college of ed offers little to no instruction in such things.
LikeLike
“I take away from it a renewed conviction that standardized tests themselves are not inherently bad, but potentially very useful in making instructional decisions.”
No, they “are not inherently bad” they are inherently evil. The harm that comes from labeling, sort and separating students into various categories whether grades-A, B, C, etc. . . or below basic, basic, proficient, or advanced (as used in Missouri’s End of Course exams) is great and more insidious than most realize.
No, they are not “potentially very useful”, they are completely invalid. Any conclusions drawn from invalid instruments are, in Wilson’s words “vain and illusory”.
Wilson has proven the invalidity of educational standards, standardized testing and grading as means of assessing students. In fifteen years there has been no rebuttal whatsoever anywhere to his 1997 dissertation “Educational Standards and the Problem of Error” which can be found at: http://epaa.asu.edu/ojs/article/view/577 . I challenge you to read and understand this work. Please show me where he is wrong/incorrect in his argument.
For a further (and quite shorter) take down of standardized testing see Wilson’s “A Little Less than Valid: An Essay Review” to be found at: http://epaa.asu.edu/ojs/article/view/577 .
LikeLike
Well with Dr. Frank firmly in the psychometric camp of course he believes that these invalid instruments can be useful, albeit in a very limited matter. “Like you, I don’t think the problem is testing–any more than the problem with a badly built house is with the hammers and saws.” Maybe the badly built house (standardized testing) was badly built due to using badly assembled (illogical and irrational) beliefs, starting with, that student learning can be measured–“. . . so assessment in schools is a developmental measure not a terminal measure” and that standardization is indeed needed–“Instruments and procedures must be national in scope and standardized in their administration and reportage”. Any conclusions drawn from invalid instruments are, in Wilson’s words “vain and illusory”.
I challenge Dr. Frank to read and rebut what N. Wilson has shown to be the total invalidity of the processes involved in educational standards and standardized testing “Educational Standards and the Problem of Error” which can be found at: http://epaa.asu.edu/ojs/article/view/577 . Diane, since you are in contact with Dr. Frank, I kindly ask that you forward this to him as I would be interested in hearing a rebuttal to Wilson’s argument, especially since I haven’t found one yet.
LikeLike
Duane, I went to the link you posted and am trying to understand Wilson’s argument. I cannot get past this section of Part 1, Chapter 1:
“Analysis of such discourses may not be used to determine the truth. Yet such analyses may be very sensitive to the uncovering of untruths, by determining the extent to which they embody “incoherencies, distortions, structured omissions and negations which in turn expose the inability of the language of ideology to produce coherent meaning” (Codd, 1988, p245). How would such untruths be established?
First, by uncovering self contradictions, within the overt discourse, or between the unstated assumptions of the discourse and the facts that the discourse establishes.
Second, by exposing false claims, claims that may be shown with empirical evidence constructed within its own frame of reference to be untrue.
Third, by detailing some of the psychometric fudges on which many assessment claims depend to maintain their established meaning.
Fourth, by indicating how repositioning the discourse may dramatically change its truth value.
Fifth, by establishing four discrete epistemological frames of reference for assessment discourse as currently constructed, and indicating the confusion when one frame is viewed from the perspectives of the others.
Sixth, by noticing frame shifts within a particular discourse, with the resulting confusion of meaning.
Seventh, by exposing the ontological slides and epistemological camouflages necessary to sustain many truth claims.”
Can you explain what he means by this?
Thank you.
LikeLike
Ed,
I think you should read the column I wrote a year agoabout the National Research Council’s “Incentives and Test-Based Accountability,” and then read the report itself. Seventeen distinguished social scientists reviewed the evidence and included that tying punishments and rewards to test scores doesn’t improve education, barely improves test scores, and leads to very negative consequences. Not only teachers cheat, but principals cheat, districts cheat and states cheat. Scores get artificially inflated, kids get excluded from the testing pool. It is amazing and sad what people do to avoid being fired. If we focus relentlessly on test scores and then discover that these same kids need remediation in college, what exactly is the point? Diane
LikeLike
For the source you provided (and for key input they relied upon), is this a fair definition for a high-stakes test:
A test was considered high stakes if its results had perceived or real consequences for students, staff members, or schools?
LikeLike
Sorry for the delayed response but when the tomatoes start coming in we don’t have a lot of time to turn them around.
Thank you for actually going and reading what Wilson has to say. I’ve found very few to actually do so.
But at the same, I understand the frustration with the wording and what Wilson is attempting to explain. I have read this study well over a dozen times and still get more from it every time. So hopefully I can help a little. My suggestion is to read the entire work, struggle through it (it’s not easy) and then reread it a number of times. What Wilson has to say has never been rebutted, as a matter of fact, as he points out, it is almost blasphemous and traitorous to society, especially in the world of education as we know it, to even bring up what he does.
The first concern that you raise means that when someone proposes a course of action and gives rationals to justify said action, if their rationals contradict themselves then the rational is not valid.
Second, that by “exposing” said claims one establishes the fact that said claims are false and therefore should be at least seen as suspect and more properly seen as false and invalid.
Third, what Wilson is attempting to do and does is establish the fact that psychometrics contain inner contradictions that negate their validity.
Fourth, by “repositioning” he means taking all the fudges by the psychometricians and viewing them through the lens of the person who is taking the “exam” and exposing the invalidities that are inherent in the process-which are usually, if not always denied/muted by the test makers.
Fifth, there are different beginning points of reference from wherein the “practicioner” starts. Four very different points of view of how to “assess” a student’s performance and to confuse and conflate these points of view constitute an error of logical thinking.
Sixth, to piggyback on the prior comment when one confuses and conflates different modes of “assessment” one contributes to the inherent errors.
And seventh, one had to understand the difference of epistemology and ontology. From Merriam Webster: Ontology: 1:a branch of metaphysics concerned with the nature and relations of being. 2: a particular theory about the nature of being or the kinds of things that have existence” Epistemology: : the study or a theory of the nature and grounds of knowledge especially with reference to its limits and validity.
An “ontological slide” is one that, in relation standardized testing denies the the fact that teaching and learning is a “quality” and not a quantity. The epistemological camouflauge is that standardized testing is reliable and valid, from the point of view of the test maker without considering the invalidities involved in the process.
Wilson shows that by looking at educational standards and standardized testing and “grades” by extension are false both from an ontological and epistemological point of view.
I hope that helps.
Again, Ed, thanks for the questions and commentary. I cannot do justice to what Noel has written, although I try and your questions help me further understand and explain what Wilson has to say.. Please read the entire work and I think it will “come together” a lot more for you.
Duane
LikeLike
I must confess that I failed Duane Swacker’s challenge. I found Wilson’s manifesto to be a largely incomprehensible mix of epistemology, social criticism, and pedagogical philosophy. Within its own universe of discourse, it may or may not be a worthwhile contribution. I have no basis for judgment on this matter, so I’ll confine my observations to the particulars of Mr. Swacker’s post.
I find it interesting that the writer asserts standardized tests to be “inherently evil” and then supports this curious assertion with examples that illustrate the very point to which he takes exception: that the problem is misuse of tests, including the assignment of labels that can haunt students throughout their academic careers and beyond. Tests do not label students; school officials label students.
This being said, the writer’s focus on categorization and labeling of children fails to appreciate that, if anything, assessment involves categorization and labeling of teachers and of programs. In many situations, students can remain anonymous–assigned code numbers seen only by researchers. Protection of student information is yet another reason to maintain a firewall between those who see raw data and those who make administrative decisions. The writer’s concern and sympathies would be more appropriately directed toward teachers (and schools) who can, if decision makers are privy to raw data, be mislabeled as ineffective for reasons that have little to do with teacher performance.
Mr. Swacker further asserts that educational tests are completely invalid. If one begins with the assumption that learning cannot be measured, which seems to be at least one thrust of Wilson’s polemic, then of course he is absolutely correct. (I want to emphasize “seems,” as I make no claim to having deciphered Wilson’s writing.) And, the same would be true of any examination, test, or quiz administered in any class by any teacher.
Finally, I have difficulty understanding Mr. Swacker’s objection to standardization of test administration and reportage of test results. To me, at least, it seems rather fundamental that one cannot assess the effectiveness of instructors and programs if different instructors, schools, districts, or states use different yardsticks. Is a teacher whose students all receive “A” grades a better teacher than one who assigns “A” grades to only ten percent of his or her students? Perhaps the first teacher in this example writes easier tests or writes easier content. Even if the same test is used, do we get meaningful comparisons of teachers or programs if one tester allows 60 minutes for completion of the test and another teacher allows 90 minutes?
LikeLike
NCLB and Race to the Top may fail for a number of reasons, but not for the reasons stated here.
Professor Frank says:
“The first principle is that no assessment can be used at the same time for both counseling and for administrative decisions (retention, increment, tenure, promotion). As you emphasized (and as every organizational psychologist with an ounce of brains wailed when No Child was first described), all this does is promote cheating…”
Of course many assessments (within many fields) are both high-stakes and sources of counseling. If you have ever received a performance review by a thoughtful and effective leader you may have simultaneously been given good insights about how to improve your performance while learning that you received a 1% pay raise and the promotion you hoped for will have to wait another year.
Dr Frank is correct when he says high-stakes tests promote cheating. The higher the stakes, the more compelled people feel to cheat…on just about anything in life. It is twisted logic though to say, because some people will cheat, we should not do that thing.
That logic leads to the following faulty conclusions:
Business leaders fudge earnings reports to stay in power. We should do away with corporate earnings reports?
Cyclists and baseball players take performance-enhancing drugs. We should do away with sports competitions?
Young adults hires other young adults to take the SAT on their behalf. We should do away with the SAT? (Maybe yes, but not for this reason).
A young child is worried that, if she doesn’t get a good grade on her next spelling test, her cell phone will go in timeout for 1 week. (Talk to young people – this scenario is VERY high stakes). She writes the answers on her hand before the test. We should do away with all tests in school.
An important point to remember in all of these examples is that we don’t have 100% of people cheating. We don’t have 50% or 25% of people cheating. Maybe 10% in the case of baseball and cycling at the peak of cheating. Less in the others examples I gave.
The solution then is to take measures that reduce the opportunity for cheating, educate on the importance of integrity and fairness, and increase the consequences for those who cheat.
Unless you believe that laws create criminals in which case we should do away with laws.
LikeLike
“If you have ever received a performance review by a thoughtful and effective leader”
Boy if there is one thing that goads my goat is the concept of “leadership” especially in relation to education. What a bunch of hierarchical nonsense.
Have I learned a lot over the years from those who happen to have been above me in an organizational chart? Sometimes, not often, but probably more important is that I have learned way more from those who were “below” me-I didn’t start teaching until I was 38.
I’ve learned a hell of a lot more from my students (I survey them at the end of the year), “below” me than I’ve ever learned from an administrator “above” me.
Sorry, “educational leadership” is pure 100% grade AAA bovine excrement.
LikeLike
The answers begged by Ed Turley’s precious little rhetorical analogies are unassailable. Rather like Microsoft technical support–inarguably correct and mostly irrelevant. The cheating in his examples is analogous to cheating by students taking assessment examinations. The cheating encouraged by linking teacher compensation to student performance is cheating on the part of teachers, administrators, and boards of education, not students.
LikeLike
Hi Harry. My analogies hold up very well whether you are pointing to students or the people who are supposed to be leading them. I’ll add some more.
Perhaps you saw the cheating that went on by The New Orleans Saints this past season. The coach offered cash bounties to players who injured opposing players. That’s the coach cheating – part of the team’s administration.
So, if winning in pro football is high-stakes, and high-stakes are bad, let’s disband pro football because pro football causes cheating. Ummm, no.
Perhaps pro football is an unacceptable analogy. Afterall, we’re talking about children here.
Here a good example of high-stakes cheating in Little League baseball:
http://www.kidzworld.com/article/1277-the-real-little-league-cheaters
In this case, the boy’s father was the main culprit behind the cheating. Little League baseball is high-stakes when teams get to the Little League World Series. I guess we should put an end to the Little League World Series because it makes the adults cheat.
Actually we’d have to get rid of all sports for children because they cause the adults to cheat.
Have you ever seen how some adults will cheat and lie in child custody cases? You can’t get more high-stakes than that ! But we shouldn’t have child custody hearings we should just…do what???
Oh, and parents doing homework for their kids. Afterall, grades in school are high-stakes. They can make or break a kid getting into college (or the college parents want for their child). OK, let’s do away with homework. And let’s do away with grades too. They cause cheating !!!
Let me be very clear. Linking teacher compensation to student performance WILL CAUSE MORE CHEATING by parents, teachers, administrators, etc. compared to when nothing is at-stake.
Where we part ways is what should be done about that. Apparently, you are on the side of stopping those high-stakes things that cause people to cheat. In education, you are saying that we can’t depend on the large majority of teachers (and other administrators) to resist cheating. I say that the vast, vast, vast majority of teachers are professionals, should be treated like professionals, and should be expected not to cheat. When the very, very small number of teachers and/or administrators do decide to cheat, I believe they should be barred from the profession for life.
Using the fact that some people will cheat when the stakes are high is a very poor argument for making all things no-stakes.
LikeLike