After my last book was published, I did some radio interviews and got some interesting feedback.

One of the most informative responses came from a distinguished professor emeritus at the University of Michigan, Harry Frank, who has written textbooks about measurement and evaluation.

His observations about testing and evaluation were brilliant. What he wrote helped me understand why NCLB had failed. As I re-read this letter, I understood better why Race to the Top will fail. For one thing, it assumes that the same tests may be used both to evaluate the teacher and to counsel the teacher. What this does, he says, is to promote cheating and teaching to the test.

And Professor Frank explains why student evaluations distort the educational process.

Professor Frank gave me his permission to reprint his letter.

.
I am by training a social psychologist, with a subspecialty and one-time consulting practice in testing and measurement.  When the Flint campus sought its first accreditation independent of the main (Ann Arbor) campus, the provost established an ad-hoc committee to develop assessment procedures.   I spent nine years on the committee, my last couple as its chair.  The procedures we developed became something of a model for the North Central Association of Schools and Colleges.  It has worked extremely well precisely because it conformed to some very fundamental principles of validation, which No Child Left Behind blatantly (if not intentionally) ignored. 

The first principle is that no assessment can be used at the same time for both counseling and for administrative decisions (retention, increment, tenure, promotion).   As you emphasized (and as every organizational psychologist with an ounce of brains wailed when No Child was first described), all this does is promote cheating and teaching to the exam (much as does the staatsexamen in Germany).   This principle is so basic that it’s often covered in the very first chapter of introductory texts on workplace performance evaluation.

Accordingly, in the very first meeting of the committee, we established an absolute firewall.  Department chairs, deans, and executive committees would never be permitted to see individual raw data; they would see only departmental pooled data.  This action did not immediately eliminate faculty resistance, but it went further in that regard than even you might imagine.  The same should apply to K-12 teachers’ unions.

Like you, I don’t think the problem is testing–any more than the problem with a badly built house is with the hammers and saws.  The problem in both cases is how potentially useful tools should be used.   Many of the current difficulties would be reduced or eliminated if it were clear that

(1)  K-12 education is a developmental process, so assessment in schools is a developmental measure not a terminal measure. The concern should be with change not simply “score.”

(2)  Assessment should be a counseling resource, not a source of extrinsic motivation, i.e., rewards and punishments for teachers, administrators, and school districts.

(3)  Student evaluations are worse than useless; they are egregiously misleading.  A 10-year study by the American Psychological Association indicated that student evaluations are correlated with only two factors:

   i.  Students’ expected course grades compared with their expected grades in other courses. 

   ii  workload (negative correlation).

For untenured faculty, course evaluations–if used for administrative decisions–therefore have the effect of motivating both grade inflation and the dumbing down of course content.

 (4)  Instruments and procedures must be national in scope and standardized in their administration and reportage (cf. your interview comments concerning the superior validity of the national examination vs. state examinations).

(5)  Data should be clustered rather than pooled.  That is, performance of mainstream students, students whose first language is not English, and developmentally disabled students should be examined separately.  It is clearly inappropriate to compare overall scores for students in, say, Birmingham, Michigan, where an overwhelming majority are native speakers of English, with students in Taos, New Mexico, where English as a first language falls behind both Spanish and Tiwa.

 (6)  Teachers should never have access in advance to test questions or even precise content.  They should be given global guidelines–general areas in which student competence is expected.

(7)  Ideally, the procedures should make no attempt to be exhaustive.  They should represent a random sampling of content, and the sample should change annually so that past tests cannot be used to prep students but can and should be used to familiarize students with the form of the questions, the level of detail expected, and so on.