Les Perelman, former professor of writing at MIT and inventor of the BABEL generator, has repeatedly exposed the quackery in computer-scoring of essays. If you want to learn how to generate an essay that will win a high score but make no sense, google the “BABEL Generator,” which was developed by Perelman and his students at MIT to fool the robocomputer graders. He explains here, in an original piece published nowhere else, why the American public needs an FDA for assessments, to judge their quality.
He writes:
An FDA for Educational Assessment, particularly for Computer Assessments
As a new and much saner administration takes over the US Department of Education led by Secretary of Education, Miguel Cardona, it is a good time, especially regarding assessment, to ask Juvenal’s famous question of “Who watches the Watchman.”
Several years ago, I realized computer applications designed to assess student writing did not understand the essays they evaluated but simply counted proxies such as the length of an essays, the number of sentences in each paragraph, and the frequency of infrequently used words. In 2014, I and three undergraduate researchers from Harvard and MIT, developed the Basic Automatic B.S. Essay Language Generator, or BABEL Generator that could in seconds generate 500-1000 words of complete gibberish that received top scores from Robo-grading applications such e-rater developed by the Educational Testing Service (ETS). I was able to develop the BABEL generator because I was already retired and, aside from some consulting assignments, had free time for research unencumbered by teaching or service obligations. Even more important, I had access to three undergraduate researchers, two from MIT and one from Harvard, who provided substantial technical expertise. Much of their potential expertise, however, was unnecessary since after only a few weeks of development our first iteration of the BABEL Generator was able to produce gibberish such as
Society will always authenticate curriculum; some for assassinations and others to a concession. The insinuation at pupil lies in the area of theory of knowledge and the field of semantics. Despite the fact that utterances will tantalize many of the reports, student is both inquisitive and tranquil. Portent, usually with admiration, will be consistent but not perilous to student. Because of embarking, the domain that solicits thermostats of educatee can be more considerately countenanced. Additionally, programme by a denouncement has not, and in all likelihood never will be haphazard in the extent to which we incense amicably interpretable expositions. In my philosophy class, some of the dicta on our personal oration for the advance we augment allure fetish by adherents.
that received high scores from the five Robo-graders we were able to access.
I and the BABEL Generator were enlisted by the Australian Teachers Unions to help the successful opposition to having the national K-12 writing tests scored by a computer. The Educational Testing Service’s response to Australia’s rejection was to have three of its researchers publish a study, “Developing an e-rater Advisory to Detect Babel-generated Essays,” that described their generating over 500,000 BABEL essays based on prompts from what are clearly the two essays in the Graduate Record Examination (GRE), the essay portion of the PRAXIS teacher certification test, and the two essay sections of the Test of English as a Foreign Language (TOEFL) and comparing the BABEL essays to 384,656 actual essays from those tests. The result of this effort was the development of an “advisory” from e-rater that would flag BABEL generated gibberish.
Unfortunately, this advisory was a solution in search of a problem. The purpose of the BABEL Generator was to display through an extreme example that Robo-graders such as e-rater could be fooled into giving high scores to undeserving essays simply by including the various proxies that constituted e-rater’s score. Candidates could not actually use the BABEL Generator while taking one of these tests; but they could use the same strategies that informed the BABEL Generator such as including long and rarely used words regardless of their meaning and inserting long vacuous sentences into every paragraph.
Moreover, the BABEL Generator is so primitive that there are much easier ways of detecting BABEL essays. We did not expect our first attempt to fool all the Robo-graders we could access to succeed, but because it did, we stopped. We had proved our point. One of the student researchers was taking Physics at Harvard and hard coded into BABEL responses inclusion of some of the terminology of sub-atomic particles such as neutrino, orbital, plasma, and neuron. E-rater and the other Robo-graders did not seem to notice. A simple program scanning for these terms could have saved the trouble of generating a half-million essays.
ETS is not satisfied in just automating the grading of the writing portion of its various tests. ETS researchers have developed SpeechRater, a Robo-grading application that would score the speaking sections of the TOEFL test. There is a whole volume of scholarly research articles on SpeechRater published by the well-respected Routledge imprint of the Taylor and Francis Group. However, the short biographies of the nineteen contributors to the volume list seventeen as current employees of ETS, one as a former employee, and only one with no explicit affiliation.
Testing organizations appear to no longer have a wide range of perspectives, or any perspective that runs counter to their very narrow psychometric outlook. This danger has long been noted. Carl D. Brigham, the eugenicist who then renounced the racial characterization of intelligence and the creator of the SAT who then became critical of that test, wrote shortly before his death that research in a testing organization should be governed and implemented not by educational psychologists but by specialists in academic disciplines since it is easier to teach them testing rather than trying to “teach testers culture.”
The obvious home for such a research organization is the US Department of Education. Just as the FDA vets the efficacy of drugs and medical devices, there should be an agency that verifies not only that assessments are measuring what they claim to be measuring but also the instrument is not biased towards or against specific ethnic or socio-economic groups. There was an old analogy question on the SAT (which no longer has analogy items) that had “Runner is to marathon as: a) envoy is to embassy; b) martyr is to massacre; c) oarsman is to regatta; d) referee is to tournament; e) horse is to stable. The correct answer is c: oarsman is to regatta. Unfortunately, there are very few regattas in the Great Plains or inner cities.
With all this online stuff, it’s easy to SCAM.
Teachers KNOW, not computer programs written by ??????
good point: ??????
Babel, wince, and re-bleat…
Words are flowing out
Like endless rain into a paper cup
They slither while they pass
They slip away across the universe… J. Lennon
Einstein: “If you can’t explain it simply, you don’t understand it well enough.”
“Being a genius or a leader doesn’t lie in using fancy jargon, big words, or explaining things in complex terms. Providing the best information in the world doesn’t mean a thing if it isn’t absorbable.”
“If you can’t explain it simply, you don’t understand it well enough.”
Said the man whose theory of gravity uses differential geometry and tensor calculus, which the vast majority never encounter in their education.
I’m guessing that it is as important to gear an explanation to the intended audience. I’m guessing Einstein wasn’t speaking to the vast majority of people when he explained his theory with differential geometry and tensor calculus, whatever that is.
Either you are too cynical or I am too contrary. Maybe both.
“Who watches the watchman?” It is an interesting idea for standardized testing. State tests that so many students take have never been subjected to serious scrutiny. These tests have never been subjected to norm referencing which asks if the test performs the same way when given to various groups of students. The pass-fail score is subject to the whims of politicians. If states want to fail lots of students, they make the proficiency cut higher. If they want lots of students to pass, they lower the cut scores. Proficiency is a totally subjective concept that is easily manipulated by the powers that be.
I also read the Brigham excerpt that was very prescient considering that was written in 1937. Even though he was a eugenicist, he understood that psychometrics is not an exact science, and it has the potential to be corrupted by the marketplace. This is exactly what has been happening with on-line testing. Software companies distribute testing software with lofty claims, but the software has never been analyzed or validated in any way. Babel is the perfect example of an imperfect program that has been used to falsely evaluate writing. Brigham stated, “Education can not to-day afford to let the sidewalk vendor dictate its objectives, but it can properly help him adapt his apparatus to its own purposes.”
By the way I can remember when NYS implemented its Basic Competency Tests in the mid 1970s. I was still teaching high school and like the regatta analogy, there was a non-fiction reading about insulation in homes. One of Haitian newcomers asked me, “What is this insulation?” While I couldn’t say too much about it in a test, my thought was Haitians have no reason to know much about insulation.
It was just a few years ago that a task on the Massachusetts state tests asked third graders to write about a wonderful time there was a snow day. The problem was that there hadn’t been any snow days in the time these 8 year olds were attending school, because of a couple mild winters. Of course, it would have never occurred to test writers – though it did to teachers – that a snow day in many households is considered nigh on to calamity because parents need to work but the kids have no place to go.
Teachers were successful in advocating for the question not to count in the kids’ score.
If nothing else, I like the cut of his gibberish, to quote Ned Flanders.
Gibberish is in the eye of the beholder. See my comment below.
“Because of embarking, the domain that solicits thermostats of educatee can be more considerately countenanced. ”
Our furnace stopped working late last night and I have narrowed the problem down to a malfunctioning thermostat circuit and I must therefore agree with that statement.
I stand corrected and, accordingly, must agree with Professor Chomsky that “Colorless green ideas sleep furiously.”
A thermostat in the hand is worth two in the wall.
And a smartphone in the hand is worth a thousand thermostats and light switches in the wall.
Agencies to infinity — and beyond”
We need another agency
To watch the ones we have
That’s filled with lots of agents, see
To monitor the bad
“Watchers of watcher watchers”
We need some watchers
To watch the watchers
And watch the watchers
Who watch the watchers
But who watches the watchers who watch the watchers?
Government agencies that are supposed to protect the public do not work if they are packed with people that want to dismantle those agencies as we saw during the Trump years, and so far the Biden administration is packing the DOE with corrupt and greedy fake reformers.
No, we don’t need no stinkin FDA for educational assessments. It’s like saying that we need an FDA to disprove that injecting bleach into your body cures Covid. The very foundation of the thought is specious, false, deadly.
The onto-epistemological foundation(s) of the standards and testing malpractice regime have already been show to be specious, false, and very harmful to student learning and the teaching and learning process itself. Noel Wilson in his never refuted nor rebutted 1997 dissertation has already shown us the absurdities, falsehoods and errors involved in the standards and testing malpractice regime render any usage of the results of the process to be “vain and illusory”, in other words completely invalid.
We don’t need no stinkin FDA for educational assessments, indeed!
Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)
A description of a quality can only be partially quantified. Quantity is almost always a very small aspect of quality. It is illogical to judge/assess a whole category only by a part of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as unidimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing the descriptive information about said interactions is inadequate, insufficient and inferior to the point of invalidity and unacceptability.
A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).
Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.
Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”
In other words all the logical errors involved in the process render any conclusions invalid.
The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. And a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.
Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.
And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”
In other words it attempts to measure “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?
My answer is NO!!!!!
One final note with Wilson channeling Foucault and his concept of subjectivization:
“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self-evident consequences. And so the circle is complete.”
In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.
Stuck in modi again.
“Just as the FDA vets the efficacy of drugs and medical devices, there should be an agency that verifies not only that assessments are measuring what they claim to be measuring but also the instrument is not biased towards or against specific ethnic or socio-economic groups.”
No no no no no no no ad infinitum!
The assessments aren’t measuring anything. That’s just one of the many onto-epistemological errors the Wilson identifies. A standardized academic test is not a measuring device.
The most misleading concept/term in education is “measuring student achievement” or “measuring student learning”. The concept has been misleading educators into deluding themselves that the teaching and learning process can be analyzed/assessed using “scientific” methods which are actually pseudo-scientific at best and at worst a complete bastardization of rationo-logical thinking and language usage.
There never has been and never will be any “measuring” of the teaching and learning process and what each individual student learns in their schooling. There is and always has been assessing, evaluating, judging of what students learn but never a true “measuring” of it.
But, but, but, you’re trying to tell me that the supposedly august and venerable APA, AERA and/or the NCME have been wrong for more than the last 50 years, disseminating falsehoods and chimeras??
Who are you to question the authorities in testing???
Yes, they have been wrong and I (and many others, Wilson, Hoffman etc. . . ) question those authorities and challenge them (or any of you other advocates of the malpractices that are standards and testing) to answer to the following onto-epistemological analysis:
The TESTS MEASURE NOTHING, quite literally when you realize what is actually happening with them. Richard Phelps, a staunch standardized test proponent (he has written at least two books defending the standardized testing malpractices) in the introduction to “Correcting Fallacies About Educational and Psychological Testing” unwittingly lets the cat out of the bag with this statement:
“Physical tests, such as those conducted by engineers, can be standardized, of course [why of course of course], but in this volume , we focus on the measurement of latent (i.e., nonobservable) mental, and not physical, traits.” [my addition]
Notice how he is trying to assert by proximity that educational standardized testing and the testing done by engineers are basically the same, in other words a “truly scientific endeavor”. The same by proximity is not a good rhetorical/debating technique.
Since there is no agreement on a standard unit of learning, there is no exemplar of that standard unit and there is no measuring device calibrated against said non-existent standard unit, how is it possible to “measure the nonobservable”?
THE TESTS MEASURE NOTHING for how is it possible to “measure” the nonobservable with a non-existing measuring device that is not calibrated against a non-existing standard unit of learning?????
PURE LOGICAL INSANITY!
The basic fallacy of this is the confusing and conflating metrological (metrology is the scientific study of measurement) measuring and measuring that connotes assessing, evaluating and judging. The two meanings are not the same and confusing and conflating them is a very easy way to make it appear that standards and standardized testing are “scientific endeavors”-objective and not subjective like assessing, evaluating and judging.
Thase supposedly objective results are used to justify discrimination against many students for their life circumstances and inherent intellectual traits.
C’mon test supporters, have at the analysis, poke holes in it, tell me where I’m wrong!
I’m expecting that I’ll still be hearing the crickets and cicadas of tinnitus instead of reading any rebuttal or refutation. Because there is no rebuttal/refutation!
It really surprises me that Perelman believes this.
He, of all people should understand how subjective all the standardized testing is, since his Bagel generator is based on that premise.
Yikes, no thanks, Mr Perelman. Had the fed govt refrained from promulgating fed-reviewed standards and aligned assessments– stayed in its own lane– the conversation would be moot. And what’s the history on how that happened? Hint: educators not involved, education research ignored.
“the FDA vets the efficacy of drugs and medical devices” is not a parallel to educational measurement” even if you frame it “Just as…, there should be.” Medicine is a scientific field; drugs and medical devices have physical effects which can be monitored. Educational assessment is a pseudoscience imposed on a field where measurement is an art whose implementation loses validity in direct proportion to distance from the classroom.
Well said.
Agreed.
An FDA for testing would just put a “scientific” stamp of approval on pseudoscience.