Les Perelman, former professor of writing at MIT and inventor of the BABEL generator, has repeatedly exposed the quackery in computer-scoring of essays. If you want to learn how to generate an essay that will win a high score but make no sense, google the “BABEL Generator,” which was developed by Perelman and his students at MIT to fool the robocomputer graders. He explains here, in an original piece published nowhere else, why the American public needs an FDA for assessments, to judge their quality.

He writes:

An FDA for Educational Assessment, particularly for Computer Assessments

As a new and much saner administration takes over the US Department of Education led by Secretary of Education, Miguel Cardona, it is a good time, especially regarding assessment, to ask Juvenal’s famous question of “Who watches the Watchman.” 

Several years ago, I realized computer applications designed to assess student writing did not understand the essays they evaluated but simply counted proxies such as the length of an essays, the number of sentences in each paragraph, and the frequency of infrequently used words.  In 2014, I and three undergraduate researchers from Harvard and MIT, developed the Basic Automatic B.S. Essay Language Generator, or BABEL Generator that could in seconds generate 500-1000 words of complete gibberish that received top scores from Robo-grading applications such e-rater developed by the Educational Testing Service (ETS).   I was able to develop the BABEL generator because I was already retired and, aside from some consulting assignments, had free time for research unencumbered by teaching or service obligations.  Even more important, I had access to three undergraduate researchers, two from MIT and one from Harvard, who provided substantial technical expertise.  Much of their potential expertise, however, was unnecessary since after only a few weeks of development our first iteration of the BABEL Generator was able to produce gibberish such as

Society will always authenticate curriculum; some for assassinations and others to a concession. The insinuation at pupil lies in the area of theory of knowledge and the field of semantics. Despite the fact that utterances will tantalize many of the reports, student is both inquisitive and tranquil. Portent, usually with admiration, will be consistent but not perilous to student. Because of embarking, the domain that solicits thermostats of educatee can be more considerately countenanced. Additionally, programme by a denouncement has not, and in all likelihood never will be haphazard in the extent to which we incense amicably interpretable expositions. In my philosophy class, some of the dicta on our personal oration for the advance we augment allure fetish by adherents.

 that received high scores from the five Robo-graders we were able to access.

I and the BABEL Generator were enlisted by the Australian Teachers Unions to help the successful opposition to having the national K-12 writing tests scored by a computer.    The Educational Testing Service’s response to Australia’s rejection was to have three of its researchers  publish a study, “Developing an e-rater Advisory to Detect Babel-generated Essays,” that described their generating over 500,000 BABEL essays based on prompts from what are clearly the two essays in the Graduate Record Examination (GRE), the essay portion of the PRAXIS teacher certification test, and the two essay sections of the Test of English as a Foreign Language (TOEFL) and comparing the BABEL essays to 384,656 actual essays from those tests.  The result of this effort was the development of an “advisory” from e-rater that would flag BABEL generated gibberish.  

Unfortunately, this advisory was a solution in search of a problem.  The purpose of the BABEL Generator was to display through an extreme example that Robo-graders such as e-rater could be fooled into giving high scores to undeserving essays simply by including the various proxies that constituted e-rater’s score.  Candidates could not actually use the BABEL Generator while taking one of these tests; but they could use the same strategies that informed the BABEL Generator such as including long and rarely used words regardless of their meaning and inserting long vacuous sentences into every paragraph.

Moreover, the BABEL Generator is so primitive that there are much easier ways of detecting BABEL essays.  We did not expect our first attempt to fool all the Robo-graders we could access to succeed, but because it did, we stopped. We had proved our point.   One of the student researchers was taking Physics at Harvard and hard coded into BABEL responses inclusion of some of the terminology of sub-atomic particles such as neutrino, orbital, plasma, and neuron.  E-rater and the other Robo-graders did not seem to notice.  A simple program scanning for these terms could have saved the trouble of generating a half-million essays.

ETS is not satisfied in just automating the grading of the writing portion of its various tests.  ETS researchers have developed SpeechRater, a Robo-grading application that would score the speaking sections of the TOEFL test.  There is a whole volume of scholarly research articles on SpeechRater published by the well-respected Routledge imprint of the Taylor and Francis Group.  However, the short biographies of the nineteen contributors to the volume list seventeen as current employees of ETS, one as a former employee, and only one with no explicit affiliation.

Testing organizations appear to no longer have a wide range of perspectives, or any perspective that runs counter to their very narrow psychometric outlook.  This danger has long been noted.  Carl D. Brigham, the eugenicist who then renounced the racial characterization of intelligence and the creator of the SAT who then became critical of that test, wrote shortly before his death that research in a testing organization should be governed and implemented not by educational psychologists but by specialists in academic disciplines since it is easier to teach them testing rather than trying to “teach testers culture.”  

The obvious home for such a research organization is the US Department of Education.  Just as the FDA vets the efficacy of drugs and medical devices, there should be an agency that verifies not only that assessments are measuring what they claim to be measuring but also the instrument is not biased towards or against specific ethnic or socio-economic groups.  There was an old analogy question on the SAT (which no longer has analogy items) that had “Runner is to marathon as: a) envoy is to embassy; b) martyr is to massacre; c) oarsman is to regatta; d) referee is to tournament; e) horse is to stable.   The correct answer is c: oarsman is to regatta.   Unfortunately, there are very few regattas in the Great Plains or inner cities.