Les Perelman, former director of undergraduate writing at MIT has been a persistent critic of machine-scored writing on tests. He has previously demonstrated that students can outwit the machines and can game the system. He created a machine called BABEL, or Basic Automatic B.S. Essay Language Generator. He says that the computer cannot distinguish between gibberish and lucid writing.
He wrote the following as a personal email to me, and I post it with his permission.
Measurement Inc., which uses Ellis Paige’s PEG (Project Essay Grade) software to grade papers all but concedes that students in classrooms where the software has been used have been using the BABEL generator or something like it to game the program. Neither vendor mentions that the same software is also being used to grade high stakes state tests, and in the case of Pearson, is being considered by PARCC to grade Common Core essays.
http://www.pegwriting.com/qa#good-faith
What is meant by a “good faith” essay?
It is important to note that although PEG software is extremely reliable in terms of producing scores that are comparable to those awarded by human judges, it can be fooled. Computers, like humans, are not perfect.
PEG presumes “good faith” essays authored by “motivated” writers. A “good faith” essay is one that reflects the writer’s best efforts to respond to the assignment and the prompt without trickery or deceit. A “motivated” writer is one who genuinely wants to do well and for whom the assignment has some consequence (a grade, a factor in admissions or hiring, etc.).
Efforts to “spoof” the system by typing in gibberish, repetitive phrases, or off-topic, illogical prose will produce illogical and essentially meaningless results.
Also, both PEG Writer and Pearson’s WriteToLearn concede in buried FAQ’s that their probabilistic grammar checkers don’t work very well.
PEG Writing by Measurement Inc.
http://www.pegwriting.com/qa#grammar
PEG’s grammar checker can detect and provide feedback for a wide variety of syntactic, semantic and punctuation errors. These errors include, but are not limited to, run-on sentences, sentence fragments and comma splices; homophone errors and other errors of word choice; and missing or misused commas, apostrophes, quotation marks and end punctuation. In addition, the grammar checker can locate and offer feedback on style choices inappropriate for formal writing.
Unlike commercial grammar checkers, however, PEG only reports those errors for which there is a high degree of confidence that the “error” is indeed an error. Commercial grammar checkers generally implement a lower threshold and as a result, may report more errors. The downside is they also report higher number of “false positives” (errors that aren’t errors). Because PEG factors these error conditions into scoring decisions, we are careful not to let “false positives” prejudice an otherwise well constructed essay.
Pearson Write to Learn
http://doe.sd.gov/oats/documents/WToLrnFAQ.pdf
The technology that supports grammar check features in programs such as Microsoft Word often return false positives. Since WriteToLearn is an educational product, the creators of this program have decided, in an attempt to not provide students with false positives, to err on the side of caution. Consequently, there are times when the grammar check will not catch all of a student’s errors.
MS Word used to produce a significant number of false positives but Microsoft in the current versions appears to have raised the probabilistic threshold so that it now underreports errors.

Reminds me of LISA or ELIZA, the spoof psychological interview program from the very early days.
Also, judging from what I have read about human essay grading there’s not much to choose between machine and human.
(making the Grades, by Todd Farley)
LikeLike
Cross posted at
http://www.opednews.com/Quicklink/Would-You-Prefer-to-Have-Y-in-Best_Web_OpEds-Education-Testing_Standardized-Testing-150628-207.html#comment552209
with this comment, referring to another post here about the valueless testing being promoted by CC.
Submitted on Sunday, Jun 28, 2015 at 8:35:47 AM:
Tests, accordion got the Pew National Standards research, where my practice was a cohort, had ONE “Principle of Learning” that covered testing; it was clear that tests were essential for the TEACHER’S use, in order to plan for the needs of the students.
Genuine & Authentic (the terms Harvard used to define each principle) EVALUATION demonstrated the application of learning, not merely the memorization of facts.
In another post at the Ravitch blog , the flaw in the current testing mania is revealed , when Bob Shepherd notes:
“In this that Common Core testing assumes that there is only one correct answer when interpreting literature. This, he says, is a complete rejection of reader-response theory, which had been prevalent for many years. Shepherd has many years of experience writing curriculum, assessments, and textbooks.
The CC testing version of a “reader response theory of goes something like this: a text means whatever the reader constructs when reading it.”
“This grotesque misunderstanding of what “a reader’s construction of a text” can reasonably mean had become the de facto orthodoxy in ELA lit texts at the middle-school and secondary-school levels.
” I call this a grotesque misunderstanding because a text is an act of communication and as such depends, usually, upon shared usages and upon the belief on the part of the reader and the writer that communication across an ontological gap of a communicable meaning is possible.
“To deny that–to say that any text can mean anything–is to undercut the very notion of communication, of transmission across that gap from one subjectivity to another.
“Part of teaching people how to read literature is to teach them about conventional usages and what those can reliably be taken to mean.”
LikeLike
So, we tell our students to write with a specific audience in mind. For example, I finally determined that for my physics students, the optimum audience for their lab conclusions is their math teacher because she does not need to see all the algebraic details, but she does not necessarily know the physics principles. This has greatly improved their argumentative writing.
So what do I tell my students who have to take a standardized test? Your audience is a computer who cares nothing about your ideas or your personal connections. It will not be made to pause and contemplate your innovative approach, interesting analogy. It will not be touched by a deep emotional connection. Just make sure you do not have a comma splice and make appropriate word choices.
LikeLike
Benjamin Winterhalter wrote a great essay in Salon on the philosophical problems with computer grading.
http://www.salon.com/2013/09/30/computer_grading_will_destroy_our_schools/
LikeLike
Thanks Nicholas. Such links and commentary is why this site is th most valuable on the net for anyone who wants truth about WHAT IT TAKES TO LEARN… and what is afoot as the Empire takes down public education.
And FYI , the term ‘Empire’ is what is used interchangeably with the the noun ‘oligarchs’ at the progressive site where I write. It began to appear recently at this site where Chris Hedges Robert Reich, and me (LOL) write the truth about the 21st century.
below is a sample of the kind of article, and the incredible commentary that follows.
You will find the term ’empire’ in several commentaries.
and BTW, I put up several commentaries, including one where I posted the photo of Bernie Sanders which I took of him at an alumni ceremony at James Madison HS.
http://www.opednews.com/articles/Amazing-Bernie-Sanders-Int-by-James-Quandy-Bernie-Sanders_Bernie-Sanders-2016-Presidential-Candidate_Interviews_Rachel-Maddow-150622-815.html#comment552066
It is my wish that educators begin to weigh in about education there, with questions and accounts of how education affects their kids. Here we all know what is occurring, but the assault on public education is hidden in the static put forth by Duncan and clones, and lies under the weight of entertainment news featuring murder, an mayhem, and celeb news. My authors page… be sure to go to quick link series and commentary
http://www.opednews.com/author/author40790.html
LikeLike
Why not have teachers grade it? Remember those handwriting machines at the carnival that would analyze your signature? At this point, the reformers might as well toss the essays into a shredder and assign failing grades at random between F and D- since that is all they’re after.
LikeLike
Utah, which has had computer grading of essays for years, has gone back to teacher grading for the new CC-based tests (every student grades 3-11 writes two essays). The biggest problem is that the ELA teachers were taken out of class to grade these essays for five or six days per teacher. So not only did the testing take days away from instruction, but then students lost several more days of meaningful instruction in order to grade the essays. A lose-lose.
LikeLike
There is another sign of this rush to create computers that can do the same job as humans—even tasks as complicated as grading an essay by using a rubric. Once a computer is equal to or superior to humans in completing a task like grading an essay, what will the next step be?
Both Elon Musk and Stephen Hawking warn of artificial intelligence. It seems that even Bill Gates has joined Musk and Hawking on this issue—at least according to the Metro in the UK.
Stephen Hawking, Elon Musk, and many other prominent figures have signed an open letter pushing for responsible AI oversight in order to mitigate risks and ensure the “societal benefit” of the technology.
http://io9.com/prominent-scientists-sign-letter-of-warning-about-ai-ri-1679487924
Stephen Hawking warns artificial intelligence could end mankind
http://www.bbc.com/news/technology-30290540
In addition, films, for instance, a recent film called “Ex Machina”, offer example of what happens when a competitive species appears on the scene. And make no mistake about it—we are moving in the direction of building AI’s and if they are equal to or superior to humans, they will be another species in competition with us and if you study what we know of the other intelligent species that have walked the earth, our species seems to have eliminated them all except maybe those that live in the ocean and not on land and without laws protecting even ocean mammals from extinction, they’d probably all be gone by now.
How can these machines benefit us when they are eliminating so many jobs? We can’t all be rocket scientists, CEO’s and billionaires.
LikeLike
The question is fundamentally flawed because the essays in question are all part of standardized tests.
Get rid of the tests and there are no more essays to grade.
Problem solved.
And “probabilistic grammar checkers? That sounds right. MS Word seems to be a “random grammar checker”, redlining totally random things.
I always thought it was hilarious that anyone would trust a computer program (written by a computer programmer, of course) to judge their grammar.
Anyone who has ever worked with computer programmers knows what I mean. Many of these people are the ones who failed English in high school. And in many cases, their programs are just as incoherent as their writing. We call it spaghetti code.
LikeLike
Oh, now you are judging computer programmers. You are stating that many computer programers failed English.
I have a few questions for you.
Did the computer programers get guidance from language specialists and/or grammarians?.
How did you decide that in many cases, their programs are just as incoherent as their writing?
What will your life be without out computers helping you in everything?
Who will you judge next?
LikeLike
So, are you okay with your essays or your children’s essays being graded by a computer, Raj?
LikeLike
I believe SDP has stated that s/he is or used to be a programmer. But nice concern trolling.
LikeLike
Well
I’ve been programming computers since the 1970’s and spent a good part of my career working as a software engineer so worked with lots of programmers over the years.
And if you believe that MS Word is good at grammar checking, then you know about as much about grammar as many of the programmers I have worked with over the years.
Then again, one can tell that just from reading your posts here.
So, to answer your last question: you.
LikeLike
Raj,
“Who will you judge next?”
Can’t the same be said of you in this exact comment?
LikeLike
Point of Information: Many dime-a-dozen computer programmers create application software that has nothing to do with studying English writing. They make money for data-mining and/or management system.
Wanna find one working at Center for Applied Linguistics? Many of those are qualified for the job out there.
LikeLike
“Get rid of the tests and there are no more essays to grade.
Problem solved.”
Exactamente.
“Many of these people are the ones who failed English in high school.”
Unfortunately, the “failure” meme is so embedded in the cultural habitus of current thought that the vast majority of folks don’t think twice about using that concept. I guess when your speaking to/about adults I don’t have as much problem with it as I do when it is employed in the assessment of student work. At the student level it is a very dangerous meme that causes untold harm to uncounted students who internalize the “failure” concept into their very being. And they do!! Teach for any length of time and you will hear students say “I’m going to fail that test/course. I’m a failure”.
How the hell and why people would believe that labeling a good faith effort by a young person, the student, as a “failing” is beyond my comprehension. The “F” word that ends in “ail” is far more damaging to far more students than that accursed/taboo “F” word ending in “uck”. But boy do teachers and administrators get their panties in a bind (including most males) when they hear “fuck” all the while proclaiming that the student is “failing” to follow the rules. Mind “fucking”, oops I mean “mind failing” insanities!!
LikeLike
Duane, you make me laugh, but you said something that actually is an important point, at least it was in my ex-paractice, where I never gave at test (well, a few small quizzes to ensure the kids learned some rote spelling or facts.)
In September, students and myself made a chart that listed what we felt was a best work, the kind of writing that when June came around, could be evaluated as E (excellent the nineties; VG the eighties, etc,) I had to assign a numerical grade for their records, so this was important.
The idea that MY BEST work, was the same as A ‘best’ work was at the center of this discussion and the chart which we produced. When Harvard looked at the way I net the 1st principle of learning (Clear Expectations) , in my practice, they saw this as an indicator.
You see, mayor my 12 year old kids believed that if they did their very best to write, and produced a page of work that they deserved a pat on the back AND a high evaluation.
I love to give pats on the back, especially when a child strives all year to improve from what hey believe in September is THETR best work. Rewards FOR ACHIEVEMENT is the 2nd principle of learning, and Harvard said that I offered many reward for meeting the criteria for EXCELLENT writing….not the least of which was the posting of great writing in the hallways… no pats on the back here… only the the writing that met the criteria for excellence made it to the hallway… and believe me, many kids tried for a lon time to get their work there. But when they, did see their work on that wall, it meant that they had done it…. met the criteria for excellence.
Mind you, I offered many incentives and ‘atta’ boy’s, during the year, for kids who WERE doing their best, but there WAS NO DOUBT AS TO WHAT was considered EXCELLENT WRITING IN GRADE 7. — Accurate punctuation, spelling, and correct grammar WERE the NYS objectives for seventh grade…our core curricula so to speak… although I got to choose my materials and activities, as the PRACTITIONER. Using language to express complex ideas was something they were required to do, as they were at the edge of high school, and in 2 years needed to move out of elementary school styles. “Good job’ was not being used back then to raise esteem for any child who bothered to slap a few sentences onto a paper.
It did not take long, as we read literature and poetry together, for the kids to begin to grasp that they could express their own thoughts in ways that others could be easily read and actually enjoyed. I entered their stories in citywide and national contests, and many them won. This spurred greater effort.
But, no parent ever said to me that I was not considering how ‘hard Johnny had worked on a piece of writing. Maybe Johnny got a coupon to have lunch with Mrs. Schwartz in my room, (a much sought-after reward in a fabulous room, for putting in a great effort– even if the paragraphs were nonexistent, and I had to figure out the spelling. Writing is not about spelling or grammar, but about getting ideas down, -capturing THINKING– in clear but interesting ways. We celebrated ideas!!!
But, there has to be a consistent criteria for excellence, and IF it is authentic, and IF the students and the parents know from day one what “LEARNING LOOKS LIKE” (that’s the standards lingo that all the cohorts used) … and in fact in eight years, only a handful of kids got the dreaded “F” and that was because ‘learning’ was not why they came to school.
and that is my 2 cents!
LikeLike
and the comment above demonstrates that writing is not about spelling… because automatic spell check does bizarre things with words, and yet the meaning is clear from the context… and ‘making meaning’ and the expression of important ideas, is the objective of writing.
Top-editing of first drafts, (which the comment above iS most certainly) must be done for final drafts, because it is a criteria.
LikeLike
It seems to me that “I” am making more “mistakes” these days as when I read over something I say “wait, I know I typed what I wanted to type but it’s been changed.”
Does anyone else have that problem?
LikeLike
It’s a function of spell check. At the news-stite where I share content, write articles and comment, the same thing happens, BUT… we can click an edit button, and FIX IT! NO SUCH LUXURY EXISTS HERE AT WORDPRESS, so we count on the readers here to figure out the meaning from contextual clues…. like on the tests of tour kiddies.
LikeLike
SomeDAM Poet: but you can’t fault the commenter in question for neglecting, according to the sort of ‘closet reading’ mandated by CCSS, his John Steinbeck:
“Man is the only kind of varmint sets his own trap, baits it, then steps in it.”
It’s just that, er, Steinbeck was offering up what I would construe as an admonition against, not encouragement for.
But I fear—alas alack and alay!—that lack of fresh batteries left said responder in the dark.
Rescue mission, anyone?
😎
LikeLike
I agree completely with those who reject the premise of this question, because trying to “objectively” evaluate writing done as part of a standardized test is absurd on its face. Only people not involved with education (legislators, hedge funders, Broadies, Duncan, etc.) would actually believe in it. In 31 years as an English teacher, rarely have I seen outstanding writing in an “on-demand” situation–competent, yes, but rarely outstanding. The motivation, not to mention the time for thoughtful revision, is just not there. (AP classes might be the exception, because of the repeated opportunities for practice and feedback during the course of a year-long course of study.) How about reducing my class sizes so I can actually teach effective writing, instead of declaring students, teachers, and schools as failures based on what Peter Greene so accurately calls the Big Standardized (BS) test?
LikeLike
For me, as I have often stated
NOW students are being “taught” that truth resides not in scholarly in depth unbiased research but in machines.
LikeLike
To answer the question: Neither!
Standardized tests (and the accompanying educational standards) contain so many epistemological and ontological errors and falsehoods that the whole process is “COMPLETELY INVALID”. It doesn’t make any difference whether that test is machine scored or teacher scored, any results and usage of those results are COMPLETELY INVALID as proven by Noel Wilson in his never refuted nor rebutted 1997 dissertation “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700
Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine.
1. A description of a quality can only be partially quantified. Quantity is almost always a very small aspect of quality. It is illogical to judge/assess a whole category only by a part of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as unidimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing the descriptive information about said interactions is inadequate, insufficient and inferior to the point of invalidity and unacceptability.
2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).
3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.
4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”
In other word all the logical errors involved in the process render any conclusions invalid.
5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. And a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.
6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.
7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”
In other words it attempts to measure “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?
My answer is NO!!!!!
One final note with Wilson channeling Foucault and his concept of subjectivization:
“So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self-evident consequences. And so the circle is complete.”
In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.
LikeLike
And your answer is why I gt the feed. I love the amusing answers, and the honest evaluations of an insane idea, but YOU sir, offer an analyst that stands alone. I await Robert, Lloyd, Bob, Deb, Laura , Chiara, Susan and all the wonderful voices that speak here as teachers.
Gotta go, my granddaughter is recording me as I talk about my family’s history.
LikeLiked by 1 person
Thanks a bunch for the very kind words Susan!!
LikeLike
ABSOLUTELY BEAUTIFUL. But it is much too intelligent for the partial wits who are in the “reform” business to ever comprehend.
LikeLike
Would I like my essay graded by a computer? Probably. Definitely, if I was looking to just get a good grade and move on without putting much effort into it. Computers are easy to fool, especially when they’re given a task because the people selling the programs realize they can make money off it, not because computers are suited to the task.
Would I want my children’s essays graded by a computer? Do I think computer grading would be an improvement in educational quality (or even another way of achieving comparable quality)? Of course not.
LikeLike
A friend of mine said she watched a student copy the reading passage that accompanied a writing prompt on the end of year test. He copied every word of it because he is an English Language Learner and didn’t really understand the task completely. He passed the writing portion as proficient. It was graded by humans this year.
LikeLike
Now that students have to “explain their working”, when computer grading hits this one I suspect that typing in a list of relevant words, in no particular order, will satisfy almost all computer grading systems. And of course they won’t be checking the grammar.
LikeLike
There is a good paper examining the computer grading of papers. Search for “Can Computers Grade Writing? Should They?” by Douglass Hesse, Executive Director of Writing at the University of Denver to find the PDF.
The real problem, however, is that testing will provide good data at all. Topics by necessity are broad and boring and kids are required to produce on demand writing after a long period of mind-numbing testing. Many of my best writers are draft writers who work at writing in their own space and time.
LikeLike
I could not agree more on the draft writing.
The reality is that no one (not even the best writer) “gets it right” the very first time –especially not within a very short time.
Timed writing tests really favor the quick, thoughtless writers over the good, thoughtful ones (which is ironic, given what the tests are supposed to show)
That is especially true when the essay is graded by a computer program which can not distinguish between thoughtful writing and meaningless gibberish.
Being able to do something quickly is rarely the same as being able to do it well — except in sports and even in that case, the preparation leading up to a race or game takes a long time.
LikeLike