Why Computers Should Not Grade Student Essays

In the brave new world of Common Core, all tests will be delivered online and graded by computers. This is supposed to be faster and cheaper than paying teachers or even low-skill hourly workers or read student essays.

But counting on machines to grade student work is a truly bad idea. We know that computers can’t recognize wit, humor, or irony. We know that many potentially great writers with unconventional writing styles would be declared failures (EE Cummings immediately to mind).

But it is worse than that. Computers can’t tell the difference between reasonable prose and bloated nonsense. Les Perelman, former director of undergraduate writing at MIT, created a machine, withe help of a team of students, called BABEL.

He was interviewed by Steve Kolowich of The Chronicle of Higher Education, who wrote:

“Les Perelman, a former director of undergraduate writing at the Massachusetts Institute of Technology, sits in his wife’s office and reads aloud from his latest essay.

“Privateness has not been and undoubtedly never will be lauded, precarious, and decent,” he reads. “Humankind will always subjugate privateness.”

Not exactly E.B. White. Then again, Mr. Perelman wrote the essay in less than one second, using the Basic Automatic B.S. Essay Language Generator, or Babel, a new piece of weaponry in his continuing war on automated essay-grading software.

“The Babel generator, which Mr. Perelman built with a team of students from MIT and Harvard University, can generate essays from scratch using as many as three keywords.

“For this essay, Mr. Perelman has entered only one keyword: “privacy.” With the click of a button, the program produced a string of bloated sentences that, though grammatically correct and structurally sound, have no coherent meaning. Not to humans, anyway. But Mr. Perelman is not trying to impress humans. He is trying to fool machines.

“Software vs. Software

“Critics of automated essay scoring are a small but lively band, and Mr. Perelman is perhaps the most theatrical. He has claimed to be able to guess, from across a room, the scores awarded to SAT essays, judging solely on the basis of length. (It’s a skill he happily demonstrated to a New York Times reporter in 2005.) In presentations, he likes to show how the Gettysburg Address would have scored poorly on the SAT writing test. (That test is graded by human readers, but Mr. Perelman says the rubric is so rigid, and time so short, that they may as well be robots.)

“In 2012 he published an essay that employed an obscenity (used as a technical term) 46 times, including in the title.

“Mr. Perelman’s fundamental problem with essay-grading automatons, he explains, is that they “are not measuring any of the real constructs that have to do with writing.” They cannot read meaning, and they cannot check facts. More to the point, they cannot tell gibberish from lucid writing.”

The rest of the article reviews projects in which professors claim to have perfected machines that are as reliable at judging student essays as human graders.

I’m with Perelman. If I write something, I have a reader or an audience in mind. I am writing for you, not for a machine. I want you to understand what I am thinking. The best writing, I believe, is created by people writing to and for other people, not by writers aiming to meet the technical specifications to satisfy a computer program.

formercheesehead says:

September 3, 2014 at 11:26 am

Agreed! I’ve had students make up percentages as support for their arguments, and when called on it, they tell me they were told to do this by previous teachers – now I don’t know how a computer program would handle this, but my guess is it would either give high scores or the opposite – some students really “luck out” and get a topic they’ve researched so they have correct information – I doubt a computer program will be able to tell the difference! It will be some kind of system like what is described here or it’ll be some kind of rubric & if the material “looks” good, it will be given a high score, even if it’s gibberish!

Chiara says:

September 3, 2014 at 11:39 am

It’s pretty simple to me. If we’re going to make them write essays for the tests, the least we could do is afford them the respect of actually paying someone to read what they wrote.

Their work and time and effort has value. If it isn’t worth reading then how can we tell them it’s worth writing?

SomeDAM Poet (Devalue Added Model) says:

September 3, 2014 at 11:57 am

“they [computers] cannot tell gibberish from lucid writing.”

I take dumbrage with that remark. There is some truly outstanding stuff that was written in Gibberish.

“Jabberwocky”, for example. Although I wouldn’t call it “lucid”. “Hillucid” (or “hillacid”) would probably be more apt.

I think the real problem with using computers (or more precisely, computer programs) to grade writing is that they normally are not context-sensitive (one can make them that way but it is actually exceedingly difficult)

Gibberish can actually be quite “good” if that is what is intended, but a computer program normally can’t identify when that is true and when it is not.

Finally, there is gibberish — random and valueless — and there is “Gibberish” like Jabberwocky, which actually inspires quite vivid — indeed frightening — images, though many of the words are made up. I have yet to see anything written by a computer that rises to that level. That’s not to say it is impossible, but I think it would take a Lewis Carroll to write the computer program and even then it would be quite difficult.

I think this actually points out the basic problem with standardization. It ignores context with which human imagination and creativity are inextricably linked.

liberalteacher says:

September 3, 2014 at 11:43 pm

You are absolutely correct. There is no algorithm that can quantify creativity, imagination and divergent thinking. I am sure that if one of these programs would grade George Orwell’s “Animal Farm,” the score would be in the deficient range for using such simplistic language. That would be too bad because most of the reformers who want such a system do believe that some animals (people) are more equal than others.

MathVale says:

September 3, 2014 at 11:59 am

To appease The Reformers, we keep computer graded assessments, but develop automated, computer based Artificial Student Interactive Non INstructed Engines (ASININEs) to take the tests. The Reformers can go off somewhere and play with their Turing machines while the rest of us live teachers, human students, and parents as biologically based life forms embark on true learning.

Duane Swacker says:

September 3, 2014 at 6:54 pm

TAGO!

September 3, 2014 at 12:01 pm

I hate to bring this up, but I’m not buying that states “had” to comply with the various Obama Administration mandates to get RttT money.

They didn’t get a whole lot of money, if you break it down. It looks like a lot, but that’s only if you don’t know what it actually costs to run public schools.

A lot of the time, the RttT grants didn’t come near to covering the costs of the strings attached. If states actually WERE “coerced” by RttT money, they’re not very good with addition and subtraction, or cost estimates.

I suspect they all went along because most state lawmakers are on-board for whatever their fellow ed reformers at the federal level serve up. I bet it has much to do with a herd mentality and the “crisis!” marketing they did than “coercion” to get federal RttT funds.

http://www.edweek.org/ew/articles/2014/01/22/18rtt-districts.h33.html

MathVale says:

September 3, 2014 at 12:11 pm

May be something to that. Ohio took RttT funding, but turned down high speed rail.

- Chiara says:
  
  September 3, 2014 at 12:16 pm
  
  It just isn’t a lot of money. A lot of the schools realized the costs of the reforms exceeded the grants and bailed (some are featured in that article I linked).
  
  “100,000 dollars!” just isn’t that meaningful or “coercive” as a percentage of their actual budgets.
  
  This was a joint federal/state lemming-like endeavor. No “coercion” that I can see. They voluntarily plunged off that cliff 🙂
- MathVale says:
  
  September 3, 2014 at 5:46 pm
  
  Yes. Good point. Costs far exceed benefits of RttT money.
Laura says:

September 3, 2014 at 5:42 pm

Our union was willing to compensate the district the dollar amount in return for the district forgoing participation. Political pressure from the BOE prevailed.

September 3, 2014 at 12:31 pm

“The Battle for New York Schools: Eva Moskowitz vs. Mayor Bill de Blasio”

That’s a piece that will be in the NYTimes. I love how they completely accept ed reform framing.

Important civic/governance/ small “d” democracy distinction! Only one of these two people was actually ELECTED to run NYC public schools 🙂

The other one? She’s self-appointed.

Apparently there is no longer any difference, at all. Why don’t they just outsource the whole thing and call it a day? We could cut out the whole “government” layer and just hire a reputable accounting firm to issue payments to contractors. It would save everyone a lot of time and energy on election day.

Gordon Wilder says:

September 3, 2014 at 1:06 pm

While we are at it why not replace all humans with computers. Actually it would seem that that is the idea behind the “reformers”. People become “its”, not humans. People, students, exist for the corporations so why not just forget all the interim. Just get rid of humans and “humanity” and let the computers and robots take over. “Sorry Hal”.

Laura H. Chapman says:

September 3, 2014 at 2:33 pm

Metametrics will not be pleased. This companiy has the rights to the Lexile scale for text complexity and a companian algorithm fo math called a Quartile scale. The Lexile scale was marketed within the CCSS and no-body has better examples of how dumb it is than Susan Ohanian, unless you are grappling with that in roll-out of the CCSS. Check out Sunsan’s website. She was honored with the Mark Twain Prize from the National Council of Teachers of English.

We Must Dissent says:

September 3, 2014 at 8:22 pm

Lexile: the scale that told me that the first section of Beowulf, the history of the Shieldings, was appropriate text for ninth and tenth graders. When the poem was entered in Anglo-Saxon.

And “Quartile”? Really? Does it divide things into four levels or categories? I really hope it’s actually called “Quantile”, not that that is the worst of its probable problems.

When they told my math department that long-answer problems would also be computer-graded on SBAC, most of us quickly realized that it would probably only take a few years before the scoring algorithms were reverse-engineered, at least for broad things. E.g, if you see prompt A in the question, use phrase or method B in your response and you get more points.

Alan C. Jones says:

September 3, 2014 at 6:04 pm

As a test of this new program run some samples writings from Hemingway or Shakespeare or Melville–interesting to see how they would fare. I have a feeling that all would be headed to the resource room.

Ponderosa says:

September 3, 2014 at 9:23 pm

Do other countries test Writing, or do they test content-areas THROUGH writing? I think American educators took a big wrong turn when they divorced Writing from content and fetishized it. My seventh graders, many of them “bad” writers, can write pages of coherent essay on the Byzantine Empire or Chinese agriculture at the end of a robust history unit. I can write reams about the Sierra Nevada, a subject I know lots about, but my essay on nuclear physics would be gibberish. When you teach kids Writing, you teach a few conventions (e.g. use topic sentences and transition words) and, because content delivery is haphazard across schools, familiarity with this handful of conventions is all that gets tested. Thus execrable writing can get a passing score so long as these elementary principles are observed. And execrable writing is what our 18 year olds produce on pretty much any topic outside their realm of experience, despite years of writing workshop and writing tests, because we’ve abdicated our responsibility as teachers to acquaint them with the world outside their immediate experience. We need a big rethink of how we approach writing. If we return to caring about content and not just form, we’ll make it even harder for techno-utopianists to make human graders obsolete.

liberalteacher says:

September 3, 2014 at 11:34 pm

If the computer scoring of essays come to pass, and if I was a parent of a learning disabled student, I would demand as an accommodation “a human scorer.” There is nothing in the law that would disallow such an accommodation. And if such an accommodation was denied, it would make a wonderful lawsuit.

retiredbutmissthekids says:

September 4, 2014 at 2:03 am

Todd Farley (author of the 2009 “Making the Grades: My Misadventures in the Standardized Testing Industry”) called it in his June 8, 2012 post on Huffington: Lies, Damn Lies & Statistics, or What’s Really Up With Automated Essay Scoring–
“Maybe a technology that purports to be able to assess a piece of writing
without having so much as the teensiest inkling as to what has been said
is good enough for your country, your city, your school, or your child. I’ll
tell you what, though: Ain’t good enough for mine.”

I recommend reading the post (or re-reading it.) Not good enough for any child.

SomeDAM Poet (Devalue Added Model) says:

September 4, 2014 at 10:28 am

I love how Perelman is using a computer to generate structurally sound but meaningless essays to beat these automated grading companies at their own game.

I used to develop software and a guy in our software testing department prided himself in the fact that he had once made a newly developed program crash with a single key stroke.

Perelman is just like that testing guy. Very clever.

He understands that checking grammar and sentence structure is the easy part and that developing a computer program (and the database to support it) that can distinguish between sense and nonsense is very difficult (and not likely to appear in an essay-grader any time soon).

People like Perelman are a major thorn in the side of these automated essay-grading companies because he can destroy months (if not years) of effort — to say nothing of millions of dollars in development costs and potential sales — with just a couple keystrokes.

And there is little they can do about it — other than add some hack check into their grader that tries to determine if someone is gaming the software (which is bound to have a lot of false positives if the automated grader is meaning-insensitive)

Bob Gilvey says:

September 4, 2014 at 9:25 am

“Perelman makes a strong case against using robo-graders for assigning grades and test scores. But there’s another use for robo-graders — a role for them to play in which, evidence suggests, they may not only be as good as humans, but better. In this role, the computer functions not as a grader but as a proofreader and basic writing tutor, providing feedback on drafts, which students then use to revise their papers before handing them in to a human.”
Via Annie Murphy Paul.
http://hechingerreport.org/content/robo-readers-arent-good-human-readers-theyre-better_17021/

Howard Denson says:

September 4, 2014 at 11:23 am

Should computers
grade student essays?

Diane Ravitch says, “…all tests will be delivered online and graded by computers. This is supposed to be faster and cheaper than paying teachers or even low-skill hourly workers or read student essays.” She points various deficiencies and gives a thumbs down to computers replacing teachers as graders.
The essay doesn’t mention that even law schools have used computers to grade the essays on entrance examinations. As I recall, the computers had a reliability rating of 85%.
The computer programs and grammar checkers can do several tasks efficiently:
They can count the number of words in each sentence and in the total essay.
They can keep up with the sentence patterns: It would be a bad sign if all sentences had only eight words. It would note the frequency of introductory phrases or clauses (“When we consider justice, we must…”), of parenthetical elements (“Justice, which Jefferson defined as xxx, is…”), and so on. These patterns may reveal how sophisticated the student is in his or her writing.
By the same time, computers aren’t thinking programs; they are flagging programs. You can construct a nonsense sentence, and the computer may not find a problem with “The deviated Apple intersects the transmission with an unguarded submarine.” I just sent that sentence through the spelling and grammar-checking program on Microsoft Word. It found nothing wrong with it.
Out of curiosity a decade or more ago, I sent Abraham Lincoln’s “Gettysburg Address” through a grammar-check program (one that did a separate print-out). Dear God, what a mess Honest Abe had. His sentences were too long and repeatedly too negative. The advice was for the writer to make it positive.
Again, these are flagging programs and NOT thinking programs. PCs are made by fools like me, but only God can make a thinker.

Duane Swacker says:

September 4, 2014 at 7:06 pm

“t would be a bad sign if all sentences had only eight words.”

Why?

What if the author intended it that way. What if I wrote an essay and dedicated it to my daughter deciding that eight was the best way to honor her? (she was born on 8/8/88, weighed 8lbs/8oz and if you add up the numbers of the time she was born it equals 8. And now if you add the two numbers of her height (6’2″) it equals 8-sorry ain’t too proud of her, eh!!!)

Deborah says:

September 10, 2014 at 9:38 am

Here’s yet another reason why young children should not be subjected to computerized assessments: “There are developmental considerations as well. While preschoolers and kindergartners may be extremely facile with physically swiping a tablet to move through screens, or even tracing a letterform on the screen with their index fingers, they do not yet possess the motor coordination necessary for efficient keyboarding. Indeed, the physical skill of sequentially moving the fingers in a coordinated pattern may not be refined until the age of 10. Learning keyboarding usually occurs around age 8 or 9, which makes sense from a developmental point of view.” http://now.tufts.edu/articles/why-it-important-learn-handwriting?utm_source=Tufts+Now+-+External+and+Students&utm_campaign=13e33ada89-Tufts_Now_external_140910&utm_medium=email&utm_term=0_c17dba3525-13e33ada89-207261749

Why Computers Should Not Grade Student Essays

24 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

Why Computers Should Not Grade Student Essays

Share this:

24 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats