By: Richard Flynn |Audience: All
The use of e-raters, also known as e-readers, to mark students' writing in computer based tests (CBT) is set to increase in the next few years. GMAT is using them, and TOEFL and GRE are to follow. Critics claim that running algorithms through a text is not as good as having a human mark, though there is evidence suggesting that an e-rater produces results broadly similar to those of human markers, and likely to improve as the algorithms and measures improve over time. One area where the critics would appear to have a point is the seeming inability of an e-rater to respond to creative language used, though I think that there is probably not enough genuinely creative language use turned out in answers to fairly standard and often dull questions used in so many English language tests.
E-raters have obvious advantages in terms of speed and cost, which may fuel the luddite conspiracy view. If they can do the job, I see no reason, however, why they shouldn't. Just as multiple choice tests and others are marked by machine, it seems logical to me to let computers do the writing too if they can handle it.
In a 2004 TOEFL report about e-raters (1), Chodorow and Burstein say that "E-rater features are based on four general types of analysis: syntactic, discourse, topical,
The first, syntactic, involves parsing the texts and identifying word types, which is something reasonably well-established, so an e-rater should be able to do this. This information is used to identify certain features to establish the essay's syntactic variety. The second stage builds on the first and tries to identify discourse markers to establish the structure of the argument. The third stage weights words to assess topical relevance, in much the same way as search engines and other tools, using the discourse items identified in the second stage. The last stage looks at the general vocabulary in terms of length of words, variety, etc. There are also features to identify errors.
In many exams, writing and essays are impression marked, so the fact that an e-rater doesn't identify every single good point and every error doesn't trouble me. I think that the methodology described in the paper will lead to a reasonable conclusion for the majority of cases, even if it fails to spot great style and imaginative writing.
In terms of teaching for an algorithm, there doesn't seem to be that much that isn't already being taught in writing classes the world over- the use of discourse markers and other organizational features, relevant vocabulary and avoidance of repetition. It all seems pretty standard to me. I don't honestly see that much to fear from a well-tested e-rater going through CBT writing to assess it.
(1) Martin Chodorow & Jill Burstein Beyond Essay Length: Evaluating e-rater®’s Performance on TOEFL® Essays TOEFL® Research Reports, Report 73, February 2004