1. Introduction
From classification to assessment
From classification to assessment
AlignED Report 1 benchmarked 32 AI models on tasks related to professional teaching knowledge: identifying neuromyths, diagnosing classroom scenarios, answering teacher certification questions, and comparing student work against curriculum standards. Those tasks asked models to classify, select, or rank. They did not ask models to read a student's work and produce a score, feedback, and a learning recommendation.
This report takes that next step. We gave two Gemini models real student essays and a real scoring rubric, then asked them to do what a human rater does: assign a holistic score, explain why, identify strengths and weaknesses, and suggest a concrete next learning step.
The promise and concern of AI essay scoring
Automated essay scoring (AES) has a long research history, from e-rater in the late 1990s to neural approaches in the 2010s. Large language models add a new dimension: they can score and explain, producing the kind of formative feedback that was previously the exclusive domain of human teachers. If LLMs can reliably assess student writing, the implications for feedback at scale are significant.
But reliability is the question. When a model scores an essay correctly, is it applying the rubric to the text? Or is it recalling the score from training data?
Data contamination in LLM evaluation
Data contamination occurs when evaluation data appears in a model's training corpus. The model has "seen the answers" before being tested. This is a known problem in LLM benchmarking: Sainz et al. (2023) demonstrated that GPT models showed inflated performance on datasets that existed before their training cutoff, and Jacovi et al. (2023) argued that contamination makes many reported benchmark scores unreliable.
For essay scoring, the risk is specific. The ETS publishes sample essays with scores and rater commentary for their GRE writing tasks. These samples are widely reproduced in test preparation materials. Any model trained on web data has almost certainly encountered them. If a model recalls that a particular essay received a score of 5, it does not need to apply the rubric at all.
We distinguish two forms of contamination:
- Verbatim recall: Can the model reproduce the essay text from memory? This would indicate direct memorisation of the text.
- Score recall: Can the model guess the published score without seeing the rubric? This would indicate memorisation of the score-essay association, even without memorising the text itself.
Score recall is the more concerning form for assessment research. A model that cannot reproduce an essay but can recall its published score will appear to be a competent assessor when tested on that essay.
In measurement theory, this is a threat to construct validity (Messick, 1989). A valid assessment score should reflect the construct it claims to measure: in this case, the ability to apply a rubric to student writing. If the score instead reflects a memorised association between an essay and its published rating, it measures something else entirely. Score memorisation is a form of construct-irrelevant variance that inflates observed agreement between model and human scores without reflecting genuine assessment capability.
What this report tests
We tested two Gemini models on two corpora:
- ETS sample essays (6 essays): Published by ETS with scores and commentary. High contamination risk.
- PERSUADE 2.0 essays (12 essays): From a research corpus published in 2024 by Crossley et al. Lower contamination risk, though not provably absent from training data.
Each corpus was scored using a rubric matched to its genre: the ETS rubric for ETS essays and the PERSUADE/ASAP rubric for PERSUADE essays. Because the rubrics differ, raw scoring accuracy cannot be directly compared across corpora. The contamination comparison instead relies on a score recall test that does not involve the rubric at all. If models recall published scores at a higher rate for the likely-contaminated corpus, this is consistent with score memorisation.
We also ran two contamination tests: a verbatim recall test (can the model reproduce the essay text?) and a score recall test (can the model guess the published score without the rubric?).