Can LLMs Assess Complex Student Competencies?

Tim Gallagher

March 2026

Abstract

Two Gemini models (3.1 Pro and 3 Flash) scored 18 student essays against rubrics matched to each corpus: the ETS GRE rubric for 6 ETS sample essays, and the PERSUADE/ASAP rubric for 12 PERSUADE 2.0 essays. On ETS essays, Flash achieved 67% exact score agreement with expert raters (4 of 6); Pro achieved 33% (2 of 6). On PERSUADE essays, Pro matched human scores exactly on 3 of 12 essays (25%); Flash on 5 of 12 (42%). A contamination test reveals a score memorisation signal: when asked to recall scores without seeing the rubric, both models matched ETS scores at 67% (4 of 6) but PERSUADE scores at only 33% (Pro) and 42% (Flash). The models did not memorise the essay text itself, but they recalled the published scores. This suggests that assessment accuracy on well-known corpora may be inflated by score memorisation, and that contamination testing should be standard practice in AI assessment research.

Key findings at a glance

2
Models tested
18
Essays scored
67%
ETS score recall (both models)
25–34pp
Contamination gap (Flash / Pro)

How to read this report

This report follows a standard research paper structure. Start with the Introduction for context, or skip to Results for the data. The Discussion section addresses what we can and cannot claim from a small study.

This report is a frozen snapshot of the March 2026 data. For the full list of AlignED reports, visit AlignED Reports.