3. Results
Scoring accuracy and the contamination signal
On this page
ETS scoring accuracy
Model scores vs. ETS expert scores on 6 GRE sample essays
- Both models scored all 6 essays within one point of the ETS score (100% within-one agreement).
- Pro matched the ETS score exactly on 2 of 6 essays (33%); Flash matched exactly on 4 of 6 (67%).
- Both models scored the score-5 essay as 6, and Pro scored the score-6 essay as 5. Model scores cluster in the mid-to-high range.
PERSUADE scoring accuracy
Model scores vs. human scores on 12 PERSUADE 2.0 essays
- Pro matched the human score exactly on 3 of 12 essays (25%); Flash matched exactly on 5 of 12 (42%).
- Within-one agreement: Pro 7 of 12 (58%), Flash 8 of 12 (67%).
- Both models systematically underscored higher-quality essays. The score-5 and score-6 essays received model scores of 3-5, suggesting a compression effect at the top of the scale.
The contamination signal
Score recall rates compared across corpora
- Both models recalled ETS scores at 67% exact match (4 of 6 essays).
- PERSUADE score recall was lower: Pro 33% (4 of 12), Flash 42% (5 of 12).
- The contamination gap in exact recall: Pro 34 percentage points, Flash 25 percentage points.
- Within-one recall on ETS was 100% for both models. Within-one on PERSUADE: Pro 92% (11 of 12), Flash 67% (8 of 12).
- The within-one gap between models on PERSUADE (Pro 92% vs Flash 67%) is notable. Pro's higher within-one recall on PERSUADE may reflect stronger general scoring ability, partial familiarity with the corpus, or both. The small sample does not allow us to distinguish these explanations.
Verbatim recall
Can models reproduce the essay text from memory?
Neither model could reproduce ETS essay text. When given the first sentence and asked to continue, both models produced low-similarity text (average similarity < 0.04, well below the 0.4 threshold for even partial recall). Both models recognised the essays as known ETS samples and stated this explicitly, but could not reproduce the actual text.
| Model | Avg. similarity | High contamination count | Verdict |
|---|---|---|---|
| Gemini 3.1 Pro | 0.039 | 0 of 6 | No verbatim memorisation |
| Gemini 3 Flash | 0.030 | 0 of 6 | No verbatim memorisation |
Both models recognised the ETS essays as known samples and identified them by name, but could not reproduce the text. They recalled the published scores but not the words. Score memorisation and text memorisation appear to be separate phenomena.
Feedback samples
Selected examples for independent evaluation
A systematic evaluation of feedback quality was not part of this study. The Appendices include three full feedback examples (score levels 1, 3, and 6) for readers to evaluate independently. Both models produced structured feedback with specific textual references, and the next-step learning tasks targeted identifiable skill gaps. Whether this reflects genuine assessment capability or patterns associated with memorised score levels cannot be determined from this data.