Results — AlignED Report 2

ETS scoring accuracy

Model scores vs. ETS expert scores on 6 GRE sample essays

ETS essay scores: Human vs. model

6-point holistic scale. Each essay has one ETS score and two model scores.

Observations:

Both models scored all 6 essays within one point of the ETS score (100% within-one agreement).
Pro matched the ETS score exactly on 2 of 6 essays (33%); Flash matched exactly on 4 of 6 (67%).
Both models scored the score-5 essay as 6, and Pro scored the score-6 essay as 5. Model scores cluster in the mid-to-high range.

PERSUADE scoring accuracy

Model scores vs. human scores on 12 PERSUADE 2.0 essays

PERSUADE essay scores: Human vs. model

12 essays, two per score level. Same rubric applied by both models.

Observations:

Pro matched the human score exactly on 3 of 12 essays (25%); Flash matched exactly on 5 of 12 (42%).
Within-one agreement: Pro 7 of 12 (58%), Flash 8 of 12 (67%).
Both models systematically underscored higher-quality essays. The score-5 and score-6 essays received model scores of 3-5, suggesting a compression effect at the top of the scale.

The contamination signal

Score recall rates compared across corpora

Score recall: ETS vs. PERSUADE

Can models guess the published score without seeing the rubric? Higher recall on ETS suggests score memorisation.

Observations:

Both models recalled ETS scores at 67% exact match (4 of 6 essays).
PERSUADE score recall was lower: Pro 33% (4 of 12), Flash 42% (5 of 12).
The contamination gap in exact recall: Pro 34 percentage points, Flash 25 percentage points.
Within-one recall on ETS was 100% for both models. Within-one on PERSUADE: Pro 92% (11 of 12), Flash 67% (8 of 12).
The within-one gap between models on PERSUADE (Pro 92% vs Flash 67%) is notable. Pro's higher within-one recall on PERSUADE may reflect stronger general scoring ability, partial familiarity with the corpus, or both. The small sample does not allow us to distinguish these explanations.

Verbatim recall

Can models reproduce the essay text from memory?

Neither model could reproduce ETS essay text. When given the first sentence and asked to continue, both models produced low-similarity text (average similarity < 0.04, well below the 0.4 threshold for even partial recall). Both models recognised the essays as known ETS samples and stated this explicitly, but could not reproduce the actual text.

Model	Avg. similarity	High contamination count	Verdict
Gemini 3.1 Pro	0.039	0 of 6	No verbatim memorisation
Gemini 3 Flash	0.030	0 of 6	No verbatim memorisation

Both models recognised the ETS essays as known samples and identified them by name, but could not reproduce the text. They recalled the published scores but not the words. Score memorisation and text memorisation appear to be separate phenomena.

Feedback samples

Selected examples for independent evaluation

A systematic evaluation of feedback quality was not part of this study. The Appendices include three full feedback examples (score levels 1, 3, and 6) for readers to evaluate independently. Both models produced structured feedback with specific textual references, and the next-step learning tasks targeted identifiable skill gaps. Whether this reflects genuine assessment capability or patterns associated with memorised score levels cannot be determined from this data.

Feedback and contamination. Even if scoring accuracy is inflated by score memorisation, the feedback itself may still be valuable. Feedback quality is a separate question from scoring accuracy. This study did not test whether feedback is influenced by contamination.

3. Results

On this page

ETS scoring accuracy

PERSUADE scoring accuracy

The contamination signal

Verbatim recall

Feedback samples