Discussion — AlignED Report 2

Three takeaways

1. Score recall works as a contamination signal

When asked to guess published scores without the rubric, both models recalled ETS scores at a higher rate than PERSUADE scores. The gap was 34 percentage points for Pro and 25 percentage points for Flash. This pattern is consistent with score memorisation from training data: the ETS essays and their scores are widely published and almost certainly present in the models' training corpora.

Score recall is a practical contamination test. It requires no access to training data, no access to model weights, and produces an interpretable signal. Any researcher can run it.

2. The gap suggests inflated accuracy on known corpora

Because the two corpora use different rubrics, raw scoring accuracy cannot be compared directly between them. The score recall test avoids this problem: it removes the rubric entirely and asks whether the model can guess the published score from the essay alone.

The fact that models recalled ETS scores at a higher rate than PERSUADE scores (without any rubric) suggests that at least some of the scoring accuracy on ETS essays may come from memorisation rather than rubric application. We cannot quantify how much, but the signal is present.

3. Assessment capability may exist but is hard to separate from memorisation

Both models produced detailed, rubric-aligned feedback with specific textual references. They identified concrete strengths and weaknesses in student writing. The next-step learning tasks targeted identifiable skill gaps. This is consistent with assessment capability beyond simple score retrieval.

But on scoring specifically, memorisation and competent assessment produce identical outputs. A model that recalls a score and a model that derives a score both return the same number. Separating these requires testing on corpora the model has not seen, which is what the PERSUADE comparison begins to do.

Limitations

Two models. Only Gemini 3.1 Pro and Gemini 3 Flash were tested. Contamination patterns may differ across model families, training data compositions, and model sizes.
Small samples. Six ETS essays and twelve PERSUADE essays. The score recall gap is consistent across both models, but the sample sizes are too small for statistical significance testing.
PERSUADE is not provably uncontaminated. The PERSUADE 2.0 corpus is publicly available. We assume lower contamination risk because it is less widely reproduced than ETS samples, but we cannot confirm it is absent from training data.
Single genre. All essays are argumentative/persuasive writing. The contamination effect may differ for narrative, expository, or other genres.
No rubric variation. Each corpus was scored with its own rubric. Testing with alternative or mismatched rubrics would help isolate whether models apply the given rubric or default to memorised scoring patterns.
Single run. Temperature was set to 0 for reproducibility, so each essay was scored once per model. We did not test robustness across multiple runs or temperature settings.

A proposed contamination testing framework

Based on this proof-of-concept, we propose four steps for contamination testing in AI assessment research:

Step 1: Run a score recall test

Give the model an essay and ask it to guess the published score without the rubric. Compare recall rates across corpora with different contamination risks.

Step 2: Run a verbatim recall test

Give the model the first sentence and ask it to continue. Measure sequence similarity to the original (e.g., using Python's SequenceMatcher). This distinguishes text memorisation from score memorisation.

Step 3: Include a lower-risk corpus

Test on at least one corpus that is less likely to be in training data. Compare scoring accuracy and recall rates across corpora. If accuracy drops on the lower-risk corpus, contamination may explain the difference.

Step 4: Report contamination results alongside accuracy

Any paper reporting AI scoring accuracy should also report score recall rates. This allows readers to judge whether accuracy is genuine or inflated.

What we claim and what we do not

We claim	We do not claim
Both models recalled ETS scores at a higher rate than PERSUADE scores. This pattern is consistent with data contamination.	We do not claim to have proven data contamination. Proof would require access to training data, which is not publicly available.
Score recall is a practical, accessible contamination test that any researcher can run.	We do not claim that score recall is a complete or sufficient test. Other contamination signals may exist.
Assessment accuracy on well-known corpora may be inflated by score memorisation.	We do not claim that LLMs cannot assess essays. Genuine capability and memorisation may coexist.
Contamination testing should be standard practice in AI assessment research.	We do not claim that existing research findings are invalid. We argue they are incomplete without contamination data.

Future directions

More models. Test Claude, GPT-4, Llama, Mistral, and other model families. Contamination patterns likely vary with training data composition.
More essays. Scale up to 50-100 essays per corpus for statistical power. Include essays from multiple prompts and genres.
Rubric variation. Test whether models produce different scores when given alternative rubrics for the same essay. If scores remain stable regardless of rubric, this is strong evidence of memorisation.
Feedback evaluation. Systematically assess feedback quality using teacher judgement or student outcomes. If feedback is genuinely useful regardless of scoring contamination, this has practical implications.
Temporal analysis. Compare score recall for essays published at different dates. Essays published before a model's training cutoff should show higher recall than those published after.

4. Discussion

On this page