Methods — AlignED Report 2

Overview

Dimension	Detail
Models	Gemini 3.1 Pro, Gemini 3 Flash
Essay corpora	ETS GRE samples (6 essays), PERSUADE 2.0 (12 essays)
Rubrics	ETS GRE 6-point holistic (for ETS essays), PERSUADE/ASAP 6-point holistic (for PERSUADE essays)
Assessment dimensions	Scoring (holistic score + justification), Feedback (strengths + areas for improvement), Next step (targeted learning task)
Contamination tests	Verbatim recall (text reproduction), Score recall (score guessing without rubric)
Temperature	0 (deterministic)
Output format	Structured JSON

Models

We tested two Gemini models from Google:

Model	Model ID	Tier
Gemini 3.1 Pro	gemini-3.1-pro-preview	Frontier
Gemini 3 Flash	gemini-3-flash-preview	Lightweight

Only two models were tested. We chose Gemini models because the google-genai SDK supports structured JSON output, which reduced the need for output parsing. Extending this approach to other model families (Claude, GPT, open-weight models) is a natural next step.

Essay corpora

ETS GRE sample essays (high contamination risk)

Six essays from the ETS "Analyse an Argument" task, one at each score level (1 through 6). These essays are published on the ETS website with scores and detailed rater commentary. They appear widely in GRE preparation materials and are almost certainly present in the training data of any model trained on web-scale corpora.

All six essays respond to the same argument prompt about Mason City's riverside recreational facilities. ETS provides one sample response at each score level with official rater commentary.

PERSUADE 2.0 essays (lower contamination risk)

Twelve essays from the PERSUADE 2.0 corpus (Crossley et al., 2024), two at each score level (1 through 6). PERSUADE 2.0 is a research dataset of student persuasive writing, published for automated essay scoring research. While the corpus is publicly available, its essays are less likely to appear in general web training data than the widely-reproduced ETS samples.

The PERSUADE essays respond to a different prompt ("Seeking multiple opinions") and were scored by trained human raters on the PERSUADE/ASAP 6-point holistic scale.

Rubrics

Each corpus was scored using its own rubric. The ETS essays were scored against the ETS GRE "Analyse an Argument" rubric, which emphasises identification and examination of assumptions, logical organisation, and language control. The PERSUADE essays were scored against the PERSUADE/ASAP holistic rubric, which emphasises development of a point of view, critical thinking, use of evidence, and persuasive writing conventions. Both rubrics use a 6-point scale, but the criteria differ to match the genre and task. Full rubric text for both scales is in the Appendices.

Corpus comparability. The two corpora differ in prompt, genre, student population, and rubric. Scoring accuracy cannot be directly compared across corpora because the rubrics assess different writing competencies. The contamination comparison relies on the score recall test (which does not involve the rubric at all) rather than on raw accuracy differences.

Assessment prompts

Each essay was assessed using three separate prompts, run sequentially. Separating prompts prevents the score from anchoring the feedback. The prompt structure was identical for both corpora, but the rubric and task framing were adapted to match each corpus. Full prompt text is in the Appendices.

Prompt 1: Scoring

The model receives the essay, the writing prompt, and the full 6-point rubric for the relevant corpus. It returns a holistic score (1-6) and a justification referencing specific rubric criteria.

ETS version: Framed as a GRE "Analyse an Argument" task. Includes the ETS rubric and the argument prompt.

PERSUADE version: Framed as a persuasive writing task. Includes the PERSUADE/ASAP rubric and the writing prompt.

Prompt 2: Feedback

The model receives the essay and prompt (but not the score it previously assigned). It returns three specific strengths and three areas for improvement, each with concrete textual references.

ETS version: Asks for feedback on analytical reasoning.

PERSUADE version: Asks for feedback on persuasive writing, calibrated to a middle/high school student audience.

Prompt 3: Next step

The model receives the essay and prompt. It returns a single, concrete learning task targeting the student's most pressing skill gap.

ETS version: Targets analytical writing weaknesses.

PERSUADE version: Targets persuasive writing weaknesses, appropriately challenging for the student's level.

Contamination tests

Verbatim recall test (ETS only)

Each model was given the first sentence of each ETS essay and asked to write, from memory, what it thought the rest of the essay said. The model's continuation was then compared to the actual remainder of the essay (everything after the first sentence) using Python's SequenceMatcher. This produces a similarity ratio between 0 and 1 based on the longest contiguous matching sequences of characters between the two texts. A score of 1.0 means the model reproduced the essay word-for-word; a score near 0 means the texts share almost nothing in common.

We classified similarity scores into three bands: 0.7 or above (high, likely memorised), 0.4–0.7 (moderate, partial recall), and below 0.4 (low, not memorised). These are researcher-defined thresholds, not established conventions. There is no standard rule of thumb for SequenceMatcher ratios in contamination testing comparable to, for example, Cohen's (1988) benchmarks for effect sizes. In practice, the threshold choice matters little here: both models scored well below 0.04 on average, an order of magnitude below even the most lenient cutoff.

Score recall test (both corpora)

Each model was given an essay and asked to guess the published score without seeing the rubric. This tests whether the model has memorised the score-essay association from training data.

The score recall prompts were tailored to each corpus to avoid introducing framing confounds. The ETS prompt stated that the essay "was scored by ETS expert raters" and asked for the ETS-assigned score. The PERSUADE prompt stated that the essay came from "the PERSUADE 2.0 corpus" and "was scored by trained human raters on a 1-6 scale." Neither prompt included the rubric.

If a model recalls ETS scores at a higher rate than PERSUADE scores, this is consistent with contamination: the ETS essays (with their published scores) are more likely to be in training data.

Scoring and analysis

Assessment accuracy was measured using:

Exact match: Model score equals human/ETS score.
Within-one: Model score is within 1 point of human/ETS score. Standard in essay scoring research, where adjacent scores are considered acceptable agreement.

Contamination was measured by comparing score recall rates across corpora. The contamination gap is the difference in exact score recall between ETS and PERSUADE essays, measured in percentage points.

2. Methods

On this page