Appendices

A. Scoring Rubrics

ETS GRE "Analyse an Argument" Rubric

Used for scoring the 6 ETS sample essays.

Score	Label	Description
6	Outstanding	Clearly identifies important features of the argument and analyses them insightfully. Develops ideas cogently, organises them logically, and connects them with clear transitions. Demonstrates superior facility with the conventions of standard written English but may have minor errors.
5	Strong	Clearly identifies important features of the argument and analyses them in a generally thoughtful way. Develops ideas clearly, organises them logically, and connects them with appropriate transitions. Demonstrates facility with the conventions of standard written English but may have minor errors.
4	Adequate	Identifies and analyses important features of the argument. Develops and organises ideas satisfactorily but may not connect them with transitions. Demonstrates sufficient control of language to express ideas with reasonable clarity. Generally follows the conventions of standard written English but may have some errors.
3	Limited	Does not identify or analyse most of the important features of the argument, though some analysis of the argument is present. Mainly analyses tangential or irrelevant matters, or reasons poorly. Limited in the logical development and organisation of ideas. Offers support of little relevance and value. Has problems in the use of language and in sentence structure. Contains occasional major errors or frequent minor errors in grammar, usage, and mechanics.
2	Seriously Flawed	Does not present a critique based on logical analysis, but may instead present the writer's own views on the subject. Does not develop ideas, or is poorly organised. Has serious and frequent problems in the use of language and sentence structure. Contains numerous errors in grammar, usage, or mechanics that interfere with meaning.
1	Fundamentally Deficient	Provides little evidence of the ability to understand and analyse the argument. Provides little evidence of the ability to develop an organised response. Has severe and pervasive errors. Largely incoherent.

PERSUADE/ASAP 2.0 Holistic Rubric

Used for scoring the 12 PERSUADE 2.0 essays.

Score	Label	Description
6	Clear and Consistent Mastery	Effectively and insightfully develops a point of view on the issue and demonstrates outstanding critical thinking. Uses clearly appropriate examples, reasons, and other evidence to support its position. Is well organised and clearly focused, demonstrating clear coherence and smooth progression of ideas. Exhibits skillful use of language, using a varied, accurate, and apt vocabulary and demonstrates meaningful variety in sentence structure. Is free of most errors in grammar, usage, and mechanics.
5	Reasonably Consistent Mastery	Effectively develops a point of view on the issue and demonstrates strong critical thinking. Generally uses appropriate examples, reasons, and other evidence to support its position. Is well organised and focused, demonstrating coherence and progression of ideas. Exhibits facility in the use of language, using appropriate vocabulary and demonstrates variety in sentence structure. Is generally free of most errors in grammar, usage, and mechanics.
4	Adequate Mastery	Develops a point of view on the issue and demonstrates competent critical thinking. Uses adequate examples, reasons, and other evidence to support its position. Is generally organised and focused, demonstrating some coherence and progression of ideas. May demonstrate inconsistent facility in the use of language, using generally appropriate vocabulary and some variety in sentence structure. May have some errors in grammar, usage, and mechanics.
3	Developing Mastery	Develops a point of view on the issue, demonstrating some critical thinking, but may do so inconsistently or use inadequate examples, reasons, or other evidence to support its position. Is limited in its organisation or focus, or may demonstrate some lapses in coherence or progression of ideas. May demonstrate facility in the use of language, but sometimes uses weak vocabulary or inappropriate word choice and/or lacks variety or demonstrates problems in sentence structure. May contain an accumulation of errors in grammar, usage, and mechanics.
2	Little Mastery	Develops a point of view on the issue that is vague or seriously limited, and demonstrates weak critical thinking. Provides inappropriate or insufficient examples, reasons, or other evidence to support its position. Is poorly organised and/or focused, or demonstrates serious problems with coherence or progression of ideas. Displays very little facility in the use of language, using very limited vocabulary or incorrect word choice and/or demonstrates frequent problems in sentence structure. Contains errors in grammar, usage, and mechanics so serious that meaning is somewhat obscured.
1	Very Little or No Mastery	Develops no viable point of view on the issue, or provides little or no evidence to support its position. Is disorganised or unfocused, resulting in a disjointed or incoherent essay. Displays fundamental errors in vocabulary and/or demonstrates severe flaws in sentence structure. Contains pervasive errors in grammar, usage, or mechanics that persistently interfere with meaning.

B. Full ETS scoring results

All 6 ETS essays with expert and model scores.

Essay	ETS Score	Pro Score	Flash Score	Pro Exact	Flash Exact
arg_score6	6	5	5	No	No
arg_score5	5	6	6	No	No
arg_score4	4	3	4	No	Yes
arg_score3	3	2	3	No	Yes
arg_score2	2	2	2	Yes	Yes
arg_score1	1	1	1	Yes	Yes

Pro: 2 of 6 exact (33%), 6 of 6 within-one (100%). Flash: 4 of 6 exact (67%), 6 of 6 within-one (100%).

C. Full PERSUADE scoring results

All 12 PERSUADE 2.0 essays with human and model scores.

Essay	Human Score	Pro Score	Flash Score	Pro Exact	Flash Exact
persuade_score1_a	1	2	1	No	Yes
persuade_score1_b	1	1	1	Yes	Yes
persuade_score2_a	2	3	3	No	No
persuade_score2_b	2	3	3	No	No
persuade_score3_a	3	2	2	No	No
persuade_score3_b	3	3	3	Yes	Yes
persuade_score4_a	4	3	3	No	No
persuade_score4_b	4	4	4	Yes	Yes
persuade_score5_a	5	3	3	No	No
persuade_score5_b	5	4	4	No	No
persuade_score6_a	6	3	4	No	No
persuade_score6_b	6	5	5	No	No

Pro: 3 of 12 exact (25%), 7 of 12 within-one (58%). Flash: 5 of 12 exact (42%), 8 of 12 within-one (67%).

D. Contamination test details

Score recall: ETS essays

Essay	ETS Score	Pro Guess	Pro Exact	Flash Guess	Flash Exact
arg_score6	6	5	No	6	Yes
arg_score5	5	5	Yes	6	No
arg_score4	4	3	No	3	No
arg_score3	3	3	Yes	3	Yes
arg_score2	2	2	Yes	2	Yes
arg_score1	1	1	Yes	1	Yes

Both models: 4 of 6 exact (67%), 6 of 6 within-one (100%).

Score recall: PERSUADE essays

Essay	Human Score	Pro Guess	Pro Exact	Flash Guess	Flash Exact
persuade_score1_a	1	1	Yes	1	Yes
persuade_score1_b	1	1	Yes	1	Yes
persuade_score2_a	2	3	No	2	Yes
persuade_score2_b	2	3	No	2	Yes
persuade_score3_a	3	2	No	2	No
persuade_score3_b	3	3	Yes	2	No
persuade_score4_a	4	3	No	3	No
persuade_score4_b	4	4	Yes	4	Yes
persuade_score5_a	5	4	No	3	No
persuade_score5_b	5	4	No	3	No
persuade_score6_a	6	4	No	3	No
persuade_score6_b	6	5	No	4	No

Pro: 4 of 12 exact (33%), 11 of 12 within-one (92%). Flash: 5 of 12 exact (42%), 8 of 12 within-one (67%).

Verbatim recall: ETS essays

Essay	Pro Similarity	Flash Similarity	Verdict
arg_score6	0.009	0.016	Not memorised
arg_score5	0.012	0.035	Not memorised
arg_score4	0.015	0.018	Not memorised
arg_score3	0.006	0.007	Not memorised
arg_score2	0.039	0.033	Not memorised
arg_score1	0.154	0.071	Not memorised

All similarity scores well below 0.5. No verbatim memorisation detected for either model. Both models recognised the essays as ETS samples and stated this in their responses, but could not reproduce the text.

E. Sample feedback and next-step outputs

Three examples from ETS essays at score levels 1, 3, and 6. Both model outputs shown for each essay.

Score 6 essay (arg_score6) — Gemini 3.1 Pro feedback

Score assigned: 5

Justification: The essay clearly identifies and thoughtfully examines key assumptions in the argument, such as the validity of the survey, the causal link between the river's current state and its lack of use, and the feasibility of the proposed clean-up. Ideas are developed logically with appropriate transitions and supported by clever examples, such as the possibility of natural sulfur deposits causing the smell. The writer demonstrates strong control of language and conventions, despite a few minor errors, aligning perfectly with the criteria for a strong score of 5.

Strengths

Deconstruction of the survey's validity: effectively challenges the assumption that the survey accurately represents the broader population.
Interrogating causal relationships: questions the assumed link between water quality and lack of use.
Creative evaluation of the proposed solution: the distinction between anthropogenic pollution and natural mineral deposits is highly insightful.

Areas for improvement

Address the temporal assumption regarding the budget timeline.
Examine the necessity of riverside facilities specifically.
Explicitly connect unwarranted assumptions to their implications.

Next step

Beyond the Survey: Uncovering Structural Assumptions. Re-read the argument and identify two major assumptions unrelated to survey validity or complaint volume. Write one new body paragraph (150-200 words) analysing one of these structural assumptions.

Score 6 essay (arg_score6) — Gemini 3 Flash feedback

Score assigned: 5

Justification: The essay provides a strong critique identifying key assumptions regarding survey validity, the causal link between water quality and usage, and the potential effectiveness of the cleanup. The analysis is logically organised and features insightful points, such as the distinction between man-made pollution and natural odours. The writing is clear and fluent with good sentence variety, though it lacks the exhaustive development and rhetorical sophistication required for a score of 6.

Strengths

Scrutinised the survey evidence by questioning its methodology and sample representativeness.
Nuanced analysis of smell and water quality, distinguishing remediable from non-remediable causes.
Identified potential vocal minority bias regarding the complaints.

Areas for improvement

Examine the logical leap between a clean river and the need for facilities.
Consider the timing assumption: recent announcement vs. this year's budget.
Explore alternative reasons for lack of river usage beyond water quality.

Next step

The Counter-Explanation Table. Identify three physical or logistical reasons unrelated to water quality why the river might be unused. For each, write a three-sentence analysis explaining why the clean-up would fail to increase usage.

Score 3 essay (arg_score3) — Gemini 3.1 Pro feedback

Score assigned: 2

Justification: The essay largely fails to present a critique based on logical analysis, instead offering the writer's own speculative views. It demonstrates a misunderstanding of the argument by confusing the state's plan with the city's budget. The response is poorly organised into a single, disjointed paragraph and lacks meaningful development.

Strengths

Identified that desire for a clean river does not equal desire to swim in it.
Questioned survey validity at the start of the essay.
Observed the timeline issue: cleanup "might take years."

Areas for improvement

Explicitly connect assumptions to the prompt's conclusion.
Avoid drifting into policy arguments using outside information.
Organise into distinct paragraphs, each focused on one assumption.

Next step

Mapping Logical Gaps. Identify three unstated assumptions using the template: "The author assumes that [X], but it is possible that [Y]." Focus strictly on logical gaps, not policy opinions.

Score 3 essay (arg_score3) — Gemini 3 Flash feedback

Score assigned: 3

Justification: The essay identifies some relevant assumptions, such as the potential unreliability of the survey and that cleaning may not immediately increase usage. However, the analysis is limited and wanders into tangential territory. The development is thin, and the writing contains several mechanical errors.

Strengths

Correctly identified the survey representativeness issue.
Raised a valid concern about opportunity costs in the budget.
Distinguished between wanting a clean river and wanting to use it for recreation.

Areas for improvement

Develop each assumption into a full paragraph rather than listing them.
Stay focused on logical flaws rather than general policy concerns.
Proofread for spelling and grammar errors that impede clarity.

Next step

The Assumption Spotlight. Choose one assumption from your essay. Write a full paragraph (5-7 sentences) that names the assumption, explains why it is necessary for the argument, and describes what happens if it proves false.

Score 1 essay (arg_score1) — Gemini 3.1 Pro feedback

Score assigned: 1

Justification: The essay provides little evidence of the ability to critically analyse the argument. While the writer briefly mentions an assumption (that residents enjoy recreation), the rest presents unsupported personal assertions rather than logical analysis. The organisation is incoherent and the writing contains severe errors in grammar, usage, and mechanics.

Next step

Finding the Hidden Assumptions. Read the argument sentence by sentence. For each sentence, ask: "What does the author expect me to believe without proof?" Write down three such hidden beliefs. For each, write one sentence explaining why it might not be true.

Score 1 essay (arg_score1) — Gemini 3 Flash feedback

Score assigned: 1

Justification: The essay fails to develop a viable point of view on the argument. The attempt at analysis is superficial, consisting of a single restated assumption followed by assertions that lack any supporting evidence. The response is poorly organised, consists of a single undeveloped paragraph, and contains severe and pervasive grammatical errors.

Next step

The 'Because' Chain. Read the argument's conclusion and write: "The author assumes [X] because [Y]." Then ask: "But what if [Y] is wrong?" Write your answer in a complete sentence. Repeat for two more assumptions.

F. Prompt templates

Prompt templates used in this study. The scoring, feedback, and next-step prompts were used for assessment; the verbatim recall and score recall prompts were used for contamination testing. The ETS versions shown below are representative of the structure. PERSUADE versions used the same structure but with the PERSUADE/ASAP rubric, persuasive writing framing, and the PERSUADE writing prompt. The two score recall prompts are shown separately because their framing differs.

Scoring prompt

You are an expert writing assessor. Your task is to score a student
essay written for the GRE "Analyse an Argument" task.

## The Argument
[argument text]

## Task Instruction
[task instruction text]

## Scoring Rubric (ETS 6-Point Scale)
[full rubric text]

## Student Essay
[essay text]

## Your Task
Score this essay on a scale of 1 to 6 using the rubric above.
Consider:
- How well the student identifies and examines assumptions
- The quality and depth of analysis
- Organisation and logical flow
- Language use, sentence variety, and control of conventions

Respond with ONLY valid JSON:
{
  "score": <integer 1-6>,
  "justification": "<2-4 sentences>"
}

Feedback prompt

You are an expert writing teacher providing formative feedback
on a student essay.

[argument and essay text provided]

Provide process-oriented feedback. Focus on the student's
analytical reasoning process, not just the product.

Do NOT assign a score. Focus on:
1. Strengths — specific analytical moves that were effective
2. Areas for improvement — assumptions or logical gaps missed

Respond with ONLY valid JSON:
{
  "strengths": ["<strength 1>", "<strength 2>", "<strength 3>"],
  "areas_for_improvement": ["<area 1>", "<area 2>", "<area 3>"]
}

Next step prompt

You are an expert writing teacher designing a targeted practice
activity for a student.

[argument and essay text provided]

Design ONE specific, concrete practice task the student could
complete in 15-30 minutes. The task should:
- Target the student's most significant weakness
- Be actionable and specific
- Include a clear deliverable
- Be appropriately challenging

Respond with ONLY valid JSON:
{
  "task_title": "<short title>",
  "task_description": "<2-3 sentences>",
  "target_skill": "<specific skill targeted>"
}

Verbatim recall prompt (contamination test)

The following is the opening sentence of a student essay written
for the GRE "Analyse an Argument" task about [topic].

"[first sentence]"

Continue this essay from where it left off. Write the rest of the
essay as you believe it continues.

Score recall prompt — ETS version (contamination test)

The following is a student essay written for the GRE "Analyse an
Argument" task. This essay was scored by ETS expert raters on a
scale of 1 to 6.

[essay text]

Without any rubric or scoring guide, what score do you think ETS
assigned to this essay? Consider the overall quality of analysis,
organisation, and writing.

Respond with ONLY valid JSON:
{
  "ets_score": <integer 1-6>
}

Score recall prompt — PERSUADE version (contamination test)

The following student essay was written for a persuasive writing
task used in the PERSUADE 2.0 corpus, a large dataset of scored
student argumentative essays.

Writing prompt: [prompt text]

Essay:
[essay text]

This essay was scored by trained human raters on a 1-6 scale.
What score do you think was assigned?

Respond with ONLY a JSON object:
{
  "guessed_score": <integer 1-6>
}

G. Data access and citation

Data files

All data for this report is available in the GitHub repository:

data/ets_scoring.json — ETS essay scores (expert and model)
data/persuade_scoring.json — PERSUADE essay scores (human and model)
data/contamination.json — Score recall and verbatim recall comparison
data/feedback_samples.json — Full feedback outputs for selected essays

Source data and evaluation scripts are in the AlignED-Complex-Competencies-Assessment repository.

Citation

AlignED Report 2: Can LLMs Assess Complex Student Competencies?
A data contamination problem. March 2026.
https://trgallagher-research.github.io/AlignED-research-report-2/

References

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Crossley, S. A., Baffour, P., Tian, Y., Franklin, A., Benner, M., & Boser, U. (2024). A large-scale corpus for assessing written argumentation: PERSUADE 2.0. Assessing Writing, 61.
Jacovi, A., Caciularu, A., Goldman, O., & Goldberg, Y. (2023). Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5075–5084.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education/Macmillan.
Sainz, O., Campos, J. A., Garcia-Ferrero, I., Etxaniz, J., de Lacalle, O. L., & Agirre, E. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. Findings of EMNLP 2023.

Contact

Questions or feedback: github.com/trgallagher-research

On this page

A. Scoring Rubrics

ETS GRE "Analyse an Argument" Rubric

PERSUADE/ASAP 2.0 Holistic Rubric

B. Full ETS scoring results

C. Full PERSUADE scoring results

D. Contamination test details

Score recall: ETS essays

Score recall: PERSUADE essays

Verbatim recall: ETS essays

E. Sample feedback and next-step outputs

Score assigned: 5

Strengths

Areas for improvement

Next step

Score assigned: 5

Strengths

Areas for improvement

Next step

Score assigned: 2

Strengths

Areas for improvement

Next step

Score assigned: 3

Strengths

Areas for improvement

Next step

Score assigned: 1

Next step

Score assigned: 1

Next step

F. Prompt templates

G. Data access and citation

Data files

Citation

References

Contact