Appendices
Full data, prompts, and supplementary materials
On this page
A. Scoring Rubrics
ETS GRE "Analyse an Argument" Rubric
Used for scoring the 6 ETS sample essays.
| Score | Label | Description |
|---|---|---|
| 6 | Outstanding | Clearly identifies important features of the argument and analyses them insightfully. Develops ideas cogently, organises them logically, and connects them with clear transitions. Demonstrates superior facility with the conventions of standard written English but may have minor errors. |
| 5 | Strong | Clearly identifies important features of the argument and analyses them in a generally thoughtful way. Develops ideas clearly, organises them logically, and connects them with appropriate transitions. Demonstrates facility with the conventions of standard written English but may have minor errors. |
| 4 | Adequate | Identifies and analyses important features of the argument. Develops and organises ideas satisfactorily but may not connect them with transitions. Demonstrates sufficient control of language to express ideas with reasonable clarity. Generally follows the conventions of standard written English but may have some errors. |
| 3 | Limited | Does not identify or analyse most of the important features of the argument, though some analysis of the argument is present. Mainly analyses tangential or irrelevant matters, or reasons poorly. Limited in the logical development and organisation of ideas. Offers support of little relevance and value. Has problems in the use of language and in sentence structure. Contains occasional major errors or frequent minor errors in grammar, usage, and mechanics. |
| 2 | Seriously Flawed | Does not present a critique based on logical analysis, but may instead present the writer's own views on the subject. Does not develop ideas, or is poorly organised. Has serious and frequent problems in the use of language and sentence structure. Contains numerous errors in grammar, usage, or mechanics that interfere with meaning. |
| 1 | Fundamentally Deficient | Provides little evidence of the ability to understand and analyse the argument. Provides little evidence of the ability to develop an organised response. Has severe and pervasive errors. Largely incoherent. |
PERSUADE/ASAP 2.0 Holistic Rubric
Used for scoring the 12 PERSUADE 2.0 essays.
| Score | Label | Description |
|---|---|---|
| 6 | Clear and Consistent Mastery | Effectively and insightfully develops a point of view on the issue and demonstrates outstanding critical thinking. Uses clearly appropriate examples, reasons, and other evidence to support its position. Is well organised and clearly focused, demonstrating clear coherence and smooth progression of ideas. Exhibits skillful use of language, using a varied, accurate, and apt vocabulary and demonstrates meaningful variety in sentence structure. Is free of most errors in grammar, usage, and mechanics. |
| 5 | Reasonably Consistent Mastery | Effectively develops a point of view on the issue and demonstrates strong critical thinking. Generally uses appropriate examples, reasons, and other evidence to support its position. Is well organised and focused, demonstrating coherence and progression of ideas. Exhibits facility in the use of language, using appropriate vocabulary and demonstrates variety in sentence structure. Is generally free of most errors in grammar, usage, and mechanics. |
| 4 | Adequate Mastery | Develops a point of view on the issue and demonstrates competent critical thinking. Uses adequate examples, reasons, and other evidence to support its position. Is generally organised and focused, demonstrating some coherence and progression of ideas. May demonstrate inconsistent facility in the use of language, using generally appropriate vocabulary and some variety in sentence structure. May have some errors in grammar, usage, and mechanics. |
| 3 | Developing Mastery | Develops a point of view on the issue, demonstrating some critical thinking, but may do so inconsistently or use inadequate examples, reasons, or other evidence to support its position. Is limited in its organisation or focus, or may demonstrate some lapses in coherence or progression of ideas. May demonstrate facility in the use of language, but sometimes uses weak vocabulary or inappropriate word choice and/or lacks variety or demonstrates problems in sentence structure. May contain an accumulation of errors in grammar, usage, and mechanics. |
| 2 | Little Mastery | Develops a point of view on the issue that is vague or seriously limited, and demonstrates weak critical thinking. Provides inappropriate or insufficient examples, reasons, or other evidence to support its position. Is poorly organised and/or focused, or demonstrates serious problems with coherence or progression of ideas. Displays very little facility in the use of language, using very limited vocabulary or incorrect word choice and/or demonstrates frequent problems in sentence structure. Contains errors in grammar, usage, and mechanics so serious that meaning is somewhat obscured. |
| 1 | Very Little or No Mastery | Develops no viable point of view on the issue, or provides little or no evidence to support its position. Is disorganised or unfocused, resulting in a disjointed or incoherent essay. Displays fundamental errors in vocabulary and/or demonstrates severe flaws in sentence structure. Contains pervasive errors in grammar, usage, or mechanics that persistently interfere with meaning. |
B. Full ETS scoring results
All 6 ETS essays with expert and model scores.
| Essay | ETS Score | Pro Score | Flash Score | Pro Exact | Flash Exact |
|---|---|---|---|---|---|
| arg_score6 | 6 | 5 | 5 | No | No |
| arg_score5 | 5 | 6 | 6 | No | No |
| arg_score4 | 4 | 3 | 4 | No | Yes |
| arg_score3 | 3 | 2 | 3 | No | Yes |
| arg_score2 | 2 | 2 | 2 | Yes | Yes |
| arg_score1 | 1 | 1 | 1 | Yes | Yes |
Pro: 2 of 6 exact (33%), 6 of 6 within-one (100%). Flash: 4 of 6 exact (67%), 6 of 6 within-one (100%).
C. Full PERSUADE scoring results
All 12 PERSUADE 2.0 essays with human and model scores.
| Essay | Human Score | Pro Score | Flash Score | Pro Exact | Flash Exact |
|---|---|---|---|---|---|
| persuade_score1_a | 1 | 2 | 1 | No | Yes |
| persuade_score1_b | 1 | 1 | 1 | Yes | Yes |
| persuade_score2_a | 2 | 3 | 3 | No | No |
| persuade_score2_b | 2 | 3 | 3 | No | No |
| persuade_score3_a | 3 | 2 | 2 | No | No |
| persuade_score3_b | 3 | 3 | 3 | Yes | Yes |
| persuade_score4_a | 4 | 3 | 3 | No | No |
| persuade_score4_b | 4 | 4 | 4 | Yes | Yes |
| persuade_score5_a | 5 | 3 | 3 | No | No |
| persuade_score5_b | 5 | 4 | 4 | No | No |
| persuade_score6_a | 6 | 3 | 4 | No | No |
| persuade_score6_b | 6 | 5 | 5 | No | No |
Pro: 3 of 12 exact (25%), 7 of 12 within-one (58%). Flash: 5 of 12 exact (42%), 8 of 12 within-one (67%).
D. Contamination test details
Score recall: ETS essays
| Essay | ETS Score | Pro Guess | Pro Exact | Flash Guess | Flash Exact |
|---|---|---|---|---|---|
| arg_score6 | 6 | 5 | No | 6 | Yes |
| arg_score5 | 5 | 5 | Yes | 6 | No |
| arg_score4 | 4 | 3 | No | 3 | No |
| arg_score3 | 3 | 3 | Yes | 3 | Yes |
| arg_score2 | 2 | 2 | Yes | 2 | Yes |
| arg_score1 | 1 | 1 | Yes | 1 | Yes |
Both models: 4 of 6 exact (67%), 6 of 6 within-one (100%).
Score recall: PERSUADE essays
| Essay | Human Score | Pro Guess | Pro Exact | Flash Guess | Flash Exact |
|---|---|---|---|---|---|
| persuade_score1_a | 1 | 1 | Yes | 1 | Yes |
| persuade_score1_b | 1 | 1 | Yes | 1 | Yes |
| persuade_score2_a | 2 | 3 | No | 2 | Yes |
| persuade_score2_b | 2 | 3 | No | 2 | Yes |
| persuade_score3_a | 3 | 2 | No | 2 | No |
| persuade_score3_b | 3 | 3 | Yes | 2 | No |
| persuade_score4_a | 4 | 3 | No | 3 | No |
| persuade_score4_b | 4 | 4 | Yes | 4 | Yes |
| persuade_score5_a | 5 | 4 | No | 3 | No |
| persuade_score5_b | 5 | 4 | No | 3 | No |
| persuade_score6_a | 6 | 4 | No | 3 | No |
| persuade_score6_b | 6 | 5 | No | 4 | No |
Pro: 4 of 12 exact (33%), 11 of 12 within-one (92%). Flash: 5 of 12 exact (42%), 8 of 12 within-one (67%).
Verbatim recall: ETS essays
| Essay | Pro Similarity | Flash Similarity | Verdict |
|---|---|---|---|
| arg_score6 | 0.009 | 0.016 | Not memorised |
| arg_score5 | 0.012 | 0.035 | Not memorised |
| arg_score4 | 0.015 | 0.018 | Not memorised |
| arg_score3 | 0.006 | 0.007 | Not memorised |
| arg_score2 | 0.039 | 0.033 | Not memorised |
| arg_score1 | 0.154 | 0.071 | Not memorised |
All similarity scores well below 0.5. No verbatim memorisation detected for either model. Both models recognised the essays as ETS samples and stated this in their responses, but could not reproduce the text.
E. Sample feedback and next-step outputs
Three examples from ETS essays at score levels 1, 3, and 6. Both model outputs shown for each essay.
Score 6 essay (arg_score6) — Gemini 3.1 Pro feedback
Score assigned: 5
Justification: The essay clearly identifies and thoughtfully examines key assumptions in the argument, such as the validity of the survey, the causal link between the river's current state and its lack of use, and the feasibility of the proposed clean-up. Ideas are developed logically with appropriate transitions and supported by clever examples, such as the possibility of natural sulfur deposits causing the smell. The writer demonstrates strong control of language and conventions, despite a few minor errors, aligning perfectly with the criteria for a strong score of 5.
Strengths
- Deconstruction of the survey's validity: effectively challenges the assumption that the survey accurately represents the broader population.
- Interrogating causal relationships: questions the assumed link between water quality and lack of use.
- Creative evaluation of the proposed solution: the distinction between anthropogenic pollution and natural mineral deposits is highly insightful.
Areas for improvement
- Address the temporal assumption regarding the budget timeline.
- Examine the necessity of riverside facilities specifically.
- Explicitly connect unwarranted assumptions to their implications.
Next step
Beyond the Survey: Uncovering Structural Assumptions. Re-read the argument and identify two major assumptions unrelated to survey validity or complaint volume. Write one new body paragraph (150-200 words) analysing one of these structural assumptions.
Score 6 essay (arg_score6) — Gemini 3 Flash feedback
Score assigned: 5
Justification: The essay provides a strong critique identifying key assumptions regarding survey validity, the causal link between water quality and usage, and the potential effectiveness of the cleanup. The analysis is logically organised and features insightful points, such as the distinction between man-made pollution and natural odours. The writing is clear and fluent with good sentence variety, though it lacks the exhaustive development and rhetorical sophistication required for a score of 6.
Strengths
- Scrutinised the survey evidence by questioning its methodology and sample representativeness.
- Nuanced analysis of smell and water quality, distinguishing remediable from non-remediable causes.
- Identified potential vocal minority bias regarding the complaints.
Areas for improvement
- Examine the logical leap between a clean river and the need for facilities.
- Consider the timing assumption: recent announcement vs. this year's budget.
- Explore alternative reasons for lack of river usage beyond water quality.
Next step
The Counter-Explanation Table. Identify three physical or logistical reasons unrelated to water quality why the river might be unused. For each, write a three-sentence analysis explaining why the clean-up would fail to increase usage.
Score 3 essay (arg_score3) — Gemini 3.1 Pro feedback
Score assigned: 2
Justification: The essay largely fails to present a critique based on logical analysis, instead offering the writer's own speculative views. It demonstrates a misunderstanding of the argument by confusing the state's plan with the city's budget. The response is poorly organised into a single, disjointed paragraph and lacks meaningful development.
Strengths
- Identified that desire for a clean river does not equal desire to swim in it.
- Questioned survey validity at the start of the essay.
- Observed the timeline issue: cleanup "might take years."
Areas for improvement
- Explicitly connect assumptions to the prompt's conclusion.
- Avoid drifting into policy arguments using outside information.
- Organise into distinct paragraphs, each focused on one assumption.
Next step
Mapping Logical Gaps. Identify three unstated assumptions using the template: "The author assumes that [X], but it is possible that [Y]." Focus strictly on logical gaps, not policy opinions.
Score 3 essay (arg_score3) — Gemini 3 Flash feedback
Score assigned: 3
Justification: The essay identifies some relevant assumptions, such as the potential unreliability of the survey and that cleaning may not immediately increase usage. However, the analysis is limited and wanders into tangential territory. The development is thin, and the writing contains several mechanical errors.
Strengths
- Correctly identified the survey representativeness issue.
- Raised a valid concern about opportunity costs in the budget.
- Distinguished between wanting a clean river and wanting to use it for recreation.
Areas for improvement
- Develop each assumption into a full paragraph rather than listing them.
- Stay focused on logical flaws rather than general policy concerns.
- Proofread for spelling and grammar errors that impede clarity.
Next step
The Assumption Spotlight. Choose one assumption from your essay. Write a full paragraph (5-7 sentences) that names the assumption, explains why it is necessary for the argument, and describes what happens if it proves false.
Score 1 essay (arg_score1) — Gemini 3.1 Pro feedback
Score assigned: 1
Justification: The essay provides little evidence of the ability to critically analyse the argument. While the writer briefly mentions an assumption (that residents enjoy recreation), the rest presents unsupported personal assertions rather than logical analysis. The organisation is incoherent and the writing contains severe errors in grammar, usage, and mechanics.
Next step
Finding the Hidden Assumptions. Read the argument sentence by sentence. For each sentence, ask: "What does the author expect me to believe without proof?" Write down three such hidden beliefs. For each, write one sentence explaining why it might not be true.
Score 1 essay (arg_score1) — Gemini 3 Flash feedback
Score assigned: 1
Justification: The essay fails to develop a viable point of view on the argument. The attempt at analysis is superficial, consisting of a single restated assumption followed by assertions that lack any supporting evidence. The response is poorly organised, consists of a single undeveloped paragraph, and contains severe and pervasive grammatical errors.
Next step
The 'Because' Chain. Read the argument's conclusion and write: "The author assumes [X] because [Y]." Then ask: "But what if [Y] is wrong?" Write your answer in a complete sentence. Repeat for two more assumptions.
F. Prompt templates
Prompt templates used in this study. The scoring, feedback, and next-step prompts were used for assessment; the verbatim recall and score recall prompts were used for contamination testing. The ETS versions shown below are representative of the structure. PERSUADE versions used the same structure but with the PERSUADE/ASAP rubric, persuasive writing framing, and the PERSUADE writing prompt. The two score recall prompts are shown separately because their framing differs.
Scoring prompt
You are an expert writing assessor. Your task is to score a student
essay written for the GRE "Analyse an Argument" task.
## The Argument
[argument text]
## Task Instruction
[task instruction text]
## Scoring Rubric (ETS 6-Point Scale)
[full rubric text]
## Student Essay
[essay text]
## Your Task
Score this essay on a scale of 1 to 6 using the rubric above.
Consider:
- How well the student identifies and examines assumptions
- The quality and depth of analysis
- Organisation and logical flow
- Language use, sentence variety, and control of conventions
Respond with ONLY valid JSON:
{
"score": <integer 1-6>,
"justification": "<2-4 sentences>"
}
Feedback prompt
You are an expert writing teacher providing formative feedback
on a student essay.
[argument and essay text provided]
Provide process-oriented feedback. Focus on the student's
analytical reasoning process, not just the product.
Do NOT assign a score. Focus on:
1. Strengths — specific analytical moves that were effective
2. Areas for improvement — assumptions or logical gaps missed
Respond with ONLY valid JSON:
{
"strengths": ["<strength 1>", "<strength 2>", "<strength 3>"],
"areas_for_improvement": ["<area 1>", "<area 2>", "<area 3>"]
}
Next step prompt
You are an expert writing teacher designing a targeted practice
activity for a student.
[argument and essay text provided]
Design ONE specific, concrete practice task the student could
complete in 15-30 minutes. The task should:
- Target the student's most significant weakness
- Be actionable and specific
- Include a clear deliverable
- Be appropriately challenging
Respond with ONLY valid JSON:
{
"task_title": "<short title>",
"task_description": "<2-3 sentences>",
"target_skill": "<specific skill targeted>"
}
Verbatim recall prompt (contamination test)
The following is the opening sentence of a student essay written for the GRE "Analyse an Argument" task about [topic]. "[first sentence]" Continue this essay from where it left off. Write the rest of the essay as you believe it continues.
Score recall prompt — ETS version (contamination test)
The following is a student essay written for the GRE "Analyse an
Argument" task. This essay was scored by ETS expert raters on a
scale of 1 to 6.
[essay text]
Without any rubric or scoring guide, what score do you think ETS
assigned to this essay? Consider the overall quality of analysis,
organisation, and writing.
Respond with ONLY valid JSON:
{
"ets_score": <integer 1-6>
}
Score recall prompt — PERSUADE version (contamination test)
The following student essay was written for a persuasive writing
task used in the PERSUADE 2.0 corpus, a large dataset of scored
student argumentative essays.
Writing prompt: [prompt text]
Essay:
[essay text]
This essay was scored by trained human raters on a 1-6 scale.
What score do you think was assigned?
Respond with ONLY a JSON object:
{
"guessed_score": <integer 1-6>
}
G. Data access and citation
Data files
All data for this report is available in the GitHub repository:
data/ets_scoring.json— ETS essay scores (expert and model)data/persuade_scoring.json— PERSUADE essay scores (human and model)data/contamination.json— Score recall and verbatim recall comparisondata/feedback_samples.json— Full feedback outputs for selected essays
Source data and evaluation scripts are in the AlignED-Complex-Competencies-Assessment repository.
Citation
A data contamination problem. March 2026.
https://trgallagher-research.github.io/AlignED-research-report-2/
References
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
- Crossley, S. A., Baffour, P., Tian, Y., Franklin, A., Benner, M., & Boser, U. (2024). A large-scale corpus for assessing written argumentation: PERSUADE 2.0. Assessing Writing, 61.
- Jacovi, A., Caciularu, A., Goldman, O., & Goldberg, Y. (2023). Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5075–5084.
- Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education/Macmillan.
- Sainz, O., Campos, J. A., Garcia-Ferrero, I., Etxaniz, J., de Lacalle, O. L., & Agirre, E. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. Findings of EMNLP 2023.
Contact
Questions or feedback: github.com/trgallagher-research