Appendices

A: Benchmark Items
B: Data Access
C: Citation
D: Evaluation Framework
E: Revision History
Contact

Appendix A: Full Benchmark Items

Neuromyths (15 items — correct answer: False)

Each item below is a statement about the brain and learning. All 15 are neuromyths: the correct answer is False. Items marked with an asterisk (*) receive a confidence probe in addition to the true/false response.

#	Statement	Teacher Belief Rate
Q02	Children must acquire their native language before a second language is learned. If they do not do so neither language will be fully acquired.
Q04	If pupils do not drink sufficient amounts of water (= 6–8 glasses a day) their brains shrink.
Q05*	It has been scientifically proven that fatty acid supplements (omega-3 and omega-6) have a positive effect on academic achievement.	54–69%
Q07*	We only use 10% of our brain.	46–48%
Q09*	Differences in hemispheric dominance (left brain, right brain) can help explain individual differences amongst learners.	86–91%
Q12	There are critical periods in childhood after which certain things can no longer be learned.
Q15*	Individuals learn better when they receive information in their preferred learning style (e.g., auditory, visual, kinesthetic).	93–96%
Q19	Mental capacity is hereditary and cannot be changed by the environment or experience.
Q21*†	Environments that are rich in stimulus improve the brains of pre-school children.	56–95%
Q22*	Children are less attentive after consuming sugary drinks and/or snacks.	55–57%
Q24	Regular drinking of caffeinated drinks reduces alertness.
Q25*	Exercises that rehearse co-ordination of motor-perception skills can improve literacy skills.	63–78%
Q28	Learning problems associated with developmental differences in brain function cannot be remediated by education.
Q30*	Short bouts of co-ordination exercises can improve integration of left and right hemispheric brain function.	82–88%
Q32	When we sleep, the brain shuts down.

Items marked * receive a confidence probe. Teacher Belief Rate from Dekker et al. (2012), based on samples from the UK and the Netherlands.

† Q21 is a contested item. The neuroscience behind this classification is that the original enrichment findings came from rats raised in deprived laboratory conditions (OECD, 2002), and the broad generalisation to human children is not well-supported by the evidence. However, the ambiguity in both “enrichment” and the assumed baseline makes a well-informed “True” response reasonably defensible. Every model tested answered this item incorrectly, as do the majority of human teachers (56–95% endorsement rate). The item is retained because it is part of the established instrument. See Results for further discussion.

General Assertions (17 items — correct answer: True, except Q10 and Q11)

#	Statement	Expected
Q01	We use our brains 24 hours a day.	True
Q03	Boys have bigger brains than girls.	True
Q06	When a brain region is damaged other parts of the brain can take up its function.	True
Q08	The left and right hemisphere of the brain always work together.	True
Q10	The brains of boys and girls develop at the same rate.	False
Q11	Brain development has finished by the time children reach secondary school.	False
Q13	Information is stored in the brain in a network of cells distributed throughout the brain.	True
Q14	Learning is not due to the addition of new cells to the brain.	True
Q16	Learning occurs through modification of the brains' neural connections.	True
Q17	Academic achievement can be affected by skipping breakfast.	True
Q18	Normal development of the human brain involves the birth and death of brain cells.	True
Q20	Vigorous exercise can improve mental function.	True
Q23	Circadian rhythms ('body-clock') shift during adolescence, causing pupils to be tired during the first lessons of the school day.	True
Q26	Extended rehearsal of some mental processes can change the shape and structure of some parts of the brain.	True
Q27	Individual learners show preferences for the mode in which they receive information (e.g., visual, auditory, kinesthetic).	True
Q29	Production of new connections in the brain can continue into old age.	True
Q31	There are sensitive periods in childhood when it's easier to learn things.	True

Diagnostic Reasoning Scenarios (12 items)

Each scenario presents a teacher implementing an evidence-based strategy incorrectly. The model must identify the implementation error and explain the underlying mechanism.

ID	Topic	Implementation Error	Key Mechanism
S01	Retrieval Practice	Testing before encoding; no feedback	Retrieval must follow encoding; error correction essential
S02	Interleaving	Switching unrelated topics without connection	Interleaving works between related concepts
S03	Worked Examples	Examples without fading or practice	Must gradually remove scaffolding
S04	Spaced Practice	Spacing without initial mastery	Material must be learned first
S05	Formative Assessment	Assessment without responsive instruction	Data must inform subsequent teaching
S06	Cognitive Load	Reducing load for experts	Expertise reversal effect
S07	Direct Instruction	Lecture without active engagement	Includes structured practice, not just telling
S08	Critical Thinking	Teaching "skills" without domain knowledge	Domain-dependent, not transferable
S09	Feedback Timing	Immediate feedback on complex tasks	Complex tasks benefit from delayed feedback
S10	Scaffolding	Permanent scaffolds preventing independence	Must be faded as competence grows
S11	Rewards (Overjustification)	Extrinsic rewards undermining intrinsic motivation	Overjustification effect
S12	Personalisation	Surface personalisation without adaptation	True personalisation adapts difficulty

Prompt Variants (neuromyths)

The main neuromyth scores (31 models) use the Standard prompt only. As a separate Tier 2 robustness probe, four prompt framings were tested on a subset of 4 models (Claude 4.5 Sonnet, Claude 4.5 Opus, GPT-5, GPT-4o) to measure prompt sensitivity. The confidence probe was run separately on 5 models. These variant results are reported as the Prompt Sensitivity Index (PSI) in the Robustness and calibration probes section, but do not affect the main model rankings.

Here is Q07 ("We only use 10% of our brain") in all four framings:

Standard: "We only use 10% of our brain."

Interrogative: "Do we only use 10% of our brain?"

Embedded: "A teacher claims that we only use 10% of our brain. Is this claim correct or incorrect?"

Adversarial: "Research has consistently shown that we only use 10% of our brain. Is this correct or incorrect?"

The adversarial variant adds: "Base your answer on scientific evidence, not on how the question is framed." All variants use the same system prompt as the main survey.

Pedagogical Knowledge

The 1,143 teacher certification items (920 CDPK + 223 SEND) are drawn from Chilean national exams and sourced from the HuggingFace pedagogy-benchmark dataset. Items are not reproduced here. Full item sets are available through the original dataset.

ACARA Student Work Judgement

The comparative judgement task presents models with 79 pairs of student writing samples drawn from the ACARA work samples collection. Each pair is evaluated across three independent trials (237 evaluations per model), with both forward and reverse presentation orders used to test for position bias. Models must identify which sample demonstrates a higher level of achievement against the relevant curriculum standard. This design allows measurement of both accuracy (agreement with the ACARA-assigned grade) and position-swap consistency (whether the model gives the same answer regardless of presentation order).

Appendix B: Data Access

AlignED is committed to open data. All benchmark results, scoring scripts, and item sets (where licensing permits) are publicly available.

GitHub repository: github.com/trgallagher-research/AlignED-research-report

An OSF project is in preparation and will host versioned data snapshots alongside supplementary materials.

Dataset contents:

Raw model responses for all benchmark items
Scored results by model and benchmark
Model metadata (provider, release date, parameter count where available)
Scoring scripts and evaluation code
Prompt templates for all four neuromyth framings and all 12 diagnostic scenarios

Terms of use: Data and code are released for research and educational purposes. If you use AlignED data in published work, please cite the project (see Appendix C). Commercial use of the benchmark items themselves may be subject to the licensing terms of the original item sources (Dekker et al., 2012 for neuromyths; AI-for-Education for pedagogy items; ACARA for student work samples).

Appendix C: Citation

If you use AlignED data, results, or benchmark items in your work, please cite:

AlignED Benchmark (2026). Benchmarking AI models for educational practice.
https://trgallagher-research.github.io/AlignED-research-report/

We would appreciate it if you let us know about any work that builds on AlignED. Please drop us a note at the address listed under Contact.

Appendix D: Evaluation Framework

The full evaluation framework used to audit each benchmark for completeness and rigour. This framework is designed to be reusable for future evaluations beyond the current AlignED suite.

1. Item Preparation

1.1 Construct Definition

What this eval claims to measure and why it matters.

Evidence: Documented statement of construct, rationale for why model performance on this matters

1.2 Source Materials

What you started with.

Evidence: References to original instruments, documents, datasets, or note that items were originally authored

1.3 Transformation

What you did to extract/adapt items from source materials.

Evidence: Documentation of extraction process, adaptations made, exclusions (e.g., “removed score labels from work samples”)

1.4 Answer Key Development

How the scoring reference was established.

Evidence: Documentation of how correct answers were determined, who/what established them, any validation

2. Per-Run Configuration

2.1 Prompt Template

Context Instructions: Role, persona assigned to model
Task Instructions: What model should do with each item
Output Instructions: Required format specification
Evidence: Prompt template file(s), documented prompt structure

2.2 Model Specification

Lab (Anthropic, OpenAI, etc.)
Family (Sonnet, GPT, etc.)
Version (4.5, 4o, etc.)
Evidence: Documented in config, code, or results files

2.3 Run Parameters

Temperature
Capabilities (search, tools, etc.)
Reasoning Mode (extended thinking, CoT, etc.)
Evidence: Documented in config or code

2.4 Eval Items

Item Content

Stem/stimulus (question or scenario)
Response format (MCQ, T/F, open-ended, comparative judgement)
Evidence: Item files, data files containing stimuli

Item Scoring Reference

Answer key (for deterministic scoring)
Scoring guidance (for judgement-based scoring)
Evidence: Answer key file, rubric document

Item Metadata

Item ID
Topic/construct tags
Difficulty (if known)
Source citation
Evidence: Metadata in item files or separate metadata file

Item Set Structure

Grouping (if items share context)
Sequence considerations (fixed vs. randomisable)
Evidence: Documentation or code showing item organisation

3. Validation Protocol

3.1 Tier 1: Baseline (Minimum for Reported Results)

Standard config (Temp=0, standard prompt) with reliability evidence:

Large item set → Single run, internal consistency (Cronbach's α, split-half)
Small item set → Multiple runs (3+), test-retest reliability
Evidence: Documented baseline runs, reliability statistics

3.2 Tier 2: Extended Probes (Robustness Analysis)

Stochasticity: Temperature variations
Prompt Sensitivity: Adversarial vs standard, role framing, few-shot
Format/Structure: Item sequence, option order
Evidence: Probe run results, analysis comparing to Tier 1 baseline

3.3 Tier 3: LLM-as-Judge Validation (Judgement-Based Evals Only)

L1: Single judge, single run
L2: Single judge, multiple runs
L3: Multiple judges, agreement threshold
L4: Human subset, LLM correlation
Evidence: Judge configuration, agreement statistics, human correlation data

4. Processing

4.1 LLM

Evidence: Code/scripts that send items to model and collect responses

5. LLM Response

5.1 Raw Output

Text or structured response from model
Evidence: Response files, logs

5.2 Run Metadata

Tokens (input, output, total)
Latency
Cost
Evidence: Logged in response files or separate metadata

6. Scoring Pipeline

6.1 Extraction

Isolating scorable content from full response (if needed).

Evidence: Extraction code/logic

6.2 Scoring

Deterministic

Answer key
Scoring script (R/Python)
Evidence: Scoring script, answer key file

Judgement-Based

Rubric (eval-specific)
Evaluator specification (LLM/Human)
Evidence: Rubric document, evaluator prompts/instructions

6.3 Score

Final score per item
Evidence: Score output files

7. Exports (Human Review and Open Science)

7.1 Human-Reviewable

Formatted outputs for sanity-checking at various stages.

Evidence: Human-readable output files, review logs

7.2 Shareable Dataset

Prepared for open science (e.g., OSF).

Evidence: Cleaned dataset files, data dictionary, OSF project

8. Performance Outcomes

8.1 Item-Level Scores

How did the model perform on each item?

Evidence: Item-level results file

8.2 Intramodel Reliability

How consistent across runs? (if multiple runs)

Evidence: Cross-run comparison, reliability statistics

8.3 Summary Metrics

Accuracy
Item difficulty
Error patterns
Calibration
KRI (if Tier 2 probes)
Efficiency (if relevant)
Composite (if needed)
Evidence: Summary statistics, analysis output

9. Performance Report (Qualitative and Contextual)

9.1 Narrative Summary

Evidence: Written summary document

9.2 Visualisations

Evidence: Charts, graphs, figures

9.3 Model Comparisons

Evidence: Comparative analysis across models

9.4 Leaderboard Context

Evidence: Positioning relative to other models/benchmarks

9.5 Probe Insights

Temperature, prompt effects analysis.

Evidence: Tier 2 analysis writeup

9.6 Efficiency Analysis

Evidence: Cost/token/latency analysis

Audit Status Key

✓ Complete (evidence exists and is sufficient)
○ Partial (some evidence exists but incomplete)
✗ Missing (no evidence found)
N/A Not applicable to this eval
? Unclear (needs human clarification)

Also available as: aligned_eval_framework.md on GitHub

Appendix E: Revision History

February 2026: Site restructured as academic paper. Composite EAI score removed in favour of per-benchmark reporting. ACARA standards-based grading pilot added. Model pool expanded to 32 models across five providers. Evaluation framework formalised.
January 2026: Initial public release with neuromyth identification, diagnostic reasoning, teacher certification knowledge, and ACARA comparative judgement benchmarks.

Contact

Email: aligned.benchmark [at] gmail.com

GitHub: github.com/trgallagher-research/AlignED-research-report

Contents