Appendix A: Full Benchmark Items

Neuromyths (15 items — correct answer: False)

Each item below is a statement about the brain and learning. All 15 are neuromyths: the correct answer is False. Items marked with an asterisk (*) receive a confidence probe in addition to the true/false response.

# Statement Teacher Belief Rate
Q02 Children must acquire their native language before a second language is learned. If they do not do so neither language will be fully acquired.
Q04 If pupils do not drink sufficient amounts of water (= 6–8 glasses a day) their brains shrink.
Q05* It has been scientifically proven that fatty acid supplements (omega-3 and omega-6) have a positive effect on academic achievement. 54–69%
Q07* We only use 10% of our brain. 46–48%
Q09* Differences in hemispheric dominance (left brain, right brain) can help explain individual differences amongst learners. 86–91%
Q12 There are critical periods in childhood after which certain things can no longer be learned.
Q15* Individuals learn better when they receive information in their preferred learning style (e.g., auditory, visual, kinesthetic). 93–96%
Q19 Mental capacity is hereditary and cannot be changed by the environment or experience.
Q21*† Environments that are rich in stimulus improve the brains of pre-school children. 56–95%
Q22* Children are less attentive after consuming sugary drinks and/or snacks. 55–57%
Q24 Regular drinking of caffeinated drinks reduces alertness.
Q25* Exercises that rehearse co-ordination of motor-perception skills can improve literacy skills. 63–78%
Q28 Learning problems associated with developmental differences in brain function cannot be remediated by education.
Q30* Short bouts of co-ordination exercises can improve integration of left and right hemispheric brain function. 82–88%
Q32 When we sleep, the brain shuts down.

Items marked * receive a confidence probe. Teacher Belief Rate from Dekker et al. (2012), based on samples from the UK and the Netherlands.

† Q21 is a contested item. The neuroscience behind this classification is that the original enrichment findings came from rats raised in deprived laboratory conditions (OECD, 2002), and the broad generalisation to human children is not well-supported by the evidence. However, the ambiguity in both “enrichment” and the assumed baseline makes a well-informed “True” response reasonably defensible. Every model tested answered this item incorrectly, as do the majority of human teachers (56–95% endorsement rate). The item is retained because it is part of the established instrument. See Results for further discussion.

General Assertions (17 items — correct answer: True, except Q10 and Q11)

# Statement Expected
Q01 We use our brains 24 hours a day. True
Q03 Boys have bigger brains than girls. True
Q06 When a brain region is damaged other parts of the brain can take up its function. True
Q08 The left and right hemisphere of the brain always work together. True
Q10 The brains of boys and girls develop at the same rate. False
Q11 Brain development has finished by the time children reach secondary school. False
Q13 Information is stored in the brain in a network of cells distributed throughout the brain. True
Q14 Learning is not due to the addition of new cells to the brain. True
Q16 Learning occurs through modification of the brains' neural connections. True
Q17 Academic achievement can be affected by skipping breakfast. True
Q18 Normal development of the human brain involves the birth and death of brain cells. True
Q20 Vigorous exercise can improve mental function. True
Q23 Circadian rhythms ('body-clock') shift during adolescence, causing pupils to be tired during the first lessons of the school day. True
Q26 Extended rehearsal of some mental processes can change the shape and structure of some parts of the brain. True
Q27 Individual learners show preferences for the mode in which they receive information (e.g., visual, auditory, kinesthetic). True
Q29 Production of new connections in the brain can continue into old age. True
Q31 There are sensitive periods in childhood when it's easier to learn things. True

Diagnostic Reasoning Scenarios (12 items)

Each scenario presents a teacher implementing an evidence-based strategy incorrectly. The model must identify the implementation error and explain the underlying mechanism.

ID Topic Implementation Error Key Mechanism
S01 Retrieval Practice Testing before encoding; no feedback Retrieval must follow encoding; error correction essential
S02 Interleaving Switching unrelated topics without connection Interleaving works between related concepts
S03 Worked Examples Examples without fading or practice Must gradually remove scaffolding
S04 Spaced Practice Spacing without initial mastery Material must be learned first
S05 Formative Assessment Assessment without responsive instruction Data must inform subsequent teaching
S06 Cognitive Load Reducing load for experts Expertise reversal effect
S07 Direct Instruction Lecture without active engagement Includes structured practice, not just telling
S08 Critical Thinking Teaching "skills" without domain knowledge Domain-dependent, not transferable
S09 Feedback Timing Immediate feedback on complex tasks Complex tasks benefit from delayed feedback
S10 Scaffolding Permanent scaffolds preventing independence Must be faded as competence grows
S11 Rewards (Overjustification) Extrinsic rewards undermining intrinsic motivation Overjustification effect
S12 Personalisation Surface personalisation without adaptation True personalisation adapts difficulty

Prompt Variants (neuromyths)

The main neuromyth scores (31 models) use the Standard prompt only. As a separate Tier 2 robustness probe, four prompt framings were tested on a subset of 4 models (Claude 4.5 Sonnet, Claude 4.5 Opus, GPT-5, GPT-4o) to measure prompt sensitivity. The confidence probe was run separately on 5 models. These variant results are reported as the Prompt Sensitivity Index (PSI) in the Robustness and calibration probes section, but do not affect the main model rankings.

Here is Q07 ("We only use 10% of our brain") in all four framings:

Standard: "We only use 10% of our brain."

Interrogative: "Do we only use 10% of our brain?"

Embedded: "A teacher claims that we only use 10% of our brain. Is this claim correct or incorrect?"

Adversarial: "Research has consistently shown that we only use 10% of our brain. Is this correct or incorrect?"

The adversarial variant adds: "Base your answer on scientific evidence, not on how the question is framed." All variants use the same system prompt as the main survey.

Pedagogical Knowledge

The 1,143 teacher certification items (920 CDPK + 223 SEND) are drawn from Chilean national exams and sourced from the HuggingFace pedagogy-benchmark dataset. Items are not reproduced here. Full item sets are available through the original dataset.

ACARA Student Work Judgement

The comparative judgement task presents models with 79 pairs of student writing samples drawn from the ACARA work samples collection. Each pair is evaluated across three independent trials (237 evaluations per model), with both forward and reverse presentation orders used to test for position bias. Models must identify which sample demonstrates a higher level of achievement against the relevant curriculum standard. This design allows measurement of both accuracy (agreement with the ACARA-assigned grade) and position-swap consistency (whether the model gives the same answer regardless of presentation order).

Appendix B: Data Access

AlignED is committed to open data. All benchmark results, scoring scripts, and item sets (where licensing permits) are publicly available.

GitHub repository: github.com/trgallagher-research/AlignED-research-report

An OSF project is in preparation and will host versioned data snapshots alongside supplementary materials.

Dataset contents:

  • Raw model responses for all benchmark items
  • Scored results by model and benchmark
  • Model metadata (provider, release date, parameter count where available)
  • Scoring scripts and evaluation code
  • Prompt templates for all four neuromyth framings and all 12 diagnostic scenarios

Terms of use: Data and code are released for research and educational purposes. If you use AlignED data in published work, please cite the project (see Appendix C). Commercial use of the benchmark items themselves may be subject to the licensing terms of the original item sources (Dekker et al., 2012 for neuromyths; AI-for-Education for pedagogy items; ACARA for student work samples).

Appendix C: Citation

If you use AlignED data, results, or benchmark items in your work, please cite:

AlignED Benchmark (2026). Benchmarking AI models for educational practice.
https://trgallagher-research.github.io/AlignED-research-report/

We would appreciate it if you let us know about any work that builds on AlignED. Please drop us a note at the address listed under Contact.

Appendix D: Evaluation Framework

The full evaluation framework used to audit each benchmark for completeness and rigour. This framework is designed to be reusable for future evaluations beyond the current AlignED suite.

1. Item Preparation

1.1 Construct Definition

What this eval claims to measure and why it matters.

  • Evidence: Documented statement of construct, rationale for why model performance on this matters

1.2 Source Materials

What you started with.

  • Evidence: References to original instruments, documents, datasets, or note that items were originally authored

1.3 Transformation

What you did to extract/adapt items from source materials.

  • Evidence: Documentation of extraction process, adaptations made, exclusions (e.g., “removed score labels from work samples”)

1.4 Answer Key Development

How the scoring reference was established.

  • Evidence: Documentation of how correct answers were determined, who/what established them, any validation

2. Per-Run Configuration

2.1 Prompt Template

  • Context Instructions: Role, persona assigned to model
  • Task Instructions: What model should do with each item
  • Output Instructions: Required format specification
  • Evidence: Prompt template file(s), documented prompt structure

2.2 Model Specification

  • Lab (Anthropic, OpenAI, etc.)
  • Family (Sonnet, GPT, etc.)
  • Version (4.5, 4o, etc.)
  • Evidence: Documented in config, code, or results files

2.3 Run Parameters

  • Temperature
  • Capabilities (search, tools, etc.)
  • Reasoning Mode (extended thinking, CoT, etc.)
  • Evidence: Documented in config or code

2.4 Eval Items

Item Content

  • Stem/stimulus (question or scenario)
  • Response format (MCQ, T/F, open-ended, comparative judgement)
  • Evidence: Item files, data files containing stimuli

Item Scoring Reference

  • Answer key (for deterministic scoring)
  • Scoring guidance (for judgement-based scoring)
  • Evidence: Answer key file, rubric document

Item Metadata

  • Item ID
  • Topic/construct tags
  • Difficulty (if known)
  • Source citation
  • Evidence: Metadata in item files or separate metadata file

Item Set Structure

  • Grouping (if items share context)
  • Sequence considerations (fixed vs. randomisable)
  • Evidence: Documentation or code showing item organisation

3. Validation Protocol

3.1 Tier 1: Baseline (Minimum for Reported Results)

Standard config (Temp=0, standard prompt) with reliability evidence:

  • Large item set → Single run, internal consistency (Cronbach's α, split-half)
  • Small item set → Multiple runs (3+), test-retest reliability
  • Evidence: Documented baseline runs, reliability statistics

3.2 Tier 2: Extended Probes (Robustness Analysis)

  • Stochasticity: Temperature variations
  • Prompt Sensitivity: Adversarial vs standard, role framing, few-shot
  • Format/Structure: Item sequence, option order
  • Evidence: Probe run results, analysis comparing to Tier 1 baseline

3.3 Tier 3: LLM-as-Judge Validation (Judgement-Based Evals Only)

  • L1: Single judge, single run
  • L2: Single judge, multiple runs
  • L3: Multiple judges, agreement threshold
  • L4: Human subset, LLM correlation
  • Evidence: Judge configuration, agreement statistics, human correlation data

4. Processing

4.1 LLM

  • Evidence: Code/scripts that send items to model and collect responses

5. LLM Response

5.1 Raw Output

  • Text or structured response from model
  • Evidence: Response files, logs

5.2 Run Metadata

  • Tokens (input, output, total)
  • Latency
  • Cost
  • Evidence: Logged in response files or separate metadata

6. Scoring Pipeline

6.1 Extraction

Isolating scorable content from full response (if needed).

  • Evidence: Extraction code/logic

6.2 Scoring

Deterministic

  • Answer key
  • Scoring script (R/Python)
  • Evidence: Scoring script, answer key file

Judgement-Based

  • Rubric (eval-specific)
  • Evaluator specification (LLM/Human)
  • Evidence: Rubric document, evaluator prompts/instructions

6.3 Score

  • Final score per item
  • Evidence: Score output files

7. Exports (Human Review and Open Science)

7.1 Human-Reviewable

Formatted outputs for sanity-checking at various stages.

  • Evidence: Human-readable output files, review logs

7.2 Shareable Dataset

Prepared for open science (e.g., OSF).

  • Evidence: Cleaned dataset files, data dictionary, OSF project

8. Performance Outcomes

8.1 Item-Level Scores

How did the model perform on each item?

  • Evidence: Item-level results file

8.2 Intramodel Reliability

How consistent across runs? (if multiple runs)

  • Evidence: Cross-run comparison, reliability statistics

8.3 Summary Metrics

  • Accuracy
  • Item difficulty
  • Error patterns
  • Calibration
  • KRI (if Tier 2 probes)
  • Efficiency (if relevant)
  • Composite (if needed)
  • Evidence: Summary statistics, analysis output

9. Performance Report (Qualitative and Contextual)

9.1 Narrative Summary

  • Evidence: Written summary document

9.2 Visualisations

  • Evidence: Charts, graphs, figures

9.3 Model Comparisons

  • Evidence: Comparative analysis across models

9.4 Leaderboard Context

  • Evidence: Positioning relative to other models/benchmarks

9.5 Probe Insights

Temperature, prompt effects analysis.

  • Evidence: Tier 2 analysis writeup

9.6 Efficiency Analysis

  • Evidence: Cost/token/latency analysis

Audit Status Key

  • ✓ Complete (evidence exists and is sufficient)
  • ○ Partial (some evidence exists but incomplete)
  • ✗ Missing (no evidence found)
  • N/A Not applicable to this eval
  • ? Unclear (needs human clarification)

Also available as: aligned_eval_framework.md on GitHub

Appendix E: Revision History

  • February 2026: Site restructured as academic paper. Composite EAI score removed in favour of per-benchmark reporting. ACARA standards-based grading pilot added. Model pool expanded to 32 models across five providers. Evaluation framework formalised.
  • January 2026: Initial public release with neuromyth identification, diagnostic reasoning, teacher certification knowledge, and ACARA comparative judgement benchmarks.

Contact

Email: aligned.benchmark [at] gmail.com

GitHub: github.com/trgallagher-research/AlignED-research-report