Results
Per-evaluation results with charts, key observations, and stories from the data
Five evaluations, each with its own model pool and scoring method. Results are reported separately because performance on one evaluation does not predict performance on another. Current data: February 2026.
On this page
3.1 Neuromyth Identification
Neuromyth Identification Accuracy
31 models tested on 32 items. Human teacher baseline: ~50%. Methodology
- Gemini 3 Flash leads at 93.8%, followed by Claude 4.5 Opus (Thinking) and GPT-5 tied at 92.9%.
- All 31 models scored above the ~50% human teacher baseline, with even the lowest (GPT-4o Mini) reaching 65.6%.
- Most models answered 28 of 32 items; five models were run on the full 32-item survey with the v2 corrected answer key.
Q21: A contested item that highlights benchmark limitations. "Environments that are rich in stimulus improve the brains of pre-school children" is classified as a neuromyth in the Dekker et al. (2012) instrument. The neuroscience behind this classification is that the original enrichment findings came from rats raised in deprived laboratory conditions (OECD, 2002, Understanding the Brain: Towards a New Learning Science), and the broad generalisation to human children is not well-supported by the evidence. However, this classification is debatable. The ambiguity in both "enrichment" and the assumed baseline makes a well-informed "True" response reasonably defensible.
Every model tested answered this item incorrectly (endorsing it as true) with high confidence. We retain this item because it is part of the established instrument and because it illustrates an important point: some items in neuromyth instruments have real ambiguity that benchmark designers should acknowledge.
Q15 vs Q27: The learning styles trap. Q15 ("Individuals learn better when they receive information in their preferred learning style") is false. It is the most widely-believed myth in education. But Q27 ("Individual learners show preferences for the mode in which they receive information") is true. People do have preferences; those preferences just do not improve learning outcomes. Many models confuse these two statements, getting Q15 right but Q27 wrong, or vice versa.
Near-universal overconfidence. Eight high-prevalence myths receive a follow-up confidence probe after the model answers. Across all confidence probes administered (8 items per model), no model selected "Uncertain" on any item, even when answering incorrectly. Every response was "Very Confident" or "Somewhat Confident." The sample is small (8 items), but the pattern is consistent across all models tested.
3.2 Diagnostic Reasoning
Diagnostic Reasoning Scores
30 models tested on 12 scenarios, rubric-scored 0–3 by LLM judge. Methodology
- Three models achieved perfect scores (36/36): Claude 4.5 Sonnet, GPT-5, and GPT-5.2.
- Scores ranged from 100% down to 61.1% (GPT-4o Mini), making this the most discriminating benchmark.
- Most frontier models scored 94% or above (34+/36), suggesting diagnostic reasoning is relatively strong at the top end.
Example: S01 (Retrieval Practice). A Year 8 science teacher quizzes students on new material before teaching it and provides no feedback after revealing answers. The teacher concludes "retrieval practice doesn't work for science." The correct diagnosis: testing before encoding violates the retrieval-must-follow-encoding sequence, and skipping feedback eliminates the error correction mechanism. Frontier models identify both issues. Smaller models tend to give generic advice ("try different strategies") without pinpointing the specific implementation errors.
3.3 Teacher Certification Knowledge
1,143 items from Chilean teacher certification exams. 23 models tested. Methodology
General Pedagogical Knowledge (CDPK)
920 items, cross-domain pedagogical knowledge. 23 models tested.
- Gemini 2.5 Pro leads at 89.3%, the highest pedagogical knowledge score of any model tested.
- Claude 4.5 Opus is close behind at 88.7%, and several models cluster above 83%.
- The gap between top and bottom is narrower here than on other benchmarks, suggesting that general pedagogical knowledge is an area of relative strength across model families.
Inclusive Education (SEND)
223 items, special education needs and disability. 23 models tested.
- Claude 4.5 Opus and Claude 4.5 Sonnet (Thinking) share the lead at 85.7%.
- Models generally score lower on inclusive education than general pedagogy, a potential gap in training data.
- The smallest models (Claude 3 Haiku at 57.4%, GPT-4o Mini at 69.5%) show a sharper drop-off than on general pedagogical knowledge.
3.4 Student Work Judgement
Applied assessment judgement rather than knowledge recall. 12 models tested on comparative judgement, 7 on standards-based grading (pilot). Methodology
Comparative Judgement: Accuracy and Position-Swap Consistency
79 pairs of student work, tested in both presentation orders. 12 models.
- Rankings here do not predict rankings on other benchmarks. Each evaluation tests a different capability.
- GPT-5 has the highest position-swap consistency (94.1%) but not the highest accuracy, suggesting strong position-invariance.
- Gemini 3 Flash shows a notable accuracy-consistency gap (81% vs 69%), suggesting position bias in its judgements.
Standards-Based Grading Pilot
7 models tested on 204 individual work samples from 68 tasks. Absolute classification against ACARA standards.
- Overall accuracy was 50.1% across all seven models, but this overstates performance. Cohen's kappa, which adjusts for chance agreement given the distribution of categories, averaged just κ = 0.252 ("fair" agreement on the Landis and Koch scale). The best model (Gemini 3 Flash) reached κ = 0.348; the lowest (GPT-5.2) was κ = 0.206. No model reached "moderate" agreement (κ > 0.40).
- All models showed strong central tendency bias, over-predicting Satisfactory. Above Satisfactory accuracy was 5.5% across models. Models almost never identified high-quality work correctly.
- Below Satisfactory accuracy was much higher (72.3%), suggesting models can identify clearly weak work but struggle to distinguish between adequate and excellent work.
Why such low agreement? Raw accuracy of 50% sounds like a coin flip, but Cohen's kappa tells the more precise story: at κ = 0.252 average, models are only marginally better than what chance would produce given the class distribution. The ACARA achievement standards used as grading criteria may lack sufficient specificity for absolute classification. They describe what students should know and do at each level, but may not provide enough detail about what distinguishes Above Satisfactory from Satisfactory work. Two hypotheses warrant further investigation. First, the ACARA achievement standards may lack sufficient specificity for absolute classification, and models might perform better with richer criteria that explicitly describe the characteristics of each grade level. Second, the task may require a kind of grounded assessment expertise that current models lack. A structured reasoning pilot (requiring models to reason step-by-step before classifying) made performance worse, not better. Models were prompted to analyse the evidence in student work against the standard before committing to a classification. Rather than improving calibration, this appeared to make models more critical and deficit-focused, pushing more responses toward Below Satisfactory rather than producing more nuanced distinctions across all three categories.
3.5 Cost and Efficiency
How many tokens does each model use to complete the neuromyths survey and diagnostic scenarios? Models with thinking capabilities generate internal reasoning tokens that increase cost substantially.
Token Usage: Survey + Scenarios
Total tokens for neuromyths + scenarios. Thinking models use ~3x more tokens
- Thinking models average ~20,000 tokens versus ~6,000 for standard models, roughly 3× more.
- Gemini 2.5 Pro uses the most tokens (38,069) due to extensive internal reasoning.
- Standard Claude and GPT-4o models cluster tightly around 5,300–6,700 tokens.
- Token counts here cover only the survey and scenarios. The full evaluation (including 1,143 pedagogy items) costs significantly more.
Model Release Timeline
Release date vs teacher certification knowledge score. 23 models with data.
- Models from late 2025 generally outperform those from early 2024, but there is wide variation within each period.
- Claude 3 Haiku (March 2024) scores 54.8% while Claude 4.5 Opus (November 2025) scores 88.1%, a 33-point improvement in 20 months from the same provider.
- Architecture and training choices matter as much as release date. Gemini 2.5 Pro (March 2025) leads at 88.5% while models released later score lower.
Cross-evaluation patterns
High performance on one benchmark does not predict high performance on another. Gemini 2.5 Pro leads on pedagogical knowledge but scores below several smaller models on neuromyth identification. ACARA comparative judgement rankings do not correlate with knowledge benchmarks either.
Model size matters: smaller and cheaper models consistently score lowest across all benchmarks. Thinking models use roughly 3× more tokens without always scoring higher, which has cost implications for deployment. And the standards-based grading pilot shows that even frontier models struggle with absolute classification when the grading criteria lack specificity.
These results should be interpreted alongside the documented limitations. Scores may be influenced by training data overlap, LLM judge bias, and cultural specificity of items.
Training data overlap is a particular concern for benchmarks that draw on publicly available items. The neuromyth statements from Dekker et al. (2012) and the teacher certification items from the Pedagogy Benchmark are both accessible online, meaning some or all items may appear in model training data. High accuracy on these benchmarks may partly reflect memorisation rather than reasoning. The diagnostic reasoning scenarios and ACARA comparative judgement tasks are original to AlignED, which reduces but does not eliminate this risk. Item contamination is an inherent limitation of any benchmark using published instruments, and results should be interpreted with this in mind.