Discussion
Five takeaways, limitations, and future directions
Five takeaways from the data
Five evaluations across 32 models produce a large volume of results. Rather than summarise every finding, we draw out five observations that cut across the individual benchmarks and carry implications for how AI is adopted in education.
4.1 Frontier models are improving but not yet reliably aligned
Top models score above 90% on classification tasks such as neuromyth identification and pedagogical knowledge items. Applied judgement is a different story. On standards-based grading, the average Cohen's kappa across all seven models tested was just 0.252, meaning models are only marginally better than chance at grading student work against curriculum standards.
Performance is uneven across task types within the same model. A model can identify neuromyths with 93% accuracy and still fail to distinguish excellent student work from adequate student work. Classification and judgement are different capabilities, and current models are far stronger at the first than the second.
A caveat: the ACARA achievement standards used in our grading benchmark are not particularly descriptive. They may lack the specificity needed to distinguish adjacent performance levels reliably. A benchmark built on more detailed rubrics with explicit grade-level descriptors would better evaluate whether models struggle with the task of absolute grading itself or simply with underspecified criteria. This is a priority for future evaluation development.
4.2 Dramatic generational improvement
The gap between older and newer models is large and growing. Claude 3 Haiku, released in March 2024, scores 54.8% on teacher certification knowledge. Claude 4.5 Opus, released in November 2025, scores 88.1%. That is a 33-point improvement in 20 months from the same provider. Similar generational gaps appear across all benchmarks and all providers.
This trend is compounded by cost. Newer models are not only more capable but cheaper to run. There is little practical reason to use an older model when its successor outperforms it at a lower price. The practical consequence is straightforward: an experience with AI from six months ago may already be outdated. Evaluations that do not track model versions and release dates risk drawing conclusions from capabilities that no longer represent the current state of the technology.
4.3 No single best model
Rankings shift across evaluations. Gemini 2.5 Pro leads on pedagogical knowledge but scores below several smaller models on neuromyth identification. ACARA comparative judgement rankings do not correlate with knowledge benchmarks. No model ranks first on all five evaluations. Any claim that "Model X is the best for education" is incomplete without specifying which educational task. This finding is consistent across every analysis we have run. It is the single most important result for anyone selecting a model for educational use.
These rankings are also a snapshot. Gemini 2.5 Pro is not Google's current frontier model; Gemini 3 has since been released, and evaluation on newer models is ongoing. Rankings will shift as new models are tested.
4.4 Assessment task type matters enormously
The gap between classification and applied judgement is the most striking pattern in the data. Classification provides clear decision boundaries. A neuromyth is either supported by evidence or it is not. Standards-based grading requires interpreting criteria and making fine-grained distinctions between adjacent performance levels. When agreement is adjusted for chance, no model reaches even "moderate" agreement on the grading task.
Two factors likely contribute. First, the ACARA achievement standards may lack sufficient specificity for reliable absolute classification by either humans or models. Second, current models may lack the grounding in student development and assessment practice needed to make these fine-grained judgements, regardless of rubric quality. When we asked models to reason step-by-step before classifying, performance got worse, not better. This suggests the bottleneck is not in the reasoning process, though whether it lies primarily in the criteria or in model capability remains an open question.
4.5 We urgently need more educational evaluations
Five evaluations covering neuromyth identification, diagnostic reasoning, teacher certification knowledge, and student work judgement are a start. They are not enough. These benchmarks do not cover lesson planning quality, pastoral and wellbeing support, feedback generation, curriculum adaptation, or dozens of other tasks teachers perform daily. The field needs more benchmarks, from more research groups, covering more tasks, in more cultural and curricular contexts. AlignED contributes five. The field needs fifty.
Limitations
Every study has limitations. We state ours directly.
Training data contamination. Some benchmark items, particularly the 32 items (15 neuromyths and 17 general assertions) from Dekker et al. (2012), have been published for over a decade and may appear in model training data. High scores on these items may partly reflect memorisation rather than reasoning about the underlying science. We cannot rule this out for any model, and it affects interpretation of the neuromyth results most directly.
LLM-as-judge scoring. Diagnostic reasoning responses are scored by Claude 4.5 Sonnet acting as an automated judge. This introduces potential judge model bias. We validated a sample of judgements manually and found high agreement, but systematic bias remains possible. The judge model may favour response styles similar to its own outputs.
Cultural specificity. The 1,143 pedagogical knowledge items come from Chilean teacher certification exams. Pedagogical principles are broadly universal, but some items may reflect Chilean educational policy, curricular structure, or cultural context. Performance differences across models could partly reflect differential exposure to Chilean educational content in training data rather than differences in pedagogical reasoning ability.
Varying sample sizes. Benchmarks range from 32 items (neuromyths) to 1,143 items (teacher certification knowledge). Statistical power differs accordingly. A three-point difference on a 32-item test carries less evidential weight than a three-point difference on a 1,143-item test. We report confidence intervals where appropriate, but readers should weight findings by sample size.
Incomplete validation. We defined three validation tiers for each benchmark: baseline reliability, robustness probes, and judge validation (see Methods). Not all tiers are complete for all benchmarks. The results presented here should be interpreted as provisional pending full validation.
What we claim and what we do not
AlignED measures how well AI models perform on specific benchmark tasks relevant to educational practice. It does not measure whether a model is a good tutor, can teach effectively, or is safe to deploy in a classroom.
A high score means the model answered benchmark items correctly. It does not mean the model can apply that knowledge in a real classroom with real students, real time pressure, and real consequences. Knowing that a learning-styles approach lacks evidence is not the same as redirecting a colleague who uses it. Scoring well on a teacher certification exam is not the same as teaching well.
These benchmarks are a starting point. They do not tell us whether a model is safe to deploy or whether it will perform well in a real classroom.
A model that cannot identify common neuromyths or diagnose basic implementation failures may be a poor fit for educational applications. But a model that performs well on these benchmarks still needs to be evaluated for safety, bias, and pedagogical effectiveness before deployment.
Future directions
These are research directions, not commitments to specific timelines.
- Tracking model capabilities over time. As models improve and costs decrease, repeated evaluation of the same benchmarks will reveal whether gains are broad or concentrated in specific task types.
- New benchmark modules. Lesson planning quality, unit design, and wellbeing and pastoral support are high-priority additions. Each requires its own item development, scoring rubric, and validation process.
- Human validation studies. Comparing model performance to teacher panels on the same tasks will establish whether benchmark scores are meaningful indicators of practical capability.
- Cross-cultural expansion. Extending pedagogical knowledge items beyond the Chilean context to include teacher certification content from other countries and educational traditions.
- Completing all three validation tiers. Baseline reliability, robustness probes, and judge validation for each benchmark. Until all three are complete, results remain provisional.
Closing
AI models are already being used for educational tasks. Teachers are adopting them. Students are using them. Policymakers are making decisions about them. The question is not whether AI will play a role in education. It already does. The question is whether we will measure what these tools can and cannot do before we build systems around them.
AlignED is one contribution to that measurement. Five evaluations, 32 models, a set of findings that are already being overtaken by the next generation of releases. The work is to keep measuring, keep publishing, and keep being honest about what the data show and what they do not.
It is the author's view that the evidence will eventually show there are few, if any, cognitive tasks in education that AI cannot perform competently. The purpose of AlignED is to test that proposition rigorously rather than assert it.
This is not something that will happen to us. It is a choice we get to make.