AlignED: Benchmarking AI Models for Educational Practice

Tim Gallagher

February 2026

Abstract

Teachers are rapidly adopting generative AI. Data from the OECD's Teaching and Learning International Survey (TALIS), collected in 2024, shows that 41% of lower secondary teachers across participating countries have used AI for professional tasks. Yet few evaluations test models on the knowledge and judgements that matter in educational settings: understanding how learning works, reasoning about why teaching approaches succeed or fail, demonstrating the pedagogical knowledge expected of qualified teachers, and comparing student work against curriculum standards.

AlignED addresses this gap with five evaluations spanning neuromyth identification, diagnostic reasoning, teacher certification knowledge (general pedagogy and inclusive education), and student work judgement. A total of 32 models from five providers have been tested across five evaluations. Key findings: no single model ranks first across all evaluations; performance on one benchmark does not predict performance on another; assessment task type matters enormously, with models performing well on comparative judgement tasks but achieving only "fair" chance-adjusted agreement (Cohen's κ = 0.25) on standards-based grading; and no model expressed uncertainty on any confidence probe, even when answering incorrectly.

This report is a frozen snapshot of the February 2026 benchmark data. For the full list of AlignED reports, visit AlignED Reports.

Key findings at a glance

Models tested

Evaluations

κ = 0.25

Grading agreement (Cohen's κ, where 1 = perfect agreement)

Models that expressed uncertainty (8-item probe)

Reading guide

This site is structured like an academic paper. The Introduction frames the problem and reviews related work. The Methods section describes all five evaluations and their scoring. The Results section presents findings with charts and key observations. The Discussion draws out five takeaways, states limitations, and outlines future work. The Appendices contain full item lists, data access, and citation information.

AlignED: Benchmarking AI Models for Educational Practice

Abstract

Key findings at a glance

Reading guide

1. Introduction

2. Methods

3. Results

4. Discussion

Appendices