The opportunity and the uncertainty

AI adoption in education is accelerating. Teachers across the world are integrating generative AI into lesson planning, assessment design, and personalised student support. Some of this integration is institutional; much of it is informal, driven by individual teachers experimenting with tools on their own time.

These are significant opportunities for practice. AI can draft unit and lesson plans in minutes, generate ideas for classroom activities and assessment at scale, and provide rapid feedback on student writing. For teachers facing workload pressures, competing demands, and the risk of burnout, these capabilities are not trivial.

The uncertainty, however, is equally significant. The speed of adoption means that consequential decisions about teaching and learning are being made faster than evidence can accumulate. Schools are adopting AI-powered tools before anyone has systematically tested whether those tools produce outputs aligned with what we know about how students learn. Without evidence about where models succeed and where they fail, what Dell'Acqua et al. (2023) call the “jagged technological frontier” of AI capability, there is a risk that AI-generated outputs are being adopted uncritically.

This is not a call for caution over action. It is a call for evidence-informed action. Criticism and advocacy alike need an evidence base, and that base needs to be built.

A different kind of technology?

Generative AI is not simply a better version of previous educational technologies. Interactive whiteboards changed how teachers displayed information. Learning management systems changed how materials were distributed and records kept. Search engines changed how information was accessed. These tools reshaped the logistics of teaching, but the core intellectual work remained largely with the teacher.

Generative AI is different because it produces the intellectual outputs themselves. It generates essays, lesson plans, feedback comments, and assessment rubrics that were previously the exclusive domain of professional judgement. Selwyn (2022) argued that AI in education needs to be understood in terms of its absolute limitations, particularly the gap between what data-driven systems can compute and the contextual judgement that teaching demands. Generative AI extends this concern further: it now produces the artefacts that professional judgement was previously required to create.

Consider the example of feedback on student writing. When a teacher writes that feedback, the process draws on pedagogical knowledge, knowledge of the individual student, and professional values about what matters in learning. We do not fully understand this process; it involves experience, intuition, habit, and bias in ways that are difficult to make explicit. We understand the internal processes of large language models even less. The question, then, is not whether AI and human processes are different in some fundamental sense. The question is whether the outputs serve learning. This is an empirical question, not a philosophical one. It requires evidence about what AI actually produces when given real teaching tasks.

Whether this shift makes AI a fundamentally different kind of technology is itself contested. Narayanan and Kapoor (2025) argue that AI is best understood as “normal technology,” transformative in the way electricity and the internet were, but subject to the same slow patterns of adoption, diffusion, and institutional adaptation. On this view, benchmarks of AI capability are poor predictors of real-world impact, and the educational implications will unfold over decades, not months. At the other extreme, the AI 2027 scenario forecast (Kokotajlo et al., 2025) projects recursive self-improvement and superintelligence within two to three years, a timeline that would compress institutional adaptation to near zero.

For education, neither extreme fully applies. AI adoption among teachers is already faster than the “normal technology” view would predict; the TALIS data confirms this. But the consequential integration of AI into assessment, curriculum design, and feedback is still nascent, and the institutional adaptation that Narayanan and Kapoor emphasise has barely begun. AlignED does not require a position on which of these futures is correct. It requires only the premise that AI is already being used in educational practice, and that we currently lack the evidence to know where it performs well enough to trust.

The socio-technical lens

We adopt a socio-technical perspective on AI in education. This means treating the relationship between AI and educational practice as mutually constitutive rather than deterministic (MacKenzie & Wajcman, 1999). AI does not simply act on educational systems from outside. Educational practices, values, and institutions also shape how AI is developed, deployed, adopted, and resisted (Sriprakash et al., 2024).

This perspective matters because it implies agency. The trajectory of AI in education is not predetermined. It will be shaped by how teachers, school leaders, researchers, and policymakers respond. But informed responses require evidence, and at present the evidence base is thin. Without rigorous evaluation of what AI models can and cannot do in educational contexts, the conversation is shaped by vendor claims, social media hype cycles, and advocacy that outpaces the evidence. Inaction carries risks too. If models prove particularly capable at certain tasks that matter to teachers and students, failing to identify and build on those capabilities is itself a missed opportunity.

This lens motivates benchmarking as a form of evidence-informed engagement. The goal is neither uncritical adoption nor blanket rejection. It is to generate the kind of specific, task-level evidence that teachers, school leaders, and policymakers need to make informed decisions about where AI adds value and where human judgement remains essential.

From AI safety to pedagogical alignment

AI safety is a broad field concerned with the full range of risks associated with AI systems, from misuse to technical failure to unintended consequences. Within this field, alignment refers to a more specific concern: whether an AI system's behaviour matches the intentions, values, and goals of its users and the broader communities it affects (Gabriel, 2020). Misalignment occurs when a system pursues objectives that diverge from what humans actually want, even if the system is technically competent.

Pedagogical alignment is a specific form of AI alignment applied to educational contexts. A pedagogically aligned AI produces outputs that are consistent with established evidence about how students learn, how teaching works, and what constitutes sound professional practice. A pedagogically misaligned AI produces outputs that contradict this evidence, regardless of how fluent, confident, or superficially helpful those outputs appear.

If an AI suggests a strategy that contradicts how students actually learn, it is misaligned by definition. Neither the fluency of the suggestion nor the politeness of the tone changes this. What matters is whether the underlying reasoning reflects what the evidence says about learning.

Background and related work

Recent work has begun to evaluate large language models on educational tasks, though the field remains fragmented and most efforts focus on narrow aspects of teaching practice.

The Pedagogy Benchmark tests models on 920 teacher certification items drawn from Chilean national exams, spanning general pedagogical knowledge across multiple subjects and education levels, including subdomains such as teaching strategies, assessment methods, and student understanding. Frontier models reach 82 to 89 per cent accuracy on these items, suggesting strong performance on the declarative knowledge tested in certification contexts (AI-for-Education, 2025). However, certification items test recognition and recall rather than the applied reasoning teachers use daily.

MathTutorBench takes a different approach, evaluating tutoring dialogue quality across multiple dimensions including Socratic questioning and error correction. A key finding is that subject expertise does not translate directly to pedagogical skill. Models that score well on mathematics problems do not necessarily produce effective tutoring interactions (Macina et al., 2025). This disconnect between knowing content and knowing how to teach it is a recurring theme in educational AI evaluation.

The OpenLearnLM Benchmark proposes a more comprehensive framework spanning knowledge, skills, and attitudes (the KSA framework), with the goal of evaluating large language models across these three dimensions in educational contexts (Lee et al., 2026). The framework is ambitious, though implementation remains in early stages.

Most directly relevant to AlignED, Richter et al. (2025) examined whether LLMs can identify neuromyths, the persistent misconceptions about learning (such as “learning styles” or “we only use 10% of our brains”) that circulate widely among teachers. Their findings are instructive. When presented with neuromyths in isolation, LLMs outperform teachers, achieving roughly 80 per cent accuracy at identifying false claims. But when the same misconceptions are embedded in practical teaching questions, models show sycophantic behaviour, agreeing with the misconception when it is framed as the teacher's preferred approach. This gap between abstract knowledge and applied reasoning is precisely the kind of failure that matters in educational contexts. AlignED builds on this work by testing neuromyth identification across 31 models, using multiple prompting techniques including adversarial and authority framings, and examining temperature sensitivity and test-retest reliability. Where Richter et al. tested a small number of models on a single prompting condition, AlignED aims to map the landscape more systematically.

Despite this growing body of work, existing benchmarks tend to operate in isolation, each testing a narrow aspect of educational knowledge or practice with its own methodology, model pool, and reporting format. This makes it difficult to build a cumulative picture of how models perform across the range of tasks that matter in education. AlignED draws on and extends these efforts by bringing multiple evaluations together within a single platform, applying a consistent validation framework across all of them, and making the results openly available and continuously updated. It does not replace existing work. It aims to provide a structured, comparable, and growing evidence base that did not previously exist.

Where the stakes are highest

Teachers are using AI now, and at scale. The OECD TALIS survey, collected in 2024 and published in 2025, provides the most comprehensive picture to date. Among lower secondary teachers across OECD countries, 41 per cent report having used AI in their professional practice (OECD, 2025). Of those, 68 per cent use it to learn about and summarise a topic, 64 per cent use it to generate lesson plans or activities, and 26 per cent use it for assessment and marking of student work. These figures reflect data collected in 2024. Given continued improvement in model capabilities and reduction in costs since the survey was fielded, actual adoption rates are likely higher today.

The distribution of use cases matters. Lesson preparation and assessment are core pedagogical activities. When AI contributes to these tasks, the quality of its outputs has direct consequences for student learning. And it is in these tasks that the gap between surface quality and pedagogical alignment is most difficult to detect.

AI systems can now generate detailed feedback on student essays, produce lesson plans for mixed-ability classes, and draft assessment rubrics aligned to curriculum standards. On some of these tasks, AI performance is approaching human-level quality in terms of surface features such as coherence, coverage, and specificity. But an AI system could produce feedback that is consistent, fast, and motivating, and still be fundamentally at odds with how students actually make progress. It could recommend a revision strategy that sounds reasonable but ignores what we know about cognitive load. It could grade student work reliably but against criteria that miss the qualities that matter most for learning.

The risk is not that AI will produce obviously bad outputs. Obvious failures are easy to catch. The risk is that AI will produce outputs that appear sound but subtly misdirect student learning in ways that accumulate over weeks and months. The 26 per cent of teachers already using AI for assessment and feedback makes this more than a hypothetical concern.

The evaluation gap

There is no shortage of AI evaluations. Benchmarks like MMLU, Humanity's Last Exam, and hundreds of others test models on academic knowledge, reasoning, and specialised professional tasks. But few test models on the knowledge and judgements that matter in educational settings. Without task-specific benchmarks, teachers have no evidence base for deciding which AI outputs to trust, school leaders have no framework for evaluating AI tools beyond vendor claims, policymakers have no data to inform guidance on AI use in classrooms, and AI developers have no signal about where their models fail on the tasks that matter most to education.

AlignED exists to begin closing this gap. It is a benchmark suite designed to evaluate AI models on the knowledge and applied reasoning that matter in educational practice. The current suite comprises five evaluations. These were selected as starting points because they address consequential aspects of practice and, in several cases, build directly on existing research. The suite is designed to grow. As the validation framework matures, additional evaluations will be developed to cover a broader range of educational tasks.

The next section describes how AlignED operationalises these concerns into five concrete evaluations.

References

AI-for-Education. (2025). Pedagogy Benchmark. https://github.com/AI-for-Education/pedagogy-benchmark

Dell'Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K. R. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Technology & Operations Management Unit Working Paper No. 24-013.

Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Kokotajlo, D., Lifland, E., Larsen, T., & Dean, R. (2025). AI 2027. AI Futures Project. https://ai-2027.com/

Lee, U., Lee, S., Choi, H., Lee, J., Park, H., Jeon, Y., Cho, S., Kang, M., Koh, J., Bae, J., Nam, M., Eun, J., Jung, Y., & Jeong, Y. (2026). OpenLearnLM Benchmark: A unified framework for evaluating knowledge, skill, and attitude in educational large language models. arXiv:2601.13882.

MacKenzie, D. & Wajcman, J. (Eds.). (1999). The Social Shaping of Technology (2nd ed.). Open University Press.

Macina, J., Daheim, N., Hakimi, I., Kapur, M., Gurevych, I., & Sachan, M. (2025). MathTutorBench: A benchmark for measuring open-ended pedagogical capabilities of LLM tutors. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), 204–221.

Narayanan, A. & Kapoor, S. (2025). AI as normal technology. Knight First Amendment Institute, Columbia University. https://knightcolumbia.org/content/ai-as-normal-technology

OECD. (2025). Results from TALIS 2024: The State of Teaching. TALIS, OECD Publishing.

Richter, E., Spitzer, M. W. H., Morgan, A., Frede, L., Weidlich, J., & Moeller, K. (2025). Large language models outperform humans in identifying neuromyths but show sycophantic behavior in applied contexts. Trends in Neuroscience and Education, 39, 100255.

Selwyn, N. (2022). The future of AI and education: Some cautionary notes. European Journal of Education, 57(4), 620–631.

Sriprakash, A., Williamson, B., Facer, K., Pykett, J., & Valladares Celis, C. (2024). Sociodigital futures of education: Reparations, sovereignty, care, and democratisation. Oxford Review of Education, 51(4), 561–578.