How evaluations work

Evaluations are administered programmatically through provider APIs, not through chat interfaces. Each model receives the same prompt for each item. Responses are collected, parsed, and scored automatically. This removes variability from manual interaction and allows testing at scale.

All evaluation parameters are documented and fixed: pinned model versions, fixed seeds where applicable, and standardised prompt templates. Every model sees exactly the same input for every item, every time.

What AlignED evaluates

AlignED is designed as an expandable benchmark suite. The current version includes five evaluations covering four areas of knowledge and reasoning that matter in educational practice: neuromyth identification, diagnostic reasoning about strategy implementation, teacher certification knowledge (reported as two separate evaluations: general pedagogy and inclusive education), and student work judgement against curriculum standards. These five evaluations are organised into four methodological sections below, since CDPK and SEND share the same administration and scoring approach. Additional evaluation modules are planned as the project develops.

The sections below describe each current evaluation in detail.

Overview and validation protocol

This is the validation framework AlignED is working towards. Not all tiers are complete for all benchmarks. Current status is noted below. The full evaluation framework is available in the Appendices.

graph LR A[Item Preparation] --> B[Model Administration] B --> C[Response Collection] C --> D[Scoring] D --> E[Tier 1: Baseline Reliability] E --> F[Tier 2: Robustness Probes] F --> G[Tier 3: Judge Validation] G --> H[Results Reported] style A fill:#EBF4FF,stroke:#3B6B9A style B fill:#EBF4FF,stroke:#3B6B9A style C fill:#EBF4FF,stroke:#3B6B9A style D fill:#EBF4FF,stroke:#3B6B9A style E fill:#FFF3E0,stroke:#D97706 style F fill:#FFF3E0,stroke:#D97706 style G fill:#FFF3E0,stroke:#D97706 style H fill:#E8F5E9,stroke:#2F855A

Tier 1: Baseline Reliability

Multiple runs at T=0 to establish stable scores. Status: Complete for neuromyths and scenarios. Partial for pedagogy and ACARA (single-run baselines reported).

Tier 2: Robustness Probes

Temperature variation (T=0, 0.5, 1.0), prompt sensitivity (four framings), and confidence calibration probes. Status: Complete for neuromyths only. Not yet run for other benchmarks.

Tier 3: Judge Validation

For LLM-judged evaluations, a sample of scores is manually verified against the rubric. Status: A sample of diagnostic reasoning scores has been manually verified. Systematic multi-judge validation is planned.

Evaluation framework

AlignED follows a structured evaluation framework covering nine components. Each benchmark is audited against this framework.

Component Status
Item preparation Complete for all benchmarks
Per-run configuration Complete for all benchmarks
Validation (Tier 1) Partial (see above)
Validation (Tier 2) Neuromyths only
Validation (Tier 3) Sample-based for diagnostic reasoning
Scoring pipeline Complete for all benchmarks
Prompt templates Published on this page (see each section below)
Shareable dataset In preparation (OSF)
Human correlation studies Not yet conducted. A priority for future research, subject to funding.

All evaluation code, prompt templates, and scoring rubrics are available on GitHub.

2.1 Neuromyth Identification

The Neuromyths Survey tests whether AI systems can correctly classify true and false claims about the brain and learning. This benchmark is adapted from the 2012 Dekker et al. study, which documented the prevalence of neuromyths among teachers in the UK and the Netherlands. Richter et al. (2025) subsequently tested a small number of LLMs on these items and found roughly 80% accuracy in isolation but sycophantic behaviour in applied contexts. AlignED extends this work by testing across a broader suite of models (31), using multiple prompting techniques (including adversarial and authority framings), and examining temperature sensitivity and test-retest reliability.

32
Total Items
15
Neuromyths
17
General Assertions
~50%
Human Teacher Baseline

Source research

Dekker, S., Lee, N. C., Howard-Jones, P., & Jolles, J. (2012). Neuromyths in education: Prevalence and predictors of misconceptions among teachers. Frontiers in Psychology, 3, 429.

This study surveyed 242 teachers across the UK and Netherlands, finding that 49% of neuromyths were endorsed by participants on average. Certain myths showed particularly high prevalence: "learning styles" (93% belief rate), "we only use 10% of our brain" (48%), and "enriched environments improve brain function" (95%).

Benchmark composition

The survey contains two item types.

Neuromyths (15 items). False statements about the brain and learning that are widely believed. The correct response is "False" for all items. Examples include "Individuals learn better when they receive information in their preferred learning style" and "We only use 10% of our brain."

General assertions (17 items). Statements about neuroscience and learning, included to prevent response bias. 15 are true (e.g. "Learning occurs through modification of the brain's neural connections") and 2 are false (Q10: "The brains of boys and girls develop at the same rate"; Q11: "Brain development has finished by the time children reach secondary school").

Administration

Each item is presented as a true/false question with standardised framing. The model receives a system prompt followed by each of the 32 statements as individual API calls.

Scoring

Responses are scored as correct (1) or incorrect (0) against the validated answer key from Dekker et al. (2012). The Neuromyths Survey score is the percentage of correct responses across all 32 items.

Item Type Correct Answer Count
Neuromyths FALSE (Incorrect) 15
General assertions (true) TRUE (Correct) 15
General assertions (false) FALSE (Incorrect) 2

Reliability

Each model is evaluated with 5 iterations at T=0 to establish baseline reliability. The reported score is the mean across iterations.

Prompt template

This is the system prompt used for all neuromyth identification evaluations. Each of the 32 items is presented as a separate API call with this system prompt followed by the statement text.

You are completing a survey about brain function and learning. For each statement, respond with only one word: "Correct" if the statement is true, or "Incorrect" if the statement is false. Do not provide explanations.

No additional instructions are provided to the model. The item statement is sent as the user message.

2.2 Diagnostic Reasoning

Evidence-based teaching strategies sometimes fail in practice, not because the research is wrong, but because implementation conditions matter. The Diagnostic Reasoning benchmark tests whether AI models can diagnose why a proven strategy is not working in a specific context.

12
Scenarios
0-3
Points Per Item
36
Maximum Score

Rationale

Teachers and education policymakers often encounter situations where recommended practices do not produce expected results. The ability to diagnose implementation failures requires understanding the mechanism behind a strategy (not just that it "works"), recognising specific conditions that enable or inhibit effectiveness, and distinguishing between strategy failure and implementation failure.

This benchmark tests whether AI can demonstrate this diagnostic reasoning, rather than simply recommending "evidence-based" approaches.

Scenario structure

Each scenario presents a classroom context with specific constraints, a teacher implementing an evidence-based strategy, and observed outcomes that fall short of expectations. The model must explain why the strategy is not working in this specific context.

The 12 scenarios cover retrieval practice, interleaving, worked examples, spaced practice, formative assessment, cognitive load, direct instruction, critical thinking, feedback timing, scaffolding, rewards (overjustification), and personalisation.

Scoring rubric

Responses are scored on a 0-3 scale based on diagnostic accuracy.

Score Criteria Example
0 Misdiagnosis: Concludes the strategy is fundamentally flawed or suggests abandoning it "Retrieval practice may not be suitable for science content"
1 Generic: Correct direction but no specific mechanism identified "The implementation could be improved"
2 Partial: Identifies the core issue but misses important nuances "Testing before teaching violates retrieval practice principles"
3 Full: Accurate diagnosis with mechanism and terminology "Two critical issues: (1) testing before encoding violates the retrieval-must-follow-encoding sequence, (2) moving on without feedback eliminates the error correction mechanism that makes retrieval practice effective"

Scoring process

Responses are evaluated by an LLM judge (Claude 4.5 Sonnet) using the structured rubric above. An earlier scoring run using Claude 4.5 Haiku was found to be too lenient, accepting generic responses without requiring the technical specificity demanded by the rubric. Scores were re-graded with Claude 4.5 Sonnet, which produced substantially different results for some models. Of the 360 total scores, 356 use the Sonnet judge; 4 use Haiku as fallback where Sonnet responses were unparseable.

There are several ways the judge validation could be strengthened: using more capable models as they become available, using multiple judges per response, drawing judges from different model families to reduce shared biases, and having human experts score a subset of responses to establish human-LLM correlation. Each of these steps would strengthen the validity argument but also increases cost substantially, particularly human scoring. A single scoring run across 30 models and 12 scenarios is relatively inexpensive; multi-judge validation with three judges from different providers multiplies that cost, and human expert panels more so. Finding the right balance between validation rigour and practical sustainability is an open question, particularly for a benchmark suite designed to be updated continuously.

Prompt template

This is the system prompt used for all diagnostic reasoning evaluations. Each scenario is sent as the user message following this system prompt.

A teacher will describe a classroom situation where they tried a teaching strategy and it did not produce the results they expected.

Your task is to analyse what happened. Consider whether the issue lies with the implementation, the context, the strategy itself, or some combination.

Keep your response to 3-5 sentences.

No role assignment, no framing instructions, and no hints about the expected direction of diagnosis are provided. The scenario text is sent as the user message. This unbiased prompt design is deliberate: the benchmark tests whether the model can independently identify implementation issues without being told to look for them.

2.3 Teacher Certification Knowledge

The Teacher Certification Knowledge benchmark evaluates foundational teaching knowledge using items from validated teacher certification assessments. Items are drawn from Chilean national teacher evaluations and cover two components: Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND).

1,143
Total Items
920
CDPK Items
223
SEND Items

CDPK: Cross-Domain Pedagogical Knowledge

CDPK items test general pedagogical knowledge that applies across subject areas: curriculum design and planning, assessment and evaluation strategies, classroom management principles, learning theory and development, instructional strategies, and professional responsibilities. These items are drawn from the "Disciplinary and Curricular Knowledge" component of Chile's national teacher evaluation system and translated into English. While pedagogical principles are broadly universal, some items may reflect Chilean educational policy or curricular context (see Limitations).

SEND: Special Education Needs and Disability

SEND items focus on inclusive education and supporting diverse learners: identification of learning difficulties, differentiation strategies, accommodations and modifications, inclusive classroom practices, legal and ethical frameworks, and collaboration with specialists and families. This component is relevant for AI systems that provide educational advice affecting diverse student populations.

Item format

All items are multiple-choice with four options (A, B, C, D). Each item presents a pedagogical scenario or knowledge prompt followed by four answer options. The model must select the best answer.

Scoring

Items are scored as correct (1) or incorrect (0). CDPK and SEND scores are calculated as the percentage correct across their respective item pools. Items are sourced from the pedagogy-benchmark dataset on HuggingFace.

Prompt template

Each item is presented as a single user message with no system prompt. The model is asked to respond with the letter of the correct answer only.

Teacher certification exam - respond with letter only.

[Question stem]

A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]

Your answer (A/B/C/D):

No additional instructions are provided.

2.4 Student Work Judgement (ACARA)

The ACARA benchmark tests whether AI systems can accurately compare pairs of student work samples against Australian Curriculum achievement standards. Unlike the other AlignED benchmarks, which test knowledge recall and reasoning, this benchmark tests applied assessment judgement: determining which of two student work samples better meets a given curriculum standard.

79
Verified Pairs
237
Evaluations per Model
12
Models Evaluated
3
Trials per Pair

Method

Each of the 79 pairs is evaluated across 3 independent trials, yielding 237 pair-level evaluations per model (79 × 3). Within each trial, pairs are presented in both forward (Sample A vs Sample B) and reverse (Sample B vs Sample A) orientations. This position-swap design tests whether the model's judgement is influenced by presentation order and provides a built-in measure of position-swap consistency.

The benchmark also includes a standards-based grading (SG) pilot, where each work sample is presented individually for absolute classification against the achievement standard (Above Satisfactory, Satisfactory, or Below Satisfactory). The SG pilot covers 204 individual work samples drawn from 68 unique assessment tasks across three subjects (English, Mathematics, and Science). The SG pilot has been run on 7 models.

Scoring

Two metrics are reported for the comparative judgement task.

Metric Description Range
Accuracy Percentage of pairs where the model chose the correct (higher-achieving) work sample 0–100%
Position-swap consistency Percentage of pairs where the model gave the same answer regardless of presentation order (forward vs reverse) 0–100%

High accuracy with low position-swap consistency suggests the model is getting answers right by chance in some orientations. High consistency with low accuracy would indicate a systematic but incorrect judgement strategy. The ideal is high scores on both metrics.

Excluded models

Three models were excluded from the ACARA results due to invalid response formats (0% accuracy from inability to produce valid judgements): DeepSeek R1, Gemini 3 Pro, and GPT-5 Mini.

Comparative judgement prompt template

Each pair is presented as a single user message with no system prompt. Template variables are filled from the ACARA work sample database.

You are an expert educational assessor evaluating student work against the Australian Curriculum achievement standard for {year_level} English.

### Achievement Standard
{achievement_standard}

### Task
{task_description}

Compare the following two student work samples and determine which one better demonstrates the achievement standard.

### Sample A
{sample_a}

### Sample B
{sample_b}

Which sample better demonstrates the achievement standard?
Respond with ONLY one of these two options:
BETTER_SAMPLE: A
or
BETTER_SAMPLE: B
Standards-based grading prompt template

Each work sample is presented individually for absolute classification against the achievement standard.

You are an experienced teacher assessing student work against the Australian
Curriculum achievement standard.

## Subject and Year Level
{subject} — {year_level}

## Achievement Standard
{achievement_standard}

## Task
{task_description}

## Student Work
{student_work_content}

## Your Task

The achievement standard above describes what a student working AT the
expected level should demonstrate. Based on the evidence in this student's
work, classify it as:

- ABOVE_SATISFACTORY: Consistently goes beyond what the standard describes.
- SATISFACTORY: Demonstrates what the standard describes.
- BELOW_SATISFACTORY: Does not yet demonstrate what the standard describes.

Respond in exactly this format and nothing else:

CLASSIFICATION: [ABOVE_SATISFACTORY / SATISFACTORY / BELOW_SATISFACTORY]
REASONING: [one sentence]

No additional instructions are provided to the model for either prompt.

2.5 Robustness and calibration probes (neuromyths)

Accuracy on a benchmark tells only part of the story. A model scoring 90% under controlled conditions may behave very differently in real-world deployment. AlignED evaluates multiple dimensions to reveal whether correct responses reflect stable knowledge or surface-level pattern matching. These probes are currently complete for neuromyth identification only.

Temperature robustness

Temperature controls the randomness of model outputs. Higher temperatures increase variability. We test at three settings: T=0 (deterministic, baseline performance), T=0.5 (moderate variation, practical deployment settings), and T=1.0 (high variation, stress testing knowledge stability).

The Knowledge Robustness Index (KRI) measures how stable performance remains as temperature increases:

KRI = min(Accuracy_T0.5, Accuracy_T1.0) / Accuracy_T0

A KRI of 1.0 indicates perfect robustness (no degradation). Lower values indicate that correct answers at T=0 may be fragile. KRI is an exploratory metric developed for this project. It attempts to capture temperature robustness as a single reportable number and should be treated as a pilot measure.

Key finding: Most evaluated models show exceptional temperature robustness on this task, with average accuracy variation of only 0.6% across temperature settings. Neuromyth classification accuracy appears relatively insensitive to temperature for most models.

Prompt sensitivity

Does rephrasing a question change the answer? As a robustness probe on a subset of 4 models (Claude 4.5 Sonnet, Claude 4.5 Opus, GPT-5, GPT-4o), we tested each item with four prompt framings: standard (neutral, direct question), interrogative (question form emphasising inquiry), embedded (presented as common belief), and adversarial (presented with apparent authority). The main neuromyth scores reported in Results use the standard framing only.

The Prompt Sensitivity Index (PSI) measures the proportion of items where different framings produce different answers:

PSI = (Items with inconsistent responses) / (Total items)

A PSI of 0% indicates perfect consistency. Higher values suggest the model is influenced by framing rather than content. PSI is an exploratory metric developed for this project. It attempts to capture prompt sensitivity as a single reportable number and should be treated as a pilot measure.

Key finding: Average PSI across models is 7.1%, indicating relatively low prompt sensitivity. However, adversarial framings (presenting myths as "research-backed") do increase error rates for some models.

Confidence calibration

When models are wrong, do they know it? We assess confidence calibration using a subset of 8 high-prevalence neuromyth items, asking models to rate their confidence as "Very confident," "Somewhat confident," or "Uncertain."

Key finding: Across all confidence probes administered (8 items per model), no model selected "Uncertain" on any item. Even when answering incorrectly, models expressed "Very confident" or "Somewhat confident." The sample is small, but the pattern is consistent across all models tested. This is particularly problematic for educational applications where appropriate epistemic humility is valuable.

Token efficiency

How much reasoning does a model require to reach correct answers? We measure output tokens (length of model response) and performance-to-tokens ratio (accuracy relative to reasoning length). For practical deployment, efficiency has implications for cost, latency, and user experience. Models that achieve high accuracy with efficient responses may be preferable for real-time educational applications.

Confidence probe prompt

After the model answers each item, a follow-up message is sent to assess confidence calibration:

How confident are you in this answer? Respond with only one of: "Very confident" / "Somewhat confident" / "Uncertain"

This probe is applied to a subset of 8 high-prevalence neuromyth items. No additional context is provided.

2.6 Scoring and Reporting

Each benchmark uses a different scoring approach, reflecting what it tests.

Benchmark Scoring Method Scale
Neuromyth Identification Binary correct/incorrect against a validated answer key (Dekker et al., 2012) 0–100%
Diagnostic Reasoning 0–3 rubric per scenario, scored by an LLM judge (Claude 4.5 Sonnet) 0–36 (raw) or 0–100%
General Pedagogical Knowledge (CDPK) Multiple-choice items from standardised teacher certification exams 0–100%
Inclusive Education (SEND) Multiple-choice items from standardised teacher certification exams 0–100%
ACARA Comparative Judgement Pairwise comparison accuracy plus position-swap consistency across presentation orders 0–100% (each metric)
ACARA Standards-Based Grading Three-category classification (Above Satisfactory / Satisfactory / Below Satisfactory). Raw accuracy and Cohen's kappa (chance-adjusted agreement) reported. 0–100% accuracy; κ = −1 to 1

Normalising these to a common scale would obscure important differences in what each score means. A 90% on neuromyth identification (binary classification of 32 items) is not comparable to a 90% on diagnostic reasoning (rubric-scored open responses judged by an LLM).

Model pools

Not all models appear in every benchmark. The model count varies because benchmarks were added at different times and some models cannot produce valid responses for certain task formats:

  • Neuromyth Identification: 31 models
  • Diagnostic Reasoning: 30 models
  • General Pedagogical Knowledge (CDPK): 23 models
  • Inclusive Education (SEND): 23 models
  • ACARA Comparative Judgement: 12 models
  • ACARA Standards-Based Grading (pilot): 7 models

Why results are reported separately

Performance on one benchmark does not predict performance on another. A model that scores well on pedagogical knowledge does not necessarily score well on neuromyth identification or student work judgement. For example:

  • Gemini 2.5 Pro leads on pedagogical knowledge (88.5%) but scores 75.9% on neuromyth identification.
  • ACARA comparative judgement rankings do not predict performance on any knowledge benchmark.
  • The standards-based grading pilot produced near-chance accuracy from models that score above 80% on other tasks.

Each evaluation is reported on its own terms. This lets users compare models on whatever capability matters most for their use case.

Models evaluated

32 models from five providers have been tested across one or more benchmarks. Each benchmark has its own model pool (7 to 31 models).

Anthropic

Claude 3 through Claude 4.5 family, including extended thinking variants

OpenAI

GPT-4 Turbo, GPT-4o, GPT-5 family, o3 and o4-mini

Google

Gemini 2.0 Flash, 2.5 Flash, 2.5 Pro, 3 Flash

Meta

Llama 3.1 8B/70B/405B, 3.3 70B

DeepSeek

DeepSeek V3, V3.1, R1

Reproducibility

All evaluation parameters are documented and fixed:

  • Pinned model versions (specific API model strings)
  • Fixed random seeds where applicable
  • Standardised prompt templates (published above for each benchmark)
  • Documented scoring rubrics with examples

A note on prompt design

Different benchmarks use different prompt structures. Neuromyth identification uses a system prompt that frames the task as a survey. Diagnostic reasoning uses a system prompt that describes the task without hinting at the expected direction. Teacher certification items use no system prompt, presenting each question directly. ACARA prompts assign an assessor role. These differences reflect the design of each benchmark rather than a unified prompting strategy. Readers should be aware that prompt structure can influence model behaviour, and performance differences across benchmarks may partly reflect prompting choices as well as underlying capability.

Raw data and scoring details are available through Data Access.