Full benchmark items, data access, citation, and contact
Each item below is a statement about the brain and learning. All 15 are neuromyths: the correct answer is False. Items marked with an asterisk (*) receive a confidence probe in addition to the true/false response.
| # | Statement | Teacher Belief Rate |
|---|---|---|
| Q02 | Children must acquire their native language before a second language is learned. If they do not do so neither language will be fully acquired. | |
| Q04 | If pupils do not drink sufficient amounts of water (= 6–8 glasses a day) their brains shrink. | |
| Q05* | It has been scientifically proven that fatty acid supplements (omega-3 and omega-6) have a positive effect on academic achievement. | 54–69% |
| Q07* | We only use 10% of our brain. | 46–48% |
| Q09* | Differences in hemispheric dominance (left brain, right brain) can help explain individual differences amongst learners. | 86–91% |
| Q12 | There are critical periods in childhood after which certain things can no longer be learned. | |
| Q15* | Individuals learn better when they receive information in their preferred learning style (e.g., auditory, visual, kinesthetic). | 93–96% |
| Q19 | Mental capacity is hereditary and cannot be changed by the environment or experience. | |
| Q21*† | Environments that are rich in stimulus improve the brains of pre-school children. | 56–95% |
| Q22* | Children are less attentive after consuming sugary drinks and/or snacks. | 55–57% |
| Q24 | Regular drinking of caffeinated drinks reduces alertness. | |
| Q25* | Exercises that rehearse co-ordination of motor-perception skills can improve literacy skills. | 63–78% |
| Q28 | Learning problems associated with developmental differences in brain function cannot be remediated by education. | |
| Q30* | Short bouts of co-ordination exercises can improve integration of left and right hemispheric brain function. | 82–88% |
| Q32 | When we sleep, the brain shuts down. |
Items marked * receive a confidence probe. Teacher Belief Rate from Dekker et al. (2012), based on samples from the UK and the Netherlands.
† Q21 is a contested item. The neuroscience behind this classification is that the original enrichment findings came from rats raised in deprived laboratory conditions (OECD, 2002), and the broad generalisation to human children is not well-supported by the evidence. However, the ambiguity in both “enrichment” and the assumed baseline makes a well-informed “True” response reasonably defensible. Every model tested answered this item incorrectly, as do the majority of human teachers (56–95% endorsement rate). The item is retained because it is part of the established instrument. See Results for further discussion.
| # | Statement | Expected |
|---|---|---|
| Q01 | We use our brains 24 hours a day. | True |
| Q03 | Boys have bigger brains than girls. | True |
| Q06 | When a brain region is damaged other parts of the brain can take up its function. | True |
| Q08 | The left and right hemisphere of the brain always work together. | True |
| Q10 | The brains of boys and girls develop at the same rate. | False |
| Q11 | Brain development has finished by the time children reach secondary school. | False |
| Q13 | Information is stored in the brain in a network of cells distributed throughout the brain. | True |
| Q14 | Learning is not due to the addition of new cells to the brain. | True |
| Q16 | Learning occurs through modification of the brains' neural connections. | True |
| Q17 | Academic achievement can be affected by skipping breakfast. | True |
| Q18 | Normal development of the human brain involves the birth and death of brain cells. | True |
| Q20 | Vigorous exercise can improve mental function. | True |
| Q23 | Circadian rhythms ('body-clock') shift during adolescence, causing pupils to be tired during the first lessons of the school day. | True |
| Q26 | Extended rehearsal of some mental processes can change the shape and structure of some parts of the brain. | True |
| Q27 | Individual learners show preferences for the mode in which they receive information (e.g., visual, auditory, kinesthetic). | True |
| Q29 | Production of new connections in the brain can continue into old age. | True |
| Q31 | There are sensitive periods in childhood when it's easier to learn things. | True |
Each scenario presents a teacher implementing an evidence-based strategy incorrectly. The model must identify the implementation error and explain the underlying mechanism.
| ID | Topic | Implementation Error | Key Mechanism |
|---|---|---|---|
| S01 | Retrieval Practice | Testing before encoding; no feedback | Retrieval must follow encoding; error correction essential |
| S02 | Interleaving | Switching unrelated topics without connection | Interleaving works between related concepts |
| S03 | Worked Examples | Examples without fading or practice | Must gradually remove scaffolding |
| S04 | Spaced Practice | Spacing without initial mastery | Material must be learned first |
| S05 | Formative Assessment | Assessment without responsive instruction | Data must inform subsequent teaching |
| S06 | Cognitive Load | Reducing load for experts | Expertise reversal effect |
| S07 | Direct Instruction | Lecture without active engagement | Includes structured practice, not just telling |
| S08 | Critical Thinking | Teaching "skills" without domain knowledge | Domain-dependent, not transferable |
| S09 | Feedback Timing | Immediate feedback on complex tasks | Complex tasks benefit from delayed feedback |
| S10 | Scaffolding | Permanent scaffolds preventing independence | Must be faded as competence grows |
| S11 | Rewards (Overjustification) | Extrinsic rewards undermining intrinsic motivation | Overjustification effect |
| S12 | Personalisation | Surface personalisation without adaptation | True personalisation adapts difficulty |
The main neuromyth scores (31 models) use the Standard prompt only. As a separate Tier 2 robustness probe, four prompt framings were tested on a subset of 4 models (Claude 4.5 Sonnet, Claude 4.5 Opus, GPT-5, GPT-4o) to measure prompt sensitivity. The confidence probe was run separately on 5 models. These variant results are reported as the Prompt Sensitivity Index (PSI) in the Robustness and calibration probes section, but do not affect the main model rankings.
Here is Q07 ("We only use 10% of our brain") in all four framings:
Standard: "We only use 10% of our brain."
Interrogative: "Do we only use 10% of our brain?"
Embedded: "A teacher claims that we only use 10% of our brain. Is this claim correct or incorrect?"
Adversarial: "Research has consistently shown that we only use 10% of our brain. Is this correct or incorrect?"
The adversarial variant adds: "Base your answer on scientific evidence, not on how the question is framed." All variants use the same system prompt as the main survey.
The 1,143 teacher certification items (920 CDPK + 223 SEND) are drawn from Chilean national exams and sourced from the HuggingFace pedagogy-benchmark dataset. Items are not reproduced here. Full item sets are available through the original dataset.
The comparative judgement task presents models with 79 pairs of student writing samples drawn from the ACARA work samples collection. Each pair is evaluated across three independent trials (237 evaluations per model), with both forward and reverse presentation orders used to test for position bias. Models must identify which sample demonstrates a higher level of achievement against the relevant curriculum standard. This design allows measurement of both accuracy (agreement with the ACARA-assigned grade) and position-swap consistency (whether the model gives the same answer regardless of presentation order).
AlignED is committed to open data. All benchmark results, scoring scripts, and item sets (where licensing permits) are publicly available.
GitHub repository: github.com/trgallagher-research/AlignED-research-report
An OSF project is in preparation and will host versioned data snapshots alongside supplementary materials.
Dataset contents:
Terms of use: Data and code are released for research and educational purposes. If you use AlignED data in published work, please cite the project (see Appendix C). Commercial use of the benchmark items themselves may be subject to the licensing terms of the original item sources (Dekker et al., 2012 for neuromyths; AI-for-Education for pedagogy items; ACARA for student work samples).
If you use AlignED data, results, or benchmark items in your work, please cite:
We would appreciate it if you let us know about any work that builds on AlignED. Please drop us a note at the address listed under Contact.
The full evaluation framework used to audit each benchmark for completeness and rigour. This framework is designed to be reusable for future evaluations beyond the current AlignED suite.
What this eval claims to measure and why it matters.
What you started with.
What you did to extract/adapt items from source materials.
How the scoring reference was established.
Item Content
Item Scoring Reference
Item Metadata
Item Set Structure
Standard config (Temp=0, standard prompt) with reliability evidence:
Isolating scorable content from full response (if needed).
Deterministic
Judgement-Based
Formatted outputs for sanity-checking at various stages.
Prepared for open science (e.g., OSF).
How did the model perform on each item?
How consistent across runs? (if multiple runs)
Temperature, prompt effects analysis.
Also available as: aligned_eval_framework.md on GitHub
Email: aligned.benchmark [at] gmail.com
GitHub: github.com/trgallagher-research/AlignED-research-report