The knowledge-application gap

The central finding of this pilot is a gap between what models know and what they do. All six models retrieved CLT-related knowledge in their Condition B reasoning traces. The closed-source models (Claude, both Geminis) explicitly named fading or the worked example effect but did not implement it. The open-weight models (DeepSeek R1, Qwen3-235B) engaged with CLT load types without reaching the specific principle of fading. None of the six applied fading in their output.

This is not an absence of knowledge. It is a failure to select and correctly apply the right principle from a set of known principles. When the prompt said "apply cognitive load theory," all models activated CLT-related knowledge, but defaulted to:

  • Clear formatting (extraneous load reduction)
  • Consistent structure (schema building)
  • Complete solutions (worked example effect)
  • Difficulty progression (element interactivity)

Fading did not make the cut. This pattern suggests that in training data, CLT is most frequently associated with presentation clarity and complete worked examples, not with scaffolding removal.

What "apply CLT" activates

The Condition B results reveal what these models treat as the default interpretation of "apply cognitive load theory." All six prioritised reducing extraneous load (formatting, annotations, consistency) and providing complete worked examples. These are valid CLT applications. But they are incomplete. A CLT-informed worked example sequence should include fading. The models treated CLT as a presentation framework rather than a design framework that governs the structure of the sequence itself.

For teachers, this means that prompting "use cognitive load theory" is likely to produce better-formatted examples, not pedagogically restructured sequences. The prompt activates surface features of CLT, not its deeper design implications.

The difficulty-fading confusion

All six models, when given no pedagogical instruction, equated "progression" with "harder equations." Gemini 3.1 Pro in Condition B went further, explicitly equating "fading" with "increasing complexity." This is the most common near-miss in the dataset.

The confusion is understandable. In everyday teaching language, "scaffold" can mean either "support a student through a difficult task" or "provide a structure that is gradually removed." The first sense is about difficulty; the second is about transfer of responsibility. Fading is about the second sense. The models default to the first.

Implications for teachers

Three practical implications emerge from this pilot, stated cautiously given the sample size:

  1. Do not assume AI applies best practice without specific instruction. If a teacher asks for worked examples without specifying pedagogical principles, the output is likely to be four complete examples that vary in difficulty but not in scaffolding structure.
  2. "Apply CLT" is not enough. The prompt activates CLT vocabulary but does not reliably activate the specific principle (fading) that governs how worked example sequences should be structured.
  3. Specific prompts produce correct outputs. When the prompt named and defined fading, all six models implemented it correctly. The more specific the instruction, the better the pedagogical quality of the output.

The teacher's role, then, is not eliminated by the model. It is shifted. The teacher needs to know what good instructional design looks like in order to prompt for it specifically, or to recognise when it is missing from the output.

Limitations

These limitations constrain what can be claimed from this pilot. They are not caveats to be skimmed. They define the boundaries of the findings.

  • Six models, one task, one principle. The knowledge-application gap was observed across all six models, which strengthens the finding relative to the original two-model pilot. However, all six were tested on the same task (two-step linear equations) and the same CLT principle (fading). The pattern may not generalise to other educational tasks or other CLT principles.
  • One run per condition. Claude's thinking mode blocks temperature control; Gemini models were run at default temperature. Without multiple runs, we cannot distinguish a stable model behaviour from a single-sample artefact.
  • One CLT principle. Fading is one of many CLT principles. The knowledge-application gap observed here may or may not generalise to other principles (e.g., the expertise reversal effect, the redundancy principle).
  • No human baseline. We do not know what percentage of human teachers would apply fading when asked to create worked examples. Without this comparison, we cannot say whether the models are better or worse than typical human practice.
  • Pilot, not confirmatory. This study tested a methodology. The findings are descriptive and exploratory. No inferential statistics are appropriate for n = 18.

What a larger study should test

If the methodology proves useful, a larger study could address the pilot's limitations:

  • More runs per model: 3–5 runs per condition per model to assess consistency, now that six models have shown the same pattern.
  • More CLT principles: Test fading alongside the expertise reversal effect, the redundancy principle, and the split-attention effect. Does the gap appear for all principles or only some?
  • Multiple runs: 3–5 runs per condition per model to assess consistency.
  • Human baseline: Administer the same prompts to qualified teachers and instructional designers. How often do humans apply fading without specific instruction?
  • Cross-model scoring with reliability metrics: Formal inter-rater reliability computation across multiple independent LLM judges.
  • Multi-turn prompting: Test whether models apply fading when given follow-up prompts (e.g., "Can you restructure these to gradually remove scaffolding?").