4. Discussion
What the pilot found, what it means for teachers, and what comes next
The knowledge-application gap
The central finding of this pilot is a gap between what models know and what they do. Both Claude Opus 4.6 and Gemini 3.1 Pro demonstrated knowledge of fading in their Condition B reasoning traces. Claude listed it as a known concept and deferred it. Gemini named it and redefined it as difficulty progression. Neither applied it in their output.
This is not an absence of knowledge. It is a failure to select and correctly apply the right principle from a set of known principles. When the prompt said "apply cognitive load theory," both models activated CLT-related knowledge, but defaulted to:
- Clear formatting (extraneous load reduction)
- Consistent structure (schema building)
- Complete solutions (worked example effect)
- Difficulty progression (element interactivity)
Fading did not make the cut. This pattern suggests that in training data, CLT is most frequently associated with presentation clarity and complete worked examples, not with scaffolding removal.
What "apply CLT" activates
The Condition B results reveal what these models treat as the default interpretation of "apply cognitive load theory." Both prioritised reducing extraneous load (formatting, annotations, consistency) and providing complete worked examples. These are valid CLT applications. But they are incomplete. A CLT-informed worked example sequence should include fading. The models treated CLT as a presentation framework rather than a design framework that governs the structure of the sequence itself.
For teachers, this means that prompting "use cognitive load theory" is likely to produce better-formatted examples, not pedagogically restructured sequences. The prompt activates surface features of CLT, not its deeper design implications.
The difficulty-fading confusion
Both models, when given no pedagogical instruction, equated "progression" with "harder equations." Gemini in Condition B went further, explicitly equating "fading" with "increasing complexity." This is the most common near-miss in the dataset.
The confusion is understandable. In everyday teaching language, "scaffold" can mean either "support a student through a difficult task" or "provide a structure that is gradually removed." The first sense is about difficulty; the second is about transfer of responsibility. Fading is about the second sense. The models default to the first.
Implications for teachers
Three practical implications emerge from this pilot, stated cautiously given the sample size:
- Do not assume AI applies best practice without specific instruction. If a teacher asks for worked examples without specifying pedagogical principles, the output is likely to be four complete examples that vary in difficulty but not in scaffolding structure.
- "Apply CLT" is not enough. The prompt activates CLT vocabulary but does not reliably activate the specific principle (fading) that governs how worked example sequences should be structured.
- Specific prompts produce correct outputs. When the prompt named and defined fading, both models implemented it correctly. The more specific the instruction, the better the pedagogical quality of the output.
The teacher's role, then, is not eliminated by the model. It is shifted. The teacher needs to know what good instructional design looks like in order to prompt for it specifically, or to recognise when it is missing from the output.
Limitations
These limitations constrain what can be claimed from this pilot. They are not caveats to be skimmed. They define the boundaries of the findings.
- Two models is not a general finding. Other models may behave differently. The pattern described here applies to Claude Opus 4.6 and Gemini 3.1 Pro as tested in March 2026.
- One run per condition. Claude's thinking mode blocks temperature control; Gemini was run at the default temperature (1.0). Without multiple runs, we cannot distinguish a stable model behaviour from a single-sample artefact.
- One CLT principle. Fading is one of many CLT principles. The knowledge-application gap observed here may or may not generalise to other principles (e.g., the expertise reversal effect, the redundancy principle).
- No human baseline. We do not know what percentage of human teachers would apply fading when asked to create worked examples. Without this comparison, we cannot say whether the models are better or worse than typical human practice.
- Pilot, not confirmatory. This study tested a methodology. The findings are descriptive and exploratory. No inferential statistics are appropriate for n = 6.
What a larger study should test
If the methodology proves useful, a larger study could address the pilot's limitations:
- More models: 5–10 frontier and mid-tier models to test whether the knowledge-application gap is general or model-specific.
- More CLT principles: Test fading alongside the expertise reversal effect, the redundancy principle, and the split-attention effect. Does the gap appear for all principles or only some?
- Multiple runs: 3–5 runs per condition per model to assess consistency.
- Human baseline: Administer the same prompts to qualified teachers and instructional designers. How often do humans apply fading without specific instruction?
- Cross-model scoring with reliability metrics: Formal inter-rater reliability computation across multiple independent LLM judges.
- Multi-turn prompting: Test whether models apply fading when given follow-up prompts (e.g., "Can you restructure these to gradually remove scaffolding?").