3. Results
Output scores, trace scores, and the knowledge-application gap
Single response per condition. Scores: 0 = absent/uniform, 1 = partial, 2 = full implementation or explicit reference. Values shown as output / trace.
- In Condition A, all six models score 0 on the trace dimension. Five of six score 0 on the output. No model spontaneously applies or reasons about fading.
- In Condition B, the connecting lines show the knowledge-application gap. Closed-source models (Claude, both Geminis) show trace scores of 2 with output scores of 0 or 1. Open-weight models (DeepSeek, Qwen3) show trace scores of 1 with output scores of 0. GPT-OSS is the only model where output and trace scores match (both 1).
- In Condition C, all six models implement fading correctly (output = 2). Five of six trace scores reach 2. GPT-OSS is the exception: it fades correctly but returns no reasoning trace (trace = 0), likely due to OpenRouter not surfacing it for this model.
Concordance table
Exact scores for each model-condition pair
| Model | Condition | Output (0–2) | Trace (0–2) | Pattern |
|---|---|---|---|---|
| Claude Opus 4.6 | A: Baseline | 0 | 0 | Neither reasoned nor applied |
| Claude Opus 4.6 | B: General CLT | 0 | 2 | Reasoned but did not apply |
| Claude Opus 4.6 | C: Specific fading | 2 | 1 | Applied with procedural reasoning |
| Gemini 3.1 Pro | A: Baseline | 0 | 0 | Neither reasoned nor applied |
| Gemini 3.1 Pro | B: General CLT | 1 | 2 | Reasoned but misapplied |
| Gemini 3.1 Pro | C: Specific fading | 2 | 2 | Reasoned and applied correctly |
| Gemini 3 Flash | A: Baseline | 1 | 0 | Some variation, no reasoning |
| Gemini 3 Flash | B: General CLT | 1 | 2 | Reasoned but did not apply |
| Gemini 3 Flash | C: Specific fading | 2 | 2 | Reasoned and applied correctly |
| DeepSeek R1 | A: Baseline | 0 | 0 | Neither reasoned nor applied |
| DeepSeek R1 | B: General CLT | 0 | 1 | Implicit reasoning, no application |
| DeepSeek R1 | C: Specific fading | 2 | 2 | Reasoned and applied correctly |
| Qwen3-235B | A: Baseline | 0 | 0 | Neither reasoned nor applied |
| Qwen3-235B | B: General CLT | 0 | 1 | Implicit reasoning, no application |
| Qwen3-235B | C: Specific fading | 2 | 2 | Reasoned and applied correctly |
| GPT-OSS-120B | A: Baseline | 0 | 0 | Neither reasoned nor applied |
| GPT-OSS-120B | B: General CLT | 1 | 1 | Partial on both dimensions |
| GPT-OSS-120B | C: Specific fading | 2 | 0* | Applied correctly; no trace returned |
* GPT-OSS-120B returned no reasoning trace for Condition C. This appears to be an API limitation (OpenRouter did not surface the trace for this call) rather than an absence of reasoning. The trace score is recorded as 0 but should be interpreted with this caveat.
The highlighted rows are the central finding. In Condition B, all six models show a gap between what they reason about and what they produce. The closed-source models retrieve formal CLT terminology (trace = 2) but still do not implement fading. The open-weight models show weaker CLT retrieval (trace = 1) and likewise produce no fading.
The Condition B finding
The knowledge-application gap across six models
The Condition B results divide cleanly by model type:
Closed-source models (Claude, Gemini 3.1 Pro, Gemini 3 Flash) all scored trace = 2, explicitly naming CLT constructs including the worked example effect. Claude listed fading as a "suggested next step" rather than implementing it. Gemini 3.1 Pro named fading but redefined it as difficulty progression. Gemini 3 Flash planned to "fade the explicit steps" in its trace but did not follow through in the output.
Open-weight models (DeepSeek R1, Qwen3-235B) both scored trace = 1, engaging with CLT concepts (intrinsic, extraneous, and germane load) without naming fading specifically. Neither implemented fading in the output.
GPT-OSS-120B is a partial exception: both output and trace scored 1, with the trace mentioning "low cognitive load" and the output showing slight structural variation. Its traces were minimal throughout (111 chars for A, 130 for B, 0 for C), so the trace scores should be interpreted cautiously.
Key trace excerpts
Selected passages from the reasoning traces, presented as generated artefacts
Condition A traces: content coverage only
"I need to create 4 worked examples for Year 8 students on solving two-step linear equations. I'll make them progressively more challenging and cover different variations, including positive and negative numbers, and different operations."
Claude Opus 4.6, Condition A
"We are going to create four worked examples for solving two-step linear equations for Year 8 students. Each example should be clear, with steps explained. We'll start with simple equations and gradually increase in complexity."
DeepSeek R1, Condition A
All six Condition A traces focus on content variety and difficulty sequencing. None mention how much the student should do, scaffolding, or any pedagogical framework. This pattern is unanimous.
Condition B traces: CLT knowledge retrieved but fading not applied
"Cognitive load theory principles relevant here: 1. Intrinsic load management... 2. Extraneous load reduction... 3. Germane load optimization... 4. Worked example effect — provide complete, step-by-step solutions... 5. Element interactivity..."
Claude Opus 4.6, Condition B (lists fading as a "suggested next step" in the output, not as something to implement now)
"I'll lean heavily on the Worked Example Effect. This means providing complete solutions with annotations... I'll also incorporate 'fading' by starting with simpler equations and gradually increasing complexity, but maintaining a similar format."
Gemini 3.1 Pro, Condition B (equates "fading" with difficulty progression)
"The worked example effect is key; I'll give them the complete solution to reduce the mental effort they have to expend just figuring out the process... I can start to fade the explicit steps at this point to encourage retrieval."
Gemini 3 Flash, Condition B (plans fading in the trace but does not execute it in the output)
"It's about managing the amount of information that learners have to process, right? There's intrinsic load (the inherent difficulty of the material), extraneous load (how the information is presented, which can be optimized), and germane load (processing that helps build schemas)."
Qwen3-235B, Condition B (engages with CLT load types but does not mention fading)
The closed-source models retrieve formal CLT terminology including explicit references to fading or the worked example effect, but misapply or defer it. The open-weight models engage with CLT at the level of load management (intrinsic, extraneous, germane) without reaching the specific principle of fading.
Condition C traces: correct implementation when instructed
"The worked example fading effect involves: 1. First example: Complete worked example (all steps shown) 2. Second example: Most steps shown, one step removed 3. Third example: More steps removed, students complete more 4. Fourth example: Most/all steps removed, students solve independently"
Claude Opus 4.6, Condition C (API-summarised thinking; see methods note)
"We're using the worked example fading effect here, also known as backward fading, and this is based on cognitive load theory... The fading should be: Example 1, 100% teacher. Example 2, 75% teacher, 25% student. Example 3, 25% teacher, 75% student. And Example 4: 0% teacher, 100% student."
Gemini 3.1 Pro, Condition C
"According to cognitive load theory, the fading effect involves starting with a full worked example, then in subsequent examples, gradually omitting steps that the learner should perform. This helps transition from pure observation to active problem-solving."
Qwen3-235B, Condition C
When told what fading is, all models that return traces plan and implement it correctly. Gemini 3.1 Pro's Condition C trace remains the most detailed pedagogical reasoning in the dataset: it names the specific variant (backward fading), plans percentage allocations, and produces a correctly faded sequence.