Output and trace scores by model and condition
Each row shows two scores: whether the output implemented fading (blue) and whether the reasoning trace referenced it (terracotta). The connecting line shows the gap. Highlighted rows are Condition B, where the gap is largest.
Output score
Trace score
Gap
Condition B (highlighted)
0 1 2
Claude Opus 4.6
A: Baseline
0 / 0
B: General CLT
0 / 2
C: Specific fading
2 / 1
Gemini 3.1 Pro
A: Baseline
0 / 0
B: General CLT
1 / 2
C: Specific fading
2 / 2
Gemini 3 Flash
A: Baseline
1 / 0
B: General CLT
1 / 2
C: Specific fading
2 / 2
DeepSeek R1
A: Baseline
0 / 0
B: General CLT
0 / 1
C: Specific fading
2 / 2
Qwen3-235B
A: Baseline
0 / 0
B: General CLT
0 / 1
C: Specific fading
2 / 2
GPT-OSS-120B
A: Baseline
0 / 0
B: General CLT
1 / 1
C: Specific fading
2 / 0

Single response per condition. Scores: 0 = absent/uniform, 1 = partial, 2 = full implementation or explicit reference. Values shown as output / trace.

Observations:
  • In Condition A, all six models score 0 on the trace dimension. Five of six score 0 on the output. No model spontaneously applies or reasons about fading.
  • In Condition B, the connecting lines show the knowledge-application gap. Closed-source models (Claude, both Geminis) show trace scores of 2 with output scores of 0 or 1. Open-weight models (DeepSeek, Qwen3) show trace scores of 1 with output scores of 0. GPT-OSS is the only model where output and trace scores match (both 1).
  • In Condition C, all six models implement fading correctly (output = 2). Five of six trace scores reach 2. GPT-OSS is the exception: it fades correctly but returns no reasoning trace (trace = 0), likely due to OpenRouter not surfacing it for this model.

Concordance table

Exact scores for each model-condition pair

Model Condition Output (0–2) Trace (0–2) Pattern
Claude Opus 4.6 A: Baseline 0 0 Neither reasoned nor applied
Claude Opus 4.6 B: General CLT 0 2 Reasoned but did not apply
Claude Opus 4.6 C: Specific fading 2 1 Applied with procedural reasoning
Gemini 3.1 Pro A: Baseline 0 0 Neither reasoned nor applied
Gemini 3.1 Pro B: General CLT 1 2 Reasoned but misapplied
Gemini 3.1 Pro C: Specific fading 2 2 Reasoned and applied correctly
Gemini 3 Flash A: Baseline 1 0 Some variation, no reasoning
Gemini 3 Flash B: General CLT 1 2 Reasoned but did not apply
Gemini 3 Flash C: Specific fading 2 2 Reasoned and applied correctly
DeepSeek R1 A: Baseline 0 0 Neither reasoned nor applied
DeepSeek R1 B: General CLT 0 1 Implicit reasoning, no application
DeepSeek R1 C: Specific fading 2 2 Reasoned and applied correctly
Qwen3-235B A: Baseline 0 0 Neither reasoned nor applied
Qwen3-235B B: General CLT 0 1 Implicit reasoning, no application
Qwen3-235B C: Specific fading 2 2 Reasoned and applied correctly
GPT-OSS-120B A: Baseline 0 0 Neither reasoned nor applied
GPT-OSS-120B B: General CLT 1 1 Partial on both dimensions
GPT-OSS-120B C: Specific fading 2 0* Applied correctly; no trace returned

* GPT-OSS-120B returned no reasoning trace for Condition C. This appears to be an API limitation (OpenRouter did not surface the trace for this call) rather than an absence of reasoning. The trace score is recorded as 0 but should be interpreted with this caveat.

The highlighted rows are the central finding. In Condition B, all six models show a gap between what they reason about and what they produce. The closed-source models retrieve formal CLT terminology (trace = 2) but still do not implement fading. The open-weight models show weaker CLT retrieval (trace = 1) and likewise produce no fading.

The Condition B finding

The knowledge-application gap across six models

The headline finding: When told to "apply cognitive load theory," all six models retrieved CLT-related knowledge in their reasoning traces but none implemented fading in their outputs. The models have access to fading as a concept. None select it for application when the prompt does not name it specifically.

The Condition B results divide cleanly by model type:

Closed-source models (Claude, Gemini 3.1 Pro, Gemini 3 Flash) all scored trace = 2, explicitly naming CLT constructs including the worked example effect. Claude listed fading as a "suggested next step" rather than implementing it. Gemini 3.1 Pro named fading but redefined it as difficulty progression. Gemini 3 Flash planned to "fade the explicit steps" in its trace but did not follow through in the output.

Open-weight models (DeepSeek R1, Qwen3-235B) both scored trace = 1, engaging with CLT concepts (intrinsic, extraneous, and germane load) without naming fading specifically. Neither implemented fading in the output.

GPT-OSS-120B is a partial exception: both output and trace scored 1, with the trace mentioning "low cognitive load" and the output showing slight structural variation. Its traces were minimal throughout (111 chars for A, 130 for B, 0 for C), so the trace scores should be interpreted cautiously.

Key trace excerpts

Selected passages from the reasoning traces, presented as generated artefacts

Condition A traces: content coverage only

"I need to create 4 worked examples for Year 8 students on solving two-step linear equations. I'll make them progressively more challenging and cover different variations, including positive and negative numbers, and different operations."

Claude Opus 4.6, Condition A

"We are going to create four worked examples for solving two-step linear equations for Year 8 students. Each example should be clear, with steps explained. We'll start with simple equations and gradually increase in complexity."

DeepSeek R1, Condition A

All six Condition A traces focus on content variety and difficulty sequencing. None mention how much the student should do, scaffolding, or any pedagogical framework. This pattern is unanimous.

Condition B traces: CLT knowledge retrieved but fading not applied

"Cognitive load theory principles relevant here: 1. Intrinsic load management... 2. Extraneous load reduction... 3. Germane load optimization... 4. Worked example effect — provide complete, step-by-step solutions... 5. Element interactivity..."

Claude Opus 4.6, Condition B (lists fading as a "suggested next step" in the output, not as something to implement now)

"I'll lean heavily on the Worked Example Effect. This means providing complete solutions with annotations... I'll also incorporate 'fading' by starting with simpler equations and gradually increasing complexity, but maintaining a similar format."

Gemini 3.1 Pro, Condition B (equates "fading" with difficulty progression)

"The worked example effect is key; I'll give them the complete solution to reduce the mental effort they have to expend just figuring out the process... I can start to fade the explicit steps at this point to encourage retrieval."

Gemini 3 Flash, Condition B (plans fading in the trace but does not execute it in the output)

"It's about managing the amount of information that learners have to process, right? There's intrinsic load (the inherent difficulty of the material), extraneous load (how the information is presented, which can be optimized), and germane load (processing that helps build schemas)."

Qwen3-235B, Condition B (engages with CLT load types but does not mention fading)

The closed-source models retrieve formal CLT terminology including explicit references to fading or the worked example effect, but misapply or defer it. The open-weight models engage with CLT at the level of load management (intrinsic, extraneous, germane) without reaching the specific principle of fading.

Condition C traces: correct implementation when instructed

"The worked example fading effect involves: 1. First example: Complete worked example (all steps shown) 2. Second example: Most steps shown, one step removed 3. Third example: More steps removed, students complete more 4. Fourth example: Most/all steps removed, students solve independently"

Claude Opus 4.6, Condition C (API-summarised thinking; see methods note)

"We're using the worked example fading effect here, also known as backward fading, and this is based on cognitive load theory... The fading should be: Example 1, 100% teacher. Example 2, 75% teacher, 25% student. Example 3, 25% teacher, 75% student. And Example 4: 0% teacher, 100% student."

Gemini 3.1 Pro, Condition C

"According to cognitive load theory, the fading effect involves starting with a full worked example, then in subsequent examples, gradually omitting steps that the learner should perform. This helps transition from pure observation to active problem-solving."

Qwen3-235B, Condition C

When told what fading is, all models that return traces plan and implement it correctly. Gemini 3.1 Pro's Condition C trace remains the most detailed pedagogical reasoning in the dataset: it names the specific variant (backward fading), plans percentage allocations, and produces a correctly faded sequence.