Output and trace scores by model and condition
Each row shows two scores: whether the output implemented fading (blue) and whether the reasoning trace referenced it (terracotta). The connecting line shows the gap. Highlighted rows are Condition B, where the gap is largest.
Output score
Trace score
Gap
Condition B (highlighted)
0 1 2
Claude Opus 4.6
A: Baseline
0 / 0
B: General CLT
0 / 2
C: Specific fading
2 / 1
Gemini 3.1 Pro
A: Baseline
0 / 0
B: General CLT
1 / 2
C: Specific fading
2 / 2

Pilot data: single response per condition. Scores: 0 = absent/uniform, 1 = partial, 2 = full implementation or explicit reference. Values shown as output / trace.

Observations:
  • In Condition A, both dots overlap at zero for both models. Neither the output nor the trace referenced fading.
  • In Condition B, the long connecting lines show the knowledge-application gap. Both models' traces scored 2 (explicit CLT references including fading) while their outputs scored 0 or 1 (no actual fading implemented).
  • In Condition C, the dots sit close together or overlap at the top of the scale. Both models implemented fading correctly when told what it is.

Concordance table

Exact scores for each model-condition pair

Model Condition Output (0–2) Trace (0–2) Pattern
Claude Opus 4.6 A: Baseline 0 0 Neither reasoned nor applied
Claude Opus 4.6 B: General CLT 0 2 Reasoned but did not apply
Claude Opus 4.6 C: Specific fading 2 1 Applied with procedural reasoning
Gemini 3.1 Pro A: Baseline 0 0 Neither reasoned nor applied
Gemini 3.1 Pro B: General CLT 1 2 Reasoned but misapplied
Gemini 3.1 Pro C: Specific fading 2 2 Reasoned and applied correctly

The highlighted rows are the central finding. In Condition B, both models' reasoning traces explicitly referenced fading or related CLT constructs (trace score = 2), but neither model's output implemented fading correctly (output score = 0 or 1).

The Condition B finding

The knowledge-application gap in detail

The headline finding: When told to "apply cognitive load theory," both models retrieved fading-related knowledge in their reasoning traces but did not implement fading in their outputs. The models know about fading. They chose not to apply it, or misapplied it, when the prompt did not name the specific principle.

Claude in Condition B listed five CLT principles by name in its trace, including the worked example effect. It interpreted that principle as providing complete solutions. The output listed "completion problems (fading worked examples)" as a suggested future next step for the teacher. Claude knew fading existed and deliberately deferred it.

Gemini in Condition B explicitly named "fading" in its trace but immediately redefined it as difficulty progression: "starting with simpler equations and gradually increasing complexity." The output increased equation difficulty across examples but did not remove any solution steps. Gemini had the right word but mapped it to the wrong mechanism.

Key trace excerpts

Selected passages from the reasoning traces, presented as generated artefacts

Condition A traces: content coverage only

"I need to create 4 worked examples for Year 8 students on solving two-step linear equations. I'll make them progressively more challenging and cover different variations, including positive and negative numbers, and different operations."

Claude Opus 4.6, Condition A

"To cover a good range of examples, I'll vary the equations. Example one will be a straightforward addition and multiplication problem... finally, to challenge them slightly and reinforce understanding that equations work both ways, Example four will involve... the variable on the right side."

Gemini 3.1 Pro, Condition A

Both traces focus on content variety and difficulty sequencing. Neither mentions how much the student should do, scaffolding, or any pedagogical framework.

Condition B traces: formal CLT knowledge retrieved but fading mishandled

"Cognitive load theory principles relevant here: 1. Intrinsic load management... 2. Extraneous load reduction... 3. Germane load optimization... 4. Worked example effect — provide complete, step-by-step solutions... 5. Element interactivity..."

Claude Opus 4.6, Condition B (trace lists fading as a "suggested next step" in the output, not as something to implement now)

"I'll lean heavily on the Worked Example Effect. This means providing complete solutions with annotations... I'll also incorporate 'fading' by starting with simpler equations and gradually increasing complexity, but maintaining a similar format."

Gemini 3.1 Pro, Condition B (equates "fading" with difficulty progression)

This is the most revealing pair of traces in the dataset. Claude lists fading as a known concept but classifies it as a later step. Gemini names fading but redefines it as making equations harder. Both models have the vocabulary. Neither selects the correct application.

Condition C traces: correct implementation when instructed

"The worked example fading effect involves: 1. First example: Complete worked example (all steps shown) 2. Second example: Most steps shown, one step removed 3. Third example: More steps removed, students complete more 4. Fourth example: Most/all steps removed, students solve independently"

Claude Opus 4.6, Condition C (API-summarised thinking; see methods note)

"We're using the worked example fading effect here, also known as backward fading, and this is based on cognitive load theory... The fading should be: Example 1, 100% teacher. Example 2, 75% teacher, 25% student. Example 3, 25% teacher, 75% student. And Example 4: 0% teacher, 100% student."

Gemini 3.1 Pro, Condition C (the richest pedagogical reasoning in any trace)

When told what fading is, both models plan and implement it correctly. Gemini's Condition C trace is the most detailed pedagogical reasoning in the entire dataset: it names the specific variant (backward fading), plans percentage allocations, and produces a correctly faded sequence.