Models

Two frontier models with accessible reasoning traces were tested:

Model Provider Reasoning trace Temperature
Claude Opus 4.6 Anthropic Extended thinking (8K budget; API returns summarised thinking) Not settable (blocked by extended thinking)
Gemini 3.1 Pro Google Thinking traces (8K budget; API returns full thinking) 1.0 (default; not modified)
Missing model: OpenAI GPT-5.2 Pro was planned as a third model. Its reasoning trace access was unavailable at the time of data collection. The study proceeded with two models.

Three prompting conditions

Each model received the same task under three conditions. The task: create a set of four worked examples teaching Year 8 students to solve two-step linear equations. The conditions differ only in the final instruction.

Condition A: Baseline

You are an experienced mathematics teacher. Create a set of 4 worked examples teaching Year 8 students how to solve two-step linear equations (e.g., 3x + 5 = 20). Each worked example should present a problem and its full worked solution.

Condition B: General CLT

You are an experienced mathematics teacher. Create a set of 4 worked examples teaching Year 8 students how to solve two-step linear equations (e.g., 3x + 5 = 20). Each worked example should present a problem and its full worked solution. Apply cognitive load theory principles in your design.

Condition C: Specific fading

You are an experienced mathematics teacher. Create a set of 4 worked examples teaching Year 8 students how to solve two-step linear equations (e.g., 3x + 5 = 20). Apply the worked example fading effect from cognitive load theory: systematically remove solution steps across the sequence so that learners gradually take over more of the problem-solving process.

Design note: Conditions A and B use "set" rather than "sequence" to avoid priming CLT associations ("worked example sequence" is a CLT term of art). Condition C uses "sequence" deliberately, since it is already naming CLT concepts. Conditions A and B include "full worked solutions" because we considered it a reasonable approximation of how a teacher might phrase the request. This creates a competing instruction: the prompt asks for complete solutions, but CLT best practice says to fade. Condition C removes that competing instruction, which is why the comparison across conditions is informative.

Scoring rubric

Each response was scored on two dimensions:

Dimension 1: Output structure (0–2)

Score Label Criteria
0 Uniform All four examples have the same structure. Every solution step is fully completed by the model. Examples may vary in difficulty but not in how much work the student does.
1 Some variation Some structural variation (shorter explanations, different formatting). All solution steps are still completed by the model. No steps left for the student.
2 Progressive scaffolding removal Solution steps are progressively removed. At least one example is partially completed (steps left blank or for the student). The final example requires the student to complete most or all steps independently.

Key distinctions: Harder equations with all steps shown = 0 or 1, not scaffolding removal. Less narration but all steps present = 1, not scaffolding removal. Steps left for the student to complete = 2.

Dimension 2: Reasoning trace content (0–2)

Score Label Criteria
0 Absent Trace does not reference fading, scaffolding removal, or gradually shifting work to the learner.
1 Implicit logic Trace references the underlying logic without formal CLT terminology (e.g., "gradually have them do more of the work").
2 Explicit reference Trace explicitly names fading, the worked example effect, the completion problem effect, cognitive load theory, or Sweller.

Scoring process

The original plan called for cross-model LLM scoring (each model's outputs scored by a different model). In practice, the pilot used a combination of LLM review agents and human verification:

  • Claude outputs were scored by Opus 4.6 and Sonnet 4.6 review agents
  • Gemini outputs were scored by the same review agents
  • All scores were verified against the rubric by the author

For a pilot of this size (n = 6), formal inter-rater reliability was not computed. A larger follow-up study would include multiple independent scorers and reliability metrics.

Reasoning trace capture

Both models were run with their reasoning/thinking features enabled. This produces a chain-of-thought trace that the model generates before producing its final output. These traces are generated text, not a direct window into model computation.

An important asymmetry: Anthropic's API returns summarised thinking for Claude 4 models, not the full chain of thought. The summaries preserve key ideas but are shorter than what the model actually generated internally. Gemini's API returns full thinking traces. This means the Claude traces in this study are API-provided summaries, while the Gemini traces are complete. Both are still analytically useful for the scoring rubric, but the Claude traces may omit reasoning that occurred internally.

Interpretability disclaimer: Chain-of-thought traces are generated text, not transparent access to model internals. Claude's traces are additionally summarised by the API before being returned. A trace that references CLT may reflect patterns in training data rather than something analogous to human reasoning about pedagogy. We analyse traces as artefacts: evidence of what the model articulated during processing, not claims about internal understanding.

Limitations

These limitations are foregrounded here, not buried in the Discussion.

  • Two models, one task, one principle. Findings describe these two models on this task at this point in time. They do not generalise to LLMs broadly, to other CLT principles, or to other educational tasks.
  • One run per condition. Claude's extended thinking mode blocks the temperature parameter entirely. Gemini's thinking mode allows temperature but Google recommends keeping it at the default (1.0). Neither model was run with a fixed low temperature, so outputs may vary across runs.
  • No human baseline. We can score model outputs against the rubric, but we do not know how human teachers or instructional designers would perform on the same prompts. Without this comparison, we cannot say whether the models are better or worse than typical human practice.
  • The prompt asks for "full worked solutions." Conditions A and B include this phrasing because we considered it a reasonable approximation of how a teacher might word the request. A model that complies literally may know about fading but prioritise following the prompt instruction. This is why the comparison across conditions matters: Condition C removes that competing instruction and asks for fading directly.
  • Single-turn only. A teacher might follow up with "can you fade the scaffolding?" This study captures only the first-pass response.
  • LLM judges share training biases. The scoring agents may share systematic biases with the models being scored. Human verification mitigates but does not eliminate this.
  • Claude traces are summarised. Anthropic's API returns summarised thinking for Claude 4 models, not the full chain of thought. Gemini's API returns full thinking output. This asymmetry means Claude may have reasoned about fading in ways not visible in the returned trace. The scoring rubric was applied to what the API returned.