Worked examples and LLMs

Worked examples are a staple of mathematics instruction: a problem is presented, and a step-by-step solution is shown for students to study. Data from the OECD's 2024 Teaching and Learning International Survey (TALIS) shows that 41% of lower secondary teachers across participating countries have used AI for professional tasks. Generating instructional materials is a plausible use case within that figure, though we do not have specific data on how often teachers prompt LLMs for worked examples.

Whether or not this is already common practice, it is a reasonable thing for a teacher to do. The question is whether the outputs models produce reflect what the evidence base says about how worked examples should be designed.

The fading effect

Cognitive load theory (CLT) is one of the most influential frameworks in instructional design. Among its best-established findings is the worked example fading effect: instructional sequences should begin with fully worked examples and progressively remove solution steps so that learners gradually take over the problem-solving process. By the end of the sequence, the student is solving problems independently.

This is distinct from simply making problems harder. Fading means removing scaffolding. Four equations that get progressively more difficult, each with every step shown, would not count as fading. A faded sequence looks different: the first example shows all steps, the second omits the final step for the student to complete, the third omits more steps, and the fourth is a practice problem with no steps shown.

The distinction matters because difficulty progression and scaffolding removal are different instructional mechanisms. Harder problems with full solutions maintain the same cognitive load structure. Fading deliberately shifts cognitive work from the example to the learner, which is the mechanism through which worked examples transition into independent practice.

The research question

When a teacher asks an LLM to create worked examples for a common educational task, does the model apply fading? Three sub-questions follow:

  1. Baseline: When the prompt makes no mention of pedagogy, does the model apply fading on its own?
  2. General CLT: When told to "apply cognitive load theory principles," does the model select fading from its repertoire of CLT knowledge?
  3. Specific fading: When the prompt names and defines the fading effect, does the model implement it correctly?

These three conditions form a prompting gradient. The comparison between Condition B and Condition C is particularly informative, because it tests whether models that know about fading will actually apply it without being told the specific principle by name.

Why this matters for teachers

If a model produces four fully worked examples when best practice says it should fade, a teacher who trusts the output is receiving material that misses a well-established design principle. The teacher may not know that fading should be applied. Or they may know but assume the model has handled it, particularly if they prompted the model to "apply cognitive load theory."

This is not about whether LLMs "understand" pedagogy. It is about what outputs teachers receive and whether those outputs reflect the evidence base.

Scope of this study

This is a pilot study. Six models were tested across three prompting conditions, producing eighteen outputs and eighteen reasoning traces. The models include three closed-source (Claude Opus 4.6, Gemini 3.1 Pro, Gemini 3 Flash) and three open-weight (DeepSeek R1, Qwen3-235B, GPT-OSS-120B). All findings are descriptive. The study tests a methodology: can examining both outputs and reasoning traces reveal how models handle pedagogical theory? The findings will inform whether a larger multi-principle study is warranted.