Preserving Learning in Generative AI Tutoring Systems: Pedagogical Safety, Cognitive Effort, and Adaptive Scaffolding
Model Version
Abstract
Generative AI tutors can now explain concepts, respond conversationally, generate practice, diagnose partial answers, and personalize support across domains. These capabilities make them plausible successors to older intelligent tutoring systems, but they also introduce a new instructional hazard: a system that can solve a learner’s task can easily replace the learner’s cognitive work. The central design problem is therefore not whether large language models can explain content. It is whether tutoring systems can regulate help so that learners still retrieve, reason, explain, struggle productively, repair misconceptions, and transfer knowledge without the model.
Recent evidence makes the distinction urgent. Carefully structured AI tutoring systems and human-AI tutor co-pilots show promising learning gains, including randomized trials in physics classrooms, K–12 mathematics, UK secondary-school tutoring, and low-resource educational contexts. Yet unrestricted access to general-purpose models can improve immediate practice performance while reducing later unassisted performance. The OECD’s 2026 Digital Education Outlook draws the same distinction at policy level: generative AI can support learning when guided by teaching principles, but outsourcing cognitive work can create apparent performance gains without durable learning. (OECD)
This paper argues for a learning-preservation approach to generative AI tutoring. Pedagogical safety should be defined as protection against educational harms: answer over-disclosure, premature solution-giving, misconception reinforcement, weak scaffolding, over-scaffolding, cognitive offloading, false mastery, poor confidence calibration, low learner agency, lack of retrieval practice, lack of transfer checks, and multi-turn pedagogical drift. The proposed architecture combines a domain model, learner model, pedagogical policy, grounded generation layer, verification tools, unassisted checkpoints, and teacher-facing analytics. The proposed evaluation hierarchy prioritizes unassisted near-transfer, delayed retention, far transfer, self-regulation, and equity over chatbot fluency or immediate task completion.
Keywords
Generative AI tutoring; adaptive learning; pedagogical safety; cognitive effort; cognitive offloading; scaffolding; intelligent tutoring systems; self-explanation; retrieval practice; learner modeling; durable learning.
1. Introduction: the performance-learning problem
Generative AI tutoring has created a paradox. The same capability that makes AI tutors attractive—the ability to generate fluent explanations, examples, and solutions on demand—can undermine the learning processes tutoring is meant to support. A learner may complete more homework, produce better code, write a more polished essay, or solve more mathematics problems with AI assistance, while becoming less able to perform independently. In education, performance during assisted practice is not the same as learning.
This paper calls that distinction learning preservation. A tutoring system preserves learning when it helps the learner without replacing the mental operations that produce durable competence. Those operations include retrieval, prediction, planning, error detection, self-explanation, comparison, abstraction, transfer, and metacognitive monitoring. A tutor that removes all difficulty may feel helpful, but it can also remove the effort that makes learning stick.
The OECD’s 2026 Digital Education Outlook frames the same issue directly: generative AI can be a learning partner when guided by clear teaching principles, but unguided use can become a shortcut that improves task completion without producing real learning gains. The report emphasizes that performance on a task does not automatically translate into learning, especially when AI offloads the cognitive activity students need to practice. (OECD)
This problem is not simply a matter of model accuracy. It is a problem of instructional control. A correct answer can be pedagogically wrong if it arrives too early. A detailed explanation can be harmful if it substitutes for self-explanation. A helpful hint can create dependency if it never fades. A tutor can be polite, fluent, and factually accurate while still failing to support durable learning.
The strongest recent tutoring research therefore shifts the question from “Can models answer?” to “Can systems teach?” SafeTutors makes that shift explicit by defining tutoring safety around educational harms such as answer over-disclosure, misconception reinforcement, and failure to scaffold. It also shows why single-turn evaluation is inadequate: multi-turn tutoring can reveal much worse pedagogical failure than isolated prompt-response tests. (arXiv)
The thesis of this paper is:
The central challenge for generative AI tutoring is not whether models can explain content. It is whether tutoring systems can regulate help so that learners still retrieve, reason, explain, struggle productively, and transfer knowledge without the model. The next generation of AI tutors should be evaluated not by fluency or immediate task completion, but by durable learning.
2. From intelligent tutoring systems to generative tutoring
Generative AI tutors are not the beginning of adaptive instruction. They belong to a longer lineage of tutoring, mastery learning, feedback systems, learning analytics, and intelligent tutoring systems.
Bloom’s “2 sigma” argument made one-to-one tutoring the aspirational benchmark for instructional effectiveness, even though later work showed that actual tutoring effects vary by implementation, dosage, structure, and context. (Sage Journals) VanLehn’s review later compared human tutoring, intelligent tutoring systems, and other tutoring systems, arguing that well-designed computer tutoring could approach human tutoring on some measures. (Tandfonline) Kulik and Fletcher’s meta-analysis of intelligent tutoring systems found substantial effects across controlled evaluations, while also reinforcing a recurring warning: effects are often larger on tests closely aligned with the tutoring system than on broader measures. (Sage Journals)
Classic intelligent tutoring systems had two strengths that generic chatbots lack. First, they often represented domain knowledge explicitly: skills, steps, misconceptions, constraints, and prerequisite relations. Second, they controlled the pedagogical move space: when to give feedback, when to hint, when to move to the next problem, and when to require more practice. ASSISTments, for example, combined immediate feedback and hints for students with teacher-facing reports, producing a classroom feedback loop rather than merely a conversational interface. (Sage Journals)
Generative AI changes the interface and authoring economics. Large language models can parse open-ended learner language, generate explanations, adapt wording, draft hints, create examples, and respond conversationally across many domains. They can support writing, language learning, programming, conceptual science, mathematics explanation, professional training, and simulation debriefing in ways older systems found difficult.
But generative AI also weakens a key boundary. Older systems were often limited to particular actions: select a hint, mark a step, choose a problem. A general-purpose model can simply solve the task. Unless constrained by instructional design, the default behavior of an LLM is to satisfy the user’s request, and many student requests are not learning-preserving. “Give me the answer” is useful for task completion but often harmful for learning.
A recent K–12 ITS meta-analysis also sharpens the historical comparison. Leite and colleagues found a smaller overall effect than older ITS syntheses and identified moderators such as worked-out examples, duration, condition, outcome type, and immediate measurement. This reinforces the need to evaluate not only whether a tutoring technology works, but under what pedagogical design, outcome measure, and time horizon it works. (arXiv)
The lesson is that generative AI tutors should inherit the discipline of intelligent tutoring systems rather than replace it. The LLM should be a flexible dialogue and generation layer inside a controlled instructional loop, not the entire tutor.
3. The emerging evidence base
3.1 Structured AI tutors can improve learning
Recent randomized and field studies show that generative AI can support learning when embedded in structured tutoring systems.
A 2025 Scientific Reports study in undergraduate physics compared a carefully designed GPT-4-based tutor with an active-learning classroom condition. The AI tutor used research-based instructional design, expert-crafted prompts, structured activities, and sequential scaffolding rather than open-ended chatbot access. Students in the AI condition learned more in less time, with learning gains reported as more than double those in the active-learning condition, while median time was 49 minutes rather than a 60-minute class period. The authors emphasize that the tutor’s success depended on careful design, not merely a system prompt. (Nature)
The LearnLM/Eedi classroom RCT extends this evidence to UK secondary-school mathematics. In that study, 165 students across five schools used LearnLM, a pedagogically fine-tuned Gemini-family model, inside the Eedi platform with expert tutors supervising the model’s drafted messages. Supervising tutors approved 76.4% of LearnLM drafts with zero or minimal edits, and students guided by LearnLM plus human supervision were 5.5 percentage points more likely to solve novel subsequent problems than students tutored by human tutors alone. (arXiv)
The Tutor CoPilot study points toward another promising pattern: using AI to improve human tutoring rather than replacing tutors. In a randomized trial with 900 tutors and 1,800 K–12 students, AI suggestions helped tutors use more high-quality instructional strategies such as guiding questions and student explanation prompts, with larger benefits for students of lower-rated or less-experienced tutors. (arXiv)
The evidence from Nigeria broadens the deployment context. A World Bank randomized trial evaluated a six-week GPT-4/Copilot-supported after-school English program for senior secondary students, showing that structured AI-supported learning can be meaningful outside elite university settings. The intervention bundled AI access, teacher support, after-school time, and digital-skills exposure, so it should be interpreted as evidence for structured AI-supported programs rather than for unguided chatbot use. (Open Knowledge Repository)
Across these studies, the common pattern is not “AI access improves learning.” It is more specific: structured AI tutoring, embedded in a pedagogical design, can improve learning under some conditions.
3.2 Unguided AI can improve practice while harming learning
The strongest warning evidence comes from mathematics. Bastani and colleagues’ high-school field experiment found that unrestricted GPT-style access improved practice performance but reduced later unassisted performance, while a guarded tutor using teacher-designed hints and answer restrictions mitigated the harm. The supplied prior evidence map reports the headline pattern: GPT Base improved practice performance but later reduced unassisted exam performance, while GPT Tutor improved practice and largely avoided the later performance loss. (SSRN)
This result should be treated as a design theorem:
Assisted task performance is not evidence of learning unless learners can later perform without the assistance.
It also explains why “helpfulness” is a dangerous optimization target. A system that maximizes immediate correctness may minimize retrieval, planning, and self-explanation. It may turn practice into production support.
The same principle appears in programming education. A three-year classroom study of AI-supported introductory Python courses tracked student familiarity, dialogue logs, and course records across successive AI-enabled cohorts. The authors frame the challenge not as whether students will use AI, but how educators can preserve agency and productive learning as AI use becomes normal. (arXiv)
3.3 Cognitive offloading is not one thing
Cognitive offloading can help or harm. Offloading tedious or extraneous work may free attention for deeper reasoning. Offloading the reasoning itself can hollow out learning. The University of Technology Sydney report by Lodge and Loble makes this distinction central: students with stronger domain knowledge and metacognitive skills may use AI to offload lower-order tasks productively, while students without those foundations are more susceptible to detrimental offloading. The report also treats this as an equity issue, because learners who most need cognitive practice may be most likely to lose it through unstructured AI use. (University of Technology Sydney)
This distinction is crucial for tutor design. The question is not whether a tutor should reduce effort. It is which effort it should reduce. A good tutor reduces extraneous load, not germane learning effort. It may simplify wording, chunk information, or clarify notation, but it should preserve the learner’s responsibility to retrieve, choose, justify, and check.
3.4 The benchmark frontier is shifting from answer quality to pedagogy
A new generation of benchmarks evaluates tutoring behavior rather than generic response quality.
SafeTutors evaluates AI tutors across safety and pedagogy together. It defines tutoring-specific harms, reports broad pedagogical failure across models, and finds that multi-turn interaction can expose much higher failure rates than single-turn evaluation. (arXiv)
MathDial showed early that strong problem-solving ability does not imply strong tutoring ability: GPT-3 could solve many math problems but often gave factually wrong feedback or revealed solutions too early in tutoring dialogues. (arXiv) MathTutorBench and MRBench similarly target open-ended pedagogical capabilities such as mistake identification, mistake localization, guidance, actionability, and answer revealing. (arXiv) TutorBench reports that frontier models remain weak on core tutoring skills such as adaptive explanation, actionable feedback, and active-learning hint generation. (Scale)
KMP-Bench extends this direction for K–8 mathematics. It includes dialogue-level and skill-level modules, uses a 4.6K dialogue evaluation set, and assesses tutoring against core pedagogical principles including challenging, explanation, modeling, practice, questioning, and feedback. Its results reinforce the solver-tutor gap: leading models may do well on verifiable solution tasks while struggling with nuanced pedagogical application. (arXiv)
ConvoLearn addresses the data side of the problem. The April 2026 version reports 2,134 semi-synthetic tutor-student dialogues grounded in learning-science dimensions such as cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. It also shows that fine-tuning on such data can steer an open model toward more dialogic tutoring behavior. (arXiv)
The direction of the field is clear: AI tutor evaluation is moving beyond “Was the answer correct?” toward “Was the instructional move appropriate for this learner at this moment?”
4. What must be preserved for learning?
4.1 Retrieval
Learners need opportunities to recall information without immediately seeing it. Retrieval practice strengthens retention more effectively than passive restudy in many contexts, and tutoring systems should therefore schedule moments when the learner must answer from memory before receiving help. (Sage Journals)
AI tutoring can easily erase retrieval by giving the needed fact, formula, code pattern, or explanation before the learner attempts recall. A learning-preserving tutor should ask: “What do you remember?”, “Which principle applies?”, “Try the next step before I help,” or “Explain the rule in your own words.”
4.2 Reasoning and planning
Many learning tasks require selecting a strategy, not merely executing a step. If a tutor identifies the strategy too early, the learner may complete the problem without practicing problem representation. For mathematics, this may mean choosing an equation. For programming, it may mean choosing a loop, condition, data structure, or test case. For writing, it may mean deciding the claim and evidence structure.
The tutor should distinguish between helping with execution and replacing planning. If the student is stuck, the first move should often be an orienting question: “What is the goal?”, “What are the known quantities?”, “What would count as evidence?”, “What is the simplest case?”
4.3 Self-explanation
Self-explanation is one of the most important learning-preserving behaviors. It forces learners to connect steps, principles, and representations. Research on self-explanation shows benefits for problem solving and understanding, especially when learners are prompted to explain why a step is valid or how a principle applies. (ScienceDirect)
A generative tutor should therefore treat self-explanation as evidence. The learner’s explanation reveals misconceptions, shallow understanding, missing links, and overconfidence. A tutor should request explanations, evaluate them, and adapt support based on their quality.
4.4 Productive struggle
Productive struggle does not mean leaving students unsupported. It means preserving manageable difficulty long enough for learners to generate, test, and revise ideas. Productive failure research shows that initial struggle can prepare learners for later instruction when the struggle is well designed and followed by consolidation. (Tandfonline)
Generative AI tutors should regulate struggle rather than eliminate it. The right response to confusion is not always an explanation. It may be a smaller subproblem, a contrasting case, a partial hint, or a request to articulate what seems confusing.
4.5 Cognitive load balance
Cognitive load theory distinguishes between the inherent difficulty of the material, avoidable difficulty introduced by poor design, and productive effort invested in schema construction. (Wiley Online Library) AI tutors should reduce avoidable difficulty: unclear language, irrelevant detail, poor formatting, missing examples, or confusing notation. They should not remove the productive effort required to form durable understanding.
This is why shorter tutor responses are often better than long ones. Many LLM explanations are overlong, over-complete, and over-confident. A tutor response should be just enough to move the learner’s thinking forward.
4.6 Transfer
Learning is not demonstrated by solving the exact problem with help. It is demonstrated when the learner can solve a similar problem without help, retain the skill later, and apply the idea in a new context. Transfer checks must therefore be built into tutoring systems, not treated as optional end-of-course assessments.
5. Pedagogical safety as learning preservation
Pedagogical safety means preventing avoidable educational harms. It is not the same as content moderation, nor is it limited to preventing offensive or dangerous outputs. In tutoring, a safe response is one that preserves the learner’s opportunity to learn.
SafeTutors is important because it evaluates tutoring safety as a pedagogical construct. It identifies harms such as answer over-disclosure, misconception reinforcement, and failure to scaffold, and it reports that failures become more visible in multi-turn interaction. (arXiv)
A learning-preservation taxonomy should include at least the following failure modes.
| Failure mode | Description | Learning risk | Better tutor behavior |
|---|---|---|---|
| Answer over-disclosure | Gives the final answer before the learner has made a meaningful attempt. | Removes retrieval, planning, and reasoning. | Require an attempt or prediction first. |
| Premature worked solution | Shows a full solution when a hint would suffice. | Turns practice into passive reading. | Use hint ladders before full worked examples. |
| Weak scaffolding | Gives vague encouragement or generic explanation without diagnosing the learner’s state. | Learner remains stuck or misunderstands why. | Diagnose the local error and give a targeted next step. |
| Misconception reinforcement | Accepts or builds on incorrect reasoning. | Strengthens wrong mental models. | Name the misconception and contrast it with the correct concept. |
| Over-scaffolding | Breaks every task into steps indefinitely. | Learner never learns to plan independently. | Fade prompts and schedule unassisted checks. |
| Cognitive offloading | Allows AI to do the learner’s core thinking. | Produces high-quality output without internal competence. | Distinguish productive support from substitution. |
| False mastery | Learner appears successful during assisted practice but fails without help. | Inflated confidence and weak retention. | Use unassisted near-transfer and delayed checks. |
| Low learner agency | Tutor controls every move. | Learner becomes passive and dependent. | Ask learner to choose, justify, and monitor strategies. |
| Poor confidence calibration | Learner is confident for the wrong reasons or distrusts correct reasoning. | Bad self-monitoring and poor help-seeking. | Ask for confidence ratings and compare with performance. |
| Lack of retrieval practice | Tutor always supplies facts, steps, or formulas. | Weak retention. | Prompt recall before explanation. |
| Lack of transfer checks | Tutor measures only the current item. | No evidence of generalization. | Include near-transfer and far-transfer tasks. |
| Multi-turn drift | Tutor starts with scaffolding but gradually becomes solution-giving. | Dialogue deteriorates over time. | Track hint level, answer boundaries, and prior attempts. |
| Tutor sycophancy | Tutor validates incorrect student reasoning to maintain rapport. | Misconceptions persist. | Separate warmth from correctness; correct errors clearly. |
| One-size Socratic questioning | Tutor keeps asking questions even when the learner needs a worked example. | Frustration, overload, inefficient learning. | Switch modes based on learner state. |
Sycophancy deserves explicit attention. Hierarchical Pedagogical Oversight identifies the tendency of tutors to validate incorrect reasoning or give overly direct answers as a structural problem in current tutor agents. Its multi-agent oversight approach improved evaluation performance on MRBench, suggesting that tutor outputs may need explicit pedagogical review rather than relying on a single model’s conversational instincts. (arXiv)
Solution leakage under motivated answer-seeking is also a learning-preservation problem. SHAPE shows that educational models can be induced to provide direct answers when the learner lacks mastery, and proposes explicit gating based on inferred prerequisites and mastery gaps. (arXiv) Work on answer leakage robustness similarly studies cases where students actively try to obtain final answers rather than scaffolding, arguing that tutor helpfulness must be bounded by pedagogical purpose. (arXiv)
This is not a call for rigid answer refusal. Sometimes direct instruction and worked examples are appropriate. The key is mode awareness. A student in worked-example mode should see the solution and then compare, explain, or complete a faded version. A student in practice mode should usually attempt, receive graduated hints, and then complete an unassisted check.
6. Dialogic tutoring and constructive learning
Good tutoring is not just explanation. It is dialogue that elicits, listens, probes, and adapts.
Dialogic tutoring treats the learner’s utterances as evidence. The tutor asks questions not to simulate Socratic style but to reveal the learner’s model. It then responds contingently: sometimes challenging, sometimes confirming, sometimes simplifying, sometimes asking for justification, sometimes offering a worked example.
ConvoLearn’s contribution is to operationalize this dialogic view. It builds tutor-student dialogues around learning-science dimensions including cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. These categories are useful because they move beyond generic “helpfulness” toward observable pedagogical behavior. (arXiv)
A dialogic AI tutor should therefore do five things consistently.
First, it should elicit thinking. Before explaining, it should ask what the learner already knows, what step they tried, or why they chose a strategy.
Second, it should diagnose from evidence. It should distinguish an arithmetic slip from a conceptual error, a missing prerequisite from a careless reading, and low confidence from low knowledge.
Third, it should respond at the right grain size. A local procedural error needs a local hint. A misconception needs contrast. A novice facing a new schema may need a worked example.
Fourth, it should hold the learner accountable. It should ask for justification, not merely accept answers. It should avoid praising unsupported reasoning as if it were understanding.
Fifth, it should promote metacognition. It should ask learners to predict difficulty, rate confidence, check answers, compare strategies, and reflect on what changed.
These behaviors are measurable. They can be coded in dialogue logs, evaluated by rubrics, and linked to learning outcomes. That is the path from chatbot quality to tutoring quality.
7. Adaptive scaffolding and learner modeling
Personalization should not be framed as “learning styles.” A learning-preserving tutor should adapt to educationally meaningful learner states.
7.1 What the tutor should track
A useful learner model should track at least six constructs.
| Learner-state construct | Evidence | Pedagogical use |
|---|---|---|
| Prior knowledge | Pretest, first attempts, explanation quality | Decide whether to use worked examples, hints, or transfer tasks. |
| Current misconception | Error patterns, explanations, wrong rules | Choose misconception contrast or targeted feedback. |
| Confidence calibration | Confidence ratings versus correctness | Trigger reflection, verification, or calibration prompts. |
| Help-seeking pattern | Timing and type of help requests | Detect productive versus substitutive AI use. |
| Self-explanation quality | Completeness, causal links, principle use | Decide whether to advance, prompt, or reteach. |
| Readiness for faded support | Performance with decreasing hints | Move toward unassisted checks. |
The help-seeking construct is especially important for generative AI. Two learners may both solve the current problem, but one may have reasoned independently while the other extracted a solution. Without modeling help-seeking, the system cannot distinguish learning from substitution.
7.2 The pedagogical policy
The learner model should feed an explicit pedagogical policy. That policy decides the next instructional move.
Examples:
If the learner has not attempted the task, ask for a first step, prediction, or explanation.
If the learner requests a final answer, ask what they have tried and offer a low-level hint.
If the learner is correct but cannot explain, ask for justification before advancing.
If the learner is incorrect and highly confident, trigger misconception contrast.
If the learner is incorrect and low confidence, reduce extraneous load and provide a targeted hint.
If repeated hints fail, switch to a worked example, then return to a near-transfer problem.
If the learner succeeds with hints, fade support and schedule an unassisted check.
If the learner shows high assisted performance and low unassisted performance, flag false mastery.
This policy should be explicit, versioned, inspectable, and testable. A hidden prompt is not enough.
7.3 Fading support
Scaffolding is only scaffolding if it can be removed. A tutor that always decomposes tasks for the learner may create a new dependency. Faded scaffolding should therefore be built into the system:
- Model the process.
- Ask the learner to complete a partial step.
- Ask the learner to choose the next step.
- Ask the learner to solve a similar problem with fewer hints.
- Ask the learner to explain the strategy from memory.
- Ask the learner to transfer the strategy to a new context.
Worked examples are not opposed to learning preservation. They are powerful when used at the right time, especially for novices. The risk is not the worked example itself; it is using worked examples as a substitute for later retrieval, explanation, and transfer. Research on worked examples and cognitive load supports their value, especially when they reduce unnecessary search for novices. (Education NSW)
8. A reference architecture for learning-preserving AI tutors
The appropriate unit of design is not the chatbot. It is the instructional loop.
Learner input → evidence capture → learner model → pedagogical policy → grounded generation → response → learner attempt → unassisted check → dashboard and analytics
8.1 Domain model
The domain model defines what there is to learn. It should include:
| Element | Example |
|---|---|
| Learning objectives | Solve linear equations with variables on both sides. |
| Knowledge components | Combine like terms; preserve equality; isolate variable. |
| Misconceptions | Applies operation to one side only; distributes sign incorrectly. |
| Prerequisites | Integer operations; inverse operations; equation meaning. |
| Representations | Symbolic equation, word problem, graph, table. |
| Transfer targets | Novel equation forms, real-world constraints, multi-step problems. |
The LLM can help author this model, but the system should not rely only on latent model knowledge. For serious learning, the curriculum structure must be inspectable.
8.2 Evidence capture
The system should capture more than final correctness. It should capture:
- the learner’s first attempt;
- intermediate steps;
- natural-language explanations;
- hint requests;
- time before asking for help;
- confidence ratings;
- revisions after feedback;
- whether the final response was produced with or without help;
- performance on later unassisted checks.
This evidence allows the system to distinguish productive struggle from answer extraction.
8.3 Learner model
The learner model estimates knowledge, misconception state, self-regulation, confidence calibration, and help-seeking behavior. It should not simply assign a mastery score after each correct answer. Correctness with heavy help is not mastery.
8.4 Pedagogical policy
The pedagogical policy chooses the next move. This is the core of learning preservation. It should specify when to ask, hint, explain, model, correct, withhold, fade, check, or escalate.
8.5 Grounded generation
The generation layer should be grounded in approved educational materials: curriculum explanations, teacher-authored hints, rubrics, worked examples, misconception libraries, and assessment criteria. Retrieval-augmented generation is helpful, but the retrieved content should be pedagogically typed. A definition, hint, worked example, misconception warning, and assessment rubric should not be treated as interchangeable passages.
RAG-based assessment work in higher education shows why grounding matters. One 2026 system used structured retrieval over rubric criteria, exemplar essays, and instructor feedback to generate scores and formative comments for 701 essays, reporting high agreement with human evaluators and rubric-aligned feedback. (arXiv)
8.6 Verification tools
In mathematics, programming, formal reasoning, and science, the LLM should not be the sole verifier. It should call symbolic solvers, code execution environments, proof checkers, simulators, calculators, or teacher-authored answer keys when appropriate. This reduces hallucinated feedback and protects learners from fluent but wrong instruction.
8.7 Unassisted checks
Every tutoring loop should include unassisted evidence. After assisted practice, the learner should solve a similar item without hints, explain a rule from memory, or transfer the idea to a new representation. Without unassisted evidence, the system cannot know whether it taught or merely helped.
8.8 Teacher dashboard
Teacher-facing analytics should report learning-relevant signals, not just chat volume. Useful dashboard indicators include:
- common misconceptions;
- students with high assisted but low unassisted performance;
- rapid hint escalation patterns;
- skipped self-explanation prompts;
- confidence-performance mismatch;
- skills needing reteaching;
- learners ready for faded support;
- learners needing human intervention.
The dashboard should help teachers decide what to reteach, which students need support, and where the tutor may be creating dependency.
9. Design-pattern catalogue
Pattern 1: Attempt gates
Before giving substantive help, require the learner to do something: identify known quantities, make a prediction, choose a principle, write a first line of code, explain what they tried, or state where they are confused.
Attempt gates protect retrieval and planning. They also give the tutor evidence for diagnosis.
Pattern 2: Hint ladders
A hint ladder moves from general to specific:
| Level | Tutor move |
|---|---|
| 1 | Orienting question |
| 2 | Relevant principle |
| 3 | Local error cue |
| 4 | Partial step |
| 5 | Worked example |
| 6 | Final answer with explanation, only when instructionally justified |
The tutor should log hint level reached. High hint levels should trigger later unassisted checks.
Pattern 3: Self-explanation prompts
The tutor asks the learner to explain why a step works, how two examples differ, what rule applies, or what changed after feedback. The tutor evaluates explanation quality, not merely answer correctness.
Pattern 4: Misconception contrast
When the learner applies a wrong rule, the tutor contrasts the misconception with the correct concept:
“You subtracted 3 from the left side only. Equations require preserving equality, so whatever operation you apply must apply to both sides. Try rewriting the step with that constraint.”
This is more effective than simply saying “incorrect.”
Pattern 5: Worked-example mode
Worked examples should be explicit modes, not accidental answer leaks. In worked-example mode, the tutor shows a solution but requires active processing: compare steps, fill in missing reasoning, explain why a move is valid, or solve a faded example afterward.
Pattern 6: Faded scaffolding
The tutor gradually removes support. It may start with a full worked example, move to partial examples, then to hints, then to unassisted problems.
Pattern 7: Retrieval checks
After explanation, the tutor asks the learner to recall the rule, solve a similar item, or explain from memory. Retrieval checks should occur before the learner sees the answer again.
Pattern 8: Unassisted near-transfer checkpoints
The tutor periodically asks the learner to solve a similar but not identical task without AI help. These checkpoints distinguish assisted performance from learning.
Pattern 9: Confidence calibration
The tutor asks the learner to rate confidence before seeing feedback, then compares confidence with correctness and explanation quality. Overconfidence plus wrong reasoning triggers misconception repair. Low confidence plus correct reasoning triggers consolidation.
Pattern 10: Answer-boundary handling
When the learner tries to obtain a final answer during practice, the tutor responds pedagogically:
“I won’t give the final answer yet because this is practice. Tell me your first step, and I’ll help you check it.”
This preserves learning while still offering support. SHAPE’s explicit gating approach formalizes this distinction by routing between instructing and problem-solving based on inferred mastery. (arXiv)
Pattern 11: Teacher dashboard feedback
The tutor summarizes learning evidence for instructors: common misconceptions, false mastery risks, over-help patterns, and suggested small-group interventions.
Pattern 12: Human escalation
The system should escalate when the learner shows persistent misconception, repeated failure after scaffolding, distress, ambiguous domain correctness, or high assisted performance with low unassisted mastery. Human escalation is not a failure of AI. It is part of responsible tutoring.
10. Evaluation: learning outcomes, not chatbot quality
Evaluation is now the bottleneck. A tutor can be coherent, accurate, friendly, and preferred by students while failing to improve durable learning. Conversely, a tutor can feel more demanding and still produce better retention and transfer.
The evaluation hierarchy should be:
| Level | Measure | Core question | Why it matters |
|---|---|---|---|
| 1 | Response quality | Is the tutor coherent, accurate, safe, and actionable? | Necessary but not sufficient. |
| 2 | Assisted task performance | Can the learner complete the current task with help? | Measures support, not learning. |
| 3 | Unassisted near-transfer | Can the learner solve a similar task without help? | First strong evidence of learning. |
| 4 | Delayed retention | Can the learner still perform later? | Distinguishes durable learning from short-term support. |
| 5 | Far transfer | Can the learner apply the idea in a new context? | Measures abstraction and flexible understanding. |
| 6 | Self-regulation and equity | Is the learner better at planning, monitoring, explaining, and seeking help? Who benefits or is harmed? | Measures long-term learner capacity and fairness. |
The warning is simple: Level 2 is not enough.
10.1 Benchmarks are necessary but insufficient
Benchmarks are useful for measuring pedagogical behavior at scale. The BEA 2025 shared task evaluated tutor responses for pedagogical ability across tracks such as mistake identification, guidance, and actionability, with many systems still far from expert performance. (arXiv) MRBench, MathTutorBench, TutorBench, SafeTutors, KMP-Bench, SHAPE, and related benchmarks all contribute useful measurement layers. (ACL Anthology)
But benchmarks cannot replace learning studies. A model can score well on single-turn hint quality and still fail over time by over-scaffolding, mis-sequencing practice, leaking answers under pressure, or failing to schedule retrieval. SafeTutors’ multi-turn findings are especially important here: pedagogical failure can worsen across dialogue, which means static evaluation misses a core property of tutoring. (arXiv)
10.2 RCTs need better instrumentation
Future studies should not merely compare “AI access” with “no AI access.” They should compare pedagogical policies and log learning-preserving behaviors:
| Instrumentation target | Example measure |
|---|---|
| Attempt behavior | Time before first attempt; attempt completeness. |
| Help-seeking | Hint requests, answer requests, rapid escalation. |
| Tutor behavior | Hint level, answer reveal, explanation length. |
| Self-explanation | Explanation quality and revision. |
| Cognitive offloading | Copying, solution extraction, lack of independent reasoning. |
| False mastery | Assisted success paired with unassisted failure. |
| Retention | Delayed post-test performance. |
| Transfer | Novel context or representation. |
| Equity | Differential effects by prior knowledge, language, confidence, access, and self-regulation. |
The core causal question is not “Does AI help?” It is:
Which pedagogical policies help which learners preserve which cognitive processes under which conditions?
10.3 Evaluate against active baselines
Weak evaluations compare AI tutoring against nothing. Strong evaluations compare it against active learning, human tutoring, worked examples, existing adaptive practice, teacher-led review, or human-AI co-pilots. The Harvard physics RCT is notable because it used an authentic active-learning comparator rather than a passive baseline. (Nature)
11. Deployment principles
11.1 Build backward from independent performance
Start with what learners must be able to do without AI. Then design AI-supported practice to prepare them for that independent performance. If the final goal is unaided reasoning, practice must include unaided reasoning.
11.2 Separate practice mode from assessment mode
Learners and teachers should know when AI help is allowed, what kind of help is allowed, and what counts as independent mastery. AI-supported homework should not be treated as equivalent to unaided assessment.
11.3 Make the help policy explicit
The policy should define when the tutor may ask a question, give a hint, show a worked example, reveal an answer, correct directly, request self-explanation, or escalate to a human. If the policy exists only as a prompt, it is too fragile.
11.4 Treat prompts as instructional code
Tutor prompts, rubrics, hints, examples, and policies should be versioned, reviewed, tested, and connected to learning outcome data. The success of structured AI tutoring depends on instructional engineering, not just model choice.
11.5 Instrument for substitution
The system should detect patterns that suggest cognitive offloading: requests for final answers before attempts, copying tutor output, skipping explanation prompts, high assisted correctness with low unassisted correctness, and rapid hint escalation. These signals should be treated as learning-risk indicators, not automatically as misconduct.
11.6 Preserve teacher agency
Teachers should be able to inspect curriculum grounding, see what hints the tutor gives, adjust pedagogical settings, review learner evidence, and override the system. Tutor autonomy should increase only where evidence supports it.
11.7 Govern model changes
A deployed tutor can change when the underlying model changes. Learning systems therefore need model-version logging, regression tests, benchmark gates, and revalidation protocols before updates affect learners.
12. Research agenda
12.1 Long-term retention and transfer
Most generative tutor studies still measure short-term outcomes. The field needs semester-scale and year-scale studies with delayed post-tests, far-transfer tasks, and usage logs that distinguish productive AI use from substitution.
12.2 Learner modeling for cognitive offloading
The field needs better models of help-seeking and offloading. A learner who asks for an explanation after attempting a problem is different from a learner who immediately asks for a final answer. The model should track not only what the learner knows, but how the learner uses help.
12.3 Subject-specific pedagogical harms
SafeTutors reports that tutoring harms can be subject-specific. (arXiv) Mathematics, writing, programming, language learning, science labs, clinical reasoning, and workplace simulations have different answer boundaries, misconception structures, and transfer goals. Evaluation should not assume one universal tutor policy.
12.4 Dialogic training data
ConvoLearn shows one route: build datasets around learning-science dimensions and train models toward constructive dialogue. (arXiv) More work is needed across age groups, cultures, languages, accessibility needs, and domains.
12.5 Pedagogical reasoning models
PedagogicalRL-Thinking argues that educational alignment should shape not only the visible response but also the model’s pedagogical reasoning process. It introduces pedagogical reasoning prompting and a thinking reward to improve instructional decision-making. (arXiv) This is a promising direction, but it must be validated against real student learning, not only proxy ratings.
12.6 Human-AI tutoring configurations
The field should compare multiple operating modes:
| Mode | Description | Best use case |
|---|---|---|
| Autonomous AI tutor | AI directly tutors the learner. | Low-stakes practice with strong guardrails and checks. |
| Human-supervised AI tutor | AI drafts, human approves or edits. | Higher-stakes tutoring, younger learners, uncertain domains. |
| Tutor co-pilot | AI suggests moves to a human tutor. | Scaling tutor quality and supporting novice tutors. |
| Teacher dashboard assistant | AI summarizes evidence and recommends interventions. | Classroom orchestration. |
| Simulation debriefer | AI analyzes performance after role-play or practice. | Workplace learning and professional education. |
The LearnLM/Eedi and Tutor CoPilot studies represent different points in this design space: AI-drafted messages with human gating versus AI suggestions that help human tutors compose better moves. (arXiv)
12.7 Equity and access
Cognitive offloading may widen gaps if learners with strong metacognition use AI productively while learners with weaker foundations use it substitutively. (University of Technology Sydney) Equity evaluation should therefore measure not only average learning gains but differential effects by prior knowledge, self-regulation, language, disability, access, and teacher support.
13. Conclusion
Generative AI tutors will not be judged by whether they sound like patient teachers. They will be judged by whether learners can later perform without them.
The strongest current evidence supports a balanced position. Generative AI can produce meaningful learning gains when embedded in structured, grounded, scaffolded systems. It can support human tutors, personalize explanations, accelerate feedback, and make practice more responsive. But unrestricted AI access can also create false mastery, cognitive offloading, answer substitution, and long-term performance losses.
The next generation of AI tutoring should therefore be designed around learning preservation. The central instructional question is not “How can the tutor be more helpful?” It is “What help preserves the learner’s necessary cognitive work?”
That question leads to a concrete design agenda: attempt gates, hint ladders, self-explanation prompts, misconception contrast, faded scaffolding, retrieval checks, unassisted checkpoints, confidence calibration, grounded generation, teacher dashboards, and explicit pedagogical policies. It also leads to a concrete evaluation agenda: response quality and assisted performance are only the first levels. The decisive outcomes are unassisted near-transfer, delayed retention, far transfer, self-regulation, and equity.
The frontier is not a chatbot that answers every question. It is a tutor that knows when not to answer.
Appendix A: Failure-mode taxonomy for learning-preserving AI tutors
| Category | Failure mode | Observable signal | Mitigation |
|---|---|---|---|
| Help regulation | Answer over-disclosure | Final answer before attempt | Attempt gate |
| Help regulation | Premature worked solution | Full solution when learner needs small cue | Hint ladder |
| Help regulation | Over-scaffolding | Learner succeeds only with step-by-step prompts | Faded scaffolding |
| Diagnosis | Misconception reinforcement | Tutor validates wrong reasoning | Misconception contrast |
| Diagnosis | Shallow feedback | Generic praise or vague correction | Error-specific feedback |
| Dialogue | Multi-turn drift | Scaffolding degrades into answer-giving | Dialogue-state tracking |
| Dialogue | Tutor sycophancy | Tutor agrees with incorrect learner | Correctness-over-rapport policy |
| Cognitive effort | Cognitive offloading | Learner asks for output before thinking | Attempt and explanation gates |
| Cognitive effort | Lack of retrieval | Tutor supplies facts immediately | Recall-before-explain prompts |
| Cognitive effort | Lack of transfer | Only identical assisted practice | Near- and far-transfer checks |
| Mastery | False mastery | High assisted, low unassisted performance | Unassisted checkpoints |
| Metacognition | Poor confidence calibration | High confidence with wrong reasoning | Confidence-performance comparison |
| Agency | Low learner agency | Tutor chooses every move | Learner choice and planning prompts |
| Equity | Inequitable adaptation | Lower-performing learners get narrower tasks | Monitor transfer opportunities by subgroup |
| Governance | Uninspected model drift | Tutor behavior changes after model update | Regression tests and model-version logging |
Appendix B: Evaluation framework
B1. Minimum viable evaluation
A minimally credible AI tutor evaluation should include:
- A baseline stronger than “no support.”
- Measures of assisted performance.
- Measures of unassisted near-transfer.
- Delayed retention when feasible.
- Dialogue logs coded for help type and hint level.
- Subgroup analysis.
- Evidence of tutor accuracy and grounding.
- A record of answer reveals and worked-example use.
B2. Strong evaluation
A strong evaluation should add:
- Random assignment where possible.
- Active comparison conditions such as human tutoring, existing adaptive practice, or teacher-led review.
- Far-transfer tasks.
- Self-explanation quality measures.
- Confidence calibration.
- Help-seeking behavior analysis.
- Teacher dashboard use.
- Longitudinal follow-up.
- Model-version tracking.
- Cost and implementation analysis.
B3. Core rubric
| Dimension | Weak tutor | Strong tutor |
|---|---|---|
| Accuracy | Often fluent but unverified | Grounded and verified where needed |
| Help timing | Gives answers on request | Requires attempts and regulates help |
| Feedback | Generic or verbose | Specific, local, actionable |
| Scaffolding | None or excessive | Graduated and faded |
| Misconceptions | Misses or reinforces them | Identifies and contrasts them |
| Dialogue | Explains at learner | Elicits and responds to learner thinking |
| Retrieval | Supplies information immediately | Prompts recall first |
| Transfer | Optimizes current task | Schedules unassisted transfer |
| Metacognition | Ignores confidence and strategy | Builds planning, monitoring, checking |
| Evaluation | Measures satisfaction and completion | Measures retention, transfer, self-regulation |
Appendix C: Annotated bibliography
C1. Historical tutoring and intelligent tutoring systems
Bloom (1984), “The 2 Sigma Problem.” Establishes one-to-one tutoring and mastery learning as aspirational benchmarks for instructional effectiveness. (Sage Journals)
VanLehn (2011), “The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.” Compares human and computer tutoring and helps situate AI tutors within the broader tutoring literature. (Tandfonline)
Kulik and Fletcher (2016), “Effectiveness of Intelligent Tutoring Systems.” Meta-analysis of controlled ITS evaluations; useful for historical comparison and for the warning that effects depend on measurement and alignment. (Sage Journals)
Roschelle et al. (2016), ASSISTments field trial. Shows the value of immediate feedback plus teacher analytics in a classroom homework system. (Sage Journals)
Nickow, Oreopoulos, and Quan (2024), tutoring meta-analysis. Summarizes experimental evidence on tutoring as one of the stronger education interventions, while emphasizing implementation variation. (Sage Journals)
Leite et al. (2025), K–12 ITS meta-analysis. Updates the ITS evidence base and identifies moderators such as worked examples, duration, outcome type, and immediate measurement. (arXiv)
C2. Generative AI tutoring trials and field evidence
Kestin et al. (2025), AI tutoring in undergraduate physics. Shows that a carefully scaffolded GPT-4 tutor can outperform an active-learning classroom comparator in short-term learning outcomes. (Nature)
Bastani et al. (2025), generative AI without guardrails in high-school mathematics. Central warning study: unrestricted AI can improve practice performance while reducing later unassisted performance; guarded tutoring mitigates the harm. (SSRN)
De Simone et al. (2025), World Bank Nigeria trial. Evaluates a structured GPT-4/Copilot-supported after-school English program in a lower-resource context. (Open Knowledge Repository)
Wang et al. (2025), Tutor CoPilot. Demonstrates human-AI tutoring support, with AI helping tutors use more effective instructional moves. (arXiv)
LearnLM Team, Google DeepMind and Eedi (2025/2026). Classroom RCT showing that a pedagogically fine-tuned model, supervised by expert tutors, can support novel mathematics problem solving. (arXiv)
Three Years with Classroom AI in Introductory Programming (2026). Longitudinal evidence on how student-AI interaction practices evolve across AI-supported programming cohorts. (arXiv)
C3. Pedagogical safety, benchmarks, and tutor behavior
SafeTutors (2026). Defines tutoring safety around educational harms and shows that multi-turn interactions can reveal much worse pedagogical failure than single-turn tests. (arXiv)
MathDial (2023). Dialogue tutoring dataset showing that problem-solving ability does not automatically produce good tutoring behavior. (arXiv)
MRBench / Maurya et al. (2025). Provides a taxonomy for evaluating AI tutor responses across pedagogical dimensions such as mistake identification, guidance, and actionability. (ACL Anthology)
MathTutorBench (2025). Evaluates open-ended pedagogical capabilities of LLM tutors and reinforces the solver-tutor gap. (arXiv)
TutorBench (2025). Assesses adaptive explanation, actionable feedback, and active-learning hint generation; reports that frontier models still struggle with core tutoring skills. (Scale)
BEA 2025 Shared Task. Large shared task on pedagogical ability assessment, showing substantial remaining room for improvement in tutor-response evaluation. (arXiv)
KMP-Bench (2026). K–8 mathematical pedagogical benchmark with dialogue and skill modules; useful for evaluating multi-turn tutoring and granular pedagogical skills. (arXiv)
SHAPE (2026). Evaluates tutoring behavior under answer-inducing student prompts and proposes graph-augmented gating based on prerequisite and mastery inference. (arXiv)
EduGuardBench (2026). Evaluates professional fidelity and teaching-specific harms in LLMs acting as teachers, including the ability to convert inappropriate requests into teachable moments. (arXiv)
Answer Leakage Robustness (2026). Studies cases where learners actively try to obtain final answers, making answer-boundary maintenance a first-class tutoring evaluation problem. (arXiv)
C4. Dialogic tutoring and model alignment
ConvoLearn (2026). Provides a learning-science grounded dataset for dialogic AI tutors across dimensions such as cognitive engagement, formative assessment, accountability, metacognition, and power dynamics. (arXiv)
Scarlatos et al. (2025), training LLM tutors for learning outcomes. Uses student modeling and pedagogical rubrics to train tutors toward improved student correctness while preserving pedagogical quality. (arXiv)
PedagogicalRL-Thinking (2026). Extends pedagogical alignment to reasoning models through pedagogical reasoning prompting and a thinking reward. (arXiv)
Hierarchical Pedagogical Oversight (2025/2026). Uses structured oversight to detect pedagogical failures such as sycophancy and overly direct answer-giving. (arXiv)
C5. Cognitive effort, offloading, and learning science
OECD Digital Education Outlook 2026. Policy-level synthesis distinguishing generative AI as learning partner from generative AI as shortcut, with emphasis on teaching principles and cognitive effort. (OECD)
Lodge and Loble (2026), cognitive offloading and education. Argues that AI offloading can be beneficial or detrimental depending on domain knowledge, metacognition, and instructional guidance. (University of Technology Sydney)
Chi and colleagues, self-explanation research. Establishes self-explanation as a durable learning mechanism relevant to tutor prompts and dialogue evaluation. (ScienceDirect)
Sweller and cognitive load theory. Provides the theoretical basis for reducing extraneous load while preserving productive learning effort. (Wiley Online Library)
Roediger and Karpicke, retrieval practice. Supports the need for recall and unassisted checks in AI tutoring systems. (Sage Journals)
Kapur, productive failure. Supports the idea that well-designed struggle can prepare learners for later instruction and deeper understanding. (Tandfonline)
C6. Grounding, assessment, and deployment
Pardos and Bhandari (2024), ChatGPT-generated help. Evaluates AI-generated hints and help content, relevant to authoring and feedback design. (PLOS)
LLM-powered assessment RAG for higher education (2026). Demonstrates grounded generation over rubrics, exemplars, and instructor feedback at realistic essay-assessment scale. (arXiv)
Supplied base paper on generative AI tutors and adaptive learning systems. Provides the initial evidence synthesis and architecture that this paper repositions around learning preservation.
Supplied 2026 source update map. Identifies newer work on LearnLM/Eedi, cognitive offloading, benchmark expansion, pedagogical reasoning models, answer-boundary robustness, and RAG deployment.