Abstract

Generative AI tutors can now explain concepts, respond conversationally, generate practice, diagnose partial answers, and personalize support across domains. These capabilities make them plausible successors to older intelligent tutoring systems, but they also introduce a new instructional hazard: a system that can solve a learner’s task can easily replace the learner’s cognitive work. The central design problem is therefore not whether large language models can explain content. It is whether tutoring systems can regulate help so that learners still retrieve, reason, explain, struggle productively, repair misconceptions, and transfer knowledge without the model.

Recent evidence makes the distinction urgent. Carefully structured AI tutoring systems and human-AI tutor co-pilots show promising learning gains, including randomized trials in physics classrooms, K–12 mathematics, UK secondary-school tutoring, and low-resource educational contexts. Yet unrestricted access to general-purpose models can improve immediate practice performance while reducing later unassisted performance. The OECD’s 2026 Digital Education Outlook draws the same distinction at policy level: generative AI can support learning when guided by teaching principles, but outsourcing cognitive work can create apparent performance gains without durable learning. (OECD)

This paper argues for a learning-preservation approach to generative AI tutoring. Pedagogical safety should be defined as protection against educational harms: answer over-disclosure, premature solution-giving, misconception reinforcement, weak scaffolding, over-scaffolding, cognitive offloading, false mastery, poor confidence calibration, low learner agency, lack of retrieval practice, lack of transfer checks, and multi-turn pedagogical drift. The proposed architecture combines a domain model, learner model, pedagogical policy, grounded generation layer, verification tools, unassisted checkpoints, and teacher-facing analytics. The proposed evaluation hierarchy prioritizes unassisted near-transfer, delayed retention, far transfer, self-regulation, and equity over chatbot fluency or immediate task completion.

Keywords

Generative AI tutoring; adaptive learning; pedagogical safety; cognitive effort; cognitive offloading; scaffolding; intelligent tutoring systems; self-explanation; retrieval practice; learner modeling; durable learning.

1. Introduction: the performance-learning problem

Generative AI tutoring has created a paradox. The same capability that makes AI tutors attractive—the ability to generate fluent explanations, examples, and solutions on demand—can undermine the learning processes tutoring is meant to support. A learner may complete more homework, produce better code, write a more polished essay, or solve more mathematics problems with AI assistance, while becoming less able to perform independently. In education, performance during assisted practice is not the same as learning.

This paper calls that distinction learning preservation. A tutoring system preserves learning when it helps the learner without replacing the mental operations that produce durable competence. Those operations include retrieval, prediction, planning, error detection, self-explanation, comparison, abstraction, transfer, and metacognitive monitoring. A tutor that removes all difficulty may feel helpful, but it can also remove the effort that makes learning stick.

The OECD’s 2026 Digital Education Outlook frames the same issue directly: generative AI can be a learning partner when guided by clear teaching principles, but unguided use can become a shortcut that improves task completion without producing real learning gains. The report emphasizes that performance on a task does not automatically translate into learning, especially when AI offloads the cognitive activity students need to practice. (OECD)

This problem is not simply a matter of model accuracy. It is a problem of instructional control. A correct answer can be pedagogically wrong if it arrives too early. A detailed explanation can be harmful if it substitutes for self-explanation. A helpful hint can create dependency if it never fades. A tutor can be polite, fluent, and factually accurate while still failing to support durable learning.

The strongest recent tutoring research therefore shifts the question from “Can models answer?” to “Can systems teach?” SafeTutors makes that shift explicit by defining tutoring safety around educational harms such as answer over-disclosure, misconception reinforcement, and failure to scaffold. It also shows why single-turn evaluation is inadequate: multi-turn tutoring can reveal much worse pedagogical failure than isolated prompt-response tests. (arXiv)

The thesis of this paper is:

The central challenge for generative AI tutoring is not whether models can explain content. It is whether tutoring systems can regulate help so that learners still retrieve, reason, explain, struggle productively, and transfer knowledge without the model. The next generation of AI tutors should be evaluated not by fluency or immediate task completion, but by durable learning.

2. From intelligent tutoring systems to generative tutoring

Generative AI tutors are not the beginning of adaptive instruction. They belong to a longer lineage of tutoring, mastery learning, feedback systems, learning analytics, and intelligent tutoring systems.

Bloom’s “2 sigma” argument made one-to-one tutoring the aspirational benchmark for instructional effectiveness, even though later work showed that actual tutoring effects vary by implementation, dosage, structure, and context. (Sage Journals) VanLehn’s review later compared human tutoring, intelligent tutoring systems, and other tutoring systems, arguing that well-designed computer tutoring could approach human tutoring on some measures. (Tandfonline) Kulik and Fletcher’s meta-analysis of intelligent tutoring systems found substantial effects across controlled evaluations, while also reinforcing a recurring warning: effects are often larger on tests closely aligned with the tutoring system than on broader measures. (Sage Journals)

Classic intelligent tutoring systems had two strengths that generic chatbots lack. First, they often represented domain knowledge explicitly: skills, steps, misconceptions, constraints, and prerequisite relations. Second, they controlled the pedagogical move space: when to give feedback, when to hint, when to move to the next problem, and when to require more practice. ASSISTments, for example, combined immediate feedback and hints for students with teacher-facing reports, producing a classroom feedback loop rather than merely a conversational interface. (Sage Journals)

Generative AI changes the interface and authoring economics. Large language models can parse open-ended learner language, generate explanations, adapt wording, draft hints, create examples, and respond conversationally across many domains. They can support writing, language learning, programming, conceptual science, mathematics explanation, professional training, and simulation debriefing in ways older systems found difficult.

But generative AI also weakens a key boundary. Older systems were often limited to particular actions: select a hint, mark a step, choose a problem. A general-purpose model can simply solve the task. Unless constrained by instructional design, the default behavior of an LLM is to satisfy the user’s request, and many student requests are not learning-preserving. “Give me the answer” is useful for task completion but often harmful for learning.

A recent K–12 ITS meta-analysis also sharpens the historical comparison. Leite and colleagues found a smaller overall effect than older ITS syntheses and identified moderators such as worked-out examples, duration, condition, outcome type, and immediate measurement. This reinforces the need to evaluate not only whether a tutoring technology works, but under what pedagogical design, outcome measure, and time horizon it works. (arXiv)

The lesson is that generative AI tutors should inherit the discipline of intelligent tutoring systems rather than replace it. The LLM should be a flexible dialogue and generation layer inside a controlled instructional loop, not the entire tutor.

3. The emerging evidence base

3.1 Structured AI tutors can improve learning

Recent randomized and field studies show that generative AI can support learning when embedded in structured tutoring systems.

A 2025 Scientific Reports study in undergraduate physics compared a carefully designed GPT-4-based tutor with an active-learning classroom condition. The AI tutor used research-based instructional design, expert-crafted prompts, structured activities, and sequential scaffolding rather than open-ended chatbot access. Students in the AI condition learned more in less time, with learning gains reported as more than double those in the active-learning condition, while median time was 49 minutes rather than a 60-minute class period. The authors emphasize that the tutor’s success depended on careful design, not merely a system prompt. (Nature)

The LearnLM/Eedi classroom RCT extends this evidence to UK secondary-school mathematics. In that study, 165 students across five schools used LearnLM, a pedagogically fine-tuned Gemini-family model, inside the Eedi platform with expert tutors supervising the model’s drafted messages. Supervising tutors approved 76.4% of LearnLM drafts with zero or minimal edits, and students guided by LearnLM plus human supervision were 5.5 percentage points more likely to solve novel subsequent problems than students tutored by human tutors alone. (arXiv)

The Tutor CoPilot study points toward another promising pattern: using AI to improve human tutoring rather than replacing tutors. In a randomized trial with 900 tutors and 1,800 K–12 students, AI suggestions helped tutors use more high-quality instructional strategies such as guiding questions and student explanation prompts, with larger benefits for students of lower-rated or less-experienced tutors. (arXiv)

The evidence from Nigeria broadens the deployment context. A World Bank randomized trial evaluated a six-week GPT-4/Copilot-supported after-school English program for senior secondary students, showing that structured AI-supported learning can be meaningful outside elite university settings. The intervention bundled AI access, teacher support, after-school time, and digital-skills exposure, so it should be interpreted as evidence for structured AI-supported programs rather than for unguided chatbot use. (Open Knowledge Repository)

Across these studies, the common pattern is not “AI access improves learning.” It is more specific: structured AI tutoring, embedded in a pedagogical design, can improve learning under some conditions.

3.2 Unguided AI can improve practice while harming learning

The strongest warning evidence comes from mathematics. Bastani and colleagues’ high-school field experiment found that unrestricted GPT-style access improved practice performance but reduced later unassisted performance, while a guarded tutor using teacher-designed hints and answer restrictions mitigated the harm. The supplied prior evidence map reports the headline pattern: GPT Base improved practice performance but later reduced unassisted exam performance, while GPT Tutor improved practice and largely avoided the later performance loss. (SSRN)

This result should be treated as a design theorem:

Assisted task performance is not evidence of learning unless learners can later perform without the assistance.

It also explains why “helpfulness” is a dangerous optimization target. A system that maximizes immediate correctness may minimize retrieval, planning, and self-explanation. It may turn practice into production support.

The same principle appears in programming education. A three-year classroom study of AI-supported introductory Python courses tracked student familiarity, dialogue logs, and course records across successive AI-enabled cohorts. The authors frame the challenge not as whether students will use AI, but how educators can preserve agency and productive learning as AI use becomes normal. (arXiv)

3.3 Cognitive offloading is not one thing

Cognitive offloading can help or harm. Offloading tedious or extraneous work may free attention for deeper reasoning. Offloading the reasoning itself can hollow out learning. The University of Technology Sydney report by Lodge and Loble makes this distinction central: students with stronger domain knowledge and metacognitive skills may use AI to offload lower-order tasks productively, while students without those foundations are more susceptible to detrimental offloading. The report also treats this as an equity issue, because learners who most need cognitive practice may be most likely to lose it through unstructured AI use. (University of Technology Sydney)

This distinction is crucial for tutor design. The question is not whether a tutor should reduce effort. It is which effort it should reduce. A good tutor reduces extraneous load, not germane learning effort. It may simplify wording, chunk information, or clarify notation, but it should preserve the learner’s responsibility to retrieve, choose, justify, and check.

3.4 The benchmark frontier is shifting from answer quality to pedagogy

A new generation of benchmarks evaluates tutoring behavior rather than generic response quality.

SafeTutors evaluates AI tutors across safety and pedagogy together. It defines tutoring-specific harms, reports broad pedagogical failure across models, and finds that multi-turn interaction can expose much higher failure rates than single-turn evaluation. (arXiv)

MathDial showed early that strong problem-solving ability does not imply strong tutoring ability: GPT-3 could solve many math problems but often gave factually wrong feedback or revealed solutions too early in tutoring dialogues. (arXiv) MathTutorBench and MRBench similarly target open-ended pedagogical capabilities such as mistake identification, mistake localization, guidance, actionability, and answer revealing. (arXiv) TutorBench reports that frontier models remain weak on core tutoring skills such as adaptive explanation, actionable feedback, and active-learning hint generation. (Scale)

KMP-Bench extends this direction for K–8 mathematics. It includes dialogue-level and skill-level modules, uses a 4.6K dialogue evaluation set, and assesses tutoring against core pedagogical principles including challenging, explanation, modeling, practice, questioning, and feedback. Its results reinforce the solver-tutor gap: leading models may do well on verifiable solution tasks while struggling with nuanced pedagogical application. (arXiv)

ConvoLearn addresses the data side of the problem. The April 2026 version reports 2,134 semi-synthetic tutor-student dialogues grounded in learning-science dimensions such as cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. It also shows that fine-tuning on such data can steer an open model toward more dialogic tutoring behavior. (arXiv)

The direction of the field is clear: AI tutor evaluation is moving beyond “Was the answer correct?” toward “Was the instructional move appropriate for this learner at this moment?”

4. What must be preserved for learning?

4.1 Retrieval

Learners need opportunities to recall information without immediately seeing it. Retrieval practice strengthens retention more effectively than passive restudy in many contexts, and tutoring systems should therefore schedule moments when the learner must answer from memory before receiving help. (Sage Journals)

AI tutoring can easily erase retrieval by giving the needed fact, formula, code pattern, or explanation before the learner attempts recall. A learning-preserving tutor should ask: “What do you remember?”, “Which principle applies?”, “Try the next step before I help,” or “Explain the rule in your own words.”

4.2 Reasoning and planning

Many learning tasks require selecting a strategy, not merely executing a step. If a tutor identifies the strategy too early, the learner may complete the problem without practicing problem representation. For mathematics, this may mean choosing an equation. For programming, it may mean choosing a loop, condition, data structure, or test case. For writing, it may mean deciding the claim and evidence structure.

The tutor should distinguish between helping with execution and replacing planning. If the student is stuck, the first move should often be an orienting question: “What is the goal?”, “What are the known quantities?”, “What would count as evidence?”, “What is the simplest case?”

4.3 Self-explanation

Self-explanation is one of the most important learning-preserving behaviors. It forces learners to connect steps, principles, and representations. Research on self-explanation shows benefits for problem solving and understanding, especially when learners are prompted to explain why a step is valid or how a principle applies. (ScienceDirect)

A generative tutor should therefore treat self-explanation as evidence. The learner’s explanation reveals misconceptions, shallow understanding, missing links, and overconfidence. A tutor should request explanations, evaluate them, and adapt support based on their quality.

4.4 Productive struggle

Productive struggle does not mean leaving students unsupported. It means preserving manageable difficulty long enough for learners to generate, test, and revise ideas. Productive failure research shows that initial struggle can prepare learners for later instruction when the struggle is well designed and followed by consolidation. (Tandfonline)

Generative AI tutors should regulate struggle rather than eliminate it. The right response to confusion is not always an explanation. It may be a smaller subproblem, a contrasting case, a partial hint, or a request to articulate what seems confusing.

4.5 Cognitive load balance

Cognitive load theory distinguishes between the inherent difficulty of the material, avoidable difficulty introduced by poor design, and productive effort invested in schema construction. (Wiley Online Library) AI tutors should reduce avoidable difficulty: unclear language, irrelevant detail, poor formatting, missing examples, or confusing notation. They should not remove the productive effort required to form durable understanding.

This is why shorter tutor responses are often better than long ones. Many LLM explanations are overlong, over-complete, and over-confident. A tutor response should be just enough to move the learner’s thinking forward.

4.6 Transfer

Learning is not demonstrated by solving the exact problem with help. It is demonstrated when the learner can solve a similar problem without help, retain the skill later, and apply the idea in a new context. Transfer checks must therefore be built into tutoring systems, not treated as optional end-of-course assessments.

5. Pedagogical safety as learning preservation

Pedagogical safety means preventing avoidable educational harms. It is not the same as content moderation, nor is it limited to preventing offensive or dangerous outputs. In tutoring, a safe response is one that preserves the learner’s opportunity to learn.

SafeTutors is important because it evaluates tutoring safety as a pedagogical construct. It identifies harms such as answer over-disclosure, misconception reinforcement, and failure to scaffold, and it reports that failures become more visible in multi-turn interaction. (arXiv)

A learning-preservation taxonomy should include at least the following failure modes.

Failure mode	Description	Learning risk	Better tutor behavior
Answer over-disclosure	Gives the final answer before the learner has made a meaningful attempt.	Removes retrieval, planning, and reasoning.	Require an attempt or prediction first.
Premature worked solution	Shows a full solution when a hint would suffice.	Turns practice into passive reading.	Use hint ladders before full worked examples.
Weak scaffolding	Gives vague encouragement or generic explanation without diagnosing the learner’s state.	Learner remains stuck or misunderstands why.	Diagnose the local error and give a targeted next step.
Misconception reinforcement	Accepts or builds on incorrect reasoning.	Strengthens wrong mental models.	Name the misconception and contrast it with the correct concept.
Over-scaffolding	Breaks every task into steps indefinitely.	Learner never learns to plan independently.	Fade prompts and schedule unassisted checks.
Cognitive offloading	Allows AI to do the learner’s core thinking.	Produces high-quality output without internal competence.	Distinguish productive support from substitution.
False mastery	Learner appears successful during assisted practice but fails without help.	Inflated confidence and weak retention.	Use unassisted near-transfer and delayed checks.
Low learner agency	Tutor controls every move.	Learner becomes passive and dependent.	Ask learner to choose, justify, and monitor strategies.
Poor confidence calibration	Learner is confident for the wrong reasons or distrusts correct reasoning.	Bad self-monitoring and poor help-seeking.	Ask for confidence ratings and compare with performance.
Lack of retrieval practice	Tutor always supplies facts, steps, or formulas.	Weak retention.	Prompt recall before explanation.
Lack of transfer checks	Tutor measures only the current item.	No evidence of generalization.	Include near-transfer and far-transfer tasks.
Multi-turn drift	Tutor starts with scaffolding but gradually becomes solution-giving.	Dialogue deteriorates over time.	Track hint level, answer boundaries, and prior attempts.
Tutor sycophancy	Tutor validates incorrect student reasoning to maintain rapport.	Misconceptions persist.	Separate warmth from correctness; correct errors clearly.
One-size Socratic questioning	Tutor keeps asking questions even when the learner needs a worked example.	Frustration, overload, inefficient learning.	Switch modes based on learner state.

Sycophancy deserves explicit attention. Hierarchical Pedagogical Oversight identifies the tendency of tutors to validate incorrect reasoning or give overly direct answers as a structural problem in current tutor agents. Its multi-agent oversight approach improved evaluation performance on MRBench, suggesting that tutor outputs may need explicit pedagogical review rather than relying on a single model’s conversational instincts. (arXiv)

Solution leakage under motivated answer-seeking is also a learning-preservation problem. SHAPE shows that educational models can be induced to provide direct answers when the learner lacks mastery, and proposes explicit gating based on inferred prerequisites and mastery gaps. (arXiv) Work on answer leakage robustness similarly studies cases where students actively try to obtain final answers rather than scaffolding, arguing that tutor helpfulness must be bounded by pedagogical purpose. (arXiv)

This is not a call for rigid answer refusal. Sometimes direct instruction and worked examples are appropriate. The key is mode awareness. A student in worked-example mode should see the solution and then compare, explain, or complete a faded version. A student in practice mode should usually attempt, receive graduated hints, and then complete an unassisted check.

6. Dialogic tutoring and constructive learning

Good tutoring is not just explanation. It is dialogue that elicits, listens, probes, and adapts.

Dialogic tutoring treats the learner’s utterances as evidence. The tutor asks questions not to simulate Socratic style but to reveal the learner’s model. It then responds contingently: sometimes challenging, sometimes confirming, sometimes simplifying, sometimes asking for justification, sometimes offering a worked example.

ConvoLearn’s contribution is to operationalize this dialogic view. It builds tutor-student dialogues around learning-science dimensions including cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. These categories are useful because they move beyond generic “helpfulness” toward observable pedagogical behavior. (arXiv)

A dialogic AI tutor should therefore do five things consistently.

First, it should elicit thinking. Before explaining, it should ask what the learner already knows, what step they tried, or why they chose a strategy.

Second, it should diagnose from evidence. It should distinguish an arithmetic slip from a conceptual error, a missing prerequisite from a careless reading, and low confidence from low knowledge.

Third, it should respond at the right grain size. A local procedural error needs a local hint. A misconception needs contrast. A novice facing a new schema may need a worked example.

Fourth, it should hold the learner accountable. It should ask for justification, not merely accept answers. It should avoid praising unsupported reasoning as if it were understanding.

Fifth, it should promote metacognition. It should ask learners to predict difficulty, rate confidence, check answers, compare strategies, and reflect on what changed.

These behaviors are measurable. They can be coded in dialogue logs, evaluated by rubrics, and linked to learning outcomes. That is the path from chatbot quality to tutoring quality.

7. Adaptive scaffolding and learner modeling

Personalization should not be framed as “learning styles.” A learning-preserving tutor should adapt to educationally meaningful learner states.

7.1 What the tutor should track

A useful learner model should track at least six constructs.

Learner-state construct	Evidence	Pedagogical use
Prior knowledge	Pretest, first attempts, explanation quality	Decide whether to use worked examples, hints, or transfer tasks.
Current misconception	Error patterns, explanations, wrong rules	Choose misconception contrast or targeted feedback.
Confidence calibration	Confidence ratings versus correctness	Trigger reflection, verification, or calibration prompts.
Help-seeking pattern	Timing and type of help requests	Detect productive versus substitutive AI use.
Self-explanation quality	Completeness, causal links, principle use	Decide whether to advance, prompt, or reteach.
Readiness for faded support	Performance with decreasing hints	Move toward unassisted checks.

The help-seeking construct is especially important for generative AI. Two learners may both solve the current problem, but one may have reasoned independently while the other extracted a solution. Without modeling help-seeking, the system cannot distinguish learning from substitution.

7.2 The pedagogical policy

The learner model should feed an explicit pedagogical policy. That policy decides the next instructional move.

Examples:

If the learner has not attempted the task, ask for a first step, prediction, or explanation.

If the learner requests a final answer, ask what they have tried and offer a low-level hint.

If the learner is correct but cannot explain, ask for justification before advancing.

If the learner is incorrect and highly confident, trigger misconception contrast.

If the learner is incorrect and low confidence, reduce extraneous load and provide a targeted hint.

If repeated hints fail, switch to a worked example, then return to a near-transfer problem.

If the learner succeeds with hints, fade support and schedule an unassisted check.

If the learner shows high assisted performance and low unassisted performance, flag false mastery.

This policy should be explicit, versioned, inspectable, and testable. A hidden prompt is not enough.

7.3 Fading support

Scaffolding is only scaffolding if it can be removed. A tutor that always decomposes tasks for the learner may create a new dependency. Faded scaffolding should therefore be built into the system:

Model the process.
Ask the learner to complete a partial step.
Ask the learner to choose the next step.
Ask the learner to solve a similar problem with fewer hints.
Ask the learner to explain the strategy from memory.
Ask the learner to transfer the strategy to a new context.

Worked examples are not opposed to learning preservation. They are powerful when used at the right time, especially for novices. The risk is not the worked example itself; it is using worked examples as a substitute for later retrieval, explanation, and transfer. Research on worked examples and cognitive load supports their value, especially when they reduce unnecessary search for novices. (Education NSW)

8. A reference architecture for learning-preserving AI tutors

The appropriate unit of design is not the chatbot. It is the instructional loop.

Learner input → evidence capture → learner model → pedagogical policy → grounded generation → response → learner attempt → unassisted check → dashboard and analytics

8.1 Domain model

The domain model defines what there is to learn. It should include:

Element	Example
Learning objectives	Solve linear equations with variables on both sides.
Knowledge components	Combine like terms; preserve equality; isolate variable.
Misconceptions	Applies operation to one side only; distributes sign incorrectly.
Prerequisites	Integer operations; inverse operations; equation meaning.
Representations	Symbolic equation, word problem, graph, table.
Transfer targets	Novel equation forms, real-world constraints, multi-step problems.

The LLM can help author this model, but the system should not rely only on latent model knowledge. For serious learning, the curriculum structure must be inspectable.

8.2 Evidence capture

The system should capture more than final correctness. It should capture:

the learner’s first attempt;
intermediate steps;
natural-language explanations;
hint requests;
time before asking for help;
confidence ratings;
revisions after feedback;
whether the final response was produced with or without help;
performance on later unassisted checks.

This evidence allows the system to distinguish productive struggle from answer extraction.

8.3 Learner model

The learner model estimates knowledge, misconception state, self-regulation, confidence calibration, and help-seeking behavior. It should not simply assign a mastery score after each correct answer. Correctness with heavy help is not mastery.

8.4 Pedagogical policy

The pedagogical policy chooses the next move. This is the core of learning preservation. It should specify when to ask, hint, explain, model, correct, withhold, fade, check, or escalate.

8.5 Grounded generation

The generation layer should be grounded in approved educational materials: curriculum explanations, teacher-authored hints, rubrics, worked examples, misconception libraries, and assessment criteria. Retrieval-augmented generation is helpful, but the retrieved content should be pedagogically typed. A definition, hint, worked example, misconception warning, and assessment rubric should not be treated as interchangeable passages.

RAG-based assessment work in higher education shows why grounding matters. One 2026 system used structured retrieval over rubric criteria, exemplar essays, and instructor feedback to generate scores and formative comments for 701 essays, reporting high agreement with human evaluators and rubric-aligned feedback. (arXiv)

8.6 Verification tools

In mathematics, programming, formal reasoning, and science, the LLM should not be the sole verifier. It should call symbolic solvers, code execution environments, proof checkers, simulators, calculators, or teacher-authored answer keys when appropriate. This reduces hallucinated feedback and protects learners from fluent but wrong instruction.

8.7 Unassisted checks

Every tutoring loop should include unassisted evidence. After assisted practice, the learner should solve a similar item without hints, explain a rule from memory, or transfer the idea to a new representation. Without unassisted evidence, the system cannot know whether it taught or merely helped.

8.8 Teacher dashboard

Teacher-facing analytics should report learning-relevant signals, not just chat volume. Useful dashboard indicators include:

common misconceptions;
students with high assisted but low unassisted performance;
rapid hint escalation patterns;
skipped self-explanation prompts;
confidence-performance mismatch;
skills needing reteaching;
learners ready for faded support;
learners needing human intervention.

The dashboard should help teachers decide what to reteach, which students need support, and where the tutor may be creating dependency.

9. Design-pattern catalogue

Pattern 1: Attempt gates

Before giving substantive help, require the learner to do something: identify known quantities, make a prediction, choose a principle, write a first line of code, explain what they tried, or state where they are confused.

Attempt gates protect retrieval and planning. They also give the tutor evidence for diagnosis.

Pattern 2: Hint ladders

A hint ladder moves from general to specific:

Level	Tutor move
1	Orienting question
2	Relevant principle
3	Local error cue
4	Partial step
5	Worked example
6	Final answer with explanation, only when instructionally justified

The tutor should log hint level reached. High hint levels should trigger later unassisted checks.

Pattern 3: Self-explanation prompts

The tutor asks the learner to explain why a step works, how two examples differ, what rule applies, or what changed after feedback. The tutor evaluates explanation quality, not merely answer correctness.

Pattern 4: Misconception contrast

When the learner applies a wrong rule, the tutor contrasts the misconception with the correct concept:

“You subtracted 3 from the left side only. Equations require preserving equality, so whatever operation you apply must apply to both sides. Try rewriting the step with that constraint.”

This is more effective than simply saying “incorrect.”

Pattern 5: Worked-example mode

Worked examples should be explicit modes, not accidental answer leaks. In worked-example mode, the tutor shows a solution but requires active processing: compare steps, fill in missing reasoning, explain why a move is valid, or solve a faded example afterward.

Pattern 6: Faded scaffolding

The tutor gradually removes support. It may start with a full worked example, move to partial examples, then to hints, then to unassisted problems.

Pattern 7: Retrieval checks

After explanation, the tutor asks the learner to recall the rule, solve a similar item, or explain from memory. Retrieval checks should occur before the learner sees the answer again.

Pattern 8: Unassisted near-transfer checkpoints

The tutor periodically asks the learner to solve a similar but not identical task without AI help. These checkpoints distinguish assisted performance from learning.

Pattern 9: Confidence calibration

The tutor asks the learner to rate confidence before seeing feedback, then compares confidence with correctness and explanation quality. Overconfidence plus wrong reasoning triggers misconception repair. Low confidence plus correct reasoning triggers consolidation.

Pattern 10: Answer-boundary handling

When the learner tries to obtain a final answer during practice, the tutor responds pedagogically:

“I won’t give the final answer yet because this is practice. Tell me your first step, and I’ll help you check it.”

This preserves learning while still offering support. SHAPE’s explicit gating approach formalizes this distinction by routing between instructing and problem-solving based on inferred mastery. (arXiv)

Pattern 11: Teacher dashboard feedback

The tutor summarizes learning evidence for instructors: common misconceptions, false mastery risks, over-help patterns, and suggested small-group interventions.

Pattern 12: Human escalation

The system should escalate when the learner shows persistent misconception, repeated failure after scaffolding, distress, ambiguous domain correctness, or high assisted performance with low unassisted mastery. Human escalation is not a failure of AI. It is part of responsible tutoring.

10. Evaluation: learning outcomes, not chatbot quality

Evaluation is now the bottleneck. A tutor can be coherent, accurate, friendly, and preferred by students while failing to improve durable learning. Conversely, a tutor can feel more demanding and still produce better retention and transfer.

The evaluation hierarchy should be:

Level	Measure	Core question	Why it matters
1	Response quality	Is the tutor coherent, accurate, safe, and actionable?	Necessary but not sufficient.
2	Assisted task performance	Can the learner complete the current task with help?	Measures support, not learning.
3	Unassisted near-transfer	Can the learner solve a similar task without help?	First strong evidence of learning.
4	Delayed retention	Can the learner still perform later?	Distinguishes durable learning from short-term support.
5	Far transfer	Can the learner apply the idea in a new context?	Measures abstraction and flexible understanding.
6	Self-regulation and equity	Is the learner better at planning, monitoring, explaining, and seeking help? Who benefits or is harmed?	Measures long-term learner capacity and fairness.

The warning is simple: Level 2 is not enough.

10.1 Benchmarks are necessary but insufficient

Benchmarks are useful for measuring pedagogical behavior at scale. The BEA 2025 shared task evaluated tutor responses for pedagogical ability across tracks such as mistake identification, guidance, and actionability, with many systems still far from expert performance. (arXiv) MRBench, MathTutorBench, TutorBench, SafeTutors, KMP-Bench, SHAPE, and related benchmarks all contribute useful measurement layers. (ACL Anthology)

But benchmarks cannot replace learning studies. A model can score well on single-turn hint quality and still fail over time by over-scaffolding, mis-sequencing practice, leaking answers under pressure, or failing to schedule retrieval. SafeTutors’ multi-turn findings are especially important here: pedagogical failure can worsen across dialogue, which means static evaluation misses a core property of tutoring. (arXiv)

10.2 RCTs need better instrumentation

Future studies should not merely compare “AI access” with “no AI access.” They should compare pedagogical policies and log learning-preserving behaviors:

Instrumentation target	Example measure
Attempt behavior	Time before first attempt; attempt completeness.
Help-seeking	Hint requests, answer requests, rapid escalation.
Tutor behavior	Hint level, answer reveal, explanation length.
Self-explanation	Explanation quality and revision.
Cognitive offloading	Copying, solution extraction, lack of independent reasoning.
False mastery	Assisted success paired with unassisted failure.
Retention	Delayed post-test performance.
Transfer	Novel context or representation.
Equity	Differential effects by prior knowledge, language, confidence, access, and self-regulation.

The core causal question is not “Does AI help?” It is:

Which pedagogical policies help which learners preserve which cognitive processes under which conditions?

10.3 Evaluate against active baselines

Weak evaluations compare AI tutoring against nothing. Strong evaluations compare it against active learning, human tutoring, worked examples, existing adaptive practice, teacher-led review, or human-AI co-pilots. The Harvard physics RCT is notable because it used an authentic active-learning comparator rather than a passive baseline. (Nature)

11. Deployment principles

11.1 Build backward from independent performance

Start with what learners must be able to do without AI. Then design AI-supported practice to prepare them for that independent performance. If the final goal is unaided reasoning, practice must include unaided reasoning.

11.2 Separate practice mode from assessment mode

Learners and teachers should know when AI help is allowed, what kind of help is allowed, and what counts as independent mastery. AI-supported homework should not be treated as equivalent to unaided assessment.

11.3 Make the help policy explicit

The policy should define when the tutor may ask a question, give a hint, show a worked example, reveal an answer, correct directly, request self-explanation, or escalate to a human. If the policy exists only as a prompt, it is too fragile.

11.4 Treat prompts as instructional code

Tutor prompts, rubrics, hints, examples, and policies should be versioned, reviewed, tested, and connected to learning outcome data. The success of structured AI tutoring depends on instructional engineering, not just model choice.

11.5 Instrument for substitution

The system should detect patterns that suggest cognitive offloading: requests for final answers before attempts, copying tutor output, skipping explanation prompts, high assisted correctness with low unassisted correctness, and rapid hint escalation. These signals should be treated as learning-risk indicators, not automatically as misconduct.

11.6 Preserve teacher agency

Teachers should be able to inspect curriculum grounding, see what hints the tutor gives, adjust pedagogical settings, review learner evidence, and override the system. Tutor autonomy should increase only where evidence supports it.

11.7 Govern model changes

A deployed tutor can change when the underlying model changes. Learning systems therefore need model-version logging, regression tests, benchmark gates, and revalidation protocols before updates affect learners.

12. Research agenda

12.1 Long-term retention and transfer

Most generative tutor studies still measure short-term outcomes. The field needs semester-scale and year-scale studies with delayed post-tests, far-transfer tasks, and usage logs that distinguish productive AI use from substitution.

12.2 Learner modeling for cognitive offloading

The field needs better models of help-seeking and offloading. A learner who asks for an explanation after attempting a problem is different from a learner who immediately asks for a final answer. The model should track not only what the learner knows, but how the learner uses help.

12.3 Subject-specific pedagogical harms

SafeTutors reports that tutoring harms can be subject-specific. (arXiv) Mathematics, writing, programming, language learning, science labs, clinical reasoning, and workplace simulations have different answer boundaries, misconception structures, and transfer goals. Evaluation should not assume one universal tutor policy.

12.4 Dialogic training data

ConvoLearn shows one route: build datasets around learning-science dimensions and train models toward constructive dialogue. (arXiv) More work is needed across age groups, cultures, languages, accessibility needs, and domains.

12.5 Pedagogical reasoning models

PedagogicalRL-Thinking argues that educational alignment should shape not only the visible response but also the model’s pedagogical reasoning process. It introduces pedagogical reasoning prompting and a thinking reward to improve instructional decision-making. (arXiv) This is a promising direction, but it must be validated against real student learning, not only proxy ratings.

12.6 Human-AI tutoring configurations

The field should compare multiple operating modes:

Mode	Description	Best use case
Autonomous AI tutor	AI directly tutors the learner.	Low-stakes practice with strong guardrails and checks.
Human-supervised AI tutor	AI drafts, human approves or edits.	Higher-stakes tutoring, younger learners, uncertain domains.
Tutor co-pilot	AI suggests moves to a human tutor.	Scaling tutor quality and supporting novice tutors.
Teacher dashboard assistant	AI summarizes evidence and recommends interventions.	Classroom orchestration.
Simulation debriefer	AI analyzes performance after role-play or practice.	Workplace learning and professional education.

The LearnLM/Eedi and Tutor CoPilot studies represent different points in this design space: AI-drafted messages with human gating versus AI suggestions that help human tutors compose better moves. (arXiv)

12.7 Equity and access

Cognitive offloading may widen gaps if learners with strong metacognition use AI productively while learners with weaker foundations use it substitutively. (University of Technology Sydney) Equity evaluation should therefore measure not only average learning gains but differential effects by prior knowledge, self-regulation, language, disability, access, and teacher support.

13. Conclusion

Generative AI tutors will not be judged by whether they sound like patient teachers. They will be judged by whether learners can later perform without them.

The strongest current evidence supports a balanced position. Generative AI can produce meaningful learning gains when embedded in structured, grounded, scaffolded systems. It can support human tutors, personalize explanations, accelerate feedback, and make practice more responsive. But unrestricted AI access can also create false mastery, cognitive offloading, answer substitution, and long-term performance losses.

The next generation of AI tutoring should therefore be designed around learning preservation. The central instructional question is not “How can the tutor be more helpful?” It is “What help preserves the learner’s necessary cognitive work?”

That question leads to a concrete design agenda: attempt gates, hint ladders, self-explanation prompts, misconception contrast, faded scaffolding, retrieval checks, unassisted checkpoints, confidence calibration, grounded generation, teacher dashboards, and explicit pedagogical policies. It also leads to a concrete evaluation agenda: response quality and assisted performance are only the first levels. The decisive outcomes are unassisted near-transfer, delayed retention, far transfer, self-regulation, and equity.

The frontier is not a chatbot that answers every question. It is a tutor that knows when not to answer.

Appendix A: Failure-mode taxonomy for learning-preserving AI tutors

Category	Failure mode	Observable signal	Mitigation
Help regulation	Answer over-disclosure	Final answer before attempt	Attempt gate
Help regulation	Premature worked solution	Full solution when learner needs small cue	Hint ladder
Help regulation	Over-scaffolding	Learner succeeds only with step-by-step prompts	Faded scaffolding
Diagnosis	Misconception reinforcement	Tutor validates wrong reasoning	Misconception contrast
Diagnosis	Shallow feedback	Generic praise or vague correction	Error-specific feedback
Dialogue	Multi-turn drift	Scaffolding degrades into answer-giving	Dialogue-state tracking
Dialogue	Tutor sycophancy	Tutor agrees with incorrect learner	Correctness-over-rapport policy
Cognitive effort	Cognitive offloading	Learner asks for output before thinking	Attempt and explanation gates
Cognitive effort	Lack of retrieval	Tutor supplies facts immediately	Recall-before-explain prompts
Cognitive effort	Lack of transfer	Only identical assisted practice	Near- and far-transfer checks
Mastery	False mastery	High assisted, low unassisted performance	Unassisted checkpoints
Metacognition	Poor confidence calibration	High confidence with wrong reasoning	Confidence-performance comparison
Agency	Low learner agency	Tutor chooses every move	Learner choice and planning prompts
Equity	Inequitable adaptation	Lower-performing learners get narrower tasks	Monitor transfer opportunities by subgroup
Governance	Uninspected model drift	Tutor behavior changes after model update	Regression tests and model-version logging

Appendix B: Evaluation framework

B1. Minimum viable evaluation

A minimally credible AI tutor evaluation should include:

A baseline stronger than “no support.”
Measures of assisted performance.
Measures of unassisted near-transfer.
Delayed retention when feasible.
Dialogue logs coded for help type and hint level.
Subgroup analysis.
Evidence of tutor accuracy and grounding.
A record of answer reveals and worked-example use.

B2. Strong evaluation

A strong evaluation should add:

Random assignment where possible.
Active comparison conditions such as human tutoring, existing adaptive practice, or teacher-led review.
Far-transfer tasks.
Self-explanation quality measures.
Confidence calibration.
Help-seeking behavior analysis.
Teacher dashboard use.
Longitudinal follow-up.
Model-version tracking.
Cost and implementation analysis.

B3. Core rubric

Dimension	Weak tutor	Strong tutor
Accuracy	Often fluent but unverified	Grounded and verified where needed
Help timing	Gives answers on request	Requires attempts and regulates help
Feedback	Generic or verbose	Specific, local, actionable
Scaffolding	None or excessive	Graduated and faded
Misconceptions	Misses or reinforces them	Identifies and contrasts them
Dialogue	Explains at learner	Elicits and responds to learner thinking
Retrieval	Supplies information immediately	Prompts recall first
Transfer	Optimizes current task	Schedules unassisted transfer
Metacognition	Ignores confidence and strategy	Builds planning, monitoring, checking
Evaluation	Measures satisfaction and completion	Measures retention, transfer, self-regulation

Appendix C: Annotated bibliography

C1. Historical tutoring and intelligent tutoring systems

Bloom (1984), “The 2 Sigma Problem.” Establishes one-to-one tutoring and mastery learning as aspirational benchmarks for instructional effectiveness. (Sage Journals)

VanLehn (2011), “The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.” Compares human and computer tutoring and helps situate AI tutors within the broader tutoring literature. (Tandfonline)

Kulik and Fletcher (2016), “Effectiveness of Intelligent Tutoring Systems.” Meta-analysis of controlled ITS evaluations; useful for historical comparison and for the warning that effects depend on measurement and alignment. (Sage Journals)

Roschelle et al. (2016), ASSISTments field trial. Shows the value of immediate feedback plus teacher analytics in a classroom homework system. (Sage Journals)

Nickow, Oreopoulos, and Quan (2024), tutoring meta-analysis. Summarizes experimental evidence on tutoring as one of the stronger education interventions, while emphasizing implementation variation. (Sage Journals)

Leite et al. (2025), K–12 ITS meta-analysis. Updates the ITS evidence base and identifies moderators such as worked examples, duration, outcome type, and immediate measurement. (arXiv)

C2. Generative AI tutoring trials and field evidence

Kestin et al. (2025), AI tutoring in undergraduate physics. Shows that a carefully scaffolded GPT-4 tutor can outperform an active-learning classroom comparator in short-term learning outcomes. (Nature)

Bastani et al. (2025), generative AI without guardrails in high-school mathematics. Central warning study: unrestricted AI can improve practice performance while reducing later unassisted performance; guarded tutoring mitigates the harm. (SSRN)

De Simone et al. (2025), World Bank Nigeria trial. Evaluates a structured GPT-4/Copilot-supported after-school English program in a lower-resource context. (Open Knowledge Repository)

Wang et al. (2025), Tutor CoPilot. Demonstrates human-AI tutoring support, with AI helping tutors use more effective instructional moves. (arXiv)

LearnLM Team, Google DeepMind and Eedi (2025/2026). Classroom RCT showing that a pedagogically fine-tuned model, supervised by expert tutors, can support novel mathematics problem solving. (arXiv)

Three Years with Classroom AI in Introductory Programming (2026). Longitudinal evidence on how student-AI interaction practices evolve across AI-supported programming cohorts. (arXiv)

C3. Pedagogical safety, benchmarks, and tutor behavior

SafeTutors (2026). Defines tutoring safety around educational harms and shows that multi-turn interactions can reveal much worse pedagogical failure than single-turn tests. (arXiv)

MathDial (2023). Dialogue tutoring dataset showing that problem-solving ability does not automatically produce good tutoring behavior. (arXiv)

MRBench / Maurya et al. (2025). Provides a taxonomy for evaluating AI tutor responses across pedagogical dimensions such as mistake identification, guidance, and actionability. (ACL Anthology)

MathTutorBench (2025). Evaluates open-ended pedagogical capabilities of LLM tutors and reinforces the solver-tutor gap. (arXiv)

TutorBench (2025). Assesses adaptive explanation, actionable feedback, and active-learning hint generation; reports that frontier models still struggle with core tutoring skills. (Scale)

BEA 2025 Shared Task. Large shared task on pedagogical ability assessment, showing substantial remaining room for improvement in tutor-response evaluation. (arXiv)

KMP-Bench (2026). K–8 mathematical pedagogical benchmark with dialogue and skill modules; useful for evaluating multi-turn tutoring and granular pedagogical skills. (arXiv)

SHAPE (2026). Evaluates tutoring behavior under answer-inducing student prompts and proposes graph-augmented gating based on prerequisite and mastery inference. (arXiv)

EduGuardBench (2026). Evaluates professional fidelity and teaching-specific harms in LLMs acting as teachers, including the ability to convert inappropriate requests into teachable moments. (arXiv)

Answer Leakage Robustness (2026). Studies cases where learners actively try to obtain final answers, making answer-boundary maintenance a first-class tutoring evaluation problem. (arXiv)

C4. Dialogic tutoring and model alignment

ConvoLearn (2026). Provides a learning-science grounded dataset for dialogic AI tutors across dimensions such as cognitive engagement, formative assessment, accountability, metacognition, and power dynamics. (arXiv)

Scarlatos et al. (2025), training LLM tutors for learning outcomes. Uses student modeling and pedagogical rubrics to train tutors toward improved student correctness while preserving pedagogical quality. (arXiv)

PedagogicalRL-Thinking (2026). Extends pedagogical alignment to reasoning models through pedagogical reasoning prompting and a thinking reward. (arXiv)

Hierarchical Pedagogical Oversight (2025/2026). Uses structured oversight to detect pedagogical failures such as sycophancy and overly direct answer-giving. (arXiv)

C5. Cognitive effort, offloading, and learning science

OECD Digital Education Outlook 2026. Policy-level synthesis distinguishing generative AI as learning partner from generative AI as shortcut, with emphasis on teaching principles and cognitive effort. (OECD)

Lodge and Loble (2026), cognitive offloading and education. Argues that AI offloading can be beneficial or detrimental depending on domain knowledge, metacognition, and instructional guidance. (University of Technology Sydney)

Chi and colleagues, self-explanation research. Establishes self-explanation as a durable learning mechanism relevant to tutor prompts and dialogue evaluation. (ScienceDirect)

Sweller and cognitive load theory. Provides the theoretical basis for reducing extraneous load while preserving productive learning effort. (Wiley Online Library)

Roediger and Karpicke, retrieval practice. Supports the need for recall and unassisted checks in AI tutoring systems. (Sage Journals)

Kapur, productive failure. Supports the idea that well-designed struggle can prepare learners for later instruction and deeper understanding. (Tandfonline)

C6. Grounding, assessment, and deployment

Pardos and Bhandari (2024), ChatGPT-generated help. Evaluates AI-generated hints and help content, relevant to authoring and feedback design. (PLOS)

LLM-powered assessment RAG for higher education (2026). Demonstrates grounded generation over rubrics, exemplars, and instructor feedback at realistic essay-assessment scale. (arXiv)

Supplied base paper on generative AI tutors and adaptive learning systems. Provides the initial evidence synthesis and architecture that this paper repositions around learning preservation.

Supplied 2026 source update map. Identifies newer work on LearnLM/Eedi, cognitive offloading, benchmark expansion, pedagogical reasoning models, answer-boundary robustness, and RAG deployment.