Back to Checklists
Human LearningChecklistMay 1, 2026Yellow — detail controls

Designing AI Assistance That Preserves Learning: An LX Implementation Checklist

Quick Answer

This checklist helps learning experience leads, learning engineers, and instructional designers build or procure AI tutoring systems that scaffold cognition rather than substitute for it. Run it during design review, vendor selection, or pre-launch hardening. It operationalizes seven design controls — objective classification, attempt-first gating, hint ladders, self-explanation, verification, scaffold fading, and AI-off transfer assessment — that separate learning-preserving tutors from answer engines. Pair with the cognitive offloading explainer for threat-model context.

This checklist hardens AI tutoring and AI-assisted training systems against the canonical failure mode documented by Bastani et al. (PNAS 2025): assisted performance rises while independent capacity falls. It is for learning experience leads, learning engineers, and instructional designers building or procuring such systems, and it operationalizes the design controls in the source synthesis on AI assistance, critical thinking, and cognitive offloading. For threat-model background, read what cognitive offloading is in AI-assisted learning first. Per the source brief, prompt-string templates and vendor-specific procurement language are intentionally omitted; the controls below are stated at the architectural and instructional level.

Checks20 total13 MUST5 SHOULD2 NICE

How to use this checklist

Run this checklist twice: once at design review or procurement (gate before build/buy), and once per release or quarterly thereafter (drift check). The owner is whoever holds instructional-design accountability for the program — typically the LX lead working with a learning engineer. "Done" means every MUST is implemented and verifiable, every SHOULD has either an implementation or a documented exception, and the assisted-versus-unassisted performance gap is measurable and reported.

Define the learning objective and offload boundary

3 checks

Classify each learning objective as a target skill or a support skill

MUST

Why it matters

Bastani et al.'s GPT Base failure mode (+48% practice performance, −17% later unassisted exam) is what happens when AI generates exactly the cognitive operations the learner is supposed to practice. Without an explicit list, the system cannot know what to withhold.

How to implement

Produce a per-unit table mapping each objective to "target" (must not be AI-generated) or "support" (safe to offload). Tie to the task analysis or Bloom-level breakdown the team already uses.

Verify it's done

An auditor can pull the table for any unit and trace each objective to a system rule that constrains what AI is permitted to generate during that unit.

Specify what the AI must NOT generate for target skills

MUST

Why it matters

Without a negative specification, the model defaults to fluent completion of whatever the learner asks. Positive guidance alone reproduces the GPT Base condition.

How to implement

For each target skill, write a "do not generate" list — full solutions, first drafts, complete code blocks, finished proofs, polished essays — and encode it as a system policy or retrieval gate.

Verify it's done

Direct "do it for me" probes across each target skill produce a hint-level response or refusal, not the artifact.

Align AI behavior with the unassisted assessment

MUST

Why it matters

If the assessment measures unassisted reasoning but the formative loop offloads it, the program produces performance-learning dissociation rather than learning.

How to implement

For every summative or transfer assessment, document which cognitive operations the learner must perform alone, and confirm the formative experience required them to perform those operations N times without AI assistance before the assessment.

Verify it's done

An instructional reviewer can map each assessment item to formative episodes where the learner produced the same operation unassisted.

Gate help with attempt-first interaction

3 checks

Require a visible learner attempt before substantive help

MUST

Why it matters

Attempt-first gating is the operational form of productive struggle. Without it, the system rewards offloading at the moment of first difficulty.

How to implement

Block hint levels above "restate goal" and "focusing question" until the learner has submitted an attempt, sketch, or written commitment to an approach. Require minimum content; empty submissions do not unlock the ladder.

Verify it's done

Trace logs show no hint at "strategic" level or above without a prior learner artifact in the same task. Spot-check at least twenty sessions per release.

Reject paraphrased "do it for me" requests for target skills

SHOULD

Why it matters

Even with attempt-first gating, learners reframe requests to extract solutions ("just show me a worked example exactly like mine"). The system has to recognize and decline.

How to implement

Classify incoming requests against the negative-generation list from the prior domain; respond at the minimum sufficient hint level instead of complying with the rephrased request.

Verify it's done

A scripted set of paraphrased extraction prompts produces hint-level responses, not solutions, in at least 95% of attempts.

Constrain the hint surface for low-prior-knowledge learners

SHOULD

Why it matters

Lee et al. (CHI 2025) document a Matthew effect — high-prior-knowledge learners use AI as amplifier, low-prior-knowledge learners as substitute. Identical hint surfaces magnify the gap.

How to implement

Use a prior-knowledge probe or early-task signal to start novices at restricted hint levels (no full worked substeps for the first N tasks) and widen as competence shows in process data.

Verify it's done

Hint-level distributions differ by cohort; novice cohorts show a lower median hint level used per task in the first unit than experienced cohorts.

Implement a hint ladder, not an answer button

3 checks

Implement graded hint levels with explicit semantics

MUST

Why it matters

A single "help" affordance collapses to "give answer." Kestin et al.'s PS2 Pal and Bastani et al.'s GPT Tutor both succeeded by structuring help into discrete levels.

How to implement

Define at least six levels — restate goal, focusing question, concept name, strategic hint, worked substep, full solution — with authored or generation-constrained outputs at each level.

Verify it's done

A design document enumerates the levels and their semantics; system traces tag every help response with one level.

Default to the minimum sufficient hint level

MUST

Why it matters

Bias toward lower levels protects struggle. Bias toward higher levels (or single-shot answers) collapses to the GPT Base failure mode regardless of how many levels exist on paper.

How to implement

Start every help request at the lowest level not yet tried for this task; require an additional learner action — re-attempt or explicit escalation — to advance the ladder.

Verify it's done

Across a sample of completed tasks, the modal hint level used is below "worked substep," and full solutions occur in fewer than a defined per-unit threshold.

Ground hints and worked solutions in expert-authored material

SHOULD

Why it matters

PS2 Pal used prewritten solutions; the GPT Tutor condition used structured prompts. Spontaneously generated full solutions risk hallucination and drift from the learning objective.

How to implement

For each task, supply expert-authored or instructor-vetted solutions and intermediate hints as the source of truth; constrain generation to paraphrase and explain rather than invent.

Verify it's done

Hint and solution outputs trace to authored source material for at least 90% of tasks per unit; the rest sit in a flagged review queue.

Require self-explanation and verification as workflow steps

3 checks

Require learner reconstruction after every AI explanation

MUST

Why it matters

Fluent AI output produces an illusion-of-understanding effect. Reconstruction forces the cognitive work that produces actual encoding rather than recognition.

How to implement

After any AI explanation, gate progression on a learner-produced summary, paraphrase, or worked answer to a near-isomorphic prompt. Advance state only on acceptable reconstruction.

Verify it's done

Task-transition logs show a learner-authored reconstruction event between every AI explanation and the next problem; reject rates and reconstruction quality are surfaced to the dashboard.

Make verification a first-class, scored task

MUST

Why it matters

Automation bias predicts uncritical acceptance of AI output. If verification is implicit, learners skip it and accept-but-wrong explanations propagate as misconceptions.

How to implement

Periodically present an AI output — some intentionally wrong — and require the learner to cite a source, compare to a known case, predict a failure mode, or run a test before accepting. Score the verification, not just the final answer.

Verify it's done

Each unit contains at least one explicit verification task; error-detection accuracy and false-acceptance rate are reported per learner.

Apply cognitive forcing functions at high-stakes decision points

NICE

Why it matters

Buçinca et al. (CHI 2021) show that cognitive forcing functions reduce overreliance on AI suggestions in decision tasks compared to passive presentation of AI output.

How to implement

At high-stakes decision points, require the learner to commit to an answer before the AI's suggestion is shown, or to explain their disagreement before overriding their own answer.

Verify it's done

System logs show a learner commitment timestamp prior to AI-suggestion display in tasks tagged as forcing-function decision points.

Fade scaffolds and assess transfer AI-off

3 checks

Define and implement an explicit scaffold fading policy

MUST

Why it matters

ITS literature is consistent: scaffolds that do not fade become permanent crutches. Without a written policy, fading does not happen.

How to implement

Choose a regime — mastery-based (hint surface narrows after demonstrated competence), time-based (hints constrict over weeks), or transfer-gated (full assistance returns only after an AI-off check passes) — and encode it in the learner state model.

Verify it's done

A written fading policy exists per unit; system traces show hint surface narrowing for individual learners over time, traceable to the chosen trigger.

Run an AI-off delayed transfer assessment for every unit

MUST

Why it matters

Bastani et al.'s −17% later unassisted exam is the canonical signal of the failure mode. Without AI-off assessment, the program literally cannot detect it.

How to implement

Schedule at least one assessment per unit, delayed by days or weeks, with AI fully unavailable. Cover near, medium, and far transfer items; include at least one item targeting metacognitive or adversarial transfer.

Verify it's done

An AI-off assessment exists for every unit; results are stored and reportable separately from assisted-formative scores.

Report the assisted-versus-unassisted performance gap, not just assisted scores

SHOULD

Why it matters

A program dashboard that shows only assisted performance hides the dissociation. The gap, not the level, is the diagnostic.

How to implement

For each unit and cohort, compute assisted formative score and AI-off transfer score; surface the delta over time and flag widening gaps.

Verify it's done

A program dashboard displays the gap per cohort; alerts trigger when the gap exceeds a defined threshold.

Instrument process, not just artifacts

3 checks

Log learning-process events, not just chat turns

MUST

Why it matters

Chat logs alone miss the learning signal. Attempts, revisions, hint escalations, time-on-task, and self-explanations are the actionable instrumentation.

How to implement

Define an event schema covering attempt submission, hint level requested, hint level granted, reconstruction submitted and accepted, verification outcome, and escalation events. Log per learner per task.

Verify it's done

A schema document exists; sampling 100 tasks recovers all event types; analytics queries can compute hint-ladder distributions and reconstruction quality per learner.

Track confidence calibration and error-detection performance

NICE

Why it matters

Lee et al. associate confidence in AI with reduced critical-thinking enaction. Calibration data is a leading indicator of overreliance before performance drops appear.

How to implement

Periodically prompt learners to rate confidence in their own answer and in AI output; compute calibration (Brier score or analogous) and error-detection rate against seeded errors.

Verify it's done

Calibration metrics appear in the learner record; trend lines are visible per cohort and per learner.

Surface process metrics to coach and teacher dashboards

SHOULD

Why it matters

Wang et al.'s Tutor CoPilot work shows expert-mediated AI assistance outperforms unmediated AI. The coach needs visibility to mediate; without it, mediation collapses to oversight theater.

How to implement

Build a coach view exposing per-learner hint-ladder distribution, reconstruction quality, AI-off versus assisted gap, and verification accuracy. Update daily for cohorts and in real time for live tutoring.

Verify it's done

A coach can identify, within five minutes, the three learners in a cohort with the largest assisted-unassisted gap and the heaviest reliance on high-level hints.

Govern the human-in-the-loop and procurement review

2 checks

Define the teacher or coach role explicitly against the AI's role

MUST

Why it matters

AI tutoring is not autonomous. Without a defined human role, the program drifts to AI-as-surrogate by default, regardless of what the design documents say.

How to implement

Specify what the human owns (objective design, hint-ladder authorship, AI-off assessment design, escalation handling) and what the AI owns (just-in-time hint delivery within authored material, formative feedback, instrumentation). Train coaches and teachers against that split.

Verify it's done

A RACI or equivalent document exists; randomly sampled coaches or teachers can name their AI-related responsibilities and the boundary against AI-owned tasks.

Require AI-off outcome evidence in procurement and renewal

MUST

Why it matters

Vendors typically present engagement and assisted-task scores. Procurement that accepts these without AI-off outcomes funds the failure mode at scale across whole programs.

How to implement

Add evaluation criteria requiring vendors to report unassisted delayed transfer outcomes for cohorts using their system, with comparison to a non-AI control or a prior baseline. Reject submissions that report only assisted metrics. See related controls in the checklists index for adjacent governance artifacts.

Verify it's done

The procurement scoring rubric includes AI-off outcome evidence as a non-optional criterion; rejected vendors' responses are auditable.

Acceptance criteria

The checklist is fully implemented when, for every unit in the program, an auditor can produce: (1) a target-vs-support objective table with corresponding negative-generation rules, (2) trace logs showing attempt-first gating and a modal hint level below "worked substep," (3) authored solutions and hints backing at least 90% of tasks, (4) a learner-reconstruction event after every AI explanation and at least one scored verification task, (5) a written fading policy and at least one AI-off delayed assessment per unit, (6) a process-event log and a coach dashboard exposing the assisted-versus-unassisted gap, and (7) a defined human role and procurement rubric requiring AI-off outcome evidence. The program reports the assisted-unassisted gap per cohort over time; a stable or narrowing gap on transfer assessments — not assisted-formative scores — is the criterion of success.

Derived From

Related Work

External References