AI Assistance, Critical Thinking, and Cognitive Offloading: Designing for Learning Instead of Cognitive Debt

AI assistance creates a new class of learning-system problem: tools that improve assisted performance while potentially weakening the learner’s independent capacity for judgment, transfer, and error detection. In learning-science terms, the issue is not “AI use” in general; it is the interaction among cognitive offloading, productive struggle, metacognitive monitoring, and scaffold fading. The field now has enough empirical signal to move past slogans. In a preregistered field experiment with nearly 1,000 high-school mathematics students, access to a GPT-4-like tutor increased practice performance by 48% in an unrestricted “GPT Base” condition and 127% in a guarded “GPT Tutor” condition, but when AI was removed, the unrestricted group scored 17% lower than students who never had AI access (Bastani et al., PNAS 2025). In a CHI 2025 survey of 319 knowledge workers contributing 936 real work examples, workers reported enacting critical thinking in only 59.29% of AI-assisted examples; higher task-specific confidence in AI predicted less critical thinking, while higher self-confidence and confidence in evaluating AI predicted more (Lee et al., CHI 2025). In contrast, a structured Harvard physics AI tutor RCT with 194 eligible students found substantially better post-test performance than in-class active learning, with an estimated effect size of 0.63 SD—or 0.73–1.3 SD under quantile regression to address ceiling effects—while median AI time-on-task was 49 minutes versus a 60-minute class period (Kestin et al., Scientific Reports 2025). The paper below synthesizes the current evidence into an architectural account: when AI functions as an answer engine, it can short-circuit the cognitive operations education is meant to cultivate; when it functions as a contingent scaffold, it can increase feedback quality, pacing, and access without collapsing learner agency. (hamsabastani.github.io)

The core distinction: performance support is not learning support

The mistake running through much AI-in-education discourse is treating task completion as evidence of learning. This is the same measurement error that cognitive tutors, worked examples, calculators, and search engines have forced the field to confront repeatedly: a learner can produce a correct artifact without building the internal schema required to reproduce, explain, transfer, or critique the result.

AI assistance amplifies that error because the system can now perform operations that were previously reliable traces of cognition: drafting an argument, selecting evidence, generating code, solving algebra, summarizing a paper, debugging prose, or proposing a decision. If the output is the only measured object, AI creates a false positive for learning.

The educationally relevant question is therefore not:

Did the learner complete the task better with AI?

It is:

What cognitive work did the learner perform, what work was offloaded, and what capacity remains when the scaffold is removed?

This paper uses critical thinking to mean the family of cognitive and metacognitive operations involved in analyzing claims, evaluating evidence, selecting criteria, detecting uncertainty, generating alternatives, and justifying decisions. That definition is compatible with major educational frameworks: Bloom’s higher-order categories of analysis, synthesis, and evaluation (Bloom et al., 1956); Halpern’s emphasis on transfer, metacognitive monitoring, and real-world reasoning (Halpern, American Psychologist 1998); and the Paul–Elder model’s attention to standards such as clarity, accuracy, relevance, logic, and fairness (Paul & Elder, Foundation for Critical Thinking 2008).

Cognitive offloading refers to using external tools or actions to reduce internal cognitive demands (Risko & Gilbert, Trends in Cognitive Sciences 2016). Offloading is not inherently bad. Writing, diagrams, calculators, search engines, flashcards, notebooks, checklists, and worked examples are all offloading technologies. The learning problem begins when offloading replaces the very operations that the learner needs to practice, monitor, and eventually internalize.

A calculator can offload arithmetic while preserving mathematical modeling. A map can offload route memory while preserving navigation judgment. A worked example can reduce extraneous load while supporting schema construction. But an AI assistant can offload problem framing, decomposition, evidence search, solution generation, explanation, style, and evaluation at once. That breadth changes the instructional design problem.

A mechanism model: how AI assistance changes cognition

AI assistance affects critical thinking through at least six mechanisms. Some are beneficial, some harmful, and most are conditional on design.

1. Load reduction can support learning—or remove the work that creates learning

Cognitive Load Theory distinguishes among intrinsic load, extraneous load, and germane processing (Sweller, Cognitive Science 1988; Sweller et al., Educational Psychology Review 1998). Good instructional design reduces unnecessary burdens while preserving the cognitive work that builds schemas. AI can reduce extraneous load by clarifying instructions, providing examples, translating jargon, or offering timely hints. That can help learners focus on the disciplinary structure of the problem.

The danger is that AI systems often reduce germane load as well. If the assistant decomposes the task, selects the method, performs the inference, and writes the justification, the learner’s working memory is spared—but so is the learner’s opportunity to construct the schema.

This is the central lesson of the Bastani et al. mathematics experiment. The unrestricted GPT Base interface improved practice performance while harming later unassisted exam performance. The guarded GPT Tutor, which provided hints and teacher-designed supports rather than direct answers, preserved the practice benefit without the observed 17% exam penalty (Bastani et al., PNAS 2025). (hamsabastani.github.io)

2. Offloading changes what learners monitor

Traditional learning tasks force learners to monitor their own solution path. AI-assisted tasks shift monitoring from “Can I solve this?” to “Is this output acceptable?” That is not necessarily easier. In expert work, evaluating an AI output may require deeper domain knowledge than producing a first draft.

Lee et al. found that generative AI shifts knowledge work toward verification, response integration, and task stewardship. The same study found that confidence in AI doing the task negatively correlated with perceived enaction of critical thinking, while confidence in oneself and confidence in evaluating AI outputs positively correlated with critical thinking (Lee et al., CHI 2025). (microsoft.com)

That pattern matters for education because novices often lack the evaluative expertise that AI workflows demand. A novice who cannot solve a problem independently is also poorly positioned to detect whether the AI’s solution is incomplete, hallucinated, misapplied, or subtly irrelevant.

3. Fluency produces illusions of understanding

Generative AI produces fluent, structured, authoritative language. Fluency is cognitively seductive: people often experience coherent explanations as evidence that they understand the underlying material. But recognition and comprehension are not the same as retrieval, reconstruction, or transfer.

This aligns with older evidence on digital memory and search. Sparrow, Liu, and Wegner showed that when people expect information to remain available online, they are less likely to remember the information itself and more likely to remember where to find it (Sparrow et al., Science 2011). AI extends that effect from fact retrieval to reasoning retrieval: the learner may remember that “the model can explain it” without being able to reconstruct the explanation.

4. Automation bias and overreliance reduce verification

Automation bias is the tendency to accept automated recommendations even when they are wrong, especially under time pressure or when the system appears competent (Parasuraman & Riley, Human Factors 1997; Mosier et al., International Journal of Aviation Psychology 1998). AI assistants inherit this risk and intensify it through conversational confidence.

Lee et al. reported several inhibitors of critical thinking in AI-assisted work: trust and reliance on AI, trivialization of tasks, lack of time, lack of domain knowledge for inspection, and difficulty revising prompts or improving outputs. In their sample, participants explicitly described accepting AI outputs by default when they doubted their own ability or considered the task simple (Lee et al., CHI 2025). (microsoft.com)

The learning-system implication is direct: “use AI critically” is not a sufficient intervention. Systems must make verification, comparison, justification, and uncertainty visible in the workflow.

5. AI can collapse productive struggle

Productive struggle is not the same as frustration. It is effortful engagement with a problem at the edge of current competence, supported enough to avoid failure loops but not so much that the learner stops thinking. Good scaffolds preserve productive struggle by sequencing help: prompt, hint, subgoal, worked step, explanation, then answer.

Most general-purpose chatbots invert that sequence. They provide full answers early, often in polished form. This is efficient for task completion and risky for learning. The unrestricted GPT Base condition in Bastani et al. is a concrete example of this failure mode: students used the tool as a crutch, practice scores rose, and later independent performance fell (Bastani et al., PNAS 2025). (hamsabastani.github.io)

6. AI can also increase access to high-quality feedback

The evidence is not anti-AI. Structured AI tutoring can work. Kestin et al.’s Harvard physics tutor was not an open chatbot. It was a constrained, pedagogically engineered system: expert-crafted prompts, stepwise scaffolding, active engagement, self-pacing, and prewritten solutions to reduce hallucinated instruction. In that design, students learned more in less time than in an active-learning class and reported higher engagement and motivation (Kestin et al., Scientific Reports 2025). (nature.com)

Tutor CoPilot shows a related pattern on the human-facing side. Rather than replacing tutors, it supports human tutors in real time with expert-like guidance. In a randomized trial involving roughly 900 tutors and 1,800 K–12 students, students whose tutors had Tutor CoPilot were about 4 percentage points more likely to master topics; gains were larger for less-experienced tutors in some reports (Wang et al., EdWorkingPaper 2024/2025). (files.eric.ed.gov)

The design principle is not “AI off” or “AI on.” It is AI as scaffold, not surrogate.

Evidence map: what we know so far

Field evidence in mathematics: the performance-learning dissociation

The strongest direct evidence on AI assistance and learning harm comes from Bastani et al.’s high-school mathematics field experiment. The study compared three conditions during practice sessions:

Control: students used standard resources such as notes and textbooks.
GPT Base: students used a standard GPT-4 chat interface.
GPT Tutor: students used a GPT-4 interface with safeguards designed with teachers, including problem-specific solutions and instructions to provide hints rather than full answers.

The key finding is the dissociation between assisted practice performance and unassisted exam performance:

Condition	Practice effect	Later unassisted exam effect
GPT Base	+48% versus control	−17% versus control
GPT Tutor	+127% versus control	statistically indistinguishable from control

(Bastani et al., PNAS 2025)

This is the canonical pattern for cognitive offloading risk: the tool increases immediate success while reducing the learner’s later independent performance. The guarded tutor did not produce a statistically significant exam advantage, but it mitigated the harm. That is already educationally important. A system that boosts practice without damaging later independence is categorically different from one that boosts practice by doing the learning-relevant work.

The design lesson is sharp: direct-answer access is not equivalent to tutoring. Tutoring is a contingent instructional process that maintains the learner’s responsibility for sense-making.

Knowledge-work evidence: critical thinking shifts from production to oversight

Lee et al.’s CHI 2025 study provides a complementary view outside formal schooling. The authors surveyed 319 knowledge workers who used generative AI at least weekly and collected 936 first-hand examples of AI use. Participants self-reported critical thinking in 555 of 936 examples—about 59.29%. They also reported that AI reduced effort for many Bloom-aligned cognitive activities: examples marked “much less effort” or “less effort” comprised 72% for knowledge, 79% for comprehension, 69% for application, 72% for analysis, 76% for synthesis, and 55% for evaluation (Lee et al., CHI 2025). (microsoft.com)

The regression results matter more than the descriptive percentages. Task-specific confidence in AI negatively predicted critical thinking enaction (β = −0.69, p < .001). Confidence in oneself (β = 0.26, p = .026), confidence in evaluating AI output (β = 0.31, p = .046), and general tendency to reflect on work (β = 0.52, p < .001) positively predicted critical thinking. In other words, users who already had domain confidence and reflective habits were more likely to use AI critically; users who trusted AI for the task were less likely to do so (Lee et al., CHI 2025). (microsoft.com)

That pattern creates a Matthew effect in learning systems. Learners with stronger prior knowledge can use AI as an amplifier. Learners with weaker prior knowledge may use AI as a substitute for the very reasoning practice they need.

Cross-sectional evidence: AI use, offloading, and critical-thinking scores

Gerlich’s 2025 study in Societies surveyed 666 valid UK participants and reported associations among AI tool use, cognitive offloading, and critical-thinking measures. The paper reports negative associations between AI use and critical thinking and a mediating role for cognitive offloading. It also reports a mediation model in which the total effect of AI usage on critical thinking was significant (b = −0.42, SE = 0.08, p < .001), the indirect effect through cognitive offloading was significant (b = −0.25, SE = 0.06, p < .001), and the direct effect remained significant (b = −0.17, SE = 0.05, p < .01) (Gerlich, Societies 2025). (mdpi.com)

This evidence should be used carefully. It is cross-sectional, relies substantially on self-report, and cannot establish causality. It is still useful as convergent evidence for a mechanism that stronger experimental studies are beginning to isolate: high reliance on AI is associated with lower engagement in independent evaluation and reasoning.

Neural and behavioral evidence: promising but early

Kosmyna et al.’s “Your Brain on ChatGPT” preprint used EEG during essay writing, comparing participants writing with ChatGPT, with search, or without external tools. The study reports weaker brain connectivity in the LLM group and describes “cognitive debt” when participants later wrote without AI (Kosmyna et al., arXiv 2025). The study attracted wide attention because it appears to show neural correlates of reduced engagement. It should be treated as early evidence, not settled science: the sample was small, the task was narrow, and the paper was a preprint at the time of reporting. (arxiv.org)

The stronger takeaway is not “AI reduces brain activity” in general. The defensible takeaway is narrower: if a writing task is structured so that the AI can perform planning and drafting, learners may engage less in the neural, linguistic, and behavioral processes normally exercised by independent writing. That is exactly what cognitive-offloading theory would predict.

Positive counterevidence: structured AI tutoring can improve learning

The Harvard PS2 Pal study is the best current counterweight to broad claims that AI necessarily erodes learning. The tutor outperformed an active-learning class in a randomized crossover design. The key is that the system was pedagogically constrained: structured sequence, active engagement, expert prompts, prewritten solutions, and careful attention to cognitive load and feedback (Kestin et al., Scientific Reports 2025). (nature.com)

This matches the broader intelligent-tutoring-systems literature. ITS meta-analyses have generally found positive effects, though estimates vary by population, domain, comparison condition, and implementation. VanLehn reported that step-based tutoring can approach the effectiveness of human tutoring in some domains (VanLehn, Educational Psychologist 2011). Ma et al. reported a mean effect size around g = 0.41 across ITS studies (Ma et al., Journal of Educational Psychology 2014). Kulik and Fletcher reported a median effect size around 0.66 in their meta-analysis (Kulik & Fletcher, Review of Educational Research 2016). More recent syntheses of generative AI in education report positive average effects but substantial heterogeneity; the important question is no longer whether AI can help, but which designs preserve learner cognition while scaling feedback.

Why generic chatbots are poor default learning systems

General-purpose AI assistants optimize helpfulness, fluency, and user satisfaction. Learning systems optimize durable change in the learner. Those objectives diverge in predictable ways.

The answer-first failure mode

A chatbot trained to be helpful tends to answer the learner’s literal request. If the learner asks, “Solve this,” the system solves it. If the learner asks, “Write my thesis statement,” the system writes it. If the learner asks, “Summarize this reading,” the system compresses it.

But learning often requires the system to resist the request:

“Before I solve it, what is your first step?”
“Choose which assumption you want to test.”
“Explain why this evidence supports your claim.”
“I can give a hint, but not the full answer yet.”
“Compare two solution paths and tell me which is more robust.”

Those moves are not friction for its own sake. They are the interactional form of scaffolding.

The verification burden shifts to the least prepared user

AI-generated answers often require domain knowledge to evaluate. This creates a paradox: the less a learner knows, the more they need help, but the less able they are to judge the help. That makes unrestricted AI most dangerous at the point of maximum novice dependence.

In expert work, this is a quality-assurance problem. In education, it is a developmental problem. The learner who accepts plausible but wrong explanations may build misconceptions. The learner who accepts correct but unexplained answers may build dependency. The learner who accepts polished language may mistake surface coherence for argument quality.

Prompting skill is not the same as disciplinary skill

“Prompt engineering” can improve interactions, but it is not a substitute for disciplinary reasoning. A student can become skilled at eliciting answers without becoming skilled at evaluating methods. Conversely, a strong learner can use simple prompts productively because they bring domain schemas to the interaction.

AI literacy curricula should therefore avoid teaching prompts as magic incantations. They should teach prompt use as a small part of a broader epistemic workflow: goal setting, hypothesis generation, source triangulation, uncertainty assessment, explanation reconstruction, and independent transfer.

A taxonomy of AI assistance by cognitive risk

Not all AI assistance carries the same learning risk. The risk depends on which cognitive operations the system performs.

Low-risk offloading: reduce barriers without replacing reasoning

Low-risk uses include:

translating interface text;
reading aloud;
formatting citations;
converting notes into flashcards after the learner has selected key concepts;
generating accessibility accommodations;
scheduling spaced practice;
cleaning transcription errors;
producing alternate representations of learner-generated work.

These supports reduce friction and extraneous load. They do not usually replace the core disciplinary cognition.

Moderate-risk offloading: useful when contingent and bounded

Moderate-risk uses include hints, explanations, examples, feedback, and Socratic questioning. These can improve learning when they are contingent on learner input and aligned to instructional goals.

The same function can shift from helpful to harmful depending on timing. An explanation after an attempt can consolidate learning. An explanation before any attempt can remove productive struggle. Feedback that points to a misconception can support revision. Feedback that rewrites the answer can replace revision.

High-risk offloading: the learner no longer performs the target skill

High-risk uses include:

generating full essays;
solving full problem sets;
producing code without requiring trace-level explanation;
selecting evidence for an argument;
deciding which interpretation is correct;
evaluating the quality of its own answer without external standards.

These uses may be acceptable in professional productivity contexts when the human has expertise and accountability. They are poor defaults for learning contexts unless the learning objective is explicitly about auditing AI output.

Design patterns that preserve critical thinking

Pattern 1: Attempt-first gating

The system should require a learner attempt before providing substantive help. The attempt can be incomplete, but it must reveal the learner’s current model.

Bad pattern

Student: Solve this quadratic.
AI: Here is the full solution.

Better pattern

Student: Solve this quadratic.
AI: First identify whether you want to factor, complete the square, or use the quadratic formula. Show your first step. I’ll respond to that.

Attempt-first gating supports retrieval, generation, and metacognitive monitoring. It also gives the system evidence for diagnosis.

Pattern 2: Hint ladders instead of answers

A learning-oriented AI should implement help levels:

Restate the goal.
Ask a focusing question.
Identify the relevant concept.
Give a strategic hint.
Give a worked substep.
Provide a partial solution.
Provide the full solution only after effort or explicit instructional justification.

This is the principle behind many intelligent tutoring systems and the guarded GPT Tutor condition in Bastani et al. The point is not to withhold help indefinitely; it is to make help contingent and minimally sufficient.

Pattern 3: Self-explanation requirements

Self-explanation is one of the most reliable ways to turn worked material into learning. AI systems should ask learners to explain:

why a step is valid;
what assumption is being used;
how two examples differ;
where an error entered;
how the method would change under a new condition.

This is especially important after AI gives any explanation. The learner should reconstruct the reasoning in their own words before moving on.

Pattern 4: Verification as a first-class task

If AI changes work from production to oversight, then verification must be taught and assessed. A system can require learners to:

cite a non-AI source;
compare two generated answers;
identify uncertainty;
flag assumptions;
run a test case;
explain why a source is credible;
predict where the AI is likely to fail.

This converts AI oversight into a learning target rather than an invisible burden.

Pattern 5: Scaffold fading

Scaffolds should fade as competence increases. A learner who always receives hints never practices problem selection and strategy choice. A learner who always receives explanations never practices explanation generation.

Fading can be implemented by:

reducing hint specificity;
delaying feedback;
switching from multiple-choice prompts to open response;
requiring independent solution before AI comparison;
moving from stepwise support to whole-task practice;
periodically removing AI entirely for transfer checks.

Pattern 6: Productive friction

Some AI friction is not a UX defect. Learning systems need carefully designed friction: pauses, predictions, confidence ratings, explanation prompts, and commit-before-reveal interactions. Buçinca et al. showed in AI decision-making contexts that cognitive forcing functions can reduce overreliance compared with simple explanations (Buçinca et al., CHI 2021). The same idea applies to learning: force a small act of cognition before the model acts.

Pattern 7: AI as critic, not ghostwriter

For writing, argumentation, and design tasks, the safest default is often not “generate the artifact” but “critique the learner’s artifact.” The learner produces a draft; AI responds as a reviewer against explicit criteria. This preserves authorship and shifts AI toward feedback.

A strong writing workflow might be:

Student writes thesis and outline unaided.
AI identifies missing assumptions or weak evidence.
Student revises.
AI generates counterarguments.
Student responds.
Human or rubric-based assessment evaluates final reasoning.

The key is that the student remains responsible for argument construction.

A reference architecture for learning-preserving AI assistance

Component 1: Learning objective classifier

The system must know whether a requested operation is part of the learning target. If the objective is “learn to write evidence-based claims,” AI should not write the claim. If the objective is “learn to evaluate AI-generated claims,” AI may generate a claim for critique.

Component 2: Learner-state model

The system should maintain a model of current learner understanding: prior attempts, errors, confidence, hint history, latency, revision quality, and transfer performance. This does not require invasive surveillance; it requires instrumenting learning interactions rather than only collecting final outputs.

Component 3: Help policy

The help policy determines what assistance is permissible at each stage. It should encode:

maximum answer completeness before an attempt;
hint level progression;
criteria for giving worked examples;
when to ask a metacognitive question;
when to fade support;
when to escalate to a human.

Component 4: Verification layer

The verification layer checks AI-generated instructional content against trusted sources, teacher-authored solutions, rubrics, or external tools. Kestin et al. avoided relying solely on GPT-4 for correctness by enriching prompts with step-by-step answers (Kestin et al., Scientific Reports 2025). Bastani et al.’s GPT Tutor similarly used teacher-designed problem information and hints (Bastani et al., PNAS 2025). (nature.com)

Component 5: Transfer assessment

Every AI-supported learning system needs periodic AI-off assessment. If the learner cannot perform without the scaffold, the system has produced assisted performance, not durable learning.

Transfer checks should vary along several dimensions:

near transfer: similar problem, different numbers;
medium transfer: same concept, new context;
far transfer: different domain or ill-structured case;
adversarial transfer: detect a plausible but wrong AI answer;
metacognitive transfer: explain when AI should and should not be trusted.

Assessment: what to measure instead of output quality alone

Measure independent performance

The primary outcome must include unassisted performance. Bastani et al.’s contribution is precisely that they measured what happened when AI was removed. Without that step, the GPT Base condition would have looked beneficial.

Measure process, not only artifacts

Learning systems should capture:

number and type of learner attempts before help;
hint levels used;
whether the learner revised after feedback;
self-explanation quality;
verification actions;
source triangulation;
confidence calibration;
time spent before answer reveal;
persistence after error.

These are closer to the cognitive mechanisms than final output quality.

Measure calibration

Critical thinking requires knowing when one might be wrong. AI assistance can inflate confidence because the output looks polished. Assessment should ask learners to estimate confidence, justify confidence, and update confidence after feedback.

Measure error detection

A robust AI-era assessment should include intentionally flawed AI outputs. Learners should identify:

hallucinated citations;
invalid assumptions;
incorrect intermediate steps;
overgeneralizations;
missing counterevidence;
biased framing;
unsupported causal claims.

If students cannot audit AI, they are not prepared for AI-mediated work.

Measure retention and transfer after delay

Immediate post-tests overestimate learning when scaffolds are strong. Delayed post-tests, cumulative retrieval, and transfer tasks are essential. This is especially important for AI systems because the short-term performance boost can be large while independent learning is flat or negative.

Instructional patterns by domain

Mathematics and quantitative reasoning

AI should emphasize strategy selection, representation, and error analysis. Direct solution generation should be delayed.

Effective moves:

ask learners to classify the problem type;
require a first equation or diagram;
give hints tied to misconceptions;
ask learners to predict whether an answer is reasonable;
require checking with an alternate method;
periodically remove AI for exam-like practice.

Risky moves:

full symbolic solution before learner attempt;
automatic simplification without explanation;
answer checking without requiring reasoning;
multi-step solutions copied into homework.

The GPT Base versus GPT Tutor contrast shows the difference between answer availability and pedagogical scaffolding (Bastani et al., PNAS 2025).

Writing and argumentation

AI should function as a reader, critic, and counterargument generator more often than as a drafter.

Effective moves:

ask students to generate thesis and evidence first;
have AI identify weak warrants or missing assumptions;
require students to accept, reject, or revise AI feedback with justification;
generate counterarguments after the student’s draft;
use rubrics to ground critique.

Risky moves:

AI-generated first drafts;
AI-selected evidence;
AI-written reflection statements;
style polishing before argument structure is evaluated.

The learning target in writing is not only grammatical fluency. It is judgment: audience, evidence, structure, stance, and revision.

Programming

Programming assistance should distinguish syntax support from algorithmic reasoning. Syntax offloading is often acceptable after conceptual understanding is established. Algorithm design, debugging strategy, and code comprehension should remain learner-owned.

Effective moves:

require pseudocode before code generation;
ask learners to predict output;
generate tests before implementation;
ask students to explain each line of AI-suggested code;
require debugging traces;
compare AI solution with learner solution.

Risky moves:

generating complete functions for novices;
accepting code without tests;
using AI to bypass reading unfamiliar codebases;
treating passing tests as proof of understanding.

Recent computing-education work on Copilot and novice programming repeatedly points to the same tension: productivity gains can coexist with reduced comprehension and overreliance (Prather et al., ICER 2023; Shihab et al., ICER 2025).

Professional learning and workplace training

In workplaces, AI assistance should be paired with deliberate practice. If AI handles routine cases, humans lose the repetitions that maintain judgment. This is Bainbridge’s “ironies of automation” applied to knowledge work: automation removes routine practice and leaves humans responsible for rare exceptions (Bainbridge, Automatica 1983).

Effective moves:

AI handles low-risk drafting, but humans audit samples;
workers complete periodic AI-off drills;
systems log uncertainty and escalation cases;
teams review AI failures as learning cases;
experts annotate why AI suggestions were accepted or rejected.

Risky moves:

total automation of routine judgment;
no practice on edge cases;
treating AI oversight as passive approval;
removing domain experts before novices build schemas.

The role of teachers and learning designers

AI does not remove the need for instructional design. It raises the cost of poor instructional design.

Teachers and designers must now specify:

Which cognitive operations are learning targets?
Which operations may be safely offloaded?
When should help be withheld, hinted, or given?
What evidence shows the learner can perform independently?
How will students learn to audit AI?
How will scaffolds fade?
How will the system prevent fluent output from substituting for understanding?

This shifts the educator’s role from content delivery alone toward cognitive environment design. The best AI tutoring studies are not victories of model capability alone; they are victories of instructional constraint.

Open problems

1. Longitudinal cognitive effects

Most evidence is short-term. We need semester- and year-scale studies that track whether AI-supported learners retain skills, transfer them, and calibrate their confidence. The most important outcomes are not immediate grades but durable independent competence.

2. Differential effects by prior knowledge

AI likely helps high-prior-knowledge learners more safely because they can evaluate outputs. Novices may be more vulnerable to offloading and hallucinated explanations. Studies should stratify by prior knowledge, self-regulation, reading skill, and domain confidence.

3. Measuring critical thinking in AI-mediated work

Existing critical-thinking assessments were not designed for AI workflows. The field needs tasks that measure:

AI output auditing;
source triangulation;
uncertainty reasoning;
adversarial evaluation;
explanation reconstruction;
decision accountability;
transfer after AI support.

4. Scaffold fading policies

We lack strong evidence on optimal fading schedules for AI tutors. Should systems fade after mastery, after time, after confidence calibration, or after successful transfer? The answer likely varies by domain and learner.

5. Metacognitive interventions

Learners need to know when they are relying on AI in ways that feel productive but reduce learning. This requires metacognitive dashboards, reflection prompts, and assessments that reveal the performance-learning gap.

6. Teacher-facing versus student-facing AI

Tutor CoPilot suggests that AI may sometimes be safer and more effective when it augments human instructors rather than directly tutoring students. The field should compare student-facing answer systems, student-facing tutors, teacher-facing copilots, and hybrid models.

7. Institutional incentive alignment

If schools grade only final artifacts, students will rationally use AI to optimize artifacts. If workplaces reward only throughput, workers will rationally offload judgment. Learning-preserving AI requires assessment and incentive systems that value process, justification, and independent transfer.

Design principles for durable AI-assisted learning

Assisted performance is not evidence of learning. Always measure AI-off transfer.
Do not offload the target skill. Offload barriers, not the cognition being taught.
Require attempts before answers. Learner work must precede substantive AI help.
Use hint ladders. Provide the minimum help needed to keep productive struggle alive.
Make verification explicit. Treat AI auditing as a learnable skill.
Require self-explanation. Learners should reconstruct reasoning after help.
Fade scaffolds. Support should decrease as competence increases.
Instrument process. Capture attempts, revisions, hints, explanations, and transfer.
Use AI as critic before creator. Preserve learner authorship wherever the target is judgment.
Keep humans in the loop for high-stakes learning. Teachers remain essential as designers, diagnosticians, and accountability anchors.

Conclusion

AI assistance is neither inherently corrosive nor inherently liberating for critical thinking. It is a powerful offloading technology whose learning effects depend on what it offloads, when it intervenes, and whether the learner remains responsible for sense-making. The current evidence supports a clear position: unrestricted answer engines can produce a performance-learning gap, while structured, pedagogically constrained systems can improve feedback, pacing, and engagement without necessarily damaging independent performance.

The design frontier is not better prompting alone. It is the construction of AI learning environments that preserve productive struggle, require metacognitive monitoring, support verification, and fade toward independence. The goal is not to prevent learners from using AI. The goal is to prevent AI from silently replacing the cognitive work that education exists to develop.

References

Abrami, P. C., Bernard, R. M., Borokhovski, E., Waddington, D. I., Wade, C. A., & Persson, T. “Strategies for Teaching Students to Think Critically: A Meta-Analysis.” Review of Educational Research, 2015. https://doi.org/10.3102/0034654314551063
Bainbridge, L. “Ironies of Automation.” Automatica, 1983. https://doi.org/10.1016/0005-1098(83)90046-8
Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. “Generative AI Without Guardrails Can Harm Learning: Evidence from High School Mathematics.” Proceedings of the National Academy of Sciences, 2025. https://hamsabastani.github.io/education_llm.pdf
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. Taxonomy of Educational Objectives: The Classification of Educational Goals. David McKay, 1956. https://archive.org/details/taxonomyofeducat0000bloo
Buçinca, Z., Malaya, M. B., & Gajos, K. Z. “To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making.” CHI, 2021. https://doi.org/10.1145/3411764.3445172
Chi, M. T. H., Bassok, M., Lewis, M. W., Reimann, P., & Glaser, R. “Self-Explanations: How Students Study and Use Examples in Learning to Solve Problems.” Cognitive Science, 1989. https://doi.org/10.1207/s15516709cog1302_1
Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wenderoth, M. P. “Active Learning Increases Student Performance in Science, Engineering, and Mathematics.” PNAS, 2014. https://doi.org/10.1073/pnas.1319030111
Gerlich, M. “AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.” Societies, 2025. https://doi.org/10.3390/soc15010006
Halpern, D. F. “Teaching Critical Thinking for Transfer Across Domains: Dispositions, Skills, Structure Training, and Metacognitive Monitoring.” American Psychologist, 1998. https://doi.org/10.1037/0003-066X.53.4.449
Kestin, G., Miller, K., Klales, A., Milbourne, T., Ponti, G., et al. “AI Tutoring Outperforms In-Class Active Learning: An RCT Introducing a Novel Research-Based Design in an Authentic Educational Setting.” Scientific Reports, 2025. https://doi.org/10.1038/s41598-025-97652-6
Kosmyna, N., Liao, X.-H., et al. “Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Task.” arXiv, 2025. https://arxiv.org/abs/2506.08872
Kulik, J. A., & Fletcher, J. D. “Effectiveness of Intelligent Tutoring Systems: A Meta-Analytic Review.” Review of Educational Research, 2016. https://doi.org/10.3102/0034654315581420
Lee, H.-P., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., & Wilson, N. “The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects from a Survey of Knowledge Workers.” CHI, 2025. https://doi.org/10.1145/3706598.3713778
Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. “Intelligent Tutoring Systems and Learning Outcomes: A Meta-Analysis.” Journal of Educational Psychology, 2014. https://doi.org/10.1037/a0037123
Mosier, K. L., Skitka, L. J., Heers, S., & Burdick, M. “Automation Bias: Decision Making and Performance in High-Tech Cockpits.” International Journal of Aviation Psychology, 1998. https://doi.org/10.1207/s15327108ijap0801_3
Parasuraman, R., & Riley, V. “Humans and Automation: Use, Misuse, Disuse, Abuse.” Human Factors, 1997. https://doi.org/10.1518/001872097778543886
Paul, R., & Elder, L. The Miniature Guide to Critical Thinking Concepts and Tools. Foundation for Critical Thinking, 2008. https://www.criticalthinking.org
Prather, J., Denny, P., Leinonen, J., et al. “The Robots Are Here: Navigating the Generative AI Revolution in Computing Education.” ICER, 2023. https://doi.org/10.1145/3568813.3600139
Risko, E. F., & Gilbert, S. J. “Cognitive Offloading.” Trends in Cognitive Sciences, 2016. https://doi.org/10.1016/j.tics.2016.07.002
Shihab, E., et al. “The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Programming Tasks.” ICER, 2025. https://arxiv.org/abs/2506.10051
Sparrow, B., Liu, J., & Wegner, D. M. “Google Effects on Memory: Cognitive Consequences of Having Information at Our Fingertips.” Science, 2011. https://doi.org/10.1126/science.1207745
Sweller, J. “Cognitive Load During Problem Solving: Effects on Learning.” Cognitive Science, 1988. https://doi.org/10.1207/s15516709cog1202_4
Sweller, J., van Merriënboer, J. J. G., & Paas, F. “Cognitive Architecture and Instructional Design.” Educational Psychology Review, 1998. https://doi.org/10.1023/A:1022193728205
VanLehn, K. “The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.” Educational Psychologist, 2011. https://doi.org/10.1080/00461520.2011.611369
Wang, R., Zhang, Q., Robinson, C., Loeb, S., & Demszky, D. “Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise.” EdWorkingPaper / arXiv, 2024–2025. https://arxiv.org/abs/2410.03017
Ward, A. F., Duke, K., Gneezy, A., & Bos, M. W. “Brain Drain: The Mere Presence of One’s Own Smartphone Reduces Available Cognitive Capacity.” Journal of the Association for Consumer Research, 2017. https://doi.org/10.1086/691462