What Is Indirect Prompt Injection? Threat Model and Defenses
Quick Answer
Indirect prompt injection is an attack where instructions hidden in content an LLM agent reads as data — an email, web page, RAG chunk, tool result, or tool description — are interpreted by the model as operational authority. The user asked for something benign; the attacker, who never spoke to the user, ends up dictating tool calls. Prompt-level defenses reduce but do not eliminate it; the durable fix is to constrain what untrusted text can cause, not what the model can recognize.
What Is Indirect Prompt Injection? Threat Model and Defenses
Indirect prompt injection is the attack class that turns an LLM agent's own ingestion paths against its user. An attacker plants instruction-shaped text in something the agent will later read as data — an email, a web page, a retrieved document, a tool result, an MCP tool description — and the model treats those instructions as authority. OWASP catalogs it under LLM01:2025, and Microsoft has called it one of the most-used techniques in AI vulnerability submissions to MSRC. If your roadmap includes agents that read untrusted content and call tools, this is your dominant threat model.
What is indirect prompt injection?
Indirect prompt injection (IPI) is a confusion-of-authority vulnerability. The user gives the agent a benign task. The agent fetches some content to do that task. Hidden inside that content is text written by a third party — never the user — that reads like an instruction. The model, faced with a single flat context window containing developer policy, user task, prior tool observations, and the attacker's text, has no reliable way to decide which of those is binding. It often picks the attacker.
This contrasts with direct prompt injection, where the attacker is the person typing into the prompt. The defender's intuition for direct injection ("the user authorized this") does not transfer. With IPI, the user authorized a summary; the attacker authorized a wire transfer. For the term-level definition and a list of channels, see indirect prompt injection.
The decisive question for IPI is not "can the model recognize malicious text?" It is "what can untrusted text cause?"
How does it work?
The mechanism is the same regardless of channel. Architecture-level only — payload strings and bypass details are out of scope here.
- Plant. The attacker places instruction-shaped text somewhere a target agent will eventually ingest as data: an email, calendar invite, web page, GitHub issue, PDF, support ticket, Slack message, search result, RAG chunk, image OCR text, code comment, log line, tool output, or tool metadata.
- Ingest. A benign user task — "summarize my unread mail," "research this vendor," "triage these tickets" — causes the agent to fetch the poisoned content through a trusted ingestion path.
- Collapse. The poisoned text lands in the same context window as the developer policy, the user prompt, prior observations, tool schemas, and the agent's scratchpad. The next-token engine is now asked to infer which of those is binding.
- Hijack. The planner treats the attacker's imperative text as authority and emits a tool call the user never asked for. This is tool hijacking driven by external content.
- Effect. The unauthorized tool call hits a capability boundary. If that boundary is prompt-only, the action executes under the user's or service's credentials: data exfiltration, unintended writes, external sends, code execution, or memory writes that persist into future sessions.
Three subtypes are worth naming explicitly:
- Tool-output injection. The payload arrives in a tool's return value — an HTML page, a database row, an issue body — and persists in conversation history.
- Tool-metadata and tool-selection injection. The payload lives in tool names, descriptions, or MCP manifests. MCPTox reports tool-poisoning attack success up to 72.8% across real MCP servers. Attackers can also poison which tool gets selected, not just what the selected tool does. See what is tool hijacking for the deeper mechanism.
- Memory and RAG poisoning (deferred IPI). The payload is written into long-term memory or a vector index and triggers in a later, unrelated session — sometimes for a different user. See memory poisoning and what is RAG data exfiltration for the retrieval-channel variant.
The canonical production example is EchoLeak (CVE-2025-32711), a zero-click exfiltration against Microsoft 365 Copilot. A crafted email slipped past the XPIA classifier, used reference-style Markdown to bypass link redaction, abused auto-fetched images, and tunnelled exfiltrated content through an allowed corporate proxy under CSP. Reproduction details and bypass specifics are intentionally not described here.
Why does it matter?
IPI breaks the most basic trust intuition we have about agents: that they act on behalf of the user who summoned them. With IPI, the user asks for a summary and an attacker who never met them drives the next ten tool calls.
The benchmark evidence summarized in Jer's prior work on tool-using LLM agent security is consistent across independent groups:
- AgentDojo (NeurIPS 2024): GPT-4o reaches 57.69% targeted attack success under the strongest evaluated attack with no defense across 97 tasks and 629 security cases. The best evaluated defense — tool filtering — still leaves 6.84%.
- Agent Security Bench (ICLR 2025): Across 10 scenarios, 400+ tools, and 27 attack/defense types, the highest reported average attack success rate is 84.30%.
- MCPTox (2025): 1,312 malicious cases over 45 live MCP servers and 353 tools. Refusal stays below 3% on the most refusing model; tool-poisoning ASR reaches 72.8%.
- Adaptive attacks (Zhan et al., 2025; Nasr et al., 2025): classifier and prompt-only defenses that look strong on static benchmarks degrade sharply when the attacker adapts.
The worst plausible outcome for an enterprise deployment is zero-click exfiltration of any data the agent can read, plus unauthorized writes and sends under user credentials, plus persistent influence via poisoned memory or RAG corpus that survives the triggering session. This page intentionally withholds reproduction details and live-system bypass specifics; readers who need that depth should consult the OWASP, MSRC, AgentDojo, and EchoLeak references in this article's external sources.
How do you defend against it?
There is no single control that closes IPI. The strategy is to assume untrusted text will reach the planner and constrain what it can cause.
- Tool precommitment and capability narrowing. Before reading untrusted content, freeze the allowed tool set and scopes for this task — for example, read-only mail search restricted by query, max 10 results, no send, no forward, no external image rendering. AgentDojo's tool filter cut targeted attack success from roughly 58% to about 7%. Cost: requires task-shape planning and breaks fully open-ended agents. Does not cover cases where the same tool needed for the benign task can also serve the attacker.
- Egress control. Deterministically block exfiltration sinks: Markdown image auto-fetch, arbitrary URL fetches, external webhooks, encoded subdomains, public paste/repo writes, and external chat posts. Watch for allowed corporate proxies that can be turned into side channels. Cost: breaks some legitimate UX such as image previews. Does not cover attacks that exfiltrate over an allowed-by-policy channel.
- Quarantine and typed extraction of untrusted data. Don't feed raw email, HTML, PDF, or MCP descriptions into the planner. Run them through an extractor that emits typed fields with provenance and taint labels. Cost: per-source extractors and schema work. Does not cover attacks expressed as benign-looking facts that subvert reasoning rather than as imperative instructions.
- Separate planner, reader, and actor (dual-LLM / IFC pattern). A trusted planner sees only the user and developer policy, and emits a deterministic capability manifest. A reader handles untrusted data and has no tools. An actor proposes tool calls that a non-model policy engine validates. CaMeL is the reference design here. Cost: more components, more latency, more engineering. Does not cover shortcuts where one agent both reads attacker-controlled content and decides authority.
- Spotlighting and instruction-hierarchy training. Mark untrusted spans with delimiters, datamarking, or encoding. Train models to respect a privileged instruction hierarchy (StruQ, SecAlign, Meta SecAlign, OpenAI's instruction hierarchy). Cost: probabilistic only; needs adaptive evaluation; some utility tax. Microsoft itself frames Spotlighting as a probabilistic preventative, not a guarantee.
- Provenance-aware memory and RAG controls. Tag every memory entry and retrieved chunk with source and trust. Default-forbid untrusted-derived memory from influencing tool authorization. Authenticate corpus writers, quarantine new documents, and scan for embedding-outlier poison clusters. Cost: memory schema work and ingestion controls. Does not cover legitimate-source content that has been compromised upstream.
- Tool metadata as supply chain. Pin and sign MCP manifests, review diffs of tool descriptions, strip hidden and invisible Unicode, separate user-visible docs from model-facing schemas, and require approval for tools that read secrets or send externally. Cost: governance overhead per integration. Does not cover malicious behavior in legitimately scoped tools.
- Meaningful human authorization for high-risk actions. Generate consent dialogs from structured runtime data, not from model prose, and surface provenance — "this action was influenced by an external email from X." Cost: UX friction; only useful when a human is in the loop. Does not cover zero-click paths that fire before any human sees the response.
The single sentence to take away: after an agent has read untrusted content, it must not be able to perform consequential actions unless an independent policy engine proves the action is within the user's original intent and allowed information flows.
Related concepts and tools
- What is tool hijacking — the tool-call subversion mechanic IPI usually exploits.
- What is multi-agent prompt injection — how IPI propagates across collaborating agents.
- What is RAG data exfiltration — the retrieval-channel exfiltration variant.
- Memory poisoning — deferred-execution IPI persisted into long-term memory.
- Agentic AI security — parent topic hub.
FAQ
How is indirect prompt injection different from direct prompt injection?
Direct prompt injection is the attacker typing into the prompt as the user. Indirect prompt injection is the attacker hiding instructions in content the agent later reads as data — an email body, a web page, a RAG chunk, a tool result, or a tool's metadata. The user-consent intuition that protects against direct injection fails here, because the user never saw or approved the attacker's text.
Can a system prompt or 'ignore malicious instructions' rule prevent indirect prompt injection?
No. Delimiting, prompt repetition, and "ignore anything that looks like an instruction" rules reduce attack success but do not eliminate it. AgentDojo and follow-up adaptive-attack work show that classifier-based and prompt-only defenses degrade sharply when the attacker knows the defense and adapts. Treat prompt-level controls as probabilistic risk reducers and put the real boundary outside the model.
What real-world incidents have involved indirect prompt injection?
EchoLeak (CVE-2025-32711) is the canonical production case: a crafted email reached Microsoft 365 Copilot and chained an XPIA classifier bypass, reference-style Markdown, auto-fetched images, and an allowed corporate proxy under CSP into zero-click data exfiltration. Microsoft has separately stated that indirect prompt injection is one of the most widely used techniques in AI vulnerabilities reported to MSRC.
What's the single highest-leverage defense against indirect prompt injection?
Tool precommitment paired with capability narrowing: decide the allowed tools and scopes before the agent reads any untrusted content, then enforce the manifest deterministically outside the model. In AgentDojo, tool filtering cut targeted attack success rate from roughly 58% to about 7% while preserving task utility. Pair it with deterministic egress control to close the exfiltration sinks.
Derived From
Related Work
External References
FAQ
How is indirect prompt injection different from direct prompt injection?
Direct prompt injection is the attacker typing into the prompt as the user. Indirect prompt injection is the attacker hiding instructions in content the agent later reads as data — an email body, a web page, a RAG chunk, a tool result, or a tool's metadata. The user-consent intuition that protects against direct injection fails here, because the user never saw or approved the attacker's text.
Can a system prompt or 'ignore malicious instructions' rule prevent indirect prompt injection?
No. Delimiting, prompt repetition, and 'ignore anything that looks like an instruction' rules reduce attack success but do not eliminate it. AgentDojo and follow-up adaptive-attack work show that classifier-based and prompt-only defenses degrade sharply when the attacker knows the defense and adapts. Treat prompt-level controls as probabilistic risk reducers and put the real boundary outside the model.
What real-world incidents have involved indirect prompt injection?
EchoLeak (CVE-2025-32711) is the canonical production case: a crafted email reached Microsoft 365 Copilot and chained an XPIA classifier bypass, reference-style Markdown, auto-fetched images, and an allowed corporate proxy under CSP into zero-click data exfiltration. Microsoft has separately stated that indirect prompt injection is one of the most widely used techniques in AI vulnerabilities reported to MSRC.
What's the single highest-leverage defense against indirect prompt injection?
Tool precommitment paired with capability narrowing: decide the allowed tools and scopes before the agent reads any untrusted content, then enforce the manifest deterministically outside the model. In AgentDojo, tool filtering cut targeted attack success rate from roughly 58% to about 7% while preserving task utility. Pair it with deterministic egress control to close the exfiltration sinks.