Hardening Multi-Agent Systems Against Prompt Injection
Quick Answer
An execution checklist for engineers and security reviewers responsible for multi-agent LLM systems that use tools, shared memory, or agent-to-agent delegation. It defends against prompt injection, memory poisoning, tool-manifest poisoning, and cross-agent infection by enforcing privilege separation, typed channels, authenticated tool metadata, sandboxed sinks, and partitioned memory. Use it per-release on production deployments; treat unchecked items as known gaps with documented exceptions.
This checklist hardens production multi-agent LLM systems — orchestrator, planner, workers, shared memory, tool/protocol broker, sandboxed executor — against prompt injection, memory poisoning, tool-manifest poisoning, and cross-agent infection. The audience is engineers and security reviewers who already understand the threat model in what is multi-agent prompt injection; this artifact is the execution layer, not the explainer. The doctrine throughout is assume partial compromise: every layer is judged by whether it reduces blast radius, not whether it perfectly detects malicious text. Some reproduction details and payload specifics are withheld in keeping with responsible-disclosure conventions for this risk tier.
How to use this checklist
Run it once at architectural review and again before each release that changes agent topology, tool surface, memory layout, or delegation patterns. The owner is whoever signs off on the system's threat model — typically a security architect paired with the agent platform's tech lead. "Done" means every MUST is implemented and verified, every SHOULD is implemented or has a written exception with a compensating control, and the system has been exercised against an adaptive evaluation harness as described in the source paper, Hardening Multi-Agent Systems Against Prompt Injection.
Architecture and privilege separation
3 checksSeparate planner, parser, and executor into distinct privilege tiers
MUSTWhy it matters
A single agent that both reads untrusted content and decides which tools to invoke is a confused deputy by construction. Untrusted text can propose actions that the same context will then authorize.
How to implement
Route untrusted content (web pages, email, documents, retrieved chunks, remote-agent replies) through a low-privilege parser whose only output is typed, schema-validated structures. Only those structures cross into the planner. The planner does not consume raw untrusted strings; the executor does not consume planner free-text.
Verify it's done
Trace any tool call back through the system and confirm the planner's input on that turn contains no untrusted free-text fields. A reviewer can read the type definitions at the parser/planner boundary and the planner/executor boundary.
Make tool access a per-session, per-task capability decision
MUSTWhy it matters
Static, ambient tool availability lets a compromised agent reach sinks unrelated to the current task. Capability scoping limits what an injection can cause even if it succeeds in influencing the planner.
How to implement
Bind the set of callable tools, their argument shapes, and their target resources to the specific task instance at session start. The broker rejects calls outside that bound set without consulting the model.
Verify it's done
Inject an out-of-scope tool name into the planner's input and confirm the broker refuses the call deterministically, with a refusal log entry citing the capability binding rather than a model-generated reason.
Forbid interpolation of untrusted variables into system or developer messages
MUSTWhy it matters
Authority leakage through prompt-template interpolation is one of the cheapest ways for untrusted content to acquire privileged framing. Once a hostile string lands in a system message, downstream layers treat it as policy.
How to implement
Templates that build system or developer messages must accept only values from a vetted, typed source — never raw user, retrieval, tool, or remote-agent text. Untrusted text is delivered exclusively in user-role messages with explicit source labels.
Verify it's done
Static analysis or a unit test renders every system/developer template with hostile fixtures and asserts that no fixture content appears in the rendered output.
Typed inter-agent channels and provenance
3 checksReplace free-form agent-to-agent messages with typed action requests
MUSTWhy it matters
Free-text delegation is the carrier wave for cross-agent infection: an injection in one worker propagates as a "goal" to the next. Schemas and enums collapse the surface that an injection can ride.
How to implement
Define schemas for every inter-agent message: capability identifier, typed arguments, policy-checked action descriptors. Free-text fields, where unavoidable, are explicitly marked untrusted and never routed back into a planner without re-parsing.
Verify it's done
Audit logs show every agent-to-agent message conforming to a registered schema. A red-team test that sends a free-text "ignore previous instructions" payload between workers either fails schema validation or lands in a field that downstream agents treat as untrusted.
Tag every message and memory entry with origin and trust class
MUSTWhy it matters
Without provenance, the model cannot apply different rules to user instructions versus tool outputs versus retrieved web content. Tags also enable deterministic policy upstream of the model.
How to implement
Attach origin metadata (user, system, retrieval, tool output, remote agent, memory) and a trust class to every message and memory record at the moment of ingest. Carry tags through summarization, compression, and memory writes; never strip them.
Verify it's done
Pick any message in a recent trace and confirm it has an unbroken provenance chain back to its source. Summaries and memory entries inherit the lowest trust class of their inputs.
Apply deterministic policy on cross-trust transitions
SHOULDWhy it matters
Tags are only useful if something acts on them. The model should not be the sole enforcer of "do not let retrieved content trigger irreversible actions."
How to implement
Before the executor accepts a tool call, a deterministic policy layer reads the trust class of the inputs that produced it and gates side-effectful or cross-principal calls accordingly. Decisions are logged with the inputs that triggered them.
Verify it's done
Force an adversarial retrieval input into the pipeline and confirm the policy layer blocks or downgrades the resulting tool call without invoking the model for adjudication.
Tool and agent registry authentication
3 checksTreat tool manifests and agent cards as untrusted unless signed
MUSTWhy it matters
Tool hijacking and agent-card spoofing bias the planner's selection before any prompt-level safeguard runs. An unauthenticated registry entry is an unauthenticated instruction.
How to implement
Sign manifests and capability descriptors at publication; verify signatures at load and on refresh. Reject or quarantine unsigned or signature-mismatched entries. See what is tool hijacking for the threat model.
Verify it's done
A registry with a tampered manifest fails load with a signature error rather than degrading silently. Logs record the rejected fingerprint.
Issue audience-bound, short-lived, scoped credentials per tool call
MUSTWhy it matters
Long-lived ambient credentials let a single injection escalate to anything the agent's identity can reach. Audience-bound tokens contain the blast.
How to implement
The broker mints a credential per tool invocation, bound to the specific tool, target resource, and call window. No token passthrough from upstream agents or sessions. Refuse any tool definition that requests credential reuse across calls.
Verify it's done
Inspect any tool call in production traces and confirm its credential has a single-call audience, a sub-minute lifetime, and the minimum scopes required for that call's arguments.
Validate returned tool artifacts against declared output types
SHOULDWhy it matters
Tool outputs are an injection vector as potent as user input — particularly when an agent treats them as ground truth. Output validation forces tool replies through the same untrusted-content path as web retrieval.
How to implement
Each tool declares its output schema in the registry. The broker validates returns against that schema and stamps them with a tool-output trust class before they reach any agent.
Verify it's done
A tool that returns extra free-text fields beyond its declared schema either has those fields stripped or causes a logged validation error; downstream agents see only the typed result.
Sandboxed execution and scoped sinks
3 checksExecute side-effectful actions in sandboxes with no ambient reach
MUSTWhy it matters
When detection fails — and the source paper is explicit that it will fail some non-trivial fraction of the time — the sandbox is what stops a misdirected action from touching unrelated credentials, networks, or filesystems.
How to implement
Run executors in environments with explicit, allow-listed network egress, no inherited credentials, no shared filesystem with other tasks, and resource ceilings. Per-agent and per-task isolation, not just per-process.
Verify it's done
Attempt a deliberately out-of-scope egress or filesystem read from inside an executor and confirm the sandbox denies it at the platform layer, independent of any model-side check.
Apply output-side sink validation before irreversible calls
MUSTWhy it matters
The dangerous moment is the transition from broker to sink (external API, remote agent, code execution). A last-line check at this transition catches cases that slipped through earlier layers.
How to implement
Before dispatch, validate the call's target, arguments, and trust-class lineage against a sink-specific policy: allowed recipients for outbound mail, allowed domains for HTTP, allowed paths for filesystem writes, allowed call shapes for finance or admin APIs.
Verify it's done
A trace shows a sink-policy decision record on every irreversible call, including the trust classes of its argument inputs. A test that crafts a plausible but out-of-policy call is blocked at the sink, not earlier.
Disallow code execution from untrusted-derived plans
MUSTWhy it matters
Code-execution sinks turn any planner-level injection into arbitrary computation. They deserve the strictest gating in the system.
How to implement
Code execution requires either a fully trusted-input lineage or explicit human approval. Generated code runs in a sandbox with no credentials, no network, and no persistent storage by default; outbound capabilities are added per-call with justification.
Verify it's done
A run whose plan derives from retrieved or remote-agent content cannot reach a code-execution sink without a recorded human approval event.
Memory partitioning and minimization
2 checksPartition memory by task, principal, and sensitivity class
MUSTWhy it matters
Shared memory is the persistence layer for memory poisoning; a single poisoned write can reactivate weeks later in an unrelated session. Partitioning bounds the reach of a successful poisoning.
How to implement
Memory namespaces are scoped to a task, a principal, and a sensitivity class. Reads across partitions require an explicit policy decision and are logged. Memory writes inherit the trust class of their inputs.
Verify it's done
A poisoned memory entry written in one task's namespace does not surface in retrieval for a different task or principal. Cross-partition reads appear in audit logs with policy justifications.
Minimize sensitive context before remote calls and remote-agent delegations
SHOULDWhy it matters
Sensitive context the agent never sees on a given call cannot be exfiltrated on that call. Minimization reduces both the value and the reachability of memory as an exfiltration target.
How to implement
Before delegating to a remote agent or calling an external tool, strip context to the fields declared necessary by the call's schema. Apply a deny-list for known sensitive types regardless of schema.
Verify it's done
A remote-agent call's payload contains only fields enumerated in its action schema. Sensitive identifiers absent from the schema are absent from the wire.
Detection, human review, and adaptive evaluation
4 checksRequire human approval at the plan level for irreversible or high-impact actions
MUSTWhy it matters
Per-step approval produces fatigue and rubber-stamping. Plan-level review concentrates human attention where automation is weakest: the moment of authorizing a coherent set of consequential actions.
How to implement
Define a risk taxonomy for plans (irreversibility, cross-principal effects, financial or admin scope). Plans above threshold pause for human review with the full plan, its provenance lineage, and the trust classes of inputs that produced it.
Verify it's done
Approval events exist in the audit trail for every above-threshold plan. A plan synthesized from untrusted inputs cannot execute its high-risk steps without a recorded human decision.
Monitor trajectories for anomalous tool-call patterns and trust-class violations
SHOULDWhy it matters
Detection is supporting instrumentation, not the security boundary, but it shortens dwell time when prevention layers are bypassed.
How to implement
Log every tool call with its arguments' trust-class lineage. Alert on patterns the source paper flags as suspicious: tool calls whose arguments derive entirely from low-trust sources, sudden delegation chains that reach new sinks, repeated near-policy-edge calls, or memory writes that immediately influence other principals' retrievals.
Verify it's done
A synthetic injection that produces a low-trust-derived tool call generates an alert with the lineage attached. The on-call runbook references this alert class.
Run masked-re-execution or trajectory-consistency checks on high-risk plans
NICEWhy it matters
Re-execution with untrusted segments masked exposes plans whose actions depend on injected content rather than on the legitimate task.
How to implement
For plans flagged by the risk taxonomy, re-run the planner with untrusted-source segments redacted or shuffled and compare the resulting tool-call set. Divergence beyond a configured threshold blocks execution pending review.
Verify it's done
A plan whose tool-call set changes materially when untrusted inputs are masked is blocked, with both trajectories preserved for review.
Evaluate against adaptive multi-agent benchmarks before each release
SHOULDWhy it matters
Static red-team prompts decay quickly. Adaptive harnesses across topology variants — planner+memory, debate, federated — are the closest available proxy for in-the-wild novelty.
How to implement
Integrate AgentDojo, InjecAgent, or BIPIA-style harnesses into the release pipeline with adaptive attacks enabled. Report attack success rate, propagation depth, sink reachability, and utility under no attack. Track regressions release over release.
Verify it's done
A release dashboard shows the four metrics for the current and prior releases. Regressions on attack success or propagation depth block the release or carry a written exception.
Acceptance criteria
The checklist is fully implemented when: every MUST is in place with a verified test or audit signal, every SHOULD is in place or covered by a documented compensating control, and a recent adaptive evaluation run is on file with non-regressing attack-success and propagation metrics. A reviewer should be able to sample any tool call from production traces and reconstruct its provenance lineage, the capability binding that authorized it, the credential scope it used, and the sink-policy decision that admitted it. Synthetic injections delivered through retrieval, tool outputs, and remote-agent replies should fail to reach irreversible sinks without producing either a deterministic policy block, a sandbox denial, or a recorded human approval. Cross-agent infection tests should not propagate beyond the originating worker's trust partition, and poisoned memory writes should not surface across task or principal boundaries. For broader navigation across this pillar, see agentic AI security.