Hardening Multi-Agent Systems Against Prompt Injection

Executive Summary

Prompt injection is now widely treated as the leading security risk for LLM-integrated applications, and multi-agent systems worsen the problem because natural-language control signals move across more boundaries: user-to-orchestrator, planner-to-worker, agent-to-agent, memory-to-agent, tool-to-agent, and protocol-to-agent. In such systems, a malicious instruction does not merely need to fool one model once; it can be replayed, amplified, persisted in memory, reinterpreted by other agents, or converted into dangerous tool calls. Official guidance from OWASP, NIST, OpenAI, Anthropic, Google DeepMind, and the National Cyber Security Centre converges on the same conclusion: prompt injection is not a narrow string-filtering problem, but a system-security problem rooted in the absence of a reliable command–data boundary inside current LLMs.

Recent empirical work shows that agentic systems are already vulnerable in realistic settings. AgentDojo introduced 97 realistic tasks and 629 security cases for agents operating over untrusted data; InjecAgent evaluated 1,054 cases across 17 user tools and 62 attacker tools and found meaningful attack success even on strong models; Prompt Infection showed lateral, virus-like propagation across interconnected agents; MAD-Spear showed that compromising even a small subset of debaters can corrupt consensus quality; and ToolHijacker demonstrated that tool descriptions themselves can become an injection surface that biases tool selection. Official browser-agent reports further show that adaptive prompt injection remains practical enough that even very low attack-success rates remain operationally significant.

The central claim of this paper is that multi-agent systems should be engineered under an “assume partial compromise” doctrine. The objective is not perfect detection of malicious text; it is to prevent compromised text from acquiring privilege, to prevent compromised agents from moving laterally, and to constrain the consequences of misalignment when it occurs. The most defensible architecture combines: strict privilege separation between planning and execution; typed, schema-constrained inter-agent messages; provenance and authentication for tool and agent metadata; scoped credentials and sandboxed executors; output-side sink controls; anomaly detection and re-execution checks; adversarially trained models; and evaluation against adaptive attacks rather than only static benchmarks. The evidence suggests that this layered strategy is materially stronger than single-layer filters, though it imposes real cost, latency, and usability trade-offs.

Because no target platform was specified, this paper assumes a generic enterprise multi-agent stack composed of an orchestrator, specialised LLM agents, shared memory or retrieval, local and remote tools, and optional inter-agent protocols. The analysis is therefore intended to apply across architecture families exemplified in the literature rather than to any single vendor product.

Abstract

Multi-agent LLM systems promise higher capability through decomposition, role specialisation, tool use, and agent-to-agent delegation. Those same design choices also increase the prompt-injection attack surface by multiplying communication channels, trust boundaries, and opportunities for privilege transfer. This paper studies how prompt injection changes when moving from single-agent pipelines to multi-agent systems, and argues that the dominant risk is no longer mere instruction override, but cross-boundary control-flow corruption: injected content can spread laterally between agents, contaminate shared memory, bias tool selection, manipulate consensus, or exploit protocol and session state.

The paper synthesises academic benchmarks, original attack papers, and official industry guidance to develop a system-and-attacker model for platform-agnostic multi-agent deployments. It then proposes a defence-in-depth architecture that combines architectural separation, protocol hardening, model-side robustness, deterministic policy enforcement, sandboxing, provenance, scoped authentication, anomaly detection, and adaptive evaluation. The literature strongly indicates that such layered designs outperform standalone prompt engineering or classifiers, but also that many reported gains shrink under adaptive attacks and that low-level containment controls remain indispensable even when model robustness improves.

The main practical conclusion is that hardening multi-agent systems against prompt injection requires security architecture, not merely safer prompts. Future work should prioritise formal authority semantics for agent communications, scalable provenance for tool and agent metadata, robust training against cross-agent propagation, and benchmarks that measure both utility and blast-radius containment under adaptive adversaries.

Introduction

Multi-agent LLM systems have moved from conceptual demonstrations to general orchestration frameworks. AutoGen models applications as interacting conversable agents; CAMEL studies role-playing societies of communicating agents; MetaGPT turns standard operating procedures into multi-agent workflows; and ChatDev uses specialised agents to coordinate software design, coding, and testing. These systems increase capability by decomposing a task into specialised sub-problems and by allowing human, model, and tool interactions to be mixed in a single workflow.

However, those gains come with a structural security cost. A single-agent assistant can be corrupted when untrusted content is concatenated with trusted instructions, but a multi-agent system adds new dimensions: a planner can relay corrupted goals to workers; a worker can deposit poisoned summaries in shared memory; a compromised remote agent can return malicious artefacts; a tool registry can bias discovery before execution even begins; and an injected agent inside a debate or consensus system can distort outcomes indirectly by influencing the other agents’ beliefs. In other words, the control plane of the system is no longer coextensive with a single prompt; it is distributed across messaging, memory, metadata, and execution substrates.

This paper takes the position that prompt injection in multi-agent systems should be analysed as a confused-deputy and authority-propagation problem. Current LLMs do not reliably separate instructions from data; official guidance now explicitly warns that treating prompt injection as ordinary input sanitisation or as an SQL-injection analogue is misleading. The correct design question is not “How do we perfectly detect malicious prompts?” but “Which messages are allowed to influence planning, which sinks can exercise privilege, how is provenance preserved, and what happens if one agent is compromised?”

The contribution of this paper is therefore analytical rather than empirical: it unifies the current literature into a platform-agnostic security model, derives a taxonomy of multi-agent prompt injection, proposes a layered defence architecture, and specifies an evaluation methodology suitable for research or high-assurance engineering. Because no specific product, framework, or deployment context was specified, all platform-specific details are stated as assumptions rather than facts about any one implementation.

Background and Related Work

Prompt injection and the missing command–data boundary

Modern prompt injection research begins from the observation that LLM-integrated applications blur the distinction between instructions and content. Greshake et al. introduced indirect prompt injection as a way to exploit applications remotely by placing malicious prompts in retrieved data, showing impacts such as data theft, arbitrary code-like behaviour, and manipulation of downstream API use. Liu et al. later formalised prompt injection attacks and benchmarks across multiple tasks and models, while Yi et al. introduced BIPIA, the first benchmark focused specifically on indirect prompt injection from external content. These works established two persistent facts: LLMs are broadly vulnerable, and mitigation must address how authority is represented in context rather than only filtering particular strings.

The same conclusion appears in official guidance. The NIST Generative AI Profile distinguishes direct prompt injection from indirect prompt injection and explicitly notes that such attacks can steal proprietary information or trigger malicious code execution in interconnected systems. The NCSC argues that prompt injection is not SQL injection because the underlying models do not enforce a robust separation between instructions and data. OWASP’s 2025 LLM risk taxonomy places prompt injection at the top of the list. Together, these sources suggest that prompt injection is a foundational architectural weakness rather than a corner-case exploit.

Multi-agent systems and the expansion of attack surface

Multi-agent systems extend this weakness by introducing more channels and more semantics. In AutoGen, agents converse to solve tasks; in CAMEL, role-playing structures coordination; in MetaGPT, SOP-like workflows encode role-specific responsibilities; and in ChatDev, specialised agents exchange programming and natural-language artefacts across phases of development. Security-relevant authority is therefore distributed across messages, roles, task decomposition logic, and tool invocations. A compromise in any one location may be amplified by delegation, summarisation, or memory persistence.

This systems view is reinforced by recent protocol work. The Model Context Protocol documents that tool descriptions and annotations should be considered untrusted unless obtained from a trusted server, that hosts must obtain explicit user consent before invoking tools, and that arbitrary data access and code execution require careful security controls. Similarly, the Agent2Agent documentation describes secure agent-to-agent communication with capability discovery through agent cards and enterprise-grade authentication and authorisation. These features are useful, but they also formalise new machine-readable surfaces that can be poisoned, spoofed, replayed, or over-privileged if not authenticated and policy-checked.

Empirical agent-security results

The benchmark literature makes the risk concrete. AgentDojo models tool-using agents over untrusted data with 97 realistic tasks and 629 security cases, and reports that state-of-the-art LLMs still fail many tasks even absent attack while current attacks break some important security properties. InjecAgent introduces 1,054 cases across tool-rich settings and finds meaningful vulnerability even for strong prompting strategies, with a boosted attacker nearly doubling success against a ReAct-prompted GPT-4 baseline. These studies show that baseline agent competence cannot be assumed and that even partial attack success is significant when real tools are attached.

More recent work identifies multi-agent-specific and tool-level effects. Prompt Infection shows that malicious prompts can self-replicate across agents in a virus-like manner. MAD-Spear shows that debate systems can be corrupted by compromising only a subset of agents and exploiting conformity pressures during consensus formation. ToolHijacker shows that poisoning a tool document can bias tool retrieval and selection before execution, and reports that several existing defences are insufficient in that setting. Web and coding-agent analyses further show that untrusted content can lead to credential exfiltration, domain-validation bypass, or unsafe execution if planners, browsers, and tool ecosystems are not isolated.

On the defence side, the literature has progressed from prompt engineering and classifiers toward structural defences. Spotlighting uses provenance-preserving transformations and reports reduction of attack success from above 50% to below 2% in its experiments. BIPIA reports substantial black-box mitigation and near-zero success under a white-box boundary-aware defence. SecAlign uses preference optimisation and reports attack success under 10% with similar utility. MELON uses masked re-execution and tool comparison to outperform state of the art on AgentDojo. CaMeL and the subsequent design-patterns work push further by treating prompt injection as an architectural problem and enforcing security through privileged/quarantined separation, explicit control-flow extraction, and capability-based execution.

Official product work aligns with the research trend. Google’s secure-agent framework emphasises well-defined human controllers, carefully limited powers, and observable actions; OpenAI advises structured outputs, avoidance of untrusted variables in developer messages, and human approval for tool actions; Anthropic combines reinforcement-learning-based robustness, classifiers, red teaming, and sandboxing, while also stating plainly that no browser agent is immune and that even a 1% attack-success rate remains meaningful. These statements support a larger conclusion: the field is converging on bounded-autonomy and defence-in-depth rather than “prompt hardening” alone.

System and Attacker Models

Assumptions, assets, and security objectives

Because the platform is unspecified, the system model assumed here is a generic multi-agent pipeline with six logical components: a user-facing orchestrator; one or more privileged planning agents; lower-privilege worker agents; shared memory or retrieval; tool and protocol adapters; and execution environments for high-risk actions. The protected assets are user intent, secrets, credentials, private data, policy state, tool outputs, memory contents, audit logs, and the integrity of control-flow decisions. The primary security goals are confidentiality, integrity, availability, and provenance-preserving accountability. The key additional requirement in multi-agent settings is authority confinement: no untrusted content should gain the ability to alter privileged planning or invoke high-risk sinks without evidence, policy approval, and containment. This modelling choice is consistent with existing multi-agent frameworks and official secure-agent guidance.

The attacker is assumed to be adaptive and may control one or more of the following: direct user input; external content such as webpages, emails, PDFs, or files; compromised tools or APIs; tool metadata such as descriptions or annotations; agent-to-agent messages from remote or federated participants; session or event channels in protocol implementations; and content later replayed from memory or retrieval. The attacker’s goals may include system-prompt leakage, tool misuse, private-data exfiltration, planner corruption, consensus manipulation, task derailment, denial of service, lateral spread to other agents, or persistence in memory for later activation. This adversary model is narrower than full host compromise but broader than single-turn jailbreaking, and it reflects both academic and official protocol threat descriptions.

Multi-agent systems also require explicit modelling of channels. A useful way to distinguish channels is by trust and by control semantics: user instructions, developer policy, inter-agent tasking, observational context, tool return data, protocol metadata, and sink invocations should not be treated equivalently. When systems collapse several of these channels into a shared free-form prompt, they create exactly the ambiguity exploited by prompt injection. OpenAI’s guidance to use structured outputs between nodes and to avoid placing untrusted variables into developer messages is therefore not merely a prompting tip; it is a statement about preserving authority boundaries.

The diagram shows the core thesis of the threat model: the high-risk transitions are not only input-to-model, but also planner-to-worker, memory-to-agent, registry-to-broker, and broker-to-sink. Prompt injection becomes more dangerous as soon as natural-language content can cross into any of those control-bearing paths. That is why protocol guidance emphasises user consent, tool safety, scoped authorisation, and rejection of anti-patterns such as token passthrough or weak session handling.

Taxonomy and Defence Strategy

Taxonomy of prompt injection in multi-agent systems

The multi-agent setting introduces additional attack classes beyond the standard direct/indirect split. The most useful analytical distinction is by propagation path: vertical attacks cross privilege levels; horizontal attacks spread between peers; temporal attacks persist across time; registry-level attacks poison discovery and selection; and protocol attacks corrupt stateful coordination. This framing fits the published attacks better than a simple “direct versus indirect” taxonomy.

Attack class	Mechanism	Why multi-agent systems amplify it	Typical impact
Direct override	Adversarial user input attempts to replace or outrank policy	Planning agents may relay corrupted goals to many workers	Misaligned tasking, safety bypass, policy leakage
Indirect content injection	Malicious instructions embedded in retrieved content	Shared memory and summaries can replay the payload to other agents	Data exfiltration, sink misuse, task hijack
Cross-agent infection	One compromised agent embeds instructions for peers	Lateral spread through delegation and message passing	System-wide corruption, stealthy propagation
Consensus or conformity poisoning	A subset of agents emits plausible but false advice	Debate and voting mechanisms can amplify persuasive errors	Corrupted consensus, misinformation, bad decisions
Tool-manifest poisoning	Poisoned tool descriptions or metadata bias tool choice	Discovery and selection happen before execution safeguards	Malicious tool selection, privilege pivoting
Memory poisoning	Injected content is stored and later reactivated	Persistent memory survives beyond the original observation	Delayed compromise, hard-to-debug recurrence
Protocol or session injection	Malicious events, replayed sessions, or forged metadata	Inter-agent protocols introduce stateful surfaces and registries	Impersonation, event hijack, unauthorised actions
Social-engineering injection	Content frames malicious actions as urgent or legitimate	Approval UIs and humans experience fatigue in long workflows	Unsafe approvals, privacy leakage, persistence of bad plans

This taxonomy synthesises Greshake et al. on indirect prompt injection, AgentDojo and InjecAgent on tool-rich agent settings, Prompt Infection on lateral spread, MAD-Spear on consensus distortion, ToolHijacker on poisoned discovery artefacts, and MCP security guidance on session hijack and untrusted tool descriptions.

Layered defence architecture

A defensible multi-agent design must ensure that risky content stays in low-privilege channels unless and until it is transformed into validated, typed, policy-approved structures. The most promising designs in the literature share this property even when they differ in implementation details. CaMeL achieves it by separating privileged and quarantined reasoning and enforcing capability-aware execution; the design-patterns paper generalises similar ideas into code-then-execute, dual-LLM, plan-then-execute, action-selector, map-reduce, and context-minimisation patterns; and official vendor guidance repeatedly recommends isolating planning from execution and constraining data flow between nodes.

This architecture is deliberately asymmetric. Untrusted input may inform perception, but not directly determine execution. The planner may decide what class of action is needed, but the executor requires sink-side authorisation, and the output guardrail checks whether the intended tool call or data release is aligned with the user’s goal. This design is consistent with OpenAI’s tool-call and pre-flight output validation, Anthropic’s sandboxing and plan review, and Google’s emphasis on limited powers and observability.

Defence classes and comparative analysis

The first defence class is architectural. Here the goal is to separate trusted control from untrusted content, minimise context given to privileged components, and convert free-form language into typed intermediate representations before any dangerous action occurs. CaMeL is the clearest research example: it explicitly extracts control and data flow from the trusted query and reports solving 77% of AgentDojo tasks with provable security, versus 84% for an undefended system. The design-patterns paper argues that such patterns can provide provable resistance under explicit modelling assumptions, which is stronger than empirical filtering alone.

The second class is prompt augmentation and channel marking. Spotlighting, boundary awareness, explicit reminders, and related methods try to help the model distinguish user intent from retrieved data by adding durable provenance signals or reinforced instructions. These techniques are cheap and easy to deploy, and some can be highly effective against non-adaptive or moderately adaptive attacks. But they still rely on the model respecting the distinction, and therefore should be viewed as a low-cost layer rather than as the security boundary itself.

The third class is model-level robustness. SecAlign uses preference optimisation to train the model to prefer secure responses and reports attack success below 10% with similar utility. Anthropic reports that it uses reinforcement learning with simulated web content to improve browser-agent robustness, while Google DeepMind reports adversarial evaluation and fine-tuning to improve resistance. These are promising advances, but the empirical literature also shows that adaptive attacks still matter: Google DeepMind reports that in 16 of 24 defence–attack pairs, adaptive attacks matched or outperformed non-adaptive attacks, meaning conventional offline evaluations can overstate security.

The fourth class is detection and re-execution. This includes classifiers, self-reflection, perplexity, and methods such as MELON that compare trajectories under masking or re-execution. MELON reports better attack prevention and utility preservation than prior approaches on AgentDojo. Yet detector trade-offs remain sharp. In Google DeepMind’s Gemini study, retrieved-data classifiers showed very high false-positive rates in some settings, while certain in-context defences lowered attack success at the cost of high null-response rates. OpenAI’s prompt-injection detector benchmark likewise shows very strong ROC AUC on its dataset but a substantial disparity in recall at 1% false-positive rate across models. These results imply that detectors are useful supporting instrumentation, but brittle as sole gatekeepers.

The fifth class is sandboxing and least privilege. Anthropic’s Claude Code sandbox isolates filesystem and network access and keeps sensitive credentials out of the sandbox entirely. Google’s secure-agent guidance similarly argues that powers must be carefully limited and observable. In multi-agent settings this principle must be extended per agent, per tool, and per sink: planners should not hold broad credentials; worker agents should have task-scoped capabilities; and execution should occur in environments where prompt injection can at worst waste tokens or produce a blocked plan rather than exfiltrate secrets. This is the most dependable method for reducing blast radius when detection fails.

The sixth class is protocol and cryptographic hardening. MCP authorisation uses OAuth-based discovery and audience-bound tokens; its security guidance forbids token passthrough, warns about session hijacking, and recommends scope minimisation. A2A similarly assumes authenticated, structured agent interaction with capability discovery via agent cards. For prompt injection, the lesson is that metadata itself should be authenticated and provenance-preserving. Tool descriptions, agent cards, capability manifests, and returned artefacts should be signed or attestable, and credentials should be audience-bound, short-lived, and tied to explicit scopes. Patterns from verifiable credentials, software signing, and provenance systems such as Sigstore and C2PA are directly relevant here even though they are not prompt-injection-specific.

Defence family	Effectiveness	Overhead	Complexity	Deployability	Residual risk
Prompt augmentation and channel marking	Medium to high against basic IPI; weaker against strong adaptive attacks	Low	Low	High	Medium
Input/output classifiers and anomaly detectors	Medium; often useful as tripwires but sensitive to thresholds and dataset shift	Medium	Medium	High	Medium to high
Robust training and preference optimisation	High in published studies; depends on training coverage and model access	High training, low inference	High	Medium	Medium
Re-execution and trajectory-consistency checks	High against several indirect attacks; costly at runtime	High	Medium to high	Medium	Medium to low
Architectural privilege separation and code-then-execute	High to very high when assumptions hold	Medium to high	High	Medium	Low to medium
Structured outputs and typed inter-agent channels	High for reducing propagation paths; does not solve persuasion upstream	Low	Medium	High	Medium
Sandboxing, scoped credentials, and sink permissions	High for impact containment, not for attack prevention per se	Medium	Medium to high	Medium	Low to medium
Protocol authentication and provenance	Medium for discovery/session/tool poisoning; insufficient against semantic manipulation alone	Low to medium	Medium to high	Medium	Medium
Human approval, plan review, and observability	Medium; strongest for high-impact actions, weakest under fatigue	Medium human cost	Medium	High	Medium

These ratings are an analytical synthesis of current evidence rather than a single benchmark result. They are grounded in the reported performance of Spotlighting, BIPIA, SecAlign, MELON, CaMeL, Google DeepMind’s adaptive evaluations, Anthropic’s browser and sandbox guidance, OpenAI’s structured-output and guardrail documentation, and MCP/A2A security specifications.

The main design implication is straightforward: no single defence family should be treated as sufficient. Marking and training reduce attack success; protocol authentication protects discovery and session state; but only least privilege, typed control channels, and sandboxed execution reliably reduce worst-case impact when manipulation succeeds.

Evaluation Methodology and Expected Results

Experimental design

Because no implementation or platform was supplied, no new experiments are executed here. Instead, this section specifies a methodology suitable for a conference submission or a reproducible engineering evaluation. The best starting point is to combine standard agent-security benchmarks with a new multi-agent harness. AgentDojo should be used for realistic tool-using tasks and adaptive-defence evaluation; InjecAgent should be used for broad tool-integrated indirect injection coverage; BIPIA should be used for content-only indirect prompt injection; and a platform-specific supplement should be added for browser or coding behaviours where web navigation, file access, shell access, or remote protocols materially change the sink surface.

The reference system should instantiate at least four topology families: a central planner with stateless workers; a planner with shared memory; a debate or ensemble system; and a federated system with remote-agent or protocol adapters. The same tasks should be run across multiple orchestration frameworks or framework-equivalent abstractions, for example AutoGen-like messaging, CAMEL-like role specialisation, and SOP-style workflows modelled after MetaGPT. Where protocol interoperability is relevant, the harness should expose capability discovery, session state, and OAuth-scoped tool access through MCP- or A2A-like adapters. This allows comparison of semantic, registry-level, and protocol-level failure modes without tying the paper to a single vendor.

Baselines should include: an undefended multi-agent ReAct-style stack; prompt-only hardening; structured outputs only; classifier-only detection; robustly trained model variants such as SecAlign-type fine-tuning; trajectory-based detection such as MELON; architectural separation such as CaMeL or a dual-LLM design; sink-only containment through sandboxing and scoped credentials; and a combined layered stack. Attacks should cover direct override, indirect content injection, memory poisoning, cross-agent propagation, consensus poisoning, tool-manifest poisoning, protocol/session injection, and blended social-engineering variants. Crucially, both non-adaptive and adaptive attacks must be included.

Metrics should be wider than attack success rate. At minimum, the study should report: attack success rate; secure task completion rate; utility under no attack; privilege-escalation rate; exfiltration success; lateral propagation depth measured in number of newly contaminated agents; memory persistence duration; false-positive rate; null-response rate; user-approval burden; runtime latency; token cost; and post-compromise blast radius, defined as the number and sensitivity of sinks that can still be reached after a successful injection. Existing work already shows why these extra metrics matter: Google DeepMind reports both attack success and utility-related trade-offs, while debate attacks affect consensus efficiency rather than only single-agent correctness.

Expected outcomes and how to report them

The literature supports three expected findings. First, adaptive attacks will materially outperform non-adaptive attacks against many prompt-based or detector-only defences. Google DeepMind’s results already show this pattern, and it is consistent with broader robustness experience in machine learning security. Second, architectural separation and sink-side containment will define the Pareto frontier for high-assurance settings: they may cost more latency or engineering effort, but they reduce the consequence of remaining attack success. Third, combined stacks will outperform single techniques on overall security–utility trade-offs, with structured channels limiting propagation, robust training lowering compliance, and sandboxing capping damage.

Results should be presented in four forms. The first is an attack–utility Pareto plot showing secure task completion versus latency or token cost. The second is a propagation matrix showing whether compromise spreads from one infected channel or agent to others. The third is a sink-level risk table reporting whether secrets, data exports, external communications, and code execution remain reachable under compromise. The fourth is an ablation over topology—single planner, planner plus memory, debate, federated A2A/MCP-style system—to show how architecture alone changes risk. Single-number benchmarking is inadequate because a defence that lowers raw attack success but increases false positives or human fatigue may still be worse in deployment.

A useful reporting convention is to separate prevention from containment. Prevention asks whether the malicious instruction changed the system’s internal plan or selected sink. Containment asks whether the system still prevented secrecy loss, unauthorised external communication, or unsafe execution even after some internal misalignment occurred. This distinction helps compare model-side defences with sandboxing and authorisation controls in a scientifically fair way. It also fits the source–sink framing used in official guidance: an attack is only operationally dangerous when an untrusted source is connected to a dangerous sink.

Discussion, Recommendations, and Future Work

Trade-offs, limitations, deployment considerations, privacy, and usability

The strongest defences are often the least convenient. Sandboxing, typed schemas, approval checkpoints, secure token brokerage, and separate planner/executor roles all add latency, engineering complexity, or user friction. Anthropic’s own documentation notes that repeated permissions create approval fatigue, which motivated plan-level review. Google’s secure-agent guidance similarly centres observability and controlled power rather than invisible autonomy. The implication is that high security in multi-agent systems will usually look more like operating-system design than like an unconstrained chatbot.

Privacy creates a further tension. Shared memory improves continuity and utility, but it also increases the chance that sensitive context leaks across tasks or departments. Anthropic explicitly notes cross-context privacy risk in extended agent interactions, while OpenAI warns that models may send more data to connected MCPs than users intended. Consequently, memory should be partitioned by task, principal, and sensitivity class, and data minimisation should be enforced before any remote call. This is not only a privacy measure; it is a prompt-injection mitigation because less sensitive context means fewer valuable targets for exfiltration.

Protocol security is necessary but insufficient. MCP and A2A can provide authorisation, discovery, structured messages, and better auditability, yet they do not by themselves solve semantic manipulation. MCP’s own security guidance warns against token passthrough, session hijacking, over-broad scopes, and untrusted tool descriptions. In other words, a perfectly authenticated malicious tool description is still malicious. Authentication proves origin; it does not prove benevolence or relevance. Architectures must therefore combine protocol integrity with semantic policy checks and low-privilege handling of untrusted metadata.

Another important limitation is that empirical results are still heavily benchmark-dependent. Spotlighting, BIPIA, SecAlign, MELON, and CaMeL all report strong gains, but they do so under different tasks, attack generators, and adversary assumptions. Official vendor benchmarks also differ in operating points and outcome definitions. Meanwhile, Google’s April 2026 web-scale analysis suggests that malicious prompt injections in the wild remain less sophisticated than the strongest research attacks, though interest is increasing. The correct inference is neither complacency nor panic: the strongest academic attacks are ahead of broad attacker operationalisation, but the incentive landscape is moving in the dangerous direction.

Practical recommendations for practitioners

Assume some prompt injections will succeed and design for constrained consequences. Treat every agent as a potentially confusable deputy and ensure that dangerous sinks require separate authorisation, scoped credentials, and containment.
Keep untrusted content out of privileged prompts. Do not interpolate untrusted data into developer messages; convert external content into validated, typed structures before it reaches the planner.
Separate planning, parsing, and execution. Use low-privilege parsing for untrusted inputs, a higher-trust planner for abstract action selection, and a sandboxed executor for side-effectful operations.
Constrain inter-agent communication. Replace free-form agent-to-agent instructions with schemas, enums, capability identifiers, and policy-checked action requests wherever possible.
Treat tool metadata, agent cards, and protocol artefacts as untrusted inputs unless authenticated and provenance-preserving. Sign and attest tool manifests and remote-agent descriptors; reject token passthrough and broad, permanent scopes.
Use sandboxing and keep secrets out of the execution environment. If a prompt injection reaches execution, the sandbox should still prevent credential theft, arbitrary outbound network access, or unrestricted filesystem reads.
Instrument detection, but do not rely on it as the only control. Use output guardrails, pre-flight response validation, self-reflection, anomaly detection, and trace analysis as supporting layers and observability tools.
Evaluate with adaptive attacks and topology-aware benchmarks before deployment. Single-turn or non-adaptive tests will systematically overestimate security in real multi-agent environments.

Open research questions

How should authority be represented formally in natural-language agent ecosystems? Existing systems still lack a robust semantic equivalent of typed capability passing or information-flow labels for free-form language.
Can cross-agent propagation be bounded provably in realistic systems with memory, retrieval, and remote protocols? Current proofs are promising but depend on controlled architectural assumptions.
What is the right training objective for multi-agent robustness? Single-model robustness does not automatically imply safety under delegation, summarisation, or consensus dynamics.
How can provenance be made usable at scale for tools, agent cards, memories, and summaries? Signing everything is not enough unless downstream agents can reason about trust, freshness, and relevance.
How should benchmarks measure containment rather than only prevention? Existing datasets are improving, but blast radius, lateral spread, and sink reachability remain under-measured.
How can human oversight remain effective without causing fatigue? Approval at every step does not scale, yet too little review undermines security.
What are the right abstractions for secure inter-agent protocols? A2A and MCP are important starts, but secure-by-default agent interoperability remains immature.
How much of prompt injection in the wild will transition from experimentation to organised abuse? Current observations suggest rising interest but still limited sophistication; that may not remain true as agents gain more authority.

Conclusion

Hardening multi-agent systems against prompt injection requires a shift in mindset. The relevant question is not whether a model can be persuaded by malicious text in principle; current models can. The correct question is whether the surrounding system preserves authority boundaries, controls propagation, limits privileges, authenticates metadata, and contains the consequences of inevitable failures. The literature and official guidance now align on this point. Prompt injection should be treated as a first-class systems-security problem for agentic AI, especially in multi-agent settings where compromise can spread laterally and persist temporally.

The most credible near-term path is layered: provenance and authentication at the protocol and metadata layers; typed control channels between agents; planner–executor separation; least privilege and sink controls; sandboxed execution; robustly trained models; and adaptive evaluation that measures both utility and containment. Future work should aim for stronger formal guarantees, better cross-topology benchmarks, and more usable oversight mechanisms. Until then, organisations deploying multi-agent systems should assume that prompt injection is a chronic operational reality and architect their systems so that compromised language does not become compromised authority.