Hardening Compound AI Systems for Multi-Step Automation
Quick Answer
This checklist hardens compound AI systems — multi-step LLM automation built from orchestrators, tool brokers, retrievers, memory, and validators — against indirect prompt injection, retrieval poisoning, excessive agency, and audit gaps. Audience is platform, ML, and security engineers preparing such a system for production. Run it as a pre-launch review and re-run per release. Some reproduction details are withheld pending vendor coordination.
A compound AI system is a workflow runtime in which a non-deterministic model is one component among orchestrators, tool brokers, retrievers, memory, validators, and human gates. Hardening it is a systems-engineering job: the model proposes, extracts, ranks, and drafts; the runtime authorizes, validates, executes, records, and stops. This checklist covers the system boundary — not the model, not a single agent. For threat-model context on why models are non-principals, see agent capability control and indirect prompt injection. Some reproduction details are withheld pending vendor coordination.
How to use this checklist
Run it as a pre-launch review for any compound AI system that mutates external state, and re-run on every workflow change or tool registry update. Each check is owned by the platform team that runs the orchestrator; security review signs off on the MUST tier before production traffic. The reference architecture, evidence anchors (AgentDojo, PoisonedRAG), and rationale for each check live in the source paper, Compound AI Systems and Orchestration Patterns for Multi-Step Automation. "Done" means every MUST is implemented and verifiable from logs or config, every SHOULD has either an implementation or a documented exception, and the regression suite in domain 7 runs in CI on every change to prompts, tools, or graph topology.
Architecture and workflow shape
3 checksExpress the workflow as a typed graph with bounded steps, retries, and wall-clock
MUSTWhy it matters
Free-form ReAct loops over high-impact tools have no structural ceiling on cost, blast radius, or recursion. A typed graph forces each node to declare its purpose, input/output schema, allowed tools, and timeout — converting "the agent decided" into "the runtime allowed." Bounded steps also stop induced loops from a poisoned tool output that asks the agent to keep retrying.
How to implement
Define the workflow as a directed graph in the orchestrator (LangGraph, custom DAG, or equivalent). Each node carries a typed I/O contract, a tool allowlist, a max-retry count, and a per-node and per-run wall-clock budget enforced by the orchestrator, not the model.
Verify it's done
A reviewer can read the graph definition for the workflow and enumerate every node's tools, schemas, and budgets without running the system. Runs that exceed step or wall-clock budgets terminate with a logged budget-exceeded event.
Make every run replayable from a deterministic seed of inputs and version pins
MUSTWhy it matters
Without replay, incident response is guesswork. The combination of model nondeterminism, retrieved context, and tool outputs means a defect cannot be reproduced from a timestamp alone — you need the exact prompt assembly, retrieval set, tool responses, and version pins.
How to implement
Persist, per run, the input event, identity and risk tier, prompt template versions, model and adapter versions, tool registry hash, retrieved document IDs with content hashes, every tool call with arguments and response, and the random seed where supported. Provide a replay command that reconstructs the run against a recorded harness.
Verify it's done
Pick a recent production run; re-execute it from its trace and confirm the orchestrator reaches the same terminal state (or a documented divergence point) using only persisted artifacts.
Forbid an "omnipotent assistant" surface
MUSTWhy it matters
An agent that simultaneously holds email, browser, filesystem, database, payments, and code execution is a lethal trifecta by construction. One injection in any input channel reaches every sink. Splitting the surface is the single highest-leverage architectural control.
How to implement
Decompose by capability: separate workflows (or separate nodes with disjoint tool allowlists) for browse-only research, drafting, and state-mutating execution. No node may hold a network egress tool plus a privileged write tool plus access to untrusted external content.
Verify it's done
For every workflow node, the union of (untrusted-input source ∪ privileged-write tool ∪ external-egress tool) is at most two of the three, and exceptions are documented with compensating controls.
Authority and capability
3 checksTreat the model as a non-principal and bind a capability set per workflow step
MUSTWhy it matters
Credentials belong to tools and the runtime, not the model. Without a per-step capability set, the agent inherits ambient authority from whatever credentials the orchestrator process holds, and a single injected instruction can reach any tool the process can call.
How to implement
Configure each graph node with an explicit capability descriptor — allowed tools, forbidden tools, data scopes, mutation rights, recipient or target allowlists. The tool broker rejects any call outside the active node's capability set. Per-tool sandboxing depth lives in agent capability control; this check is the system-level binding.
Verify it's done
A test that asks the agent in node A to call a tool only allowed in node B fails at the broker with a capability-denied event in the trace.
Gate material consequences with deterministic policy-as-code
MUSTWhy it matters
"Ask the model whether this refund is reasonable" is not a control — the model is the system being attacked. Amount thresholds, recipient allowlists, separation-of-duties rules, and rate limits must be enforced by code that does not consult an LLM.
How to implement
Implement a policy engine (OPA, Cedar, or equivalent) that evaluates each candidate state-changing action against rules expressed as code. The orchestrator submits the proposed action; the engine returns allow/deny/require-approval; the broker honors the decision.
Verify it's done
Policy decisions appear as discrete events in the trace with the rule ID that fired. A unit test suite covers each policy rule with positive and negative cases and runs in CI.
Prefer action-specific tools over general API wrappers
SHOULDWhy it matters
A tool exposed as run_sql(query) or http_request(url, method, body) lets the model author arbitrary intent. A tool exposed as create_refund_draft(order_id, amount) constrains intent to a known, auditable shape and lets the policy engine reason about arguments.
How to implement
For each business action the agent may take, expose a narrow tool whose argument schema names the entities involved. Reserve general-purpose tools for read-only contexts behind a tighter capability gate.
Verify it's done
The tool registry contains no general execution primitives in the capability set of any node that can mutate external state, except those flagged with a documented exception and approval requirement.
Context construction
3 checksTag every context element with a trust label and propagate it to policy decisions
MUSTWhy it matters
Without trust labels, the prompt is a flat string and the model cannot — and the runtime will not — distinguish a developer instruction from a tool output that contains an injected instruction. AgentDojo reported up to 70% tool-output injection success against frontier models; trust labels are the substrate that makes downstream defenses possible. See information flow control for the underlying model.
How to implement
Define a fixed label set (system, developer, user, tool-trusted, tool-untrusted, retrieved-public, agent-generated). The orchestrator wraps every context segment with its label at assembly time. The tool broker and policy engine consume labels when deciding whether a proposed action is permitted given the active context's taint.
Verify it's done
Inspect a recorded prompt assembly and confirm every segment carries a label. The policy engine's decision log references taint labels for state-changing actions.
Build minimal per-step context and separate planning from execution
SHOULDWhy it matters
Passing the entire transcript into every step turns one compromised tool output into a permanent prompt injection for the rest of the run. A planner that holds untrusted text should not also hold the credentials and tool surface to act on it.
How to implement
Each node receives only the typed inputs it declared, plus the system prompt for its role. Planning nodes operate over redacted, summarized, or label-stripped views. Execution nodes receive structured action arguments, not free-form planner narratives. Single-agent variants of this pattern are detailed in tool-using agent hardening.
Verify it's done
A diff of context fed to planner versus executor nodes shows the executor cannot read raw untrusted-tool content; only typed action parameters reach it.
Block state-mutating tool calls when the active context contains untrusted content unless an independent policy check passes
SHOULDWhy it matters
This is the structural defense against indirect prompt injection that does not depend on detecting payloads. If the planner read attacker-controlled text, the runtime treats subsequent state-changing intent as suspect by default.
How to implement
The tool broker reads the taint of the context that produced the action, and either denies state-changing tools, requires human approval, or requires the action to pass a policy rule independent of the model's justification (e.g., recipient on a pre-approved list, amount under a per-tier cap).
Verify it's done
A regression test injects benign-looking instructions into a tool output that ask the agent to perform a mutating action; the broker denies or escalates the resulting tool call and logs the taint-driven decision.
Retrieval and memory
3 checksQuarantine newly ingested documents and carry provenance and trust tier through to the prompt
MUSTWhy it matters
PoisonedRAG demonstrated 90% attack success with five malicious documents in a corpus of millions. A retrieval substrate that loses provenance at retrieval time is a single ingestion away from compromising every downstream agent run.
How to implement
Stage ingested content in a quarantine tier subject to schema checks, content scanning, and (where appropriate) human review before promotion to the high-trust index. Persist source URI, ingestion time, source tier, and content hash on each chunk; render those fields into the prompt alongside the chunk so policy and the model both see provenance.
Verify it's done
A sampled retrieval result in a recorded trace shows the source tier and URI for each chunk. Newly ingested content does not appear in high-trust retrieval until promoted.
Govern memory writes with schema, provenance, and write-policy
MUSTWhy it matters
Memory turns a transient compromise into a persistent one. An agent that can author free-form rules into long-term memory can give a future session whatever instructions an attacker wants, with no remaining attack surface to detect.
How to implement
Define a typed schema for memory entries with an explicit author field (user, system, agent), a provenance pointer to the run that produced the entry, and a write-policy that forbids agents from writing security-relevant fields (capability hints, tool preferences, policy directives). All writes flow through a memory broker that enforces the schema.
Verify it's done
A test in which the agent attempts to write a policy-shaped instruction into memory is rejected at the broker. Memory entries in production are inspectable by author and provenance.
Pin and review tool registry entries as a supply chain
SHOULDWhy it matters
Tool descriptions, names, and argument schemas are part of the prompt. A registry that pulls tool metadata from a third-party source at runtime is a prompt-injection channel and a name-collision channel. MCP-style ecosystems make this attack class material.
How to implement
Pin tool registry entries by content hash. Treat additions and changes as code review with a security gate. Reject tool descriptions that contain instructional content directed at the model rather than schema documentation.
Verify it's done
The deployed tool registry hash matches the reviewed and signed registry artifact for the release. Diffs to tool descriptions are visible in change history.
Validation and human gates
2 checksValidate every state-changing action with checks independent of the actor model
MUSTWhy it matters
A model that proposes an action and a model that judges the action share failure modes — including the same prompt injection. Independent validators (schema, deterministic tests, simulators, separate-model critics with disjoint context) are how probabilistic output becomes an enforceable contract.
How to implement
For each mutating tool, define validators that run on the proposed arguments before commit: schema validation, business rules (amount caps, allowlists), simulation/dry-run where the tool supports it, and where applicable a critic model that sees only the action and policy, not the planner's narrative.
Verify it's done
Every mutating tool call in a recorded trace shows at least one validator decision logged before commit. Disabling the actor model and replaying validators against a corpus of known-bad actions yields the expected denials.
Stage mutations as draft → policy check → commit, and require human approval for irreversible actions
MUSTWhy it matters
Drafts are reversible; commits are not. Splitting the lifecycle gives policy and humans a place to intervene without halting the agent's reasoning. Irreversible actions (payments, deletes, external messages) deserve a human in the loop with the concrete action in front of them.
How to implement
The mutating-tool API accepts a draft, returns a draft ID, and exposes a separate commit call gated by the policy engine and, for irreversible classes, a human approver. The approval UI presents the resolved arguments, source documents with taint labels, and a stated rollback plan — not the agent's free-form justification.
Verify it's done
Trace inspection shows a draft event, a policy decision, and (for irreversible actions) a human approval event preceding every commit. Bypassing draft state is not possible from the agent's tool surface.
Observability and replay
2 checksLog prompts, tool calls, taint labels, policy decisions, validator results, approvals, and side effects on a replayable trace
MUSTWhy it matters
Incident response on a compound system without traces is forensic guesswork. The trace is also the only way to evaluate whether defenses are firing in production rather than only in CI.
How to implement
Emit structured events for prompt assembly (with segment labels), each tool call (arguments, response, taint), each policy decision (rule ID, outcome), each validator outcome, each approval, and each external side effect, keyed by run ID and node ID. Store with retention appropriate to the system's risk tier and make traces searchable by responders.
Verify it's done
A responder can reconstruct any production run end-to-end from the trace store within the retention window, including all defense decisions, without needing access to the live system.
Track cost, model calls, and tool calls per run and alert on budget breaches
SHOULDWhy it matters
Cost-exhaustion and denial-of-wallet attacks exploit recursive tool use, retry loops, and induced multi-agent debate. A budget alarm catches both attacks and quality regressions (cascading retries on a flaky tool).
How to implement
Per run, count model calls, tokens, tool calls, and wall-clock; compare against per-workflow budgets; emit a structured alert above thresholds and terminate runs above hard ceilings.
Verify it's done
A synthetic run that exceeds the configured budget terminates and produces an alert routed to the on-call channel. Per-run cost dashboards exist and are reviewed.
Security testing in CI
3 checksMaintain an indirect prompt-injection regression suite covering tool outputs, retrieval, memory, and handoffs
MUSTWhy it matters
Without a regression suite, every prompt or graph change can silently regress defenses. The suite encodes the threat model as executable tests so that "did we break the trust labels" is a CI failure, not a postmortem finding.
How to implement
Build a corpus of injection-bearing inputs across the four channels (tool outputs, retrieved documents, memory entries, inter-node handoffs). Each test asserts a defense outcome — denial, escalation, or label propagation — not the absence of a behavior. Run on every change to prompts, tools, graph topology, or policy.
Verify it's done
CI fails on a deliberately weakened defense (e.g., disabling taint propagation) and passes on the current main branch. Coverage of the four channels is tracked.
Test poisoned retrieval and poisoned tool metadata as part of CI, evaluated alongside utility
SHOULDWhy it matters
Defenses that block attacks by killing utility are not defenses. Adversarial robustness and task success must be evaluated in the same harness so a regression in either is visible.
How to implement
Inject a small number of crafted documents into a staging retrieval index and a fixture tool registry; run the workflow against benign tasks and against tasks designed to trigger the poisoned content; report attack success rate and task success rate side by side. Block release on regressions in either.
Verify it's done
The release report shows both metrics with thresholds, and a release with degraded utility but improved adversarial robustness is flagged for review rather than auto-approved.
Run cost-exhaustion and denial-of-wallet tests
NICEWhy it matters
Budget controls are easy to misconfigure and easy to regress. A scheduled adversarial test verifies that hard ceilings actually halt runaway runs in the deployed configuration.
How to implement
Maintain a small set of adversarial inputs designed to induce retry loops, recursive tool calls, or planner/critic ping-pong. Run against staging on a schedule; assert that budgets terminate the run and the alert path fires.
Verify it's done
The most recent scheduled run shows budget-driven termination at the expected step count, and the alert channel received the corresponding notification.
Acceptance criteria
The system is hardened when a security reviewer can read the workflow graph, the tool registry, and the policy ruleset offline and predict what the system will and will not do under attack. Every mutating action in production traces shows a taint-aware policy decision, an independent validator outcome, and (for irreversible classes) a human approval, with all of these events keyed to a replayable run. The CI pipeline fails on regressions in the indirect-prompt-injection suite and in the poisoned-retrieval suite, and reports task utility alongside adversarial robustness on every release. Per-run cost and step budgets terminate runaway executions, and the on-call team has exercised trace replay on a real incident or drill within the last quarter. Per-tool sandboxing depth is covered separately by agent capability control; handoff and contamination concerns in multi-agent topologies are covered by multi-agent prompt-injection defense.