Back to Checklists
Applied IntelligenceChecklistMay 1, 2026Yellow — detail controls

Hardening Function Calling and Tool-Use Reliability for Production LLM Agents

Quick Answer

This checklist hardens LLM agents that call tools with side effects against the failure modes that produce wrong refunds, deleted production resources, duplicated mutations, and exfiltration via tool outputs. The audience is platform engineers, agent developers, and security reviewers who already understand tool use and want a control set beyond 'enable function calling.' Run it per release on every workflow that mutates state, and quarterly across the agent fleet. Some reproduction details are withheld pending vendor coordination.

A tool call is a distributed-systems boundary, not a model feature. This checklist hardens the path from planner to executor — schemas, tool design, state, idempotency, observation typing, and evals — so that no single layer (and especially not the model alone) is asked to carry reliability. It targets platform engineers, agent developers, and security reviewers shipping agents that mutate real state. For threat-model background, see what tool-use reliability actually means. Some reproduction details for recent excessive-agency incidents are withheld; this checklist describes defenses, not attack chains.

Checks20 total8 MUST9 SHOULD3 NICE

How to use this checklist

Run it per release on every workflow that touches production state, and once per quarter across the agent fleet. Each check has an owner — typically the agent platform team for runtime controls and the workflow owner for tool-surface controls. The check set is derived from the tool-use reliability stack paper; "done" means every MUST is verifiable in audit, every SHOULD has either an implementation or a documented exception, and the acceptance criteria below are met under adversarial eval.

Schema and decoding enforcement

4 checks

Enforce strict schemas on every tool call

MUST

Why it matters

Free-form JSON parsing is a known source of silent corruption — wrong types, missing required fields, hallucinated tool names. Strict schema enforcement at decode time eliminates a whole class of bugs before they reach the executor.

How to implement

If the provider exposes a strict / Structured Outputs mode, enable it on every function-calling endpoint. For self-hosted models, use a constrained decoder (Guidance, Outlines, XGrammar, llama.cpp grammars). Compile and pin every tool schema; do not generate schemas at runtime from natural language.

Verify it's done

Construct a synthetic prompt that should produce a schema-violating call. Confirm the runtime rejects it before dispatch and emits a structured decode error rather than a malformed tool invocation.

Validate every decoded tool-call object before dispatch

SHOULD

Why it matters

Even with a strict decoder upstream, provider feature flags, parallel-call paths, and unsupported schema features can silently fall back to permissive decoding. The validator is the last deterministic checkpoint before side effects.

How to implement

Run a deterministic JSON Schema validator on the decoded object inside the runtime, after the model returns and before the dispatcher executes. Treat the validator's verdict — not the model's claim — as the source of truth.

Verify it's done

Inject a tool-call object that uses a schema feature your decoder does not support. Confirm the runtime rejects the call rather than passing it through unchecked.

Test schema coverage against your decoder

NICE

Why it matters

JSONSchemaBench shows that real-world JSON Schema coverage varies by roughly 2× across constrained-decoding frameworks. A schema that "looks fine" may be silently downgraded.

How to implement

For each production schema, compile it under your decoder and run a positive sample (valid call accepted) and a negative sample (invalid call rejected). Track schema coverage as a CI gate for any schema change.

Verify it's done

A coverage report exists in CI listing each tool schema and its compile / accept / reject status.

Make refusal and clarification first-class schema outputs

SHOULD

Why it matters

If the only schema-valid output is a tool call, the model will fabricate one under ambiguity rather than ask. This is a primary source of wrong-parameter errors in τ-bench-style settings.

How to implement

Add need_clarification and refuse shapes to the response union for every workflow. Train or prompt the model to use them on missing-parameter, ambiguous-entity, and out-of-policy inputs.

Verify it's done

Run an ambiguous input through the agent and confirm it can return a need_clarification shape instead of guessing a tool call.

Tool surface design

4 checks

Split high-impact tools into read, draft, validate, and commit

MUST

Why it matters

A single tool that both decides and acts on production state has no checkpoint between model intent and irreversible side effect. This is the structural pattern behind recent AI-mediated production-deletion incidents.

How to implement

Replace do_thing with get_*, create_*_draft, validate_*, and submit_*. Make submit_* accept only an opaque draft ID plus a separately minted approval token, never raw parameters.

Verify it's done

Audit the tool catalog. No mutating tool both takes free-form parameters and produces an irreversible side effect in a single call.

Use opaque IDs and narrow enums; never free-form entity strings on mutating tools

SHOULD

Why it matters

Free-form customer, order, or reason parameters invite the model to hallucinate plausible-but-wrong values. Empirical studies of function-calling errors put parameter-value mismatch at 70–90% of remaining failures once tool selection is correct.

How to implement

Have a retrieval step return candidate IDs; constrain the mutating tool's parameters to those IDs and to a closed enum of reasons / categories. Use string types only for genuinely free-form fields like a user-authored note.

Verify it's done

Review schemas for every mutating tool. No free-form string parameter resolves to an entity, action, or reason code.

Shortlist tools per task via a router

SHOULD

Why it matters

Adding semantically related tools measurably degrades selection accuracy. The full tool belt also expands the attack surface for indirect prompt injection.

How to implement

Implement a tool router that filters the candidate set by user permissions, task domain, environment, and side-effect class before the planner sees it. Keep the per-call belt small.

Verify it's done

For a representative low-authority request, log the candidate tool set the planner saw and confirm it excludes high-impact and out-of-domain tools.

Keep destructive tools off the default belt

NICE

Why it matters

A destructive tool that is never directly callable cannot be hijacked into a single-step deletion. Routing through a deletion-request tool adds a deterministic approval step the model cannot bypass.

How to implement

Replace direct delete_* exposure with create_deletion_request plus an out-of-band approval flow. Do not register delete_* on default agent loops.

Verify it's done

No agent profile lists a raw destructive tool in its default belt; the only path to destruction is via an approved request object.

State, idempotency, and authority

4 checks

Scope credentials by environment

MUST

Why it matters

Prompts that say "never touch production" are not access controls. The Replit and PocketOS/Railway production-deletion incidents are archetypal examples of agents reaching environments they had credentials for but were instructed to avoid.

How to implement

Mint per-task credentials scoped to a single environment via the platform's IAM. Staging tasks should hold staging-only credentials; the cloud or database layer — not the agent — must enforce the boundary. See agent capability control for the broader pattern.

Verify it's done

Attempt a production-scoped action with a staging-scoped credential. The cloud/database layer rejects it; the rejection is visible in the platform audit log.

Require deterministic policy approval for destructive actions

MUST

Why it matters

A model-generated "are you sure?" plus a model-generated "yes" is not approval; it is two model turns. Destructive action without an out-of-context gate is the structural definition of excessive agency.

How to implement

Mint approval tokens outside the model context, scoped to a specific validated action object, time-bounded, single-use. The executor accepts the action only with a matching token; the model cannot mint or forward tokens.

Verify it's done

Inspect audit logs for a sample of destructive operations. Every commit has a matching approval-token issuance event from outside the model context.

Accept and enforce idempotency keys on every mutating tool

MUST

Why it matters

Agent loops retry. Webhooks duplicate. Without idempotency keys, a retry produces a duplicate refund, email, or deployment.

How to implement

Require an idempotency key on every mutating endpoint. Persist (key, result) and return the original result on replay. Generate keys per planned action, not per HTTP attempt.

Verify it's done

Replay the same tool call with the same key and confirm a single side effect — one row inserted, one email sent, one deployment created.

Provide dry-run and rollback paths for high-impact tools

SHOULD

Why it matters

A dry-run lets the planner — and a human reviewer — inspect the exact result shape before committing. A rollback path bounds blast radius when something does go wrong.

How to implement

For each high-impact tool, expose a dry-run mode that returns the same response schema minus side effects, and either a transactional commit or an explicit reverse operation.

Verify it's done

Run each high-impact tool in dry-run mode in CI and assert the response shape matches the live shape minus mutation fields.

Observations and taint

4 checks

Return typed, minimized observations from every tool

SHOULD

Why it matters

Dumping raw HTML, email bodies, or log streams into the planner context inflates token use and gives indirect-injection content a foothold. A structured projection makes the planner work with facts, not text.

How to implement

Define an output schema for every tool. Return only the fields the planner needs; truncate or summarize free-form content into structured fields where possible.

Verify it's done

Sample production tool responses in traces. Each is a schema-typed object, not an unbounded string blob.

Tag every observation with provenance and trust level

MUST

Why it matters

Without provenance, the planner cannot distinguish a developer-authored system instruction from an attacker-authored field in a retrieved document. Source-sink reasoning requires a source label.

How to implement

Tag every context entry with one of trusted_system | developer | user | internal_retrieved | external_untrusted. Propagate the tag through tool outputs that wrap or quote other content.

Verify it's done

Inspect a sampled trace. Every context entry carries a provenance tag, including fields nested inside tool results.

Block untrusted content from authorizing tool calls or satisfying approvals

MUST

Why it matters

This is the source-sink rule that makes indirect prompt injection survivable: untrusted-tainted text must not be able to mint approval tokens, satisfy confirmation gates, or trigger external-transmission sinks on its own.

How to implement

In the policy gate, reject any action whose authorization chain depends on an external_untrusted source. Require explicit, separately-sourced re-confirmation for any tool call influenced by such content.

Verify it's done

Run an indirect-injection test where a retrieved document contains instructions to email or exfiltrate data. The policy gate blocks the resulting call before dispatch.

Treat external-transmission tools as DLP sinks

SHOULD

Why it matters

Email, HTTP, browser navigation, chat post, and file share are exfiltration sinks. Sensitive-tagged data flowing into one of them is the worst plausible outcome of an indirect-injection compromise.

How to implement

Classify outbound tools as sinks and block sensitive-tagged data from reaching them without explicit, out-of-context user confirmation. Pair with the controls in tool-using agent hardening.

Verify it's done

Attempt a sensitive→external flow in a test harness. The DLP layer blocks or escalates before the sink executes.

Evaluation, audit, and operations

4 checks

Evaluate workflows on final state, not call similarity

MUST

Why it matters

An agent can produce the right sequence of calls and still leave the database in the wrong state — or vice versa. Final-state evaluation is the only signal that matches what users care about.

How to implement

Adopt τ-bench's pattern: define a goal state, run the agent against a sandboxed environment, diff the final state against the goal. Score on diff equivalence.

Verify it's done

The eval harness fails a run that hits the right call sequence but leaves the wrong final state, and vice versa.

Include indirect-prompt-injection cases in security evals

SHOULD

Why it matters

AgentDojo and EchoLeak show that production-shaped agents remain vulnerable to indirect injection at high rates. Without injection cases, your eval suite measures cooperation, not robustness.

How to implement

Add AgentDojo-style cases where attacker-controlled content lives in tool outputs — emails, tickets, logs, retrieved documents. Report attack-success-rate alongside utility-under-attack so a fix that drops both is visible.

Verify it's done

The eval suite contains injection cases for every external-content tool, and CI fails on attack-success-rate regressions.

Log replayable traces and convert incidents into regression evals

SHOULD

Why it matters

A trace that can be replayed end-to-end turns every incident into a permanent test case and every change into a measurable delta.

How to implement

Capture prompt, tool candidate set, decoded tool-call object, validator decision, policy decision, executor result, and observation for every run. Make the trace deterministically replayable against a recorded sandbox state.

Verify it's done

Pick a sampled trace. Replay it end-to-end and confirm the replay reaches the same final state.

Run chaos drills against the agent runtime

NICE

Why it matters

Real production fails partially: credentials get revoked mid-task, APIs return half-results, webhooks duplicate, approval tokens expire. Agents that have never seen these conditions handle them creatively, which is bad.

How to implement

Quarterly, inject failures into a staging fleet — revoked credentials, partial 5xx, expired approval tokens, duplicate webhooks, malicious-looking log content. Record outcomes and convert any unsafe behavior into a regression eval.

Verify it's done

A quarterly drill record exists with failure modes injected, agent behaviors observed, and follow-up evals filed.

Acceptance criteria

The checklist is fully implemented when every MUST is enforced by a runtime control rather than a prompt instruction, and every SHOULD has either an implementation or a documented exception with a named owner. Final-state evals pass at a defined threshold for each workflow class, and pass^k is tracked for any agent with autonomous mutation authority. Indirect-prompt-injection cases live in CI, and attack-success-rate is reported alongside utility under attack. Audit logs show that every destructive commit carries an out-of-context approval token, every mutating call carries an idempotency key, and every observation carries a provenance tag. Pair these controls with the prompt-injection architecture in tool-using agent hardening and the capability bounds in agent capability control; reliability and security on tool-using agents are the same problem viewed from two angles.

Derived From

Related Work

External References