Back to Checklists
AI SecurityChecklistMay 1, 2026Yellow — detail controls

Hardening Agentic Patch Validation in Automated Vulnerability Repair Pipelines

Quick Answer

Use this to harden an agentic automated vulnerability repair pipeline so a model-authored patch cannot pass acceptance by suppressing the proof-of-concept, editing tests, or overfitting the validator. Audience is security engineers, AppSec leads, and AI-platform owners running AVR. Apply per-pipeline once, then re-review whenever validation assets, the agent's tool surface, or risk routing change.

This checklist hardens the validation layer of an agentic automated vulnerability repair (AVR) pipeline against the failure mode the literature consistently surfaces: patches that pass build-and-crash gates by suppressing the proof-of-concept, swallowing input, editing tests, or overfitting whatever the validator happens to check. It is derived from the source paper on agentic patch validation and applies the staged-evidence architecture as executable controls. Some reproduction-grade detail (validator-bypass prompts, sanitizer-suppression recipes) is withheld; the controls themselves are stated in full.

Checks19 total12 MUST7 SHOULD

How to use this checklist

Run it once per AVR pipeline before any agent-authored patch is allowed to merge, then re-run whenever validation assets, agent tool scope, or risk routing change. Owner is the security engineer or AppSec lead accountable for the pipeline, with sign-off from the AI-platform owner. "Done" means every MUST is implemented and verifiable by an external reviewer reading logs and configuration, not by re-running the agent. Pair this with agent capability control and tool-use reliability hardening for the agent's surrounding controls.

Patch provenance and workspace integrity

3 checks

Apply the candidate as a diff to a clean checkout in an isolated container

MUST

Why it matters

If the validator runs inside the same workspace the agent mutated, any tool-driven side effect (modified test fixtures, regenerated lockfiles, cached build artifacts) silently joins the patch. PVBench and AutoPatchBench both observe agents passing basic checks via state the patch itself does not contain.

How to implement

Treat the agent's output as a unified diff only. Apply it with git apply (or equivalent) to a fresh clone at a pinned base commit, inside a network-isolated container with deterministic compiler and dependency versions. Discard the agent's working tree.

Verify it's done

A reviewer can re-derive the validation environment from the recorded base commit, container image digest, and diff alone, with no reference to the agent's runtime state.

Treat tests, harnesses, build, CI, sanitizer, and lock files as privileged assets

MUST

Why it matters

A repair agent that can edit the validator can pass it. The paper's check-circumvention pattern — assertion removal, sanitizer-flag changes, harness rewrites — depends on this capability.

How to implement

Enforce path-based write policy on the agent: source tree only. Tests, fuzz harnesses, build scripts, CI configuration, sanitizer flags, dependency locks, generated code, Dockerfiles, static-analysis rules, and benchmark scripts require a separate change request reviewed independently of the source patch. See agent capability control for the capability-scoping pattern.

Verify it's done

Diff inspection on a sample of accepted patches shows changes only under the source paths; any validation-asset diff is in a separately reviewed PR with a human approver.

Record full agent trajectory and a provenance attestation

SHOULD

Why it matters

Validator failure modes are often only visible in the trajectory — for example, a patch that "succeeded" because the agent silently rebuilt the test corpus. Without provenance, post-mortems are unactionable.

How to implement

Persist prompts, tool calls, tool outputs, container image digests, compiler versions, dependency versions, random seeds, model identifiers, and patch diffs. Emit an SBOM delta and a signed attestation linking trajectory to the candidate diff.

Verify it's done

For any merged patch, an auditor can reconstruct the full chain from prompt to merged commit in one query.

Build and exploit-reproduction gates

3 checks

Build a release, debug-with-assertions, and sanitized matrix

MUST

Why it matters

A single release build hides undefined behavior, latent assertion violations, and memory errors that the patch did not actually fix. The AutoPatchBench results — ~60% pass at build-and-crash collapsing to 5–11% under deeper checks — are largely explained by missing build configurations.

How to implement

For each candidate, build at minimum: release, debug with assertions enabled, and one or more sanitized configurations (ASan, UBSan, MSan, TSan as the language and platform allow).

Verify it's done

CI logs for every candidate show all matrix configurations attempted, with explicit pass/fail per configuration; missing configurations block acceptance.

Confirm the PoC reproduces on the vulnerable baseline before testing the candidate

MUST

Why it matters

A reproducer that no longer triggers on the unpatched revision tells you nothing about the candidate. Silently broken reproducers are a common failure mode in benchmark-derived pipelines.

How to implement

Run the PoC against the pinned vulnerable revision with the same harness, sanitizer flags, and timeouts you will use against the candidate. Capture exit codes, stderr, timeouts, and resource consumption. Abort validation if the baseline does not reproduce.

Verify it's done

Every acceptance log records a baseline-reproduction step with the documented failure signal, dated to the same run as the candidate evaluation.

Verify non-reproduction without harness, input-path, or flag drift

MUST

Why it matters

The simplest false positive is a patch that "fixes" the bug by altering the path the input takes — early return, input rejection, swallowed exception. Non-crash is necessary but not sufficient.

How to implement

Diff the harness, sanitizer configuration, and input-handling path between baseline and candidate runs and require equivalence. Reject patches whose non-reproduction depends on changes outside the source diff or on changes to the input-acceptance contract.

Verify it's done

Acceptance log shows byte-equal harness and flags between baseline and candidate; any divergence is flagged for human review.

Behavior-witness and regression testing

3 checks

Require at least one PoC+ behavior-witness test per accepted patch

MUST

Why it matters

PVBench's central finding is that PoC non-reproduction has a 42.3% false-discovery rate against tests that encode the expected post-patch behavior. Without a behavior witness, "fixed" means only "no longer crashes on this specific input."

How to implement

Generate or hand-author an output, intermediate-state, or self-checking test that asserts the correct post-patch behavior on the PoC input and on at least one neighbor input. Require that the test fails on the vulnerable revision for the right reason — wrong output, missing invariant — and passes on the candidate.

Verify it's done

Every merged patch is accompanied by a PoC+ test artifact with a recorded fail-on-baseline / pass-on-candidate pair.

Gate on changed-line and differential coverage of the patch region

MUST

Why it matters

Patches whose changed branches are not exercised by validation are accepted on faith. CodeRover-S shows similarity metrics are uncorrelated with plausibility (point-biserial −0.008); coverage of the change is one of the few cheap signals that is not.

How to implement

Collect line and branch coverage for the diff region during validation. Reject if changed branches are unexercised. Where feasible for high-risk components, run mutation testing on the changed region and require a minimum mutation score.

Verify it's done

Coverage report per candidate shows ≥1 hit on every changed branch; uncovered branches block acceptance with an explicit log line.

Classify the PoC input and assert the correct behavior class

SHOULD

Why it matters

Conflating "should crash" with "should be rejected" with "should be accepted and processed correctly" is how crash-suppression patches pass. The classification is the contract the patch has to honor.

How to implement

For each PoC, label the input as valid, invalid-but-accepted, invalid-and-rejected, or security-prohibited, and assert the class-appropriate behavior — correct output, structured rejection, denial — not non-crash.

Verify it's done

Acceptance metadata includes the PoC input classification and the assertion family used; mismatches between class and assertion type are rejected.

Fuzz, differential, and variant analysis

4 checks

Run a validation fuzz campaign on sanitized builds, seeded with the PoC and project corpus

MUST

Why it matters

Fuzzing is where AutoPatchBench's plausible patches collapsed from ~60% to 5–11%. It is the cheapest control with the largest precision gain.

How to implement

Seed with the original PoC and the project's regression corpus. Run on sanitized builds with a time budget proportional to risk: minutes for triage routing, hours for release candidates, continuous post-merge for high-criticality components.

Verify it's done

Per-candidate fuzz logs record seed corpus, sanitizer mode, time budget, and outcome; no merge proceeds without a campaign result attached.

Treat fuzz output as multi-signal — sanitizer findings, timeouts, coverage collapse, and grammar drift

SHOULD

Why it matters

A patch that suddenly rejects the entire input grammar will produce zero crashes and full pass — and zero coverage of the parser. Crashes alone are an inadequate fuzz oracle.

How to implement

Compare candidate-run fuzz coverage to baseline-run coverage; alert on collapse. Compare accept/reject ratios on a labeled grammar sample; alert on drift. Treat new sanitizer findings and new timeout clusters as rejection signals.

Verify it's done

Fuzz acceptance criteria document four explicit signals (new findings, new timeouts, coverage delta, grammar drift), each with a numeric threshold.

Run differential or metamorphic execution against a reference patch when available

SHOULD

Why it matters

When a developer patch or upstream fix exists, runtime divergence from it is a strong root-cause signal. The USENIX 2025 SoK reports VulnFix at 96.0% test-pass but 10.4% ground-truth-oriented success — differential execution is one way to close that gap.

How to implement

Generate or reuse an input set; run candidate and reference patch under instrumentation; compare runtime state, output, and side effects. Flag semantic drift for human review.

Verify it's done

Where a reference is available, every acceptance includes a differential-execution result; absence is logged with a reason.

Run variant analysis and reject net-new high-confidence static findings in the changed region

SHOULD

Why it matters

A real fix at the root cause usually covers sibling sites; a symptom-suppression patch does not. Static analysis on the changed region also catches the easy cases the agent introduces.

How to implement

Search the codebase for sites matching the same root-cause pattern (CWE, API misuse, taint shape) and require evidence the patch covers them or that they are unaffected. Run static analysis on the diff and reject net-new high-confidence findings.

Verify it's done

Acceptance log includes a variant-search result and a static-analysis delta; sibling sites without evidence block merge.

Specification, invariants, and root-cause evidence

3 checks

Require a structured root-cause report from the agent — and use it as evidence, not proof

MUST

Why it matters

A report forces the agent to commit to a hypothesis the validator can compare against. It also exposes the reasoning gap between "patch makes crash go away" and "patch restores the violated invariant."

How to implement

Require fields: CWE, trigger path, crash site, root-cause site, invariant violated, why the patch restores it, why valid behavior is preserved. Treat the report as a comparison artifact for the human reviewer; never as a substitute for runtime evidence.

Verify it's done

Every candidate has a complete report; reviewers can point to the field that contradicts a runtime signal when rejecting a patch.

Encode component invariants and check the patch against them

SHOULD

Why it matters

Memory, type, parser, error-handling, security, performance, and ABI/API invariants are the contract the component has with its callers. A patch that crash-suppresses by violating one of these is not a fix.

How to implement

For each touched component, maintain an invariant set covering the seven families above. Run the candidate through invariant checks (assertions, property tests, contract tests) and reject violations.

Verify it's done

A documented invariant manifest exists per security-critical component, and acceptance logs reference which invariants were checked per patch.

Detect and route check-circumvention and crash-suppression patterns to human review

MUST

Why it matters

Early returns, swallowed exceptions, removed assertions, pre-assertion variable rewrites, sanitizer-flag changes, and over-allocations to dodge bounds logic are the recurring shapes of false fixes across PVBench, AutoPatchBench, and the USENIX SoK.

How to implement

Run a static pattern detector over the candidate diff for these shapes. Any hit forces routing to human security review with the matched pattern surfaced. Suppressing the pattern requires explicit reviewer acknowledgment that the suppression is the intended fix.

Verify it's done

Detector rules are version-controlled, hit rates are tracked, and no patch matching a rule reaches automerge without an acknowledgment record.

Risk routing, acceptance, and post-merge audit

3 checks

Route every candidate through an explicit assurance-level gate

MUST

Why it matters

A single accept/reject gate cannot serve a pipeline that handles both low-risk parser fixes and crypto changes. Risk routing is what keeps the deeper controls affordable on the bulk of patches.

How to implement

Define five modes — advisory, triage, mitigation, automerge, prohibited — with documented entry criteria per component and severity. Default route is triage; promotion to automerge requires meeting all MUSTs above and component eligibility. Build on the routing pattern in tool-use reliability hardening for the agent's tool calls during validation.

Verify it's done

Every patch in the audit log has a recorded route and route reason; route distribution is reported monthly.

Prohibit automerge for security-critical components

MUST

Why it matters

Even with the controls above, validator precision under adversarial generation is bounded. Crypto, authentication, sandbox boundaries, kernel, memory allocators, deserializers, and safety-critical components do not tolerate the residual false-positive rate.

How to implement

Maintain a list of prohibited-from-automerge components in version control. Patches touching those paths require human security review and cannot be promoted to automerge by any agent or reviewer below a defined approval level.

Verify it's done

Path-based policy blocks automerge for the prohibited list at the merge gate; bypasses require documented exception approval.

Track false-closure rate and validator precision as production security metrics

SHOULD

Why it matters

Without longitudinal metrics, validator drift and overfitting go unnoticed until an incident. The vulnerability-management state itself becomes the compromised asset.

How to implement

Track per-component: false-closure rate (patches accepted then re-opened), post-merge fuzz recurrence, rollback rate, and validator precision/recall against a sealed evaluation set with hidden tests and known-bad patches. Review quarterly. The reliability framing in tool-use reliability applies directly to the validator as a control surface.

Verify it's done

Metrics dashboard exists, sealed eval set is rotated, and a written quarterly review records action items on regressions.

Acceptance criteria

The checklist is fully implemented when, for every agent-authored patch reaching merge, an external reviewer can produce in under thirty minutes: the pinned base commit and container digest, the patch diff, a baseline reproduction record, a sanitized-build matrix result, a PoC+ behavior-witness pair, a coverage delta showing every changed branch exercised, a fuzz campaign result with the four documented signals, an invariant or differential-execution check, a structured root-cause report, and a recorded assurance route. Patches touching prohibited components never reach automerge. False-closure rate, post-merge fuzz recurrence, and validator precision against a sealed evaluation set are tracked and reviewed quarterly, and patches matching check-circumvention patterns route to human review without exception. Anything short of this leaves the validator itself as the weakest link in the AVR pipeline.

Derived From

Related Work

External References