Agentic Binary Reverse Engineering: State of the Art, Architecture, Benchmarks, Failure Modes, and Research Agenda

Abstract

Agentic binary reverse engineering is the emerging practice of using large language model systems that can plan, invoke tools, observe results, revise hypotheses, and continue analysis of compiled programs without direct human step-by-step control. The field has moved quickly from LLM-assisted decompiler summarization toward tool-grounded agents that use Ghidra, IDA, radare2, angr, GDB, Python, sandboxes, and structured reporting interfaces to perform static, dynamic, and hybrid binary analysis. The state of the art is not a single model or a single decompiler; it is an execution architecture: decomposed, feedback-driven, evidence-preserving, sandboxed, and evaluated by deterministic outcomes rather than prose plausibility. Recent systems such as Project Naptime, Microsoft Project Ire, ClearAgent, FORGE, ReCopilot, LLM4Decompile, DisasLLM, and emerging reverse-engineering benchmarks show major progress in tool use, vulnerability discovery, malware classification, function understanding, and decompilation. At the same time, current agents remain brittle under obfuscation, long-horizon dynamic analysis, anti-analysis tricks, context pressure, prompt or tool manipulation, and non-standard architectures. This paper surveys the current state of agentic binary reverse engineering, distinguishes it from adjacent LLM-for-code work, reviews benchmark evidence, analyzes failure modes, and proposes a research agenda for safe, reliable, and operationally useful systems.

1. Definition and Scope

Binary reverse engineering is the process of recovering the behavior, structure, and intent of compiled software when source code is absent, incomplete, or untrusted. Classical workflows combine disassembly, decompilation, string and symbol analysis, control-flow recovery, data-flow reasoning, emulation, debugging, symbolic or concolic execution, and human hypothesis testing. “Agentic” binary reverse engineering adds a decision-making loop: an LLM or multi-agent system observes binary artifacts, chooses analysis actions, calls tools, interprets outputs, stores intermediate evidence, and iterates toward a goal such as malware classification, C2 extraction, vulnerability discovery, patch analysis, flag recovery, or protocol reconstruction.

Recent work defines agentic reverse engineering systems as collections of LLM-based agents equipped with tools and tasked with analyzing a binary on a given device; these systems are usually categorized by whether they perform static, dynamic, or hybrid analysis. Static analysis examines the binary without execution, dynamic analysis inspects behavior at runtime, and hybrid analysis combines both. The literature also notes that most existing agents decompile binaries and operate over outputs from tools such as Ghidra or IDA, but the strongest systems increasingly interleave decompilation with disassembly, debugging, execution, and evidence validation. (arXiv)

This paper focuses on legitimate defensive and research uses: vulnerability discovery and validation, malware triage, firmware analysis, software supply-chain assessment, and analyst augmentation. It does not provide exploit procedures, live target guidance, or operational malware instructions.

2. From LLM Assistance to Agentic Reverse Engineering

The first wave of LLM use in reverse engineering was assistive: paste decompiler output into a model and ask for summaries, variable names, probable algorithms, or vulnerability hypotheses. That remains useful, but it is not agentic. The agentic shift is the move from one-shot interpretation to a closed loop: inspect, hypothesize, act, verify, and revise.

Google Project Zero’s Project Naptime articulated this design shift for vulnerability research. It emphasized space for reasoning, interactive environments, specialized tools, automatic verification, and sampling across multiple independent hypotheses. Naptime’s architecture provides an agent with a code browser, Python tool, debugger, and reporter, and it uses automatic verification to check whether the agent’s output satisfies the task. On CyberSecEval 2-style tasks, Google reported up to a 20× improvement over the original evaluation approach, while also cautioning that isolated benchmark challenges differ substantially from fully autonomous security research in large, ambiguous systems. (Project Zero)

DARPA’s AI Cyber Challenge provides the broader context for autonomous cyber reasoning systems. AIxCC ran from 2023 to 2025 and challenged teams to build fully autonomous systems using LLMs to discover and patch vulnerabilities in real-world C and Java open-source software. The final competition ran about 143 hours, involved seven finalist teams, and evaluated analysis of 53 challenge projects with large cloud and LLM budgets. The AIxCC SoK also contrasts AIxCC with DARPA’s earlier Cyber Grand Challenge: CGC focused on binary security for custom binaries, while AIxCC focused on vulnerability discovery and remediation in real-world open-source development. (arXiv)

For binary reverse engineering specifically, the current state of the art is best understood as a convergence of four threads: binary-specialized models, tool-augmented agents, feedback-driven execution architectures, and deterministic benchmarks.

3. Core Technical Architecture

A modern agentic binary reverse-engineering system typically contains five layers.

First, it has an ingestion and triage layer that identifies file type, architecture, symbols, sections, strings, imports, entropy, packer indicators, and execution environment requirements. Second, it has a static-analysis layer around tools such as Ghidra, IDA, radare2, angr, objdump, strings, readelf, Binary Ninja, or custom analyzers. Third, it has a dynamic layer that can execute or emulate the sample in a controlled environment, set breakpoints, inspect registers and memory, capture traces, and test hypotheses. Fourth, it has an LLM planning layer that decides which artifact or hypothesis to pursue next. Fifth, it has an evidence and reporting layer that records claims, supporting observations, tool outputs, uncertainty, and validation status.

Microsoft’s Project Ire is a clear industrial example. Microsoft describes it as an autonomous agent that classifies software by fully reverse engineering a file without prior clues about origin or purpose. It uses advanced language models and callable reverse-engineering and binary-analysis tools. The system reconstructs control-flow graphs using frameworks such as angr and Ghidra, performs iterative function analysis, builds a chain of evidence, invokes a validator, and emits a final malicious-or-benign report. (Microsoft Research)

The newest research systems increasingly treat architecture as the central contribution. FORGE, for example, argues that many prior LLM-based binary-analysis systems use a one-pass paradigm: static tools build a fixed representation, and the model reasons over that snapshot. FORGE instead frames binary analysis as feedback-driven execution, with a reasoning–action–observation loop and a Dynamic Forest of Agents that decomposes tasks, coordinates parallel exploration, bounds per-agent context, and preserves evidence for validation. (arXiv)

This design trend matters because binary analysis is partially observable. Types, variable names, source-level structure, and intent are usually missing. No single decompiler output is authoritative. A good agent must reason over partial evidence, decide what information would reduce uncertainty, call the right tool, and then update its working model.

4. Model-Level Advances

Agentic systems depend on the base model, but the model alone is not sufficient. The strongest systems combine general reasoning models with binary-specific representations, domain-specific fine-tuning, structured memory, and tool-grounded validation.

LLM4Decompile is a major example of model-level specialization. It introduced an open-source family of models from 1.3B to 33B parameters trained for binary decompilation. The work distinguishes direct end-to-end decompilation from refinement of conventional decompiler output, and reports that the 6.7B direct model achieved 45.4% successful decompilation on HumanEval and 18.0% on ExeBench, outperforming Ghidra and GPT-4o by more than 100% in re-executability in the authors’ benchmarks. The same paper, however, reports that obfuscation sharply degrades both Ghidra and LLM4Decompile, with control-flow flattening and bogus control flow causing large drops in decompilation success. (arXiv)

ReCopilot takes another model-centric approach: it is an expert LLM for binary-analysis tasks that incorporates continued pretraining, supervised fine-tuning, and direct preference optimization on binary-relevant data. It uses variable data flow and call graphs to improve context awareness and reports state-of-the-art results on tasks such as function-name recovery and variable-type inference, outperforming existing tools and LLMs by 13% on its benchmark. (arXiv)

Other specialized model work addresses earlier stages in the pipeline. DisasLLM targets disassembly of obfuscated executables, especially cases involving junk bytes that confuse instruction-boundary recovery. It uses an LLM-based classifier to decide whether decoded instructions are correct and a strategy for end-to-end disassembly, with the authors reporting superior performance over prior disassembly approaches on heavily obfuscated executables. (arXiv) WaDec targets WebAssembly and reports a fine-tuned LLM approach that reduces code inflation and achieves nontrivial recompilability and re-execution rates for Wasm decompilation. (arXiv)

The model-level lesson is clear: general frontier models are useful, but binary-specific models improve recurring subtasks such as decompilation, type recovery, naming, summarization, and disassembly correction. The agent-level lesson is equally important: even good specialized models need tool grounding and verification because reverse engineering requires external evidence.

5. Tool Use and Feedback-Driven Execution

The strongest agentic systems now resemble disciplined analyst workflows more than chatbots. They inspect a binary, build hypotheses, generate scripts, run tools, collect observations, and preserve an audit trail.

Project Naptime’s conclusions are instructive: when expert humans would need iterative reasoning, hypothesis formation, and validation, models also need that flexibility; otherwise benchmarks understate or mischaracterize their actual capability. Google also noted that smaller models struggle with complex tool environments, while larger models can partially use the flexibility required for real-world scenarios. (Project Zero)

FORGE provides stronger evidence that execution structure is a first-order factor. It evaluated 3,457 real-world firmware binaries and reported 1,274 vulnerabilities across 591 unique binaries with 72.3% precision. It also found that its structured, multi-agent execution model improved coverage and verified-vulnerability yield compared with static and simpler LLM-based paradigms. The key claim is not merely that LLMs help; it is that decomposition, parallel exploration, bounded context, and evidence-driven validation determine whether LLM-based analysis scales. (arXiv)

ClearAgent represents a related direction for vulnerability detection at the binary level. It proposes an agentic binary-analysis framework with an LLM-friendly and analyzer-friendly binary interface, enabling the agent to iteratively explore for buggy code and attempt to verify candidate bug reports by constructing concrete triggering inputs. (HKUST)

The current state of the art therefore favors systems that expose reverse-engineering tools through constrained, semantically meaningful interfaces rather than giving a model raw, unrestricted shell access. The agent needs enough power to inspect, execute, and validate, but the interface must reduce unnecessary context, avoid unsafe actions, and make evidence review possible.

6. Empirical Benchmarks and What They Show

Benchmarking is improving, but the field is still young. Current benchmarks measure different slices of the problem, and no single benchmark captures real-world reverse engineering end to end.

BinMetric evaluates LLMs on binary-analysis tasks using 1,000 questions derived from 20 real-world open-source projects across six tasks, including decompilation, code summarization, call-site reconstruction, signature recovery, algorithm classification, and assembly instruction generation. The study concludes that LLMs show potential but still struggle with precise binary lifting and assembly synthesis. (arXiv)

AgentRE-Bench focuses more directly on long-horizon agent behavior. It gives an LLM agent compiled ELF binaries and static-analysis tools, then scores whether it can identify C2 infrastructure, encoding schemes, anti-analysis techniques, and communication protocols. It emphasizes deterministic scoring, 10–25 tool-call chains, a 25-call budget, and tasks ranging from simple reverse shells to metamorphic droppers with RC4 encryption, control-flow flattening, triple anti-debugging, and self-modifying code. (agentre-bench.ai) Its public results also show a useful caution: hallucination calibration can dominate raw reasoning depth, because models that over-claim techniques are penalized and often score worse. (agentre-bench.ai)

CREBench narrows the focus to cryptographic binary reverse engineering. It gives agents binaries and Ghidra decompiled pseudocode, places them in a sandboxed agent framework, and scores four subtasks: algorithm identification, key or IV extraction, wrapper-level reimplementation, and flag recovery. The benchmark includes 48 manually reimplemented cryptographic algorithms and variations for key usage and binary complexity. The authors report that the best evaluated model recovered flags in 59% of challenges under pass@3, while a human expert baseline scored substantially higher; they also identify dynamic analysis as a relative weakness. (arXiv)

The “Agent Failure Patterns in CTF Reverse Engineering” study evaluated three LLM agents on 24 reverse-engineering challenges from 2025 CTF events. At least one agent solved 88% of the tasks, showing strong capability, but the study identified four recurring weaknesses: training bias, over-trust in observations, context limitation, and plan persistence. It also found that removing decompiler access had little effect on success rates for many tasks and that agents often reasoned directly from assembly, using decompilers mainly for navigation.

Human-subject evidence also matters. “Decompiling the Synergy” surveyed 153 practitioners and studied 48 participants across more than 109 hours of software reverse engineering. It found that LLM assistance can narrow the novice–expert gap, increase novice comprehension by about 98%, speed known-algorithm triage by up to 2.4×, and improve recovery of artifacts such as symbols, comments, and types by at least 66%. It also found hallucinations, unhelpful suggestions, expert overreliance risks, and little performance improvement for experts in some conditions. The paper’s conclusion is balanced: LLMs augment analysts rather than replace them.

7. The State of the Art

The state of the art in agentic binary reverse engineering can be summarized in one sentence: the best systems are tool-grounded, feedback-driven, evidence-preserving, sandboxed, and specialized for bounded tasks.

The most mature capabilities are triage, summarization, function naming, variable and type suggestions, known-algorithm recognition, string and constant recovery, simple key or protocol extraction, CTF-style crackme solving, assisted decompilation, and report generation. These tasks align with LLM strengths: pattern recognition, code-like reasoning, summarization, and script generation. They also benefit from clear verification signals, such as whether a generated input is accepted, whether a crash is reproduced, or whether a malware report is supported by evidence.

The strongest industrial signal is Microsoft Project Ire. In public Windows-driver tests, Microsoft reported precision of 0.98 and recall of 0.83, with only 2% benign files flagged as threats. In a harder real-world evaluation of nearly 4,000 files slated for manual reverse-engineering review, Project Ire operated autonomously on files created after the model training cutoff and achieved 0.89 precision with 0.26 recall and a 4% false-positive rate. Microsoft states that the prototype will be leveraged inside Defender as a Binary Analyzer. (Microsoft Research)

The strongest academic system-level signal is FORGE: large-scale firmware evaluation, dynamic decomposition, multi-agent exploration, and evidence-driven validation. Its results suggest that the biggest gains may come from the architecture of reasoning and validation rather than from simply swapping in a larger model. (arXiv)

The strongest benchmark signal is mixed. Agents can solve many short-horizon or medium-horizon reverse-engineering challenges, but they are fragile when dynamic analysis becomes central, when obfuscation invalidates familiar representations, when tool outputs are misleading, or when the task requires prolonged memory of earlier evidence.

8. Failure Modes

Current systems fail in ways that are familiar to experienced reverse engineers but amplified by LLM behavior.

Obfuscation remains a major obstacle. The 2026 survey of agentic reverse-engineering systems identifies obfuscation and tokenization as core static-analysis challenges; token-heavy decompiler output reduces coverage, while obfuscated binaries add misleading complexity. The same survey argues that agentic pipelines need explicit deobfuscation components and smarter tokenization strategies. (arXiv) LLM4Decompile’s obfuscation results reinforce this point: conventional obfuscation can sharply reduce decompilation success for both LLM-based and traditional approaches. (arXiv)

Dynamic analysis is still brittle. Agents can use GDB, sandboxes, and execution feedback, but runtime analysis is interactive, architecture-dependent, stateful, and vulnerable to anti-debugging, timing, environment checks, and analysis dead ends. The agentic RE survey identifies lack of guardrails, timeouts, reliance on emulation, and unsafe execution as central challenges for dynamic and hybrid systems. (arXiv) CREBench similarly found dynamic analysis to be a relative weakness, and its failure examples include agents getting trapped in low-level GDB interactions without converting observations into successful submissions. (arXiv)

Hallucination and over-trust are persistent problems. Reverse engineering produces partial, ambiguous evidence. Agents may treat a decompiler artifact, a guessed function name, or an early algorithm hypothesis as fact. The CTF failure-pattern study found over-trust in observations, training bias, context limitation, and plan persistence as recurring weaknesses. AgentRE-Bench’s public results similarly suggest that over-claiming techniques can dominate performance because fabricated techniques are penalized. (agentre-bench.ai)

Context management remains unresolved. Real binaries generate massive disassembly, pseudocode, traces, logs, and tool outputs. Agents must remember earlier evidence without flooding context windows. FORGE’s Dynamic Forest of Agents and bounded per-agent context is one current answer, but the general problem of long-horizon, evidence-consistent reverse engineering remains open. (arXiv)

Safety and containment are first-order requirements. Agentic RE systems may execute untrusted binaries, run debugger commands, modify files, invoke network tools, or follow instructions embedded in malware strings or decompiler output. The agentic RE survey warns that dynamic-analysis agents often have broad command execution capabilities and that malicious binaries could trick agents into unsafe actions. (arXiv)

9. Human–Agent Collaboration

The likely near-term deployment model is not full analyst replacement. It is analyst acceleration with auditable evidence. Human reverse engineers are still better at deciding which uncertainty matters, recognizing misleading tool outputs, adapting to unfamiliar architectures, and making risk judgments under ambiguity. LLM agents are useful for high-volume triage, first-pass summaries, artifact recovery, repetitive scripting, and exploring multiple hypotheses cheaply.

The human-study evidence supports this. LLMs helped novices substantially but did not reliably improve expert performance, and in some cases experts were harmed by hallucinated vulnerability suggestions or noisy artifacts. The best role for LLMs, according to that study, is as a quick, under-relied-upon filter rather than a replacement for expertise.

For operational teams, this implies that every agentic RE output should include evidence, uncertainty, and reproducibility information. A report that says “this is malware” or “this function is vulnerable” is less useful than a report that links claims to functions, control-flow paths, data-flow evidence, dynamic observations, and validation status.

10. Security Implications

Agentic binary reverse engineering changes both defensive and offensive economics. Defensively, it can reduce malware-triage backlogs, increase firmware-analysis coverage, improve vulnerability-discovery throughput, and help less experienced analysts perform useful first-pass analysis. Project Ire’s Defender-oriented roadmap and FORGE’s firmware-scale evaluation are concrete examples of this defensive direction. (Microsoft Research)

The risk side is also real. Benchmarks such as AgentRE-Bench explicitly measure whether agents can identify C2 infrastructure, encoding schemes, anti-analysis techniques, communication protocols, and embedded secrets in compiled ELF binaries. (agentre-bench.ai) CREBench evaluates agents on algorithm identification, key extraction, behavioral reimplementation, and flag recovery in cryptographic reverse-engineering tasks. (arXiv) These capabilities are dual-use: they can support malware defense and software assurance, but they can also reduce the cost of analyzing proprietary or adversary software.

This dual-use profile argues for strict sandboxing, tool allowlists, network isolation, evidence logging, human approval for risky dynamic actions, and deployment policies aligned to legitimate authorization.

11. Research Agenda

The field needs progress in six areas.

First, evaluation must become more realistic and more reproducible. Benchmarks should combine deterministic scoring with long-horizon tasks, multi-architecture binaries, packed and obfuscated samples, realistic malware-like behaviors, firmware constraints, and human baselines. BinMetric, CREBench, AgentRE-Bench, and the CTF failure-pattern work are early steps, but none fully captures real enterprise malware analysis, IoT firmware auditing, or large commercial software reverse engineering. (arXiv)

Second, agents need better semantic memory. The system should maintain an evidence graph: functions, strings, xrefs, hypotheses, tool outputs, dynamic observations, confidence, contradictions, and unresolved questions. This is more robust than a flat chat transcript.

Third, dynamic analysis needs safer and more capable interfaces. Agents need high-level debugger actions, trace summarization, time-control mechanisms, emulator introspection, anti-analysis detection, and policy enforcement. Raw shell access is powerful but unsafe and inefficient.

Fourth, decompilation should be treated as one view, not ground truth. The CTF failure-pattern study found that decompilers were often used for navigation rather than core understanding, and sometimes obscured instruction-level evidence. Future agents should fluidly move among bytes, assembly, pseudocode, CFGs, data-flow slices, traces, and runtime state.

Fifth, verification must be central. The strongest systems use validators, concrete triggering inputs, evidence replay, or deterministic scoring. Project Naptime’s emphasis on automatic verification, Project Ire’s chain of evidence and validator, ClearAgent’s attempt to construct concrete triggering inputs, and FORGE’s discovery–validation workflow all point in the same direction. (Project Zero)

Sixth, safety research must address adversarial binaries. A binary can contain strings, symbols, debug artifacts, or runtime outputs that attempt to influence the agent. A robust system must treat all target-controlled text as untrusted data, isolate execution, and prevent tool outputs from becoming unverified instructions.

12. Conclusion

Agentic binary reverse engineering has advanced from prompt-based decompiler assistance to autonomous and semi-autonomous systems that can use tools, explore hypotheses, validate evidence, and produce auditable reports. The best current systems are not merely larger LLMs; they are carefully engineered analysis environments with constrained tools, iterative feedback, structured memory, decomposition, and verification.

The state of the art is strong enough to matter. Agents can assist malware classification, accelerate novice reverse engineers, solve many CTF-style binary tasks, improve firmware vulnerability discovery, refine decompiler output, and perform useful artifact recovery. At the same time, current systems are not reliable general-purpose reverse engineers. They struggle with obfuscation, long-horizon dynamic analysis, context drift, misleading evidence, anti-analysis techniques, and safety constraints.

The next breakthrough is likely to come from better execution architecture rather than model scale alone: evidence graphs, safe dynamic-analysis APIs, multi-agent decomposition, calibrated uncertainty, deterministic validation, and human-centered review. In practice, the most credible near-term deployment model is an auditable reverse-engineering copilot or triage agent that increases analyst reach while preserving human judgment for high-impact conclusions.

References

Challenges and Future Directions in Agentic Reverse Engineering Systems. arXiv. https://arxiv.org/html/2604.14317v1
Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models. Google Project Zero, 2024. https://projectzero.google/2024/06/project-naptime.html
SoK: DARPA's AI Cyber Challenge (AIxCC) — Competition Design, Architectures, and Lessons Learned. arXiv. https://arxiv.org/pdf/2602.07666
Project Ire Autonomously Identifies Malware at Scale. Microsoft Research blog. https://www.microsoft.com/en-us/research/blog/project-ire-autonomously-identifies-malware-at-scale/
FORGE: Feedback-Driven Execution for LLM-Based Binary Analysis. arXiv. https://arxiv.org/html/2604.15136v1
LLM4Decompile: Decompiling Binary Code with Large Language Models. arXiv. https://arxiv.org/html/2403.05286v3
ReCopilot: Reverse Engineering Copilot in Binary Analysis. arXiv 2505.16366. https://arxiv.org/abs/2505.16366
DisasLLM: Disassembling Obfuscated Executables with LLM. arXiv. https://arxiv.org/html/2407.08924v1
WaDec: Decompiling WebAssembly Using Large Language Model. arXiv 2406.11346. https://arxiv.org/abs/2406.11346
ClearAgent: Agentic Binary Analysis for Effective Vulnerability Detection. HKUST Research Portal. https://researchportal.hkust.edu.hk/en/publications/clearagent-agentic-binary-analysis-for-effective-vulnerability-de/
BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models. arXiv. https://arxiv.org/html/2505.07360v1
AgentRE-Bench — LLM Reverse Engineering Benchmark. https://www.agentre-bench.ai/
CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering. arXiv. https://arxiv.org/html/2604.03750v1