Harnessing Large Language Models for Enhanced Malware Reverse Engineering

Project Lupine: A Practical Workflow for Fine-Tuned Code Models in Ghidra

Jeremy Richards
Presentation: SecTor 2023 (Toronto, ON, Canada)
Conference Dates: October 23-26, 2023
Recording: SecTor 2023 - Harnessing Large Language Models for Enhanced Malware Reverse Engineering

Executive Summary
1. Introduction
1. Background: Reverse Engineering and LLMs
1. Related Work
1. Project Lupine Overview
1. Dataset Engineering
1. Model Selection and Fine-Tuning
1. IDE Integration and Analyst-in-the-Loop Feedback
1. Results, Limitations, and Risk Considerations
1. Recommendations for Practitioners
1. Future Work
1. Conclusion
Appendix A: Implementation Snapshot (Training and Integration)
References

Executive Summary

Malware reverse engineering is a high-skill, high-effort activity that often requires analysts to triage large binaries and identify a small subset of functions that are truly security-relevant. This presentation-derived write-up summarizes Project Lupine, a workflow that applies fine-tuned large language models (LLMs) to accelerate routine reverse engineering tasks by automatically generating (1) descriptive function names, (2) concise summaries, and (3) step-by-step explanations from decompiled code.

The core idea is simple: a model that can reliably translate noisy decompiler output into consistent natural-language annotations can reduce analyst time spent on repetitive triage and increase time spent on deeper reasoning and validation. Project Lupine combines dataset generation (synthetic and real-world), parameter-efficient fine-tuning, a lightweight inference server, and a Ghidra plugin to place LLM-assisted annotations directly into the analyst's primary workspace.

Key takeaways include: (a) task performance improves significantly with purposeful fine-tuning on decompiled functions, (b) model size and context window materially impact summary and step-by-step quality, and (c) successful deployment requires safeguards against hallucination, prompt injection, and data exfiltration risks - especially when using shared or remote community endpoints.

1. Introduction

Reverse engineering is labor-intensive by nature: analysts must interpret assembly, map low-level operations to higher-level concepts, and recognize patterns across unfamiliar codebases. Malware analysis compounds these challenges through obfuscation, anti-debugging, packers, and intentional ambiguity. As the volume and complexity of malicious software increases, teams need automation that accelerates the boring parts while highlighting the interesting ones.

Large language models have demonstrated strong capabilities in pattern recognition over code-like text and in translating complex artifacts into human-readable explanations. The motivating question for this work is: can an LLM understand decompiled code well enough to provide actionable annotations for malware reverse engineering?

Project Lupine targets three high-value, repeatable tasks that appear early in most reverse engineering workflows:

Explain a decompiled function step by step.
Provide a short summary of what the function does.
Suggest a descriptive new function name based on observed behavior.

Automating these tasks enables analysts to quickly scan function lists and cross-references (xrefs) using meaningful names and consistent summaries, improving navigation and prioritization.

2. Background: Reverse Engineering and LLMs

A common approach to reversing an unknown binary begins with basic file and architecture identification (e.g., PE32 vs. ELF; x86 vs. x64), followed by structural inspection (headers, sections, imports/exports, resources). Analysts then pivot using cross-references: they look for interesting APIs, strings, and call graphs, and progressively trace execution from entry points toward suspicious logic.

Figure 2. Baseline reverse engineering workflow used to prioritize interesting functions.

LLMs are a strong fit for this context because they can map recurring low-level patterns into higher-level concepts and express them in natural language. For decompiled code specifically, models can:

Recognize repeated idioms and common library or API usage patterns.
Translate machine-level operations into higher-level intent (e.g., persistence, injection, network communication).
Bridge to natural language by producing consistent explanations and naming conventions.

These benefits come with failure modes that matter in malware analysis: hallucinations (confidently incorrect explanations), over-generalization, and susceptibility to misdirection through adversarial code constructs or prompt injection.

3. Related Work

A growing ecosystem of reverse engineering assistants has explored LLM-driven annotation and explanation within common tools. Examples include:

G-3PO (Ghidra assistant) and Gepetto (IDA Pro), which request explanatory comments and variable/function names from hosted models.
AI assistants for debuggers (e.g., Pwndbg and GEF integrations) that help interpret debugging context.
Whole-program summarization experiments that attempt to describe larger codebases from decompiled output.
Import table analysis tools that map Windows APIs to likely behaviors and associated ATT&CK techniques.

On the research side, work such as LmPa and DIRTY highlights the value of combining program analysis with learned models to improve decompiler outputs and to predict variable names and types. These lines of work motivate treating LLM assistance as an augmentation layer on top of traditional analysis - not as a replacement.

4. Project Lupine Overview

Project Lupine is designed to improve analyst velocity by embedding LLM assistance directly into reverse engineering workflows. It consists of:

A dataset generation pipeline that produces paired examples of decompiled functions and desired annotations.
A fine-tuned local LLM optimized for reverse engineering annotation tasks.
A lightweight inference service (local or remote) that exposes a small API for model queries.
Ghidra plugins that request annotations and write them back into the decompiler view as comments and function names.
An optional feedback loop that allows analysts to submit corrected labels to improve future training runs.

Figure 1. Project Lupine architecture and analyst-in-the-loop feedback cycle.

5. Dataset Engineering

Fine-tuning depends on high-quality, task-aligned data. Lupine's training examples pair decompiled code with three labels: a descriptive function name, a short summary, and a step-by-step explanation. The project uses two complementary data sources:

Synthetic binaries generated to exercise specific techniques or API patterns.
Real-world malware samples processed through an analysis pipeline to extract candidate functions for labeling.

5.1 Synthetic data generation (conceptual)

Synthetic data provides a scalable starting point: it can be generated in large quantities and targeted at specific behaviors. In Lupine, a generator produces source code intended to demonstrate distinct adversary techniques by using designated operating system APIs. The workflow compiles and executes generated code to ensure it functions, then decompiles the resulting binaries and retains the decompiled functions alongside their intended descriptions.

Because synthetic samples can encode their own ground truth (the requested behavior), they enable training on consistent labels. At the same time, synthetic-only training is insufficient for robust malware analysis: real malware includes anti-analysis tricks, non-idiomatic code, and noisy artifacts that must be represented in the dataset.

5.2 Real-world malware processing pipeline

To incorporate real samples, Lupine leverages an automated pipeline that identifies and decompiles interesting regions of code, then prepares them for labeling and model training. A representative workflow includes:

Static inspection of PE structure and packer hints (e.g., pefile, Detect It Easy).
Pattern matching and classification (e.g., YARA rules).
Capability discovery to locate suspicious behaviors and offsets (e.g., capa).
Decompilation of selected functions (e.g., radare2).
LLM interaction to propose labels and store responses for review (e.g., via LangChain).

Figure 3. Real-world malware processing pipeline for candidate extraction and labeling.

A key practical detail is that the pipeline stores results in a structured analysis object (for example, a dictionary-of-dictionaries), separating capability findings (function offsets) from decompilation output (function bodies). This structure supports downstream review tools and repeatable training dataset creation.

As referenced in the source material, a curated dataset of labeled function examples was published on Hugging Face as dyngnosis/function_names_v2.

6. Model Selection and Fine-Tuning

The project evaluated different starting points for code-focused language models. Early experiments with a StarCoder-derived approach were not satisfactory for the target reverse engineering tasks, motivating a shift toward Code Llama-family models.

Several training considerations emerged as especially important:

Task definition: clearly separate naming, summarization, and step-by-step explanation tasks.
Prompt format: align the fine-tuning prompt style with the base model's instruction format to reduce format drift.
Context budgeting: treat the available context window as a fixed resource and maximize informative inputs (for example, include xrefs or memory dereferences when possible).
Tokenization hygiene: avoid training on samples that become silently truncated; truncated samples can poison learning and reduce reliability.

6.1 Parameter-efficient fine-tuning

To make fine-tuning feasible on modest hardware, Lupine applies parameter-efficient techniques (PEFT). QLoRA-style approaches combine quantization with low-rank adapters so that large base models can be adapted without updating all weights. In the project's implementation, adapters were applied to attention projection modules (q_proj, k_proj, v_proj, o_proj) with a rank of 16, alpha of 16, and dropout of 0.05.

Model size was observed to be a practical driver of output quality, especially for summaries and step-by-step explanations. In this project, 34B-parameter models were required to achieve consistently usable results for these tasks (with the expectation that improved data and higher-context training could reduce that requirement over time).

7. IDE Integration and Analyst-in-the-Loop Feedback

A distinguishing goal of Project Lupine is workflow integration: analysts should not need to context-switch to benefit from model assistance. The Ghidra plugins are designed to run from within the decompiler view and to write results directly back into the program database by:

Renaming the current function to a descriptive, behavior-aligned name.
Adding a summary comment (and optionally a step-by-step explanation) to the function.
Allowing an analyst to edit and submit corrections to improve future model versions.

7.1 Plugin modes and shortcuts

Three plugins are described in the source material, each with configured shortcut keys:

llm.py (CTRL-ALT-L): calls a local LLM to request a new function name and description, then renames the function and updates the comment. The plugin expects api_server.py on localhost:8000.
llm_remote.pt (CTRL-ALT-O): calls the Project Lupine community server to request a new function name and description. It sends the sample hash, function offset, and decompiled code to the community endpoint.
llm_suggest (CTRL-SHIFT-K): submits analyst edits (name and/or summary) back to the community server, along with the sample hash, function offset, decompiled code, and updated comment.

Figure 4. Analyst-in-the-loop plugin interaction and feedback cycle.

Remote mode can provide shared learning benefits, but should be used only with appropriate governance and data-handling controls.

8. Results, Limitations, and Risk Considerations

Project Lupine demonstrated that a fine-tuned model can produce useful reverse engineering annotations that improve navigation and triage - such as naming functions based on observed API usage and summarizing multi-step routines. The approach can also highlight notable constants (for example, encryption-related constants) and provide structured explanations that help analysts decide where to focus deeper effort.

From the project's conclusions, several operational lessons stand out:

Matching the pre-training prompt format is important, as is starting from a semi-working base model.
Context should be treated like a budget: maximize useful context per call; when space permits, add xrefs, memory dereferences, and (where available) dynamic analysis traces.
When context is tight due to large functions, chunking and intermediate summarization can help.
An analyst-in-the-loop feedback mechanism supports continuous learning and drives adoption when it improves day-to-day workflow.
Community models can exhibit confident hallucinations (and failure modes such as repetition or not knowing when to stop) and can be susceptible to injection or misdirection.

8.1 Operational security considerations

LLM-assisted reverse engineering introduces new operational risks. Teams should plan for:

Data handling: decompiled code may contain proprietary content; prefer local inference for sensitive samples.
Network controls: if remote services are used, constrain outbound connectivity and log requests/responses.
Input sanitation: treat model inputs as attacker-controlled; resist prompt injection by isolating system instructions and applying strict templates.
Output verification: use cross-checking (for example, API resolution, xrefs, dynamic traces) before accepting outputs as fact.

9. Recommendations for Practitioners

Based on the project's experience, the following practices improve results and reduce risk when applying LLMs to reverse engineering:

9.1 Data and labeling

Start with synthetic data to bootstrap, but incorporate real malware early to capture anti-analysis and obfuscation patterns.
Design review workflows so analyst corrections are low-friction; adoption depends on making feedback part of the normal workflow.
Remove label leakage artifacts (for example, verbose debug prints) that allow the model to learn shortcuts instead of behavior.

9.2 Prompting and context management

Treat context as a budget: allocate tokens to the most informative artifacts first (decompiled code, key strings, imports, xrefs).
When context is tight, use chunking and intermediate summarization to keep large functions within model limits.
Use strict output formats and length constraints to reduce repetition and improve readability in an IDE.

9.3 Training and evaluation

Track evaluation performance separately for naming vs. summarization vs. step-by-step tasks; a single score can hide weaknesses.
Maintain a held-out set of real-world samples for regression testing across model versions.
Prefer incremental improvements: adjust data quality, context construction, and prompt templates before expanding the model footprint.

10. Future Work

Future directions identified for Project Lupine focus on expanding analyst utility and improving model robustness:

Rule generation: produce YARA, Sigma, or Snort detections informed by extracted behaviors.
Auto-analyst workflows: seed exploration with known APIs, follow LLM-suggested leads, and incorporate dynamic analysis when needed.
Reporting: export structured findings (for example, Markdown) for threat intelligence and incident response.
Smaller models: explore quantized 7B and 13B variants to reduce hardware requirements while maintaining acceptable quality.

11. Conclusion

Project Lupine illustrates a pragmatic path to using LLMs as reverse engineering copilots: focus on concrete, high-frequency tasks; build training data that reflects real analyst workflows; integrate output directly into the RE environment; and keep analysts in the loop. With appropriate safeguards, fine-tuned models can improve triage speed and help analysts identify high-value functions faster - while still requiring rigorous verification for final conclusions.

Appendix A: Implementation Snapshot (Training and Integration)

This appendix captures representative implementation parameters and integration details from the source material. These values are not normative; they are provided to document one working configuration used in the project.

A.1 Training configuration (representative)

Component	Value (representative)
Training dataset	dyngnosis/function_names_v2 (Hugging Face)
Train/test split	train_test_split(test_size=0.01)
Base model	codellama/CodeLlama-34b-hf
Loading	load_in_8bit=True, torch_dtype=float16, device_map='auto'
LoRA target modules	q_proj, k_proj, v_proj, o_proj
LoRA hyperparameters	r=16, lora_alpha=16, lora_dropout=0.05, bias='none', task_type='CAUSAL_LM'
Effective batch size	batch_size=4 (per_device_train_batch_size=1 with gradient_accumulation_steps=4)
Max steps	10000
Learning rate / warmup	learning_rate=3e-4, warmup_steps=50
Precision	fp16=True
Logging / eval / save	logging_steps=10, eval_steps=500, save_steps=500
Optimizer	adamw_torch
Sequence handling	group_by_length=True

A.2 Ghidra integration (representative)

Plugin	Behavior and assumptions
llm.py (CTRL-ALT-L)	Local inference; renames function and updates comment; expects api_server.py on localhost:8000.
llm_remote.pt (CTRL-ALT-O)	Community inference; sends sample hash, function offset, and decompiled code to community server.
llm_suggest (CTRL-SHIFT-K)	Submits analyst edits (name/comment) back to community server for continuous improvement.

References

dyngnosis/function_names_v2 - dataset reference (Hugging Face).
Code Llama - Meta AI code model family.
Dettmers et al. QLoRA (arXiv:2305.14314).
G-3PO - Ghidra assistant (tool reference).
Gepetto - IDA Pro assistant (tool reference).
GPT-WPRE - whole-program reverse engineering prototype.
IATelligence - import table analysis with ATT&CK mapping.
LmPa - LLM + program analysis for decompilation improvements.
DIRTY (USENIX 2022) - learned variable names and types for decompiler output.
SecTor 2023 presentation recording: https://vimeo.com/883085410?h=a1f709610d