This research evaluates the efficacy of local Large Language Models (LLMs), specifically the Qwen2.5-Coder series, in deobfuscating binaries protected by Control Flow Flattening (CFF). Using a closed-loop workflow—incorporating Ghidra decompilation, Ollama-orchestrated prompting, and behavioral verification—the study tests the ability to recover RC4 logic from stripped, obfuscated C code. Findings indicate that while structural recovery is achievable, smaller models (7B-14B) suffer from critical reasoning failures, including data-flow loss, incorrect operator precedence, and self-audit hallucinations. The research underscores that LLMs currently function best as hypothesis generators within a rigorous, behaviorally-verified analysis framework rather than autonomous deobfuscation engines.
-
Research Overview: AI-Driven Reverse Engineering
- Objective: Determine if local LLMs running on restricted GPU slices (NVIDIA H200 MIG) can deobfuscate CFF-protected binaries without prior algorithmic knowledge.
- Core Philosophy: Advocates for an iterative, behavior-driven approach where LLMs act as "hypothesis generators" to assist human analysts.
- Toolchain Integration: Combines Ghidra (decompilation), Ollama (LLM orchestration), GCC (compilation), and Python (verification).
-
Methodology: The Closed-Loop Deobfuscation Workflow
- Obfuscation Target: A stripped RC4 implementation modified with manual CFF, utilizing a central dispatcher loop and opaque state variables (e.g., 0xDEAD0001).
- Execution Cycle: Implements a "Decompile $\rightarrow$ Prompt $\rightarrow$ Recover $\rightarrow$ Compile $\rightarrow$ Behavioral Test" pipeline.
- Verification Mechanism: Employs
compare_recovered.py, a custom script that checks for byte-for-byte identity in the 256-byte S-box array and output ciphertext against a Python ground-truth.
-
Technical Challenges: LLM Reasoning Failures
- Data-Flow Corruption: Significant tendency to drop critical accumulator variables, such as the RC4 'j' variable, during logic reconstruction.
- Operator Precedence Errors: Incorrect implementation of bitwise logic, such as masking only the state variable instead of the intended sum with
& 0xff. - Variable Aliasing: Errors in mapping obfuscated state variables to the correct temporary variables during recovery.
- Self-Audit Hallucinations: Instances where the model provides false-positive "PASS" reports, claiming code correctness despite failing behavioral verification.
-
Model Performance and Scaling
- Qwen2.5-Coder 7B: Demonstrated high failure rates and struggled significantly with maintaining essential data-flow integrity.
- Qwen2.5-Coder 14B: Capable of structural recovery but required multiple iterations (v1 through v4) to achieve successful behavioral equivalence.
- Scaling Requirements: Concludes that while 14B models show promise, 70B+ class models are likely required for reliable, production-grade deobfuscation tasks.
-
Industry and Defensive Implications
- Shift in Workflow: Malware analysis is moving toward hybrid human-AI workflows where automated verification is the primary safeguard.
- Requirement for Harnesses: Successful AI-assisted RE necessitates a rigorous behavioral verification harness to catch silent logic failures.
- Future Tooling: Integration of local LLMs into headless decompiler environments presents a significant opportunity for rapid triage.
Related posts
- Malware News — Malware analysis: part 9. AI-assisted deobfuscation: control flow flattening. Simple C example
- Arxiv
- Computer
- Github
- Promon
- Digital
- Www2
- Youtube
- Allsecuritynews