← Back to Daily Briefing

This research evaluates the efficacy of local Large Language Models (LLMs), specifically the Qwen2.5-Coder series, in deobfuscating binaries protected by Control Flow Flattening (CFF). Using a closed-loop workflow—incorporating Ghidra decompilation, Ollama-orchestrated prompting, and behavioral verification—the study tests the ability to recover RC4 logic from stripped, obfuscated C code. Findings indicate that while structural recovery is achievable, smaller models (7B-14B) suffer from critical reasoning failures, including data-flow loss, incorrect operator precedence, and self-audit hallucinations. The research underscores that LLMs currently function best as hypothesis generators within a rigorous, behaviorally-verified analysis framework rather than autonomous deobfuscation engines.

  • Research Overview: AI-Driven Reverse Engineering

    • Objective: Determine if local LLMs running on restricted GPU slices (NVIDIA H200 MIG) can deobfuscate CFF-protected binaries without prior algorithmic knowledge.
    • Core Philosophy: Advocates for an iterative, behavior-driven approach where LLMs act as "hypothesis generators" to assist human analysts.
    • Toolchain Integration: Combines Ghidra (decompilation), Ollama (LLM orchestration), GCC (compilation), and Python (verification).
  • Methodology: The Closed-Loop Deobfuscation Workflow

    • Obfuscation Target: A stripped RC4 implementation modified with manual CFF, utilizing a central dispatcher loop and opaque state variables (e.g., 0xDEAD0001).
    • Execution Cycle: Implements a "Decompile $\rightarrow$ Prompt $\rightarrow$ Recover $\rightarrow$ Compile $\rightarrow$ Behavioral Test" pipeline.
    • Verification Mechanism: Employs compare_recovered.py, a custom script that checks for byte-for-byte identity in the 256-byte S-box array and output ciphertext against a Python ground-truth.
  • Technical Challenges: LLM Reasoning Failures

    • Data-Flow Corruption: Significant tendency to drop critical accumulator variables, such as the RC4 'j' variable, during logic reconstruction.
    • Operator Precedence Errors: Incorrect implementation of bitwise logic, such as masking only the state variable instead of the intended sum with & 0xff.
    • Variable Aliasing: Errors in mapping obfuscated state variables to the correct temporary variables during recovery.
    • Self-Audit Hallucinations: Instances where the model provides false-positive "PASS" reports, claiming code correctness despite failing behavioral verification.
  • Model Performance and Scaling

    • Qwen2.5-Coder 7B: Demonstrated high failure rates and struggled significantly with maintaining essential data-flow integrity.
    • Qwen2.5-Coder 14B: Capable of structural recovery but required multiple iterations (v1 through v4) to achieve successful behavioral equivalence.
    • Scaling Requirements: Concludes that while 14B models show promise, 70B+ class models are likely required for reliable, production-grade deobfuscation tasks.
  • Industry and Defensive Implications

    • Shift in Workflow: Malware analysis is moving toward hybrid human-AI workflows where automated verification is the primary safeguard.
    • Requirement for Harnesses: Successful AI-assisted RE necessitates a rigorous behavioral verification harness to catch silent logic failures.
    • Future Tooling: Integration of local LLMs into headless decompiler environments presents a significant opportunity for rapid triage.

Related posts

  1. Malware News — Malware analysis: part 9. AI-assisted deobfuscation: control flow flattening. Simple C example
  2. Arxiv
  3. Computer
  4. Github
  5. Promon
  6. Digital
  7. Www2
  8. Youtube
  9. Allsecuritynews

LINK COPIED TO CLIPBOARD