EVOHUNT: Evolving LLM Audit Playbooks
Abstract
An LLM agent for vulnerability discovery and validation is more than a model. It combines three components: (i) an underlying LLM for code analysis, (ii) a general-purpose agent harness, such as Codex or OpenCode, for repository navigation, tool use, context, and long-horizon execution, and (iii) an audit playbook, domain-specific procedural knowledge that guides the LLM and harness toward effective vulnerability discovery. Prior work relies on human-supplied playbooks in several forms, including prompt engineering, role play, manually designed audit workflows, curated vulnerability knowledge bases (e.g., via RAG), and heuristics. This raises two research questions: (RQ1) Acquisition Is human curation necessary? Can playbook creation be fully automated? (RQ2) Transfer Can an evolved playbook transfer the audit procedure to weaker agents, improving their capability? We present EVOHUNT, which instantiates a playbook evolution environment over open-source repositories for security auditing. Three agents drive the evolution loop: (i) an audit agent rolls out the current playbook and produces findings and evidence; (ii) an evaluator scores outcomes against ground truth; and (iii) a reviser commits updates to the playbook based on failure analysis. The playbook format is unconstrained: starting empty, EVOHUNT freely adds or removes workflows, heuristics, vulnerability knowledge, or any domain-specific content. The evolved playbook requires only minor adaptation to run under a different LLM or harness. We evaluate EVOHUNT on 813 open-source security advisories for evolution and 371 held-out advisories for testing. For acquisition, playbook evolution raises end-to-end exploits for Codex/GPT5.4-xhigh 6 __ (1.1% 6.2%), and the evolved OpenCode/GLM5.1 playbook surpasses OpenAI Codex Security on every metric (11.3% vs. 9.2%), showing open-source evolution can outperform a dedicated commercial product. For transfer, the GLM-evolved playbook gives the strongest student lift (27B: 2.4% 6.5%; A3B: 1.1% 4.6%) and yields 2 _._ 4 more A3B matches than GPT transfer.