AI Autonomous Cyber Attack Benchmarking
Abstract
Frontier AI systems are increasingly capable of interactive cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, systematic evaluation of their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber-range infrastructure. Existing public benchmarks capture important isolated skills, such as CTF solving, vulnerability reproduction, and exploit generation, but they often abstract away the operational structure of realistic intrusions: discovering exposed services, gaining an initial foothold, collecting internal information, and expanding compromise across networked hosts. This gap makes it difficult to observe emerging cyber risks early, because frontier AI systems are not routinely evaluated under conditions that preserve the end-to-end structure of realistic cyber attacks. In this paper, we introduce AGENTCYBERRANGE, the first open, multi-range evaluation infrastructure for measuring the autonomous cyber attack capability of frontier AI systems in realistic cyber ranges. AGENTCYBERRANGE consists of a benchmark suite with 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges containing 156 internal hosts, together with CAGE, an evaluation toolchain for scalable system execution, task orchestration, result collection, and automatic verification. The benchmark covers two core stages of realistic attacks: web exploitation and post-exploitation. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex achieves the highest success rates, solving 16.1% of web exploitation tasks and 31.7% of postexploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%, respectively. We further observe that evaluated systems identify out-ofbenchmark vulnerabilities, including previously unknown vulnerabilities in popular projects, and mutate payloads to bypass host defenses. These results show that open, end-to-end cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.