AICHILLES: Uncovering AI-Evolved System Weaknesses

Arxiv pdf 2026-06-01T00:00:00
arXiv Paper — PDF not available. Only the Executive Summary is available here. To read or download the full paper, visit the arXiv abstract page.

Abstract

The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 1260% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AICHILLES that takes as input a baseline program P and an AI-evolved program P[], AICHILLES searches for valid workloads where P[] regresses relative to P in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AICHILLES combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AICHILLES finds 49 distinct hidden weaknesses. We also show that explicitly including AICHILLES in the AI-driven development lifecycle can mitigate several of these weaknesses.

Loading executive summary...

LINK COPIED TO CLIPBOARD