AICHILLES: Uncovering AI-Evolved System Weaknesses
Abstract
The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 1260% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AICHILLES that takes as input a baseline program P and an AI-evolved program P[], AICHILLES searches for valid workloads where P[] regresses relative to P in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AICHILLES combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AICHILLES finds 49 distinct hidden weaknesses. We also show that explicitly including AICHILLES in the AI-driven development lifecycle can mitigate several of these weaknesses.