SkillMutator: LLM Agent Skill Attacks
Abstract
Large language model (LLM) agents increasingly extend their capabilities at runtime by loading Agent Skills: composite artifacts that pair a natural-language specification, SKILL.md, with executable scripts and task-specific resources. Because a skills behavior is determined jointly by naturallanguage instructions and executable behavior, assessing its safety requires reasoning across both modalities. This makes skills useful, but also creates a language-and-code cross-modal attack surface. An attacker can present a benign-looking workflow in SKILL.md while embedding implicit directives that steer the agent to exfiltrate sensitive files even when the accompanying scripts and resources appear harmless. Despite the rapid growth of skill marketplaces, this attack surface remains understudied. Prior work typically treats skills either as prompt-injection vectors or as code artifacts for static scanning, leaving attacks that emerge from the interaction between two modalities largely unmeasured. In our evaluation, an open-source skill scanner detects only 2% 8% of such attacks, while a commercial scanner detects only 9% 17%. To address this gap, we introduce SkillMutator, the first benchmark for install-time detection of language-and-code cross-modal attacks on Agent Skills. It emulates an adversarial skill-mutation process across 13 attack categories and iteratively refines malicious skills using scanner feedback, making injected behaviors difficult to distinguish from legitimate workflows. This benchmark enables systematic measurement of cross-modal attacks in realistic Agent Skill settings. We further propose a four-phase reasoning-trajectory distillation framework that distills frontier-teacher traces into smaller open-weight models through four structured reasoning stages, producing a locally deployable scanner that avoids third-party content exposure and excessive API cost. On the strongest subset of SkillMutator ( n =76 ), our scanner improves detection from 17 . 1% for the base model (Qwen2.5-Coder-7BInstruct) to 88 . 2% , surpassing GPT-4o-mini ( 23 . 7% ), GPT5.4-mini ( 79 . 0% ), and reaching frontier-level GPT-5.4 ( 86 . 8% ). These results show that practical defense against cross-modal attacks is feasible without relying on costly third-party frontier models.