UniAttack: Automated Multi-Layer LLM Jailbreaking

Arxiv pdf 2026-06-01T00:00:00
arXiv Paper — PDF not available. Only the Executive Summary is available here. To read or download the full paper, visit the arXiv abstract page.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UniAttack, an adversarial testing framework designed from a _defense-oriented perspective_ to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UniAttack extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This featurecentric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UniAttack achieves an average attack success rate (ASR) improvement of 64.63%-248.82% on models deployed with multi-layered defense mechanisms and it only takes 0.03%-4.96% cost of the baselines. UniAttack artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.

Loading executive summary...

LINK COPIED TO CLIPBOARD