arXiv (Computer Science - Cryptography and Security) • 2w
HarmRLVR: Weaponizing Verifiable Rewards to Reverse LLM Safety Alignment
HarmRLVR is a novel attack framework that weaponizes Reinforcement Learning with Verifiable Rewards (RLVR) to strip safety guardrails from Large Language Models (LLMs). By utilizing the Group Relative Policy Optimization (GRPO) algorithm and a minimal dataset of 64 harmful prompts, attackers can rapidly reverse alignment in open-source models including Llama, Qwen, and DeepSeek. Unlike traditional harmful fine-tuning, HarmRLVR achieves a 96.01% attack success rate and a 4.94/5 harmfulness score while preserving the model's general intelligence and reasoning capabilities, creating a high-efficiency vector for generating uncensored, malicious content.