FlagThis - Cybersecurity news

arXiv (Computer Science - Cryptography and Security) • 2w

HarmRLVR: Weaponizing Verifiable Rewards to Reverse LLM Safety Alignment

Vulnerability Analysis🏢 Meta#HarmRLVR#LLMSecurity#AIAlignment#Meta#DeepSeek

HarmRLVR is a novel attack framework that weaponizes Reinforcement Learning with Verifiable Rewards (RLVR) to strip safety guardrails from Large Language Models (LLMs). By utilizing the Group Relative Policy Optimization (GRPO) algorithm and a minimal dataset of 64 harmful prompts, attackers can rapidly reverse alignment in open-source models including Llama, Qwen, and DeepSeek. Unlike traditional harmful fine-tuning, HarmRLVR achieves a 96.01% attack success rate and a 4.94/5 harmfulness score while preserving the model's general intelligence and reasoning capabilities, creating a high-efficiency vector for generating uncensored, malicious content.

Links:arXiv (Computer Science - Cryptography and Security), Researchgate, Adwardlee, Ai, Openreview, Dblp, Zdnet •

HarmRLVR: Weaponizing Verifiable Rewards to Reverse LLM Safety Alignment

SHARE INTELLIGENCE WIRE