FILTERING BY: CLEAR FILTER

The Cognitive Firewall: A Proactive Zero-Trust Framework for LLM Safety

Current Large Language Model (LLM) safety paradigms utilize reactive, single-turn message filtering, leaving them vulnerable to "salami-slicing" attacks. These attacks decompose malicious intent across multiple dialogue turns to evade detection. The Cognitive Firewall framework addresses this through a proactive, stateful, multi-gate Zero-Trust architecture. By employing independent oversight agents—specifically Intent, Zero-Trust Context, Consistency, and Output Risk gates—the framework monitors the evolution of user objectives and treats all asserted roles as unverified evidence. This approach shifts defense from isolated scoring to escalation-based blocking, successfully reducing attack success rates (ASR) to <2% on standard benchmarks and 14% against complex, human-authored adversarial prompts.


LINK COPIED TO CLIPBOARD