Current Large Language Model (LLM) safety paradigms utilize reactive, single-turn message filtering, leaving them vulnerable to "salami-slicing" attacks. These attacks decompose malicious intent across multiple dialogue turns to evade detection. The Cognitive Firewall framework addresses this through a proactive, stateful, multi-gate Zero-Trust architecture. By employing independent oversight agents—specifically Intent, Zero-Trust Context, Consistency, and Output Risk gates—the framework monitors the evolution of user objectives and treats all asserted roles as unverified evidence. This approach shifts defense from isolated scoring to escalation-based blocking, successfully reducing attack success rates (ASR) to <2% on standard benchmarks and 14% against complex, human-authored adversarial prompts.
-
Vulnerability Analysis: Multi-Turn Decomposition
- Salami-slicing attacks: Harmful objectives are fragmented across multiple turns to ensure individual prompts appear benign.
- Authority exploitation: Attackers leverage user-asserted personas or administrative roles to bypass traditional input filters.
- Context-shifting: Exploitation of the model's inherent trust in dialogue history to mask ultimate malicious goals.
-
Defensive Architecture: The Multi-Gate Model
- Intent Gate: Decomposes user inputs to identify high-level operational objectives and detect hidden, evolving goals.
- Zero-Trust Context Gate: Implements Zero-Trust principles by treating all claimed roles and permissions as unverified evidence.
- Consistency Gate: Monitors multi-turn dialogue progression to detect strategic escalation or prompt decomposition.
- Output Risk Gate: Performs final inspection of candidate LLM responses before they are released to the user.
-
Operational Logic: Escalation-Based Defense
- Decision Logic: Utilizes high-confidence signal escalation rather than simple score averaging to trigger defensive blocks.
- Stateful Oversight: Shifts security posture from reactive keyword filtering to continuous, context-aware conversational governance.
- Multi-Agent Deployment: Employs independent agentic layers to provide a robust, auditable, and proactive defensive perimeter.
-
Security Impact & Performance Metrics
- Standard Benchmark Efficacy: Achieved an Attack Success Rate (ASR) of <2% across three major jailbreak datasets.
- Complex Adversarial Defense: Successfully limited successful human-authored, manually engineered attacks to a 14% rate.
- Operational Utility: Maintains a controlled 8% over-refusal rate, balancing high-security requirements with enterprise utility.
Related posts
- arXiv (Computer Science - Cryptography and Security) — Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety
- Dev
- Shiftai
- Hipocap
- Summerofcode
- Activewizards
- Medrxiv
- Accuknox
- Llm-hacking