Research detailed in arXiv:2606.27567 identifies a fundamental architectural flaw in shared-embedding sequence models where instructions and data are processed via a unified attention-aggregation pipeline. This "instruction-data conflation" mirrors the Von Neumann architecture's overlap of code and data, rendering prompt injection a structural vulnerability rather than a patchable alignment bug. Mathematical proofs utilizing Total Variation Distance (TVD) demonstrate the impossibility of Semantic-Faithful Control (SFC), proving that trusted instructions and untrusted data are statistically inseparable. This flaw enables authoritative action hijacking, including refusal bypasses and unauthorized tool execution, effectively neutralizing current in-pipeline classifiers and alignment-based defenses.
-
Threat Model: Instruction-Data Conflation
- Defines Prompted Action Models (PAMs) as systems where control signals (instructions) and variable inputs (data) share a single embedding space.
- Draws a direct parallel to the Von Neumann architecture, where the lack of separation between code and data enabled decades of buffer overflow exploits.
- Establishes that prompt injection is a systemic property of the architecture, not a failure of the training set or RLHF alignment.
-
Attack Mechanics: The Invariance Gap
- Utilizes Total Variation Distance (TVD) to prove the mathematical impossibility of provenance recovery, meaning the model cannot reliably distinguish the source of a token.
- Identifies the "Invariance Gap," where semantic-equivalence classes allow adversarial inputs to bypass finite training sets and classifiers.
- Employs attention-map analysis to demonstrate "Control-Path Exposure," showing how untrusted data can hijack the model's internal attention mechanism to trigger authoritative actions.
-
Systemic Security Impact
- High success rates in hijacking "authoritative actions," resulting in unauthorized tool execution, memory writes, and refusal bypasses.
- Demonstrates that current state-of-the-art in-pipeline classifiers fail when faced with semantic-equivalent adversarial inputs.
- Quantifies representation overlap across production tokenizers, proving that trusted and untrusted streams are processed identically at the latent level.
-
Countermeasures and Architectural Requirements
- Argues that "in-pipeline" defenses (filtering and alignment) are insufficient due to the finite-coverage invariance gap.
- Proposes a fundamental shift toward the physical or logical separation of instruction and data channels.
- Suggests implementing architectural boundaries similar to Data Execution Prevention (DEP) or ASLR to isolate control paths from user-supplied data.
-
Conclusion: The Path to Robust Agentic AI
- Concludes that shared-embedding models are inherently insecure for high-stakes agentic workflows.
- Asserts that true security requires a move away from single-stream sequence processing for control logic.
- Warns that until architectural separation is achieved, prompt injection remains a permanent risk factor.
Related posts
- arXiv (Computer Science - Cryptography and Security) — TEMPO-Diffusion: Temporally Exposed Malicious Poisoning of Diffusion Models
- arXiv (Computer Science - Cryptography and Security) — HauntAttack: When Attack Follows Reasoning as a Shadow
- arXiv (Computer Science - Cryptography and Security) — On the Inseparability of Instructions and Data in Shared-Embedding Sequence Models
- arXiv (Computer Science - Cryptography and Security) — Decomposing Memorization Reduction in Privacy-Preserving Fine-Tuning of SLMs for CSIRTs
- Hacking-and-security
- Themoonlight
- Rdworldonline
- Youtube
- Assets
- Xhan77
- Proceedings
- Tempo
- Huggingface
- Roboticscenter
- Emergentmind
- Github
- Openaccess
- Openreview
- Alphaxiv
- Dailysecurity
- Roboticscenter
- Stat
- Aclanthology
- Mansisak
- Openreview