VLM Visual-Token Side Channels
Abstract
A vision encoder compresses image pixels into semantic embeddings, and in doing so it acts as an implicit privacy boundary between the image and the language model: the resulting states emphasize semantic content and attenuate the pixel-local detail needed for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary, routing image patches directly into the language-model token stream. We show that this design choice exposes an architectural privacy attack surface: the intermediate visual tokens form a pre-output side channel. Under a token-access adversary, decoders invert the visual-token streams of two encoder-free VLMs, Gemma4 and Fuyu, into recognizable image structure and readable held-out access codes (top- __ exact 21/24 and 22/24, and 42/48 and 46/48 on an independent larger split), while matched encoder-based controls localize the target region but recover no exact stringsQwen3-VL and InternVL on both splits (0/24 and 0/48), and LLaVA-1.5 on the larger split (0/48). Controlled within-model ablations identify the operative variable as the spatial sampling fidelity of the visual-token gridspecifically character-direction sampling densityrather than token or value count(Fisher __ = 6 _._ 52 10[][7] , channel projection vs. spatial pooling). The channel is not confined to exported tokens: Gemma4 layer-0 key-value cache tensors are themselves directly invertible ( __ grad = 0 _._ 4202 vs. 0 _._ 0045 shuffled), placing the side channel on the key-value cache that production serving stacks persist for decoding efficiency. It survives clutter and realistic document degradation, transfers zero-shot to public document images, and resists valuelevel defenses such as additive noise and quantization; mitigation must instead reduce the spatial sampling. The vision encoder thus functions as a privacy boundary whose removal should be treated as a first-class privacy decision in VLM deployment.