Split-View PDF Attacks on LLMs

Arxiv pdf 2026-06-01T00:00:00
arXiv Paper — PDF not available. Only the Executive Summary is available here. To read or download the full paper, visit the arXiv abstract page.

Abstract

Document-to-LLM applications typically read uploaded PDFs by first translating them into text through a hidden extraction layer that users cannot observe or audit. We show that this layer enables split-view PDFs : one document can have two semantic views before model reasoning. By mining specification-permitted or implementation-tolerated representation gaps at the PDF render/extract boundary, we instantiate 25 extraction gaps (EG) in which extractors return attacker-controlled or extractor-dependent text while the rendered page shows benign or different content. The gaps form four families: semantic overrides, hidden semantic injection, reading-order splits, and font-decoding splits, and 14 gaps have no exact path/mechanism-level match in prior PDF-to-LLM attacks. We evaluate these gaps on 16 PDF processing stacks and 7 commercial LLM services that accept PDFs through official APIs and web chat. Each gap causes render-extract divergence on at least one stack. Under a gap-level exposure criterion, every evaluated service exposes at least one gap, with 12/25 to 21/25 exposed gaps. Across deployment variants, exposure is driven mainly by the ingestion stack—not model identity alone—that constructs the models document context: APIs, cloud backends, web frontends, and local runtimes can expose different views of the same PDF. We further show that tested safety filters cover only selected hidden-text constructions. To support triage, we also develop a static screening scanner whose rules trigger on all 25 benchmark gaps in our self-test, and we discuss dual-view consistency as a longer-term defense direction.

Loading executive summary...

LINK COPIED TO CLIPBOARD