Software Dark Matter: SBOM Visibility Gaps
Abstract
While various efforts in industry and academia have attempted to close this gap through numerous heuristics, there is no generalized model that allows a comprehensive exploration of this phenomenon. This is primarily due to the fact that these studies focus on the effect (i.e., lack of transparency) rather than explore their causes. Further, without a clear understanding of such causes, it is difficult to establish methods that address this lack of transparency. Modern software supply chains have evolved into vast, heterogeneous networks where transparency the granular understanding of all software components is now a critical security requirement. While Software Bills of Materials (SBOMs) have emerged as the primary mechanism for this transparency, current industry practices rely on a metadata-centric paradigm that assumes an artifact is defined solely by its package manager declarations. We posit that this assumption is fundamentally flawed, creating a systemic visibility gap we define as Software Dark Matter (SDM). SDM represents the set of security-critical files present in an artifacts filesystem that are unaccounted for by its associated metadata. We implement a reference tool, DARKFILES, and use it to analyze four ecosystems of disjoint nature: DockerHub, Maven Central, plugin/extension marketplaces (Jenkins plugins and OpenVSX), and a real-world enterprise environment. In this paper, we hypothesize that existing solutions and operational pipelines fail for systemic reasons, in particular the informational divergence between a software manifest and its physical reality: Software Bill of Materials (SBOMs) [2] have established themselves as a mechanism to provide much needed _transparency_ regarding the software stacks provided by vendors. The guiding principle behind SBOMs is that, when software vendors disclose a comprehensive view of the software components included in their products, software consumers would be able to take adequate action to minimize their attack surface. Thus, SBOMs are increasingly treated as a foundation for software supply-chain security [25]. They are used to determine exposure during incident response (e.g., react4shell), to drive governance workflows, and importantly, justify risk claims about deployed artifacts. Our research makes the following contributions: we introduce a general-purpose metric for artifact fidelity calculating SDM as the ratio of untracked files per total file count. We introduce Packaging Lag, a phenomenon where official metadata remains out-of-date across multiple versions before catching up to an artifacts actual content. We demonstrate that SDM exposes vulnerable software invisible to SBOM-driven pipelines both by cross-referencing untracked packages against known CVE databases and through the direct discovery of three confirmed high-severity CVEs, showing that SDM is highly correlated with sensitive information including secrets and cryptographic keys.