MASCOT-Android Source Code Collection

Arxiv pdf 2026-06-01T00:00:00
arXiv Paper — PDF not available. Only the Executive Summary is available here. To read or download the full paper, visit the arXiv abstract page.

Abstract

Compared with binaries and decompiled code, malware source code more directly reflects the attackers original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28% and an FPR of 1.06% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection. We conducted two case studies. First, we constructed an Android malware code-reuse graph and combined it with LLM-based code detection to assess traces of LLM assistance in malware development. The results suggest that LLMs are already contributing, at least to some extent, to the development and propagation of malware. The second study performs symbolic information ablation experiments in which we gradually remove different types of symbolic information from malware source code to assess their impact on malware detection performance. This study shows that import statements contain highly informative signals because they are related to API usage, whereas comments and class names have limited discriminative value. In summary, we present a curated dataset of Android malware source code and an automated collection model, and our case studies highlight the value of source code for studying both LLM-assisted malware development and the role of symbolic information in malware detection.

Loading executive summary...

LINK COPIED TO CLIPBOARD