Séminaire MIDI : Stefano Zacchiroli
Titre du séminaire et orateur
Software Heritage: Analyzing the Global Graph of Public Software Development.
Stefano Zacchiroli, Laboratoire IRIF, Université de Paris
Date et lieu
Mardi 24 mars 2020, 16h.
The Software Heritage project has assembled the largest existing archive of publicly available software source code and associated development history, for more than 6 billion unique source code files and 1 billion unique commits, coming from more than 90 million software development projects.
In this talk we will review the project background, current status, and future directions with a focus on its graph-based data model and its exploitation. The archive is a Merkle DAG whose nodes stand for source code development artifacts such as source files, code trees, commits, releases, and version control system (VCS) snapshots. The graph is typed, fully-deduplicated, and global, allowing to keep track of all the different places (e.g., different VCS repositories) from which a given artifacts have been distributed from. The graph is big, with about 200 billion edges and 20 billion nodes and exponentially growing, doubling every 2 years. The graph network topology and growth dynamics are being studied, but still largely unknown at this stage.
We will discuss the state-of-the-art of operating, analyzing, and querying the Software Heritage graph, and early results in applying graph compression techniques to it to make it more easily manageable. We will conclude with an in-depth discussion of open questions, challenges, and actionable research directions.