Publications

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes
David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar Devanbu, Bogdan Vasilescu, Cindy Rubio-González
In Proceedings of International Conference on Software Engineering (ICSE'19) Montreal, Canada, May 2019.
Available as: PDF, BibTeX

Abstract: Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like Travis-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BugSwarm, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BugSwarm toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.

Related Papers

ActionsRemaker: Reproducing GitHub Actions
Hao-Nan Zhu, Kevin Z. Guan, Robert M. Furth, Cindy Rubio-González

Available as: PDF

Abstract: Mining Continuous Integration and Continuous Delivery (CI/CD) has enabled new research opportunities for the software engineering (SE) research community. However, it remains a challenge to reproduce CI/CD build processes, which is crucial for several areas of research within SE such as fault localization and repair. In this paper, we present ActionsRemaker, a reproducer for GitHub Actions builds. We describe the challenges on reproducing GitHub Actions builds and the design of ActionsRemaker. Evaluation of ActionsRemaker demonstrates its ability to reproduce fail-pass pairs: of 180 pairs from 67 repositories, 130 (72.2%) from 43 repositories are reproducible. We also discuss reasons for unreproducibility. ActionsRemaker is publicly available at https://github.com/bugswarm/actions-remaker, and a demo of the tool can be found at https://youtu.be/flblSqoxeA.
On the Reproducibility of Software Defect Datasets
Hao-Nan Zhu, Cindy Rubio-González

Available as: PDF

Abstract: Software defect datasets are crucial to facilitating the evaluation and comparison of techniques in fields such as fault localization, test generation, and automated program repair. However, the reproducibility of software defect artifacts is not immune to breakage. In this paper, we conduct a study on the reproducibility of software defect artifacts. First, we study five state-of-the-art Java defect datasets. Despite the multiple strategies applied by dataset maintainers to ensure reproducibility, all datasets are prone to breakages. Second, we conduct a case study in which we systematically test the reproducibility of 1,795 software artifacts during a 13-month period. We find that 62.6% of the artifacts break at least once, and 15.3% artifacts break multiple times. We manually investigate the root causes of breakages and handcraft 10 patches, which are automatically applied to 1,055 distinct artifacts in 2,948 fixes. Based on the nature of the root causes, we propose automated dependency caching and artifact isolation to prevent further breakage. In particular, we show that isolating artifacts to eliminate external dependencies increases reproducibility to 95% or higher, which is on par with the level of reproducibility exhibited by the most reliable manually curated dataset.
On the Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions
David Tomassi, Cindy Rubio-González
In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
Available as: PDF, BibTeX

Abstract: Static bug detectors aim at helping developers to automatically find and prevent bugs. In this experience paper, we study the effectiveness of static bug detectors at identifying Null Pointer Dereferences or Null Pointer Exceptions (NPEs). NPEs pervade all programming domains from systems to web development. Specifically, our study measures the effectiveness of five Java static bug detectors: CheckerFramework, ERADICATE, INFER, NULLAWAY, and SPOTBUGS. We conduct our study on 102 real-world and reproducible NPEs from 42 open-source projects found in the BUGSWARM and DEFECTS4J datasets. We apply two known methods to determine whether a bug is found by a given tool, and introduce two new methods that leverage stack trace and code coverage information. Additionally, we provide a categorization of the tool’s capabilities and the bug characteristics to better understand the strengths and weaknesses of the tools. Overall, the tools under study only find 30 out of 102 bugs (29.4%), with the majority found by ERADICATE. Based on our observations, we identify and discuss opportunities to make the tools more effective and useful.
Fixing Dependency Errors for Python Build Reproducibility
Suchita Mukherjee, Abigail Almanza, Cindy Rubio-González
In ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
Available as: PDF, BibTeX

Abstract: Software reproducibility is important for re-usability and the cumulative progress of research. An important manifestation of unreproducible software is the changed outcome of software builds over time. While enhancing code reuse, the use of open-source dependency packages hosted on centralized repositories such as PyPI can have adverse effects on build reproducibility. Frequent updates to these packages often cause their latest versions to have breaking changes for applications using them. Large Python applications risk their historical builds becoming unreproducible due to the widespread usage of Python dependencies, and the lack of uniform practices for dependency version specification. Manually fixing dependency errors requires expensive developer time and effort, while automated approaches face challenges of parsing unstructured build logs, finding transitive dependencies, and exploring an exponential search space of dependency versions. In this paper, we investigate how open-source Python projects specify dependency versions, and how their reproducibility is impacted by dependency packages. We propose a tool PyDFix to detect and fix unreproducibility in Python builds caused by dependency errors. PyDFix is evaluated on two bug datasets BugSwarm and BugsInPy, both of which are built from real-world open-source projects. PyDFix analyzes a total of 2,702 builds, identifying 1,921 (71.1%) of them to be unreproducible due to dependency errors. From these, PyDFix provides a complete fix for 859 (44.7%) builds, and partial fixes for an additional 632 (32.9%) builds.
A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair
David A. Tomassi, Cindy Rubio-González

Available as PDF and BibTeX which can be found: Here

Abstract: Datasets play an important role in the advancement of software tools and facilitate their evaluation. BugSwarm is an infrastructure to automatically create a large dataset of real-world reproducible failures and fixes. In this paper, we respond to Durieux and Abreu's critical review of the BugSwarm dataset, referred to in this paper as CriticalReview. We replicate CriticalReview's study and find several incorrect claims and assumptions about the BugSwarm dataset. We discuss these incorrect claims and other contributions listed by CriticalReview. Finally, we discuss general misconceptions about BugSwarm, and our vision for the use of the infrastructure and dataset.
Bugs in the Wild: Examining the Effectiveness of Static Analyzers at Finding Real-World Bugs
David A. Tomassi
In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Available as PDF and BibTeX which can be found: Here

Abstract: Static analysis is a powerful technique to find software bugs. In past years, a few static analysis tools have become available for developers to find certain kinds of bugs in their programs. However, there is no evidence on how effective the tools are in finding bugs in real-world software. In this paper, we present a preliminary study on the popular static analyzers ErrorProne and SpotBugs. Specifically, we consider 320 real Java bugs from the BugSwarm dataset, and determine which of these bugs can potentially be found by the analyzers, and how many are indeed detected. We find that 30.3% and 40.3% of the bugs are candidates for detection by ErrorProne and SpotBugs, respectively. Our evaluation shows that the analyzers are relatively easy to incorporate into the tool chain of diverse projects that use the Maven build system. However, the analyzers are not as effective detecting the bugs under study, with only one bug successfully detected by SpotBugs.

Publications

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes

Related Papers

ActionsRemaker: Reproducing GitHub Actions

On the Reproducibility of Software Defect Datasets

On the Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions

Fixing Dependency Errors for Python Build Reproducibility

A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair

Bugs in the Wild: Examining the Effectiveness of Static Analyzers at Finding Real-World Bugs