Result and Artifact Review and Badging

An experimental result is not fully established unless it can be independently reproduced. A variety of recent studies, primarily in the biomedical field, have revealed that an uncomfortably large number of research results found in the literature fail this test, because of sloppy experimental methods, flawed statistical analyses, or in rare cases, fraud. Publishers can promote the integrity of the research ecosystem by developing review processes that increase the likelihood that results can be independently replicated and reproduced. An extreme approach would be to require completely independent reproduction of results as part of the refereeing process. An intermediate approach is to require that artifacts associated with the work undergo a formal audit. By "artifact" we mean a digital object that was either created by the authors to be used as part of the study or generated by the experiment itself. For example, artifacts can be software systems, scripts used to run experiments, input datasets, raw data collected in the experiment, or scripts used to analyze results.

Additional benefits ensue if the research artifacts are themselves made publically available so that any interested party may audit them. This also enables replication experiments to be performed, which, because they inevitably are done under slightly different conditions, serve to verify the robustness of the original results. And perhaps more importantly, well-formed and documented artifacts allow others to build directly upon the previous work through reuse and repurposing.

A number of ACM conferences and journals have already instituted formal processes for artifact review. Here we provide terminology and standards for review processes of these types in order to promote a base level of uniformity which will enable labeling of successfully reviewed papers across ACM publications choosing to adopt such practices.

Of course, there remain many circumstances in which such enhanced review will be either infeasible or not possible. As a result, such review processes are encouraged, but remain completely optional for ACM journals and conferences, and when they are made available, it is recommended that participation by authors also be made optional. Authors who do agree to such additional review, and whose work meets established standards, will be rewarded with appropriate labeling both in the text of the article and in the metadata displayed in the ACM Digital Library. Specific labels, or badges, are proposed below.

Terminology.

A variety of research communities have embraced the goal of reproducibility in experimental science. Unfortunately, the terminology in use has not been uniform. Because of this we find it necessary to define our terms. The following are inspired by the International Vocabulary for Metrology(VIM); see the Appendix for details.
 

  • Repeatability (Same team, same experimental setup)

    • The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation.

  • Replicability (Different team, same experimental setup)

    • The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts.

  • Reproducibility (Different team, different experimental setup)

    • The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.

The concepts of repeatability and reproducibility are taken directly from the VIM. Repeatability is something we expect of any well-controlled experiment. Results that are not repeatable are rarely suitable for publication. The proposed intermediate concept of replicability stems from the unique properties of computational experiments, i.e., that the measurement procedure/system, being virtual, is more easily portable, enabling inspection and exercise by others. While reproducibility is the ultimate goal, this initiative seeks to take an intermediate step, that is, to promote practices that lead to better replicability. We fully acknowledge that simple replication of results using author-supplied artifacts is a weak form of reproducibility. Nevertheless, it is an important first step, and the auditing processes that go well beyond traditional refereeing will begin to raise the bar for experimental research in computing.

Badging.

We recommend that three separate badges related to artifact review be associated with research articles in ACM publications: Artifacts Evaluated, Artifacts Available and Results Validated. These badges are considered independent and any one, two or all three can be applied to any given paper depending on review procedures developed by the journal or conference.

Artifacts Evaluated
This badge is applied to papers whose associated artifacts have successfully completed an independent audit. Artifacts need not be made publicly available to be considered for this badge. However, they do need to be made available to reviewers. Two levels are distinguished, only one of which should be applied in any instance:
 

  • Artifacts Evaluated – Functional

    The artifacts associated with the research are found to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation.
    • Notes

      • Documented: At minimum, an inventory of artifacts is included, and sufficient description provided to enable the artifacts to be exercised.

      • Consistent: The artifacts are relevant to the associated paper, and contribute in some inherent way to the generation of its main results.

      • Complete: To the extent possible, all components relevant to the paper in question are included. (Proprietary artifacts need not be included. If they are required to exercise the package then this should be documented, along with instructions on how to obtain them. Proxies for proprietary data should be included so as to demonstrate the analysis.)

      • Exercisable: Included scripts and/or software used to generate the results in the associated paper can be successfully executed, and included data can be accessed and appropriately manipulated.

  • Artifacts Evaluated – Reusable
    The artifacts associated with the paper are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated – Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated. In particular, norms and standards of the research community for artifacts of this type are strictly adhered to. 

Artifacts Available
This badge is applied to papers in which associated artifacts have been made permanently available for retrieval.

  • Artifacts Available

    Author-created artifacts relevant to this paper have been placed on a publically accessible archival repository. A DOI or link to this repository along with a unique identifier for the object is provided.
    • Notes

      • We do not mandate the use of specific repositories. Publisher repositories (such as the ACM Digital Library), institutional repositories, or open commercial repositories (e.g., figshare or Dryad) are acceptable. In all cases, repositories used to archive data should have a declared plan to enable permanent accessibility. Personal web pages are not acceptable for this purpose.

      • Artifacts do not need to have been formally evaluated in order for an article to receive this badge. In addition, they need not be complete in the sense described above. They simply need to be relevant to the study and add value beyond the text in the article. Such artifacts could be something as simple as the data from which the figures are drawn, or as complex as a complete software system under study.

Results Validated
This badge is applied to papers in which the main results of the paper have been successfully obtained by a person or team other than the author. Two levels are distinguished:

  • Results Replicated The main results of the paper have been obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts provided by the author.

  • Results Reproduced The main results of the paper have been independently obtained in a subsequent study by a person or team other than the authors, without the use of author-supplied artifacts.


In each cases, exact replication or reproduction of results is not required, or even expected. Instead, the results must be in agreement to within a tolerance deemed acceptable for experiments of the given type. In particular, differences in the results should not change the main claims made in the paper.

It is easy to see how research articles that develop algorithms or software systems could be labeled as described above. Here, the artifacts could be implementations of algorithms or complete software systems, and replication would involve exercise of software, typically software provided by the author. However, we intend these badges to be applicable to other types of research as well. For example, artifacts associated with human-subject studies of novel human-computer interface modalities might be the collected data, as well as the scripts developed to analyze the data. "Replication" might focus on a careful inspection of the experimental protocol along with independent analysis of the collected data.


Review Procedures.
The descriptions of badges provided above do not specify the details of the review process itself. For example: Should reviews occur before or after acceptance of a paper? How many reviewers should there be? Should the reviewers be anonymous, or should they be allowed to interact openly with the authors? How should artifacts be packaged for review? What specific metrics should be used to assess quality? Current grassroots efforts to evaluate artifacts and formally test replicability have answered these questions in different ways. We believe that it is still too early to establish more specific guidelines for artifact and replicability review. Indeed, there is sufficient diversity among the various communities in the computing field that this may not be desirable at all. We do believe that the broad definitions provided above provide a framework that will allow badges to have general comparability among communities.

Because there may be some variation in review procedures among ACM’s publication venues, badges included in PDFs and in ACM Digital Library metadata should be linked to a brief explanation of the particular review process which led to the awarding of the badge.

We acknowledge that the Artifacts Available and Results Validated badges described above make sense even if they result from action that occurs after publication. Editors-in-Chiefs and Conference Steering Committee Chairs (or an appropriate SIG Chair should a conference not have an extant steering committee) will have the authority to award these badges post-publication if warranted. For Results Validated, a peer-reviewed publication which reports the replication or reproduction must be submitted as evidence, and if awarded, the badge will contain a link to this paper.

Appendix

Concepts from the International Vocabulary of Metrology

The primary reference for terminology in physical measurement is

International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM), 3rd edition, JCGM 200:2012, http://www.bipm.org/en/publications/guides/vim.html.

The connection to experimental computer science is this: the result of an experiment can be thought of as a measurement, albeit one of a virtual object. As a result, the following VIM definitions of concepts such as repeatability and reproducibility are a propos.

Measurement repeatability: measurement precision under a set of repeatability conditions of measurement.

  • Measurement precision: closeness of agreement between indications or measured quantity values obtained by replicate measurements on the same or similar objects under specified conditions.

  • Repeatability condition of measurement: condition of measurement, out of a set of conditions that includes the same measurement procedure, same operators, same measuring system, same operating conditions and same location, and replicate measurements on the same or similar objects over a short period of time.

Measurement reproducibility: measurement precision under reproducibility conditions of measurement.

  • Reproducibility condition of measurement: condition of measurement, out of a set of conditions that includes different locations, operators, measuring systems, and replicate measurements on the same or similar objects

Approved June 8, 2016

Publish with ACM

ACM's prestigious conferences and journals are seeking top-quality papers in all areas of computing and IT. It is now easier than ever to find the most appropriate venue for your research and publish with ACM.

Publish your work

Why I Belong to ACM

Hear from Bryan Cantrill, vice president of engineering at Joyent, Ben Fried chief information officer at Google, and Theo Schlossnagle, OmniTI founder on why they are members of ACM.

Get Involved with ACM

ACM is a volunteer-led and member-driven organization. Everything ACM accomplishes is through the efforts of people like you. A wide range of activities keep ACM moving, including organizing conferences, editing journals, reviewing papers and participating on boards and committees, to name just a few. Find out all the ways that you can volunteer with ACM.

volunteer