What makes an analysis reproducible?

The goal of reproducible data analysis is to document and communicate your analysis so that other researchers can easily follow your procedure and replicate it results.

The list of demands and best practices can seem overwhelming at times, but as Klein et al. (2018) note, even small steps are helpful, and often reduce effort on part of the original analyst:

[the adoption of reproducible workflows] can be piecemeal -- each incremental step towards complete transparency adds positive value.


Introductions


Open-source tools

The tools we use for analysis should allow colleagues to see what we have done, and re-run or adapt our steps. Freely available tools that can be picked up by any colleague are often more helpful than proprietary, commercial tools, but either will do. Thankfully, there are several sets of tools to choose from:

Analysis frameworks

  • JASP and Jamovi are easy-to-use, graphical interfaces for statistical analysis. Both make possible fully reproducible analyses without needing to write code.
  • The R programming language and the corresponding RStudio interface are probably the most common analysis tools in the social sciences.
  • Python and Jupyter provide an alternative approach.
  • Julia is a programming language focussed on high-performance numerical computation.

Resources for learning R

If you are new to R, there are several fantastic resources to help you get started:


Open and documented data formats

Like the tools used for analysis, data is most useful if it can be reused by anyone.

  • Text-based data formats such as comma-seperated values are probably the most widely understood format for tabular data. If in doubt, the Library of Congress provides some guidelines around data formats for long-term archival.
  • Wickham (2014) provides guidelines and examples for creating tidy datasets.
  • It's often helpful to create a codebook with further information about your data. The codebook package for R can help you create an overview of you data automatically, and is especially helpful for survey data.

Public code

As with the above steps, publishing code can seem daunting, but Barnes (2010) makes a compelling argument to publish your computer code: It is good enough. The code that produced them is often a vital part of the results you report, as it documents the precise steps in the analysis. For added transparency, syntax files can be shared alongside data in a public online repository.


Further steps

Beyond sharing your code, you can take several additional steps further increase the ease with with fellow researchers (and yourself) can follow and reproduce your analysis:

  • Give your files (data, analysis scripts, etc.) self-explanatory names or follow a standardized folder structure.
  • Self-contained projects reduce the degree to which a programme is tied to any specific computer, ensuring that analyses can be re-run by others.
  • Literate programming is the practice of combining code interwoven with the narrative of the analysis such as explanations and discussions of the intermediate steps. For example, Notebooks in RStudio or Jupyter contain not only code, but also the output of every analysis step, and can also contain your commentary. They also produce HTML and PDF reports that make it possible to inspect all results without re-running the analysis. Comments in scripts also facilitate understanding.
  • Consistency can also help add clarity to code. For added standardization, you might consider adopting naming conventions in R, follow a styleguide such as the one proposed in Advanced R, or the Google Styleguide, or let a package like FormatR do the work for you.

Version control

Version control systems track changes to code (and other files) over time, documenting the history of a project, and providing backups of earlier versions to fall back on.

Version control resources

Dependency management

Your results may depend on the specific versions of the software you used in your analysis -- both the analysis framework, and the plugins and packages you installed on your computer. Documenting these dependencies, such as the package versions you relied on, helps others to recreate the exact environment that you conducted the analysis in. This helps avoid works-on-my-machine errors that can be extremely difficult to pin down.

Dependency management resources

Approaches

  • In R, the packrat package stores a snapshot of your package library and allows others to reproduce the exact same state. The Checkpoint package will reinstall packages available on a specific date. The built-in sessionInfo() command lists the used versions of every active package.
  • In Python, the virtualenv helps manage environments on a per-project basis. Conda aims to manage dependencies in any language, but is most common in the Python world.
  • Containers are the latest addition to the dependency management toolkit. Instead of just recreating the set of installed packages, containers capture an entire system, and often contain instructions for automatically setting up the system from scratch. This ensures that every part of the analysis environment can be reproduced exactly, and safely transferred between computers if desired. Thus, an entire analyis can be packaged and (re-)run on almost any computer, including external online services.
    Containers are probably most useful where dependencies go beyond a single analysis framework and associated packages, for example in the case of complex toolchains.

Containers

Containers are frequently used in software engineering, and are a very stable and dependable technology that is slowly finding its way into scientific practice. The most common software for managing containers is Docker.

  • Green & Clyburne-Sherin (2018) motivate and review containers in the context of reproducible analyses in the social sciences. Their manuscript introduces the commercial CodeOcean service, which can run analysis containers over the internet.
  • The Singularity project is building a container system designed specifically to meet the needs of researchers and academic institutions. The project's motivation is to increase the "mobility of compute" around high-performance computing systems, and provide a central hub for sharing containers.
  • The Scientific Filesystem aims to make containers more accessible and transparent.