What makes an analysis reproducible?
The goal of reproducible data analysis is to document and communicate your analysis so that other researchers can easily follow your procedure and replicate it results.
The list of demands and best practices can seem overwhelming at times, but as Klein et al. (2018) note, even small steps are helpful, and often reduce effort on part of the original analyst:
[the adoption of reproducible workflows] can be piecemeal -- each incremental step towards complete transparency adds positive value.
Introductions
- A great introduction to this topic with a focus on psychology is the practical guide for transparency in psychological science by Klein et al. (2018), who also provide extensive supplemental resources with practical tips for reproducible analyses, and propose an exemplary folder structure for data in online repositories.
- Daniel Lakens has written an excellent and very comprehensive step-by-step tutorial on computational reproducibility using
R
and Markdown.
Open-source tools
The tools we use for analysis should allow colleagues to see what we have done, and re-run or adapt our steps. Freely available tools that can be picked up by any colleague are often more helpful than proprietary, commercial tools, but either will do. Thankfully, there are several sets of tools to choose from:
Analysis frameworks
- JASP and Jamovi are easy-to-use, graphical interfaces for statistical analysis. Both make possible fully reproducible analyses without needing to write code.
- The
R
programming language and the corresponding RStudio interface are probably the most common analysis tools in the social sciences. - Python and Jupyter provide an alternative approach.
- Julia is a programming language focussed on high-performance numerical computation.
Resources for learning R
If you are new to R
, there are several fantastic resources to help you get started:
- RStudio maintains a repository of
R
Tutorials, and Lorne Campbell has put together another list of resources for learning to useR
. - The Software Carpentry offers an introduction to
R
for non-programmers. - RStudio's
R
Cheatsheets are a fantastic resource for findingR
commands for any particular purpose. R
for Data Science by Grolemund & Wickham is a more comprehensive, and more advanced, tutorial that covers all aspects of data analysis.
Open and documented data formats
Like the tools used for analysis, data is most useful if it can be reused by anyone.
- Text-based data formats such as comma-seperated values are probably the most widely understood format for tabular data. If in doubt, the Library of Congress provides some guidelines around data formats for long-term archival.
- Wickham (2014) provides guidelines and examples for creating tidy datasets.
- It's often helpful to create a codebook with further information about your data. The
codebook
package forR
can help you create an overview of you data automatically, and is especially helpful for survey data.
Public code
As with the above steps, publishing code can seem daunting, but Barnes (2010) makes a compelling argument to publish your computer code: It is good enough. The code that produced them is often a vital part of the results you report, as it documents the precise steps in the analysis. For added transparency, syntax files can be shared alongside data in a public online repository.
Further steps
Beyond sharing your code, you can take several additional steps further increase the ease with with fellow researchers (and yourself) can follow and reproduce your analysis:
- Give your files (data, analysis scripts, etc.) self-explanatory names or follow a standardized folder structure.
- Self-contained projects reduce the degree to which a programme is tied to any specific computer, ensuring that analyses can be re-run by others.
- Literate programming is the practice of combining code interwoven with the narrative of the analysis such as explanations and discussions of the intermediate steps. For example, Notebooks in RStudio or Jupyter contain not only code, but also the output of every analysis step, and can also contain your commentary. They also produce
HTML
andPDF
reports that make it possible to inspect all results without re-running the analysis. Comments in scripts also facilitate understanding. - Consistency can also help add clarity to code. For added standardization, you might consider adopting naming conventions in
R
, follow a styleguide such as the one proposed in AdvancedR
, or the Google Styleguide, or let a package like FormatR do the work for you.
Version control
Version control systems track changes to code (and other files) over time, documenting the history of a project, and providing backups of earlier versions to fall back on.
Version control resources
- Bryan (2017) provides an excellent introduction to the most popular such system,
Git
, in the context of research data andR
, if you have a moment to talk about version control. The author also shares her extensive step-by-step guide to Happy Git and GitHub for the useR. - Vuorre & Curley (2018) provide a tutorial on curating research assets with
Git
, tailored specifically to psychology.
Dependency management
Your results may depend on the specific versions of the software you used in your analysis -- both the analysis framework, and the plugins and packages you installed on your computer. Documenting these dependencies, such as the package versions you relied on, helps others to recreate the exact environment that you conducted the analysis in. This helps avoid works-on-my-machine errors that can be extremely difficult to pin down.
Dependency management resources
Approaches
- In
R
, the packrat package stores a snapshot of your package library and allows others to reproduce the exact same state. The Checkpoint package will reinstall packages available on a specific date. The built-insessionInfo()
command lists the used versions of every active package. - In Python, the
virtualenv
helps manage environments on a per-project basis. Conda aims to manage dependencies in any language, but is most common in the Python world. - Containers are the latest addition to the dependency management toolkit. Instead of just recreating the set of installed packages, containers capture an entire system, and often contain instructions for automatically setting up the system from scratch. This ensures that every part of the analysis environment can be reproduced exactly, and safely transferred between computers if desired. Thus, an entire analyis can be packaged and (re-)run on almost any computer, including external online services.
Containers are probably most useful where dependencies go beyond a single analysis framework and associated packages, for example in the case of complex toolchains.
Containers
Containers are frequently used in software engineering, and are a very stable and dependable technology that is slowly finding its way into scientific practice. The most common software for managing containers is Docker.
- Green & Clyburne-Sherin (2018) motivate and review containers in the context of reproducible analyses in the social sciences. Their manuscript introduces the commercial CodeOcean service, which can run analysis containers over the internet.
- The Singularity project is building a container system designed specifically to meet the needs of researchers and academic institutions. The project's motivation is to increase the "mobility of compute" around high-performance computing systems, and provide a central hub for sharing containers.
- The Scientific Filesystem aims to make containers more accessible and transparent.