The reproducibility crisis

I like to start off talks about reproducibility in science with some humor. This video is a few years old, but it has some timeless insights.

https://www.youtube.com/embed/66oNv_DJuPc

What’s the point? That even the most well-meaning of us can make careless errors that undermine the reproducibility of science.

But, is it a crisis?

In 2016, Nature published the results of a survey of 1,500 scientists (Baker 2016). They were asked a number of questions, including the following:

Is there a reproducibility crisis?

Yes, a significant crisis
Yes, a slight crisis
No crisis
Don’t know

Baker 2016

Do we all agree?

source(params$supporting_functions)
survey <- readr::read_csv(params$survey_data_fn)

survey$crisis <- ordered(survey$crisis, c("Significant", "Slight", "No crisis", "Don't know"))

survey %>%
  dplyr::filter(., !is.na(crisis)) %>%
  ggplot2::ggplot(.) +
  aes(x = crisis, fill = crisis) +
  geom_bar(stat = "count") +
  scale_y_discrete(breaks = c(0, 5, 10, 15), limits = 0:15)

Problems with reproducibility extend beyond behavioral science

Why is reproducibility hard?

Baker 2016

A manifesto for reproducible science

Munafo et al., 2017

What does reproducible science look like?

What do we mean by ‘reproducibility’?

Goodman et al., 2016

Methods reproducibility
- Enough details about materials & methods recorded (& reported)
- Same results with same materials & methods
Results reproducibility
- Same results from independent study
Inferential reproducibility
- Same inferences from one or more studies or reanalyses

Achieving methods reproducibility

How do we record enough details about all of our data, materials, and methods so we can fully reproduce our analysis? Have we captured (and reported) enough details about all of the steps of our project:

Data collection
Cleaning
Visualization
Analysis
Reporting
Manuscript, talk, poster preparation.

One way to answer the question is to think about that introductory video. Another way is to think about the ‘hit by a truck’ scenario:

In other words, could someone else (in your lab or another lab) pick up where you left off without significant loss of time and momentum if you weren’t there?. This is a high bar to reach, but it’s feasible with the right tools and mindset.

No one is irreplaceable, but we can all strive to be indispensible.

Using R (and RStudio) for reproducible science

So, how can we use R (and RStudio) to make our own work more reproducible? There are several ways.

RStudio projects

RStudio has a ‘projects’ function that I strongly recommend you use.

You create a new project using the ‘File/New Project’ menu item
I recommend storing all projects in a new directory with a sensible name (‘projects/1st_year_proj’)
This creates an *.Rproj file (the asterisk is a wildcard character meaning that any number of characters can precede the period)
I recommend turning-off saving data to the .RData file (Tools/Global Options… and/or Tools/Project Options…)
‘File/Open Project…’ command opens ‘fresh’ workspace

I create a new project every time I start a new ‘thing’. That occurs surprisingly often. Here’s an example of some recent projects I’ve worked on.

Using RStudio projects helps keep your files and settings organized. It’s easy to switch between projects. It reduces mental effort (what directory am I in?), and especially avoids having to use directory-setting commands like setwd() that will only work on your computer and are generally not reproducible. RStudio projects also integrates with version control systems (e.g., GitHub).

Reproducible workflows

So, besides using RStudio projects, what other characteristics do reproducible workflows have? Reproducible workflows are:

Scripted, automated = minimize human-dependent steps.
Well-documented
Kind to your future (forgetful) self
Transparent to me & colleagues == transparent to others

Scripting a reproducible workflow

Think of each step of your data workflow. Most data-related projects involve something like this:

# Import/gather data

# Clean data

# Visualize data

# Analyze data

# Report findings

Imagine writing R code to handle each step.

# Import data
my_data <- read.csv("path/2/data_file.csv")

# Clean data
my_data$gender <- tolower(my_data$gender) # make lower case
...

You could put all the code in one script file, or even better, have separate scripts for each step that you source() one by one.

# Import data
source("R/Import_data.R") # source() runs scripts, loads functions

# Clean data
source("R/Clean_data.R")

# Visualize data
source("R/Visualize_data.R")
...

Strengths of scripts

Automating workflows with scripts has many strengths. You keep R commands in files that can be re-run. Separate pieces of your workflow are kept separately.

You can create a “Master.R” script that can be run to regenerate full sequence of results. Find an error in raw data file? No problem; just fix the data file and re-run the “Master.R” script. Get hit by a bus? No problem; your colleagues can easily pick-up where you left off. Here’s the Make_site.R script for this bootcamp:

# make_bootcamp_site.R
#
# Updates and makes R bootcamp site

# Update and clean data from registrants and survey
source("R/registrants.R")
source("R/survey.R")
update_registration_data()
update_survey_data()

# Gilmore's talks
rmarkdown::render(input = "talks/bootcamp-day-1-intro.Rmd", 
                  output_format = "ioslides_presentation")

rmarkdown::render(input = "talks/slow-r.Rmd", 
                  output_format = c("html_document"))

rmarkdown::render(input = "talks/r-eproducible-science.Rmd", 
                  output_format = c("html_document"))

# Render site last so that updated versions get copied to docs/
rmarkdown::render_site()

So, scripts are great. But there’s a way to take the virtues of scripts and make them even more useful and powerful: R Markdown documents.

Reproducible reports and papers with R Markdown

R Markdown is an add-on package to R, developed by the RStudio team. R Markdown allows you to write a single text document that combines text, code, images, videos, and equations into one document. R Markdown then lets you render that document in a number of formats: a PDF or MS Word document, a web page or site, slides, a blog, or even a book.

One tool to rule them all and in the R-ness, bind them.

If you don’t speak LOTR, just ignore that.

What is R Markdown?

R Markdown extends Markdown ¹, a scripting ‘language’ used in lots of blogging engines and wikis. Markdown is a simple formatting syntax for authoring HTML documents and can also be converted to PDF and MS Word documents. It’s designed to be easy for humans to write and for machines to read.

You can learn the basics of R Markdown in a very short time. Here are several resources for doing so we recommend:

RStudio has an R Markdown page with extensive documentation, including some very useful cheatsheets
Hadley Wickham, the author of ggplot2 and other R packages has an online tutorial
Psychologists Mike Frank and Chris Hartgerink have produced a nice tutorial that they gave at the 2017 Society for Improving Psychological Science (SIPS) meeting.

The anatomy of an R Markdown file

Let’s create a simple R Markdown file so we can see how this works. From the ‘File’ menu select ‘New File…’ and then the R Markdown file type.

Notice that the ‘New R Markdown’ window let’s us choose different types of documents, presentations, an interactive web application using Shiny, or another file from some template. We’ll just use the defaults and create a new Untitled document that gives as its default output an HTML file.

Let’s expand the Source panel so we can see the full file.

The template shows us the core components of an R Markdown file:

Header

The header is marked off from the rest of the document by dashes ---. The header contains metadata about the document, contained in different ‘fields’: title, author, date, and output.

---
title: "Untitled"
author: "Rick Gilmore"
date: "7/27/2018"
output: html_document
---

All of these can be edited. The title can and often should be different from the file name. The date field is filled out automatically ². The output field shows that this document will be converted into an html_document (.html file) that we can open in any web browser.

Body text

The body text starts with the double hash marks ##. R Markdown follows Markdown’s convention of using hash marks to specify heading levels as in an outline. One hashmark means the 1st or top level. Two hashmarks means the 2nd level, etc.³

Note that we can include clickable web links by surrounding URLs with angular brackets <>, make text boldface by surrounding it with double-asterisks **boldface** or in italics with single asterisks *italics*.

R Markdown allows other kinds of content to be inserted in body text:

Named links
Images:
Equations: \(e = mc^2\)

and even video or audio recordings using HTML.

Code chunks

Code chunks are separated from the body text by triple back-ticks ‘```’.

Let’s look at the second code chunk:

Text in brackets {r cars} tells R that this chunk contains code written in R ⁴ and gives the chunk the name cars. The name is optional, and must be unique within a file, but it can help in debugging a long R Markdown file. In this case, the chunk runs the summary() command on the cars dataset.

When you create your own R Markdown documents, you will put your R code inside a chunk. You can create new chunks by clicking on a blank line in the R Markdown document and typing CTRL+ALT+I.

The virtue of putting code in chunks is that you can run them piece by piece from within the document. For example, clicking on the small right arrow icon runs the current chunk.

Scrolling down to the next code chunk, we see that it plots data from the pressure dataset: plot(pressure).

Returning to the first chunk called setup, we see that chunks themselves can have options that specify whether or not they are displayed echo=FALSE in the document, whether or not chunks are evaluated eval=TRUE and so forth. This allows the user to customize how the document executes each chunk. See the RStudio documentation for more information about what chunk options suit your needs.

Rendering output

Edit the body text if you like, then render the document using the Knit button. If we have not saved the file, we may be prompted for a file name, you can use test.Rmd for now. This will generate a test.html file (per the output: html_document in the header). Let’s open that file. We can see that the document combines the body text, links, R code chunks and R code ouputs, including plots in a very readable way.

One of the virtues of R Markdown, of course, is that we could produce different output formats for the same file, either by changing the output field in the document header or by issuing a command in the console:

rmarkdown::render("test.Rmd", 
                  output_format = 'pdf_document')
rmarkdown::render("test.Rmd", 
                  output_format = 'word_document')

# More than one output_format
rmarkdown::render("test.Rmd", 
                  output_format = c('html_document',
                                    'pdf_document',
                                    'word_document'))

R Markdown using 2019 R bootcamp data

We can use an R Markdown document bootcamp-survey.Rmd to analyze the survey data. Let’s open it up and see how it looks.

The default format is an html_document, and I’ve added some additional parameters in the header to produce a table of contents toc: yes with numbered sections number_section: TRUE, that create a ‘floating’ menu-like table of contents via toc_float: TRUE.

output:
  html_document:
    toc: TRUE
    toc_depth: 3
    toc_float: TRUE
    number_section: TRUE

I’ve also added parameters so I can easily produce outputs in different formats. Notice that I’ve added comments about what I did and why, so that the R Markdown file is like a combination lab notebook and data report. And by creating it in R Markdown, I can satisfy many audiences with different needs.

Your adviser likes PDFs? No problem. Your collaborator prefers MS Word? Got it covered. Need to give a quick brown bag talk you can give from any web browser? Easy. R Markdown can become one of your super-powers.

More about R Markdown

there is a book called R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund. Here is a link to the e-version of the book: https://bookdown.org/yihui/rmarkdown/.

Why write reproducible papers/reports in R Markdown?

The previous example showed how we might create reproducible data analysis reports in R Markdown. It’s only a short step to writing full papers this way. But let’s talk about why we might want to do this.

The following is section is copied verbatim from Mike Frank & Chris Hartgerink’s tutorial on GitHub.

There are three reasons to write reproducible papers. To be right, to be reproducible, and to be efficient. There are more, but these are convincing to us. In more depth:

To avoid errors. Using an automated method for scraping APA-formatted stats out of PDFs, [@Nuijten2015-ul] found that over 10% of p-values in published papers were inconsistent with the reported details of the statistical test, and 1.6% were what they called “grossly” inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not. Nearly half of all papers had errors in them.

To promote computational reproducibility. Computational reproducibility means that other people can take your data and get the same numbers that are in your paper. Even if you don’t have errors, it can still be very hard to recover the numbers from published papers because of ambiguities in analysis. Creating a document that literally specifies where all the numbers come from in terms of code that operates over the data removes all this ambiguity.

To create spiffy documents that can be revised easily. This is actually a really big neglected one for us. At least one of us used to tweak tables and figures by hand constantly, leading to a major incentive never to rerun analyses because it would mean re-pasting and re-illustratoring all the numbers and figures in a paper. That’s a bad thing! It means you have an incentive to be lazy and to avoid redoing your stuff. And you waste tons of time when you do. In contrast, with a reproducible document, you can just rerun with a tweak to the code. You can even specify what you want the figures and tables to look like before you’re done with all the data collection (e.g., for purposes of preregistraion or a registered report).

Advanced topics in R-eproducible science

There is so much more to say about how to do R-eproducible science. Here are some advanced topics we can spend a few minutes discussing, depending on your level of interest. These materials will remain available to you for later reference.

Version control with git and GitHub
Web sites, blogs, (even books) with R Markdown
Write reproducible papers in R Markdown using papaja
- Make this from this

Version control

Track changes is great? Right? But if you’ve ever written a lengthy document with other people, you’ve experienced the challenge of tracking versions across time. At some point, the changes become too extensive to track, and so the author(s) decide to accept or reject a bunch and create a new version. This is how version control becomes an extension of the track changes problem. Most of us have experienced something like this sequence: ‘paper.docx’, ‘paper_new.docx’, ‘paper_new_new.docx’, ‘paper_new_new_ROG.docx’, etc.

My current scheme with colleagues is something like this: ‘nsf_grant_2018-08-16v1.docx’, ‘nsf_grant_2018-08-16v2.docx’, etc. That is, each person who modifies the document saves it as a new version. It doesn’t avoid conflicts if we’re working in parallel, but it does help us track down where we went astray.

Imagine a scheme for doing this automatically with your R and RStudio files? RStudio incorporates two ‘version control’ systems from the software development world, ‘git’ and ‘subversion’. I use ‘git’ and a web-based service for managing projects that use git called GitHub.

Rick’s GitHub workflow

We don’t have time to go into git and GitHub here, but I strongly recommend Jenny Bryan’s tutorial Happy Git and GitHub for the useR. In the meantime, this is the workflow I use for almost every project I do that will involve R:

Create a repo on GitHub
Copy repo URL
File/New Project.../
Version Control, Git
Paste repo URL
Select local name for repo and directory where it lives.
Open project within R Studio File/Open Project...
Commit (upload a commented version) early & often

These videos show this workflow in action.

This way, I always have local and web-based copies of my latest work. It’s easy to share with collaborators. I just send them a URL to the project. This is how Michael and I worked together on the bootcamp. And it’s how I create all of my teaching and talk slides.

When using a GitHub workflow with collaborators, you must get in the habit of ‘pushing’ your work to GitHub when you finish and ‘pulling’ the latest version from GitHub when you start working again. It’s not Dropbox or Box.

Web sites with R Markdown

When I say most of my work these days is in RStudio, I’m not kidding. It’s easy to create simple websites that start out as R Markdown documents. The bootcamp’s website is an example.

If you’re curious about this, please feel free to examine the ‘guts’ of the bootcamp’s repository. It includes a _site.yml file that contains site configuration parameters, an index.Rmd home page for the site and other *.Rmd files that get converted into pages, and directories for files.

To create the site, I simply enter rmarkdown::render_site() from the console, commit the changes and push them to GitHub. GitHub has its own web hosting service called GitHub pages that makes it easy to create and modify simple websites.

The future of reproducible psychological science…

I’m very optimistic about the future of psychological science. Our science is harder than physics, chemistry, and engineering because it incorporates phenomena from all of them. And the survey data show that failures to replicate in these other fields are common, more common than most outsiders would imagine. Behavioral scientists are becoming more mindful of the challenges to robustness in our work and are making strides to bolster it.

Let’s learn more, faster

At the end of the day, by making our research more reproducible, our findings will be more robust, and by sharing our findings, displays, data, and analysis code more widely and openly, we’ll all learn more, faster. That’s Databrary’s motto, by the way.

So, I think that in the very near future we’ll see the following:

Transparent, reproducible, open workflows across the publication cycle
Pregistration common
Replications commonly conducted and published
Openly shared materials + data + code
[@Munafo2017-dc]: reproducible practices across the workflow
- Including procedure videos [@Gilmore2017-eh]
Data, materials (displays, code) stored in central repository
- Open Science Framework (OSF)
- Databrary
[@Gilmore2017-eh]: video and reproducible behavioral science
Access/analyze materials/displays via repository application program interfaces (APIs)

These will help us build a ‘cumulative’ science [@Mischel2011-br] of behavior.

Learn from my mistakes

I still consider myself a ‘student of R’. I learn new tricks and techniques all the time. You can teach an old dog new tricks. He just has to be motivated. It takes longer, and you have to be patient.

But you can learn from my mistakes:

Script everything you possibly can
- If you have to repeat something, make a function or write a parameterized script
Document all the time
- Comment your code
- Update README files
Don’t be afraid to ask
Don’t be afraid to work in the open
Learn from others
Just do it!

In many ways, learning R is like acquiring a super power.

So, go be super-powerful. But remember that with great power comes great responsibility.

Supporting Materials

This document was produced on 2019-08-23 14:12:21 in RStudio version 1.2.1335 using R Markdown. The code and materials used to generate the slides may be found at https://github.com/psu-psychology/r-bootcamp-2019/. Information about the R Session that produced the slides is as follows:

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
## [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2        cellranger_1.1.0  pillar_1.4.2     
##  [4] compiler_3.6.1    tools_3.6.1       zeallot_0.1.0    
##  [7] digest_0.6.20     viridisLite_0.3.0 lubridate_1.7.4  
## [10] jsonlite_1.6      evaluate_0.14     nlme_3.1-140     
## [13] gtable_0.3.0      lattice_0.20-38   pkgconfig_2.0.2  
## [16] rlang_0.4.0       cli_1.1.0         rstudioapi_0.10  
## [19] yaml_2.2.0        haven_2.1.1       xfun_0.9         
## [22] withr_2.1.2       xml2_1.2.2        httr_1.4.1       
## [25] knitr_1.24        vctrs_0.2.0       generics_0.0.2   
## [28] hms_0.5.0         grid_3.6.1        tidyselect_0.2.5 
## [31] glue_1.3.1        R6_2.4.0          readxl_1.3.1     
## [34] rmarkdown_1.15    modelr_0.1.5      magrittr_1.5     
## [37] backports_1.1.4   scales_1.0.0      htmltools_0.3.6  
## [40] rvest_0.3.4       assertthat_0.2.1  colorspace_1.4-1 
## [43] stringi_1.4.3     lazyeval_0.2.2    munsell_0.5.0    
## [46] broom_0.5.2       crayon_1.3.4

References & notes

Hypertext Markup Language (HTML) is the core language of web pages. ‘Markdown’ is a markup language for easily writing a file that gets converted to HTML.↩
A best practice is to change the date field so that it is automatically updated each time the file is rendered: date: '2019-08-23 14:12:21'.↩
HTML has different header levels that these map to: so <H1>, <H2>, etc.↩
R Markdown can contain code written in languages other than R!↩

R-eproducible Psychological Science

Rick Gilmore

2019-08-23 14:12:18

Themes