Goals

  • Download and clean data from 2018 R Bootcamp Survey
  • Visualize data

Preliminaries

Load required packages.

library(tidyverse)
library(googlesheets)

Load data and examine

The survey data are stored in a Google Sheet. We’ll use the googlesheets package to open it and create a data frame. Documentation about the package can be found here.

There are some idiosyncrasies in using the googlesheets package in an R Markdown document because it requires interaction with the console, so I created a separate R script, Get_bootcamp_googlesheet.R to extract the survey data. If you try to execute the next chunk, it may give you an error, or it may ask you to allow googlesheets to access information in your Google profile. This just allows R to grab the data from the Googlesheet using your Google account.

survey_url <- "https://docs.google.com/spreadsheets/d/1-YB0iWUNN_9oxBhz221NFiyBOcwMfHziFeUiUvQwn7k/edit?usp=sharing"

bootcamp_by_url <- gs_url(survey_url)

bootcamp_sheets <- gs_ws_ls(bootcamp_by_url)

boot_data <- bootcamp_by_url %>%
  gs_read(bootcamp_sheets[1])
          
write_csv(boot_data, path=params$data_file_out)

This script downloads the data file saves it to a CSV under ../data/survey_2018.csv. We can then load this file.

I also created a test data file, data/survey-test.csv so I could see how everything worked before y’all filled out your responses. The R/Make_test_survey.R file shows how I did this. It’s a great, reproducible practice to simulate the data you expect, then run it through your pipeline.


# Choose data from respondents
survey <- read_csv(params$data_file_in)
## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Your current level of experience/expertise with R` = col_character(),
##   `Your enthusiasm for banjo music` = col_integer(),
##   `How old do you feel (in years)` = col_integer(),
##   `Preferred number of hours spent sleeping/day` = col_character(),
##   `Favorite day of the week` = col_character(),
##   `Is there a reproducibility 'crisis' in psychology?` = col_character()
## )
survey
## # A tibble: 56 x 7
##    Timestamp `Your current l… `Your enthusias… `How old do you…
##    <chr>     <chr>                       <int>            <int>
##  1 7/24/201… pro                            10               45
##  2 8/14/201… lots                            5               29
##  3 8/15/201… limited                         1               35
##  4 8/15/201… limited                         2               25
##  5 8/15/201… limited                         1               27
##  6 8/15/201… lots                            3               19
##  7 8/15/201… limited                         2               30
##  8 8/15/201… pro                             2               26
##  9 8/15/201… limited                         1               26
## 10 8/15/201… limited                         3               25
## # ... with 46 more rows, and 3 more variables: `Preferred number of hours
## #   spent sleeping/day` <chr>, `Favorite day of the week` <chr>, `Is there
## #   a reproducibility 'crisis' in psychology?` <chr>

The str() or ‘structure’ command is also a great way to see what you’ve got.

str(survey)
## Classes 'tbl_df', 'tbl' and 'data.frame':    56 obs. of  7 variables:
##  $ Timestamp                                         : chr  "7/24/2018 14:18:42" "8/14/2018 11:37:15" "8/15/2018 9:37:17" "8/15/2018 9:37:37" ...
##  $ Your current level of experience/expertise with R : chr  "pro" "lots" "limited" "limited" ...
##  $ Your enthusiasm for banjo music                   : int  10 5 1 2 1 3 2 2 1 3 ...
##  $ How old do you feel (in years)                    : int  45 29 35 25 27 19 30 26 26 25 ...
##  $ Preferred number of hours spent sleeping/day      : chr  "8" "8.5" "6" "8" ...
##  $ Favorite day of the week                          : chr  "Sunday" "Saturday" "Sunday" "Friday" ...
##  $ Is there a reproducibility 'crisis' in psychology?: chr  "Yes, a significant crisis" "Yes, a slight crisis" "Yes, a significant crisis" "Yes, a significant crisis" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 7
##   .. ..$ Timestamp                                         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Your current level of experience/expertise with R : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Your enthusiasm for banjo music                   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ How old do you feel (in years)                    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Preferred number of hours spent sleeping/day      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Favorite day of the week                          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Is there a reproducibility 'crisis' in psychology?: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Clearly, we need to do some cleaning before we can do anything with this.

Cleaning data

Let’s start by renaming variables.

names(survey) <- c("Timestamp",
                  "R_exp",
                  "Banjo",
                  "Psych_age_yrs",
                  "Sleep_hrs",
                  "Fav_day",
                  "Crisis")
# complete.cases() drops NAs
survey <- survey[complete.cases(survey),]
survey
## # A tibble: 56 x 7
##    Timestamp     R_exp  Banjo Psych_age_yrs Sleep_hrs Fav_day Crisis      
##    <chr>         <chr>  <int>         <int> <chr>     <chr>   <chr>       
##  1 7/24/2018 14… pro       10            45 8         Sunday  Yes, a sign…
##  2 8/14/2018 11… lots       5            29 8.5       Saturd… Yes, a slig…
##  3 8/15/2018 9:… limit…     1            35 6         Sunday  Yes, a sign…
##  4 8/15/2018 9:… limit…     2            25 8         Friday  Yes, a sign…
##  5 8/15/2018 9:… limit…     1            27 10        Saturd… Yes, a slig…
##  6 8/15/2018 9:… lots       3            19 7.5       Friday  Yes, a slig…
##  7 8/15/2018 9:… limit…     2            30 10        Saturd… Yes, a slig…
##  8 8/15/2018 9:… pro        2            26 9         Saturd… Yes, a slig…
##  9 8/15/2018 9:… limit…     1            26 8         Sunday  Yes, a slig…
## 10 8/15/2018 9:… limit…     3            25 10        Saturd… Yes, a slig…
## # ... with 46 more rows

Now, lets make sure we have numbers where we expect them.

survey$Sleep_hrs <- readr::parse_number(survey$Sleep_hrs)
survey
## # A tibble: 56 x 7
##    Timestamp     R_exp  Banjo Psych_age_yrs Sleep_hrs Fav_day Crisis      
##    <chr>         <chr>  <int>         <int>     <dbl> <chr>   <chr>       
##  1 7/24/2018 14… pro       10            45       8   Sunday  Yes, a sign…
##  2 8/14/2018 11… lots       5            29       8.5 Saturd… Yes, a slig…
##  3 8/15/2018 9:… limit…     1            35       6   Sunday  Yes, a sign…
##  4 8/15/2018 9:… limit…     2            25       8   Friday  Yes, a sign…
##  5 8/15/2018 9:… limit…     1            27      10   Saturd… Yes, a slig…
##  6 8/15/2018 9:… lots       3            19       7.5 Friday  Yes, a slig…
##  7 8/15/2018 9:… limit…     2            30      10   Saturd… Yes, a slig…
##  8 8/15/2018 9:… pro        2            26       9   Saturd… Yes, a slig…
##  9 8/15/2018 9:… limit…     1            26       8   Sunday  Yes, a slig…
## 10 8/15/2018 9:… limit…     3            25      10   Saturd… Yes, a slig…
## # ... with 46 more rows

Looks good. Let’s save that cleaned file so we don’t have to do this again.

write_csv(survey, path="../data/survey_clean.csv")

We may want to make the R_exp variable ordered.

(survey_responses <- unique(survey$R_exp))
## [1] "pro"                      "lots"                    
## [3] "limited"                  "none, limited, lots, pro"
## [5] "none"

This shows us the different survey response values. It looks like somebody checked all the levels. Let’s change that to limited.

survey$R_exp[survey$R_exp == "none, limited, lots, pro"] <- "limited"
survey$R_exp <- ordered(survey$R_exp, levels=c("none",
                                               "limited",
                                               "some",
                                               "lots",
                                               "pro"))

Visualization

Now, we follow Mike Meyer’s advice: “Plot your data!”

Descriptive plots

R_exp_hist <- survey %>%
  ggplot() +
  aes(x=R_exp) +
  geom_histogram(stat = "count") # R_exp is discrete
## Warning: Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist
Distribution of prior R experience

Distribution of prior R experience

Sleep_hrs_hist <- survey %>%
  ggplot() +
  aes(x=Sleep_hrs) +
  geom_histogram() # Sleep_hrs is continuous
Sleep_hrs_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of preferred sleep hrs/day

Distribution of preferred sleep hrs/day

Banjo_hist <- survey %>%
  ggplot() +
  aes(x=Banjo) +
  geom_histogram(bins = 10)
Banjo_hist
Distribution of Enthusiasm for Banjo Music

Distribution of Enthusiasm for Banjo Music


Does R experience have any relation to banjo music enthusiasm or one’s psychological age?

Banjo_vs_r_exp <- survey %>%
  ggplot() +
  aes(x=Banjo, y=Psych_age_yrs) +
  facet_grid(. ~ R_exp) +
  geom_point()
  # + stat_smooth()
Banjo_vs_r_exp

crisis_plot <- survey %>%
  ggplot() +
  aes(x=Crisis) +
  geom_bar()
crisis_plot

Data documentation (codebook)

Every data set should be documented. You can generate a template data codebook with some useful summary information using the package dataMaid.

if(!require(dataMaid)){install.packages('dataMaid')}
## Loading required package: dataMaid
## 
## Attaching package: 'dataMaid'
## The following object is masked from 'package:dplyr':
## 
##     summarize
library(dataMaid)
dataMaid::makeCodebook(data = survey, 
                       reportTitle = 'Codebook for 2018 R bootcamp survey', 
                       replace = TRUE)
## Data report generation is finished. Please wait while your output file is being rendered.

Then, we can look at the codebook_survey.Rmd file and edit it as needed, especially the section with the code descriptions.

# Codebook summary table

------------------------------------------------------------------------------
Label   Variable                Class         # unique  Missing  Description  
                                                values                        
------- ----------------------- ----------- ---------- --------- -------------
        **[Timestamp]**         POSIXct              1  0.00 %   Time & date survey was completed            

        **[R\_exp]**            ordered              5  0.00 %   Levels of R experience: {none", "limited", "some", "lots","pro")}            

        **[Banjo]**             integer             10  0.00 %   Level of enthusiasm for banjo music [1,10]              

        **[Psych\_age\_yrs]**   integer             33  0.00 %   Age participant reports feeling              

        **[Sleep\_hrs]**        numeric             50  0.00 %   Preferred number of hours/day spent sleeping

        **[Fav\_day]**          Date                 1  0.00 %   Favorite day of the week: {'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'}              

        **[Crisis]**            character            4  0.00 %   Is there a 'reproducibility' crisis in psychology: {'Yes, a significant crisis', 'Yes, a slight crisis', 'No', 'Don't know'}            
------------------------------------------------------------------------------

Analysis

I could use a document like this to plan out my analysis plan before I conduct it. If I used simulated data, I could make sure that my workflow will run when I get real (cleaned) data. I could even preregister my analysis plan before I conduct it. That doesn’t preclude later exploratory analyses, but it does hold me and my collaborators accountable for what I predicted in advance.

Notes

Notice that I sometimes put a label like Banjo-vs-r-exp in the brackets {}for a given ‘chunk’ of R code. The main reasons to do this are:

  • It sometimes makes it easier to debug your code.
  • In some cases, you can have this ‘chunk’ name serve as the file name for a figure you generate within a chunk.
  • In a bit, we’ll see how these chunk names are useful for making tables, figures, and equations that generate their own numbers.