Goals

  • Download and clean data from 2019 R Bootcamp Survey
  • Visualize data
  • Demonstrate scripting of data gathering and cleaning

Preliminaries

Load required packages.

library(tidyverse)
## ── Attaching packages ──────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(googlesheets)
library(dataMaid)
## 
## Attaching package: 'dataMaid'
## The following object is masked from 'package:dplyr':
## 
##     summarize

Load data and examine

The survey data are stored in a Google Sheet. We’ll use the googlesheets package to open it and create a data frame. Documentation about the package can be found here.

There are some idiosyncrasies in using the googlesheets package in an R Markdown document because it requires interaction with the console, so I created a separate R function to gather/get/download these data. If you open the R/survey.R file, you will see a function that looks like this:

get_survey_data <- function(verbose = FALSE,
                            sheet_url = "https://docs.google.com/spreadsheets/d/1YrKFrPz38FV-JbJKlp7wGQAw9IagVJFMuLbCLY7HMeA/"
                            sheet_name = 'PSU R Bootcamp 2019 Survey (Responses)') {
  # Download 2019 Bootcamp registration data from GoogleSheet
  library(googledrive)
  library(googlesheets)
  
  drive_auth(use_oob = TRUE)
  options(httr_oob_default = TRUE)
  
  survey_gs <- googlesheets::gs_title(sheet_name)
  survey_data <- googlesheets::gs_read(ss = survey_gs,
                                       ws = 'Form Responses 1')
  survey_data
}

We’ll load a previously saved version of the raw survey data here.

source(params$supporting_functions)
survey <- readr::read_csv(params$data_file_in)
## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Your current level of experience/expertise with R` = col_character(),
##   `Other programming languages you know` = col_character(),
##   `Your enthusiasm for "Game of Thrones" Season 8.` = col_double(),
##   `Your favorite beverage` = col_character(),
##   `Age in years` = col_double(),
##   `Preferred number of hours spent sleeping/day` = col_character(),
##   `Favorite day of the week` = col_character(),
##   `Are your data tidy?` = col_character(),
##   `Is there a reproducibility crisis?` = col_character()
## )

Inspecting the data

The str() or ‘structure’ command is also a great way to see what you’ve got.

str(survey)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 42 obs. of  10 variables:
##  $ Timestamp                                        : chr  "6/25/2019 12:27:56" "7/17/2019 15:45:04" "7/18/2019 16:12:20" "7/18/2019 16:46:18" ...
##  $ Your current level of experience/expertise with R: chr  "some" "some" "some" "none" ...
##  $ Other programming languages you know             : chr  "Python, Unix/Linux shell programming, Swift" "None" "None" "SPSS/SAS syntax, some mplus, lisrel (way back when)" ...
##  $ Your enthusiasm for "Game of Thrones" Season 8.  : num  1 1 1 3 1 1 8 4 1 1 ...
##  $ Your favorite beverage                           : chr  "Coffee" "Tea" "Wine" "Spirits" ...
##  $ Age in years                                     : num  32 26 112 52 24 31 29 26 21 43 ...
##  $ Preferred number of hours spent sleeping/day     : chr  "6" "11" "9" "7-8" ...
##  $ Favorite day of the week                         : chr  "Thursday" "Saturday" "Saturday" "Friday" ...
##  $ Are your data tidy?                              : chr  "Yes" "Yes" "That's a personal question" "That's a personal question" ...
##  $ Is there a reproducibility crisis?               : chr  NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Timestamp = col_character(),
##   ..   `Your current level of experience/expertise with R` = col_character(),
##   ..   `Other programming languages you know` = col_character(),
##   ..   `Your enthusiasm for "Game of Thrones" Season 8.` = col_double(),
##   ..   `Your favorite beverage` = col_character(),
##   ..   `Age in years` = col_double(),
##   ..   `Preferred number of hours spent sleeping/day` = col_character(),
##   ..   `Favorite day of the week` = col_character(),
##   ..   `Are your data tidy?` = col_character(),
##   ..   `Is there a reproducibility crisis?` = col_character()
##   .. )

Clearly, we need to do some cleaning before we can do anything with this.

Cleaning data

Let’s start by turning Timestamp into a proper date and time. The lubridate package helps us manipulate character strings (chr) into dates and times. The Timestamp variable is in a format common in the U.S. – month/day/year – but this format is not universal. The mdy_hms command in lubridate converts the month/day/year (mdy) format followed by the time with hours, minutes, and seconds (hms) into a more flexible and universal format involving UTC (universal coordinated time or ‘Zulu’ time).

survey$Timestamp <- lubridate::mdy_hms(survey$Timestamp)
survey$Timestamp
##  [1] "2019-06-25 12:27:56 UTC" "2019-07-17 15:45:04 UTC"
##  [3] "2019-07-18 16:12:20 UTC" "2019-07-18 16:46:18 UTC"
##  [5] "2019-07-18 17:02:34 UTC" "2019-07-19 11:40:44 UTC"
##  [7] "2019-07-22 11:12:50 UTC" "2019-07-22 16:35:17 UTC"
##  [9] "2019-07-23 09:30:45 UTC" "2019-07-23 09:56:12 UTC"
## [11] "2019-07-23 17:55:25 UTC" "2019-07-24 08:46:34 UTC"
## [13] "2019-07-24 11:11:21 UTC" "2019-07-24 14:39:17 UTC"
## [15] "2019-07-25 14:14:39 UTC" "2019-07-29 17:16:34 UTC"
## [17] "2019-07-30 16:31:40 UTC" "2019-07-31 11:15:38 UTC"
## [19] "2019-08-02 11:51:48 UTC" "2019-08-06 10:47:38 UTC"
## [21] "2019-08-08 10:18:27 UTC" "2019-08-09 09:32:28 UTC"
## [23] "2019-08-09 10:22:19 UTC" "2019-08-09 17:15:10 UTC"
## [25] "2019-08-12 15:30:56 UTC" "2019-08-13 15:48:58 UTC"
## [27] "2019-08-15 11:33:13 UTC" "2019-08-16 12:24:55 UTC"
## [29] "2019-08-16 14:46:58 UTC" "2019-08-17 15:36:31 UTC"
## [31] "2019-08-18 13:32:52 UTC" "2019-08-20 12:48:42 UTC"
## [33] "2019-08-20 15:13:40 UTC" "2019-08-20 17:21:10 UTC"
## [35] "2019-08-20 22:27:34 UTC" "2019-08-20 23:06:26 UTC"
## [37] "2019-08-21 09:22:22 UTC" "2019-08-21 09:30:07 UTC"
## [39] "2019-08-21 09:30:39 UTC" "2019-08-21 10:03:09 UTC"
## [41] "2019-08-21 10:09:30 UTC" "2019-08-21 12:30:22 UTC"

We also note that the Other programming languages you know question will need some work to be useful for data analysis. Let’s look at this variable specifically:

survey$`Other programming languages you know`
##  [1] "Python, Unix/Linux shell programming, Swift"                                
##  [2] "None"                                                                       
##  [3] "None"                                                                       
##  [4] "SPSS/SAS syntax, some mplus, lisrel (way back when)"                        
##  [5] "None"                                                                       
##  [6] "SPSS/SAS syntax, MS DOS"                                                    
##  [7] "SPSS/SAS syntax, Mplus"                                                     
##  [8] "Javascript/HTML/CSS, SPSS/SAS syntax"                                       
##  [9] "None"                                                                       
## [10] "None"                                                                       
## [11] "SPSS/SAS syntax"                                                            
## [12] "SPSS/SAS syntax"                                                            
## [13] "Java"                                                                       
## [14] "Python"                                                                     
## [15] "SPSS/SAS syntax, Unix/Linux shell programming"                              
## [16] "SPSS/SAS syntax"                                                            
## [17] "Python, SPSS/SAS syntax"                                                    
## [18] "SPSS/SAS syntax"                                                            
## [19] "SPSS/SAS syntax, Mplus"                                                     
## [20] "Python, MATLAB, SPSS/SAS syntax, Unix/Linux shell programming"              
## [21] "Python"                                                                     
## [22] "Python, Praat"                                                              
## [23] "None"                                                                       
## [24] "Python, SPSS/SAS syntax"                                                    
## [25] "SPSS/SAS syntax"                                                            
## [26] "SPSS/SAS syntax"                                                            
## [27] "None"                                                                       
## [28] "None"                                                                       
## [29] "SPSS/SAS syntax, Lisrel, MPlus"                                             
## [30] "Python, MATLAB"                                                             
## [31] "SPSS/SAS syntax"                                                            
## [32] "Python, MATLAB, SPSS/SAS syntax"                                            
## [33] "None"                                                                       
## [34] "I have experience with using SPSS scripts, but not with writing them myself"
## [35] "SPSS/SAS syntax"                                                            
## [36] "SPSS/SAS syntax"                                                            
## [37] "SPSS/SAS syntax"                                                            
## [38] "None"                                                                       
## [39] "Python, MATLAB"                                                             
## [40] "SPSS/SAS syntax"                                                            
## [41] "C/C++"                                                                      
## [42] "SPSS/SAS syntax"

We’ll have to parse this into different languages. Note that we can refer to the Other programming languages you know variable using back-tick "" characters. A bit later, we'll want to simplify these variable names. For now, the following function fromR/survey.R` does most of what we need to do.

clean_other_languages <- function(df) {
  # Clean the 'Other programming languages you know' field
  out_df <- df
  
  # Create Booleans for different languages/language categories
  python <- stringr::str_detect(df$`Other programming languages you know`, "(P|p)ython")
  spss_sas <- stringr::str_detect(df$`Other programming languages you know`, "SPSS/SAS")
  mplus <- stringr::str_detect(df$`Other programming languages you know`, "(M|m)plus")
  lisrel <- stringr::str_detect(df$`Other programming languages you know`, "(L|l)isrel")
  none <- stringr::str_detect(df$`Other programming languages you know`, "None")
  js_html_css <- stringr::str_detect(df$`Other programming languages you know`, "HTML")
  java <- stringr::str_detect(df$`Other programming languages you know`, "Java")
  unix <- stringr::str_detect(df$`Other programming languages you know`, "nix")
  swift <- stringr::str_detect(df$`Other programming languages you know`, "Swift")
  msdos <- stringr::str_detect(df$`Other programming languages you know`, "MS DOS")
  
  # Create new fields for each language; easier to gather separately
  out_df$python <- NA
  out_df$python[python == TRUE] <- "python"
  
  out_df$spss_sas <- NA
  out_df$spss_sas[spss_sas == TRUE] <- "spss_sas"
  
  out_df$mplus <- NA
  out_df$mplus[mplus == TRUE] <- "mplus"
  
  out_df$lisrel <- NA
  out_df$lisrel[lisrel == TRUE] <- "lisrel"
  
  out_df$none <- NA
  out_df$none[none == TRUE] <- "none"
  
  out_df$js_html_css <- NA
  out_df$js_html_css[js_html_css == TRUE] <- "js_html_css"
  
  out_df$java <- NA
  out_df$java[java == TRUE] <- "java"
  
  out_df$unix <- NA
  out_df$unix[unix == TRUE] <- "unix"
  
  out_df$swift <- NA
  out_df$swift[swift == TRUE] <- "swift"
  
  out_df$msdos <- NA
  out_df$msdos[msdos == TRUE] <- "msdos"
  
  return(out_df)
}

Let’s run it.

survey <- clean_other_languages(survey)

This function creates new Boolean variables for each language we pulled out of the Other programming languages you know column. Later, we’ll combine these into a single variable.

Next, we notice that the Preferred number of hours spent sleeping/day is a char variable, and we really want this to be a number.

survey$`Preferred number of hours spent sleeping/day`
##  [1] "6"     "11"    "9"     "7-8"   "7.5"   "7.7"   "7.5"   "8"    
##  [9] "9"     "7"     "7.249" "9"     "9"     "9"     "7"     "5.5"  
## [17] "9"     "8"     "12"    "7"     "9"     "8"     "10"    "10"   
## [25] "8-9"   "8"     "7-8"   "8"     "9"     "8"     "8"     "9"    
## [33] "9"     "8"     "8"     "10"    "7.5"   "8"     "9-10"  "7"    
## [41] "9"     "9"

The following function fixes the “7-8” and “8-9” problems

clean_hrs_sleep <- function(df) {
  # `Preferred number of hours spent sleeping/day`
  
  # "7-8"
  clean_this <- df$`Preferred number of hours spent sleeping/day` == "7-8"
  df$`Preferred number of hours spent sleeping/day`[clean_this] <- "7.5"

  # 8-9
  clean_this_too <- df$`Preferred number of hours spent sleeping/day` == "8-9"
  df$`Preferred number of hours spent sleeping/day`[clean_this_too] <- "8.5"

  df$`Preferred number of hours spent sleeping/day` <- as.numeric(df$`Preferred number of hours spent sleeping/day`)
  
  return(df)
}

So, we’ll run that.

survey <- clean_hrs_sleep(survey)
## Warning in clean_hrs_sleep(survey): NAs introduced by coercion

Now, we’re ready to create shorter, but still-human-readable names for the variables. My preferred style is to use lowercase names with underscores.

clean_survey_names <- function(df) {
  # Create shorter names for variables
  df <- dplyr::rename(df, time_stamp = Timestamp)
  df <- dplyr::rename(df, r_exp = `Your current level of experience/expertise with R`)
  df <- dplyr::rename(df, other_langs = `Other programming languages you know`)
  df <- dplyr::rename(df, beverage = `Your favorite beverage`)
  df <- dplyr::rename(df, age_yrs = `Age in years`)
  df <- dplyr::rename(df, sleep_hrs = `Preferred number of hours spent sleeping/day`)
  df <- dplyr::rename(df, got_s8 =  `Your enthusiasm for \"Game of Thrones\" Season 8.`)
  df <- dplyr::rename(df, day = `Favorite day of the week`)
  df <- dplyr::rename(df, tidy_data = `Are your data tidy?`)
  
  return(df) 
}

Let’s clean the names.

survey <- clean_survey_names(survey)

Next, we’ll want to make a ‘tidier’ data frame with the languages a person reports knowing. A tidy data file for this would be ‘longer’ with a column called, say lang_known and with duplicate values in the other fields.

gather_known_langs <- function(df) {
  # Create tidy data tibble when there are multiple languages known
  df1 <- dplyr::select(df, time_stamp, python:msdos)
  df2 <- tidyr::gather(df1, "lang", "lang_known", -time_stamp)
  df3 <- dplyr::filter(df2, !is.na(lang_known))
  df4 <- dplyr::select(df3, -lang)
  df5 <- dplyr::left_join(df, df4, by = 'time_stamp')
  df6 <- dplyr::select(df5, -other_langs, -(python:msdos))
  return(df6)
}

We’ll use the dplyr::gather() function for this. Michael is about to go into great detail about this sort of data munging.

survey <- gather_known_langs(survey)

Finally, we’ll clean-up the Is there a reproducibility crisis? variable we’ve already renamed crisis.

survey <- clean_repro_crisis(survey)

Cleaning data with functions

You’ll note that I’ve written separate functions to deal with each step of the data cleaning. This seems like good practice to me since I really want to think about each variable separately. If I create a separate function to clean each variable, I can also keep separate things separate.

At the top of the R/survey.R file, you’ll see functions that combine these simplier functions:

To combine the steps I used to clean the data, I run:

clean_survey_data <- function(df) {
  # Clean the 2019 R Bootcamp Survey Data
  
  df_0 <- clean_timestamp(df)
  df_1 <- clean_other_languages(df_0)
  df_2 <- clean_hrs_sleep(df_1)
  df_3 <- clean_survey_names(df_2)
  df_4 <- gather_known_langs(df_3)
  df_5 <- clean_repro_crisis(df_4)
  
  return(df_5)
}

This means that to update and clean the survey data, something I’ve done many times, I just run:

survey <- get_survey_data
survey <- clean_survey_data(survey)

This automates the entire process and makes it REPRODUCIBLE.

The last step is to save the new cleaned file so we don’t have to do this again.

save_survey_data(survey, params$data_file_out)

Indeed, I’ve written an update_survey_data() function that combines all of these gathering, cleaning, and saving steps:

update_survey_data <- function() {
  save_survey_data(
    clean_survey_data(
      get_survey_data()
      )
  )
}

Visualization

Now, we follow Mike Meyer’s advice: “Plot your data!” Who’s Mike Meyer? Rick’s stats professor from grad school. He had us use ‘Splus’ a forerunner of R, and he was an inspiring and funny professor.

Descriptive plots

R_exp_hist <- survey %>%
  ggplot() +
  aes(x=r_exp) +
  geom_histogram(stat = "count") # R_exp is discrete
## Warning: Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist
Distribution of prior R experience

Distribution of prior R experience

We observe that this is not ordered in the way we’d expect, so let’s fix that.

survey$r_exp <- ordered(survey$r_exp, c("none", "limited", "some", "pro"))

R_exp_hist <- survey %>%
  ggplot() +
  aes(x=r_exp) +
  geom_histogram(stat = "count") # R_exp is discrete
## Warning: Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist

Much better!

Sleep_hrs_hist <- survey %>%
  ggplot() +
  aes(x=sleep_hrs) +
  geom_histogram() # Sleep_hrs is continuous
Sleep_hrs_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Distribution of preferred sleep hrs/day

Distribution of preferred sleep hrs/day

Data documentation (codebook)

Every data set should be documented. You can generate a template data codebook with some useful summary information using the package dataMaid.

if(!require(dataMaid)){install.packages('dataMaid')}
library(dataMaid)
dataMaid::makeCodebook(data = survey, 
                       reportTitle = 'Codebook for 2019 R bootcamp survey', 
                       replace = TRUE)
## Data report generation is finished. Please wait while your output file is being rendered.

Then, we can look at the codebook_survey.Rmd file and edit it as needed, especially the section with the code descriptions.

---------------------------------------------------------------------------
Label   Variable            Class         # unique  Missing   Description  
                                            values                         
------- ------------------- ----------- ---------- ---------- -------------
        **[time\_stamp]**   POSIXct             15   0.00 %                

        **[r\_exp]**        ordered              4   0.00 %                

        **[got\_s8]**       numeric              7   0.00 %                

        **[beverage]**      character            6   0.00 %                

        **[age\_yrs]**      numeric             13   0.00 %                

        **[sleep\_hrs]**    numeric              8   0.00 %                

        **[day]**           character            5   0.00 %                

        **[tidy\_data]**    character            3   0.00 %                

        **[crisis]**        factor               1  100.00 %               

        **[lang\_known]**   character           10   0.00 %                
---------------------------------------------------------------------------

Analysis

I could use a document like this to plan out my analysis plan before I conduct it. If I used simulated data, I could make sure that my workflow will run when I get real (cleaned) data. I could even preregister my analysis plan before I conduct it. That doesn’t preclude later exploratory analyses, but it does hold me and my collaborators accountable for what I predicted in advance.

Notes

Notice that I sometimes put a label like R-exp-hist in the brackets {}for a given ‘chunk’ of R code. The main reasons to do this are:

  • It sometimes makes it easier to debug your code.
  • In some cases, you can have this ‘chunk’ name serve as the file name for a figure you generate within a chunk.
  • These chunk names are useful for making tables, figures, and equations that generate their own numbers.