Load required packages.
library(tidyverse)
## ── Attaching packages ──────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(googlesheets)
library(dataMaid)
##
## Attaching package: 'dataMaid'
## The following object is masked from 'package:dplyr':
##
## summarize
The survey data are stored in a Google Sheet. We’ll use the googlesheets
package to open it and create a data frame. Documentation about the package can be found here.
There are some idiosyncrasies in using the googlesheets
package in an R Markdown document because it requires interaction with the console, so I created a separate R function to gather/get/download these data. If you open the R/survey.R
file, you will see a function that looks like this:
get_survey_data <- function(verbose = FALSE,
sheet_url = "https://docs.google.com/spreadsheets/d/1YrKFrPz38FV-JbJKlp7wGQAw9IagVJFMuLbCLY7HMeA/"
sheet_name = 'PSU R Bootcamp 2019 Survey (Responses)') {
# Download 2019 Bootcamp registration data from GoogleSheet
library(googledrive)
library(googlesheets)
drive_auth(use_oob = TRUE)
options(httr_oob_default = TRUE)
survey_gs <- googlesheets::gs_title(sheet_name)
survey_data <- googlesheets::gs_read(ss = survey_gs,
ws = 'Form Responses 1')
survey_data
}
We’ll load a previously saved version of the raw survey data here.
source(params$supporting_functions)
survey <- readr::read_csv(params$data_file_in)
## Parsed with column specification:
## cols(
## Timestamp = col_character(),
## `Your current level of experience/expertise with R` = col_character(),
## `Other programming languages you know` = col_character(),
## `Your enthusiasm for "Game of Thrones" Season 8.` = col_double(),
## `Your favorite beverage` = col_character(),
## `Age in years` = col_double(),
## `Preferred number of hours spent sleeping/day` = col_character(),
## `Favorite day of the week` = col_character(),
## `Are your data tidy?` = col_character(),
## `Is there a reproducibility crisis?` = col_character()
## )
The str()
or ‘structure’ command is also a great way to see what you’ve got.
str(survey)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 42 obs. of 10 variables:
## $ Timestamp : chr "6/25/2019 12:27:56" "7/17/2019 15:45:04" "7/18/2019 16:12:20" "7/18/2019 16:46:18" ...
## $ Your current level of experience/expertise with R: chr "some" "some" "some" "none" ...
## $ Other programming languages you know : chr "Python, Unix/Linux shell programming, Swift" "None" "None" "SPSS/SAS syntax, some mplus, lisrel (way back when)" ...
## $ Your enthusiasm for "Game of Thrones" Season 8. : num 1 1 1 3 1 1 8 4 1 1 ...
## $ Your favorite beverage : chr "Coffee" "Tea" "Wine" "Spirits" ...
## $ Age in years : num 32 26 112 52 24 31 29 26 21 43 ...
## $ Preferred number of hours spent sleeping/day : chr "6" "11" "9" "7-8" ...
## $ Favorite day of the week : chr "Thursday" "Saturday" "Saturday" "Friday" ...
## $ Are your data tidy? : chr "Yes" "Yes" "That's a personal question" "That's a personal question" ...
## $ Is there a reproducibility crisis? : chr NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. Timestamp = col_character(),
## .. `Your current level of experience/expertise with R` = col_character(),
## .. `Other programming languages you know` = col_character(),
## .. `Your enthusiasm for "Game of Thrones" Season 8.` = col_double(),
## .. `Your favorite beverage` = col_character(),
## .. `Age in years` = col_double(),
## .. `Preferred number of hours spent sleeping/day` = col_character(),
## .. `Favorite day of the week` = col_character(),
## .. `Are your data tidy?` = col_character(),
## .. `Is there a reproducibility crisis?` = col_character()
## .. )
Clearly, we need to do some cleaning before we can do anything with this.
Let’s start by turning Timestamp
into a proper date and time. The lubridate
package helps us manipulate character strings (chr
) into dates and times. The Timestamp
variable is in a format common in the U.S. – month/day/year – but this format is not universal. The mdy_hms
command in lubridate converts the month/day/year (mdy) format followed by the time with hours, minutes, and seconds (hms) into a more flexible and universal format involving UTC (universal coordinated time or ‘Zulu’ time).
survey$Timestamp <- lubridate::mdy_hms(survey$Timestamp)
survey$Timestamp
## [1] "2019-06-25 12:27:56 UTC" "2019-07-17 15:45:04 UTC"
## [3] "2019-07-18 16:12:20 UTC" "2019-07-18 16:46:18 UTC"
## [5] "2019-07-18 17:02:34 UTC" "2019-07-19 11:40:44 UTC"
## [7] "2019-07-22 11:12:50 UTC" "2019-07-22 16:35:17 UTC"
## [9] "2019-07-23 09:30:45 UTC" "2019-07-23 09:56:12 UTC"
## [11] "2019-07-23 17:55:25 UTC" "2019-07-24 08:46:34 UTC"
## [13] "2019-07-24 11:11:21 UTC" "2019-07-24 14:39:17 UTC"
## [15] "2019-07-25 14:14:39 UTC" "2019-07-29 17:16:34 UTC"
## [17] "2019-07-30 16:31:40 UTC" "2019-07-31 11:15:38 UTC"
## [19] "2019-08-02 11:51:48 UTC" "2019-08-06 10:47:38 UTC"
## [21] "2019-08-08 10:18:27 UTC" "2019-08-09 09:32:28 UTC"
## [23] "2019-08-09 10:22:19 UTC" "2019-08-09 17:15:10 UTC"
## [25] "2019-08-12 15:30:56 UTC" "2019-08-13 15:48:58 UTC"
## [27] "2019-08-15 11:33:13 UTC" "2019-08-16 12:24:55 UTC"
## [29] "2019-08-16 14:46:58 UTC" "2019-08-17 15:36:31 UTC"
## [31] "2019-08-18 13:32:52 UTC" "2019-08-20 12:48:42 UTC"
## [33] "2019-08-20 15:13:40 UTC" "2019-08-20 17:21:10 UTC"
## [35] "2019-08-20 22:27:34 UTC" "2019-08-20 23:06:26 UTC"
## [37] "2019-08-21 09:22:22 UTC" "2019-08-21 09:30:07 UTC"
## [39] "2019-08-21 09:30:39 UTC" "2019-08-21 10:03:09 UTC"
## [41] "2019-08-21 10:09:30 UTC" "2019-08-21 12:30:22 UTC"
We also note that the Other programming languages you know
question will need some work to be useful for data analysis. Let’s look at this variable specifically:
survey$`Other programming languages you know`
## [1] "Python, Unix/Linux shell programming, Swift"
## [2] "None"
## [3] "None"
## [4] "SPSS/SAS syntax, some mplus, lisrel (way back when)"
## [5] "None"
## [6] "SPSS/SAS syntax, MS DOS"
## [7] "SPSS/SAS syntax, Mplus"
## [8] "Javascript/HTML/CSS, SPSS/SAS syntax"
## [9] "None"
## [10] "None"
## [11] "SPSS/SAS syntax"
## [12] "SPSS/SAS syntax"
## [13] "Java"
## [14] "Python"
## [15] "SPSS/SAS syntax, Unix/Linux shell programming"
## [16] "SPSS/SAS syntax"
## [17] "Python, SPSS/SAS syntax"
## [18] "SPSS/SAS syntax"
## [19] "SPSS/SAS syntax, Mplus"
## [20] "Python, MATLAB, SPSS/SAS syntax, Unix/Linux shell programming"
## [21] "Python"
## [22] "Python, Praat"
## [23] "None"
## [24] "Python, SPSS/SAS syntax"
## [25] "SPSS/SAS syntax"
## [26] "SPSS/SAS syntax"
## [27] "None"
## [28] "None"
## [29] "SPSS/SAS syntax, Lisrel, MPlus"
## [30] "Python, MATLAB"
## [31] "SPSS/SAS syntax"
## [32] "Python, MATLAB, SPSS/SAS syntax"
## [33] "None"
## [34] "I have experience with using SPSS scripts, but not with writing them myself"
## [35] "SPSS/SAS syntax"
## [36] "SPSS/SAS syntax"
## [37] "SPSS/SAS syntax"
## [38] "None"
## [39] "Python, MATLAB"
## [40] "SPSS/SAS syntax"
## [41] "C/C++"
## [42] "SPSS/SAS syntax"
We’ll have to parse this into different languages. Note that we can refer to the Other programming languages you know
variable using back-tick "" characters. A bit later, we'll want to simplify these variable names. For now, the following function from
R/survey.R` does most of what we need to do.
clean_other_languages <- function(df) {
# Clean the 'Other programming languages you know' field
out_df <- df
# Create Booleans for different languages/language categories
python <- stringr::str_detect(df$`Other programming languages you know`, "(P|p)ython")
spss_sas <- stringr::str_detect(df$`Other programming languages you know`, "SPSS/SAS")
mplus <- stringr::str_detect(df$`Other programming languages you know`, "(M|m)plus")
lisrel <- stringr::str_detect(df$`Other programming languages you know`, "(L|l)isrel")
none <- stringr::str_detect(df$`Other programming languages you know`, "None")
js_html_css <- stringr::str_detect(df$`Other programming languages you know`, "HTML")
java <- stringr::str_detect(df$`Other programming languages you know`, "Java")
unix <- stringr::str_detect(df$`Other programming languages you know`, "nix")
swift <- stringr::str_detect(df$`Other programming languages you know`, "Swift")
msdos <- stringr::str_detect(df$`Other programming languages you know`, "MS DOS")
# Create new fields for each language; easier to gather separately
out_df$python <- NA
out_df$python[python == TRUE] <- "python"
out_df$spss_sas <- NA
out_df$spss_sas[spss_sas == TRUE] <- "spss_sas"
out_df$mplus <- NA
out_df$mplus[mplus == TRUE] <- "mplus"
out_df$lisrel <- NA
out_df$lisrel[lisrel == TRUE] <- "lisrel"
out_df$none <- NA
out_df$none[none == TRUE] <- "none"
out_df$js_html_css <- NA
out_df$js_html_css[js_html_css == TRUE] <- "js_html_css"
out_df$java <- NA
out_df$java[java == TRUE] <- "java"
out_df$unix <- NA
out_df$unix[unix == TRUE] <- "unix"
out_df$swift <- NA
out_df$swift[swift == TRUE] <- "swift"
out_df$msdos <- NA
out_df$msdos[msdos == TRUE] <- "msdos"
return(out_df)
}
Let’s run it.
survey <- clean_other_languages(survey)
This function creates new Boolean variables for each language we pulled out of the Other programming languages you know
column. Later, we’ll combine these into a single variable.
Next, we notice that the Preferred number of hours spent sleeping/day
is a char
variable, and we really want this to be a number.
survey$`Preferred number of hours spent sleeping/day`
## [1] "6" "11" "9" "7-8" "7.5" "7.7" "7.5" "8"
## [9] "9" "7" "7.249" "9" "9" "9" "7" "5.5"
## [17] "9" "8" "12" "7" "9" "8" "10" "10"
## [25] "8-9" "8" "7-8" "8" "9" "8" "8" "9"
## [33] "9" "8" "8" "10" "7.5" "8" "9-10" "7"
## [41] "9" "9"
The following function fixes the “7-8” and “8-9” problems
clean_hrs_sleep <- function(df) {
# `Preferred number of hours spent sleeping/day`
# "7-8"
clean_this <- df$`Preferred number of hours spent sleeping/day` == "7-8"
df$`Preferred number of hours spent sleeping/day`[clean_this] <- "7.5"
# 8-9
clean_this_too <- df$`Preferred number of hours spent sleeping/day` == "8-9"
df$`Preferred number of hours spent sleeping/day`[clean_this_too] <- "8.5"
df$`Preferred number of hours spent sleeping/day` <- as.numeric(df$`Preferred number of hours spent sleeping/day`)
return(df)
}
So, we’ll run that.
survey <- clean_hrs_sleep(survey)
## Warning in clean_hrs_sleep(survey): NAs introduced by coercion
Now, we’re ready to create shorter, but still-human-readable names for the variables. My preferred style is to use lowercase names with underscores.
clean_survey_names <- function(df) {
# Create shorter names for variables
df <- dplyr::rename(df, time_stamp = Timestamp)
df <- dplyr::rename(df, r_exp = `Your current level of experience/expertise with R`)
df <- dplyr::rename(df, other_langs = `Other programming languages you know`)
df <- dplyr::rename(df, beverage = `Your favorite beverage`)
df <- dplyr::rename(df, age_yrs = `Age in years`)
df <- dplyr::rename(df, sleep_hrs = `Preferred number of hours spent sleeping/day`)
df <- dplyr::rename(df, got_s8 = `Your enthusiasm for \"Game of Thrones\" Season 8.`)
df <- dplyr::rename(df, day = `Favorite day of the week`)
df <- dplyr::rename(df, tidy_data = `Are your data tidy?`)
return(df)
}
Let’s clean the names.
survey <- clean_survey_names(survey)
Next, we’ll want to make a ‘tidier’ data frame with the languages a person reports knowing. A tidy data file for this would be ‘longer’ with a column called, say lang_known
and with duplicate values in the other fields.
gather_known_langs <- function(df) {
# Create tidy data tibble when there are multiple languages known
df1 <- dplyr::select(df, time_stamp, python:msdos)
df2 <- tidyr::gather(df1, "lang", "lang_known", -time_stamp)
df3 <- dplyr::filter(df2, !is.na(lang_known))
df4 <- dplyr::select(df3, -lang)
df5 <- dplyr::left_join(df, df4, by = 'time_stamp')
df6 <- dplyr::select(df5, -other_langs, -(python:msdos))
return(df6)
}
We’ll use the dplyr::gather()
function for this. Michael is about to go into great detail about this sort of data munging.
survey <- gather_known_langs(survey)
Finally, we’ll clean-up the Is there a reproducibility crisis?
variable we’ve already renamed crisis
.
survey <- clean_repro_crisis(survey)
You’ll note that I’ve written separate functions to deal with each step of the data cleaning. This seems like good practice to me since I really want to think about each variable separately. If I create a separate function to clean each variable, I can also keep separate things separate.
At the top of the R/survey.R
file, you’ll see functions that combine these simplier functions:
To combine the steps I used to clean the data, I run:
clean_survey_data <- function(df) {
# Clean the 2019 R Bootcamp Survey Data
df_0 <- clean_timestamp(df)
df_1 <- clean_other_languages(df_0)
df_2 <- clean_hrs_sleep(df_1)
df_3 <- clean_survey_names(df_2)
df_4 <- gather_known_langs(df_3)
df_5 <- clean_repro_crisis(df_4)
return(df_5)
}
This means that to update and clean the survey data, something I’ve done many times, I just run:
survey <- get_survey_data
survey <- clean_survey_data(survey)
This automates the entire process and makes it REPRODUCIBLE.
The last step is to save the new cleaned file so we don’t have to do this again.
save_survey_data(survey, params$data_file_out)
Indeed, I’ve written an update_survey_data()
function that combines all of these gathering, cleaning, and saving steps:
update_survey_data <- function() {
save_survey_data(
clean_survey_data(
get_survey_data()
)
)
}
Now, we follow Mike Meyer’s advice: “Plot your data!” Who’s Mike Meyer? Rick’s stats professor from grad school. He had us use ‘Splus’ a forerunner of R, and he was an inspiring and funny professor.
R_exp_hist <- survey %>%
ggplot() +
aes(x=r_exp) +
geom_histogram(stat = "count") # R_exp is discrete
## Warning: Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist
We observe that this is not ordered in the way we’d expect, so let’s fix that.
survey$r_exp <- ordered(survey$r_exp, c("none", "limited", "some", "pro"))
R_exp_hist <- survey %>%
ggplot() +
aes(x=r_exp) +
geom_histogram(stat = "count") # R_exp is discrete
## Warning: Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist
Much better!
Sleep_hrs_hist <- survey %>%
ggplot() +
aes(x=sleep_hrs) +
geom_histogram() # Sleep_hrs is continuous
Sleep_hrs_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Every data set should be documented. You can generate a template data codebook with some useful summary information using the package dataMaid
.
if(!require(dataMaid)){install.packages('dataMaid')}
library(dataMaid)
dataMaid::makeCodebook(data = survey,
reportTitle = 'Codebook for 2019 R bootcamp survey',
replace = TRUE)
## Data report generation is finished. Please wait while your output file is being rendered.
Then, we can look at the codebook_survey.Rmd
file and edit it as needed, especially the section with the code descriptions.
---------------------------------------------------------------------------
Label Variable Class # unique Missing Description
values
------- ------------------- ----------- ---------- ---------- -------------
**[time\_stamp]** POSIXct 15 0.00 %
**[r\_exp]** ordered 4 0.00 %
**[got\_s8]** numeric 7 0.00 %
**[beverage]** character 6 0.00 %
**[age\_yrs]** numeric 13 0.00 %
**[sleep\_hrs]** numeric 8 0.00 %
**[day]** character 5 0.00 %
**[tidy\_data]** character 3 0.00 %
**[crisis]** factor 1 100.00 %
**[lang\_known]** character 10 0.00 %
---------------------------------------------------------------------------
I could use a document like this to plan out my analysis plan before I conduct it. If I used simulated data, I could make sure that my workflow will run when I get real (cleaned) data. I could even preregister my analysis plan before I conduct it. That doesn’t preclude later exploratory analyses, but it does hold me and my collaborators accountable for what I predicted in advance.
Notice that I sometimes put a label like R-exp-hist
in the brackets {}
for a given ‘chunk’ of R code. The main reasons to do this are: