```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Goals

- Download and clean data from 2019 R Bootcamp Survey
- Visualize data
- Demonstrate scripting of data gathering and cleaning

# Preliminaries

Load required packages.

```{r load-packages}
library(tidyverse)
library(googlesheets)
library(dataMaid)
```

# Load data and examine

The survey data are stored in a [Google Sheet](https://docs.google.com/spreadsheets/d/1YrKFrPz38FV-JbJKlp7wGQAw9IagVJFMuLbCLY7HMeA/edit?usp=sharing). 
We'll use the `googlesheets` package to open it and create a data frame. Documentation about the package can be found [here](https://cran.r-project.org/web/packages/googlesheets/vignettes/basic-usage.html).

There are some idiosyncrasies in using the `googlesheets` package in an R Markdown document because it requires interaction with the console, so I created a separate R function to gather/get/download these data.
If you open the `R/survey.R` file, you will see a function that looks like this:

```
get_survey_data <- function(verbose = FALSE,
                            sheet_url = "https://docs.google.com/spreadsheets/d/1YrKFrPz38FV-JbJKlp7wGQAw9IagVJFMuLbCLY7HMeA/"
                            sheet_name = 'PSU R Bootcamp 2019 Survey (Responses)') {
  # Download 2019 Bootcamp registration data from GoogleSheet
  library(googledrive)
  library(googlesheets)
  
  drive_auth(use_oob = TRUE)
  options(httr_oob_default = TRUE)
  
  survey_gs <- googlesheets::gs_title(sheet_name)
  survey_data <- googlesheets::gs_read(ss = survey_gs,
                                       ws = 'Form Responses 1')
  survey_data
}
```

We'll load a previously saved version of the raw survey data here.

```{r load-raw-survey}
source(params$supporting_functions)
survey <- readr::read_csv(params$data_file_in)
```

## Inspecting the data 

The `str()` or 'structure' command is also a great way to see what you've got.

```{r}
str(survey)
```

Clearly, we need to do some cleaning before we can do anything with this.

## Cleaning data

Let's start by turning `Timestamp` into a proper date and time.
The `lubridate` package helps us manipulate character strings (`chr`) into dates and times.
The `Timestamp` variable is in a format common in the U.S. -- month/day/year -- but this format is not universal.
The `mdy_hms` command in lubridate converts the month/day/year (mdy) format followed by the time with hours, minutes, and seconds (hms) into a more flexible and universal format involving UTC (universal coordinated time or 'Zulu' time).

```{r}
survey$Timestamp <- lubridate::mdy_hms(survey$Timestamp)
survey$Timestamp
```

We also note that the `Other programming languages you know` question will need some work to be useful for data analysis.
Let's look at this variable specifically:

```{r}
survey$`Other programming languages you know`
```

We'll have to parse this into different languages.
Note that we can refer to the `Other programming languages you know` variable using back-tick "`" characters.
A bit later, we'll want to simplify these variable names.
For now, the following function from `R/survey.R` does most of what we need to do.

```
clean_other_languages <- function(df) {
  # Clean the 'Other programming languages you know' field
  out_df <- df
  
  # Create Booleans for different languages/language categories
  python <- stringr::str_detect(df$`Other programming languages you know`, "(P|p)ython")
  spss_sas <- stringr::str_detect(df$`Other programming languages you know`, "SPSS/SAS")
  mplus <- stringr::str_detect(df$`Other programming languages you know`, "(M|m)plus")
  lisrel <- stringr::str_detect(df$`Other programming languages you know`, "(L|l)isrel")
  none <- stringr::str_detect(df$`Other programming languages you know`, "None")
  js_html_css <- stringr::str_detect(df$`Other programming languages you know`, "HTML")
  java <- stringr::str_detect(df$`Other programming languages you know`, "Java")
  unix <- stringr::str_detect(df$`Other programming languages you know`, "nix")
  swift <- stringr::str_detect(df$`Other programming languages you know`, "Swift")
  msdos <- stringr::str_detect(df$`Other programming languages you know`, "MS DOS")
  
  # Create new fields for each language; easier to gather separately
  out_df$python <- NA
  out_df$python[python == TRUE] <- "python"
  
  out_df$spss_sas <- NA
  out_df$spss_sas[spss_sas == TRUE] <- "spss_sas"
  
  out_df$mplus <- NA
  out_df$mplus[mplus == TRUE] <- "mplus"
  
  out_df$lisrel <- NA
  out_df$lisrel[lisrel == TRUE] <- "lisrel"
  
  out_df$none <- NA
  out_df$none[none == TRUE] <- "none"
  
  out_df$js_html_css <- NA
  out_df$js_html_css[js_html_css == TRUE] <- "js_html_css"
  
  out_df$java <- NA
  out_df$java[java == TRUE] <- "java"
  
  out_df$unix <- NA
  out_df$unix[unix == TRUE] <- "unix"
  
  out_df$swift <- NA
  out_df$swift[swift == TRUE] <- "swift"
  
  out_df$msdos <- NA
  out_df$msdos[msdos == TRUE] <- "msdos"
  
  return(out_df)
}
```

Let's run it.

```{r}
survey <- clean_other_languages(survey)
```

This function creates new Boolean variables for each language we pulled out of the `Other programming languages you know` column.
Later, we'll combine these into a single variable.

Next, we notice that the `Preferred number of hours spent sleeping/day` is a `char` variable, and we really want this to be a number.

```{r}
survey$`Preferred number of hours spent sleeping/day`
```

The following function fixes the "7-8" and "8-9" problems

```
clean_hrs_sleep <- function(df) {
  # `Preferred number of hours spent sleeping/day`
  
  # "7-8"
  clean_this <- df$`Preferred number of hours spent sleeping/day` == "7-8"
  df$`Preferred number of hours spent sleeping/day`[clean_this] <- "7.5"

  # 8-9
  clean_this_too <- df$`Preferred number of hours spent sleeping/day` == "8-9"
  df$`Preferred number of hours spent sleeping/day`[clean_this_too] <- "8.5"

  df$`Preferred number of hours spent sleeping/day` <- as.numeric(df$`Preferred number of hours spent sleeping/day`)
  
  return(df)
}
```

So, we'll run that.

```{r}
survey <- clean_hrs_sleep(survey)
```

Now, we're ready to create shorter, but still-human-readable names for the variables.
My preferred style is to use lowercase names with underscores.

```
clean_survey_names <- function(df) {
  # Create shorter names for variables
  df <- dplyr::rename(df, time_stamp = Timestamp)
  df <- dplyr::rename(df, r_exp = `Your current level of experience/expertise with R`)
  df <- dplyr::rename(df, other_langs = `Other programming languages you know`)
  df <- dplyr::rename(df, beverage = `Your favorite beverage`)
  df <- dplyr::rename(df, age_yrs = `Age in years`)
  df <- dplyr::rename(df, sleep_hrs = `Preferred number of hours spent sleeping/day`)
  df <- dplyr::rename(df, got_s8 =  `Your enthusiasm for \"Game of Thrones\" Season 8.`)
  df <- dplyr::rename(df, day = `Favorite day of the week`)
  df <- dplyr::rename(df, tidy_data = `Are your data tidy?`)
  
  return(df) 
}
```

Let's clean the names.

```{r}
survey <- clean_survey_names(survey)
```

Next, we'll want to make a 'tidier' data frame with the languages a person reports knowing.
A tidy data file for this would be 'longer' with a column called, say `lang_known` and with duplicate values in the other fields.

```
gather_known_langs <- function(df) {
  # Create tidy data tibble when there are multiple languages known
  df1 <- dplyr::select(df, time_stamp, python:msdos)
  df2 <- tidyr::gather(df1, "lang", "lang_known", -time_stamp)
  df3 <- dplyr::filter(df2, !is.na(lang_known))
  df4 <- dplyr::select(df3, -lang)
  df5 <- dplyr::left_join(df, df4, by = 'time_stamp')
  df6 <- dplyr::select(df5, -other_langs, -(python:msdos))
  return(df6)
}
```

We'll use the `dplyr::gather()` function for this.
Michael is about to go into great detail about this sort of data munging.

```{r}
survey <- gather_known_langs(survey)
```

Finally, we'll clean-up the `Is there a reproducibility crisis?` variable we've already renamed `crisis`.

```{r}
survey <- clean_repro_crisis(survey)
```

## Cleaning data with functions

You'll note that I've written separate functions to deal with each step of the data cleaning.
This seems like good practice to me since I really want to think about each variable separately.
If I create a separate function to clean each variable, I can also keep separate things separate.

At the top of the `R/survey.R` file, you'll see functions that combine these simplier functions:

To combine the steps I used to clean the data, I run:

```
clean_survey_data <- function(df) {
  # Clean the 2019 R Bootcamp Survey Data
  
  df_0 <- clean_timestamp(df)
  df_1 <- clean_other_languages(df_0)
  df_2 <- clean_hrs_sleep(df_1)
  df_3 <- clean_survey_names(df_2)
  df_4 <- gather_known_langs(df_3)
  df_5 <- clean_repro_crisis(df_4)
  
  return(df_5)
}
```

This means that to update and clean the survey data, something I've done many times, I just run:

```
survey <- get_survey_data
survey <- clean_survey_data(survey)
```

This automates the entire process and makes it **REPRODUCIBLE**.

The last step is to save the new cleaned file so we don't have to do this again.

```{r save-survey-data}
save_survey_data(survey, params$data_file_out)
```

Indeed, I've written an `update_survey_data()` function that combines all of these gathering, cleaning, and saving steps:

```
update_survey_data <- function() {
  save_survey_data(
    clean_survey_data(
      get_survey_data()
      )
  )
}
```

# Visualization

Now, we follow Mike Meyer's advice: "Plot your data!"
Who's Mike Meyer?
Rick's stats professor from grad school.
He had us use 'Splus' a forerunner of R, and he was an inspiring and funny professor.

## Descriptive plots

```{r R-exp-hist, fig.cap="Distribution of prior R experience"}
R_exp_hist <- survey %>%
  ggplot() +
  aes(x=r_exp) +
  geom_histogram(stat = "count") # R_exp is discrete
R_exp_hist
```

We observe that this is not ordered in the way we'd expect, so let's fix that.

```{r}
survey$r_exp <- ordered(survey$r_exp, c("none", "limited", "some", "pro"))

R_exp_hist <- survey %>%
  ggplot() +
  aes(x=r_exp) +
  geom_histogram(stat = "count") # R_exp is discrete
R_exp_hist
```

Much better!

```{r Sleep_hrs_hist, fig.cap="Distribution of preferred sleep hrs/day"}
Sleep_hrs_hist <- survey %>%
  ggplot() +
  aes(x=sleep_hrs) +
  geom_histogram() # Sleep_hrs is continuous
Sleep_hrs_hist
```

# Data documentation (codebook)

Every data set should be documented.
You can generate a template data codebook with some useful summary information using the package `dataMaid`.

```{r make_dataMaid_codebook}
if(!require(dataMaid)){install.packages('dataMaid')}
library(dataMaid)
dataMaid::makeCodebook(data = survey, 
                       reportTitle = 'Codebook for 2019 R bootcamp survey', 
                       replace = TRUE)
```

Then, we can look at the `codebook_survey.Rmd` file and edit it as needed, especially the section with the code descriptions.

```
---------------------------------------------------------------------------
Label   Variable            Class         # unique  Missing   Description  
                                            values                         
------- ------------------- ----------- ---------- ---------- -------------
        **[time\_stamp]**   POSIXct             15   0.00 %                

        **[r\_exp]**        ordered              4   0.00 %                

        **[got\_s8]**       numeric              7   0.00 %                

        **[beverage]**      character            6   0.00 %                

        **[age\_yrs]**      numeric             13   0.00 %                

        **[sleep\_hrs]**    numeric              8   0.00 %                

        **[day]**           character            5   0.00 %                

        **[tidy\_data]**    character            3   0.00 %                

        **[crisis]**        factor               1  100.00 %               

        **[lang\_known]**   character           10   0.00 %                
---------------------------------------------------------------------------
```

# Analysis

I could use a document like this to plan out my analysis plan **before** I conduct it.
If I used simulated data, I could make sure that my workflow will run when I get real (cleaned) data.
I could even preregister my analysis plan before I conduct it.
That doesn't preclude later exploratory analyses, but it does hold me and my collaborators accountable for what I predicted in advance.

# Notes

Notice that I sometimes put a label like `R-exp-hist` in the brackets `{}`for a given 'chunk' of R code. The main reasons to do this are:

- It sometimes makes it easier to debug your code.
- In some cases, you can have this 'chunk' name serve as the file name for a figure you generate within a chunk.
- These chunk names are useful for making tables, figures, and equations that generate their own numbers.