Load required packages.
library(tidyverse)
library(googlesheets)
The survey data are stored in a Google Sheet. We’ll use the googlesheets
package to open it and create a data frame. Documentation about the package can be found here.
There are some idiosyncrasies in using the googlesheets
package in an R Markdown document because it requires interaction with the console, so I created a separate R script, Get_bootcamp_googlesheet.R
to extract the survey data. If you try to execute the next chunk, it may give you an error, or it may ask you to allow googlesheets
to access information in your Google profile. This just allows R to grab the data from the Googlesheet using your Google account.
survey_url <- "https://docs.google.com/spreadsheets/d/1-YB0iWUNN_9oxBhz221NFiyBOcwMfHziFeUiUvQwn7k/edit?usp=sharing"
bootcamp_by_url <- gs_url(survey_url)
bootcamp_sheets <- gs_ws_ls(bootcamp_by_url)
boot_data <- bootcamp_by_url %>%
gs_read(bootcamp_sheets[1])
write_csv(boot_data, path=params$data_file_out)
This script downloads the data file saves it to a CSV under ../data/survey_2018.csv. We can then load this file.
I also created a test data file, data/survey-test.csv
so I could see how everything worked before y’all filled out your responses. The R/Make_test_survey.R
file shows how I did this. It’s a great, reproducible practice to simulate the data you expect, then run it through your pipeline.
# Choose data from respondents
survey <- read_csv(params$data_file_in)
## Parsed with column specification:
## cols(
## Timestamp = col_character(),
## `Your current level of experience/expertise with R` = col_character(),
## `Your enthusiasm for banjo music` = col_integer(),
## `How old do you feel (in years)` = col_integer(),
## `Preferred number of hours spent sleeping/day` = col_character(),
## `Favorite day of the week` = col_character(),
## `Is there a reproducibility 'crisis' in psychology?` = col_character()
## )
survey
## # A tibble: 56 x 7
## Timestamp `Your current l… `Your enthusias… `How old do you…
## <chr> <chr> <int> <int>
## 1 7/24/201… pro 10 45
## 2 8/14/201… lots 5 29
## 3 8/15/201… limited 1 35
## 4 8/15/201… limited 2 25
## 5 8/15/201… limited 1 27
## 6 8/15/201… lots 3 19
## 7 8/15/201… limited 2 30
## 8 8/15/201… pro 2 26
## 9 8/15/201… limited 1 26
## 10 8/15/201… limited 3 25
## # ... with 46 more rows, and 3 more variables: `Preferred number of hours
## # spent sleeping/day` <chr>, `Favorite day of the week` <chr>, `Is there
## # a reproducibility 'crisis' in psychology?` <chr>
The str()
or ‘structure’ command is also a great way to see what you’ve got.
str(survey)
## Classes 'tbl_df', 'tbl' and 'data.frame': 56 obs. of 7 variables:
## $ Timestamp : chr "7/24/2018 14:18:42" "8/14/2018 11:37:15" "8/15/2018 9:37:17" "8/15/2018 9:37:37" ...
## $ Your current level of experience/expertise with R : chr "pro" "lots" "limited" "limited" ...
## $ Your enthusiasm for banjo music : int 10 5 1 2 1 3 2 2 1 3 ...
## $ How old do you feel (in years) : int 45 29 35 25 27 19 30 26 26 25 ...
## $ Preferred number of hours spent sleeping/day : chr "8" "8.5" "6" "8" ...
## $ Favorite day of the week : chr "Sunday" "Saturday" "Sunday" "Friday" ...
## $ Is there a reproducibility 'crisis' in psychology?: chr "Yes, a significant crisis" "Yes, a slight crisis" "Yes, a significant crisis" "Yes, a significant crisis" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 7
## .. ..$ Timestamp : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Your current level of experience/expertise with R : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Your enthusiasm for banjo music : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ How old do you feel (in years) : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Preferred number of hours spent sleeping/day : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Favorite day of the week : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Is there a reproducibility 'crisis' in psychology?: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Clearly, we need to do some cleaning before we can do anything with this.
Let’s start by renaming variables.
names(survey) <- c("Timestamp",
"R_exp",
"Banjo",
"Psych_age_yrs",
"Sleep_hrs",
"Fav_day",
"Crisis")
# complete.cases() drops NAs
survey <- survey[complete.cases(survey),]
survey
## # A tibble: 56 x 7
## Timestamp R_exp Banjo Psych_age_yrs Sleep_hrs Fav_day Crisis
## <chr> <chr> <int> <int> <chr> <chr> <chr>
## 1 7/24/2018 14… pro 10 45 8 Sunday Yes, a sign…
## 2 8/14/2018 11… lots 5 29 8.5 Saturd… Yes, a slig…
## 3 8/15/2018 9:… limit… 1 35 6 Sunday Yes, a sign…
## 4 8/15/2018 9:… limit… 2 25 8 Friday Yes, a sign…
## 5 8/15/2018 9:… limit… 1 27 10 Saturd… Yes, a slig…
## 6 8/15/2018 9:… lots 3 19 7.5 Friday Yes, a slig…
## 7 8/15/2018 9:… limit… 2 30 10 Saturd… Yes, a slig…
## 8 8/15/2018 9:… pro 2 26 9 Saturd… Yes, a slig…
## 9 8/15/2018 9:… limit… 1 26 8 Sunday Yes, a slig…
## 10 8/15/2018 9:… limit… 3 25 10 Saturd… Yes, a slig…
## # ... with 46 more rows
Now, lets make sure we have numbers where we expect them.
survey$Sleep_hrs <- readr::parse_number(survey$Sleep_hrs)
survey
## # A tibble: 56 x 7
## Timestamp R_exp Banjo Psych_age_yrs Sleep_hrs Fav_day Crisis
## <chr> <chr> <int> <int> <dbl> <chr> <chr>
## 1 7/24/2018 14… pro 10 45 8 Sunday Yes, a sign…
## 2 8/14/2018 11… lots 5 29 8.5 Saturd… Yes, a slig…
## 3 8/15/2018 9:… limit… 1 35 6 Sunday Yes, a sign…
## 4 8/15/2018 9:… limit… 2 25 8 Friday Yes, a sign…
## 5 8/15/2018 9:… limit… 1 27 10 Saturd… Yes, a slig…
## 6 8/15/2018 9:… lots 3 19 7.5 Friday Yes, a slig…
## 7 8/15/2018 9:… limit… 2 30 10 Saturd… Yes, a slig…
## 8 8/15/2018 9:… pro 2 26 9 Saturd… Yes, a slig…
## 9 8/15/2018 9:… limit… 1 26 8 Sunday Yes, a slig…
## 10 8/15/2018 9:… limit… 3 25 10 Saturd… Yes, a slig…
## # ... with 46 more rows
Looks good. Let’s save that cleaned file so we don’t have to do this again.
write_csv(survey, path="../data/survey_clean.csv")
We may want to make the R_exp
variable ordered.
(survey_responses <- unique(survey$R_exp))
## [1] "pro" "lots"
## [3] "limited" "none, limited, lots, pro"
## [5] "none"
This shows us the different survey response values. It looks like somebody checked all the levels. Let’s change that to limited.
survey$R_exp[survey$R_exp == "none, limited, lots, pro"] <- "limited"
survey$R_exp <- ordered(survey$R_exp, levels=c("none",
"limited",
"some",
"lots",
"pro"))
Now, we follow Mike Meyer’s advice: “Plot your data!”
R_exp_hist <- survey %>%
ggplot() +
aes(x=R_exp) +
geom_histogram(stat = "count") # R_exp is discrete
## Warning: Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist
Sleep_hrs_hist <- survey %>%
ggplot() +
aes(x=Sleep_hrs) +
geom_histogram() # Sleep_hrs is continuous
Sleep_hrs_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Banjo_hist <- survey %>%
ggplot() +
aes(x=Banjo) +
geom_histogram(bins = 10)
Banjo_hist
Does R experience have any relation to banjo music enthusiasm or one’s psychological age?
Banjo_vs_r_exp <- survey %>%
ggplot() +
aes(x=Banjo, y=Psych_age_yrs) +
facet_grid(. ~ R_exp) +
geom_point()
# + stat_smooth()
Banjo_vs_r_exp
crisis_plot <- survey %>%
ggplot() +
aes(x=Crisis) +
geom_bar()
crisis_plot
Every data set should be documented. You can generate a template data codebook with some useful summary information using the package dataMaid
.
if(!require(dataMaid)){install.packages('dataMaid')}
## Loading required package: dataMaid
##
## Attaching package: 'dataMaid'
## The following object is masked from 'package:dplyr':
##
## summarize
library(dataMaid)
dataMaid::makeCodebook(data = survey,
reportTitle = 'Codebook for 2018 R bootcamp survey',
replace = TRUE)
## Data report generation is finished. Please wait while your output file is being rendered.
Then, we can look at the codebook_survey.Rmd
file and edit it as needed, especially the section with the code descriptions.
# Codebook summary table
------------------------------------------------------------------------------
Label Variable Class # unique Missing Description
values
------- ----------------------- ----------- ---------- --------- -------------
**[Timestamp]** POSIXct 1 0.00 % Time & date survey was completed
**[R\_exp]** ordered 5 0.00 % Levels of R experience: {none", "limited", "some", "lots","pro")}
**[Banjo]** integer 10 0.00 % Level of enthusiasm for banjo music [1,10]
**[Psych\_age\_yrs]** integer 33 0.00 % Age participant reports feeling
**[Sleep\_hrs]** numeric 50 0.00 % Preferred number of hours/day spent sleeping
**[Fav\_day]** Date 1 0.00 % Favorite day of the week: {'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'}
**[Crisis]** character 4 0.00 % Is there a 'reproducibility' crisis in psychology: {'Yes, a significant crisis', 'Yes, a slight crisis', 'No', 'Don't know'}
------------------------------------------------------------------------------
I could use a document like this to plan out my analysis plan before I conduct it. If I used simulated data, I could make sure that my workflow will run when I get real (cleaned) data. I could even preregister my analysis plan before I conduct it. That doesn’t preclude later exploratory analyses, but it does hold me and my collaborators accountable for what I predicted in advance.
Notice that I sometimes put a label like Banjo-vs-r-exp
in the brackets {}
for a given ‘chunk’ of R code. The main reasons to do this are: