ioslides_presentation
, pdf_document
, and word_document
formatsLoad required packages.
library(tidyverse)
library(googlesheets)
The survey data are stored in a Google Sheet. We’ll use the googlesheets
package to open it and create a data frame. Documentation about the package can be found here.
There are some idiosyncrasies in using the googlesheets
package in an R Markdown document because it requires interaction with the console, so I created a separate R script, Get_bootcamp_googlesheet.R
to extract the survey data. If you try to execute the next chunk, it may give you an error, or it may ask you to allow googlesheets
to access information in your Google profile.
# Set eval=FALSE so I can render non-notebook formats
survey_url <- "https://docs.google.com/spreadsheets/d/1Ay56u6g4jyEEdlmV2NHxTLBlcjI2gHavta-Ik0kGrpg/edit?usp=sharing"
bootcamp_by_url <- survey_url %>%
extract_key_from_url() %>%
gs_key()
Auto-refreshing stale OAuth token.
Sheet successfully identified: "PSU Psychology R Bootcamp Survey (Responses)"
bootcamp_sheets <- gs_ws_ls(bootcamp_by_url)
boot_data <- bootcamp_by_url %>%
gs_read(bootcamp_sheets[1])
Accessing worksheet titled 'Form Responses 1'.
Downloading: 700 B
Downloading: 700 B
Downloading: 710 B
Downloading: 710 B
Downloading: 710 B
Downloading: 710 B
Downloading: 710 B
Downloading: 710 B
Parsed with column specification:
cols(
Timestamp = col_character(),
`Your current level of experience/expertise with R` = col_character(),
`Your enthusiasm for Game of Thrones` = col_integer(),
`Age in years` = col_integer(),
`Preferred number of hours spent sleeping/day` = col_character(),
`Favorite day of the week` = col_character(),
`Are your data tidy?` = col_character()
)
write_csv(boot_data, path="../data/survey.csv")
This script downloads the data file saves it to a CSV under data/survey.csv
.We can then load this file.
I also created a test data file, data/survey-test.csv
so I could see how everything worked before y’all filled out your responses. The R/Make_test_survey.R
file shows how I did this. It’s a great, reproducible practice to simulate the data you expect, then run it through your pipeline.
# Created test data set for testing.
# survey <- read_csv("../data/survey-test.csv")
# Or choose data from respondents
survey <- read_csv("../data/survey.csv")
Parsed with column specification:
cols(
Timestamp = col_character(),
`Your current level of experience/expertise with R` = col_character(),
`Your enthusiasm for Game of Thrones` = col_integer(),
`Age in years` = col_integer(),
`Preferred number of hours spent sleeping/day` = col_character(),
`Favorite day of the week` = col_character(),
`Are your data tidy?` = col_character()
)
survey
The str()
or ‘structure’ command is also a great way to see what you’ve got.
str(survey)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 39 obs. of 7 variables:
$ Timestamp : chr NA "8/13/2017 23:29:24" "8/14/2017 12:01:12" "8/15/2017 12:42:09" ...
$ Your current level of experience/expertise with R: chr NA "some" "some" "some" ...
$ Your enthusiasm for Game of Thrones : int NA 10 10 10 10 10 10 3 9 10 ...
$ Age in years : int NA 28 22 24 28 24 23 25 37 25 ...
$ Preferred number of hours spent sleeping/day : chr NA "8!!!" "7" "10" ...
$ Favorite day of the week : chr NA "Friday" "Friday" "Saturday" ...
$ Are your data tidy? : chr NA "Yes" "That's a personal question" "No" ...
- attr(*, "spec")=List of 2
..$ cols :List of 7
.. ..$ Timestamp : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Your current level of experience/expertise with R: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Your enthusiasm for Game of Thrones : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Age in years : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Preferred number of hours spent sleeping/day : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Favorite day of the week : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Are your data tidy? : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
Clearly, we need to do some cleaning before we can do anything with this.
Let’s start by renaming variables.
names(survey) <- c("Timestamp",
"R_exp",
"GoT",
"Age_yrs",
"Sleep_hrs",
"Fav_day",
"Tidy_data")
# complete.cases() drops NAs
survey <- survey[complete.cases(survey),]
survey
Now, lets make sure we have numbers where we expect them. That person who really likes 8 hours (“8!!!”) is a problem (for me, not them).
survey$Sleep_hrs <- readr::parse_number(survey$Sleep_hrs)
survey
Looks good. Let’s save that cleaned file so we don’t have to do this again.
write_csv(survey, path="../data/survey_clean.csv")
We may want to make the R_exp
variable ordered.
(survey_responses <- unique(survey$R_exp))
[1] "some" "none" "limited" "pro"
This shows us the different survey response values.
survey$R_exp <- ordered(survey$R_exp, levels=c("none",
"limited",
"some",
"lots",
"pro"))
Now, we follow Mike Meyer’s advice: “Plot your data!”
R_exp_hist <- survey %>%
ggplot() +
aes(x=R_exp) +
geom_histogram(stat = "count") # R_exp is discrete
Ignoring unknown parameters: binwidth, bins, pad
R_exp_hist
Sleep_hrs_hist <- survey %>%
ggplot() +
aes(x=Sleep_hrs) +
geom_histogram() # Sleep_hrs is continuous
Sleep_hrs_hist
Got_hist <- survey %>%
ggplot() +
aes(x=GoT) +
geom_histogram()
Got_hist
Looks like we are of two minds about GoT.
Does R experience have any relation to GoT enthusiasm?
GoT_vs_r_exp <- survey %>%
ggplot() +
aes(x=GoT, y=Age_yrs) +
facet_grid(. ~ R_exp) +
geom_point()
# + stat_smooth()
GoT_vs_r_exp
tidy_hist <- survey %>%
ggplot() +
aes(x=Tidy_data) +
geom_histogram(stat = "count")
tidy_hist
I could use a document like this to plan out my analysis plan before I conduct it. If I used simulated data, I could make sure that my workflow will run when I get real (cleaned) data. I could even preregister my analysis plan before I conduct it. That doesn’t preclude later exploratory analyses, but it does hold me and my collaborators accountable for what I predicted in advance.
Notice that I sometimes put a label like got-vs-r-exp
in the brackets for a given ‘chunk’ of R code. The main reasons to do this are: