The goal of this document is to provide a basic introduction to using the tidyr and dplyr packages in R for data tidying and wrangling.

tidyr walkthough

The tidyr package provides a small number of functions for reshaping data into a tidy format. Tidy data are defined by:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit (e.g., persons, schools, counties) forms a table.

Imagine a dataset where you have ratings of well-being and anxiety measured 4 times in a longitudinal study.

Imagine that someone sends you a dataset that looks like this:

df <- data.frame(subid=1:10, sub_w1=rnorm(10, 5, 1), sub_w2=rnorm(10, 6, 1), sub_w3=rnorm(10, 7, 1),
                 anx_w1=rnorm(10, 9, 1), anx_w2=rnorm(10, 6, 1), anx_w3=rnorm(10, 7, 1))

kable(round(df,3))
subid sub_w1 sub_w2 sub_w3 anx_w1 anx_w2 anx_w3
1 3.075 4.462 6.207 10.182 4.936 6.348
2 6.122 7.340 7.801 8.974 8.296 6.907
3 6.033 5.461 8.211 9.231 8.171 7.158
4 5.690 5.081 6.766 12.203 4.173 6.401
5 6.186 6.417 5.693 7.759 7.116 6.528
6 4.734 8.484 8.089 8.457 6.732 7.292
7 5.067 6.901 6.475 10.134 6.303 7.668
8 4.043 5.184 6.726 8.408 4.748 7.086
9 3.951 4.942 6.585 6.897 7.127 6.611
10 4.404 7.828 9.912 10.882 4.790 5.839

separate: split the values of a variable at a position in the character string

Now, the time variable has both information about the measure (sub versus anx) and time (1-3). This is a job for separate!

df_long <- df_long %>% separate(time, into=c("measure", "time"), sep = "_")
head(df_long)
##   subid measure time    value
## 1     1     sub   w1 3.074873
## 2     2     sub   w1 6.122216
## 3     3     sub   w1 6.032673
## 4     4     sub   w1 5.690036
## 5     5     sub   w1 6.186475
## 6     6     sub   w1 4.733549

Cool, but we see that time has the ‘w’ prefix and isn’t a number. If your analysis uses a numeric (continuous) time representation (e.g., multilevel models), this won’t work. Let’s parse the number out of it.

df_long <- df_long %>% mutate(time=parse_number(time))
head(df_long)
##   subid measure time    value
## 1     1     sub    1 3.074873
## 2     2     sub    1 6.122216
## 3     3     sub    1 6.032673
## 4     4     sub    1 5.690036
## 5     5     sub    1 6.186475
## 6     6     sub    1 4.733549

This now qualifies as tidy. But it is not necessarily right for every application. For example, in longitudinal SEM (e.g., latent curve models), time is usually encoded by specific loadings onto intercept and slope factors. This requires a ‘wide’ data format similar to where we started. Let’s use tidyr to demonstrate how to go backwards in our transformation process – long-to-wide.

spread: convert a key-value

We can imagine an intermediate step in which we have the values of each measure as columns, instead of encoding them with respect to both measure and time.

df_long %>% spread(key=measure, value=value)
##    subid time       anx      sub
## 1      1    1 10.181831 3.074873
## 2      1    2  4.936028 4.462167
## 3      1    3  6.348020 6.206762
## 4      2    1  8.973985 6.122216
## 5      2    2  8.295925 7.339596
## 6      2    3  6.907473 7.801088
## 7      3    1  9.230804 6.032673
## 8      3    2  8.170932 5.461016
## 9      3    3  7.158098 8.210505
## 10     4    1 12.202934 5.690036
## 11     4    2  4.172569 5.081356
## 12     4    3  6.401017 6.765562
## 13     5    1  7.758699 6.186475
## 14     5    2  7.116117 6.417462
## 15     5    3  6.527680 5.693236
## 16     6    1  8.456882 4.733549
## 17     6    2  6.732005 8.484119
## 18     6    3  7.292282 8.089112
## 19     7    1 10.133610 5.067451
## 20     7    2  6.302825 6.900962
## 21     7    3  7.667511 6.474607
## 22     8    1  8.407892 4.043383
## 23     8    2  4.748029 5.184476
## 24     8    3  7.085874 6.726062
## 25     9    1  6.896648 3.951086
## 26     9    2  7.126776 4.942077
## 27     9    3  6.611164 6.584615
## 28    10    1 10.881976 4.403815
## 29    10    2  4.790130 7.828304
## 30    10    3  5.838759 9.911949

unite: paste together values of two variables (usually as string)

This is moving in the right direction, but if we want the column to encode both time and variable, we need to unite the time- and measure-related information. The unite function does exactly this, essentially pasting together the values of multiple columns into a single column.

df_wide <- df_long %>% unite(col="vartime", measure, time)
head(df_wide)
##   subid vartime    value
## 1     1   sub_1 3.074873
## 2     2   sub_1 6.122216
## 3     3   sub_1 6.032673
## 4     4   sub_1 5.690036
## 5     5   sub_1 6.186475
## 6     6   sub_1 4.733549

Looks promising. Let’s go back to spread now that we have a key that encodes all variable (column) information.

df_wide <- df_wide %>% spread(key=vartime, value=value)

We’ve now transformed our long-form dataset back into a wide dataset.

advanced reshaping

If you find yourself needing more advanced reshaping powers, look at the reshape2 package, a predecessor of tidyr. Even though tidyr is more recent, it is also more simplified and does not offer robust facilities for reshaping lists and arrays. Moreover, for data.frame objects, the dcast function from reshape2 offers a flexible syntax for specifying how multi-dimensional data should be reshaped into a 2-D data.frame. Here are a couple of resources:

Reshape2 tutorial: http://seananderson.ca/2013/10/19/reshape.html

Further extensions using data.table package: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

dplyr walkthrough

Now that we have basic tools to tidy data, let’s discuss data wrangling using dplyr.

Let’s start with the survey from our bootcamp. What’s the average age of individuals in the bootcamp, stratified by R expertise?

Note that summarize removes a single level of ungrouping. Here, we only have one grouping variable, so the output of summarize will be ‘ungrouped.’

survey <- read_csv("../data/survey-test.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   Timestamp = col_datetime(format = ""),
##   R_exp = col_character(),
##   GoT = col_integer(),
##   Age_yrs = col_integer(),
##   Sleep_hrs = col_double(),
##   Fav_date = col_date(format = ""),
##   Tidy_data = col_character()
## )
survey %>% group_by(R_exp) %>% summarize(m_age=mean(Age_yrs), sd_age=sd(Age_yrs))
## # A tibble: 5 x 3
##     R_exp m_age    sd_age
##     <chr> <dbl>     <dbl>
## 1 limited  38.9 10.764654
## 2    lots  49.2  5.473167
## 3    none  39.9 12.368868
## 4     pro  38.8  8.024961
## 5    some  41.8 10.086294

What if I want to have means and SDs for several continuous variables by R expertise? The summarize_at function provides functionality to specify several variables using vars() and potentially several summary functions using funs().

survey %>% group_by(R_exp) %>% summarize_at(vars(Age_yrs, Sleep_hrs, GoT), funs(m=mean, sd=sd))
## # A tibble: 5 x 7
##     R_exp Age_yrs_m Sleep_hrs_m GoT_m Age_yrs_sd Sleep_hrs_sd   GoT_sd
##     <chr>     <dbl>       <dbl> <dbl>      <dbl>        <dbl>    <dbl>
## 1 limited      38.9    7.852436   4.9  10.764654    0.7213635 1.911951
## 2    lots      49.2    8.283258   2.7   5.473167    0.8662253 1.418136
## 3    none      39.9    8.241380   4.8  12.368868    0.5319107 2.658320
## 4     pro      38.8    7.746333   5.0   8.024961    1.2624823 1.154701
## 5    some      41.8    7.586250   4.3  10.086294    1.2497898 2.540779

We can also make this more beautiful using techniques we’ve already seen above… R is programming with data. We just extend out our data pipeline a bit. The extract function here is like separate, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a tutorial here: http://www.regular-expressions.info/tutorial.html.

survey %>% group_by(R_exp) %>% summarize_at(vars(Age_yrs, Sleep_hrs, GoT), funs(m=mean, sd=sd)) %>%
  gather(key=var, value=value, -R_exp) %>% 
  extract(col="var", into=c("variable", "statistic"), regex=("(.*)_(.*)$")) %>%
  spread(key=statistic, value=value) %>% arrange(variable, R_exp)
## # A tibble: 15 x 4
##      R_exp  variable         m         sd
##      <chr>     <chr>     <dbl>      <dbl>
##  1 limited   Age_yrs 38.900000 10.7646541
##  2    lots   Age_yrs 49.200000  5.4731669
##  3    none   Age_yrs 39.900000 12.3688677
##  4     pro   Age_yrs 38.800000  8.0249611
##  5    some   Age_yrs 41.800000 10.0862943
##  6 limited       GoT  4.900000  1.9119507
##  7    lots       GoT  2.700000  1.4181365
##  8    none       GoT  4.800000  2.6583203
##  9     pro       GoT  5.000000  1.1547005
## 10    some       GoT  4.300000  2.5407785
## 11 limited Sleep_hrs  7.852436  0.7213635
## 12    lots Sleep_hrs  8.283258  0.8662253
## 13    none Sleep_hrs  8.241380  0.5319107
## 14     pro Sleep_hrs  7.746333  1.2624823
## 15    some Sleep_hrs  7.586250  1.2497898

Revisiting univbct dataset using dplyr.

Let’s now return to the univbct data where we have longitudinal observations of job satisfaction, commitment, and readiness to deploy. Let’s enact the core ‘verbs’ of dplyr.

filter: obtaining observations (rows) based on some criteria

  data(univbct, package="multilevel")

Filter only men in company A

  company_A_men <- filter(univbct, COMPANY=="A" & GENDER==1)
  #print 10 observations at random to check the accuracy of the filter
  kable(company_A_men %>% sample_n(10))
BTN COMPANY MARITAL GENDER HOWLONG RANK EDUCATE AGE JOBSAT1 COMMIT1 READY1 JOBSAT2 COMMIT2 READY2 JOBSAT3 COMMIT3 READY3 TIME JSAT COMMIT READY SUBNUM
133 704 A 2 1 0 11 2 19 4.000000 4.000000 4.00 4.000000 3.666667 3.75 3.000000 3.000000 3.00 0 4.000000 4.000000 4.00 183
323 404 A 2 1 1 16 4 36 4.000000 4.000000 3.25 4.000000 3.666667 2.75 4.000000 3.666667 2.75 1 4.000000 3.666667 2.75 454
229 1022 A 1 1 2 13 2 19 1.666667 3.000000 1.75 1.666667 1.666667 1.50 1.333333 2.000000 2.50 0 1.666667 3.000000 1.75 333
333 4042 A 1 1 2 13 2 20 3.333333 4.333333 3.00 3.333333 3.666667 3.75 3.666667 3.666667 3.50 2 3.666667 3.666667 3.50 468
53 4 A 2 1 0 17 4 36 4.000000 4.666667 3.25 NA 4.000000 3.50 3.666667 4.333333 3.50 1 NA 4.000000 3.50 107
73 4 A 2 1 5 18 5 44 5.000000 5.000000 4.00 NA 5.000000 3.00 4.000000 4.333333 3.75 0 5.000000 5.000000 4.00 130
6 4 A 2 1 0 16 2 39 3.333333 3.666667 1.25 NA NA 1.75 3.333333 3.666667 1.50 2 3.333333 3.666667 1.50 13
267 404 A 2 1 2 15 2 30 3.333333 3.666667 2.75 4.000000 4.000000 3.00 4.000000 3.666667 3.50 2 4.000000 3.666667 3.50 381
279 144 A 1 1 3 13 2 20 4.333333 3.333333 2.50 4.000000 3.333333 2.50 3.333333 3.000000 2.50 2 3.333333 3.000000 2.50 393
209 299 A 4 1 3 13 2 26 2.000000 2.666667 2.00 1.000000 1.000000 1.75 2.666667 2.000000 2.50 1 1.000000 1.000000 1.75 291

What about the number of people in companies A and B?

  filter(univbct, COMPANY %in% c("A","B")) %>% nrow()
## [1] 750

Or counts by company and battalion

  univbct %>% group_by(BTN, COMPANY) %>% count()
## # A tibble: 43 x 3
## # Groups:   BTN, COMPANY [43]
##      BTN COMPANY     n
##    <dbl>  <fctr> <int>
##  1     4       A    66
##  2     4       B    15
##  3     4       C    12
##  4     4       D    30
##  5     4     HHC    18
##  6   104       A    12
##  7   104     HHC     3
##  8   124       A    42
##  9   144       A    30
## 10   299       A    39
## # ... with 33 more rows

select: obtaining variables (columns) based on some criteria

Let’s start by keeping only the three core dependent variables over time: jobsat, commit, ready. Keep SUBNUM as well for unique identification.

dvs_only <- univbct %>% select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, COMMIT1, COMMIT2, COMMIT3, READY1, READY2, READY3)

If you have many variables of a similar name, you might try starts_with(). Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.

dvs_only <- univbct %>% select(SUBNUM, starts_with("JOBSAT"), starts_with("COMMIT"), starts_with("READY"))

Other selection mechanisms: * contains: variable name contains a literal string * starts_with * ends_with * matches: variable name matches a regular expression * one_of: variable is one of the elements in a character vector. Example: select(one_of(c(“A”, “B”)))

Note that select and filter can be combined to subset both observations and variables of interest. For example, look at readiness to deploy in battalion 299 only

univbct %>% filter(BTN==299) %>% select(SUBNUM, READY1, READY2, READY3) %>% head
##   SUBNUM READY1 READY2 READY3
## 1      4    2.5   3.25   3.00
## 2      4    2.5   3.25   3.00
## 3      4    2.5   3.25   3.00
## 4      7    2.0   1.75   1.25
## 5      7    2.0   1.75   1.25
## 6      7    2.0   1.75   1.25

Select is also useful for dropping variables that are not of interest.

nojobsat <- univbct %>% select(-starts_with("JOBSAT"))
names(nojobsat)
##  [1] "BTN"     "COMPANY" "MARITAL" "GENDER"  "HOWLONG" "RANK"    "EDUCATE"
##  [8] "AGE"     "COMMIT1" "READY1"  "COMMIT2" "READY2"  "COMMIT3" "READY3" 
## [15] "TIME"    "JSAT"    "COMMIT"  "READY"   "SUBNUM"

mutate: add one or more variables that are a function of other variables

Mean of commit scores over waves.

  univbct <- univbct %>% rowwise() %>% mutate(commitmean=mean(c(COMMIT1, COMMIT2, COMMIT3))) %>% ungroup()

Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the vars function here, which acts identically to select, but in the context of a summary or mutation operation on specific variables.

  meancent <- function(x) { x - mean(x, na.rm=TRUE) }
  univbct <- univbct %>% mutate_at(vars(starts_with("COMMIT")), funs(cm=meancent, pct=percent_rank))
  univbct %>% select(starts_with("COMMIT")) %>% summarize_all(mean, na.rm=TRUE) %>% gather()
## # A tibble: 15 x 2
##               key         value
##             <chr>         <dbl>
##  1        COMMIT1  3.616702e+00
##  2        COMMIT2  3.467514e+00
##  3        COMMIT3  3.537473e+00
##  4         COMMIT  3.540303e+00
##  5     commitmean  3.537767e+00
##  6     COMMIT1_cm -2.195134e-16
##  7     COMMIT2_cm  1.601433e-16
##  8     COMMIT3_cm -7.797226e-17
##  9      COMMIT_cm -8.193985e-17
## 10  commitmean_cm -1.015464e-16
## 11    COMMIT1_pct  4.146716e-01
## 12    COMMIT2_pct  4.174271e-01
## 13    COMMIT3_pct  4.028798e-01
## 14     COMMIT_pct  4.125408e-01
## 15 commitmean_pct  4.228116e-01

arrange: reorder observations in specific order

Order data by ascending battalion, company, then subnum

  univbct <- univbct %>% arrange(BTN, COMPANY, SUBNUM)

Descending sort: descending battalion, ascending company, ascending subnum

  univbct <- univbct %>% arrange(desc(BTN), COMPANY, SUBNUM)

A more realistic example: preparation for multilevel analysis

In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).

We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -2 as the sep argument to separate splits the string at the second-to-last position (i.e., starting at the right).

For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’

Take a look:

  univbct %>% select(SUBNUM, starts_with("JOBSAT"), JSAT) %>% head(n=20)
## # A tibble: 20 x 5
##    SUBNUM  JOBSAT1  JOBSAT2  JOBSAT3     JSAT
##     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1    103 2.000000 2.333333 3.333333 2.000000
##  2    103 2.000000 2.333333 3.333333 2.333333
##  3    103 2.000000 2.333333 3.333333 3.333333
##  4    129 3.666667 4.333333 4.666667 3.666667
##  5    129 3.666667 4.333333 4.666667 4.333333
##  6    129 3.666667 4.333333 4.666667 4.666667
##  7    171 3.666667 4.000000       NA 3.666667
##  8    171 3.666667 4.000000       NA 4.000000
##  9    171 3.666667 4.000000       NA       NA
## 10    202 1.333333 2.000000 4.333333 1.333333
## 11    202 1.333333 2.000000 4.333333 2.000000
## 12    202 1.333333 2.000000 4.333333 4.333333
## 13    270 4.000000 3.666667 5.000000 4.000000
## 14    270 4.000000 3.666667 5.000000 3.666667
## 15    270 4.000000 3.666667 5.000000 5.000000
## 16    296 4.000000 4.000000 4.000000 4.000000
## 17    296 4.000000 4.000000 4.000000 4.000000
## 18    296 4.000000 4.000000 4.000000 4.000000
## 19    348 3.333333 3.000000 3.333333 3.333333
## 20    348 3.333333 3.000000 3.333333 3.000000

We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).

  univbct <- univbct %>% group_by(SUBNUM) %>% filter(row_number() == 1) %>% 
  select(-JSAT, -COMMIT, -READY) %>% ungroup()

First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer)

  forMLM <- univbct %>% select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, COMMIT1, COMMIT2, COMMIT3, READY1, READY2, READY3) %>% 
  gather(key="key", value="value", -SUBNUM) %>% 
  separate(col="key", into=c("variable", "occasion"), -2) %>%
  spread(key=variable, value=value) %>% mutate(occasion=as.numeric(occasion))

Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.

  forMLM <- forMLM %>% group_by(SUBNUM) %>% 
  mutate_at(vars(COMMIT, JOBSAT, READY), funs(wicent=meancent, pmean=mean)) %>%
  ungroup()

  head(forMLM)
## # A tibble: 6 x 11
##   SUBNUM occasion   COMMIT   JOBSAT READY COMMIT_wicent JOBSAT_wicent
##    <dbl>    <dbl>    <dbl>    <dbl> <dbl>         <dbl>         <dbl>
## 1      1        1 1.666667 1.666667  2.75    -0.4444443     -0.222222
## 2      1        2 1.666667 1.000000  1.00    -0.4444443     -0.888889
## 3      1        3 3.000000 3.000000  3.00     0.8888887      1.111111
## 4      2        1 1.666667 3.666667  3.00     0.2222227     -0.222222
## 5      2        2 1.333333 4.000000  2.00    -0.1111113      0.111111
## 6      2        3 1.333333 4.000000  1.75    -0.1111113      0.111111
## # ... with 4 more variables: READY_wicent <dbl>, COMMIT_pmean <dbl>,
## #   JOBSAT_pmean <dbl>, READY_pmean <dbl>