R Bootcamp 2018: dplyr and tidyr walkthrough

Before we start: beware namespace collisions!
- What to do?
tidyr walkthough
dplyr walkthrough

The goal of this document is to provide a basic introduction to using the tidyr and dplyr packages in R for data tidying and wrangling.

Before we start: beware namespace collisions!

One of the most irritating problems you may encounter in the tidyverse world is code that previously worked suddenly throws an inexplicable error. For example:

> survey %>% group_by(R_exp) %>% 
summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))

Error in summarize(., m_age = mean(Psych_age_yrs), sd_age = sd(Psych_age_yrs)) : 
argument "by" is missing, with no default

By using fairly intuitive verbs such as ‘summarize’ and ‘select’, dplyr (and sometimes tidyr) can use the same function names as other packages. For example, Hmisc has a summarize function that operates. The predecessor to dplyr was called plyr – although largely outmoded, it has a few remaining functions that remain very useful. But… these functions operate differently (the syntax is note the same!).

This points to the problem of what are called ‘namespace collisions.’ That is, when R looks for a function (or any object) in the Global environment, it searches through a ‘path’. You can see the nitty gritty using searchpaths(). But the TL;DR is that if you – or any function you call on – loads another package, that package may override a dplyr function and make your code crash!

What to do?

Watch out for warnings about objects being ‘masked’ when packages are loaded.
Explicitly specify the package where your desired function lives using the double colon operator. Example: dplyr::summarize.
Try to load tidyverse packages using library(tidyverse). At least handles collisions within the tidyverse!

Example of output that portends a namespace collision:

library(dplyr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

tidyr walkthough

The tidyr package provides a small number of functions for reshaping data into a tidy format. Tidy data are defined by:

Each variable forms a column
Each observation forms a row
Each type of observational unit (e.g., persons, schools, counties) forms a table.

Imagine a dataset where you have ratings of well-being and anxiety measured 4 times in a longitudinal study.

Imagine that someone sends you a dataset that looks like this:

df <- data.frame(subid=1:10, 
                 sub_w1=rnorm(10, 5, 1), sub_w2=rnorm(10, 6, 1), sub_w3=rnorm(10, 7, 1),
                 anx_w1=rnorm(10, 9, 1), anx_w2=rnorm(10, 6, 1), anx_w3=rnorm(10, 7, 1))

kable(round(df,3))

subid	sub_w1	sub_w2	sub_w3	anx_w1	anx_w2	anx_w3
1	5.276	5.081	8.271	9.620	6.084	6.868
2	4.986	6.595	9.144	9.561	6.803	7.236
3	4.282	4.665	8.166	10.898	7.117	7.099
4	5.672	5.232	6.932	9.245	6.479	7.783
5	3.994	8.014	5.517	8.540	6.098	6.700
6	4.657	6.967	8.138	8.848	7.643	6.641
7	4.103	6.451	7.428	9.484	4.418	5.059
8	5.628	6.997	8.337	8.883	6.798	6.456
9	4.696	4.872	6.430	8.283	5.832	7.261
10	4.567	6.159	8.437	7.907	6.344	7.686

gather: gather many related columns into a key-value pair

This is not especially tidy. We have three columns that represent the same variable on three occasions. It would be cleaner to have a time variable (key) and two variables representing well-being and anxiety.

df_long <- df %>% gather(key=time, value=wellbeing, sub_w1, sub_w2, sub_w3)
print(df_long)

##    subid    anx_w1   anx_w2   anx_w3   time wellbeing
## 1      1  9.619590 6.084015 6.867555 sub_w1  5.275683
## 2      2  9.560709 6.802824 7.236272 sub_w1  4.985548
## 3      3 10.898357 7.116813 7.099039 sub_w1  4.282360
## 4      4  9.245252 6.479136 7.783039 sub_w1  5.672307
## 5      5  8.540470 6.097641 6.700154 sub_w1  3.993627
## 6      6  8.847970 7.642710 6.640836 sub_w1  4.657330
## 7      7  9.484030 4.418370 5.058520 sub_w1  4.102833
## 8      8  8.882659 6.798393 6.456400 sub_w1  5.627684
## 9      9  8.283270 5.832436 7.260793 sub_w1  4.695705
## 10    10  7.906891 6.343541 7.686283 sub_w1  4.567234
## 11     1  9.619590 6.084015 6.867555 sub_w2  5.080863
## 12     2  9.560709 6.802824 7.236272 sub_w2  6.595293
## 13     3 10.898357 7.116813 7.099039 sub_w2  4.665417
## 14     4  9.245252 6.479136 7.783039 sub_w2  5.232391
## 15     5  8.540470 6.097641 6.700154 sub_w2  8.013735
## 16     6  8.847970 7.642710 6.640836 sub_w2  6.967485
## 17     7  9.484030 4.418370 5.058520 sub_w2  6.450891
## 18     8  8.882659 6.798393 6.456400 sub_w2  6.997446
## 19     9  8.283270 5.832436 7.260793 sub_w2  4.871811
## 20    10  7.906891 6.343541 7.686283 sub_w2  6.158679
## 21     1  9.619590 6.084015 6.867555 sub_w3  8.270652
## 22     2  9.560709 6.802824 7.236272 sub_w3  9.143847
## 23     3 10.898357 7.116813 7.099039 sub_w3  8.165602
## 24     4  9.245252 6.479136 7.783039 sub_w3  6.932239
## 25     5  8.540470 6.097641 6.700154 sub_w3  5.516544
## 26     6  8.847970 7.642710 6.640836 sub_w3  8.137576
## 27     7  9.484030 4.418370 5.058520 sub_w3  7.427647
## 28     8  8.882659 6.798393 6.456400 sub_w3  8.337145
## 29     9  8.283270 5.832436 7.260793 sub_w3  6.429599
## 30    10  7.906891 6.343541 7.686283 sub_w3  8.436685

Better, but now our time variable is a mix of variable information and time information. We can retain just the last character as time using mutate from dplyr and parse_number from readr.

df_long <- df_long %>% mutate(time=parse_number(time))
kable(round(df_long, 3))

subid	anx_w1	anx_w2	anx_w3	time	wellbeing
1	9.620	6.084	6.868	1	5.276
2	9.561	6.803	7.236	1	4.986
3	10.898	7.117	7.099	1	4.282
4	9.245	6.479	7.783	1	5.672
5	8.540	6.098	6.700	1	3.994
6	8.848	7.643	6.641	1	4.657
7	9.484	4.418	5.059	1	4.103
8	8.883	6.798	6.456	1	5.628
9	8.283	5.832	7.261	1	4.696
10	7.907	6.344	7.686	1	4.567
1	9.620	6.084	6.868	2	5.081
2	9.561	6.803	7.236	2	6.595
3	10.898	7.117	7.099	2	4.665
4	9.245	6.479	7.783	2	5.232
5	8.540	6.098	6.700	2	8.014
6	8.848	7.643	6.641	2	6.967
7	9.484	4.418	5.059	2	6.451
8	8.883	6.798	6.456	2	6.997
9	8.283	5.832	7.261	2	4.872
10	7.907	6.344	7.686	2	6.159
1	9.620	6.084	6.868	3	8.271
2	9.561	6.803	7.236	3	9.144
3	10.898	7.117	7.099	3	8.166
4	9.245	6.479	7.783	3	6.932
5	8.540	6.098	6.700	3	5.517
6	8.848	7.643	6.641	3	8.138
7	9.484	4.418	5.059	3	7.428
8	8.883	6.798	6.456	3	8.337
9	8.283	5.832	7.261	3	6.430
10	7.907	6.344	7.686	3	8.437

Okay, but now anxiety feels left out… shouldn’t the same approach apply?

When you use the subtraction syntax (subtracting a variable), gather assumes that all variables except subid should be gathered.

df_long <- df %>% gather(key=time, value=value, -subid)
print(df_long)

##    subid   time     value
## 1      1 sub_w1  5.275683
## 2      2 sub_w1  4.985548
## 3      3 sub_w1  4.282360
## 4      4 sub_w1  5.672307
## 5      5 sub_w1  3.993627
## 6      6 sub_w1  4.657330
## 7      7 sub_w1  4.102833
## 8      8 sub_w1  5.627684
## 9      9 sub_w1  4.695705
## 10    10 sub_w1  4.567234
## 11     1 sub_w2  5.080863
## 12     2 sub_w2  6.595293
## 13     3 sub_w2  4.665417
## 14     4 sub_w2  5.232391
## 15     5 sub_w2  8.013735
## 16     6 sub_w2  6.967485
## 17     7 sub_w2  6.450891
## 18     8 sub_w2  6.997446
## 19     9 sub_w2  4.871811
## 20    10 sub_w2  6.158679
## 21     1 sub_w3  8.270652
## 22     2 sub_w3  9.143847
## 23     3 sub_w3  8.165602
## 24     4 sub_w3  6.932239
## 25     5 sub_w3  5.516544
## 26     6 sub_w3  8.137576
## 27     7 sub_w3  7.427647
## 28     8 sub_w3  8.337145
## 29     9 sub_w3  6.429599
## 30    10 sub_w3  8.436685
## 31     1 anx_w1  9.619590
## 32     2 anx_w1  9.560709
## 33     3 anx_w1 10.898357
## 34     4 anx_w1  9.245252
## 35     5 anx_w1  8.540470
## 36     6 anx_w1  8.847970
## 37     7 anx_w1  9.484030
## 38     8 anx_w1  8.882659
## 39     9 anx_w1  8.283270
## 40    10 anx_w1  7.906891
## 41     1 anx_w2  6.084015
## 42     2 anx_w2  6.802824
## 43     3 anx_w2  7.116813
## 44     4 anx_w2  6.479136
## 45     5 anx_w2  6.097641
## 46     6 anx_w2  7.642710
## 47     7 anx_w2  4.418370
## 48     8 anx_w2  6.798393
## 49     9 anx_w2  5.832436
## 50    10 anx_w2  6.343541
## 51     1 anx_w3  6.867555
## 52     2 anx_w3  7.236272
## 53     3 anx_w3  7.099039
## 54     4 anx_w3  7.783039
## 55     5 anx_w3  6.700154
## 56     6 anx_w3  6.640836
## 57     7 anx_w3  5.058520
## 58     8 anx_w3  6.456400
## 59     9 anx_w3  7.260793
## 60    10 anx_w3  7.686283

separate: split the values of a variable at a position in the character string

Now, the time variable has both information about the measure (sub versus anx) and time (1-3). This is a job for separate!

df_long <- df_long %>% separate(time, into=c("measure", "time"), sep = "_")
head(df_long)

##   subid measure time    value
## 1     1     sub   w1 5.275683
## 2     2     sub   w1 4.985548
## 3     3     sub   w1 4.282360
## 4     4     sub   w1 5.672307
## 5     5     sub   w1 3.993627
## 6     6     sub   w1 4.657330

nrow(df_long)

## [1] 60

Cool, but we see that time has the ‘w’ prefix and isn’t a number. If your analysis uses a numeric (continuous) time representation (e.g., multilevel models), this won’t work. Let’s parse the number out of it.

df_long <- df_long %>% mutate(time=parse_number(time))
head(df_long)

##   subid measure time    value
## 1     1     sub    1 5.275683
## 2     2     sub    1 4.985548
## 3     3     sub    1 4.282360
## 4     4     sub    1 5.672307
## 5     5     sub    1 3.993627
## 6     6     sub    1 4.657330

This now qualifies as tidy. But it is not necessarily right for every application. For example, in longitudinal SEM (e.g., latent curve models), time is usually encoded by specific loadings onto intercept and slope factors. This requires a ‘wide’ data format similar to where we started. Let’s use tidyr to demonstrate how to go backwards in our transformation process – long-to-wide.

spread: convert a key-value

We can imagine an intermediate step in which we have the values of each measure as columns, instead of encoding them with respect to both measure and time.

print(df_intermediate <- df_long %>% spread(key=measure, value=value))

##    subid time       anx      sub
## 1      1    1  9.619590 5.275683
## 2      1    2  6.084015 5.080863
## 3      1    3  6.867555 8.270652
## 4      2    1  9.560709 4.985548
## 5      2    2  6.802824 6.595293
## 6      2    3  7.236272 9.143847
## 7      3    1 10.898357 4.282360
## 8      3    2  7.116813 4.665417
## 9      3    3  7.099039 8.165602
## 10     4    1  9.245252 5.672307
## 11     4    2  6.479136 5.232391
## 12     4    3  7.783039 6.932239
## 13     5    1  8.540470 3.993627
## 14     5    2  6.097641 8.013735
## 15     5    3  6.700154 5.516544
## 16     6    1  8.847970 4.657330
## 17     6    2  7.642710 6.967485
## 18     6    3  6.640836 8.137576
## 19     7    1  9.484030 4.102833
## 20     7    2  4.418370 6.450891
## 21     7    3  5.058520 7.427647
## 22     8    1  8.882659 5.627684
## 23     8    2  6.798393 6.997446
## 24     8    3  6.456400 8.337145
## 25     9    1  8.283270 4.695705
## 26     9    2  5.832436 4.871811
## 27     9    3  7.260793 6.429599
## 28    10    1  7.906891 4.567234
## 29    10    2  6.343541 6.158679
## 30    10    3  7.686283 8.436685

df_intermediate %>% nrow()

## [1] 30

unite: paste together values of two variables (usually as string)

This is moving in the right direction, but if we want the column to encode both time and variable, we need to unite the time- and measure-related information. The unite function does exactly this, essentially pasting together the values of multiple columns into a single column.

df_wide <- df_long %>% unite(col="vartime", measure, time)
head(df_wide)

##   subid vartime    value
## 1     1   sub_1 5.275683
## 2     2   sub_1 4.985548
## 3     3   sub_1 4.282360
## 4     4   sub_1 5.672307
## 5     5   sub_1 3.993627
## 6     6   sub_1 4.657330

Looks promising. Let’s go back to spread now that we have a key that encodes all variable (column) information.

df_wide <- df_wide %>% spread(key=vartime, value=value)

We’ve now transformed our long-form dataset back into a wide dataset.

advanced reshaping

If you find yourself needing more advanced reshaping powers, look at the reshape2 package, a predecessor of tidyr. Even though tidyr is more recent, it is also more simplified and does not offer robust facilities for reshaping lists and arrays. Moreover, for data.frame objects, the dcast function from reshape2 offers a flexible syntax for specifying how multi-dimensional data should be reshaped into a 2-D data.frame. Here are a couple of resources:

Reshape2 tutorial: http://seananderson.ca/2013/10/19/reshape.html

Further extensions using data.table package: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

dplyr walkthrough

Now that we have basic tools to tidy data, let’s discuss data wrangling using dplyr.

Let’s start with the survey from our bootcamp. What’s the average age of individuals in the bootcamp, stratified by R expertise?

Note that summarize removes a single level of ungrouping. Here, we only have one grouping variable, so the output of summarize will be ‘ungrouped.’

survey <- read_csv("../data/survey_clean.csv")

## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   R_exp = col_character(),
##   Banjo = col_integer(),
##   Psych_age_yrs = col_integer(),
##   Sleep_hrs = col_double(),
##   Fav_day = col_character(),
##   Crisis = col_character()
## )

survey %>% group_by(R_exp) %>% dplyr::summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))

## # A tibble: 5 x 3
##   R_exp                     m_age sd_age
##   <chr>                     <dbl>  <dbl>
## 1 limited                    33.4   24.1
## 2 lots                       38.5   40.3
## 3 none                       27.8   16.6
## 4 none, limited, lots, pro 1000     NA  
## 5 pro                        35.5   13.4

What if I want to have means and SDs for several continuous variables by R expertise? The summarize_at function provides functionality to specify several variables using vars() and potentially several summary functions using funs().

survey %>% group_by(R_exp) %>% summarize_at(vars(Psych_age_yrs, Sleep_hrs, Banjo), funs(m=mean, sd=sd))

## # A tibble: 5 x 7
##   R_exp Psych_age_yrs_m Sleep_hrs_m Banjo_m Psych_age_yrs_sd Sleep_hrs_sd
##   <chr>           <dbl>       <dbl>   <dbl>            <dbl>        <dbl>
## 1 limi…            33.4        8.68    4.75             24.1        2.97 
## 2 lots             38.5        8.91    4.09             40.3        2.44 
## 3 none             27.8        8.2     3.9              16.6        1.48 
## 4 none…          1000         24       1                NA         NA    
## 5 pro              35.5        8.5     6                13.4        0.707
## # ... with 1 more variable: Banjo_sd <dbl>

We can also make this more beautiful using techniques we’ve already seen above… R is programming with data. We just extend out our data pipeline a bit. The extract function here is like separate, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a tutorial here: http://www.regular-expressions.info/tutorial.html.

survey %>% group_by(R_exp) %>% summarize_at(vars(Psych_age_yrs, Sleep_hrs, Banjo), funs(m=mean, sd=sd)) %>%
  gather(key=var, value=value, -R_exp) %>% 
  extract(col="var", into=c("variable", "statistic"), regex=("(.*)_(.*)$")) %>%
  spread(key=statistic, value=value) %>% arrange(variable, R_exp)

## # A tibble: 15 x 4
##    R_exp                    variable            m     sd
##    <chr>                    <chr>           <dbl>  <dbl>
##  1 limited                  Banjo            4.75  3.05 
##  2 lots                     Banjo            4.09  2.02 
##  3 none                     Banjo            3.9   2.08 
##  4 none, limited, lots, pro Banjo            1    NA    
##  5 pro                      Banjo            6     5.66 
##  6 limited                  Psych_age_yrs   33.4  24.1  
##  7 lots                     Psych_age_yrs   38.5  40.3  
##  8 none                     Psych_age_yrs   27.8  16.6  
##  9 none, limited, lots, pro Psych_age_yrs 1000    NA    
## 10 pro                      Psych_age_yrs   35.5  13.4  
## 11 limited                  Sleep_hrs        8.68  2.97 
## 12 lots                     Sleep_hrs        8.91  2.44 
## 13 none                     Sleep_hrs        8.2   1.48 
## 14 none, limited, lots, pro Sleep_hrs       24    NA    
## 15 pro                      Sleep_hrs        8.5   0.707

Examining univbct dataset using dplyr.

Let’s examine the univbct data, which contains longitudinal observations of job satisfaction, commitment, and readiness to deploy. From the documentation ?univbct:

This data set contains the complete data set used in Bliese and Ployhart (2002). The data is longitudinal data converted to univariate (i.e., stacked) form. Data were collected at three time points. A data frame with 22 columns and 1485 observations from 495 individuals.

data(univbct, package="multilevel")
str(univbct)

## 'data.frame':    1485 obs. of  22 variables:
##  $ BTN    : num  1022 1022 1022 1004 1004 ...
##  $ COMPANY: Factor w/ 8 levels "A","B","C","D",..: 6 6 6 4 4 4 2 2 2 2 ...
##  $ MARITAL: num  1 1 1 4 4 4 2 2 2 2 ...
##  $ GENDER : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ HOWLONG: num  2 2 2 0 0 0 0 0 0 1 ...
##  $ RANK   : num  12 12 12 13 13 13 15 15 15 14 ...
##  $ EDUCATE: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ AGE    : num  20 20 20 24 24 24 24 24 24 23 ...
##  $ JOBSAT1: num  1.67 1.67 1.67 3.67 3.67 ...
##  $ COMMIT1: num  1.67 1.67 1.67 1.67 1.67 ...
##  $ READY1 : num  2.75 2.75 2.75 3 3 3 3.75 3.75 3.75 2.5 ...
##  $ JOBSAT2: num  1 1 1 4 4 ...
##  $ COMMIT2: num  1.67 1.67 1.67 1.33 1.33 ...
##  $ READY2 : num  1 1 1 2 2 2 3.75 3.75 3.75 3.25 ...
##  $ JOBSAT3: num  3 3 3 4 4 4 4 4 4 3 ...
##  $ COMMIT3: num  3 3 3 1.33 1.33 ...
##  $ READY3 : num  3 3 3 1.75 1.75 1.75 1.75 1.75 1.75 3 ...
##  $ TIME   : num  0 1 2 0 1 2 0 1 2 0 ...
##  $ JSAT   : num  1.67 1 3 3.67 4 ...
##  $ COMMIT : num  1.67 1.67 3 1.67 1.33 ...
##  $ READY  : num  2.75 1 3 3 2 1.75 3.75 3.75 1.75 2.5 ...
##  $ SUBNUM : num  1 1 1 2 2 2 3 3 3 4 ...

We have 1485 observations of military personnel nested within companies, which are nested within batallions: https://en.wikipedia.org/wiki/Battalion.

Let’s enact the core ‘verbs’ of dplyr to understand and improve the structure of these data.

filter: obtaining observations (rows) based on some criteria

Filter only men in company A

company_A_men <- filter(univbct, COMPANY=="A" & GENDER==1)
#print 10 observations at random to check the accuracy of the filter
kable(company_A_men %>% sample_n(10))

	BTN	COMPANY	MARITAL	GENDER	HOWLONG	RANK	EDUCATE	AGE	JOBSAT1	COMMIT1	READY1	JOBSAT2	COMMIT2	READY2	JOBSAT3	COMMIT3	READY3	TIME	JSAT	COMMIT	READY	SUBNUM
41	3066	A	1	1	2	13	2	19	1.333333	2.666667	1.00	3.333333	2.666667	2.75	3.000000	2.333333	2.75	1	3.333333	2.666667	2.75	98
73	4	A	2	1	5	18	5	44	5.000000	5.000000	4.00	NA	5.000000	3.00	4.000000	4.333333	3.75	0	5.000000	5.000000	4.00	130
215	4042	A	4	1	2	17	4	36	4.000000	4.666667	3.50	4.000000	4.000000	4.00	4.000000	4.000000	4.00	1	4.000000	4.000000	4.00	296
66	1022	A	2	1	5	16	2	31	3.333333	3.666667	2.75	3.000000	3.000000	3.00	3.666667	3.666667	2.50	2	3.666667	3.666667	2.50	120
120	4	A	2	1	3	15	2	33	4.000000	4.666667	2.50	3.666667	4.666667	1.75	4.666667	4.666667	2.00	2	4.666667	4.666667	2.00	175
337	1022	A	1	1	1	12	3	21	2.333333	3.333333	2.00	NA	NA	NA	NA	NA	NA	0	2.333333	3.333333	2.00	474
112	3066	A	1	1	2	12	2	20	2.666667	1.333333	1.75	3.000000	3.666667	NA	3.000000	3.000000	3.00	0	2.666667	1.333333	1.75	173
175	4	A	2	1	2	17	3	34	3.666667	4.000000	3.25	4.000000	4.000000	3.75	3.666667	5.000000	3.25	0	3.666667	4.000000	3.25	238
77	3066	A	2	1	0	21	5	23	3.333333	4.000000	3.00	3.666667	3.666667	4.00	3.666667	3.666667	4.25	1	3.666667	3.666667	4.00	133
307	104	A	2	1	0	15	2	23	3.000000	NA	4.00	3.333333	4.333333	3.50	3.666667	4.000000	2.25	0	3.000000	NA	4.00	428

What about the number of people in companies A and B?

filter(univbct, COMPANY %in% c("A","B")) %>% nrow()

## [1] 750

Or counts by company and battalion

univbct %>% group_by(BTN, COMPANY) %>% count()

## # A tibble: 43 x 3
## # Groups:   BTN, COMPANY [43]
##      BTN COMPANY     n
##    <dbl> <fct>   <int>
##  1     4 A          66
##  2     4 B          15
##  3     4 C          12
##  4     4 D          30
##  5     4 HHC        18
##  6   104 A          12
##  7   104 HHC         3
##  8   124 A          42
##  9   144 A          30
## 10   299 A          39
## # ... with 33 more rows

select: obtaining variables (columns) based on some criteria

Let’s start by keeping only the three core dependent variables over time: jobsat, commit, ready. Keep SUBNUM as well for unique identification.

dvs_only <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                                      COMMIT1, COMMIT2, COMMIT3, 
                                      READY1, READY2, READY3)

If you have many variables of a similar name, you might try starts_with(). Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.

dvs_only <- univbct %>% dplyr::select(SUBNUM, starts_with("JOBSAT"), starts_with("COMMIT"), starts_with("READY"))

Other selection mechanisms: * contains: variable name contains a literal string * starts_with * ends_with * matches: variable name matches a regular expression * one_of: variable is one of the elements in a character vector. Example: select(one_of(c(“A”, “B”)))

Note that select and filter can be combined to subset both observations and variables of interest. For example, look at readiness to deploy in battalion 299 only

univbct %>% filter(BTN==299) %>% dplyr::select(SUBNUM, READY1, READY2, READY3) %>% head

##   SUBNUM READY1 READY2 READY3
## 1      4    2.5   3.25   3.00
## 2      4    2.5   3.25   3.00
## 3      4    2.5   3.25   3.00
## 4      7    2.0   1.75   1.25
## 5      7    2.0   1.75   1.25
## 6      7    2.0   1.75   1.25

Select is also useful for dropping variables that are not of interest.

nojobsat <- univbct %>% dplyr::select(-starts_with("JOBSAT"))
names(nojobsat)

##  [1] "BTN"     "COMPANY" "MARITAL" "GENDER"  "HOWLONG" "RANK"    "EDUCATE"
##  [8] "AGE"     "COMMIT1" "READY1"  "COMMIT2" "READY2"  "COMMIT3" "READY3" 
## [15] "TIME"    "JSAT"    "COMMIT"  "READY"   "SUBNUM"

mutate: add one or more variables that are a function of other variables

(Row-wise) mean of commit scores over waves. Note how you can used select() within a mutate to run a function on a subset of the data.

univbct <- univbct %>% mutate(commitmean=rowMeans(dplyr::select(., COMMIT1, COMMIT2, COMMIT3)))

Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the vars function here, which acts identically to select, but in the context of a summary or mutation operation on specific variables.

meancent <- function(x) { x - mean(x, na.rm=TRUE) } #simple worker function to mean center a variable
univbct <- univbct %>% mutate_at(vars(starts_with("COMMIT")), funs(cm=meancent, pct=percent_rank))
univbct %>% dplyr::select(starts_with("COMMIT")) %>% summarize_all(mean, na.rm=TRUE) %>% gather()

##               key         value
## 1         COMMIT1  3.616702e+00
## 2         COMMIT2  3.467514e+00
## 3         COMMIT3  3.537473e+00
## 4          COMMIT  3.540303e+00
## 5      commitmean  3.537767e+00
## 6      COMMIT1_cm -2.195134e-16
## 7      COMMIT2_cm  1.601433e-16
## 8      COMMIT3_cm -7.797226e-17
## 9       COMMIT_cm -8.193985e-17
## 10  commitmean_cm -1.015464e-16
## 11    COMMIT1_pct  4.146716e-01
## 12    COMMIT2_pct  4.174271e-01
## 13    COMMIT3_pct  4.028798e-01
## 14     COMMIT_pct  4.125408e-01
## 15 commitmean_pct  4.228116e-01

arrange: reorder observations in specific order

Order data by ascending battalion, company, then subnum

univbct <- univbct %>% arrange(BTN, COMPANY, SUBNUM)

Descending sort: descending battalion, ascending company, ascending subnum

univbct <- univbct %>% arrange(desc(BTN), COMPANY, SUBNUM)

A more realistic example: preparation for multilevel analysis

In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).

We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -1 as the sep argument to separate splits the string at the second-to-last position (i.e., starting at the right).

For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’

Take a look:

univbct %>% dplyr::select(SUBNUM, starts_with("JOBSAT"), JSAT) %>% head(n=20)

##    SUBNUM  JOBSAT1  JOBSAT2  JOBSAT3     JSAT
## 1     103 2.000000 2.333333 3.333333 2.000000
## 2     103 2.000000 2.333333 3.333333 2.333333
## 3     103 2.000000 2.333333 3.333333 3.333333
## 4     129 3.666667 4.333333 4.666667 3.666667
## 5     129 3.666667 4.333333 4.666667 4.333333
## 6     129 3.666667 4.333333 4.666667 4.666667
## 7     171 3.666667 4.000000       NA 3.666667
## 8     171 3.666667 4.000000       NA 4.000000
## 9     171 3.666667 4.000000       NA       NA
## 10    202 1.333333 2.000000 4.333333 1.333333
## 11    202 1.333333 2.000000 4.333333 2.000000
## 12    202 1.333333 2.000000 4.333333 4.333333
## 13    270 4.000000 3.666667 5.000000 4.000000
## 14    270 4.000000 3.666667 5.000000 3.666667
## 15    270 4.000000 3.666667 5.000000 5.000000
## 16    296 4.000000 4.000000 4.000000 4.000000
## 17    296 4.000000 4.000000 4.000000 4.000000
## 18    296 4.000000 4.000000 4.000000 4.000000
## 19    348 3.333333 3.000000 3.333333 3.333333
## 20    348 3.333333 3.000000 3.333333 3.000000

We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).

univbct <- univbct %>% group_by(SUBNUM) %>% filter(row_number() == 1) %>% 
  dplyr::select(-JSAT, -COMMIT, -READY) %>% ungroup()

First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer)

#use -1 as argument to separate to split at the last character
forMLM <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                                    COMMIT1, COMMIT2, COMMIT3, 
                                    READY1, READY2, READY3) %>% 
  gather(key="key", value="value", -SUBNUM) %>% 
  separate(col="key", into=c("variable", "occasion"), -1) %>%
  spread(key=variable, value=value) %>% mutate(occasion=as.numeric(occasion))

Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.

forMLM <- forMLM %>% group_by(SUBNUM) %>% 
  mutate_at(vars(COMMIT, JOBSAT, READY), funs(wicent=meancent, pmean=mean)) %>%
  ungroup()

head(forMLM)

## # A tibble: 6 x 11
##   SUBNUM occasion COMMIT JOBSAT READY COMMIT_wicent JOBSAT_wicent
##    <dbl>    <dbl>  <dbl>  <dbl> <dbl>         <dbl>         <dbl>
## 1      1        1   1.67   1.67  2.75        -0.444        -0.222
## 2      1        2   1.67   1     1           -0.444        -0.889
## 3      1        3   3      3     3            0.889         1.11 
## 4      2        1   1.67   3.67  3            0.222        -0.222
## 5      2        2   1.33   4     2           -0.111         0.111
## 6      2        3   1.33   4     1.75        -0.111         0.111
## # ... with 4 more variables: READY_wicent <dbl>, COMMIT_pmean <dbl>,
## #   JOBSAT_pmean <dbl>, READY_pmean <dbl>