The goal of this document is to provide a basic introduction to data wrangling using functions from the so-called ‘tidyverse’ approach. The tidyverse (https://www.tidyverse.org) is a set of data science packages in R that are intended to provide a consistent paradigm for working with data. This approach unifies a previously inchoate landscape of different functions and packages in R that could be daunting to new users.

Although we do not claim that the tidyverse approach is best according to all possible criteria, we do believe that it is the best paradigm for working with data in R for social scientists, many of whom do not have a formal background in computer programming.

Here, we will draw primarily from the tidyr and dplyr packages in R.

For an excellent book-length treatment of the tidyverse approach, see R for Data Science by Hadley Wickham and Garrett Grolemund.

Before we start: beware namespace collisions!

One of the most irritating problems you may encounter in the tidyverse world (and more generally, in R) is when code that previously worked suddenly throws an inexplicable error.

For example:

> survey %>% group_by(R_exp) %>% 
summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))

Error in summarize(., m_age = mean(Psych_age_yrs), sd_age = sd(Psych_age_yrs)) : 
argument "by" is missing, with no default

By using fairly intuitive data wrangling verbs such as ‘summarize’ and ‘select’, dplyr (and sometimes tidyr) can use the same function names as other packages. For example, Hmisc has a summarize function that does not operate in the same way as summarize in dplyr. Also, the predecessor to dplyr was called plyr. Although largely outmoded, it has a few remaining functions that may be useful. But… many of these functions have the same names in dplyr but operate differently (the syntax is not the same!), which can be a common source of collisions when using dplyr.

This points to the problem of what are called ‘namespace collisions.’ That is, when R looks for a function (or any object) in the global environment, it searches through a ‘path’. You can see the nitty gritty using searchpaths(). But the TL;DR is that if you – or any function you call on – loads another package, that package may override a dplyr function and make your code crash!

What to do?

Watch out for warnings about objects being ‘masked’ when packages are loaded.
Explicitly specify the package where your desired function lives using the double colon operator. Example: dplyr::summarize.
Try to load tidyverse packages using library(tidyverse). At least handles collisions within the tidyverse!

Example of output that portends a namespace collision:

library(dplyr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Data pipelines

Although not properly a part of dplyr, the tidyverse paradigm encourages the use of so-called data pipelines when writing the syntax for a multi-step data transformation procedure. The pipe operator %>% is provided by the magrittr package, which is loaded by dplyr. Data pipeline syntax is intended to provide a readable syntax for the order in which data operations are performed. You can think of this as a recipe for data wrangling. For example:

1. Read data from the file: 'mydata.csv'
2. Rename SDATE to submission_date; rename ed to education
3. Only keep subject IDs above 10 (the first 9 were in-lab pilot subjects)
4. Drop the timestamp and survey_settings variables
5. Compute the log of income as a new variable called log_income
6. Arrange the data by ID, then submission_date (where each subject submitted many surveys)
7. Ensure that ID and submission_date are the left-most columns in the data

We would write this out as a dplyr pipeline using the pipe operator %>% to chain together data operations.

dataset <- read.csv("mydata.csv") %>%
  rename(submission_date=SDATE, education=ed) %>%
  filter(ID < 10) %>%
  select(-timestamp, -survey_settings) %>%
  mutate(log_income = log(income)) %>%
  arrange(ID, submission_date) %>%
  select(ID, submission_date, everything())

This pipeline approach is in contrast to traditional data operations in functional programming where the syntax usually follows a similar structure to mathematical notation, such as \(f(g(x))\). In this notation, first we apply \(g(x)\), then we apply \(f(x)\) on the resulting value (i.e., from inner to outer).

In R, we might see something like this:

arrange(summarize(
  filter(data, variable == numeric_value), Total = sum(variable)), 
  desc(Total)
)

Although this is not a terrible syntax, it gets confusing to keep track of “the output of this function is the input to the next one.”

In a data pipeline approach, the equivalent syntax would be:

data %>%
  filter(variable == "value") %>%
  summarize(Total = sum(variable)) %>%
  arrange(desc(Total))

This is easier to read from left to right.

(Aside) The history of ‘pipes’

An alternative syntax to the \(f(g(x))\) emerged long ago from Unix terminal programming, which uses | as the pipe operator.

find . –iname '*.pdf' | grep –v 'figure' | sort –n

In Linux shell, this says

1. In the current directory (.), find all files called (something).pdf
2. Remove any file that contains the string 'figure'
3. Sort the files in ascending numeric order

The idea of “pipes” and “redirection” in shell scripting is that the command can be read from left to right, where the pipe | indicates that the output of left command is provided as input to the right command.

Use of ‘this’ reference in tidyverse

Sometimes it is useful to refer to the current dataset or variable explicitly in tidyverse data wrangling syntax.

dplyr/magrittr tends to hide this from us for convenience, but it’s there under the hood.

iris %>% filter(Sepal.Length > 7)

is the same as

iris %>% filter(., Sepal.Length > 7)

So, ‘.’ refers to the current dataset or variable (depending on context) in dplyr operations. And if you don’t specify where the ‘.’ falls in your syntax, it will always be passed as the first argument to the downstream function.

Special values to watch out for in data wrangling

NA: missing
- na.rm=TRUE available in many functions
- Also, see na.omit(), na.exclude(), na.fail(), na.pass()
NULL: null set
- Often used when something is undefined
Inf: infinite
NaN: Not a number.
- Result of an invalid computation, e.g., log(-1)
warnings(): If R mentions a warning in data wrangling, make sure you’ve handled it or know its origin. Don’t just ignore these!

dplyr fundamentals

The dplyr package is at the core of data wrangling in the tidyverse, providing a set of wrangling verbs that collectively form a coherent language for data transformations.

Core verbs

filter: subset or remove observations (rows)
select: subset or remove a group of columns (variables)
mutate: add or modify one or more variables (transforming the data)
summarize: collapse multiple values into a single value (e.g., by summing or taking means; aggregation)
arrange: change the order of observations (i.e., sort the data)

Additional important verbs

join: Combine datasets on matching variable(s) (i.e., merge two datasets)
group_by: divide dataset according to one or more categorical variables (factors)
ungroup: Remove grouping from data operations
rename: change the names of one or more variables
recode: change the values of a discrete variable (especially factor)
slice: subset rows based on numeric order
distinct: Remove observations that are duplicates (cf. unique)

A first pass through our bootcamp survey

To learn dplyr, let’s start with the survey from our bootcamp. What’s the average age of individuals in the bootcamp, stratified by R expertise?

Note that summarize removes a single level of ungrouping. Here, we only have one grouping variable, r_exp, so the output of summarize will be ‘ungrouped.’

survey <- read_csv("../data/csv/survey.csv") %>%
  mutate(r_exp=ordered(r_exp, levels=c("none", "limited", "some", "lots", "pro")))

survey %>% group_by(r_exp) %>% 
  dplyr::summarize(n=n(), m_age=mean(age_yrs), sd_age=sd(age_yrs)) %>% 
  kable_table()

r_exp	n	m_age	sd_age
none	15	31	12.0
limited	13	29	7.8
some	26	30	16.9
lots	3	32	0.0
pro	1	28	NA

What if I want to have means and SDs for several continuous variables grouped by R expertise? The summarize_at function provides functionality to specify several variables using vars() and potentially several summary functions by passing them in a named list.

survey %>% group_by(r_exp) %>% 
  summarize_at(vars(age_yrs, sleep_hrs, got_s8), list(m=mean, sd=sd)) %>% 
  kable_table()

r_exp	age_yrs_m	sleep_hrs_m	got_s8_m	age_yrs_sd	sleep_hrs_sd	got_s8_sd
none	31	8.3	3.9	12.0	1.3	3.2
limited	29	9.0	4.2	7.8	1.5	3.0
some	30	NA	3.5	16.9	NA	3.2
lots	32	7.0	2.0	0.0	0.0	0.0
pro	28	7.2	1.0	NA	NA	NA

Let’s slow this down:

group_by

survey %>% group_by(r_exp)

This tells dplyr to divide the survey data into a set of smaller data.frame objects, one per level of r_exp. Internally, this looks something like the output below. After this division of the dataset into chunks, summarize will work on each chunk individually.

## $none
##            time_stamp r_exp got_s8 beverage
## 1 2019-07-18 16:46:18  none      3  Spirits
## 2 2019-07-18 16:46:18  none      3  Spirits
## 3 2019-07-18 16:46:18  none      3  Spirits
## 4 2019-07-19 11:40:44  none      1     Wine
## 5 2019-07-19 11:40:44  none      1     Wine
## 6 2019-07-23 09:56:12  none      1    Water
## 
## $limited
##            time_stamp   r_exp got_s8 beverage
## 1 2019-07-23 09:30:45 limited      1  Spirits
## 2 2019-07-24 11:11:21 limited      7    Water
## 3 2019-07-31 11:15:38 limited      1      Tea
## 4 2019-08-02 11:51:48 limited      3    Water
## 5 2019-08-02 11:51:48 limited      3    Water
## 6 2019-08-12 15:30:56 limited      4    Water
## 
## $some
##            time_stamp r_exp got_s8 beverage
## 1 2019-06-25 12:27:56  some      1   Coffee
## 2 2019-06-25 12:27:56  some      1   Coffee
## 3 2019-06-25 12:27:56  some      1   Coffee
## 4 2019-07-17 15:45:04  some      1      Tea
## 5 2019-07-18 16:12:20  some      1     Wine
## 6 2019-07-18 17:02:34  some      1   Coffee
## 
## $lots
##            time_stamp r_exp got_s8 beverage
## 1 2019-08-06 10:47:38  lots      2     Wine
## 2 2019-08-06 10:47:38  lots      2     Wine
## 3 2019-08-06 10:47:38  lots      2     Wine
## 
## $pro
##            time_stamp r_exp got_s8 beverage
## 1 2019-07-23 17:55:25   pro      1   Coffee

summarize_at

All variants of summarize, including summarize_at, transform from a dataset that has many rows to a dataset that has a single row per grouping unit. If you do not use group_by, summarize will yield an overall summary statistic in the entire dataset. For example, to get the mean and SD of hours slept last night, irrespective of R experience, we could just use a simple summarize:

survey %>% 
  summarize(m_sleep=mean(sleep_hrs, na.rm=T), sd_sleep=sd(sleep_hrs, na.rm=T)) %>% 
  kable_table()

m_sleep	sd_sleep
8.2	1.3

But because we used group_by(r_exp) above, we got unique summaries of the variables at each level of R experience.

The summarize_at function accepts two primary arguments. First, we specify a set of variables that we wish to summarize in the same way (i.e., compute the same summary statistics). Second, we specify which statistics we wish to compute. In our case, the syntax was:

summarize_at(vars(age_yrs, sleep_hrs, got_s8), list(m=mean, sd=sd))

The vars() function specifies the unquoted names of variables in the dataset we wish to summarize, separated by commas.

The list() object created here asks dplyr to compute the mean and SD of each variable in the vars() statement at each level of R experience (the group_by basis). The names of the list elements (left side) — here, m and sd — become the suffixes added for each variable. The value of the element (right side) — here, mean and sd — are the functions that should be used to compute a summary statistic (they should return one number per grouped variable).

Passing arguments to summary functions

Notice above how the mean and SD for sleep was missing (NA) for the ‘some’ group. This is because someone (or maybe a few people) didn’t fill out that item! In R, many functions have an na.rm argument that indicates whether to remove missing data before computing the statistic. If we wish to pass this argument to mean and sd here, we have two options:

First, we can pass arguments as additional arguments to summarize_at like this:

summarize_at(
  vars(some, variables, here), 
  list(s1=summary1function, s2=summary2function), 
  args_passed_on_1=TRUE, args_passed_on_2="Hot dog"
)

Here, the values of args_passed_on_1 and args_passed_on_2 will be provided as additional arguments to every function in the list (here, we have two of them).

Second, to be even more precise, we can switch to what dplyr and purrr call ‘lambdas’ where we use the tilde (~) to write out the exact function calls.

summarize_at(
  vars(some, variables, here), 
  list(~mean(., na.rm=TRUE), ~sd(., na.rm=TRUE))
)

The lambda approach uses . to denote ‘this’ to dplyr — that is, whatever data dplyr is chewing on at that moment.

survey %>% group_by(r_exp) %>% 
  summarize_at(
    vars(age_yrs, sleep_hrs, got_s8), 
    list(m=mean, sd=sd), na.rm=TRUE
  ) %>% 
  kable_table()

r_exp	age_yrs_m	sleep_hrs_m	got_s8_m	age_yrs_sd	sleep_hrs_sd	got_s8_sd
none	31	8.3	3.9	12.0	1.3	3.2
limited	29	9.0	4.2	7.8	1.5	3.0
some	30	8.0	3.5	16.9	1.2	3.2
lots	32	7.0	2.0	0.0	0.0	0.0
pro	28	7.2	1.0	NA	NA	NA

Making a summarize pipeline even more beautiful

We can also make the output more beautiful using tidying techniques we’ve already seen in the tidyr tutorial. Remember that R is all about programming for data science. In particular, notice that we have some columns that are means and others that are SDs.

We can just extend our data pipeline a bit. The extract function from tidyr here is like separate, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a useful tutorial here: http://www.regular-expressions.info/tutorial.html.

survey %>% group_by(r_exp) %>% 
  summarize_at(vars(age_yrs, sleep_hrs, got_s8), list(m=mean, sd=sd)) %>%
  gather(key=var, value=value, -r_exp) %>% 
  extract(col="var", into=c("variable", "statistic"), regex=("(.*)_(.*)$")) %>%
  spread(key=statistic, value=value) %>% arrange(variable, r_exp) %>%
  kable_table()

r_exp	variable	m	sd
none	age_yrs	31.3	12.0
limited	age_yrs	29.1	7.8
some	age_yrs	30.4	16.9
lots	age_yrs	32.0	0.0
pro	age_yrs	28.0	NA
none	got_s8	3.9	3.2
limited	got_s8	4.2	3.0
some	got_s8	3.5	3.2
lots	got_s8	2.0	0.0
pro	got_s8	1.0	NA
none	sleep_hrs	8.3	1.3
limited	sleep_hrs	9.0	1.5
some	sleep_hrs	NA	NA
lots	sleep_hrs	7.0	0.0
pro	sleep_hrs	7.2	NA

Examining a more complex multilevel dataset using dplyr

Let’s examine the univbct data, which contains longitudinal observations of job satisfaction, commitment, and readiness to deploy. From the documentation ?univbct:

This data set contains the complete data set used in Bliese and Ployhart (2002). The data is longitudinal data converted to univariate (i.e., stacked) form. Data were collected at three time points. A data frame with 22 columns and 1485 observations from 495 individuals.

We have 1485 observations of military personnel nested within companies, which are nested within batallions: https://en.wikipedia.org/wiki/Battalion.

data(univbct, package="multilevel")
str(univbct)

## 'data.frame':    1485 obs. of  22 variables:
##  $ BTN    : num  1022 1022 1022 1004 1004 ...
##  $ COMPANY: Factor w/ 8 levels "A","B","C","D",..: 6 6 6 4 4 4 2 2 2 2 ...
##  $ MARITAL: num  1 1 1 4 4 4 2 2 2 2 ...
##  $ GENDER : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ HOWLONG: num  2 2 2 0 0 0 0 0 0 1 ...
##  $ RANK   : num  12 12 12 13 13 13 15 15 15 14 ...
##  $ EDUCATE: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ AGE    : num  20 20 20 24 24 24 24 24 24 23 ...
##  $ JOBSAT1: num  1.67 1.67 1.67 3.67 3.67 ...
##  $ COMMIT1: num  1.67 1.67 1.67 1.67 1.67 ...
##  $ READY1 : num  2.75 2.75 2.75 3 3 3 3.75 3.75 3.75 2.5 ...
##  $ JOBSAT2: num  1 1 1 4 4 ...
##  $ COMMIT2: num  1.67 1.67 1.67 1.33 1.33 ...
##  $ READY2 : num  1 1 1 2 2 2 3.75 3.75 3.75 3.25 ...
##  $ JOBSAT3: num  3 3 3 4 4 4 4 4 4 3 ...
##  $ COMMIT3: num  3 3 3 1.33 1.33 ...
##  $ READY3 : num  3 3 3 1.75 1.75 1.75 1.75 1.75 1.75 3 ...
##  $ TIME   : num  0 1 2 0 1 2 0 1 2 0 ...
##  $ JSAT   : num  1.67 1 3 3.67 4 ...
##  $ COMMIT : num  1.67 1.67 3 1.67 1.33 ...
##  $ READY  : num  2.75 1 3 3 2 1.75 3.75 3.75 1.75 2.5 ...
##  $ SUBNUM : num  1 1 1 2 2 2 3 3 3 4 ...

Let’s enact the core ‘verbs’ of dplyr to understand and improve the structure of these data.

filter: obtaining observations (rows) based on some criteria

Objective: filter only men in company A

company_A_men <- filter(univbct, COMPANY=="A" & GENDER==1)
#print 10 observations at random to check the accuracy of the filter
#p=11 just shows the first 11 columns to keep it on one page for formatting
company_A_men %>% sample_n(10) %>% kable_table(p=11)

BTN	COMPANY	MARITAL	GENDER	HOWLONG	RANK	EDUCATE	AGE	JOBSAT1	COMMIT1	READY1
4	A	2	1	5	18	5	44	5.0	5.0	4.0
3066	A	1	1	0	NA	2	19	5.0	5.0	4.0
124	A	2	1	4	14	2	25	4.0	3.7	3.5
404	A	1	1	2	14	2	22	1.7	2.3	1.0
144	A	2	1	5	15	2	NA	3.7	3.7	4.0
3066	A	2	1	1	15	2	30	3.3	4.0	3.0
4	A	2	1	5	14	3	21	3.0	1.7	2.0
1022	A	1	1	2	12	2	18	1.7	2.0	1.5
3066	A	2	1	3	13	3	21	3.0	3.7	2.8
3066	A	1	1	1	13	2	19	2.7	3.0	3.0

How many people are in companies A and B?

filter(univbct, COMPANY %in% c("A","B")) %>% nrow()

## [1] 750

What about counts by company and battalion?

univbct %>% group_by(BTN, COMPANY) %>% count() %>%
  kable_table(n=12)

BTN	COMPANY	n
4	A	66
4	B	15
4	C	12
4	D	30
4	HHC	18
104	A	12
104	HHC	3
124	A	42
144	A	30
299	A	39
299	B	30
299	C	27

select: obtaining variables (columns) based on some criteria

Let’s start by keeping only the three core dependent variables over time: jobsat, commit, ready. Keep SUBNUM as well for unique identification.

dvs_only <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                                      COMMIT1, COMMIT2, COMMIT3, 
                                      READY1, READY2, READY3)

If you have many variables of a similar name, you might try starts_with(). Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.

dvs_only <- univbct %>% dplyr::select(SUBNUM, starts_with("JOBSAT"), starts_with("COMMIT"), starts_with("READY"))

Other selection mechanisms:

contains: variable name contains a literal string
starts_with: variable names start with a string
ends_with: variable names end with a string
num_range: variables that have a common prefix (e.g., ‘reasoning’) and a numeric range (e.g., 1-20)
matches: variable name matches a regular expression
one_of: variable is one of the elements in a character vector. Example: select(one_of(c(“A”, “B”)))

See ?select_helpers for more details.

select + filter

Note that select and filter can be combined to subset both observations and variables of interest.

For example, look at readiness to deploy in battalion 299 only:

univbct %>% filter(BTN==299) %>% dplyr::select(SUBNUM, READY1, READY2, READY3) %>% 
  kable_table(n=6)

SUBNUM	READY1	READY2	READY3
4	2.5	3.2	3.0
4	2.5	3.2	3.0
4	2.5	3.2	3.0
7	2.0	1.8	1.2
7	2.0	1.8	1.2
7	2.0	1.8	1.2

Select is also useful for dropping variables that are not of interest using a kind of subtraction syntax.

nojobsat <- univbct %>% dplyr::select(-starts_with("JOBSAT"))
names(nojobsat)

##  [1] "BTN"     "COMPANY" "MARITAL" "GENDER"  "HOWLONG" "RANK"    "EDUCATE"
##  [8] "AGE"     "COMMIT1" "READY1"  "COMMIT2" "READY2"  "COMMIT3" "READY3" 
## [15] "TIME"    "JSAT"    "COMMIT"  "READY"   "SUBNUM"

mutate: add one or more variables that are a function of other variables

(Row-wise) mean of commit scores over waves. Note how you can used select() within a mutate to run a function on a subset of the data.

univbct <- univbct %>% 
  mutate(commitmean=rowMeans(dplyr::select(., COMMIT1, COMMIT2, COMMIT3)))

Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the vars function here, which acts identically to select, but in the context of a summary or mutation operation on specific variables.

meancent <- function(x) { x - mean(x, na.rm=TRUE) } #simple worker function to mean center a variable

univbct <- univbct %>% 
  mutate_at(vars(starts_with("COMMIT")), list(cm=meancent, pct=percent_rank))

univbct %>% 
  dplyr::select(starts_with("COMMIT")) %>% 
  summarize_all(mean, na.rm=TRUE) %>% gather() %>%
  kable_table()

key	value
COMMIT1	3.62
COMMIT2	3.47
COMMIT3	3.54
COMMIT	3.54
commitmean	3.54
COMMIT1_cm	0.00
COMMIT2_cm	0.00
COMMIT3_cm	0.00
COMMIT_cm	0.00
commitmean_cm	0.00
COMMIT1_pct	0.44
COMMIT2_pct	0.44
COMMIT3_pct	0.43
COMMIT_pct	0.44
commitmean_pct	0.49

arrange: reorder observations in specific order

Order data by ascending battalion, company, then subnum

univbct <- univbct %>% arrange(BTN, COMPANY, SUBNUM)

Descending sort: descending battalion, ascending company, ascending subnum

univbct <- univbct %>% arrange(desc(BTN), COMPANY, SUBNUM)

A more realistic example: preparation for multilevel analysis

In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).

We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -1 as the sep argument to separate splits the string at the second-to-last position (i.e., starting at the right).

For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’ In other words, there is a peculiar redundancy in the data that is altogether confusing.

Take a look:

univbct %>% dplyr::select(SUBNUM, starts_with("JOBSAT"), JSAT) %>% kable_table(n=12)

SUBNUM	JOBSAT1	JOBSAT2	JOBSAT3	JSAT
103	2.0	2.3	3.3	2.0
103	2.0	2.3	3.3	2.3
103	2.0	2.3	3.3	3.3
129	3.7	4.3	4.7	3.7
129	3.7	4.3	4.7	4.3
129	3.7	4.3	4.7	4.7
171	3.7	4.0	NA	3.7
171	3.7	4.0	NA	4.0
171	3.7	4.0	NA	NA
202	1.3	2.0	4.3	1.3
202	1.3	2.0	4.3	2.0
202	1.3	2.0	4.3	4.3

We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).

univbct <- univbct %>% group_by(SUBNUM) %>% filter(row_number() == 1) %>% 
  dplyr::select(-JSAT, -COMMIT, -READY) %>% ungroup()

First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer)

#use -1 as argument to separate to split at the last character
forMLM <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                                    COMMIT1, COMMIT2, COMMIT3, 
                                    READY1, READY2, READY3) %>% 
  gather(key="key", value="value", -SUBNUM) %>% 
  separate(col="key", into=c("variable", "occasion"), -1) %>%
  spread(key=variable, value=value) %>% mutate(occasion=as.numeric(occasion))

Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.

forMLM <- forMLM %>% group_by(SUBNUM) %>% 
  mutate_at(vars(COMMIT, JOBSAT, READY), list(wic=meancent, pm=mean)) %>%
  ungroup()

forMLM %>% kable_table(n=10) %>% kable_styling(font_size = 14)

SUBNUM	occasion	COMMIT	JOBSAT	READY	COMMIT_wic	JOBSAT_wic	READY_wic	COMMIT_pm	JOBSAT_pm	READY_pm
1	1	1.7	1.7	2.8	-0.44	-0.22	0.50	2.1	1.9	2.2
1	2	1.7	1.0	1.0	-0.44	-0.89	-1.25	2.1	1.9	2.2
1	3	3.0	3.0	3.0	0.89	1.11	0.75	2.1	1.9	2.2
2	1	1.7	3.7	3.0	0.22	-0.22	0.75	1.4	3.9	2.2
2	2	1.3	4.0	2.0	-0.11	0.11	-0.25	1.4	3.9	2.2
2	3	1.3	4.0	1.8	-0.11	0.11	-0.50	1.4	3.9	2.2
3	1	3.3	4.0	3.8	-0.11	0.00	0.67	3.4	4.0	3.1
3	2	3.3	4.0	3.8	-0.11	0.00	0.67	3.4	4.0	3.1
3	3	3.7	4.0	1.8	0.22	0.00	-1.33	3.4	4.0	3.1
4	1	3.0	3.3	2.5	-0.11	0.00	-0.42	3.1	3.3	2.9

Data wrangling in dplyr

Michael Hallquist

2019-08-22