The goal of this document is to provide a basic introduction to data wrangling using functions from the so-called ‘tidyverse’ approach. The tidyverse (https://www.tidyverse.org) is a set of data science packages in R
that are intended to provide a consistent paradigm for working with data. This approach unifies a previously inchoate landscape of different functions and packages in R
that could be daunting to new users.
Although we do not claim that the tidyverse approach is best according to all possible criteria, we do believe that it is the best paradigm for working with data in R for social scientists, many of whom do not have a formal background in computer programming.
Here, we will draw primarily from the tidyr and dplyr packages in R.
For an excellent book-length treatment of the tidyverse approach, see R for Data Science by Hadley Wickham and Garrett Grolemund.
One of the most irritating problems you may encounter in the tidyverse world (and more generally, in R) is when code that previously worked suddenly throws an inexplicable error.
For example:
> survey %>% group_by(R_exp) %>%
summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))
Error in summarize(., m_age = mean(Psych_age_yrs), sd_age = sd(Psych_age_yrs)) :
argument "by" is missing, with no default
By using fairly intuitive data wrangling verbs such as ‘summarize’ and ‘select’, dplyr
(and sometimes tidyr
) can use the same function names as other packages. For example, Hmisc
has a summarize
function that does not operate in the same way as summarize
in dplyr
. Also, the predecessor to dplyr
was called plyr
. Although largely outmoded, it has a few remaining functions that may be useful. But… many of these functions have the same names in dplyr
but operate differently (the syntax is not the same!), which can be a common source of collisions when using dplyr
.
This points to the problem of what are called ‘namespace collisions.’ That is, when R looks for a function (or any object) in the global environment, it searches through a ‘path’. You can see the nitty gritty using searchpaths()
. But the TL;DR is that if you – or any function you call on – loads another package, that package may override a dplyr function and make your code crash!
dplyr::summarize
.library(tidyverse)
. At least handles collisions within the tidyverse!Example of output that portends a namespace collision:
library(dplyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
Although not properly a part of dplyr
, the tidyverse
paradigm encourages the use of so-called data pipelines when writing the syntax for a multi-step data transformation procedure. The pipe operator %>%
is provided by the magrittr
package, which is loaded by dplyr
. Data pipeline syntax is intended to provide a readable syntax for the order in which data operations are performed. You can think of this as a recipe for data wrangling. For example:
1. Read data from the file: 'mydata.csv'
2. Rename SDATE to submission_date; rename ed to education
3. Only keep subject IDs above 10 (the first 9 were in-lab pilot subjects)
4. Drop the timestamp and survey_settings variables
5. Compute the log of income as a new variable called log_income
6. Arrange the data by ID, then submission_date (where each subject submitted many surveys)
7. Ensure that ID and submission_date are the left-most columns in the data
We would write this out as a dplyr pipeline using the pipe operator %>%
to chain together data operations.
dataset <- read.csv("mydata.csv") %>%
rename(submission_date=SDATE, education=ed) %>%
filter(ID < 10) %>%
select(-timestamp, -survey_settings) %>%
mutate(log_income = log(income)) %>%
arrange(ID, submission_date) %>%
select(ID, submission_date, everything())
This pipeline approach is in contrast to traditional data operations in functional programming where the syntax usually follows a similar structure to mathematical notation, such as \(f(g(x))\). In this notation, first we apply \(g(x)\), then we apply \(f(x)\) on the resulting value (i.e., from inner to outer).
In R
, we might see something like this:
arrange(summarize(
filter(data, variable == numeric_value), Total = sum(variable)),
desc(Total)
)
Although this is not a terrible syntax, it gets confusing to keep track of “the output of this function is the input to the next one.”
In a data pipeline approach, the equivalent syntax would be:
data %>%
filter(variable == "value") %>%
summarize(Total = sum(variable)) %>%
arrange(desc(Total))
This is easier to read from left to right.
An alternative syntax to the \(f(g(x))\) emerged long ago from Unix terminal programming, which uses |
as the pipe operator.
find . –iname '*.pdf' | grep –v 'figure' | sort –n
In Linux shell, this says
1. In the current directory (.), find all files called (something).pdf
2. Remove any file that contains the string 'figure'
3. Sort the files in ascending numeric order
The idea of “pipes” and “redirection” in shell scripting is that the command can be read from left to right, where the pipe |
indicates that the output of left command is provided as input to the right command.
Sometimes it is useful to refer to the current dataset or variable explicitly in tidyverse data wrangling syntax.
dplyr/magrittr tends to hide this from us for convenience, but it’s there under the hood.
iris %>% filter(Sepal.Length > 7)
is the same as
iris %>% filter(., Sepal.Length > 7)
So, ‘.’ refers to the current dataset or variable (depending on context) in dplyr operations. And if you don’t specify where the ‘.’ falls in your syntax, it will always be passed as the first argument to the downstream function.
The dplyr
package is at the core of data wrangling in the tidyverse
, providing a set of wrangling verbs that collectively form a coherent language for data transformations.
To learn dplyr
, let’s start with the survey from our bootcamp. What’s the average age of individuals in the bootcamp, stratified by R expertise?
Note that summarize
removes a single level of ungrouping. Here, we only have one grouping variable, r_exp
, so the output of summarize
will be ‘ungrouped.’
survey <- read_csv("../data/csv/survey.csv") %>%
mutate(r_exp=ordered(r_exp, levels=c("none", "limited", "some", "lots", "pro")))
survey %>% group_by(r_exp) %>%
dplyr::summarize(n=n(), m_age=mean(age_yrs), sd_age=sd(age_yrs)) %>%
kable_table()
r_exp | n | m_age | sd_age |
---|---|---|---|
none | 15 | 31 | 12.0 |
limited | 13 | 29 | 7.8 |
some | 26 | 30 | 16.9 |
lots | 3 | 32 | 0.0 |
pro | 1 | 28 | NA |
What if I want to have means and SDs for several continuous variables grouped by R expertise? The summarize_at
function provides functionality to specify several variables using vars()
and potentially several summary functions by passing them in a named list
.
survey %>% group_by(r_exp) %>%
summarize_at(vars(age_yrs, sleep_hrs, got_s8), list(m=mean, sd=sd)) %>%
kable_table()
r_exp | age_yrs_m | sleep_hrs_m | got_s8_m | age_yrs_sd | sleep_hrs_sd | got_s8_sd |
---|---|---|---|---|---|---|
none | 31 | 8.3 | 3.9 | 12.0 | 1.3 | 3.2 |
limited | 29 | 9.0 | 4.2 | 7.8 | 1.5 | 3.0 |
some | 30 | NA | 3.5 | 16.9 | NA | 3.2 |
lots | 32 | 7.0 | 2.0 | 0.0 | 0.0 | 0.0 |
pro | 28 | 7.2 | 1.0 | NA | NA | NA |
Let’s slow this down:
survey %>% group_by(r_exp)
This tells dplyr
to divide the survey
data into a set of smaller data.frame
objects, one per level of r_exp
. Internally, this looks something like the output below. After this division of the dataset into chunks, summarize
will work on each chunk individually.
## $none
## time_stamp r_exp got_s8 beverage
## 1 2019-07-18 16:46:18 none 3 Spirits
## 2 2019-07-18 16:46:18 none 3 Spirits
## 3 2019-07-18 16:46:18 none 3 Spirits
## 4 2019-07-19 11:40:44 none 1 Wine
## 5 2019-07-19 11:40:44 none 1 Wine
## 6 2019-07-23 09:56:12 none 1 Water
##
## $limited
## time_stamp r_exp got_s8 beverage
## 1 2019-07-23 09:30:45 limited 1 Spirits
## 2 2019-07-24 11:11:21 limited 7 Water
## 3 2019-07-31 11:15:38 limited 1 Tea
## 4 2019-08-02 11:51:48 limited 3 Water
## 5 2019-08-02 11:51:48 limited 3 Water
## 6 2019-08-12 15:30:56 limited 4 Water
##
## $some
## time_stamp r_exp got_s8 beverage
## 1 2019-06-25 12:27:56 some 1 Coffee
## 2 2019-06-25 12:27:56 some 1 Coffee
## 3 2019-06-25 12:27:56 some 1 Coffee
## 4 2019-07-17 15:45:04 some 1 Tea
## 5 2019-07-18 16:12:20 some 1 Wine
## 6 2019-07-18 17:02:34 some 1 Coffee
##
## $lots
## time_stamp r_exp got_s8 beverage
## 1 2019-08-06 10:47:38 lots 2 Wine
## 2 2019-08-06 10:47:38 lots 2 Wine
## 3 2019-08-06 10:47:38 lots 2 Wine
##
## $pro
## time_stamp r_exp got_s8 beverage
## 1 2019-07-23 17:55:25 pro 1 Coffee
All variants of summarize
, including summarize_at
, transform from a dataset that has many rows to a dataset that has a single row per grouping unit. If you do not use group_by
, summarize will yield an overall summary statistic in the entire dataset. For example, to get the mean and SD of hours slept last night, irrespective of R experience, we could just use a simple summarize
:
survey %>%
summarize(m_sleep=mean(sleep_hrs, na.rm=T), sd_sleep=sd(sleep_hrs, na.rm=T)) %>%
kable_table()
m_sleep | sd_sleep |
---|---|
8.2 | 1.3 |
But because we used group_by(r_exp)
above, we got unique summaries of the variables at each level of R experience.
The summarize_at
function accepts two primary arguments. First, we specify a set of variables that we wish to summarize in the same way (i.e., compute the same summary statistics). Second, we specify which statistics we wish to compute. In our case, the syntax was:
summarize_at(vars(age_yrs, sleep_hrs, got_s8), list(m=mean, sd=sd))
The vars()
function specifies the unquoted names of variables in the dataset we wish to summarize, separated by commas.
The list()
object created here asks dplyr
to compute the mean and SD of each variable in the vars()
statement at each level of R experience (the group_by
basis). The names of the list elements (left side) — here, m
and sd
— become the suffixes added for each variable. The value of the element (right side) — here, mean
and sd
— are the functions that should be used to compute a summary statistic (they should return one number per grouped variable).
Passing arguments to summary functions
Notice above how the mean and SD for sleep was missing (NA
) for the ‘some’ group. This is because someone (or maybe a few people) didn’t fill out that item! In R
, many functions have an na.rm
argument that indicates whether to remove missing data before computing the statistic. If we wish to pass this argument to mean
and sd
here, we have two options:
First, we can pass arguments as additional arguments to summarize_at
like this:
summarize_at(
vars(some, variables, here),
list(s1=summary1function, s2=summary2function),
args_passed_on_1=TRUE, args_passed_on_2="Hot dog"
)
Here, the values of args_passed_on_1
and args_passed_on_2
will be provided as additional arguments to every function in the list (here, we have two of them).
Second, to be even more precise, we can switch to what dplyr
and purrr
call ‘lambdas’ where we use the tilde (~) to write out the exact function calls.
summarize_at(
vars(some, variables, here),
list(~mean(., na.rm=TRUE), ~sd(., na.rm=TRUE))
)
The lambda approach uses .
to denote ‘this’ to dplyr
— that is, whatever data dplyr is chewing on at that moment.
survey %>% group_by(r_exp) %>%
summarize_at(
vars(age_yrs, sleep_hrs, got_s8),
list(m=mean, sd=sd), na.rm=TRUE
) %>%
kable_table()
r_exp | age_yrs_m | sleep_hrs_m | got_s8_m | age_yrs_sd | sleep_hrs_sd | got_s8_sd |
---|---|---|---|---|---|---|
none | 31 | 8.3 | 3.9 | 12.0 | 1.3 | 3.2 |
limited | 29 | 9.0 | 4.2 | 7.8 | 1.5 | 3.0 |
some | 30 | 8.0 | 3.5 | 16.9 | 1.2 | 3.2 |
lots | 32 | 7.0 | 2.0 | 0.0 | 0.0 | 0.0 |
pro | 28 | 7.2 | 1.0 | NA | NA | NA |
We can also make the output more beautiful using tidying techniques we’ve already seen in the tidyr tutorial. Remember that R
is all about programming for data science. In particular, notice that we have some columns that are means and others that are SDs.
We can just extend our data pipeline a bit. The extract
function from tidyr
here is like separate
, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a useful tutorial here: http://www.regular-expressions.info/tutorial.html.
survey %>% group_by(r_exp) %>%
summarize_at(vars(age_yrs, sleep_hrs, got_s8), list(m=mean, sd=sd)) %>%
gather(key=var, value=value, -r_exp) %>%
extract(col="var", into=c("variable", "statistic"), regex=("(.*)_(.*)$")) %>%
spread(key=statistic, value=value) %>% arrange(variable, r_exp) %>%
kable_table()
r_exp | variable | m | sd |
---|---|---|---|
none | age_yrs | 31.3 | 12.0 |
limited | age_yrs | 29.1 | 7.8 |
some | age_yrs | 30.4 | 16.9 |
lots | age_yrs | 32.0 | 0.0 |
pro | age_yrs | 28.0 | NA |
none | got_s8 | 3.9 | 3.2 |
limited | got_s8 | 4.2 | 3.0 |
some | got_s8 | 3.5 | 3.2 |
lots | got_s8 | 2.0 | 0.0 |
pro | got_s8 | 1.0 | NA |
none | sleep_hrs | 8.3 | 1.3 |
limited | sleep_hrs | 9.0 | 1.5 |
some | sleep_hrs | NA | NA |
lots | sleep_hrs | 7.0 | 0.0 |
pro | sleep_hrs | 7.2 | NA |
Let’s examine the univbct data, which contains longitudinal observations of job satisfaction, commitment, and readiness to deploy. From the documentation ?univbct
:
This data set contains the complete data set used in Bliese and Ployhart (2002). The data is longitudinal data converted to univariate (i.e., stacked) form. Data were collected at three time points. A data frame with 22 columns and 1485 observations from 495 individuals.
We have 1485 observations of military personnel nested within companies, which are nested within batallions: https://en.wikipedia.org/wiki/Battalion.
data(univbct, package="multilevel")
str(univbct)
## 'data.frame': 1485 obs. of 22 variables:
## $ BTN : num 1022 1022 1022 1004 1004 ...
## $ COMPANY: Factor w/ 8 levels "A","B","C","D",..: 6 6 6 4 4 4 2 2 2 2 ...
## $ MARITAL: num 1 1 1 4 4 4 2 2 2 2 ...
## $ GENDER : num 1 1 1 1 1 1 1 1 1 1 ...
## $ HOWLONG: num 2 2 2 0 0 0 0 0 0 1 ...
## $ RANK : num 12 12 12 13 13 13 15 15 15 14 ...
## $ EDUCATE: num 2 2 2 2 2 2 2 2 2 2 ...
## $ AGE : num 20 20 20 24 24 24 24 24 24 23 ...
## $ JOBSAT1: num 1.67 1.67 1.67 3.67 3.67 ...
## $ COMMIT1: num 1.67 1.67 1.67 1.67 1.67 ...
## $ READY1 : num 2.75 2.75 2.75 3 3 3 3.75 3.75 3.75 2.5 ...
## $ JOBSAT2: num 1 1 1 4 4 ...
## $ COMMIT2: num 1.67 1.67 1.67 1.33 1.33 ...
## $ READY2 : num 1 1 1 2 2 2 3.75 3.75 3.75 3.25 ...
## $ JOBSAT3: num 3 3 3 4 4 4 4 4 4 3 ...
## $ COMMIT3: num 3 3 3 1.33 1.33 ...
## $ READY3 : num 3 3 3 1.75 1.75 1.75 1.75 1.75 1.75 3 ...
## $ TIME : num 0 1 2 0 1 2 0 1 2 0 ...
## $ JSAT : num 1.67 1 3 3.67 4 ...
## $ COMMIT : num 1.67 1.67 3 1.67 1.33 ...
## $ READY : num 2.75 1 3 3 2 1.75 3.75 3.75 1.75 2.5 ...
## $ SUBNUM : num 1 1 1 2 2 2 3 3 3 4 ...
Let’s enact the core ‘verbs’ of dplyr to understand and improve the structure of these data.
Objective: filter only men in company A
company_A_men <- filter(univbct, COMPANY=="A" & GENDER==1)
#print 10 observations at random to check the accuracy of the filter
#p=11 just shows the first 11 columns to keep it on one page for formatting
company_A_men %>% sample_n(10) %>% kable_table(p=11)
BTN | COMPANY | MARITAL | GENDER | HOWLONG | RANK | EDUCATE | AGE | JOBSAT1 | COMMIT1 | READY1 |
---|---|---|---|---|---|---|---|---|---|---|
4 | A | 2 | 1 | 5 | 18 | 5 | 44 | 5.0 | 5.0 | 4.0 |
3066 | A | 1 | 1 | 0 | NA | 2 | 19 | 5.0 | 5.0 | 4.0 |
124 | A | 2 | 1 | 4 | 14 | 2 | 25 | 4.0 | 3.7 | 3.5 |
404 | A | 1 | 1 | 2 | 14 | 2 | 22 | 1.7 | 2.3 | 1.0 |
144 | A | 2 | 1 | 5 | 15 | 2 | NA | 3.7 | 3.7 | 4.0 |
3066 | A | 2 | 1 | 1 | 15 | 2 | 30 | 3.3 | 4.0 | 3.0 |
4 | A | 2 | 1 | 5 | 14 | 3 | 21 | 3.0 | 1.7 | 2.0 |
1022 | A | 1 | 1 | 2 | 12 | 2 | 18 | 1.7 | 2.0 | 1.5 |
3066 | A | 2 | 1 | 3 | 13 | 3 | 21 | 3.0 | 3.7 | 2.8 |
3066 | A | 1 | 1 | 1 | 13 | 2 | 19 | 2.7 | 3.0 | 3.0 |
How many people are in companies A and B?
filter(univbct, COMPANY %in% c("A","B")) %>% nrow()
## [1] 750
What about counts by company and battalion?
univbct %>% group_by(BTN, COMPANY) %>% count() %>%
kable_table(n=12)
BTN | COMPANY | n |
---|---|---|
4 | A | 66 |
4 | B | 15 |
4 | C | 12 |
4 | D | 30 |
4 | HHC | 18 |
104 | A | 12 |
104 | HHC | 3 |
124 | A | 42 |
144 | A | 30 |
299 | A | 39 |
299 | B | 30 |
299 | C | 27 |
Let’s start by keeping only the three core dependent variables over time: jobsat, commit, ready. Keep SUBNUM as well for unique identification.
dvs_only <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3,
COMMIT1, COMMIT2, COMMIT3,
READY1, READY2, READY3)
If you have many variables of a similar name, you might try starts_with()
. Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.
dvs_only <- univbct %>% dplyr::select(SUBNUM, starts_with("JOBSAT"), starts_with("COMMIT"), starts_with("READY"))
Other selection mechanisms:
See ?select_helpers
for more details.
Note that select and filter can be combined to subset both observations and variables of interest.
For example, look at readiness to deploy in battalion 299 only:
univbct %>% filter(BTN==299) %>% dplyr::select(SUBNUM, READY1, READY2, READY3) %>%
kable_table(n=6)
SUBNUM | READY1 | READY2 | READY3 |
---|---|---|---|
4 | 2.5 | 3.2 | 3.0 |
4 | 2.5 | 3.2 | 3.0 |
4 | 2.5 | 3.2 | 3.0 |
7 | 2.0 | 1.8 | 1.2 |
7 | 2.0 | 1.8 | 1.2 |
7 | 2.0 | 1.8 | 1.2 |
Select is also useful for dropping variables that are not of interest using a kind of subtraction syntax.
nojobsat <- univbct %>% dplyr::select(-starts_with("JOBSAT"))
names(nojobsat)
## [1] "BTN" "COMPANY" "MARITAL" "GENDER" "HOWLONG" "RANK" "EDUCATE"
## [8] "AGE" "COMMIT1" "READY1" "COMMIT2" "READY2" "COMMIT3" "READY3"
## [15] "TIME" "JSAT" "COMMIT" "READY" "SUBNUM"
(Row-wise) mean of commit scores over waves. Note how you can used select()
within a mutate to run a function on a subset of the data.
univbct <- univbct %>%
mutate(commitmean=rowMeans(dplyr::select(., COMMIT1, COMMIT2, COMMIT3)))
Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm
for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the vars
function here, which acts identically to select
, but in the context of a summary or mutation operation on specific variables.
meancent <- function(x) { x - mean(x, na.rm=TRUE) } #simple worker function to mean center a variable
univbct <- univbct %>%
mutate_at(vars(starts_with("COMMIT")), list(cm=meancent, pct=percent_rank))
univbct %>%
dplyr::select(starts_with("COMMIT")) %>%
summarize_all(mean, na.rm=TRUE) %>% gather() %>%
kable_table()
key | value |
---|---|
COMMIT1 | 3.62 |
COMMIT2 | 3.47 |
COMMIT3 | 3.54 |
COMMIT | 3.54 |
commitmean | 3.54 |
COMMIT1_cm | 0.00 |
COMMIT2_cm | 0.00 |
COMMIT3_cm | 0.00 |
COMMIT_cm | 0.00 |
commitmean_cm | 0.00 |
COMMIT1_pct | 0.44 |
COMMIT2_pct | 0.44 |
COMMIT3_pct | 0.43 |
COMMIT_pct | 0.44 |
commitmean_pct | 0.49 |
Order data by ascending battalion, company, then subnum
univbct <- univbct %>% arrange(BTN, COMPANY, SUBNUM)
Descending sort: descending battalion, ascending company, ascending subnum
univbct <- univbct %>% arrange(desc(BTN), COMPANY, SUBNUM)
In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).
We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -1
as the sep
argument to separate
splits the string at the second-to-last position (i.e., starting at the right).
For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct
is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’ In other words, there is a peculiar redundancy in the data that is altogether confusing.
Take a look:
univbct %>% dplyr::select(SUBNUM, starts_with("JOBSAT"), JSAT) %>% kable_table(n=12)
SUBNUM | JOBSAT1 | JOBSAT2 | JOBSAT3 | JSAT |
---|---|---|---|---|
103 | 2.0 | 2.3 | 3.3 | 2.0 |
103 | 2.0 | 2.3 | 3.3 | 2.3 |
103 | 2.0 | 2.3 | 3.3 | 3.3 |
129 | 3.7 | 4.3 | 4.7 | 3.7 |
129 | 3.7 | 4.3 | 4.7 | 4.3 |
129 | 3.7 | 4.3 | 4.7 | 4.7 |
171 | 3.7 | 4.0 | NA | 3.7 |
171 | 3.7 | 4.0 | NA | 4.0 |
171 | 3.7 | 4.0 | NA | NA |
202 | 1.3 | 2.0 | 4.3 | 1.3 |
202 | 1.3 | 2.0 | 4.3 | 2.0 |
202 | 1.3 | 2.0 | 4.3 | 4.3 |
We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).
univbct <- univbct %>% group_by(SUBNUM) %>% filter(row_number() == 1) %>%
dplyr::select(-JSAT, -COMMIT, -READY) %>% ungroup()
First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer
)
#use -1 as argument to separate to split at the last character
forMLM <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3,
COMMIT1, COMMIT2, COMMIT3,
READY1, READY2, READY3) %>%
gather(key="key", value="value", -SUBNUM) %>%
separate(col="key", into=c("variable", "occasion"), -1) %>%
spread(key=variable, value=value) %>% mutate(occasion=as.numeric(occasion))
Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.
forMLM <- forMLM %>% group_by(SUBNUM) %>%
mutate_at(vars(COMMIT, JOBSAT, READY), list(wic=meancent, pm=mean)) %>%
ungroup()
forMLM %>% kable_table(n=10) %>% kable_styling(font_size = 14)
SUBNUM | occasion | COMMIT | JOBSAT | READY | COMMIT_wic | JOBSAT_wic | READY_wic | COMMIT_pm | JOBSAT_pm | READY_pm |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1.7 | 1.7 | 2.8 | -0.44 | -0.22 | 0.50 | 2.1 | 1.9 | 2.2 |
1 | 2 | 1.7 | 1.0 | 1.0 | -0.44 | -0.89 | -1.25 | 2.1 | 1.9 | 2.2 |
1 | 3 | 3.0 | 3.0 | 3.0 | 0.89 | 1.11 | 0.75 | 2.1 | 1.9 | 2.2 |
2 | 1 | 1.7 | 3.7 | 3.0 | 0.22 | -0.22 | 0.75 | 1.4 | 3.9 | 2.2 |
2 | 2 | 1.3 | 4.0 | 2.0 | -0.11 | 0.11 | -0.25 | 1.4 | 3.9 | 2.2 |
2 | 3 | 1.3 | 4.0 | 1.8 | -0.11 | 0.11 | -0.50 | 1.4 | 3.9 | 2.2 |
3 | 1 | 3.3 | 4.0 | 3.8 | -0.11 | 0.00 | 0.67 | 3.4 | 4.0 | 3.1 |
3 | 2 | 3.3 | 4.0 | 3.8 | -0.11 | 0.00 | 0.67 | 3.4 | 4.0 | 3.1 |
3 | 3 | 3.7 | 4.0 | 1.8 | 0.22 | 0.00 | -1.33 | 3.4 | 4.0 | 3.1 |
4 | 1 | 3.0 | 3.3 | 2.5 | -0.11 | 0.00 | -0.42 | 3.1 | 3.3 | 2.9 |