tidyr reminders

The tidyr package provides a small number of functions for reshaping data into a tidy format. Tidy data are defined by:

Each variable forms a column
Each observation forms a row
Each type of observational unit (e.g., persons, schools, counties) forms a table.

Imagine a dataset where you have ratings of well-being and anxiety measured 4 times in a longitudinal study.

The dataset looks like this:

df <- data.frame(subid=1:10, 
                 well_w1=rnorm(10, 5, 1), 
                 well_w2=rnorm(10, 6, 1), 
                 well_w3=rnorm(10, 7, 1),
                 anx_w1=rnorm(10, 9, 1), 
                 anx_w2=rnorm(10, 6, 1), 
                 anx_w3=rnorm(10, 7, 1))

df

subid	well_w1	well_w2	well_w3	anx_w1	anx_w2	anx_w3
1	7.19	6.30	6.94	9.48	5.37	5.99
2	4.82	7.63	7.10	9.23	6.55	5.20
3	4.82	5.38	8.56	8.86	6.16	5.38
4	2.49	6.47	7.83	11.41	5.30	7.91
5	4.44	7.42	7.54	7.08	5.96	7.26
6	4.86	6.11	6.41	10.66	6.71	7.58
7	6.09	7.87	5.10	11.12	4.28	5.96
8	4.38	4.97	7.19	8.81	6.22	8.16
9	4.09	4.66	6.42	6.88	6.48	9.15
10	3.41	6.55	5.46	9.28	9.38	6.35

gather: gather many related columns into a key-value pair

This is not especially tidy. We have three columns that represent the same variable on three occasions. It would be cleaner to have a time variable (key) and two variables representing well-being and anxiety.

Objective: Use gather to bring together the wellbeing columns.

df_long <- df %>% gather(key=time, value=wellbeing, well_w1, well_w2, well_w3)
df_long

subid	anx_w1	anx_w2	anx_w3	time	wellbeing
1	9.48	5.37	5.99	well_w1	7.19
2	9.23	6.55	5.20	well_w1	4.82
3	8.86	6.16	5.38	well_w1	4.82
4	11.41	5.30	7.91	well_w1	2.49
5	7.08	5.96	7.26	well_w1	4.44
6	10.66	6.71	7.58	well_w1	4.86
7	11.12	4.28	5.96	well_w1	6.09
8	8.81	6.22	8.16	well_w1	4.38
9	6.88	6.48	9.15	well_w1	4.09
10	9.28	9.38	6.35	well_w1	3.41
1	9.48	5.37	5.99	well_w2	6.30
2	9.23	6.55	5.20	well_w2	7.63
3	8.86	6.16	5.38	well_w2	5.38
4	11.41	5.30	7.91	well_w2	6.47
5	7.08	5.96	7.26	well_w2	7.42
6	10.66	6.71	7.58	well_w2	6.11
7	11.12	4.28	5.96	well_w2	7.87
8	8.81	6.22	8.16	well_w2	4.97
9	6.88	6.48	9.15	well_w2	4.66
10	9.28	9.38	6.35	well_w2	6.55
1	9.48	5.37	5.99	well_w3	6.94
2	9.23	6.55	5.20	well_w3	7.10
3	8.86	6.16	5.38	well_w3	8.56
4	11.41	5.30	7.91	well_w3	7.83
5	7.08	5.96	7.26	well_w3	7.54
6	10.66	6.71	7.58	well_w3	6.41
7	11.12	4.28	5.96	well_w3	5.10
8	8.81	6.22	8.16	well_w3	7.19
9	6.88	6.48	9.15	well_w3	6.42
10	9.28	9.38	6.35	well_w3	5.46

Better, but now our time variable is a mix of variable information and time information. We can retain just the last character as time using mutate from dplyr and parse_number from readr. I’ve given you the syntax below for free – you’ll just need to adapt it to your naming scheme.

df_long <- df_long %>% mutate(time=parse_number(time))
df_long

subid	anx_w1	anx_w2	anx_w3	time	wellbeing
1	9.48	5.37	5.99	1	7.19
2	9.23	6.55	5.20	1	4.82
3	8.86	6.16	5.38	1	4.82
4	11.41	5.30	7.91	1	2.49
5	7.08	5.96	7.26	1	4.44
6	10.66	6.71	7.58	1	4.86
7	11.12	4.28	5.96	1	6.09
8	8.81	6.22	8.16	1	4.38
9	6.88	6.48	9.15	1	4.09
10	9.28	9.38	6.35	1	3.41
1	9.48	5.37	5.99	2	6.30
2	9.23	6.55	5.20	2	7.63
3	8.86	6.16	5.38	2	5.38
4	11.41	5.30	7.91	2	6.47
5	7.08	5.96	7.26	2	7.42
6	10.66	6.71	7.58	2	6.11
7	11.12	4.28	5.96	2	7.87
8	8.81	6.22	8.16	2	4.97
9	6.88	6.48	9.15	2	4.66
10	9.28	9.38	6.35	2	6.55
1	9.48	5.37	5.99	3	6.94
2	9.23	6.55	5.20	3	7.10
3	8.86	6.16	5.38	3	8.56
4	11.41	5.30	7.91	3	7.83
5	7.08	5.96	7.26	3	7.54
6	10.66	6.71	7.58	3	6.41
7	11.12	4.28	5.96	3	5.10
8	8.81	6.22	8.16	3	7.19
9	6.88	6.48	9.15	3	6.42
10	9.28	9.38	6.35	3	5.46

Okay, but now anxiety feels left out… shouldn’t the same approach apply?

When you use the subtraction syntax (subtracting a variable), gather assumes that all variables except subid should be gathered.

Objective: Gather both well-being and anxiety columns into one key-value pair.

The data should look something like this (just the first 6 rows here):

##    subid    time     value
## 1      1 well_w1  7.188648
## 2      2 well_w1  4.822453
## 3      3 well_w1  4.814725
## 4      4 well_w1  2.493464
## 5      5 well_w1  4.442689
## 6      6 well_w1  4.856441

df_long <- df %>% gather(key=time, value=value, -subid)
df_long

subid	time	value
1	well_w1	7.19
2	well_w1	4.82
3	well_w1	4.82
4	well_w1	2.49
5	well_w1	4.44
6	well_w1	4.86
7	well_w1	6.09
8	well_w1	4.38
9	well_w1	4.09
10	well_w1	3.41
1	well_w2	6.30
2	well_w2	7.63
3	well_w2	5.38
4	well_w2	6.47
5	well_w2	7.42
6	well_w2	6.11
7	well_w2	7.87
8	well_w2	4.97
9	well_w2	4.66
10	well_w2	6.55
1	well_w3	6.94
2	well_w3	7.10
3	well_w3	8.56
4	well_w3	7.83
5	well_w3	7.54
6	well_w3	6.41
7	well_w3	5.10
8	well_w3	7.19
9	well_w3	6.42
10	well_w3	5.46
1	anx_w1	9.48
2	anx_w1	9.23
3	anx_w1	8.86
4	anx_w1	11.41
5	anx_w1	7.08
6	anx_w1	10.66
7	anx_w1	11.12
8	anx_w1	8.81
9	anx_w1	6.88
10	anx_w1	9.28
1	anx_w2	5.37
2	anx_w2	6.55
3	anx_w2	6.16
4	anx_w2	5.30
5	anx_w2	5.96
6	anx_w2	6.71
7	anx_w2	4.28
8	anx_w2	6.22
9	anx_w2	6.48
10	anx_w2	9.38
1	anx_w3	5.99
2	anx_w3	5.20
3	anx_w3	5.38
4	anx_w3	7.91
5	anx_w3	7.26
6	anx_w3	7.58
7	anx_w3	5.96
8	anx_w3	8.16
9	anx_w3	9.15
10	anx_w3	6.35

separate: split the values of a variable at a position in the character string

Now, the time variable has both information about the measure (well versus anx) and time (1-3). This is a job for separate!

Objective: Take the wellbeing + anxiety key-value pair and separate the measure (well-being versus anxiety) and the time (1, 2, 3) into separate variables.

df_long <- df_long %>% separate(time, into=c("measure", "time"), sep = "_")
head(df_long)

subid	measure	time	value
1	well	w1	7.19
2	well	w1	4.82
3	well	w1	4.82
4	well	w1	2.49
5	well	w1	4.44
6	well	w1	4.86

nrow(df_long)

## [1] 60

Cool, but we see that time has the ‘w’ prefix and isn’t a number. If your analysis uses a numeric (continuous) time representation (e.g., multilevel models), this won’t work. Let’s parse the number out of it, as above. Again, I’ve given you the syntax, you just need to make sure it matches your naming.

df_long <- df_long %>% mutate(time=parse_number(time))
head(df_long)

subid	measure	time	value
1	well	1	7.19
2	well	1	4.82
3	well	1	4.82
4	well	1	2.49
5	well	1	4.44
6	well	1	4.86

This now qualifies as tidy. But it is not necessarily right for every application. For example, in longitudinal SEM (e.g., latent curve models), time is usually encoded by specific loadings onto intercept and slope factors. This requires a ‘wide’ data format similar to where we started. Let’s use tidyr to demonstrate how to go backwards in our transformation process – long-to-wide.

spread: convert a key-value

We can imagine an intermediate step in which we have the values of each measure as columns, instead of encoding them with respect to both measure and time.

Objective: We want to use spread to achieve a dataset that looks like this:

##    subid time       anx     well
## 1      1    1  9.478812 7.188648
## 2      1    2  5.374468 6.302645
## 3      1    3  5.989688 6.937929
## 4      2    1  9.229342 4.822453
## 5      2    2  6.546099 7.634392
## 6      2    3  5.195278 7.096602
## 7      3    1  8.862074 4.814725
## 8      3    2  6.155699 5.378153
## 9      3    3  5.379621 8.562951

df_intermediate <- df_long %>% spread(key=measure, value=value)
df_intermediate

subid	time	anx	well
1	1	9.48	7.19
1	2	5.37	6.30
1	3	5.99	6.94
2	1	9.23	4.82
2	2	6.55	7.63
2	3	5.20	7.10
3	1	8.86	4.82
3	2	6.16	5.38
3	3	5.38	8.56
4	1	11.41	2.49
4	2	5.30	6.47
4	3	7.91	7.83
5	1	7.08	4.44
5	2	5.96	7.42
5	3	7.26	7.54
6	1	10.66	4.86
6	2	6.71	6.11
6	3	7.58	6.41
7	1	11.12	6.09
7	2	4.28	7.87
7	3	5.96	5.10
8	1	8.81	4.38
8	2	6.22	4.97
8	3	8.16	7.19
9	1	6.88	4.09
9	2	6.48	4.66
9	3	9.15	6.42
10	1	9.28	3.41
10	2	9.38	6.55
10	3	6.35	5.46

df_intermediate %>% nrow() #show number of rows

## [1] 30

unite: paste together values of two variables (usually as string)

This is moving in the right direction, but if we want the column to encode both time and variable, we need to unite the time- and measure-related information. The unite function does exactly this, essentially pasting together the values of multiple columns into a single column.

Objective: Use unite to achieve a dataset that looks like this:

##   subid vartime    value
## 1     1  well_w1 7.188648
## 2     2  well_w1 4.822453
## 3     3  well_w1 4.814725
## 4     4  well_w1 2.493464
## 5     5  well_w1 4.442689
## 6     6  well_w1 4.856441

df_wide <- df_long %>% unite(col="vartime", measure, time, sep="_w")
head(df_wide)

subid	vartime	value
1	well_w1	7.19
2	well_w1	4.82
3	well_w1	4.82
4	well_w1	2.49
5	well_w1	4.44
6	well_w1	4.86

Looks promising. Let’s go back to spread now that we have a key that encodes all variable (column) information.

Objective: Use spread to generate a long-form dataset that matches the original format.

df_wide <- df_wide %>% spread(key=vartime, value=value)
df_wide

subid	anx_w1	anx_w2	anx_w3	well_w1	well_w2	well_w3
1	9.48	5.37	5.99	7.19	6.30	6.94
2	9.23	6.55	5.20	4.82	7.63	7.10
3	8.86	6.16	5.38	4.82	5.38	8.56
4	11.41	5.30	7.91	2.49	6.47	7.83
5	7.08	5.96	7.26	4.44	7.42	7.54
6	10.66	6.71	7.58	4.86	6.11	6.41
7	11.12	4.28	5.96	6.09	7.87	5.10
8	8.81	6.22	8.16	4.38	4.97	7.19
9	6.88	6.48	9.15	4.09	4.66	6.42
10	9.28	9.38	6.35	3.41	6.55	5.46

We’ve now transformed our long-form dataset back into a wide dataset.

Tidyr Wrangling Exercises Key

You!

8/22/2019

tidyr reminders

separate: split the values of a variable at a position in the character string

spread: convert a key-value

unite: paste together values of two variables (usually as string)