The tidyr package provides a small number of functions for reshaping data into a tidy format. Tidy data are defined by:
Imagine a dataset where you have ratings of well-being and anxiety measured 4 times in a longitudinal study.
The dataset looks like this:
df <- data.frame(subid=1:10,
well_w1=rnorm(10, 5, 1),
well_w2=rnorm(10, 6, 1),
well_w3=rnorm(10, 7, 1),
anx_w1=rnorm(10, 9, 1),
anx_w2=rnorm(10, 6, 1),
anx_w3=rnorm(10, 7, 1))
df
subid | well_w1 | well_w2 | well_w3 | anx_w1 | anx_w2 | anx_w3 |
---|---|---|---|---|---|---|
1 | 7.19 | 6.30 | 6.94 | 9.48 | 5.37 | 5.99 |
2 | 4.82 | 7.63 | 7.10 | 9.23 | 6.55 | 5.20 |
3 | 4.82 | 5.38 | 8.56 | 8.86 | 6.16 | 5.38 |
4 | 2.49 | 6.47 | 7.83 | 11.41 | 5.30 | 7.91 |
5 | 4.44 | 7.42 | 7.54 | 7.08 | 5.96 | 7.26 |
6 | 4.86 | 6.11 | 6.41 | 10.66 | 6.71 | 7.58 |
7 | 6.09 | 7.87 | 5.10 | 11.12 | 4.28 | 5.96 |
8 | 4.38 | 4.97 | 7.19 | 8.81 | 6.22 | 8.16 |
9 | 4.09 | 4.66 | 6.42 | 6.88 | 6.48 | 9.15 |
10 | 3.41 | 6.55 | 5.46 | 9.28 | 9.38 | 6.35 |
Now, the time variable has both information about the measure (well versus anx) and time (1-3). This is a job for separate!
Objective: Take the wellbeing + anxiety key-value pair and separate
the measure (well-being versus anxiety) and the time (1, 2, 3) into separate variables.
Cool, but we see that time has the ‘w’ prefix and isn’t a number. If your analysis uses a numeric (continuous) time representation (e.g., multilevel models), this won’t work. Let’s parse the number out of it, as above. Again, I’ve given you the syntax, you just need to make sure it matches your naming.
#df_long <- df_long %>% mutate(time=parse_number(time))
#head(df_long)
This now qualifies as tidy. But it is not necessarily right for every application. For example, in longitudinal SEM (e.g., latent curve models), time is usually encoded by specific loadings onto intercept and slope factors. This requires a ‘wide’ data format similar to where we started. Let’s use tidyr to demonstrate how to go backwards in our transformation process – long-to-wide.
We can imagine an intermediate step in which we have the values of each measure as columns, instead of encoding them with respect to both measure and time.
Objective: We want to use spread
to achieve a dataset that looks like this:
## subid time anx well
## 1 1 1 9.478812 7.188648
## 2 1 2 5.374468 6.302645
## 3 1 3 5.989688 6.937929
## 4 2 1 9.229342 4.822453
## 5 2 2 6.546099 7.634392
## 6 2 3 5.195278 7.096602
## 7 3 1 8.862074 4.814725
## 8 3 2 6.155699 5.378153
## 9 3 3 5.379621 8.562951
This is moving in the right direction, but if we want the column to encode both time and variable, we need to unite the time- and measure-related information. The unite function does exactly this, essentially pasting together the values of multiple columns into a single column.
Objective: Use unite
to achieve a dataset that looks like this:
## subid vartime value
## 1 1 well_w1 7.188648
## 2 2 well_w1 4.822453
## 3 3 well_w1 4.814725
## 4 4 well_w1 2.493464
## 5 5 well_w1 4.442689
## 6 6 well_w1 4.856441
Looks promising. Let’s go back to spread
now that we have a key that encodes all variable (column) information.
Objective: Use spread
to generate a long-form dataset that matches the original format.
We’ve now transformed our long-form dataset back into a wide dataset.