Most data wrangling can be accomplished using data.frame
object (or tbl
objects in dplyr
). These objects consist of rows and columns, forming a rectangular structure.
gapminder %>% kable_table(n=6)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
A variable contains all values measuring an attribute (e.g., neuroticism) across units (e.g., people).
Columns in data.frames are typically labeled and represent variables.
Moreover, in a data.frame, all values in a given column should have the same data type, such as character strings. But a data.frame
is different from a matrix
object because columns can differ in terms of data type. For example, in the gapminder
, the country
column is a factor, whereas lifeExp
is a numeric column.
“A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.” Citation
An observation contains all values measured on the same unit across attributes.
Rows in a data.frame typically represent observations.
Datasets contain values, which can be numbers, strings, factors, and so on.
Each value belongs to a single observation and a single variable – that is, each value is a datum.
To communicate effectively about data structure, manipulation, and processing, we need a common set of data wrangling verbs.
The data import and transformation cheatsheets (https://www.rstudio.com/resources/cheatsheets/) provide a succinct overview of the tools we will use in bootcamp.
Base R provides a number of other tools for data management, but does not have a unifying conceptual framework.
For clarity in your training, we’ll stick closely to functions in the tidyr and dplyr packages developed by Hadley Wickham and others. See https://www.tidyverse.org.
In general, we will try to wrangle each dataset in R into what we call tidy form. The idea of ‘tidying’ is to modify the structure of a dataset to facilitate analysis. Furthermore, the term ‘tidy’ is intended to refer to a specific and shared framework, not a subjective view of what each user thinks is tidy.
Specifically, we can judge a dataset as tidy or messy according how rows, columns, and tables are matched up with observations, variables, and types.
Building on Wickham (2014), tidy data have the following three properties:
Furthermore, variables that are part of the design (participant number, experimental condition, county ID, etc.), or that may be key categorical moderators, should typically be placed first (columns to the left), and measured variables thereafter (columns to the right).
This sounds easy, right?! But in truth, we have all seen data that look more like this:
0.6226 WARNING Movie2 stim could not be imported and won't be available
17.6226 DATA version J
102.5349 DATA Keypress: space
110.4876 DATA Keypress: v
110.6203 DATA Keypress: v
111.0859 DATA Keypress: v
112.6459 DATA Keypress: t
116.4982 DATA Keypress: v
118.5406 DATA Keypress: v
For a more detailed treatment of tidy data, see: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
The tidyr
package provides four core functions to aid in converting messy data into tidy form. We may also need functions from dplyr
at times. Each of these verbs is also a function that does transforms the dataset – with the goal of making it more tidy.
Note: gather
and spread
are complements. And separate
and unite
are complements.
Now, let’s look at a series of datasets (from Wickham 2014) and consider how tidy or messy they are.
Here is our first mess. Notice that the column headers are values, not variable names. This is untidy and hard to look at. We effectively have the data in a cross-tabulated format, but religion
and income
are not variables in the dataset.
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k |
---|---|---|---|---|---|---|
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 |
Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 |
Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 |
Hindu | 1 | 9 | 7 | 9 | 11 | 34 |
Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 |
Jehovah’s Witness | 20 | 27 | 24 | 24 | 21 | 30 |
Jewish | 19 | 19 | 25 | 25 | 30 | 95 |
In the tidy version, religion and income become variables, and the number of observations in each religion x income combination is a frequency column. This is now tidy insofar as each value in the frequency column represents a unique combination of the religion and income factors, which are coded as variables.
religion | income | freq |
---|---|---|
Agnostic | <$10k | 27 |
Agnostic | >150k | 84 |
Agnostic | $10-20k | 34 |
Agnostic | $100-150k | 109 |
Agnostic | $20-30k | 60 |
Agnostic | $30-40k | 81 |
Agnostic | $40-50k | 76 |
Agnostic | $50-75k | 137 |
Agnostic | $75-100k | 122 |
Agnostic | Don’t know/refused | 96 |
To achieve the above transformation, we want to gather the many columns of income into a single income column.
tidy1 <- mess1 %>% gather(key=income, value=freq, -religion)
Here, we tell tidyr
that we wish to create a lookup (‘key’) column called income
whose correponding values will be called freq
(here, representing the frequency of this religion x income combination). Furthermore, as additional arguments to gather
, we provide the columns that should be combined, representing levels of the key variable. By specifying -religion
, we are saying ‘all columns except religion.’ The alternative would be to provide a comma-separate list of columns like this mess1 %>% gather(key=income, value=freq, `<$10k`, `$10-20k`, etc.)
In our second mess, we have a weather dataset from the Global Historical Climatology Network for one weather station MX17004 in Mexico. The data represent minimum and maximum temperatures measured across 31 days for five months. The days within each month are on the columns, the months are encoded as a variable month
, and the min and max temperatures are separated by row, as identified by the element
variable.
#show a subset of column that fit on the page
mess2 %>% dplyr::select(id:d13) %>% kable_table(n=8)
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | d10 | d11 | d12 | d13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MX17004 | 2010 | 1 | tmax | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 1 | tmin | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 2 | tmax | NA | 27.3 | 24.1 | NA | NA | NA | NA | NA | NA | NA | 29.7 | NA | NA |
MX17004 | 2010 | 2 | tmin | NA | 14.4 | 14.4 | NA | NA | NA | NA | NA | NA | NA | 13.4 | NA | NA |
MX17004 | 2010 | 3 | tmax | NA | NA | NA | NA | 32.1 | NA | NA | NA | NA | 34.5 | NA | NA | NA |
MX17004 | 2010 | 3 | tmin | NA | NA | NA | NA | 14.2 | NA | NA | NA | NA | 16.8 | NA | NA | NA |
MX17004 | 2010 | 4 | tmax | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
MX17004 | 2010 | 4 | tmin | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
id | year | month | day | tmax | tmin |
---|---|---|---|---|---|
MX17004 | 2010 | 1 | 30 | 27.8 | 14.5 |
MX17004 | 2010 | 2 | 2 | 27.3 | 14.4 |
MX17004 | 2010 | 2 | 3 | 24.1 | 14.4 |
MX17004 | 2010 | 2 | 11 | 29.7 | 13.4 |
MX17004 | 2010 | 2 | 23 | 29.9 | 10.7 |
MX17004 | 2010 | 3 | 5 | 32.1 | 14.2 |
MX17004 | 2010 | 3 | 10 | 34.5 | 16.8 |
MX17004 | 2010 | 3 | 16 | 31.1 | 17.6 |
To clean this up, we need to bring all of the day columns together using gather
so that we can encode day
as a variable and temperature
as a variable. We also may want to have max and min temperature as separate columns (i.e., variables), rather than keeping that as a key-value pair. That is, the tmin
and tmax
values denote the type of observation, which would usually be represented as separate variables in tidy format.
Here is the basic approach:
#use num_range() to select variables called d1--d31
tidy2 <- mess2 %>% gather(key=day, value=temperature, num_range("d", 1:31)) %>%
mutate(day=parse_number(day)) %>% #remove 'd'
spread(key=element, value=temperature) %>%
na.omit()
Notice how spread takes a key – element
– that has values 'tmin'
or 'tmax'
and puts the values of these rows onto columns. This is a kind of ‘long-to-wide’ conversion and we would expect here for the number of rows in the dataset drop two-fold with the spread
compare to the preceding step where we’ve gathered the day columns.
In our third mess, we have multiple variables stored in one column. More specifically, in these data, the ‘m014’ etc. columns represent a combination of sex (m/f) and age range (e.g., 014 is 0–14). The country and year columns are ‘tidy’ because they represent variables, but the sex + age columns are not.
#use select to select a few columns that can fit on the page
mess3 %>% dplyr::select(country:f1524) %>% kable_table(n=5)
country | year | m014 | m1524 | m2534 | m3544 | m4554 | m5564 | m65 | mu | f014 | f1524 |
---|---|---|---|---|---|---|---|---|---|---|---|
AD | 2000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | NA | NA |
AE | 2000 | 2 | 4 | 4 | 6 | 5 | 12 | 10 | NA | 3 | 16 |
AF | 2000 | 52 | 228 | 183 | 149 | 129 | 94 | 80 | NA | 93 | 414 |
AG | 2000 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NA | 1 | 1 |
AL | 2000 | 2 | 19 | 21 | 14 | 24 | 19 | 16 | NA | 3 | 11 |
country | year | sex | age_range | freq |
---|---|---|---|---|
AD | 2000 | m | 0-14 | 0 |
AE | 2000 | m | 0-14 | 2 |
AF | 2000 | m | 0-14 | 52 |
AG | 2000 | m | 0-14 | 0 |
AL | 2000 | m | 0-14 | 2 |
AM | 2000 | m | 0-14 | 2 |
AN | 2000 | m | 0-14 | 0 |
AO | 2000 | m | 0-14 | 186 |
We essentially need to parse apart the ‘m’ from the ‘014’ components of each value, which is a job for separate
. Note that we also need to gather the wacky sex + age columns first to make this easier.
tidy3 <- mess3 %>% gather(key=sex_age, value=freq, -country, -year) %>%
separate(sex_age, into=c("sex", "age_range"), sep=1)
Here, we gather all columns except country
and year
into a single key-value pair. This is an intermediate stage of the dataset that is semi-tidy. We then separate the sex and age components of the values into separate variables, resulting in a tidy dataset.
country | year | sex | age_range | freq |
---|---|---|---|---|
AD | 2000 | m | 014 | 0 |
AE | 2000 | m | 014 | 2 |
AF | 2000 | m | 014 | 52 |
AG | 2000 | m | 014 | 0 |
AL | 2000 | m | 014 | 2 |
AM | 2000 | m | 014 | 2 |
AN | 2000 | m | 014 | 0 |
AO | 2000 | m | 014 | 186 |
This is pretty close. The age_range
variable is still a little clunky because it isn’t easy to read. We could modify this further using mutate
and recode
from dplyr
, but that’s not the immediate emphasis here.
tidy3 <- tidy3 %>% mutate(age_range=recode(age_range,
"014"="0-14",
"1524"="15-24",
"2534"="25-34",
"3544"="35-44",
"4554"="45-54",
"5564"="55-64",
"65"="65+",
"u"="unknown", .default=NA_character_
))
Although the least common of the tidying verbs (in my experience), unite
is the complement to separate and can be used to bring together multiple variables that we wish to store as a single variable. For example, we may have first name and last name stored in separate variables, but wish to put them together for display or exporting purposes. Sometimes, we also use unite
as an intermediate stage in tidying, bringing together variables, reshaping the data, then re-separating them.
first_name | last_name | age | favorite_color |
---|---|---|---|
Graham | Doe | 11 | Purple |
Kieran | Helali | 9 | Blue |
Charlotte | Stafford | 11 | Pink |
If we wanted to have a full_name
, we could use unite
to combine first_name
and last_name
and then get rid of those individual columns.
df4_united <- df4 %>% unite(col = "full_name", first_name, last_name, sep=" ")
full_name | age | favorite_color |
---|---|---|
Graham Doe | 11 | Purple |
Kieran Helali | 9 | Blue |
Charlotte Stafford | 11 | Pink |
If you find yourself needing more advanced reshaping powers, look at the reshape2
package, a predecessor of tidyr
. Even though tidyr
is more recent, it is also more simplified and does not offer robust facilities for reshaping lists
and arrays
. Moreover, for data.frame
objects, the dcast
function from reshape2
offers a flexible syntax for specifying how multi-dimensional data should be reshaped into a 2-D data.frame. The dcast
approach has been extended by the data.table
package, which provides a fast, flexible, and memory-efficient approach to basic data reshaping (though there is a learning curve). Here are a couple of resources:
Reshape2 tutorial: http://seananderson.ca/2013/10/19/reshape.html
Further extensions using data.table package: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html