Introduction to joining datasets using dplyr

This document is a minor adaptation of Michael Hallquist’s minor adaptation of Jenny Bryan’s dplyr joins tutorial (with a little help from animations by Garrick Aden-Buie): http://stat545.com/bit001_dplyr-cheatsheet.html. The goal is to develop an intuition of the four major types of two-table join operations: inner, left, right, and full. We’ll also get into using joins to identify areas of match or mismatch between two datasets (using semi- and anti-joins).

First, we import the data into data.frame objects. We’re generating the data from scratch using character strings. But the join functions below will work with other data types (e.g., numeric data).

superheroes <-"
name,     alignment, gender, publisher
Magneto,  bad,       male,   Marvel
Storm,    good,      female, Marvel
Mystique, bad,       female, Marvel
Batman,   good,      male,   DC
Joker,    bad,       male,   DC
Catwoman, bad,       female, DC
Hellboy,  good,      male,   Dark Horse Comics
"


superheroes <- read_csv(superheroes, trim_ws = TRUE, skip = 1) #"trim_ws" eliminates extra spaces in dataset; #skip=1 tells R not to read in the first line of data as data (instead leaving it as column names) 

publishers <- "
publisher, yr_founded
DC,        1934
Marvel,    1939
Image,     1992
"
publishers <- read_csv(publishers, trim_ws = TRUE, skip = 1)

superheroes
## # A tibble: 7 x 4
##   name     alignment gender publisher        
##   <chr>    <chr>     <chr>  <chr>            
## 1 Magneto  bad       male   Marvel           
## 2 Storm    good      female Marvel           
## 3 Mystique bad       female Marvel           
## 4 Batman   good      male   DC               
## 5 Joker    bad       male   DC               
## 6 Catwoman bad       female DC               
## 7 Hellboy  good      male   Dark Horse Comics
publishers
## # A tibble: 3 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939
## 3 Image           1992

Mutating joins: inner, left, right, full

“Mutating” joins combine variables from two datasets on the basis of one or more keys that match between datasets. In the case of these datasets, notice they share the “publisher” column.

NB: By default, dplyr will search for common columns across datsets as the matching keys (natural join). If you want to control the process, specify the key using “by”.

inner join

Require match in both datasets (non-matching rows are dropped)

For those of you who are visual learners, conceptually, imagine the following two simple datasets:

An inner join combines the two datasets and drops the non-matching rows like so:

Lets try it with our superhero data.

(ijsp <- inner_join(x=superheroes, y=publishers)) #*NB*: the extra parentheses enclosing the line of code tell R to both save the new dataframe as ijsp AND to print the result simultaneously
## Joining, by = "publisher"
## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

Same idea, just explicit declaration of key (i.e., “publisher”)

inner_join(superheroes, publishers, by="publisher") #note that we've cut the x=, y= as this isn't necessary to specify
## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

Notice both Hellboy (from the superheroes dataset) and Image comics (from the publishers dataset) were dropped.

left join

Keep all rows in left-hand ‘x’ dataset (i.e., superheroes). Add columns from publishers where there is a match. Fill in NA for non-matching observations.

(ljsp <- left_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA

right join

Keep all rows in right-hand ‘y’ dataset (i.e., publishers). Add columns from superheroes where there is a match. Fill in NA for non-matching observations.

(superheroes %>% right_join(publishers)->rjsp) #note the shift to using dplyr piping; just another way to do this code
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Batman   good      male   DC              1934
## 2 Joker    bad       male   DC              1934
## 3 Catwoman bad       female DC              1934
## 4 Magneto  bad       male   Marvel          1939
## 5 Storm    good      female Marvel          1939
## 6 Mystique bad       female Marvel          1939
## 7 <NA>     <NA>      <NA>   Image           1992

full join

Keep all rows in left-hand ‘x’ (superheroes) and right-hand ‘y’ (publishers) datasets. Resulting dataset will have all columns of both datasets, but filling in NA for any non-matches on either side (denoted as blank spaces below).

(superheroes %>% full_join(publishers)->fjsp)
## Joining, by = "publisher"
## # A tibble: 8 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA
## 8 <NA>     <NA>      <NA>   Image                   1992

One-to-many join

Note that when there are non-unique matches, the join adds all possible combinations. This occurs in a one-to-many join.

Let’s say you wanted to see if students’ final grades were associated with how many extra credit points they recieved on multiple assignments during a course.

(extra_credit <- data_frame(x = c("Jeff", "Karla", "Jasmine", "Karla"), z = c(2, 5, 4, 4)))
## # A tibble: 4 x 2
##   x           z
##   <chr>   <dbl>
## 1 Jeff        2
## 2 Karla       5
## 3 Jasmine     4
## 4 Karla       4
(final_grade <- data_frame(x = c("Jeff", "Karla", "Ping"), y = c(70, 80, 92)))
## # A tibble: 3 x 2
##   x         y
##   <chr> <dbl>
## 1 Jeff     70
## 2 Karla    80
## 3 Ping     92
final_grade %>% left_join(extra_credit)
## Joining, by = "x"
## # A tibble: 4 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 Jeff     70     2
## 2 Karla    80     5
## 3 Karla    80     4
## 4 Ping     92    NA

Filtering joins: semi_join and anti_join

Filtering joins use specific criteria to identify observations (rows) from one table that exist or don’t exist in another table.

These joins are typically used for diagnosing mismatch between two overlapping datasets.

semi_join

retain observations (rows) in x that match in y

Notice that this is different from the left_join shown above as the data from y is not kept.

semi_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 6 x 4
##   name     alignment gender publisher
##   <chr>    <chr>     <chr>  <chr>    
## 1 Magneto  bad       male   Marvel   
## 2 Storm    good      female Marvel   
## 3 Mystique bad       female Marvel   
## 4 Batman   good      male   DC       
## 5 Joker    bad       male   DC       
## 6 Catwoman bad       female DC

Now, observations in publishers that match in superheroes

semi_join(publishers, superheroes)
## Joining, by = "publisher"
## # A tibble: 2 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939

This can be useful if you have a dataset of your data of interest and another dataset that indicates which of your participants/observations you want to remove or filter out.

anti_join

observations in x that are not matched in y Note that this is similar to setdiff in base R

Observations in superheroes that don’t match in publishers

anti_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 1 x 4
##   name    alignment gender publisher        
##   <chr>   <chr>     <chr>  <chr>            
## 1 Hellboy good      male   Dark Horse Comics

Now, observations in publishers that don’t match in superheroes

publishers %>% anti_join(superheroes)
## Joining, by = "publisher"
## # A tibble: 1 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 Image           1992

This can be useful if you are trying to identify extra participants/observations that may have snuck into one dataset (x) or been deleted in another (y).

Joining multiple datasets

Joining can be done repeatedly across multiple datasets. The following code, for instance, joins datasets two at a time from left to right in the list. The result of a two-table join becomes the ‘x’ dataset for the next join of a new dataset ‘y’.

df1 <- data.frame(id=1:10, x=rnorm(10), y=runif(10))
df2 <- data.frame(id=1:11, z=rnorm(11), a=runif(11))
df3 <- data.frame(id=2:10, b=rnorm(9), c=runif(9))

dftemp<-full_join(df1,df2)
## Joining, by = "id"
dffinal<-full_join(dftemp,df3)
## Joining, by = "id"
#alternative way to combine:
Reduce(function(...) { full_join(...) }, list(df1, df2, df3))
## Joining, by = "id"
## Joining, by = "id"
##    id           x         y           z         a          b         c
## 1   1 -0.09569160 0.3693569 -0.01760451 0.9243227         NA        NA
## 2   2 -0.14770983 0.6340806 -1.61827355 0.2826797  1.0949237 0.5117386
## 3   3 -0.65144863 0.9072347  0.23176152 0.6260628 -0.1040951 0.9134037
## 4   4  0.07920755 0.3096684  0.80828051 0.7909440 -0.7617629 0.7029727
## 5   5  1.74176576 0.7919466  0.13911298 0.5984088  0.1863756 0.9299028
## 6   6 -0.01983633 0.3393136 -0.99427005 0.2287662 -1.0331383 0.8669464
## 7   7  1.33266902 0.8971444  1.13504122 0.2095490  0.8389904 0.1482519
## 8   8 -1.07062129 0.2526074 -0.67868542 0.6491620  0.7316112 0.5641820
## 9   9  0.62097325 0.9162661  0.54734353 0.0572775  2.6786502 0.3328617
## 10 10  0.51406927 0.1650579 -0.55699672 0.4676513 -1.5073461 0.6264076
## 11 11          NA        NA -0.12816097 0.7760353         NA        NA

Alternative using pipeline (less extensible)

(mergedf = df1 %>% full_join(df2) %>% full_join(df3))
## Joining, by = "id"
## Joining, by = "id"
##    id           x         y           z         a          b         c
## 1   1 -0.09569160 0.3693569 -0.01760451 0.9243227         NA        NA
## 2   2 -0.14770983 0.6340806 -1.61827355 0.2826797  1.0949237 0.5117386
## 3   3 -0.65144863 0.9072347  0.23176152 0.6260628 -0.1040951 0.9134037
## 4   4  0.07920755 0.3096684  0.80828051 0.7909440 -0.7617629 0.7029727
## 5   5  1.74176576 0.7919466  0.13911298 0.5984088  0.1863756 0.9299028
## 6   6 -0.01983633 0.3393136 -0.99427005 0.2287662 -1.0331383 0.8669464
## 7   7  1.33266902 0.8971444  1.13504122 0.2095490  0.8389904 0.1482519
## 8   8 -1.07062129 0.2526074 -0.67868542 0.6491620  0.7316112 0.5641820
## 9   9  0.62097325 0.9162661  0.54734353 0.0572775  2.6786502 0.3328617
## 10 10  0.51406927 0.1650579 -0.55699672 0.4676513 -1.5073461 0.6264076
## 11 11          NA        NA -0.12816097 0.7760353         NA        NA