Introduction to joining datasets using dplyr

This document is a minor adaptation of Jenny Bryan’s dplyr joins tutorial: http://stat545.com/bit001_dplyr-cheatsheet.html. The goal is to develop an intuition of the four major types of two-table join operations: inner, left, right, and full.

First, we import the data into data.frame objects.

superheroes <-"
name, alignment, gender,         publisher
Magneto,       bad,   male,            Marvel
Storm,      good, female,            Marvel
Mystique,       bad, female,            Marvel
Batman,      good,   male,                DC
Joker,       bad,   male,                DC
Catwoman,       bad, female,                DC
Hellboy,      good,   male, Dark Horse Comics
"
superheroes <- read_csv(superheroes, trim_ws = TRUE, skip = 1)

publishers <- "
publisher, yr_founded
DC,       1934
Marvel,       1939
Image,       1992
"
publishers <- read_csv(publishers, trim_ws = TRUE, skip = 1)

Mutating joins: inner, left, right, outer

“Mutating” joins combine variables from two datasets on the basis of one or more keys that match between datasets.

NB: By default, dplyr will search for common columns across datsets as the matching keys (natural join). If you want to control the process, specify the key using “by”

inner join

Require match in both datasets (non-matching rows are dropped)

ijsp <- inner_join(superheroes, publishers)
## Joining, by = "publisher"
print(ijsp)
## # A tibble: 6 x 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1  Magneto       bad   male    Marvel       1939
## 2    Storm      good female    Marvel       1939
## 3 Mystique       bad female    Marvel       1939
## 4   Batman      good   male        DC       1934
## 5    Joker       bad   male        DC       1934
## 6 Catwoman       bad female        DC       1934

Same idea, just explicit declaration of key

inner_join(superheroes, publishers, by="publisher")
## # A tibble: 6 x 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1  Magneto       bad   male    Marvel       1939
## 2    Storm      good female    Marvel       1939
## 3 Mystique       bad female    Marvel       1939
## 4   Batman      good   male        DC       1934
## 5    Joker       bad   male        DC       1934
## 6 Catwoman       bad female        DC       1934

left join

Keep all rows in left-hand ‘x’ dataset (i.e., superheroes). Add columns from publishers where there is a match. Fill in NA for non-matching observations.

ljsp <- left_join(superheroes, publishers)
## Joining, by = "publisher"
print(ljsp)
## # A tibble: 7 x 5
##       name alignment gender         publisher yr_founded
##      <chr>     <chr>  <chr>             <chr>      <int>
## 1  Magneto       bad   male            Marvel       1939
## 2    Storm      good female            Marvel       1939
## 3 Mystique       bad female            Marvel       1939
## 4   Batman      good   male                DC       1934
## 5    Joker       bad   male                DC       1934
## 6 Catwoman       bad female                DC       1934
## 7  Hellboy      good   male Dark Horse Comics         NA

right join

Keep all rows in right-hand ‘y’ dataset (i.e., publishers). Add columns from superheroes where there is a match. Fill in NA for non-matching observations.

superheroes %>% right_join(publishers)
## Joining, by = "publisher"
## # A tibble: 7 x 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1   Batman      good   male        DC       1934
## 2    Joker       bad   male        DC       1934
## 3 Catwoman       bad female        DC       1934
## 4  Magneto       bad   male    Marvel       1939
## 5    Storm      good female    Marvel       1939
## 6 Mystique       bad female    Marvel       1939
## 7     <NA>      <NA>   <NA>     Image       1992

full join

Keep all rows in left-hand ‘x’ (superheroes) and right-hand ‘y’ (publishers) datasets. Resulting dataset will have all columns of both datasets, but filling in NA for any non-matches on either side.

fjsp <- full_join(superheroes, publishers)
## Joining, by = "publisher"
print(fjsp)
## # A tibble: 8 x 5
##       name alignment gender         publisher yr_founded
##      <chr>     <chr>  <chr>             <chr>      <int>
## 1  Magneto       bad   male            Marvel       1939
## 2    Storm      good female            Marvel       1939
## 3 Mystique       bad female            Marvel       1939
## 4   Batman      good   male                DC       1934
## 5    Joker       bad   male                DC       1934
## 6 Catwoman       bad female                DC       1934
## 7  Hellboy      good   male Dark Horse Comics         NA
## 8     <NA>      <NA>   <NA>             Image       1992

One-to-many join

Note that when there are non-unique matches, the join adds all possible combinations. This occurs in a one-to-many join.

df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))

df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 5 x 3
##       x     y     z
##   <dbl> <int> <chr>
## 1     1     1     a
## 2     1     1     b
## 3     1     2     a
## 4     1     2     b
## 5     2     3     a

Filtering joins: semi_join and anti_join

Filtering joins use specific criteria to identify observations (rows) from one table that exist or don’t exist in another table.

These joins are typically used for diagnosing mismatch between two overlapping datasets.

semi_join

retain observations (rows) in x that match in y

semi_join(x = publishers, y = superheroes)
## Joining, by = "publisher"
## # A tibble: 2 x 2
##   publisher yr_founded
##       <chr>      <int>
## 1    Marvel       1939
## 2        DC       1934

Now, observations in superheroes that match in publishers

semi_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 6 x 4
##       name alignment gender publisher
##      <chr>     <chr>  <chr>     <chr>
## 1   Batman      good   male        DC
## 2    Joker       bad   male        DC
## 3 Catwoman       bad female        DC
## 4  Magneto       bad   male    Marvel
## 5    Storm      good female    Marvel
## 6 Mystique       bad female    Marvel

anti_join

observations in x that are not matched in y Note that this is similar to setdiff in base R

Observations in superheroes that don’t match in publishers

anti_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 1 x 4
##      name alignment gender         publisher
##     <chr>     <chr>  <chr>             <chr>
## 1 Hellboy      good   male Dark Horse Comics

Now, observations in publishers that don’t match in superheroes

publishers %>% anti_join(superheroes)
## Joining, by = "publisher"
## # A tibble: 1 x 2
##   publisher yr_founded
##       <chr>      <int>
## 1     Image       1992

Joining multiple datasets

Fancy approach to multiple dataset merge. Joins datasets two at a time from left to right in the list. The result of a two-table join becomes the ‘x’ dataset for the next join of a new dataset ‘y’.

df1 <- data.frame(id=1:10, x=rnorm(10), y=runif(10))
df2 <- data.frame(id=1:11, z=rnorm(11), a=runif(11))
df3 <- data.frame(id=2:10, b=rnorm(9), c=runif(9))

Reduce(function(...) { full_join(...) }, list(df1, df2, df3))
## Joining, by = "id"
## Joining, by = "id"
##    id          x          y           z         a          b         c
## 1   1  0.7010162 0.68818062 -0.82735364 0.8579486         NA        NA
## 2   2 -0.4716938 0.09264434  0.16983211 0.6870368 -0.5014704 0.1445381
## 3   3  0.3615511 0.05852267 -0.98554653 0.7464957 -0.9822050 0.4879376
## 4   4 -1.2185181 0.86834456  0.45097187 0.7460212  0.6448889 0.7811944
## 5   5  0.3267126 0.15820262 -0.08241624 0.1008412 -1.7557947 0.8599357
## 6   6  0.7079773 0.06090102 -1.36810154 0.5401580 -1.5348371 0.1214809
## 7   7  0.3504285 0.21795279  1.26590645 0.1162464  0.2336147 0.1907346
## 8   8 -0.6225727 0.80950215  1.56302943 0.6595951 -1.1160290 0.8013162
## 9   9 -0.3130711 0.01486292  1.44205305 0.2240288 -1.8307086 0.9960257
## 10 10 -0.3283722 0.43208181 -1.61823938 0.7665094  1.1224169 0.2523477
## 11 11         NA         NA  1.15820346 0.8743222         NA        NA

Alternative using pipeline (less extensible)

mergedf = df1 %>% full_join(df2) %>% full_join(df3)
## Joining, by = "id"
## Joining, by = "id"