This document is a minor adaptation of Jenny Bryan’s dplyr joins tutorial: http://stat545.com/bit001_dplyr-cheatsheet.html. The goal is to develop an intuition of the four major types of two-table join operations: inner, left, right, and full.
First, we import the data into data.frame objects.
superheroes <-"
name, alignment, gender, publisher
Magneto, bad, male, Marvel
Storm, good, female, Marvel
Mystique, bad, female, Marvel
Batman, good, male, DC
Joker, bad, male, DC
Catwoman, bad, female, DC
Hellboy, good, male, Dark Horse Comics
"
superheroes <- read_csv(superheroes, trim_ws = TRUE, skip = 1)
publishers <- "
publisher, yr_founded
DC, 1934
Marvel, 1939
Image, 1992
"
publishers <- read_csv(publishers, trim_ws = TRUE, skip = 1)
“Mutating” joins combine variables from two datasets on the basis of one or more keys that match between datasets.
NB: By default, dplyr will search for common columns across datsets as the matching keys (natural join). If you want to control the process, specify the key using “by”
Require match in both datasets (non-matching rows are dropped)
ijsp <- inner_join(superheroes, publishers)
## Joining, by = "publisher"
print(ijsp)
## # A tibble: 6 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
Same idea, just explicit declaration of key
inner_join(superheroes, publishers, by="publisher")
## # A tibble: 6 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
Keep all rows in left-hand ‘x’ dataset (i.e., superheroes). Add columns from publishers where there is a match. Fill in NA for non-matching observations.
ljsp <- left_join(superheroes, publishers)
## Joining, by = "publisher"
print(ljsp)
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
Keep all rows in right-hand ‘y’ dataset (i.e., publishers). Add columns from superheroes where there is a match. Fill in NA for non-matching observations.
superheroes %>% right_join(publishers)
## Joining, by = "publisher"
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Batman good male DC 1934
## 2 Joker bad male DC 1934
## 3 Catwoman bad female DC 1934
## 4 Magneto bad male Marvel 1939
## 5 Storm good female Marvel 1939
## 6 Mystique bad female Marvel 1939
## 7 <NA> <NA> <NA> Image 1992
Keep all rows in left-hand ‘x’ (superheroes) and right-hand ‘y’ (publishers) datasets. Resulting dataset will have all columns of both datasets, but filling in NA for any non-matches on either side.
fjsp <- full_join(superheroes, publishers)
## Joining, by = "publisher"
print(fjsp)
## # A tibble: 8 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
## 8 <NA> <NA> <NA> Image 1992
Note that when there are non-unique matches, the join adds all possible combinations. This occurs in a one-to-many join.
df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 5 x 3
## x y z
## <dbl> <int> <chr>
## 1 1 1 a
## 2 1 1 b
## 3 1 2 a
## 4 1 2 b
## 5 2 3 a
Filtering joins use specific criteria to identify observations (rows) from one table that exist or don’t exist in another table.
These joins are typically used for diagnosing mismatch between two overlapping datasets.
retain observations (rows) in x that match in y
semi_join(x = publishers, y = superheroes)
## Joining, by = "publisher"
## # A tibble: 2 x 2
## publisher yr_founded
## <chr> <int>
## 1 Marvel 1939
## 2 DC 1934
Now, observations in superheroes that match in publishers
semi_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 6 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Batman good male DC
## 2 Joker bad male DC
## 3 Catwoman bad female DC
## 4 Magneto bad male Marvel
## 5 Storm good female Marvel
## 6 Mystique bad female Marvel
observations in x that are not matched in y Note that this is similar to setdiff in base R
Observations in superheroes that don’t match in publishers
anti_join(superheroes, publishers)
## Joining, by = "publisher"
## # A tibble: 1 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Hellboy good male Dark Horse Comics
Now, observations in publishers that don’t match in superheroes
publishers %>% anti_join(superheroes)
## Joining, by = "publisher"
## # A tibble: 1 x 2
## publisher yr_founded
## <chr> <int>
## 1 Image 1992
Fancy approach to multiple dataset merge. Joins datasets two at a time from left to right in the list. The result of a two-table join becomes the ‘x’ dataset for the next join of a new dataset ‘y’.
df1 <- data.frame(id=1:10, x=rnorm(10), y=runif(10))
df2 <- data.frame(id=1:11, z=rnorm(11), a=runif(11))
df3 <- data.frame(id=2:10, b=rnorm(9), c=runif(9))
Reduce(function(...) { full_join(...) }, list(df1, df2, df3))
## Joining, by = "id"
## Joining, by = "id"
## id x y z a b c
## 1 1 0.7010162 0.68818062 -0.82735364 0.8579486 NA NA
## 2 2 -0.4716938 0.09264434 0.16983211 0.6870368 -0.5014704 0.1445381
## 3 3 0.3615511 0.05852267 -0.98554653 0.7464957 -0.9822050 0.4879376
## 4 4 -1.2185181 0.86834456 0.45097187 0.7460212 0.6448889 0.7811944
## 5 5 0.3267126 0.15820262 -0.08241624 0.1008412 -1.7557947 0.8599357
## 6 6 0.7079773 0.06090102 -1.36810154 0.5401580 -1.5348371 0.1214809
## 7 7 0.3504285 0.21795279 1.26590645 0.1162464 0.2336147 0.1907346
## 8 8 -0.6225727 0.80950215 1.56302943 0.6595951 -1.1160290 0.8013162
## 9 9 -0.3130711 0.01486292 1.44205305 0.2240288 -1.8307086 0.9960257
## 10 10 -0.3283722 0.43208181 -1.61823938 0.7665094 1.1224169 0.2523477
## 11 11 NA NA 1.15820346 0.8743222 NA NA
Alternative using pipeline (less extensible)
mergedf = df1 %>% full_join(df2) %>% full_join(df3)
## Joining, by = "id"
## Joining, by = "id"