These exercises are slightly adapted from Jenny Bryan’s wonderful dplyr overview here: https://stat545.com/block009_dplyr-intro.html. Thus, the answers can be found on those pages, but I’ve removed the solutions here for you to work on it directly.

Load dplyr and gapminder

I choose to load the tidyverse, which will load dplyr, among other packages we use incidentally below. Also load gapminder.

library(dplyr)
library(gapminder)

Say hello to the Gapminder tibble

The gapminder data frame is a special kind of data frame: a tibble.

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

Its tibble-ness is why we get nice compact printing. For a reminder of the problems with base data frame printing, go type iris in the R Console or, better yet, print a data frame to screen that has lots of columns.

Note how gapminder’s class() includes tbl_df; the “tibble” terminology is a nod to this.

class(gapminder)
## [1] "tbl_df"     "tbl"        "data.frame"

There will be some functions, like print(), that know about tibbles and do something special. There will others that do not, like summary(). In which case the regular data frame treatment will happen, because every tibble is also a regular data frame.

To turn any data frame into a tibble use as_tibble():

as_tibble(iris)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Think before you create excerpts of your data …

If you feel the urge to store a little snippet of your data:

(canada <- gapminder[241:252, ])
## # A tibble: 12 x 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## 11 Canada  Americas   2002    79.8 31902268    33329.
## 12 Canada  Americas   2007    80.7 33390141    36319.

Stop and ask yourself …

Do I want to create mini datasets for each level of some factor (or unique combination of several factors) … in order to compute or graph something?

If YES, use proper data aggregation techniques or facetting in ggplot2don’t subset the data. Or, more realistic, only subset the data as a temporary measure while you develop your elegant code for computing on or visualizing these data subsets.

If NO, then maybe you really do need to store a copy of a subset of the data. But seriously consider whether you can achieve your goals by simply using the subset = argument of, e.g., the lm() function, to limit computation to your excerpt of choice. Lots of functions offer a subset = argument!

Copies and excerpts of your data clutter your workspace, invite mistakes, and sow general confusion. Avoid whenever possible.

Reality can also lie somewhere in between. You will find the workflows presented below can help you accomplish your goals with minimal creation of temporary, intermediate objects.

Use filter() to subset data row-wise.

filter() takes logical expressions and returns the rows for which all are TRUE.

Objectives:

  1. Find observations where life expectancy is less than 29
  2. Find observations from Rwanda after the year 1979
  3. Find observations from both Rwanda and Afghanistan
#answers here

NB. Under no circumstances should you subset your data the way I did at first:

excerpt <- gapminder[241:252, ]

Why is this a terrible idea?

  • It is not self-documenting. What is so special about rows 241 through 252?
  • It is fragile. This line of code will produce different results if someone changes the row order of gapminder, e.g. sorts the data earlier in the script.

Meet the new pipe operator

Before we go any further, we should exploit the new pipe operator that the tidyverse imports from the magrittr package by Stefan Bache. This is going to change your data analytical life. You no longer need to enact multi-operation commands by nesting them inside each other, like so many Russian nesting dolls. This new syntax leads to code that is much easier to write and to read.

Here’s what it looks like: %>%. The RStudio keyboard shortcut: Ctrl + Shift + M (Windows), Cmd + Shift + M (Mac).

Let’s demo then I’ll explain. Print the first 6 rows of the gapminder data using a pipe $>$ with head()

#answer here

This is equivalent to head(gapminder). The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.

Never fear, you can still specify other arguments to this function! To see the first 3 rows of Gapminder, how would you pass the n=3 argument in a pipe syntax?

#answer here

I’ve advised you to think “gets” whenever you see the assignment operator, <-. Similary, you should think “then” whenever you see the pipe operator, %>%.

You are probably not impressed yet, but the magic will soon happen.

Use select() to subset the data on variables or columns.

Use select() to subset the data on variables. Specifically, select the year and lifeExp columns.

#answer here

Revel in the convenience of dplyr

Objective: get the year and lifeExp variables for Cambodia

#answer here

Use mutate() to add new variables

Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population (pop) and GDP per capita (gdpPercap). Let’s multiply them together.

mutate() is a function that defines and inserts new variables into a tibble. You can refer to existing variables by name.

#answer here

Hmmmm … those GDP numbers are almost uselessly large and abstract. Consider the advice of Randall Munroe of xkcd:

One thing that bothers me is large numbers presented without context… ‘If I added a zero to this number, would the sentence containing it mean something different to me?’ If the answer is ‘no,’ maybe the number has no business being in the sentence in the first place."

Maybe it would be more meaningful to consumers of my tables and figures to stick with GDP per capita. But what if I reported GDP per capita, relative to some benchmark country. Since Canada is my adopted home, I’ll go with that.

I need to create a new variable that is gdpPercap divided by Canadian gdpPercap, taking care that I always divide two numbers that pertain to the same year.

Objectives:

  • Filter down to the rows for Canada.
  • Create a new temporary variable in a copy of gapminder called my_gap:
    • Extract the gdpPercap variable from the Canadian data.
    • Replicate it once per country in the dataset, so it has the right length.
  • Divide raw gdpPercap by this Canadian figure.
  • Discard the temporary variable of replicated Canadian gdpPercap.
#answer here

Note that, mutate() builds new variables sequentially so you can reference earlier ones (like tmp) when defining later ones (like gdpPercapRel). Also, you can get rid of a variable by setting it to NULL.

How could we sanity check that this worked? The Canadian values for gdpPercapRel better all be 1!

# answer here

Use arrange() to row-order data in a principled way

arrange() reorders the rows in a data frame.

Objective: Order the data by year then country, as opposed to by country then year.

#answer here

Or maybe you want just the data from 2007, sorted on life expectancy?

#answer here

Oh, you’d like to sort on life expectancy in descending order? Then use desc().

#answer here

I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.

Use rename() to rename variables

When I first cleaned this Gapminder excerpt, I was a camelCase person, but now I’m all about snake_case. So I am vexed by the variable names I chose when I cleaned this data years ago. Let’s rename some variables!

Objectives

  1. Rename gdpPercap to gdp_percap
  2. Rename gdpPercapRel to gdp_percap_rel
#answer here

select() can rename and reposition variables

You’ve seen simple use of select(). There are two tricks you might enjoy:

  1. select() can rename the variables you request to keep.
  2. select() can be used with everything() to hoist a variable up to the front of the tibble.
gapminder %>%
  filter(country == "Burundi", year > 1996) %>% 
  select(yr = year, lifeExp, gdpPercap) %>% 
  select(gdpPercap, everything())
## # A tibble: 3 x 3
##   gdpPercap    yr lifeExp
##       <dbl> <int>   <dbl>
## 1      463.  1997    45.3
## 2      446.  2002    47.4
## 3      430.  2007    49.6

everything() is one of several helpers for variable selection. Read its help to see the rest.

group_by() is a mighty weapon

I have found friends and family collaborators love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer.

dplyr offers powerful tools to solve this class of problem.

  • group_by() adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.
  • summarize() takes a dataset with \(n\) observations, computes requested summaries, and returns a dataset with 1 observation.
  • Window functions take a dataset with \(n\) observations and return a dataset with \(n\) observations.
  • mutate() and summarize() will honor groups.
  • You can also do very general computations on your groups with do(), though elsewhere in this course, I advocate for other approaches that I find more intuitive, using the purrr package.

Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.

Counting things up

Let’s start with simple counting. How many observations do we have per continent?

#answer here

Let us pause here to think about the tidyverse. You could get these same frequencies using table() from base R.

table(gapminder$continent)
## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24
str(table(gapminder$continent))
##  'table' int [1:5(1d)] 624 300 396 360 24
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:5] "Africa" "Americas" "Asia" "Europe" ...

But the object of class table that is returned makes downstream computation a bit fiddlier than you’d like. For example, it’s too bad the continent levels come back only as names and not as a proper factor, with the original set of levels. This is an example of how the tidyverse smooths transitions where you want the output of step i to become the input of step i + 1.

The tally() function is a convenience function that knows to count rows. It honors groups.

Objective Use tally to obtain counts of observations by continent.

#answer here

The count() function is an even more convenient function that does both grouping and counting.

Objective Use count to obtain the same details (counts by continent)

#answer here

What if we wanted to add the number of unique countries for each continent? You can compute multiple summaries inside summarize(). Use the n_distinct() function to count the number of distinct countries within each continent.

#answer here

General summarization

The functions you’ll apply within summarize() include classical statistical summaries, like mean(), median(), var(), sd(), mad(), IQR(), min(), and max(). Remember they are functions that take \(n\) inputs and distill them down into 1 output.

Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.

#answer here

summarize_at() applies the same summary function(s) to multiple variables. Let’s compute average and median life expectancy and GDP per capita by continent by year … but only for 1952 and 2007.

#answer here

Let’s focus just on Asia. What are the minimum and maximum life expectancies seen by year?

#answer here

Of course it would be much more interesting to see which country contributed these extreme observations. Is the minimum (maximum) always coming from the same country? We tackle that with window functions shortly.

Grouped mutate

Sometimes you don’t want to collapse the \(n\) rows for each group into one row. You want to keep your groups, but compute within them.

Computing with group-wise summaries

Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use mutate() to make a new variable. The first() function extracts the first value from a vector. Notice that first() is operating on the vector of life expectancies within each country group.

Note: From this point forward, I give you Jenny’s answers verbatim since most of what follows would be considered intermediate dplyr skills, and this is an introductory lesson.

gapminder %>% 
  group_by(country) %>% 
  select(country, year, lifeExp) %>% 
  mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>% 
  filter(year < 1963)
## # A tibble: 426 x 4
## # Groups:   country [142]
##    country      year lifeExp lifeExp_gain
##    <fct>       <int>   <dbl>        <dbl>
##  1 Afghanistan  1952    28.8         0   
##  2 Afghanistan  1957    30.3         1.53
##  3 Afghanistan  1962    32.0         3.20
##  4 Albania      1952    55.2         0   
##  5 Albania      1957    59.3         4.05
##  6 Albania      1962    64.8         9.59
##  7 Algeria      1952    43.1         0   
##  8 Algeria      1957    45.7         2.61
##  9 Algeria      1962    48.3         5.23
## 10 Angola       1952    30.0         0   
## # … with 416 more rows

Within country, we take the difference between life expectancy in year \(i\) and life expectancy in 1952. Therefore we always see zeroes for 1952 and, for most countries, a sequence of positive and increasing numbers.

Window functions

Window functions take \(n\) inputs and give back \(n\) outputs. Furthermore, the output depends on all the values. So rank() is a window function but log() is not. Here we use window functions based on ranks and offsets.

Let’s revisit the worst and best life expectancies in Asia over time, but retaining info about which country contributes these extreme values.

gapminder %>%
  filter(continent == "Asia") %>%
  select(year, country, lifeExp) %>%
  group_by(year) %>%
  filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2) %>% 
  arrange(year) %>%
  print(n = Inf)
## # A tibble: 24 x 3
## # Groups:   year [12]
##     year country     lifeExp
##    <int> <fct>         <dbl>
##  1  1952 Afghanistan    28.8
##  2  1952 Israel         65.4
##  3  1957 Afghanistan    30.3
##  4  1957 Israel         67.8
##  5  1962 Afghanistan    32.0
##  6  1962 Israel         69.4
##  7  1967 Afghanistan    34.0
##  8  1967 Japan          71.4
##  9  1972 Afghanistan    36.1
## 10  1972 Japan          73.4
## 11  1977 Cambodia       31.2
## 12  1977 Japan          75.4
## 13  1982 Afghanistan    39.9
## 14  1982 Japan          77.1
## 15  1987 Afghanistan    40.8
## 16  1987 Japan          78.7
## 17  1992 Afghanistan    41.7
## 18  1992 Japan          79.4
## 19  1997 Afghanistan    41.8
## 20  1997 Japan          80.7
## 21  2002 Afghanistan    42.1
## 22  2002 Japan          82  
## 23  2007 Afghanistan    43.8
## 24  2007 Japan          82.6

We see that (min = Afghanistan, max = Japan) is the most frequent result, but Cambodia and Israel pop up at least once each as the min or max, respectively. That table should make you impatient for our upcoming work on tidying and reshaping data! Wouldn’t it be nice to have one row per year?

How did that actually work? First, I store and view a partial that leaves off the filter() statement. All of these operations should be familiar.

asia <- gapminder %>%
  filter(continent == "Asia") %>%
  select(year, country, lifeExp) %>%
  group_by(year)
asia
## # A tibble: 396 x 3
## # Groups:   year [12]
##     year country     lifeExp
##    <int> <fct>         <dbl>
##  1  1952 Afghanistan    28.8
##  2  1957 Afghanistan    30.3
##  3  1962 Afghanistan    32.0
##  4  1967 Afghanistan    34.0
##  5  1972 Afghanistan    36.1
##  6  1977 Afghanistan    38.4
##  7  1982 Afghanistan    39.9
##  8  1987 Afghanistan    40.8
##  9  1992 Afghanistan    41.7
## 10  1997 Afghanistan    41.8
## # … with 386 more rows

Now we apply a window function – min_rank(). Since asia is grouped by year, min_rank() operates within mini-datasets, each for a specific year. Applied to the variable lifeExp, min_rank() returns the rank of each country’s observed life expectancy. FYI, the min part just specifies how ties are broken. Here is an explicit peek at these within-year life expectancy ranks, in both the (default) ascending and descending order.

For concreteness, I use mutate() to actually create these variables, even though I dropped this in the solution above. Let’s look at a bit of that.

asia %>%
  mutate(le_rank = min_rank(lifeExp),
         le_desc_rank = min_rank(desc(lifeExp))) %>% 
  filter(country %in% c("Afghanistan", "Japan", "Thailand"), year > 1995)
## # A tibble: 9 x 5
## # Groups:   year [3]
##    year country     lifeExp le_rank le_desc_rank
##   <int> <fct>         <dbl>   <int>        <int>
## 1  1997 Afghanistan    41.8       1           33
## 2  2002 Afghanistan    42.1       1           33
## 3  2007 Afghanistan    43.8       1           33
## 4  1997 Japan          80.7      33            1
## 5  2002 Japan          82        33            1
## 6  2007 Japan          82.6      33            1
## 7  1997 Thailand       67.5      12           22
## 8  2002 Thailand       68.6      12           22
## 9  2007 Thailand       70.6      12           22

Afghanistan tends to present 1’s in the le_rank variable, Japan tends to present 1’s in the le_desc_rank variable and other countries, like Thailand, present less extreme ranks.

You can understand the original filter() statement now:

filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2)

These two sets of ranks are formed on-the-fly, within year group, and filter() retains rows with rank less than 2, which means … the row with rank = 1. Since we do for ascending and descending ranks, we get both the min and the max.

If we had wanted just the min OR the max, an alternative approach using top_n() would have worked.

gapminder %>%
  filter(continent == "Asia") %>%
  select(year, country, lifeExp) %>%
  arrange(year) %>%
  group_by(year) %>%
  #top_n(1, wt = lifeExp)        ## gets the min
  top_n(1, wt = desc(lifeExp)) ## gets the max
## # A tibble: 12 x 3
## # Groups:   year [12]
##     year country     lifeExp
##    <int> <fct>         <dbl>
##  1  1952 Afghanistan    28.8
##  2  1957 Afghanistan    30.3
##  3  1962 Afghanistan    32.0
##  4  1967 Afghanistan    34.0
##  5  1972 Afghanistan    36.1
##  6  1977 Cambodia       31.2
##  7  1982 Afghanistan    39.9
##  8  1987 Afghanistan    40.8
##  9  1992 Afghanistan    41.7
## 10  1997 Afghanistan    41.8
## 11  2002 Afghanistan    42.1
## 12  2007 Afghanistan    43.8

Grand Finale

So let’s answer that “simple” question: which country experienced the sharpest 5-year drop in life expectancy? Recall that this excerpt of the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.

At this point, that’s just too easy, so let’s do it by continent while we’re at it.

gapminder %>%
  select(country, year, continent, lifeExp) %>%
  group_by(continent, country) %>%
  ## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
  ## positive means lifeExp went up, negative means it went down
  mutate(le_delta = lifeExp - lag(lifeExp)) %>% 
  ## within country, retain the worst lifeExp change = smallest or most negative
  summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>% 
  ## within continent, retain the row with the lowest worst_le_delta
  top_n(-1, wt = worst_le_delta) %>% 
  arrange(worst_le_delta)
## # A tibble: 5 x 3
## # Groups:   continent [5]
##   continent country     worst_le_delta
##   <fct>     <fct>                <dbl>
## 1 Africa    Rwanda             -20.4  
## 2 Asia      Cambodia            -9.10 
## 3 Americas  El Salvador         -1.51 
## 4 Europe    Montenegro          -1.46 
## 5 Oceania   Australia            0.170

Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.

Break the code into pieces, starting at the top, and inspect the intermediate results. That’s certainly how I was able to write such a thing. These commands do not leap fully formed out of anyone’s forehead – they are built up gradually, with lots of errors and refinements along the way. I’m not even sure it’s a great idea to do so much manipulation in one fell swoop. Is the statement above really hard for you to read? If yes, then by all means break it into pieces and make some intermediate objects. Your code should be easy to write and read when you’re done.

In later tutorials, we’ll explore more of dplyr, such as operations based on two datasets.