Establishing the Problem

If you find yourself copy and pasting the same thing more than 3 times
you should probably write a function to handle it.

Individuals who work in data science usually find themselves doing repetitive tasks and analyses. How can we: 1) Automate to prevent copy-pasting? and 2) Scale up something (e.g., script, function) that works on a lower level to work at a higher level (e.g, on multiple files or input).

Some examples of common problems that might be answered by “scaling up”:

  • Reading multiple data files into your active Environment

  • Fitting multiple models to multiples observations

  • Obtaining multiple graphs for each subject

  • Applying a custom function

Session Goals

The goals of this session are two-fold:

  1. Gain a deeper understanding of functions,

  2. Learn how to apply functions to “Scale it Up”

Before We Begin

Make sure you have the following packages installed if you would like to follow along.

# Load and/or install
pacman::p_load(tidyverse, broom, gapminder)

Functions

Functions consist of a formal and a body, and take the form my_function(formal){body}. You must always assign a function to an object to be able to use it, unless you use an anonymous function.

hello_world <- function() {
    print("Hello, world!")
}
hello_world()
## [1] "Hello, world!"

Congratulations! Now let’s take it up a notch…

## When on multiple lines, functions run downstream...
hello_world_name <- function(name) {
    message <- paste0("Hello, ", name, "!")  # paste0 is paste without any spaces
    return(message)
}

hello_world_name("Dan")
## [1] "Hello, Dan!"

A note about return(). If you do not manually specify what should be returned from a function, R will assume that it is the result of the last line in the function.

If you want to return something from a function that is more than a single element (e.g., a data frame), you will need to assign it to a list, for example.

Begin to Scale it Up: apply and map

Apply functions are essentially for loops simplified. These functions take a list as input and supply a list as output.

list <- lapply(1:10, print)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
lapply(list, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 2.645751
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 3
## 
## [[10]]
## [1] 3.162278

Map functions from the purrr package are like apply, but have a few advantages. For example, you don’t have to always have to write out functions, and there are type specific map functions.

my_list <- list(letters = letters, numbers = 1:26)

So, maybe we want to extract the 3rd element from each part of the list… This won’t work:

lapply(my_list, 3)
Error in match.fun(FUN) : '3' is not a function, character or symbol

But this does!

map(my_list, 3)
## $letters
## [1] "c"
## 
## $numbers
## [1] 3

And this!

my_list %>% map(., 3)
## $letters
## [1] "c"
## 
## $numbers
## [1] 3

If you wanted to stay in base R, you’d have to do this:

lapply(my_list, "[[", 3)
## $letters
## [1] "c"
## 
## $numbers
## [1] 3

The [[ should look familiar as the left half of the [[]] from the indexing tutorial.

More map: Variations

`?`(map)

Base map takes a list as input and returns a list:

map(1:10, print)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 10

map_df takes a df as input, returns a df

df <- data.frame(x = 1:10, y = 11:20, z = letters[1:10])
map_df(df, mean)
## # A tibble: 1 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1   5.5  15.5    NA

map_if applies a function if a condition is met, and if not will return the object unaltered.

map_if(df, is.integer, mean)
## $x
## [1] 5.5
## 
## $y
## [1] 15.5
## 
## $z
##  [1] a b c d e f g h i j
## Levels: a b c d e f g h i j

We can pass additional arguments like so:

`?`(scale  # Takes three arguments
)

map_df(df[1:2], scale, center = FALSE)
## # A tibble: 10 x 2
##        x     y
##    <dbl> <dbl>
##  1 0.153 0.662
##  2 0.306 0.722
##  3 0.459 0.782
##  4 0.612 0.843
##  5 0.764 0.903
##  6 0.917 0.963
##  7 1.07  1.02 
##  8 1.22  1.08 
##  9 1.38  1.14 
## 10 1.53  1.20
# Alternatively map(df[1:2], scale, center=FALSE)

More map: map2

map2 allows for two inputs to be present for a function.

map2(1:10, 2, function(x, y) {
    x * y
})
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 10
## 
## [[6]]
## [1] 12
## 
## [[7]]
## [1] 14
## 
## [[8]]
## [1] 16
## 
## [[9]]
## [1] 18
## 
## [[10]]
## [1] 20

Anonymous functions

One of the benefits of map over *apply is that we can use short hand, or anonymous functions. If we have to explicitly define a function in map if takes this form:

map2(1:10, 2, function(x, y) {
    x * y
})
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 10
## 
## [[6]]
## [1] 12
## 
## [[7]]
## [1] 14
## 
## [[8]]
## [1] 16
## 
## [[9]]
## [1] 18
## 
## [[10]]
## [1] 20

But, we can take a short cut and write an anonymous function in this form:

map2(1:10, 2, ~.x * .y)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 10
## 
## [[6]]
## [1] 12
## 
## [[7]]
## [1] 14
## 
## [[8]]
## [1] 16
## 
## [[9]]
## [1] 18
## 
## [[10]]
## [1] 20

A Quick Scaling Up Example

All of his mapping is very basic and it might be hard to understand the power of mapping functions to lists. As a quick example, let’s examine a more practical solution to a common problem from one of the issues brought up at the beginning of this tutorial.

I have a folder of data (looking at you EPrime!) and I want to read all of the contents into R.

source("../R/scale_up_utils.R")

# Function to create lots of data in a folder
my_dir <- lots_of_data()

# List the files in the folder
file_names <- list.files(my_dir, pattern = "*.csv", full.names = TRUE)

# Read it in
df <- purrr::map(file_names, readr::read_csv)
length(df)
## [1] 30

What is this black magic? Well, like we saw above, we now have a special list with length = 30 and each element is a separate data frame. We can see each individual data frame by subsetting our list.

head(df[[1]])
## # A tibble: 6 x 4
##   subject_nr block  pic      RT
##        <int> <chr>  <chr> <dbl>
## 1       2582 block1 pic1   95.5
## 2       2582 block2 pic2  300. 
## 3       2582 block1 pic3  186. 
## 4       2582 block2 pic4  173. 
## 5       2582 block1 pic1   48.0
## 6       2582 block2 pic2  212.

This might be useful for some analyses, say, for example if you want to run analyses on the participant level, but more often than not we want a 2-dimensional data frame that is in tidy data format. To do so, we need to reduce or simplify our nested list of many data frames. To do so, we use the verb reduce.

reduce simplifies list elements by combining common elements via a user-specified function. In this case we will use rbind so that our data is combined by stacking each data frame on top of each other.

df_reduced <- df %>% reduce(., rbind)

df_reduced[c(1:4, 300:304), ]
## # A tibble: 9 x 4
##   subject_nr block  pic      RT
##        <int> <chr>  <chr> <dbl>
## 1       2582 block1 pic1   95.5
## 2       2582 block2 pic2  300. 
## 3       2582 block1 pic3  186. 
## 4       2582 block2 pic4  173. 
## 5       7670 block2 pic4  183. 
## 6       7670 block1 pic1   38.0
## 7       7670 block2 pic2  257. 
## 8       7670 block1 pic3  130. 
## 9       7670 block2 pic4   94.2

Getting data from a folder (start to finish)

How succinct can we make this process? Fairly short if we assume we already have the raw data in a folder.

# The whole thing!
df <- list.files(my_dir, pattern = "*.csv", full.names = TRUE) %>% map(., read_csv) %>% 
    reduce(., rbind)

Utilizing pipes and tidyverse verbs, we can accomplish a fairly complex task in three lines of code.

Make sure to delete the random folder we created so it isn’t taking up unnecessary space.

# Clean up
unlink(my_dir, recursive = TRUE)

Our Gapminder Scaling Problem

Let’s try scaling something else on an issue that might be a bit more complicated. Here, we will use the gapminder data set to fit individual linear models to many different groups of observations.

A large majority of this section of the tutorial is based on the chapter Many Models from Hadley Wickham’s wonderful book R for Data Science(Wickham and Grolemund 2017).

The Gapminder data set

This is a data set that lives in a package based on world economic, health, etc. data collected by the “Gapminder” Foundation. Gapminder is a non-profit Swedish organization aimed at educating consumers with evidenced-based facts about the world.

The Gapminder website has many, many data files available for public use, and lots of cool visualizations to play around with. However, we will be most interested in the following:

Variable Class Description
country factor Country
continent factor Continent of Country
year integer Year data was sampled
lifeExp numeric Life expectancy at birth in years
pop integer Population
gdpPercap numeric GDP per capita (adjusted)

The broom package

This package from David Robinson provides model output from a variety of model functions in a tidy format. We will be using broom as we develop a pipeline and scale up our functions.

# You can check citations in R if you can't remember!
citation("broom")
## 
## To cite package 'broom' in publications use:
## 
##   David Robinson and Alex Hayes (2018). broom: Convert Statistical
##   Analysis Objects into Tidy Tibbles. R package version 0.5.0.
##   https://CRAN.R-project.org/package=broom
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {broom: Convert Statistical Analysis Objects into Tidy Tibbles},
##     author = {David Robinson and Alex Hayes},
##     year = {2018},
##     note = {R package version 0.5.0},
##     url = {https://CRAN.R-project.org/package=broom},
##   }

Mind the “gap”

Because this data frame has many types of data (numeric, time, groups), we can begin to ask and look at very cool things (“PLOT YOUR DATA!”).

Question: Is GDP increasing across all countries?

Start by plotting the data:

gapminder %>% ggplot(., aes(year, gdpPercap, group = country)) + geom_line(alpha = 0.3, 
    size = 2) + theme_bw(base_size = 15)

Oh, my. something is happening with a few countries. I wonder what continent the most extreme trends are on:

gapminder %>% ggplot(., aes(year, gdpPercap, group = country, colour = continent)) + 
    geom_line(alpha = 0.3, size = 2) + theme_bw(base_size = 15)

Let’s pull out one country and see if we can find the abnormal culprit.

gapminder %>% filter(gdpPercap >= 90000) %>% distinct(country)  # We only need to see it once
## # A tibble: 1 x 1
##   country
##   <fct>  
## 1 Kuwait

Well, well, well…ok, let’s plot just this country and begin to fit just one model.

Let’s compare Kuwait to another country (Canada):

It looks like a linear trend fits Canada’s GRP over the years quite well!

Starting simple: A single solution

OK, so how do we fit that linear model to just one country (e.g., Kuwait or Canada)? Well, first we need to subset or filter our data by country.

kw <- filter(gapminder, country == "Kuwait")
head(kw)
## # A tibble: 6 x 6
##   country continent  year lifeExp     pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Kuwait  Asia       1952    55.6  160000   108382.
## 2 Kuwait  Asia       1957    58.0  212846   113523.
## 3 Kuwait  Asia       1962    60.5  358266    95458.
## 4 Kuwait  Asia       1967    64.6  575003    80895.
## 5 Kuwait  Asia       1972    67.7  841934   109348.
## 6 Kuwait  Asia       1977    69.3 1140357    59265.

Now we can apply our linear model to just Kuwait’s data.

fit1 <- lm(gdpPercap ~ year, data = kw)
fit_summary <- summary(fit1)
print(fit1)
## 
## Call:
## lm(formula = gdpPercap ~ year, data = kw)
## 
## Coefficients:
## (Intercept)         year  
##     3200775        -1584

And we can clean this up a bit with broom’s tidy function.

clean_summary <- tidy(fit_summary)
print(clean_summary)
## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  3200775.   633274.      5.05 0.000496
## 2 year           -1584.      320.     -4.95 0.000577

We can also “augment” our original data frame with some useful additions with broom’s augment function.

fit_augment <- fit1 %>% augment(.)
head(fit_augment)
## # A tibble: 6 x 9
##   gdpPercap  year .fitted .se.fit  .resid   .hat .sigma .cooksd .std.resid
##       <dbl> <int>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
## 1   108382.  1952 108892.  10387.   -509. 0.295  20161. 2.10e-4    -0.0317
## 2   113523.  1957 100972.   9072.  12551. 0.225  19594. 8.06e-2     0.745 
## 3    95458.  1962  93052.   7863.   2406. 0.169  20143. 1.94e-3     0.138 
## 4    80895.  1967  85132.   6818.  -4237. 0.127  20105. 4.09e-3    -0.237 
## 5   109348.  1972  77213.   6020.  32135. 0.0991 16708. 1.72e-1     1.77  
## 6    59265.  1977  69293.   5579. -10027. 0.0851 19857. 1.40e-2    -0.548
fit_augment %>% ggplot(., aes(year, .fitted)) + geom_line(size = 2) + geom_line(aes(fit_augment$year, 
    fit_augment$gdpPercap), size = 2, colour = "red")

This should look familiar from above. So, it looks like we successfully fitted a model to Kuwait’s data. But what if we want/need to compute a separate linear model for each country? Well, we have a few options:

1). Copy and paste the above code for each country

2). Develop some sort of for loop or apply script with a custom linear model function

Well, the first option seems intuitive and logical, but how many countries are there? That will be how many times we have to copy and paste our above code.

gapminder %>% select(., country) %>% distinct(., country) %>% tally()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   142

Hm, copy-pasting 142 times seems ridiculous and not a logical solution. How about a for loop? I will be using broom’s glance function to provide a single line for each model some summary statistics that will be useful for comparing model fit.

countries <- gapminder %>% select(., country) %>% distinct(., country)

tidy_fits <- data.frame()

for (x in 1:nrow(countries)) {
    countryx <- countries[x, ]
    df <- filter(gapminder, country == countryx$country)
    
    fitx <- lm(gdpPercap ~ year, data = df)
    xtidy <- glance(fitx)
    xtidy$country <- countryx$country
    
    tidy_fits <- rbind(tidy_fits, xtidy)
}

tidy_fits %>% select(., country, everything()) %>% head(.)
## # A tibble: 6 x 12
##   country   r.squared adj.r.squared sigma statistic   p.value    df logLik
##   <fct>         <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl>
## 1 Afghanis…   0.00539       -0.0941  113.    0.0542   8.21e-1     2  -72.7
## 2 Albania     0.678          0.646   710.   21.1      9.98e-4     2  -94.7
## 3 Algeria     0.782          0.760   641.   35.9      1.34e-4     2  -93.5
## 4 Angola      0.129          0.0422 1141.    1.48     2.51e-1     2 -100. 
## 5 Argentina   0.706          0.677  1059.   24.0      6.23e-4     2  -99.5
## 6 Australia   0.969          0.966  1432.  318.       6.61e-9     2 -103. 
## # ... with 4 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>

OK…that’s acceptable, and it can help us conceptually understand what we need to do in order to accomplish our task, but it’s clunky and intuitive. Isn’t there anything else?

The tidyverse Scale Up

Well, we do have a third option:

3). Some tidyverse alternative which is probably easier (spoiler: it is)

Nested data frames

You might imagine that using map might help us both consolidate our code and make it more readable for future reference. Conceptually, we want to do the exact same as the for loop above: extract grouped data by country, apply a linear model to each country’s data, and then add the tidied linear model summary to a tidy data frame.

In order to do this within a tidyverse framework, we need to nest our data frame by our grouping variable. A nested data frame is a special data frame that has for each observation (row) another data frame with only the observations of split by the grouping variable.

by_country <- gapminder %>% group_by(., country) %>% nest(.)

head(by_country[[1, 2]])
## # A tibble: 6 x 5
##   continent  year lifeExp      pop gdpPercap
##   <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Asia       1952    28.8  8425333      779.
## 2 Asia       1957    30.3  9240934      821.
## 3 Asia       1962    32.0 10267083      853.
## 4 Asia       1967    34.0 11537966      836.
## 5 Asia       1972    36.1 13079460      740.
## 6 Asia       1977    38.4 14880372      786.

Next, we need to do some setup and create a specialized linear model function

country_model <- function(x) {
    lm(gdpPercap ~ year, data = x)
}

That’s it! Just wrapping our linear model in a nice function! Now we are ready to apply it to each of our nested data frames.

models <- map(by_country$data, country_model)

We can now add to this nested data frame a model column that will contain all of the model output for each country.

by_country <- by_country %>% mutate(model = map(data, country_model))

head(by_country)
## # A tibble: 6 x 3
##   country     data              model   
##   <fct>       <list>            <list>  
## 1 Afghanistan <tibble [12 × 5]> <S3: lm>
## 2 Albania     <tibble [12 × 5]> <S3: lm>
## 3 Algeria     <tibble [12 × 5]> <S3: lm>
## 4 Angola      <tibble [12 × 5]> <S3: lm>
## 5 Argentina   <tibble [12 × 5]> <S3: lm>
## 6 Australia   <tibble [12 × 5]> <S3: lm>

Finally, we can get our comparison list by applying glace to each country’s model and then unnesting the data.

tidy_fits2 <- by_country %>% mutate(glance = map(model, broom::glance)) %>% 
    unnest(glance)

head(tidy_fits2)
## # A tibble: 6 x 14
##   country  data     model r.squared adj.r.squared sigma statistic  p.value
##   <fct>    <list>   <lis>     <dbl>         <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghani… <tibble… <S3:…   0.00539       -0.0941  113.    0.0542  8.21e-1
## 2 Albania  <tibble… <S3:…   0.678          0.646   710.   21.1     9.98e-4
## 3 Algeria  <tibble… <S3:…   0.782          0.760   641.   35.9     1.34e-4
## 4 Angola   <tibble… <S3:…   0.129          0.0422 1141.    1.48    2.51e-1
## 5 Argenti… <tibble… <S3:…   0.706          0.677  1059.   24.0     6.23e-4
## 6 Austral… <tibble… <S3:…   0.969          0.966  1432.  318.      6.61e-9
## # ... with 6 more variables: df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>

References

Wickham, H., and G. Grolemund. 2017. R for Data Science. O’Reilly Media. https://books.google.com/books?id=-7RhvgAACAAJ.