If you find yourself copy and pasting the same thing more than 3 times
you should probably write a function to handle it.
Individuals who work in data science usually find themselves doing repetitive tasks and analyses. How can we: 1) Automate to prevent copy-pasting? and 2) Scale up something (e.g., script, function) that works on a lower level to work at a higher level (e.g, on multiple files or input).
Some examples of common problems that might be answered by “scaling up”:
Reading multiple data files into your active Environment
Fitting multiple models to multiples observations
Obtaining multiple graphs for each subject
Applying a custom function
The goals of this session are two-fold:
Gain a deeper understanding of functions,
Learn how to apply functions to “Scale it Up”
Make sure you have the following packages installed if you would like to follow along.
# Load and/or install
pacman::p_load(tidyverse, broom, gapminder)
Functions consist of a formal
and a body
, and take the form my_function(formal){body}
. You must always assign a function to an object to be able to use it, unless you use an anonymous function.
hello_world <- function() {
print("Hello, world!")
}
hello_world()
## [1] "Hello, world!"
Congratulations! Now let’s take it up a notch…
## When on multiple lines, functions run downstream...
hello_world_name <- function(name) {
message <- paste0("Hello, ", name, "!") # paste0 is paste without any spaces
return(message)
}
hello_world_name("Dan")
## [1] "Hello, Dan!"
A note about return()
. If you do not manually specify what should be returned from a function, R will assume that it is the result of the last line in the function.
If you want to return something from a function that is more than a single element (e.g., a data frame), you will need to assign it to a list, for example.
apply
and map
Apply functions are essentially for
loops simplified. These functions take a list as input and supply a list as output.
list <- lapply(1:10, print)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
lapply(list, sqrt)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 2.645751
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 3
##
## [[10]]
## [1] 3.162278
Map functions from the purrr
package are like apply, but have a few advantages. For example, you don’t have to always have to write out functions, and there are type specific map
functions.
my_list <- list(letters = letters, numbers = 1:26)
So, maybe we want to extract the 3rd element from each part of the list… This won’t work:
lapply(my_list, 3)
Error in match.fun(FUN) : '3' is not a function, character or symbol
But this does!
map(my_list, 3)
## $letters
## [1] "c"
##
## $numbers
## [1] 3
And this!
my_list %>% map(., 3)
## $letters
## [1] "c"
##
## $numbers
## [1] 3
If you wanted to stay in base R, you’d have to do this:
lapply(my_list, "[[", 3)
## $letters
## [1] "c"
##
## $numbers
## [1] 3
The [[
should look familiar as the left half of the [[]]
from the indexing tutorial.
map
: Variations`?`(map)
Base map
takes a list as input and returns a list:
map(1:10, print)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
##
## [[6]]
## [1] 6
##
## [[7]]
## [1] 7
##
## [[8]]
## [1] 8
##
## [[9]]
## [1] 9
##
## [[10]]
## [1] 10
map_df
takes a df as input, returns a df
df <- data.frame(x = 1:10, y = 11:20, z = letters[1:10])
map_df(df, mean)
## # A tibble: 1 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 5.5 15.5 NA
map_if
applies a function if a condition is met, and if not will return the object unaltered.
map_if(df, is.integer, mean)
## $x
## [1] 5.5
##
## $y
## [1] 15.5
##
## $z
## [1] a b c d e f g h i j
## Levels: a b c d e f g h i j
We can pass additional arguments like so:
`?`(scale # Takes three arguments
)
map_df(df[1:2], scale, center = FALSE)
## # A tibble: 10 x 2
## x y
## <dbl> <dbl>
## 1 0.153 0.662
## 2 0.306 0.722
## 3 0.459 0.782
## 4 0.612 0.843
## 5 0.764 0.903
## 6 0.917 0.963
## 7 1.07 1.02
## 8 1.22 1.08
## 9 1.38 1.14
## 10 1.53 1.20
# Alternatively map(df[1:2], scale, center=FALSE)
map
: map2
map2
allows for two inputs to be present for a function.
map2(1:10, 2, function(x, y) {
x * y
})
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
##
## [[5]]
## [1] 10
##
## [[6]]
## [1] 12
##
## [[7]]
## [1] 14
##
## [[8]]
## [1] 16
##
## [[9]]
## [1] 18
##
## [[10]]
## [1] 20
One of the benefits of map
over *apply
is that we can use short hand, or anonymous functions. If we have to explicitly define a function in map
if takes this form:
map2(1:10, 2, function(x, y) {
x * y
})
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
##
## [[5]]
## [1] 10
##
## [[6]]
## [1] 12
##
## [[7]]
## [1] 14
##
## [[8]]
## [1] 16
##
## [[9]]
## [1] 18
##
## [[10]]
## [1] 20
But, we can take a short cut and write an anonymous function in this form:
map2(1:10, 2, ~.x * .y)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
##
## [[5]]
## [1] 10
##
## [[6]]
## [1] 12
##
## [[7]]
## [1] 14
##
## [[8]]
## [1] 16
##
## [[9]]
## [1] 18
##
## [[10]]
## [1] 20
All of his mapping is very basic and it might be hard to understand the power of mapping functions to lists. As a quick example, let’s examine a more practical solution to a common problem from one of the issues brought up at the beginning of this tutorial.
I have a folder of data (looking at you EPrime
!) and I want to read all of the contents into R
.
source("../R/scale_up_utils.R")
# Function to create lots of data in a folder
my_dir <- lots_of_data()
# List the files in the folder
file_names <- list.files(my_dir, pattern = "*.csv", full.names = TRUE)
# Read it in
df <- purrr::map(file_names, readr::read_csv)
length(df)
## [1] 30
What is this black magic? Well, like we saw above, we now have a special list with length = 30 and each element is a separate data frame. We can see each individual data frame by subsetting our list.
head(df[[1]])
## # A tibble: 6 x 4
## subject_nr block pic RT
## <int> <chr> <chr> <dbl>
## 1 2582 block1 pic1 95.5
## 2 2582 block2 pic2 300.
## 3 2582 block1 pic3 186.
## 4 2582 block2 pic4 173.
## 5 2582 block1 pic1 48.0
## 6 2582 block2 pic2 212.
This might be useful for some analyses, say, for example if you want to run analyses on the participant level, but more often than not we want a 2-dimensional data frame that is in tidy data format. To do so, we need to reduce or simplify our nested list of many data frames. To do so, we use the verb reduce
.
reduce
simplifies list elements by combining common elements via a user-specified function. In this case we will use rbind
so that our data is combined by stacking each data frame on top of each other.
df_reduced <- df %>% reduce(., rbind)
df_reduced[c(1:4, 300:304), ]
## # A tibble: 9 x 4
## subject_nr block pic RT
## <int> <chr> <chr> <dbl>
## 1 2582 block1 pic1 95.5
## 2 2582 block2 pic2 300.
## 3 2582 block1 pic3 186.
## 4 2582 block2 pic4 173.
## 5 7670 block2 pic4 183.
## 6 7670 block1 pic1 38.0
## 7 7670 block2 pic2 257.
## 8 7670 block1 pic3 130.
## 9 7670 block2 pic4 94.2
How succinct can we make this process? Fairly short if we assume we already have the raw data in a folder.
# The whole thing!
df <- list.files(my_dir, pattern = "*.csv", full.names = TRUE) %>% map(., read_csv) %>%
reduce(., rbind)
Utilizing pipes and tidyverse
verbs, we can accomplish a fairly complex task in three lines of code.
Make sure to delete the random folder we created so it isn’t taking up unnecessary space.
# Clean up
unlink(my_dir, recursive = TRUE)
Let’s try scaling something else on an issue that might be a bit more complicated. Here, we will use the gapminder
data set to fit individual linear models to many different groups of observations.
A large majority of this section of the tutorial is based on the chapter Many Models
from Hadley Wickham’s wonderful book R for Data Science
(Wickham and Grolemund 2017).
This is a data set that lives in a package based on world economic, health, etc. data collected by the “Gapminder” Foundation. Gapminder is a non-profit Swedish organization aimed at educating consumers with evidenced-based facts about the world.
The Gapminder website has many, many data files available for public use, and lots of cool visualizations to play around with. However, we will be most interested in the following:
Variable | Class | Description |
---|---|---|
country | factor | Country |
continent | factor | Continent of Country |
year | integer | Year data was sampled |
lifeExp | numeric | Life expectancy at birth in years |
pop | integer | Population |
gdpPercap | numeric | GDP per capita (adjusted) |
broom
packageThis package from David Robinson provides model output from a variety of model functions in a tidy format. We will be using broom
as we develop a pipeline and scale up our functions.
# You can check citations in R if you can't remember!
citation("broom")
##
## To cite package 'broom' in publications use:
##
## David Robinson and Alex Hayes (2018). broom: Convert Statistical
## Analysis Objects into Tidy Tibbles. R package version 0.5.0.
## https://CRAN.R-project.org/package=broom
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {broom: Convert Statistical Analysis Objects into Tidy Tibbles},
## author = {David Robinson and Alex Hayes},
## year = {2018},
## note = {R package version 0.5.0},
## url = {https://CRAN.R-project.org/package=broom},
## }
Because this data frame has many types of data (numeric, time, groups), we can begin to ask and look at very cool things (“PLOT YOUR DATA!”).
Question: Is GDP increasing across all countries?
Start by plotting the data:
gapminder %>% ggplot(., aes(year, gdpPercap, group = country)) + geom_line(alpha = 0.3,
size = 2) + theme_bw(base_size = 15)
Oh, my. something is happening with a few countries. I wonder what continent the most extreme trends are on:
gapminder %>% ggplot(., aes(year, gdpPercap, group = country, colour = continent)) +
geom_line(alpha = 0.3, size = 2) + theme_bw(base_size = 15)
Let’s pull out one country and see if we can find the abnormal culprit.
gapminder %>% filter(gdpPercap >= 90000) %>% distinct(country) # We only need to see it once
## # A tibble: 1 x 1
## country
## <fct>
## 1 Kuwait
Well, well, well…ok, let’s plot just this country and begin to fit just one model.
Let’s compare Kuwait to another country (Canada):
It looks like a linear trend fits Canada’s GRP over the years quite well!
OK, so how do we fit that linear model to just one country (e.g., Kuwait or Canada)? Well, first we need to subset or filter our data by country.
kw <- filter(gapminder, country == "Kuwait")
head(kw)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1952 55.6 160000 108382.
## 2 Kuwait Asia 1957 58.0 212846 113523.
## 3 Kuwait Asia 1962 60.5 358266 95458.
## 4 Kuwait Asia 1967 64.6 575003 80895.
## 5 Kuwait Asia 1972 67.7 841934 109348.
## 6 Kuwait Asia 1977 69.3 1140357 59265.
Now we can apply our linear model to just Kuwait’s data.
fit1 <- lm(gdpPercap ~ year, data = kw)
fit_summary <- summary(fit1)
print(fit1)
##
## Call:
## lm(formula = gdpPercap ~ year, data = kw)
##
## Coefficients:
## (Intercept) year
## 3200775 -1584
And we can clean this up a bit with broom
’s tidy
function.
clean_summary <- tidy(fit_summary)
print(clean_summary)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3200775. 633274. 5.05 0.000496
## 2 year -1584. 320. -4.95 0.000577
We can also “augment” our original data frame with some useful additions with broom
’s augment
function.
fit_augment <- fit1 %>% augment(.)
head(fit_augment)
## # A tibble: 6 x 9
## gdpPercap year .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 108382. 1952 108892. 10387. -509. 0.295 20161. 2.10e-4 -0.0317
## 2 113523. 1957 100972. 9072. 12551. 0.225 19594. 8.06e-2 0.745
## 3 95458. 1962 93052. 7863. 2406. 0.169 20143. 1.94e-3 0.138
## 4 80895. 1967 85132. 6818. -4237. 0.127 20105. 4.09e-3 -0.237
## 5 109348. 1972 77213. 6020. 32135. 0.0991 16708. 1.72e-1 1.77
## 6 59265. 1977 69293. 5579. -10027. 0.0851 19857. 1.40e-2 -0.548
fit_augment %>% ggplot(., aes(year, .fitted)) + geom_line(size = 2) + geom_line(aes(fit_augment$year,
fit_augment$gdpPercap), size = 2, colour = "red")
This should look familiar from above. So, it looks like we successfully fitted a model to Kuwait’s data. But what if we want/need to compute a separate linear model for each country? Well, we have a few options:
1). Copy and paste the above code for each country
2). Develop some sort of for
loop or apply
script with a custom linear model function
Well, the first option seems intuitive and logical, but how many countries are there? That will be how many times we have to copy and paste our above code.
gapminder %>% select(., country) %>% distinct(., country) %>% tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 142
Hm, copy-pasting 142 times seems ridiculous and not a logical solution. How about a for
loop? I will be using broom
’s glance
function to provide a single line for each model some summary statistics that will be useful for comparing model fit.
countries <- gapminder %>% select(., country) %>% distinct(., country)
tidy_fits <- data.frame()
for (x in 1:nrow(countries)) {
countryx <- countries[x, ]
df <- filter(gapminder, country == countryx$country)
fitx <- lm(gdpPercap ~ year, data = df)
xtidy <- glance(fitx)
xtidy$country <- countryx$country
tidy_fits <- rbind(tidy_fits, xtidy)
}
tidy_fits %>% select(., country, everything()) %>% head(.)
## # A tibble: 6 x 12
## country r.squared adj.r.squared sigma statistic p.value df logLik
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 Afghanis… 0.00539 -0.0941 113. 0.0542 8.21e-1 2 -72.7
## 2 Albania 0.678 0.646 710. 21.1 9.98e-4 2 -94.7
## 3 Algeria 0.782 0.760 641. 35.9 1.34e-4 2 -93.5
## 4 Angola 0.129 0.0422 1141. 1.48 2.51e-1 2 -100.
## 5 Argentina 0.706 0.677 1059. 24.0 6.23e-4 2 -99.5
## 6 Australia 0.969 0.966 1432. 318. 6.61e-9 2 -103.
## # ... with 4 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## # df.residual <int>
OK…that’s acceptable, and it can help us conceptually understand what we need to do in order to accomplish our task, but it’s clunky and intuitive. Isn’t there anything else?
tidyverse
Scale UpWell, we do have a third option:
3). Some tidyverse
alternative which is probably easier (spoiler: it is)
You might imagine that using map
might help us both consolidate our code and make it more readable for future reference. Conceptually, we want to do the exact same as the for
loop above: extract grouped data by country, apply a linear model to each country’s data, and then add the tidied linear model summary to a tidy data frame.
In order to do this within a tidyverse
framework, we need to nest our data frame by our grouping variable. A nested data frame is a special data frame that has for each observation (row) another data frame with only the observations of split by the grouping variable.
by_country <- gapminder %>% group_by(., country) %>% nest(.)
head(by_country[[1, 2]])
## # A tibble: 6 x 5
## continent year lifeExp pop gdpPercap
## <fct> <int> <dbl> <int> <dbl>
## 1 Asia 1952 28.8 8425333 779.
## 2 Asia 1957 30.3 9240934 821.
## 3 Asia 1962 32.0 10267083 853.
## 4 Asia 1967 34.0 11537966 836.
## 5 Asia 1972 36.1 13079460 740.
## 6 Asia 1977 38.4 14880372 786.
Next, we need to do some setup and create a specialized linear model function
country_model <- function(x) {
lm(gdpPercap ~ year, data = x)
}
That’s it! Just wrapping our linear model in a nice function
! Now we are ready to apply it to each of our nested data frames.
models <- map(by_country$data, country_model)
We can now add to this nested data frame a model column that will contain all of the model output for each country.
by_country <- by_country %>% mutate(model = map(data, country_model))
head(by_country)
## # A tibble: 6 x 3
## country data model
## <fct> <list> <list>
## 1 Afghanistan <tibble [12 × 5]> <S3: lm>
## 2 Albania <tibble [12 × 5]> <S3: lm>
## 3 Algeria <tibble [12 × 5]> <S3: lm>
## 4 Angola <tibble [12 × 5]> <S3: lm>
## 5 Argentina <tibble [12 × 5]> <S3: lm>
## 6 Australia <tibble [12 × 5]> <S3: lm>
Finally, we can get our comparison list by applying glace
to each country’s model and then unnesting the data.
tidy_fits2 <- by_country %>% mutate(glance = map(model, broom::glance)) %>%
unnest(glance)
head(tidy_fits2)
## # A tibble: 6 x 14
## country data model r.squared adj.r.squared sigma statistic p.value
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… <tibble… <S3:… 0.00539 -0.0941 113. 0.0542 8.21e-1
## 2 Albania <tibble… <S3:… 0.678 0.646 710. 21.1 9.98e-4
## 3 Algeria <tibble… <S3:… 0.782 0.760 641. 35.9 1.34e-4
## 4 Angola <tibble… <S3:… 0.129 0.0422 1141. 1.48 2.51e-1
## 5 Argenti… <tibble… <S3:… 0.706 0.677 1059. 24.0 6.23e-4
## 6 Austral… <tibble… <S3:… 0.969 0.966 1432. 318. 6.61e-9
## # ... with 6 more variables: df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>
Wickham, H., and G. Grolemund. 2017. R for Data Science. O’Reilly Media. https://books.google.com/books?id=-7RhvgAACAAJ.