Load/install required packages the long way. We will revisit this example in the exercises at the end to redo it with our new functional programming skills. If you cannot get this document to work, you can view it interactively at this link.
if(!require("learnr")){install.packages("learnr"); library("learnr")}
if(!require("tidyverse")){install.packages("tidyverse"); library("tidyverse")}
if(!require("purrr")){install.packages("purrr"); library("purrr")}
if(!require("here")){install.packages("here"); library("here")}
#setwd(here::here('functional_programming'))
Functions in R take some input argument (a ‘formal’) and execute some code based off of the input (‘body’)
Functions are useful for creating ‘shortcuts’ that you use often and are not already implemented in R or an R package
In order to use a function you create, you must first ‘define’ it, much like you would by assigning a value to an object in R
To assign a function to an object (name), use the function(input){body}
call:
hello_world <- function() { print('Hello, world!') }
Here, hello_world
is the name of the function, and print()
is what the function does (everything inside the {}
)
Notice that this function does not have any input!
First, reproduce the function above, and run it. Next, try modifying the function’s input and body to allow it to print your name instead of ‘world’.
hello_world <- function(name){
message <- paste0('Hello, ',name,'!')
return(message)
}
hello_world('Dan')
That’s neat, but useless. Let’s walk through another example with some real life statistical significance.
set.seed(1999)
z_score <- function(score, values){
grand_mean <- mean(values, na.rm = TRUE)
sdt <- sd(values, na.rm = TRUE)
z <- (score - grand_mean) / sdt
return(z)
}
values <- runif(20, min = 0, max = 20)
score <- values[1]
z_score(score, values)
## [1] 0.7294446
The last line of the function (return(z)
) is the function output
In this example, we want what we stored in the object z
to be given back
For multiple returns, we need to store our final output in a list or dataframe
set.seed(1999)
z_score <- function(score, values){
grand_mean <- mean(values, na.rm = TRUE)
sdt <- sd(values, na.rm = TRUE)
z <- (score - grand_mean) / sdt
return(z)
}
values <- runif(20, min = 0, max = 20)
score <- values[1]
z_score(score, values)
Again, this is nice, but as it stands now you would have to run z_score()
on each individual score to get the z score value
We can unlock the full potential of R functions by combining it with loops, conditional logic, and other functionals (e.g., lapply()
)
Basic R loops are similar to other programming languages (e.g., python, MatLab, etc.)
Tells R to evaluate something until a certain point is reached (a ‘while’ loop), or until the end of a vector is reached (a ‘for’ loop)
In other words, loops cycle through ‘decision-trees’ until a specified break-point is reached (see flowchart below)
All loops iterate along some sequence
Iterators are (generally) numbers, and are initialized by calling: for (i in 1:10) {loop something here}
i
is the iterator, and tells the for
statement to cycle through the loop 10 times (1:10
):
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
for (i in 1:10){
## Loops cycle through what is inside the {} brackets
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Try to use what you know to loop through numbers 1 - 10 and square each value
Write the R code required to print each number’s squared value:
for(i in 1:10){
print(i^2)
}
So when might this come in handy? Consider a list of values that you want to run the same function on; for example, we might want to rename the subject column of a dataframe to extract only the numeric values:
list <- c('subject_1','subject_2','subject_3','subject_4')
# Instead of manually counting how many iterations we need
# we can instead look at the `length()` of the list we
# just created
length(list)
## [1] 4
# To extract just the numeric values, we use the
# tidyverse function `readr::parse_number()`
readr::parse_number('subject_45')
## [1] 45
# Initialize an empty vector to fill
list_sans_text <- vector('numeric')
# Let's try combining each of these into a for loop:
for (i in 1:length(list)){
list_sans_text <- readr::parse_number(i)
}
print(list_sans_text)
## [1] 4
Hmm…that didn’t work. Why?
How would we tell the loop to use the value instead of the position? Try to fix the code below so that the output is a list of only subject numbers.
list <- c('subject_1','subject_2','subject_3','subject_4')
# Initialize an empty vector to fill
list_sans_text <- vector('numeric')
for (i in 1:length(list)){
list_sans_text <- readr::parse_number(i)
}
print(list_sans_text)
list <- c('subject_1','subject_2','subject_3','subject_4')
# Initialize an empty vector to fill (`double()` here is
# synonomous with `vector('numeric')`)
list_sans_text <- double()
for (i in 1:length(list)){
list_sans_text[i] <- readr::parse_number(list[i])
}
print(list_sans_text)
loops
correspond to a position within! If you want to use an actual value (character or numeric) you need to extract it first!x <- 1
while( x <= 10){
y <- x
print(y^2)
x <- x+1
}
x <- 1
while( x <= 10){
y <- x
p <- x-1
print(y^p)
x <- x+1
}
data(starwars)
starwars <- as.data.frame(starwars)
head(starwars)
x_values <- c('height','mass')
for (i in 1:length(x_values)){
new <- paste0(x_values[i],'_centered')
values <- as.numeric(starwars[,x_values[i]])
mean <- mean(values, na.rm=TRUE)
std <- sd(values, na.rm=TRUE)
for(j in 1:nrow(starwars)) {
# Here is our z-score function within the loop
starwars[j,new] <- (starwars[j,x_values[i]] - mean) / std
}
}
# Check to make sure we have values by printing first 6 rows
head(as_tibble(starwars))
Was that more complicated then it needed to be? Most certainly. That’s why most people leave loops behind in favor of the apply
family. But before we do, note two important concepts detailed in the example above:
You can nest loops within loops (theoretically infinitely)
The easy way of extracting an element from a dataframe or list (df$column
) does not play well within loops
Conditional logic in programming evaluates a statement as either TRUE
or FALSE
and performs code based on the statement. In this way, it is much like loops.
The basic conditional logic functions are if()
and else()
, which are often used together
if()
evaluates a statement:
x <- 1
# 'If x equals 1, print to the console 'This is TRUE''
if (x == 1) {'This is TRUE'}
## [1] "This is TRUE"
Notice that we use ==
instead of =
When the statement is not TRUE
, else()
can be used as a follow up:
x <- 2
# 'If x equals 1, print 'This is TRUE' to the console,
# if x does not equal 1, print 'This is FALSE' to the console'
if (x == 1) {'This is TRUE'} else {'This is FALSE'}
## [1] "This is FALSE"
ifelse()
:x <- 2
ifelse(x == 1, 'This is TRUE!', 'This is FALSE!')
## [1] "This is FALSE!"
ifelse()
has the benefit of being vectorized, which means that it operates on a case-by-case level instead of the vector as a wholeset.seed(1999)
x = runif(10)
# This doesn't replace each value
y <- if (x < .5) {TRUE} else {FALSE}
# Instead, it evaluates the first element of the vector 'x'
x[1]; y
## [1] 0.7681209
## [1] FALSE
# ifelse() will work on each element
y2 <- ifelse(x < .5, TRUE, FALSE)
y2
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
How might we use conditional logic with tidyverse’s mutate()
function to change starwars$gender
to capital letters?
data(starwars)
data(starwars)
starwars <- starwars %>%
mutate(., gender2 = ifelse(.$gender == 'male', 'Male',
ifelse('female','Female',.$gender)))
The apply family–as the name suggests–applies a function across certain elements
Think of this family as a simplified and optimized loops
The apply family consists largely of:
lapply()
- the simplest apply; it takes a function and applies it to each element of a list
sapply()
and vapply
- which returns simplified vectors instead of lists
purrr::map()
- which can handle multiple inputs
z_score()
function to multiple casesdata(starwars)
head(starwars)
myList <- c('height','mass')
# Returns list
myValues <- lapply(starwars[myList], scale)
head(myValues$height); head(myValues$mass)
## [,1]
## [1,] -0.06781696
## [2,] -0.21161731
## [3,] -2.25358235
## [4,] 0.79498517
## [5,] -0.70053852
## [6,] 0.10474347
## [,1]
## [1,] -0.1198643
## [2,] -0.1316667
## [3,] -0.3854181
## [4,] 0.2283063
## [5,] -0.2850978
## [6,] 0.1338871
# Returns vector
myValues2 <- sapply(starwars[,c('height','mass')], scale)
head(myValues2)
## height mass
## [1,] -0.06781696 -0.1198643
## [2,] -0.21161731 -0.1316667
## [3,] -2.25358235 -0.3854181
## [4,] 0.79498517 0.2283063
## [5,] -0.70053852 -0.2850978
## [6,] 0.10474347 0.1338871
purrr::map()
is largely the same as lapply()
, but returns consistent values, and has the ability to be scaled to multiple input values using purrr::map2()
and purrr::pmap()
set.seed(1999)
df <- data_frame(a = runif(20), b = runif(20), c = runif(20), d = runif(20))
# The default for runif() should be a mean of 0.5
# map_dbl() returns 'double' (a numeric value with decimal points) not a list
df %>% map_dbl(., function(x) mean(x))
## a b c d
## 0.5616511 0.5792588 0.5180648 0.5688397
# In this example, it is functionally equivelent to:
df %>% sapply(., mean)
## a b c d
## 0.5616511 0.5792588 0.5180648 0.5688397
See what happens if you use map()
instead of map_dbl()
df <- data_frame(a = runif(20), b = runif(20), c = runif(20), d = runif(20))
df %>% map_dbl(., function(x) mean(x))
purrr::map()
’s utility becomes more apparent when we consider a fairly common issue in data science: split-apply-combine, or map-reducedata(starwars)
# Omiting incomplete cases so it plays nicely with `lm()`
starwars <- as.data.frame(starwars) %>% na.omit(.)
# Model the relationship between height and mass for each gender
models <- starwars %>%
split(., .$gender) %>% # split
map(~lm(height~mass, data = .)) %>% # apply
map(summary) # combine
models
## $female
##
## Call:
## lm(formula = height ~ mass, data = .)
##
## Residuals:
## 5 7 44 61 62 87
## -14.6919 -3.8925 12.3387 4.1448 1.1465 0.9543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.7755 23.6398 6.632 0.00268 **
## mass 0.1616 0.4231 0.382 0.72197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.03 on 4 degrees of freedom
## Multiple R-squared: 0.03518, Adjusted R-squared: -0.206
## F-statistic: 0.1458 on 1 and 4 DF, p-value: 0.722
##
##
## $male
##
## Call:
## lm(formula = height ~ mass, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.816 -3.963 1.588 6.588 29.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.6923 14.0774 8.076 7.07e-08 ***
## mass 0.8062 0.1630 4.947 6.80e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.66 on 21 degrees of freedom
## Multiple R-squared: 0.5381, Adjusted R-squared: 0.5162
## F-statistic: 24.47 on 1 and 21 DF, p-value: 6.8e-05
~
in the example above; this is short hand for an annoynmous function which can be written out in long form as function(x) lm(height~mass, data = x)
1) Write a function that will automatically and “tidily” give the mean, SD, and median of starwars$height
in a dataframe.
data(starwars)
starwars <- as.data.frame(starwars)
summary_stats <- function(){
}
summary_stats(starwars$height)
data(starwars)
starwars <- as.data.frame(starwars)
## Here's one possible approach...
summary_stats <- function(x){
name <- as.character(x)
mean <- mean(x, na.rm = TRUE)
sd <- sqrt(var(x, na.rm = TRUE))
median <- median(x, na.rm = TRUE)
data.frame(mean=mean,sd=sd,median=median)
}
summary_stats(starwars$height)
2) Notice how we had to type the same thing four times to check and load four different packages at the beginning of this tutorial. This violates a fundamental rule of programming (never copy and paste the same thing). Try to check and load each required package using lapply()
. Remember, lapply()
takes a list as input, and returns a list as output. Hint for lapply()
to work in this case, you must also pass the input argument character.only=TRUE
.
paks <- c('learnr','tidyverse','purrr','here')
paks <- c('learnr','tidyverse','purrr','here')
if (any(lapply(paks, require, character.only=TRUE) == FALSE)) {
lapply(paks, install.packages(paks))
}
3) Use conditional logic to add a column to the starwars
dataset splitting it by median starwars$height
.
data(starwars)
data(starwars)
medianHeight <- median(starwars$height, na.rm = TRUE)
starwars$median_split <- ifelse(starwars$height >= medianHeight, 'Above','Below')
4) Return to the starwars
data set. Use a functional (apply()
or purrr::map()
) to iterate across every column (not just one) and find out how many NA
s there are in each. Then, can you replace each NA
value with 999
? Hint use the base function replace()
inside your functional.
data(starwars)
data(starwars)
# How many NAs do we have in our data?
starwarsNAs <- starwars %>%
purrr::map(., function(x) sum(is.na(x)))
# Let's replace these NAs
starwars_NA_replace <- starwars %>%
purrr::map(function(x) replace(x, is.na(x), 999))
## Hmm...creates a list, but we want a dataframe!
starwars_NA_replace <- as_tibble(starwars_NA_replace)