Setup

Load/install required packages the long way. We will revisit this example in the exercises at the end to redo it with our new functional programming skills. If you cannot get this document to work, you can view it interactively at this link.

if(!require("learnr")){install.packages("learnr"); library("learnr")}

if(!require("tidyverse")){install.packages("tidyverse"); library("tidyverse")}

if(!require("purrr")){install.packages("purrr"); library("purrr")}

if(!require("here")){install.packages("here"); library("here")}

#setwd(here::here('functional_programming'))

Introduction to Functions

  • Functions in R take some input argument (a ‘formal’) and execute some code based off of the input (‘body’)

  • Functions are useful for creating ‘shortcuts’ that you use often and are not already implemented in R or an R package

  • In order to use a function you create, you must first ‘define’ it, much like you would by assigning a value to an object in R

  • To assign a function to an object (name), use the function(input){body} call:

hello_world <- function() { print('Hello, world!') }
  • Here, hello_world is the name of the function, and print() is what the function does (everything inside the {})

  • Notice that this function does not have any input!

First, reproduce the function above, and run it. Next, try modifying the function’s input and body to allow it to print your name instead of ‘world’.

hello_world <- function(name){
  message <- paste0('Hello, ',name,'!')
  return(message)
}

hello_world('Dan')

Functions (cont.)

That’s neat, but useless. Let’s walk through another example with some real life statistical significance.

set.seed(1999)

z_score <- function(score, values){
  grand_mean <- mean(values, na.rm = TRUE)
  sdt <- sd(values, na.rm = TRUE)
  z <- (score - grand_mean) / sdt
  return(z)
}

values <- runif(20, min = 0, max = 20)
score <- values[1]

z_score(score, values)
## [1] 0.7294446
  • The last line of the function (return(z)) is the function output

  • In this example, we want what we stored in the object z to be given back

  • For multiple returns, we need to store our final output in a list or dataframe

set.seed(1999)

z_score <- function(score, values){
  grand_mean <- mean(values, na.rm = TRUE)
  sdt <- sd(values, na.rm = TRUE)
  z <- (score - grand_mean) / sdt
  return(z)
}

values <- runif(20, min = 0, max = 20)
score <- values[1]

z_score(score, values)
  • Again, this is nice, but as it stands now you would have to run z_score() on each individual score to get the z score value

  • We can unlock the full potential of R functions by combining it with loops, conditional logic, and other functionals (e.g., lapply())

Loops

  • Basic R loops are similar to other programming languages (e.g., python, MatLab, etc.)

  • Tells R to evaluate something until a certain point is reached (a ‘while’ loop), or until the end of a vector is reached (a ‘for’ loop)

  • In other words, loops cycle through ‘decision-trees’ until a specified break-point is reached (see flowchart below)

Loop Flowchart

Loop Flowchart

For Loops

  • All loops iterate along some sequence

  • Iterators are (generally) numbers, and are initialized by calling: for (i in 1:10) {loop something here}

  • i is the iterator, and tells the for statement to cycle through the loop 10 times (1:10):

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

for (i in 1:10){
  ## Loops cycle through what is inside the {} brackets
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Try to use what you know to loop through numbers 1 - 10 and square each value

Write the R code required to print each number’s squared value:

for(i in 1:10){
  print(i^2)
}

So when might this come in handy? Consider a list of values that you want to run the same function on; for example, we might want to rename the subject column of a dataframe to extract only the numeric values:

list <- c('subject_1','subject_2','subject_3','subject_4')

# Instead of manually counting how many iterations we need
# we can instead look at the `length()` of the list we
# just created
length(list)
## [1] 4
# To extract just the numeric values, we use the
# tidyverse function `readr::parse_number()`
readr::parse_number('subject_45')
## [1] 45
# Initialize an empty vector to fill
list_sans_text <- vector('numeric')

# Let's try combining each of these into a for loop:
for (i in 1:length(list)){
  list_sans_text <- readr::parse_number(i)
}

print(list_sans_text)
## [1] 4

Hmm…that didn’t work. Why?

  • Iterators are (generally) numbers, and therefore correspond to a position in a vector or list, not the value itself

How would we tell the loop to use the value instead of the position? Try to fix the code below so that the output is a list of only subject numbers.

list <- c('subject_1','subject_2','subject_3','subject_4')

# Initialize an empty vector to fill
list_sans_text <- vector('numeric')

for (i in 1:length(list)){
  list_sans_text <- readr::parse_number(i)
}

print(list_sans_text)
list <- c('subject_1','subject_2','subject_3','subject_4')

# Initialize an empty vector to fill (`double()` here is
# synonomous with `vector('numeric')`)
list_sans_text <- double()

for (i in 1:length(list)){
  list_sans_text[i] <- readr::parse_number(list[i])
}

print(list_sans_text)
  • Remember, iterators in loops correspond to a position within! If you want to use an actual value (character or numeric) you need to extract it first!

While Loops

  • While loops are generally the same as for loops, but instead of iterating along a sequence of values, they repeat until a condition is met
x <- 1

while( x <= 10){
  y <- x
  print(y^2)
  x <- x+1
}
  • Can you change the code above so that it uses the previous value as the power?
x <- 1

while( x <= 10){
  y <- x
  p <- x-1
  print(y^p)
  x <- x+1
}

A Hefty Loop Example

  • The above are simple examples. Consider a more hefty example whereby you want to run quickly z score variables across multiple columns:
data(starwars)
starwars <- as.data.frame(starwars)
head(starwars)
x_values <- c('height','mass')

for (i in 1:length(x_values)){
  
  new <- paste0(x_values[i],'_centered')
  values <- as.numeric(starwars[,x_values[i]])
  mean <- mean(values, na.rm=TRUE)
  std <- sd(values, na.rm=TRUE)
  
  for(j in 1:nrow(starwars)) {
    # Here is our z-score function within the loop
    starwars[j,new] <- (starwars[j,x_values[i]] - mean) / std
    }
}

# Check to make sure we have values by printing first 6 rows
head(as_tibble(starwars))

Was that more complicated then it needed to be? Most certainly. That’s why most people leave loops behind in favor of the apply family. But before we do, note two important concepts detailed in the example above:

  1. You can nest loops within loops (theoretically infinitely)

  2. The easy way of extracting an element from a dataframe or list (df$column) does not play well within loops

Conditional Logic

Conditional logic in programming evaluates a statement as either TRUE or FALSE and performs code based on the statement. In this way, it is much like loops.

  • The basic conditional logic functions are if() and else(), which are often used together

  • if() evaluates a statement:

x <- 1

# 'If x equals 1, print to the console 'This is TRUE''
if (x == 1) {'This is TRUE'}
## [1] "This is TRUE"
  • Notice that we use == instead of =

  • When the statement is not TRUE, else() can be used as a follow up:

x <- 2

# 'If x equals 1, print 'This is TRUE' to the console,
# if x does not equal 1, print 'This is FALSE' to the console'
if (x == 1) {'This is TRUE'} else {'This is FALSE'}
## [1] "This is FALSE"
  • We can combine these two statements in a vectorized base function, ifelse():
x <- 2

ifelse(x == 1, 'This is TRUE!', 'This is FALSE!')
## [1] "This is FALSE!"
  • ifelse() has the benefit of being vectorized, which means that it operates on a case-by-case level instead of the vector as a whole
set.seed(1999)
x = runif(10)

# This doesn't replace each value
y <- if (x < .5) {TRUE} else {FALSE}

# Instead, it evaluates the first element of the vector 'x'
x[1]; y
## [1] 0.7681209
## [1] FALSE
# ifelse() will work on each element

y2 <- ifelse(x < .5, TRUE, FALSE)

y2
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

How might we use conditional logic with tidyverse’s mutate() function to change starwars$gender to capital letters?

data(starwars)
data(starwars)

starwars <- starwars %>% 
  mutate(., gender2 = ifelse(.$gender == 'male', 'Male',
                             ifelse('female','Female',.$gender)))

Apply Family (with a Tidyverse flavor)

  • The apply family–as the name suggests–applies a function across certain elements

  • Think of this family as a simplified and optimized loops

  • The apply family consists largely of:

  1. lapply() - the simplest apply; it takes a function and applies it to each element of a list

  2. sapply() and vapply - which returns simplified vectors instead of lists

  3. purrr::map() - which can handle multiple inputs

  • Let’s apply the z_score() function to multiple cases
data(starwars)
head(starwars)
myList <- c('height','mass')

# Returns list
myValues <- lapply(starwars[myList], scale)
head(myValues$height); head(myValues$mass)
##             [,1]
## [1,] -0.06781696
## [2,] -0.21161731
## [3,] -2.25358235
## [4,]  0.79498517
## [5,] -0.70053852
## [6,]  0.10474347
##            [,1]
## [1,] -0.1198643
## [2,] -0.1316667
## [3,] -0.3854181
## [4,]  0.2283063
## [5,] -0.2850978
## [6,]  0.1338871
# Returns vector
myValues2 <- sapply(starwars[,c('height','mass')], scale)
head(myValues2)
##           height       mass
## [1,] -0.06781696 -0.1198643
## [2,] -0.21161731 -0.1316667
## [3,] -2.25358235 -0.3854181
## [4,]  0.79498517  0.2283063
## [5,] -0.70053852 -0.2850978
## [6,]  0.10474347  0.1338871
  • purrr::map() is largely the same as lapply(), but returns consistent values, and has the ability to be scaled to multiple input values using purrr::map2() and purrr::pmap()
set.seed(1999)
df <- data_frame(a = runif(20), b = runif(20), c = runif(20), d = runif(20))

# The default for runif() should be a mean of 0.5
# map_dbl() returns 'double' (a numeric value with decimal points) not a list
df %>% map_dbl(., function(x) mean(x))
##         a         b         c         d 
## 0.5616511 0.5792588 0.5180648 0.5688397
# In this example, it is functionally equivelent to:
df %>% sapply(., mean)
##         a         b         c         d 
## 0.5616511 0.5792588 0.5180648 0.5688397

See what happens if you use map() instead of map_dbl()

df <- data_frame(a = runif(20), b = runif(20), c = runif(20), d = runif(20))

df %>% map_dbl(., function(x) mean(x))
  • purrr::map()’s utility becomes more apparent when we consider a fairly common issue in data science: split-apply-combine, or map-reduce
data(starwars)
# Omiting incomplete cases so it plays nicely with `lm()`
starwars <- as.data.frame(starwars) %>% na.omit(.)

# Model the relationship between height and mass for each gender
models <- starwars %>% 
  split(., .$gender) %>% # split
  map(~lm(height~mass, data = .)) %>% # apply
  map(summary) # combine

models
## $female
## 
## Call:
## lm(formula = height ~ mass, data = .)
## 
## Residuals:
##        5        7       44       61       62       87 
## -14.6919  -3.8925  12.3387   4.1448   1.1465   0.9543 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 156.7755    23.6398   6.632  0.00268 **
## mass          0.1616     0.4231   0.382  0.72197   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 4 degrees of freedom
## Multiple R-squared:  0.03518,    Adjusted R-squared:  -0.206 
## F-statistic: 0.1458 on 1 and 4 DF,  p-value: 0.722
## 
## 
## $male
## 
## Call:
## lm(formula = height ~ mass, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.816  -3.963   1.588   6.588  29.099 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 113.6923    14.0774   8.076 7.07e-08 ***
## mass          0.8062     0.1630   4.947 6.80e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.66 on 21 degrees of freedom
## Multiple R-squared:  0.5381, Adjusted R-squared:  0.5162 
## F-statistic: 24.47 on 1 and 21 DF,  p-value: 6.8e-05
  • Note the use of ~ in the example above; this is short hand for an annoynmous function which can be written out in long form as function(x) lm(height~mass, data = x)

Intermediate Exercises

1) Write a function that will automatically and “tidily” give the mean, SD, and median of starwars$height in a dataframe.

data(starwars)
starwars <- as.data.frame(starwars)

summary_stats <- function(){
  
}

summary_stats(starwars$height)
data(starwars)
starwars <- as.data.frame(starwars)

## Here's one possible approach...

summary_stats <- function(x){
  name <- as.character(x)
  mean <- mean(x, na.rm = TRUE)
  sd <- sqrt(var(x, na.rm = TRUE))
  median <- median(x, na.rm = TRUE)
  
  data.frame(mean=mean,sd=sd,median=median)
}

summary_stats(starwars$height)

2) Notice how we had to type the same thing four times to check and load four different packages at the beginning of this tutorial. This violates a fundamental rule of programming (never copy and paste the same thing). Try to check and load each required package using lapply(). Remember, lapply() takes a list as input, and returns a list as output. Hint for lapply() to work in this case, you must also pass the input argument character.only=TRUE.

paks <- c('learnr','tidyverse','purrr','here')
paks <- c('learnr','tidyverse','purrr','here')

if (any(lapply(paks, require, character.only=TRUE) == FALSE)) {
  lapply(paks, install.packages(paks))
}

3) Use conditional logic to add a column to the starwars dataset splitting it by median starwars$height.

data(starwars)
data(starwars)

medianHeight <- median(starwars$height, na.rm = TRUE)

starwars$median_split <- ifelse(starwars$height >= medianHeight, 'Above','Below')

4) Return to the starwars data set. Use a functional (apply() or purrr::map()) to iterate across every column (not just one) and find out how many NAs there are in each. Then, can you replace each NA value with 999? Hint use the base function replace() inside your functional.

data(starwars)
data(starwars)

# How many NAs do we have in our data?
starwarsNAs <- starwars %>%
  purrr::map(., function(x) sum(is.na(x)))

# Let's replace these NAs
starwars_NA_replace <- starwars %>%
  purrr::map(function(x) replace(x, is.na(x), 999))

## Hmm...creates a list, but we want a dataframe!
starwars_NA_replace <- as_tibble(starwars_NA_replace)