## Setup Load/install required packages the long way. We will revisit this example in the exercises at the end to redo it with our new functional programming skills. If you cannot get this document to work, you can view it interactively at this [link](http://bit.ly/functionalprogramming). ```{r setup, warning = FALSE, message = FALSE, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) if (!require(pacman)) { install.packages("pacman"); library(pacman) } p_load(learnr, tidyverse, purrr, here) ``` ```{r setup2, eval = FALSE} if(!require("learnr")){install.packages("learnr"); library("learnr")} if(!require("tidyverse")){install.packages("tidyverse"); library("tidyverse")} if(!require("purrr")){install.packages("purrr"); library("purrr")} if(!require("here")){install.packages("here"); library("here")} #setwd(here::here('functional_programming')) ``` ## Introduction to Functions - Functions in R take some input argument (a 'formal') and execute some code based off of the input ('body') - Functions are useful for creating 'shortcuts' that you use often and are not already implemented in R or an R package - In order to use a function you create, you must first 'define' it, much like you would by assigning a value to an object in R - To assign a function to an object (name), use the `function(input){body}` call: ```{r functions1} hello_world <- function() { print('Hello, world!') } ``` - Here, `hello_world` is the name of the function, and `print()` is what the function does (everything inside the `{}`) - Notice that this function does not have any input! *First, reproduce the function above, and run it.* *Next, try modifying the function's input and body to allow it to print your name instead of 'world'.* ```{r functions-ex-1, exercise=TRUE, exercise.lines = 10} ``` ```{r functions-ex-1-solution} hello_world <- function(name){ message <- paste0('Hello, ',name,'!') return(message) } hello_world('Dan') ``` ## Functions (cont.) That's neat, but useless. Let's walk through another example with some real life statistical significance. ```{r functions2} set.seed(1999) z_score <- function(score, values){ grand_mean <- mean(values, na.rm = TRUE) sdt <- sd(values, na.rm = TRUE) z <- (score - grand_mean) / sdt return(z) } values <- runif(20, min = 0, max = 20) score <- values[1] z_score(score, values) ``` - The last line of the function (`return(z)`) is the function output - In this example, we want what we stored in the object `z` to be given back - For multiple returns, we need to store our final output in a list or dataframe ```{r func-exc, exercise=TRUE, exercise.lines = 20} set.seed(1999) z_score <- function(score, values){ grand_mean <- mean(values, na.rm = TRUE) sdt <- sd(values, na.rm = TRUE) z <- (score - grand_mean) / sdt return(z) } values <- runif(20, min = 0, max = 20) score <- values[1] z_score(score, values) ``` - Again, this is nice, but as it stands now you would have to run `z_score()` on each *individual* score to get the z score value - We can unlock the full potential of R functions by combining it with loops, conditional logic, and other functionals (e.g., `lapply()`) ## Loops - Basic R loops are similar to other programming languages (e.g., python, MatLab, etc.) - Tells R to evaluate something until a certain point is reached (a 'while' loop), or until the end of a vector is reached (a 'for' loop) - In other words, loops cycle through 'decision-trees' until a specified break-point is reached (see flowchart below) ![Loop Flowchart](https://i2.wp.com/blog.datacamp.com/wp-content/uploads/2015/07/flowchart1.png?w=450?tap_a=5644-dce66f&tap_s=10907-287229) ## For Loops - All loops iterate along some sequence - Iterators are (generally) numbers, and are initialized by calling: `for (i in 1:10) {loop something here}` - `i` is the iterator, and tells the `for` statement to cycle through the loop 10 times (`1:10`): `r 1:10` ```{r for-loops-1} for (i in 1:10){ ## Loops cycle through what is inside the {} brackets print(i) } ``` *Try to use what you know to loop through numbers 1 - 10 and square each value* Write the R code required to print each number's squared value: ```{r for-loops-1-ex, exercise=TRUE, exercise.lines = 10} ``` ```{r for-loops-1-ex-solution} for(i in 1:10){ print(i^2) } ``` So when might this come in handy? Consider a list of values that you want to run the same function on; for example, we might want to rename the subject column of a dataframe to extract only the numeric values: ```{r for-loops-2} list <- c('subject_1','subject_2','subject_3','subject_4') # Instead of manually counting how many iterations we need # we can instead look at the `length()` of the list we # just created length(list) # To extract just the numeric values, we use the # tidyverse function `readr::parse_number()` readr::parse_number('subject_45') # Initialize an empty vector to fill list_sans_text <- vector('numeric') # Let's try combining each of these into a for loop: for (i in 1:length(list)){ list_sans_text <- readr::parse_number(i) } print(list_sans_text) ``` Hmm...that didn't work. Why? - Iterators are (generally) numbers, and therefore correspond to a *position* in a vector or list, not the value itself *How would we tell the loop to use the value instead of the position?* *Try to fix the code below so that the output is a list of only subject numbers.* ```{r for-loops-2-ex, exercise=TRUE, exercise.lines = 10} list <- c('subject_1','subject_2','subject_3','subject_4') # Initialize an empty vector to fill list_sans_text <- vector('numeric') for (i in 1:length(list)){ list_sans_text <- readr::parse_number(i) } print(list_sans_text) ``` ```{r for-loops-2-ex-solution} list <- c('subject_1','subject_2','subject_3','subject_4') # Initialize an empty vector to fill (`double()` here is # synonomous with `vector('numeric')`) list_sans_text <- double() for (i in 1:length(list)){ list_sans_text[i] <- readr::parse_number(list[i]) } print(list_sans_text) ``` - Remember, iterators in `loops` correspond to a position within! If you want to use an actual value (character or numeric) you need to extract it first! ## While Loops - While loops are generally the same as for loops, but instead of iterating along a sequence of values, they repeat until a condition is met ```{r while-loop-example, exercise = TRUE, exercise.lines = 10} x <- 1 while( x <= 10){ y <- x print(y^2) x <- x+1 } ``` - Can you change the code above so that it uses the previous value as the power? ```{r while-loop-example-solution} x <- 1 while( x <= 10){ y <- x p <- x-1 print(y^p) x <- x+1 } ``` ## A Hefty Loop Example - The above are simple examples. Consider a more hefty example whereby you want to run quickly z score variables across multiple columns: ```{r hefty-loop} data(starwars) starwars <- as.data.frame(starwars) head(starwars) x_values <- c('height','mass') for (i in 1:length(x_values)){ new <- paste0(x_values[i],'_centered') values <- as.numeric(starwars[,x_values[i]]) mean <- mean(values, na.rm=TRUE) std <- sd(values, na.rm=TRUE) for(j in 1:nrow(starwars)) { # Here is our z-score function within the loop starwars[j,new] <- (starwars[j,x_values[i]] - mean) / std } } # Check to make sure we have values by printing first 6 rows head(as_tibble(starwars)) ``` Was that more complicated then it needed to be? Most certainly. That's why most people leave loops behind in favor of the `apply` family. But before we do, note two important concepts detailed in the example above: 1) You can nest loops within loops (theoretically infinitely) 2) The easy way of extracting an element from a dataframe or list (`df$column`) does not play well within loops ## Conditional Logic Conditional logic in programming evaluates a statement as either `TRUE` or `FALSE` and performs code based on the statement. In this way, it is much like loops. - The basic conditional logic functions are `if()` and `else()`, which are often used together - `if()` evaluates a statement: ```{r conditional-logic1} x <- 1 # 'If x equals 1, print to the console 'This is TRUE'' if (x == 1) {'This is TRUE'} ``` - Notice that we use `==` instead of `=` - When the statement is not `TRUE`, `else()` can be used as a follow up: ```{r conditional-logic2} x <- 2 # 'If x equals 1, print 'This is TRUE' to the console, # if x does not equal 1, print 'This is FALSE' to the console' if (x == 1) {'This is TRUE'} else {'This is FALSE'} ``` - We can combine these two statements in a vectorized base function, `ifelse()`: ```{r conditional-logic3} x <- 2 ifelse(x == 1, 'This is TRUE!', 'This is FALSE!') ``` - `ifelse()` has the benefit of being vectorized, which means that it operates on a case-by-case level instead of the vector as a whole ```{r conditional-logic4} set.seed(1999) x = runif(10) # This doesn't replace each value y <- if (x < .5) {TRUE} else {FALSE} # Instead, it evaluates the first element of the vector 'x' x[1]; y ``` ```{r conditional-logic4-1} # ifelse() will work on each element y2 <- ifelse(x < .5, TRUE, FALSE) y2 ``` *How might we use conditional logic with tidyverse's `mutate()` function to change `starwars$gender` to capital letters?* ```{r for-conditional-1-ex, exercise=TRUE, exercise.lines = 10} data(starwars) ``` ```{r for-conditional-1-ex-solution} data(starwars) starwars <- starwars %>% mutate(., gender2 = ifelse(.$gender == 'male', 'Male', ifelse('female','Female',.$gender))) ``` ## Apply Family (with a Tidyverse flavor) - The apply family--as the name suggests--applies a function across certain elements - Think of this family as a simplified and optimized loops - The apply family consists largely of: 1) `lapply()` - the simplest apply; it takes a function and applies it to each element of a list 2) `sapply()` and `vapply` - which returns simplified vectors instead of lists 3) `purrr::map()` - which can handle multiple inputs - Let's apply the `z_score()` function to multiple cases ```{r apply1} data(starwars) head(starwars) ``` ```{r apply1-1} myList <- c('height','mass') # Returns list myValues <- lapply(starwars[myList], scale) head(myValues$height); head(myValues$mass) # Returns vector myValues2 <- sapply(starwars[,c('height','mass')], scale) head(myValues2) ``` - `purrr::map()` is largely the same as `lapply()`, but returns consistent values, and has the ability to be scaled to multiple input values using `purrr::map2()` and `purrr::pmap()` ```{r map1} set.seed(1999) df <- data_frame(a = runif(20), b = runif(20), c = runif(20), d = runif(20)) # The default for runif() should be a mean of 0.5 # map_dbl() returns 'double' (a numeric value with decimal points) not a list df %>% map_dbl(., function(x) mean(x)) # In this example, it is functionally equivelent to: df %>% sapply(., mean) ``` **See what happens if you use `map()` instead of `map_dbl()`** ```{r map1-ex, exercise=TRUE, exercise.lines = 10} df <- data_frame(a = runif(20), b = runif(20), c = runif(20), d = runif(20)) df %>% map_dbl(., function(x) mean(x)) ``` - `purrr::map()`'s utility becomes more apparent when we consider a fairly common issue in data science: split-apply-combine, or map-reduce ```{r map2} data(starwars) # Omiting incomplete cases so it plays nicely with `lm()` starwars <- as.data.frame(starwars) %>% na.omit(.) # Model the relationship between height and mass for each gender models <- starwars %>% split(., .$gender) %>% # split map(~lm(height~mass, data = .)) %>% # apply map(summary) # combine models ``` - Note the use of `~` in the example above; this is short hand for an *annoynmous function* which can be written out in long form as `function(x) lm(height~mass, data = x)` ## Intermediate Exercises **1)** Write a function that will automatically and "tidily" give the mean, SD, and median of `starwars$height` in a dataframe. ```{r intermediate-ex-1, exercise=TRUE, exercise.lines = 10} data(starwars) starwars <- as.data.frame(starwars) summary_stats <- function(){ } summary_stats(starwars$height) ``` ```{r intermediate-ex-1-solution} data(starwars) starwars <- as.data.frame(starwars) ## Here's one possible approach... summary_stats <- function(x){ name <- as.character(x) mean <- mean(x, na.rm = TRUE) sd <- sqrt(var(x, na.rm = TRUE)) median <- median(x, na.rm = TRUE) data.frame(mean=mean,sd=sd,median=median) } summary_stats(starwars$height) ``` **2)** Notice how we had to type the same thing four times to check and load four different packages at the beginning of this tutorial. This violates a fundamental rule of programming (never copy and paste the same thing). Try to check and load each required package using `lapply()`. Remember, `lapply()` takes a list as input, and returns a list as output. *Hint* for `lapply()` to work in this case, you must also pass the input argument `character.only=TRUE`. ```{r intermediate-ex-2, exercise=TRUE, exercise.lines = 10} paks <- c('learnr','tidyverse','purrr','here') ``` ```{r intermediate-ex-2-solution} paks <- c('learnr','tidyverse','purrr','here') if (any(lapply(paks, require, character.only=TRUE) == FALSE)) { lapply(paks, install.packages(paks)) } ``` **3)** Use conditional logic to add a column to the `starwars` dataset splitting it by median `starwars$height`. ```{r intermediate-ex-3, exercise=TRUE, exercise.lines = 10} data(starwars) ``` ```{r intermediate-ex-3-solution} data(starwars) medianHeight <- median(starwars$height, na.rm = TRUE) starwars$median_split <- ifelse(starwars$height >= medianHeight, 'Above','Below') ``` **4)** Return to the `starwars` data set. Use a functional (`apply()` or `purrr::map()`) to iterate across every column (not just one) and find out how many `NA`s there are in each. Then, can you replace each `NA` value with `999`? *Hint* use the base function `replace()` inside your functional. ```{r intermediate-ex-4, exercise=TRUE, exercise.lines = 10} data(starwars) ``` ```{r intermediate-ex-4-solution} data(starwars) # How many NAs do we have in our data? starwarsNAs <- starwars %>% purrr::map(., function(x) sum(is.na(x))) # Let's replace these NAs starwars_NA_replace <- starwars %>% purrr::map(function(x) replace(x, is.na(x), 999)) ## Hmm...creates a list, but we want a dataframe! starwars_NA_replace <- as_tibble(starwars_NA_replace) ```