Functional Programming in R

Intro

This is a guide on how to carry out the basics of functional programming in R. Most of the guide follows material from Hadley Wickham’s Advanced R, but I’ve tried to provide additional value by simplifying a complex topic with relevant, real-world examples.

Sample Data

Let’s first create a sample dataset consisting of some basic employee data.

set.seed(999)

dat1 <- tibble(emp_id = sample(100000:200000,
                               size = 30),
               department = c(rep("HR",10),
                              rep("IT",10),
                              rep("Sales",10)),
               salary_k = round(c(rnorm(10,60,5),
                                  rnorm(10,100,7),
                                  rnorm(10,40,20)),2),
               age = sample(20:66,30,T),
               gender = sample(c("Male","Female","Non-binary"),30,T,
                               prob = c(0.45,0.45,0.1)),
               business_unit_id = sample(c("B01","AH2"),30,T),
               emp_status = sample(c("Active","Termed"),30,T,
                                   prob = c(0.8,0.2)),
               tenure_yrs = sample(1:15,30,TRUE))

head(dat1)
## # A tibble: 6 × 8
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 138211 HR             60.0    62 Female B01              Active             10
## 2 155878 HR             53.6    63 Female AH2              Active              5
## 3 107821 HR             54.4    60 Female AH2              Active              5
## 4 105305 HR             61.5    20 Female B01              Active              5
## 5 136118 HR             61.4    47 Female B01              Termed              1
## 6 151590 HR             49.8    62 Male   AH2              Active             10

Functionals

Functional: A function that takes in another function as its input, and then returns a vector

Functional Programming: Programming style that emphasizes the use of functions, especially those that take in other functions as inputs.

Example: We create a function that a

functionalfunc <- function(func1){
  func1(dat1$age)
}

functionalfunc(mean)
## [1] 41.66667
mean(dat1$age) == functionalfunc(mean)
## [1] TRUE

PURRR Package & MAP() Functions

The purrr package contains a variety of functionals with a consistent interface, and said functions are often simpler to follow than the base R equivalents.

The family of map() functions take in a function and a vector or list, and transform the input vector/list by applying the supplied function to each element of the input list/vector. The output is an object of the same length as the input

MAP()

The “base” map() function is distinct from the other map_* functions discussed further below in that it always returns a list output.

We’ll look at 2 examples, one of which iterates over an input vector, and one of which iterates over an input list.

Example 1: Log transform each element of the vector of employee salaries

age <- dat1$age

unlist(map(.x = age,
           .f = log))
##  [1] 4.127134 4.143135 4.094345 2.995732 3.850148 4.127134 4.043051 3.332205
##  [9] 3.784190 3.737670 3.663562 3.555348 3.988984 3.784190 3.044522 2.995732
## [17] 4.077537 3.688879 4.043051 3.178054 3.218876 3.332205 3.135494 4.143135
## [25] 3.526361 3.367296 3.663562 3.178054 3.806662 4.127134

In the example above, we supplied the log() function and our salary vector as our inputs, and used map() to apply the log transform to each element.


For-Loop Comparison

We can compare our results to what we’d get with a for-loop using base R:

output <- vector("list",length = length(age))

for(i in seq_along(age)){
  output[[i]] <- log(age[[i]])
}

unlist(output)==unlist(map(.x = age,
                           .f = log))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

*.apply Comparison

And finally, we can also compare to using lapply():

unlist(lapply(X = age, FUN = log))
##  [1] 4.127134 4.143135 4.094345 2.995732 3.850148 4.127134 4.043051 3.332205
##  [9] 3.784190 3.737670 3.663562 3.555348 3.988984 3.784190 3.044522 2.995732
## [17] 4.077537 3.688879 4.043051 3.178054 3.218876 3.332205 3.135494 4.143135
## [25] 3.526361 3.367296 3.663562 3.178054 3.806662 4.127134
lapply(X = age, FUN = log)==unlist(map(.x = age, .f = log))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Note: lapply() is the base R equivalent to map(), but lapply() doesn’t support the helper fns that map() does. BUT if you don’t need anything other than lapply() itself, might as well reduce dependencies and just use that!


MAP Revisited

Now let’s look at MAP-iterating over an input list.

Example 2: Calculate the average (mean) for each numeric column in our employee dataset:

dat1 %>% 
  select(where(is.numeric)) %>% 
  map(mean) %>% 
  unlist()
##       emp_id     salary_k          age   tenure_yrs 
## 141749.66667     66.69633     41.66667      7.50000

For-Loop Comparison

Here’s the equivalent code using 2 for-loops:

numcols <- vector("list", length = length(names(dat1)))

for(i in seq_along(names(dat1))){
  numcols[[i]] <- is.numeric(dat1[[i]])
}

numcols <- unlist(numcols)

means <- vector("list", length(dat1[numcols]))
names(means) <- names(dat1[numcols])

for(i in seq_along(dat1[numcols])){
  means[[i]] <- mean(dat1[numcols][[i]])
}

unlist(means)
##       emp_id     salary_k          age   tenure_yrs 
## 141749.66667     66.69633     41.66667      7.50000

As we can see, map() is a helluva lot more convenient!


MAP & Data Frame Iteration

Note: map() output must be the same length as the input. When operating over a data frame input whose length is equal to N columns, the map() list output must also have length equal to N columns.


Example: Return the unique values for each column:

# Iterate over each column in input dataframe
map(.x = dat1, .f = unique) -> output

# Verify output length = input length = ncols(df)
length(output)==length(dat1)
## [1] TRUE
# View output
output
## $emp_id
##  [1] 138211 155878 107821 105305 136118 151590 104586 122418 140606 130070
## [11] 114776 150894 128534 130977 118784 116113 169159 139110 154657 150997
## [21] 161782 132449 156989 187244 186174 195322 112459 131186 181702 140579
## 
## $department
## [1] "HR"    "IT"    "Sales"
## 
## $salary_k
##  [1]  60.04  53.59  54.44  61.50  61.38  49.75  60.07  62.91  59.83  59.42
## [11]  95.49 112.21 102.56  99.53 101.98 103.97  91.05 103.05  96.04  93.54
## [21]  63.30  60.84  62.08  39.63  17.06  11.84  34.35  31.64  59.93  37.87
## 
## $age
##  [1] 62 63 60 20 47 57 28 44 42 39 35 54 21 59 40 24 25 23 34 29 45
## 
## $gender
## [1] "Female"     "Male"       "Non-binary"
## 
## $business_unit_id
## [1] "B01" "AH2"
## 
## $emp_status
## [1] "Active" "Termed"
## 
## $tenure_yrs
##  [1] 10  5  1  2 12  8  6 11 14 15  4  9  7
# Iterate over each output element and calculate their lengths
map(.x = output, .f = length)
## $emp_id
## [1] 30
## 
## $department
## [1] 3
## 
## $salary_k
## [1] 30
## 
## $age
## [1] 21
## 
## $gender
## [1] 3
## 
## $business_unit_id
## [1] 2
## 
## $emp_status
## [1] 2
## 
## $tenure_yrs
## [1] 13

MAP and Atomic Vector Outputs

In the examples above, we used unlist() to convert the list-type output of map() into a simpler vector format. As a shortcut, the following map_* extensions produce atomic vector outputs:

map_lgl(): Return a logical value for each column re: whether or not it’s type numeric:

dat1 %>% 
  map_lgl(is.numeric)
##           emp_id       department         salary_k              age 
##             TRUE            FALSE             TRUE             TRUE 
##           gender business_unit_id       emp_status       tenure_yrs 
##            FALSE            FALSE            FALSE             TRUE

map_chr: Return the data type of each column:

dat1 %>% 
  map_chr(typeof)
##           emp_id       department         salary_k              age 
##        "integer"      "character"         "double"        "integer" 
##           gender business_unit_id       emp_status       tenure_yrs 
##      "character"      "character"      "character"        "integer"

map_int(): Return an integer count of the number of missing values in each column:

missing <- function(x){
  sum(is.na(x))
}

dat1 %>% map_int(missing)
##           emp_id       department         salary_k              age 
##                0                0                0                0 
##           gender business_unit_id       emp_status       tenure_yrs 
##                0                0                0                0

map_dbl(): Return the proportion of above average values for each numeric column:

aboveavg <- function(x){
  sum(x > mean(x))/length(x)
}

dat1 %>% 
  select(where(is.numeric)) %>% 
  map_dbl(aboveavg)
##     emp_id   salary_k        age tenure_yrs 
##  0.4000000  0.3333333  0.5000000  0.4666667

Base R & Atomic Vector Outputs

Both sapply() and vapply() take in a function and and an input object, iterate the fn over each input element, and return an atomic output vector.

Example: Return the proportion of above average values for each numeric column:

aboveavg <- function(x){
  sum(x > mean(x))/length(x)
}

sapply(X = select_if(dat1, is.numeric),
       FUN = aboveavg)
##     emp_id   salary_k        age tenure_yrs 
##  0.4000000  0.3333333  0.5000000  0.4666667
vapply(X = select_if(dat1, is.numeric),
       FUN = aboveavg, FUN.VALUE = 1)
##     emp_id   salary_k        age tenure_yrs 
##  0.4000000  0.3333333  0.5000000  0.4666667

MAP & Anonymous Functions & Shortcuts

Instead of using map() with a pre-defined/explicit function, we can use in-line anonymous functions.

Example: Return the proportion of above average values for each numeric column:

aboveavg <- function(x){
  sum(x > mean(x))/length(x)
}

dat1 %>% 
  select(where(is.numeric)) %>% 
  map_dbl(.f = function(x) sum(x > mean(x))/length(x))
##     emp_id   salary_k        age tenure_yrs 
##  0.4000000  0.3333333  0.5000000  0.4666667

One downside to anonymous functions is that the syntax can be verbose.

dat1 %>% 
  select(where(is.numeric)) %>% 
  map_dbl(.f = ~sum(.x > mean(.x))/length(.x))
##     emp_id   salary_k        age tenure_yrs 
##  0.4000000  0.3333333  0.5000000  0.4666667

Multi-Step MAPs

In the examples above, we used tidyverse syntax to first select columns that met a certain condition, and then passed those columns to our map() call.

We could instead split this into a multi-map-step process:

Example: Compute the standard deviation for all numeric columns:

map_lgl(.x = dat1,
        .f = is.numeric) -> numcols

map_dbl(.x = dat1[numcols],
        .f = sd)
##       emp_id     salary_k          age   tenure_yrs 
## 25058.493207    27.346545    15.234564     4.614445

This same process can be used for any combination of iterative function calls:

Example: Return the number of distinct values for each column containing “id” in the column name:

map_lgl(.x = names(dat1),
        .f = ~grepl(x = .x,
                   pattern = "id")) -> idcols

map_dbl(.x = dat1[idcols],
        .f = ~length(unique(x = .x)))
##           emp_id business_unit_id 
##               30                2

Passing Additional Arguments with (…)

We often wish to pass along additional arguments to the function being called by map().

We’ll look at 2 versions of the same example, one with the (…) shortcut, and one without.

Example A: Use the describe() function from the psych package to provide descriptive statistics for each of the first 3 numeric variables and supply the optional argument (skew = TRUE):

dat1 %>% 
  select_if(is.numeric) %>% 
  select(1:3) %>% 
  map(.f = ~describe(.x, skew = TRUE))
## $emp_id
##    vars  n     mean       sd   median  trimmed     mad    min    max range skew
## X1    1 30 141749.7 25058.49 138660.5 140251.6 24803.9 104586 195322 90736 0.45
##    kurtosis      se
## X1    -0.71 4575.03
## 
## $salary_k
##    vars  n mean    sd median trimmed   mad   min    max  range  skew kurtosis
## X1    1 30 66.7 27.35  61.11   67.55 33.15 11.84 112.21 100.37 -0.03    -0.99
##      se
## X1 4.99
## 
## $age
##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 30 41.67 15.23     41   41.71 23.72  20  63    43 0.06    -1.53 2.78

Now let’s look at the (…) shortcut map() provides.

Example B (Shortcut): Use the describe() function from the psych package to provide descriptive statistics for each of the first 3 numeric variables and supply the optional argument (skew = TRUE):

dat1 %>% 
  select_if(is.numeric) %>% 
  select(1:3) %>% 
  map(describe, skew = TRUE)
## $emp_id
##    vars  n     mean       sd   median  trimmed     mad    min    max range skew
## X1    1 30 141749.7 25058.49 138660.5 140251.6 24803.9 104586 195322 90736 0.45
##    kurtosis      se
## X1    -0.71 4575.03
## 
## $salary_k
##    vars  n mean    sd median trimmed   mad   min    max  range  skew kurtosis
## X1    1 30 66.7 27.35  61.11   67.55 33.15 11.84 112.21 100.37 -0.03    -0.99
##      se
## X1 4.99
## 
## $age
##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 30 41.67 15.23     41   41.71 23.72  20  63    43 0.06    -1.53 2.78

Hadley Wickham’s Advanced R provides a great graphic for understanding how the ... shortcut works, shown below:

map (…) shortcut Source: Advanced R by Wickham


Note: Supplying the additional arguments via the (…) map shortcut isn’t quite equivalent to placing them inside the function call

In the first example we looked at above, the 2 methods still produced equivalent results, but this is not the case for dynamic contexts.


Let’s take a look at a different example:

Example 2: We notice that our age variable is only measured as an integer, and we want to add a stochastic noise term to each element (a.k.a. a jitter) to better account for continuous variability

Jitter Function: Let’s first create a fitter fn that adds a noise/error term

jitter <- function(x,e) x + e

Now let’s look at 2 different methods of iterating this function:

Example 2A: Apply a jitter function to add a noise term to each age value:

map_dbl(.x = dat1$age,
        .f = ~jitter(x = .x, 
                     e = rnorm(1)))
##  [1] 63.47997 64.41430 59.42057 20.06353 47.03211 61.03753 56.50550 26.82393
##  [9] 45.03278 43.43092 39.87222 34.33823 53.87425 43.38543 20.69894 20.21987
## [17] 59.19968 40.01864 56.83562 22.66630 25.42263 26.80327 23.19167 62.30717
## [25] 34.46624 29.81711 39.68331 23.52101 45.43313 60.16190

As we can see, a different jitter amount was applied to each element of the input age vector (which is what we want!)

Example 2B (MAP (…) Shortcut): Apply a jitter function to add a noise term to each age value:

map_dbl(.x = dat1$age,
        .f = jitter,
        rnorm(1))
##  [1] 61.92906 62.92906 59.92906 19.92906 46.92906 61.92906 56.92906 27.92906
##  [9] 43.92906 41.92906 38.92906 34.92906 53.92906 43.92906 20.92906 19.92906
## [17] 58.92906 39.92906 56.92906 23.92906 24.92906 27.92906 22.92906 62.92906
## [25] 33.92906 28.92906 38.92906 23.92906 44.92906 61.92906

Because we used the map (…) shortcut to pass the secondary error term argument (instead of placing it inside the jitter function call), the same jitter amount was applied to each element.

Bottom Line: In certain applications/contexts, passing additional arguments via (…) may not be ideal


Iterating/Varying over Arguments Other than (X)

Let’s suppose we think we have some outliers w/respect to our employee salaries, and we want to calculate an avg salary using a trimmed mean.

In this example, we aren’t iterating over (X), but rather the (trim = …) argument supplied to mean()!

Example A (For Loop): Calculate a trimmed avg employee salary using a series of different trim arguments

trims <- c(0, 0.1, 0.2, 0.5)
output <- vector("list", length = length(trims))

for(i in seq_along(trims)){
  output[[i]] <- mean(dat1$salary_k,trim = trims[[i]])
}

unlist(output)
## [1] 66.69633 67.54667 66.95556 61.11000

Now let’s see what happens when we use map() in the standard fashion (in which .x is the iterative arg)

Example B (Bad MAP Use): Calculate a trimmed avg employee salary using a series of different trim arguments

map_dbl(.x = dat1$salary_k,
        .f = mean,
        trims)
## Error in mean.default(.x[[i]], ...): 'trim' must be numeric of length one

There are 2 main problems with this approach in this context:

Instead, let’s correctly specify our trims as the iterative elements:

Example C (Good MAP Use): Calculate a trimmed avg employee salary using a series of different trim arguments

map_dbl(.x = trims,
        .f = ~mean(x = dat1$salary_k,
                   trim = .x))
## [1] 66.69633 67.54667 66.95556 61.11000

We can verify that these results match the correct results from our for-loop:

unlist(output)==map_dbl(.x = trims,
                        .f = ~mean(x = dat1$salary_k,
                                   trim = .x))
## [1] TRUE TRUE TRUE TRUE

PURRR-Style

Many real-world data science problems can be streamlined by using multiple purrr functions in combination.

Multi-Purrr Example

Example: Let’s suppose we want to model the relationship between salary and tenure by department.

We can start by splitting our dataset into 3 subsets, one for each department:

split(x = dat1,
      f = dat1$department) -> deplist

map(.x = deplist,
    .f = head,
    3)
## $HR
## # A tibble: 3 × 8
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 138211 HR             60.0    62 Female B01              Active             10
## 2 155878 HR             53.6    63 Female AH2              Active              5
## 3 107821 HR             54.4    60 Female AH2              Active              5
## 
## $IT
## # A tibble: 3 × 8
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 114776 IT             95.5    39 Male   AH2              Active              6
## 2 150894 IT            112.     35 Male   B01              Active              2
## 3 128534 IT            103.     54 Female B01              Active             11
## 
## $Sales
## # A tibble: 3 × 8
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 161782 Sales          63.3    25 Male   AH2              Active              1
## 2 132449 Sales          60.8    28 Female B01              Active             15
## 3 156989 Sales          62.1    23 Non-b… B01              Active             11

Now we want to fit a linear regression model to each of these subsets:

deplist %>% 
  map(.f = ~lm(salary_k ~ tenure_yrs,
               data = .x)) -> depmods

names(depmods) <- names(deplist)

depmods
## $HR
## 
## Call:
## lm(formula = salary_k ~ tenure_yrs, data = .x)
## 
## Coefficients:
## (Intercept)   tenure_yrs  
##     60.6374      -0.3974  
## 
## 
## $IT
## 
## Call:
## lm(formula = salary_k ~ tenure_yrs, data = .x)
## 
## Coefficients:
## (Intercept)   tenure_yrs  
##     98.4360       0.2008  
## 
## 
## $Sales
## 
## Call:
## lm(formula = salary_k ~ tenure_yrs, data = .x)
## 
## Coefficients:
## (Intercept)   tenure_yrs  
##     44.1994      -0.2577

Great! For the final step, let’s extract the slopes so we can compare the effect of tenure upon salary across departments:

map(.x = depmods,
    .f = ~.x$coefficients) 
## $HR
## (Intercept)  tenure_yrs 
##  60.6373811  -0.3973527 
## 
## $IT
## (Intercept)  tenure_yrs 
##   98.436015    0.200798 
## 
## $Sales
## (Intercept)  tenure_yrs 
##  44.1993642  -0.2577323

Chaining Purrr Steps

Instead of carrying out each step separately, we can chain the entire process and create a dataframe of the coefficients:

dat1 %>% 
  split(f = dat1$department) %>% 
  map(.f = ~lm(salary_k ~ tenure_yrs,
               data = .x)) %>% 
  map_df(.f = coef) %>% 
  as.data.frame() %>% 
  rename(intercept = `(Intercept)`,
         slope = tenure_yrs) %>% 
  mutate(department = c("HR","IT","Sales"))
##   intercept      slope department
## 1  60.63738 -0.3973527         HR
## 2  98.43601  0.2007980         IT
## 3  44.19936 -0.2577323      Sales

Thanks for reading!