This is a guide on how to carry out the basics of functional programming in R. Most of the guide follows material from Hadley Wickham’s Advanced R, but I’ve tried to provide additional value by simplifying a complex topic with relevant, real-world examples.
Let’s first create a sample dataset consisting of some basic employee data.
set.seed(999)
dat1 <- tibble(emp_id = sample(100000:200000,
size = 30),
department = c(rep("HR",10),
rep("IT",10),
rep("Sales",10)),
salary_k = round(c(rnorm(10,60,5),
rnorm(10,100,7),
rnorm(10,40,20)),2),
age = sample(20:66,30,T),
gender = sample(c("Male","Female","Non-binary"),30,T,
prob = c(0.45,0.45,0.1)),
business_unit_id = sample(c("B01","AH2"),30,T),
emp_status = sample(c("Active","Termed"),30,T,
prob = c(0.8,0.2)),
tenure_yrs = sample(1:15,30,TRUE))
head(dat1)## # A tibble: 6 × 8
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 10
## 2 155878 HR 53.6 63 Female AH2 Active 5
## 3 107821 HR 54.4 60 Female AH2 Active 5
## 4 105305 HR 61.5 20 Female B01 Active 5
## 5 136118 HR 61.4 47 Female B01 Termed 1
## 6 151590 HR 49.8 62 Male AH2 Active 10
Functional: A function that takes in another function as its input, and then returns a vector
Functional Programming: Programming style that emphasizes the use of functions, especially those that take in other functions as inputs.
Example: We create a function that a
functionalfunc <- function(func1){
func1(dat1$age)
}
functionalfunc(mean)## [1] 41.66667
mean(dat1$age) == functionalfunc(mean)## [1] TRUE
The purrr package contains a variety of functionals with a consistent interface, and said functions are often simpler to follow than the base R equivalents.
The family of map() functions take in a function and a vector or list, and transform the input vector/list by applying the supplied function to each element of the input list/vector. The output is an object of the same length as the input
The “base” map() function is distinct from the other map_* functions discussed further below in that it always returns a list output.
We’ll look at 2 examples, one of which iterates over an input vector, and one of which iterates over an input list.
Example 1: Log transform each element of the vector of employee salaries
age <- dat1$age
unlist(map(.x = age,
.f = log))## [1] 4.127134 4.143135 4.094345 2.995732 3.850148 4.127134 4.043051 3.332205
## [9] 3.784190 3.737670 3.663562 3.555348 3.988984 3.784190 3.044522 2.995732
## [17] 4.077537 3.688879 4.043051 3.178054 3.218876 3.332205 3.135494 4.143135
## [25] 3.526361 3.367296 3.663562 3.178054 3.806662 4.127134
In the example above, we supplied the log() function and our salary vector as our inputs, and used map() to apply the log transform to each element.
We can compare our results to what we’d get with a for-loop using base R:
output <- vector("list",length = length(age))
for(i in seq_along(age)){
output[[i]] <- log(age[[i]])
}
unlist(output)==unlist(map(.x = age,
.f = log))## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
And finally, we can also compare to using lapply():
unlist(lapply(X = age, FUN = log))## [1] 4.127134 4.143135 4.094345 2.995732 3.850148 4.127134 4.043051 3.332205
## [9] 3.784190 3.737670 3.663562 3.555348 3.988984 3.784190 3.044522 2.995732
## [17] 4.077537 3.688879 4.043051 3.178054 3.218876 3.332205 3.135494 4.143135
## [25] 3.526361 3.367296 3.663562 3.178054 3.806662 4.127134
lapply(X = age, FUN = log)==unlist(map(.x = age, .f = log))## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Note: lapply() is the base R equivalent to map(), but lapply() doesn’t support the helper fns that map() does. BUT if you don’t need anything other than lapply() itself, might as well reduce dependencies and just use that!
Now let’s look at MAP-iterating over an input list.
Example 2: Calculate the average (mean) for each numeric column in our employee dataset:
dat1 %>%
select(where(is.numeric)) %>%
map(mean) %>%
unlist()## emp_id salary_k age tenure_yrs
## 141749.66667 66.69633 41.66667 7.50000
Here’s the equivalent code using 2 for-loops:
numcols <- vector("list", length = length(names(dat1)))
for(i in seq_along(names(dat1))){
numcols[[i]] <- is.numeric(dat1[[i]])
}
numcols <- unlist(numcols)
means <- vector("list", length(dat1[numcols]))
names(means) <- names(dat1[numcols])
for(i in seq_along(dat1[numcols])){
means[[i]] <- mean(dat1[numcols][[i]])
}
unlist(means)## emp_id salary_k age tenure_yrs
## 141749.66667 66.69633 41.66667 7.50000
As we can see, map() is a helluva lot more convenient!
Note: map() output must be the same length as the input. When operating over a data frame input whose length is equal to N columns, the map() list output must also have length equal to N columns.
Example: Return the unique values for each column:
# Iterate over each column in input dataframe
map(.x = dat1, .f = unique) -> output
# Verify output length = input length = ncols(df)
length(output)==length(dat1)## [1] TRUE
# View output
output## $emp_id
## [1] 138211 155878 107821 105305 136118 151590 104586 122418 140606 130070
## [11] 114776 150894 128534 130977 118784 116113 169159 139110 154657 150997
## [21] 161782 132449 156989 187244 186174 195322 112459 131186 181702 140579
##
## $department
## [1] "HR" "IT" "Sales"
##
## $salary_k
## [1] 60.04 53.59 54.44 61.50 61.38 49.75 60.07 62.91 59.83 59.42
## [11] 95.49 112.21 102.56 99.53 101.98 103.97 91.05 103.05 96.04 93.54
## [21] 63.30 60.84 62.08 39.63 17.06 11.84 34.35 31.64 59.93 37.87
##
## $age
## [1] 62 63 60 20 47 57 28 44 42 39 35 54 21 59 40 24 25 23 34 29 45
##
## $gender
## [1] "Female" "Male" "Non-binary"
##
## $business_unit_id
## [1] "B01" "AH2"
##
## $emp_status
## [1] "Active" "Termed"
##
## $tenure_yrs
## [1] 10 5 1 2 12 8 6 11 14 15 4 9 7
# Iterate over each output element and calculate their lengths
map(.x = output, .f = length)## $emp_id
## [1] 30
##
## $department
## [1] 3
##
## $salary_k
## [1] 30
##
## $age
## [1] 21
##
## $gender
## [1] 3
##
## $business_unit_id
## [1] 2
##
## $emp_status
## [1] 2
##
## $tenure_yrs
## [1] 13
In the examples above, we used unlist() to convert the list-type output of map() into a simpler vector format. As a shortcut, the following map_* extensions produce atomic vector outputs:
map_lgl(): Return a logical value for each column re: whether or not it’s type numeric:
dat1 %>%
map_lgl(is.numeric)## emp_id department salary_k age
## TRUE FALSE TRUE TRUE
## gender business_unit_id emp_status tenure_yrs
## FALSE FALSE FALSE TRUE
map_chr: Return the data type of each column:
dat1 %>%
map_chr(typeof)## emp_id department salary_k age
## "integer" "character" "double" "integer"
## gender business_unit_id emp_status tenure_yrs
## "character" "character" "character" "integer"
map_int(): Return an integer count of the number of missing values in each column:
missing <- function(x){
sum(is.na(x))
}
dat1 %>% map_int(missing)## emp_id department salary_k age
## 0 0 0 0
## gender business_unit_id emp_status tenure_yrs
## 0 0 0 0
map_dbl(): Return the proportion of above average values for each numeric column:
aboveavg <- function(x){
sum(x > mean(x))/length(x)
}
dat1 %>%
select(where(is.numeric)) %>%
map_dbl(aboveavg)## emp_id salary_k age tenure_yrs
## 0.4000000 0.3333333 0.5000000 0.4666667
Both sapply() and vapply() take in a function and and an input object, iterate the fn over each input element, and return an atomic output vector.
Example: Return the proportion of above average values for each numeric column:
aboveavg <- function(x){
sum(x > mean(x))/length(x)
}
sapply(X = select_if(dat1, is.numeric),
FUN = aboveavg)## emp_id salary_k age tenure_yrs
## 0.4000000 0.3333333 0.5000000 0.4666667
vapply(X = select_if(dat1, is.numeric),
FUN = aboveavg, FUN.VALUE = 1)## emp_id salary_k age tenure_yrs
## 0.4000000 0.3333333 0.5000000 0.4666667
Instead of using map() with a pre-defined/explicit function, we can use in-line anonymous functions.
Example: Return the proportion of above average values for each numeric column:
aboveavg <- function(x){
sum(x > mean(x))/length(x)
}
dat1 %>%
select(where(is.numeric)) %>%
map_dbl(.f = function(x) sum(x > mean(x))/length(x))## emp_id salary_k age tenure_yrs
## 0.4000000 0.3333333 0.5000000 0.4666667
One downside to anonymous functions is that the syntax can be verbose.
dat1 %>%
select(where(is.numeric)) %>%
map_dbl(.f = ~sum(.x > mean(.x))/length(.x))## emp_id salary_k age tenure_yrs
## 0.4000000 0.3333333 0.5000000 0.4666667
In the examples above, we used tidyverse syntax to first select columns that met a certain condition, and then passed those columns to our map() call.
We could instead split this into a multi-map-step process:
Example: Compute the standard deviation for all numeric columns:
map_lgl(.x = dat1,
.f = is.numeric) -> numcols
map_dbl(.x = dat1[numcols],
.f = sd)## emp_id salary_k age tenure_yrs
## 25058.493207 27.346545 15.234564 4.614445
This same process can be used for any combination of iterative function calls:
Example: Return the number of distinct values for each column containing “id” in the column name:
map_lgl(.x = names(dat1),
.f = ~grepl(x = .x,
pattern = "id")) -> idcols
map_dbl(.x = dat1[idcols],
.f = ~length(unique(x = .x)))## emp_id business_unit_id
## 30 2
We often wish to pass along additional arguments to the function being called by map().
...We’ll look at 2 versions of the same example, one with the (…) shortcut, and one without.
Example A: Use the describe() function from the psych package to provide descriptive statistics for each of the first 3 numeric variables and supply the optional argument (skew = TRUE):
dat1 %>%
select_if(is.numeric) %>%
select(1:3) %>%
map(.f = ~describe(.x, skew = TRUE))## $emp_id
## vars n mean sd median trimmed mad min max range skew
## X1 1 30 141749.7 25058.49 138660.5 140251.6 24803.9 104586 195322 90736 0.45
## kurtosis se
## X1 -0.71 4575.03
##
## $salary_k
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 66.7 27.35 61.11 67.55 33.15 11.84 112.21 100.37 -0.03 -0.99
## se
## X1 4.99
##
## $age
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 30 41.67 15.23 41 41.71 23.72 20 63 43 0.06 -1.53 2.78
Now let’s look at the (…) shortcut map() provides.
Example B (Shortcut): Use the describe() function from the psych package to provide descriptive statistics for each of the first 3 numeric variables and supply the optional argument (skew = TRUE):
dat1 %>%
select_if(is.numeric) %>%
select(1:3) %>%
map(describe, skew = TRUE)## $emp_id
## vars n mean sd median trimmed mad min max range skew
## X1 1 30 141749.7 25058.49 138660.5 140251.6 24803.9 104586 195322 90736 0.45
## kurtosis se
## X1 -0.71 4575.03
##
## $salary_k
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 66.7 27.35 61.11 67.55 33.15 11.84 112.21 100.37 -0.03 -0.99
## se
## X1 4.99
##
## $age
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 30 41.67 15.23 41 41.71 23.72 20 63 43 0.06 -1.53 2.78
Hadley Wickham’s Advanced R provides a great graphic for
understanding how the ... shortcut works, shown below:
Note: Supplying the additional arguments via the (…) map shortcut isn’t quite equivalent to placing them inside the function call
In the first example we looked at above, the 2 methods still produced equivalent results, but this is not the case for dynamic contexts.
Let’s take a look at a different example:
Example 2: We notice that our age variable is only measured as an integer, and we want to add a stochastic noise term to each element (a.k.a. a jitter) to better account for continuous variability
Jitter Function: Let’s first create a fitter fn that adds a noise/error term
jitter <- function(x,e) x + eNow let’s look at 2 different methods of iterating this function:
Example 2A: Apply a jitter function to add a noise term to each age value:
map_dbl(.x = dat1$age,
.f = ~jitter(x = .x,
e = rnorm(1)))## [1] 63.47997 64.41430 59.42057 20.06353 47.03211 61.03753 56.50550 26.82393
## [9] 45.03278 43.43092 39.87222 34.33823 53.87425 43.38543 20.69894 20.21987
## [17] 59.19968 40.01864 56.83562 22.66630 25.42263 26.80327 23.19167 62.30717
## [25] 34.46624 29.81711 39.68331 23.52101 45.43313 60.16190
As we can see, a different jitter amount was applied to each element of the input age vector (which is what we want!)
Example 2B (MAP (…) Shortcut): Apply a jitter function to add a noise term to each age value:
map_dbl(.x = dat1$age,
.f = jitter,
rnorm(1))## [1] 61.92906 62.92906 59.92906 19.92906 46.92906 61.92906 56.92906 27.92906
## [9] 43.92906 41.92906 38.92906 34.92906 53.92906 43.92906 20.92906 19.92906
## [17] 58.92906 39.92906 56.92906 23.92906 24.92906 27.92906 22.92906 62.92906
## [25] 33.92906 28.92906 38.92906 23.92906 44.92906 61.92906
Because we used the map (…) shortcut to pass the secondary error term argument (instead of placing it inside the jitter function call), the same jitter amount was applied to each element.
Bottom Line: In certain applications/contexts, passing additional arguments via (…) may not be ideal
Let’s suppose we think we have some outliers w/respect to our employee salaries, and we want to calculate an avg salary using a trimmed mean.
trims = c(0,0.1,0.2,0.3)In this example, we aren’t iterating over (X), but rather the (trim = …) argument supplied to mean()!
Example A (For Loop): Calculate a trimmed avg employee salary using a series of different trim arguments
trims <- c(0, 0.1, 0.2, 0.5)
output <- vector("list", length = length(trims))
for(i in seq_along(trims)){
output[[i]] <- mean(dat1$salary_k,trim = trims[[i]])
}
unlist(output)## [1] 66.69633 67.54667 66.95556 61.11000
Now let’s see what happens when we use map() in the standard fashion (in which .x is the iterative arg)
Example B (Bad MAP Use): Calculate a trimmed avg employee salary using a series of different trim arguments
map_dbl(.x = dat1$salary_k,
.f = mean,
trims)## Error in mean.default(.x[[i]], ...): 'trim' must be numeric of length one
There are 2 main problems with this approach in this context:
Instead, let’s correctly specify our trims as the iterative elements:
Example C (Good MAP Use): Calculate a trimmed avg employee salary using a series of different trim arguments
map_dbl(.x = trims,
.f = ~mean(x = dat1$salary_k,
trim = .x))## [1] 66.69633 67.54667 66.95556 61.11000
We can verify that these results match the correct results from our for-loop:
unlist(output)==map_dbl(.x = trims,
.f = ~mean(x = dat1$salary_k,
trim = .x))## [1] TRUE TRUE TRUE TRUE
Many real-world data science problems can be streamlined by using multiple purrr functions in combination.
Example: Let’s suppose we want to model the relationship between salary and tenure by department.
We can start by splitting our dataset into 3 subsets, one for each department:
split(x = dat1,
f = dat1$department) -> deplist
map(.x = deplist,
.f = head,
3)## $HR
## # A tibble: 3 × 8
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 10
## 2 155878 HR 53.6 63 Female AH2 Active 5
## 3 107821 HR 54.4 60 Female AH2 Active 5
##
## $IT
## # A tibble: 3 × 8
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 114776 IT 95.5 39 Male AH2 Active 6
## 2 150894 IT 112. 35 Male B01 Active 2
## 3 128534 IT 103. 54 Female B01 Active 11
##
## $Sales
## # A tibble: 3 × 8
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 161782 Sales 63.3 25 Male AH2 Active 1
## 2 132449 Sales 60.8 28 Female B01 Active 15
## 3 156989 Sales 62.1 23 Non-b… B01 Active 11
Now we want to fit a linear regression model to each of these subsets:
deplist %>%
map(.f = ~lm(salary_k ~ tenure_yrs,
data = .x)) -> depmods
names(depmods) <- names(deplist)
depmods## $HR
##
## Call:
## lm(formula = salary_k ~ tenure_yrs, data = .x)
##
## Coefficients:
## (Intercept) tenure_yrs
## 60.6374 -0.3974
##
##
## $IT
##
## Call:
## lm(formula = salary_k ~ tenure_yrs, data = .x)
##
## Coefficients:
## (Intercept) tenure_yrs
## 98.4360 0.2008
##
##
## $Sales
##
## Call:
## lm(formula = salary_k ~ tenure_yrs, data = .x)
##
## Coefficients:
## (Intercept) tenure_yrs
## 44.1994 -0.2577
Great! For the final step, let’s extract the slopes so we can compare the effect of tenure upon salary across departments:
map(.x = depmods,
.f = ~.x$coefficients) ## $HR
## (Intercept) tenure_yrs
## 60.6373811 -0.3973527
##
## $IT
## (Intercept) tenure_yrs
## 98.436015 0.200798
##
## $Sales
## (Intercept) tenure_yrs
## 44.1993642 -0.2577323
Instead of carrying out each step separately, we can chain the entire process and create a dataframe of the coefficients:
dat1 %>%
split(f = dat1$department) %>%
map(.f = ~lm(salary_k ~ tenure_yrs,
data = .x)) %>%
map_df(.f = coef) %>%
as.data.frame() %>%
rename(intercept = `(Intercept)`,
slope = tenure_yrs) %>%
mutate(department = c("HR","IT","Sales"))## intercept slope department
## 1 60.63738 -0.3973527 HR
## 2 98.43601 0.2007980 IT
## 3 44.19936 -0.2577323 Sales
Thanks for reading!