9_2_my_first_functional

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

9.2.1 purrr::map()

purrr::map()53 takes a vector and a function, calls the function once for each element of the vector, and returns the results as a list. In other words, map(1:3, f) is equivalent to list(f(1), f(2), f(3)).

squaring <- function(x) x^2

map(.x = 1:3, 
    .f = squaring)
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

the name map here refers to idea of mapping from mathematics, .f provides the mapping here.

the idea is as simple as

simple_map <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

The real purrr::map() function has a few differences: it is written in C

  1. faster!
  2. supports a few shortcuts(see 9.2.2)
  • Note

    map() is basically equivalent to lapply()with a few more helpers.

producing atomic vector

map() returns a list

to return atomic vector we have:

# map_chr() always returns a character vector
map_chr(mtcars, typeof)
     mpg      cyl     disp       hp     drat       wt     qsec       vs 
"double" "double" "double" "double" "double" "double" "double" "double" 
      am     gear     carb 
"double" "double" "double" 
# map_lgl() always returns a logical vector
map_lgl(iris, is.double)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
        TRUE         TRUE         TRUE         TRUE        FALSE 
# map_int() always returns a integer vector
n_unique <- function(x) length(unique(x))
map_int(mtcars, n_unique)
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
  25    3   27   22   22   29   30    2    2    3    6 
# map_dbl() always returns a double vector!
map_dbl(mtcars, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

notice mtcars is a data frame, and data frames are lists containing vectors of the same length.

porperty:

All map-variant functions always return an output vector the same length as the input

Thus each call of .f needs to return a single value.

pair <- function(x) c(x, x)
map_dbl(1:3, pair)
#> ! Result must be length 1, not 2.

Similarly, the type of return must be correct.

map_dbl(1:2, as.character)
#> Error: Can't coerce element 1 from a character to a double

map_dbl will die trying to coerce the output to a length-1 double.

In either case, it’s often useful to switch back to map(), because map() can accept any type of output. That allows you to see the problematic output, and figure out what to do with it.

equivalence in BASE R

sapply() and vapply() can also returns atomic vector

  1. sapply() :

    avoid to use as it tries to simplify result, potentially returning matrix, list or vector.

  2. vapply() :

    works but not succinct. for example:the equivalent to map_dbl(x, mean, na.rm = TRUE) is vapply(x, mean, na.rm = TRUE, FUN.VALUE = double(1)).

9.2.2 Anonymous functions and shortcuts

Anonymous Function

Instead of using map() with an existing function, we can create an inline anonymous function

map_dbl(mtcars, function(x) (3 * sum(x))^2 ) 
         mpg          cyl         disp           hp         drat           wt 
  3719883.69    352836.00 490591490.49 198302724.00    119211.37     95392.03 
        qsec           vs           am         gear         carb 
  2936013.71      1764.00      1521.00    125316.00     72900.00 

Shortcut

Simply use the twiddle ~

twiddle always comes together with .x

Following the twiddle, we can supply function with name or without name!

In either case .x is referred to as the current element in the list we are calling f for.

map_dbl(mtcars, ~ (3 * sum(.x))^2 )
         mpg          cyl         disp           hp         drat           wt 
  3719883.69    352836.00 490591490.49 198302724.00    119211.37     95392.03 
        qsec           vs           am         gear         carb 
  2936013.71      1764.00      1521.00    125316.00     72900.00 

purrr functions translate formulas, created by ~ (pronounced “twiddle”), into functions.

We can see this process by as_mapper():

as_mapper(~ (3 * sum(.x))^2 )
<lambda>
function (..., .x = ..1, .y = ..2, . = ..1) 
(3 * sum(.x))^2
attr(,"class")
[1] "rlang_lambda_function" "function"             

.x and .y for two argument functions, and ..1, ..2, ..3, etc, for functions with an arbitrary number of arguments.

shortcut is useful for generating random data.

x <- map(1:3, ~ runif(2))
str(x)
List of 3
 $ : num [1:2] 0.0879 0.813
 $ : num [1:2] 0.798 0.351
 $ : num [1:2] 0.575 0.587

indexing

powered by purrr::pluck() , map() families can be used for indexing too.

  • character vector to select elements by name,

  • an integer vector to select by position

  • a list to select by both name and position (useful for nested lists!)

x <- list(
  list(-1, x = 1, y = c(2), z = "a"),
  list(-2, x = 4, y = c(5, 6), z = "b"),
  list(-3, x = 8, y = c(9, 10, 11))
)

# Select by name
# for each element in x, index by name "x"
map_dbl(x, "x")
[1] 1 4 8
# Or by position
# for each element in x, index the first element
map_dbl(x, 1)
[1] -1 -2 -3
# Or by both
# for each element in x, index by name "y", and then index the first element.
map_dbl(x, list("y", 1))
[1] 2 5 9

Don’t confuse:

Notice: default of non-existing indexing is NULL. See ?pluck. so although map(x, "z") , map(x, “z”) is not as NULL cannot be coerced into character. Unless we provide a default.

# You'll get an error if a component doesn't exist:
map_chr(x, "z")
#> Error: Result 3 must be a single string, not NULL of length 0
# Unless you supply a .default value
map_chr(x, "z", .default = NA)
[1] "a" "b" NA 
# or simply use map()
map(x, "z")
[[1]]
[1] "a"

[[2]]
[1] "b"

[[3]]
NULL
  • lapply() accepts function as string input or symbol input. e.g lapply(1:3, squaring) is equivalent to lapply(1:3, “squaring”)

9.2.3 Passing additional arguments

In other words, passing arguments with

Additional arguments

for e.g. we can supply na.rm = T to mean()

method 1: passing inside anonymous function

x <- list(1:5, c(NA, NA, 2, 10))

# We can do it this way.
# quick review: map() returns a list
map(x, ~mean(.x, na.rm = T))
[[1]]
[1] 3

[[2]]
[1] 6
# quick review: map_dbl() returns an atomic vec
map_dbl(x, ~mean(.x, na.rm = T))
[1] 3 6

method 2: simpler way: direct passing after f

map(x, mean, na.rm = T)
[[1]]
[1] 3

[[2]]
[1] 6
map_dbl(x, mean, na.rm = T)
[1] 3 6

Don’t confuse:

Don’t write it using both twiddle and additional argument(s), the later will be ignored!!

Don’t pass the argument saved for named function after anonymous function. It won’t work. Since the argument is not defined in the anonymous function.

map(x, ~mean(.x), na.rm = T)
[[1]]
[1] 3

[[2]]
[1] NA

But we can pass additional argument this way into anonymous function for arguments we defined in the anonymous function using .x, .y or ..1, ..2, ..3 …

map(1:3, function(x, y){x^2 + y}, y = 10000)
[[1]]
[1] 10001

[[2]]
[1] 10004

[[3]]
[1] 10009

is equivalent to

map(1:3, ~.x^2 + .y, .y = 10000)
[[1]]
[1] 10001

[[2]]
[1] 10004

[[3]]
[1] 10009

is equivalent to

map(1:3, .f =~ ..1^2 + ..2, ..2 = 10000)
[[1]]
[1] 10001

[[2]]
[1] 10004

[[3]]
[1] 10009

Properties:

  • Any arguments after f in the map() call are inserted after the individual element in each f() call

  • map() is only vectorised over its first argument, If an argument after f is a vector, it will be passed along as is:

Difference b.w method 1 2

method1: the extra argument(s) is evaluated for every f call

method2: the extra argument(s) is evaluated only once at map() call.

my_func <- function(a, b) a + b
x <- rep(0, 5)

# evaluated for every f call
map_dbl(x, ~my_func(.x, runif(n = 1)) )
[1] 0.2499515 0.7997119 0.2330579 0.9677542 0.7710998
# evaluated only once!
map_dbl(x, my_func, runif(n = 1))
[1] 0.2193924 0.2193924 0.2193924 0.2193924 0.2193924

9.2.4 Argument names

Tip: always pass argument with names

This is good for reading. Otherwise the user needs to remember the order of the argument for the function.

Why map() uses .x and .f

map() uses weird .x and .f to avoid the situation where the function provided to map() uses x or f.

for example, recall our simple_map which uses f to as argument name for the function.

simple_map <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

But then if our function also uses f as one of its argument name.

Our map_dbl() would work! :D.

As it know the f is not the function argument, instead it is the argument to supply into .f = my_func .

my_func <- function(a, f){a + f}

map_dbl(1:3, my_func, f = 10)
[1] 11 12 13

But simple_map would seize the f , wrongly recognizing it as the function to iterate over 1:3.

simple_map(1:3, my_func, f = 10)
#> Error in f(x[[i]], ...) : could not find function "f"

Recognize simple_map(1:3, my_func, f = 10) is equivalent to simple_map(x = 1:3, f = 10, my_func)

To make it harder to debug, in a case where the provided function itself also uses f as a function, then the error might be hard to fathom.

# f is supposed to be a function here!
bootstrap_summary <- function(x, f) {
  f(sample(x, replace = TRUE))
}


# f is seized by simple_map()  ):
simple_map(mtcars, bootstrap_summary, f = mean)
#> Error in mean.default(x[[i]], ...): 'trim' must be numeric of length one

This is essentailly calling simple_map(x = mtcars, f = mean, trim = bootstrap), resulting in the error.

Takeaways

  • .x and .f naming is to avoid conflict in cases where the .f function itself has argument named x or f

  • Just in case if .x and .f are also conflicted, use anonymous function instead.

  • Explicitly provide the name when passing argument.

Compare with apply() family

Base functions that pass along ... use a variety of naming conventions to prevent undesired argument matching:

  • The apply family mostly uses capital letters (e.g. X and FUN).

What is

is essentially the arguments passed into an argument of the current function.

e.g.

map(.x, .f, ..., .progress = FALSE)

apply(X, MARGIN, FUN, ..., simplify = TRUE)

9.2.6 Exercises

Q1

  1. Use as_mapper() to explore how purrr generates anonymous functions for the integer, character, and list helpers. What helper allows you to extract attributes? Read the documentation to find out.
as_mapper(2)
function (x, ...) 
pluck_raw(x, list(2), .default = NULL)
<environment: 0x10c0934b0>
as_mapper("cool")
function (x, ...) 
pluck_raw(x, list("cool"), .default = NULL)
<environment: 0x10c1064b0>
as_mapper(list(1, "nice"))
function (x, ...) 
pluck_raw(x, list(1, "nice"), .default = NULL)
<environment: 0x10c16b108>
?as_mapper
as_mapper(runif)
function (n, min = 0, max = 1) 
.Call(C_runif, n, min, max)
<bytecode: 0x10c169958>
<environment: namespace:stats>
as_mapper(mean)
function (x, ...) 
UseMethod("mean")
<bytecode: 0x10e2e74f8>
<environment: namespace:base>
as_mapper(head)
function (x, ...) 
UseMethod("head")
<bytecode: 0x108a2f870>
<environment: namespace:utils>

Q2

  1. map(1:3, ~ runif(2)) is a useful pattern for generating random numbers, but map(1:3, runif(2)) is not. Why not? Can you explain why it returns the result that it does?
map(1:3, ~runif(2))
[[1]]
[1] 0.5265459 0.4452324

[[2]]
[1] 0.83659045 0.09955172

[[3]]
[1] 0.87268165 0.05890994

essentially makes an anonymous function using a shortcut twiddle ~. it is equivalent to

map(1:3, .f = function(x){runif(2)} )
[[1]]
[1] 0.3592562 0.4744019

[[2]]
[1] 0.831641 0.497323

[[3]]
[1] 0.9760719 0.4039886

basically the function body part runif(2) runs for three iterations, each time generating a pair of uniform (0,1) for us.

On the other hand, runif(2) here is not a function-type value, nor does it have a twiddle to turn it into an anonymous function(also a function-type value) In that case, as_mapper turn it into an indexing function, basically equivalent to map(1:3, .f = 0.5) . And in this version of purrr is seems like it rounds down when indexing while indexing 0 is equivalent to indexing 1.

map(1:3, runif(2))
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

We basically indexed each length-one vector by one and then indexed the resulting length-one vector by one again.

Should we add to the indexing, the result would be NULL, the default of pluck_raw.

map(1:3, runif(2,min = 2))
Warning in runif(2, min = 2): NAs produced
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL
as_mapper(~runif(2))
<lambda>
function (..., .x = ..1, .y = ..2, . = ..1) 
runif(2)
attr(,"class")
[1] "rlang_lambda_function" "function"             
as_mapper(runif(2))
function (x, ...) 
pluck_raw(x, list(0.245325515512377, 0.336805780418217), .default = NULL)
<environment: 0x10bf4dac0>

Q3

  1. Use the appropriate map() function to:

    1. Compute the standard deviation of every column in a numeric data frame.

    2. Compute the standard deviation of every numeric column in a mixed data frame. (Hint: you’ll need to do it in two steps.)

    3. Compute the number of levels for every factor in a data frame.

  2. let’s do iris’s first 4 columns.

map(iris[,1:4], sd)
$Sepal.Length
[1] 0.8280661

$Sepal.Width
[1] 0.4358663

$Petal.Length
[1] 1.765298

$Petal.Width
[1] 0.7622377
  1. Lets first leverage map_lgl to find the boolean index vector of columns being numeric

    then using it to extract the numeric part of the dataframe we can repeat what we did in Q1

(num_index <- map_lgl(iris, is.numeric))
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
        TRUE         TRUE         TRUE         TRUE        FALSE 
map(iris[num_index], sd)
$Sepal.Length
[1] 0.8280661

$Sepal.Width
[1] 0.4358663

$Petal.Length
[1] 1.765298

$Petal.Width
[1] 0.7622377
  1. Let’s first turn all cha columns in CO2 into factor.
CO22 <- CO2 %>%
  mutate(across(where(is.character), as.factor))

Repeat what we did in Q2

(fac_index <- map_lgl(CO22, is.factor))
    Plant      Type Treatment      conc    uptake 
     TRUE      TRUE      TRUE     FALSE     FALSE 
map(CO22[fac_index], levels) %>% 
  map(length)
$Plant
[1] 12

$Type
[1] 2

$Treatment
[1] 2
# or smarterly we can do
map(CO22[fac_index], ~length(levels(.x)))
$Plant
[1] 12

$Type
[1] 2

$Treatment
[1] 2

Q4

The following code simulates the performance of a t-test for non-normal data. Extract the p-value from each test, then visualise.

trials <- map(1:100, ~ t.test(rpois(10, 10), rpois(7, 10)))
trials_p_values <- trials %>% map_dbl("p.value")

hist(trials_p_values)

Solution from book.

library(ggplot2)

df_trials <- tibble::tibble(p_value = map_dbl(trials, "p.value"))

df_trials %>%
  ggplot(aes(x = p_value, fill = p_value < 0.05)) +
  geom_dotplot(binwidth = .01) +  # geom_histogram() as alternative
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = "top"
  )

Q5

The following code uses a map nested inside another map to apply a function to every element of a nested list. Why does it fail, and what do you need to do to make it work?

x <- list(
  list(1, c(3, 9)),
  list(c(3, 6), 7, c(4, 7, 6))
)

triple <- function(x) x * 3
# map(x, map, .f = triple)



#> Error in .f(.x[[i]], ...): unused argument (function (.x, .f, ...)
#> {
#> .f <- as_mapper(.f, ...)
#> .Call(map_impl, environment(), ".x", ".f", "list")
#> })

instead of triple being passed into .f map and map is called for each element of x to map that element list by tripling through it, the line of code wrongly passed triple as the .f, passing map as an argument to that triple, which results in the error.

map(.x = x, .f = map,  triple)
[[1]]
[[1]][[1]]
[1] 3

[[1]][[2]]
[1]  9 27


[[2]]
[[2]][[1]]
[1]  9 18

[[2]][[2]]
[1] 21

[[2]][[3]]
[1] 12 21 18

Q6

Use map() to fit linear models to the mtcars dataset using the formulas stored in this list:

formulas <- list(
  mpg ~ disp,
  mpg ~ I(1 / disp),
  mpg ~ disp + wt,
  mpg ~ I(1 / disp) + wt
)

let’s do so by iterating through the formula list!

map(formulas,  ~lm(formula = .x, data = mtcars)  )
[[1]]

Call:
lm(formula = .x, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122  


[[2]]

Call:
lm(formula = .x, data = mtcars)

Coefficients:
(Intercept)    I(1/disp)  
      10.75      1557.67  


[[3]]

Call:
lm(formula = .x, data = mtcars)

Coefficients:
(Intercept)         disp           wt  
   34.96055     -0.01772     -3.35083  


[[4]]

Call:
lm(formula = .x, data = mtcars)

Coefficients:
(Intercept)    I(1/disp)           wt  
     19.024     1142.560       -1.798  

Alternatively if we recognize the first argument is simply formula

models <- map(formulas, lm, data = mtcars)

Q7

Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below, then extract the \(R^2\)of the model fit (Hint: you can compute the \(R^2\) with summary().)

bootstrap <- function(df) {
  df[sample(nrow(df), replace = TRUE), , drop = FALSE]
}

bootstraps <- map(1:10, ~ bootstrap(mtcars))

let’s fit through each of the bootstrapped mrcars’

fits_lsit <- map(bootstraps, .f = lm, formula = mpg~disp)
summaries_lsit <- map(fits_lsit, summary)
(r_sqr_lsit <- map(summaries_lsit, "r.squared"))
[[1]]
[1] 0.725219

[[2]]
[1] 0.7385835

[[3]]
[1] 0.688559

[[4]]
[1] 0.7337701

[[5]]
[1] 0.6097428

[[6]]
[1] 0.6770883

[[7]]
[1] 0.6623904

[[8]]
[1] 0.7088759

[[9]]
[1] 0.7912166

[[10]]
[1] 0.6943253

To do it in one run we:

fits_lsit <- map(bootstraps, .f = lm, formula = mpg~disp)

bootstraps %>% map(.f = lm, formula = mpg ~ disp) %>%
  map(summary) %>%
  map_dbl("r.squared")
 [1] 0.7252190 0.7385835 0.6885590 0.7337701 0.6097428 0.6770883 0.6623904
 [8] 0.7088759 0.7912166 0.6943253