Lesson 12 | 3 June 2020

Vectorization in R

Vectorization is a process unique to R and its functions. A vectorized function works not just on a single value, but on a whole vector of values at the same time. So instead of looping over all values of the vector and applying a function within the loop, vectorization makes that unnecessary and can drastically simplify your code to a single line of code.

Common Vectorized Functions in R

Common Vectorized Functions in R

lapply( ) - loop over a list and evaluate a function on each element

sapply( ) - same as lapply( ) but try to simplify the result

apply( ) - apply a function over the margins of an array

tapply( ) - apply a function over subsets of a vector

mapply( ) - multivariate version of lapply

split( ) - auxiliary function used with lapply( ) and sapply( ) because it splits objects into subpieces

lapply( )

lapply( )

lapply( ) loops over a list and evaluate a function on each element. lapply( ) always returns a list, regardless of the class of the input.

str(lapply)

## function (X, FUN, ...)

X - the list we would like to apply some function to
FUN - the function we would like to apply to each element in the list
… - specify any other arguments to send to the function

head(lapply) # R's source code written in C

##                                          
## 1 function (X, FUN, ...)                 
## 2 {                                      
## 3     FUN <- match.fun(FUN)              
## 4     if (!is.vector(X) || is.object(X)) 
## 5         X <- as.list(X)                
## 6     .Internal(lapply(X, FUN))

If you don’t input a list, lapply will convert your object into a list according to its source code. To learn more about R’s C interface check out this site: http://adv-r.had.co.nz/C-interface.html.

Let’s go through some examples. Throughout these examples, we’ll be using rnorm( ) to generate random numbers from a defined normal distribution and runif( ) to generate uniform random variables. So what arguments does rnorm( ) and runif( ) take?

str(rnorm) # sample size, mean of sample to be simulated, etc.

## function (n, mean = 0, sd = 1)

str(runif) # sample size, and the lower and upper limits of the distribution

## function (n, min = 0, max = 1)

Example 1. Take the mean of each element in a list.

x <- list(a = 1:5, b = rnorm(10)) # list w/ 2 elements.
l <- lapply(x, mean)
l # new values assembled in a new list

## $a
## [1] 3
## 
## $b
## [1] 0.2953307

Example 2. Take the mean of each element in a list.

x <- list(a=1:4, b=rnorm(10), c=rnorm(20,1), d= rnorm(100, 5))
lapply(x, mean)

## $a
## [1] 2.5
## 
## $b
## [1] 0.6096763
## 
## $c
## [1] 0.9237171
## 
## $d
## [1] 5.146349

Example 3. Apply a function to a vector in lapply( ).

You can use lapply( ) to evaluate a function multiple times each with a different argument. Below, is an example where I call the runif( ) function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.

x <- 1:4 
#mean(x)
lapply(x, runif)

## [[1]]
## [1] 0.9027863
## 
## [[2]]
## [1] 0.9363437 0.3141462
## 
## [[3]]
## [1] 0.6517832 0.2041103 0.4341976
## 
## [[4]]
## [1] 0.6625782 0.7502077 0.7816101 0.5590214

Example 4. Add additional FUN arguments.

x <- 1:4
lapply(x, runif, min=0, max=10)

## [[1]]
## [1] 1.604497
## 
## [[2]]
## [1] 6.821681 8.059541
## 
## [[3]]
## [1] 0.4648676 9.5136075 4.6491326
## 
## [[4]]
## [1] 0.1959291 0.1552903 1.7185262 1.5683099

So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.

Example 5. Anonymous functions.

Anonymous functions - don’t have a name, but can be created within the lapply (but they will not exist outside of lapply( )).

For example, let’s make an anonymous function for extracting the first column of each matrix.

x <- list(a = matrix(1:4, 2, 2), b = matrix( 1:6, 3, 2))

lapply(x, function(elt) elt[,1])

## $a
## [1] 1 2
## 
## $b
## [1] 1 2 3

sapply( )

sapply( )

sapply( ) will try to simplify the result of lapply if possible.

If the result is a list where every element is length 1, then a vector is returned
If the result is a list where every element is a vector of the same length (>1), a matrix is returned.
If it can’t figure things out, a list is returned.

Example 1. Take the mean of each element in a list.

x <- list(a=1:4, b=rnorm(10), c=rnorm(20,1), d=rnorm(100,5))
lapply(x,mean)

## $a
## [1] 2.5
## 
## $b
## [1] -0.4353834
## 
## $c
## [1] 0.5798263
## 
## $d
## [1] 4.942518

sapply(x,mean)

##          a          b          c          d 
##  2.5000000 -0.4353834  0.5798263  4.9425177

split( )

split( )

The benefit of combining split( ) with lapply( ) or sapply( ) is to take a data structure, split it into subsets defined by another variable, and apply a function over those subsets.

str(split)

## function (x, f, drop = FALSE, ...)

x is a vector (or list) or data frame
f is a factor (or coerced to one) or a list of factors
drop indicates whether empty factors levels should be dropped

Example 1

Let’s use gl( ) function to “generate levels” in a factor variable. An R factor is used to store categorical data as levels. It can store both character and integer types of data.

str(gl)

## function (n, k, length = n * k, labels = seq_len(n), ordered = FALSE)

?gl

n an integer giving the number of levels
k an integer giving the number of replications.

x <- c(rnorm(10), runif(10), rnorm(10,1))
f <- gl(3,10)
split(x,f)

## $`1`
##  [1] -0.59551470  0.02627083 -0.07124994 -0.49352066  0.60030041 -0.33435199
##  [7]  1.02329748  0.24600562  0.16653438  1.88206784
## 
## $`2`
##  [1] 0.9134681 0.2727243 0.9025976 0.5090428 0.8845266 0.2132206 0.1856311
##  [8] 0.8944454 0.4285311 0.6840332
## 
## $`3`
##  [1] -0.1115023  1.4412868  0.1431598  0.9567117  1.9758419 -0.3024723
##  [7]  2.9196340  2.9653031 -0.4777817  2.6517442

lapply(split(x,f), mean)

## $`1`
## [1] 0.2449839
## 
## $`2`
## [1] 0.5888221
## 
## $`3`
## [1] 1.216193

apply( )

apply ( )

apply( ) applies a function over the margins of an array.

str(apply)

## function (X, MARGIN, FUN, ...)

X - the object we would like to apply some function to
MARGIN - specifies if the function is applied to rows or columns, 1 = row and 2 = column.
FUN - the function we would like to apply
… - specify any other arguments to send to the function

Example 1. Calculating the mean price of each of the stocks over the 10 days.

d <- as.data.frame(matrix(
  c(185.74, 184.26, 162.21, 159.04, 164.87, 
         162.72, 157.89, 159.49, 150.22, 151.02, 
         1.47, 1.56, 1.39, 1.43, 1.42, 
         1.36, NA, 1.43, 1.57, 1.54,
         1605, 1580, 1490, 1520, 1550, 
         1525, 1495, 1485, 1470, 1510, 
         95.05, 97.49, 88.57, 85.55, 92.04, 
         91.70, 89.88, 93.17, 90.12, 92.14), ncol=4, nrow=10,
  dimnames = list(c("Day1","Day2","Day3","Day4","Day5",
                    "Day6","Day7","Day8","Day9","Day10"),
                  c("Stock1", "Stock2", "Stock3", "Stock4"))))
d

##       Stock1 Stock2 Stock3 Stock4
## Day1  185.74   1.47   1605  95.05
## Day2  184.26   1.56   1580  97.49
## Day3  162.21   1.39   1490  88.57
## Day4  159.04   1.43   1520  85.55
## Day5  164.87   1.42   1550  92.04
## Day6  162.72   1.36   1525  91.70
## Day7  157.89     NA   1495  89.88
## Day8  159.49   1.43   1485  93.17
## Day9  150.22   1.57   1470  90.12
## Day10 151.02   1.54   1510  92.14

AVG <- apply(X=d, MARGIN=2, FUN=mean)
AVG <- apply(X=d, MARGIN=2, FUN=mean, na.rm=TRUE)
AVG

##      Stock1      Stock2      Stock3      Stock4 
##  163.746000    1.463333 1523.000000   91.571000

colMeans(d, na.rm=TRUE)

##      Stock1      Stock2      Stock3      Stock4 
##  163.746000    1.463333 1523.000000   91.571000

Example 2. Find max of each stock.

apply(X=d, MARGIN=2, FUN=max, na.rm=TRUE) # find max

##  Stock1  Stock2  Stock3  Stock4 
##  185.74    1.57 1605.00   97.49

Example 3. Calculate the 20th and 80th percentile of each stock.

Let R know which percentiles to calculate

apply(X=d, MARGIN=2, FUN=quantile, probs=c(0.2, 0.8), na.rm=TRUE)

##      Stock1 Stock2 Stock3 Stock4
## 20% 156.516  1.408   1489 89.618
## 80% 168.748  1.548   1556 93.546

Example 4. Plot the data.

par(mfrow=c(2,2))
apply(X=d, MARGIN=2, FUN=plot, type="l", main="stock", ylab= "Price", xlab="Day")

## NULL

Example 5. Sum each row.

apply(X=d, MARGIN=1, FUN=sum, na.rm=TRUE)

##    Day1    Day2    Day3    Day4    Day5    Day6    Day7    Day8    Day9   Day10 
## 1887.26 1863.31 1742.17 1766.02 1808.33 1780.78 1742.77 1739.09 1711.91 1754.70

Example 6. Plot market trends.

plot(apply(X=d, MARGIN=1, FUN=sum, na.rm=TRUE), type="l", ylab= "Total Market Value", xlab="Day", main="Market Trend")
points(apply(d, 1, FUN=sum, na.rm=TRUE), pch=16, col="blue")

tapply( )

tapply ( )

tapply( ) can be used to apply a function to subsets of a variable or vector. The tapply( ) function is a specialized loop/subsetting function, although it is more efficient than the simple use of square brackets or a “subset” function. The tapply function allows the user to divide a variable into multiple groups based on another variable(s) used to define the groups/subsets, and then apply a function to each of the groups/subsets.

LungCapData <- read.table("LungCapData.txt", header=TRUE)
str(LungCapData)

## 'data.frame':    725 obs. of  6 variables:
##  $ LungCap  : num  6.47 10.12 9.55 11.12 4.8 ...
##  $ Age      : int  6 18 16 14 5 11 8 11 15 11 ...
##  $ Height   : num  62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
##  $ Smoke    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Gender   : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
##  $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...

attach(LungCapData)

attach( ) attaches a data frame (or list) to the search path, so it becomes possible to refer to the variables in the data frame by their names alone, rather than as components of the data frame (e.g., in the example above, you would use Age rather than d$Age).

str(tapply)

## function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

X = an atomic object, typically a vector
INDEX = grouping variable same length as X and used to create subsets of the data
FUN = function
… = additional arguments need to apply to the function
simplify= TRUE means to simplify results in TRUE

Example 1. Calculate mean age of smokers and non-smokers seperately.

tapply(X=Age, INDEX=Smoke, FUN=mean, na.rm=T)

##       no      yes 
## 12.03549 14.77922

tapply(X=Age, INDEX=Smoke, FUN=mean, na.rm=T, simplify=FALSE) # returns a list format

## $no
## [1] 12.03549
## 
## $yes
## [1] 14.77922

What does this look like with square brackets?

mean(Age[Smoke=="no"])

## [1] 12.03549

mean(Age[Smoke=="yes"])

## [1] 14.77922

Example 2. Apply the summary function to groups.

tapply(Age, Smoke, summary)

## $no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   12.00   12.04   15.00   19.00 
## 
## $yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   13.00   15.00   14.78   17.00   19.00

Example 3. Apply the summary quantile to groups.

tapply(Age, Smoke, quantile, probs=c(0.2,0.8))

## $no
## 20% 80% 
##   8  16 
## 
## $yes
## 20% 80% 
##  12  17

Example 4. ‘subset’ based on multiple variables/vectors. Calculate the mean Age for Smoker/NonSmoker and male/female.

tapply(X=Age, INDEX=list(Smoke,Gender), FUN=mean, na.rm=T)

##       female     male
## no  12.12739 11.94910
## yes 14.75000 14.81818

What does this look like with square brackets?

mean(Age[Smoke=="no" & Gender=="female"])

## [1] 12.12739

mean(Age[Smoke=="no" & Gender=="male"])

## [1] 11.9491

mean(Age[Smoke=="yes" & Gender=="female"])

## [1] 14.75

mean(Age[Smoke=="yes" & Gender=="male"])

## [1] 14.81818

‘by’ function in R does the same as ‘tapply’ in R except that it returns results in vector format.

by(Age, list(Smoke, Gender), mean, na.rm=T)

## : no
## : female
## [1] 12.12739
## ------------------------------------------------------------ 
## : yes
## : female
## [1] 14.75
## ------------------------------------------------------------ 
## : no
## : male
## [1] 11.9491
## ------------------------------------------------------------ 
## : yes
## : male
## [1] 14.81818

temp <- by(Age, list(Smoke, Gender), mean, na.rm=T)
temp[4]

## [1] 14.81818

class(temp)

## [1] "by"

temp2 <- c(temp) # convert to a vector
temp2

## [1] 12.12739 14.75000 11.94910 14.81818

class(temp2)

## [1] "numeric"

mapply( )

mapply( )

mapply( ) is a multivariate version of sapply( ) and lapply( ) functions. It is a multivariate apply of sorts which applies a function in parallel over a set of arguments.

For sapply( ), lapply( ), tapply( ), they only apply a function over the elements of a single object. So what happens if you have two lists you want to apply a function over? sapply( ) and lapply( ) can’t be used for that purpose. What you could do then is write a for loop where the for loop will index each of the elements of each list and pass a function through each elemetn in each list.

But mapply( ) can take multiple list arguments and apply a function to the elements in the lists in parallel.

str(mapply)

## function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)

FUN is a function to apply
… contains arguments to apply over
MoreArgs is a list of other arguments to FUN
SIMPLIFY indicates whether the result should be simplified

Example 1

list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) # tedius to type

## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

mapply(rep, 1:4, 4:1)

## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

Example 2

noise <- function(n, mean, sd) {
  set.seed(100)
  rnorm(n, mean, sd)
}
noise(5,1,2)

## [1] -0.004384701  1.263062331  0.842165820  2.773569619  1.233942541

noise(1:5, 1:5, 2) # if pass a vector of arguments, this doesn't work correctly

## [1] -0.004384701  2.263062331  2.842165820  5.773569619  5.233942541

mapply(noise, 1:5, 1:5, 2) # this is how it should be:

## [[1]]
## [1] -0.004384701
## 
## [[2]]
## [1] 0.9956153 2.2630623
## 
## [[3]]
## [1] 1.995615 3.263062 2.842166
## 
## [[4]]
## [1] 2.995615 4.263062 3.842166 5.773570
## 
## [[5]]
## [1] 3.995615 5.263062 4.842166 6.773570 5.233943

# which is the same as:
list(noise(1,1,2), noise(2,2,2), 
     noise(3,3,2), noise(4,4,2),
     noise(5,5,2))

## [[1]]
## [1] -0.004384701
## 
## [[2]]
## [1] 0.9956153 2.2630623
## 
## [[3]]
## [1] 1.995615 3.263062 2.842166
## 
## [[4]]
## [1] 2.995615 4.263062 3.842166 5.773570
## 
## [[5]]
## [1] 3.995615 5.263062 4.842166 6.773570 5.233943

Commmon purrr vectorized functions

Commmon purrr vectorized functions

Vectorization Continued: Purrr

Map functions are vectorized functions available through the purr library. They are extremely similar to the vectorized functions already available in R, so this will give you more exposure vectorization. In general, map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.

Here is also a great site that explores in more detail the functional tools of map functions: https://adv-r.hadley.nz/functionals.html#map. Also here is the purrr cheatsheet: https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf.

library(purrr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

map( ) - just like lapply( ) in that it will loop over an object and evaluate a function on each element of that object. Then, it will return a list. But it is inconvenient to return a list when a simpler data structure would do, so there are four more specific variants:

map_lgl( ), map_int( ), map_dbl( ), and map_chr( ) - returns an atomic vector of the indicated type

map_dfr( ) and map_dfc( ) return a data frame created by row-binding and column-binding respectively. They require dplyr to be installed.

map( )

map( )

Now let’s do the example we saw using lapply( ) but for map( ).

Example 1.

x <- list(a = 1:5, b = rnorm(10)) # list w/ 2 elements.
l <- map(x, mean)
l

## $a
## [1] 3
## 
## $b
## [1] 0.01139972

Example 2.

values <- 1:10
map(values, function(x) {rnorm(10, x)})

## [[1]]
##  [1] 0.97068329 0.61114575 1.51085626 0.08618581 3.31029682 0.56191002
##  [7] 1.76406062 1.26196129 1.77340460 0.18562088
## 
## [[2]]
##  [1] 1.5615494 1.2797784 2.2309445 0.8422705 2.2470760 1.9088864 3.7573756
##  [8] 1.8620704 1.8888065 1.3099857
## 
## [[3]]
##  [1] 2.778206 3.182908 3.417323 4.065402 3.970202 2.898371 4.403203 1.223224
##  [9] 3.622867 2.477717
## 
## [[4]]
##  [1] 5.322231 3.636560 5.319066 4.043779 2.121344 3.552938 2.261402 4.178865
##  [9] 5.897466 1.728075
## 
## [[5]]
##  [1] 5.980464 3.601174 6.824872 6.381299 4.161148 4.738004 4.931156 4.621116
##  [9] 7.581959 5.129834
## 
## [[6]]
##  [1] 5.286975 6.637994 6.201692 5.930083 5.907510 6.448903 4.935644 4.837581
##  [9] 7.648522 3.937904
## 
## [[7]]
##  [1] 7.012750 5.912472 7.270539 8.008452 4.925595 7.896822 6.950004 5.654651
##  [9] 5.068788 7.709582
## 
## [[8]]
##  [1] 7.842095 8.216368 8.817362 9.727176 7.896230 7.442878 9.428301 7.107043
##  [9] 6.842429 7.469704
## 
## [[9]]
##  [1] 11.445683  8.167504  9.413520  7.821317  7.825965  8.667077 10.363114
##  [8]  8.530853  9.842876  7.542006
## 
## [[10]]
##  [1]  9.599694  9.223583  9.630703 11.240101  9.892566 10.172594 10.254601
##  [8]  9.385466  8.570785  9.669025

1:10 %>%
  map(~ rnorm(10, .x))

## [[1]]
##  [1]  1.1283861  2.0181200  0.7444263  0.6974590  2.6151907  0.2262866
##  [7]  1.4240024  0.4160530  1.4150357 -0.5452617
## 
## [[2]]
##  [1] 1.481250 1.720208 3.007457 1.530430 2.297897 1.582206 1.149619 2.689046
##  [9] 1.539804 3.348184
## 
## [[3]]
##  [1] 3.4430714 2.8490738 3.4555489 2.9598453 3.4561210 2.5915750 0.8635061
##  [8] 3.1568219 3.6600489 2.0181656
## 
## [[4]]
##  [1] 2.886356 3.562652 3.483889 4.418996 4.134155 5.034686 5.653503 3.982053
##  [9] 3.975797 4.250247
## 
## [[5]]
##  [1] 4.662875 4.886646 4.901117 5.264087 5.138984 4.757731 5.059031 4.822728
##  [9] 5.794680 5.006738
## 
## [[6]]
##  [1] 5.370210 5.747510 5.309578 6.202542 6.846381 6.632074 6.201414 5.908929
##  [9] 6.289484 5.945315
## 
## [[7]]
##  [1] 4.958150 7.358369 6.627399 8.268309 9.168600 5.760277 7.589874 7.124019
##  [9] 6.476292 7.620228
## 
## [[8]]
##  [1] 8.708222 7.906802 7.704803 6.914185 7.375185 7.766993 7.749183 8.953895
##  [9] 7.734027 9.895276
## 
## [[9]]
##  [1]  8.570009 10.575547  9.161941  7.914547  9.576937  9.028172  8.643297
##  [8]  9.852626  9.513365 10.018203
## 
## [[10]]
##  [1]  8.978521  9.438332  8.987444  6.979186 10.332350 11.240512 10.671350
##  [8]  8.669966  9.149420  8.211169

Example 3.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary)

## $`4`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1513 -1.9795 -0.6272  1.9299  5.2523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.571      4.347   9.104 7.77e-06 ***
## wt            -5.647      1.850  -3.052   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.332 on 9 degrees of freedom
## Multiple R-squared:  0.5086, Adjusted R-squared:  0.454 
## F-statistic: 9.316 on 1 and 9 DF,  p-value: 0.01374
## 
## 
## $`6`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##      Mazda RX4  Mazda RX4 Wag Hornet 4 Drive        Valiant       Merc 280 
##        -0.1250         0.5840         1.9292        -0.6897         0.3547 
##      Merc 280C   Ferrari Dino 
##        -1.0453        -1.0080 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   28.409      4.184   6.789  0.00105 **
## wt            -2.780      1.335  -2.083  0.09176 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared:  0.4645, Adjusted R-squared:  0.3574 
## F-statistic: 4.337 on 1 and 5 DF,  p-value: 0.09176
## 
## 
## $`8`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1491 -1.4664 -0.8458  1.5711  3.7619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.8680     3.0055   7.942 4.05e-06 ***
## wt           -2.1924     0.7392  -2.966   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.024 on 12 degrees of freedom
## Multiple R-squared:  0.423,  Adjusted R-squared:  0.3749 
## F-statistic: 8.796 on 1 and 12 DF,  p-value: 0.01179

map_lgl( ), map_int( ), map_dbl( ), and map_chr( )

map_lgl( ), map_int( ), map_dbl( ), and map_chr( )

purrr uses the convention that suffixes, like dbl( ), refer to the output. All map_*( ) functions can take any type of vector as input.

Example 1.

map_chr(mtcars, typeof) # always returns a character vector

##      mpg      cyl     disp       hp     drat       wt     qsec       vs 
## "double" "double" "double" "double" "double" "double" "double" "double" 
##       am     gear     carb 
## "double" "double" "double"

map_lgl(mtcars, is.double) # always returns a logical vector

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

n_unique <- function(x) { length(unique(x)) }
map_int(mtcars, n_unique) # always returns an integer vector

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6

map_dbl(mtcars, mean) # always returns a double vector (also known as floats)

##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Example 2.

1:10 %>%
  map(rnorm, n = 10) %>%  # output a list
  map_dbl(mean)           # output an atomic vector

##  [1] 0.9614181 1.8371975 2.9805346 4.3129193 4.8374819 6.3585267 7.1639700
##  [8] 8.0847131 8.7010062 9.9062764

map_dfr( ) and map_dfc( )

map_dfr( ) and map_dfc( )

All the purr functions you’ve seen above return lists or vectors, but you may want to retrun a dataframe.

Example 1.

myFunction <- function(arg1){
  col <- arg1 * 2
  x <- as.data.frame(col)
}
values <- c(1, 3, 5, 7, 9)
df <- map_dfr(values, myFunction) # binds the results row-wise
df <- map_dfc(values, myFunction) # binds the results column-wise

Example 2.

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .x)) %>%
  map_dfr(~ as.data.frame(t(as.matrix(coef(.)))))

##   (Intercept)        wt
## 1    39.57120 -5.647025
## 2    28.40884 -2.780106
## 3    23.86803 -2.192438

Introduction to Vectorization in R

Anastasia Bernat

Lesson 12 | 3 June 2020

Vectorization in R

Common Vectorized Functions in R

lapply( )

sapply( )

split( )

apply( )

tapply( )

mapply( )

Commmon purrr vectorized functions

map( )

map_lgl( ), map_int( ), map_dbl( ), and map_chr( )

map_dfr( ) and map_dfc( )