suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("stringr"))
#The package microbenchmark is used for timing code.
suppressPackageStartupMessages(library("microbenchmark"))
package 㤼㸱microbenchmark㤼㸲 was built under R version 3.6.3
1. Compute the mean of every column in mtcars
.
2. Determine the type of each column in nycflights13::flights
.
3. Compute the number of unique values in each column of iris
.
4. Generate 10 random normals for each of \(\mu = −10,0,10,\) and \(100\).
1. To calculate the mean of every column in mtcars
, apply the function mean()
to each column, and use map_dbl
, since the results are numeric.
map_dbl(mtcars, mean)
mpg cyl disp hp drat wt qsec vs
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500
am gear carb
0.406250 3.687500 2.812500
2. To calculate the type of every column in nycflights13::flights
apply the function typeof()
, discussed in the section on Vector basics, and use map_chr()
, since the results are character.
map_chr(nycflights13::flights, typeof)
year month day dep_time sched_dep_time dep_delay
"integer" "integer" "integer" "integer" "integer" "double"
arr_time sched_arr_time arr_delay carrier flight tailnum
"integer" "integer" "double" "character" "integer" "character"
origin dest air_time distance hour minute
"character" "character" "double" "double" "double" "double"
time_hour
"double"
3. There is no function that directly calculates the number of unique values in a vector. For a single column, the number of unique values of a vector can be calculated like so,
length(unique(iris$Species))
[1] 3
To apply this to all columns, we can provide the map an anonymous function. We can write anonymous function using the standard R syntax.
map_int(iris, function(x) length(unique(x)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35 23 43 22 3
We could also use the compact, one-sided formula shortcut that purrr provides.
map_int(iris, ~ length(unique(.)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35 23 43 22 3
In these examples, the map_int() function is used since length() returns an integer. However, the map_dbl() function will also work.
map_dbl(iris, ~ length(unique(.)))
4. To generate 10 random normals for each of \(\mu = −10,0,10,\) and \(100\): The result is a list of numeric vectors.
map(c(-10, 0, 10, 100), ~ rnorm(n = 10, mean = .))
[[1]]
[1] -10.562685 -9.450440 -11.205591 -7.427206 -9.121196 -7.236307 -9.484426 -11.450771
[9] -9.985221 -10.683619
[[2]]
[1] 0.63546587 0.71664803 -0.98568053 -1.51530864 -0.04996717 -1.49228515 -0.13181626
[8] 0.05997331 0.64866414 -0.42070499
[[3]]
[1] 8.313399 9.594709 9.968042 10.572528 10.234050 8.862336 10.997314 11.195132 10.603083
[10] 10.287259
[[4]]
[1] 98.80427 101.22868 100.94155 100.79101 101.12101 99.55110 99.75927 99.56236 102.02652
[10] 101.00852
Since a single call of rnorm()
returns a numeric vector with a length greater than one we cannot use map_dbl
, which requires the function to return a numeric vector that is only length one. The map functions pass any additional arguments to the function being called.
The function is.factor()
indicates whether a vector is a factor.
is.factor(diamonds$color)
[1] TRUE
To check all columns in a factor for whether it is a factor is a job for a map_*() function. Since the result of is.factor()
is logical, we will use map_lgl()
to apply is.factor()
to the columns of the data frame.
map_lgl(diamonds, is.factor)
carat cut color clarity depth table price x y z
FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
map(1:5, runif)
do? Why?Map functions work with any vectors, not just lists. As with lists, the map functions will apply the function to each element of the vector. In the following examples, the inputs to map()
are atomic vectors (logical, character, integer, double).
map(c(TRUE, FALSE, TRUE), ~ !.)
[[1]]
[1] FALSE
[[2]]
[1] TRUE
[[3]]
[1] FALSE
map(c("Hello", "World"), str_to_upper)
[[1]]
[1] "HELLO"
[[2]]
[1] "WORLD"
map(1:5, ~ rnorm(.))
[[1]]
[1] 0.06280466
[[2]]
[1] 0.4185305 -0.3795624
[[3]]
[1] -0.2601337 -0.7149337 1.2696833
[[4]]
[1] 0.11534827 0.05702938 0.88862189 1.32397427
[[5]]
[1] 0.5122246 1.0382322 1.6076809 -1.8047729 0.6420032
map(c(-0.5, 0, 1), ~ rnorm(1, mean = .))
[[1]]
[1] -2.224334
[[2]]
[1] 0.2706889
[[3]]
[1] 0.9579823
It is important to be aware that while the input of map()
can be any vector, the output is always a list.
map(1:5, runif)
[[1]]
[1] 0.1494658
[[2]]
[1] 0.6528250 0.4799463
[[3]]
[1] 0.9744443 0.2791404 0.8208076
[[4]]
[1] 0.6709032 0.5937742 0.9961677 0.2599329
[[5]]
[1] 0.56860716 0.61701854 0.07568036 0.22553429 0.54821350
This expression is equivalent to running the following.
list(
runif(1),
runif(2),
runif(3),
runif(4),
runif(5)
)
[[1]]
[1] 0.8088577
[[2]]
[1] 0.6489569 0.3369204
[[3]]
[1] 0.09497879 0.66540476 0.89104245
[[4]]
[1] 0.01484305 0.79000957 0.27765970 0.19737707
[[5]]
[1] 0.23060943 0.19025157 0.56040339 0.06336879 0.16675676
The map()
function loops through the numbers 1 to 5. For each value, it calls the runif()
with that number as the first argument, which is the number of sample to draw. The result is a length five list with numeric vectors of sizes one through five, each with random samples from a uniform distribution. Note that although input to map()
was an integer vector, the return value was a list.
map(-2:2, rnorm, n = 5)
do? Why? What does map_dbl(-2:2, rnorm, n = 5)
do? Why?Consider the first expression.
map(-2:2, rnorm, n = 5)
[[1]]
[1] -1.875677 -2.475980 -3.953484 -2.906089 -2.607832
[[2]]
[1] -1.7310002 -0.5087884 -1.4219197 -2.3649502 -0.4751831
[[3]]
[1] -1.0501727 -1.0422769 0.8935098 0.3735523 0.7320833
[[4]]
[1] 1.4035647 -0.1525906 2.6630770 -0.3736245 1.0380134
[[5]]
[1] 2.1716007 4.2711560 0.9351085 4.0590603 0.6201520
This expression takes samples of size five from five normal distributions, with means of (-2, -1, 0, 1, and 2), but the same standard deviation (1). It returns a list with each element a numeric vectors of length 5.
However, if instead, we use map_dbl()
, the expression raises an error.
# map_dbl(-2:2, rnorm, n = 5)
#> Result 1 must be a single double, not a double vector of length 5
This is because the map_dbl()
function requires the function it applies to each element to return a numeric vector of length one. If the function returns either a non-numeric vector or a numeric vector with a length greater than one, map_dbl()
will raise an error. The reason for this strictness is that map_dbl()
guarantees that it will return a numeric vector of the same length as its input vector.
This concept applies to the other map_*()
functions. The function map_chr()
requires that the function always return a character vector of length one; map_int()
requires that the function always return an integer vector of length one; map_lgl()
requires that the function always return an logical vector of length one. Use the map()
function if the function will return values of varying types or lengths.
To return a double vector, we could use map()
followed by flatten_dbl()
,
flatten_dbl(map(-2:2, rnorm, n = 5))
[1] -3.78746056 -1.57532860 -1.25197942 -3.40964151 -2.16499929 -1.10871102 -0.79198392
[8] -0.11331084 -0.70837810 -1.36750714 -0.74433696 -0.05375261 -1.39895763 0.75811861
[15] 0.17177696 -0.91846162 -0.92356516 1.62452989 0.59117241 0.81927221 2.34703113
[22] 1.22385230 4.09175355 1.51842380 2.37895695
map(x, function(df) lm(mpg ~ wt, data = df))
to eliminate the anonymous function.This code in this question does not run, so I will use the following code.
x <- split(mtcars, mtcars$cyl)
map(x, function(df) lm(mpg ~ wt, data = df))
$`4`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
39.571 -5.647
$`6`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
28.41 -2.78
$`8`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
23.868 -2.192
We can eliminate the use of an anonymous function using the ~
shortcut.
map(x, ~ lm(mpg ~ wt, data = .))
$`4`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
39.571 -5.647
$`6`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
28.41 -2.78
$`8`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
23.868 -2.192
Though not the intent of this question, the other way to eliminate anonymous function is to create a named one.
run_reg <- function(df) {
lm(mpg ~ wt, data = df)
}
map(x, run_reg)
$`4`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
39.571 -5.647
$`6`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
28.41 -2.78
$`8`
Call:
lm(formula = mpg ~ wt, data = df)
Coefficients:
(Intercept) wt
23.868 -2.192