Summaries and exercises from R for Data Science by Hadley Wickham & Garrett Grolemund.
The pipe allows you to create a logical order to code. Foo Foo hops, then scoops, then bops.
The pipe won’t work for two classes on functions.
assign(), get(), and load().tryCatch(), try(), suppressMessages(), and suppressWarnings().[1] 10
> #Does not work because the pipe assigns
> #it to a temporary environment
> "x" %>% assign(100)
> x[1] 10
[1] 100
> #Lazy functions only evaluate the arguments
> #when the function uses them, so the pipe
> #won't work
> tryCatch(stop("!"), error = function(e) "An error")[1] "An error"
Error in eval(lhs, parent, parent): !
It’s best not to use the pipe when.
The “tee” pipe, %T>%, returns the lefthand side instead of the righthand side.
NULL
num [1:50, 1:2] 0.251 -0.978 -0.244 0.603 -0.464 ...
For many functions like cor(), where you’re passing them vectors that come from a data frame and not the data frame itself, you can use %$%.
[1] -0.8475514
You can use %<>% to perform a function and assign at the same time.
Imagine you were doing the following transformations:
> df <- tibble::tibble(
+ a = rnorm(10),
+ b = rnorm(10),
+ c = rnorm(10),
+ d = rnorm(10)
+ )
>
> df$a <- (df$a - min(df$a, na.rm = TRUE)) /
+ (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
> #Spot the error
> df$b <- (df$b - min(df$b, na.rm = TRUE)) /
+ (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
> df$c <- (df$c - min(df$c, na.rm = TRUE)) /
+ (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
> df$d <- (df$d - min(df$d, na.rm = TRUE)) /
+ (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))Normally you would cut and paste, but that often results in errors. df$b has an error.
The following code has one input, df$a, which can be substituted with \(x\).
[1] 0.5309992 0.0000000 0.6226931 0.6557572 0.4130389 0.7992905 0.6612732
[8] 0.0466836 0.2791986 1.0000000
[1] 0.5309992 0.0000000 0.6226931 0.6557572 0.4130389 0.7992905 0.6612732
[8] 0.0466836 0.2791986 1.0000000
We can make it even simpler by substituting with an existing function.
[1] 0.5309992 0.0000000 0.6226931 0.6557572 0.4130389 0.7992905 0.6612732
[8] 0.0466836 0.2791986 1.0000000
Now we can create a custom function.
> rescale01 <- function(x) {
+ rng <- range(x, na.rm = TRUE)
+ (x - rng[1]) / (rng[2] - rng[1])
+ }
> rescale01(c(0, 5, 10))[1] 0.0 0.5 1.0
There are three steps to creating a function:
rescale01().function().{ code }.It’s best to test a few inputs.
[1] 0.0 0.5 1.0
[1] 0.00 0.25 0.50 NA 1.00
Now we can simplify the original example.
> df$a <- rescale01(df$a)
> df$b <- rescale01(df$b)
> df$c <- rescale01(df$c)
> df$d <- rescale01(df$d)And if there’s an error you only have to change it in one place.
[1] 0 0 0 0 0 0 0 0 0 0 NaN
> rescale01 <- function(x) {
+ rng <- range(x, na.rm = TRUE, finite = TRUE)
+ (x - rng[1]) / (rng[2] - rng[1])
+ }
> rescale01(x) [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
[8] 0.7777778 0.8888889 1.0000000 Inf
TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?na.rm as a parameter doesn’t change the output. The setting finite=TRUE will drop all non-finite elements, and NA is a non-finite element.> #Original Function
> rescale01 <- function(x) {
+ rng <- range(x, na.rm = TRUE, finite = TRUE)
+ (x - rng[1]) / (rng[2] - rng[1])
+ }
> rescale01(c(NA, 1:5))[1] NA 0.00 0.25 0.50 0.75 1.00
> #Updated Function
> rescale02 <- function(x, na.rm = FALSE) {
+ rng <- range(x, na.rm = na.rm, finite = TRUE)
+ (x - rng[1]) / (rng[2] - rng[1])
+ }
>
> rescale02(c(NA, 1:5), na.rm = FALSE)
> rescale02(c(NA, 1:5), na.rm = TRUE)[1] NA 0.00 0.25 0.50 0.75 1.00
[1] NA 0.00 0.25 0.50 0.75 1.00
rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.> rescale03 <- function(x) {
+ rng <- range(x, na.rm = TRUE, finite = TRUE)
+ y <- (x - rng[1]) / (rng[2] - rng[1])
+ y[y == -Inf] <- 0
+ y[y == Inf] <- 1
+ y
+ }
>
> rescale03(c(Inf, -Inf, 0:5, NA))[1] 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 NA
mean(is.na(x)) calculates the proportion of NA values in a vector.[1] 0.2857143
x / sum(x, na.rm = TRUE) standardizes a vector so that it sums to one.[1] 0.04761905 0.09523810 0.14285714 0.19047619 0.23809524 0.28571429
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) calculates the coefficient of variation.> coef_variation <- function(x, na.rm = FALSE) {
+ sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
+ }
>
> coef_variation(c(1:10, NA), na.rm = TRUE)[1] 0.5504819
\[Var(x)=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2\]
\[Skew(x)=\frac{\frac{1}{n-2}\left(\sum_{i=1}^{n}(x_i-\bar{x})^3\right)}{Var(x)^{3/2}}\]
> variance <- function(x, na.rm = TRUE) {
+ n <- length(x)
+ m <- mean(x, na.rm = TRUE)
+ sq_err <- (x - m)^2
+ sum(sq_err) / (n - 1)
+ }
>
> var(1:10)
> variance(1:10)[1] 9.166667
[1] 9.166667
> skewness <- function(x, na.rm = FALSE) {
+ n <- length(x)
+ m <- mean(x, na.rm = na.rm)
+ v <- var(x, na.rm = na.rm)
+ (sum((x - m) ^ 3) / (n - 2)) / v ^ (3 / 2)
+ }
>
> skewness(c(5, 10, 15, 100))[1] 1.463378
both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.> both_na <- function(x, y) {
+ stopifnot(length(x)==length(y))
+ sum(is.na(x) & is.na(y))
+ }
>
> both_na(
+ c(NA, NA, 1, 2),
+ c(NA, 1, NA, 2)
+ )[1] 1
[1] 3
Error in both_na(c(NA, NA, 1, 2, NA, NA, 1, 3), c(NA, 1, NA, 2, NA, NA, : length(x) == length(y) is not TRUE
> is_directory <- function(x) file.info(x)$isdir
> is_readable <- function(x) file.access(x, 4) == 0is_directory() checks whether the path in x is a directory. * is_readable() checks whether the path in x exists and the user has permission to open it.The formatting should be easy to read and understand.
> # Too short
> f()
>
> # Not a verb, or descriptive
> my_awesome_function()
>
> # Long, but clear
> impute_missing()
> collapse_years()You should try to use “snake_case”, where each lowercase word is separated by an underscore or “camelCase”, but be consistent!
> # Try not to switch around styles, pick one.
> col_mins <- function(x, y) {}
> rowMaxes <- function(y, x) {}With a common prefix autocomplete allows you to see all the members of the family
> # Good
> input_select()
> input_checkbox()
> input_text()
>
> # Not so good
> select_input()
> checkbox_input()
> text_input()Avoid overriding existing functions and variables.
You should use comments, # to explain why you’re doing something. You should also break your file into readable chucks with (Cmd/Ctrl + Shift + R). They will then be displayed in the code navigation drop-down at the bottom-left of the editor.
> #(Cmd/Ctrl + Shift + R) Section breaks
>
> # Load data --------------------------------------
>
> # Plot data --------------------------------------> f1 <- function(string, prefix) {
+ substr(string, 1, nchar(prefix)) == prefix
+ }
>
> f2 <- function(x) {
+ if (length(x) <= 1) return(NULL)
+ x[-length(x)]
+ }
>
> f3 <- function(x, y) {
+ rep(y, length.out = length(x))
+ }f1 tests whether each element of the character vector starts with the string prefix. Rename = has_prefix()f2 drops the last element of the vector x. Rename = drop_last()f3 repeats y once for each element of x. Rename = repeat_value()[1] TRUE TRUE FALSE
[1] 1 2 3
[1] 5 5 5 5
rnorm() samples from the univariate normal distribution, while MASS::mvrnorm samples from the multivariate normal distribution. The main arguments in rnorm() are n, mean, sd. The main arguments is MASS::mvrnorm are n, mu, Sigma. To be consistent they should have the same names.norm_r() and norm_d() start with a distribution type and end with a function.rnorm() and dnorm() start with a function and end with the distribution. R uses this format.You’ll often have to conditionally execute code.
> if (condition) {
+ # code executed when condition is TRUE
+ } else {
+ # code executed when condition is FALSE
+ }The goal of this function is to return a logical vector describing whether or not each element of a vector is named. A function returns the last value that it computed.
> has_name <- function(x) {
+ nms <- names(x)
+ if (is.null(nms)) {
+ rep(FALSE, length(x))
+ } else {
+ !is.na(nms) & nms != ""
+ }
+ }The condition must evaluate to either TRUE or FALSE. If it’s a vector, you’ll get a warning message; if it’s an NA, you’ll get an error.
> if (c(TRUE, FALSE)) {}
> #> Warning in if (c(TRUE, FALSE)) {: the
> #condition has length > 1 and only the
> #> first element will be used
> #> NULL
>
> if (NA) {}
> #> Error in if (NA) {: missing value where
> #TRUE/FALSE neededYou can use || (or) and && (and) to combine multiple logical expressions. These operators are “short-circuiting”: as soon as || sees the first TRUE it returns TRUE without computing anything else. As soon as && sees the first FALSE it returns FALSE.
You should never use | or & in an if statement: these are vectorised operations that apply to multiple values (vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.). If you do have a logical vector, you can use any() or all() to collapse it to a single value.
== is also vectorised, which means that it’s easy to get more than one output. Either check the length is already 1, collapse with all() or any(), or use the non-vectorised identical(). identical() is very strict: it always returns either a single TRUE or a single FALSE, and doesn’t coerce types. This means that you need to be careful when comparing integers and doubles:
[1] FALSE
You also need to be wary of floating point numbers. Instead use dplyr::near() for comparisons.
[1] 2
[1] FALSE
[1] 4.440892e-16
[1] TRUE
You can add numerous else if statements if there are a number of conditions.
Or, the switch() function allows you to evaluate selected code based on position or name.
> function(x, y, op) {
+ switch(op,
+ plus = x + y,
+ minus = x - y,
+ times = x * y,
+ divide = x / y,
+ stop("Unknown op!")
+ )
+ }Also, cut() can be used to discretise continuous variables.
Both if and function should (almost) always be followed by squiggly brackets ({}), and the contents should be indented by two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else. Always indent the code inside curly braces.
> function(x,y){
+ # Good
+ if (y < 0 && debug) {
+ message("Y is negative")
+ }
+
+ if (y == 0) {
+ log(x)
+ } else {
+ y ^ x
+ }
+ }
>
> function(x,y){
+ # Bad
+ if (y < 0 && debug)
+ message("Y is negative")
+
+ if (y == 0) {
+ log(x)
+ }
+ else {
+ y ^ x
+ }
+ }For very short statements one line is okay.
But multiple lines is easier to read.
if and ifelse()? .ifelse() has two potential answers.[1] "yes"
lubridate::now(). That will make it easier to test your function.)> library(lubridate)
> greet <- function(time = lubridate::now()) {
+ hr <- lubridate::hour(time)
+ if (hr < 12) {
+ print("good morning")
+ } else if (hr < 17) {
+ print("good afternoon")
+ } else {
+ print("good evening")
+ }
+ }
>
> greet()
> greet(ymd_h("2017-01-08 05"))
> greet(ymd_h("2017-01-08 13"))
> greet(ymd_h("2017-01-08 20"))[1] "good afternoon"
[1] "good morning"
[1] "good afternoon"
[1] "good evening"
fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.> fizzbuzz <- function(x) {
+ stopifnot(length(x) == 1)
+ stopifnot(is.numeric(x))
+ if (!(x %% 3) && !(x %% 5)) {
+ "fizzbuzz"
+ } else if (!(x %% 3)) {
+ "fizz"
+ } else if (!(x %% 5)) {
+ "buzz"
+ } else {
+ as.character(x)
+ }
+ }
>
> fizzbuzz(6)
> fizzbuzz(10)
> fizzbuzz(15)
> fizzbuzz(2)[1] "fizz"
[1] "buzz"
[1] "fizzbuzz"
[1] "2"
> fizzbuzz_vec <- function(x) {
+ case_when(!(x %% 3) & !(x %% 5) ~ "fizzbuzz",
+ !(x %% 3) ~ "fizz",
+ !(x %% 5) ~ "buzz",
+ TRUE ~ as.character(x)
+ )
+ }
> fizzbuzz_vec(c(0, 1, 2, 3, 5, 9, 10, 12, 15))[1] "fizzbuzz" "1" "2" "fizz" "buzz" "fizz" "buzz"
[8] "fizz" "fizzbuzz"
> fizzbuzz_vec2 <- function(x) {
+ y <- as.character(x)
+ # put the individual cases first - any elements divisible by both 3 and 5
+ # will be overwritten with fizzbuzz later
+ y[!(x %% 3)] <- "fizz"
+ y[!(x %% 3)] <- "buzz"
+ y[!(x %% 3) & !(x %% 5)] <- "fizzbuzz"
+ y
+ }
>
> fizzbuzz_vec2(c(0, 1, 2, 3, 5, 9, 10, 12, 15))[1] "fizzbuzz" "1" "2" "buzz" "5" "buzz" "10"
[8] "buzz" "fizzbuzz"
cut() to simplify this set of nested if-else statements?cut() if I’d used < instead of <=?> #less than or equal to
> temp <- seq(-10, 50, by = 5)
> cut(temp, c(-Inf, 0, 10, 20, 30, Inf),
+ right = TRUE,
+ labels = c("freezing", "cold",
+ "cool", "warm", "hot")
+ ) [1] freezing freezing freezing cold cold cool cool warm
[9] warm hot hot hot hot
Levels: freezing cold cool warm hot
> #less than
> temp <- seq(-10, 50, by = 5)
> cut(temp, c(-Inf, 0, 10, 20, 30, Inf),
+ right = FALSE,
+ labels = c("freezing", "cold",
+ "cool", "warm", "hot")
+ ) [1] freezing freezing cold cold cool cool warm warm
[9] hot hot hot hot hot
Levels: freezing cold cool warm hot
switch() with numeric values?[1] "apple"
[1] "banana"
> # only uses the integer part
> switch(1.2, "apple", "banana", "cantaloupe")
> switch(2.8, "apple", "banana", "cantaloupe")[1] "apple"
[1] "banana"
switch() call do? What happens if x is “e”?switch() encounters an argument with a missing value, like a = ,, it will return the value of the next argument with a non missing value.> switcheroo <- function(x) {
+ switch(x,
+ a = ,
+ b = "ab",
+ c = ,
+ d = "cd"
+ )
+ }
>
> switcheroo("a")
> switcheroo("b")
> switcheroo("c")
> switcheroo("d")
> switcheroo("e")
> switcheroo("f")[1] "ab"
[1] "ab"
[1] "cd"
[1] "cd"
The function must include the data and the arguments that control the details.
log() - the data is x, and the detail is the base of the logarithm.
mean() - the data is x, and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm).
t.test() - the data are x and y, and the details of the test are alternative, mu, paired, var.equal, and conf.level.
str_c() - you can supply any number of strings to ..., and the details of the concatenation are controlled by sep and collapse.
The arguments come first followed by details, which should have default values.
> # Compute confidence interval around mean
> # using normal approximation
> mean_ci <- function(x, conf = 0.95) {
+ se <- sd(x) / sqrt(length(x))
+ alpha <- 1 - conf
+ mean(x) + se * qnorm(c(alpha / 2,
+ 1 - alpha / 2))
+ }
>
> x <- runif(100)
> mean_ci(x)[1] 0.4002615 0.5165031
[1] 0.3819986 0.5347660
If you override the default value of a detail argument, you should use the full name:
> # Good
> mean(1:10, na.rm = TRUE)
>
> # Bad
> mean(x = 1:10, , FALSE)
> mean(, TRUE, x = c(1:10, NA))You should place a space around = in function calls, and always put a space after a comma. Using whitespace makes it easier to skim the function for the important components.
> # Good
> average <- mean(feet / 12 + inches, na.rm = TRUE)
>
> # Bad
> average<-mean(feet/12+inches,na.rm=TRUE)Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names.
x, y, z: vectors.w: a vector of weights.df: a data frame.i, j: numeric indices (typically rows and columns).n: length, or number of rows.p: number of columns.Otherwise, consider matching names of arguments in existing R functions. For example, use na.rm to determine if missing values should be removed.
It’s easy to call your function with invalid inputs. To avoid this problem, it’s often useful to make constraints explicit.
> wt_mean <- function(x, w) {
+ sum(x * w) / sum(w)
+ }
> wt_var <- function(x, w) {
+ mu <- wt_mean(x, w)
+ sum(w * (x - mu) ^ 2) / sum(w)
+ }
> wt_sd <- function(x, w) {
+ sqrt(wt_var(x, w))
+ }What happens if x and w are not the same length?
[1] 7.666667
In this case, because of R’s vector recycling rules, we don’t get an error.
It’s good practice to check important preconditions, and throw an error (with stop()), if they are not true:
> wt_mean <- function(x, w) {
+ if (length(x) != length(w)) {
+ stop("`x` and `w` must be the same length", call. = FALSE)
+ }
+ sum(w * x) / sum(w)
+ }Sometimes you don’t have to go too far and check everything, like na.rm.
> wt_mean <- function(x, w, na.rm = FALSE) {
+ if (!is.logical(na.rm)) {
+ stop("`na.rm` must be logical")
+ }
+ if (length(na.rm) != 1) {
+ stop("`na.rm` must be length 1")
+ }
+ if (length(x) != length(w)) {
+ stop("`x` and `w` must be the same length", call. = FALSE)
+ }
+
+ if (na.rm) {
+ miss <- is.na(x) | is.na(w)
+ x <- x[!miss]
+ w <- w[!miss]
+ }
+ sum(w * x) / sum(w)
+ }This is a lot of extra work for little additional gain. A useful compromise is the built-in stopifnot(): it checks that each argument is TRUE, and produces a generic error message if not.
> wt_mean <- function(x, w, na.rm = FALSE) {
+ stopifnot(is.logical(na.rm), length(na.rm) == 1)
+ stopifnot(length(x) == length(w))
+
+ if (na.rm) {
+ miss <- is.na(x) | is.na(w)
+ x <- x[!miss]
+ w <- w[!miss]
+ }
+ sum(w * x) / sum(w)
+ }
> wt_mean(1:6, 6:1, na.rm = "foo")Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE
Many functions in R take an arbitrary number of inputs:
[1] 55
[1] "abcdef"
They rely on a special argument: ... (pronounced dot-dot-dot). This special argument captures any number of arguments that aren’t otherwise matched.
It’s useful because you can then send those ... on to another function. This is a useful catch-all if your function primarily wraps another function.
[1] "a, b, c, d, e, f, g, h, i, j"
> rule <- function(..., pad = "-") {
+ title <- paste0(...)
+ width <- getOption("width") -
+ nchar(title) - 5
+ cat(title, " ", stringr::str_dup(pad,
+ width), "\n", sep = "")
+ }
> rule("Important output")Important output -----------------------------------------------------------
However, any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:
[1] 4
If you just want to capture the values of the ..., use list(...).
Arguments in R are lazily evaluated: they’re not computed until they’re needed. That means if they’re never used, they’re never called.
commas(letters, collapse = "-") do? Why?[1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z"
Error in str_c(..., collapse = ", "): formal argument "collapse" matched by multiple actual arguments
> commas <- function(..., collapse = ", ") {
+ str_c(..., collapse = collapse)
+ }
> commas(letters, collapse = "-")[1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"
pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?stringr::str_length().> rule <- function(..., pad = "-") {
+ title <- paste0(...)
+ width <- getOption("width") - nchar(title) - 5
+ cat(title, " ", str_dup(pad, width), "\n", sep = "")
+ }
>
> rule("Important output")Important output -----------------------------------------------------------
Valuable output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> rule2 <- function(..., pad = "-") {
+ title <- paste0(...)
+ width <- getOption("width") - nchar(title) - 5
+ cat(title, " ", str_dup(pad,
+ width/stringr::str_length(pad)),
+ "\n", sep = "")
+ }
> rule2("Valuable output", pad = "-+")Valuable output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
trim argument to mean() do? When might you use it?trim arguments trims a fraction of observations from each end of the vector before calculating the mean. This is used to eliminate outliers.method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?method argument can take one of those three values. The first value, “pearson”, is used by default.There are two things you should consider when returning a value:
You can use return() to end the function early.
> complicated_function <- function(x, y, z) {
+ if (length(x) == 0 || length(y) == 0) {
+ return(0)
+ }
+
+ # Complicated code here
+ }If the first block is very long, by the time you get to the else, you’ may have forgotten the condition.
> f <- function() {
+ if (x) {
+ # Do
+ # something
+ # that
+ # takes
+ # many
+ # lines
+ # to
+ # express
+ } else {
+ # return something short
+ }
+ }Instead, use an early return for the simple case.
If you want to write your own pipeable functions, it’s important to think about the return value. Knowing the return value’s object type will mean that your pipeline will “just work”. For example, with dplyr and tidyr the object type is the data frame.
There are two basic types of pipeable functions: transformations and side-effects. With transformations, an object is passed to the function’s first argument and a modified object is returned. With side-effects, the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file. Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline. For example, this simple function prints the number of missing values in a data frame:
> show_missings <- function(df) {
+ n <- sum(is.na(df))
+ cat("Missing values: ", n, "\n", sep = "")
+
+ invisible(df)
+ }If we call it interactively, the invisible() means that the input df doesn’t get printed out:
Missing values: 0
But it’s still there, it’s just not printed by default:
Missing values: 0
[1] "data.frame"
[1] 32 11
And we can still use it in a pipe:
Missing values: 0
Missing values: 18
The last component of a function is its environment. The environment of a function controls how R finds the value associated with a name. For example, take this function:
In many programming languages, this would be an error, because y is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to find the value associated with a name. Since y is not defined inside the function, R will look in the environment where the function was defined:
[1] 110
[1] 1010
This allows you to do devious things like:
> `+` <- function(x, y) {
+ if (runif(1) < 0.1) {
+ sum(x, y)
+ } else {
+ sum(x, y) * 1.1
+ }
+ }
> table(replicate(1000, 1 + 2))
3 3.3
88 912
There are two types of vectors.
Atomic vectors are homogeneous, while lists can be heterogeneous.
NULL is often used to represent the absence of a vector while NA which is used to represent the absence of a value in a vector.
You can check the type with typeof()
[1] "character"
[1] "integer"
You can check its length with length()
[1] 3
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are three important types of augmented vector:
The four most important types of atomic vectors are logical, integer, double, and character.
Logical vectors can take only three possible values: FALSE, TRUE, and NA.
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[1] TRUE TRUE FALSE NA
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an L after the number:
[1] "double"
[1] "integer"
[1] 1.5
Integers vs. Doubles
[1] 2
[1] 4.440892e-16
Instead of comparing floating point numbers using ==, you should use dplyr::near() which allows for some numerical tolerance.
NA, while doubles have four: NA, NaN, Inf and -Inf. All three special values NaN, Inf and -Inf can arise during division:[1] -Inf NaN Inf
Avoid using == to check for these other special values. Instead use the helper functions.
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
Each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings.
Each type of atomic vector has its own missing value:
[1] NA
[1] NA
[1] NA
[1] NA
You can always use NA and it will be converted to the correct type. However, there are some functions that are strict about their inputs.
is.finite(x) and !is.infinite(x)[1] TRUE FALSE FALSE FALSE FALSE
[1] TRUE TRUE TRUE FALSE FALSE
dplyr::near() (Hint: to see the source code, drop the ()). How does it work?function (x, y, tol = .Machine$double.eps^0.5)
{
abs(x - y) < tol
}
<bytecode: 0x000000001cf0a838>
<environment: namespace:dplyr>
.Machine$double.eps, which is the smallest floating point number that the computer can represent.The range of integers values that R can represent in an integer vector is \(\pm2^{31}-1\)
[1] 2147483647
The double can represent numbers in the range of about \(\pm2 \times 10^{308}\)
[1] 1.797693e+308
> tibble(
+ x = c(1.8, 1.5, 1.2, 0.8, 0.5, 0.2,
+ -0.2, -0.5, -0.8, -1.2, -1.5, -1.8),
+ `Round down` = floor(x),
+ `Round up` = ceiling(x),
+ `Round towards zero` = trunc(x),
+ `Nearest, round half to even` = round(x)
+ ) # A tibble: 12 x 5
x `Round down` `Round up` `Round towards zero` `Nearest, round half to e~
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.8 1 2 1 2
2 1.5 1 2 1 2
3 1.2 1 2 1 1
4 0.8 0 1 0 1
5 0.5 0 1 0 0
6 0.2 0 1 0 0
7 -0.2 -1 0 0 0
8 -0.5 -1 0 0 0
9 -0.8 -1 0 0 -1
10 -1.2 -2 -1 -1 -1
11 -1.5 -2 -1 -1 -2
12 -1.8 -2 -1 -1 -2
[1] TRUE FALSE TRUE FALSE TRUE TRUE NA
[1] 1235 134 NA
[1] 1.0 3.5 1000.0 NA 12234.9 1234.0 123.0 1.0
It’s useful to review some of the important tools for working with vectors.
How to convert from one type to another, and when that happens automatically.
How to tell if an object is a specific type of vector.
What happens when you work with vectors of different lengths.
How to name the elements of a vector.
How to pull out elements of interest.
There are two ways to convert, or coerce, one type of vector to another:
Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification.
Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.
The most important type of implicit coercion: using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
> x <- sample(1:20, 100, replace = TRUE)
> z <- x > 10
> sum(z) # how many are greater than 10?
> # what proportion are greater than 10?
> scales::percent(mean(z)) [1] 56
[1] "56%"
It’s also important to understand what happens when you try and create a vector containing multiple types with c(): the most complex type always wins.
[1] "integer"
[1] "double"
[1] "character"
To test you can use the is_* functions provided by purrr.
> is_logical(TRUE)
> is_integer(2L)
> is_double(2.5)
> is_numeric(3.5)
> is_character("x")
> is_atomic(1)
> is_list(list(1:10,2:5))
> is_vector(c("x","y"))[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
Each predicate also comes with a “scalar” version, like is_scalar_atomic(), which checks that the length is 1. This is useful, for example, if you want to check that an argument to your function is a single logical value.
recycling - the shorter vector is repeated, or recycled, to the same length as the longer vector.
vectorized - the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.
[1] 102 105 103 104 109 108 101 110 106 107
[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:
Warning in 1:10 + 1:3: longer object length is not a multiple of shorter object
length
[1] 2 4 6 5 7 9 8 10 12 11
The vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar. If you do want to recycle, you’ll need to do it yourself with rep().
Error: Tibble columns must have compatible sizes.
* Size 4: Existing data.
* Size 2: Column `y`.
i Only values of size one are recycled.
# A tibble: 4 x 2
x y
<int> <int>
1 1 1
2 2 2
3 3 1
4 4 2
# A tibble: 4 x 2
x y
<int> <int>
1 1 1
2 2 1
3 3 2
4 4 2
All types of vectors can be named. You can name them during creation with c():
x y z
1 2 4
Or after the fact with purrr::set_names():
a b c
1 2 3
filter() only works with tibble, so we’ll need new tool for vectors: [. [ is the subsetting function, and is called like x[a]. There are four types of things that you can subset a vector with:
Subsetting with positive integers keeps the elements at those positions:
[1] "three" "two" "five"
By repeating a position, you can actually make a longer output than input:
[1] "one" "one" "five" "five" "five" "two"
Negative values drop the elements at the specified positions:
[1] "two" "four"
It’s an error to mix positive and negative values:
Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts
The error message mentions subsetting with zero, which returns no values:
character(0)
TRUE value. This is most often useful in conjunction with the comparison functions.[1] 10 3 5 8 1
[1] 10 NA 8 NA
xyz def
5 2
x[], which returns the complete x. This is not useful for subsetting vectors, but it is useful when subsetting matrices (and other high dimensional structures) because it lets you select all the rows or all the columns, by leaving that index blank. For example, if x is 2d, x[1, ] selects the first row and all the columns, and x[, -1] selects all rows and all columns except the first.There is an important variation of [ called [[. [[ only ever extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting a single item, as in a for loop.
mean(is.na(x)) tell you about a vector x? What about sum(!is.finite(x))?[1] 0.2857143
[1] 4
is.vector(). What does it actually test for? Why does is.atomic() not agree with the definition of atomic vectors above?is.vector() only checks whether the object has no attributes other than names.[1] TRUE
is.atomic() explicitly checks whether an object is one of the atomic types (“logical”, “integer”, “numeric”, “complex”, “character”, and “raw”) or NULL.[1] TRUE
[1] FALSE
setNames() with purrr::set_names().setNames() takes two arguments, a vector to be named and a vector of names to apply to its elements.a b c d
1 2 3 4
a b c d
"a" "b" "c" "d"
set_names() has more ways to set the names than setNames().a b c d
1 2 3 4
a b c d
1 2 3 4
a b c d
"a" "b" "c" "d"
A B C
1 2 3
Error: `nm` must be `NULL` or a character vector the same length as `x`
> last_value <- function(x) {
+ if (length(x)) {
+ x[[length(x)]]
+ } else {
+ x
+ }
+ }
>
> last_value(numeric())numeric(0)
[1] 1
[1] 10
> even_indices <- function(x) {
+ if (length(x)) {
+ x[seq_along(x) %% 2 == 0]
+ } else {
+ x
+ }
+ }
> even_indices(numeric())numeric(0)
numeric(0)
[1] 2 4 6 8 10
[1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
> not_last <- function(x) {
+ n <- length(x)
+ if (n) {
+ x[-n]
+ } else {
+ # n == 0
+ x
+ }
+ }
>
> not_last(1:3)[1] 1 2
> even_numbers2 <- function(x) {
+ x[!is.infinite(x) & !is.nan(x) & (x %% 2 == 0)]
+ }
> even_numbers2(c(0:4, NA, NaN, Inf, -Inf))[1] 0 2 4 NA
x[-which(x > 0)] not the same as x[x <= 0]?[1] -1 0 -Inf NaN NA
[1] -1 0 -Inf NA NA
Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with list():
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
A very useful tool for working with lists is str() because it focuses on the structure, not the contents.
List of 3
$ : num 1
$ : num 2
$ : num 3
List of 3
$ a: num 1
$ b: num 2
$ c: num 3
Unlike atomic vectors, list() can contain a mix of objects:
List of 4
$ : chr "a"
$ : int 1
$ : num 1.5
$ : logi TRUE
Lists can even contain other lists.
List of 2
$ :List of 2
..$ : num 1
..$ : num 2
$ :List of 2
..$ : num 3
..$ : num 4
To explain more complicated list manipulation functions, it’s helpful to have a visual representation of lists. For example, take these three lists:
> x1 <- list(c(1, 2), c(3, 4))
> x2 <- list(list(1, 2), list(3, 4))
> x3 <- list(1, list(2, list(3)))There are three ways to subset a list, which can be illustrated with a list named a:
[ extracts a sub-list. The result will always be a list.List of 2
$ a: int [1:3] 1 2 3
$ b: chr "a string"
List of 1
$ d:List of 2
..$ : num -1
..$ : num -5
Like with vectors, you can subset with a logical, integer, or character vector.
[[ extracts a single component from a list. It removes a level of hierarchy from the list.
int [1:3] 1 2 3
List of 2
$ : num -1
$ : num -5
$ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.
[1] 1 2 3
[1] 1 2 3
The distinction between [ and [[ is really important for lists, because [[ drills down into the list while [ returns a new, smaller list. Compare the code and output above with the visual representation
Subsetting a tibble works the same way as a list; a data frame can be thought of as a list of columns. The key difference between a list and a tibble is that all the elements (columns) of a tibble must have the same length (number of rows). Lists can have vectors with different lengths as elements.
[1] 1 2
Any vector can contain arbitrary additional metadata through its attributes. You can think of attributes as named list of vectors that can be attached to any object. You can get and set individual attribute values with attr() or see them all at once with attributes().
NULL
$greeting
[1] "Hi!"
$farewell
[1] "Bye!"
There are three very important attributes that are used to implement fundamental parts of R:
Here’s what a typical generic function looks like:
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x0000000012fe60a0>
<environment: namespace:base>
The call to “UseMethod” means that this is a generic function, and it will call a specific method, a function, based on the class of the first argument. (All methods are functions; not all functions are methods). You can list all the methods for a generic with methods():
[1] as.Date.character as.Date.default as.Date.factor
[4] as.Date.numeric as.Date.POSIXct as.Date.POSIXlt
[7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
see '?methods' for accessing help and source code
For example, if x is a character vector, as.Date() will call as.Date.character(); if it’s a factor, it’ll call as.Date.factor().
You can see the specific implementation of a method with getS3method():
function (x, ...)
{
if (inherits(x, "Date"))
x
else if (is.null(x))
.Date(numeric())
else if (is.logical(x) && all(is.na(x)))
.Date(as.numeric(x))
else stop(gettextf("do not know how to convert '%s' to class %s",
deparse1(substitute(x)), dQuote("Date")), domain = NA)
}
<bytecode: 0x000000001d0dc8e8>
<environment: namespace:base>
function (x, origin, ...)
{
if (missing(origin)) {
if (!length(x))
return(.Date(numeric()))
if (!any(is.finite(x)))
return(.Date(x))
stop("'origin' must be supplied")
}
as.Date(origin, ...) + x
}
<bytecode: 0x000000001e2d03e8>
<environment: namespace:base>
The most important S3 generic is print(): it controls how the object is printed when you type its name at the console. Other important generics are the subsetting functions [, [[, and $.
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. These are called augmented vectors, because they are vectors with additional attributes, including class. Because augmented vectors have a class, they behave differently to the atomic vector on which they are built. Four important augmented vectors:
Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:
[1] "integer"
$levels
[1] "ab" "cd" "ef"
$class
[1] "factor"
Dates in R are numeric vectors that represent the number of days since 1 January 1970.
[1] 365
[1] "double"
$class
[1] "Date"
Date-times are numeric vectors with class POSIXct that represent the number of seconds since 1 January 1970. (“POSIXct” stands for “Portable Operating System Interface”, calendar time.)
[1] 3600
attr(,"tzone")
[1] "UTC"
[1] "double"
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "UTC"
The tzone attribute is optional. It controls how the time is printed, not what absolute time it refers to.
[1] "1969-12-31 17:00:00 PST"
[1] "1969-12-31 20:00:00 EST"
There is another type of date-times called POSIXlt. These are built on top of named lists:
[1] "list"
$names
[1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday"
[9] "isdst" "zone" "gmtoff"
$class
[1] "POSIXlt" "POSIXt"
$tzone
[1] "US/Eastern" "EST" "EDT"
POSIXct’s are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular data time lubridate::as_date_time().
Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and names (column) and row.names attributes:
[1] "list"
$names
[1] "x" "y"
$row.names
[1] 1 2 3 4 5
$class
[1] "tbl_df" "tbl" "data.frame"
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.
Traditional data.frames have a very similar structure:
[1] "list"
$names
[1] "x" "y"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5
The main difference is the class. The class of tibble includes “data.frame” which means tibbles inherit the regular data frame behaviour by default.
01:00:00
[1] "hms" "difftime"
[1] "double"
$units
[1] "secs"
$class
[1] "hms" "difftime"
# A tibble: 5 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
Error: Tibble columns must have compatible sizes.
* Size 3: Existing data.
* Size 4: Column `y`.
i Only values of size one are recycled.
Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
Imperative programming - On the imperative side you have tools like for loops and while loops, which make iteration very explicit, so it’s obvious what’s happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop.
Functional programming (FP) - offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. With FP you can solve many common iteration problems with less code, more ease, and fewer errors.
Imagine we have this simple tibble:
You want to compute the median of each column. You could do it with copy-and-paste:
[1] -0.3534617
[1] 0.305703
[1] -0.481693
[1] 0.3886924
However, that could be a long and messy process. Instead, we could use a for loop:
> output <- vector("double", ncol(df)) # 1. output
> for (i in seq_along(df)) { # 2. sequence
+ output[[i]] <- median(df[[i]]) # 3. body
+ }
> output[1] -0.3534617 0.3057030 -0.4816930 0.3886924
Every for loop has three components:
output <- vector("double", length(x)). Before you start the loop, you must always allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow.A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector (“logical”, “integer”, “double”, “character”, etc) and the length of the vector.
i in seq_along(df). This determines what to loop over: each run of the for loop will assign i to a different value from seq_along(df). It’s useful to think of i as a pronoun, like “it”.You might not have seen seq_along() before. It’s a safe version of the familiar 1:length(l), with an important difference: if you have a zero-length vector, seq_along() does the right thing:
integer(0)
[1] 1 0
output[[i]] <- median(df[[i]]). This is the code that does the work. It’s run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.mtcars.nycflights13::flights.iris.mean of every column in mtcars
> output <- vector("double", ncol(mtcars))
> names(output) <- names(mtcars)
> for (i in names(mtcars)) {
+ output[i] <- mean(mtcars[[i]])
+ }
> output mpg cyl disp hp drat wt qsec
20.090625 24.750000 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
type of each column in nycflights13::flights
> output <- vector("list", ncol(nycflights13::flights))
> names(output) <- names(nycflights13::flights)
> for (i in names(nycflights13::flights)) {
+ output[[i]] <- class(nycflights13::flights[[i]])
+ }
> output$year
[1] "integer"
$month
[1] "integer"
$day
[1] "integer"
$dep_time
[1] "integer"
$sched_dep_time
[1] "integer"
$dep_delay
[1] "numeric"
$arr_time
[1] "integer"
$sched_arr_time
[1] "integer"
$arr_delay
[1] "numeric"
$carrier
[1] "character"
$flight
[1] "integer"
$tailnum
[1] "character"
$origin
[1] "character"
$dest
[1] "character"
$air_time
[1] "numeric"
$distance
[1] "numeric"
$hour
[1] "numeric"
$minute
[1] "numeric"
$time_hour
[1] "POSIXct" "POSIXt"
the number of unique values in each column of iris
> data("iris")
> iris_uniq <- vector("double", ncol(iris))
> names(iris_uniq) <- names(iris)
> for (i in names(iris)) {
+ iris_uniq[i] <- n_distinct(iris[[i]])
+ }
> iris_uniqSepal.Length Sepal.Width Petal.Length Petal.Width Species
35 23 43 22 3
10 random normals from distributions with means of -10, 0, 10, and 100
> n <- 10
> # values of the mean
> mu <- c(-10, 0, 10, 100)
> normals <- vector("list", length(mu))
> for (i in seq_along(normals)) {
+ normals[[i]] <- rnorm(n, mean = mu[i])
+ }
> normals[[1]]
[1] -11.305450 -8.969553 -10.597783 -9.490975 -12.033553 -8.355739
[7] -8.497958 -7.966495 -9.667332 -10.892709
[[2]]
[1] 0.73334559 -0.39124276 0.02113921 1.01163921 0.59841898 -1.04622886
[7] 1.04824229 0.21131357 1.18336142 -1.82724122
[[3]]
[1] 10.671651 10.374447 11.521582 10.735073 9.589511 10.806080 10.519606
[8] 9.582236 11.563617 10.219076
[[4]]
[1] 100.06090 101.17170 100.95919 100.57750 101.63769 102.98548 102.08713
[8] 100.72449 98.89897 99.43802
\(A\)
[1] "abcdefghijklmnopqrstuvwxyz"
Answer
[1] "abcdefghijklmnopqrstuvwxyz"
\(B\)
> x <- sample(100)
> std <- 0
> for (i in seq_along(x)) {
+ std <- std + (x[i] - mean(x)) ^ 2
+ }
> std <- sqrt(std / (length(x) - 1))
> std[1] 29.01149
Answer
[1] 29.01149
\(C\)
> x <- runif(100)
> out <- vector("numeric", length(x))
> out[1] <- x[1]
> for (i in 2:length(x)) {
+ out[i] <- out[i - 1] + x[i]
+ }
> out [1] 0.1440869 1.0679018 1.3891570 2.1269189 2.6886382 3.3821259
[7] 3.9923716 4.6016887 4.9330813 5.3848960 5.7287877 5.7609090
[13] 6.6276744 7.3797277 8.0035553 8.3853359 8.9505014 9.2988127
[19] 9.7329107 10.4536404 11.4071862 12.2734312 12.2970024 12.7295530
[25] 13.1116046 13.2240063 13.7072301 14.2046590 15.0448627 15.9517964
[31] 16.0222470 16.6757728 16.7059309 17.3781322 17.4546802 18.0476092
[37] 18.7081376 19.6130017 20.1667295 20.7087546 21.6915076 22.6413271
[43] 23.5487731 24.0550506 25.0443236 25.5013848 26.4970597 27.1068480
[49] 27.5307976 27.5383474 28.4668664 29.0783425 29.6132066 30.4023030
[55] 31.3647640 31.5877037 31.9190192 32.4822475 33.2450028 33.9801238
[61] 34.5088791 34.6028091 35.0269705 35.1175262 35.3690433 35.5147748
[67] 35.6804987 36.2560811 36.5100650 36.9092297 36.9531377 37.5071875
[73] 37.7281001 38.4011549 39.2863424 39.7743467 40.3965827 40.9600051
[79] 41.4264888 42.1911425 42.4171137 42.9757523 43.9440758 44.3726887
[85] 44.6519331 44.8845706 45.4585461 45.5974488 46.1936807 46.8595879
[91] 47.8160022 48.8087514 48.9079801 49.8920008 49.9395958 50.2672004
[97] 51.0444748 51.8900169 52.4357557 53.1680187
Answer
[1] 0.1440869 1.0679018 1.3891570 2.1269189 2.6886382 3.3821259
[7] 3.9923716 4.6016887 4.9330813 5.3848960 5.7287877 5.7609090
[13] 6.6276744 7.3797277 8.0035553 8.3853359 8.9505014 9.2988127
[19] 9.7329107 10.4536404 11.4071862 12.2734312 12.2970024 12.7295530
[25] 13.1116046 13.2240063 13.7072301 14.2046590 15.0448627 15.9517964
[31] 16.0222470 16.6757728 16.7059309 17.3781322 17.4546802 18.0476092
[37] 18.7081376 19.6130017 20.1667295 20.7087546 21.6915076 22.6413271
[43] 23.5487731 24.0550506 25.0443236 25.5013848 26.4970597 27.1068480
[49] 27.5307976 27.5383474 28.4668664 29.0783425 29.6132066 30.4023030
[55] 31.3647640 31.5877037 31.9190192 32.4822475 33.2450028 33.9801238
[61] 34.5088791 34.6028091 35.0269705 35.1175262 35.3690433 35.5147748
[67] 35.6804987 36.2560811 36.5100650 36.9092297 36.9531377 37.5071875
[73] 37.7281001 38.4011549 39.2863424 39.7743467 40.3965827 40.9600051
[79] 41.4264888 42.1911425 42.4171137 42.9757523 43.9440758 44.3726887
[85] 44.6519331 44.8845706 45.4585461 45.5974488 46.1936807 46.8595879
[91] 47.8160022 48.8087514 48.9079801 49.8920008 49.9395958 50.2672004
[97] 51.0444748 51.8900169 52.4357557 53.1680187
prints() the lyrics to the children’s song “Alice the camel”.\(A\)
> humps <- c("five", "four", "three", "two", "one", "no")
> for (i in humps) {
+ cat(str_c("Alice the camel has ", rep(i, 3), " humps.",
+ collapse = "\n"
+ ), "\n")
+ if (i == "no") {
+ cat("Now Alice is a horse.\n")
+ } else {
+ cat("So go, Alice, go.\n")
+ }
+ cat("\n")
+ }Alice the camel has five humps.
Alice the camel has five humps.
Alice the camel has five humps.
So go, Alice, go.
Alice the camel has four humps.
Alice the camel has four humps.
Alice the camel has four humps.
So go, Alice, go.
Alice the camel has three humps.
Alice the camel has three humps.
Alice the camel has three humps.
So go, Alice, go.
Alice the camel has two humps.
Alice the camel has two humps.
Alice the camel has two humps.
So go, Alice, go.
Alice the camel has one humps.
Alice the camel has one humps.
Alice the camel has one humps.
So go, Alice, go.
Alice the camel has no humps.
Alice the camel has no humps.
Alice the camel has no humps.
Now Alice is a horse.
\(B\)
> numbers <- c(
+ "ten", "nine", "eight", "seven", "six", "five",
+ "four", "three", "two", "one"
+ )
> for (i in numbers) {
+ cat(str_c("There were ", i, " in the bed\n"))
+ cat("and the little one said\n")
+ if (i == "one") {
+ cat("I'm lonely...")
+ } else {
+ cat("Roll over, roll over\n")
+ cat("So they all rolled over and one fell out.\n")
+ }
+ cat("\n")
+ }There were ten in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were nine in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were eight in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were seven in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were six in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were five in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were four in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were three in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were two in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.
There were one in the bed
and the little one said
I'm lonely...
\(C\)
> bottles <- function(n) {
+ if (n > 1) {
+ str_c(n, " bottles")
+ } else if (n == 1) {
+ "1 bottle"
+ } else {
+ "no more bottles"
+ }
+ }
>
> beer_bottles <- function(total_bottles) {
+ # print each lyric
+ for (current_bottles in seq(total_bottles, 0)) {
+ # first line
+ cat(str_to_sentence(str_c(bottles(current_bottles), " of beer on the wall, ", bottles(current_bottles), " of beer.\n")))
+ # second line
+ if (current_bottles > 0) {
+ cat(str_c(
+ "Take one down and pass it around, ", bottles(current_bottles - 1),
+ " of beer on the wall.\n"
+ ))
+ } else {
+ cat(str_c("Go to the store and buy some more, ", bottles(total_bottles), " of beer on the wall.\n")) }
+ cat("\n")
+ }
+ }
> beer_bottles(3)3 bottles of beer on the wall, 3 bottles of beer.
Take one down and pass it around, 2 bottles of beer on the wall.
2 bottles of beer on the wall, 2 bottles of beer.
Take one down and pass it around, 1 bottle of beer on the wall.
1 bottle of beer on the wall, 1 bottle of beer.
Take one down and pass it around, no more bottles of beer on the wall.
No more bottles of beer on the wall, no more bottles of beer.
Go to the store and buy some more, 3 bottles of beer on the wall.
> add_to_vector <- function(n) {
+ output <- vector("integer", 0)
+ for (i in seq_len(n)) {
+ output <- c(output, i)
+ }
+ output
+ }
> add_to_vector(10) [1] 1 2 3 4 5 6 7 8 9 10
> add_to_vector_2 <- function(n) {
+ output <- vector("integer", n)
+ for (i in seq_len(n)) {
+ output[[i]] <- i
+ }
+ output
+ }
> add_to_vector_2(10) [1] 1 2 3 4 5 6 7 8 9 10
> library(microbenchmark)
> timings <- microbenchmark(add_to_vector(10000), add_to_vector_2(10000), times = 10)
> timingsUnit: microseconds
expr min lq mean median uq max
add_to_vector(10000) 74960.7 76436.5 82950.63 79618.15 82265.7 117413.2
add_to_vector_2(10000) 374.0 382.0 638.15 555.90 936.8 951.4
neval cld
10 b
10 a
Appending to a vector takes much times longer than pre-allocating the vector.
There are four variations on the basic theme of the for loop:
Sometimes you want to use a for loop to modify an existing object. For example, when we wanted to rescale every column in a data frame it wasn’t very efficient:
> df <- tibble(
+ a = rnorm(10),
+ b = rnorm(10),
+ c = rnorm(10),
+ d = rnorm(10)
+ )
> rescale01 <- function(x) {
+ rng <- range(x, na.rm = TRUE)
+ (x - rng[1]) / (rng[2] - rng[1])
+ }
>
> df$a <- rescale01(df$a)
> df$b <- rescale01(df$b)
> df$c <- rescale01(df$c)
> df$d <- rescale01(df$d)To solve this with a for loop we again think about the three components:
Output: we already have the output — it’s the same as the input
Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).
Body: apply rescale01().
This gives us:
Typically you’ll be modifying a list or data frame with this sort of loop, so remember to use [[, not [.
There are three basic ways to loop over a vector.
Looping over the numeric indices with for (i in seq_along(xs)) and extracting the value with x[[i]].
Loop over the elements: for (x in xs). This is most useful if you only care about side-effects, like plotting or saving a file, because it’s difficult to save the output efficiently.
Loop over the names: for (nm in names(xs)). This gives you name, which you can use to access the value with x[[nm]]. This is useful if you want to use the name in a plot title or a file name. If you’re creating named output, make sure to name the results vector like so:
Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:
Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
> means <- c(0, 1, 2)
>
> output <- double()
> for (i in seq_along(means)) {
+ n <- sample(100, 1)
+ output <- c(output, rnorm(n, means[[i]]))
+ }
> str(output) num [1:144] -0.831 0.219 -0.806 1.523 -0.697 ...
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. A better solution to save the results in a list, and then combine into a single vector after the loop is done:
> out <- vector("list", length(means))
> for (i in seq_along(means)) {
+ n <- sample(100, 1)
+ out[[i]] <- rnorm(n, means[[i]])
+ }
> str(out)List of 3
$ : num [1:87] 0.2105 0.13719 0.00106 0.4588 -1.41733 ...
$ : num [1:54] 0.965 2.433 -0.512 1.621 -0.187 ...
$ : num [1:11] 4.676 1.363 -0.211 1.046 2.086 ...
num [1:152] 0.2105 0.13719 0.00106 0.4588 -1.41733 ...
unlist() will flatten a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl() — it will throw an error if the input isn’t a list of doubles.
This pattern occurs in other places too:
You might be generating a long string. Instead of paste()ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with paste(output, collapse = "").
You might be generating a big data frame. Instead of sequentially rbind()ing in each iteration, save the output in a list, then use dplyr::bind_rows(output) to combine the output into a single data frame.
Watch out for this pattern. Whenever you see it, switch to a more complex result object, and then combine in one step at the end.
Sometimes you don’t even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:
A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as a for loop:
> for (i in seq_along(x)) {
+ # body
+ }
>
> # Equivalent to
> i <- 1
> while (i <= length(x)) {
+ # body
+ i <- i + 1
+ }Here’s how we could use a while loop to find how many tries it takes to get three heads in a row:
> flip <- function() sample(c("T", "H"), 1)
>
> flips <- 0
> nheads <- 0
>
> while (nheads < 3) {
+ if (flip() == "H") {
+ nheads <- nheads + 1
+ } else {
+ nheads <- 0
+ }
+ flips <- flips + 1
+ }
> flips[1] 9
files <- dir("data/", pattern = "\\.csv$", full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame.> files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)
> files
> #> [1] "data//file1.csv" "data//file2.csv" "data//file3.csv"> for (i in seq_along(files)) {
+ df_list[[i]] <- read_csv(files[[i]])
+ }
> #> Parsed with column specification:
> #> cols(
> #> X1 = col_double(),
> #> X2 = col_character()
> #> )
> #> Parsed with column specification:
> #> cols(
> #> X1 = col_double(),
> #> X2 = col_character()
> #> )
> #> Parsed with column specification:
> #> cols(
> #> X1 = col_double(),
> #> X2 = col_character()
> #> )> print(df_list)
> #> [[1]]
> #> # A tibble: 2 x 2
> #> X1 X2
> #> <dbl> <chr>
> #> 1 1 a
> #> 2 2 b
> #>
> #> [[2]]
> #> # A tibble: 2 x 2
> #> X1 X2
> #> <dbl> <chr>
> #> 1 3 c
> #> 2 4 d
> #>
> #> [[3]]
> #> # A tibble: 2 x 2
> #> X1 X2
> #> <dbl> <chr>
> #> 1 5 e
> #> 2 6 f> print(df)
> #> # A tibble: 6 x 2
> #> X1 X2
> #> <dbl> <chr>
> #> 1 1 a
> #> 2 2 b
> #> 3 3 c
> #> 4 4 d
> #> 5 5 e
> #> 6 6 fAlternatively
(nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique?> x <- c(11, 12, 13)
> print(names(x))
> #> NULL
> for (nm in names(x)) {
+ print(nm)
+ print(x[[nm]])
+ }show_mean(iris) would print:> show_mean(iris)
> #> Sepal.Length: 5.84
> #> Sepal.Width: 3.06
> #> Petal.Length: 3.76
> #> Petal.Width: 1.20> show_mean <- function(df, digits = 2) {
+ # Get max length of all variable names in the dataset
+ maxstr <- max(str_length(names(df)))
+ for (nm in names(df)) {
+ if (is.numeric(df[[nm]])) {
+ cat(
+ str_c(str_pad(str_c(nm, ":"), maxstr + 1L, side = "right"),
+ format(mean(df[[nm]]), digits = digits, nsmall = digits),
+ sep = " "
+ ),
+ "\n"
+ )
+ }
+ }
+ }
> show_mean(iris)Sepal.Length: 5.84
Sepal.Width: 3.06
Petal.Length: 3.76
Petal.Width: 1.20
> trans <- list(
+ disp = function(x) x * 0.0163871,
+ am = function(x) {
+ factor(x, labels = c("auto", "manual"))
+ }
+ )
> for (var in names(trans)) {
+ mtcars[[var]] <- trans[[var]](mtcars[[var]])
+ }This code mutates the disp and am columns:
disp is multiplied by 0.0163871am is replaced by a factor variableThe code works by looping over a named list of functions. It calls the named function in the list on the column of mtcars with the same name, and replaces the values of that column.
This is a function.
This applies the function to the column of mtcars with the same name
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
To see why this is important, consider (again) this simple data frame:
Imagine you want to compute the mean of every column. You could do that with a for loop:
> output <- vector("double", length(df))
> for (i in seq_along(df)) {
+ output[[i]] <- mean(df[[i]])
+ }
> output[1] 0.2969105 0.1263920 -0.4363532 0.1929435
You realise that you’re going to want to compute the means of every column pretty frequently, so you extract it out into a function:
> col_mean <- function(df) {
+ output <- vector("double", length(df))
+ for (i in seq_along(df)) {
+ output[i] <- mean(df[[i]])
+ }
+ output
+ }But then you think it’d also be helpful to be able to compute the median, and the standard deviation, so you copy and paste your col_mean() function and replace the mean() with median() and sd():
> col_median <- function(df) {
+ output <- vector("double", length(df))
+ for (i in seq_along(df)) {
+ output[i] <- median(df[[i]])
+ }
+ output
+ }
>
> col_sd <- function(df) {
+ output <- vector("double", length(df))
+ for (i in seq_along(df)) {
+ output[i] <- sd(df[[i]])
+ }
+ output
+ }Now it’s time to think about how to generalize it.
What would you do if you saw a set of functions like this:
> f1 <- function(x) abs(x - mean(x)) ^ 1
> f2 <- function(x) abs(x - mean(x)) ^ 2
> f3 <- function(x) abs(x - mean(x)) ^ 3You’d notice that there’s a lot of duplication, and extract it out into an additional argument:
We can do exactly the same thing with col_mean(), col_median() and col_sd() by adding an argument that supplies the function to apply to each column:
> col_summary <- function(df, fun) {
+ out <- vector("double", length(df))
+ for (i in seq_along(df)) {
+ out[i] <- fun(df[[i]])
+ }
+ out
+ }
> col_summary(df, median)[1] 0.6253409 0.2379445 -0.2843706 0.4614654
[1] 0.2969105 0.1263920 -0.4363532 0.1929435
The purrr package provides functions that eliminate the need for many common for loops. The apply family of functions in base R (apply(), lapply(), tapply(), etc) solve a similar problem, but purrr is more consistent.
The goal of using purrr functions instead of for loops is to allow you to break common list manipulation challenges into independent pieces:
How can you solve the problem for a single element of the list? Once you’ve solved that problem, purrr takes care of generalizing your solution to every element in the list.
If you’re solving a complex problem, how can you break it down into bite-sized pieces that allow you to advance one small step towards a solution? With purrr, you get lots of small pieces that you can compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
apply(). In the 2d case, what two for loops does it generalise?For an object with two-dimensions, such as a matrix or data frame, apply() replaces looping over the rows or columns of a matrix or data-frame. The apply() function is used like apply(X, MARGIN, FUN, ...), where X is a matrix or array, FUN is a function to apply, and ... are additional arguments passed to FUN.
When MARGIN = 1, then the function is applied to each row. For example, the following example calculates the row means of a matrix.
When MARGIN = 1, then the function is applied to each row. For example, the following example calculates the row means of a matrix.
[,1] [,2] [,3]
[1,] 0.6944777 0.5483574 -0.07868662
[2,] 1.3358468 -1.4439711 0.44619633
[3,] -0.6631026 -0.3226765 0.45548651
[4,] 1.2309291 0.9318699 -1.02463961
[5,] 0.2355328 -1.8492699 1.48251032
[1] 0.38804949 0.11269068 -0.17676421 0.37938648 -0.04374227
That is equivalent to this for-loop.
> X_row_means <- vector("numeric", length = nrow(X))
> for (i in seq_len(nrow(X))) {
+ X_row_means[[i]] <- mean(X[i, ])
+ }
> X_row_means[1] 0.38804949 0.11269068 -0.17676421 0.37938648 -0.04374227
[,1] [,2] [,3]
[1,] -0.5445332 0.6140945 -0.3722522
[2,] 0.8828511 0.8138215 1.2358328
[3,] 0.2435959 -0.9422863 -1.4996260
[4,] -0.7618232 -0.3257123 0.4565981
[5,] -1.5681726 0.1806481 -0.6348601
When MARGIN = 2, apply() is equivalent to a for-loop looping over columns.
[1] -0.34961640 0.06811311 -0.16286146
> X_col_means <- vector("numeric", length = ncol(X))
> for (i in seq_len(ncol(X))) {
+ X_col_means[[i]] <- mean(X[, i])
+ }
> X_col_means[1] -0.34961640 0.06811311 -0.16286146
col_summary() so that it only applies to numeric columns You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column.The original col_summary() function is:
> col_summary <- function(df, fun) {
+ out <- vector("double", length(df))
+ for (i in seq_along(df)) {
+ out[i] <- fun(df[[i]])
+ }
+ out
+ }The adapted version adds extra logic to only apply the function to numeric columns.
> col_summary2 <- function(df, fun) {
+ # create an empty vector which will store whether each
+ # column is numeric
+ numeric_cols <- vector("logical", length(df))
+ # test whether each column is numeric
+ for (i in seq_along(df)) {
+ numeric_cols[[i]] <- is.numeric(df[[i]])
+ }
+ # find the indexes of the numeric columns
+ idxs <- which(numeric_cols)
+ # find the number of numeric columns
+ n <- sum(numeric_cols)
+ # create a vector to hold the results
+ out <- vector("double", n)
+ # apply the function only to numeric vectors
+ for (i in seq_along(idxs)) {
+ out[[i]] <- fun(df[[idxs[[i]]]])
+ }
+ # name the vector
+ names(out) <- names(df)[idxs]
+ out
+ }Let’s test that col_summary2() works by creating a small data frame with some numeric and non-numeric columns.
> df <- tibble(
+ X1 = c(1, 2, 3),
+ X2 = c("A", "B", "C"),
+ X3 = c(0, -1, 5),
+ X4 = c(TRUE, FALSE, TRUE)
+ )
> col_summary2(df, mean) X1 X3
2.000000 1.333333
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:
map() makes a list.map_lgl() makes a logical vector.map_int() makes an integer vector.map_dbl() makes a double vector.map_chr() makes a character vector.Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.
The chief benefits of using functions like map() is not speed, but clarity: they make your code easier to write and to read.
We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use map_dbl():
> df <- tibble(
+ a = rnorm(10),
+ b = rnorm(10),
+ c = rnorm(10),
+ d = rnorm(10)
+ )
>
> map_dbl(df, mean) a b c d
0.08452007 -0.31266076 -0.12060485 0.17055680
a b c d
0.2558613 -0.2808854 -0.1146915 0.1351632
a b c d
0.9527619 1.0748197 0.9835811 0.9954018
Compared to using a for loop, focus is on the operation being performed (i.e. mean(), median(), sd()), not the bookkeeping required to loop over every element and store the output. This is even more apparent if we use the pipe:
a b c d
0.08452007 -0.31266076 -0.12060485 0.17055680
a b c d
0.2558613 -0.2808854 -0.1146915 0.1351632
a b c d
0.9527619 1.0748197 0.9835811 0.9954018
There are a few differences between map_*() and col_summary():
All purrr functions are implemented in C. This makes them a little faster at the expense of readability.
The second argument, .f, the function to apply, can be a formula, a character vector, or an integer vector.
map_*() uses … ([dot dot dot]) to pass along additional arguments to .f each time it’s called:
a b c d
0.2558613 -0.2808854 -0.1146915 0.1351632
The map functions also preserve names:
x y
3 2
There are a few shortcuts that you can use with .f in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits up the mtcars dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.
Here we’ve used . as a pronoun: it refers to the current list element (in the same way that ireferred to the current index in the for loop).
When you’re looking at many models, you might want to extract a summary statistic like the \(R^2\). To do that we need to first run summary() and then extract the component called r.squared. We could do that using the shorthand for anonymous functions:
16 24 32
0.5086326 0.4645102 0.4229655
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
16 24 32
0.5086326 0.4645102 0.4229655
You can also use an integer to select elements by position:
[1] 2 5 8
The apply family of functions in base R have some similarities with the purrr functions:
lapply() is basically identical to map(), except that map() is consistent with all the other functions in purrr, and you can use the shortcuts for .f.
Base sapply() is a wrapper around lapply() that automatically simplifies the output. This is useful for interactive work but is problematic in a function because you never know what sort of output you’ll get:
> x1 <- list(
+ c(0.27, 0.37, 0.57, 0.91, 0.20),
+ c(0.90, 0.94, 0.66, 0.63, 0.06),
+ c(0.21, 0.18, 0.69, 0.38, 0.77)
+ )
> x2 <- list(
+ c(0.50, 0.72, 0.99, 0.38, 0.78),
+ c(0.93, 0.21, 0.65, 0.13, 0.27),
+ c(0.39, 0.01, 0.38, 0.87, 0.34)
+ )
>
> threshold <- function(x, cutoff = 0.8) x[x > cutoff]
> x1 %>% sapply(threshold) %>% str()List of 3
$ : num 0.91
$ : num [1:2] 0.9 0.94
$ : num(0)
num [1:3] 0.99 0.93 0.87
vapply() is a safe alternative to sapply() because you supply an additional argument that defines the type. The only problem with vapply() is that it’s a lot of typing: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric). One advantage of vapply() over purrr’s map functions is that it can also produce matrices — the map functions only ever produce vectors.\(A\) mean of every column in mtcars.
mpg cyl disp hp drat wt qsec
20.090625 24.750000 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
\(B\) type of each column in nycflights13::flights.
year month day dep_time sched_dep_time
"integer" "integer" "integer" "integer" "integer"
dep_delay arr_time sched_arr_time arr_delay carrier
"double" "integer" "integer" "double" "character"
flight tailnum origin dest air_time
"integer" "character" "character" "character" "double"
distance hour minute time_hour
"double" "double" "double" "double"
\(C\) number of unique values in each column of iris.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35 23 43 22 3
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35 23 43 22 3
\(D\) 10 random normals from distributions with means of -10, 0, 10, and 100
[[1]]
[1] -10.765580 -10.354792 -8.630696 -9.222892 -9.983079 -10.862739
[7] -8.446143 -10.687488 -11.758374 -9.282130
[[2]]
[1] -0.1717519 0.3488467 -0.8176360 0.2375417 -0.6415341 0.5376673
[7] -1.2825508 -0.9099092 1.0666013 0.2527361
[[3]]
[1] 9.023351 9.738028 9.988458 10.318579 9.676457 10.322253 10.330490
[8] 11.816240 11.948324 11.313721
[[4]]
[1] 99.46283 99.29963 100.60308 99.22069 101.46597 100.22511 98.90328
[8] 98.23925 100.98210 99.17732
carat cut color clarity depth table price x y z
FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
map(1:5, runif) do? Why?Map functions work with any vectors, not just lists, but the output is always a list.
[[1]]
[1] 0.135653
[[2]]
[1] 0.7360124 0.3761619
[[3]]
[1] 0.9430826 0.3133162 0.2449193
[[4]]
[1] 0.2607224 0.7893160 0.2471994 0.4369824
[[5]]
[1] 0.61941998 0.45527702 0.37159786 0.04816784 0.88527555
The map() function loops through the numbers 1 to 5. For each value, it calls the runif() with that number as the first argument, which is the number of sample to draw. The result is a length five list with numeric vectors of sizes one through five, each with random samples from a uniform distribution.
map(-2:2, rnorm, n = 5) do? Why? What does map_dbl(-2:2, rnorm, n = 5) do? Why?[[1]]
[1] -1.065540 -1.428568 -1.889852 -2.568519 -2.483179
[[2]]
[1] 0.5256277 -1.5588777 -1.4982390 -0.5576297 -1.0534056
[[3]]
[1] 0.91434208 1.11712743 -0.05561116 -0.63470759 -0.47005863
[[4]]
[1] 1.7068155 1.5203229 1.9596955 0.3492204 1.3292990
[[5]]
[1] 1.040654 1.068034 1.237006 1.469892 2.135265
This expression takes samples of size five from five normal distributions, with means of (-2, -1, 0, 1, and 2), but the same standard deviation (1). It returns a list with each element a numeric vectors of length 5.
However, if instead, we use map_dbl(), the expression raises an error.
Error: Result 1 must be a single double, not a double vector of length 5
This is because the map_dbl() function requires the function it applies to each element to return a numeric vector of length one.
To return a numeric vector, use flatten_dbl() to coerce the list returned by map() to a numeric vector.
[1] -4.3235094 -2.8643875 -1.3219671 -1.5742265 -1.9908245 -0.3568805
[7] -0.4634469 0.6073714 -2.3118770 -0.2788020 0.7030964 -2.7938192
[13] 1.1452780 1.1432346 0.8491727 0.4047326 -1.3385323 3.3168344
[19] 0.4558347 1.3934177 2.8659056 1.3200135 1.9807593 1.8088077
[25] 1.6873271
map(x, function(df) lm(mpg ~ wt, data = df)) to eliminate the anonymous function.$`16`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
39.571 -5.647
$`24`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
28.41 -2.78
$`32`
Call:
lm(formula = mpg ~ wt, data = .)
Coefficients:
(Intercept) wt
23.868 -2.192
When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you’ll get an error message, and no output.
safely() is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
result is the original result. If there was an error, this will be NULL.
error is an error object. If the operation was successful, this will be NULL.
The try() function in base R. It’s similar, but because it sometimes returns the original result and it sometimes returns an error object it’s more difficult to work with.
Let’s illustrate this with a simple example: log():
List of 2
$ result: num 2.3
$ error : NULL
List of 2
$ result: NULL
$ error :List of 2
..$ message: chr "non-numeric argument to mathematical function"
..$ call : language .Primitive("log")(x, base)
..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
When the function succeeds, the result element contains the result and the error element is NULL. When the function fails, the result element is NULL and the error element contains an error object.
safely() is designed to work with map:
List of 3
$ :List of 2
..$ result: num 0
..$ error : NULL
$ :List of 2
..$ result: num 2.3
..$ error : NULL
$ :List of 2
..$ result: NULL
..$ error :List of 2
.. ..$ message: chr "non-numeric argument to mathematical function"
.. ..$ call : language .Primitive("log")(x, base)
.. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
This would be easier to work with if we had two lists: one of all the errors and one of all the output. That’s easy to get with purrr::transpose():
List of 2
$ result:List of 3
..$ : num 0
..$ : num 2.3
..$ : NULL
$ error :List of 3
..$ : NULL
..$ : NULL
..$ :List of 2
.. ..$ message: chr "non-numeric argument to mathematical function"
.. ..$ call : language .Primitive("log")(x, base)
.. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
Typically you’ll either look at the values of x where y is an error, or work with the values of y that are ok:
[[1]]
[1] "a"
[1] 0.000000 2.302585
Purrr provides two other useful adverbs:
safely(), possibly() always succeeds. It’s simpler than safely(), because you give it a default value to return when there is an error.[1] 0.000000 2.302585 NA
quietly() performs a similar role to safely(), but instead of capturing errors, it captures printed output, messages, and warnings:
List of 2
$ :List of 4
..$ result : num 0
..$ output : chr ""
..$ warnings: chr(0)
..$ messages: chr(0)
$ :List of 4
..$ result : num NaN
..$ output : chr ""
..$ warnings: chr "NaNs produced"
..$ messages: chr(0)
Often you have multiple related inputs that you need iterate along in parallel. That’s the job of the map2() and pmap() functions. For example, imagine you want to simulate some random normals with different means.
List of 3
$ : num [1:5] 5.1 5.02 6.59 3.8 3.81
$ : num [1:5] 10.88 10.87 8.13 11.11 12
$ : num [1:5] -1.7 -3.83 -2.77 -3.57 -1.6
What if you also want to vary the standard deviation? One way to do that would be to iterate over the indices and index into vectors of means and sds:
List of 3
$ : num [1:5] 5.82 5.71 5.92 5.47 4.45
$ : num [1:5] 2.56 6.63 9.02 14.79 13.53
$ : num [1:5] 1.82 8.18 -15.91 -15.49 -7.08
But that obfuscates the intent of the code. Instead we could use map2() which iterates over two vectors in parallel:
List of 3
$ : num [1:5] 7.64 4.3 5.76 4.39 4.22
$ : num [1:5] 1.3 11.19 8.13 11.87 7.21
$ : num [1:5] 4.052 15.024 -0.904 -11.016 -11.34
map2() generates this series of function calls:
Note that the arguments that vary for each call come before the function; arguments that are the same for every call come after.
Like map(), map2() is just a wrapper around a for loop:
> map2 <- function(x, y, f, ...) {
+ out <- vector("list", length(x))
+ for (i in seq_along(x)) {
+ out[[i]] <- f(x[[i]], y[[i]], ...)
+ }
+ out
+ }You could also imagine map3(), map4(), map5(), map6() etc, but that would get tedious quickly. Instead, purrr provides pmap() which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
List of 3
$ : num 5.44
$ : num [1:3] 6.5 3.27 11.96
$ : num [1:5] 9.856 -0.434 -5.895 -14.533 -10.192
That looks like:
If you don’t name the list’s elements, pmap() will use positional matching when calling the function. That’s a little fragile, and makes the code harder to read, so it’s better to name the arguments:
That generates longer, but safer, calls:
Since the arguments are all the same length, it makes sense to store them in a data frame:
> params <- tribble(
+ ~mean, ~sd, ~n,
+ 5, 1, 1,
+ 10, 5, 3,
+ -3, 10, 5
+ )
> params %>%
+ pmap(rnorm)[[1]]
[1] 4.920801
[[2]]
[1] 14.479088 5.613026 1.594877
[[3]]
[1] 4.354443 10.125541 -11.431938 -23.574923 4.859846
There’s one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:
> f <- c("runif", "rnorm", "rpois")
> param <- list(
+ list(min = -1, max = 1),
+ list(sd = 5),
+ list(lambda = 10)
+ )To handle this case, you can use invoke_map():
List of 3
$ : num [1:5] -0.3153 0.3146 -0.8008 0.6843 0.0278
$ : num [1:5] 6.11 5.3 -1.24 1.07 -1.55
$ : int [1:5] 8 12 7 13 11
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
And again, you can use tribble() to make creating these matching pairs a little easier:
> sim <- tribble(
+ ~f, ~params,
+ "runif", list(min = -1, max = 1),
+ "rnorm", list(sd = 5),
+ "rpois", list(lambda = 10)
+ )
> sim %>%
+ mutate(sim = invoke_map(f, params, n = 10))# A tibble: 3 x 3
f params sim
<chr> <list> <list>
1 runif <named list [2]> <dbl [10]>
2 rnorm <named list [1]> <dbl [10]>
3 rpois <named list [1]> <int [10]>
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here’s a very simple example:
[1] 1
[1] "a"
[1] 3
walk() is generally not that useful compared to walk2() or pwalk(). For example, if you had a list of plots and a vector of file names, you could use pwalk() to save each file to the corresponding location on disk:
> library(ggplot2)
> plots <- mtcars %>%
+ split(.$cyl) %>%
+ map(~ggplot(., aes(mpg, wt)) + geom_point())
> paths <- stringr::str_c(names(plots), ".pdf")
>
> pwalk(list(paths, plots), ggsave, path = tempdir())walk(), walk2() and pwalk() all invisibly return .x, the first argument. This makes them suitable for use in the middle of pipelines.
Purrr provides a number of other functions that abstract over other types of for loops. They’re used less frequently than the map functions, but they’re useful to know about.
A number of functions work with predicate functions that return either a single TRUE or FALSE.
keep() and discard() keep elements of the input where the predicate is TRUE or FALSE respectively:
'data.frame': 150 obs. of 1 variable:
$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
some() and every() determine if the predicate is true for any or for all of the elements.
[1] TRUE
[1] TRUE
detect() finds the first element where the predicate is true; detect_index() returns its position.
[1] 7 2 10 6 8 4 3 5 9 1
[1] 7
[1] 1
head_while() and tail_while() take elements from the start or end of a vector while a predicate is true:
[1] 7
integer(0)
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
> dfs <- list(
+ age = tibble(name = "John", age = 30),
+ sex = tibble(name = c("John", "Mary"), sex = c("M", "F")),
+ trt = tibble(name = "Mary", treatment = "A")
+ )
>
> dfs %>% reduce(full_join)# A tibble: 2 x 4
name age sex treatment
<chr> <dbl> <chr> <chr>
1 John 30 M <NA>
2 Mary NA F A
Or maybe you have a list of vectors, and want to find the intersection:
> vs <- list(
+ c(1, 3, 5, 6, 10),
+ c(1, 2, 3, 7, 8, 10),
+ c(1, 2, 3, 4, 8, 9, 10)
+ )
>
> vs %>% reduce(intersect)[1] 1 3 10
The reduce function takes a “binary” function (i.e. a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left.
Accumulate is similar but it keeps all the interim results. You could use it to implement a cumulative sum:
[1] 3 5 2 8 7 1 10 4 6 9
[1] 3 8 10 18 25 26 36 40 46 55
every() using a for loop.> # Use ... to pass arguments to the function
> every2 <- function(.x, .p, ...) {
+ for (i in .x) {
+ if (!.p(i, ...)) {
+ # If any is FALSE we know not all of then were TRUE
+ return(FALSE)
+ }
+ }
+ # if nothing was FALSE, then it is TRUE
+ TRUE
+ }
>
> every2(1:3, function(x) {
+ x > 1
+ })[1] FALSE
[1] TRUE
col_summary() that applies a summary function to every numeric column in a data frame.col_summary()is:> col_sum3 <- function(df, f) {
+ is_num <- sapply(df, is.numeric)
+ df_num <- df[, is_num]
+
+ sapply(df_num, f)
+ }$Sepal.Length
[1] 5.843333
$Sepal.Width
[1] 3.057333
$Petal.Length
[1] 3.758
$Petal.Width
[1] 1.199333
The cause of these bugs is the behavior of sapply(). The sapply() function does not guarantee the type of vector it returns, and will returns different types of vectors depending on its inputs. If no columns are selected, instead of returning an empty numeric vector, it returns an empty list. This causes an error since we can’t use a list with [.