Functions are a fundamental building block of the R language. You’ve probably used dozens (or even hundreds) of functions written by others, but in order to take your R game to the next level, you’ll need to learn to write your own functions. This course will teach you the fundamentals of writing functions in R so that, among other things, you can make your code more readable, avoid coding errors, and automate repetitive tasks.
Before we embark, we’ll make sure you’re ready for this course by reviewing some of the prerequisites. You’ll review the syntax of writing a function in R, the basic data types in R, subsetting and writing for loops. It won’t all be review, we’ll introduce you to a few new things that will be helpful throughout the course.
The function template is a useful way to start writing a function:
my_fun <- function(arg1, arg2) {
# body
}
my_fun is the variable that you want to assign your function to, arg1 and arg2 are arguments to the function. The template has two arguments, but you can specify any number of arguments, each separated by a comma. You then replace # body with the R code that your function will execute, referring to the inputs by the argument names your specified.
Let’s get to writing functions!
ratio() that takes arguments x and y and returns their ratio, x / y.# Define ratio() function
ratio <- function(x, y) {
x / y
}
# Call ratio() with arguments 3 and 4
ratio(3, 4)
## [1] 0.75
How did you call your function ratio() in the previous exercise? Do you remember the two ways to specify the arguments? (If you have forgotten it might be useful to review the video from Intermediate R)
You probably either did ratio(3, 4), which relies on matching by position, or ratio(x = 3, y = 4), which relies on matching by name.
For functions you and others use often, it’s okay to use positional matching for the first one or two arguments. These are usually the data to be computed on. Good examples are the x argument to the summary functions (mean(), sd(), etc.) and the x and y arguments to plotting functions.
However, beyond the first couple of arguments you should always use matching by name. It makes your code much easier for you and others to read. This is particularly important if the argument is optional, because it has default. When overriding a default value, it’s good practice to use the name.
Notice that when you call a function, you should place a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.
Here’s a rather cryptic call to mean(). Rewrite the call to follow the best practices we just discussed. Use the natural ordering of the arguments, which you can find by typing ?mean in the console.
# Rewrite the call to follow best practices
mean(c(1:9, NA), trim = 0.1, na.rm = TRUE)
## [1] 5
Consider the following function that takes a numeric value, x, as input:
f <- function(x) {
if (TRUE) {
return(x + 1)
}
x
}
What will be the result of calling f(2)?
Great job! The body of the conditional is always evaluated and the function returns early without ever running x.
The next few questions are designed to test your understanding of scoping. For first timers, these concepts will be hard and you might not get it the first time through. We decided to include it in the first chapter because it’s good to know now, so when something unexpected happens, you know what to come back to and review.
For now, take a careful look at the function featured in each question. Try to predict what the function will return without running it! Being able to reason about a function is an important skill for R programmers.
Consider the following:
y <- 10
f <- function(x) {
x + y
}
What will f(10) return?
Correct! Because y is not passed in as an argument to the function, R looks outside of the function environment.
Consider the following:
y <- 10
f <- function(x) {
y <- 5
x + y
}
What will f(10) return?
Nice one! The value of x is passed in as an argument to the function and the value of y is defined inside of the function.
This one’s a little different. Instead of predicting what the function will return, you need to predict the value of a variable. We run the following in a fresh session in the console:
f <- function(x) {
y <- 5
x + y
}
f(5)
Now, what will typing y return?
Correct! Even though y is set equal to 5 within the body of the function, the object does not exist in the global environment.
Which of these is NOT one of the atomic types of vectors in R?
logicalintegerdoublecharacterfactorThere a few ways to subset a list. Throughout the course we’ll mostly use double bracket ([[]]) subsetting by index and by name.
That is, my_list[[1]] extracts the first element of the list my_list, and my_list[["name"]] extracts the element in my_list that is called name. If the list is nested you can travel down the hierarchy by recursive subsetting. For example, my_list[[1]][["name]] is the element called name inside the first element of my_list.
A data frame is just a special kind of list, so you can use double bracket subsetting on data frames too, my_df[[1]] will extract the first column of a data frame and my_df[["name"]] will extract the column named name from the data frame.
I’ve set up a list called tricky_list in your workspace. Use the function typeof() combined with double bracket subsetting to answer the following questions.
What type of object is the…
tricky_list?x in tricky_list?x in tricky_list?Note: Use the double bracket notation to extract elements of the list.
tricky_list <- list(
nums = c(0.5433085, -1.1538035, 1.2599351, 0.3498936, -0.2438708, 1.9490014, -0.6785343, 0.3676652, -1.1371768, 0.8308079),
y = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE),
x = list("hello", "hi!", "goodbye!", "bye!"),
model = lm(formula = mpg ~ wt, data = mtcars)
)
# 2nd element in tricky_list
typeof(tricky_list[[2]])
## [1] "logical"
# Element called x in tricky_list
typeof(tricky_list[["x"]])
## [1] "list"
# 2nd element inside the element called x in tricky_list
typeof(tricky_list[["x"]][[2]])
## [1] "character"
Often you won’t know exactly what is inside a list. But, you’ll need to figure it out to get some useful piece of data. Extracting elements from the output of the names() and str() functions is a great way to explore the structure of a list.
Calling names() on a list will give you names at the top level of the list and str() will give you a full description of the entire list (which can sometimes be a little overwhelming).
tricky_list has a regression model stored in it. Let’s see if we can drill down and pull out the slope estimate corresponding to the wt variable.
names() on tricky_list to guess where the regression model is stored.names() and str() on the model element of tricky_list.coefficients element of the model element of tricky_list.wt element of the coefficients element of the model element of tricky_list.Note: Use the double bracket notation to extract elements of the list.
# Guess where the regression model is stored
names(tricky_list)
## [1] "nums" "y" "x" "model"
# Use names() and str() on the model element
names(tricky_list$model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
str(tricky_list$model)
## List of 12
## $ coefficients : Named num [1:2] 37.29 -5.34
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
## $ residuals : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ effects : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ...
## ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "wt"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.18 1.05
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 30
## $ xlevels : Named list()
## $ call : language lm(formula = mpg ~ wt, data = mtcars)
## $ terms :Classes 'terms', 'formula' language mpg ~ wt
## .. ..- attr(*, "variables")= language list(mpg, wt)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "mpg" "wt"
## .. .. .. ..$ : chr "wt"
## .. ..- attr(*, "term.labels")= chr "wt"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(mpg, wt)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
## $ model :'data.frame': 32 obs. of 2 variables:
## ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt
## .. .. ..- attr(*, "variables")= language list(mpg, wt)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
## .. .. .. .. ..$ : chr "wt"
## .. .. ..- attr(*, "term.labels")= chr "wt"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(mpg, wt)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
## - attr(*, "class")= chr "lm"
# Subset the coefficients element
tricky_list$model$coefficients
## (Intercept) wt
## 37.285126 -5.344472
# Subset the wt element
tricky_list$model$coefficients[["wt"]]
## [1] -5.344472
Excellent work! Another way is to use the coef() function on the model element of tricky_list.
Let’s take a look at the sequence component of our for loop:
i in 1:ncol(df)
Each time our for loop iterates, i takes the next value in 1:ncol(df). This is a pretty common model for a sequence: a sequence of consecutive integers designed to index over one dimension of our data.
What might surprise you is that this isn’t the best way to generate such a sequence, especially when you are using for loops inside your own functions. Let’s look at an example where df is an empty data frame:
df <- data.frame()
1:ncol(df)
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
Our sequence is now the somewhat non-sensical: 1, 0. You might think you wouldn’t be silly enough to use a for loop with an empty data frame, but once you start writing your own functions, there’s no telling what the input will be.
A better method is to use the seq_along() function. This function generates a sequence along the index of the object passed to it, but handles the empty case much better.
1:ncol(df) sequence in our for loop with seq_along(df).df to an empty data frame.for loop to verify there is no error.df1 <- data.frame(a = c(-1.26272674, 0.05566754, 1.10385911, -0.74148033, -1.06092338, 1.30082893, 1.05572799, 0.50386200, 2.96581138, 0.09361089), b = c(1.03361615, -0.67150813, -0.71684510, -2.02185287, -0.59079083, -1.09392460, 0.01939328, -0.94027939, 0.28912993, -0.21311041), c = c(1.3972461, -0.4300333, -0.8542174, -0.2505189, -1.6230256, 0.3762698, -3.0804333, -0.7753514, -2.4670001, -2.4857207), d = c(0.864732549, -0.779882564, -0.937921427, 0.106842685, -0.503514794, 0.126978932, 0.007054644, -0.554996198, 0.260304380, 1.027297975))
# Replace the 1:ncol(df) sequence
for (i in seq_along(df1)) {
print(median(df1[[i]]))
}
## [1] 0.2987364
## [1] -0.6311495
## [1] -0.8147844
## [1] 0.05694866
# Change the value of df
df <- data.frame()
# Repeat for loop to verify there is no error
for (i in seq_along(df)) {
print(median(df[[i]]))
}
Our for loop does a good job displaying the column medians, but we might want to store these medians in a vector for future use.
Before you start the loop, you must always allocate sufficient space for the output, let’s say an object called output. This is very important for efficiency: if you grow the for loop at each iteration (e.g. using c()), your for loop will be very slow.
A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc.) and the length of the vector.
Then, at each oteration of the loop you must store the output in the corresponding entry of the output vector, i.e. assign the result to output[[i]]. (You might ask why we are using double brackets here when output is a vector. It’s primarily for generalizability: this subsetting will wotk whether output is a vector or a list.)
Let’s edit our loop to store the medians, rather than printing them to the console.
"double" vector of length ncol(df) and name it output.ith element of output.output to the console.df2 <- data.frame(a = c(-2.023584106, 0.095603871, -0.108814482, -1.562952654, -0.001695936, -0.662148862, -1.453288540, -1.224299431, 0.095396832, -0.787936455), b = c(2.1864750, -1.1326362, -1.4270034, -1.3547144, -0.9374406, 2.2645230, -1.8042590, -0.9469231, -1.1271026, 0.1081373), c = c(-0.5798420, 0.1221429, -1.0678397, 1.0096803, 0.1396902, -0.7214353, 1.4382972, 0.3411223, -0.5589626, 0.3901412), d = c(-0.4541943, -1.1309516, -0.9768001, -1.2112622, -1.1094934, -0.3615033, -0.9000185, 0.6892541, 0.1882872, -1.1494923))
# Create new double vector: output
output <- vector("double", ncol(df2))
# Alter the loop
for (i in seq_along(df2)) {
# Change code to store result in output
output[[i]] <- median(df2[[i]])
}
# Print output
output
## [1] -0.7250427 -1.0370128 0.1309165 -0.9384093
Writing your own functions is one way to reduce duplication in your code. In this chapter, you’ll learn when to write a function, how to get started and what to keep in mind when you are writing. You’ll also learn to appreciate that functions have two audiences: the computer (which runs the code) and humans (who need to be able to understand the code).
We have a snippet of code that successfully rescales a column to be between 0 and 1:
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
Our goal over the next few exercises is to turn this snippet, written to work on the a column in the data frame df, into a general purpose rescale01() function that we can apply to any vector.
The first step of turning a snippet into a function is to examine the snippet and decide how many inputs there are, then rewrite the snippet to refer to these inputs using temporary names. These inputs will become the arguments to our function, so choosing good names for them is important. (We’ll talk more about naming arguments in a later exercise.)
In this snippet, there is one input: the numeric vector to be rescaled (currently df$a). What would be a good name for this input? It’s quite common in R to refer to a vector of data simply as x (like in the mean function), so we will follow that convention here.
x containing the numbers 1 through 10.x instead of referring to the data frame column df$a.# Define example vector x
x <- c(1:10)
# Rewrite this snippet to refer to x
(x - min(x, na.rm = TRUE)) /
(max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000
Our next step is to examine our snippet and see if we can write it more clearly.
Take a close look at our rewritten snippet. Do you see any duplication?
(x - min(x, na.rm = TRUE)) /
(max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
One obviously duplicated statement is min(x, na.rm = TRUE). It makes more sense for us just to calculate it once, store the result, and then refer to it when needed. In fact, since we also need the maximum value of x, it would be even better to calculate the range once, then refer to the first and second elements when they are needed.
What should we call this intermediate variable? You’ll soon get the message that using good names is an important part of writing clear code! I suggest we call it rng (for “range”).
rng to contain the range of x using the function range(). Specify the na.rm() argument to automatically ignore any NAs in the vector.# Define example vector x
x <- 1:10
# Define rng
rng <- range(x, na.rm = TRUE)
# Rewrite this snippet to refer to the elements of rng
(x - rng[1]) /
(rng[2] - rng[1])
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000
What do you need to write a function? You need a name for the function, you need to know the arguments to the function, and you need code that forms the body of the function.
We now have all these pieces ready to put together. It’s time to write the function!
rescale01():x.rng <- range(x, na.rm = TRUE) (x - rng[1]) / (rng[2] - rng[1])# Define example vector x
x <- 1:10
# Use the function template to create the rescale01 function
rescale01 <- function(x) {
# body
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
# Test your function, call rescale01 using the vector x as the argument
rescale01(x)
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000
Let’s tackle a new problem. We want to write a function, both_na() that counts at how many positions two vectors, x and y, both have a missing value.
How do we get started? Should we start writing our function?
both_na <- function(x, y) {
# something goes here?
}
No! We should start by solving a simple example problem first. Let’s define an x and y where we know what the answer both_na(x, y) should be.
Let’s start with:
x <- c( 1, 2, NA, 3, NA)
y <- c(NA, 3, NA, 3, 4)
Then both_na(x, y) should return 1, since there is only one element that is missing in both x and y, the third element.
(Notice we introduced a couple of extra spaces to each vector. Adding spaces to x and y to make them match up makes it much easier to see what the correct result is. Code formatting is an important aspect of writing clear code.)
Your first task is to write a line of code that gets to that answer.
Write a line of code that finds the number of positions where both x and y have missing values. You may find the is.na() and sum() functions useful, as well as the & operator.
# Define example vectors x and y
x <- c( 1, 2, NA, 3, NA)
y <- c(NA, 3, NA, 3, 4)
# Count how many elements are missing in both x and y
sum(is.na(x) & is.na(y))
## [1] 1
Great! You now have a snippet of code that works:
sum(is.na(x) & is.na(y))
You’ve already figured out it should have two inputs (the two vectors to compare) and we’ve even given them reasonable names: x and y. Our snippet is also so simple we can’t write it any clearer.
So, let’s go ahead and turn this snippet into a function!
both_na().both_na() should take two arguments, x and y.# Define example vectors x and y
x <- c( 1, 2, NA, 3, NA)
y <- c(NA, 3, NA, 3, 4)
# Turn this snippet into a function: both_na()
both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}
We have a function that works in at least one situation, but we should probably check it works in others.
Consider the following vectors:
x <- c(NA, NA, NA)
y1 <- c( 1, NA, NA)
y2 <- c( 1, NA, NA, NA)
What would you expect both_na(x, y1) to return? What about both_na(x, y2)? Does your function return what you expected? Try it and see!
x, y1 and y2 according the above examples.both_na() with x and y1.both_na() with x and y2.# Define x, y1 and y2
x <- c(NA, NA, NA)
y1 <- c(1, NA, NA)
y2 <- c(1, NA, NA, NA)
# Call both_na on x, y1
both_na(x, y1)
## [1] 2
# Call both_na on x, y2
both_na(x, y2)
## Warning in is.na(x) & is.na(y): longer object length is not a multiple of
## shorter object length
## [1] 3
Consider the following function, f2(), which has also been loaded into your workspace:
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f2() isn’t a very good function name! Which of the following would make a good name for this function?
rmv()remove_first()shorten()removed_last()remove_last()Perfect! It’s a verb, and descriptive. Can you brainstorm other good names for this function?
It’s not just your function that needs a good name. Your arguments need good names too!
Take a look at this function, which calculates a confidence interval for a population mean:
mean_ci <- function(c, nums) {
se <- sd(nums) / sqrt(length(nums))
alpha <- 1 - c
mean(nums) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
The argument nums is a sample of data and the argument c controls the level of the confidence interval. For example, if c = 0.95, we get a 95% confidence interval. Are c and nums good arguments names?
c is a particularly bad name. It is completely non-descriptive and it’s the name of an existing function in R. What might be better? Maybe something like confidence, since it reveals the purpose of the argument: to control the level of confidence for the interval. Another option might be level, since it’s the same name used for the confint function in base R and your users may already be familiar with that name for this parameter.
nums is not inherently bad, but since it’s the placeholder for the vector of data, a name like x would be more recognizable to users.
Rewrite the mean_ci() function to take arguments named level and x instead of c and nums, in that order for now.
# Rewrite mean_ci to take arguments named level and x
mean_ci <- function(level, x) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - level
mean(x) + se * qnorm(c(alpha / 2, level / 2))
}
Aside from giving your arguments good names, you should put some thought into what order your arguments are in and if they should have defaults or not.
Arguments are often one of two types:
Generally, data arguments should come first. Detail arguments should go on the end, and usually should have default values.
Take another look at our function:
mean_ci <- function(level, x) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - level
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x is the data to be used for the calculation, so it’s a data argument and should come first.
level is a detail argument; it just controls the level of confidence for which the interval is constructed. It should come after the data arguments and we should give it a default value, say 0.95, since it is common to want a 95% confidence interval.
x, to the front.level, to the end and give it the default value 0.95.# Alter the arguments to mean_ci
mean_ci <- function(x, level = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - level
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
One of your colleagues has noticed if you pass mean_ci() an empty vector it returns a confidence interval with missing values at both ends (try it: mean_ci(numeric(0))). In this case, they decided it would make more sense to produce a warning "x was empty" and return c(-Inf, Inf) and have edited the function to be:
mean_ci <- function(x, level = 0.95) {
if (length(x) == 0) {
warning("`x` was empty", call. = FALSE)
interval <- c(-Inf, Inf)
} else {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - level
interval <- mean(x) +
se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
interval
}
Notice how hard it is now to follow the logic of the function. If you want to know what happens in the empty x case, you need to read the entire function to check if anything happens to interval before the function returns. There isn’t much to read in this case, but if this was a longer function you might be scrolling through pages of code.
This is a case where an early return() makes sense. If x is empty, the function should immediately return c(-Inf, Inf).
Edit the mean_ci function using an if statement to check for the case when x is empty and if so, to produce the same warning as the code above then immediately return() c(-Inf, Inf).
# Alter the mean_ci function
mean_ci <- function(x, level = 0.95) {
if (length(x) == 0) {
warning("`x` was empty", call. = FALSE)
return(c(-Inf, Inf))
}
se <- sd(x) / sqrt(length(x))
alpha <- 1 - level
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
Over the next few exercises, we’ll practice everything you’ve learned so far. Here’s a poorly written function, which is also available in your workspace:
f <- function(x, y) {
x[is.na(x)] <- y
cat(sum(is.na(x)), y, "\n")
x
}
Your job is to turn it in to a nicely written function.
What does this function do? Let’s try to figure it out by passing in some arguments.
x with the values: 1, 2, NA, 4 and 5.f() with the arguments x = x, and y = 3.f() with the arguments x = x, and y = 10.# Define a numeric vector x with the values 1, 2, NA, 4 and 5
x <- c(1, 2, NA, 4, 5)
# Call f() with the arguments x = x and y = 3
f(x=x, y=3)
## 0 3
## [1] 1 2 3 4 5
# Call f() with the arguments x = x and y = 10
f(x=x, y=10)
## 0 10
## [1] 1 2 10 4 5
Did you figure out what f() does? f() takes a vector x and replaces any missing values in it with the value y.
Imagine you came across the line df$z <- f(df$z, 0) a little further on in the code. What does that line do? Now you know, it replaces any missing values in the column df$z with the value 0. But anyone else who comes across that line is going to have to go back and find the definition of f and see if they can reason it out.
Let’s rename our function and arguments to make it obvious to everyone what we are trying to achieve.
f() to replace_missings().y argument to replacement.df$z with 0’s using your new function. Make sure you assign the result back to df$z.# Rename the function f() to replace_missings()
replace_missings <- function(x, replacement) {
# Change the name of the y argument to replacement
x[is.na(x)] <- replacement
cat(sum(is.na(x)), replacement, "\n")
x
}
# Rewrite the call on df$z to match our new names
df$z <- replace_missings(df$z, 0)
## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'
## 0 0
Great! Now when someone comes across
df$z <- replace_missings(df$z, replacement = 0)
in your code, it’s pretty obvious what you are trying to achieve. The body of our replace_missings() function is still a little messy. There is certainly some unnecessary duplication.
is_miss, a logical vector that identifies the missing values in x.is_miss instead of is.na(x).replace_missings <- function(x, replacement) {
# Define is_miss
is_miss <- is.na(x)
# Rewrite rest of function to refer to is_miss
x[is_miss] <- replacement
cat(sum(is_miss), replacement, "\n")
x
}
Did you notice replace_missings() prints some output to the console? Try it:
replace_missings(df$z, replacement = 0)
That output isn’t exactly self-explanatory. It would be much nicer to say “2 missing values replaced by the value 0”.
It is also bad practice to use cat() for anything other than a print() method (a function designed just to display output). Having an important message just print to the screen makes it very hard for other people who might be programming with your function to capture the output and handle it appropriately.
The official R way to supply simple diagnostic information is the message() function. The unnamed arguments are pasted together with no separator (and no need for a newline at the end) and by default are printed to the screen.
Let’s make our function nicer for users by using message() and making the output self-contained.
message() and output sum(is_miss) missings replaced by the value replacement.df$z with 0’s, as we’ve done before.replace_missings <- function(x, replacement) {
is_miss <- is.na(x)
x[is_miss] <- replacement
# Rewrite to use message()
message(sum(is_miss), " missings replaces by the value ", replacement)
x
}
# Check your new function by running on df$z
replace_missings(df$z, replacement = 0)
## 0 missings replaces by the value 0
## numeric(0)
You already know how to use a for loop. The goal of this chapter is to teach you how to use the map functions in the purrr package which remove the code that’s duplicated across multiple for loops. After completing this chapter you’ll be able to solve new iteration problems with greater ease (faster and with fewer bugs).
Imagine we have a data frame called df:
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
We want to compute the median of each column. You could do this with copy and paste (df is available in your workspace, so go ahead and try it in the console):
median(df[[1]])
median(df[[2]])
median(df[[3]])
median(df[[4]])
But that’s a lot of repetition! Let’s start by seeing how we could reduce the duplication by using a for loop.
We’ve provided some code to get you started. Fill in the body of the for loop to calculate the median of each column and store the results in output.
# Initialize output vector
output <- vector("double", ncol(df))
# Fill in the body of the for loop
for (i in seq_along(df)) {
output[i] <- median(df[[i]])
}
# View the result
output
## [1] NA
Now, imagine you need to do this to another data frame df2. You copy and paste the for loop, and edit every reference to df to be df2 instead.
output <- vector("double", ncol(df2))
for (i in seq_along(df2)) {
output[[i]] <- median(df2[[i]])
}
output
And then you realize you have another data frame df3 for which you also want the column medians. You copy and paste…and realize you’ve copied and pasted two times. Time to write a function!
Turn the for loop snippet into a function called col_median() that takes one argument df and returns the vector of column medians.
# Turn this code into col_median()
output <- vector("double", ncol(df))
col_median <- function(df) {
for (i in seq_along(df)) {
output[[i]] <- median(df[[i]])
}
output
}
col_median
## function(df) {
## for (i in seq_along(df)) {
## output[[i]] <- median(df[[i]])
## }
## output
## }
What if instead of medians of every column you actually want means?
Let’s write a col_mean() function that returns the vector of column means.
col_median() function into the editor.col_mean() function by editing col_median() to find the column means instead.# Create col_mean() function to find column means
col_mean <- function(df) {
output <- numeric(length(df))
for (i in seq_along(df)) {
output[[i]] <- mean(df[[i]])
}
output
}
You now have functions for column medians and means, what about one for standard deviations?
col_median <- function(df) {
output <- numeric(length(df))
for (i in seq_along(df)) {
output[[i]] <- median(df[[i]])
}
output
}
col_mean <- function(df) {
output <- numeric(length(df))
for (i in seq_along(df)) {
output[[i]] <- mean(df[[i]])
}
output
}
col_median function into the editor.col_median function to find the column standard deviations instead.col_sd for your new function.# Define col_sd() function
col_sd <- function(df) {
output <- numeric(length(df))
for (i in seq_along(df)) {
output[[i]] <- sd(df[[i]])
}
output
}
We just copied and pasted the function col_median two times. That’s a sure sign we need to write a function. How can we write a function that will take column summaries for any summary function we provide?
Let’s look at a simpler example first. Consider the functions f1(), f2() and f3() that take a vector x and return deviations from the mean value raised to the powers 1, 2, and 3 respectively:
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
How could you remove the duplication in this set of function definitions?
Hopefully, you would suggest writing a single function with two arguments: x and power. That way, one function reproduces all the functionality of f1(), f2() and f3(), and more.
f() to take a second argument power.f() so that the absolute deviations raised to power are returned.# Add a second argument called power
f <- function(x, power) {
# Edit the body to return absolute deviations raised to power
abs(x - mean(x)) ** power
}
You just saw that we can remove the duplication in our set of summary functions by requiring the function doing the summary as an input. This leads to creating the col_summary function:
col_summary <- function(df, fun) {
output <- vector("numeric", length(df))
for (i in seq_along(df)) {
output[[i]] <- fun(df[[i]])
}
output
}
It may be kind of surprising that you can pass a function as an argument to another function, so let’s verify first that it worked. We’ve found the column means and medians using our old col_mean() and col_median() functions. Your job is to repeat the calculations using col_summary() instead and verify that it works.
Not only does col_summary() remove the duplication in the functions we’ve already written, it also allows us to apply any summary to the columns of a data frame. Verify this, by finding the column interquartile ranges using the function IQR(). (For more info, see ?IQR.)
df by specifying median as the fun argument to col_summary().df by specifying mean as the fun argument to col_summary().df using col_summary().# Find the column medians using col_median() and col_summary()
col_median(df)
## [1] NA
col_summary(df, median)
## [1] NA
# Find the column means using col_mean() and col_summary()
col_mean(df)
## [1] NaN
col_summary(df, mean)
## [1] NaN
# Find the column IQRs using col_summary()
col_summary(df, IQR)
## [1] NA
Nice! If you’ve used the lapply() or sapply() functions before, then you’ve passed functions as arguments. If not, you may want to review them in the Intermediate R course before moving on to the next video.
All the map functions in purrr take a vector, .x, as the first argument, then return .f applied to each element of .x. The type of object that is returned is determined by function suffix (the part after _):
map() returns a list or data framemap_lgl() returns a logical vectormap_int() returns a integer vectormap_dbl() returns a double vectormap_chr() returns a character vectorLet’s repeat our column summaries using a map function instead of our col_summary() function.
Which map function shall we use? In our case, every summary we calculated returned a single numeric value, so we’ll use map_dbl().
Use map_dbl() to find the…
df.df.df.# Load the purrr package
library(purrr)
## Warning: package 'purrr' was built under R version 3.4.4
# Use map_dbl() to find column means
map_dbl(df, mean)
## z
## NaN
# Use map_dbl() to column medians
map_dbl(df, median)
## z
## NA
# Use map_dbl() to find column standard deviations
map_dbl(df, sd)
## z
## NA
The map functions use the ... (“dot dot dot”) argument to pass along additional arguments to .f each time it’s called. For example, we can pass the trim argument to the mean() function:
map_dbl(df, mean, trim = 0.5)
Multiple arguments can be passed along using commas to separate them. For example, we can also pass the na.rm argument to mean():
map_dbl(df, mean, trim = 0.5, na.rm = TRUE)
You don’t have to specify the arguments by name, but it is good practice!
You may be wondering why the arguments to map() are .x and .f and not x and f? It’s because .x and .f are very unlikely to be argument names you might pass through the ..., thereby preventing confusion about whether an argument belongs to map() or to the function being mapped.
Let’s get a bit of practice with this. We’ll apply our new knowledge to a subset of the planes data frame available in the nycflights13 package. Use map_dbl() to find the average and 5th percentile of each column in planes.
planes by combining`map_dbl() with mean().planes again, but this time exclude missing values from the calculation.planes by combining map_dbl with quantile(). Don’t forget to exclude missing values!planes <- data.frame(year = c(1956, 1975, 1977, 1996, 2010, NA), engines = c(4, 1, 2, 2, 2, 1), seats = c(102, 4, 139, 142, 20, 2), speed = c(232, 108, 432, NA, NA, NA))
# Find the mean of each column
map_dbl(planes, mean)
## year engines seats speed
## NA 2.00000 68.16667 NA
# Find the mean of each column, excluding missing values
map_dbl(planes, mean, na.rm = TRUE)
## year engines seats speed
## 1982.80000 2.00000 68.16667 257.33333
# Find the 5th percentile of each column, excluding missing values
map_dbl(planes, quantile, probs = 0.05, na.rm = TRUE)
## year engines seats speed
## 1959.8 1.0 2.5 120.4
Great work! Other nice features of the map functions are that they’re implemented in C, which makes them really fast, and that they generally preserve names from their input in the output.
Choosing the right map function is important. You can always use map(), which will return a list. However, if you know what type of output you expect, you are better to use the corresponding function. That way, if you expect one thing and get another, you’ll know immediately because the map function will return an error.
For example, try running:
map_lgl(df, mean)
The map functions are what we call type consistent. This means you know exactly what type of output to expect regardless of the input. map_lgl() either returns either a logical vector or an error. map_dbl() returns either a double or an error.
One way to check the output type is to run the corresponding function on the first element. For example, mean(df[[1]]) returns a single numeric value, suggesting map_dbl().
Let’s get some more practice with the map functions. We’ve created a new data frame, df3, in your workspace. Use a map function to find which columns are numeric, the type of each column, and a summary of each column.
Remember to choose the appropriate map function based on the output you expect for each of the following:
df3 by combining a map function with is.numeric().df3 by combining a map function with typeof().df3 by combining a map function with summary().df3 <- data.frame(A = c(-2.13641409, -0.54173437, 1.25655044, 0.17926296, -1.86171148, -0.46128828, -0.52251896, 0.06505055, -0.03916377, 0.30235985), B = c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B"), C = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), D = c(0.8053080, -0.6494968, -0.1815008, 0.8507846, 0.2905451, 0.7030261, 0.6552918, -1.3814747, -0.6142817, -2.1614109))
# Find the columns that are numeric
map_lgl(df3, is.numeric)
## A B C D
## TRUE FALSE TRUE TRUE
# Find the type of each column
map_chr(df3, typeof)
## A B C D
## "double" "integer" "double" "double"
# Find a summary of each column
map(df3, summary)
## $A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.1364 -0.5369 -0.2502 -0.3760 0.1507 1.2566
##
## $B
## A B
## 5 5
##
## $C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
##
## $D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.16141 -0.64069 0.05452 -0.16832 0.69109 0.85078
Excellent work! In chapter 5 we’ll learn more about why type consistent functions are better to program with, and why you should avoid type inconsistent functions like sapply().
Our goal is to fit a separate linear regression of miles per gallon (mpg) against weight (wt) for each group of cars in our list of data frames, where each data frame in our list represents a different group. How should we get started?
First, let’s confirm the structure of this list of data frames. Then, we’ll solve a simpler problem first: fit the regression to the first group of cars.
cyl with str().cyl and assign it to the variable four_cyls.four_cyls as the data argument to lm().cyl <- list(`4` = mtcars[which(mtcars$cyl == 4), ], `6` = mtcars[which(mtcars$cyl == 6), ], `8` = mtcars[which(mtcars$cyl == 8), ])
# Examine the structure of cyl
str(cyl)
## List of 3
## $ 4:'data.frame': 11 obs. of 11 variables:
## ..$ mpg : num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
## ..$ cyl : num [1:11] 4 4 4 4 4 4 4 4 4 4 ...
## ..$ disp: num [1:11] 108 146.7 140.8 78.7 75.7 ...
## ..$ hp : num [1:11] 93 62 95 66 52 65 97 66 91 113 ...
## ..$ drat: num [1:11] 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
## ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
## ..$ qsec: num [1:11] 18.6 20 22.9 19.5 18.5 ...
## ..$ vs : num [1:11] 1 1 1 1 1 1 1 1 0 1 ...
## ..$ am : num [1:11] 1 0 0 1 1 1 0 1 1 1 ...
## ..$ gear: num [1:11] 4 4 4 4 4 4 3 4 5 5 ...
## ..$ carb: num [1:11] 1 2 2 1 2 1 1 1 2 2 ...
## $ 6:'data.frame': 7 obs. of 11 variables:
## ..$ mpg : num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
## ..$ cyl : num [1:7] 6 6 6 6 6 6 6
## ..$ disp: num [1:7] 160 160 258 225 168 ...
## ..$ hp : num [1:7] 110 110 110 105 123 123 175
## ..$ drat: num [1:7] 3.9 3.9 3.08 2.76 3.92 3.92 3.62
## ..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
## ..$ qsec: num [1:7] 16.5 17 19.4 20.2 18.3 ...
## ..$ vs : num [1:7] 0 0 1 1 1 1 0
## ..$ am : num [1:7] 1 1 0 0 0 0 1
## ..$ gear: num [1:7] 4 4 3 3 4 4 5
## ..$ carb: num [1:7] 4 4 1 1 4 4 6
## $ 8:'data.frame': 14 obs. of 11 variables:
## ..$ mpg : num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
## ..$ cyl : num [1:14] 8 8 8 8 8 8 8 8 8 8 ...
## ..$ disp: num [1:14] 360 360 276 276 276 ...
## ..$ hp : num [1:14] 175 245 180 180 180 205 215 230 150 150 ...
## ..$ drat: num [1:14] 3.15 3.21 3.07 3.07 3.07 2.93 3 3.23 2.76 3.15 ...
## ..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
## ..$ qsec: num [1:14] 17 15.8 17.4 17.6 18 ...
## ..$ vs : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ am : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:14] 3 3 3 3 3 3 3 3 3 3 ...
## ..$ carb: num [1:14] 2 4 3 3 3 4 4 4 2 2 ...
# Extract the first element into four_cyls
four_cyls <- cyl[[1]]
# Fit a linear regression of mpg on wt using four_cyls
lm(mpg ~ wt, data = four_cyls)
##
## Call:
## lm(formula = mpg ~ wt, data = four_cyls)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
Great! We now have a snippet of code that performs the operation we want on one data frame. One option would be to turn this into a function, for example:
fit_reg <- function(df) {
lm(mpg ~ wt, data = df)
}
Then pass this function into map():
map(cyl, fit_reg)
But it seems a bit much to define a function for such a specific model when we only want to do this once. Instead of defining the function in the global environment, we will just use the function anonymously inside our call to map().
What does this mean? Instead of referring to our function by name in map(), we define it on the fly in the .f argument to map().
Rewrite the map() call to use the anonymous function function(df) lm(mpg ~ wt, data = df).
# Rewrite to call an anonymous function
map(cyl, function(df) lm(mpg ~ wt, data = df))
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 23.868 -2.192
Writing anonymous functions takes a lot of extra key strokes, so purrr provides a shortcut that allows you to write an anonymous function as a one-sided formula instead.
In R, a one-sided formula starts with a ~, followed by an R expression. In purrr’s map functions, the R expression can refer to an element of the .x argument using the . character.
Let’s take a look at an example. Imagine, instead of a regression on each data frame in cyl, we wanted to know the mean displacement for each data frame. One way to do this would be to use an anonymous function:
map_dbl(cyl, function(df) mean(df$disp))
To perform the same operation using the formula shortcut, we replace the function definition (function(df)) with the ~, then when we need to refer to the element of cyl the function operates on (in this case df), we use a ..
map_dbl(cyl, ~ mean(.$disp))
See how much less typing it involves! It also saves you from coming up with an argument name. Can you rewrite our previous anonymous function using this formula shortcut instead?
Rewrite our call to map() to use the formula notation instead of an anonymous function.
# Rewrite to use the formula shortcut instead
map(cyl, ~ lm(mpg ~ wt, data = .))
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 23.868 -2.192
There are also some useful shortcuts that come in handy when you want to subset each element of the .x argument. If the .f argument to a map function is set equal to a string, let’s say "name", then purrr extracts the "name" element from every element of .x.
This is a really common situation you find yourself in when you work with nested lists. For example, if we have a list of where every element contains an a and b element:
list_of_results <- list(
list(a = 1, b = "A"),
list(a = 2, b = "C"),
list(a = 3, b = "D")
)
We might want to pull out the a element from every entry. We could do it with the string shortcut like this:
map(list_of_results, "a")
Now take our list of regresssion models:
map(cyl, ~ lm(mpg ~ wt, data = .))
It might be nice to extract the slope coefficient from each model. You’ll do this in a few steps: first fit the models, then get the coefficients from each model using the coef() function, then pull out the wt estimate using the string shortcut.
models.map and the coef() function to extract the coefficients from each model, and save it in the variable coefs.map and the string shortcut to extract the slope wt element from coefficients vectors.# Save the result from the previous exercise to the variable models
models <- map(cyl, ~ lm(mpg ~ wt, data = .))
# Use map and coef to get the coefficients for each model: coefs
coefs <- map(models, coef)
# Use string shortcut to extract the wt coefficient
map(coefs, "wt")
## $`4`
## [1] -5.647025
##
## $`6`
## [1] -2.780106
##
## $`8`
## [1] -2.192438
Another useful shortcut for subsetting is to pass a numeric vector as the .f argument. This works just like passing a string but subsets by index rather than name. For example, with your previous list_of_results:
list_of_results <- list(
list(a = 1, b = "A"),
list(a = 2, b = "C"),
list(a = 3, b = "D")
)
Another way to pull out the a element from each list, is to pull out the first element:
map(list_of_results, 1)
Let’s pull out the slopes from our models again, but this time using numeric subsetting. Also, since we are pulling out a single numeric value from each element, let’s use map_dbl().
Extract the second element from each vector in coefs using the numeric shortcut and map_dbl().
coefs <- map(models, coef)
# use map_dbl with the numeric shortcut to pull out the second element
map_dbl(coefs, 2)
## 4 6 8
## -5.647025 -2.780106 -2.192438
purrr also includes a pipe operator: %>%. The pipe operator is another shortcut that saves typing, but also increases readability. The explanation of the pipe operator is quite simple: x %>% f(y) is another way of writing f(x, y). That is, the left hand side of the pipe, x, becomes the first argument to the function, f(), on the right hand side of the pipe.
Take a look at our code to get our list of models:
cyl <- split(mtcars, mtcars$cyl)
map(cyl, ~ lm(mpg ~ wt, data = .))
We split the data frame mtcars and save it as the variable cyl. We then pass cyl as the first argument to map to fit the models. We could rewrite this using the pipe operator as:
split(mtcars, mtcars$cyl) %>%
map(~ lm(mpg ~ wt, data = .))
We read this as “split the data frame mtcars on cyl, then use map() on the result.”
One of the powerful things about the pipe is we can chain together many operations. Here is our complete code, written with pipes, instead assigning each step to a variable and using it in the next step:
mtcars %>%
split(mtcars$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(coef) %>%
map_dbl("wt")
We’ve written some code in the editor to pull out the R2 from each model. Rewrite the last two lines to use a pipe instead.
Rewrite the last two lines to use a pipe instead.
# Define models (don't change)
models <- mtcars %>%
split(mtcars$cyl) %>%
map(~ lm(mpg ~ wt, data = .))
# Rewrite to be a single command using pipes
summaries <- models %>%
map(summary) %>%
map_dbl("r.squared")
Fantastic! We’ll use the pipe operator a lot in the remaining chapters, so you’ll get lots of practice using it.
Now you’ve seen how useful the map functions are for reducing duplication, we’ll introduce you to a few more functions in purrr that allow you to handle more complicated inputs and outputs. In particular, you’ll learn how to deal with functions that might return an error, how to iterate over multiple arguments and how to iterate over functions that have no output at all.
safely() is an adverb; it takes a verb and modifies it. That is, it takes a function as an argument and it returns a function as its output. The function that is returned is modified so it never throws an error (and never stops the rest of your computation!).
Instead, it always returns a list with two elements:
result is the original result. If there was an error, this will be NULL.error is an error object. If the operation was successful this will be NULL.Let’s try to make the readLines() function safe.
readLines() function to safely(), and assign the output to safe_readLines.safe_readLines() on the string "http://example.org" to read the example homepage HTML file.safe_readLines() on "http://asdfasdasdkfjlda", a nonsense web address that shouldn’t be found.# Create safe_readLines() by passing readLines() to safely()
safe_readLines <- safely(readLines)
# Call safe_readLines() on "http://example.org"
safe_readLines("http://example.org")
## $result
## [1] "<!doctype html>"
## [2] "<html>"
## [3] "<head>"
## [4] " <title>Example Domain</title>"
## [5] ""
## [6] " <meta charset=\"utf-8\" />"
## [7] " <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />"
## [8] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />"
## [9] " <style type=\"text/css\">"
## [10] " body {"
## [11] " background-color: #f0f0f2;"
## [12] " margin: 0;"
## [13] " padding: 0;"
## [14] " font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;"
## [15] " "
## [16] " }"
## [17] " div {"
## [18] " width: 600px;"
## [19] " margin: 5em auto;"
## [20] " padding: 50px;"
## [21] " background-color: #fff;"
## [22] " border-radius: 1em;"
## [23] " }"
## [24] " a:link, a:visited {"
## [25] " color: #38488f;"
## [26] " text-decoration: none;"
## [27] " }"
## [28] " @media (max-width: 700px) {"
## [29] " body {"
## [30] " background-color: #fff;"
## [31] " }"
## [32] " div {"
## [33] " width: auto;"
## [34] " margin: 0 auto;"
## [35] " border-radius: 0;"
## [36] " padding: 1em;"
## [37] " }"
## [38] " }"
## [39] " </style> "
## [40] "</head>"
## [41] ""
## [42] "<body>"
## [43] "<div>"
## [44] " <h1>Example Domain</h1>"
## [45] " <p>This domain is established to be used for illustrative examples in documents. You may use this"
## [46] " domain in examples without prior coordination or asking for permission.</p>"
## [47] " <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>"
## [48] "</div>"
## [49] "</body>"
## [50] "</html>"
##
## $error
## NULL
# Call safe_readLines() on "http://asdfasdasdkfjlda"
safe_readLines("http://asdfasdasdkfjlda")
## Warning in file(con, "r"): InternetOpenUrl failed: 'The server name or
## address could not be resolved'
## $result
## NULL
##
## $error
## <simpleError in file(con, "r"): cannot open the connection>
One feature of safely() is that it plays nicely with the map() functions. Consider this list containing the two URLs from the last exercise, plus one additional URL to make things more interesting:
urls <- list(
example = "http://example.org",
rproj = "http://www.r-project.org",
asdf = "http://asdfasdasdkfjlda"
)
We are interested in quickly downloading the HTML files at each URL. You might try:
map(urls, readLines)
But it results in an error, Error in file(con, "r") : cannot open the connection, and no output for any of the URLs. Go on, try it!
We can solve this problem by using our safe_readLines() instead.
map() on urls with the safe_readLines() function instead and assign the results to html.str() on html to examine the output.result from one of the two elements that was successful using double square bracket subsetting.error from the element that was unsuccessful, again using double square bracket subsetting.# Define safe_readLines()
safe_readLines <- safely(readLines)
# Use the safe_readLines() function with map(): html
html <- map(urls, safe_readLines)
## Warning in file(con, "r"): InternetOpenUrl failed: 'The server name or
## address could not be resolved'
# Call str() on html
str(html)
## List of 3
## $ example:List of 2
## ..$ result: chr [1:50] "<!doctype html>" "<html>" "<head>" " <title>Example Domain</title>" ...
## ..$ error : NULL
## $ rproj :List of 2
## ..$ result: chr [1:119] "<!DOCTYPE html>" "<html lang=\"en\">" " <head>" " <meta charset=\"utf-8\">" ...
## ..$ error : NULL
## $ asdf :List of 2
## ..$ result: NULL
## ..$ error :List of 2
## .. ..$ message: chr "cannot open the connection"
## .. ..$ call : language file(con, "r")
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
# Extract the result from one of the successful elements
html[[1]][["result"]]
## [1] "<!doctype html>"
## [2] "<html>"
## [3] "<head>"
## [4] " <title>Example Domain</title>"
## [5] ""
## [6] " <meta charset=\"utf-8\" />"
## [7] " <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />"
## [8] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />"
## [9] " <style type=\"text/css\">"
## [10] " body {"
## [11] " background-color: #f0f0f2;"
## [12] " margin: 0;"
## [13] " padding: 0;"
## [14] " font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;"
## [15] " "
## [16] " }"
## [17] " div {"
## [18] " width: 600px;"
## [19] " margin: 5em auto;"
## [20] " padding: 50px;"
## [21] " background-color: #fff;"
## [22] " border-radius: 1em;"
## [23] " }"
## [24] " a:link, a:visited {"
## [25] " color: #38488f;"
## [26] " text-decoration: none;"
## [27] " }"
## [28] " @media (max-width: 700px) {"
## [29] " body {"
## [30] " background-color: #fff;"
## [31] " }"
## [32] " div {"
## [33] " width: auto;"
## [34] " margin: 0 auto;"
## [35] " border-radius: 0;"
## [36] " padding: 1em;"
## [37] " }"
## [38] " }"
## [39] " </style> "
## [40] "</head>"
## [41] ""
## [42] "<body>"
## [43] "<div>"
## [44] " <h1>Example Domain</h1>"
## [45] " <p>This domain is established to be used for illustrative examples in documents. You may use this"
## [46] " domain in examples without prior coordination or asking for permission.</p>"
## [47] " <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>"
## [48] "</div>"
## [49] "</body>"
## [50] "</html>"
# Extract the error from the element that was unsuccessful
html[[3]][["error"]]
## <simpleError in file(con, "r"): cannot open the connection>
We now have output that contains the HTML for each of the two URLs on which readLines() was successful and the error for the other. But the output isn’t that easy to work with, since the results and errors are buried in the inner-most level of the list.
purrr provides a function transpose() that reshapes a list so the inner-most level becomes the outer-most level. In otherwords, it turns a list-of-lists “inside-out”. Consider the following list:
nested_list <- list(
x1 = list(a = 1, b = 2),
x2 = list(a = 3, b = 4)
)
If I need to extract the a element in x1, I could do nested_list[["x1"]][["a"]]. However, if I transpose the list first, the order of subsetting reverses. That is, to extract the same element I could also do transpose(nested_list)[["a"]][["x1"]].
This is really handy for safe output, since we can grab all the results or all the errors really easily.
transpose(html).transpose(html) and assign to the variable res.transpose(html) and assign to the variable errs.# Define safe_readLines() and html
safe_readLines <- safely(readLines)
html <- map(urls, safe_readLines)
## Warning in file(con, "r"): InternetOpenUrl failed: 'The server name or
## address could not be resolved'
# Examine the structure of transpose(html)
str(transpose(html))
## List of 2
## $ result:List of 3
## ..$ example: chr [1:50] "<!doctype html>" "<html>" "<head>" " <title>Example Domain</title>" ...
## ..$ rproj : chr [1:119] "<!DOCTYPE html>" "<html lang=\"en\">" " <head>" " <meta charset=\"utf-8\">" ...
## ..$ asdf : NULL
## $ error :List of 3
## ..$ example: NULL
## ..$ rproj : NULL
## ..$ asdf :List of 2
## .. ..$ message: chr "cannot open the connection"
## .. ..$ call : language file(con, "r")
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
# Extract the results: res
res <- transpose(html)[["result"]]
# Extract the errors: errs
errs <- transpose(html)[["error"]]
What you do with the errors and results is up to you. But, commonly you’ll want to collect all the results for the elements that were successful and examine the inputs for all those that weren’t.
map_lgl() with is_null() to create a logical vector, is_ok, that is TRUE when errs is NULL.res with is_ok.urls with !is_ok.# Initialize some objects
safe_readLines <- safely(readLines)
html <- map(urls, safe_readLines)
## Warning in file(con, "r"): InternetOpenUrl failed: 'The server name or
## address could not be resolved'
res <- transpose(html)[["result"]]
errs <- transpose(html)[["error"]]
# Create a logical vector is_ok
is_ok <- map_lgl(errs, is_null)
# Extract the successful results
res[is_ok]
## $example
## [1] "<!doctype html>"
## [2] "<html>"
## [3] "<head>"
## [4] " <title>Example Domain</title>"
## [5] ""
## [6] " <meta charset=\"utf-8\" />"
## [7] " <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />"
## [8] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />"
## [9] " <style type=\"text/css\">"
## [10] " body {"
## [11] " background-color: #f0f0f2;"
## [12] " margin: 0;"
## [13] " padding: 0;"
## [14] " font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;"
## [15] " "
## [16] " }"
## [17] " div {"
## [18] " width: 600px;"
## [19] " margin: 5em auto;"
## [20] " padding: 50px;"
## [21] " background-color: #fff;"
## [22] " border-radius: 1em;"
## [23] " }"
## [24] " a:link, a:visited {"
## [25] " color: #38488f;"
## [26] " text-decoration: none;"
## [27] " }"
## [28] " @media (max-width: 700px) {"
## [29] " body {"
## [30] " background-color: #fff;"
## [31] " }"
## [32] " div {"
## [33] " width: auto;"
## [34] " margin: 0 auto;"
## [35] " border-radius: 0;"
## [36] " padding: 1em;"
## [37] " }"
## [38] " }"
## [39] " </style> "
## [40] "</head>"
## [41] ""
## [42] "<body>"
## [43] "<div>"
## [44] " <h1>Example Domain</h1>"
## [45] " <p>This domain is established to be used for illustrative examples in documents. You may use this"
## [46] " domain in examples without prior coordination or asking for permission.</p>"
## [47] " <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>"
## [48] "</div>"
## [49] "</body>"
## [50] "</html>"
##
## $rproj
## [1] "<!DOCTYPE html>"
## [2] "<html lang=\"en\">"
## [3] " <head>"
## [4] " <meta charset=\"utf-8\">"
## [5] " <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">"
## [6] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">"
## [7] " <title>R: The R Project for Statistical Computing</title>"
## [8] ""
## [9] " <link rel=\"icon\" type=\"image/png\" href=\"/favicon-32x32.png\" sizes=\"32x32\" />"
## [10] " <link rel=\"icon\" type=\"image/png\" href=\"/favicon-16x16.png\" sizes=\"16x16\" />"
## [11] ""
## [12] " <!-- Bootstrap -->"
## [13] " <link href=\"/css/bootstrap.min.css\" rel=\"stylesheet\">"
## [14] " <link href=\"/css/R.css\" rel=\"stylesheet\">"
## [15] ""
## [16] " <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->"
## [17] " <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->"
## [18] " <!--[if lt IE 9]>"
## [19] " <script src=\"https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js\"></script>"
## [20] " <script src=\"https://oss.maxcdn.com/respond/1.4.2/respond.min.js\"></script>"
## [21] " <![endif]-->"
## [22] " </head>"
## [23] " <body>"
## [24] " <div class=\"container page\">"
## [25] " <div class=\"row\">"
## [26] " <div class=\"col-xs-12 col-sm-offset-1 col-sm-2 sidebar\" role=\"navigation\">"
## [27] "<div class=\"row\">"
## [28] "<div class=\"col-xs-6 col-sm-12\">"
## [29] "<p><a href=\"/\"><img src=\"/Rlogo.png\" width=\"100\" height=\"78\" alt = \"R\" /></a></p>"
## [30] "<p><small><a href=\"/\">[Home]</a></small></p>"
## [31] "<h2 id=\"download\">Download</h2>"
## [32] "<p><a href=\"http://cran.r-project.org/mirrors.html\">CRAN</a></p>"
## [33] "<h2 id=\"r-project\">R Project</h2>"
## [34] "<ul>"
## [35] "<li><a href=\"/about.html\">About R</a></li>"
## [36] "<li><a href=\"/logo/\">Logo</a></li>"
## [37] "<li><a href=\"/contributors.html\">Contributors</a></li>"
## [38] "<li><a href=\"/news.html\">Whatâs New?</a></li>"
## [39] "<li><a href=\"/bugs.html\">Reporting Bugs</a></li>"
## [40] "<li><a href=\"http://developer.R-project.org\">Development Site</a></li>"
## [41] "<li><a href=\"/conferences.html\">Conferences</a></li>"
## [42] "<li><a href=\"/search.html\">Search</a></li>"
## [43] "</ul>"
## [44] "</div>"
## [45] "<div class=\"col-xs-6 col-sm-12\">"
## [46] "<h2 id=\"r-foundation\">R Foundation</h2>"
## [47] "<ul>"
## [48] "<li><a href=\"/foundation/\">Foundation</a></li>"
## [49] "<li><a href=\"/foundation/board.html\">Board</a></li>"
## [50] "<li><a href=\"/foundation/members.html\">Members</a></li>"
## [51] "<li><a href=\"/foundation/donors.html\">Donors</a></li>"
## [52] "<li><a href=\"/foundation/donations.html\">Donate</a></li>"
## [53] "</ul>"
## [54] "<h2 id=\"help-with-r\">Help With R</h2>"
## [55] "<ul>"
## [56] "<li><a href=\"/help.html\">Getting Help</a></li>"
## [57] "</ul>"
## [58] "<h2 id=\"documentation\">Documentation</h2>"
## [59] "<ul>"
## [60] "<li><a href=\"http://cran.r-project.org/manuals.html\">Manuals</a></li>"
## [61] "<li><a href=\"http://cran.r-project.org/faqs.html\">FAQs</a></li>"
## [62] "<li><a href=\"http://journal.r-project.org\">The R Journal</a></li>"
## [63] "<li><a href=\"/doc/bib/R-books.html\">Books</a></li>"
## [64] "<li><a href=\"/certification.html\">Certification</a></li>"
## [65] "<li><a href=\"/other-docs.html\">Other</a></li>"
## [66] "</ul>"
## [67] "<h2 id=\"links\">Links</h2>"
## [68] "<ul>"
## [69] "<li><a href=\"http://www.bioconductor.org\">Bioconductor</a></li>"
## [70] "<li><a href=\"/other-projects.html\">Related Projects</a></li>"
## [71] "<li><a href=\"/gsoc.html\">GSoC</a></li>"
## [72] "</ul>"
## [73] "</div>"
## [74] "</div>"
## [75] " </div>"
## [76] " <div class=\"col-xs-12 col-sm-7\">"
## [77] " <h1>The R Project for Statistical Computing</h1>"
## [78] "<h2 id=\"getting-started\">Getting Started</h2>"
## [79] "<p>R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To <strong><a href=\"http://cran.r-project.org/mirrors.html\">download R</a></strong>, please choose your preferred <a href=\"http://cran.r-project.org/mirrors.html\">CRAN mirror</a>.</p>"
## [80] "<p>If you have questions about R like how to download and install the software, or what the license terms are, please read our <a href=\"http://cran.R-project.org/faqs.html\">answers to frequently asked questions</a> before you send an email.</p>"
## [81] "<h2 id=\"news\">News</h2>"
## [82] "<ul>"
## [83] "<li><p>You can now support the R Foundation with a renewable subscription as a <a href=\"https://www.r-project.org/foundation/donations.html\">supporting member</a></p></li>"
## [84] "<li><p><a href=\"https://cran.r-project.org/src/base/R-3\"><strong>R version 3.5.1 (Feather Spray)</strong></a> has been released on 2018-07-02.</p></li>"
## [85] "<li><p>The R Foundation has been awarded the Personality/Organization of the year 2018 award by the professional association of German market and social researchers.</p></li>"
## [86] "</ul>"
## [87] "<h2 id=\"news-via-twitter\">News via Twitter</h2>"
## [88] "<a class=\"twitter-timeline\""
## [89] " href=\"https://twitter.com/_R_Foundation?ref_src=twsrc%5Etfw\""
## [90] " data-width=\"400\""
## [91] " data-show-replies=\"false\""
## [92] " data-chrome=\"noheader,nofooter,noborders\""
## [93] " data-dnt=\"true\""
## [94] " data-tweet-limit=\"3\">News from the R Foundation</a>"
## [95] "<script async"
## [96] " src=\"https://platform.twitter.com/widgets.js\""
## [97] " charset=\"utf-8\"></script>"
## [98] "<!--- (Boilerplate for release run-in)"
## [99] "- [**R version 3.1.3 (Smooth Sidewalk) prerelease versions**](http://cran.r-project.org/src/base-prerelease) will appear starting February 28. Final release is scheduled for 2015-03-09."
## [100] "-->"
## [101] " </div>"
## [102] " </div>"
## [103] " <div class=\"raw footer\">"
## [104] " © The R Foundation. For queries about this web site, please contact"
## [105] "\t<script type='text/javascript'>"
## [106] "<!--"
## [107] "var s=\"=b!isfg>#nbjmup;xfcnbtufsAs.qspkfdu/psh#?uif!xfcnbtufs=0b?\";"
## [108] "m=\"\"; for (i=0; i<s.length; i++) {if(s.charCodeAt(i) == 28){m+= '&';} else if (s.charCodeAt(i) == 23) {m+= '!';} else {m+=String.fromCharCode(s.charCodeAt(i)-1);}}document.write(m);//-->"
## [109] "\t</script>;"
## [110] " for queries about R itself, please consult the "
## [111] " <a href=\"help.html\">Getting Help</a> section."
## [112] " </div>"
## [113] " </div>"
## [114] " <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->"
## [115] " <script src=\"https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js\"></script>"
## [116] " <!-- Include all compiled plugins (below), or include individual files as needed -->"
## [117] " <script src=\"/js/bootstrap.min.js\"></script>"
## [118] " </body>"
## [119] "</html>"
# Find the URLs that were unsuccessful
urls[!is_ok]
## $asdf
## [1] "http://asdfasdasdkfjlda"
We’ll use random number generation as an example throughout the remaining exercises in this chapter. To get started, let’s imagine simulating 5 random numbers from a Normal distribution. You can do this in R with the rnorm() function. For example, to generate 5 random numbers from a Normal distribution with mean zero, we can do:
rnorm(n = 5)
Now, imagine you want to do this three times, but each time with a different sample size. You already know how! Let’s use the map() function to get it done.
n containing the values: 5, 10, and 20.map() to iterate over n, each time applying the function rnorm().# Create a list n containing the values: 5, 10, and 20
n = list(5, 10, 20)
# Call map() on n with rnorm() to simulate three samples
map(n, rnorm)
## [[1]]
## [1] -1.1469904 2.4975007 -1.9272133 0.7783345 0.9114511
##
## [[2]]
## [1] -0.2337036 1.8748620 -1.5758529 -2.3189536 -0.7428074 -1.8256433
## [7] -0.4730371 0.4290537 1.5295782 -1.3502995
##
## [[3]]
## [1] -0.99018610 1.04436651 -0.30661648 -0.34647854 -0.80708796
## [6] -0.46017581 0.59775163 2.06227455 0.20230164 1.03245125
## [11] -0.24261787 0.70598204 -0.42377479 1.34759524 0.44965158
## [16] 0.79605774 0.01229766 -1.32845745 0.15156184 0.81738876
Ok, but now imagine we don’t just want to vary the sample size, we also want to vary the mean. The mean can be specified in rnorm() by the argument mean. Now there are two arguments to rnorm() we want to vary: n and mean.
The map2() function is designed exactly for this purpose; it allows iteration over two objects. The first two arguments to map2() are the objects to iterate over and the third argument .f is the function to apply.
Let’s use map2() to simulate three samples with different sample sizes and different means.
mu containing the values: 1, 5, and 10.map() call to use map2() with both n and mu.# Initialize n
n <- list(5, 10, 20)
# Create a list mu containing the values: 1, 5, and 10
mu = list(1, 5, 10)
# Edit to call map2() on n and mu with rnorm() to simulate three samples
map2(n, mu, rnorm)
## [[1]]
## [1] 0.9758931 -0.1812981 1.5555424 1.9331116 1.0600187
##
## [[2]]
## [1] 3.727427 4.307103 5.489234 4.603160 3.738368 5.066826 4.053531
## [8] 6.393750 3.698077 4.475885
##
## [[3]]
## [1] 8.762064 8.326619 9.532079 9.957813 10.210206 8.198173 9.687371
## [8] 10.155119 8.206832 8.471932 9.117131 9.935629 10.920511 11.419466
## [15] 11.655665 9.686627 9.728569 9.697320 8.981511 10.536834
But wait, there’s another argument to rnorm() we might want to vary: sd, the standard deviation of the Normal distribution. You might think there is a map3() function, but there isn’t. Instead purrr provides a pmap() function that iterates over 2 or more arguments.
First, let’s take a look at pmap() for the situation we just solved: iterating over two arguments. Instead of providing each otem to iterate as arguments, pmap() takes a list of arguments as its input. For examle, we could replicate our previous example, iterating over both n and mu with the following:
n <- list(5, 10, 20)
mu <- list(1, 5, 10)
pmap(list(n, mu), rnorm)
Notice how we had to put our two items to iterate over (n and mu) into a list.
Let’s expand this code to iterate over varying standard deviations too.
sd with the values: 0.1, 1 and 0.1.pmap() to also iterate over sd.# Initialize n and mu
n <- list(5, 10, 20)
mu <- list(1, 5, 10)
# Create a sd list with the values: 0.1, 1 and 0.1
sd <- list(0.1, 1, 0.1)
# Edit this call to pmap() to iterate over the sd list as well
pmap(list(n, mu, sd), rnorm)
## [[1]]
## [1] 1.0528919 1.1648386 0.8886547 1.0393355 1.0712239
##
## [[2]]
## [1] 4.152407 4.597224 6.697670 5.089153 4.062108 4.383448 6.371211
## [8] 5.521881 5.859346 5.762934
##
## [[3]]
## [1] 10.029109 9.943092 9.886960 9.957687 9.954770 9.777668 9.965139
## [8] 10.055390 9.972113 9.948000 10.044777 10.081238 10.066976 9.972944
## [15] 10.038387 9.996991 9.980090 9.874104 10.084952 9.891329
Compare the following two calls to pmap() (run them in the console and compare their output too!):
pmap(list(n, mu, sd), rnorm)
pmap(list(mu, n, sd), rnorm)
What’s the difference? By default pmap() matches the elements of the list to the arguments in the function by position. In the first case, n to the n argument of rnorm(), mu to the mean argument of rnorm(), and sd to the sd argument of rnorm(). In the second case mu gets matched to the n argument of rnorm(), which is clearly not what we intended!
Instead of relying on this positional matching, a safer alternative is to provide names in our list. The name of each element should be the argument name we want to match it to.
Let’s fix up that second call.
Without changing their ordering, simply name the elements of the argument list to properly match the arguments of rnorm().
# Name the elements of the argument list
pmap(list(mean = mu, n = n, sd = sd), rnorm)
## [[1]]
## [1] 1.1519608 1.0735480 1.0280553 0.9261175 0.9922490
##
## [[2]]
## [1] 4.131899 4.949653 5.228590 4.832954 3.812897 5.707803 3.509142
## [8] 2.491301 6.604584 5.137850
##
## [[3]]
## [1] 9.964222 9.961345 10.080620 9.974861 9.760178 10.160124 9.860835
## [8] 9.882635 10.047251 9.820728 9.937227 10.065364 9.993196 10.101329
## [15] 9.978600 9.985901 10.003213 9.904783 9.959685 9.957789
Sometimes it’s not the arguments to a function you want to iterate over, but a set of functions themselves. Imagine that instead of varying the parameters to rnorm() we want to simulate from different distributions, say, using rnorm(), runif(), and rexp(). How do we iterate over calling these functions?
In purrr, this is handled by the invoke_map() function. The first argument is a list of functions. In our example, something like:
f <- list("rnorm", "runif", "rexp")
The second argument specifies the arguments to the functions. In the simplest case, all the functions take the same argument, and we can specify it directly, relying on ... to pass it to each function. In this case, call each function with the argument n = 5:
invoke_map(f, n = 5)
In more complicated cases, the functions may take different arguments, or we may want to pass different values to each function. In this case, we need to supply invoke_map() with a list, where each element specifies the arguments to the corresponding function.
Let’s use this approach to simulate three samples from the following three distributions: Normal(10, 1), Uniform(0, 5), and Exponential(5).
We’ve given you some code to get you started.
min and max elements to runif_params with values 0 and 5 respectively.rate element to rexp_params with value 5.invoke_map() of f() using the params list as the second argument, keeping n = 5 as a global argument.# Define list of functions
f <- list("rnorm", "runif", "rexp")
# Parameter list for rnorm()
rnorm_params <- list(mean = 10)
# Add a min element with value 0 and max element with value 5
runif_params <- list(min = 0, max = 5)
# Add a rate element with value 5
rexp_params <- list(rate = 5)
# Define params for each function
params <- list(
rnorm_params,
runif_params,
rexp_params
)
# Call invoke_map() on f supplying params as the second argument
invoke_map(f, params, n = 5)
## [[1]]
## [1] 9.898867 9.106245 10.191681 9.598026 9.432234
##
## [[2]]
## [1] 2.3012579 0.7346494 2.6671266 0.3859850 0.9904662
##
## [[3]]
## [1] 0.53597570 0.45202542 0.07089529 0.09059730 0.09841491
walk() operates just like map() except it’s designed for functions that don’t return anything. You use walk() for functions with side effects like printing, plotting or saving.
Let’s check that our simulated samples are in fact what we think they are by plotting a histogram for each one.
We’ve included code from our simulation of three samples from three different distributions, except we’ve bumped up the sample size to 50. We’ve also added some names to f and params to help us remember what matches up with what.
sims.walk() on sims with the hist() function to create a histogram of each sample.# Define list of functions
f <- list(Normal = "rnorm", Uniform = "runif", Exp = "rexp")
# Define params
params <- list(
Normal = list(mean = 10),
Uniform = list(min = 0, max = 5),
Exp = list(rate = 5)
)
# Assign the simulated samples to sims
sims <- invoke_map(f, params, n = 50)
# Use walk() to make a histogram of each element in sims
walk(sims, hist)
Those histograms were pretty good, but they really needed better breaks for the bins on the x-axis. That means we need to vary two arguments to hist(): x and breaks. Remember map2()? That allowed us to iterate over two arguments. Guess what? There is a walk2(), too!
Let’s use walk2() to improve those histograms with better breaks.
We’ve loaded sims in your workplace.
breaks argument to hist() is "Sturges". Replace "Sturges" in the breaks_list list with reasonable breaks for the histograms. Let’s use seq(6, 16, 0.5) for the Normal, seq(0, 5, 0.25) for the Uniform and seq(0, 1.5, 0.1) for the Exponential.walk2() to create a histogram for each sample with the breaks in breaks_list.# Replace "Sturges" with reasonable breaks for each sample
breaks_list <- list(
Normal = seq(6, 16, 0.5),
Uniform = seq(0, 5, 0.25),
Exp = seq(0, 1.5, 0.1)
)
# Use walk2() to make histograms with the right breaks
walk2(sims, breaks_list, hist)
In the previous exercise, we hard-coded the breaks, but that was a little lazy. Those breaks probably won’t be great if we change the parameters of our simulation.
A better idea would be to generate reasonable breaks based on the actual values in our simulated samples. This is a great chance to review our function writing skills and combine our own function with purrr.
Let’s start by writing our own function find_breaks(), which copies the default breaks in the ggplot2 package: break the range of the data in 30 bins.
How do we start? Simple, of course! Here’s a snippet of code that works for the first sample:
rng <- range(sims[[1]], na.rm = TRUE)
seq(rng[1], rng[2], length.out = 30)
Your job in this exercise is to turn that snippet into a function.
In the next exercise, we’ll combine find_breaks() with map() and walk2() to create histograms with sensible breaks.
find_breaks(), which takes a single argument x and return the sequence of breaks.find_breaks() on sims[[1]].# Turn this snippet into find_breaks()
find_breaks <- function(x) {
rng <- range(x, na.rm = TRUE)
seq(rng[1], rng[2], length.out = 30)
}
# Call find_breaks() on sims[[1]]
find_breaks(sims[[1]])
## [1] 7.701320 7.833121 7.964923 8.096724 8.228525 8.360326 8.492128
## [8] 8.623929 8.755730 8.887532 9.019333 9.151134 9.282935 9.414737
## [15] 9.546538 9.678339 9.810141 9.941942 10.073743 10.205544 10.337346
## [22] 10.469147 10.600948 10.732750 10.864551 10.996352 11.128154 11.259955
## [29] 11.391756 11.523557
Now that we have find_breaks(), we can find nice breaks for all the samples using map(). Then, pass the result into walk2() to get nice (but custom breaks) for our samples.
map() to iterate find_breaks() over sims and assign the result to nice_breaks.nice_breaks as the second argument to walk2() to iterate over both the simulations and calculated breaks to plot histograms.# Use map() to iterate find_breaks() over sims: nice_breaks
nice_breaks <- map(sims, find_breaks)
# Use nice_breaks as the second argument to walk2()
walk2(sims, nice_breaks, hist)
Ugh! Nice breaks but those plots had UUUUGLY labels and tytles. The x-axis labels are easy to fix if we don’t mind every plot having its x-axis labeled the same way. We can use the ... argument to any of the map() or walk() functions to pass in further arguments to the function .f. In this case, we might decide we don’t want any labels on the x-axis, in which case we need to pass an empty string to the xlab argument of hist():
walk2(sims, nice_breaks, hist, xlab = "")
But, what about the titles? We don’t want them to be the same for each plot. How can we iterate over the arguments x, breaks and main? You guessed it, there is a pwalk() function that works just like pmap().
Let’s use pwalk() to tidy up these plots. Also, let’s increase our sample size to 1000.
nice_titles that contains the character strings: "Normal(10, 1)", Uniform(0, 5)", "Exp(5)".pwalk() instead of walk2() to iterate over the x, breaks and main arguments to hist(). Like for pmap(), the first argument to pwalk() should be a list() of arguments to hist() using matching by name. Keep the xlab = "" argument as-is to keep things clean.# Increase the sample size to 1000
sims <- invoke_map(f, params, n = 1000)
# Compute nice_breaks (don't change this)
nice_breaks <- map(sims, find_breaks)
# Create a vector of nice_titles
nice_titles <- c("Normal(10, 1)", "Uniform(0, 5)", "Exp(5)")
# Use pwalk() instead of walk2()
pwalk(list(sims, nice_breaks, main = nice_titles), hist, xlab = "")
One of the nice things about the walk() functions is that they return the object you passed to them. This means they can easily be used in pipelins (a pipeline is just a short way of saying “a statement with lots of pipes”).
To illustrate, we’ll return to our first example of making histograms for each sample:
walk(sims, hist)
Take a look at what gets returned:
tmp <- walk(sims, hist)
str(tmp)
It’s our original sims object. That means we can pipe the sims object along to other functions. For example, we might want some basic summary statistics on each sample as well as our histograms.
We’ve converted walk(sims, hist) to a piped statement: sims %>% walk(hist). Pipe the result into map using summary() as the .f argument.
# Pipe this along to map(), using summary() as .f
sims %>%
walk(hist) %>%
map(summary)
## $Normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.392 9.367 10.061 10.045 10.713 13.328
##
## $Uniform
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002306 1.412243 2.531916 2.558309 3.764220 4.998873
##
## $Exp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0003988 0.0548013 0.1339406 0.1945181 0.2686363 1.7022091
In this chapter we’ll focus on writing functions that don’t surprise you or your users. We’ll expose you to some functions that work 95% of the time, and 5% of the time fail in surprising ways. You’ll learn which functions you should avoid using inside a function and which you should use with care.
Recall our both_na() function from Chapter2, that finds the number of entries where vectors x and y both have missing values:
both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}
We had an example where the behaviour was a little suprising:
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na(x, y)
The function works and returns 3, but we certainly didn’t design this function with the idea that people could pass in different length arguments.
Using stopifnot() is a quick way to have your functions stop, if a condition isn’t met. stopifnot() takes logical expressions as arguments and if any are FALSE an error will occur.
stopifnot() to both_na() that checks arguments x and y have the same length.both_na() to verify it returns an error.# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
# Add stopifnot() to check length of x and y
stopifnot(length(x) == length(y))
sum(is.na(x) & is.na(y))
}
# Call both_na() on x and y
both_na(x, y)
## Error: length(x) == length(y) is not TRUE
Using stop() instead of stopifnot() allows you to specify a more informative error message. Recall the general pattern for using stop() is:
if (condition) {
stop("Error", call. = FALSE)
}
Writing good error messages is an important part of writing a good function! We recommend your error tells the user what should betrue, not what is false. For example, here a good error would be "x and y must have the same length", rather than the bad error "x and y don't have the same length".
Let’s use this pattern to write a better check for the length of x and y.
condition with a logical statement that evaluates to TRUE when x and y have different lengths."x and y must have the same length".both_na() to verigy it returns a more informative error.# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
# Replace condition with logical
if (length(x) != length(y)) {
# Replace "Error" with better message
stop("x and y must have the same length", call. = FALSE)
}
sum(is.na(x) & is.na(y))
}
# Call both_na on x and y
both_na(x, y)
## Error: x and y must have the same length
Side effects describe the things that happen when you run a function that alters the state of your R session. If foo() is a function with no side effects (a.k.a. pure), then when we run x <- foo(), the only change we expect is that the variable x now has a new value. No other variables in the global environment should be changed or created, no output should be printed, no plots displayed, no files saved, no options changed. We know exactly the changes to the state of the session just be reading the call to the function.
Can you identify which of these functions doesn’t have side effects?
show_missings <- function(x) {
n <- sum(is.na(x))
cat("Missing values: ", n, "\n", sep = "")
}
replace_missings <- function(x, replacement) {
x[is.na(x)] <- replacement
}
plot_missings <- function(x) {
plot(seq_along(x), is.na(x))
x
}
exclude_missings <- function(x) {
options(na.action = "na.exclude")
}
show_missings()replace_missings()plot_missings()exclude_missings()Correct, great job! Of course functions with side effects are crucial for data analysis. You need to be aware of them, and deliberate in their usage. It’s ok to use them if the side effect is desired, but don’t surprise users with unexpected side effects.
sapply() is another common offender returning unstable types. The type of output returned from sapply() depends on the type of input.
Consider the following data frame and two calls to sapply():
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
A <- sapply(df[1:4], class)
B <- sapply(df[3:4], class)
What type of objects will be A and B be?
A and B will both be character vectors.A and B will both be lists.A will be a list, B will be a character vectorA, will be a list, B will be a character matrixThis unpredictable behavior is a sign that you shouldn’t rely on sapply() inside your own functions.
So, what do you do? Use alternate functions that are type consistent! And you already know a whole set: the map() functions in purrr.
In this example, when we call class() on the columns of the data frame we are expecting character output, so our function of choice should be: map_chr():
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
A <- map_chr(df[1:4], class)
B <- map_chr(df[3:4], class)
Except that gives us errors. This is a good thing! It alerts us that our assumption (that class() would return purely character output) is wrong.
Let’s look at a couple of solutions. First, we could use map() instead of map_chr(). Our result will always be a list, no matter the input.
The first two chunks give some sapply() calls, and demonstrate the type inconsistency by calling str() on each result.
X, Y and Z using map() instead of sapply().str() on your X, Y and Z to demonstrate type consistency of map().df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
# sapply calls
A <- sapply(df[1:4], class)
B <- sapply(df[3:4], class)
C <- sapply(df[1:2], class)
# Demonstrate type inconsistency
str(A)
## List of 4
## $ a: chr "integer"
## $ b: chr "numeric"
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(B)
## chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "y" "z"
str(C)
## Named chr [1:2] "integer" "numeric"
## - attr(*, "names")= chr [1:2] "a" "b"
# Use map() to define X, Y and Z
X <- map(df[1:4], class)
Y <- map(df[3:4], class)
Z <- map(df[1:2], class)
# Use str() to check type inconsistency
str(X)
## List of 4
## $ a: chr "integer"
## $ b: chr "numeric"
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(Y)
## List of 2
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(Z)
## List of 2
## $ a: chr "integer"
## $ b: chr "numeric"
If we wrap our solution into a function, we can be confident that this function will always return a list because we’ve used a type consistent function, map():
col_classes <- function(df) {
map(df, class)
}
But what if you wanted this function to always return a character string?
One option would be to decide what should happen if class() returns something longer than length 1. For example, we might simply take the first element of the vector returned by class().
class_list.map_chr() along with the numeric subsetting shortcut, to extract the first element from every item in class_list.col_classes <- function(df) {
# Assign list output to class_list
class_list <- map(df, class)
#Use map_chr() to extract first element in class_list
map_chr(class_list, 1)
}
# Check that our new function is type consistent
df %>% col_classes() %>% str()
## Named chr [1:4] "integer" "numeric" "POSIXct" "ordered"
## - attr(*, "names")= chr [1:4] "a" "b" "y" "z"
df[3:4] %>% col_classes() %>% str()
## Named chr [1:2] "POSIXct" "ordered"
## - attr(*, "names")= chr [1:2] "y" "z"
df[1:2] %>% col_classes() %>% str()
## Named chr [1:2] "integer" "numeric"
## - attr(*, "names")= chr [1:2] "a" "b"
Another option would be to simply fail. We could rely on map_chr()’s type consistency to fail for us:
col_classes <- function(df) {
map_chr(df, class)
}
df %>% col_classes() %>% str()
Or, check the condition ourselves and return an informative error message. We’ll implement this approach in this exercise.
As you write more functions, you’ll find you often come across this tension between implementing a function that does something sensible when something surprising happens, or simply fails when something surprising happens. Our recommendation is to fail when you are writing functions that you’ll use behind the scenes for programming and to do something sensible when writing functions for users to use interactively.
(And by the way, flatten_chr() is yet another useful function in purrr. It takes a list and removes its hierarchy. The suffix _chr indicates that this is another type consistent function, and will either return a character string or an error message.)
condition with a logical statement that uses any() and map_dbl() to check if any element in class_list has length greater than 1.col_classes <- function(df) {
class_list <- map(df, class)
# Add a check that no element of class_list has length > 1
if (any(map_dbl(class_list, length) > 1)) {
stop("Some columns have more than one class", call. = FALSE)
}
# Use flatten_chr() to return a character vector
flatten_chr(class_list)
}
# Check that our new function is type consistent
df %>% col_classes() %>% str()
## Error: Some columns have more than one class
df[3:4] %>% col_classes() %>% str()
## Error: Some columns have more than one class
df[1:2] %>% col_classes() %>% str()
## chr [1:2] "integer" "numeric"
Let’s take a look at a function that uses the non-standard evaluation (NSE) function filter() from the dplyr package:
big_x <- function(df, threshold) {
dplyr::filter(df, x > threshold)
}
This big_x() function attempts to return all rows in df where the x column exceeds a certain threshold. Let’s get a feel for how it might be used.
We’ve placed diamonds_sub, a 20 row subset of the diamonds data from the ggplot2 package in your workspace. Use big_x() to find all rows in diamonds_sub where the x column is greater than 7.
# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 2 x 10
## carat cut color clarity depth table price x y z
## <dbl> <int> <int> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.35 3 4 4 60.9 54 10471 7.18 7.15 4.36
## 2 1.63 3 3 1 62 55 7229 7.57 7.5 4.68
Now, let’s see how this function might fail. There are two instances in which the non-standard evaluation of filter() could cause surprising results:
x column doesn’t exist in df.threshold column in df.Let’s illustrate these failures. In each case we’ll use big_x() in the same way as the previous exercise, so we should expect the same output. However, not only do we get unexpected outputs, there is no indication (i.e. error message) that lets us know something might have gone wrong.
x and give it the value 1.big_x() to find all rows in diamonds_sub where the x column is greater than 7.threshold column in diamonds_sub with the value 100.big_x() to find all rows in diamonds_sub where the x column is greater than 7.# Remove the x column from diamonds
diamonds_sub$x <- NULL
# Create variable x with value 1
x <- 1
# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)
## # A tibble: 0 x 9
## # ... with 9 variables: carat <dbl>, cut <int>, color <int>,
## # clarity <int>, depth <dbl>, table <dbl>, price <int>, y <dbl>, z <dbl>
# Create a threshold column with value 100
diamonds_sub$threshold <- 100
# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <int>, color <int>,
## # clarity <int>, depth <dbl>, table <dbl>, price <int>, y <dbl>,
## # z <dbl>, threshold <dbl>
To avoid the problems caused by non-standard evaluation functions, you could avoid using them. In our example, we could achieve the same results by using standard subsetting (i.e. []) instead of filter(). For more insight into dealing with NSE and how to write your own non-standard evaluation functions, we recommend reading Hadley’s vignette on the topic. Also, programming with the NSE functions in dplyr will be easier in a future version.
If you do need to use non-standard evaluation functions, it’s up to you to provide protection against the problem cases. That means you need to know what the problem cases are, to check for them, and to fail explicitly.
To see what that might look like, let’s rewrite big_x() to fail for our problem cases.
Write a check for each of the following:
x is not in names(df), stop with the message "df must contain variable called x".threshold is in names(df), stop with the message "df must not contain variable called threshold".Remember to use the argument call. = FALSE in each call to stop() so that the call is not a part of the error message.
big_x <- function(df, threshold) {
# Write a check for x not being in df
if (!"x" %in% names(df)) {
stop("df must contain variable called x", call. = FALSE)
}
# Write a check for threshold being in df
if ("threshold" %in% names (df)) {
stop("df must not contain variable called threshold", call. = FALSE)
}
dplyr::filter(df, x > threshold)
}
In general, you want to avoid having the return value of your own functions depend on any global options. That way, you and others can reason about your functions without needing to know the current state of the options.
It is, however, okay to have side effects of a function depend on global options. For example, the print() function uses getOption("digits") as the default for the digits argument. This gives users some control over how results are displayed, but doesn’t change the underlying computation.
Let’s take a look at an example function that uses a global default sensibly. The print.lm() function has the options digits with default max(3, getOption("digits") - 3).
We’ve fit a regression model of fuel efficiency on weight using the mtcars dataset.
summary() to take a look at the fitted regression model. Pay particular attention to number of decimal places reported.digits option to 2: options(digits = 2).summary(). Notice the number of decimal places has changed, but there is no change to the underlying fit object.# Fit a regression model
fit <- lm(mpg ~ wt, data = mtcars)
# Look at the summary of the model
summary(fit)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
# Set the global digits option to 2
options(digits = 2)
# Take another look at the summary
summary(fit)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.543 -2.365 -0.125 1.410 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.285 1.878 19.86 < 2e-16 ***
## wt -5.344 0.559 -9.56 1.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.745
## F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10