Chapter 15 Functions
Functions allow you to automate common tasks in a more powerful and general way than copying and pasting. Writing a function has 3 big advantages:
You can give a function an evocative name that makes your code easier to understand
As requirements change, you only need to update code in one plcae, instead of many
You eliminate the chance of making incidental mistakes when you copy and paste
Pre requisites
library(tidyverse)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df
rng1 <- range(df$a, na.rm = TRUE)
rng1
[1] -2.392289 2.051636
None.
When should you write a function?
Answer: when you copy and paste a block of code more than twice
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$a, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
df
To rewrite this as a function, figure out what the input is -> df$a The three steps: 1. Pick a name for the function. 2. List the inputs or arguments to the function 3. place the code you have developed in the body of the function, a { block that immediately follows function(…)
# range gives the lowest and highest values in a range of numbers
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0,5,10))
[1] 0.0 0.5 1.0
# test with different inputs
rescale01(c(1,2,3,NA, 5))
[1] 0.00 0.25 0.50 NA 1.00
We can now simplify the original code with this function
library(tibble)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
df$a
[1] 0.4185576 0.9440421 0.8372570 0.0000000 0.9109720 0.5342870 0.4982837
[8] 0.3155818 1.0000000 0.7430001
df$b
[1] 0.5817967 0.5698511 0.0000000 0.9808045 1.0000000 0.7010986 0.6378013
[8] 0.5781051 0.2233968 0.2868422
df$c
[1] 0.27002818 1.00000000 0.84519963 0.54433445 0.27935139 0.00000000 0.78181957
[8] 0.65830083 0.69281630 0.07192731
df$d
[1] 0.3093271 0.8444363 0.0000000 0.9475431 0.6310155 1.0000000 0.8329480
[8] 0.4633139 0.9300175 0.4341780
Now, if we discovered that we need to account for infinite values, we only need to fix the function
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
df$a
[1] 0.00000000 0.23644565 1.00000000 0.08290783 0.37141227 0.30269086 0.24381466
[8] 0.99788512 0.53050533 0.48883139
df$b
[1] 0.23924610 0.00000000 0.75608943 0.41153726 0.15258224 0.29927755 0.23271249
[8] 1.00000000 0.25795056 0.09907989
df$c
[1] 0.18814529 0.41358598 0.68311228 0.88936122 0.26587056 0.64924870 0.31643881
[8] 0.00000000 1.00000000 0.04308001
df$d
[1] 0.9316141 1.0000000 0.4976616 0.8795419 0.6274105 0.0000000 0.3567989
[8] 0.8379030 0.5763909 0.4061070
Exercie
- Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?
By a single missing value, this means that x has at least one NA value. If there were any NA values, and na.rm = FALSE, then the function would return NA. I can confirm this by testing a function that allows for na.rm as an argument
rescale01_alt <- function(x, finite = TRUE) {
rng <- range(x, na.rm = finite, finite = finite)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01_alt(c(NA, 1:5), finite = FALSE)
[1] NA NA NA NA NA NA
In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
y <- (x - rng[1]) / (rng[2] - rng[1])
y[y == -Inf] <- 0
y[y == Inf] <- 1
y
}
rescale01(c(Inf, -Inf, 0:5, NA))
[1] 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 NA
Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
mean(is.na(x))
x / sum(x, na.rm = TRUE)
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
This function standardizes a function to its weight. If all elements of x are non-negative, this will ensure the vector sums to 1.
weights <- function(x) {
x / sum(x, na.rm = TRUE)
}
y <- weights(0:5)
y
[1] 0.00000000 0.06666667 0.13333333 0.20000000 0.26666667 0.33333333
sum(y)
[1] 1
Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.
both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}
both_na(c(NA, NA, 1, 2),
c(NA, 1, NA, 2))
[1] 1
both_na(c(NA, NA, 1, 2, NA, NA, 1),
c(NA, 1, NA, 2, NA, NA, 1))
[1] 3
What do the following functions do? Why are they useful even though they are so short?
is_directory <- function(x) file.info(x)$isdir
Error: unexpected symbol in:
"
is_directory"
The function is_directory checks whether the path in x is a directory. The function is_readable checks whether the path in x is readable, meaning that the file exists and the user has permission to open it. These functions are useful even though they are short because their names make it much clearer what the code is doing.
Functions are for humans and computers
Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are OKY if the function computes a very well known noun (ie. eman()) is better than compute_mean().
If the function uses multiple words, use the snake_case, where each lowercase word is separate by an underscore. Also avoid the most common names from base R to avoid confusion.
Use comments, lines stating with # to explain the “why” of your code. You generally should avoid comments that explain the “what” or the “how”.
Use long lines of - or = to make it easy to spot the breaks:
# load data -----------------
# Plot data =================
Exercise
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
f1 returns whether a function has a common prefix. The function f2 drops the last element. A better name for f2 is drop_last(). This is a harder one to name. I would say something like recycle (R’s name for this behavior), or epxand.
Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent? rnorm samples from the univariate normal distribution, while MASS::mvrnorm samples from the multivariate normal distribution. The main arguments in rnorm are n, mean, sd. The main arguments is MASS::mvrnorm are n, mu, Sigma. To be consistent they should have the same names. However, this is difficult. In general, it is better to be consistent with more widely used functions, e.g. rmvnorm should follow the conventions of rnorm. However, while mean is correct in the multivariate case, sd does not make sense in the multivariate case. Both functions an internally consistent though; it would be bad to have mu and sd or mean and Sigma.
Make a case for why norm_r(), norm_d() etc would be better than rnorm(), dnorm(). Make a case for the opposite.
If named norm_r and norm_d, it groups the family of functions related to the normal distribution. If named rnorm, and dnorm, functions related to are grouped into families by the action they perform. r* functions always sample from distributions: rnorm, rbinom, runif, rexp. d* functions calculate the probability density or mass of a distribution: dnorm, dbinom, dunif, dexp.
Conditional Execution
An if statement allows you to conditionally execute code. It looks like this:
if (condition) {
# code executed when condition is true
} else {
# code executed when condition is false
}
To get help on if, enclose it with backticks and ?
? ‘if’
Remember that the function returns the last value that it computed:
has_name <- function(x) {
nms <- names(x)
if (is.null(nms)) {
rep(FALSE, length(x))
} else {
!is.na(nms) & nms !=""
}
}
has_name(c("wilson", "","robert"))
[1] FALSE FALSE FALSE
Conditions
The condition must evaluate to either TRUE or FALSE. If it’s a vector, you’ll get a warning message; if it’s an NA, you’ll get an error. Watch out for these messages in your own code.
You can use || (or) or && (and) to combine mulitple logical expressions. These operators are “short-circuiting” as soon as || sees the first TRUE it returns TRUE without computing anything else. As soon as %% sees the first FALSE it returns FALSE.
You should never use | or % in an if statement. These are vectorized operations that apply to mulitple values (that is why you use them in filter()). If you do have a logical vector, you can use any() or all() to collapse it to a single value
Be careful when testing for equality. == is vectorized, which means that it’s easy to get more than one output. Either check the length is already 1, collapse with all() or any() , or use the non vectorized identical().
Identical() is very strict. It always returns either a single TRUE or a single FALSE. It doesnt coerce types. This means that you need to be careful when comapring integers and doubles.
Also be wary of floating point numbers. Use dplyr::near() for comparisons. And remember x == NA doesnt do anything useful.
Multiple Conditions
You can chain multiple if statements together:
if (this) {
# do that
} else if (that) {
# do something else
} else {
# do this
}
If you end up with a large series of if statements, better use switch() function. It allows you to eavluate selected code based on position or name.
switch_demo <- function(x,y,op) {
switch(op,
plus = x + y,
minus = x - y,
times = x * y,
divide = x / y,
sqr = x ^ y,
stop("unkown op!")
)
}
switch_demo(4,2,"sqr")
[1] 16
Code Style
Both if and function should (almost) alwasy be followed by squiggly brackets({}) and the contents should be indented by two spaces. This makes it easier to see the hierarcy in your code by skimming the lefthand margin.
An opening curly brace should never go on itw own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else. Always indent the code inside curly braces ,/p>
Exercises
What’s the difference between if and ifelse()? Carefully read the help and construct three examples that illustrate the key differences. The keyword if tests a single condition, while ifelse tests each element.
Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)
library(lubridate)
greet <- function(time = lubridate::now()) {
hr <- hour(time)
# I don't know what to do about times after midnight,
# are they evening or morning?
if (hr < 12) {
print("good morning")
} else if (hr < 17) {
print("good afternoon")
} else {
print("good evening")
}
}
greet()
[1] "good afternoon"
greet(ymd_h("2017-01-08:05"))
[1] "good morning"
greet(ymd_h("2017-01-08:13"))
[1] "good afternoon"
greet(ymd_h("2017-01-08:20"))
[1] "good evening"
Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.
fizzbuzz <- function(x) {
stopifnot(length(x) == 1)
stopifnot(is.numeric(x))
# this could be made more efficient by minimizing the
# number of tests
if (!(x %% 3) & !(x %% 5)) {
print("fizzbuzz")
} else if (!(x %% 3)) {
print("fizz")
} else if (!(x %% 5)) {
print("buzz")
}
}
fizzbuzz(5)
[1] "buzz"
fizzbuzz(15)
[1] "fizzbuzz"
fizzbuzz(4)
How could you use cut() to simplify this set of nested if-else statements?
# original code
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
Two advantages of using cut is that it works on vectors, whereas if only works on a single value (I already demonstrated this above), and that to change comparisons I only needed to change the argument to right, but I would have had to change four operators in the if expression.
temp <- seq(-10, 50, by = 5)
cut(temp, c(-Inf, 0, 10, 20, 30, Inf), right = TRUE,
labels = c("freezing", "cold", "cool", "warm", "hot"))
[1] freezing freezing freezing cold cold cool cool warm
[9] warm hot hot hot hot
Levels: freezing cold cool warm hot
Function Arguments
The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that control the details of the compution. for example: in log(), the data is x, and the detail is the base of the logarithm in mean(), the data is x and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm). In t.test(), the data are x and y, and the details of the test are alternative, mu, paired, var.equal and conf.level. in str_c() you supply any number of strings to … and the details of the concatenation are controlled by sep and collapse
Choosing Names
Th names of the arguments are important to readers. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names:
x,y,z: vectors w: a vector of weights df: a data frame i, j : numberic indices n: length, or number of rows p: number of columns
Checking Values
It’s good practice to check importatn preconditions and throw an error with stop() if they are not true
wt_mean <- function(x,w) {
if (length(x) != length(w)){
stop("'x' and 'w' must be the same length", call. = FALSE)
}
sum(w*x)/sum(x)
}
wt_mean(1:6, 2:7)
[1] 5.333333
See also stopifnot() it checks that each argument is TRUE and produces a gneric error message if not.
wt_mean <- function(x,w, na.rm=FALSE) {
stopifnot(is.logical(na.rm), length(na.rm) ==1)
stopifnot(length(x) == length(w))
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w*x)/sum(x)
}
wt_mean(1:6, NA)
Error: length(x) == length(w) is not TRUE
Note that when using stopifnot() you assert what should be true rather than checking for what might be wrong.
h2> Dot-Dot-Dot(…)
Many functions in R take an arbitrary number of inputs. How do these functions work? They rely on a specail argument : … This special argument captures any number of arguments that aren’t otherwise matched.
It’s useful because you can then send those… on to another function. This is useful catch all if your function primarily wraps another function. For example:
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
[1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad ="-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad,width), "\n", sep = " ")
}
rule("Important output from wilson")
Important output from wilson ------------------------------------------------
Lazy Evaluation
Arguments in R are lazily evaluated. This means that they’re not computed until they’re needed. If they are never used, they are never called. This is an important property of R.
Exercise
What does commas(letters, collapse = “-”) do? Why? The argument collapse is passed to str_c as part of …, so it tries to run str_c(letters, collapse = “-”, collapse = “,”). Combines all the alphabets into one string.
It’d be nice if you could supply multiple characters to the pad argument, e.g. rule(“Title”, pad = “-+”). Why doesn’t this currently work? How could you fix it? It does not work because it duplicates pad by the width minus the length of the string. This is implictly assuming that pad is only one character. I could adjust the code to calculate the length of pad. The trickiest part is handling what to do if width is not a multiple of the number of characters of pad.
rule <- function(..., pad ="-+") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad,width), "\n", sep = " ")
}
rule("Important output from wilson")
Important output from wilson -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
rule("Important output from wilson", pad="*X*")
Important output from wilson *X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X*
So the solution is to figure how much to pad.
rule <- function(..., pad ="-+") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
padchar <- nchar(pad)
cat(title, " ", stringr::str_dup(pad,width %/% padchar),
stringr::str_sub(pad, 1, width %% padchar), "\n", sep = " ")
}
rule("Important output from wilson")
Important output from wilson -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
rule("Important output from wilson", pad="*X*")
Important output from wilson *X**X**X**X**X**X**X**X**X**X**X**X**X**X**X**X*
What does the trim argument to mean() do? When might you use it? The trim arguments trims a fraction of observations from each end of the vector (meaning the range) before calculating the mean. This is useful for calculating a measure of central tendancy that is robust to outliers. Much like 95th percentile?
The default value for the method argument to cor() is c(“pearson”, “kendall”, “spearman”). What does that mean? What value is used by default? It means that the method argument can take one of those three values. The first value, “pearson”, is used by default.
Return Values
Figuring out what your function should return is usually easy. Tere are two things you should ocnsider when returning a value :
does returning early make your function easier ot read? can you make your function pipeable?
Explicit Return statements
You can chose to return early by using return(). A common reason to do this is because the inputs are empty.
complicated_function <- function(x,y,x) {
if (length(x) == 0 || length(y) == 0) {
return(0)
}
}
Another reason is because you have an if statement with one complex block and one simple block.
f <- function () {}
if(x) {
# do
#something
#that
#takes
#many
#lines
# to
# express
} else {
# return something short
}
}
YOu can re-write it
f <- function() {
if(!x) {
return(something_short)
}
# do
#something
#that
#takes
#many
#lines
# to
# express
}
}
Writing Pipeable Functions
There are two types of pipeable functions: transformation and side-effect.
In transformation functions, there’s a clear primary object that is passed in as the first argument and a modified version is returned by the function.
side-effect functions are primairy called to perform an action, like drawing a plot or saving a file, not transforming the object.
Environment
The last component of a function is its environment. This is not something you need to understand deeply when you first start writing functions. However it’s important to know a little bit about environments because they are crucial to how functions work. If a value is not defined inside the function, R will still find it as valid because it uses lexical scoping to find the value associated with a name.
