R - Functions
A function is a set of statements organized together to perform a
specific task. R has a large number of in-built functions and the user
can create their own functions.
In R, a function is an object so the R interpreter is able to pass
control to the function, along with arguments that may be necessary for
the function to accomplish the actions.
The function in turn performs its task and returns control to the
interpreter as well as any result which may be stored in other
objects.
Function Definition
An R function is created by using the keyword function. The basic
syntax of an R function definition is as follows −
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#function_name <- function(arg_1, arg_2, ...) {
# Function body
#}
Function Components
The different parts of a function are −
1. Function Name − This is the actual name of the
function. It is stored in R environment as an object with this
name.
2. Arguments − An argument is a placeholder. When a
function is invoked, you pass a value to the argument. Arguments are
optional; that is, a function may contain no arguments. Also arguments
can have default values.
3. Function Body − The function body contains a
collection of statements that defines what the function does.
4. Return Value − The return value of a function is
the last expression in the function body to be evaluated.
R has many in-built functions which can be directly
called in the program without defining them first. We can also create
and use our own functions referred as user defined
functions.
Built-in Function
Simple examples of in-built functions are seq(),
mean(), max(), sum(x)
and paste(…) etc. They are directly called by user
written programs. You can refer most widely used R functions.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
## [1] 32 33 34 35 36 37 38 39 40 41 42 43 44
# Find mean of numbers from 25 to 82.
print(mean(25:82))
## [1] 53.5
# Find sum of numbers frm 41 to 68.
print(sum(41:68))
## [1] 1526
When we execute the above code, it produces the result show up next
to chunk−
User-defined Function
We can create user-defined functions in R. They are specific to what
a user wants and once created they can be used like the built-in
functions. Below is an example of how a function is created and
used.
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
# Call the function new.function supplying 6 as an argument.
new.function(6)
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
When we execute the above code, it produces the following result
next to chunk−
Calling a Function without an Argument
# Create a function without an argument.
new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}
# Call the function without supplying an argument.
new.function()
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
When we execute the above code, it produces the following result
next to chunk−
Calling a Function with Argument Values (by position and by
name)
The arguments to a function call can be supplied in the same
sequence as defined in the function or they can be supplied in a
different sequence but assigned to the names of the arguments.
# Create a function with arguments.
new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}
# Call the function by position of arguments.
new.function(5,3,11)
## [1] 26
# Call the function by names of the arguments.
new.function(a = 11, b = 5, c = 3)
## [1] 58
When we execute the above code, it produces the following result
next to chunk−
Calling a Function with Default Argument
We can define the value of the arguments in the function definition
and call the function without supplying any argument to get the default
result. But we can also call such functions by supplying new values of
the argument and get non default result.
# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}
# Call the function without giving any argument.
new.function()
## [1] 18
# Call the function with giving new values of the argument.
new.function(9,5)
## [1] 45
When we execute the above code, it produces the following result
next to chunk−
Lazy Evaluation of Function
Arguments to functions are evaluated lazily, which means so they are
evaluated only when needed by the function body.
# Create a function with arguments.
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
#new.function(6)
When we execute the above code, it produces the following result
next to chunk, and give an error result for not default value given
−
R Functions ..
A function is a block of code which only runs when it is
called.
You can pass data, known as parameters, into a function.
A function can return data as a result.
Creating a Function
To create a function, use the function() keyword:
** Example**
my_function <- function() { # create a function with the name my_function
print("Hello World!")
}
Call a Function
To call a function, use the function name followed by parenthesis,
like my_function():
Example
my_function <- function() {
print("Hello World!")
}
my_function() # call the function named my_function
## [1] "Hello World!"
Arguments
Information can be passed into functions as arguments.
Arguments are specified after the function name, inside the
parentheses. You can add as many arguments as you want, just separate
them with a comma.
The following example has a function with one argument (fname). When
the function is called, we pass along a first name, which is used inside
the function to print the full name:
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
## [1] "Peter Griffin"
my_function("Lois")
## [1] "Lois Griffin"
my_function("Stewie")
## [1] "Stewie Griffin"
Parameters or Arguments?
The terms “parameter” and “argument” can be used for the same thing:
information that are passed into a function.
From a function’s perspective:
A parameter is the variable listed inside the parentheses in the
function definition.
An argument is the value that is sent to the function when it is
called.
** Number of Arguments**
By default, a function must be called with the correct number of
arguments. Meaning that if your function expects 2 arguments, you have
to call the function with 2 arguments, not more, and not less:
Example
This function expects 2 arguments, and gets 2 arguments:
my_function <- function(fname, lname) {
paste(fname, lname)
}
my_function("Peter", "Griffin")
## [1] "Peter Griffin"
If you try to call the function with 1 or 3 arguments, you will get
an error:
Example
This function expects 2 arguments, and gets 1 argument:
my_function <- function(fname, lname) {
paste(fname, lname)
}
#my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter
value.
If we call the function without an argument, it uses the default
value:
Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}
my_function("Sweden")
## [1] "I am from Sweden"
my_function("India")
## [1] "I am from India"
my_function() # will get the default value, which is Norway
## [1] "I am from Norway"
my_function("USA")
## [1] "I am from USA"
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
## [1] 15
print(my_function(5))
## [1] 25
print(my_function(9))
## [1] 45
The output of the code above will be the shown next to chunk:
Nested Functions
There are two ways to create a nested function:
- Call a function within another function.
- Write a function within a function.
Example
Call a function within another function:
Nested_function <- function(x, y) {
a <- x + y
return(a)
}
Nested_function(Nested_function(2,2), Nested_function(3,3))
## [1] 10
Example Explained
The function tells x to add y.
The first input Nested_function(2,2) is “x” of the main
function.
The second input Nested_function(3,3) is “y” of the main
function.
The output is therefore (2+2) + (3+3) = 10.
Example
Write a function within a function:
Outer_func <- function(x) {
Inner_func <- function(y) {
a <- x + y
return(a)
}
return (Inner_func)
}
output <- Outer_func(3) # To call the Outer_func
output(5)
## [1] 8
Example Explained
You cannot directly call the function because the Inner_func has
been defined (nested) inside the Outer_func.
We need to call Outer_func first in order to call Inner_func as a
second step.
We need to create a new variable called output and give it a value,
which is 3 here.
We then print the output with the desired value of “y”, which in
this case is 5.
The output is therefore 8 (3 + 5).
Recursion
R also accepts function recursion, which means a defined function
can call itself.
Recursion is a common mathematical and programming concept. It means
that a function calls itself. This has the benefit of meaning that you
can loop through data to reach a result.
The developer should be very careful with recursion as it can be
quite easy to slip into writing a function which never terminates, or
one that uses excess amounts of memory or processor power. However, when
written correctly, recursion can be a very efficient and
mathematically-elegant approach to programming.
In this example, tri_recursion() is a function that
we have defined to call itself (“recurse”). We use the
k variable as the data, which decrements
(-1) every time we recurse. The recursion ends when the
condition is not greater than 0 (i.e. when it is 0).
To a new developer it can take some time to work out how exactly
this works, best way to find out is by testing and modifying it.
Example
tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else {
result = 0
return(result)
}
}
tri_recursion(6)
## [1] 1
## [1] 3
## [1] 6
## [1] 10
## [1] 15
## [1] 21
19. R Functions…
19.1 Introduction
One of the best ways to improve your reach as a data scientist is to
write functions. Functions allow you to automate common tasks in a more
powerful and general way than copy-and-pasting. Writing a function has
three big advantages over using copy-and-paste:
1. You can give a function an evocative name that makes your code
easier to understand.
2. As requirements change, you only need to update code in one
place, instead of many.
3. You eliminate the chance of making incidental mistakes when you
copy and paste (i.e. updating a variable name in one place, but not in
another).
Writing good functions is a lifetime journey. Even after using R for
many years I still learn new techniques and better ways of approaching
old problems. The goal of this chapter is not to teach you every
esoteric detail of functions but to get you started with some pragmatic
advice that you can apply immediately.
As well as practical advice for writing functions, this chapter also
gives you some suggestions for how to style your code. Good code style
is like correct punctuation. Youcanmanagewithoutit, but it sure makes
things easier to read! As with styles of punctuation, there are many
possible variations. Here we present the style we use in our code, but
the most important thing is to be consistent.
19.1.1 Prerequisites
19.2 When should you write a function?
You should consider writing a function whenever you’ve copied and
pasted a block of code more than twice (i.e. you now have three copies
of the same code). For example, take a look at this code. What does it
do?
df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
You might be able to puzzle out that this rescales each column to
have a range from 0 to 1. But did you spot the mistake? I made an error
when copying-and-pasting the code for df$b: I forgot to change an a to a
b. Extracting repeated code out into a function is a good idea because
it prevents you from making this type of mistake.
There is some duplication in this code. We’re computing the range of
the data three times, so it makes sense to do it in one step:
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
## [1] 0.3331202 0.2554806 0.5560026 0.4972072 0.9374437 0.7440388 0.5335841
## [8] 0.5296189 0.0000000 1.0000000
There are three key steps to creating a new function:
1. You need to pick a name for the function. Here
I’ve used rescale01 because this function rescales a vector to lie
between 0 and 1.
2. You list the inputs, or arguments, to the
function inside function. Here we have just one argument. If we had more
the call would look like function(x, y, z).
3. You place the code you have developed in body of
the function, a { block that immediately follows function(…).
Note the overall process: I only made the function after I’d figured
out how to make it work with a simple input. It’s easier to start with
working code and turn it into a function; it’s harder to create a
function and then try to make it work.
We can simplify the original example now that we have a
function:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
Compared to the original, this code is easier to understand and
we’ve eliminated one class of copy-and-paste errors. There is still
quite a bit of duplication since we’re doing the same thing to multiple
columns. We’ll learn how to eliminate that duplication in iteration,
once you’ve learned more about R’s data structures in vectors.
Another advantage of functions is that if our requirements change,
we only need to make the change in one place. For example, we might
discover that some of our variables include infinite values, and
rescale01() fails:
x <- c(1:10, Inf)
rescale01(x)
## [1] 0 0 0 0 0 0 0 0 0 0 NaN
This is an important part of the “do not repeat yourself” (or DRY)
principle. The more repetition you have in your code, the more places
you need to remember to update when things change (and they always do!),
and the more likely you are to create bugs over time.
19.2.1 Exercises
1. Why is TRUE not a parameter to rescale01()? What would happen if
x contained a single missing value, and na.rm was FALSE?
2. In the second variant of rescale01(), infinite values are left
unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is
mapped to 1.
3. Practice turning the following code snippets into functions.
Think about what each function does. What would you call it? How many
arguments does it need? Can you rewrite it to be more expressive or less
duplicative?
mean(is.na(x))
## [1] 0
x / sum(x, na.rm = TRUE)
## [1] 0 0 0 0 0 0 0 0 0 0 NaN
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
## [1] NaN
4. write your own functions to compute the variance and skewness of
a numeric vector. Variance is defined as
\[
\operatorname{Var}(x)=\frac{1}{n-1}
\sum_{i=1}^n\left(x_i-\bar{x}\right)^2
\]
where
\[
\bar{x}=\left(\sum_i^n x_i\right) / n
\]
is the sample mean. Skewness is defined as
\[
\operatorname{Skew}(x)=\frac{\frac{1}{n-2}\left(\sum_{i=1}^n\left(x_i-\bar{x}\right)^3\right)}{\operatorname{Var}(x)^{3
/ 2}} .
\]
5. Write both_na(), a function that takes two vectors of the same
length and returns the number of positions that have an NA in both
vectors.
6. What do the following functions do? Why are they useful even
though they are so short?
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
7. Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot
of duplication in this song. Extend the initial piping example to
recreate the complete song, and use functions to reduce the
duplication.
19.3 Functions are for humans and computers
It’s important to remember that functions are not just for the
computer, but are also for humans. R doesn’t care what your function is
called, or what comments it contains, but these are important for human
readers. This section discusses some things that you should bear in mind
when writing functions that humans can understand.
The name of a function is important. Ideally, the name of your
function will be short, but clearly evoke what the function does. That’s
hard! But it’s better to be clear than short, as RStudio’s autocomplete
makes it easy to type long names.
If your function name is composed of multiple words, I recommend
using “snake_case”, where each lowercase word is separated by an
underscore. camelCase is a popular alternative. It doesn’t really matter
which one you pick, the important thing is to be consistent: pick one or
the other and stick with it. R itself is not very consistent, but
there’s nothing you can do about that. Make sure you don’t fall into the
same trap by making your code as consistent as possible.
# Never do this!
col_mins <- function(x, y) {}
rowMaxes <- function(y, x) {}
If you have a family of functions that do similar things, make sure
they have consistent names and arguments. Use a common prefix to
indicate that they are connected. That’s better than a common suffix
because autocomplete allows you to type the prefix and see all the
members of the family.
# Good
#input_select()
#input_checkbox()
#input_text()
# Not so good
#select_input()
#checkbox_input()
#text_input()
A good example of this design is the stringr package: if you don’t
remember exactly which function you need, you can type str_ and jog your
memory.
Where possible, avoid overriding existing functions and variables.
It’s impossible to do in general because so many good names are already
taken by other packages, but avoiding the most common names from base R
will avoid confusion.
# Don't do this!
T <- FALSE
c <- 10
mean <- function(x) sum(x)
Use comments, lines starting with #, to explain the
“why” of your code. You generally should avoid comments that explain the
“what” or the “how”. If you can’t understand what the code does from
reading it, you should think about how to rewrite it to be more clear.
Do you need to add some intermediate variables with useful names? Do you
need to break out a subcomponent of a large function so you can name it?
However, your code can never capture the reasoning behind your
decisions: why did you choose this approach instead of an alternative?
What else did you try that didn’t work? It’s a great idea to capture
that sort of thinking in a comment.
RStudio provides a keyboard shortcut to create these headers
(Cmd/Ctrl + Shift + R), and will display them in the code navigation
drop-down at the bottom-left of the editor:
19.3.1 Exercises
1. Read the source code for each of the following three functions,
puzzle out what they do, and then brainstorm better names.
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
2. Take a function that you’ve written recently and spend 5 minutes
brainstorming a better name for it and its arguments.
3. Compare and contrast rnorm() and MASS::mvrnorm(). How could you
make them more consistent?
4. Make a case for why norm_r(), norm_d() etc would be better than
rnorm(), dnorm(). Make a case for the opposite.
19.4 Conditional execution
An if statement allows you to conditionally execute code. It looks
like this:
#if (condition) {
# code executed when condition is TRUE
#} else {
# code executed when condition is FALSE
#}
To get help on if you need to surround it in backticks:
?if. The help isn’t particularly helpful if you’re not
already an experienced programmer, but at least you know how to get to
it!
Here’s a simple function that uses an if statement. The goal of this
function is to return a logical vector describing whether or not each
element of a vector is named.
has_name <- function(x) {
nms <- names(x)
if (is.null(nms)) {
rep(FALSE, length(x))
} else {
!is.na(nms) & nms != ""
}
}
This function takes advantage of the standard return rule: a
function returns the last value that it computed. Here that is either
one of the two branches of the if statement.
19.4.1 Conditions
The condition must evaluate to either TRUE or FALSE. If it’s a
vector, you’ll get a warning message; if it’s an NA, you’ll get an
error. Watch out for these messages in your own code:
#if (c(TRUE, FALSE)) {}
#> Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
#> first element will be used
#> NULL
#if (NA) {}
#> Error in if (NA) {: missing value where TRUE/FALSE needed
You can use || (or) and &&
(and) to combine multiple logical expressions. These operators are
“short-circuiting”: as soon as || sees the first TRUE it returns TRUE
without computing anything else. As soon as &&
sees the first FALSE it returns FALSE. You should never use
| or & in an if statement: these
are vectorised operations that apply to multiple values (that’s why you
use them in filter()). If you do have a logical vector, you can use
any() or all() to collapse it to a
single value.
Be careful when testing for equality. == is
vectorised, which means that it’s easy to get more than one output.
Either check the length is already 1, collapse with
all() or any(), or use the
non-vectorised identical().
identical() is very strict: it always returns either a
single TRUE or a single FALSE, and doesn’t coerce types. This means that
you need to be careful when comparing integers and doubles:
identical(0L, 0)
## [1] FALSE
#> [1] FALSE
You also need to be wary of floating point numbers:
x <- sqrt(2) ^ 2
x
## [1] 2
x == 2
## [1] FALSE
x - 2
## [1] 4.440892e-16
Instead use dplyr::near() for comparisons, as
described in comparisons.
And remember, x == NA doesn’t do anything
useful!
19.4.2 Multiple conditions
You can chain multiple if statements together:
#if (this) {
# do that
#} else if (that) {
# do something else
#} else {
#
#}
But if you end up with a very long series of chained if statements,
you should consider rewriting. One useful technique is the
switch() function. It allows you to evaluate selected
code based on position or name.
#> function(x, y, op) {
#> switch(op,
#> plus = x + y,
#> minus = x - y,
#> times = x * y,
#> divide = x / y,
#> stop("Unknown op!")
#> )
#> }
Another useful function that can often eliminate long chains of if
statements is cut(). It’s used to discretise continuous
variables.
19.4.3 Code style
Both if and function should (almost) always be followed by squiggly
brackets ({}), and the contents should be indented by
two spaces. This makes it easier to see the hierarchy in your code by
skimming the left-hand margin.
An opening curly brace should never go on its own line and should
always be followed by a new line. A closing curly brace should always go
on its own line, unless it’s followed by else. Always indent the code
inside curly braces.
# Good
#if (y < 0 && debug) {
# message("Y is negative")
#}
#if (y == 0) {
# log(x)
#} else {
# y ^ x
#}
# Bad
#if (y < 0 && debug)
#message("Y is negative")
#if (y == 0) {
# log(x)
#}
#else {
# y ^ x
#}
It’s ok to drop the curly braces if you have a very short if
statement that can fit on one line:
y <- 10
x <- if (y < 20) "Too low" else "Too high"
19.4.4 Exercises
1. What’s the difference between if and
ifelse()? Carefully read the help and construct three
examples that illustrate the key differences.
2. Write a greeting function that says “good morning”, “good
afternoon”, or “good evening”, depending on the time of day. (Hint: use
a time argument that defaults to lubridate::now(). That
will make it easier to test your function.)
3. Implement a fizzbuzz function. It takes a single number as input.
If the number is divisible by three, it returns “fizz”. If it’s
divisible by five it returns “buzz”. If it’s divisible by three and
five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure
you first write working code before you create the function.
4. How could you use cut() to simplify this set of
nested if-else statements?
#if (temp <= 0) {
# "freezing"
#} else if (temp <= 10) {
# "cold"
#} else if (temp <= 20) {
# "cool"
#} else if (temp <= 30) {
# "warm"
#} else {
# "hot"
#}
How would you change the call to cut() if I’d used < instead of
<=? What is the other chief advantage of cut() for this problem?
(Hint: what happens if you have many values in temp?)
5. What happens if you use switch() with numeric values?
6. What does this switch() call do? What happens if x is “e”?
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
Experiment, then carefully read the documentation.
19.5 Function arguments
The arguments to a function typically fall into two broad sets: one
set supplies the data to compute on, and the other supplies arguments
that control the details of the computation. For example:
- In log(), the data is x, and the
detail is the base of the logarithm.
- In mean(), the data is x, and
the details are how much data to trim from the ends
(trim) and how to handle missing values
(na.rm).
- In ¨t.test(), the data are x and
y, and the details of the test are
alternative, mu,
paired, var.equal, and
conf.level.
- In str_c() you can supply any number of strings
to …, and the details of the concatenation are
controlled by sep and collapse.
Generally, data arguments should come first. Detail arguments should
go on the end, and usually should have default values. You specify a
default value in the same way you call a function with a named
argument:
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - conf
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
## [1] 49.26167 49.37028
mean_ci(x, conf = 0.99)
## [1] 49.24461 49.38735
The default value should almost always be the most common value. The
few exceptions to this rule are to do with safety. For example, it makes
sense for na.rm to default to FALSE
because missing values are important. Even though na.rm =
TRUE is what you usually put in your code, it’s a bad idea to
silently ignore missing values by default.
When you call a function, you typically omit the names of the data
arguments, because they are used so commonly. If you override the
default value of a detail argument, you should use the full name:
# Good
#mean(1:10, na.rm = TRUE)
# Bad
#mean(x = 1:10, , FALSE)
#mean(, TRUE, x = c(1:10, NA))
You can refer to an argument by its unique prefix
(e.g. mean(x, n = TRUE)), but this is generally best
avoided given the possibilities for confusion.
Notice that when you call a function, you should place a space
around = in function calls, and always put a space
after a comma, not before (just like in regular English). Using
whitespace makes it easier to skim the function for the important
components.
# Good
#average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
#average<-mean(feet/12+inches,na.rm=TRUE)
19.5.1 Choosing names
The names of the arguments are also important. R doesn’t care, but
the readers of your code (including future-you!) will. Generally you
should prefer longer, more descriptive names, but there are a handful of
very common, very short names. It’s worth memorising these:
- x, y, z:
vectors.
- w: a vector of weights.
- df: a data frame.
- i, j: numeric indices (typically
rows and columns).
- n: length, or number of rows.
- p: number of columns.
Otherwise, consider matching names of arguments in existing R
functions. For example, use na.rm to determine if
missing values should be removed.
19.5.2 Checking values
What happens if x and w are not
the same length?
wt_mean(1:6, 1:3)
## [1] 7.666667
In this case, because of R’s vector recycling rules, we don’t get an
error.
It’s good practice to check important preconditions, and throw an
error (with stop()), if they are not true:
wt_mean <- function(x, w) {
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
sum(w * x) / sum(w)
}
Be careful not to take this too far. There’s a tradeoff between how
much time you spend making your function robust, versus how long you
spend writing it. For example, if you also added a
na.rm argument, I probably wouldn’t check it
carefully:
wt_mean <- function(x, w, na.rm = FALSE) {
if (!is.logical(na.rm)) {
stop("`na.rm` must be logical")
}
if (length(na.rm) != 1) {
stop("`na.rm` must be length 1")
}
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(w)
}
This is a lot of extra work for little additional gain. A useful
compromise is the built-in stopifnot(): it checks that
each argument is TRUE, and produces a generic error
message if not.
wt_mean <- function(x, w, na.rm = FALSE) {
stopifnot(is.logical(na.rm), length(na.rm) == 1)
stopifnot(length(x) == length(w))
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(w)
}
#wt_mean(1:6, 6:1, na.rm = "foo")
#> Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE
Note that when using stopifnot() you assert what
should be true rather than checking for what might be wrong.
19.5.3 Dot-dot-dot (…)
How do these functions work? They rely on a special argument:
… (pronounced dot-dot-dot). This special argument
captures any number of arguments that aren’t otherwise matched.
It’s useful because you can then send those … on to
another function. This is a useful catch-all if your function primarily
wraps another function. For example, I commonly create these helper
functions that wrap around str_c():
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
## [1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
## Important output -----------------------------------------------------------
Here … lets me forward on any arguments that I
don’t want to deal with to str_c(). It’s a very
convenient technique. But it does come at a price: any misspelled
arguments will not raise an error. This makes it easy for typos to go
unnoticed:
x <- c(1, 2)
sum(x, na.mr = TRUE)
## [1] 4
If you just want to capture the values of the …,
use list(…).
19.5.4 Lazy evaluation
Arguments in R are lazily evaluated: they’re not computed until
they’re needed. That means if they’re never used, they’re never called.
This is an important property of R as a programming language, but is
generally not important when you’re writing your own functions for data
analysis. You can read more about lazy evaluation at http://adv-r.had.co.nz/Functions.html#lazy-evaluation.
19.5.5 Exercises
1. What does commas(letters, collapse = “-”) do?
Why?
2. It’d be nice if you could supply multiple characters to the
pad argument, e.g. rule(“Title”, pad =
“-+”). Why doesn’t this currently work? How could you fix
it?
3. What does the trim argument to
mean() do? When might you use it?
4. The default value for the method argument to
cor() is c(“pearson”, “kendall”,
“spearman”). What does that mean? What value is used by
default?
19.6 Return values
Figuring out what your function should return is usually
straightforward: it’s why you created the function in the first place!
There are two things you should consider when returning a value:
1. Does returning early make your function easier to read?
2. Can you make your function pipeable?
19.6.1 Explicit return statements
Another reason is because you have a if statement
with one complex block and one simple block. For example, you might
write an if statement like this:
f <- function() {
if (x) {
# Do
# something
# that
# takes
# many
# lines
# to
# express
} else {
# return something short
}
}
But if the first block is very long, by the time you get to the
else, you’ve forgotten the condition.
One way to rewrite it is to use an early return for the simple
case:
f <- function() {
if (!x) {
return(something_short)
}
# Do
# something
# that
# takes
# many
# lines
# to
# express
}
This tends to make the code easier to understand, because you don’t
need quite so much context to understand it.
19.6.2 Writing pipeable functions
If you want to write your own pipeable functions, it’s important to
think about the return value. Knowing the return value’s object type
will mean that your pipeline will “just work”. For example, with
dplyr and tidyr the object type is the
data frame.
But it’s still there, it’s just not printed by default:
x <- show_missings(mtcars)
## Missing values: 0
class(x)
## [1] "data.frame"
dim(x)
## [1] 32 11
And we can still use it in a pipe:
mtcars %>%
show_missings() %>%
mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>%
show_missings()
## Missing values: 0
## Missing values: 18
19.7 Environment
The last component of a function is its environment. This is not
something you need to understand deeply when you first start writing
functions. However, it’s important to know a little bit about
environments because they are crucial to how functions work. The
environment of a function controls how R finds the value associated with
a name. For example, take this function:
f <- function(x) {
x + y
}
In many programming languages, this would be an error, because
y is not defined inside the function. In R, this is
valid code because R uses rules called lexical scoping
to find the value associated with a name. Since y is
not defined inside the function, R will look in the
environment where the function was defined:
y <- 100
f(10)
## [1] 110
y <- 1000
f(10)
## [1] 1010
This behaviour seems like a recipe for bugs, and indeed you should
avoid creating functions like this deliberately, but by and large it
doesn’t cause too many problems (especially if you regularly restart R
to get to a clean slate).
The advantage of this behaviour is that from a language standpoint
it allows R to be very consistent. Every name is looked up using the
same set of rules. For f() that includes the behaviour
of two things that you might not expect: { and
+. This allows you to do devious things like:
`+` <- function(x, y) {
if (runif(1) < 0.1) {
sum(x, y)
} else {
sum(x, y) * 1.1
}
}
table(replicate(1000, 1 + 2))
##
## 3 3.3
## 97 903
rm(`+`)
This is a common phenomenon in R. R places few limits on your power.
You can do many things that you can’t do in other programming languages.
You can do many things that 99% of the time are extremely
ill-advised (like overriding how addition works!). But
this power and flexibility is what makes tools like
ggplot2 and dplyr possible. Learning
how to make best use of this flexibility is beyond the scope of this
book, but you can read about in Advanced R.