If we can do anything in R, then we already know how to use functions. This tutorial will cover how to create our own functions. We’ll also review some looping functions, including the infamous apply() family of functions.
We already know that when we use a function, we need to know what the arguments are. For example, if we want to use the mean() function, we look at the documentation by typing ?mean().
The first function we write will be our own version of a function that calculates the mean of a numeric vector. Since we can’t just call it mean (or, if we do, we will replace that base function in our environment), we will call our function myMean.
Here is the basic structure of creating a function.
myMean <- function( ){
}
A function is created by using another function called function(). The parentheses will contain the parameters we want our function to have, and the curly braces will contain the operation that will be performed on the parameters. First, we’ll need a vector of numbers, so we’ll call it the x parameter.
myMean <- function(x){
#do something to x
}
Now we’ll calculate the average of the numbers in x.
myMean <- function(x){
total_count_of_values <- length(x)
total_sum_of_values <- sum(x)
average_of_values <- total_sum_of_values/total_count_of_values
average_of_values
}
We’ve found how many numbers are in the x vector, we’ve added up all the values in that vector, and we’ve found the average. In a function, the last line will always be returned. You can also use the return() function, but it’s not necessary.
myMean <- function(x){
total_count_of_values <- length(x)
total_sum_of_values <- sum(x)
average_of_values <- total_sum_of_values/total_count_of_values
return(average_of_values)
}
Let’s try our function.
my_vector <- c(1, 3, 5, 2, 6, 9, 0)
vector_mean <- myMean(x = my_vector)
vector_mean
## [1] 3.714286
Like most programming languges, R has for and while loops. We’ll just review for loops and move on to apply() functions, which are more commonly used in R.
For loops are used to repeat an operation a set number of times. The basic outline is
for(i in sequence){
}
The sequence parameter is typically a vector. The i parameter is a variable that will take on the values in the sequence vector. For instance, if sequence was the vector c(1, 2, 3) then the i will take on each of those values in turn.
Here we use our myMean() function to find the average of three vectors.
my_list <- list(c(1, 5, 9, 3), 1:10, c(23, 42))
my_averages <- c()
for(i in c(1, 2, 3)){
my_averages[i] <- myMean(my_list[[i]])
}
my_averages
## [1] 4.5 5.5 32.5
We use the i parameter as a variable to extract different members of my_list and add the average to different positions in the my_averages vector.
We could use the i parameter to specify a column in a data frame. Let’s use a for loop to calculate the means of a few columns in the chicago_air dataset from the region5air package.
library(region5air)
data(chicago_air)
head(chicago_air)
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
## 4 2013-01-04 0.028 18 0.62 1 6
## 5 2013-01-05 0.025 26 0.48 1 7
## 6 2013-01-06 0.026 36 0.47 1 1
chicago_avgs <- c()
for(i in c("ozone", "temp", "solar")){
chicago_avgs[i] <- myMean(chicago_air[, i])
}
chicago_avgs
## ozone temp solar
## NA NA 0.8410411
It looks like there is a problem with myMean(). We didn’t account for NAs. A properly written function would need to take into account that sort of thing, but we can also deal with NAs in the for loop.
chicago_avgs <- c()
for(i in c("ozone", "temp", "solar")){
numeric_series <- chicago_air[, i]
numeric_series <- numeric_series[!is.na(numeric_series)]
chicago_avgs[i] <- myMean(numeric_series)
}
chicago_avgs
## ozone temp solar
## 0.03567257 54.83984375 0.84104110
In R, the most efficient way to do loops is to use the apply() functions. These are functions that have apply() at the end of their name (such as lapply(), tapply(), and mapply()) and apply functions to each member of a vector, list, or column in a data frame.
apply() takes a data frame (or matrix) as the first argument. The second argument specifies if you want to apply a function to the rows (1) or columns (2), and the third argument is the function you want to apply to each row or column. Additional arguments can be used to pass on to the function being applied to each row or column.air <- chicago_air[, c("ozone", "temp", "solar")]
air_max <- apply(air,
MARGIN = 2, # we are applying the max() function to each column
FUN = max,
na.rm = TRUE# na.rm is being passed to the max() function
)
air_max
## ozone temp solar
## 0.081 92.000 1.490
lapply() applies a function to each member of a list. Here we find the length of each vector in my_list.lapply(my_list, length)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 10
##
## [[3]]
## [1] 2
sapply() does the same thing as lapply() but the output is simplified as much as possible. Also, for any of the apply() functions, the FUN argument can be specified inside the function itself. Here we find how many NAs are in each column of the air data frame. (Since a data frame is really a list of columns, air can be used as the list argument.)sapply(air, function(column){
the_NAs <- is.na(column) # returns a logical vector, TRUE if NA
sum(the_NAs) # since TRUE is equivalent, this gives a count of NAs
})
## ozone temp solar
## 26 109 0
tapply() takes (typically) a vector of values that a function will be applied to. The second argument is a list of one or more factors that will split the first vector into sections that the function will then be applied to. The third argument is the function, and additional arguments can be passed on to the function. Here we find the max ozone values in the chicago_air data frame by month.tapply(chicago_air$ozone, list(chicago_air$month), max, na.rm = TRUE)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.034 0.041 0.052 0.062 0.081 0.074 0.043 0.066 0.078 0.057 0.030 0.030
Go to the exercises page for this tutorial and test your comprehension of this material.