Have you ever done anything like this (the example is from Hadley Wickham’s book Advanced R Programming):
# Generate a sample dataset
create_dataframe <- function(na_value = -99) {
set.seed(1014)
df <- data.frame(replicate(6, sample(c(1:10,
na_value),
6, rep = TRUE)))
names(df) <- letters[1:6]
df
}
df <- create_dataframe()
Now, let’s replace all those -99 values with NAs:
df$a[df$a == -99] <- NA
df$b[df$b == -99] <- NA
df$c[df$c == -98] <- NA
df$d[df$d == -99] <- NA
df$e[df$e == -99] <- NA
df$f[df$g == -99] <- NA
Ok, this is painful, and error-prone. We might make copy-paste errors.
Can anyone spot an error?
Let’s do the right thing: start again, and write a function:
df <- create_dataframe()
df
## a b c d e f
## 1 1 6 1 5 -99 1
## 2 10 4 4 -99 9 3
## 3 7 9 5 4 1 4
## 4 2 9 3 8 6 8
## 5 1 10 5 9 8 6
## 6 6 2 1 3 8 5
fix_missing <- function(x) {
x[x == -99] <- NA
x
}
Now we can apply the function to each variable in the data frame:
For example:
df$d
## [1] 5 -99 4 8 9 3
contains one -99 value.
df$a <- fix_missing(df$a)
df$b <- fix_missing(df$b)
df$c <- fix_missing(df$c)
df$d <- fix_missing(df$d)
df$e <- fix_missing(df$e)
df$f <- fix_missing(df$e)
This is better, but still involves copying and pasting (should be avoided at all costs). In R, we would do something like this:
df <- create_dataframe()
df
## a b c d e f
## 1 1 6 1 5 -99 1
## 2 10 4 4 -99 9 3
## 3 7 9 5 4 1 4
## 4 2 9 3 8 6 8
## 5 1 10 5 9 8 6
## 6 6 2 1 3 8 5
df[] <- lapply(df, fix_missing)
df
## a b c d e f
## 1 1 6 1 5 NA 1
## 2 10 4 4 NA 9 3
## 3 7 9 5 4 1 4
## 4 2 9 3 8 6 8
## 5 1 10 5 9 8 6
## 6 6 2 1 3 8 5
But what if we have a new dataset with -98 as a missing value, instead of -99?
df <- create_dataframe(na_value = -98)
df
## a b c d e f
## 1 1 6 1 5 -98 1
## 2 10 4 4 -98 9 3
## 3 7 9 5 4 1 4
## 4 2 9 3 8 6 8
## 5 1 10 5 9 8 6
## 6 6 2 1 3 8 5
Let’s write a new function:
fix_missing <- function(x, na_value = -99) {
x[x == na_value] <- NA
x
}
Now we can pass in the correct value as an argument1 When using lapply, arguments to the function being applied are passed as argmuments to lapply(). :
df[] <- lapply(df, fix_missing, na_value = -98)
df
## a b c d e f
## 1 1 6 1 5 NA 1
## 2 10 4 4 NA 9 3
## 3 7 9 5 4 1 4
## 4 2 9 3 8 6 8
## 5 1 10 5 9 8 6
## 6 6 2 1 3 8 5
There are three key steps to creating a new function:
Pick a name for the function. Here we called the function fix_missing because that’s a fairly accurate description of what is does.
List the inputs, or arguments, to the function. Here we have two arguments: x and na_value
Place the code in the body of the function. This is the bit inside the curly braces {}. The return value of the function will be the last expression evaluated inside the function. R only returns a single object; if you want more than one object to be returned, you need to place these inside a list.
fix_missing <- function(x, na_value = -99) {
x[x == na_value] <- NA
x
}
The first argument, x, does not have a default value, whereas the second argument, na_value has a default value of -99. Aguments with default values can be omitted when calling the function, argument without default values are required.
Now, let’s do a little exercise:
x^k and a description (as.character) of the function call.power_fun_1 <- function(x, k = 2) {
x^k
}
power_fun_1(3)
## [1] 9
power_fun_2 <- function(x, k = 2) {
list(result = x^k,
call = paste(as.character(x), "to the power of",
as.character(k)))
}
power_fun_2(2, 3)
## $result
## [1] 8
##
## $call
## [1] "2 to the power of 3"
power_fun_3 <- function(x, k = 2) {
out <- list(result = x^k,
x = x,
k = k,
call = paste(as.character(x), "to the power of",
as.character(k)))
class(out) <- "some_class"
out
}
plot.some_class <- function(z) {
curve(x^z$k, -2*z$x, 2*z$x, lwd = 3,
col = "violet", ylab = z$call,
xlab = "x")
points(z$x, z$result, pch = 19, col = "blue", cex = 3)
}
Power function plot
out <- power_fun_3(2, k = 3)
plot(out)
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tint_0.0.3
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.4 magrittr_1.5 rprojroot_1.1
## [4] tools_3.3.2 htmltools_0.3.5 parallel_3.3.2
## [7] yaml_2.1.14 Rcpp_0.12.8 stringi_1.1.2
## [10] rmarkdown_1.2.9000 knitr_1.15.4 stringr_1.1.0
## [13] digest_0.6.10 evaluate_0.10