Let’s say you’re programming in R, and you find that you’re repeating the same procedure, analysis, or custom plot several times. For example, say you’re plotting multiple scatterplots, and you want to include a regression line on each plot (perhaps among many other customizations). Is there a way to repeat the same custom procedure many times without having to repeat code? As you guessed by the title of this document, the answer is yes! With custom functions!
One of the coolest things you can do in R is write your own custom functions. Custom functions allow you to define a specific action you are interested in, from calculating a unique statistic to creating a custom plot, which you can then easily apply to new data.
To make a custom function, you need to define one or more input(s) followed by some action and an output. Your function will take the input and perform the action based on that input, and spit out some output in the form of transformed data, summary statistics, or a plot.
For example, let’s create a custom function that replicates the standard mean() function in R:
My.Mean <- function(x) { # Name the function and define the input
# Define the action, in this case, the sum divided by the mean
Output <- sum(x) / length(x)
return(Output) # Return the output of the function
}
Once you’ve defined a function and given it a name, you can then use the function on new data:
My.Mean(0:10) # Should be 5
## [1] 5
My.Mean(rep(0, 100)) # Should be 0
## [1] 0
The function My.Mean only had one input (the vector x). However, we can define multiple inputs. For example, let’s adjust the function so that it has an input remove.outliers which, when True, automatically removes outliers before calculating the mean:
My.Mean <- function(
x, # x is a data vector
remove.outliers = F # this says whether or not to remove outliers
) {
# Now we separately define actions depending on what the user's input
# value for the remove.outliers object
## First, let's define the action when we do NOT remove outliers
if(remove.outliers == F) {
Output <- sum(x) / length(x)
} # End of action
## Next, let's define the action when we DO remove outliers
if(remove.outliers == T) {
# Determine the mean and standard deviation of the original vector
Orig.Mean <- sum(x) / length(x)
Orig.SD <- sd(x)
# Create a logical vector indicating which values are outliers (more or less
# than 2 times the standard deviation
Outlier.log <- (x > Orig.Mean + 2 * Orig.SD) | (x < Orig.Mean - 2 * Orig.SD)
# Create a new vector called x.noout with no outliers
x.noout <- x[Outlier.log == FALSE]
# Calculate the mean of the new vector
Output <- Output <- sum(x.noout) / length(x.noout)
} # End of action
# Finally, we can return the output of the function
return(Output)
}
A few notes on the function:
In the function input definition, I added “== F” after “remove.outliers” to serve as a default value. If you include a default value for a function input, then if the user does not specify the input value, R will use the default value. You can even include a default data vector for x by including “x = rnorm(10)” for example.
When you have a logical input (such as remove.outliers which is either True or False), it is common to use several if statements that define the action for when the input is True and when it is False.
Here is an example of the function in action:
# Create a vector Data with 100 samples from a standard normal dist with one outlier at 500
Data <- c(
rnorm(n = 100, mean = 0, sd = 1),
500 # Here's The outlier!
)
My.Mean(Data, remove.outliers = F) # Since we are not removing outliers, the
## [1] 4.837
# mean should be influenced by the outlier
My.Mean(Data, remove.outliers = T) # Now we are removing the outliers. The mean
## [1] -0.1142
# should be close to 0
Once you’ve defined a custom function in R, you can then use that function in other functions! For example, if you want to apply your new function to many groups, you can use it in the tapply() command:
with(ChickWeight, tapply(X = weight, INDEX = Chick, FUN = My.Mean,
remove.outliers = T # You can add additional inputs to the FUN here
))
If you find that you’re frequently adding certain things to plots, like reference lines for the median or mean, then creating a custom plotting function can really make things easier for you!
Let’s create a custom histogram function My.Hist with several custom inputs:
My.Hist <- function(x, add.median = T, add.mean = T, add.legend = T) {
hist(x)
if(add.median == T) {abline(v = median(x), lwd = 3)}
if(add.mean == T) {abline(v = mean(x), lty = 2, lwd = 3)}
if(add.legend == T) {
legend("topright", c("median", "mean"), lwd = c(3, 3), lty = c(1, 2))
}
}
Ok, let’s try it out!
My.Hist(rexp(100), add.median = T, add.mean = T, add.legend = T)
Of course, you can get much fancier by adding as many custom procedures or plotting parameters as you’d like.
Let’s create a custom scatterplot with some additional features
My.Scatter <- function(x = rnorm(1000), y = rnorm(1000), add.mean = T,
add.regression = T, add.conclusion = T) {
plot(x, y, pch = 21, bg = gray(.5, alpha = .5))
if(add.mean == T) { # Add mean reference lines
abline(h = mean(y), col = gray(.5))
abline(v = mean(x), col = gray(.5))
} # Close add.mean
if(add.conclusion == T) { # Add the conclusion sentence to the top of the plot
C.Test <- cor.test(x, y)
Coefficient <- C.Test$estimate
p.value <- C.Test$p.value
if(p.value <= .05) {
Conclusion <- paste("Note: The correlation coefficient is ", round(Coefficient, 2),
".\nIt IS significantly different from 0 (p = ", round(p.value, 3),
")", sep = "")
}
if(p.value > .05) {
Conclusion <- paste("Note: The correlation coefficient is ", round(Coefficient, 2),
".\nIt is NOT significantly different from 0 (p = ", round(p.value, 3),
")", sep = "")
}
mtext(text = Conclusion, side = 3, line = 1, cex = .8)
} # Close add.conclusion
if(add.regression == T) { # Add a regression line and change its color based
# on its p-value.
Model <- lm(y ~ x)
p.value <- anova(Model)$"Pr(>F)"[1]
if(p.value <= 0.05) {Reg.Line.Col <- "red"}
if(p.value > 0.05) {Reg.Line.Col <- "black"}
abline(lm(y ~ x), lty = 2, lwd = 4, col = Reg.Line.Col)
} # Close add.regression
}
Phew, that was a lot of code, but trust me it will be worth it when we use the function multiple times! Let’s see our custom scatterplot in action! I’ll create two plots, one where there is not significant correlation (or there shouldn’t be…), and one where there should be:
# Generate data
x <- rnorm(100)
y.uncorr <- rnorm(100) #y that should be uncorrelated (though we could get a type 1 error)
y.corr <- x + rnorm(100, 0, .1) # y that should be correlated
par(mfrow = c(1, 2)) # Create a plotting grid
# Run custom scatterplots (with default input values)
My.Scatter(x, y.uncorr)
My.Scatter(x, y.corr)
Cool huh? Now anytime you want to create a scatterplot and automatically include reference lines and regression lines (or anything else you want to add to the function), you can simply run your custom function instead of adding all the additional code each time.
Each time you open a new R session, R won’t remember your custom functions, so you have to redefine them. Thankfully, there is an easy way to automate this process so you don’t have to manually re-create your custom functions for each analysis. To do this, create a datafile with your custom function definitions (I call mine “Nathaniels Custom R Functions.R”). Then, when you start a new R document where you want to reload your custom functions, include a source command as follows:
setwd(dir = "/Users/Nathaniel/Dropbox") # Set your working directory
source(file = "Nathaniels Custom R Functions.R") # Load your custom functions!
When you run the source command, R will automatically execute all the code in your custom function file and allow you to use them in your new document!