Let’s say you’re programming in R, and you find that you’re repeating the same procedure, analysis, or custom plot several times. For example, say you’re plotting multiple scatterplots, and you want to include a regression line on each plot (perhaps among many other customizations). Is there a way to repeat the same custom procedure many times without having to repeat code? As you guessed by the title of this document, the answer is yes! With custom functions!

One of the coolest things you can do in R is write your own custom functions. Custom functions allow you to define a specific action you are interested in, from calculating a unique statistic to creating a custom plot, which you can then easily apply to new data.

To make a custom function, you need to define one or more input(s) followed by some action and an output. Your function will take the input and perform the action based on that input, and spit out some output in the form of transformed data, summary statistics, or a plot.

Replicating the mean() function

For example, let’s create a custom function that replicates the standard mean() function in R:

My.Mean <- function(x) {  # Name the function and define the input
  
  # Define the action, in this case, the sum divided by the mean
  
  Output <- sum(x) / length(x)
  
  return(Output) # Return the output of the function
  
  }

Once you’ve defined a function and given it a name, you can then use the function on new data:

My.Mean(0:10)  # Should be 5
## [1] 5
My.Mean(rep(0, 100)) # Should be 0
## [1] 0

Creating a custom mean function that can remove outliers

The function My.Mean only had one input (the vector x). However, we can define multiple inputs. For example, let’s adjust the function so that it has an input remove.outliers which, when True, automatically removes outliers before calculating the mean:

My.Mean <- function(
                    x, # x is a data vector
                    remove.outliers = F # this says whether or not to remove outliers
                    ) {
  
# Now we separately define actions depending on what the user's input
# value for the remove.outliers object
  
## First, let's define the action when we do NOT remove outliers
  
  if(remove.outliers == F) {
    
    Output <- sum(x) / length(x)
    
    } # End of action
  
## Next, let's define the action when we DO remove outliers
  
    if(remove.outliers == T) {
    
  # Determine the mean and standard deviation of the original vector
      
      Orig.Mean <- sum(x) / length(x)
      Orig.SD <- sd(x)
      
  # Create a logical vector indicating which values are outliers (more or less
  # than 2 times the standard deviation
      
      Outlier.log <- (x > Orig.Mean + 2 * Orig.SD) | (x < Orig.Mean - 2 * Orig.SD)
      
  # Create a new vector called x.noout with no outliers
      x.noout <- x[Outlier.log == FALSE]
      
  # Calculate the mean of the new vector
      Output <- Output <- sum(x.noout) / length(x.noout)
    
    } # End of action
 

# Finally, we can return the output of the function
  
  return(Output)
  }

A few notes on the function:

  1. In the function input definition, I added “== F” after “remove.outliers” to serve as a default value. If you include a default value for a function input, then if the user does not specify the input value, R will use the default value. You can even include a default data vector for x by including “x = rnorm(10)” for example.

  2. When you have a logical input (such as remove.outliers which is either True or False), it is common to use several if statements that define the action for when the input is True and when it is False.

Here is an example of the function in action:

# Create a vector Data with 100 samples from a standard normal dist with one outlier at 500
Data <- c(
          rnorm(n = 100, mean = 0, sd = 1),
          500 # Here's The outlier!
          )

My.Mean(Data, remove.outliers = F) # Since we are not removing outliers, the
## [1] 4.837
# mean should be influenced by the outlier

My.Mean(Data, remove.outliers = T) # Now we are removing the outliers. The mean
## [1] -0.1142
# should be close to 0

Using your custom functions with other R functions

Once you’ve defined a custom function in R, you can then use that function in other functions! For example, if you want to apply your new function to many groups, you can use it in the tapply() command:

with(ChickWeight, tapply(X = weight, INDEX = Chick, FUN = My.Mean, 
            remove.outliers = T # You can add additional inputs to the FUN here
            ))

Plotting Functions

If you find that you’re frequently adding certain things to plots, like reference lines for the median or mean, then creating a custom plotting function can really make things easier for you!

Custom Histogram with added reference lines

Let’s create a custom histogram function My.Hist with several custom inputs:

  1. add.median = Adds a vertical solid line for the median.
  2. add.mean = Adds a vertical dashed line for the median.
  3. add.legend = Adds a legend to the plot with mean and median info.
My.Hist <- function(x, add.median = T, add.mean = T, add.legend = T) {
  
  hist(x)
  
  if(add.median == T) {abline(v = median(x), lwd = 3)}
  
  if(add.mean == T) {abline(v = mean(x), lty = 2, lwd = 3)}
  
  if(add.legend == T) {
    
    legend("topright", c("median", "mean"), lwd = c(3, 3), lty = c(1, 2))
    
  }
  }

Ok, let’s try it out!

My.Hist(rexp(100), add.median = T, add.mean = T, add.legend = T)

plot of chunk unnamed-chunk-8

Of course, you can get much fancier by adding as many custom procedures or plotting parameters as you’d like.

Custom Scatterplot with added reference lines and regression line

Let’s create a custom scatterplot with some additional features

  1. add.mean = Adds a reference line for the mean of each variable.
  2. add.regression = Adds a regression line. If the correlation is significant I’ll make the line red, if not it will be black.
  3. add.conclusion = Adds a sentence to the top of the plot stating the correlation coefficient and its p-value.
My.Scatter <- function(x = rnorm(1000), y = rnorm(1000), add.mean = T,
                       add.regression = T, add.conclusion = T) {
  
  plot(x, y, pch = 21, bg = gray(.5, alpha = .5))
  
  if(add.mean == T) { # Add mean reference lines
    
    abline(h = mean(y), col = gray(.5))
    abline(v = mean(x), col = gray(.5))
    
    } # Close add.mean
  

  if(add.conclusion == T) { # Add the conclusion sentence to the top of the plot
    
    C.Test <- cor.test(x, y)
    
    Coefficient <- C.Test$estimate
    p.value <- C.Test$p.value
    
    if(p.value <= .05) {
      
Conclusion <- paste("Note: The correlation coefficient is ", round(Coefficient, 2), 
                     ".\nIt IS significantly different from 0 (p = ", round(p.value, 3), 
                     ")", sep = "")
    }
    
    if(p.value > .05) {
      
      Conclusion <- paste("Note: The correlation coefficient is ", round(Coefficient, 2), 
                     ".\nIt is NOT significantly different from 0 (p = ", round(p.value, 3), 
                     ")", sep = "")
    }
    
    mtext(text = Conclusion, side = 3, line = 1, cex = .8)
    
  } # Close add.conclusion
  
  if(add.regression == T) { # Add a regression line and change its color based
                            # on its p-value.
    
    Model <- lm(y ~ x)
    p.value <- anova(Model)$"Pr(>F)"[1]
    
    if(p.value <= 0.05) {Reg.Line.Col <- "red"}
    if(p.value > 0.05) {Reg.Line.Col <- "black"}
    
    abline(lm(y ~ x), lty = 2, lwd = 4, col = Reg.Line.Col)
    
    } # Close add.regression
  
  }

Phew, that was a lot of code, but trust me it will be worth it when we use the function multiple times! Let’s see our custom scatterplot in action! I’ll create two plots, one where there is not significant correlation (or there shouldn’t be…), and one where there should be:

# Generate data
x <- rnorm(100)
y.uncorr <- rnorm(100) #y that should be uncorrelated (though we could get a type 1 error)
y.corr <- x + rnorm(100, 0, .1) # y that should be correlated

par(mfrow = c(1, 2)) # Create a plotting grid

# Run custom scatterplots (with default input values)
My.Scatter(x, y.uncorr)
My.Scatter(x, y.corr)

plot of chunk unnamed-chunk-10

Cool huh? Now anytime you want to create a scatterplot and automatically include reference lines and regression lines (or anything else you want to add to the function), you can simply run your custom function instead of adding all the additional code each time.

Saving your custom functions for later use

Each time you open a new R session, R won’t remember your custom functions, so you have to redefine them. Thankfully, there is an easy way to automate this process so you don’t have to manually re-create your custom functions for each analysis. To do this, create a datafile with your custom function definitions (I call mine “Nathaniels Custom R Functions.R”). Then, when you start a new R document where you want to reload your custom functions, include a source command as follows:

setwd(dir = "/Users/Nathaniel/Dropbox") # Set your working directory
source(file = "Nathaniels Custom R Functions.R") # Load your custom functions!

When you run the source command, R will automatically execute all the code in your custom function file and allow you to use them in your new document!