Motivation

If it’s performing EDA, developing models or writing reproducible reports a number of operations in R are common across all projects. It is is a useful idea to start extracting these common code fragments, generalizing them and keeping them handy so that they can be easily brought into your next project.

This vignette will look at some hopefully helpful code fragments to start off your utility collection. In general it’s a good idea to create a function for each code fragment, this will be helpful should you want to create a package from your code. We won’t cover package creation here, instead we will assume that you will be cut-and-pasting from a file or using the R method source("filename") to include the utility code into the current file. If using the source method remember to include the file containing the utility code in project should you distribute it.

A note on Markdown R chunks

You can of course paste your utility or code fragments into a Markdwon file. The code will then be placed into code chunks that are executed when you elect to run the code segment or Knit the Markdown file. There are various options which control the output of the code within a chunk. Code chunks have the following format:

   ```{r NAME, OPTIONS}
        #code goes here
   ```

The NAME is a title for the code chunk, it is a very good idea to choose something descriptive for the title. Note the NAME associated with an R chunk has to be unique across the code chunks in the Markdown file. Naming chunks is useful in navigating of the document as they are listed in a drop down at the bottom right of the RStudio editor pane, selecting a name from the drop down will move the editor cursor to that code chunk.

OPTIONS is made up of a serious of comma delimited otpions which assigned a value of TRUE of FALSE to turn them on and off. The options applicable to code development are:

Package Loading Utility Function

Most R file have to include libraries, and they do this using the library() method. However, to use a library the containing package must be installed, this is usually just mentioned as a comment in the R file. There is no need for this, code can be written to check if the package is installed and install it if it isn’t present.

This simple utility function will load a package if it is missing before loading the library.

# setupPackage - is a method to load a libray which will first ensure that the associated package is loaded.
# packageName is a stirng specifying the library to be loaded.
# loud - is a boolean which will default to FALSE, if TRUE extra messages may be dispalyed (assuming chunk options allow)
# note for testing use the detatch() method to remove a package eg detach("package:tidyverse", unload = TRUE)
setupPackage <- function( packageName, loud=FALSE ) {
  
  if (!require(packageName, character.only=TRUE)) {
    install.packages(packageName, dependencies=TRUE, verbose=loud)
  } 
  
  # the character.only argument tells the method to expect the packageName to be a character string.
  library(packageName, character.only = TRUE, verbose=loud)
}

Example usage:

setupPackage("tidyverse")

Automatically Plot Attributes

During EDA it is desirable to generate simple plots for each attribute. It is tedious to produce each individual plot. The following function (which is a work in progress) will produce bar charts for all factor and logical variables within the supplied dataframe.

The code selects the column names for all attributes of type factor (and logical which it first converts to factor). And then proceeds to produce a graph for each attribute. The slightly tricky bit is that some of the methods require a symbol attribute and not the name "attribute" to remedy this the !!ensym method was used to encode the string name as a symbol. The function makes use of the fct_infreq() method from the the forcats package, this method reorders the levels within a factor by frequency.

Versions of this method could be created to produce different types of plots or work with different attributes types like numeric.

# boxPlotsForAllFactors - will produce box plots for all factor and logical attributes within
#                         a dataframe.
# df                     - the dataframe
# Requires tidyverse
# forcats
boxPlotsForAllFactors <- function(df) {

  # make a copy of the data set
  factor_data <- df
  
  # recode logical as factors
  factor_data  %<>% mutate_if(is.logical, as.factor)
  
  # get all factor columns
  factor_data <- dplyr::select_if(factor_data, is.factor)
  
  # get the column names (will be of stype string)
  col_names <- colnames(factor_data)
  
  # loop through the column names and produce a plot
  for(col_name in col_names) {
    # get number of levels within the factor
    nlev <- nlevels(factor_data[[col_name]])
  
    # don't use color if more than 25  the legends mess up the plots
    if( nlev <= 25 ){
      print(
        ggplot(data=factor_data, aes(fill= !!ensym(col_name))) +
          geom_bar(aes(x = forcats::fct_infreq(!!ensym(col_name)) )) +
          theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
          xlab(eval(col_name))
      )
    } else {
         print(
         ggplot(data=factor_data) +
          geom_bar(aes(x = forcats::fct_infreq(!!ensym(col_name)) )) +
          theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
           xlab(eval(col_name)) 
         )
    }
  }

}

Where to Next

Creating our own collection of utility functions and code fragments will be initialy time consuming but will be a greate time saver down the track. Start your collection today.