In part 1 of this tutorial we tidied up some data from the IDEMdata package, and in part 2 we used that data to create precipitation graphs. In this part we will use functions and loops to automate the creation of these graphs.

Functions

Functions in R are created using the following general format:

myFunction <- function(){
  
}

So functions are created just like other variables in R, with the assignment operator <-. Arguments for the function are placed inside the brackets () and the statements for the function are placed inside the braces {}. The last line of the statements will be returned when the function is used.

For example, here is a function that calculates the mean of a vector.

myMean <- function(vector){
  sum.of.vector <- sum(vector)
  number.of.values <- length(vector)
  average <- sum.of.vector/number.of.values
  average
}

Now we can use this like any other function in R.

myMean(vector = c(5, 7, 3.14, 12, 9))
## [1] 7.228

Plot function

We’re going to write our own function that will take watershed data and create a precipitation graph. We’ll call our function prPlot() and it’ll have the following general outline:

prPlot <- function(date, precip, sample, trellis = TRUE, sample.label){
  
  if(trellis == FALSE){
    
    # lattice plot statements
    
  } else if(trellis == TRUE){
    
    # lattice plot statements with conditional
    
  } 
  
  # double y scale plot statements
  
}

The requirements for the arguments are that date has the "Date" class and that precip and sample have the "numeric" class; that the three vectors are of the same length (and are ordered appropriately); that trellis is either TRUE or FALSE; and that sample.label is a string.

Now we’ll place code for the plot in braces.

prPlot <- function(date, precip, sample, trellis = TRUE, sample.label){
  
  if(trellis == FALSE){
    
    rain <- xyplot(precip ~ date, type = "h", ylab = "Precipitation (in)",
                   xlab = "Date")

    sample <- xyplot(sample  ~ date, type = "p", ylab = sample.label, 
                     ylim = c(0, 2*max(sample, na.rm = TRUE)))
    
  } else if(trellis == TRUE){
    
    # create a time variable with the "POSIXlt"" class
    time <- as.POSIXlt(date)
    
    # create a character vector with the format "Month, Year" 
    # with a three-letter abbreviation for the month
    month.year <- format(time, "%b, %Y")
    
    # make the character vector a factor with levels in the appropriate order
    month.year <- factor(month.year, levels = unique(month.year))
    
    # determine the number of rows for the trellis grid
    rows <- length(levels(month.year))/3
    
    # create a vector with days of the month
    mday <- time$mday
    
    rain <- xyplot(precip ~ mday | month.year, type = "h",
                   ylab = "Precipitation (in)", layout = c(3, rows), as.table = TRUE, 
                   xlab = "")
    
    sample <- xyplot(sample  ~ mday | month.year, type = "p", ylab = sample.label,
                  ylim = c(0, 2*max(sample, na.rm = TRUE)))
    
  } 
  
  doubley <- doubleYScale(rain, sample, add.ylab2 = TRUE, under = TRUE, 
                          text = c( "Precipitation (in)", sample.label),
                          column = 2, type = c("l", "p"))
  
  update(doubley,
         par.settings = simpleTheme(col = c("black", "red"), pch = c(NA, 16)))
  
}

But before we can use this function, we need data in the proper format. So we’ll write a function that subsets data that is in the same format as the wide.precip.df data frame and calls the prPlot function.

prSubsetPlot <- function(data, site, site.col, date.col, precip.col, 
                         sample.col, sample.name, conditioned = TRUE,
                         save = TRUE, file.prefix = NULL, 
                         file.extension = "png"){
  
  # subset down to rows with the user selected site
  site.df <- data[data[, site.col] == site, c(date.col, sample.col)]
  
  # create a data frame with precipitation values for every day, no duplicates
  precip.df <- unique(data[, c(date.col, precip.col)])
  
  # create a data frame with precipitaiton and sample columns, and a date
  # column with unique dates
  site.df <- merge(precip.df, site.df, all = TRUE)
  
  plot <- prPlot(date = site.df[, date.col], precip = site.df[, precip.col],
                     sample = site.df[, sample.col], trellis = conditioned,
                     sample.label = sample.name)
  
  if(save){
    
    # create a file name using site and sample names
    file.name <- paste(site, sample.col, "precip_graph", sep = "_")
    
    # add the extension to the end of the file name
    file.name <- paste(file.name, file.extension, sep = ".")
    
    # add the prefix to the beginning
    file.name <- paste0(file.prefix, file.name)
    
    # open the appropriate graphing device with location for the file
    trellis.device(device = file.extension, filename = file.name)
     
    print(plot)
    
    # turn off the graphing device
    dev.off()
    
  } else {
    
    plot
    
    }
  
}

The prSubsetPlot function takes a data frame (the argument data) that is in the wide format, subsets it down to a particular sampling site, creates a precipitation graph of the specified sampled data, and saves it. Let’s see if it works. The following code should save a .png file in your working directory.

# get the latest version of the 'IDEMdata' package
library(devtools)
install_github("InDEM/IDEMdata")
library(IDEMdata)
data(wide_precip)

prSubsetPlot(data = wide.precip.df, site = "LMG-05-0009", site.col = "station", 
                 date.col = "date", precip.col = "precipitation",  sample.col = "sat", 
                 sample.name = "% Saturation", conditioned = "TRUE", save = TRUE, 
                 file.extension = "png")

Loops

Like most programming languages, R has while loops and for loops. Of the two, for loops are more commonly used, so we’ll cover how they are written. But in R, it’s even more common to use a special family of loop operations called apply() functions. We’ll end up using one of those to automate our precipitation graphs, but first we’ll go over the for loop.

For loop

In most programming languages, the for loop would typically be set up in this way:

for(i in sequence){
  
  # operation indexed by i
  
}

The index variable, traditionally named i, is given an initial value of 1 (or 0, or whatever initial value makes sense for the situation). The sequence variable would be a sequence of values that you want the index variable to take while the statements in the brackets are iterated. In most cases it is a sequence of integers. And at the end of a single iteration of the loop, the index variable is advanced to the next value in the sequence (in most cases, it is just the next integer, i + 1).

Lets look at a real example using the myMean() function we created above.

list.of.vectors <- list(1:20, rnorm(10), rnorm(30))

vector.means <- c()

for(i in 1:3){
  vector.means[i] <- myMean(list.of.vectors[[i]])
}

vector.means
## [1] 10.50000 -0.01681  0.18692

So we create the empty vector vector.means to collect the mean values in, and the index variable i takes an initial value of 1. Then the for loop runs over the sequence 1:3. This means that in the first iteration the average of the first vector in the list list.of.vectors is placed as the first value in vector.means, i.e. we have the expression vector.means[1] <- myMean(list.of.vectors[[1]]). After the first loop is completed the index variable advances one step, so i takes on a value of 2 for the next iteration, and we have vector.means[2] <- myMean(list.of.vectors[[2]]). This happens a third time, when the index value is 3, and the for loop stops after this iteration because i has taken the last value in the sequence.

apply functions

Experienced R programmers typically do not use for loops, especially in situations like the example we just looked at. This is an ideal situation for the lapply() function. lappy() is one member of a family of functions that are referred to as apply functions. Type ?apply and ?lapply to see descriptions of the many related functions, and visit this page for some examples of how these functions can be used.

lapply() takes just two arguments (at a minimum), lapply(X, FUN, ...). X is a list of objects that you want to perform an operation on, and FUN is the function that you want to apply to each object. In our situation, we want to apply myMean() to list.of.vectors.

lapply(list.of.vectors, myMean)
## [[1]]
## [1] 10.5
## 
## [[2]]
## [1] -0.01681
## 
## [[3]]
## [1] 0.1869

lapply() always returns a list of objects. If we wanted the output to be a vector, we could use the sapply() function which simplifies what is returned.

sapply(list.of.vectors, myMean)
## [1] 10.50000 -0.01681  0.18692

Automating the graphs

Now, if we want to make a lot of precipitation graphs, we need to set up our code so that we can easily loop through all of the arguments necessary for each plot. For example, if we wanted to plot 3 different sampling substances at 2 sites, we could represent these scenarios as rows in a data frame.

substance <- c("sat", "e_coli", "ph")
station <- c("LMG-05-0009", "LMG-05-0004")
scenarios <- expand.grid(substance, station, stringsAsFactors = FALSE)
colnames(scenarios) <- c("substance", "station")
scenarios
##   substance     station
## 1       sat LMG-05-0009
## 2    e_coli LMG-05-0009
## 3        ph LMG-05-0009
## 4       sat LMG-05-0004
## 5    e_coli LMG-05-0004
## 6        ph LMG-05-0004

One way to run the prSubsetPlot() function on all of these scenarios is to use a for loop that loops through the index of rows on the data frame. The only thing we’re missing is the label for the substance on the graph, so first we’ll create a vector of substance names.

substance.names <- scenarios[, "substance"]
substance.names[substance.names == "sat"] <- "% Saturation"
substance.names[substance.names == "e_coli"] <- "E. coli (MPN/100mL)"
substance.names[substance.names == "ph"] <- "pH (SU)"

for(i in 1:dim(scenarios)[1]){
  prSubsetPlot(data = wide.precip.df, site = scenarios[i, "station"], 
                   site.col = "station", date.col = "date", 
                   precip.col = "precipitation",  sample.col = scenarios[i, "substance"], 
                   sample.name = substance.names[i], conditioned = "TRUE", save = TRUE, 
                   file.extension = "png")
}

We could also use an apply function called mapply().

mapply(FUN = prSubsetPlot, site = scenarios[, "station"],  
       sample.col = scenarios[, "substance"], sample.name = substance.names,
       MoreArgs = list(data = wide.precip.df, site.col = "station", date.col = "date", 
                       precip.col = "precipitation", conditioned = "TRUE", save = TRUE, 
                       file.extension = "png"))

Finally, to put all of these functions together and make the automation of precipitation graphs as simple as possible, we make one function that will loop through all of the plots and save them to the working directory.

precipPlot <- function(data, date.column = "date", sites, site.column = "station",
                       substances, substance.names, precip.column = "precipitation",
                       trellis = "TRUE", save = TRUE, file.extension = "png", ...){
  
  substance.names.df <- data.frame(substance = substances, 
                                   substance.name = substance.names,
                                   stringsAsFactors = FALSE)
  
  scenarios <- expand.grid(substances, sites, stringsAsFactors = FALSE)
  colnames(scenarios) <- c("substance", "site")
  scenarios <- merge(scenarios, substance.names.df)
  
  mapply(FUN = prSubsetPlot, site = scenarios[, "site"], 
         sample.col = scenarios[, "substance"], 
         sample.name = scenarios[, "substance.name"],
         MoreArgs = list(data = data, site.col = site.column, date.col = date.column,
                         precip.col = precip.column, conditioned = trellis,
                         save = save, file.extension = file.extension))

}

This function has been added to the IDEMdata package. To see the help file, type ?precipPlot. Below is how you would use this function to save a plot for every substance at every site in the wide.precip.df data frame.

stations <- unique(wide.precip.df$station[!is.na(wide.precip.df$station)])
samples <- colnames(wide.precip.df)[8:34]

# to get the names for the substances, load the deep_river_chemistry
# data frame and reshape it to get the names in the proper order
data(deep_river_chemistry)
library(reshape2)
wide.df <- dcast(deep_river_chemistry, STATION_NAME + ACTIVITY_NO + ACTIVITY_END_DATE 
                      + WATERBODY_NAME + UTM_EAST + UTM_NORTH + COUNTY_NAME 
                      ~ SUBSTANCE_NAME, value.var = 'LAB_RESULT')
names <- colnames(wide.df)[8:34]

precipPlot(data = wide.precip.df, sites = stations, substances = samples,
           substance.names = names)

R Markdown

This tutorial was created using RStudio’s R Markdown. The code can be found on GitHub.