In part 1 of this tutorial we tidied up some data from the IDEMdata
package, and in part 2 we used that data to create precipitation graphs. In this part we will use functions and loops to automate the creation of these graphs.
Functions in R are created using the following general format:
myFunction <- function(){
}
So functions are created just like other variables in R, with the assignment operator <-
. Arguments for the function are placed inside the brackets ()
and the statements for the function are placed inside the braces {}
. The last line of the statements will be returned when the function is used.
For example, here is a function that calculates the mean of a vector.
myMean <- function(vector){
sum.of.vector <- sum(vector)
number.of.values <- length(vector)
average <- sum.of.vector/number.of.values
average
}
Now we can use this like any other function in R.
myMean(vector = c(5, 7, 3.14, 12, 9))
## [1] 7.228
We’re going to write our own function that will take watershed data and create a precipitation graph. We’ll call our function prPlot()
and it’ll have the following general outline:
prPlot <- function(date, precip, sample, trellis = TRUE, sample.label){
if(trellis == FALSE){
# lattice plot statements
} else if(trellis == TRUE){
# lattice plot statements with conditional
}
# double y scale plot statements
}
The requirements for the arguments are that date
has the "Date"
class and that precip
and sample
have the "numeric"
class; that the three vectors are of the same length (and are ordered appropriately); that trellis
is either TRUE
or FALSE
; and that sample.label
is a string.
Now we’ll place code for the plot in braces.
prPlot <- function(date, precip, sample, trellis = TRUE, sample.label){
if(trellis == FALSE){
rain <- xyplot(precip ~ date, type = "h", ylab = "Precipitation (in)",
xlab = "Date")
sample <- xyplot(sample ~ date, type = "p", ylab = sample.label,
ylim = c(0, 2*max(sample, na.rm = TRUE)))
} else if(trellis == TRUE){
# create a time variable with the "POSIXlt"" class
time <- as.POSIXlt(date)
# create a character vector with the format "Month, Year"
# with a three-letter abbreviation for the month
month.year <- format(time, "%b, %Y")
# make the character vector a factor with levels in the appropriate order
month.year <- factor(month.year, levels = unique(month.year))
# determine the number of rows for the trellis grid
rows <- length(levels(month.year))/3
# create a vector with days of the month
mday <- time$mday
rain <- xyplot(precip ~ mday | month.year, type = "h",
ylab = "Precipitation (in)", layout = c(3, rows), as.table = TRUE,
xlab = "")
sample <- xyplot(sample ~ mday | month.year, type = "p", ylab = sample.label,
ylim = c(0, 2*max(sample, na.rm = TRUE)))
}
doubley <- doubleYScale(rain, sample, add.ylab2 = TRUE, under = TRUE,
text = c( "Precipitation (in)", sample.label),
column = 2, type = c("l", "p"))
update(doubley,
par.settings = simpleTheme(col = c("black", "red"), pch = c(NA, 16)))
}
But before we can use this function, we need data in the proper format. So we’ll write a function that subsets data that is in the same format as the wide.precip.df
data frame and calls the prPlot
function.
prSubsetPlot <- function(data, site, site.col, date.col, precip.col,
sample.col, sample.name, conditioned = TRUE,
save = TRUE, file.prefix = NULL,
file.extension = "png"){
# subset down to rows with the user selected site
site.df <- data[data[, site.col] == site, c(date.col, sample.col)]
# create a data frame with precipitation values for every day, no duplicates
precip.df <- unique(data[, c(date.col, precip.col)])
# create a data frame with precipitaiton and sample columns, and a date
# column with unique dates
site.df <- merge(precip.df, site.df, all = TRUE)
plot <- prPlot(date = site.df[, date.col], precip = site.df[, precip.col],
sample = site.df[, sample.col], trellis = conditioned,
sample.label = sample.name)
if(save){
# create a file name using site and sample names
file.name <- paste(site, sample.col, "precip_graph", sep = "_")
# add the extension to the end of the file name
file.name <- paste(file.name, file.extension, sep = ".")
# add the prefix to the beginning
file.name <- paste0(file.prefix, file.name)
# open the appropriate graphing device with location for the file
trellis.device(device = file.extension, filename = file.name)
print(plot)
# turn off the graphing device
dev.off()
} else {
plot
}
}
The prSubsetPlot
function takes a data frame (the argument data
) that is in the wide format, subsets it down to a particular sampling site, creates a precipitation graph of the specified sampled data, and saves it. Let’s see if it works. The following code should save a .png
file in your working directory.
# get the latest version of the 'IDEMdata' package
library(devtools)
install_github("InDEM/IDEMdata")
library(IDEMdata)
data(wide_precip)
prSubsetPlot(data = wide.precip.df, site = "LMG-05-0009", site.col = "station",
date.col = "date", precip.col = "precipitation", sample.col = "sat",
sample.name = "% Saturation", conditioned = "TRUE", save = TRUE,
file.extension = "png")
Like most programming languages, R has while loops and for loops. Of the two, for loops are more commonly used, so we’ll cover how they are written. But in R, it’s even more common to use a special family of loop operations called apply()
functions. We’ll end up using one of those to automate our precipitation graphs, but first we’ll go over the for loop.
In most programming languages, the for loop would typically be set up in this way:
for(i in sequence){
# operation indexed by i
}
The index variable, traditionally named i
, is given an initial value of 1
(or 0
, or whatever initial value makes sense for the situation). The sequence
variable would be a sequence of values that you want the index variable to take while the statements in the brackets are iterated. In most cases it is a sequence of integers. And at the end of a single iteration of the loop, the index variable is advanced to the next value in the sequence (in most cases, it is just the next integer, i + 1
).
Lets look at a real example using the myMean()
function we created above.
list.of.vectors <- list(1:20, rnorm(10), rnorm(30))
vector.means <- c()
for(i in 1:3){
vector.means[i] <- myMean(list.of.vectors[[i]])
}
vector.means
## [1] 10.50000 -0.01681 0.18692
So we create the empty vector vector.means
to collect the mean values in, and the index variable i
takes an initial value of 1
. Then the for loop runs over the sequence 1:3
. This means that in the first iteration the average of the first vector in the list list.of.vectors
is placed as the first value in vector.means
, i.e. we have the expression vector.means[1] <- myMean(list.of.vectors[[1]])
. After the first loop is completed the index variable advances one step, so i
takes on a value of 2
for the next iteration, and we have vector.means[2] <- myMean(list.of.vectors[[2]])
. This happens a third time, when the index value is 3
, and the for loop stops after this iteration because i
has taken the last value in the sequence.
Experienced R programmers typically do not use for loops, especially in situations like the example we just looked at. This is an ideal situation for the lapply()
function. lappy()
is one member of a family of functions that are referred to as apply
functions. Type ?apply
and ?lapply
to see descriptions of the many related functions, and visit this page for some examples of how these functions can be used.
lapply()
takes just two arguments (at a minimum), lapply(X, FUN, ...)
. X
is a list of objects that you want to perform an operation on, and FUN
is the function that you want to apply to each object. In our situation, we want to apply myMean()
to list.of.vectors
.
lapply(list.of.vectors, myMean)
## [[1]]
## [1] 10.5
##
## [[2]]
## [1] -0.01681
##
## [[3]]
## [1] 0.1869
lapply()
always returns a list of objects. If we wanted the output to be a vector, we could use the sapply()
function which simplifies what is returned.
sapply(list.of.vectors, myMean)
## [1] 10.50000 -0.01681 0.18692
Now, if we want to make a lot of precipitation graphs, we need to set up our code so that we can easily loop through all of the arguments necessary for each plot. For example, if we wanted to plot 3 different sampling substances at 2 sites, we could represent these scenarios as rows in a data frame.
substance <- c("sat", "e_coli", "ph")
station <- c("LMG-05-0009", "LMG-05-0004")
scenarios <- expand.grid(substance, station, stringsAsFactors = FALSE)
colnames(scenarios) <- c("substance", "station")
scenarios
## substance station
## 1 sat LMG-05-0009
## 2 e_coli LMG-05-0009
## 3 ph LMG-05-0009
## 4 sat LMG-05-0004
## 5 e_coli LMG-05-0004
## 6 ph LMG-05-0004
One way to run the prSubsetPlot()
function on all of these scenarios is to use a for loop that loops through the index of rows on the data frame. The only thing we’re missing is the label for the substance on the graph, so first we’ll create a vector of substance names.
substance.names <- scenarios[, "substance"]
substance.names[substance.names == "sat"] <- "% Saturation"
substance.names[substance.names == "e_coli"] <- "E. coli (MPN/100mL)"
substance.names[substance.names == "ph"] <- "pH (SU)"
for(i in 1:dim(scenarios)[1]){
prSubsetPlot(data = wide.precip.df, site = scenarios[i, "station"],
site.col = "station", date.col = "date",
precip.col = "precipitation", sample.col = scenarios[i, "substance"],
sample.name = substance.names[i], conditioned = "TRUE", save = TRUE,
file.extension = "png")
}
We could also use an apply
function called mapply()
.
mapply(FUN = prSubsetPlot, site = scenarios[, "station"],
sample.col = scenarios[, "substance"], sample.name = substance.names,
MoreArgs = list(data = wide.precip.df, site.col = "station", date.col = "date",
precip.col = "precipitation", conditioned = "TRUE", save = TRUE,
file.extension = "png"))
Finally, to put all of these functions together and make the automation of precipitation graphs as simple as possible, we make one function that will loop through all of the plots and save them to the working directory.
precipPlot <- function(data, date.column = "date", sites, site.column = "station",
substances, substance.names, precip.column = "precipitation",
trellis = "TRUE", save = TRUE, file.extension = "png", ...){
substance.names.df <- data.frame(substance = substances,
substance.name = substance.names,
stringsAsFactors = FALSE)
scenarios <- expand.grid(substances, sites, stringsAsFactors = FALSE)
colnames(scenarios) <- c("substance", "site")
scenarios <- merge(scenarios, substance.names.df)
mapply(FUN = prSubsetPlot, site = scenarios[, "site"],
sample.col = scenarios[, "substance"],
sample.name = scenarios[, "substance.name"],
MoreArgs = list(data = data, site.col = site.column, date.col = date.column,
precip.col = precip.column, conditioned = trellis,
save = save, file.extension = file.extension))
}
This function has been added to the IDEMdata
package. To see the help file, type ?precipPlot
. Below is how you would use this function to save a plot for every substance at every site in the wide.precip.df
data frame.
stations <- unique(wide.precip.df$station[!is.na(wide.precip.df$station)])
samples <- colnames(wide.precip.df)[8:34]
# to get the names for the substances, load the deep_river_chemistry
# data frame and reshape it to get the names in the proper order
data(deep_river_chemistry)
library(reshape2)
wide.df <- dcast(deep_river_chemistry, STATION_NAME + ACTIVITY_NO + ACTIVITY_END_DATE
+ WATERBODY_NAME + UTM_EAST + UTM_NORTH + COUNTY_NAME
~ SUBSTANCE_NAME, value.var = 'LAB_RESULT')
names <- colnames(wide.df)[8:34]
precipPlot(data = wide.precip.df, sites = stations, substances = samples,
substance.names = names)
This tutorial was created using RStudio’s R Markdown. The code can be found on GitHub.