In this tutorial we will be using data from the IDEMdata package to manipulate data sets, create custom graphs, and use functions and loops to automate the creation of these graphs.
To access the data, we need to download the package from a GitHub repository. The repository address is https://github.com/InDEM/IDEMdata. To install it from GitHub, we use the devtools package.
library(devtools)
install_github("InDEM/IDEMdata")
Now we load the IDEMdata package and take a look at the package description help file.
library(IDEMdata)
?IDEMdata
In the bottom right panel of RStudio you will see the file that describes what’s in this package. At the very bottom of that file, click on “Index” and you will see all of the data sets that are in the package.
The “deep_river” data frames are data that were collected by sampling the Deep River - Portage Burns Watershed in Lake and Porter Counties. These data are just a subset of sampling that took place in the watershed between April 2013 and March 2014. The purpose of the sampling was to get a baseline assessment of the watershed for TMDL development and watershed planning. More information about the study can be found here.
In part 1 of this tutorial we will tidy up and merge two data frames from the IDEMdata package. In part 2 we will design graphs that visualize the relationship between daily precipitation and the sampled data from the Deep River - Portage Burns Watershed. In part 3 we will write functions and use loops to automate the creation of these graphs.
Let’s take a look at the deep_river_chemistry data frame.
data(deep_river_chemistry)
chem.df <- deep_river_chemistry # rename for convenience
remove(deep_river_chemistry) # a little clean-up
View(chem.df)
The first thing we need to deal with are duplicate rows in the data frame. We can find those duplicates by using the duplicated() funtion, which returns a logical vector.
duplicates <- duplicated(chem.df)
head(duplicates, 10)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The duplicates vector will have a value of TRUE for any row that is a duplicate of a previous row. We’ll use the negation of this logical vector to subset out all of the rows that are not duplicates.
chem.df <- chem.df[!duplicates, ]
Another thing to notice is that there are some negative values in the LAB_RESULT column that don’t make sense. For example:
chem.df[c(36, 71), c("ACTIVITY_END_DATE", "SUBSTANCE_NAME", "LAB_RESULT", "LAB_RESULT_TEXT")]
## ACTIVITY_END_DATE SUBSTANCE_NAME LAB_RESULT LAB_RESULT_TEXT
## 36 4/8/2013 13:00 Nitrogen, Ammonia -1 <.1
## 73 6/10/2013 12:20 Coliforms (Total) -2 >2419.6
The LAB_RESULT column has a numeric class, and we can see that the LAB_RESULT_TEXT column has < and > symbols. These are threshold values, and the negative values in the numeric column are there to mark threshold results. For maxium thresholds (i.e., text values with >) we’re going to just replace the numeric value in the LAB_RESULT column with the max threshold value. For example, if the text value is ">2419.6" we’ll insert the numeric value 2419.6 into the LAB_RESULT column. For text values that are minimum thresholds (<) we’ll replace the numeric value with half of the minimum threshold (e.g., for "<.1" we insert the numeric value 0.05).
First, the maximum thresholds:
# create a subset of rows with >
max.thresh.df <- chem.df[grepl(">", chem.df$LAB_RESULT_TEXT, , fixed = T), ]
# create a character vector of LAB_RESULT_TEXT values with > removed
max.thresh.character <- gsub(">", "", max.thresh.df$LAB_RESULT_TEXT, fixed = T)
# convert the character vector to numeric
max.thresh.numeric <- as.numeric(max.thresh.character)
# replace those negative numeric values with the numeric values vector
chem.df[grepl(">", chem.df$LAB_RESULT_TEXT, fixed = T), "LAB_RESULT"] <- max.thresh.numeric
Now we replace the minimum threshold values with half the threshold:
# create a subset of rows with <
min.thresh.df <- chem.df[grepl("<", chem.df$LAB_RESULT_TEXT, , fixed = T), ]
# create a character vector of LAB_RESULT_TEXT values with > removed
min.thresh.character <- gsub("<", "", min.thresh.df$LAB_RESULT_TEXT, fixed = T)
# convert the character vector to numeric
min.thresh.numeric <- as.numeric(min.thresh.character)
# replace those negative numeric values with the numeric values vector
chem.df[grepl("<", chem.df$LAB_RESULT_TEXT, fixed = T), "LAB_RESULT"] <- min.thresh.numeric/2
The chem.df data frame is in a “long” format. This means that there is one “value” column and one “variable” column (in this data frame, the numeric values are in the LAB_RESULT column and the variables are in the SUBSTANCE_NAME column). We may want this data in a wide format, meaning there are columns for every variable. To do this, we use the reshape2 package, as explained in a previous tutorial.
It’s helpful to identify the minimum columns we need to keep in order to preserve all of the values. In other words, what columns are necessary to identify a unique record? In the dcast() function below, the columns to the left of ~ in the formula are the columns identified as necessary to preserve every value (plus a few that are kept for convenince), the column on the right of ~ is the variable column, and LAB_RESULT is identified as the value.var.
library(reshape2)
wide.chem.df <- dcast(chem.df, STATION_NAME + ACTIVITY_NO + ACTIVITY_END_DATE
+ WATERBODY_NAME + UTM_EAST + UTM_NORTH + COUNTY_NAME
~ SUBSTANCE_NAME, value.var = 'LAB_RESULT')
View(wide.chem.df)
In order to make this data frame useful, we need to rename the columns. We can use the colname() function to rename some of the columns. Here we rename the first 7 columns, as well as one of the turbidity columns:
colnames(wide.chem.df)[c(1:7, 32)] <- c("station", "activity", "date", "waterbody",
"utm_e", "utm_n", "county", "turbidity_hach")
However, it is more efficient to use regular expressions to do this. We won’t go into how regular expressions work (a thorough introduction to string manipulation in R can be found here), but here is some code that will tidy up our column names, courtesy of Eric Bailey:
# Convert column names to lower case
colnames(wide.chem.df) <- tolower(colnames(wide.chem.df))
# Get rid of all the abbreviations (ex. (as caco3))
colnames(wide.chem.df) <- gsub("\\(.*\\)", "", colnames(wide.chem.df))
# Get rid of non-alphanumeric characters
colnames(wide.chem.df) <- gsub("[\\,\\.%\\+]", "", colnames(wide.chem.df))
# Remove leading and trailing whitespace
colnames(wide.chem.df) <- gsub("^\\s|\\s$", "", colnames(wide.chem.df))
# Replace remaining spaces with periods
colnames(wide.chem.df) <- gsub(" ", "\\_", colnames(wide.chem.df))
The other data set in the IDEMdata package that we will be using for this tutorial is the precip_hobart data frame.
data(precip_hobart)
head(precip_hobart)
## Date Precipitation
## 1 2013-04-01 0
## 2 2013-04-02 0
## 3 2013-04-03 0
## 4 2013-04-04 0
## 5 2013-04-05 0
## 6 2013-04-06 0
To make things simpler for creating graphs, we will want to add the precipitation values to the wide.chem.df data frame. We’ll use the merge() function to do this, but first we need to make sure that the date columns for both data frames are in the same format.
First we look at the date column in the wide.chem.df data frame.
class(wide.chem.df$date)
## [1] "character"
head(wide.chem.df$date)
## [1] "6/24/2013 11:55" "7/22/2013 11:35" "4/8/2013 13:00" "5/6/2013 12:00"
## [5] "6/10/2013 12:20" "9/9/2013 11:25"
Right now the date column is a has the "character" class. We need to convert it to a date/time class. The simplest way to convert to a date in R is to use the as.Date() function. (See this site for a good overview of date and time classes in R.)
wide.chem.df$date <- as.Date(wide.chem.df$date,
format = "%m/%d/%Y %H:%M",
tz = "America/Chicago")
class(wide.chem.df$date)
## [1] "Date"
head(wide.chem.df$date)
## [1] "2013-06-24" "2013-07-22" "2013-04-08" "2013-05-06" "2013-06-10"
## [6] "2013-09-09"
Now we need to make the date column in the precip_hobart data frame the same class.
precip_hobart$Date <- as.Date(precip_hobart$Date,
tz = "America/Chicago")
class(precip_hobart$Date)
## [1] "Date"
head(precip_hobart$Date)
## [1] "2013-04-01" "2013-04-02" "2013-04-03" "2013-04-04" "2013-04-05"
## [6] "2013-04-06"
We will also need a Julian day column.
colnames(precip_hobart) <- c("date", "precipitation")
precip_hobart <- data.frame(jday = 1:dim(precip_hobart)[1], precip_hobart)
Now we’re ready to merge the precipitation data with the sampling data. We’ll use the merge() function that was introduced in this tutorial.
wide.precip.df <- merge(wide.chem.df, precip_hobart, all = T)
In part 2 of the tutorial we will use this data to create precipitation graphs.
This tutorial was created using RStudio’s R Markdown. The code can be found on GitHub.