Author: Jeff Walker
Created: 2014-05-14
Updated: 2014-05-20
This document provides some exploratory data analysis of the MyRWA water quality database for the Malden River.
Load the packages used for this document and set knitr options.
library(myrwaR)
library(dplyr)
library(lubridate)
library(ggplot2)
theme_set(theme_bw())
opts_chunk$set(tidy = FALSE)
Load data from the Access database using the myrwaR package.
ch <- db_connect(file.path('..','..','..','data','MysticDB_20140510.accdb'))
wq <- db_results(ch)
close(ch)
The stations located within the Malden River sub-basin were identified in QGIS and saved to a csv file called locations_malden_river.csv
locations <- read.csv(file.path('..','..','gis','csv','locations_malden_river.csv'), as.is=TRUE)
locations <- unique(locations$LocationID)
The wq data frame is filtered to include only these stations within the Malden River subbasin:
wq <- filter(wq, LocationID %in% locations)
wq <- droplevels(wq)
The majority of stations are part of the MyRWA HOTSPOT program. The MyRWA Baseline (BASE) and MWRA CSORWM programs include only one station each (MAR036 and MWRA176, respectively). There are also a few stations from municipalities (MUNI).
ggplot(wq, aes(LocationID,ProjectID)) + geom_tile(height=0.5, width=0.9) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) + labs(y='')
We can also see which parameters were analyzed in each program. The only consistent parameters across all four include ECOLI, ENT, and TEMP_WATER.
ggplot(wq, aes(CharacteristicID, ProjectID)) + geom_tile(height=0.5, width=0.9) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) + labs(y='')
We can also see the number of samples for each program. Across all the projects, DO, DO_SAT, ECOLI, SPCOND, and TEMP_WATER are the most common. The MUNI stations primarily include ECOLI data, and also single values of the PPC_* data. TP, TSS, and NO23 are only measured in the BASE project. Turbidity was only measured in the MWRA CSORWM project.
ggplot(wq, aes(CharacteristicID, fill=ProjectID)) + geom_bar(position='stack') + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) + labs(y='No. Samples')
Finally, the temporal distribution of samples for each project can be seen from this figure. For each month/year and project, a black tile indicates at least one sample of any parameter was collected. The Baseline project includes data from 2000-2014. The CSORWM project was from 2002-2012 (although more data may now be available).
ggplot(wq, aes(floor_date(Datetime, unit="month"), ProjectID)) + geom_tile(height=0.5) + labs(x='Month-Year')
The CSORWM data were also only collected from May - January, while the Baseline and Hotspot data were collected year-round.
ggplot(wq, aes(factor(month(Datetime)), ProjectID)) + geom_tile(width=0.9, height=0.5) + labs(x='Month')
Based on these summaries, two separate analyses are warrented. The first focuses on the consistent long-term datasets, which include Baseline sampling at MAR036 and the CSORWM sampling at MWRA176.
Extract the Baseline and CSORWM samples, and only the primary characteristics.
wq.lt <- filter(wq, ProjectID %in% c('BASE', 'CSORWM'), CharacteristicID %in% c("DO", "DO_SAT", "ECOLI", "ENT", "FCOLI", "NO23", "PH", "SPCOND", "TEMP_WATER", "TP", "TSS", "TURB")) %.% droplevels()
Plot the raw data over time. There are some clear outliers:
It also appears that for SPCOND, the BASE samples are an order of magnitude higher than the CSORWM samples. I assume this was an error when inputing the data and that CSORWM are actually in mS/cm.
ggplot(wq.lt, aes(Datetime, ResultValue, color=ProjectID)) + geom_point() + facet_wrap(~CharacteristicID+Units, scales='free_y') + labs(y='Value') + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
## Warning: Removed 1 rows containing missing values (geom_point).
Identify which rows contain outliers:
outlier_index <- with(wq.lt, which((CharacteristicID=="DO_SAT" & ResultValue > 600) | (CharacteristicID=="PH" & ResultValue < 1) | (CharacteristicID=="TP" & ResultValue > 1) | (CharacteristicID=="TSS" & ResultValue > 1000)))
print(outlier_index) # row indices for outliers
## [1] 324 690 691 1597
Remove those rows from wq.lt and save in a separate data frame wq.lt.outlier:
wq.lt.outlier <- wq.lt[outlier_index, ] # data frame saving outliers
wq.lt <- wq.lt[-outlier_index,]
Fix the SPCOND measurements, which appear to be off by a factor 1000:
wq.lt[which(wq.lt$CharacteristicID=="SPCOND" & wq.lt$ProjectID=="CSORWM"), "ResultValue"] <- wq.lt[which(wq.lt$CharacteristicID=="SPCOND" & wq.lt$ProjectID=="CSORWM"), "ResultValue"]*1000
Plot the samples again without the outliers:
ggplot(wq.lt, aes(Datetime, ResultValue, color=ProjectID)) + geom_point() + facet_wrap(~CharacteristicID+Units, scales='free_y') + labs(y='Value') + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
## Warning: Removed 1 rows containing missing values (geom_point).
To see if there is salt water intrusion, plot the specific conductance. The MWRA collects 'bottom' and 'surface' samples specified by SampleDepthType, the MyRWA Baseline samples are collected at the surface, but have SampleDepthType of NA. This figure shows that salt water is indeed present in the bottom samples collected by MWRA.
ggplot(filter(wq.lt, CharacteristicID=="SPCOND"), aes(Datetime, ResultValue, color=SampleDepthType)) + geom_point() + facet_wrap(~ProjectID, ncol=1)
To compare the surface samples, filter out the MWRA bottom samples.
ggplot(filter(wq.lt, CharacteristicID=="SPCOND", (SampleDepthType == 'S' | is.na(SampleDepthType))), aes(Datetime, ResultValue, color=ProjectID)) + geom_point() + labs(y="Specific Conductivity (uS/cm)")
The DO data show very low concentrations in the MWRA bottom samples, and some Baseline samples below 5 mg/L.
ggplot(filter(wq.lt, CharacteristicID %in% c("DO", "DO_SAT")), aes(Datetime, ResultValue, color=SampleDepthType)) + geom_point() + facet_grid(CharacteristicID+Units~ProjectID, scales='free_y') + labs(y="")
## Warning: Removed 1 rows containing missing values (geom_point).
Seasonality is shown by plotting concentrations by Julian Day (1=Jan 1).
ggplot(filter(wq.lt, CharacteristicID %in% c("DO", "DO_SAT")), aes(yday(Datetime), ResultValue, color=SampleDepthType)) + geom_point() + facet_grid(CharacteristicID+Units~ProjectID, scales='free_y') + labs(x="Julian Day", y="")
## Warning: Removed 1 rows containing missing values (geom_point).
pH values seem to be higher at the surface than at the bottom among MWRA data. The Baseline data ends in 2006 and shows an odd downward trend.
ggplot(filter(wq.lt, CharacteristicID == 'PH'), aes(Datetime, ResultValue, color=SampleDepthType)) + geom_point() + facet_wrap(~ProjectID, ncol=1) + labs(y="pH")
Total Phosphorus (TP) was only collected by MyRWA. There appears to be an outlier at the end of 2004, which is below a typical detection limit.
ggplot(filter(wq.lt, CharacteristicID == 'TP'), aes(Datetime, ResultValue)) + geom_point() + facet_wrap(~ProjectID, ncol=1) + labs(y="Total Phosphorus (mg/L)")
Removing the outlier and appending it to the previous wq.lt.outlier dataframe.
outlier_index <- with(wq.lt, which(ProjectID=="BASE" & CharacteristicID=="TP" & ResultValue < 0.01))
print(outlier_index)
## [1] 514
wq.lt.outlier <- rbind(wq.lt.outlier, wq.lt[outlier_index, ]) # data frame saving outliers
wq.lt <- wq.lt[-outlier_index,]
Replotting the data without the outlier.
ggplot(filter(wq.lt, CharacteristicID == 'TP'), aes(Datetime, ResultValue)) + geom_point() + geom_hline(yint=0, alpha=0) + facet_wrap(~ProjectID, ncol=1) + labs(y="Total Phosphorus (mg/L)")
Plotting by Julian Day shows no apparent seasonality except perhaps fewer high values in winter.
ggplot(filter(wq.lt, CharacteristicID == 'TP'), aes(yday(Datetime), ResultValue)) + geom_point() + geom_hline(yint=0, alpha=0) +facet_wrap(~ProjectID, ncol=1) + labs(x="Julian Day", y="Total Phosphorus (mg/L)")
Next step is to compare TP measurements with antecedent precipitation and flow to see if higher values are associated with storm events.
Extract the municipal data to wq.muni.
wq.muni <- filter(wq, ProjectID=="MUNI") %.% droplevels()
Check number of samples for each location and characteristic. All but one location only includes ECOLI data. The station MALMRO includes the other parameters.
ggplot(wq.muni, aes(LocationID, fill=CharacteristicID)) + geom_bar() + labs(y="No. Samples") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Plot the ECOLI concentrations at each site over time. Hard to tell which sample goes with which location. A lot of samples collected in Dec 2011 - Jan 2012, why the winter?
ggplot(filter(wq.muni, CharacteristicID=="ECOLI"), aes(Datetime, ResultValue, color=LocationID)) + geom_point() + labs(y="ECOLI") + scale_y_log10()
Easier to distinguish by location.
ggplot(filter(wq.muni, CharacteristicID=="ECOLI"), aes(Datetime, ResultValue)) + geom_point() + labs(y="ECOLI") + scale_y_log10() + facet_wrap(~LocationID, nrow=2) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Compare the distributions of ECOLI between the sites.
ggplot(filter(wq.muni, CharacteristicID=="ECOLI"), aes(LocationID, ResultValue)) + geom_boxplot() + labs(y="ECOLI") + scale_y_log10() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Extract the hotspot data to wq.hot.
wq.hot <- filter(wq, ProjectID=="HOTSPOT") %.% droplevels()
Check number of samples for each location and characteristic. Seems well distributed. Some locations have many more samples though.
ggplot(wq.hot, aes(LocationID, fill=CharacteristicID)) + geom_bar() + labs(y="No. Samples") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Compare characteristics using tiles. Shows that TSS and TP were only sampled at ELP001. The most common characteristics are TEMP_Water, SPCOND, SALINITY, ECOLI, DO_SAT, and DO.
ggplot(wq.hot, aes(LocationID, CharacteristicID)) + geom_tile(width=0.9, height=0.5) + labs(y="") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
All the hotspot ECOLI samples without specifying the location. Some values vey high (>50,000).
ggplot(filter(wq.hot, CharacteristicID=="ECOLI"), aes(Datetime, ResultValue)) + geom_point() + labs(y="") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))