Exploratory Data Analysis of Malden River WQ Data

Author: Jeff Walker
Created: 2014-05-14
Updated: 2014-05-20

This document provides some exploratory data analysis of the MyRWA water quality database for the Malden River.

Packages

Load the packages used for this document and set knitr options.

library(myrwaR)
library(dplyr)
library(lubridate)
library(ggplot2)
theme_set(theme_bw())
opts_chunk$set(tidy = FALSE)

Load Data

Load data from the Access database using the myrwaR package.

ch <- db_connect(file.path('..','..','..','data','MysticDB_20140510.accdb'))
wq <- db_results(ch) 
close(ch)

Stations

The stations located within the Malden River sub-basin were identified in QGIS and saved to a csv file called locations_malden_river.csv

locations <- read.csv(file.path('..','..','gis','csv','locations_malden_river.csv'), as.is=TRUE)
locations <- unique(locations$LocationID)

The wq data frame is filtered to include only these stations within the Malden River subbasin:

wq <- filter(wq, LocationID %in% locations)
wq <- droplevels(wq)

Summary of Sampling Programs

The majority of stations are part of the MyRWA HOTSPOT program. The MyRWA Baseline (BASE) and MWRA CSORWM programs include only one station each (MAR036 and MWRA176, respectively). There are also a few stations from municipalities (MUNI).

ggplot(wq, aes(LocationID,ProjectID)) + geom_tile(height=0.5, width=0.9) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) + labs(y='')

plot of chunk project_station

We can also see which parameters were analyzed in each program. The only consistent parameters across all four include ECOLI, ENT, and TEMP_WATER.

ggplot(wq, aes(CharacteristicID, ProjectID)) + geom_tile(height=0.5, width=0.9) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) + labs(y='')

plot of chunk project_characteristic

We can also see the number of samples for each program. Across all the projects, DO, DO_SAT, ECOLI, SPCOND, and TEMP_WATER are the most common. The MUNI stations primarily include ECOLI data, and also single values of the PPC_* data. TP, TSS, and NO23 are only measured in the BASE project. Turbidity was only measured in the MWRA CSORWM project.

ggplot(wq, aes(CharacteristicID, fill=ProjectID)) + geom_bar(position='stack') + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) + labs(y='No. Samples')

plot of chunk project_characteristic_stack

Finally, the temporal distribution of samples for each project can be seen from this figure. For each month/year and project, a black tile indicates at least one sample of any parameter was collected. The Baseline project includes data from 2000-2014. The CSORWM project was from 2002-2012 (although more data may now be available).

ggplot(wq, aes(floor_date(Datetime, unit="month"), ProjectID)) + geom_tile(height=0.5) + labs(x='Month-Year')

plot of chunk project_monthyear

The CSORWM data were also only collected from May - January, while the Baseline and Hotspot data were collected year-round.

ggplot(wq, aes(factor(month(Datetime)), ProjectID)) + geom_tile(width=0.9, height=0.5) + labs(x='Month')

plot of chunk project_month

Based on these summaries, two separate analyses are warrented. The first focuses on the consistent long-term datasets, which include Baseline sampling at MAR036 and the CSORWM sampling at MWRA176.

Long-term Datasets

Extract the Baseline and CSORWM samples, and only the primary characteristics.

wq.lt <- filter(wq, ProjectID %in% c('BASE', 'CSORWM'), CharacteristicID %in% c("DO", "DO_SAT", "ECOLI", "ENT", "FCOLI", "NO23", "PH", "SPCOND", "TEMP_WATER", "TP", "TSS", "TURB")) %.% droplevels()

Plot the raw data over time. There are some clear outliers:

It also appears that for SPCOND, the BASE samples are an order of magnitude higher than the CSORWM samples. I assume this was an error when inputing the data and that CSORWM are actually in mS/cm.

ggplot(wq.lt, aes(Datetime, ResultValue, color=ProjectID)) + geom_point() + facet_wrap(~CharacteristicID+Units, scales='free_y') + labs(y='Value') + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk lt_sample_ts

Identify which rows contain outliers:

outlier_index <- with(wq.lt, which((CharacteristicID=="DO_SAT" & ResultValue > 600) | (CharacteristicID=="PH" & ResultValue < 1) | (CharacteristicID=="TP" & ResultValue > 1) | (CharacteristicID=="TSS" & ResultValue > 1000)))
print(outlier_index) # row indices for outliers
## [1]  324  690  691 1597

Remove those rows from wq.lt and save in a separate data frame wq.lt.outlier:

wq.lt.outlier <- wq.lt[outlier_index, ] # data frame saving outliers
wq.lt <- wq.lt[-outlier_index,]

Fix the SPCOND measurements, which appear to be off by a factor 1000:

wq.lt[which(wq.lt$CharacteristicID=="SPCOND" & wq.lt$ProjectID=="CSORWM"), "ResultValue"] <- wq.lt[which(wq.lt$CharacteristicID=="SPCOND" & wq.lt$ProjectID=="CSORWM"), "ResultValue"]*1000

Plot the samples again without the outliers:

ggplot(wq.lt, aes(Datetime, ResultValue, color=ProjectID)) + geom_point() + facet_wrap(~CharacteristicID+Units, scales='free_y') + labs(y='Value') + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk lt_sample_ts2

Specific Conductivity

To see if there is salt water intrusion, plot the specific conductance. The MWRA collects 'bottom' and 'surface' samples specified by SampleDepthType, the MyRWA Baseline samples are collected at the surface, but have SampleDepthType of NA. This figure shows that salt water is indeed present in the bottom samples collected by MWRA.

ggplot(filter(wq.lt, CharacteristicID=="SPCOND"), aes(Datetime, ResultValue, color=SampleDepthType)) + geom_point() + facet_wrap(~ProjectID, ncol=1)

plot of chunk ls_spcond

To compare the surface samples, filter out the MWRA bottom samples.

ggplot(filter(wq.lt, CharacteristicID=="SPCOND", (SampleDepthType == 'S' | is.na(SampleDepthType))), aes(Datetime, ResultValue, color=ProjectID)) + geom_point() + labs(y="Specific Conductivity (uS/cm)")

plot of chunk ls_spcond_surf

Dissolved Oxygen

The DO data show very low concentrations in the MWRA bottom samples, and some Baseline samples below 5 mg/L.

ggplot(filter(wq.lt, CharacteristicID %in% c("DO", "DO_SAT")), aes(Datetime, ResultValue, color=SampleDepthType)) + geom_point() + facet_grid(CharacteristicID+Units~ProjectID, scales='free_y') + labs(y="")
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk ls_do

Seasonality is shown by plotting concentrations by Julian Day (1=Jan 1).

ggplot(filter(wq.lt, CharacteristicID %in% c("DO", "DO_SAT")), aes(yday(Datetime), ResultValue, color=SampleDepthType)) + geom_point() + facet_grid(CharacteristicID+Units~ProjectID, scales='free_y') + labs(x="Julian Day", y="")
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk ls_do_jday

pH

pH values seem to be higher at the surface than at the bottom among MWRA data. The Baseline data ends in 2006 and shows an odd downward trend.

ggplot(filter(wq.lt, CharacteristicID == 'PH'), aes(Datetime, ResultValue, color=SampleDepthType)) + geom_point() + facet_wrap(~ProjectID, ncol=1) + labs(y="pH")

plot of chunk ls_ph

Total Phosphorus

Total Phosphorus (TP) was only collected by MyRWA. There appears to be an outlier at the end of 2004, which is below a typical detection limit.

ggplot(filter(wq.lt, CharacteristicID == 'TP'), aes(Datetime, ResultValue)) + geom_point() + facet_wrap(~ProjectID, ncol=1) + labs(y="Total Phosphorus (mg/L)")

plot of chunk ls_tp

Removing the outlier and appending it to the previous wq.lt.outlier dataframe.

outlier_index <- with(wq.lt, which(ProjectID=="BASE" & CharacteristicID=="TP" & ResultValue < 0.01))
print(outlier_index)
## [1] 514
wq.lt.outlier <- rbind(wq.lt.outlier, wq.lt[outlier_index, ]) # data frame saving outliers
wq.lt <- wq.lt[-outlier_index,]

Replotting the data without the outlier.

ggplot(filter(wq.lt, CharacteristicID == 'TP'), aes(Datetime, ResultValue)) + geom_point() + geom_hline(yint=0, alpha=0) + facet_wrap(~ProjectID, ncol=1) + labs(y="Total Phosphorus (mg/L)")

plot of chunk ls_tp2

Plotting by Julian Day shows no apparent seasonality except perhaps fewer high values in winter.

ggplot(filter(wq.lt, CharacteristicID == 'TP'), aes(yday(Datetime), ResultValue)) + geom_point() + geom_hline(yint=0, alpha=0) +facet_wrap(~ProjectID, ncol=1) + labs(x="Julian Day", y="Total Phosphorus (mg/L)")

plot of chunk ls_tp_jday

Next step is to compare TP measurements with antecedent precipitation and flow to see if higher values are associated with storm events.

Municipal Dataset

Extract the municipal data to wq.muni.

wq.muni <- filter(wq, ProjectID=="MUNI") %.% droplevels()

Check number of samples for each location and characteristic. All but one location only includes ECOLI data. The station MALMRO includes the other parameters.

ggplot(wq.muni, aes(LocationID, fill=CharacteristicID)) + geom_bar() + labs(y="No. Samples") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk muni_sample_no

ECOLI

Plot the ECOLI concentrations at each site over time. Hard to tell which sample goes with which location. A lot of samples collected in Dec 2011 - Jan 2012, why the winter?

ggplot(filter(wq.muni, CharacteristicID=="ECOLI"), aes(Datetime, ResultValue, color=LocationID)) + geom_point() + labs(y="ECOLI") + scale_y_log10()

plot of chunk muni_ecoli

Easier to distinguish by location.

ggplot(filter(wq.muni, CharacteristicID=="ECOLI"), aes(Datetime, ResultValue)) + geom_point() + labs(y="ECOLI") + scale_y_log10() + facet_wrap(~LocationID, nrow=2) + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk muni_ecoli2

Compare the distributions of ECOLI between the sites.

ggplot(filter(wq.muni, CharacteristicID=="ECOLI"), aes(LocationID, ResultValue)) + geom_boxplot() + labs(y="ECOLI") + scale_y_log10() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk muni_ecoli_box

Hotspot Dataset

Extract the hotspot data to wq.hot.

wq.hot <- filter(wq, ProjectID=="HOTSPOT") %.% droplevels()

Check number of samples for each location and characteristic. Seems well distributed. Some locations have many more samples though.

ggplot(wq.hot, aes(LocationID, fill=CharacteristicID)) + geom_bar() + labs(y="No. Samples") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk hot_sample_no

Compare characteristics using tiles. Shows that TSS and TP were only sampled at ELP001. The most common characteristics are TEMP_Water, SPCOND, SALINITY, ECOLI, DO_SAT, and DO.

ggplot(wq.hot, aes(LocationID, CharacteristicID)) + geom_tile(width=0.9, height=0.5) + labs(y="") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk hot_sample_tile

ECOLI

All the hotspot ECOLI samples without specifying the location. Some values vey high (>50,000).

ggplot(filter(wq.hot, CharacteristicID=="ECOLI"), aes(Datetime, ResultValue)) + geom_point() + labs(y="") + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk hot_ecoli