Exploratory Water Quality Data Analysis

–this time with HOTSPOT Data

Yi Xuan (MyRWA Programmer Intern)

Summer 2014, Mystic River Watershed Association

========================================================

Reference: Guidance from and previous work done by Jeff Walker (PhD Tufts), http://rpubs.com/heflopod/

Data Source: MyRWA’s Access Database

After pulling out data from the database, select a random HOTSPOT sampling date and futher select only ECOLI data

HOTSPOT (Hotspot Mystic River Watershed MA) were water data collected by MyRWA staff and volunteers

#override the default wrapping width
options(width=130)
##changing DateHour format
wq2$DateHour <-as.Date(wq2$DateHour,'%m/%d/%Y') 
#select a subset from the data.frame
hotspotBarPlot<-subset(subset(wq2,  DateHour=="2013-08-15") ,  CharacteristicID=="ECOLI")
#check out what's in the data.frame
head(hotspotBarPlot,5)
##         DateHour LocationTypeID LocationID VisitID CharacteristicID ResultValue     Units Qualifier FlagID SampleTypeID
## 12348 2013-08-15             27 MEDOF01030   30676            ECOLI         568 MPN/100ml      <NA>   <NA>            S
## 12352 2013-08-15             27 MEDOFWB010   30678            ECOLI        2452 MPN/100ml      <NA>   <NA>            S
## 12355 2013-08-15             22      WNB04   30677            ECOLI        6932 MPN/100ml      <NA>   <NA>            S
## 12364 2013-08-15             27 MEDOF50009   30679            ECOLI         242 MPN/100ml      <NA>   <NA>            S
## 12372 2013-08-15             27 MEDOF04100   30680            ECOLI         626 MPN/100ml      <NA>   <NA>            S
##       CharacteristicName SampleType            Datetime ProjectID SampleDepthType                        Comment MunicipalityID
## 12348   Escherichia coli          S 2013-08-15 07:26:00   HOTSPOT            <NA>        Some suspended material        Medford
## 12352   Escherichia coli          S 2013-08-15 08:23:00   HOTSPOT            <NA>                           <NA>        Medford
## 12355   Escherichia coli          S 2013-08-15 08:03:00   HOTSPOT            <NA>                           <NA>        Medford
## 12364   Escherichia coli          S 2013-08-15 08:45:00   HOTSPOT            <NA> Insufficient flow for YSI data        Medford
## 12372   Escherichia coli          S 2013-08-15 09:02:00   HOTSPOT            <NA>                           <NA>        Medford
##                WaterBodyID Latitude Longitude WaterType LocationTypeName Precip.48 Weather
## 12348 Mystic River (Fresh)    42.42    -71.13     Fresh      Storm Sewer         0     Dry
## 12352    Winter Hill Brook    42.40    -71.10     Fresh      Storm Sewer         0     Dry
## 12355    Winter Hill Brook    42.40    -71.10     Fresh     River/Stream         0     Dry
## 12364 Mystic River (Fresh)    42.42    -71.10     Fresh      Storm Sewer         0     Dry
## 12372 Mystic River (Fresh)    42.42    -71.10     Fresh      Storm Sewer         0     Dry

Levels of ECOLI found during Hotspot sampling 08/15/2013

ggplot(hotspotBarPlot, aes(x=LocationID, y=ResultValue, width =.65)) +

  geom_bar(fill="lightgoldenrod1", colour="black",stat = "identity") + 
  #control x axis labeling and its text display angle
  theme(axis.text.x=element_text(angle=45, hjust=1, vjust=NULL)) +
  labs(title="Levels of ECOLI Found \n During Hotspot sampling 08/15/2013",
       y='ResultValue, MPN/100ml') +
  geom_text(aes(label=ResultValue, x = LocationID, y = ResultValue), 
            position = position_dodge(width = 0.8), vjust=-.6) +
      # add two horizontal lines intercept at particular values
  geom_hline(aes(yintercept=1260), colour="#CC0033", size=1.25)+
  geom_hline(aes(yintercept=235), colour="#FF9900", size=1.25)+
  #title and its size specification
  theme(plot.title = element_text(size = rel(1.5)))+
  #adding text/ standard to the plot
  annotate("text", label = "Boating Standard, 1,260 E.coli/100ml", 
           x = 10, y = 1000, size = 4, colour = "black") +
  annotate("text", label = "Swimming Standard, 235 E.coli/100ml", 
           x = 10, y = 130, size = 4, colour = "black") +
  scale_y_log10(limits=c(1,20000),breaks=c(1, 10, 100,1000,10000))

plot of chunk unnamed-chunk-4

On that data, only three sites, all from Winchester had ECOLI levels under regulatory limits.The worst site was from Winter Hill Brook (WINB04) at Medford, which had ECOLI at 6932 MPN per 100ml.

Now let’s take a look at ECOLI data from Lexington

#select a subset of the data.frame
multiplot <- subset(subset(wq2, MunicipalityID=="Lexington" ), CharacteristicID=="ECOLI" )
#override the default wrapping width
options(width=130)
head(multiplot,6)
##        DateHour LocationTypeID LocationID VisitID CharacteristicID ResultValue     Units Qualifier FlagID SampleTypeID
## 5457 2005-08-30             22     MIB032   27354            ECOLI        1102 CFU/100ml      <NA>   <NA>            S
## 5494 2005-08-30             22     FEB018   27362            ECOLI         586 CFU/100ml      <NA>   <NA>            S
## 5495 2005-08-30             22    MUB019W   27361            ECOLI        1041 CFU/100ml      <NA>   <NA>            S
## 5496 2005-08-30             22    MUB019E   27359            ECOLI        3921 CFU/100ml      <NA>   <NA>            S
## 5497 2005-08-30             22     MIB030   27357            ECOLI        5654 CFU/100ml      <NA>   <NA>            S
## 5530 2005-08-30             22     SHB183   27367            ECOLI        2317 CFU/100ml      <NA>   <NA>            S
##      CharacteristicName SampleType            Datetime ProjectID SampleDepthType                                      Comment
## 5457   Escherichia coli          S 2005-08-30 06:49:00   HOTSPOT               S                            1st called SIB047
## 5494   Escherichia coli          S 2005-08-30 07:53:00   HOTSPOT               S                                         <NA>
## 5495   Escherichia coli          S 2005-08-30 07:35:00   HOTSPOT               S                                no oily sheen
## 5496   Escherichia coli          S 2005-08-30 07:28:00   HOTSPOT               S                                   oily sheen
## 5497   Escherichia coli          S 2005-08-30 07:06:00   HOTSPOT               S 1st called SIB027, strong H2S odor, low flow
## 5530   Escherichia coli          S 2005-08-30 08:15:00   HOTSPOT               S                             slight H2S smell
##      MunicipalityID       WaterBodyID Latitude Longitude WaterType LocationTypeName Precip.48 Weather
## 5457      Lexington        Mill Brook    42.43    -71.20     Fresh     River/Stream      0.57     Wet
## 5494      Lexington   Fessenden Brook    42.44    -71.20     Fresh     River/Stream      0.57     Wet
## 5495      Lexington      Munroe Brook    42.43    -71.19     Fresh     River/Stream      0.57     Wet
## 5496      Lexington      Munroe Brook    42.43    -71.19     Fresh     River/Stream      0.57     Wet
## 5497      Lexington        Mill Brook    42.43    -71.19     Fresh     River/Stream      0.57     Wet
## 5530      Lexington Shaker Glen Brook    42.45    -71.20     Fresh     River/Stream      0.57     Wet
ggplot(multiplot, aes(DateHour, ResultValue,color=Weather)) + 
  #setting the shape and size of points
  geom_point( stat="identity",
             shape= 20,size = 4,na.rm=TRUE)  +  
  
  #overwriting default colour, assign new colours to points
  scale_colour_manual(values=c("red", "blue"))+
  
  #warp multiple plots by locationID, set arbitrary row number
  facet_wrap(~LocationID, nrow=5, scale="free_x") +
  
  #specify the x axis datetime tick mark
  scale_x_date(breaks="12 month",limits=c(min(multiplot$DateHour),max=max(multiplot$DateHour)),
               labels = date_format("%Y")) +
  
  labs(title="Multi plot of ECOLI levels From Lexiton Mulnicipality Data ", 
       y='ResultValue, Ecoli Count')+
  #set title size and colour
  theme(plot.title = element_text(face="bold",size = rel(2), colour = "dodgerblue4"))+
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))+
  
    # add two horizontal lines intercept at particular values
  geom_hline(aes(yintercept=1260), colour="maroon1", size=1.25,alpha=0.5)+
  geom_hline(aes(yintercept=235), colour="steelblue", size=1.25,alpha=0.5)+
  
  # set log y axis and its ticks points, and y axis break/ tick marks
  scale_y_log10(limits=c(1,1.2e5),breaks=c(1, 10, 100,1000,10000)) 

plot of chunk unnamed-chunk-5 Turn out there aren’t many sites and samples from Lexington. Upon futher inspection, in reality Mystic River doesn’t flow through Lexington, where just some ponds and small streams scatter trhoughout.

A summary of all the HOTSPOT sites in the database, Arlington comes up on top with 2220 sites

summary(wq2$MunicipalityID)
##   Arlington     Belmont      Boston  Burlington   Cambridge Charlestown     Chelsea East Boston     Everett   Lexington 
##        2220        1125         235          87         758           0        1413         418         232         209 
##      Malden     Medford     Melrose     Reading      Revere  Somerville    Stoneham   Wakefield  Winchester    Winthrop 
##         855        1181         416          12         638         690         197          14         932         220 
##      Woburn        NA's 
##        1019          31

Excel style, a quick achive plot for ECOLI level at WIN2-5d, a site at Aberjona River, Winchester, MA

ggplot(archivePlot, aes(x=Datetime, y=ResultValue,color=Weather)) + 
  labs(title="Achive plot of ECOLI levels found at site WIN2-5d ", 
       y='ResultValue, Ecoli Count')+
    #hide vertical gridlines and make it color black
    theme(panel.grid.minor.x=element_blank(), 
          panel.grid.major.x=element_blank(), 
          panel.grid.major.y=element_line(colour = 'lightyellow', size = 0.5),
          panel.grid.minor.y=element_line(colour = 'lightyellow', size = 0.5))+
    # add two horizontal lines intercept at particular values
  geom_hline(aes(yintercept=1260), colour="#CC0033", size=1.25,alpha=0.5)+
  geom_hline(aes(yintercept=235), colour="#FF9900", size=1.25,alpha=0.5)+
  geom_point(aes(), stat="identity",
             shape= 19,size = 5,na.rm=TRUE)  +
  geom_text(aes(label=ResultValue, x = Datetime, y = ResultValue), 
            position = position_dodge(width = 0.8), vjust=-.7)+
  #set title size, font and colour
  theme(plot.title = element_text(face="bold",size = rel(2), colour = "lemonchiffon3"))+
  # set log y axis and its range, and y axis break/ tick marks
  scale_y_log10(limits=c(1,1.0e5),breaks=c(1, 10, 100,1000,10000)) 
## ymax not defined: adjusting position using y instead

plot of chunk unnamed-chunk-8

Next, let’s check out what’s going on for the Aberjona River. Visualizing using geom_tile

Aberjona<-subset(subset(wq2,  WaterBodyID=="Aberjona River"), LocationTypeName="Storm Sewer")

ggplot(Aberjona, aes(LocationID, CharacteristicID)) +
  geom_tile(aes(fill=Weather),width=0.9, height=0.5) + 
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk unnamed-chunk-9

From the geom_tile visualization, we can tell that the majority of the samples during dry, and some during rainy days (thanks to MyRWA’s staff and volunteers).

Important water quality parameters (ECOLI, Salinity, DO and the like) were captured for most of the sites at the Aberjona River. On the other hand, Potassium, TSS and TP(Total Phosphorus), not so much.

ggplot(Aberjona, aes(LocationID, CharacteristicID)) +
  geom_tile(aes(fill=LocationTypeName),width=0.9, height=0.5) + 
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

plot of chunk unnamed-chunk-10

Water quality visualization by LocationTypeName, which has two values: River/Stream and STorm Sewer. Sites initial started with ABR (Aberjona River) are River/Stream type, on the other hand, other sites with initial started with WIN (Winchester, MA) or WOB (Woburn, MA) are Storm Sewer type, and from local’s combined sewer system.