Analysis of common non-point source pollutants

For now, I’m interested in just a few water quality parameters that have also been measured at other places in Story County.

Flow
Nitrate (As with Hach test strips, what’s actually being measured is inorganic nitrogen, or nitrate + nitrite, normalized as weight of nitrogen in each compound, but we’ll call it “nitrate” for short)
Total phosphorus
E. coli bacteria (an indicator of fecal contamination)
Total suspended solids or TSS (a measure of how muddy the water is, involving drying and weighing the sediment after it’s settled out)
Turbidity (closely related to TSS, uses a sensor to measure how muddy the water is by how much light is scattered)

After filtering out just the relevant information, we’ll rearrange the data table into a “tidy” arrangement to make it easier to work with. Each sampling date is now a row, each water quality parameter measured (flow, E. coli, nitrate, etc.) is a column. This process reduces a giant data table (10,632 row, 28 columns) to a much more manageable size (258 rows, 12 columns). A few rows are shown for demonstration.

DNR_nutrients <- DNR %>%
  # Pull out just the analytes of interest: flow, E. coli, nitrate+nitrite, total phosphorus,
  #TSS and turbidity
  filter(cas_rn %in% c("FLOW", "68583-22-2", "NIT-NO3-NO2", "PHOSP-PHOSP", "TSS", "TURB")) %>%
  # Then pull out just the relevant column. Site number and name, sample date,
  # the shorthand code for the analyte (cas_rn), and result
  # Columns dealing with units and detection limits can't easily be pivoted, so we'll deal with those
  # in a separate analysis
  select(siteID, name, sampleDate, cas_rn, result) %>%
  # Tidy the dataset so that results for each variable show up in a separate column.
    pivot_wider( names_from = cas_rn, values_from = result) %>%
    mutate(sampleYear = year(sampleDate), sampleMonth = month(sampleDate), 
           sampleDay = day(sampleDate))%>%
    rename(`E-COLI` = `68583-22-2`)
head(DNR)

head(DNR_nutrients)

Years of data available

Reviewing the simpler table by date, I’m better able to see where we have missing data. Monitoring started in October of 1998 has gone through June of 2020. Hydrologists often work with “water years” (Oct-Sept) to better account for snow-pack, but that requires some additional data wrangling, so I’ll stick with calendar years. There were some months missed in 2008 and 2009 due to budget cuts associated with the recession. In Sept 2015, they had to come back a different day to get an E. coli sample, so we have an extra row for that month. To enable better comparisons across years, I’ll drop years for which we have only a few months data. This leaves 19 complete years to work with.

DNR_someyears <- filter(DNR_nutrients, sampleYear %in% c(1999:2007, 2010:2019))

Summary statistics

Having data in a tidy format makes it much easier to graph and summarize! A single line of code will now give me the mean, median, minimum, maximum and quartiles for each variable.

summary(DNR_someyears)

##      siteID             name             sampleDate              FLOW       
##  Min.   :10850002   Length:229         Min.   :1999-01-11   Min.   :  -1.0  
##  1st Qu.:10850002   Class :character   1st Qu.:2003-10-06   1st Qu.:  34.5  
##  Median :10850002   Mode  :character   Median :2010-07-13   Median : 200.0  
##  Mean   :10850002                      Mean   :2009-07-23   Mean   : 471.1  
##  3rd Qu.:10850002                      3rd Qu.:2015-04-02   3rd Qu.: 576.5  
##  Max.   :10850002                      Max.   :2019-12-09   Max.   :7300.0  
##                                                             NA's   :19      
##       TURB            E-COLI           TSS          NIT-NO3-NO2   
##  Min.   :  1.70   Min.   :   10   Min.   :  1.00   Min.   : 0.85  
##  1st Qu.:  4.10   1st Qu.:  180   1st Qu.:  7.00   1st Qu.: 7.00  
##  Median :  8.95   Median :  400   Median : 17.00   Median : 9.80  
##  Mean   : 25.49   Mean   : 1494   Mean   : 53.22   Mean   :10.72  
##  3rd Qu.: 25.25   3rd Qu.: 1200   3rd Qu.: 54.00   3rd Qu.:14.00  
##  Max.   :360.00   Max.   :24000   Max.   :740.00   Max.   :23.00  
##  NA's   :1        NA's   :10      NA's   :1        NA's   :1      
##   PHOSP-PHOSP       sampleYear    sampleMonth       sampleDay    
##  Min.   :0.0500   Min.   :1999   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:0.2400   1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 6.00  
##  Median :0.4700   Median :2010   Median : 7.000   Median :10.00  
##  Mean   :0.8491   Mean   :2009   Mean   : 6.511   Mean   : 9.41  
##  3rd Qu.:1.0000   3rd Qu.:2015   3rd Qu.: 9.000   3rd Qu.:13.00  
##  Max.   :4.1000   Max.   :2019   Max.   :12.000   Max.   :27.00  
##  NA's   :1

We’ll explore some of these in with more detail.

How are the data distributed?

With over 200 observations, we can get a sense for the shape or distribution of the data. Here’s a histogram of nitrate. Most of the observations are clustered near the middle, around 9 mg/L. This is as close as you’re likely to get to a normal distribution (the classic “bell curve”) in water quality data. The mean (10.7 mg/L) is a little higher than the median (9.8 mg/L).

ggplot(data = DNR_someyears) +
    geom_histogram(mapping = aes(x = `NIT-NO3-NO2`), bins = 6) +
  labs(x = "Nitrate + Nitrite (mg/L)")

Here’s a histogram for total suspended solids. Looks different, doesn’t it? It has a skewed distribution, like income. In most countries there’s a lot of poor people, a smaller number of middle income people, and a few very wealthy people. Similarly, in the South Skunk River there’s a lot of a clear water days, a smaller number of slightly muddy days, and a few very muddy days. The influence of those very muddy days pulls the the average up, so it no longer represents conditions on a typical day. The mean (53 mg/L) is much higher than the median (18 mg/L) points. This is important, because some of the most common statistical methods (like the t-tests and ANOVAs) are not very reliable with skewed data.

ggplot(data = DNR_someyears) +
    geom_histogram(mapping = aes(x = `TSS`), bins = 6) +
  labs(x = "Total suspended solids (mg/L)")

Boxplots

How to read a boxplot

“Box and whiskers” plots pack a lot of information into one graph. You may be familiar with percentiles from standardized tests. The box ends at the bottom 25% (quartile) and top 25% of the data. The box is split at the 50th percentile, the median or typical value. The “whiskers” extend to the minimum and maximum. If there are data points a long way from the median (1.5 times the interquartile range), they are considered outliers and marked with a point.

With practice, you can see the distribution of the data. Notice the lopsided shape of the boxplot for total suspended solids, it’s not evenly split. The bottom two quartiles are compressed into a much narrower range (1-17 mg/L) than the third quartile (17-54 mg/L) and the top quartile includes a lot of outliers (200-700 mg/L). That’s a sign of a skewed distribution.

ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x = `NIT-NO3-NO2`)) + 
  labs(x = "Nitrate + Nitrite (mg/L)", title = "Boxplot for all 19 years of monthly data") +
    theme(axis.title.y=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank())

ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x = TSS))+
  labs(x = "Total suspended solids (mg/L)", title = "Boxplot for all 19 years of monthly data") +
  theme(axis.title.y=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank())

Confused by the terminology? Here’s a simpler way to put it: the water in the South Skunk is usualy fairly clear (low in suspended solids) but when it’s muddy, such as after a heavy rain, it can get really muddy. Nitrate, carried in groundwater and tile drainage, does not change as much, but is consistently high.

Annual patterns

Water quality varies a lot from year to year, as shown by these boxplots. 2011-2013 were especially dry years. Interestingly, while these years had low TSS and turbidity, they had high total phosphorus. Nitrate concentrations shot up after 2011–if yields are lower than expected, there’s going to be extra nitrogen in the soil that can wash out in the next year.

ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= sampleYear, y = `FLOW`, group = sampleYear)) +
  labs(x = "Year", y = "Flow (cfs)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= sampleYear, y = `TURB`, group = sampleYear))+
  labs(x = "Year", y = "Turbidity (NTU)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= sampleYear, y = `TSS`, group = sampleYear)) +
  labs(x = "Year", y = "Total Suspended Solids (mg/L)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= sampleYear, y = `NIT-NO3-NO2`, group = sampleYear)) +
  labs(x = "Year", y = "Nitrate + Nitrite (mg/L)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= sampleYear, y = `PHOSP-PHOSP`, group = sampleYear))+
  labs(x = "Year", y = "Total Phosophorus (mg/L)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= sampleYear, y = `E-COLI`, group = sampleYear))+
  labs(x = "Year", y = "E. coli (MPN/100mL)", title = "Boxplots for all 19 years of monthly data, sorted by Year")

Seasonal patterns

Water quality also has seasonal patterns. Are you comfortable enough reading boxplots now that you can spot some? Here we have boxplots for all 19 years, sorted by month.

# Adding a text representation of month, i.e. 3 -> Mar
DNR_someyears <- mutate(DNR_someyears, month_txt = month(sampleMonth, label = TRUE))
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= month_txt, y = `FLOW`, group = month_txt)) +
  labs(x = "Month", y = "Flow (cfs)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= month_txt, y = `TURB`, group = month_txt))+
  labs(x = "Month", y = "Turbidity (NTU)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= month_txt, y = `TSS`, group = month_txt)) +
  labs(x = "Month", y = "Total Suspended Solids (mg/L)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= month_txt, y = `NIT-NO3-NO2`, group = month_txt)) +
  labs(x = "Month", y = "Nitrate + Nitrite (mg/L)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= month_txt, y = `PHOSP-PHOSP`, group = month_txt))+
  labs(x = "Month", y = "Total Phosophorus (mg/L)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
  geom_boxplot(mapping = aes(x= month_txt, y = `E-COLI`, group = month_txt))+
  labs(x = "Month", y = "E. coli (MPN/100mL)", title = "Data from all 19 years, sorted by month")

Here’s what I noticed.

Flow is usually high in late spring, usually low in August and winter, and quite variable in March and October.
Muddy water (high turbidity and suspended solids) is most common from March thru July.
Nitrate is highest in May and June (when the weather is warm and wet, but before crops have matured) and lowest in August (when the weather is dry and crops are mature)
Total phosphorus is low in May and June and highly variable at other times of year.
E. coli is higher late in the year (Jul-Oct) but always quite variable

How does water quality relate to streamflow?

We know that heavy rains can erode soil from cropland and wash pollutants into streams. We know that at higher flows, the water has more power to erode streams banks and stir up mud from the stream bottom. Volunteers with the Squaw Creek Watershed Coalition have helped us document some dramatic examples of how water clarity and phosphorus can change over the course of a week in response to rains.

Is this pattern present with DNR’s monthly samples? Here we’ll plot some pollutants against flow, and add a moving average (a LOESS smooth line) to show the general trend.

ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `TSS`))+
  geom_point() + 
  geom_smooth(method = lm) +
  labs(x = "Flow (cfs)", y = "Total suspended solids (mg/L)", title = "Water quality vs. Flow (all years), with LOESS smooth line")
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `E-COLI`))+
  geom_point() + 
  geom_smooth(method = lm) +
  labs(x = "Flow (cfs)", y = "E. coli (MPN/100mL)", title = "Water quality vs. Flow (all years), with LOESS smooth line")
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `NIT-NO3-NO2`))+
  geom_point() + 
  geom_smooth(method = lm) +
  labs(x = "Flow (cfs)", y = "Nitrate + Nitrite (mg/L)", title = "Water quality vs. Flow (all years), with LOESS smooth line")
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `PHOSP-PHOSP`))+
  geom_point() + 
  geom_smooth(method = lm) +
  labs(x = "Flow (cfs)", y = "Total phosphorus (mg/L)", title = "Water quality vs. Flow (all years), with LOESS smooth line") +
  # the smooth line wants to go below zero. We'll adjust the viewing area.
  coord_cartesian(ylim = c(0, 5))

Pretty messy! If there is a relationship, generally:

TSS increases with flow
E. coli increases with flow
Nitrate stays the same or decreases with flow
Total phosphorus decreases with flow! This could indicate point sources of pollution (although I think the site is above the wastewater treatment plant), or dissolved phosphorus from drainage tiles.

The South Skunk River in Story County is enjoyed as a water trail and a smallmouth bass fishery, so we want to understand, and help improve its water quality!

Exploratory Data Analysis of the South Skunk River (v1.1)

Daniel Haug, Prairie Rivers of Iowa

2020-08-06

Introduction

Data available