The South Skunk River at 16th St, a little upstream of the monitoring site
The Iowa DNR has been collecting monthly water quality samples from the South Skunk River since 1998, from a site near the Ames Wastewater Treatment Plant. Data can be downloaded from AQuIA, an online portal to DNR’s water assessment database. https://programs.iowadnr.gov/aquia/Sites/10850002
Map of the long-term monitoring site on the South Skunk River
AQuIA has some nice graphing and summary tools to explore the data, but to dive in deeper and get some insights we can apply to data we’re collecting from smaller tributaries, we’ll need to download it.
Importing the data and getting it an a usable form it up can be time-consuming, so I’ve automated the process with R scripts, so that it can easily be repeated when the database has been updated.
The dataset includes tests for over 290 different analytes (things you analyze), including:
Some “analytes” are monitored on a monthly basis, others infrequently. Many of these chemicals tested for were not present above the detection limit. Here is a summary of what was tested for, and what was found.
For now, I’m interested in just a few water quality parameters that have also been measured at other places in Story County.
After filtering out just the relevant information, we’ll rearrange the data table into a “tidy” arrangement to make it easier to work with. Each sampling date is now a row, each water quality parameter measured (flow, E. coli, nitrate, etc.) is a column. This process reduces a giant data table (10,632 row, 28 columns) to a much more manageable size (258 rows, 12 columns). A few rows are shown for demonstration.
DNR_nutrients <- DNR %>%
# Pull out just the analytes of interest: flow, E. coli, nitrate+nitrite, total phosphorus,
#TSS and turbidity
filter(cas_rn %in% c("FLOW", "68583-22-2", "NIT-NO3-NO2", "PHOSP-PHOSP", "TSS", "TURB")) %>%
# Then pull out just the relevant column. Site number and name, sample date,
# the shorthand code for the analyte (cas_rn), and result
# Columns dealing with units and detection limits can't easily be pivoted, so we'll deal with those
# in a separate analysis
select(siteID, name, sampleDate, cas_rn, result) %>%
# Tidy the dataset so that results for each variable show up in a separate column.
pivot_wider( names_from = cas_rn, values_from = result) %>%
mutate(sampleYear = year(sampleDate), sampleMonth = month(sampleDate),
sampleDay = day(sampleDate))%>%
rename(`E-COLI` = `68583-22-2`)
head(DNR)
head(DNR_nutrients)
Reviewing the simpler table by date, I’m better able to see where we have missing data. Monitoring started in October of 1998 has gone through June of 2020. Hydrologists often work with “water years” (Oct-Sept) to better account for snow-pack, but that requires some additional data wrangling, so I’ll stick with calendar years. There were some months missed in 2008 and 2009 due to budget cuts associated with the recession. In Sept 2015, they had to come back a different day to get an E. coli sample, so we have an extra row for that month. To enable better comparisons across years, I’ll drop years for which we have only a few months data. This leaves 19 complete years to work with.
DNR_someyears <- filter(DNR_nutrients, sampleYear %in% c(1999:2007, 2010:2019))
Having data in a tidy format makes it much easier to graph and summarize! A single line of code will now give me the mean, median, minimum, maximum and quartiles for each variable.
summary(DNR_someyears)
## siteID name sampleDate FLOW
## Min. :10850002 Length:229 Min. :1999-01-11 Min. : -1.0
## 1st Qu.:10850002 Class :character 1st Qu.:2003-10-06 1st Qu.: 34.5
## Median :10850002 Mode :character Median :2010-07-13 Median : 200.0
## Mean :10850002 Mean :2009-07-23 Mean : 471.1
## 3rd Qu.:10850002 3rd Qu.:2015-04-02 3rd Qu.: 576.5
## Max. :10850002 Max. :2019-12-09 Max. :7300.0
## NA's :19
## TURB E-COLI TSS NIT-NO3-NO2
## Min. : 1.70 Min. : 10 Min. : 1.00 Min. : 0.85
## 1st Qu.: 4.10 1st Qu.: 180 1st Qu.: 7.00 1st Qu.: 7.00
## Median : 8.95 Median : 400 Median : 17.00 Median : 9.80
## Mean : 25.49 Mean : 1494 Mean : 53.22 Mean :10.72
## 3rd Qu.: 25.25 3rd Qu.: 1200 3rd Qu.: 54.00 3rd Qu.:14.00
## Max. :360.00 Max. :24000 Max. :740.00 Max. :23.00
## NA's :1 NA's :10 NA's :1 NA's :1
## PHOSP-PHOSP sampleYear sampleMonth sampleDay
## Min. :0.0500 Min. :1999 Min. : 1.000 Min. : 1.00
## 1st Qu.:0.2400 1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 6.00
## Median :0.4700 Median :2010 Median : 7.000 Median :10.00
## Mean :0.8491 Mean :2009 Mean : 6.511 Mean : 9.41
## 3rd Qu.:1.0000 3rd Qu.:2015 3rd Qu.: 9.000 3rd Qu.:13.00
## Max. :4.1000 Max. :2019 Max. :12.000 Max. :27.00
## NA's :1
We’ll explore some of these in with more detail.
With over 200 observations, we can get a sense for the shape or distribution of the data. Here’s a histogram of nitrate. Most of the observations are clustered near the middle, around 9 mg/L. This is as close as you’re likely to get to a normal distribution (the classic “bell curve”) in water quality data. The mean (10.7 mg/L) is a little higher than the median (9.8 mg/L).
ggplot(data = DNR_someyears) +
geom_histogram(mapping = aes(x = `NIT-NO3-NO2`), bins = 6) +
labs(x = "Nitrate + Nitrite (mg/L)")
Here’s a histogram for total suspended solids. Looks different, doesn’t it? It has a skewed distribution, like income. In most countries there’s a lot of poor people, a smaller number of middle income people, and a few very wealthy people. Similarly, in the South Skunk River there’s a lot of a clear water days, a smaller number of slightly muddy days, and a few very muddy days. The influence of those very muddy days pulls the the average up, so it no longer represents conditions on a typical day. The mean (53 mg/L) is much higher than the median (18 mg/L) points. This is important, because some of the most common statistical methods (like the t-tests and ANOVAs) are not very reliable with skewed data.
ggplot(data = DNR_someyears) +
geom_histogram(mapping = aes(x = `TSS`), bins = 6) +
labs(x = "Total suspended solids (mg/L)")
How to read a boxplot
“Box and whiskers” plots pack a lot of information into one graph. You may be familiar with percentiles from standardized tests. The box ends at the bottom 25% (quartile) and top 25% of the data. The box is split at the 50th percentile, the median or typical value. The “whiskers” extend to the minimum and maximum. If there are data points a long way from the median (1.5 times the interquartile range), they are considered outliers and marked with a point.
With practice, you can see the distribution of the data. Notice the lopsided shape of the boxplot for total suspended solids, it’s not evenly split. The bottom two quartiles are compressed into a much narrower range (1-17 mg/L) than the third quartile (17-54 mg/L) and the top quartile includes a lot of outliers (200-700 mg/L). That’s a sign of a skewed distribution.
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x = `NIT-NO3-NO2`)) +
labs(x = "Nitrate + Nitrite (mg/L)", title = "Boxplot for all 19 years of monthly data") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x = TSS))+
labs(x = "Total suspended solids (mg/L)", title = "Boxplot for all 19 years of monthly data") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Confused by the terminology? Here’s a simpler way to put it: the water in the South Skunk is usualy fairly clear (low in suspended solids) but when it’s muddy, such as after a heavy rain, it can get really muddy. Nitrate, carried in groundwater and tile drainage, does not change as much, but is consistently high.
Water quality varies a lot from year to year, as shown by these boxplots. 2011-2013 were especially dry years. Interestingly, while these years had low TSS and turbidity, they had high total phosphorus. Nitrate concentrations shot up after 2011–if yields are lower than expected, there’s going to be extra nitrogen in the soil that can wash out in the next year.
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= sampleYear, y = `FLOW`, group = sampleYear)) +
labs(x = "Year", y = "Flow (cfs)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= sampleYear, y = `TURB`, group = sampleYear))+
labs(x = "Year", y = "Turbidity (NTU)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= sampleYear, y = `TSS`, group = sampleYear)) +
labs(x = "Year", y = "Total Suspended Solids (mg/L)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= sampleYear, y = `NIT-NO3-NO2`, group = sampleYear)) +
labs(x = "Year", y = "Nitrate + Nitrite (mg/L)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= sampleYear, y = `PHOSP-PHOSP`, group = sampleYear))+
labs(x = "Year", y = "Total Phosophorus (mg/L)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= sampleYear, y = `E-COLI`, group = sampleYear))+
labs(x = "Year", y = "E. coli (MPN/100mL)", title = "Boxplots for all 19 years of monthly data, sorted by Year")
Water quality also has seasonal patterns. Are you comfortable enough reading boxplots now that you can spot some? Here we have boxplots for all 19 years, sorted by month.
# Adding a text representation of month, i.e. 3 -> Mar
DNR_someyears <- mutate(DNR_someyears, month_txt = month(sampleMonth, label = TRUE))
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= month_txt, y = `FLOW`, group = month_txt)) +
labs(x = "Month", y = "Flow (cfs)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= month_txt, y = `TURB`, group = month_txt))+
labs(x = "Month", y = "Turbidity (NTU)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= month_txt, y = `TSS`, group = month_txt)) +
labs(x = "Month", y = "Total Suspended Solids (mg/L)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= month_txt, y = `NIT-NO3-NO2`, group = month_txt)) +
labs(x = "Month", y = "Nitrate + Nitrite (mg/L)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= month_txt, y = `PHOSP-PHOSP`, group = month_txt))+
labs(x = "Month", y = "Total Phosophorus (mg/L)", title = "Data from all 19 years, sorted by month")
ggplot(data = DNR_someyears) +
geom_boxplot(mapping = aes(x= month_txt, y = `E-COLI`, group = month_txt))+
labs(x = "Month", y = "E. coli (MPN/100mL)", title = "Data from all 19 years, sorted by month")
Here’s what I noticed.
We know that heavy rains can erode soil from cropland and wash pollutants into streams. We know that at higher flows, the water has more power to erode streams banks and stir up mud from the stream bottom. Volunteers with the Squaw Creek Watershed Coalition have helped us document some dramatic examples of how water clarity and phosphorus can change over the course of a week in response to rains.
Is this pattern present with DNR’s monthly samples? Here we’ll plot some pollutants against flow, and add a moving average (a LOESS smooth line) to show the general trend.
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `TSS`))+
geom_point() +
geom_smooth(method = lm) +
labs(x = "Flow (cfs)", y = "Total suspended solids (mg/L)", title = "Water quality vs. Flow (all years), with LOESS smooth line")
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `E-COLI`))+
geom_point() +
geom_smooth(method = lm) +
labs(x = "Flow (cfs)", y = "E. coli (MPN/100mL)", title = "Water quality vs. Flow (all years), with LOESS smooth line")
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `NIT-NO3-NO2`))+
geom_point() +
geom_smooth(method = lm) +
labs(x = "Flow (cfs)", y = "Nitrate + Nitrite (mg/L)", title = "Water quality vs. Flow (all years), with LOESS smooth line")
ggplot(data = DNR_someyears, mapping = aes(x = `FLOW`, y= `PHOSP-PHOSP`))+
geom_point() +
geom_smooth(method = lm) +
labs(x = "Flow (cfs)", y = "Total phosphorus (mg/L)", title = "Water quality vs. Flow (all years), with LOESS smooth line") +
# the smooth line wants to go below zero. We'll adjust the viewing area.
coord_cartesian(ylim = c(0, 5))
Pretty messy! If there is a relationship, generally:
The South Skunk River in Story County is enjoyed as a water trail and a smallmouth bass fishery, so we want to understand, and help improve its water quality!