`

# The "DNR" placeholders will stand in for any csv file downloaded from DNR's AQuIA
# Within a session, the DNR object can be used in other scripts.
DNR_input <- "data/07080105ecoli.csv"
DNR_output <- "data/output.csv"
# Document the input when you run variations: 
# downloaded: 2020-07-23
# sites: All sites in South SKunk River HUC8 (07080105)
# analytes: E. coli
# dates: all


# import from csv.  Some of the error flags have letter codes
# They are rare enough that R can't guess type from the first 1000 rows, so we have to specify
DNR <- read_csv(DNR_input, 
  col_types = cols(qualFlag = col_character(), 
    quantFlag = col_character(), quantLimit = col_double(), 
    remark = col_character(), 
    sampleDate = col_datetime(format = "%m/%d/%Y %H:%M:%S"))) %>%
# Optional - Break apart sample date
mutate(sampleYear = year(sampleDate), sampleMonth = month(sampleDate), sampleDay = day(sampleDate)) %>%
# Simplify date-time.  For most purposes, time of day is not important.
  mutate(sampleDate = as_date(sampleDate))
View(DNR)

Sites

Several stretches of the South Skunk River are on the Impaired Waters List because of elevated E. coli bacteria. To better understand this, we will download E. coli data collected by the Iowa DNR, from all sites in the South Skunk River basin (HUC 07080105).

sites <- distinct(DNR, siteID, name, latitude, longitude)
leaflet(data = sites) %>% addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
  addMarkers(~longitude, ~latitude, popup = ~as.character(name), label = ~as.character(siteID))
# Select just those columns of interest. Since the input data does not include other analytes,
# we can safely rename "result" to E. coli without any complicated filtering or pivoting
DNR_ecoli <- select(DNR, siteID, siteName = name, sampleDate, sampleYear, sampleMonth, ecoli = result)

Introduction to E. coli

E. coli is a bacteria that is found in the guts of warm-blooded animals, including livestock (like pigs, cows, and chickens), wildlife (such as geese and deer), and humans. It has been used as an indicator of fecal contamination. Swimming and other forms of primary contact recreation in waters with high levels of E. coli can increase the risk of waterborne illnesses that cause diarrhea.

Bacteria in a petri dish is the classic example of exponential growth, with populations quickly doubling in size. E. coli is highly variable, which is why it is often plotted on a log scale, with each tic mark going up by factors of 10: 10, 100, 1000, 10,000, 100,000. As you can see in the graph below, E. coli levels in both the South Skunk and Indian Creek not only exceeds the single sample maximum (235 MPN/100mL), it’s often 100 times greater!

library(scales)
example_sites <- filter(DNR_ecoli, siteID %in% c(   10500001,10850002))
ggplot(data = example_sites) + 
  geom_point(mapping = aes(x = sampleDate, y = ecoli, color= siteName)) + 
  geom_hline(yintercept = 235, color = 'red')+
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  labs(x = "Date", y = "E. coli bacteria (MPN/100mL") +
  theme_bw() + theme(legend.position = "bottom")

E. coli upstream vs downstream

library(scales)
skunk_sites <- filter(DNR_ecoli, siteID %in% c(10850003,10850002,10620001))
ggplot(data = skunk_sites) + 
  geom_point(mapping = aes(x = sampleDate, y = ecoli, color= siteName)) + 
  geom_hline(yintercept = 235, color = 'red')+
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) +
  labs(x = "Date", y = "E. coli bacteria (MPN/100mL") +
  theme_bw() + theme(legend.position = "bottom")

Summarizing E. coli for a recreational season

To summarise conditions during a recreational season, DNR uses the geometric mean. It transforms the data much like we saw in the log scale graph, so it’s not overly influenced by extremely high readings and gives us a better idea of the typical conditions: is it around 10, 100, or 1000 colonies per 100mL?

The recreational season is defined as March 15 thru Nov 15. We’re less concerned about E. coli during the winter, because no one is going for a swim!

# Filters by month
# E. coli standard aplies only to the "recreational season" of March 15-Nov 15
DNR_geomean <- DNR_ecoli %>%
  filter(sampleMonth %in% c(3:11)) %>%
# Calculate geometric mean, round to whole digits
# and summarise by site and year
# Counts samples per year
  group_by(siteID, siteName, sampleYear) %>%
  summarise(count = n(),
    geomean = round(exp(mean(log(ecoli))), digits = 0))
DNR_geomean

To evaluate beaches at publicly owned lakes, DNR requires weekly samples. To evaluate rivers, DNR requires 7 samples (at least monthly). We’ll look at just those sites with enough data to draw conclusions. In evaluating the seasonal geomean rather than a single sample, the relevant standard is 126 MPN/100mL.

DNR_geomean_season <- DNR_geomean %>%
  filter(count >= 7)
ggplot(data = DNR_geomean_season) +
  geom_col(mapping = aes(x = siteName, y = geomean, fill = factor(sampleYear)), position = "dodge")+
  geom_hline(yintercept = 126, color = 'red') +
  labs(x = "Site", y = "Seasonal geomean of E. coli (MPN/100mL)", title = "E. coli by Site") +   
  coord_flip()

  #ggsave("output/ecoli_bargraph.png", my_plot, width = 6, height = 7)
DNR_geomean_select <- DNR_geomean %>%
  filter(count >= 7, siteID %in% c(10850003,10850002,10620001))
ggplot(data = DNR_geomean_select) +
  geom_col(mapping = aes(x= factor(sampleYear), y = geomean, fill = siteName), position = "dodge2") +
  geom_hline(yintercept = 126, color = 'red') +
  coord_flip() +
  labs(x = "Year", y = "Seasonal geomean of E. coli (MPN/100mL)", title = "E. coli in the South Skunk River")