Module 4: Water Quality

install.packages(“knitr”) install.packages(“tidyverse”) install.packages(“lubridate”)

library(tidyverse)
library(lubridate)

Science question

How often do rivers in the U.S. exceed the EPA limit for water quality?

Environmental Learning Objectives:

Understand the variability of concentrations of dissolved substances in stream water and identify some reasons for this variability.
Find regulatory guidelines for nitrate concentrations in surface water, find real-time nitrate concentration data, and recognize when observed concentrations exceed guidelines
Evaluate natural and anthropogenic factors that cause the concentrations of nitrate to change over time
Define the meaning of “statistical probability,” and use this term as a regulatory tool to evaluate a potential drinking water resource

Data Science Learning Objectives

Practice importing data sets that aren’t in a standard header format
Apply if-statements and ifelse()()
Apply loops for scaling-up analyses
Pasting strings and numbers together
Communicating with an API

Assumed R knowledge

mutate()
group_by()
ggplot()
geom_line()
geom_point()
select()
summarize()
mean()
filter()
rename()
read_csv()
lm()
rbind()
geom_abline()
geom_smooth()
geom_histogram()
head()

R knowledge covered in module

read_delim()
rename()
slice()
as.numeric()
paste0()
for loop
if-else statement
ifelse()
vector()
geom_hline()

Why this matters

Water is one of the earth’s most precious resources, and is involved in almost many biogeochemical processes. Human activities often lead to problems with water quality. Concentrations of dissolved substances in freshwater varies over time and space, and depends on surrounding anthropogenic factors. Water quality significantly impacts not only ecosystem health (plants, animals, microorganisms, and other species) but also humans who depend on freshwater resources for recreation, tourism, and our very basic needs of waste management and drinking water. This module aims to help students understand the variability of concentrations of dissolved substances in stream water and identify some reasons for this variability. You will explore and evaluate water quality concerns with real-time data from regulatory entities, and learn about water quality ecosystem and human health implications.

Outline:

Explore impaired streams and real-time nitrate concentrations
Download data
Import data
Visualize the data
Analyze the data
Scaling up to multiple sites
Analyzing new state
Analyzing multiple time periods

Step 1: Explore impaired streams and real-time nitrate concentrations

For the second part of this module, we will be further exploring impaired water bodies near us and federal guidelines for safety using publicly available EPA water quality data.

Question 1: (provide answers)

Explore federal regulatory guidelines. The US Environmental Protection Agency lists water quality regulations for both Human Drinking Water http://water.epa.gov/drink/contaminants/index.cfm

What is the Maximum Contaminant Level (MCL) for nitrate (reported as nitrate -nitrogen) in mg/L.

EPA_limit <- 10

What are the potential health impacts of consuming water with concentrations above this limit?

One potential health impacts of consuming water with concentration above this MCL is infants younger than six months old could become very ill and it could be fatal if they are not treated. Additionally, some symptoms of consuming this water include shortness of breath and blue-baby syndrome.
What are the common sources of nitrate in water?

Common sources of nirate water include runoff from using fertilizer, septic tanks leaking, sewage, and erosion of natural deposits.

Step 2: Download data

Collect, plot, and explain the water quality data to do an assessment of acute, local, water quality impacts.

During this portion of the module, you will search for and work with local water quality data, and generate plots that will allow you to further analyze the data, answering questions as you go.

Watershed managers need to know how often a risk presents itself in a watershed in order for action to be taken. For financially strapped government’s, priority is given to problems that have the highest probability of occurring and which are associated with the most severe impacts. “Blue-baby syndrome” is a condition that leads to infant mortality. It is caused by the ingestion of nitrate in drinking water which subsequently bonds to oxygen sites on hemoglobin in the blood of the infant. This impairs the circulation of oxygen in the bloodstream and causes the baby to turn blue. Obviously, society would like to avoid this outcome.

Let’s assume that the Cedar River, Iowa, is being considered for use as a drinking water supply by the municipality of Palo, Iowa (note, this is probably not happening in reality). Is this river water safe for human consumption? To answer this question, we will first explore the variability of nitrate within this river.

Open the U.S. Geological Survey’s “Water Quality Watch” web page: http://waterwatch.usgs.gov/wqwatch/
Under the “State” menu, select “Iowa” and allow the map to refresh.
Under the measurement menu, select “Nitrate.”
Move your mouse over the available triangles until you find “Cedar River at Blairs Ferry Rd, at Palo, IA.” This triangle is equidistant from the top (North) and the bottom (South) of the map, and about an inch to the left of the right (East) boundary of the state.
Once you find it, click once on the triangle, and the click again on the eight-digit hydrologic unit code (HUC) number “05464420” which identifies this site.
Now you are in the official USGS data site for the Cedar River at Blairs Ferry Rd, Palo, IA. Scroll down and explore the plots of the available data. We would like to see whether nitrate concentrations change over the course of the agricultural season.
Scroll back up to the box labelled “Available Parameters.”
Uncheck all except for “NO3+NO2,water,insitu”
Change the starting date to April 1, 2014 and the end date to September 1, 2014. Enter these dates in YEAR-MO-DY format (i.e. 2014-04-01, and 2014-09-01, respectively).
Under the heading “Output Format,” select the “Tab-separated” circle and click “GO.”
Copy the data into a text file (in RStudio: File -> New file -> Text File) and save as site_05464420.txt into your data directory for this module (you should have creating a Module 4 directory with different sub directories for data, etc. and a R project for the Module).

Step 3: Import data

Try reading in the data with the read_csv() function

stream_data <- read_csv("site_05464420.txt")

## Parsed with column specification:
## cols(
##   `agency_cd site_no datetime    tz_cd   44004_99133 44004_99133_cd` = col_character()
## )

Did it work? Now open the file in a text editor. How is this file different that other files we have worked with? What are the header rows?

Try a more general function (read_delim()) for import files that are not comma delimitated. " for the delim tells the function to look for a tab to separate the cells.

stream_data <- read_delim("data_raw/site_05464420.txt", delim = "\t")

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

Look at the imported data. Did it work? Look at the file again in a text editor (i.e., In RStudio you can open the file, file -> open file).
What is line number has the header row? Is there a symbol that designates a comment in the file?

Now use the comment option in the read_delim. You will need to add the comment symbol below in place of [INSERT SYMBOL]

stream_data <- read_delim("data_raw/site_05464420.txt", 
                          delim = "\t", 
                          comment = "#")

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

Do the headers look good now?

head(stream_data)

## # A tibble: 6 x 6
##   agency_cd site_no  datetime         tz_cd `44004_99133` `44004_99133_cd`
##   <chr>     <chr>    <chr>            <chr> <chr>         <chr>           
## 1 5s        15s      20d              6s    14n           10s             
## 2 USGS      05464420 2014-04-01 00:00 CDT   3.77          A               
## 3 USGS      05464420 2014-04-01 00:15 CDT   3.76          A               
## 4 USGS      05464420 2014-04-01 00:30 CDT   3.76          A               
## 5 USGS      05464420 2014-04-01 00:45 CDT   3.77          A               
## 6 USGS      05464420 2014-04-01 01:00 CDT   3.76          A

Which column is the nitrate observation? Let’s give them better names. The number below in rename is the column number

stream_data <- stream_data %>% 
    rename("nitrate" = 5)

What is the class of the datetime and nitrate columns?

Is this what you expected?

Why might the class be wrong? Look at the top rows of the dataframe and in the raw data. Why is the first row different that all the other rows?

Now, we need to remove the first row

There are two ways:

stream_data <- stream_data[-1, ] is the way the we learned in the data carpentry module. This removes the first row

A way that works with tidyverse and pipes. slice is a function that selects particular rows.

stream_data <- stream_data %>% 
    slice(-1)

head(stream_data)

## # A tibble: 6 x 6
##   agency_cd site_no  datetime         tz_cd nitrate `44004_99133_cd`
##   <chr>     <chr>    <chr>            <chr> <chr>   <chr>           
## 1 USGS      05464420 2014-04-01 00:00 CDT   3.77    A               
## 2 USGS      05464420 2014-04-01 00:15 CDT   3.76    A               
## 3 USGS      05464420 2014-04-01 00:30 CDT   3.76    A               
## 4 USGS      05464420 2014-04-01 00:45 CDT   3.77    A               
## 5 USGS      05464420 2014-04-01 01:00 CDT   3.76    A               
## 6 USGS      05464420 2014-04-01 01:15 CDT   3.76    A

Now fix the class issue with the datetime and nitrate columns. What is the format of the datatime? Since it has the year, month, day, hour, and minute we will use the ymd_hm() function (this is similar to the ymd() that we used in the data carpentry module)

stream_data <- stream_data %>% 
    mutate(datetime = ymd_hm(datetime))
head(stream_data)

## # A tibble: 6 x 6
##   agency_cd site_no  datetime            tz_cd nitrate `44004_99133_cd`
##   <chr>     <chr>    <dttm>              <chr> <chr>   <chr>           
## 1 USGS      05464420 2014-04-01 00:00:00 CDT   3.77    A               
## 2 USGS      05464420 2014-04-01 00:15:00 CDT   3.76    A               
## 3 USGS      05464420 2014-04-01 00:30:00 CDT   3.76    A               
## 4 USGS      05464420 2014-04-01 00:45:00 CDT   3.77    A               
## 5 USGS      05464420 2014-04-01 01:00:00 CDT   3.76    A               
## 6 USGS      05464420 2014-04-01 01:15:00 CDT   3.76    A

Now lets fix the nitrate column by converting from character to numeric using as.numeric()

stream_data <- stream_data %>%
    mutate(nitrate = as.numeric(nitrate))
head(stream_data)

## # A tibble: 6 x 6
##   agency_cd site_no  datetime            tz_cd nitrate `44004_99133_cd`
##   <chr>     <chr>    <dttm>              <chr>   <dbl> <chr>           
## 1 USGS      05464420 2014-04-01 00:00:00 CDT      3.77 A               
## 2 USGS      05464420 2014-04-01 00:15:00 CDT      3.76 A               
## 3 USGS      05464420 2014-04-01 00:30:00 CDT      3.76 A               
## 4 USGS      05464420 2014-04-01 00:45:00 CDT      3.77 A               
## 5 USGS      05464420 2014-04-01 01:00:00 CDT      3.76 A               
## 6 USGS      05464420 2014-04-01 01:15:00 CDT      3.76 A

Finally, remove the nitrate missing values

stream_data %>%
  filter(!is.na(nitrate)) %>%
  select(nitrate)

## # A tibble: 13,897 x 1
##    nitrate
##      <dbl>
##  1    3.77
##  2    3.76
##  3    3.76
##  4    3.77
##  5    3.76
##  6    3.76
##  7    3.76
##  8    3.76
##  9    3.76
## 10    3.76
## # … with 13,887 more rows

Now combine all data cleaning steps into a single piped command

stream_data <- read_delim("data_raw/site_05464420.txt", 
                          delim = "\t", 
                          comment = "#")

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

stream_data <- stream_data %>%
  slice(-1) %>%
  rename('nitrate' = 5) %>% 
  mutate(nitrate = as.numeric(nitrate))  %>%  #YOU NEEDED A MUTATE HERE
  filter(!is.na(nitrate)) %>%
  mutate(datetime = ymd_hm(datetime)) #YOU HAD AN EXTRA PIPE HERE

head(stream_data)

## # A tibble: 6 x 6
##   agency_cd site_no  datetime            tz_cd nitrate `44004_99133_cd`
##   <chr>     <chr>    <dttm>              <chr>   <dbl> <chr>           
## 1 USGS      05464420 2014-04-01 00:00:00 CDT      3.77 A               
## 2 USGS      05464420 2014-04-01 00:15:00 CDT      3.76 A               
## 3 USGS      05464420 2014-04-01 00:30:00 CDT      3.76 A               
## 4 USGS      05464420 2014-04-01 00:45:00 CDT      3.77 A               
## 5 USGS      05464420 2014-04-01 01:00:00 CDT      3.76 A               
## 6 USGS      05464420 2014-04-01 01:15:00 CDT      3.76 A

Step 4: Visualize the data

Create plot with datetime on the x-axis and nitrate on the y-axis (this is a time-series plot of nitrate). Then use the geom_hline() function to add a horizontal line at the EPA concentration (the EPA_limit variable that you defined above)

ggplot(data = stream_data,
       mapping = aes(x = datetime,
                     y = nitrate,
                     color = EPA_limit)) +
  geom_line() +
  geom_point() +
  geom_hline(yintercept = EPA_limit) +
  labs(x = "Date and Time" , y = "Nitrate (chl)" , title = "Time-Series Plot of Nitrate")

Question 2: Does the location exceed the EPA limit?

Answer 2:

Yes, it exceeds the EPA limit.

Step 5: Analyze the data

Calculate the percentage of time that the stream exceeds the EPA limit.

Hints: you will use the ifelse(), sum(), n(), mutate(), and summarize() functions. The ifelse() function is likely new to you. You use the ifelse() function within the mutate function. For example following code will set the value of new_variable equal to 1 IF ’old_variableis greater than 1, otherwise (i.e., ELSE) set the value ofnew_variable` equal to 0.

mutate(new_variable = ifelse(old_variable > 1, 1, 0))

Be sure you print out your results for calculating the percentage of time that the steam exceeds the EPA limit so that it shows up in the knitted document.

exceed <- stream_data %>% 
  mutate(exceed_limit = ifelse(nitrate > 10, 1, 0)) %>%
  summarize(exceed_limit = sum(exceed_limit),
            total = n()) %>%

  mutate(summarize = (exceed_limit/total)) 
    
head(exceed)

## # A tibble: 1 x 3
##   exceed_limit total summarize
##          <dbl> <int>     <dbl>
## 1         3747 13897     0.270

  #INSERT CALCULATIONS TO DETERMINE THE PERCENTAGE OF OBSERVATIONS
  #THAT EXCEED THE EPA LIMIT

Step 6: Scaling up to multiple sites

Now we want to calculate the probability of exceeding the EPA limit for all gauged streams and river in Iowa. It would be very slow to manual click and download files for all gauges and time points that we are interested in.
Fortunately there is a way around this that is more advanced (but hey you are studying environmental informatics). The process of automating the download and calculations requires a few new concepts.

First, many sources of data have an application programming interface (API) that is a set of clearly defined methods of communication. The USGS has a API that will use.

Copy and paste the following command into a web browser.

https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05464420&period=&begin_date=2014-04-01&end_date=2014-09-01

See how this produces the same file as what you downloaded manually.

Now look at the web address. The format of the web address is the AP. There is “&site_no”, “&begin_date”, “&end_date” in the address. We can modify these to download other sites and other time periods.

Create a variable called site and set it equal to our focal site number.
Similarly, create variables for the begin and end date

site <- "&site_no"
begin_date <- "&begin_date"
end_date <- "&"

Now use the paste0() function combine the components of address with the variables to create a new web address where we can easily change the address with new sites and time periods

url <- paste0("https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=", 
              site, 
              "&period=&begin_date=",
              begin_date, 
              "&end_date=", 
              end_date)
url

## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=&site_no&period=&begin_date=&begin_date&end_date=&"

Now download the file at the web address using R

download.file(url = url,
              dest = paste0("data_raw/site_", site, ".txt"))

We are now one step closer to automating.

Second, we need to be able to loop through a vector of different sites and download each of those sites. We will use the concept of a “for loop”.

Create a vector of sites. The following was obtained from the USGS website and represents all of the active sites in Iowa

site <- c("05412500",
          "05418400",
          "05418720",
          "05420500",
          "05455100",
          "05465500",
          "05464420",
          "05481000",
          "05482000",
          "05482300",
          "05482500",
          "05483600", 
          "05484000", 
          "05484500",
          "06808500", 
          "06817000")

A “for loop” will increment through a vector. The following example loops through the values 1 through 10 and names the current value of the loop. It prints the current value of i.

for (i in 1:10) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

We can use this current value of i to subset the site vector. Note that we change the highest value in the loop from 10 to the length of the site vector

for (i in 1:length(site)) {
  print(site[i])
}

## [1] "05412500"
## [1] "05418400"
## [1] "05418720"
## [1] "05420500"
## [1] "05455100"
## [1] "05465500"
## [1] "05464420"
## [1] "05481000"
## [1] "05482000"
## [1] "05482300"
## [1] "05482500"
## [1] "05483600"
## [1] "05484000"
## [1] "05484500"
## [1] "06808500"
## [1] "06817000"

Now we can use the subsetted site vector to modify the site value in the url.
See that we replaced site with site[i]

for (i in 1:length(site)) {
  url <- paste0("https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=", 
                site[i], 
                "&period=&begin_date=", 
                begin_date, 
                "&end_date=", 
                end_date)
  print(url)
}

## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05412500&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05418400&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05418720&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05420500&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05455100&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05465500&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05464420&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05481000&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05482000&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05482300&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05482500&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05483600&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05484000&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05484500&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=06808500&period=&begin_date=&begin_date&end_date=&"
## [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=06817000&period=&begin_date=&begin_date&end_date=&"

url <- vector("character", length = length(site))
for (i in 1:length(site)) {
  url[i] <- paste0("https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=", 
                site[i], 
                "&period=&begin_date=", 
                begin_date, 
                "&end_date=", 
                end_date)
}
print(url)

##  [1] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05412500&period=&begin_date=&begin_date&end_date=&"
##  [2] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05418400&period=&begin_date=&begin_date&end_date=&"
##  [3] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05418720&period=&begin_date=&begin_date&end_date=&"
##  [4] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05420500&period=&begin_date=&begin_date&end_date=&"
##  [5] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05455100&period=&begin_date=&begin_date&end_date=&"
##  [6] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05465500&period=&begin_date=&begin_date&end_date=&"
##  [7] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05464420&period=&begin_date=&begin_date&end_date=&"
##  [8] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05481000&period=&begin_date=&begin_date&end_date=&"
##  [9] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05482000&period=&begin_date=&begin_date&end_date=&"
## [10] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05482300&period=&begin_date=&begin_date&end_date=&"
## [11] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05482500&period=&begin_date=&begin_date&end_date=&"
## [12] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05483600&period=&begin_date=&begin_date&end_date=&"
## [13] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05484000&period=&begin_date=&begin_date&end_date=&"
## [14] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=05484500&period=&begin_date=&begin_date&end_date=&"
## [15] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=06808500&period=&begin_date=&begin_date&end_date=&"
## [16] "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=06817000&period=&begin_date=&begin_date&end_date=&"

Third, combine your code from above that you used to calculate the probability of exceeding the EPA threshold, the API example, and the “for loop”" to calculate the probability of exceeding the EPA threshold for all the streams and rivers in Iowa.

#Vector of site numbers
site <- c("05412500",
          "05418400",
          "05418720",
          "05420500",
          "05455100",
          "05465500",
          "05464420",
          "05481000",
          "05482000", 
          "05482300", 
          "05482500", 
          "05483600", 
          "05484000", 
          "05484500",
          "06808500", 
          "06817000") 

#Create a vector to hold the calculation for each gauge
site_exceed <- vector("double", length = length(site)) 

EPA_limit <- 10

for (i in 1:length(site)) {
  
    download.file(url = paste0("https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_99133=on&format=rdb&site_no=", 
                               site[i], 
                               "&period=&begin_date=", 
                               begin_date, 
                               "&end_date=", 
                               end_date),
                dest = paste0("data_raw/", site[i], ".txt"))

  stream_data <- read_delim(paste0("data_raw/", site[i], ".txt"), 
                            delim = "\t", 
                            comment = "#")
  
  if (ncol(stream_data) > 5) { 
    
  stream_data <- read_delim("data_raw/site_05464420.txt", 
                          delim = "\t", 
                          comment = "#") 
stream_data <- stream_data %>%
  slice(-1) %>%
  rename('nitrate' = 5) %>% 
  mutate(nitrate = as.numeric(nitrate))  %>%  
  filter(!is.na(nitrate)) %>%
  mutate(datetime = ymd_hm(datetime)) 

head(stream_data)

  exceed <- stream_data %>% 
  mutate(exceed_limit = ifelse(nitrate > 5, 1, 0)) %>%
  summarize(exceed_limit = sum(exceed_limit),
            total = n()) %>%

  mutate(summarize = (exceed_limit/total)) 
    
head(exceed)
    
    site_exceed[i] <-  1
  }else{
    site_exceed[i] <- NA
  }
  
}

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `43610_99133` = col_character(),
##   `43610_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character()
## )

## Warning: 1 parsing failure.
## row col  expected    actual                    file
##   1  -- 4 columns 5 columns 'data_raw/05418400.txt'

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character()
## )

## Warning: 1 parsing failure.
## row col  expected    actual                    file
##   1  -- 4 columns 5 columns 'data_raw/05418720.txt'

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `174657_99133` = col_character(),
##   `174657_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character()
## )

## Warning: 1 parsing failure.
## row col  expected    actual                    file
##   1  -- 4 columns 5 columns 'data_raw/05455100.txt'

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44062_99133` = col_character(),
##   `44062_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )
## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )
## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character()
## )

## Warning: 1 parsing failure.
## row col  expected    actual                    file
##   1  -- 4 columns 5 columns 'data_raw/05481000.txt'

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44240_99133` = col_character(),
##   `44240_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44246_99133` = col_character(),
##   `44246_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44262_99133` = col_character(),
##   `44262_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character()
## )

## Warning: 1 parsing failure.
## row col  expected    actual                    file
##   1  -- 4 columns 5 columns 'data_raw/05483600.txt'

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44294_99133` = col_character(),
##   `44294_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44302_99133` = col_character(),
##   `44302_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44654_99133` = col_character(),
##   `44654_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44719_99133` = col_character(),
##   `44719_99133_cd` = col_character()
## )

## Parsed with column specification:
## cols(
##   agency_cd = col_character(),
##   site_no = col_character(),
##   datetime = col_character(),
##   tz_cd = col_character(),
##   `44004_99133` = col_character(),
##   `44004_99133_cd` = col_character()
## )

#combine the site ID and exceed columns into a data frame (tibble)
site_exceed_combined <- tibble(site = site, 
                               exceed = site_exceed)
print(site_exceed_combined)

## # A tibble: 16 x 2
##    site     exceed
##    <chr>     <dbl>
##  1 05412500      1
##  2 05418400     NA
##  3 05418720     NA
##  4 05420500      1
##  5 05455100     NA
##  6 05465500      1
##  7 05464420      1
##  8 05481000     NA
##  9 05482000      1
## 10 05482300      1
## 11 05482500      1
## 12 05483600     NA
## 13 05484000      1
## 14 05484500      1
## 15 06808500      1
## 16 06817000      1

Fourth, make a bar graph showing the probability of exceeding the EPA threshold for each station in your analysis.

In the geom_bar(), you will use stat = "identity (which say “just use the y variable that I provide”)

Be sure your station numbers are readable on the plot.

ggplot(data = site_exceed_combined, mapping = aes(x = site, y = exceed)) +
 geom_bar(stat = 'identity') +
    theme(panel.grid = element_blank(),
        text = element_text(size = 16),
        axis.text.x = element_text(color = "grey20",
                                   size = 12,
                                   angle = 90,
                                   hjust = 0.5,
                                   vjust = 0.5),
        axis.text.y = element_text(color = "grey20",
                                   size = 12)) +
            labs(x = "Station", y = "Exceeds the EPA Threshold", title = "Probability of Exceeding the EPA Threshold at Sites in Iowa")

## Warning: Removed 5 rows containing missing values (position_stack).

Step 6: Analyzing a new state

Question 3 Answer the following question by reusing and modifying the code above: What was the probability of exceeding the EPA threshold for all gauges in Virginia between 2014-04-01 and 2014-09-01?

You will get the list of Virginia gauges by 1) Going to https://waterwatch.usgs.gov/ 2) Selecting Virginia and Nitrate from the drop-down options 3) Clicking “Site List”

The numbers in front of each station are the numbers used in the API. Be sure your answer prints out to the document when you knit it.

Answer 3:

print(site_exceed)

##  [1]  1 NA NA  1 NA  1  1 NA  1  1  1 NA  1  1  1  1

Question 4: How does the probability of exceeding the EPA threshold differ between Iowa and Virginia? How might this be related to differences in land-use between the states?

Answer 4:

Step 7 (Optional - Not graded): Analyzing multiple time periods

Optional Question 5:

Answer the following question by reusing and modifying the code above: What was the probability of exceeding the EPA threshold for station “05412500” in Iowa each year between April 1 and Sept 1 for years 2014 though 2018? (you will calculate the probability of exceeding separately for each year). What do you need to modify to evaluate different years rather than sites? Think about how the example above uses the site vector and modify the code so that it loops through different time periods.

Be sure your answer prints out to the document when you knit it.

Optional Answer 5:

#INSERT CODE

print(site_exceed)

##  [1]  1 NA NA  1 NA  1  1 NA  1  1  1 NA  1  1  1  1

Optional Question 6:

Is the probability of exceeding the EPA threshold changing over time? Show with a plot.

Optional Answer 6:

#ggplot(data = site_exceed_combined, mapping = aes(x = years, y = exceed)) +
  #geom_line() +
  #geom_point() +
  #labs(x = "Year", y = "Proability of Exceeding EPA Threshold", 
       #title = "Iowa: Proabability of nitrate exceeding EPA threshold")

Reference

This module was initially developed by

Castendyk, D. and Gibson, C. 30 June 2015. Project EDDIE: Water Quality. Project EDDIE Module 6, Version 1. http://cemast.illinoisstate.edu/data-for-students/modules/water-quality.shtml. Module development was supported by NSF DEB 1245707.