Automating Data Retrieval from Federal Agencies - Part 2

Last time we looked at how to automate a data retrieval from USGS. If you haven’t seen that tutorial yet, please check it out here before continuing.

At the end of that post, I hint that you can tweak the script I show you to include downloads from multiple sites (i.e. 2 or more stream gauges). Well here is an example of how that could work.

Downloading from multiple sites

I’m going to blaze through some of the inital set up since it was covered in the last post, but here is the basic code again.

library(tidyverse) # for all your data wrangling needs
library(lubridate) # for clean handling of dates and times
library(plotly)    # for easy interactive java script plots

# the basic address and filepath to USGS water quality data
base_url <- "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?"

# dates we're interested in
start_date <- "2020-08-01"
end_date <- "2020-08-31"

# these are the site numbers for the Sheboygan, Milwaukee and Menomonee rivers.
# Recognize the 04 part of this address?
sites <- c("04086000", "04087000", "04087120")

Build a funcion for pasting multiple sites together

We need a fucntion that can paste our different sites together into a character string that matches USGS’s url format. In this case, that means collapsing a vector of sites and pasting the “&site_no=” command in front of each site.

insert_sites <- function(sites){
  
  site_call <- rep("&site_no=", length.out = length(sites))
  stringr::str_c(site_call,sites, collapse = "")
  
}

Now everything else that follows is the same as we had last time. We already had a function like this for parameters since we downloaded two different parameters in part 1.

# We're downloading discharge, and gauge height respectively.
parameters <-  c("cb_00060", "cb_00065","")

# Function to concatenate multiple parameters into a single string

insert_param <- function(parameters){
  
  stringr::str_c(parameters, collapse = "=on&")
  
}

# now I can simply paste this all together into a final url
usgs_full_url <- paste0(base_url,
                        insert_param(parameters),
                        "format=rdb",
                        insert_sites(sites),
                        "&period=&begin_date=",start_date,
                        "&end_date=",end_date)

my_usgs <- read_delim(file = usgs_full_url,
                      delim = "\t",       # this says that the file is tab delimited
                      comment = "#",      # this says to ignore any line that starts with #
                      col_names = TRUE)   # this data set has column names

#Let's take a look at what we got so far
my_usgs

## # A tibble: 8,921 x 8
##    agency_cd site_no datetime tz_cd `157440_00060` `157440_00060_c~
##    <chr>     <chr>   <chr>    <chr> <chr>          <chr>           
##  1 5s        15s     20d      6s    14n            10s             
##  2 USGS      040860~ 2020-08~ CDT   <NA>           <NA>            
##  3 USGS      040860~ 2020-08~ CDT   378            P               
##  4 USGS      040860~ 2020-08~ CDT   380            P               
##  5 USGS      040860~ 2020-08~ CDT   366            P               
##  6 USGS      040860~ 2020-08~ CDT   361            P               
##  7 USGS      040860~ 2020-08~ CDT   400            P               
##  8 USGS      040860~ 2020-08~ CDT   392            P               
##  9 USGS      040860~ 2020-08~ CDT   383            P               
## 10 USGS      040860~ 2020-08~ CDT   373            P               
## # ... with 8,911 more rows, and 2 more variables: `223508_00065` <chr>,
## #   `223508_00065_cd` <chr>

Trouble shoot unwanted header data

At first glance, this looks exactly the same as what we got last time… However there is a catch. USGS doesn’t combine header data and spit out a single unfied file. Instead, they return each site as its own block that contains “#” commented meta data, new header data, and the data itself. You can check out what this looks like by pasting the url into your browser or following this link.

Before we can continue, we’ll need to find a way to clean out those extra headers if we want to convert our data types into their proper format.

We can do this by using a regular expression to find indices that contain letters where we’d expect to find numbers or dates. A great package for this type of string manipulation is stringr and they have a useful cheatsheet here

# Find the indices for rows that contain header information by searching for letters (i.e. [:alpha:])
problem_indices <- str_which(my_usgs$datetime, pattern =  "[:alpha:]")

# See if we've found the unwanted header data
my_usgs[problem_indices,]

## # A tibble: 5 x 8
##   agency_cd site_no datetime tz_cd `157440_00060` `157440_00060_c~
##   <chr>     <chr>   <chr>    <chr> <chr>          <chr>           
## 1 5s        15s     20d      6s    14n            10s             
## 2 agency_cd site_no datetime tz_cd 157473_00060   157473_00060_cd 
## 3 5s        15s     20d      6s    14n            10s             
## 4 agency_cd site_no datetime tz_cd 157524_00060   157524_00060_cd 
## 5 5s        15s     20d      6s    14n            10s             
## # ... with 2 more variables: `223508_00065` <chr>, `223508_00065_cd` <chr>

Ok, this makes sense. We asked for 3 sites, and we ended up with header information 3 times. Now we can clean this up and more on to plotting.

# get rid of the rows that we contain header information 
# and get rid of the meta-data flag columns (just to keep things looking clean)
my_usgs <- my_usgs[-problem_indices,] %>%
  select(site_no, date_time = datetime, discharge_cfs = ends_with("00060"), gauge_height_ft = ends_with("00065"))
  

# add some nice R friendly names
names(my_usgs) <- c("site_no", "date_time", "discharge_cfs", "gauge_height_ft")

# reformat the data types to be factor, date time, and numbers
my_usgs <-  my_usgs %>%
  mutate(site_no = factor(site_no),
         date_time = ymd_hm(date_time),
         discharge_cfs = as.numeric(discharge_cfs),
         gauge_height_ft = as.numeric(gauge_height_ft))

# OK, this should look better now. Let's take a peak
my_usgs

## # A tibble: 8,916 x 4
##    site_no  date_time           discharge_cfs gauge_height_ft
##    <fct>    <dttm>                      <dbl>           <dbl>
##  1 04086000 2020-08-01 00:00:00            NA            3.11
##  2 04086000 2020-08-01 00:15:00           378            3.06
##  3 04086000 2020-08-01 00:30:00           380            3.06
##  4 04086000 2020-08-01 00:45:00           366            3.04
##  5 04086000 2020-08-01 01:00:00           361            3.03
##  6 04086000 2020-08-01 01:15:00           400            3.1 
##  7 04086000 2020-08-01 01:30:00           392            3.09
##  8 04086000 2020-08-01 01:45:00           383            3.07
##  9 04086000 2020-08-01 02:00:00           373            3.05
## 10 04086000 2020-08-01 02:15:00           360            3.03
## # ... with 8,906 more rows

Looks good. Let’s prepare for plotting.

# Get data into long format for ggplot2
my_usgs_long <- my_usgs %>%
  pivot_longer(-c(date_time, site_no)) %>%
  filter(!is.na(value))

# Prep some nicer looking parameter names
pretty_names <- c(
  "discharge_cfs" = "Discharge (cfs)",
  "gauge_height_ft" = "Gauge Height (ft)"
)

# And some human readable names instead of site numbers
pretty_sites <- c(
  "04086000" = "Sheboygan", 
  "04087000" = "Milwaukee",
  "04087120" = "Menomonee"
)

Plot our results

p <- my_usgs_long %>%
  ggplot(aes(date_time, value, col=name)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2") +
  facet_grid(name~site_no, scales = "free_y", 
             labeller = labeller(name = as_labeller(pretty_names),
                                 site_no = as_labeller(pretty_sites))) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = paste0("USGS Discharge and Gauge Height"),
       subtitle = paste(start_date, "-", end_date),
       y = "",
       x = "")

p

This looks OK, but it may be easier to make comparisons if we have all the sites on the same graph. Let’s do that and then make the plot interactive again.

p2 <- my_usgs_long %>%
  mutate(nice_sites = recode(site_no, !!!pretty_sites)) %>%
  ggplot(aes(date_time, value, col=nice_sites)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2") +
  facet_grid(name~., scales = "free_y", 
             labeller = labeller(name = as_labeller(pretty_names))) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(title = paste0("USGS Discharge and Gauge Height"),
       subtitle = paste(start_date, "-", end_date),
       y = "",
       x = "",
       col="")

ggplotly(p2)

We’re done for now

Awesome, we did it. We managed to automate retrieval from not 1, but multiple USGS sites and plot them in an interactive graph.

UW-Milwaukee Data Science Club

Nathan Van Ee