Automating Data Retrieval from Federal Agencies

NOAA, USGS, and other government agencies have a wealth of publicly available data. The trick is understanding the organization of this data and how best to access it.

For example watershed and stream gauge data are organized by their hydrologic unit code (HUC). This code is essentially an address with that breaks watersheds into incrementally smaller subsections two digits at a time.

Here’s an example that shows how this works for a the Upper Marble Canyon watershed in Death Valley, CA.

Here at UW-Milwaukee, we’re in the Great Lakes region. So, as you can see from the map above the first two digits for watersheds that drain to the Great Lakes are 04 and the site we’ll look up later will reflect that.

To look up your river or watershed of interest you can follow the directions here: Find My Watershed USGS Tutorial

But honestly, it usually faster to just google it (e.g. “USGS your river/watershed”). This usually returns a webpage with the HUC somewhere near the top of the site and would be useful for seeing which parameters are available at a particular site.

Understanding how urls work

Stop thinking about data as something you have on your computer. If you have an internet connection, you have access to data all over the world. That map I showed above… I never saved it to memory on my computer, but instead simply linked it into this R Markdown page.

The trick, just like before, is understanding the address system.

Here is the link to the data we’re going to use today. It is 1 month of river gauge data from the Sheboygan River, just north of us here in Milwaukee Wisconsin. https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_00060=on&cb_00065=on&format=rdb&site_no=04086000&period=&begin_date=2020-08-01&end_date=2020-08-31

Wow! That is a long url, but there’s a reason for that. That url contains both an address and a query.

You can think of every thing that ends with a “/” as a folder or part of a file path. It the same as when you look at a file path to a folder on your computer, but now you’re getting information from a computer remotely.

If we look at just that, we have our basic url address https://nwis.waterdata.usgs.gov/usa/nwis/uv/.

Now the part that comes next starts with a ? because it is a query (i.e. a request for data) and everything that follows is specifying which data we’re interested in. Notice that each request is separated with an “&” symbol.

The hardest part of this particular query is understanding the parameter codes. Government agencies have this documented, but often their websites are not too friendly. Here’s a site that lists all of USGS’s parameter codes… Be warned, there are a lot and it gets messy. USGS Parameter Codes

Translating this part of the url left to right we get this.

  1. “cb_00060=on&cb_00065=on”
                            give me discharge and gauge height data
  1. “&format=rdb”
                             in tab delimited format
  1. “&site_no=04086000”
                             for site no …. (sheboygan)
  1. “&period=”
                             for times that
  1. “&begin_date=2020-08-01”
                             start August 1, 2020
  1. “&end_date=2020-08-31”
                             and end August 31, 2020

Not too bad when you break it down, right?

All right enough background, let’s finally code!

Load up the packages we’ll use

library(tidyverse) # for all your data wrangling needs
library(lubridate) # for clean handling of dates and times
library(plotly)    # for easy interactive java script plots

Construct a url to get our USGS data

The main goal here is to break the url into smaller pieces that we can change as needed in our scripts. Better yet, we could turn these static r objects into user defined inputs in a shiny web app.

# the basic address and filepath to USGS water quality data
base_url <- "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?"

# dates we're interested in
start_date <- "2020-08-01"
end_date <- "2020-08-31"

# this is the site number for the Sheboygan river.
# Recognize the 04 part of this address?
site <- "04086000"

# We're downloading discharge, and gauge height respectively.
parameters <-  c("cb_00060", "cb_00065","")

# this is a quick little function to concatenate multiple parameters
# mostly because I'm too lazy to type "=on&" more than once.
insert_param <- function(parameters){
  stringr::str_c(parameters, collapse = "=on&")
}

# now I can simply paste this all together into a final url
usgs_full_url <- paste0(base_url,
                        insert_param(parameters),
                        "format=rdb&site_no=",site,
                        "&period=&begin_date=",start_date,
                        "&end_date=",end_date)

Read in the data and clean it up

Now that we’ve constructed the proper url, we can read the data into R.

Now at this point you may be thinking that it would have been faster just to go to a USGS website and download the data from a browser… and for one download that’s probably true.

But what if you have 10 sites you’re interested in? What if you need to update your data every month, week, or day? What if you want to connect the data to your very own RShiny applet?

The power of coding is that you can create simple scripts that are flexible and scalable. This same simple approach (with some tweaking) would work for any of the situations mentioned above.

Get the data

# read in the usgs data with the url we constructed above
my_usgs <- read_delim(file = usgs_full_url,
                      delim = "\t",       # this says that the file is tab delimited
                      comment = "#",      # this says to ignore any line that starts with #
                      col_names = TRUE)   # this data set has column names
                         

#Let's take a look at what we got so far
my_usgs
## # A tibble: 2,977 x 8
##    agency_cd site_no datetime tz_cd `157440_00060` `157440_00060_c~
##    <chr>     <chr>   <chr>    <chr> <chr>          <chr>           
##  1 5s        15s     20d      6s    14n            10s             
##  2 USGS      040860~ 2020-08~ CDT   <NA>           <NA>            
##  3 USGS      040860~ 2020-08~ CDT   378            P               
##  4 USGS      040860~ 2020-08~ CDT   380            P               
##  5 USGS      040860~ 2020-08~ CDT   366            P               
##  6 USGS      040860~ 2020-08~ CDT   361            P               
##  7 USGS      040860~ 2020-08~ CDT   400            P               
##  8 USGS      040860~ 2020-08~ CDT   392            P               
##  9 USGS      040860~ 2020-08~ CDT   383            P               
## 10 USGS      040860~ 2020-08~ CDT   373            P               
## # ... with 2,967 more rows, and 2 more variables: `223508_00065` <chr>,
## #   `223508_00065_cd` <chr>

Hmmm… Looks like we have some extra columns we don’t need and an extra row of header data. Let’s clean this up a bit more and make sure that everything gets formatted properly.

# get rid of the 1st row that we don't want and the meta-data flag columns 
my_usgs <- my_usgs[-1,] %>%
  select(-c(site_no, ends_with("_cd")))

# add some nice names
names(my_usgs) <- c("date_time", "discharge_cfs", "gauge_height_ft")

# reformat the data types to be date times, and numbers
my_usgs <-  my_usgs %>%
  mutate(date_time = ymd_hm(date_time),
         discharge_cfs = as.numeric(discharge_cfs),
         gauge_height_ft = as.numeric(gauge_height_ft))

# OK, this should look better now. Let's take a peak
my_usgs
## # A tibble: 2,976 x 3
##    date_time           discharge_cfs gauge_height_ft
##    <dttm>                      <dbl>           <dbl>
##  1 2020-08-01 00:00:00            NA            3.11
##  2 2020-08-01 00:15:00           378            3.06
##  3 2020-08-01 00:30:00           380            3.06
##  4 2020-08-01 00:45:00           366            3.04
##  5 2020-08-01 01:00:00           361            3.03
##  6 2020-08-01 01:15:00           400            3.1 
##  7 2020-08-01 01:30:00           392            3.09
##  8 2020-08-01 01:45:00           383            3.07
##  9 2020-08-01 02:00:00           373            3.05
## 10 2020-08-01 02:15:00           360            3.03
## # ... with 2,966 more rows

All right, now let’s see what these data looks like.

Make a plot

Static EDA plot

# Quick and dirty base R plot

plot(my_usgs$date_time, my_usgs$discharge_cfs, type = "l", main = paste("USGS Site No: ", site))

That looks like real data. Now let’s pretty it up a bit.

Static pretty plot

# Make some nicer looking names

pretty_names <- c(
  "discharge_cfs" = "Discharge (cfs)",
  "gauge_height_ft" = "Gauge Height (ft)"
)

p <- my_usgs %>%
  pivot_longer(-date_time) %>%
  filter(!is.na(value)) %>%
  ggplot(aes(date_time, value, col=name)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2") +
  facet_wrap(~name, ncol = 1, scales = "free_y", 
             labeller = as_labeller(pretty_names)) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = paste0("USGS Site No: ",site),
       subtitle = paste(start_date, "-", end_date),
       y = "",
       x = "")

p

This looks pretty good, but we can take it a step further by making this interactive. Amazingly, we’re just one line away from doing this thanks to the incredible plotly package.

Interactive pretty plot

ggplotly(p)

We’re done for now

In my next post, I’ll show you how to take this script and tweak it to handle multiple sites at once. At our data science club meeting on Friday, we’ll take this script and turn it into an interactive shiny web app.

For now, copy this code into your console and play around with different dates, parameters, and site numbers. You’ll quickly see how powerful a little automation is.

Here are a few suggestions for different sites around Milwaukee:

  • “04087000” = “Milwaukee River at Estabrook Park”
  • “04087120” = “Menomonee River at Wauwatosa”
  • “04087159” = “Kinnickinnic River at 11th St”
  • “04087170” = “Milwaukee River Mouth”

This last site contains a lot of parameters beyond discharge and gauge height. Here are a few ideas to try out.

  • “00010” = “Water Temperature (C)”
  • “00095” = “Specific Conductance (uS)”
  • “63680” = “Turbidity (FNU)”,
  • “00300” = “Dissolved Oxygen (mg/L)”
  • “00400” = “pH”

Have fun!