NOAA, USGS, and other government agencies have a wealth of publicly available data. The trick is understanding the organization of this data and how best to access it.
For example watershed and stream gauge data are organized by their hydrologic unit code (HUC). This code is essentially an address with that breaks watersheds into incrementally smaller subsections two digits at a time.
Here’s an example that shows how this works for a the Upper Marble Canyon watershed in Death Valley, CA.
Here at UW-Milwaukee, we’re in the Great Lakes region. So, as you can see from the map above the first two digits for watersheds that drain to the Great Lakes are 04 and the site we’ll look up later will reflect that.
To look up your river or watershed of interest you can follow the directions here: Find My Watershed USGS Tutorial
But honestly, it usually faster to just google it (e.g. “USGS your river/watershed”). This usually returns a webpage with the HUC somewhere near the top of the site and would be useful for seeing which parameters are available at a particular site.
Stop thinking about data as something you have on your computer. If you have an internet connection, you have access to data all over the world. That map I showed above… I never saved it to memory on my computer, but instead simply linked it into this R Markdown page.
The trick, just like before, is understanding the address system.
Here is the link to the data we’re going to use today. It is 1 month of river gauge data from the Sheboygan River, just north of us here in Milwaukee Wisconsin. https://nwis.waterdata.usgs.gov/usa/nwis/uv/?cb_00060=on&cb_00065=on&format=rdb&site_no=04086000&period=&begin_date=2020-08-01&end_date=2020-08-31
Wow! That is a long url, but there’s a reason for that. That url contains both an address and a query.
You can think of every thing that ends with a “/” as a folder or part of a file path. It the same as when you look at a file path to a folder on your computer, but now you’re getting information from a computer remotely.
If we look at just that, we have our basic url address https://nwis.waterdata.usgs.gov/usa/nwis/uv/.
Now the part that comes next starts with a ? because it is a query (i.e. a request for data) and everything that follows is specifying which data we’re interested in. Notice that each request is separated with an “&” symbol.
The hardest part of this particular query is understanding the parameter codes. Government agencies have this documented, but often their websites are not too friendly. Here’s a site that lists all of USGS’s parameter codes… Be warned, there are a lot and it gets messy. USGS Parameter Codes
Translating this part of the url left to right we get this.
Not too bad when you break it down, right?
All right enough background, let’s finally code!
library(tidyverse) # for all your data wrangling needs
library(lubridate) # for clean handling of dates and times
library(plotly) # for easy interactive java script plots
The main goal here is to break the url into smaller pieces that we can change as needed in our scripts. Better yet, we could turn these static r objects into user defined inputs in a shiny web app.
# the basic address and filepath to USGS water quality data
base_url <- "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?"
# dates we're interested in
start_date <- "2020-08-01"
end_date <- "2020-08-31"
# this is the site number for the Sheboygan river.
# Recognize the 04 part of this address?
site <- "04086000"
# We're downloading discharge, and gauge height respectively.
parameters <- c("cb_00060", "cb_00065","")
# this is a quick little function to concatenate multiple parameters
# mostly because I'm too lazy to type "=on&" more than once.
insert_param <- function(parameters){
stringr::str_c(parameters, collapse = "=on&")
}
# now I can simply paste this all together into a final url
usgs_full_url <- paste0(base_url,
insert_param(parameters),
"format=rdb&site_no=",site,
"&period=&begin_date=",start_date,
"&end_date=",end_date)
Now that we’ve constructed the proper url, we can read the data into R.
Now at this point you may be thinking that it would have been faster just to go to a USGS website and download the data from a browser… and for one download that’s probably true.
But what if you have 10 sites you’re interested in? What if you need to update your data every month, week, or day? What if you want to connect the data to your very own RShiny applet?
The power of coding is that you can create simple scripts that are flexible and scalable. This same simple approach (with some tweaking) would work for any of the situations mentioned above.
# read in the usgs data with the url we constructed above
my_usgs <- read_delim(file = usgs_full_url,
delim = "\t", # this says that the file is tab delimited
comment = "#", # this says to ignore any line that starts with #
col_names = TRUE) # this data set has column names
#Let's take a look at what we got so far
my_usgs
## # A tibble: 2,977 x 8
## agency_cd site_no datetime tz_cd `157440_00060` `157440_00060_c~
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 5s 15s 20d 6s 14n 10s
## 2 USGS 040860~ 2020-08~ CDT <NA> <NA>
## 3 USGS 040860~ 2020-08~ CDT 378 P
## 4 USGS 040860~ 2020-08~ CDT 380 P
## 5 USGS 040860~ 2020-08~ CDT 366 P
## 6 USGS 040860~ 2020-08~ CDT 361 P
## 7 USGS 040860~ 2020-08~ CDT 400 P
## 8 USGS 040860~ 2020-08~ CDT 392 P
## 9 USGS 040860~ 2020-08~ CDT 383 P
## 10 USGS 040860~ 2020-08~ CDT 373 P
## # ... with 2,967 more rows, and 2 more variables: `223508_00065` <chr>,
## # `223508_00065_cd` <chr>
Hmmm… Looks like we have some extra columns we don’t need and an extra row of header data. Let’s clean this up a bit more and make sure that everything gets formatted properly.
# get rid of the 1st row that we don't want and the meta-data flag columns
my_usgs <- my_usgs[-1,] %>%
select(-c(site_no, ends_with("_cd")))
# add some nice names
names(my_usgs) <- c("date_time", "discharge_cfs", "gauge_height_ft")
# reformat the data types to be date times, and numbers
my_usgs <- my_usgs %>%
mutate(date_time = ymd_hm(date_time),
discharge_cfs = as.numeric(discharge_cfs),
gauge_height_ft = as.numeric(gauge_height_ft))
# OK, this should look better now. Let's take a peak
my_usgs
## # A tibble: 2,976 x 3
## date_time discharge_cfs gauge_height_ft
## <dttm> <dbl> <dbl>
## 1 2020-08-01 00:00:00 NA 3.11
## 2 2020-08-01 00:15:00 378 3.06
## 3 2020-08-01 00:30:00 380 3.06
## 4 2020-08-01 00:45:00 366 3.04
## 5 2020-08-01 01:00:00 361 3.03
## 6 2020-08-01 01:15:00 400 3.1
## 7 2020-08-01 01:30:00 392 3.09
## 8 2020-08-01 01:45:00 383 3.07
## 9 2020-08-01 02:00:00 373 3.05
## 10 2020-08-01 02:15:00 360 3.03
## # ... with 2,966 more rows
All right, now let’s see what these data looks like.
# Quick and dirty base R plot
plot(my_usgs$date_time, my_usgs$discharge_cfs, type = "l", main = paste("USGS Site No: ", site))
That looks like real data. Now let’s pretty it up a bit.
# Make some nicer looking names
pretty_names <- c(
"discharge_cfs" = "Discharge (cfs)",
"gauge_height_ft" = "Gauge Height (ft)"
)
p <- my_usgs %>%
pivot_longer(-date_time) %>%
filter(!is.na(value)) %>%
ggplot(aes(date_time, value, col=name)) +
geom_line() +
scale_color_brewer(palette = "Dark2") +
facet_wrap(~name, ncol = 1, scales = "free_y",
labeller = as_labeller(pretty_names)) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = paste0("USGS Site No: ",site),
subtitle = paste(start_date, "-", end_date),
y = "",
x = "")
p
This looks pretty good, but we can take it a step further by making this interactive. Amazingly, we’re just one line away from doing this thanks to the incredible plotly package.
ggplotly(p)
In my next post, I’ll show you how to take this script and tweak it to handle multiple sites at once. At our data science club meeting on Friday, we’ll take this script and turn it into an interactive shiny web app.
For now, copy this code into your console and play around with different dates, parameters, and site numbers. You’ll quickly see how powerful a little automation is.
Here are a few suggestions for different sites around Milwaukee:
This last site contains a lot of parameters beyond discharge and gauge height. Here are a few ideas to try out.
Have fun!