Last time we looked at how to automate a data retrieval from USGS. If you haven’t seen that tutorial yet, please check it out here before continuing.
At the end of that post, I hint that you can tweak the script I show you to include downloads from multiple sites (i.e. 2 or more stream gauges). Well here is an example of how that could work.
I’m going to blaze through some of the inital set up since it was covered in the last post, but here is the basic code again.
library(tidyverse) # for all your data wrangling needs
library(lubridate) # for clean handling of dates and times
library(plotly) # for easy interactive java script plots
# the basic address and filepath to USGS water quality data
base_url <- "https://nwis.waterdata.usgs.gov/usa/nwis/uv/?"
# dates we're interested in
start_date <- "2020-08-01"
end_date <- "2020-08-31"
# these are the site numbers for the Sheboygan, Milwaukee and Menomonee rivers.
# Recognize the 04 part of this address?
sites <- c("04086000", "04087000", "04087120")
We need a fucntion that can paste our different sites together into a character string that matches USGS’s url format. In this case, that means collapsing a vector of sites and pasting the “&site_no=” command in front of each site.
insert_sites <- function(sites){
site_call <- rep("&site_no=", length.out = length(sites))
stringr::str_c(site_call,sites, collapse = "")
}
Now everything else that follows is the same as we had last time. We already had a function like this for parameters since we downloaded two different parameters in part 1.
# We're downloading discharge, and gauge height respectively.
parameters <- c("cb_00060", "cb_00065","")
# Function to concatenate multiple parameters into a single string
insert_param <- function(parameters){
stringr::str_c(parameters, collapse = "=on&")
}
# now I can simply paste this all together into a final url
usgs_full_url <- paste0(base_url,
insert_param(parameters),
"format=rdb",
insert_sites(sites),
"&period=&begin_date=",start_date,
"&end_date=",end_date)
my_usgs <- read_delim(file = usgs_full_url,
delim = "\t", # this says that the file is tab delimited
comment = "#", # this says to ignore any line that starts with #
col_names = TRUE) # this data set has column names
#Let's take a look at what we got so far
my_usgs
## # A tibble: 8,921 x 8
## agency_cd site_no datetime tz_cd `157440_00060` `157440_00060_c~
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 5s 15s 20d 6s 14n 10s
## 2 USGS 040860~ 2020-08~ CDT <NA> <NA>
## 3 USGS 040860~ 2020-08~ CDT 378 P
## 4 USGS 040860~ 2020-08~ CDT 380 P
## 5 USGS 040860~ 2020-08~ CDT 366 P
## 6 USGS 040860~ 2020-08~ CDT 361 P
## 7 USGS 040860~ 2020-08~ CDT 400 P
## 8 USGS 040860~ 2020-08~ CDT 392 P
## 9 USGS 040860~ 2020-08~ CDT 383 P
## 10 USGS 040860~ 2020-08~ CDT 373 P
## # ... with 8,911 more rows, and 2 more variables: `223508_00065` <chr>,
## # `223508_00065_cd` <chr>
At first glance, this looks exactly the same as what we got last time… However there is a catch. USGS doesn’t combine header data and spit out a single unfied file. Instead, they return each site as its own block that contains “#” commented meta data, new header data, and the data itself. You can check out what this looks like by pasting the url into your browser or following this link.
Before we can continue, we’ll need to find a way to clean out those extra headers if we want to convert our data types into their proper format.
We can do this by using a regular expression to find indices that contain letters where we’d expect to find numbers or dates. A great package for this type of string manipulation is stringr and they have a useful cheatsheet here
# Find the indices for rows that contain header information by searching for letters (i.e. [:alpha:])
problem_indices <- str_which(my_usgs$datetime, pattern = "[:alpha:]")
# See if we've found the unwanted header data
my_usgs[problem_indices,]
## # A tibble: 5 x 8
## agency_cd site_no datetime tz_cd `157440_00060` `157440_00060_c~
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 5s 15s 20d 6s 14n 10s
## 2 agency_cd site_no datetime tz_cd 157473_00060 157473_00060_cd
## 3 5s 15s 20d 6s 14n 10s
## 4 agency_cd site_no datetime tz_cd 157524_00060 157524_00060_cd
## 5 5s 15s 20d 6s 14n 10s
## # ... with 2 more variables: `223508_00065` <chr>, `223508_00065_cd` <chr>
Ok, this makes sense. We asked for 3 sites, and we ended up with header information 3 times. Now we can clean this up and more on to plotting.
# get rid of the rows that we contain header information
# and get rid of the meta-data flag columns (just to keep things looking clean)
my_usgs <- my_usgs[-problem_indices,] %>%
select(site_no, date_time = datetime, discharge_cfs = ends_with("00060"), gauge_height_ft = ends_with("00065"))
# add some nice R friendly names
names(my_usgs) <- c("site_no", "date_time", "discharge_cfs", "gauge_height_ft")
# reformat the data types to be factor, date time, and numbers
my_usgs <- my_usgs %>%
mutate(site_no = factor(site_no),
date_time = ymd_hm(date_time),
discharge_cfs = as.numeric(discharge_cfs),
gauge_height_ft = as.numeric(gauge_height_ft))
# OK, this should look better now. Let's take a peak
my_usgs
## # A tibble: 8,916 x 4
## site_no date_time discharge_cfs gauge_height_ft
## <fct> <dttm> <dbl> <dbl>
## 1 04086000 2020-08-01 00:00:00 NA 3.11
## 2 04086000 2020-08-01 00:15:00 378 3.06
## 3 04086000 2020-08-01 00:30:00 380 3.06
## 4 04086000 2020-08-01 00:45:00 366 3.04
## 5 04086000 2020-08-01 01:00:00 361 3.03
## 6 04086000 2020-08-01 01:15:00 400 3.1
## 7 04086000 2020-08-01 01:30:00 392 3.09
## 8 04086000 2020-08-01 01:45:00 383 3.07
## 9 04086000 2020-08-01 02:00:00 373 3.05
## 10 04086000 2020-08-01 02:15:00 360 3.03
## # ... with 8,906 more rows
Looks good. Let’s prepare for plotting.
# Get data into long format for ggplot2
my_usgs_long <- my_usgs %>%
pivot_longer(-c(date_time, site_no)) %>%
filter(!is.na(value))
# Prep some nicer looking parameter names
pretty_names <- c(
"discharge_cfs" = "Discharge (cfs)",
"gauge_height_ft" = "Gauge Height (ft)"
)
# And some human readable names instead of site numbers
pretty_sites <- c(
"04086000" = "Sheboygan",
"04087000" = "Milwaukee",
"04087120" = "Menomonee"
)
p <- my_usgs_long %>%
ggplot(aes(date_time, value, col=name)) +
geom_line() +
scale_color_brewer(palette = "Dark2") +
facet_grid(name~site_no, scales = "free_y",
labeller = labeller(name = as_labeller(pretty_names),
site_no = as_labeller(pretty_sites))) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = paste0("USGS Discharge and Gauge Height"),
subtitle = paste(start_date, "-", end_date),
y = "",
x = "")
p
This looks OK, but it may be easier to make comparisons if we have all the sites on the same graph. Let’s do that and then make the plot interactive again.
p2 <- my_usgs_long %>%
mutate(nice_sites = recode(site_no, !!!pretty_sites)) %>%
ggplot(aes(date_time, value, col=nice_sites)) +
geom_line() +
scale_color_brewer(palette = "Dark2") +
facet_grid(name~., scales = "free_y",
labeller = labeller(name = as_labeller(pretty_names))) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = paste0("USGS Discharge and Gauge Height"),
subtitle = paste(start_date, "-", end_date),
y = "",
x = "",
col="")
ggplotly(p2)
Awesome, we did it. We managed to automate retrieval from not 1, but multiple USGS sites and plot them in an interactive graph.