What is dataRetrieval?
dataRetrieval is an R package that is used to get data from USGS and EPA data sources into R.
This tutorial will use streamflow data acquired through dataRetrieval to demonstrate how hydrograph data can be analyzed in R. We will also cover how to calculate useful features in R and graph information using the ggplot package.
dataRetrieval is available on the CRAN repository, which is a clearinghouse for different kinds of R packages. In order to install it, enter the following into your R terminal:
install.packages("dataRetrieval")
Whenever you begin a new R session, you must attach the package in order to use the tools in dataRetrieval.
library("dataRetrieval")
Additionally, you must have an internet connection in order for the dataRetrieval functions to successfully retrieve the data.
The NWIS is the National Water Information System. It is a system put together by the USGS for the capture and retrieval of water quality data such as: * groundwater data * surface water quality * rating curves * peak flow * daily aggregated statistics * current real-time data
More information about the NWIS and a complete list of all available can be found at the following website: https://waterdata.usgs.gov/nwis
For this first example, we will pull information from the Milwaukee River streamgage located at Estabrook Park in Milwaukee, WI. The ID for this site is 04087000. This ID was found at the USGS’s website.
MKE_SiteID <- '04087000'
The readNWISdv function is used to retrieve daily values data. For this particular site, the only data available is gage height and discharge. The codes for gage height and discharge are “00060” and “00065” respectively.
The parameterCd parameter determines which statistic will be returned. “00060” is the code for the daily mean discharge, and it is the default for the readNWISdv function.
readNWISdv(MKE_SiteID, parameterCd = '00060', startDate = "2017-06-01", endDate = "2017-06-06")
In the data below, the first three columns display the agency providing the data, the site code, and the date of the record respectively. Following that is the value of the requested parameter and a column for qualifying codes. The most common codes are “P” for provisional data and “A” for approved data. A full list of the possible qualifying codes can be found here.
## agency_cd site_no Date X_00060_00003 X_00060_00003_cd
## 1 USGS 04087000 2017-06-01 983 A
## 2 USGS 04087000 2017-06-02 838 A
## 3 USGS 04087000 2017-06-03 796 A
## 4 USGS 04087000 2017-06-04 729 A
## 5 USGS 04087000 2017-06-05 694 A
## 6 USGS 04087000 2017-06-06 630 A
The readNWISuv data is used to retrieve instantaneous value data. It accepts the same parameters as readNWISdv.
readNWISuv(MKE_SiteID, parameterCd = '00060' , "2017-06-01","2017-06-06")
## agency_cd site_no dateTime X_00060_00000 X_00060_00000_cd
## 1 USGS 04087000 2017-06-01 05:00:00 1070 A
## 2 USGS 04087000 2017-06-01 05:15:00 1070 A
## 3 USGS 04087000 2017-06-01 05:30:00 1070 A
## 4 USGS 04087000 2017-06-01 05:45:00 1080 A
## 5 USGS 04087000 2017-06-01 06:00:00 1050 A
## 6 USGS 04087000 2017-06-01 06:15:00 1050 A
## tz_cd
## 1 UTC
## 2 UTC
## 3 UTC
## 4 UTC
## 5 UTC
## 6 UTC
You may notice that the returned dataset has column names that are based on the parameter and statistic codes. dataRetrieval has a built-in convenience function to clean up the column names called renameNWISColumns into something more human-readable.
MKEdata <- readNWISuv(MKE_SiteID, parameterCd = '00060', "2017-06-01", "2017-06-06")
MKEdata <- renameNWISColumns(MKEdata)
head(MKEdata)
## agency_cd site_no dateTime Flow_Inst Flow_Inst_cd tz_cd
## 1 USGS 04087000 2017-06-01 05:00:00 1070 A UTC
## 2 USGS 04087000 2017-06-01 05:15:00 1070 A UTC
## 3 USGS 04087000 2017-06-01 05:30:00 1070 A UTC
## 4 USGS 04087000 2017-06-01 05:45:00 1080 A UTC
## 5 USGS 04087000 2017-06-01 06:00:00 1050 A UTC
## 6 USGS 04087000 2017-06-01 06:15:00 1050 A UTC
Hydrologic data such as discharge, gage height, and other variables are time-series data. Time-series data is multiple measurements of the same variable over time. When examining time-series data, it is important to note three things:
When working with Time-series data, a common way of visualizing the change of the measured variable is by plotting time on the X axis and the measurement on the Y axis. In this tutorial, we will use the ggplot2 package to visualize time-series data.
library(ggplot2)
MKEdata <- readNWISuv(MKE_SiteID, parameterCd = '00060', "2017-06-02", "2017-06-05")
MKEdata <- renameNWISColumns(MKEdata)
ggplot(MKEdata, aes(x=dateTime, y=Flow_Inst)) + geom_line()
This data is a good example of the basic traits of a discharge hydrograph following a rain event.
Hydrograph traits will vary depending on factors within the river’s drainage basin. For example, the Milwaukee River’s drainage basin area upstream of the Estabrook Park gage is very large.
The drainage basin data was gathered from the USGS Streamstats Tool, and imported using the readOGR command within the rgdal package.
library(leaflet)
library(rgdal)
MKEmap <- leaflet()
MKEmap <- setView(MKEmap, lng=-88.0801, lat = 43.4170, zoom= 8.5)
MKEmap <- addTiles(MKEmap)
MKEmap <- addPolygons(MKEmap, data=basin)
MKEmap <- addMarkers(MKEmap, lng = -87.909235, lat = 43.100139, popup="Milwaukee River at Estabrook Gage")
MKEmap
Some of the factors inside the drainage basin that affect hydrograph response are Basin Area, Land Use, Soil/Rock Type, soil Moisture, and Land Slope.
Outside factors that can influence the hydrograph response include temperature, type of precipitation, precipitation timing, and precipitation intensity. Like the drainage basin polygon data, this precipitation information is generally not available through the dataRetrieval functions and will have to come from another source.
As an example of how drainage basin area and land use influences streamflow, let’s look at data from streamgages in the three Milwaukee-area rivers: the Milwaukee, Menomonee and Kinnickinnic.
First, let’s set up the streamgage information and import the shapefiles for the drainage basins.
rivercodes <- c("04087000","04087120","04087159")
MKEbasin <- readOGR("shpfls\\mkebasin\\mkeatEstabrook\\globalwatershed.shp", layer = "globalwatershed", GDAL1_integer64_policy = TRUE)
MNEbasin <- readOGR("shpfls\\mkebasin\\menoatTosa\\globalwatershed.shp", layer = "globalwatershed", GDAL1_integer64_policy = TRUE)
KKbasin <- readOGR("shpfls\\mkebasin\\kkat11th\\globalwatershed.shp", layer = "globalwatershed", GDAL1_integer64_policy = TRUE)
Then let’s plot the drainage basins on a map to see how they compare:
rivsmap <- leaflet()
rivsmap <- setView(rivsmap, lng=-88.033143, lat = 43.091546, zoom=10)
rivsmap <- addTiles(rivsmap)
rivsmap <- addPolygons(rivsmap, data=MKEbasin)
rivsmap <- addPolygons(rivsmap, data=MNEbasin, color = "#914285")
rivsmap <- addPolygons(rivsmap, data=KKbasin, color = "#160042")
rivsmap <- addMarkers(rivsmap, lng = -87.909, lat = 43.09998, popup="Milwaukee River at Estabrook Park")
rivsmap <- addMarkers(rivsmap, lng = -87.9998, lat = 43.04544, popup="Menomonee at Wauwatosa")
rivsmap <- addMarkers(rivsmap, lng = -87.92643, lat = 42.99733, popup="Kinnickinnic at 11th St")
rivsmap
It is clear from the map that the MIlwaukee drainage basin is the largest, followed by the Menomonee and Kinnickinnic drainage basins.
Next, let’s pull the data for each of the streamgages and see how their hydrographs look for a single storm:
First, we’ll pull the data using the dataretrieval tool.
rivsdata <- readNWISuv(c(rivercodes), parameterCd = '00060' ,"2017-06-02","2017-06-05")
rivsdata <- renameNWISColumns(rivsdata)
Next, we’ll use a for-loop to use the site codes to add site names in a new column. In order for the if-statements to work, the site_no column has to be changed to integer-type values.
rivsdata$site_no <- as.integer(rivsdata$site_no)
for(i in 1:nrow(rivsdata)){
if (rivsdata$site_no[i] == 04087000){
rivsdata$site_name[i] <- "Milwaukee"
rivsdata$drgarea[i] <- 1306.80
} else if (rivsdata$site_no[i] == 04087120){
rivsdata$site_name[i] <- "Menomonee"
rivsdata$drgarea[i] <- 231.09
} else{
rivsdata$site_name[i] <-"Kinnickinnic"
rivsdata$drgarea[i] <- 34.70
}}
Next, we’ll graph the data. When each river is on the same graph.
Instead of putting the data on the same graph, we’ll stack them using the
facet_grid function. We will also use the scales option to allow the graphs to have an inconsistent scale. Instead of comparing the exact discharge values, this graphic is better for comparing the shapes of the hydrographs.