Loading and Plotting USGS Flow Data

Jeffrey D. Walker 2013-01-08

This tutorial explains how to load and plot a daily flow time series. The data were downloaded from the USGS NWIS for Station 01102500, which is located on the Aberjona River in Massachusetts. The daily flow data were downloaded to a tab-delimited text file for the years 2000-2011 (click here to download the original file).

The text file was first loaded into Excel, and the columns for the Date, Flow, and Flag were saved to a csv file in a folder called data (data/dv_01102500.csv).

Load CSV File

CSV files are loaded in R using the read.csv() function.

q <- read.csv("./data/dv_01102500.csv")
str(q)
## 'data.frame':    4383 obs. of  3 variables:
##  $ Date: Factor w/ 4383 levels "1/1/2000","1/1/2001",..: 1 133 265 301 313 325 337 349 361 13 ...
##  $ Flow: num  15 15 15 24 74 45 30 19 18 39 ...
##  $ Flag: Factor w/ 4 levels "A","A:e","P",..: 1 1 1 1 1 1 1 1 1 1 ...

The str() function reports some useful information about the dataframe q.

First, we notice that it converted both the Date and Flag columns to factor variables. By default, the read.csv() function assumes all columns that are not numeric (contain some characters that are not digits) are factors. For the flag column, this makes sense since it is a categorical variable, but it would be better to have the Date column be converted to one of R's native datetime objects. Unfortunately, dates can be confusing in R, there are different datetime types such as Date, POSIXct, and POSIXlt (see this for more details).

We can prevent read.csv() from converting the Date and Flag columns to factors by setting the argument as.is=TRUE.

q <- read.csv("./data/dv_01102500.csv", as.is = TRUE)
str(q)
## 'data.frame':    4383 obs. of  3 variables:
##  $ Date: chr  "1/1/2000" "1/2/2000" "1/3/2000" "1/4/2000" ...
##  $ Flow: num  15 15 15 24 74 45 30 19 18 39 ...
##  $ Flag: chr  "A" "A" "A" "A" ...

Now we see that Date and Flag are both character vectors.

There are many ways to convert a character vector of dates to one of the datetime types, but the easiest way is to use the lubridate package which provides a set of functions depending on the format of the datetime string. In this case, dates are stored in the format MM/DD/YYYY and thus we use the mdy() function (there are a number of other date parsing functions including dmy(), mdy_hms(), etc.).

library(lubridate)
q$Date <- mdy(q$Date)
## 4383 parsed with %m/%d/%Y
str(q)
## 'data.frame':    4383 obs. of  3 variables:
##  $ Date: POSIXct, format: "2000-01-01" "2000-01-02" ...
##  $ Flow: num  15 15 15 24 74 45 30 19 18 39 ...
##  $ Flag: chr  "A" "A" "A" "A" ...

We now see that the Date column is of type POSIXct, which represents dates as the number of seconds since the beginning of 1970 (in UTC) as a numeric vector. Finally, we can convert the Flag column to a factor using the factor() function.

q$Flag <- factor(q$Flag)
str(q)
## 'data.frame':    4383 obs. of  3 variables:
##  $ Date: POSIXct, format: "2000-01-01" "2000-01-02" ...
##  $ Flow: num  15 15 15 24 74 45 30 19 18 39 ...
##  $ Flag: Factor w/ 4 levels "A","A:e","P",..: 1 1 1 1 1 1 1 1 1 1 ...

Using the table() function, we see there are four types of flags in the dataset with most rows having flag A.

table(q$Flag)
## 
##    A  A:e    P  P:e 
## 3895   31  454    3

The lubridate package also provides a series of functions for extracting parts of a date value, such as the month(), year() and yday(), the latter of which returns the julian day (number of days since Jan 1 of the given year).

q$Year <- year(q$Date)
q$Month <- month(q$Date)
q$JDay <- yday(q$Date)
head(q)
##         Date Flow Flag Year Month JDay
## 1 2000-01-01   15    A 2000     1    1
## 2 2000-01-02   15    A 2000     1    2
## 3 2000-01-03   15    A 2000     1    3
## 4 2000-01-04   24    A 2000     1    4
## 5 2000-01-05   74    A 2000     1    5
## 6 2000-01-06   45    A 2000     1    6

Plotting

The flow dataset can now be plotted in a variety of ways using the ggplot2 package.

Timeseries

First, we simple plot the time series as a line chart using the geom_line() geometry.

library(ggplot2)
ggplot(q, aes(Date, Flow)) + geom_line() + labs(x = "Date", y = "Flow (cfs)")

plot of chunk plot.ts

Because flow data are generally log-normally distributed, it sometimes helps to plot flow time series on a semi-log plot. To transform the y-axis to a log scale, we use the scale_y_log10() function.

ggplot(q, aes(Date, Flow)) + geom_line() + scale_y_log10() + labs(x = "Date", 
    y = "Flow (cfs)")

plot of chunk plot.ts.log

We can also add minor tick marks by setting the breaks argument in scale_y_log10() to the output of the trans_breaks() function, which requires the scales package, and adding the annotation_logticks() function with argument sides=l to specify that log-scale tick marks should only be put on the left axis.

library(scales)
ggplot(q, aes(Date, Flow)) + geom_line() + labs(x = "Date", y = "Flow (cfs)") + 
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x)) + annotation_logticks(sides = "l")

plot of chunk plot.ts.log2

Monthly Boxplots

To see the distributions of flows by month, we use the geom_boxplot() geometry.

ggplot(q, aes(Month, Flow)) + geom_boxplot() + labs(x = "Month", y = "Flow (cfs)")

plot of chunk plot.box.month

Hmm, this doesn't look right… Because the geom_boxplot() geometry expects the x-values (indicated by the first argument to the aes() function) to be a factor, or categorical, variable. So we simply convert the Month column to a factor using the factor() function.

ggplot(q, aes(factor(Month), Flow)) + geom_boxplot() + labs(x = "Month", y = "Flow (cfs)")

plot of chunk plot.box.month2

We can again transform the y-axis to a log scale.

ggplot(q, aes(factor(Month), Flow)) + geom_boxplot() + labs(x = "Month", y = "Flow (cfs)") + 
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x)) + annotation_logticks(sides = "l")

plot of chunk plot.box.month.log

And we see that the flows are generally higher in the spring due to snow melt, and lower in the late summer due to more evapotranspiration.

Seasonal Patterns

Another way to see the seasonal patterns in the data is to plot the data points against julian day.

ggplot(q, aes(JDay, Flow)) + geom_point() + labs(x = "Julian Day", y = "Flow (cfs)") + 
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x)) + annotation_logticks(sides = "l")

plot of chunk plot.jday

This plot thus shows the general seasonal pattern in the flow data. We can add a LOESS smooth to the data using the geom_smooth() function.

ggplot(q, aes(JDay, Flow)) + geom_point() + geom_smooth() + labs(x = "Julian Day", 
    y = "Flow (cfs)") + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x)) + 
    annotation_logticks(sides = "l")

plot of chunk plot.jday.smooth

Facet Plots

We can also using the facet_wrap() function to create small multiple plots. For example, we could create individual time series plots for each year with the julian day on the x-axis.

library(scales)
ggplot(q, aes(JDay, Flow)) + geom_line() + labs(x = "Julian Day", y = "Flow (cfs)") + 
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x)) + annotation_logticks(sides = "l") + 
    facet_wrap(~Year)

plot of chunk plot.ts.facet.year

Combining some of these methods, we can create boxplots of each Year faceted by Month. This type of plot is useful for seeing if there are any seasonal trends in the data. For example, are flows in the spring increasing or decreasing over time? This type of hypothesis can be tested using the seasonal Kendall test. Note that we need to rotate the x-axis labels to avoid overlap, so we use theme() to set the angle of the axis.text.x property.

library(scales)
ggplot(q, aes(factor(Year), Flow)) + geom_boxplot() + labs(x = "Year", y = "Flow (cfs)") + 
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x)) + annotation_logticks(sides = "l") + 
    facet_wrap(~Month) + theme(axis.text.x = element_text(angle = -90, hjust = 0, 
    vjust = 0))

plot of chunk plot.ts.facet.month

Now try making that chart in Excel!