Group Information

The Unnamed Flood Group: Kristine Vorwerk and Brough Morris

Data and Subgroup Description

Data source:

Current Water Data for the Nation. USGS: 2016. [online] (accessed 12/1/16, https://waterdata.usgs.gov/nwis/rt)

Summary of the dataset:

We chose a subset of the USGS streamflow data, which records streamflow and other characteristics of water bodies across the nation. The subset was a record of data for four sites in the Westfield River Watershed near the town of Westfield, Mass. The data consisted of 3 columns and 106,350 rows. The columns corresponded to the date on which the observation was made, the streamflow in cubic feet per second, and the location at which the observation was made. To manipulate the data, we used the lubridate package to create three new variables corresponding to the year, month, and day of each observation and the dplyr package to select specific time intervals of interest.

Of the sites we chose, three were on tributaries and one was downstream of their intersection. Two of the tributaries have dams on them, which were built as flood control during the time period for which we have data. In the past, the Army Corps of Engineers has estimated that the two dams collectively saved approximately 500 million dollars in flood damages. We wished to further explore the impact of these dams and examine the extent to which they have controlled the floods.

Pre-Graphing Data Wrangling

This section includes libraries, initial data wrangling, and the definition of our color palette.

#Get the libraries
library(lubridate)
library(ggplot2)
library(dplyr)
library(stringr)
library(maps)
library(mapdata)
library(ggmap)
library(plotly)
library(ggthemes)
library(scales)

#Import the data and clean it
data<-read.csv("C:\\Users\\kvorw_000\\Desktop\\water_data2.csv")
data <- data %>%
  na.omit()
data$date<-as.character(data$date)
data$date<-mdy(data$date)
data<-data%>%
  mutate(month=month(date), day=day(date), year=year(date))%>%
  mutate(mean_discharge=as.numeric(as.character(mean_discharge)))

#make a color pallete
#colors in the order downstream1, downstream2, west1, west2, middle1, middle2, east1, east2, downstream3, east3, combined_data1
colors = c("darkgreen", "lightgreen","red", "pink", "deeppink4", "gray", "darkblue", "blue", "olivedrab", "lightblue", "turquoise2")

1. Westfield River Watershed near Westfield, Mass

#make data frames for all four sites and each individual site
sites.df <- data.frame(name = c("Downstream", "West Branch", "Middle Branch", "East Branch"),
lon = c(-72.6994444, -72.8961111, -72.8730555, -72.8647222),
lat = c(42.1066666, 42.2372222, 42.2586111, 42.2877777))
site1.df <- data.frame(name = c("Downstream"),
lon = c(-72.6994444),
lat = c(42.1066666))
site2.df <- data.frame(name = c("West Branch"),
lon = c(-72.8961111),
lat = c(42.2372222))
site3.df <- data.frame(name = c("Middle Branch"),
lon = c(-72.8730555),
lat = c(42.2580))
site4.df <- data.frame(name = c("East Branch"),
lon = c(-72.8647222),
lat = c(42.28400))
dam1.df <- data.frame(name = c("Littleville Dam"),
lon = c(-72.8815),
lat = c(42.2645))
dam2.df <- data.frame(name = c("Knightville Dam"),
lon = c(-72.8612),
lat = c(42.2901))

#retrieve the map
ll_means <- colMeans(sites.df[2:3])
sq_map2 <- get_map(location = ll_means,
maptype = "terrain", source = "google", zoom = 11)

#plot the points on the map
ggmap(sq_map2) +
  #point for the downstream brach
  geom_point(data = site1.df, color=colors[1], size = 4) + 
  geom_text(data = site1.df, aes(label = paste(" ", as.character(name), sep="")),
  angle = 0, hjust = 0, color = colors[1]) +
  #point for the west branch
  geom_point(data = site2.df, color=colors[3], size = 4) + 
  geom_text(data = site2.df, aes(label = paste(" ", as.character(name), sep="")),
  angle = 0, hjust = 0, color = colors[3]) +
  #point for the middle branch
  geom_point(data = site3.df, color=colors[5], size = 4) + 
  geom_text(data = site3.df, aes(label = paste(" ", as.character(name), sep="")),
  angle = 0, hjust = 0, color = colors[5]) +
  #point for the east branch
  geom_point(data = site4.df, color=colors[7], size = 4) + 
  geom_text(data = site4.df, aes(label = paste(" ", as.character(name), sep="")),
  angle = 0, hjust = 0, color = colors[7]) +
  #point for littleville dam
  geom_point(data = dam1.df, color=colors[5], size = 2) + 
  geom_text(data = dam1.df, aes(label = paste(" ", as.character(name), sep="")),
  angle = 0, hjust = 0, color = colors[5]) +
  #point for knightville dam
  geom_point(data = dam2.df, color=colors[7], size = 2) + 
  geom_text(data = dam2.df, aes(label = paste(" ", as.character(name), sep="")),
  angle = 0, hjust = 0, color = colors[7])+
  labs(title="Westfield River Watershed near Westfield, Mass", x="", y="")

This graphic shows the four sites we chose. It is important to note that the three tributary sites are all in the mountains on rivers which run through tight valleys. These valleys funnel flood waters downstream to the wide, flat floodplain. This floodplain is occupied by the city of Westfield Mass and experienced severe flooding prior to the construction of the dams.

To create this graphic, the maps, mapdata, and ggmap packages were used. We wanted the points for the different sites to be different colors, which we kept consistent throughout all of our graphs. Doing this required creating a color palette and using several different data frames: one for each point on the graph, and one overall data frame which was used to instruct r which graph to retrieve. We chose to use a topographical map because it most clearly showed the valleys and floodplain.

2. Maximum yearly streamflow

#make a new data set which has the maximum yearly flow at each of the four sites
data1<-data%>%
  filter(location=="east")%>%
  group_by(year)%>%
summarize(location="East", max_flow=max(mean_discharge)/1000)

data2<-data%>%
  filter(location=="west")%>%
  group_by(year)%>%
summarize(location="West", max_flow=max(mean_discharge)/1000)

data3<-data%>%
  filter(location=="downstream")%>%
  group_by(year)%>%
summarize(location="Downstream", max_flow=max(mean_discharge)/1000)

data4<-data%>%
  filter(location=="middle")%>%
  group_by(year)%>%
summarize(location="Middle", max_flow=max(mean_discharge)/1000)

data_all<-rbind(data3,data2,data4,data1)

#get the colors necessary in order
colors1<-c(colors[1], colors[7], colors[5], colors[3])

#plot the maximum flow, facetted by location 
g<-ggplot(data_all, aes(year, max_flow, color=location)) +
  geom_point() +
  geom_smooth(method="loess") +
  geom_vline(xintercept=1945, color=colors[7], size=1.2) +
  geom_vline(xintercept=1965, color=colors[5], size=1.2) +
  facet_wrap(~location) +
  scale_color_manual(values=colors1) + 
  labs(
    title="Maximum yearly streamflow",
    subtitle="Thousands of cubic feet per second, vertical lines represent dam constructions",
    caption="Data from USGS",
    x="",
    y=""
  )+
  theme_few()+
  theme(legend.position="none")
g

This graph shows the maximum yearly flow rate for each river. The time of construction for each dam is indicated by a vertical line whose color matches the river it was built on. There was no statistically significant change in the trend line for maximum flow rate when either of the dams was constructed; however, extreme outliers cut off after dam construction. The east and middle tributaries stop having these outliers after their dam is built, and the downstream river stops having these outliers after both dams are built. The immediate dip in the trend line after dam construction is likely due to the loss of high outliers.

This graph used dplyr to transform the original data set into a new data set containing the maximum flow rate recorded during each year for each location. Faceting was used to separate the graphs by location, and each was fit with a smoothed trend line. The colors from the first graphic were carried over by setting the color aesthetic to correspond to location.

3. Flood frequency before and after dam construction (East Site) and Flood frequency before, between, and after construction of the two dams (Downstream Site)

#create data sets for each location 
downstream_data <- data %>%
  filter(location == "downstream")
east_data <- data %>%
  filter(location == "east")
middle_data <- data %>%
  filter(location == "middle")
west_data <- data %>%
  filter(location == "west")

#Count the number of days of data for the East set before and after the dam was built.
east_data_counts <- east_data %>%
  mutate(dam = ifelse(year >= 1941, "After", "Before")) %>%
  group_by(dam) %>%
  summarize(days = n())

#Make the barplot of flood proportion for the East set.

east_data_sd <- east_data %>%
  mutate(sd = sd(mean_discharge)) %>%
  mutate(total_mean = mean(mean_discharge)) %>%
  mutate(flood = ifelse(mean_discharge >= total_mean + 2*sd, "1", "0")) %>%
  mutate(flood = as.numeric(flood))
east_fpy <- east_data_sd %>%
  group_by(year) %>%
  summarize(fpy = mean(flood))

east_dam <- east_fpy %>%
  mutate(dam = ifelse(year >= 1941, "After", "Before")) %>%
  group_by(dam) %>%
  summarize(mean_fpy = mean(fpy)) %>%
  mutate(sample = ifelse(dam == "After", "25902", "11451")) %>%
  mutate(sample = as.numeric(sample)) %>%
  mutate(error = 1/sqrt(sample))

limits <- aes(ymin = mean_fpy-error, ymax = mean_fpy+error)

colors2<-c(colors[7], colors[8])

ggplot(east_dam, aes(x = reorder(dam, mean_fpy), mean_fpy, fill=dam)) +
  geom_bar(stat = "identity")+
  geom_errorbar(limits, width = 0.25, size=1)+
  scale_fill_manual(values=colors2)+
  labs(x = "",
       y = "",
       title = "Flood frequency before and after dam construction\n(East Site)",
       subtitle="Proportion of days where the streamflow was more than two standard deviations above average",
      caption="Data from USGS"
      )+
  theme_few()+
  theme(legend.position="none") +
  scale_y_continuous(labels = scales::percent)

#Counting the number of days of data for the Downstream set before, between, and after each dam was constructed.

downstream_data_counts <- downstream_data %>%
  mutate(dam = ifelse(year >= 1941, ifelse(year >= 1965, "After", "Middle"), "Before")) %>%
  group_by(dam) %>%
  summarize(days = n())

#Making the bar plot of flood proportion for the Downstream set.

downstream_data_sd <- downstream_data %>%
  mutate(sd = sd(mean_discharge)) %>%
  mutate(total_mean = mean(mean_discharge)) %>%
  mutate(flood = ifelse(mean_discharge >= total_mean + 2*sd, "1", "0")) %>%
  mutate(flood = as.numeric(flood))
downstream_fpy <- downstream_data_sd %>%
  group_by(year) %>%
  summarize(fpy = mean(flood))

downstream_dam <- downstream_fpy %>%
  mutate(dam = ifelse(year >= 1941, ifelse(year >= 1965, "After", "Between"), "Before")) %>%
  group_by(dam) %>%
  summarize(mean_fpy = mean(fpy)) %>%
  mutate(sample = ifelse(dam == "After", "18962", ifelse(dam == "Between", "8766", "9685"))) %>%
  mutate(sample = as.numeric(sample)) %>%
  mutate(error = 1/sqrt(sample))

limits <- aes(ymin = mean_fpy-error, ymax = mean_fpy+error)
colors3<-c(colors[1], colors[2], colors[9])

ggplot(downstream_dam, aes(x = reorder(dam, mean_fpy), mean_fpy, fill=dam)) +
  geom_bar(stat = "identity")+
  geom_errorbar(limits, width = 0.25, size=1)+
  scale_fill_manual(values=colors3)+
  labs(x = "",
       y = "",
       title = "Flood frequency before, between, and after construction of the two dams\n(Downstream Site)",
       subtitle="Proportion of days where the streamflow was more than two standard deviations above average",
      caption="Data from USGS"
      )+
  theme_few()+
  theme(legend.position="none") +
  scale_y_continuous(labels = scales::percent)

These two graphs show the proportion of days with greater than two standard deviations above average stream flow, separated by when the dams were built. The first graph is for the east river and the second is for the downstream river. In both graphs the proportion increases after each dam is built, though with our current sample size none of these increases are statistically significant. This relationship could be explained by dams taking individual massive flood events and spreading them out, changing a single day of massive flooding into several days of higher than average stream flow.

We plotted these graphs using the geom_bar and geom_errorbar functions. To get the proportion we made a binary variable, “flood”, that was 1 if the flow rate was above our two standard deviation threshold and 0 if it was not. Then we grouped the data by time period (before, between, or after the dams) and took the mean of “flood” for each. We also calculated the size of each of these samples and used it to calculate the values for the error bars.

4.Daily streamflow of the East Branch before and after dam construction

#make a new variable that stores the day and month of each data point
data_bydate<-data%>%
  mutate(monthday=make_datetime(year=2000, month=month, day=day))%>%
  mutate(monthday=as.Date(monthday))%>%
  mutate(year=as.numeric(year))

#get data for the average flow on each day for a decade before the dam was built
data_before<-data_bydate%>%
  filter(location=="east")%>%
  filter(1915<=year&year<=1925)%>%
  na.omit()%>%
  group_by(monthday)%>%
  summarize(daily_average=mean(mean_discharge), Period="Before")

#get data for the average flow on each day for a decade after the dam was built
data_after<-data_bydate%>%
  filter(location=="east")%>%
  filter(1945<=year&year<=1955)%>%
  na.omit()%>%
  group_by(monthday)%>%
  summarize(daily_average=mean(mean_discharge), Period="After")

data_beforeandafter<-rbind(data_before, data_after)

colors4<-c(colors[7], colors[8])

ggplot(data_beforeandafter, aes(x=monthday, y=daily_average, color=Period, fill=Period))+
  geom_line()+
  scale_color_manual(values=colors4) + 
  scale_fill_manual(values=colors4) +
  geom_smooth(method="loess", span=.5)+
  scale_x_date(labels = date_format("%b"))+
  labs(
    title="Daily streamflow of the East Branch before and after dam construction",
    subtitle="Cubic feet per second",
    caption="Data from USGS",
    x="",
    y="",
    color="", 
    fill=""
    )  +
  theme_few()

This graph shows the daily stream flow in the east river averaged over two ten year periods: one before the dam was constructed, and one after the dam was constructed. There is a spike in the spring months that corresponds to snowmelt, followed by a dip in the summer and a smaller spike in the fall when trees lose their leaves and begin to draw less water. There is no statistically significant difference between the two trendlines, indicating that the dam did not have a significant impact on the annual cycle of stream flow. The steep dip in the trend line on the left edge of the graph and the steep increase on the right edge are likely due to the break in a cyclical trend. The levels are approximately the same at the left and right edges of the graph, and if the trend line took into account that the annual data loops around those steep lines would be flattened out.

In order to group by day of the year, we first used lubridate to make a new variable. This variable stored the day and month as part of a datetime object in which the year was arbitrarily assigned to be 2000 (since year was a required variable). Two data sets corresponding to a decade before and after the dam was constructed were then made and the average flow rate for each day of the year was found. These datasets were recombined and graphed using a line plot, with one line corresponding to each time period. Smooth fit lines were included for each to capture the trend in a clearer fashion, and the sensitivity was adjusted so the lines mirrored the trend closely.

5. Yearly streamflow comparing the sum of all three tributaries to the downstream site

#combine all the data into one big set, and drop all the days that did not have measurements for the middle river
east_data1 <- east_data %>%
  mutate(merge_date = as.character(date))
middle_data1 <- middle_data %>%
  filter(date != "1915-12-26"& date != "1927-03-14"& date != "1942-03-22"& date != "1948-03-22"& date != "1950-04-04"& date != "1952-12-11"& date != "1955-10-15"& date != "1957-01-23"& date != "1959-04-03"& date != "1962-12-06"& date != "1973-02-05"& date != "1988-11-21")%>%
  mutate(merge_date = as.character(date))
west_data1 <- west_data %>%
  mutate(merge_date = as.character(date))
downstream_data1 <- downstream_data %>%
  mutate(merge_date = as.character(date))
merged_data <- merge(east_data1, middle_data1, by = "merge_date")
merged_data1 <- merge(merged_data, west_data1, by = "merge_date")
merged_data2 <- merge(merged_data1, downstream_data1, by = "merge_date")

#Rename the mess of a merge that got created.
merge_names <- c("merge_date", "e.location", "e.date", "e.mean_discharge", "e.month", "e.day", "e.year", "m.location", "m.date", "m.mean_discharge", "m.month", "m.day", "m.year", "w.location", "w.date", "w.mean_discharge", "w.month", "w.day", "w.year", "d.location", "d.date", "d.mean_discharge", "d.month", "d.day", "d.year")
merged_data3 <- merged_data2
names(merged_data3) <- merge_names

#Add the columns comparing the cumulative flow rates and the downstream flow rate.

merged_data4 <- merged_data3 %>%
  mutate(sum_discharge = e.mean_discharge+w.mean_discharge+m.mean_discharge) %>%
  mutate(difference_discharge = d.mean_discharge-sum_discharge)

#combine the relevant data into one dataframe for graphing
downstream<-merged_data4%>%
  mutate(Flow="Downstream", discharge=d.mean_discharge)%>%
  select(e.year, discharge,Flow)
sum<-merged_data4%>%
  mutate(Flow="Sum Of tributaries", discharge=sum_discharge)%>%
  select(e.year, discharge,Flow)
data_all2<-rbind(downstream,sum)

#Graphing the sum of the upstream flow rates and the flow rate for downstream in two different colors, then including the difference as a line. (Also marked dams and 0 line)
colors5<-c(colors[1], colors[11])
ggplot(data_all2)+
  geom_point(aes(e.year, discharge, color=Flow))+
  scale_color_manual(values=colors5)+
  geom_line(data=merged_data4, aes(d.year, difference_discharge), size=1.2)+
  geom_hline(yintercept = 0, size=1.2) + geom_vline(xintercept=1945, color=colors[7], size=1.2) +
  geom_vline(xintercept=1965, color=colors[5], size=1.2) +
    labs(
    title="Yearly streamflow comparing the sum of all three tributaries\nto the downstream site",
    subtitle="Cubic feet per second, vertical lines represent dam constructions",
    caption="Data from USGS",
    x="",
    y="",
    color=""
    )  +
  theme_few()

This graph compares the combined flow rate for all three tributaries to the flow rate downstream. The black line shows the difference between those two values over time. After the second dam was constructed the variance in the black line decreases dramatically. Because the large spikes all correspond to significant flow rate outliers, this is most likely due to the lack of any such outliers after the dams were built. After dam construction, the black line fluctuates close to 0, indicating there is nearly a one to one correspondence between water leaving the tributaries and water entering downstream on days without large flood events.

We plotted this using a combination of ggplots geom_point and geom_line functions. The middle river dataset exclusively included the maximum flow rate for each year, so to get comparable data we combined the flow rates for each river by date using the merge function, which filtered out the data for days that were not in all four sets. This gave us a dataset with a row for each date and a column for each stream flow value. We used two separate geom_point commands to plot the two sets together, and used geom_hline and geom_vline to mark the x axis and the times of dam construction.

Conclusions

Our analysis of the flow rates at the four sites has shown that the dams did indeed serve their purpose. Flood frequency was not reduced by the construction of the dams, but the unusually severe floods were successfully stopped. This was due to floods being spread out over a longer period of time by the dams, transforming a dangerous surge of water into a controlled increase in flow over a longer time therefore protecting the downstream community. We also saw that the general annual trend in flow rate was preserved before and after dam construction. This is a sign that the dams managed to reduce flooding without affecting the downstream environment more than necessary.

Given further time, we would like to explore the data on a finer timescale. For some of the sites we examined, data is being collected remotely on 15 minute intervals. This data is only available for the past 120 days however, so we did not use it in this project. That said, it would be fascinating to look at how the water levels change during a flood, and how the patterns of change differ between the tributaries with and without a dam. In particular, we would be interested in seeing if our theory that the dams spread out floods over a longer time period is accurate.

Citations

Current Water Data for the Nation. USGS: 2016. [online] (accessed 12/1/16, https://waterdata.usgs.gov/nwis/rt)

Littleville Lake Flood Risk Management Project. US Army Corps of Engineers: 2016. [online] (accessed 12/13/16, http://www.nae.usace.army.mil/Missions/Civil-Works/Flood-Risk-Management/Massachusetts/Littleville/)

Knightville Dam Flood Risk Management Project. US Army Corps of Engineers: 2016. [online] (accessed 12/13/16, http://www.nae.usace.army.mil/Missions/Civil-Works/Flood-Risk-Management/Massachusetts/Knightville/)

Data Visualization Final Report

Kristine Vorwerk and Brough Morris

December 19, 2016