This document is being developed as part of a Learning and Teaching project, funded by the University of the Sunshine Coast. The project’s formatl title is “Develop your own Lifelong Learning Resource: Learners as Active Collaborators”, but the shortened title is “Write your own…”
The R programming language is being used to develop this resource enabling studios to build an e-portfolio of text with the R code required to run the various hydrological methods embedded within the resource. Using the ‘Knitr’ library, the students are then able to re-run the code for different datasets and locations and update the methods as they change. The intent is to develop a life-long learning resource that can be developed for any subject or endeavour.
To create the linked Table of Contents in the left-hand frame you will need to follow these instructions. Each module is embedded into this main document by adding a ‘chunk’ and include in the r arguement list: child=“module1.Rmd”, replacing the 1 with the appropriate module number.
Referencing (or citations) in Knitr are surprisingly easy using the knitcitations package, once you have figured out how to do it! The learning path is steep, so I am providing you with some tips, so that hopefully it will work for you first time. However, this may not be the best way to do it, so if you do come up with a better method, can you please make let myself and everyone else know on the discussion board.
Most modern references include a DOI (or Digital Object Identifier), which makes it very easy to reference documents in your .Rmd file.
To install the knitcitations module you need to implement the following code:
library(devtools)
install_github("knitcitations", "cboettig")
library("knitcitations")
library("RefManageR")
cleanbib()
bib <- read.bibtex("/Users/helenfairweather/Documents/SunshineCoast/USC/Teaching/ENG330/2014/IFDResources/references.bib")
#I have left the following in as it was part of my journey to figure out how to implement the knitcitations code to both store references that have a DOI, and those that are entered manually.
#refs <- lapply("10.7158/W11-858.2012.16.1",ref)
#write.bibtex(refs, file = "refs.bib")
For example, a relevant paper on IFD Temporal Patterns is from French and Jones (2012), and this reference is created using the citet or citep functions. Experiment with the use of these two functions (one puts the whole reference in paranthesis and the other just the year).
There are of course files that you will need to reference that do not have a DOI, and in this case you will need to input the relevant fields into a ‘.bib’ file. The format for inputting these ‘old’ references is surprisingly simple. For example, the Hill and Mein (1996) is a reference, the details for which I have copied from the article itself (see below). These details can be entered into a text file within the RStudio environment.
After the identifier @article, the first entry is a ‘key’ to make it easier to reference. The code to enter the Hill and Mein (1996) reference is “r citet(bib[[“Hill1996”]])”, (or citep, depending on how you want the parenthesis configured), where “Hill1996”" is the ‘key’.
@article{Hill1996,
abstract = {Anomalies in design flood estimates can arise from the design temporal patterns and losses. This study shows the effect that intense bursts within longer duration design storms can have on flood estimates, and recommends filtering of the temporal patterns. It also presents the results of a pilot study in which losses are derived to be consistent with the nature of design rainfalls; the result was a significant improvement in design flood accuracy.},
author = {Hill, P. J. and Mein, R. G.},
journal = {23rd Hydrology and Water Resources Symposium Hobart Australia 21-24 May 1996},
month = may,
number = {5},
pages = {445--51},
title = {{Incompatibilities between Storm Temporal Patterns and Losses for Design Flood Estimation.}},
year = {1996}
}
To generate a bibliography the following code is entered (with echo=FALSE included if you do not want the code block to print).
You should also reference the packages you have used in producing your document, as well as RStudio itself. Fortunately Boettiger (2014) has made this possible also. Your bibliography will also include these citations.
This page was created in Rstudio: RStudio Team (2012), with the packages knitr: Xie (2014a); Xie (2013); Xie (2014b), knitcitations: Boettiger (2014), and RefManageR: McLean (2014).
[1] C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0-1. 2014.
[2] R. French and M. Jones. “Design rainfall temporal patterns in Australian Rainfall and Runoff: Durations exceeding one hour”. In: AJWR 16.1 (2012). DOI: 10.7158/w11-858.2012.16.1.
[3] P. J. Hill and R. G. Mein. “Incompatibilities between Storm Temporal Patterns and Losses for Design Flood Estimation.”. In: 23rd Hydrology and Water Resources Symposium Hobart Australia 21-24 May 1996 (May. 1996), pp. 445-51.
[4] M. McLean. Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. Submitted. 2014.
[5] RStudio Team. RStudio: Integrated Development Environment for R. RStudio, Inc.. Boston, MA, 2012.
[6] Y. Xie. Dynamic Documents with R and knitr. ISBN 978-1482203530. Boca Raton, Florida: Chapman and Hall/CRC, 2013.
[7] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.
[8] Y. Xie. knitr: A general-purpose package for dynamic report generation in R. R package version 1.6. 2014.
The decade of reduced rainfall from the mid nineties to late 2010 and the subsequent flooding events in the summer of 2011 in Queensland provides a good example of the contrasting problems faced by engineers who do any work that intersects with the water cycle.
Understanding, monitoring and modelling the behaviour of the water cycle as it relates to the amount of water available from a dry period through to the excess water in wet times, provides the underpinning theory of hydrology.
Nearly all engineering works interact with the water cycle at some point in time (eg. roads, buildings, dams and culverts). Calculating the volume of water to be routed through and around the structure or contained within its boundaries is the application of hydrology practice.
For the dry times, engineers involved in the water supply sector need to be able to design and manage water storages and associated infrastructure so as to provide a continuous water supply. The design criteria for water supply systems are such that water is provided with a minimum of disruption for domestic, industrial and agricultural use.
For the wet times, engineers need to be able build structures that mitigate the damage large volumes of water can cause to infrastructure, the environment and indeed life. The cost of building infrastructure to mitigate against very large rainfall events can be cost prohibitive and so society has accepted a level of risk that a flooding event will occur. This risk is quantified through probabilities of occurrence of floods of various magnitudes.
As we have seen with the Queensland Flood Commission of Inquiry, working in this area brings major responsibilities and the public can be very quick to blame the engineers for getting it wrong. Therefore it is very important to have a good grounding in the basic principles and theories.
These basic principles cover statistical techniques for understanding the frequency of events and assessing the risk of occurrences (probabilities), the components of the water cycle and how they impact on the hydrological behaviour of a region and the physical processes that govern the movement of water through and within the landscape. This course will give an introduction to these basic principles. This module will cover the basics of the hydrologic cycle: rainfall, evaporation and infiltration.
When you have completed this module of the course you should be able to:
The key elements of the hydrologic cycle are the processes that transport water between the earth, groundwater, oceans and the atmosphere (rainfall, evapotranspiration, runoff) and the stores that detain water on the surface of the earth (dams and lakes, rivers, vegetation), beneath the earth (soil moisture and groundwater) and in the oceans.
The water balance is a key principle that underpins our understanding of the hydrologic cycle and the interactions between the stores and processes that move water between these stores.
Hydrology is an environmental science spanning the disciplines of engineering, biology, chemistry, physics, geomorphology, meteorology, geology, ecology, oceanography and, increasingly, the social environmental sciences such as planning, economics and political conflict resolution (Clifford 2002).
The occurrence, circulation and distribution of the waters of the Earth is the foundation of the study of hydrology and embraces the physical, chemical, and biological reactions of water in natural and man-made environments. The complex nature of the hydrologic cycle is such that weather patterns, soil types, topography and other geologic factors are important components of the study of hydrology.
When dealing with the hydrologic cycle we are generally only focused on the movement of water within our region of interest. However the movement of water at the global scale is very important from the point of view of the climatic conditions that drive the regional movement of water, and therefore the global hydrologic cycle is very important.
Solar heating is the primary driver for the hydrologic cycle at the global scale. Solar radiation is a key input into evaporation from water (both oceans and freshwater bodies) and the land surface. The water that is evaporated enters the atmosphere as a water vapour and is transported by winds around the globe. When the conditions are suitable this water vapour is condensed to form clouds. These clouds hold the moisture that will potentially fall as precipitation over the land and oceans when another set of atmospheric conditions (mainly temperature and the pressure exerted by the water vapour) are met.
Precipitation falls as rainfall, snow or hail. Once it reaches the earth surface it is stored temporarily in hollows on the surface, as soil moisture or as snow. Once all the temporary stores are filled, excess rainfall runs off into streams, rivers, lakes and man-made reservoirs or infiltrates into the soil from where it may continue to infiltrate to the groundwater store. Eventually the water in the rivers, lakes, reservoirs and groundwater makes it’s way to the oceans and is evaporated again to complete the global water cycle.
Trenberth et al. 2007 provides a representation of the global hydrological cycle (Figure 1). The values assigned to each of the main water storage units (called reservoirs in Trenberth et al. (2007) are slightly different to those provided by Ladson (2009, p 4), with the exception of soil moisture, which is different by a factor of 8 (Table 1). This highlights the uncertainty associated with trying to quantify the elements of the global hydrologic cycle.
Figure 1: The hydrological cycle. Estimates of the main water reservoirs, given in plain font in 103 km3, and the flow of moisture through the system, given in italics (103 km3 yr-1) (Trenberth et al. 2007)
| Water Store | Trenberth et al. (2007) | Ladson (2009) |
|---|---|---|
| Ice (includes permafrost) | 26372 | 24000 |
| Groundwater | 15300 | 23000 |
| Lakes and rivers | 178 | 178 |
| Soil moisture | 122 | 16 |
| Wetlands | 11 | |
| Atmosphere | 12.7 | 13 |
| Oceans | 1335040 | 1338000 |
As water moves through the landscape, the atmosphere and the oceans it transports and redistributes energy, salt, nutrients and minerals. This redistribution of energy creates the ‘weather’ systems that we experience daily. The movement of nutrients, salts and minerals create problems (eg. Salinity of agricultural lands) and opportunities (deposition of material creating fertile river valleys). It is the hydrologists role to understand how this water can be controlled to mitigate the problems and maximise the opportunities.
The part of the water cycle that Hydrologists are predominately interested is a very small component of the total water budget. For example, freshwater makes up only 2.5% of the total global water budget and surface water only represents 1.3% of this. Groundwater makes up ~30% of the freshwater reserves with the remaining majority held in glaciers and ice caps (Figure 2).
Figure 2: Breakdown of the Earth’s water.
Hydrologists often use a water balance to account for the water in that part of the water cycle being studied. A Water balance is the application of the principle of conservation of mass (continuity equation) (UNESCO 1971). Conservation of energy is also an important principle in the study of hydrology, particularly in relation to measuring fluxes or phase changes of water vapour in the atmosphere.
Putting a boundary around the inflows and outflows of the water cycle of interest can be problematic, but essentially the water balance equation relies on the principle of the conservation of mass. As water is considered to be incompressible for the water cycles hydrologists generally are interested in, the conservation of mass is equivalent to the conservation of volume and so the water balance equation becomes:
\(I\Delta t-O\Delta t=\Delta S\) ………………………………….Equation 1
where
\(I\) is inflow [units of flowrate]
\(O\) is outflow [units of flowrate]
\(S\) is the storage [unit of volume]
\(t\) is the time interval [units of time]
The units for the flowrate, volume and time values in Equation 1 must of course be compatible. See Ladson (2009), p5 for information on units.
As well as the global scale the water cycle can be applied at the National or regional scale and a water balance constructed. At the National scale the inflow of interest is precipitation (\(P\)), and outflows are total evaporation (\(E\)) and the amount of water that is discharged to the sea (\(Q\)). At this scale, a steady state water balance approach is adopted (Welsh et al. 2007) and the change in storage is considered to be zero and therefore \(I=O\) or \(I=E+Q+e\) , where \(e\) is an error term or the unaccounted for volume.
This balance is applied by Ladson (2009) for Australia for an annual time step using the average measurements from the period 1918-1967 (Figure 3). This water balance demonstrates the uncertainty in the estimates made at this scale. For example the amount of river discharge to the sea (\(Q\)) is of the same order of magnitude of the unaccounted for water (or error term) in the water balance (\(e\)).
Figure 3: Hydrological Cycle Conceptualised by (Pidwirny 2010) with Ladson’s (2009) estimate of the water balance for the Australian mainland shown around the borders.
When a water balance is applied at a regional scale, a water catchment is generally the area of study. At this scale a more detailed water balance can be applied and as well as \(P\) the inflows may include surface inflows from another catchment (\(I_s\)) and groundwater inflows (\(I_g\)).
The catchment of interest may not have an outlet to the ocean and therefore the outflows are also a little different to the water balance at the global scale: evapotranspiration (\(ET\)), subsurface outflow (\(Q_g\)), streamflow (\(Q\)) and diversions (\(D\)).
At some time scales at the catchment level the changes in storage levels do contribute and are therefore included in the water balance equation. The storage types can be broken down into surface storage in lakes and dams (\(S_s\)), soil moisture storage (\(S_sm\)) and groundwater storage (\(S_g\)). There are other water storages in the catchment, but the change in their capacity is generally zero over the time period used in most water balances, eg. vegetation, rivers and channels and snow and ice.
A catchment water balance can be a key tool for analysing the impacts of land use change and climate change (Hatton 2001). For example, if steady state conditions are assumed (ie. changes in storages = 0), the impact on discharge to changes in rainfall and evapotranspiration can be investigated.
Access to data is one of the biggest hurdles to undertaking a water balance. For example, if we were to apply a water balance to the Mooloolah catchment, in which the University of the Sunshine Coast sits, we would need access to average rainfall, evapotranspiration and runoff from this catchment. It is difficult to find data at this catchment scale, but data is available at the Sunshine Coast Region scale from the Queensland Government Wetland Hydro-Climate Tool.
The following plot (Figure 4) is created using the annual rainfall and runoff Probability Exceedance Curve data for the Qld Southern Coastal Lowlands sub-bioregion (Qld sub-bioregion code: 12.4). These data were ‘stripped’ from the underlying html using several R libraries (RCurl and XML). Therefore you will need to install these packages (eg. install.packages(“RCurl”)) before running the .Rmd file.
Figure 4: Annual rainfall and runoff Probability Exceedance Curves for the Qld Southern Coastal Lowlands sub-bioregion (Qld sub-bioregion code: 12.4).
The following data are the annual rainfall and runoff time series from the Queensland Government Wetland Hydro-Climate Tool. If you want to re-run this code for another location (and you do!) you will need to go the site and download new data and change the directory name in the code within the .Rmd document. Figure 5 shows the relationship between runoff and rainfall for this area.
Figure 5: Relationship between Runoff and Rainfall for the Qld Southern Coastal Lowlands sub-bioregion (Qld sub-bioregion code: 12.4), green line is 1:1 line, line of best fit is in red and shaded area represents amount of uncertainity in the fit.
If we were to conduct water balance for each year in the record we will need annual evaporation or evapotranspiration data averaged over the same area. Unfortunately these data are not readily available, however if we assume that rainfall is only input, and runoff and evaporation are our outputs, we could estimate our annual evaporation as the difference between rainfall and runoff. The histogram of these differences (Figure 6) shows the majority of the estimated annual evaporation is from 500 to 1500mm.
Figure 6: Estimate of Evaporation as the difference between rainfall and runoff for the Qld Southern Coastal Lowlands sub-bioregion (Qld sub-bioregion code: 12.4).
The mean of the estimated annual evaporation is 1104mm, which demonstrates a good match with the long term Areal Actual Average Annual Evapotranspiration from the Bureau of Meteorology.
This module will introduce you to several concepts you may not have encountered before: Intensity-Frequency-Duration curves, generally called IFD curves, temporal rainfall patterns and areal reduction factors. These three elements are the building blocks for the design rainfall event - an historically important input into hydrological studies.
In your first assignment you will be asked to create the design rainfall event for a site in the Mary River Catchment. Therefore, much of the material for this module will be provided to you after the assignment has been submitted.
The Intensity-Frequency-Duration (IFD) curves used to create a design rainfall event are unique to each location. Data that generate the curves have been derived from statistical analyses of daily and sub-daily rainfall over various periods across Australia. More information is available from the Bureau of Meteorology Design Rainfall portal and the Australian Rainfall and Runoff Revision project.
In Figure 7 the Intensity-Frequency-Duration curves (of Engineers Australia, 1987) at the location of the Bureau of Meteorology’s weather station in the Beerburrum Forest are plotted. This plot is the same as that used in the example for Assignment 1, however an additional line has been added (dashed purple line in Figure 7), which is the recent rainfall event recorded at the Bureau of Meteorology’s Beerburrum weather station (Lat: -26.96 Lon: 152.96).
library(ggplot2)
library(reshape)
library(rmarkdown) #this library is required to create the figure caption.
#read the data file into R (note: I pasted the Beerburrum IFD data into a text file in RStudio - there is no extension on this file (ie. no .txt or .csv))
dirv<-"/Users/helenfairweather/Documents/SunshineCoast/USC/Teaching/LTG/2014/e-resource/IFData"
fname<-"BeerburrumForestBomWx_87"
#Use the read.table command to import data into R
ifd<-read.table(paste(dirv,fname,sep="/"),sep=",",header=T)
#The column containing the duration data is not numeric as it also includes the text describe the time unit used (ie. mins or hrs). Therefore the gsub command is used to strip only the numbers out of this column and place into a new column.
ifd[1:5,9]<-as.numeric(gsub("Mins","",ifd[1:5,1]))
ifd[6,9]<-as.numeric(gsub("Hr","",ifd[6,1]))*60
ifd[7:13,9]<-as.numeric(gsub("Hrs","",ifd[7:13,1]))*60
#names are assigned to each of the columns to make it easier for accessing the data to plot.
names(ifd)<-c("xlab","ari1","ari2","ari5","ari10","ari20","ari50","ari100","mins")
# a vector of labels for the x-axis is read from the first column in the original ifd dataset (renamed xlab).
xlab<-ifd$xlab
#this line of code creates the breaks for the x-axis (e.g. the points where the grid lines are created)
bmins<-ifd$mins
#the following lines of code are for the y axis and creates the vector of labels (ylab) and breaks (brain) for the rainfall axis. The line below is commented out as this is the long way around it - the subsequent 3 lines provide a simplier method - see if you can intepret what is happening here.
#ylab<-c(0.3,0.4,0.6,0.8,1,2,3,4,6,7,8,10,20,30,40,60,80,100,200,300,400,600)
seq_v<-c(1,2,3,4,6,7,8,10)
ylab<-c(seq_v[3:9]/10,seq_v,seq_v*10,seq_v[1:6]*100)
brain<-ylab
#the reshape library (specifically the melt command) can be used to format the data for input into ggplot, as we can plot the data with one command using a group option.
ifdl<-melt(ifd[,2:8])
ifdl$mins<-ifd[,9]
#when you use the attach command you can access the names of the columns directly (ie. you don't need to use ifdl$ before the column names.)
attach(ifdl)
#I have used the ggplot2 library to create this graph - it is a bit of a steep learning curve, but in my opinion it is well worth it as it allows you to create high-level plots with a lot of control.
#this creates a vector of colours for plotting each of the IFD curves for all the ARIs
colvec=c("dark blue","dark green","brown","light blue","red","purple","green")
#This creates a vector of labels to to with the colour vector to identify the ARI's
labvec=c("1 Year(lower curve)","2 Years","5 Years","10 Years","20 Years","50 Years","100 Years(upper curve)")
#when the data were 'melted' all the column names starting with "ari" (ie. columns 2:8 - see code ifdl<-melt(ifd[,2:8])), were put into a column 'variable'. This next line of code makes this a factor so that it can be grouped to create a line for each ARI.
ifdl$variable<-as.factor(ifdl$variable)
#the zoo package enables the creation of timeseries. As hydrological data almost always is temporal, it is important to develop the skills to efficiently handle dates and times. The zoo library provides some nice features to enable this.
library(zoo)
#the following two functions make it easier to deal with dates and times.
library(chron)
library(lubridate)
#Downloaded the csv file for the weather station of interest, which will contain the last 72 hours of observations. There are several rows at the start of this file, and instead of manipulating these in R, I simply deleted them in Excel first. Read the data from this file using the read.csv command.
rrain<-"IDQ60901.95566.csv"
rrv_temp<-read.csv(paste(dirv,rrain,sep="/"),sep=",")
#There are a lot of data in this file, but for our purposes we only want the date/time and the rainfall amounts, which are in columns 5 and 25 respectively.
rrv<-rrv_temp[,c(5,25)]
colnames(rrv)<-c("Date.Time","rain")
#The Date.Time column is not in a pretty format, and doesn't include a month or year, so these has to be entered manually.
monv<-8
yv<-2014
#The following loop extracts the start and finish date - it is very long, but that is because of the messy date.time format in the data file (eg. 24/03:30pm). There is probably a much more elegant solution than this, so if you do find one, I would be very pleased to hear about it.
for(i in 1:2){
#the first time through the loop the first value is extracted from the data file - this is in fact the last date.time, as the data is in descending order.
ci=1
if(i==2){ci=nrow(rrv)}
#the following line seperates the date and time as they are seperated by a /.
bits <- unlist(strsplit(as.character(rrv[ci,1]), '/'))
#The first element of the seperated 'bits' is the day, value and this is converted to a numeric.
stday<-as.numeric(bits[1])
#the second element of the seperated 'bits' contains the time information. The hour value is easily extracted and made numeric.
bits<-unlist(strsplit(bits[2],":"))
sthr<-as.numeric(bits[1])
#The following If statements test for am or pm, and if the latter, adds 12 hours to the value.
if(nchar(gsub("am","",bits[2]))==4){
stmin<-as.numeric(gsub("pm","",bits[2]))
sthr<-sthr+12
}else
{stmin<-as.numeric(gsub("am","",bits[2]))}
#if only 1 zero is returned, an additional 0 is added to produce the time in the format hh:mm:ss
if(nchar(stmin)==1){stmin=paste(stmin,0,sep="")}
timech<-paste(sthr,stmin,"00",sep=":")
datech<-paste(stday,monv,yv,sep="/")
dt<-chron(dates=datech,times=timech,format=c(dates = "d/m/y", times = "h:m:s"))
if(i==1){findt=dt}
else{stdt=dt}
}
#the zoo library can very easily create the time series for the data from the first and last date, seperated by 30 mins.
tseq<-seq(stdt,findt,by=times("00:30:00"))
#the zoo command applies this time seq to the measured rainfall.
zrain<-zoo(rev(rrv$rain),tseq)
#the rainfall data is cumulative over a 24 hour period, resetting to 0 at 9:01 AM each day. Therefore the index of 9:30AM values are used to create a vector of half-hourly totals using a mixture of the diff function for all the times, except 9:30.
i930<-which(hour(index(zrain))==9)
indevent<-c(zrain[1],diff(zrain[1:i930[1]]),zrain[i930[2]],diff(zrain[(i930[2]):i930[3]]),zrain[i930[4]],diff(zrain[i930[4]:i930[5]]),zrain[i930[6]],diff(zrain[i930[6]:nrow(rrv)]))
#the following block of code applies a rollapplyr function to sum the data for each of the durations across all of the data, and returns the maximum for each duration.
max_dur<-NULL
dt_eventends<-NULL
dur_hrs<-c(0.5,1,2,3,6,12,24,48)
for(i in 1:length(dur_hrs)) {
sumv<-rollapplyr(indevent,dur_hrs[i]*2,sum)
iv<-which.max(sumv)
timv<-paste(dur_hrs[i],00,00,sep=":")
dt_eventends<-c(dt_eventends,index(sumv[iv]))
max_dur<-c(max_dur,max(sumv))
}
dur_hrs<-c(dur_hrs,72)
max_dur<-c(max_dur,sum(indevent))
dt_eventends<-c(dt_eventends,index(indevent[length(indevent)]))
dt_max_dur<-chron(dt_eventends, format=c(dates = "d/m/y", times = "h:m:s"))
#A data frame is created with these data (durations in minuts, maximum values, their intensity and the time they occured) as input. The cbind function binds each of these vectors together as columns in a data frame.
rrv<-as.data.frame(cbind(dur_hrs*60,max_dur,max_dur/(dur_hrs),dt_max_dur))
names(rrv)<-c("minv","max_dur","Intensity","dt")
write.csv(rrv,"event_rain.csv")
#this next line of code creates the information for the plot - this is stored in the variable ifdp. Search ggplot2 and ggplot to understand all the elements of this plotting procedure. The data from the recent rainfall event is added through the second geom_line and first geom_point call.
ifdp<-ggplot(ifdl)+geom_line(aes(x=mins,y=value,group=variable,colour=variable))+
geom_line(data=rrv,linetype="dashed",col="purple",aes(x=rrv$minv,y=rrv$Intensity))+
geom_point(data=rrv,col="purple",aes(x=rrv$minv,y=rrv$Intensity))+
annotate("text",label=paste("Rain Event from",index(indevent[1]),"to",index(indevent[length(indevent)]),sep=" "),x = 500,y=1.25,size=4,colour="purple")+
scale_x_log10(limits=c(5,4320),breaks=bmins,labels=xlab)+scale_y_log10(limits=c(0.3,1000),breaks=brain,labels=ylab)+
xlab("DURATION IN MINUTES OR HOURS")+ylab("RAINFALL INTENSITY IN MILLIMETRES PER HOUR")+
scale_color_manual(name="AVERAGE RECURRENCE INTERVAL",values=colvec,labels=labvec)+
theme(legend.justification = c(0, 0),legend.position = c(0, 0),panel.background = element_rect(colour = "black",fill="white"),panel.grid.major = element_line(colour = 'grey'),panel.grid.minor = element_line(colour = 'grey'),plot.title = element_text(lineheight=.8, face="bold"))+
guides(col = guide_legend(reverse = TRUE))+
ggtitle("DESIGN RAINFALL INTENSITY CHART\n Beerburrum Forest BoM weather station location")
#to plot the graph you simple type the plot variable name.
ifdp
Figure 7: 1987 Design Rainfall Intensity Chart produced for the Beerburrum Forest BoM weather station location, with the recent rainfall event shown by the dashed purple line. Measured durations are indicated by the closed dots.
In conjunction with the revision of Australian Rainfall and Runoff (of Engineers Australia, 1987), the Bureau of Meteorology has produced revised IFD data for Australia. Apart from changes to the underlying data (the dataset used in the statistical analyses has increased considerably), additional durations have been added and terminology has been changed. The change in terminology is an attempt to make the probabilistic nature of rainfall (and flood) events better understood across the community.
Previously the terms ‘Average Recurrence Interval’(ARI) and ‘Annual Exceedance Probability’ (AEP) were used within the engineering profession in Australia. They were thought to be interchangeable, however, the two were often confused and people would sometimes refer to the Annual Recurrence Interval, which of course has no meaning.
A probability framework (ie. AEP), expressed as a percentage, has been adopted by the Engineers Australia National Committee on Water Engineering (responsible for delivering the revision of ARR) as it is believed this will better convey the probabilistic nature of rainfall and subsequent flood events. For example, the 1% AEP has a 1% chance of occurring each and every year, and therefore these events can occur in consecutive years (or even more than one in any given year). Previously when the ‘recurrence interval’ or ‘return period’ terminology was used, it was inferred as meaning that these events occurred at regular intervals.
In the revised ARR, the ‘Exceedance Year’ (EY) terminology has been adopted for events with the probability of occurring frequently (eg. a couple of times a year). So for events with a percentage AEP > 50%, EY is used. For example, 2EY would relate to events with magnitude that occur, on average, twice a year.
The National Committee on Water Engineering, when considering the terminology to adopt for the revised ARR, were concerned that whatever was adopted had clarity of meaning, was technically correct and practical and acceptable. The revised ARR website has a terminology chapter which provides additional material. In this chapter, particularly take note of the discussion around the seasonality of data in Australia, and the wet and drying periods.
Another major difference with the 2013 IFD data is the format of the output. As you saw in Figure 7, IFD data were previously provided as intensities, however the 2013 IFD data is currently being provided as depth for each duration and AEP. In future releases there will be an option to also download the intensities for these new IFD data.
Figure 8 is an example of the 2013 IFD depths, with the maximum depths for each duration from the recent weather event (21/8/2104 to 24/8/2014) also added.
Figure 8: 2013 Design Rainfall Intensity Chart produced for the Beerburrum Forest BoM weather station location (BoM 2013), with the recent rainfall event shown by the dashed purple line. Measured durations are indicated by the closed dots.
Temporal patterns are applied to the rainfall intensity (or depth for the new IFD data) to distribute the total rainfall across the required duration in a pattern that is meant to statistically represent the median rainfall pattern for a given zone. The core assumption of using this median value for the temporal pattern, and other input parameters to the design storm, is this will result in a flood estimate of a similar probability to that of the rainfall probability; this is known as the assumption of NUETRALITY (of Engineers Australia, 1987). The need to research the assumption of this approach was identified in of Engineers Australia (1987), however there has been little work done to date. Subsequent writers identified that this assumption is invariably never satisfied (French and Jones, 2012)
In spite of all these issues, the temporal pattern is an important part of producing the design storm for the time being. The temporal patterns are presented in of Engineers Australia (1987). In this publication, Australia is broken up into eight zones, with the Sunshine Coast in zone 3. This zone extends from northern NSW through to the top of Cape York.
Areal Reduction Factors (ARF) are the ratio between rainfall at a point, and that same rainfall averaged across an area. This is to account for the spatial variability that is inherent in all rainfall events.
The ARFs in the current version of Australian Rainfall and runoff (of Engineers Australia, 1987), are based on studies from the United States. There has been some literature in recent times on ARFs. Project 2 of the review of Australian Rainfall and Runoff is on the spatial pattern of design rainfall, and the reports on this project will help you in your assignment.
[1] I. of Engineers Australia. Australian Rainfall and Runoff. 1987.
[2] R. French and M. Jones. “Design rainfall temporal patterns in Australian Rainfall and Runoff: Durations exceeding one hour”. In: AJWR 16.1 (2012). DOI: 10.7158/w11-858.2012.16.1.
In this module we will investigate the processes that are responsible for the generation of runoff, that in turn, becomes streamflow. It is this streamflow that hydrologists are primarily interested in, either from the point of view of understanding the amount of water available in drought times, through to the impacts of floods.
Surface runoff prior to becoming streamflow can also be of interest to a hydrologist, as a flood event can happen anywhere. The primary location of floods though is in a drainage channel (ie. streamflow overtopping a bank).
In this course the focus is on hydrology as it relates to flooding, but the analyses of low flow conditions is also important for hydrologists.
When you have completed this module of the course you should be able to: * Describe the conditions for surface runoff to commence * Describe the components of streamflow for different catchment characteristics * Describe the different methodologies for measuring and modelling surface runoff and streamflow * Interpret a hyetograph and hydrograph
In this course we are only concerned with the case where rainfall is required before a surface runoff event occurs. This is not always the case as a runoff event may occur through other mechanisms (eg. a broken water mains).
Runoff can occur through rainfall that occurs on all, or part of, the catchment (Figure 1). There are several process that are precursors to the runoff event occuring.
catchmentRainfall
Figure 1: Runoff is generated from rainfall that occurs on all, or part of, the catchment.
Not all rain reaches the earth’s surface, as it can be held up in intercept storage. Rain that does reach the surface of the earth can be held up in depression storage (Figure 2).
intercept_throughfall
Figure 2: Interception and Depression Storage need to be considered in hydrological analyses and modelling.
Plants, grass, buildings and tree trunks and branches intercept some precipitation before it reaches the earth’s surface. The amount of preciptation intercepted depends on form, ie. much more snow is likely to be intercepted by tree branches and trunks than rainfall. In Queensland, we are primarily concerned with precipitation as rainfall. Where rainfall is used in this document, also consider that it could apply to snowfall or hail.
A number of processes occur subsequent to the interception of rainfall by the natural or man-made structures above or on the surface of the earth . Accounting for the amount (or flux) of water that occurs through these processes is important for hydrology.
The interception store is the amount of rain held by vegetation and therefore does not become runoff. Some of the water that is intercepted will be subsequently lost from this store. These losses are defined as: * The interception loss. * stemflow, and * throughfall.
The interception loss is defined as the amount of water that evaporates from the interception store. It is considered a loss in the context of a loss from the water balance. The stemflow is defined as water that drips to the ground when the interception store fills or snow/ice melts. The __throughfall_ is defined as the water that drips to the ground from leaves when the interception store fills, snow/ice melts and precipitation that passes through the vegetation canopy.
Rain can be retained in puddles during a storm, and this is known in hydrology as depression storage. Rain retained in depression storage is eventually infiltrated or evaporated. In contrast, rain can be temporally retained as it spreads across the surface. This is known as detention storage. Detention storage excludes depression storage, and is defined as the volume of water that is spread across the surface that is en-route to the drainage line (ie. volume of flowing water existing on the land surface that has not reached a river, channel or creek). (Ladson (2011) calls this overland flow).
depression_storage
Figure 3: Example of depressetion storage (picture from USDA “From the Surface Down” publication, 2010).
The soil type is characterised by it’s texture and structure (these are different things). These characteristics, in turn, govern the infiltration characteristics of water through the soil. Therefore, a soil survey is an important input for hydrological analyses.
This is a good reference from the US Natural Resources Conservation Service for a simple explanation of soil development. A Refresher Course in Soil Science, from the Australian Society of Soil Science (1985) is more specific to Queensland. An electronic version of this publication is available from the Queensland Department of Environment, Land and Water Library.
There are many digital maps and resources available through the Queensland Government Information Service that you can map in R.
## Loading required package: sp
## rgdal: version: 0.8-16, (SVN revision 498)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.9.2, released 2012/10/08
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgdal/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgdal/proj
plot of chunk unnamed-chunk-14
It may seem counter-intuitive, but though a soil is fully saturated, water can still infiltrate through it. The texture and structure of the soil both influence the infiltration rate of water through it. A sandy soil holds less total water than fine-textured soils (such as clays) and this affects the infiltration rate, i. The precipitation rate also affects the inflitration rate. When the rate of infiltration into the soil is:
governed by suply constraints, \[i>p\]
and when governed by soil constraints, \[i<p\].
This module will introduce some basic probability concepts and procedures used in Hydrological Frequency Analysis. The methods that have been used historically for flood frequency analyses were developed prior to the advent of the computer power that we now have access to. Subsequently the methods employed to analyse flow data are becoming more sophisticated. The R environment is particularly useful for understanding this increased perplexity.
When you have completed this module of the course you should be able to:
The revision of AR&R, particularly Book 3, Chapter 2, Flood Frequency, has produced material that is particularly valuable for understanding the new methods.
William King from Coastal Carolina University has produced a handy tutorial on Probability functions in R.
The subject is also covered well by Ladson (2011) in Chapter 6: Floods and Flood Frequency Analysis.
For some basic statistics and probability theory, see the book by G Jay Kearns. This book can be accessed through R by three simple commands:
install.packages("IPSUR")
library(IPSUR)
read(IPSUR)
In flood frequency analysis we are primarily interested in analysing the probability of floods of certain sizes, therefore we are generally only interested in the flood peaks. Given the stochastic nature of rainfall, and many of the processes that contribute to flooding, a probability framework is required in order to assess risk, as the flood peaks are treated as random variables.
One of the primary assumptions of applying the flood probability framework is that each peak is statistically independent of all other flood peaks. This has generally be shown to be the case.
To understand probabilities, and hence to be able to work with the probability of the occurrence of hydrological events, we need to be able to describe and compute probabilities. This is usually done using a ‘probability density function’, or pdf.
The flood probability model is generally described by its probability density function (pdf). For sample data, the pdf is a smoothed curve applied to a histogram of measurements. There are many forms of a pdf, depending on the shape of this curve. When the underlying data fits a pdf that resembles a bell-like shape it is called a ‘normal’ (or Gaussian) distribution. This is an excellent resource for explaining probability distributions in R.
There are several distributions that are used in hydrology. Each of these curves can be described by a mathematical equation, which are characterised by one or more parameters (e.g location, scale and skewness) which usually have some physical meaning in hydrology. For example the normal distribution is characterised by the mean (\(\mu\)) and standard deviation (\(\sigma\)).
The values of the standard normal distribution (the special normal distribution which has \(\mu = 0\) and \(\sigma = 1\)) can be generated by using the dnorm function, which returns the height of the normal curve at values along the \(x\)-axis. (Variables which follow a standard normal distribution are usually designated \(z\).) Wrapping the curve function around this result produces the familiar ‘bell-shaped’ curve. The point on the curve where z = 1 is shown by the green circle and line.
curve(dnorm(x), xlim=c(-3,3), col='red', lwd=3, xlab="Z", ylab="density", main="Probability Density Function")
points(1, dnorm(1), col='green', pch=19, cex=3)
lines(c(1,1), c(0,dnorm(1)) ,col='green', lwd=2)
plot of chunk unnamed-chunk-15
To find the density value of particular value along the \(x\)-axis (ie. the height of the curve at that point), a value is passed to the dnorm() function:
dnorm(1)
## [1] 0.242
The probability that \(z\) will be greater than a particular value is given by the area under the pdf to the right of the \(z\)-value on the \(x\)-axis. You may remember looking up these ‘\(Z\)-scores’ (values on the \(x\)-axis) in a table to return the \(p\)-value (region under the normal curve) when conducting hypothesis tests. The good news is that you no longer have to do that; you only need to enter the correct command into R!
The area under the curve (shaded in blue below) from a specified \(z\) value on the \(x\)-axis to the maximum value on the \(x\)-axis (which is \(\infty\) for the normal distribution), gives the probability of the value of \(Z\) being greater than the particular value of \(z\) being examined: \(z\geq\) Z; ie. it gives \(P(Z \geq z)\). The total area under the pdf curve is \(1\).
plot of chunk unnamed-chunk-17
You can also specify a different values for \(\mu\) and \(\sigma\) in the dnorm function (rather than 0 and 1 respectively). The following pdf is for the case with \(\mu = 10\) and \(\sigma = 2\):
plot of chunk unnamed-chunk-18
The cumulative distribution function (cdf) is the integral of the pdf (eg. area under the curve), or the total accumulated probability up to a certain value. It is represented as \(P(z)= \int_{-\infty}^{z}{p(z)dz}\). Therefore, like the pdf, there is a mathematical curve that describes the pdf. The pdf of ‘normally’ distributed data takes on the familiar s-shape and can be generated for the standard normal distribution (\(\mu = 0\) and \(\sigma = 1\)) using the pnorm function inside the curve function.
It stands to reason that if a value (\(z\)) is passed to the pnorm function, it will return the value on the cdf curve, which is equivalent to the area under the pdf curve to the left of \(z\). It should be noted that, in general, capital letter represent the random variable (in this case \(Z\)), and small letter represent an outcome or realization of that random variable (eg. \(z\)).
par(mfrow=c(2, 1)) # Makes a 2x1 grid on which to place plots
z_value <- 1
# PDF
curve(dnorm(x), xlim=c(-3,3), col='red', lwd=3, xlab="Z", ylab="density", main="Probability Density Function", las=1)
min_v <- -3
max_v <- z_value
xvals <- c(min_v, seq(min_v,max_v,0.01), max_v)
yvals <- c(0, dnorm(seq(min_v,max_v,0.01)), 0)
polygon(xvals, yvals, col="skyblue")
lines( c(z_value, z_value), c(0, 0.5), col='green', lwd=2)
text(max_v+0.2, dnorm(z_value),
"The blue area =\n value on cdf: P(Z<z))",
pos=4)
# CDF
min_v <- -3
max_v <- 3
curve(pnorm(x), col='blue', lwd=3, xlim=c(min_v, max_v), xlab="Z", main="Cumulative Distribution Function", las=1)
points(z_value, pnorm(z_value), col="green", pch=19, cex=2)
lines( c(z_value, z_value), c(0, pnorm(z_value)), col='green', lwd=2)
lines(c(-3.5,z_value), c(pnorm(z_value), pnorm(z_value)), col='green', lwd=2)
plot of chunk unnamed-chunk-19
par(mfrow=c(1,1))
The probability theory that is used to estimate the probability of floods of certain magnitudes occurring is taken from the Sampling Theory (see the book by G Jay Kearns) and the theory of Bernoulli Trials (see page D-21 in this resource).
The binomial distribution represents the successes or failures or trials that are repetitive (eg. tossing a coin 100 times and recording the occurrence of Heads or Tails). As we are dealing with discrete events (that is, one Head or two Heads, but not 2.8034 Heads), the representation of the probability function is called a probability mass function. The graphical representation of a mass function is usually through a column chart (as opposed to a curve for the cdf), Essentially these theories result in the equations that allow the probability of k successes in n trials to be calculated (using the binomial formula):
\(P(k \text{ successes in } n \text{ trials})=\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}\)
The Annual Exceedance Probability (AEP) is the probability (p) that is of interest to us in the previous equation. We can apply the above equation to determine the probability of occurrence of various scenarios. For example, if we consider peaks that occur in an annual time period, we can calculate the probability that exactly three 50 year floods will occur in a 100 year period.
If we consider a series that represents the maximum floods over a certain period (eg. annual) or threshold (eg. Peak over Threshold or PoT), the magnitude of these floods can be related to the probability of exceedance (in the case of annual peaks) or their return interval (in the case of PoT). Fortunately there is a relationship between the probability of exceedance (AEP, or \(p\)) and the return interval (ARI): \(ARI=\frac{1}{p}\).
Therefore we need to develop skills in being able to calculate these exceedance probabilities.
If we wanted to work out the probability that exactly three 50-year floods will occur in a 100-year period, we could of course apply the above equations. For example, in this situation, our k=3 and n=100. Our ARI = 50, therefore \(p=\frac{1}{50}\). Plugging these values into our first equation yields:
\(P(3 \text{ in } 100)=\frac{100!}{3!(100-3)!}(\frac{1}{50})^{3}(\frac{49}{50})^{97} = 0.182\), so therefore there is an 18.2% chance that three 50-year floods will occur in any 100 year period.
At this stage we don’t know the magnitude of that flood for a given area. This will come through the analysis of actual streamflow data.
Applying this equation every time we wish to calculate these probabilities seems onerous; fortunately R has a built in function to enable the calculation of the probability of k successes in n trials to be calculated using dbinom(k,n,p)\(= \)P(3 100) =$ 18.2276.
But what if we wanted to know the probability of 3 or more 50-year floods occurring in a 100 year period? We know that the sum of the probabilities of all possible numbers of 50-year floods occurring in 100 years must equal unity (this includes the probability of there being no 50-year events in a 100-year period), so \(P(3 \text{ or more in } 100) = P(3 \text{ in } 100) + P(4 \text{ in } 100) + ... P(100 \text{ or more in }100)\).
However, it would take a long time to apply each of these combinations into the first equation. We could take a different tack and just calculate the Probabilities of 0 in 100, 1 in 100 and 2 in 100 and take the sum of these probabilities away from 1: \(P(3 \text{ or more in } 100) = 1 - P(0 \text{ in } 100) + P(1 \text{ in } 100) + P(2 \text{ in } 100)\). However, this also seems a bit onerous, but fortunately we can again apply the dbinom(k,n,p) function and wrap a sum function around it: 1-sum(dbinom(0:2,100,1/50)) = 0.3233. In R we could also easily apply the first option directly, eg. sum(dbinom(3:100,100,1/50)). You will see this gives the same answer.
To understand the shape of this distribution you can wrap a plot function around the binom function for the points from 0:100. (The binomial distribution is computed at disctrete values, such as 0, 1, 2, … 100; for example, in 100 years, we can have 0 years or 1 year or two years and so on with a flood, but it makes no sense to talk about 2.45604 years having a flood. This means the pdf will not be continuous.) The blue dot in the following graph corresponds to the probability of 2 occurrences of 50 -year floods in 100 years. It makes sense that this has the highest probability, given that, on average, we could expect two 50-year floods in any 100 year period.
x <- seq(0, 100, 1)
plot(x,dbinom(x,100,1/50),col="red",lwd=3,type="h", las=1, xlab="Number of Occurrences of 50 Year Floods in 100 Years", ylab="Probability of Occurence", main="Probability Distribution for number of \n Occurences of 50 Year Floods in 100 Years")
points(2,dbinom(2,100,1/50), pch=19, col ="blue")
plot of chunk unnamed-chunk-20
Exercise: If you were asked to design a bridge to be safe during a 50 Year flood, what is the chance that the design flood will be exceeded four or more times in a 100 year period?
In design however, we are most interested in the probability of occurence of 1 or more ‘failures’ in n years. Using the previous logic, this equates to:
\(P(1 \text{ or more in } 100) = 1 - P(0 \text{ in } 100)\)
Applying the previous relationships yields: \[P(1 \text{ or more in } 100) = 1-\frac{n!}{0!(n-0)!}p^0(1-p)^{n}=1-(1-p)^{n}\] (since \(0! = 1\) and \(p^0 = 1\)). Therefore the probability of one or more 2000 year floods in a 30-year period is given by: \(P(1 \text{ or more in } 30) = 1- (1-\frac{1}{2000})^{30} = 0.0149\) or 1.49%
Again we can use the dbinom function to calculate this: 1-dbinom(0,30,1/2000) = 1.49%.
Exercise: What is the probability of one or more 200 year floods in a 20 year period?
Now consider the case of a project with a design lifetime of 100 years. The design discharge we are interested in is a discharge with a 100 year return period. In this case what is the probability of ‘failure’ in the lifetime of the project? In this case n=100, and the return period is also 100. Therefore the probability is 0.01: 1- dbinom(0,100,0.01). This equates to a 63.4% probability of a 100 year flood occuring in any 100 year period!
What is the probability of 100 year floods occurring in two consecutive years? In this case our \(n=2\), and we can either directly apply dbinom(2,2,0.1) or multiply the probability of one 100 year flood in 1 year (ie. \(n\) and \(k =1\)): dbinom(1,1,0.01) by the same probability (since the events are independent): dbinom(1,1,0.01)*dbinom(1,1,0.01). Therefore the probability (expressed as a percentage) of 100 year floods occurring in two consecutive years is 0.01%: It is very unlikely.
We can also turn these relationships around to determine the design return period for an acceptable risk of failure. For example: An hydraulic construction site on a stream needs the projection of a coffer dam/tunnel for a period of three consecutive years. If the total acceptable risk of failure during this three year period is estimated to be 0.5%, what must be the design return period for the protective structure? In this case, we have the answer to P = 1 - dbinom, which is the same as the equation: \(P = 1 - (1-p)^{n}\). Given that P=0.005 (ie. 0.5% acceptable risk of failure), and \(n=3\), the equation can be rerranged to calculate the exceedance probability (p)= 0.167%. The ARI is the reciprocal of this: \(ARI=\frac{1}{p}\) = 599 years. There is no doubt an inverse bionomial function in R to calculate this: can you find one?
Exercise: A total acceptable risk of failure for a bridge is 5%. Given that this bridge has a design life of 50 years, What is the ARI that the bridge should be design for?
There are several other statistical distributions that are important for use in hydrology. These can be broken down into different distribution ‘families’. Each of these distributions has a corresponding function in R that enables you to extract a value of the density curve (d), cumulative distriubtion curve (p) and a range of other statistical properties. The values in brackets proceed the function name shown in Table 4.1.
Table 4.1: Distributions common in hydrology with the appropriate functionn name. Note each function is proceeded by the letter d, but this can be changed to p (and a few other options) depending on whether the value on the density or culumative distribution curve is required.
| Family | Distribution Namme | Related R Function |
|---|---|---|
| Normal Family | Normal | dnorm |
| Log-normal (that is, \(\log X\) has a normal distribution) | dlnorm |
|
| Discrete | binomial (or Bernoulli) | dbinom |
| Gamma Family | Pearson type III | dpearson available in Library (SuppDists) |
| Log-Pearson type III | dlgamma3 available in Library (FAdist) |
|
| Extreme Value | Type I - Gumbel (for max or min) | dgumbel available in Library (FAdist) |
| Type II- Weibull (for min) | dweibull |
The starting point for understanding the driving hydrological processes in a river system or catchment is to extract flow data for the location(s) of interest. In Queensland, flow data is available through the Queensland Government Water Monitoring Data Portal. For current streamflow data, click on Open Stations, and select the relevant region, through the map, or basin, then gauging station through the links in the frame to the left.
Data can be downloaded for any available timestep up to the length of the data availability, under the ‘Custom Outputs’ tab. For the example below, daily stream discharge (Megalitres/Day) from the the Mary River at Bellbird Creek gauging station (Station Number: 138110A) have been used. These data are available from 01/10/1959.
Data are available from the Qld Water Monitoring Data Portal in a structure that requires some manipulation prior to being able to analyse it. Several variables can be downloaded at a time, but the structure is always the same. The first column contains the Date and Time data, then the next six columns contain the Mean, Min and Max values and corresponding quality codes for the first variable. If there is more than one variable, this pattern is repeated again. The last column always contains the metadata about the station and a description of the quality codes. The structure for one variable is shown in the table below.
| Time | station number | station number | station number | ||||
|---|---|---|---|---|---|---|---|
| and | Variable Code | Variable Code | Variable Code | ||||
| Date | Variable Name and Units | Variable Name and Units | Variable Name and Units | ||||
| Mean | Qual | Min | Qual | Max | Qaul | ||
| 1st Date | value | value | value | value | value | value | Sites: |
| 2nd Date | value | value | value | value | value | value | Station Number - Station Name Lat Long Elev |
| … | … | … | … | … | … | … | From row 8 onwards, there is a description of the Qual codes |
| Last Date | value | value | value | value | value | value | |
| Miscellaneous text | |||||||
| Miscellaneous text |
The following code is commented to assist your understanding of one way to import the data into R.
# The station number is represented by the variable stn.
# To bring in data for a different location, you only need to change stn,
# provided the csv file is placed in the same directory.
stn<- "138110A"
#fn is a variable that is created by pasting together the directory (change to point to your local directory), the station number (variable stn) and the file extension (.csv).
fn<-paste("/Users/helenfairweather/Documents/SunshineCoast/USC/Teaching/ENG330/2014/Data/",stn,".csv",sep="")
#fn<-paste("//usc.internal/usc/home/hfairwea/Desktop/Teaching/ENG3302014Sem2/RHTML/data/Daily/",stn,".csv",sep="")
#extract the first part of the file to get the variable name.
#before reading in the data determine the number of columns in the csv file. This is stored in the ncol variable.
ncol <- max(count.fields(fn, sep = ","))
# The first column always contains the date, and the last column the metadata.
# For every variable of interest, the file contains six values:
# the mean value, min value and max value, plus the corresponding quality code for each of these.
# Therefore the ncol minus 2 (to remove the date and metadata columns) divided by 6 will give the number of variables,
# which is stored in nv.
nv<-as.integer((ncol-2)/6)
#To extract the variable names, the first 4 rows of the data file are read into the hd variable
hd<-read.csv(fn,header=FALSE,nrows=4,sep=",")
#An empty vector to store the variable names is created and a loop used to capture these names for the number of variables (nv). The first variable name is sourced from row 3, column 2, and each subsequent variable name from the same row (3), but incremented by 6 columns.
varn<-character(0)
cn<-2
for(i in 1:nv)
{
varn[i]<-as.character(hd[3,cn])
cn=cn+6
}
#The data are extracted from the file by using the read.csv function and skipping the first four rows.
ddtemp<-read.csv(fn,header=FALSE,skip=4,sep=",")
#The extra column at the end of the file is removed in the next line of code.
dd<-ddtemp[1:(nrow(ddtemp)),1:(ncol-1)]
#If the two extra lines at the end of the file are not removed, the following code is used in lieu of above.
#dd<-ddtemp[1:(nrow(ddtemp)-2),1:(ncol-1)]
#Names are assigned to the columns using the concatenate function and in a loop given by the number of variables.
cnames<-NULL
for(c in 1:nv){cnames<-c(cnames,paste(c("Mean","Qual","Min","Qual","Max","Qual"),rep(c,6)))}
names(dd)<-c("Date",cnames)
Once the data are entered into R, they should be first be interogated to ascertain the number of missing values. The summary command produces the following result for this dataset.
library(xtable)
tab<-xtable(summary(dd))
print(tab, type="html")
| Date | Mean 1 | Qual 1 | Min 1 | Qual 1 | Max 1 | Qual 1 | |
|---|---|---|---|---|---|---|---|
| 1 | 1/01/00: 1 | Min. : 0 | Min. : 9.0 | Min. : 0 | Min. : 9.0 | Min. : 0 | Min. : 9.0 |
| 2 | 1/01/01: 1 | 1st Qu.: 27 | 1st Qu.: 9.0 | 1st Qu.: 24 | 1st Qu.: 9.0 | 1st Qu.: 30 | 1st Qu.: 9.0 |
| 3 | 1/01/02: 1 | Median : 79 | Median : 9.0 | Median : 73 | Median : 9.0 | Median : 86 | Median : 9.0 |
| 4 | 1/01/03: 1 | Mean : 452 | Mean : 14.6 | Mean : 239 | Mean : 14.6 | Mean : 872 | Mean : 14.6 |
| 5 | 1/01/04: 1 | 3rd Qu.: 213 | 3rd Qu.: 9.0 | 3rd Qu.: 187 | 3rd Qu.: 9.0 | 3rd Qu.: 238 | 3rd Qu.: 9.0 |
| 6 | 1/01/05: 1 | Max. :156586 | Max. :200.0 | Max. :30057 | Max. :200.0 | Max. :329157 | Max. :200.0 |
| 7 | (Other):20064 | NA’s :216 | NA’s :216 | NA’s :216 |
There are libraries to handle these missing data, but we will not cover this at this time. The na.rm=TRUE option (which instructs R to remove the missing values, or NAs, before computation) will be included in functions where these missing data would cause a problem.
Several libraries are available to facilitate the relatively easy plotting of hydrological timeseries data. These are zoo, hydroTSM, lubridate and lattice. The zoo package is designed to create timeseries data, which is important for hydrological analyses. The first requirement to create a zoo object is to create an index of dates (and time, if relevant). Even though the dates are included in the imported file, they are not necessarily read as such. They may be coerced into dates using date_ind<-as.Date(dd$Date,format="%d/%m/%Y",origin="1970-01-01"). However, if this is not successful, you can read the first and last dates using a similar function, and create a sequence from this first to last date in increments of one day.
In the data frame previously created for discharge, the Mean, Min and Max discharge values are in columns 2, 4 and 6.
(Columns 3, 5 and 7 contain the quality codes for the mean, min and max respectively.) If other variables are included, they are accessed by adding 6 to these column numbers. Using these columns, the zoo command creates a timeseries object with the associated date index. The hydroTSM library’s plot command will produce a line graph of all the time series in the zoo object.
Further investigation of these data can be conducted using the lattice library, particularly the splom function, which produces a matrix plot of the three parameters (Mean, Min and Max) for each variable (in this case, Discharge (ML/Day)) in our zoo object. The matrix that is produced consists of two plots per parameter, each with the x and y axes reversed. This is instructive for investigating the relationship between the variables and the parameters. In this case, the plot that demonstrates the strongest relationship is between the Mean and Maximum flows, indicating the Maximum flows have a large effect on the Mean values.
A boxplot is useful for investigating the behaviour of the stream over certain time periods, for example each month of the year. The daily2monthly function in the hydroTSM library is used to first create monthly sums from the daily data, using the m <- daily2monthly(zd[,3], FUN=sum, na.rm=TRUE) command. A vector with the three-letter abbreviations of the month names is extracted using the cmonth <- format(time(m), "%b") command, which are then converted to factors using the command: months <- factor(cmonth, levels=unique(cmonth), ordered=TRUE). (The use of ordered tells R that the levels of the factor have a natural order in which they appear.) These data are then used to create the boxplot using boxplot(coredata(m) ~ months, col="lightblue", main="ML/month"). The m' object is a timeseries, and thecoredata(m)` extracts the values of interest, without the related dates.
The boxplot is a representation of all available data, showing the median (think black line inside the blue box) and the interquartile range (represented by the blue box). The whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box. The extreme points are are indicated by the open circle points.
The boxplot graphically shows that the wettest months are generally from December through to March, though it is evident that extreme flows have occured in April, July and October. It is also evident that the highest flow occured in April. Applying the command index(zd[which.max(zd[,3]),]) gives the date of this occurence as 1989-04-25 12:00:00.
Like the previous boxplots, flow duration curves are a summary of the hydrology of a stream, but provide added information in terms of the percentage of time that flows of given magnitudes are exceeded. The following code produces a plot showing the flow duration curve for all the data (blue curve). This curve is produced by the cdf command from the hydroTSM library. In the example provided, all available daily data has been included in the construction of this curve.
Any time step can be used to construct the flow duration curve, but a daily time step will provide the most information. The graph can be used in two ways:
An estimate can be determined through a visual inspection, or through the application of R functions. For example, the quantile function will return the flow that is exceeded a certain percentage of time: quantile(zd[,3],1-tfe,na.rm=T). In this case, the tfe variable is the percentage of time that flow is equalled or exceeded. The flow duration curve is the cumulative distribution function with the axis swapped around, with the x-axis (% time equalled or exceed on the x-axis) representing 1 - the cdf probabilities. Therefore input into the quantile function is 1-tfe. So for the flow that is equalled or exceeded 20% of the time = ceiling(quantile(zd[,3],0.8,na.rm=T)), or 315.
To find the % of time a certain flow is equalled or exceeded the Empirical Cumulative Distribution Function (ecdf) function can be used. For example, a flow of 315 is equalled or exceeded 20% of the time. The blue arrows demonstrate this on the figure below.
The flow duration curve can also be applied to a subset of data, eg. data from months with high flows. The example below shows this for the high flow months (Jan - Feb) and low flow months (Aug-Sep). Examples of applying the quantile function to these subsets of data are also shown, demonstrating the changing statistical characteristics of this stream in wet versus dry times.
A steep section at the top end of the flow duration curve indicates a relatively small number of very high flows, which is confirmed by the boxplots shown previously. According to Ladson (2011, p.143) a steep section at the low flow end of the curve indicates that baseflow is insignificant. Baseflow will be investigated in more detail in a later module.
Exercise: Bring in data from the Victorian Water Monitoring Portal for Joyces Creek at Strathlea Gauge no. 407320 and compare with the graphs in Ladson, p.142.
Much of the material presented in this part of the module is from the revised Book IV of ARR, Estimation of Peak Discharge.
Previously flow duration curves were used to obtain the percentage of time that a flow is exceeded. As previously noted, these percentiles are used in reverse to that normally applied by statisticians. For example, the flow that is exceeded 10% of the time (a low percentile), will be relatively high flow, as 90% of the time, flows are below this value.
The flow duration curve is useful for determining ‘yields’ of catchments (eg. how reliable a flow regime is for water supply), however for determining design flows for a given level of risk, a flood probability model is required. In this flood probability model, flow peaks (from a certain time period, usually annual) are considered as random variables. The procedure fits a statistical (probability) model to a series of flood peaks. The method also relies on the assumption of independence of the data.
Hydrologists are generally interested in the behavior of streamflows at the extreme ends of the spectrum: droughts and floods. The magnitude of these extreme events is inversely related to their frequency of occurrence. In relation to floods, the peak of the river flow in some defined period (usually one year) is considered to be a random variable (q) (“the measurement associated with the next physical sample from the population” (Millard, 1998)). The values taken by a random variable can be discrete or continuos. In the case of discrete random variable there are a finite number of values. For example, in the case of tossing a single coin, the resulting values are finite: Heads or Tails. The continuous random variable can take on an infinite number of values (eg. the peak of the river flow, q).
The set of random flood peaks is represented by an upper-case Q, where a specific realization is represented by the lower-case (e.g. q). The first representation of the probability model of the data to be analysed can be obtained from the histogram of that data. The probability density function (pdf) that we previously investigated, is the limiting form of the histogram of q as the number of samples approaches infinity.
The frequency of occurrence is represented by either the AEP or ARI. In the case of the ARI, this is also considered to be a random variable (ie. the average or expected time between occurrences of events with certain magnitudes for a given AEP).
The frequency analysis of flood data can be undertaken using an Annual Peak or a Peak over Threshold (PoT) timeseries. By convention, the AEP is generally related to the annual series, and ARI to the PoT. This latter series is also known as the partial duration series. Any frequency analysis of data timeseries requires an assumption of independence.
The objective of flood frequency analysis is to define the relationship between floods of certain magnitude and their frequency of occurrence (expressed as AEP or ARI). In contrast to the flow duration curve, which is constructed from all the data, the flood frequency extracts the peaks from each year (Annual flood series) or independent peaks over a specified threshold (Partial series).
The first step to investigate the behavior of a particular river is to plot the daily data. In the following analyses daily data from the Mary River at Bellbird Creek gauging station (Station Number: 138110A) are used. Once the daily data are read in, the maximum discharges are extracted and plotted over the time series.
The daily data is read into R in the usual way (eg. read.csv). As per the previous code, the headings are stripped, and variable names extracted before the data are read into R. In this case, the data have been stored in the variable called dd. A timeseries is again created using zoo and a daily timestep. The data for this example, extracts the daily maximum values and uses the zoo function to create the timeseries, zd.
In terms of analysis, the annual series may seem to be the easiest to extract. However, the time of the year in which the majority of the floods occur influences whether the assumption of independence is met. This independence means that each flood peak included in the analysis has physical independence from the cause of the flood. For example, if a large rainfall event causing a flood and very wet antecedent conditions, is followed by another large rainfall event, the subsequent flood peak may not be independent of the previous flood.
The months that define the water year are therefore very important. For example, if the calendar year is used, then in Qld, this could result in the largest floods in two consecutive years not being independent; the first flood may have occurred in late Dec, and the second in early Jan, and they may both in fact be part of the same flood.
Before we extract the annual peaks the timeseries is adjusted so that discharges are assigned to the relevant water year. If the water year commences in October, adding 3 months to a time index will create a time series, where the month assigned the value 1, is actually the tenth month (ie. October). This is very easy to achieve in R using a zoo object, and there is no need to use If statements to account of the Dec to Jan transition.
The code to create this index of water years is as.numeric(as.yearmon(time(zd)) + 3/12) %/% 1. This code will convert the time value in the zoo object zd to be adjusted by +3 months (eg. October will now be the 1st month in the water year). The as.yearmon code converts the date to a year, and as.numeric coupled with %/% 1 returns an integer value corresponding to the correct ‘water year’.
library(hydroTSM)
stn<- "138110A"
#fn is a variable that is created by pasting together the directory (change to point to your local directory), the station number (variable stn) and the file extension (.csv).
fn<-paste("/Users/helenfairweather/Documents/SunshineCoast/USC/Teaching/ENG330/2014/Data/",stn,".csv",sep="")
#fn<-paste("//usc.internal/usc/home/hfairwea/Desktop/Teaching/ENG3302014Sem2/RHTML/data/Daily/",stn,".csv",sep="")
#extract the first part of the file to get the variable name.
#before reading in the data determine the number of columns in the csv file. This is stored in the ncol variable.
ncol <- max(count.fields(fn, sep = ","))
nrows<-length(count.fields(fn, sep = ","))
# The first column always contains the date, and the last column the metadata.
# For every variable of interest, the file contains six values:
# the mean value, min value and max value, plus the corresponding quality code for each of these.
# Therefore the ncol minus 2 (to remove the date and metadata columns) divided by 6 will give the number of variables,
# which is stored in nv.
nv<-as.integer((ncol-2)/6)
#To extract the variable names, the first 4 rows of the data file are read into the hd variable
hd<-read.csv(fn,header=FALSE,nrows=4,sep=",")
#An empty vector to store the variable names is created and a loop used to capture these names for the number of variables (nv). The first variable name is sourced from row 3, column 2, and each subsequent variable name from the same row (3), but incremented by 6 columns.
varn<-character(0)
cn<-2
for(i in 1:nv)
{
varn[i]<-as.character(hd[3,cn])
cn=cn+6
}
nrws<-length(count.fields(fn, sep = ","))
#The data are extracted from the file by using the read.csv function and skipping the first four rows.
ddtemp<-read.csv(fn,header=FALSE,skip=4,sep=",",nrows=nrws-6)
#The extra column at the end of the file is removed in the next line of code.
dd<-ddtemp[1:(nrow(ddtemp)),1:(ncol-1)]
#If the two extra lines at the end of the file are not removed, the following code is used in lieu of above.
#dd<-ddtemp[1:(nrow(ddtemp)-2),1:(ncol-1)]
#Names are assigned to the columns using the concatenate function and in a loop given by the number of variables.
cnames<-NULL
for(c in 1:nv){cnames<-c(cnames,paste(c("Mean","Qual","Min","Qual","Max","Qual"),rep(c,6)))}
names(dd)<-c("Date",cnames)
#date_ind<-as.Date(dd$Date,format="%d/%m/%Y",origin="1970-01-01")
#stdate<-as.Date(substr(toString(dd$Date[1]),1,nchar(toString(dd$Date[1]))-5),format="%d/%m/%y")
#windows
#stdate<- as.POSIXct(dd$Date[1],format="%H:%M:%S %d/%m/%Y")-100*365.25
#findate<-as.POSIXct(dd$Date[nrow(dd)],format="%H:%M:%S %d/%m/%Y")
#mac
stdate<- as.Date(dd$Date[1],format="%d/%m/%y")-100*365.25
findate<-as.Date(dd$Date[nrow(dd)],format="%d/%m/%y")
date_ind<-seq(stdate,findate,by="day")
#can automate the number of column numbers (eg. c(2, 4, 6)) based on the number of variables
zd<-zoo(dd[,c(2,4,6)],date_ind)
nma<-3
zd$yr<-as.numeric(as.yearmon(time(zd)) + nma/12) %/% 1
#peaks extracted using water year. To adjust the start of the water year change the nma (number of months to adjust) by 12 - the month that represents the starting year
#minyr<-year(index(zd[1]))
#maxyr<-year(index(zd[nrow(zd)]))
minyr<-as.integer(zd$yr[1])
maxyr<-as.integer(zd$yr[nrow(zd)])
nyrs<-as.integer(maxyr-(minyr-1))
maxv<-numeric(nyrs)
indd<-as.Date(nyrs)
for(i in 1:nyrs)
{temp<-subset(zd,zd$yr==(i+minyr-1))
maxv[i]<-max(temp,na.rm=T)
indd[i]<-as.Date(index(temp[which.max(temp[,3])]))
}
plot(zd[,3],col="blue",xlab="",ylab=varn[1],main="Daily Maximum Flows \nAnnual Peaks shown in red dots")
points(indd,maxv,col="red",pch=20)
abline(v=seq(index(zd[1]),index(zd[nrow(zd)]),by="year"),col="grey")
plot of chunk unnamed-chunk-27
The annual peaks extracted in the previous code is used to further investigation the probability distribution characteristics of the annual series. A histogram of the annual peak flows, shows the positively skewed nature of the annual series for this gauging station, which is to be expected. Note the difference between the histogram produced using all the monthly values, compared to the peaks only - the skew isn’t as high. Think about why this might be the case.
library(lattice)
wyv<-aggregate(zd, zd$yr, max,na.rm=TRUE)
histogram(coredata(wyv[,3]),xlab=varn[1],main="Histogram of Annual Peaks")
plot of chunk unnamed-chunk-28
To further analyse and plot the probability characteristics of the annual series, we need to construct a cumulative probability distribution function that fits the annual peaks. This fit can be estimated empirically (by plotting the peak discharges against an estimated probability) or theoretically (by undertaking a Goodness of Fit (GOF) test).
Fitting an empirical distribution is relatively straightforward, now that we have our annual peaks extracted for the required water year. The empirical distribution is also known as a probability plot, where AEP estimates are plotted (plotting position) against each of the peaks.
The first step to calculate the probability plotting position is to rank the peak discharges from lowest to highest. Calculate the plotting position (or estimated AEP) for each of these discharges, based on the following formula: \[P_{(m)}=\frac{m-\alpha}{n+1-2\alpha}\]
where \(P_{(m)}\) is the AEP \(m\) is the rank of the peak discharge \(n\) is the number of peak discharges \(\alpha\) is a constant whose value is selected to preserve desirable statistical characteristics.
Various values of \(\alpha\) are provided for a range of probability distributions. In the implementation that follows, \(\alpha=0.4\) is used, as is the case in Ladson (2008). This is known at the Cunnane formula:
\[P_{(m)}=\frac{m-0.4}{n+0.2}\]
If we assume a log normal distribution, we can plot the peak discharges (on a log scale) against the standard normal deviate, \(z\) using the qnorm, which gives the inverse of the norm distribution.
rind<-rank(maxv) #ranked indices
n<-length(maxv) #number of discharges in the annual series
pp<-(rind-0.4)/(n+0.2)
plot(qnorm(pp),maxv,log="y",col="blue",pch=20,xlab="z",ylab=varn[1])
plot of chunk unnamed-chunk-29
Ladson (2008) provides an explanation for plotting the equivalent AEP labels on the x-axis in Excel. We will go through this in the Lecture, but easier still is to use R to fit a family of distributions and test the GOF. This is more advanced than time allows for this course, but it is something you might like to pursue in your own time.
The plotting position formula should not be used to obtain the AEP or an observed flood. This is done after the probability distribution has been estimated.
Note that Ladson (2008), provides detailed instructions for the fitting of the LP3 distribution with product moments, which is not recommended in the revised chapter of AR&R.