1 Introduction

This module will introduce some basic probability concepts and procedures used in Hydrological Frequency Analysis. The methods that have been used historically for flood frequency analyses were developed prior to the advent of the computer power that we now have access to. Subsequently the methods employed to analyse flow data are becoming more sophisticated. The R environment is particularly useful for understanding this increased perplexity.

2 Objectives

When you have completed this module of the course you should be able to:

3 Resources

The revision of AR&R, particularly Book 3, Chapter 2, Flood Frequency, has produced material that is particularly valuable for understanding the new methods.

William King from Coastal Carolina University has produced a handy tutorial on Probability functions in R.

The subject is also covered well by Ladson (2011) in Chapter 6: Floods and Flood Frequency Analysis.

For some basic statistics and probability theory, see the book by G Jay Kearns. This book can be accessed through R by three simple commands:

install.packages("IPSUR")
library(IPSUR)
read(IPSUR)

4 Probability Theory

In flood frequency analysis we are primarily interested in analysing the probability of floods of certain sizes, therefore we are generally only interested in the flood peaks. Given the stochastic nature of rainfall, and many of the processes that contribute to flooding, a probability framework is required in order to assess risk, as the flood peaks are treated as random variables.

One of the primary assumptions of applying the flood probability framework is that each peak is statistically independent of all other flood peaks. This has generally be shown to be the case.

To understand probabilities, and hence to be able to work with the probability of the occurrence of hydrological events, we need to be able to describe and compute probabilities. This is usually done using a ‘probability density function’, or pdf.

The flood probability model is generally described by its probability density function (pdf). For sample data, the pdf is a smoothed curve applied to a histogram of measurements. There are many forms of a pdf, depending on the shape of this curve. When the underlying data fits a pdf that resembles a bell-like shape it is called a ‘normal’ (or Gaussian) distribution. This is an excellent resource for explaining probability distributions in R.

There are several distributions that are used in hydrology. Each of these curves can be described by a mathematical equation, which are characterised by one or more parameters (e.g location, scale and skew) which usually have some physical meaning in hydrology. For example the normal distribution is characterised by the mean (\(\mu\)) and standard deviation (\(\sigma\)). Though there are very few applications of a normal distribution in hydrology, it is instructive to introduce probability concepts through the use of a normal distribution.

The values of the standard normal distribution (the special normal distribution which has \(\mu = 0\) and \(\sigma = 1\)) can be generated by using the dnorm function, which returns the height of the normal curve at values along the \(x\)-axis.

Variables which follow a standard normal distribution are usually designated \(z\). In general, capital letter represent the random variable (in this case \(Z\)), and small letter represent an outcome or realization of that random variable (eg. \(z\)).

Wrapping the curve function around this result produces the familiar ‘bell-shaped’ curve. The point on the curve where z = 1 is shown by the green circle and line.

curve(dnorm(x), xlim=c(-3,3), col='red', lwd=3, xlab="Z", ylab="density", main="Probability Density Function")
points(c(1,1), c(0,dnorm(1)), col='green', pch=19, cex=2, type="b")

plot of chunk unnamed-chunk-1

#lines(c(1,1), c(0,dnorm(1)) ,col='green', lwd=2) #changed to type="b" in above function, which plots both points and line.

To find the density value of particular value along the \(x\)-axis (ie. the height of the curve at that point), a value is passed to the dnorm() function:

dnorm(1)
## [1] 0.242

The probability that \(z\) will be greater than a particular value is given by the area under the pdf to the right of the \(z\)-value on the \(x\)-axis. You may remember looking up these ‘\(Z\)-scores’ (values on the \(x\)-axis) in a table to return the \(p\)-value (region under the normal curve) when conducting hypothesis tests. The good news is that you no longer have to do that; you only need to enter the correct command into R!

The area under the curve (shaded in blue below) from a specified \(z\) value on the \(x\)-axis to the maximum value on the \(x\)-axis (which is \(\infty\) for the normal distribution), gives the probability of the value of \(Z\) being greater than the particular value of \(z\) being examined: \(z\geq\) Z; ie. it gives \(P(Z \geq z)\). The total area under the pdf curve is \(1\).

plot of chunk unnamed-chunk-3

You can also specify a different values for \(\mu\) and \(\sigma\) in the dnorm function (rather than 0 and 1 respectively). The following pdf is for the case with \(\mu = 10\) and \(\sigma = 2\):

plot of chunk unnamed-chunk-4

The cumulative distribution function (cdf) is the integral of the pdf (eg. area under the curve), or the total accumulated probability up to a certain value. It is represented as \(P(z)= \int_{-\infty}^{z}{p(z)dz}\). Therefore, like the pdf, there is a mathematical curve that describes the pdf. The pdf of ‘normally’ distributed data takes on the familiar s-shape and can be generated for the standard normal distribution (\(\mu = 0\) and \(\sigma = 1\)) using the pnorm function inside the curve function.

It stands to reason that if a value (\(z\)) is passed to the pnorm function, it will return the value on the cdf curve, which is equivalent to the area under the pdf curve to the left of \(z\).

plot of chunk unnamed-chunk-5

4.1 Probability Functions

The probability theory that is used to estimate the probability of floods of certain magnitudes occurring is taken from the Sampling Theory (see the book by G Jay Kearns) and the theory of Bernoulli Trials (see page D-21 in this resource).

The binomial distribution represents the successes or failures or trials that are repetitive (eg. tossing a coin 100 times and recording the occurrence of Heads or Tails). As we are dealing with discrete events (that is, one Head or two Heads, but not 2.8034 Heads), the representation of the probability function is called a probability mass function. The graphical representation of a mass function is usually through a column chart (as opposed to a curve for the cdf), Essentially these theories result in the equations that allow the probability of k successes in n trials to be calculated (using the binomial formula):

\(P(k \text{ successes in } n \text{ trials})=\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}\)

The Annual Exceedance Probability (AEP) is the probability (p) that is of interest to us in the previous equation. We can apply the above equation to determine the probability of occurrence of various scenarios. For example, if we consider peaks that occur in an annual time period, we can calculate the probability that exactly three 50 year floods will occur in a 100 year period.

If we consider a series that represents the maximum floods over a certain period (eg. annual) or threshold (eg. Peak over Threshold or PoT), the magnitude of these floods can be related to the probability of exceedance (in the case of annual peaks) or their return interval (in the case of PoT). Fortunately there is a relationship between the probability of exceedance (AEP, or \(p\)) and the return interval (ARI): \(ARI=\frac{1}{p}\).

Therefore we need to develop skills in being able to calculate these exceedance probabilities.

4.1.1 Exceedance Probabilities

If we wanted to work out the probability that exactly three 50-year floods will occur in a 100-year period, we could of course apply the above equations. For example, in this situation, our k=3 and n=100. Our ARI = 50, therefore \(p=\frac{1}{50}\). Plugging these values into our first equation yields:

\(P(3 \text{ in } 100)=\frac{100!}{3!(100-3)!}(\frac{1}{50})^{3}(\frac{49}{50})^{97} = 0.182\), so therefore there is an 18.2% chance that three 50-year floods will occur in any 100 year period.

At this stage we don’t know the magnitude of that flood for a given area. This will come through the analysis of actual streamflow data.

Applying this equation every time we wish to calculate these probabilities seems onerous; fortunately R has a built in function to enable the calculation of the probability of k successes in n trials to be calculated using dbinom(k,n,p)\(= \)P(3 100) =$ 18.2276.

But what if we wanted to know the probability of 3 or more 50-year floods occurring in a 100 year period? We know that the sum of the probabilities of all possible numbers of 50-year floods occurring in 100 years must equal unity (this includes the probability of there being no 50-year events in a 100-year period), so \(P(3 \text{ or more in } 100) = P(3 \text{ in } 100) + P(4 \text{ in } 100) + ... P(100 \text{ or more in }100)\).

However, it would take a long time to apply each of these combinations into the first equation. We could take a different tack and just calculate the Probabilities of 0 in 100, 1 in 100 and 2 in 100 and take the sum of these probabilities away from 1: \(P(3 \text{ or more in } 100) = 1 - P(0 \text{ in } 100) + P(1 \text{ in } 100) + P(2 \text{ in } 100)\). However, this also seems a bit onerous, but fortunately we can again apply the dbinom(k,n,p) function and wrap a sum function around it: 1-sum(dbinom(0:2,100,1/50)) = 0.3233. In R we could also easily apply the first option directly, eg. sum(dbinom(3:100,100,1/50)). You will see this gives the same answer.

To understand the shape of this distribution you can wrap a plot function around the binom function for the points from 0:100. (The binomial distribution is computed at disctrete values, such as 0, 1, 2, … 100; for example, in 100 years, we can have 0 years or 1 year or two years and so on with a flood, but it makes no sense to talk about 2.45604 years having a flood. This means the pdf will not be continuous.) The blue dot in the following graph corresponds to the probability of 2 occurrences of 50 -year floods in 100 years. It makes sense that this has the highest probability, given that, on average, we could expect two 50-year floods in any 100 year period.

 x <- seq(0, 100, 1)
plot(x,dbinom(x,100,1/50),col="red",lwd=3,type="h", las=1, xlab="Number of Occurrences of 50 Year Floods in 100 Years", ylab="Probability of Occurrence", main="Probability Distribution for number of \n Occurrences of 50 Year Floods in 100 Years")

points(2,dbinom(2,100,1/50), pch=19, col ="blue")

plot of chunk unnamed-chunk-6

Exercise: If you were asked to design a bridge to be safe during a 50 Year flood, what is the chance that the design flood will be exceeded four or more times in a 100 year period?

In design however, we are most interested in the probability of occurence of 1 or more ‘failures’ in n years. Using the previous logic, this equates to:

\(P(1 \text{ or more in } 100) = 1 - P(0 \text{ in } 100)\)

Applying the previous relationships yields:

\(P(1 \text{ or more in } 100) = 1-\frac{n!}{0!(n-0)!}p^0(1-p)^{n}=1-(1-p)^{n}\) (since \(0! = 1\) and \(p^0 = 1\)). Therefore the probability of one or more 2000 year floods in a 30-year period is given by: \(P(1 \text{ or more in } 30) = 1- (1-\frac{1}{2000})^{30} = 0.0149\) or 1.49%

Again we can use the dbinom function to calculate this: 1-dbinom(0,30,1/2000) = 1.49%.

Exercise: What is the probability of one or more 200 year floods in a 20 year period?

Now consider the case of a project with a design lifetime of 100 years. The design discharge we are interested in is a discharge with a 100 year return period. In this case what is the probability of ‘failure’ in the lifetime of the project? In this case n=100, and the return period is also 100. Therefore the probability is 0.01: 1- dbinom(0,100,0.01). This equates to a 63.4% probability of a 100 year flood occuring in any 100 year period!

What is the probability of 100 year floods occurring in two consecutive years? In this case our \(n=2\), and we can either directly apply dbinom(2,2,0.1) or multiply the probability of one 100 year flood in 1 year (ie. \(n\) and \(k =1\)): dbinom(1,1,0.01) by the same probability (since the events are independent): dbinom(1,1,0.01)*dbinom(1,1,0.01). Therefore the probability (expressed as a percentage) of 100 year floods occurring in two consecutive years is 0.01%: It is very unlikely.

We can also turn these relationships around to determine the design return period for an acceptable risk of failure. For example: An hydraulic construction site on a stream needs the projection of a coffer dam/tunnel for a period of three consecutive years. If the total acceptable risk of failure during this three year period is estimated to be 0.5%, what must be the design return period for the protective structure? In this case, we have the answer to P = 1 - dbinom, which is the same as the equation: \(P = 1 - (1-p)^{n}\). Given that P=0.005 (ie. 0.5% acceptable risk of failure), and \(n=3\), the equation can be rerranged to calculate the exceedance probability (p)= 0.167%. The ARI is the reciprocal of this: \(ARI=\frac{1}{p}\) = 599 years. There is no doubt an inverse bionomial function in R to calculate this: can you find one?

Exercise: A total acceptable risk of failure for a bridge is 5%. Given that this bridge has a design life of 50 years, What is the ARI that the bridge should be design for?

There are several other statistical distributions that are important for use in hydrology. These can be broken down into different distribution ‘families’. Each of these distributions has a corresponding function in R that enables you to extract a value of the density curve (d), cumulative distribution curve (p) and a range of other statistical properties. The values in brackets proceed the function name shown in Table 4.1.

Table 4.1: Distributions common in hydrology with the appropriate functionn name. Note each function is proceeded by the letter d, but this can be changed to p (and a few other options) depending on whether the value on the density or cumulative distribution curve is required.

Family Distribution Name Related R Function
Normal Family Normal dnorm
Log-normal (that is, \(\log X\) has a normal distribution) dlnorm
Discrete binomial dbinom
Gamma Family Pearson type III dpearson available in Library (SuppDists)
Log-Pearson type III dlgamma3 available in Library (FAdist)
Extreme Value Type I - Gumbel (for max or min) dgumbel available in Library (FAdist)
Type II- Weibull (for min) dweibull


5 Streamflow Data

The starting point for understanding the driving hydrological processes in a river system or catchment is to extract flow data for the location(s) of interest. In Queensland, flow data is available through the Queensland Government Water Monitoring Data Portal. For current streamflow data, click on Open Stations, and select the relevant region, through the map, or basin, then gauging station through the links in the frame to the left.

Data can be downloaded for any available timestep up to the length of the data availability, under the ‘Custom Outputs’ tab. For the example below, daily stream discharge (Megalitres/Day) from the the Mary River at Bellbird Creek gauging station (Station Number: 138110A) have been used. These data are available from 01/10/1959.

Data are available from the Qld Water Monitoring Data Portal in a structure that requires some manipulation prior to being able to analyse it. Several variables can be downloaded at a time, but the structure is always the same. The first column contains the Date and Time data, then the next six columns contain the Mean, Min and Max values and corresponding quality codes for the first variable. If there is more than one variable, this pattern is repeated again. The last column always contains the metadata about the station and a description of the quality codes. The structure for one variable is shown in the table below.

Time station number station number station number
and Variable Code Variable Code Variable Code
Date Variable Name and Units Variable Name and Units Variable Name and Units
Mean Qual Min Qual Max Qaul
1st Date value value value value value value Sites:
2nd Date value value value value value value Station Number - Station Name Lat Long Elev
… … … … … … … From row 8 onwards, there is a description of the Qual codes
Last Date value value value value value value
Miscellaneous text
Miscellaneous text

The following code is commented to assist your understanding of one way to import the data into R.

# The station number is represented by the variable stn.  
# To bring in data for a different location, you only need to change stn, 
# provided the csv file is placed in the same directory.
stn<- "138110A"

#fn is a variable that is created by pasting together the directory (change to point to your local directory), the station number (variable stn) and the file extension (.csv).
fn<-paste("/Users/helenfairweather/Documents/SunshineCoast/USC/Teaching/ENG330/2014/Data/",stn,".csv",sep="")
#fn<-paste("//usc.internal/usc/home/hfairwea/Desktop/Teaching/ENG3302014Sem2/RHTML/data/Daily/",stn,".csv",sep="")



#extract the first part of the file to get the variable name.


#before reading in the data determine the number of columns in the csv file. This is stored in the ncol variable.
ncol <- max(count.fields(fn, sep = ","))

nrows<-length(count.fields(fn, sep = ","))

# The first column always contains the date, and the last column the metadata.
# For every variable of interest, the file contains six values: 
# the mean value, min value and max value, plus the corresponding quality code for each of these.
# Therefore the ncol minus 2 (to remove the date and metadata columns) divided by 6 will give the number of variables, 
# which is stored in nv.
nv<-as.integer((ncol-2)/6)

#To extract the variable names, the first 4 rows of the data file are read into the hd variable
hd<-read.csv(fn,header=FALSE,nrows=4,sep=",")

#An empty vector to store the variable names is created and a loop used to capture these names for the number of variables (nv).  The first variable name is sourced from row 3, column 2, and each subsequent variable name from the same row (3), but incremented by 6 columns.
varn<-character(0)
cn<-2
for(i in 1:nv)
{
  
  varn[i]<-as.character(hd[3,cn])
  cn=cn+6
  
}

nrws<-length(count.fields(fn, sep = ","))
#The data are extracted from the file by using the read.csv function and skipping the first four rows.
ddtemp<-read.csv(fn,header=FALSE,skip=4,sep=",",nrows=nrws-6)

#The extra column at the end of the file is removed in the next line of code.
dd<-ddtemp[1:(nrow(ddtemp)),1:(ncol-1)]
#If the two extra lines at the end of the file are not removed, the following code is used in lieu of above.
#dd<-ddtemp[1:(nrow(ddtemp)-2),1:(ncol-1)]

#Names are assigned to the columns using the concatenate function and in a loop given by the number of variables. 
cnames<-NULL
for(c in 1:nv){cnames<-c(cnames,paste(c("Mean","Qual","Min","Qual","Max","Qual"),rep(c,6)))}
names(dd)<-c("Date",cnames)

Once the data are entered into R, they should be first be interrogated to ascertain the number of missing values. The summary command produces the following result for this dataset.

library(xtable)
tab<-xtable(summary(dd))
print(tab, type="html")
Date Mean 1 Qual 1 Min 1 Qual 1 Max 1 Qual 1
1 1/01/00: 1 Min. : 0 Min. : 9.0 Min. : 0 Min. : 9.0 Min. : 0 Min. : 9.0
2 1/01/01: 1 1st Qu.: 27 1st Qu.: 9.0 1st Qu.: 24 1st Qu.: 9.0 1st Qu.: 30 1st Qu.: 9.0
3 1/01/02: 1 Median : 79 Median : 9.0 Median : 73 Median : 9.0 Median : 86 Median : 9.0
4 1/01/03: 1 Mean : 452 Mean : 14.6 Mean : 239 Mean : 14.6 Mean : 873 Mean : 14.6
5 1/01/04: 1 3rd Qu.: 213 3rd Qu.: 9.0 3rd Qu.: 187 3rd Qu.: 9.0 3rd Qu.: 238 3rd Qu.: 9.0
6 1/01/05: 1 Max. :156586 Max. :200.0 Max. :30057 Max. :200.0 Max. :329157 Max. :200.0
7 (Other):20062 NA’s :216 NA’s :216 NA’s :216

There are libraries to handle these missing data, but we will not cover this at this time. The na.rm=TRUE option (which instructs R to remove the missing values, or NAs, before computation) will be included in functions where these missing data would cause a problem.

Several libraries are available to facilitate the relatively easy plotting of hydrological timeseries data. These are zoo, hydroTSM, lubridate and lattice. The zoo package is designed to create timeseries data, which is important for hydrological analyses. The first requirement to create a zoo object is to create an index of dates (and time, if relevant). Even though the dates are included in the imported file, they are not necessarily read as such. They may be coerced into dates using date_ind<-as.Date(dd$Date,format="%d/%m/%Y",origin="1970-01-01"). However, if this is not successful, you can read the first and last dates using a similar function, and create a sequence from this first to last date in increments of one day.

In the data frame previously created for discharge, the Mean, Min and Max discharge values are in columns 2, 4 and 6.
(Columns 3, 5 and 7 contain the quality codes for the mean, min and max respectively.) If other variables are included, they are accessed by adding 6 to these column numbers. Using these columns, the zoo command creates a timeseries object with the associated date index. The hydroTSM library’s plot command will produce a line graph of all the time series in the zoo object.

plot of chunk unnamed-chunk-8

Further investigation of these data can be conducted using the lattice library, particularly the splom function, which produces a matrix plot of the three parameters (Mean, Min and Max) for each variable (in this case, Discharge (ML/Day)) in our zoo object. The matrix that is produced consists of two plots per parameter, each with the x and y axes reversed. This is instructive for investigating the relationship between the variables and the parameters. In this case, the plot that demonstrates the strongest relationship is between the Mean and Maximum flows, indicating the Maximum flows have a large effect on the Mean values.

plot of chunk unnamed-chunk-9

A boxplot is useful for investigating the behaviour of the stream over certain time periods, for example each month of the year. The daily2monthly function in the hydroTSM library is used to first create monthly sums from the daily data, using the m <- daily2monthly(zd[,3], FUN=sum, na.rm=TRUE) command. A vector with the three-letter abbreviations of the month names is extracted using the cmonth <- format(time(m), "%b") command, which are then converted to factors using the command: months <- factor(cmonth, levels=unique(cmonth), ordered=TRUE). Factors are variables in R which take on a limited number of different values.

(The use of ordered tells R that the levels of the factor have a natural order in which they appear.) These data are then used to create the boxplot using boxplot(coredata(m) ~ months, col="lightblue", main="ML/month"). The m' object is a timeseries, and thecoredata(m)` extracts the values of interest, without the related dates.

The boxplot is a representation of all available data, showing the median (think black line inside the blue box) and the interquartile range (represented by the blue box). The whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box. The extreme points are indicated by the open circle points.

plot of chunk unnamed-chunk-10

The boxplot graphically shows that the wettest months are generally from December through to March, though it is evident that extreme flows have occured in April, July and October. It is also evident that the highest flow occured in April. Applying the command index(zd[which.max(zd[,3]),]) gives the date of this occurence as 1989-04-25. This command finds the maximum value, which.max from the maximum daily values (column 3 in the variable zd). This maximum value is the index of the location of the value in column 3 of zd: zd[10800]. Wrapping index around this will return the corresponding date.

6 Flow Duration Curves

Like the previous boxplots, flow duration curves are a summary of the hydrology of a stream, but provide added information in terms of the percentage of time that flows of given magnitudes are exceeded. The following code produces a plot showing the flow duration curve for all the data (blue curve). This curve is produced by the fdc command from the hydroTSM library. In the example provided, all available daily data has been included in the construction of this curve. Note the log scale on the y-axis; as noted earlier, hydrology data rarely fits a normal distribution.

Any time step can be used to construct the flow duration curve, but a daily time step will provide the most information. The graph can be used in two ways:

  1. determine the percentage of time that a given flow is exceeded, and
  2. find the magnitude of flow that is exceeded a certain percentage of time.

An estimate can be determined through a visual inspection, or through the application of R functions. For example, the quantile function will return the flow that is exceeded a certain percentage of time: quantile(zd[,3],1-tfe,na.rm=T). In this case, the tfe variable is the percentage of time that flow is equalled or exceeded. The flow duration curve is the cumulative distribution function with the axis swapped around, with the x-axis (% time equalled or exceed on the x-axis) representing 1 - the cdf probabilities. Therefore input into the quantile function is 1-tfe. So for the flow that is equalled or exceeded 20% of the time = ceiling(quantile(zd[,3],0.8,na.rm=T)), or 315.

To find the % of time a certain flow is equalled or exceeded the Empirical Cumulative Distribution Function (ecdf) function can be used. For example, a flow of 315 is equalled or exceeded 20% of the time. The blue arrows demonstrate this on the figure below.

plot of chunk unnamed-chunk-11

The flow duration curve can also be applied to a subset of data, eg. data from months with high flows. The example below shows this for the high flow months (Jan - Feb) and low flow months (Aug-Sep). Examples of applying the quantile function to these subsets of data are also shown, demonstrating the changing statistical characteristics of this stream in wet versus dry times.

A steep section at the top end of the flow duration curve indicates a relatively small number of very high flows, which is confirmed by the boxplots shown previously. According to Ladson (2011, p.143) a steep section at the low flow end of the curve indicates that baseflow is insignificant. Baseflow will be investigated in more detail in a later module.

plot of chunk unnamed-chunk-12

Exercise: Bring in data from the Victorian Water Monitoring Portal for Joyces Creek at Strathlea Gauge no. 407320 and compare with the graphs in Ladson, p.142. The format of these data are very similar to the Qld water monitoring data.