Exercise 4: R and rainfall analysis

I. Introductory commands

Some important commands are listed below. The . indicate blanks where you would fill in missing info.

?...                      # Loads help files for a specific command   
?getwd                  # Loads help files for "getwd"   
??getwd               # Searches for the word "getwd" in all the help files   
setwd(.)                # Sets the working directory to .   
getwd()                 # Shows the current working directory   
dir()                     # List the files in the working directory   
ls()                      #  List the variables in your R workspace   

# "#" Indicates a comment.  The text . is ignored   
install.packages(".")           # Gets user-created packages from online repositories   
library(.)                      #  Loads packages that you've installed

II. Running commands

Example 1: Starting R, change directories

Start R studio (“Start”>“RStudio”)

First, put the name of your class directory into a variable “mydir”.

mydir = "C:/myRdir" #  This puts the text for your directory in the variable mydir
  # -->  The text is whatever directory your GEOG576 HW is in.
mydir  # I always type the name of the variable after creating it to see if R did what I expect.

Now, go to your directory using setwd(), and list the contents:

setwd(mydir)
dir()

Example 2: Creating a script

Click “File>New File > RScript” Put the commands from the two boxes above into the Rscript window, highlight the text and click >Run. You can also run the lines by clicking on the first line, then click “>Run” repeatedly to run each line.

III. Data structures in R

A. Vectors and matrices

Vectors and matrices contain only one class of data (either numbers or text, but not both).
A vector is created using “c”:

x = c(10,20,30)
x  # Again, I like to type the variable after creating it to check its contents

## [1] 10 20 30

y = c("Joe","Harry","Larry")
y

## [1] "Joe"   "Harry" "Larry"

What happens if you type x = c(1,“Joe”)?

Find out the class of each object using class():

class(x)

## [1] "numeric"

class(y)

## [1] "character"

You can refer to elements of a vector or matrix with [ ]. What do each of the following do?

x[2]
x/5
x[2]/5
y[3]
paste(y[1],y[2],sep=" ")
paste0(y[1],y[2])
x2 = as.character(x)
x3 = as.numeric(x2)

Boolean operators. What do each of the following do?

x>15
x[x>15]
which(x>15)
length(x<15)
length(x[x<15])

B. Data frames and Reading in data

Data frames are perhaps the most widely used data format in R, especially for data analysis. They can contain a variety of data types, including text and numbers. Data frames also have names for each of their columns.
To experiment with this, we will load some precipitation data and plot it.

1. Load the data from a published google sheet using read.csv.

Here I have published some data in a google sheet to the web using File -> Published to the web, csv option. Doing this gives you a url that can be referenced in R to read in the data.

a. Alternative 1: Load from the url

poway.url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRD8gtm7VHx0OAu6z21tB-6USuGSxDAEA4z3Jls3-da7nZDp765WEO64NjuloXruymYlAJglDosLVWm/pub?gid=39591027&single=true&output=csv"
poway.precip.hourly = read.csv(url(poway.url))

See how the functions read.csv() and url() work. Type:

?read.csv
?read.url

b. Alternative 2: Load from a downloaded csv file

*** Copy and paste this url into your brower: https://docs.google.com/spreadsheets/d/e/2PACX-1vRD8gtm7VHx0OAu6z21tB-6USuGSxDAEA4z3Jls3-da7nZDp765WEO64NjuloXruymYlAJglDosLVWm/pub?gid=39591027&single=true&output=csv

and save the csv file into a directory on your machine.

Now, load the csv. Make sure you used setwd to get into the directory where you put the file:

poway.precip.hourly = read.csv("poway_filled.csv")

Now, look at the variable that you just loaded:

poway.precip.hourly

But this prints out a lot of data–too much! To just see the first, or last, 5 lines of x:

head(poway.precip.hourly)

##   X           Date.time Rainfall.in
## 1 1 1962-10-04 15:00:00        0.01
## 2 2 1962-10-04 16:00:00        0.01
## 3 3 1962-10-04 17:00:00        0.00
## 4 4 1962-10-04 18:00:00        0.00
## 5 5 1962-10-04 19:00:00        0.00
## 6 6 1962-10-04 20:00:00        0.00

tail(poway.precip.hourly)

##             X           Date.time Rainfall.in
## 400028 400028 2008-05-23 10:00:00        0.00
## 400029 400029 2008-05-23 11:00:00        0.00
## 400030 400030 2008-05-23 12:00:00        0.00
## 400031 400031 2008-05-23 13:00:00        0.00
## 400032 400032 2008-05-23 14:00:00        0.12
## 400033 400033 2008-05-23 15:00:00        0.04

2. Look at subsets of the rainfall data.

What do each of the following commands do?:

names(poway.precip.hourly)
poway.precip.hourly[,1]
poway.precip.hourly[,2]
poway.precip.hourly[1:10,]
poway.precip.hourly[1:10,"Precip.inches"]

C. Dates in R.

1. Date and time variables can be created using “strptime”, which creates something called a POSIX date-time object.

For more information about dates and date-time objects, review a Quick introduction to R, and see https://www.stat.berkeley.edu/~s133/dates.html.

Check out the help for strptime:

?strptime

Date.time.POSIX = strptime(poway.precip.hourly$Date.time,format="%Y-%m-%d %H:%M")
head(Date.time.POSIX)

## [1] "1962-10-04 15:00:00 PDT" "1962-10-04 16:00:00 PDT"
## [3] "1962-10-04 17:00:00 PDT" "1962-10-04 18:00:00 PDT"
## [5] "1962-10-04 19:00:00 PDT" "1962-10-04 20:00:00 PDT"

Notice that date.time has the time zone associated with it. The computer guesses the time zone, but you can also specify the time zone with tz (if needed, see ?strptime).

III. Plotting data

Check out the plot command in help using ?plot

A. Plot the time series of rainfall using your date.time object and the rainfall column of x:

plot(Date.time.POSIX,poway.precip.hourly$Rainfall.in,type="l")  # "l" means plot it as a "line"

Now change the axis labels:

plot(Date.time.POSIX,poway.precip.hourly$Rainfall.in,type="l",xlab="",ylab="Hourly precipitation at poway, inches")

IV. Time series management

A. Aggregate your houly data to daily and plot it.

1. Use the aggregate function

dates = format(Date.time.POSIX,"%Y-%m-%d")  # What does this do?
poway.precip.daily = aggregate(poway.precip.hourly$Rainfall.in,by=list(dates),FUN=sum)
names(poway.precip.daily)  # Are these descriptive names?

## [1] "Group.1" "x"

  #  No, so replace with better names:
names(poway.precip.daily) = c("Date.text","Precip.inches")
#  The date is now of class "text", which R cannot read as a Date.  So, use as.Date to convert to a date object:
poway.precip.daily$Date = as.Date(poway.precip.daily$Date.text,"%Y-%m-%d")

2. Plot the time series of daily precipitation.

See the plot command in III.A. Your output should look like this:

3. Sort the precipitation data to find the largest daily events.

daily.sorted = sort(poway.precip.daily$Precip.inches,decreasing=TRUE)
daily.sorted[1:20]

##  [1] 3.40 2.68 2.63 2.50 2.38 2.36 2.32 2.24 2.20 2.10 2.10 2.07 2.04 2.00 1.89
## [16] 1.89 1.84 1.80 1.76 1.73

B. Time series management: hydroTSM

The package “hydroTSM” allows you to explore and summarize hydrological time series data.

1. First, install the package, which you only need to do once on a given computer:

install.packages("hydroTSM")

Now load the library:

library(hydroTSM)

2. Generate time series object and plot

You first need to convert your time series into a “zoo” class, which is a time series object.

poway.daily.ts = zoo(poway.precip.daily$Precip.inches,order.by=poway.precip.daily$Date)
head(poway.daily.ts)

## 1962-10-04 1962-10-05 1962-10-06 1962-10-07 1962-10-08 1962-10-09 
##       0.02       0.00       0.00       0.00       0.00       0.00

plot(poway.daily.ts,xlab="",ylab="Daily rainfall, inches")

See how many days have good data using the “dwi” function (use ?dwi to see what it does)

dwi(poway.daily.ts,out.unit="years")

## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 
##   89  365  366  365  365  365  366  365  365  365  366  365  365  365  366  365 
## 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 
##  365  365  366  365  365  365  366  365  365  365  366  365  365  365  366  365 
## 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
##  365  365  366  365  365  365  366  365  365  365  366  365  365  365  144

Are there any years you should leave out of the analysis?

3. Annual (water year) precipitation: sum and daily maxima

Unfortunately, hydroTSM doesn’t have a way to summarize on water year. For that, we have to generate a list of dates we want to define the water year, label the time series with those water years, and sum the precipitation values for those labels (yikes!).

Note: I found the solution below at http://stackoverflow.com/questions/22073881/hydrological-year-time-series after googling “R hydroTSM october monthly2annual”)

a. Generate a series of dates to “break” the time series up by the October 1 water year.

# First, generate a vector of "breaks", starting on Oct 1 and going by year.
  # 1962 is the first year of data at Poway.
breaks <- seq(as.Date("1962-10-01"), length=47, by="year")  
  # Length is the number of years you want the breaks for.
# labels.wy = seq(1963,length=43)  # Why do the labels start in 1963?
years.breaks = as.numeric(format(breaks,"%Y"))
labels.water.year = years.breaks[2:length(breaks)]  # Why are we leaving off the first year in the water years label?
poway.precip.daily$wateryear <- cut(poway.precip.daily$Date, breaks,labels=labels.water.year)

Check to make sure the wateryears were assigned correctly by looking at the first 600 lines:

poway.precip.daily[1:400,]  # See the first 400 days of daily data

b. Now calculate the sum of precipitation by wateryear using “aggregate”:

x.annual.wy = aggregate(poway.precip.daily$Precip.inches,by=list(poway.precip.daily$wateryear),FUN=sum)
names(x.annual.wy) = c("Year","Rainfall.inches")
# x.daily$Group.1 is a "factor" class, which is annoying and doesn't plot right.

# check to see the results make sense
head(x.annual.wy)

##   Year Rainfall.inches
## 1 1963            9.22
## 2 1964            7.56
## 3 1965            8.88
## 4 1966           14.16
## 5 1967           12.80
## 6 1968            6.50

c. Plot as a line plot

# Generate a vector of numerical values of year:
years = as.numeric(as.character(x.annual.wy$Year))
  # why do we need as.character?  What happens if you don't use "as.character"?
plot(years,x.annual.wy$Rainfall.inches,type="l",ylab="Annual precip, inches",xlab="Water year")

d. Plot as a histogram

Usually, rainfall is plotted as a histogram, not a line, since the rainfall happens over a year, not at a point.

# Note:  I found this code by googling "R barplot of annual precipitation"
b.plot <- barplot(height = x.annual.wy$Rainfall.inches, names.arg = years,
                  xlab="Year", ylab="Precipitation (in)")

e. Find the maximum daily rainfall for each water year.

I won’t show the code for this…see part b above. Hint: what function should you use instead of the sum?

##   Group.1    x
## 1    1963 1.44
## 2    1964 1.16
## 3    1965 0.88
## 4    1966 2.38
## 5    1967 2.63
## 6    1968 0.88

f. And plot:

g. Do a trend test to see if there has been a trend in daily max, first using both the linear model (lm):

#  Linear model
lm.daily = lm(x.dailymax.wy$Rainfall.inches ~ years)
summary(lm.daily)

## 
## Call:
## lm(formula = x.dailymax.wy$Rainfall.inches ~ years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92978 -0.40172 -0.05106  0.30430  1.88954 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.060252  13.993587   0.433    0.667
## years       -0.002298   0.007048  -0.326    0.746
## 
## Residual standard error: 0.6346 on 44 degrees of freedom
## Multiple R-squared:  0.00241,    Adjusted R-squared:  -0.02026 
## F-statistic: 0.1063 on 1 and 44 DF,  p-value: 0.7459

And add the line to your plot:

plot(years,x.dailymax.wy$Rainfall.inches,type="l",ylab="Maximum daily precip, inches",xlab="Water year")
abline(lm.daily)

h. Check the normality of the residuals. Note: I found this solution at http://www.r-tutor.com/elementary-statistics/simple-linear-regression/normal-probability-plot-residuals

# Calculate the "standardized residuals
lm.stdres = rstandard(lm.daily) 
qqnorm(lm.stdres, ylab="Standardized Residuals", xlab="Normal Scores", main="Maximum daily rainfall:  Annual maximum series") 
qqline(lm.stdres)

Are the residuals normally distributed?

i. Now do a trend test using Mann-Kendall:

install.packages("Kendall")

library(Kendall)
poway.mk.test = MannKendall(x.dailymax.wy$Rainfall.inches)
summary(poway.mk.test)

## Score =  -49 , Var(Score) = 11144.33
## denominator =  1030.992
## tau = -0.0475, 2-sided pvalue =0.64933