Exercise 4: R and rainfall analysis

I. Introductory commands

Some important commands are listed below. The . indicate blanks where you would fill in missing info.

?...                      # Loads help files for a specific command   
?getwd                  # Loads help files for "getwd"   
??getwd               # Searches for the word "getwd" in all the help files   
setwd("C:\mypath...")               # Sets the working directory to whatever path you specify   
getwd()                 # Shows the current working directory   
dir()                     # List the files in the working directory   
ls()                      #  List the variables in your R workspace   

# "#" Indicates a comment.  The text . is ignored   
install.packages(".")           # Gets user-created packages from online repositories   
library(.)                      #  Loads packages that you've installed

II. Running commands

Example 1: Starting R, change directories

Start R studio (“Start”>“RStudio”)

First, put the name of your class directory into a variable “mydir”.

mydir = "C:/myRdir" #  This puts the text for your directory in the variable mydir
  # -->  The text is whatever directory your GEOG576 HW is in.
  # YOU NEED TO CHANGE THE TEXT in "" to your directory, for exmaple:
  # mydir = "C:/User/tbiggs/GEOG576".
  # See the instructions in the previous section of this exercise:
    # C.  Set the working directory for your data folder."
  # If you need additional help, a video on how to copy paths in windows is here:  https://www.youtube.com/watch?v=G4qyzclix10
mydir  # I always type the name of the variable after creating it to see if R did what I expect.

Now, go to your directory using setwd(), and list the contents:

setwd(mydir)
dir()

Example 2: Creating a script

Click “File>New File > RScript” Put the commands from the two boxes above into the Rscript window, highlight the text and click >Run. You can also run the lines by clicking on the first line, then click “>Run” repeatedly to run each line.

III. Data structures in R

A. Data frames and Reading in data

1. Load the data from a published google sheet using “read.csv”.

Here I have published some data in a google sheet to the web using File -> Publish to the web, csv option. Doing this gives you a url that can be referenced in R to read in the data.

a. Alternative 1: Load from the url

poway.url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT4Ov3nGPpsPjPz1ZgbkOipHGArx4V4g_LTZl-Gdv4G-3NwY-ZE07s2g8TEMQOWU2peawDaeVS3z6Br/pub?output=csv"
poway.precip.daily = read.csv(url(poway.url))

See how the functions read.csv() and url() work. Type:

?read.csv
?url

b. Alternative 2: Load from a downloaded csv file

*** Copy and paste this url into your brower: https://docs.google.com/spreadsheets/d/e/2PACX-1vT4Ov3nGPpsPjPz1ZgbkOipHGArx4V4g_LTZl-Gdv4G-3NwY-ZE07s2g8TEMQOWU2peawDaeVS3z6Br/pub?output=csv

and save the csv file into a directory on your machine. You could call this file “poway_precip_daily.csv”.

Now, load the csv. Make sure you used setwd to get into the directory where you put the file:

#poway.precip.daily = read.csv("poway_precip_daily.csv")

Now, look at the variable that you just loaded:

poway.precip.daily

But this prints out a lot of data–too much! To just see the first, or last, 5 lines of x:

head(poway.precip.daily)

##       STATION     DATE LATITUDE LONGITUDE ELEVATION                NAME
## 1 USC00047111 1/1/1900 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 2 USC00047111 1/2/1900 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 3 USC00047111 1/3/1900 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 4 USC00047111 1/4/1900 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 5 USC00047111 1/5/1900 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 6 USC00047111 1/6/1900 33.01953  -117.031     197.5 POWAY VALLEY, CA US
##   PRCP.0.1mm PRCP.mm PRCP_ATTRIBUTES
## 1          0     0.0           P,,6,
## 2          0     0.0           P,,6,
## 3        457    45.7            ,,6,
## 4        531    53.1            ,,6,
## 5          0     0.0           P,,6,
## 6          0     0.0           P,,6,

tail(poway.precip.daily)

##           STATION      DATE LATITUDE LONGITUDE ELEVATION                NAME
## 26329 USC00047111 7/28/2022 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 26330 USC00047111 7/29/2022 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 26331 USC00047111 7/30/2022 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 26332 USC00047111 7/31/2022 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 26333 USC00047111  9/6/2022 33.01953  -117.031     197.5 POWAY VALLEY, CA US
## 26334 USC00047111  9/7/2022 33.01953  -117.031     197.5 POWAY VALLEY, CA US
##       PRCP.0.1mm PRCP.mm PRCP_ATTRIBUTES
## 26329          0       0        ,,7,1600
## 26330          0       0        ,,7,1600
## 26331          0       0        ,,7,1600
## 26332          0       0        ,,7,1600
## 26333          0       0        ,,H,1600
## 26334          0       0        ,,H,1600

2. Look at subsets of the rainfall data.

What do each of the following commands do?:

names(poway.precip.daily)
poway.precip.daily[,1]
poway.precip.daily[,2]
poway.precip.daily[1:10,]
poway.precip.daily[1:10,"PRCP.mm"]

B. Dates and times in R.

Date variables can be created using “as.Date”, which creates something called a Date date-time object.

For more information about dates and date-time objects, review “A Quick introduction to R”, and see https://www.stat.berkeley.edu/~s133/dates.html.

Check out the help for as.Date:

?as.Date

What is the class of your DATE object?

class(poway.precip.daily$DATE)

## [1] "character"

Characters won’t plot right…Dates have to be a “Date” class.

poway.precip.daily$Date = as.Date(poway.precip.daily$DATE,format="%m/%d/%Y")
  # I often use "Date" in lower case to indicate it's my own formatted Date...
head(poway.precip.daily$Date)

## [1] "1900-01-01" "1900-01-02" "1900-01-03" "1900-01-04" "1900-01-05"
## [6] "1900-01-06"

What class is the Date column?

Now look at subsets of the data by year and by precipitation threshold:

poway.precip.daily[poway.precip.daily$Date >= as.Date("1990-01-01"),]
poway.precip.daily[poway.precip.daily$PRCP.mm>50,]

IV. Plotting data

Check out the plot command in help using ?plot

A. Plot the time series of rainfall using your date.time object and the rainfall column of x:

plot(poway.precip.daily$Date,poway.precip.daily$PRCP.mm,type="l")  # "l" means plot it as a "line"

Now change the axis labels:

plot(poway.precip.daily$Date,poway.precip.daily$PRCP.mm,type="l",xlab="",ylab="Daily precipitation at poway, mm")

There’s a big gap in the middle. You can restrict the range of the x-axis using the “xlim” option:

plot(poway.precip.daily$Date,poway.precip.daily$PRCP.mm,type="l",xlab="",ylab="Daily precipitation at poway, mm",xlim=as.Date(c("1955-01-01","2022-12-31")))

Or, you can create a new data.frame with just the data from ~1955-2022. How can you find out the first date with data after the 1910-1950s gap? Hint: use View…

Now, use that date to create a new data.frame with just the data from 195x-2022. Note: *** you have to change the string “19xx-xx-xx” to the date you want to use ***:

poway.precip.daily.sub = poway.precip.daily[poway.precip.daily$Date >= as.Date("19xx-xx-xx"),]
  # *** This line will not work if you don't change the "19xx-xx-xx" to the date you want to use, 
  # where 19xx is changed to e.g. 1930 or 1955 for example, and xx-xx is the month and day, 
  # 10-01 or 01-01 or 02-15, for example.

Now plot the smaller data.frame, and make sure it removed the earlier data.
I’m not showing the code for how to do this…Hint: you don’t need the xlab option anymore:

B. Time series analysis and summaries

1. Sort the precipitation data to find the largest daily events.

daily.sorted = sort(poway.precip.daily$PRCP.mm,decreasing=TRUE)
daily.sorted[1:20]

##  [1] 86.9 79.5 78.0 77.0 73.7 71.1 63.0 62.0 61.0 59.2 58.9 58.4 58.2 57.7 56.4
## [16] 56.4 55.6 55.1 54.1 53.1

2. Annual (water year) precipitation: sum and daily maxima

To calculate total precipitation for a water year, we need to have a column that has the water year for each day, and sum the precipitation values by water year.

Note: I found the solution below at http://stackoverflow.com/questions/22073881/hydrological-year-time-series after googling “R hydroTSM october monthly2annual”)

a. Generate a series of dates to “break” the time series up by the October 1 water year.

# First, generate a vector of "breaks", starting on Oct 1 and going by year.
  # 1962 is the first year of data at Poway.
breaks <- seq(as.Date("1956-10-01"), length=(2023-1956), by="year")  
  # Length is the number of years you want the breaks for.
years.breaks = as.numeric(format(breaks,"%Y"))
labels.water.year = years.breaks[2:length(breaks)]  # Why are we leaving off the first year in the water years label?
poway.precip.daily.sub$WaterYear <- cut(poway.precip.daily.sub$Date, breaks,labels=labels.water.year)

Check to make sure the hydroYears were assigned correctly by looking at the first 600 lines:

poway.precip.daily.sub[1:400,]  # See the first 400 days of daily data

b. Now calculate the sum of precipitation by WaterYear using “aggregate”:

x.annual.rainfall.wy = aggregate(poway.precip.daily.sub$PRCP.mm,by=list(poway.precip.daily.sub$WaterYear),FUN=sum)
names(x.annual.rainfall.wy) = c("WaterYear","Rainfall.mm")
# x.daily$Group.1 is a "factor" class, which is annoying and doesn't plot right.

# check to see the results make sense
head(x.annual.rainfall.wy)

##   WaterYear Rainfall.mm
## 1      1957       297.9
## 2      1958        82.8
## 3      1959       128.6
## 4      1960       261.2
## 5      1961       125.1
## 6      1962       312.6

Do these values seem reasonable? How can you know?

What class is WaterYear?
Convert it to a numerical value. Note that you have to do “as.character” first to turn a factor class into a character class…

x.annual.rainfall.wy$WaterYear = as.numeric(as.character(x.annual.rainfall.wy$WaterYear))

c. Make sure each water year has complete data for all days of the year:

x.annual.wy.count = aggregate(poway.precip.daily.sub$PRCP.mm,by=list(poway.precip.daily.sub$WaterYear),FUN=length)
names(x.annual.wy.count) = c("WaterYear","Count.dates")
x.annual.wy.count$WaterYear = as.numeric(as.character(x.annual.wy.count$WaterYear))
head(x.annual.wy.count)

##   WaterYear Count.dates
## 1      1957         365
## 2      1958         214
## 3      1959         365
## 4      1960         364
## 5      1961         365
## 6      1962         365

How many years have less than 365 days of data? Hint: create a new data.frame from x.annual.wy.count where Count.dates is >= 365…

How could you remove the WaterYears with less than 365 days of data? I’m not showing you the code for this.

d. Plot as a line plot–include the years with less than 365 days of data

plot(x.annual.rainfall.wy$WaterYear,x.annual.rainfall.wy$Rainfall.mm,type="l",ylab="Annual precip, mm",xlab="Water year")

e. Plot as a bar chart

Usually, rainfall is plotted as a bar chart, not a line, since the rainfall happens over a year, not at a point.

# Note:  I found this code by googling "R barplot of annual precipitation"
b.plot <- barplot(height = x.annual.rainfall.wy$Rainfall.mm, names.arg = x.annual.rainfall.wy$WaterYear,
                  xlab="Year", ylab="Precipitation (mm)")

e2. Create a new data.frame that excludes WaterYears with less than 365 days of data:

…and plot WaterYears with all 365 days as a blue dot:

x.annual.rainfall.wy.sub = x.annual.rainfall.wy[x.annual.wy.count$Count.dates>=365,]
plot(x.annual.rainfall.wy$WaterYear,x.annual.rainfall.wy$Rainfall.mm,type="l",ylab="Annual precip, mm",xlab="Water year")
points(x.annual.rainfall.wy.sub$WaterYear,x.annual.rainfall.wy.sub$Rainfall.mm,col="blue")

# And add a legend
legend("topleft",c("All WY",">365 days of data"),col=c("black","blue"),pch=c(NA,1),lty=c(1,NA))

Do WaterYears with <365 days of data make a big difference? In which years?

Make a plot showing the # of days with data (y-axis) vs the WaterYear (x-axis)

f. Find the maximum daily rainfall for each water year.

I won’t show the code for this…see part b above. Hint: what function should you use instead of the sum? Note: below, I call the output data frame “x.dailymax.wy”.

##   Group.1    x
## 1    1957 32.8
## 2    1958 19.6
## 3    1959 24.1
## 4    1960 27.9
## 5    1961 24.9
## 6    1962 58.2

g. And plot (I’m not showing the code)

why is this plot funny? How do you fix it? I’m not showing the code…hint: what class is WaterYear?

Add a trend line:

h. Do a trend test to see if there has been a trend in daily max, first using both the linear model (lm):

#  Linear model
lm.daily = lm(Dailymax.rainfall.mm ~ WaterYear, x.dailymax.wy)
  # The first variable Dailymax.rainfal.mm is the name of the column with the daily maximum rainfall.
  # The second variable WaterYear is the name of the column with the Water Year.
  # The third variable x.dailymax.wy is the data.frame that has the first two variables.
summary(lm.daily)

## 
## Call:
## lm(formula = Dailymax.rainfall.mm ~ WaterYear, data = x.dailymax.wy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.30 -11.72  -2.60  10.45  38.79 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -205.6330   202.3697  -1.016    0.313
## WaterYear      0.1237     0.1017   1.216    0.228
## 
## Residual standard error: 15.74 on 64 degrees of freedom
## Multiple R-squared:  0.0226, Adjusted R-squared:  0.007326 
## F-statistic:  1.48 on 1 and 64 DF,  p-value: 0.2283

Add a trendline to your plot:

plot(x.dailymax.wy$WaterYear,x.dailymax.wy$Dailymax.rainfall.mm,type="l",ylab="Maximum daily precip, inches",xlab="Water year")
abline(lm.daily)

Plot the residuals on a normal probability plot:

lm.stdres = rstandard(lm.daily) 
qqnorm(lm.stdres, ylab="Standardized Residuals", xlab="Normal Scores", main="Maximum daily rainfall:  Annual maximum series") 
qqline(lm.stdres)

i. Now do a trend test using Mann-Kendall:

install.packages("Kendall")

library(Kendall)
poway.mk.test = MannKendall(x.dailymax.wy$Dailymax.rainfall.mm)
summary(poway.mk.test)

## Score =  227 , Var(Score) = 32639
## denominator =  2138.992
## tau = 0.106, 2-sided pvalue =0.21095