The USACE river gauge closest to the proposed diversion, which is at river mile 60, is the Alliance station at river mile 62.5.
Import a csv that contains the Alliance gage stage data. Downloaded from http://rivergages.mvr.usace.army.mil/WaterControl/shefgraph-historic.cfm?sid=01390 on 02/01/2019
setwd('C:\\Users\\Jack\\Documents\\AB\\PVA MS Sturgeon 2019')
rawDat <- read.csv('USACEAllianceGageStage.csv')
dat <- rawDat
names(dat) <- c("date", "stage")
head(dat)
The first column contains the date and time of measurement. The second column contains the measurement for the river stage in feet, using ‘M’ to denote missing data. It is imported as a factor because of the missing value code.
Recode the ‘M’ and 1000001 for missing data to NA for R, then convert to numeric from factor.
dat$stage <- as.character(dat$stage)
dat$stage[dat$stage == 'M'] <- NA
dat$stage <- as.numeric(dat$stage)
NAs introduced by coercion
dat$stage[dat$stage > 1000] <- NA
head(dat)
The first entry is dated 1911, which is far outside of the rest of the series, so this row will be removed.
dat <- dat[-1,]
head(dat)
sd(dat$stage, na.rm = TRUE)
[1] 2.37885
summary(dat$stage)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.060 2.190 3.660 4.233 6.353 15.190 108
The mean and sd match the original spreadsheet data after the same filtering process.
The times can just be dropped, because there is never more than one measurement per day…
sum(table(dat$date) > 1)
[1] 0
dateTimeChar <- as.character(dat$date)
head(dateTimeChar)
[1] "7/18/2008 9:00" "7/19/2008 8:00" "7/20/2008 8:00" "7/21/2008 8:00" "7/22/2008 8:00" "7/23/2008 8:00"
timeChar <- unlist(strsplit(dateTimeChar, split = "[ ]+"))[seq(2, length(dat$date), by = 2)]
head(timeChar)
[1] "9:00" "8:00" "8:00" "8:00" "8:00" "8:00"
hourChar <- unlist(strsplit(timeChar, split = ":"))[seq(1, (length(dateTimeChar) - 1), by = 2)]
head(hourChar)
[1] "9" "8" "8" "8" "8" "8"
summary(as.numeric(hourChar))
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.000 8.000 8.000 8.001 8.000 9.000
… and because all of the measurments occur at 8 or 9 O’Clock.
Convert the column named “date” from a factor to an object of class “Date” (this ignores the times per the above).
dat$date <- as.Date(x = dat$date, format = '%m/%d/%Y')
head(dat)
Split the date into additional columns for month and year. Then just rearrange column order so stage is last.
dat$month <- as.numeric(format(dat$date, "%m"))
dat$year <- as.numeric(format(dat$date, "%Y"))
dat$day <- 1:nrow(dat) # DELETE THIS
dat$day <- as.numeric(strftime(dat$date, format = "%j")) # day of year
dat <- dat[,c("date", "day", "month", "year", "stage")]
head(dat)
This is the data set that we will work with.
Count and proportion of missing values
sum(is.na(dat$stage))
[1] 108
sum(is.na(dat$stage)) / length(dat$stage)
[1] 0.02803738
There is a relatively small proportion of missing values, so it is probably safe to just leave them in the data set.
hist(dat$stage, col = 'gray', xlab = 'Stage (feet)', main = 'Distribution of stage measurements at USACE Alliance gage')
Over all of the data the river stage is clearly bimodal. Let’s see if if a seasonal pattern can explain this.
for(i in 2008:2018){
hist(dat$stage[dat$year == i], col = 'gray', xlab = 'Stage (feet)', main = paste('Distribution of stage measurements at USACE Alliance gage in ', i))
}
Most of the years appear to have a similar bimodal pattern, so it seems likely that there is a monthly cycle in stage that needs to be accounted for.
for(i in 1:12){
hist(dat$stage[dat$month == i],
col = 'gray',
xlab = 'Stage (feet)',
main = paste('Distribution of stage measurements at USACE Alliance gage in month', i),
xlim = c(0, max(dat$stage, na.rm = TRUE))
)
}
The monthly distributions do appear to be unimodal, except for December, which appears to be bimodal. There is a perhaps interesting pattern where the spring months (March, April, May, and June) appear to be left skewed, while August through December appear to be right skewed, and January, February, and July are symmetric. This also shows why the distribution of the data overall is bimodal, because the left-skewed months all have similar high means and the right-skewed months all have similar low means.
for(i in 1:12){
hist(log(dat$stage[dat$month == i]), col = 'gray', xlab = 'Stage (feet)', main = paste('Distribution of stage measurements at USACE Alliance gage in month', i))
}
plot(dat$date, dat$stage, bty = 'l', xlab = 'Date', ylab = 'Stage (feet)', type = 'l')
acf(dat$stage, na.action = na.pass, main = "")
In the raw time series there is a great deal of autocorrelation that only slowly decreases with lag. We need to see how much of this is just because of the seasonal periodicicty or yearly trend, if present, and then see if any autocorrelation remains after removing these sources.
fit.seasonal <- lm(stage ~ as.factor(month), data = dat, na.action = na.exclude)
summary(fit.seasonal)
Call:
lm(formula = stage ~ as.factor(month), data = dat, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-4.6842 -1.2649 -0.2308 1.0875 11.9886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.46687 0.10506 42.516 < 2e-16 ***
as.factor(month)2 -0.24149 0.15912 -1.518 0.129
as.factor(month)3 0.74128 0.15430 4.804 1.62e-06 ***
as.factor(month)4 1.84350 0.15444 11.936 < 2e-16 ***
as.factor(month)5 2.08564 0.15361 13.578 < 2e-16 ***
as.factor(month)6 1.47736 0.15502 9.530 < 2e-16 ***
as.factor(month)7 0.08719 0.15139 0.576 0.565
as.factor(month)8 -1.28892 0.14925 -8.636 < 2e-16 ***
as.factor(month)9 -1.87608 0.14982 -12.522 < 2e-16 ***
as.factor(month)10 -1.77603 0.14914 -11.909 < 2e-16 ***
as.factor(month)11 -1.76842 0.14959 -11.822 < 2e-16 ***
as.factor(month)12 -1.26543 0.14936 -8.472 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.934 on 3732 degrees of freedom
(108 observations deleted due to missingness)
Multiple R-squared: 0.3407, Adjusted R-squared: 0.3388
F-statistic: 175.3 on 11 and 3732 DF, p-value: < 2.2e-16
boxplot(stage ~ month, data = dat)
There is a significant monthly pattern of variation across all years.
aov(fit.seasonal)
Call:
aov(formula = fit.seasonal)
Terms:
as.factor(month) Residuals
Sum of Squares 7216.369 13964.999
Deg. of Freedom 11 3732
Residual standard error: 1.934415
Estimated effects may be unbalanced
108 observations deleted due to missingness
TukeyHSD(aov(fit.seasonal))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = fit.seasonal)
$`as.factor(month)`
diff lwr upr p adj
2-1 -0.241491477 -0.76184338 0.278860428 0.9360338
3-1 0.741283840 0.23669773 1.245869946 0.0001026
4-1 1.843503556 1.33845421 2.348552901 0.0000000
5-1 2.085643622 1.58333325 2.587953990 0.0000000
6-1 1.477362955 0.97043285 1.984293061 0.0000000
7-1 0.087190336 -0.40785407 0.582234740 0.9999895
8-1 -1.288915198 -1.77697504 -0.800855353 0.0000000
9-1 -1.876080473 -2.36601332 -1.386147631 0.0000000
10-1 -1.776034833 -2.26372596 -1.288343703 0.0000000
11-1 -1.768418611 -2.25759631 -1.279240915 0.0000000
12-1 -1.265427373 -1.75385787 -0.776996872 0.0000000
3-2 0.982775317 0.44491136 1.520639275 0.0000002
4-2 2.084995033 1.54669647 2.623293593 0.0000000
5-2 2.327135099 1.79140550 2.862864700 0.0000000
6-2 1.718854432 1.17879089 2.258917976 0.0000000
7-2 0.328681813 -0.20024111 0.857604738 0.6713174
8-2 -1.047423721 -1.56981525 -0.525032190 0.0000000
9-2 -1.634588996 -2.15873086 -1.110447136 0.0000000
10-2 -1.534543356 -2.05659042 -1.012496291 0.0000000
11-2 -1.526927134 -2.05036320 -1.003491066 0.0000000
12-2 -1.023935896 -1.54667374 -0.501198052 0.0000000
4-3 1.102219716 0.57914575 1.625293681 0.0000000
5-3 1.344359782 0.82392992 1.864789641 0.0000000
6-3 0.736079115 0.21118897 1.260969260 0.0002917
7-3 -0.654093505 -1.16751389 -0.140673117 0.0018820
8-3 -2.030199039 -2.53688824 -1.523509839 0.0000000
9-3 -2.617364314 -3.12585790 -2.108870732 0.0000000
10-3 -2.517318673 -3.02365272 -2.010984623 0.0000000
11-3 -2.509702451 -3.01746849 -2.001936412 0.0000000
12-3 -2.006711213 -2.51375745 -1.499664976 0.0000000
5-4 0.242140066 -0.27873894 0.763019074 0.9353076
6-4 -0.366140601 -0.89147608 0.159194881 0.4917664
7-4 -1.756313220 -2.27018888 -1.242437557 0.0000000
8-4 -3.132418754 -3.63956927 -2.625268236 0.0000000
9-4 -3.719584029 -4.22853729 -3.210630765 0.0000000
10-4 -3.619538389 -4.12633408 -3.112742697 0.0000000
11-4 -3.611922167 -4.12014855 -3.103695787 0.0000000
12-4 -3.108930929 -3.61643816 -2.601423698 0.0000000
6-5 -0.608280667 -1.13098348 -0.085577852 0.0079496
7-5 -1.998453286 -2.50963727 -1.487269306 0.0000000
8-5 -3.374558821 -3.87898177 -2.870135870 0.0000000
9-5 -3.961724096 -4.46795951 -3.455488685 0.0000000
10-5 -3.861678455 -4.36574466 -3.357612250 0.0000000
11-5 -3.854062233 -4.35956685 -3.348557615 0.0000000
12-5 -3.351070995 -3.85585259 -2.846289404 0.0000000
7-6 -1.390172619 -1.90589686 -0.874448381 0.0000000
8-6 -2.766278153 -3.27530167 -2.257254637 0.0000000
9-6 -3.353443428 -3.86426308 -2.842623775 0.0000000
10-6 -3.253397788 -3.76206778 -2.744727791 0.0000000
11-6 -3.245781566 -3.75587699 -2.735686137 0.0000000
12-6 -2.742790328 -3.25216925 -2.233411410 0.0000000
8-7 -1.376105534 -1.87329340 -0.878917673 0.0000000
9-7 -1.963270809 -2.46229741 -1.464244210 0.0000000
10-7 -1.863225169 -2.36005109 -1.366399249 0.0000000
11-7 -1.855608947 -2.35389418 -1.357323712 0.0000000
12-7 -1.352617709 -1.85016943 -0.855065992 0.0000000
9-8 -0.587165275 -1.07926384 -0.095066710 0.0054922
10-8 -0.487119635 -0.97698640 0.002747129 0.0528873
11-8 -0.479503413 -0.97085016 0.011843334 0.0634294
12-8 0.023487825 -0.46711503 0.514090680 1.0000000
10-9 0.100045640 -0.39168724 0.591778518 0.9999545
11-9 0.107661863 -0.38554540 0.600869124 0.9999079
12-9 0.610653100 0.11818692 1.103119281 0.0029831
11-10 0.007616222 -0.48336428 0.498596722 1.0000000
12-10 0.510607460 0.02037141 1.000843512 0.0325051
12-11 0.502991238 0.01127631 0.994706162 0.0395487
acf(residuals(fit.seasonal), na.action = na.pass, main = "")
The seasonal (monthly) pattern did reduce the autocorrelation in the time series, but not completely.
plot(dat$month, dat$stage)
plot(dat$month, residuals(fit.seasonal))
There is still a seasonal periodicity in the results, but in the opposite direction as before, which might be a consequence of how early months were skewed one way and later months skewed the other way.
The monthly variation seems very regular and might be fit just as well by a periodic function.
fit.cos <- lm(stage ~ cos(2*pi/365*day) + sin(2*pi/365*day),
data = dat,
na.action = na.exclude)
summary(fit.cos)
Call:
lm(formula = stage ~ cos(2 * pi/365 * day) + sin(2 * pi/365 *
day), data = dat, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-4.8390 -1.2874 -0.2913 1.1213 12.0278
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.30760 0.03208 134.28 <2e-16 ***
cos(2 * pi/365 * day) -0.80721 0.04522 -17.85 <2e-16 ***
sin(2 * pi/365 * day) 1.73275 0.04545 38.12 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.96 on 3741 degrees of freedom
(108 observations deleted due to missingness)
Multiple R-squared: 0.3216, Adjusted R-squared: 0.3212
F-statistic: 886.7 on 2 and 3741 DF, p-value: < 2.2e-16
aov(fit.cos)
Call:
aov(formula = fit.cos)
Terms:
cos(2 * pi/365 * day) sin(2 * pi/365 * day) Residuals
Sum of Squares 1228.832 5582.957 14369.579
Deg. of Freedom 1 1 3741
Residual standard error: 1.959874
Estimated effects may be unbalanced
108 observations deleted due to missingness
# Need to sort residuals for the lines function to plot correctly
tempData <- data.frame(x = dat$month, y = fitted.values(fit.cos))
tempData <- tempData[order(tempData[,1]),]
plot(dat$month, dat$stage)
points(dat$month, fitted.values(fit.cos), col = 2, pch = 16)
plot(dat$day, dat$stage)
points(dat$day, fitted.values(fit.cos), col = 2, pch = 16)
plot(dat$month, residuals(fit.cos))
The residuals still have the same pattern, but using this function will be easier for modeling than using 12 different means, and the model seems just as accurate.
These results represent the linear model: \[x_t = U_1cos(2\pi\omega t) + U_2sin(2\pi\omega t)\].
Using the trigonometric identity \[cos(\alpha+\beta) = cos(\alpha)cos(\beta) - sin(\alpha)sin(\beta)\] this is a linear represenation of the periodic function \[x_t = Asin(2\pi\omega t + \phi)\], where \(x_t\) is the stage at month \(t\), \(A\) is the amplitude, \(\omega\) is the number of cycles per month (inverse period), and \(\phi\) is the phase or offset of the function as number of months.
In the linear transformation, \(U_1 = Acos\phi\) and \(U_2 = -Asin\phi\), and are the coefficients estimated by the linear model. Using this we can get that \(A = \sqrt{U_1^2 + U_2^2}\) and that \(\phi = tan^-1(-U_2/U_1)\). In this model, the frequency \(\omega\) is assumed to be known as 12, since it appears that there is one cycle per year, and this agrees with the natural history of the system.
Based on the estimates of the linear model, we can get the following estimates of the parameters of the periodic function.
names(fit.cos)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr"
[8] "df.residual" "na.action" "xlevels" "call" "terms" "model"
myIntercept <- coefficients(fit.cos)[[1]]
U1 <- coefficients(fit.cos)[[2]]
U2 <- coefficients(fit.cos)[[3]]
A <- sqrt(U1^2 + U2^2)
phi <- atan(-U2 / U1)
offsetMonths <- 2*pi / phi
myIntercept
[1] 4.307601
A
[1] 1.911551
phi
[1] 1.134835
This results in a periodic model of \[x_t = 4.3 + 1.9*cos(2\pi/365*day + 2\pi/365*1.13)\].
myPred <- A*sin(2*pi/365*(1:365) + 2*pi/365*phi) + myIntercept
plot(dat$day, dat$stage)
lines(1:365, myPred, col = 'red', lwd = 3)
hist(residuals(fit.cos))
acf(residuals(fit.cos), na.action = na.pass, main = "")
fit.trend <- lm(stage ~ year, data = dat, na.action=na.exclude)
summary(fit.trend)
Call:
lm(formula = stage ~ year, data = dat, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-4.3250 -2.0550 -0.5425 2.1231 10.7400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -161.8748 25.5724 -6.330 2.74e-10 ***
year 0.0825 0.0127 6.496 9.36e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.366 on 3742 degrees of freedom
(108 observations deleted due to missingness)
Multiple R-squared: 0.01115, Adjusted R-squared: 0.01089
F-statistic: 42.19 on 1 and 3742 DF, p-value: 9.357e-11
boxplot(stage ~ year, data = dat)
There is a significant annual trend across all months, but the amount of variation explained is so small that this cannot really be considered as important.
acf(residuals(fit.trend), na.action = na.pass, main = "")
The annual trend did not reduce the autocorrelation in the time series much, if at all.
plot(dat$year, dat$stage)
plot(dat$year, residuals(fit.trend))
There is no real trend, and removing the seasonal periodicity still results in significant autocorrelation. In order to model river stage completely, we finally need a model for the autocorrelation of the error term, which is represented by the residuals of the season model.
library(forecast)
package <U+393C><U+3E31>forecast<U+393C><U+3E32> was built under R version 3.4.4
set.seed(1)
fit.aa <- auto.arima(residuals(fit.cos), allowdrift = FALSE, max.q = 0, d = 0)
fit.aa
Series: residuals(fit.cos)
ARIMA(3,0,0) with zero mean
Coefficients:
ar1 ar2 ar3
0.519 0.2682 0.1745
s.e. 0.016 0.0176 0.0160
sigma^2 estimated as 0.4487: log likelihood=-3871.86
AIC=7751.72 AICc=7751.73 BIC=7776.73
acf(residuals(fit.aa), na.action = na.pass, main = "")
hist(residuals(fit.aa), col = "gray")
arCoeffs <- fit.aa$coef
arSigma2 <- fit.aa$sigma2
The final model of river stage is then the periodic function from before with the addition of an AR(3) model of autocorrelation, in which the current value is a linear function of the previous three values plus an error term.
The complete model is then \[x_t = 4.3 + 1.9*cos(2\pi/365*day + 2\pi/365*1.13) + w_t + 0.519w_{t-1} + 0.2682w_{t-2} + 0.1745w_{t-3}\] where \[w_t\] is sampled from the residuals after the periodic and AR fits.
First we generate a time series using normal errors with the sd estiamted from the AR model.
set.seed(1)
simStageNorm <- A*sin(2*pi/365*dat$day + 2*pi/365*phi) + myIntercept + arima.sim(n = length(dat$stage),
list(ar = arCoeffs),
rand.gen = rnorm, sd = sqrt(arSigma2)
)
plot(dat$date, dat$stage, type = 'l')
lines(dat$date, simStageNorm, type = 'l', col = 'red')
Here the red line is the simulated time series and the black line is the observed time series. This looks good, but it does not have the same occasional large spikes as in the original data.
Now we will simulate the time series using the same model but with errors sampled from the observed residuals.
This is a helper function to enable sampling from the observed residual distribution within the arima.sim function.
rg <- function(n, res) sample(res, n, replace = TRUE)
set.seed(1)
simStageSamp <- A*sin(2*pi/365*dat$day + 2*pi/365*phi) + myIntercept + arima.sim(n = length(dat$stage),
model = list(ar = arCoeffs),
rand.gen = rg, res = na.exclude(residuals(fit.aa))
)
plot(dat$date, dat$stage, type = 'l')
lines(dat$date, simStageSamp, type = 'l', col = 'red')
Here the red line is the simulated time series and the black line is the observed time series. This seems to mimic the time series better than the previous simulation.