Coregonus hoyi, often known as the bloater, is in the family Salmonidae and a species of freshwater whitefish . It is silver in colour, herring-like fish, 25.5 centimetres long. It is found in the Great Lakes and also in Lake Nipigon, inhabits underwater slopes.
The task is to analyze the egg depositions of Lake Huron Bloasters by using different analysis methods. The final goal is to choose the best model from different set of models and give relevant forecasts of egg depositions for the next 5 years.
We load the following packages
library(TSA)
## Warning: package 'TSA' was built under R version 3.5.3
library(tseries)
library(lmtest)
library(dLagM)
## Warning: package 'dLagM' was built under R version 3.5.3
## Warning: package 'nardl' was built under R version 3.5.3
## Warning: package 'dynlm' was built under R version 3.5.3
library(fUnitRoots)
## Warning: package 'fUnitRoots' was built under R version 3.5.3
## Warning: package 'timeSeries' was built under R version 3.5.3
## Warning: package 'fBasics' was built under R version 3.5.3
library(FitAR)
## Warning: package 'FitAR' was built under R version 3.5.3
## Warning: package 'leaps' was built under R version 3.5.3
## Warning: package 'ltsa' was built under R version 3.5.2
## Warning: package 'bestglm' was built under R version 3.5.3
library(forecast)
egg_data1 <- read.csv("eggs.csv", header = TRUE)
head(egg_data1)
## year eggs
## 1 1981 0.0402
## 2 1982 0.0602
## 3 1983 0.1205
## 4 1984 0.1807
## 5 1985 0.7229
## 6 1986 0.5321
class(egg_data1)
## [1] "data.frame"
In the below code, we are converting the data to a time series data to perform the analysis. A random walk graph is generated using the plot function and the following points are observed:
- Trend : We see a upward trend.
- Behaviour : The series appears to be auto regressive due to multiple succeeding points.
- Seasonality : We see no seasonality in this graph.
- Change in Variance: Change in variance is found.
- Change in point: Year 1990 shows there is a sudden jump in the Deposition of Eggs
egg_data2 <-ts(as.vector(egg_data1$eggs), start= 1981, end=1996)
class(egg_data2)
## [1] "ts"
plot(egg_data2,type='o',ylab='Random Walk')
acf(egg_data2)
pacf(egg_data2)
adf.test(egg_data2)
##
## Augmented Dickey-Fuller Test
##
## data: egg_data2
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
In the scatterplot, we observe a positive upward trend. This shows that there is a strong correlation between egg depositions in an year as compared to the previous year.
We then find the correlation to be around 74% which justifies the upward trend of the scatterplot.
plot(y= egg_data2, x = zlag(egg_data2), ylab = 'Egg Depositions', xlab = 'Previous Year Egg Depositions')
y = egg_data2
x = zlag(egg_data2)
index = 2:length(x)
cor(y[index],x[index])
## [1] 0.7445657
data.transform = BoxCox.ar(egg_data2, method = "yule-walker")
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1
data.transform$ci
## [1] 0.1 0.8
lambda = 0.45 # The mid point of the interval
data.bc = ((egg_data2^lambda)-1)/lambda
plot(data.bc, type = "o", ylab = "Egg Depositions (in millions)", main = "Transformed - Egg depositions between 1981 and 1996")
adf.test(data.bc)
##
## Augmented Dickey-Fuller Test
##
## data: data.bc
## Dickey-Fuller = -1.6769, Lag order = 2, p-value = 0.6955
## alternative hypothesis: stationary
#Let's calculate the first difference and plot the first differenced series
diff.egg_data2 = diff(data.bc)
plot(diff.egg_data2, type = "o")
The adf test for the differenced series is done. The null hypothesis states that the series is non stationary whereas the alternate hypothesis states that the series is stationary. The p-value is 0.0443 which is less than 0.05, hence we can reject the null hypothesis and the series is stationary now.
Also the acf and pacf do not show any significant lags.
We have a stationary series now and we can proceed ahead with fitting the models.
adf.test(diff.egg_data2)
##
## Augmented Dickey-Fuller Test
##
## data: diff.egg_data2
## Dickey-Fuller = -3.6798, Lag order = 2, p-value = 0.0443
## alternative hypothesis: stationary
acf(diff.egg_data2)
pacf(diff.egg_data2)
eacf(diff.egg_data2, ar.max = 3, ma.max = 3)
## AR/MA
## 0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
res = armasubsets(y = diff.egg_data2, nar = 3, nma = 6, y.name = 'test', ar.method = 'yw')
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 4 linear dependencies found
## Reordering variables and trying again:
plot(res)
# ARIMA(0,1,1) - CSS
model011css = arima(diff.egg_data2, order = c(0,1,1), method = 'CSS')
coeftest(model011css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.093338 0.056532 -19.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,1) - ML
model011ml = arima(diff.egg_data2, order = c(0,1,1), method = 'ML')
coeftest(model011ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.99994 0.66766 -1.4977 0.1342
# ARIMA(1,1,0) - CSS
model110css = arima(diff.egg_data2, order = c(1,1,0), method = 'CSS')
coeftest(model110css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.40609 0.24460 -1.6602 0.09687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,0) - ML
model110ml = arima(diff.egg_data2, order = c(1,1,0), method = 'ML')
coeftest(model110ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.38056 0.23493 -1.6199 0.1053
# ARIMA (1,1,1) - CSS
model111css = arima(diff.egg_data2, order = c(1,1,1), method = 'CSS')
coeftest(model111css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.030947 0.290460 0.1065 0.9152
## ma1 -1.021947 0.201926 -5.0610 4.171e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (1,1,1) - ML
model111ml = arima(diff.egg_data2, order = c(1,1,1), method = 'ML')
coeftest(model111ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.11474 0.26992 0.4251 0.670764
## ma1 -1.00000 0.33205 -3.0116 0.002599 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,3) - CSS
model013css = arima(diff.egg_data2, order = c(0,1,3), method = 'CSS')
coeftest(model013css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.10117 0.29725 -3.7045 0.0002118 ***
## ma2 -0.11909 0.33351 -0.3571 0.7210194
## ma3 0.14673 0.34631 0.4237 0.6717891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,3) - ML
model013ml = arima(diff.egg_data2, order = c(0,1,3), method = 'ML')
coeftest(model013ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.852252 0.536615 -1.5882 0.1122
## ma2 -0.102618 0.286176 -0.3586 0.7199
## ma3 -0.045118 0.512224 -0.0881 0.9298
# ARIMA(3,1,3) - CSS
model313css = arima(diff.egg_data2, order = c(3,1,3), method = 'CSS')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model313css)
## Warning in sqrt(diag(se)): NaNs produced
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -1.2016973 0.0067545 -177.9105 < 2e-16 ***
## ar2 -0.8077002 0.0231155 -34.9419 < 2e-16 ***
## ar3 -0.1124583 0.0998924 -1.1258 0.26025
## ma1 1.9195706 NA NA NA
## ma2 1.9992451 NA NA NA
## ma3 -1.3935177 0.6294151 -2.2140 0.02683 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(3,1,3) - ML
model313ml = arima(diff.egg_data2, order = c(3,1,3), method = 'ML')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model313ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.35427 0.42161 0.8403 0.400749
## ar2 -0.67968 0.26271 -2.5872 0.009675 **
## ar3 -0.41455 0.32151 -1.2894 0.197269
## ma1 -1.31542 0.60118 -2.1881 0.028666 *
## ma2 1.42062 1.09000 1.3033 0.192465
## ma3 -0.59339 0.60515 -0.9806 0.326809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,5) - CSS
model015css = arima(diff.egg_data2, order = c(0,1,5), method = 'CSS')
coeftest(model015css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.016594 0.375033 -2.7107 0.006715 **
## ma2 0.013365 0.364525 0.0367 0.970752
## ma3 -0.753976 0.239229 -3.1517 0.001623 **
## ma4 1.192805 0.411640 2.8977 0.003759 **
## ma5 -0.538537 0.506024 -1.0643 0.287215
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,5) - ML
model015ml = arima(diff.egg_data2, order = c(0,1,5), method = 'ML')
coeftest(model015ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.824906 0.739712 -1.1152 0.2648
## ma2 0.038341 0.314834 0.1218 0.9031
## ma3 -0.732093 0.840193 -0.8713 0.3836
## ma4 1.065250 0.965775 1.1030 0.2700
## ma5 -0.089888 0.354698 -0.2534 0.7999
# ARIMA (0,1,2) - CSS
model012css = arima(diff.egg_data2, order = c(0,1,2), method = 'CSS')
coeftest(model012css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.017026 0.274754 -3.7016 0.0002143 ***
## ma2 -0.087178 0.308221 -0.2828 0.7772974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (0,1,2) - ML
model012ml = arima(diff.egg_data2, order = c(0,1,2), method = 'ML')
coeftest(model012ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.88465 0.42637 -2.0748 0.0380 *
## ma2 -0.11535 0.25484 -0.4527 0.6508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,2) - CSS
model112css = arima(diff.egg_data2, order = c(1,1,2), method = 'CSS')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model112css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.329019 0.010368 31.7337 < 2.2e-16 ***
## ma1 -2.494136 0.107804 -23.1359 < 2.2e-16 ***
## ma2 1.189979 0.159874 7.4432 9.825e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,2) - ML
model112ml = arima(diff.egg_data2, order = c(1,1,2), method = 'ML')
coeftest(model112ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.022041 0.984412 0.0224 0.9821
## ma1 -0.904711 0.996659 -0.9077 0.3640
## ma2 -0.095271 0.937432 -0.1016 0.9191
residual.analysis <- function(model, std = TRUE){
library(TSA)
library(FitAR)
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
par(mfrow=c(3,2))
plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
abline(h=0)
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardised residuals")
print(shapiro.test(res.model))
k=0
LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
par(mfrow=c(1,1))
}
residual.analysis(model = model011css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.9093, p-value = 0.1321
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model011ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.95773, p-value = 0.653
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model110css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.96159, p-value = 0.7199
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model110ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.95896, p-value = 0.6743
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model111css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.95293, p-value = 0.5717
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model111ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.94804, p-value = 0.4941
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model013css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.89171, p-value = 0.07118
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model013ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.94279, p-value = 0.4188
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model313css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.93626, p-value = 0.3377
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model313ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.95724, p-value = 0.6445
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model015css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.90275, p-value = 0.1049
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model015ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.96794, p-value = 0.8265
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model012css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.92679, p-value = 0.2442
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model012ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.94768, p-value = 0.4887
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model112css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.9373, p-value = 0.3497
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
residual.analysis(model = model112ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.9475, p-value = 0.486
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
sort.score <- function(x, score = c("bic", "aic")){
if (score == "aic"){
x[with(x, order(AIC)),]
} else if (score == "bic") {
x[with(x, order(BIC)),]
} else {
warning('score = "x" only accepts valid arguments ("aic","bic")')
}
}
sort.score(AIC(model110ml, model011ml, model111ml, model013ml, model313ml, model015ml, model012ml, model112ml), score = "aic")
## df AIC
## model011ml 2 23.12446
## model012ml 3 24.93160
## model111ml 3 24.94140
## model013ml 4 26.92313
## model112ml 4 26.93109
## model110ml 2 27.05798
## model015ml 6 27.39690
## model313ml 7 29.34726
sort.score(BIC(model110ml, model011ml, model111ml, model013ml, model313ml, model015ml, model012ml, model112ml), score = "bic")
## df BIC
## model011ml 2 24.40258
## model012ml 3 26.84878
## model111ml 3 26.85857
## model110ml 2 28.33609
## model013ml 4 29.47936
## model112ml 4 29.48732
## model015ml 6 31.23124
## model313ml 7 33.82067
predict(model011ml,n.ahead = 5,newxreg = NULL,se.fit=TRUE)
## $pred
## Time Series:
## Start = 1997
## End = 2001
## Frequency = 1
## [1] 0.1148626 0.1148626 0.1148626 0.1148626 0.1148626
##
## $se
## Time Series:
## Start = 1997
## End = 2001
## Frequency = 1
## [1] 0.4491633 0.4491633 0.4491633 0.4491633 0.4491633
fit=Arima(egg_data2,c(0,1,1))
plot(forecast(fit,h=5))
fit
## Series: egg_data2
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.0304
## s.e. 0.2545
##
## sigma^2 estimated as 0.1885: log likelihood=-8.25
## AIC=20.51 AICc=21.51 BIC=21.92
We analysed the egg depositions of Lake Huron Bloasters by using various analysis methods. We found the best fitting model for the data to be ARIMA(0,1,1). We started with understanding the properties of the data, converting it into a time series, checking the correlation, transforming it and differencing it once before modelling. After each step the adf test was conducted to make sure that the time series is stationary in order to fit the best model. Certain models were identified through EACF, AIC and BIC functions. These models were tested through the coefficient test and then the residuals were checked for normality as well. ARIMA(0,1,1) had the lowest values for AIC and BIC and also had normality behavious during the residual analysis. Further, predictions were made using the ARIMA(0,1,1) model for the next 5 years for the Egg Depositions.