Executive Statement
- The dataset used in this assignment includes egg depositions in millions of age-3 Lake Huron Bloaters between 1981 and 1996.
- A set of possible models will be identified after analyzing and transforming of time series data.
- By fitting different models to the time series data and examining residuals for each model a set of eligible models will be used for further model selection using AIC and BIC scored.
- Finally the best model will be used to predict the changes for the next 5 years.
Load required packages
library(TSA)
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(fUnitRoots)
## Loading required package: timeDate
##
## Attaching package: 'timeDate'
## The following objects are masked from 'package:TSA':
##
## kurtosis, skewness
## Loading required package: timeSeries
## Loading required package: fBasics
library(tseries)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:timeSeries':
##
## time<-
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forecast)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.Arima TSA
## fitted.fracdiff fracdiff
## plot.Arima TSA
## residuals.fracdiff fracdiff
Load required data and convert it into time series
egg<- read.csv("eggs.csv",header = T)
class(egg)
## [1] "data.frame"
egg.ts <- ts(as.vector(egg[2]), start=1981)
class(egg.ts)
## [1] "ts"
Plot time series data and check for stationarity
plot(egg.ts , type='o', main='Time series of yearly egg depositions of Lake Huron Bloasters', ylab='egg depositions(in millions)')

par(mfrow=c(1,2))
acf(egg.ts)
pacf(egg.ts)

par(mfrow=c(1,1))
adf.test(egg.ts)
##
## Augmented Dickey-Fuller Test
##
## data: egg.ts
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
- In the time series plot there is an upward trend with no obvious seasonality, no obvious intervention, and no changing variance and no succeeding observations. Bouncing observations around the mean level imply the existence of moving average behavior.
- A slowly decaying pattern in ACF and very high first correlation in PACF implies the existence of trend and nonstationarity.
- With a p-value of 0.5469 we cannot reject the null hypothesis which states that the series is non-stationary.
Fit quadratic model
t = time(egg.ts)
t2 = t^2
model1 = lm(egg.ts~ t + t2)
summary(model1)
##
## Call:
## lm(formula = egg.ts ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50896 -0.25523 -0.02701 0.16615 0.96322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.647e+04 2.141e+04 -2.170 0.0491 *
## t 4.665e+01 2.153e+01 2.166 0.0494 *
## t2 -1.171e-02 5.415e-03 -2.163 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared: 0.5932, Adjusted R-squared: 0.5306
## F-statistic: 9.479 on 2 and 13 DF, p-value: 0.00289
plot(ts(fitted(model1)),ylab='Yearly egg depositions', main = "Fitted quadratic curve to Yearly egg depositions", ylim=c(0,2.2))
lines(as.vector(egg.ts),type="o")

res.model1 = rstudent(model1)
plot(y = res.model1, x = as.vector(time(egg.ts)),xlab = 'Time', ylab='Standardized Residuals',type='o', main='residual plot for model 1')
abline(h=0)

qqnorm(res.model1)
qqline(res.model1, col = 2)

shapiro.test(res.model1)
##
## Shapiro-Wilk normality test
##
## data: res.model1
## W = 0.87948, p-value = 0.03809
acf(res.model1)

- A quadratic model is chosen as observing the time series plot reveals that the trend is not linear and therefore a quadratic trend is a better fit.
- The overlay plot shows that the general trend is captured with the quadratic model, it does not capture the fluctuation in the plot.
- All terms are significant in the quadratic model with R-squared value of 0.5932. This means that the quadratic model only explains 59.32% of the variability in the data.
- The p value for the Shapiro-Wilk normality test is 0.03809, therefore the residual distribution is not normal.
- The ACF plot of the residuals shows significant auto-correlation in the residuals.
- Therefore, a quadratic model does not capture the true trend of this time series.
First differencing of time series
egg.diff = diff(BC.egg)
par(mfrow=c(1,1))
plot(egg.diff,type='o',ylab='Yearly egg depositions')

adf.test(egg.diff)
##
## Augmented Dickey-Fuller Test
##
## data: egg.diff
## Dickey-Fuller = -3.6096, Lag order = 2, p-value = 0.04931
## alternative hypothesis: stationary
par(mfrow=c(1,2))
acf(egg.diff)
pacf(egg.diff)

par(mfrow=c(1,1))
- Trend is observed after first differencing.
- ADF test produces a p-value of 0.04931, which is very close to 0.05. This indicates that the series is not stationary with the first differencing.
- There is no significant lag seen in ACF and PACF. This suggests us to fit trend models which we know would not be appropriate from previous trials.
- We will therefore apply the second differencing.
Second differencing of time series
egg.diff2 = diff(BC.egg, differences = 2)
par(mfrow=c(1,1))
plot(egg.diff2,type='o',ylab="early egg depositions")

adf.test(egg.diff2)
##
## Augmented Dickey-Fuller Test
##
## data: egg.diff2
## Dickey-Fuller = -3.1108, Lag order = 2, p-value = 0.1492
## alternative hypothesis: stationary
par(mfrow=c(1,2))
acf(egg.diff2)
pacf(egg.diff2)

par(mfrow=c(1,1))
- No trend is observed in the time series plot.
- This is confirmed with the ADF test which produces a p-value of 0.1492, indicating that the series is stationary with the second differencing.
- There is no significant lag seen in ACF plot, and 1 significant lag seen in PACF plot.
- Possible model ARIMA(1,2,0)
Use EACF and BIC table to find other possible models
eacf(egg.diff2, ar.max = 3, ma.max = 3)
## AR/MA
## 0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
res = armasubsets(y=egg.diff2,nar=3,nma=3,y.name='test',ar.method='ols')
## Warning in ar.ols(x, aic = aic, order.max = order.max, na.action =
## na.action, : model order: 7 singularities in the computation of the
## projection matrix results are only valid up to model order 6
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 2 linear dependencies found
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : nvmax reduced to 4
plot(res)
## Warning in log(vr): NaNs produced
## Warning in log(vr): NaNs produced
## Warning in log(vr): NaNs produced

- From EACF table, we get possible models ARIMA(1,2,0), ARIMA(1,2,1), ARIMA(0,2,1)
- In the BIC table, shaded columns correspond to AR(3), MA(2) and MA(3) coefficients, so ARIMA(3,2,2), ARIMA(3,2,3) are possible models in the set of candidate model.
- The dataset contains 16 observations, so ARIMA(3,2,3) is too big for this dataset and will be excluded.
Candidate models summary
- In conclusion the set of candidate models is {ARIMA(1,2,0), ARIMA(1,2,1), ARIMA(0,2,1), ARIMA(3,2,2)}.
Fit ARIMA(1,2,0)
model.120=arima(egg.ts, order = c(1,2,0), method='CSS')
coeftest(model.120)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.45944 0.23810 -1.9296 0.05365 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.120=arima(egg.ts, order = c(1,2,0), method='ML')
coeftest(model.120)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.42966 0.22743 -1.8892 0.05886 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- AR1 coefficient of ARIMA(1,2,0) model is not significant at 5% significance in both tests.
- ARIMA(1,2,0) model is not a good fit for this data, so it will not be included in the final model selection.
Fit ARIMA(1,2,1)
model.121=arima(egg.ts, order = c(1,2,1), method='CSS')
coeftest(model.121)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.073817 0.284315 0.2596 0.7951
## ma1 -1.132556 0.074796 -15.1419 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.121=arima(egg.ts, order = c(1,2,1), method='ML')
coeftest(model.121)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.071763 0.269250 0.2665 0.7898
## ma1 -0.999998 0.236871 -4.2217 2.425e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- AR1 coefficient of ARIMA(1,2,1) model is not significant at 5% significance in both tests.
- This suggests that ARIMA(1,2,1) model is not a good fit for this data. AR1 term should be omitted, which gives ARIMA(0,2,1).
Fit ARIMA(0,2,1)
model.021=arima(egg.ts, order = c(0,2,1), method='CSS')
coeftest(model.021)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.066739 0.071847 -14.847 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(rstandard(model.021), ylab='Standardized residuals', type='o', main='Time series plot of standardized residuals
for the second difference of the Box-Cox transformed egg deposition series')
abline(h=0)

e1=residuals(model.021)
qqnorm(e1)
qqline(e1)

shapiro.test(e1)
##
## Shapiro-Wilk normality test
##
## data: e1
## W = 0.92621, p-value = 0.2121
par(mfrow=c(1,2))
acf(residuals(model.021))
pacf(residuals(model.021))

par(mfrow=c(1,1))
Box.test(residuals(model.021), lag=6, type="Ljung-Box", fitdf=0)
##
## Box-Ljung test
##
## data: residuals(model.021)
## X-squared = 7.7309, df = 6, p-value = 0.2585
tsdiag(model.021, gof=15, omit.initial=F)

- MA1 coefficient of ARIMA(0,2,1) model is significant at 5% significance in ‘CSS’ test.
- The p value for the Shapiro-Wilk normality test is 0.2121, therefore the residual distribution is normal.
- From ACF and PACF of the residuals, we can conclude that there is no significant information on the correlation structure of the series left in the residuals. This is also supported by the Box-Ljung test at 5% level of significance.
- There is no p-value for the Ljung-Box statistic below alpha.
- ARIMA(0,2,1) model is a good fit for this time series data, therefore it will be included in the final model selection.
Fit model ARIMA(3,2,2)
model.322=arima(egg.ts, order = c(3,2,2), method='CSS')
coeftest(model.322)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.139922 0.348205 -0.4018 0.68780
## ar2 -0.198602 0.258520 -0.7682 0.44235
## ar3 -0.544728 0.277057 -1.9661 0.04928 *
## ma1 -0.683391 0.360419 -1.8961 0.05795 .
## ma2 0.045948 0.329455 0.1395 0.88908
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.322=arima(egg.ts, order = c(3,2,2), method='ML')
coeftest(model.322)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.132356 0.472342 0.2802 0.77932
## ar2 -0.099038 0.323371 -0.3063 0.75940
## ar3 -0.384399 0.314000 -1.2242 0.22088
## ma1 -0.996970 0.589396 -1.6915 0.09074 .
## ma2 0.199121 0.472870 0.4211 0.67369
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Only AR3 coefficient of ARIMA(3,2,2) model is significant at 5% significance using ‘CSS’ test.
- No coefficient of ARIMA(3,2,2) model is significant at 5% significance using ‘ML’ test.
- ARIMA(3,2,2) is not a good fit for this time series data. Insignificant terms should be omitted, which gives ARIMA(3,2,0).
Fit model ARIMA(3,2,0)
model.320=arima(egg.ts, order = c(3,2,0), method='CSS')
coeftest(model.320)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.53949 0.23534 -2.2924 0.02188 *
## ar2 -0.33499 0.27022 -1.2397 0.21508
## ar3 -0.45802 0.26462 -1.7309 0.08348 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.323=arima(egg.ts, order = c(3,2,0), method='ML')
coeftest(model.320)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.53949 0.23534 -2.2924 0.02188 *
## ar2 -0.33499 0.27022 -1.2397 0.21508
## ar3 -0.45802 0.26462 -1.7309 0.08348 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Only AR1 coefficient of ARIMA(3,2,0) model is significant at 5% significance using both test.
- ARIMA(3,2,0) is not a good fit for this time series data. Insignificant terms should be omitted, which gives ARIMA(1,2,0).
- ARIMA(1,2,0) has already been shown to be a bad fit for this time series data, so it will not be included in the final model selection.
Model selection
- Only model ARIMA(0,2,1) is included in the final model selection, so we will proceed to prediction using this model.
Prediction using selected model
predict(model.021, n.ahead=5, newxreg = NULL, se.fit=TRUE)
## Warning in predict.Arima(model.021, n.ahead = 5, newxreg = NULL, se.fit =
## TRUE): MA part of model is not invertible
## $pred
## Time Series:
## Start = 1997
## End = 2001
## Frequency = 1
## [1] 1.080342 1.136584 1.192826 1.249068 1.305310
##
## $se
## Time Series:
## Start = 1997
## End = 2001
## Frequency = 1
## [1] 0.4567535 0.6722845 0.8553529 1.0243099 1.1858809
fit=Arima(egg.ts, c(0,2,1), lambda = 0.5)
plot(forecast(fit, h=5))

Conclusion
- From analyzing the Lake Huron Bloaters egg depositions data, the following set of candidate models were identified: {ARIMA(1,2,0), ARIMA(1,2,1), ARIMA(0,2,1), ARIMA(3,2,2)}.
- Candidate models were used to fit the time series data using both least squares estimate and maximum likelihood estimate methods to determine if they are a good fit for the time series data.
- ARIMA(0,2,1) is the only model that provides a good fit, therefore it was used for prediction and forecasting.
- Egg depositions of Lake Huron Bloaters from 1997 to 2001 are predicted in millions to be 1.080342, 1.136584, 1.192826, 1.249068, 1.305310 with standard errors of 0.4567535, 0.6722845, 0.8553529, 1.0243099, 1.1858809 respectively.