Coregonus Hoyi is a fish found in water lobbies like lake huran where it habitats in underwater slopes. In this assignment, we see how the Egg depositions of age-3 Lake Huron Bloaters has changed from 1981 to 1996. We also aim to predict the changes in Egg depositions in the next 5 year using relevant model fitting and functions on R studio. The egg series data consists of two variables – year and egg Egg Depositions (in millions) over the years.
To undertake this research, time series methods on R Studio are being used to infer from the egg series data set.
Use of simple visualisations tool for modelling and predicting trends in time series. We examine efficiency and quality of all the methods used investigation and study the reliability of the fitted model. Check stationary in the time series and also use varies parameters to diagnose best model fit for the series.
library(TSA)
library(timeSeries)
library(lmtest)
library(readr)
library(zoo)
library(FitAR)
library(forecast)
library(tseries)
library(dLagM)
library(fUnitRoots)
We Read in the egg series data and convert it into a time series. We plot the series to check for visual inferences in the egg series data.
data111 <- read_csv("eggs.csv")
## Parsed with column specification:
## cols(
## year = col_integer(),
## eggs = col_double()
## )
data111
## # A tibble: 16 x 2
## year eggs
## <int> <dbl>
## 1 1981 0.0402
## 2 1982 0.0602
## 3 1983 0.120
## 4 1984 0.181
## 5 1985 0.723
## 6 1986 0.532
## 7 1987 0.432
## 8 1988 0.482
## 9 1989 1.15
## 10 1990 2.10
## 11 1991 1.57
## 12 1992 1.56
## 13 1993 0.763
## 14 1994 0.843
## 15 1995 1.01
## 16 1996 1.02
Converting the data file into time series format
data11 <- ts(as.vector(data111$eggs), start=1981, end=1996, frequency = 1)
class(data111)
## [1] "tbl_df" "tbl" "data.frame"
In the time series plot there is a clear existence of upward moving trend for egg depositions.Seasonality and changing variance are to be found. Succeeding observations imply the existence of autoregressive behavior. Egg depositions reached its peak in 1989 post which we see a decline followed by a upward trend agin in 1992.
plot(data11, ylab = "Egg Depositions (in millions)", main = "Time series plot of egg depositions of bloaters between 1981 and 1996", type = "o",)
Slowly decaying pattern in ACF and very high first correlation in PACF implies the existence of trend and stationarity.
par(mfrow = c(1,2))
acf(data11)
pacf(data11)
par(mfrow = c(1,1))
With a p-value of 0.54, we confirm that the series is stationary.
adf.test(data11)
##
## Augmented Dickey-Fuller Test
##
## data: data11
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
Calculating the correlation between one year’s egg Deposition with that of the successive year’s revenue. Correlation value of 0.74 implies that in this series, there is strong relationship between yearly egg depositions
y = data11
x = zlag(data11)
index = 2:length(x)
cor(y[index], x[index])
## [1] 0.7445657
We plot a scatter plot to compare consecutive years results
plot(y = y, x = x, ylab = "Egg Depositions (in millions)", xlab = "Previous Year's Depositions", main = "Scatter Plot Comparing Consecutive Time series plot of egg depositions of bloaters")
Now we progress to construct and interpret the time series by use least squares to fit a linear time trend to this time series. We are using Intercept and Slope values in linear trend model to see if there is any significance.
model1 = lm(data11~time(data11))
summary(model1)
##
## Call:
## lm(formula = data11 ~ time(data11))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4048 -0.2768 -0.1933 0.2536 1.1857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -165.98275 49.58836 -3.347 0.00479 **
## time(data11) 0.08387 0.02494 3.363 0.00464 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4598 on 14 degrees of freedom
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4074
## F-statistic: 11.31 on 1 and 14 DF, p-value: 0.004642
plot(data11, type = "o", ylab='Millions Units',xlab='Year', main = "Fitted linear trend model - Egg depositions between 1981 and 1996")
abline(model1, col = "Blue", lty = 2)
The summary shows that both the slope & intercept are not statistically significant at less 5% significance level. The adjusted R-Square value is not very strong at 40% variance, which indicates that this model will not be a good fit.
Standardized residuals are saved and further fitted to the time series plot.
res.model1 = rstudent(model1)
plot(y = res.model1, x = as.vector(time(data11)), xlab = 'Year',type='o', ylab='Standardized Residuals',main = "Residuals of linear trend model - Egg depositions between 1981 and 1996")
The below QQ Plot shows that this linear model doesn’t have a good fit as the data points do not closly follow the red dotted line at all, indicating that this is not normal. We shall confirm this with the Shapiro test for normality.
qqnorm(res.model1)
qqline(res.model1, col = 2, lwd = 1, lty = 2)
shapiro.test(res.model1)
##
## Shapiro-Wilk normality test
##
## data: res.model1
## W = 0.7726, p-value = 0.001205
As indicated with the QQ Plot, since the p-value is <0.05, it is not normal.
Now we progress to construct and interpret the time series by least squares approach to fit a quadratic time trend to this time series.
t = time(data11)
t2 = t^2
model2 = lm(data11 ~ t + t2)
summary(model2)
##
## Call:
## lm(formula = data11 ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50896 -0.25523 -0.02701 0.16615 0.96322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.647e+04 2.141e+04 -2.170 0.0491 *
## t 4.665e+01 2.153e+01 2.166 0.0494 *
## t2 -1.171e-02 5.415e-03 -2.163 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared: 0.5932, Adjusted R-squared: 0.5306
## F-statistic: 9.479 on 2 and 13 DF, p-value: 0.00289
P value of the quadratic model suggest that the model is significant as the p vale is less than 0.05, however being very close to zero also means that we need to reject the null hypotheses. Hence, we look at the r squared value 0.53, its suggest that the model is more significant than linear model.
plot(ts(fitted(model2)), ylim = c(min(c(fitted(model2), as.vector(data11))), max(c(fitted(model2), as.vector(data11)))), ylab = "y", main = "Fitted quadratic trend model - Egg depositions between 1981 and 1996", type = "l", lty = 2, col = "blue")
lines(as.vector(data11), type = "o")
Standardized residuals are saved and further fitted to the time series plot.
res.model2 = rstudent(model2)
plot(y = res.model2, x = as.vector(time(data11)), xlab = 'Year', ylab = 'Egg Depositions (in millions)', type = 'o')
abline(h = 0, col = "Blue")
qqnorm(res.model2)
qqline(res.model2, col = 2, lwd = 1, lty = 2)
shapiro.test(res.model2)
##
## Shapiro-Wilk normality test
##
## data: res.model2
## W = 0.87948, p-value = 0.03809
The time series plot fits well with the quadratic model from the better. QQ Plot has a better fit with the data points & the red dotted line.
After the looking at the output of each individual model to the time series, it is clear that quadratic model is the closest to the actual time series. Quadratic model is the best fit to this time series data as it is statistically significant & the best fit of normality of the time series. Harmonic model was not utlized here as there is no seasonal or a cyclic pattern within the original dataset.
Applying the Box-Cox transformationdata11.tr = BoxCox.ar(data11, method = "yule-walker")
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1
data11.tr$ci
## [1] 0.1 0.8
lambda = 0.45 # The mid point of the interval
data11.bc = ((data11^lambda)-1)/lambda
plot(data11.bc, type = "o", ylab = "Egg Depositions (in millions)", main = "Transformed - Egg depositions between 1981 and 1996")
adf.test(data11.bc)
##
## Augmented Dickey-Fuller Test
##
## data: data11.bc
## Dickey-Fuller = -1.6769, Lag order = 2, p-value = 0.6955
## alternative hypothesis: stationary
Above plot shows that there isn’t a lot of difference when compared to the original time series plot. Hence we move not to to make other check to the data.Dickey-Fuller Test to confirm non-stationary. p-value of 0.69 indicates the series is non-stationary as the p-value is > 0.05 and hence we cannot reject the null hypothesis.We can suggest that there is still a strong evidence of trend within the series. Hence we proceed to apply the first difference and see the results.
Above plot shows that there isn’t a lot of difference when compared to the original time series plot. Hence we move not to to make other check to the data.Dickey-Fuller Test to confirm non-stationary. p-value of 0.69 indicates the series is non-stationary as the p-value is > 0.05 and hence we cannot reject the null hypothesis.We can suggest that there is still a strong evidence of trend within the series. Hence we proceed to apply the first difference and see the results.
data11.bc.diff = diff(data11.bc, differences = 1)
plot(data11.bc.diff, type = "o", ylab = "Egg Depositions (in millions)", main = "First Difference of the data11 Deposition series")
We can see that there is changing variance in the series after taking the first difference. Hence willing to fit and try this on models.
Performing the ADF Testadf.test(data11.bc.diff)
##
## Augmented Dickey-Fuller Test
##
## data: data11.bc.diff
## Dickey-Fuller = -3.6798, Lag order = 2, p-value = 0.0443
## alternative hypothesis: stationary
Dickey-Fuller Test to constitutionality. p-value of 0.05 indicates the series stationary as the p-value is > 0.05 and hence we cannot accept the null hypothesis.
par(mfrow = c(1,2))
acf(data11.bc.diff)
pacf(data11.bc.diff)
par(mfrow = c(1,1))
ACF & PACF plot shows that there is no significant lags at all here, indicating the existence of white noise data.
We confirm existence of white noise by seeing that both p and q are equal to 0.
eacf(data11.bc.diff, ar.max = 3, ma.max = 3)
## AR/MA
## 0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
# ARIMA(0,1,0), ARIMA(0,1,1), ARIMA(1,1,0)
We limit the orders p and q at 3 as the eacf() function return with an error and displays null. From the output of eacf we confirm that the following models can be included for further processing - ARIMA(0,1,0), ARIMA(0,1,1), ARIMA(1,1,0)
res = armasubsets(y = data11.bc.diff, nar = 3, nma = 6, y.name = 'test', ar.method = 'yw')
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 4 linear dependencies found
## Reordering variables and trying again:
plot(res)
# ARIMA(0,1,3), ARIMA(3,1,3), ARMIA(0,1,5)
In the BIC table, shaded columns correspond to AR(1), AR(3), AR(7), MA(2), MA(3) and MA(4) coefficients and from here we can include ARIMA(1,1,1),ARIMA(0,1,2),ARIMA(0,1,5),ARIMA(3,1,3),ARIMA(0,1,3),ARIMA(0,1,1),ARIMA(1,1,0) models as the other combinations give large models.
We proceed to find some candidate ARIMA models using various tools. We perform z test to correlation and confident of different models.
Moel Testing# ARIMA(0,1,1) - CSS
model_011_css = arima(data11.bc.diff, order = c(0,1,1), method = 'CSS')
coeftest(model_011_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.093338 0.056532 -19.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,1) - ML
model_011_ml = arima(data11.bc.diff, order = c(0,1,1), method = 'ML')
coeftest(model_011_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.99994 0.66766 -1.4977 0.1342
# ARIMA(1,1,0) - CSS
model_110_css = arima(data11.bc.diff, order = c(1,1,0), method = 'CSS')
coeftest(model_110_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.40609 0.24460 -1.6602 0.09687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,0) - ML
model_110_ml = arima(data11.bc.diff, order = c(1,1,0), method = 'ML')
coeftest(model_110_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.38056 0.23493 -1.6199 0.1053
# ARIMA(0,1,3) - CSS
model_013_css = arima(data11.bc.diff, order = c(0,1,3), method = 'CSS')
coeftest(model_013_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.10117 0.29725 -3.7045 0.0002118 ***
## ma2 -0.11909 0.33351 -0.3571 0.7210194
## ma3 0.14673 0.34631 0.4237 0.6717891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,3) - ML
model_013_ml = arima(data11.bc.diff, order = c(0,1,3), method = 'ML')
coeftest(model_013_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.852252 0.536615 -1.5882 0.1122
## ma2 -0.102618 0.286176 -0.3586 0.7199
## ma3 -0.045118 0.512224 -0.0881 0.9298
# ARIMA(3,1,3) - CSS
model_313_css = arima(data11.bc.diff, order = c(3,1,3), method = 'CSS')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model_313_css)
## Warning in sqrt(diag(se)): NaNs produced
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -1.2016973 0.0067545 -177.9105 < 2e-16 ***
## ar2 -0.8077002 0.0231155 -34.9419 < 2e-16 ***
## ar3 -0.1124583 0.0998924 -1.1258 0.26025
## ma1 1.9195706 NA NA NA
## ma2 1.9992451 NA NA NA
## ma3 -1.3935177 0.6294154 -2.2140 0.02683 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(3,1,3) - ML
model_313_ml = arima(data11.bc.diff, order = c(3,1,3), method = 'ML')
coeftest(model_313_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.35329 0.41854 0.8441 0.398620
## ar2 -0.68002 0.26353 -2.5804 0.009868 **
## ar3 -0.41454 0.32147 -1.2895 0.197226
## ma1 -1.31528 0.56912 -2.3111 0.020828 *
## ma2 1.42450 0.89398 1.5934 0.111063
## ma3 -0.59677 0.51624 -1.1560 0.247685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,5) - CSS
model_015_css = arima(data11.bc.diff, order = c(0,1,5), method = 'CSS')
coeftest(model_015_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.016594 0.375033 -2.7107 0.006715 **
## ma2 0.013365 0.364525 0.0367 0.970752
## ma3 -0.753976 0.239229 -3.1517 0.001623 **
## ma4 1.192805 0.411640 2.8977 0.003759 **
## ma5 -0.538537 0.506024 -1.0643 0.287215
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,5) - ML
model_015_ml = arima(data11.bc.diff, order = c(0,1,5), method = 'ML')
coeftest(model_015_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.824906 0.739713 -1.1152 0.2648
## ma2 0.038341 0.314834 0.1218 0.9031
## ma3 -0.732093 0.840191 -0.8713 0.3836
## ma4 1.065250 0.965774 1.1030 0.2700
## ma5 -0.089888 0.354698 -0.2534 0.7999
We sort the above models to see the best fit for the data
sort.score(AIC(model_110_ml, model_011_ml, model_013_ml, model_313_ml, model_015_ml), score = "aic")
## df AIC
## model_011_ml 2 23.12446
## model_013_ml 4 26.92313
## model_110_ml 2 27.05798
## model_015_ml 6 27.39690
## model_313_ml 7 29.34748
sort.score(BIC(model_110_ml, model_011_ml, model_013_ml, model_313_ml, model_015_ml), score = "bic")
## df BIC
## model_011_ml 2 24.40258
## model_110_ml 2 28.33609
## model_013_ml 4 29.47936
## model_015_ml 6 31.23124
## model_313_ml 7 33.82088
# ARIMA (0,1,2) - CSS
model_012_css = arima(data11.bc.diff, order = c(0,1,2), method = 'CSS')
coeftest(model_012_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.017026 0.274754 -3.7016 0.0002143 ***
## ma2 -0.087178 0.308221 -0.2828 0.7772974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (0,1,2) - ML
model_012_ml = arima(data11.bc.diff, order = c(0,1,2), method = 'ML')
coeftest(model_012_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.88465 0.42637 -2.0748 0.0380 *
## ma2 -0.11535 0.25484 -0.4527 0.6508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (1,1,1) - CSS
model_111_css = arima(data11.bc.diff, order = c(1,1,1), method = 'CSS')
coeftest(model_111_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.030947 0.290460 0.1065 0.9152
## ma1 -1.021947 0.201926 -5.0610 4.171e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (1,1,1) - ML
model_111_ml = arima(data11.bc.diff, order = c(1,1,1), method = 'ML')
coeftest(model_111_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.11474 0.26992 0.4251 0.670764
## ma1 -1.00000 0.33205 -3.0116 0.002599 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally we check the residual of the model selected. There is no problem in the residuals of ARIMA(0,1,1) model.
residual.analysis(model = model_011_css)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.9093, p-value = 0.1321
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
par(mfrow=c(1,1))
From the above residual facts we confirm that ARIMA(0,1,1) seems to be the best fit to the hence we fit the model to the data.
Based on fitting each of the model, we can confirm that, the ARIMA(0,1,1) model successfully fits pattern of the original series. ARIMA(0,1,1) model provides the most accurate prediction when compare to the other progression model.