The aim of the investigation is to find the best ARIMA model from a set of possible models for a provided egg depositions time series data and to make predictions for the next 5 years. To answer the question, the series was firstly analysed through visualisation. The set of candidate models was identified by specifying the parameters in ARIMA (p,d,q) with the help of:
the properties of sample ACF and PACF
the extended autocorrelation function (EACF)
Bayesian Information Criterion (BIC)
The process was further followed by estimating the model parameters and testing their significance. Diagnostic checks were made to test the goodness of fit of estimated models. Diagnostic checks included analysis of residuals from the fitted models and analysis of overparameterisised models. The best model was selected based on AIC and BIC values and was used for giving forecasts.
The dataset eggs used for analysis represents yearly Egg depositions (millions) of age-3 Lake Huron Bloaters between years 1981 and 1996.
The following code chunk imports the dataset as well as displays its first observations.
eggs <- read_csv("~/Documents/Postgrads/Sem 1 2019/Time Series/assignment2/eggs.csv")
## Parsed with column specification:
## cols(
## year = col_double(),
## eggs = col_double()
## )
head(eggs)
## # A tibble: 6 x 2
## year eggs
## <dbl> <dbl>
## 1 1981 0.0402
## 2 1982 0.0602
## 3 1983 0.120
## 4 1984 0.181
## 5 1985 0.723
## 6 1986 0.532
Checking the class of eggs object showed that the data was read as a data frame, so it was converted to a time series eggs_ts. The eggs_ts time series is very short with only 16 observations available.
class(eggs)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
eggs_ts <- ts(eggs[2], start = 1981, end = 1996, frequency = 1)
class(eggs_ts)
## [1] "ts"
eggs_ts
## Time Series:
## Start = 1981
## End = 1996
## Frequency = 1
## eggs
## [1,] 0.0402
## [2,] 0.0602
## [3,] 0.1205
## [4,] 0.1807
## [5,] 0.7229
## [6,] 0.5321
## [7,] 0.4317
## [8,] 0.4819
## [9,] 1.1546
## [10,] 2.0984
## [11,] 1.5663
## [12,] 1.5562
## [13,] 0.7631
## [14,] 0.8434
## [15,] 1.0141
## [16,] 1.0241
The following code creates a time series plot of our data.
plot(eggs_ts, type = "o", main = "Time series plot of Egg Depositions", ylab = "Millions")
Figure 1
The time series plot in Figure 1 shows there is a general upward trend in the egg depositions series over the observed time. There is no seasonality, no changing variance, no obvious intervention, and succeeding observations imply the existence of autoregressive behaviour. There are points in year 1985 and around year 1990 that are possible outliers.
The following code chunk creates a function f to plot sample ACF and PACF together for any object and also produces these plots for the egg depositions series.
f <- function(x) {
par(mfrow=c(1,2))
acf(x)
pacf(x)
}
f(eggs_ts)
Figure 2 - ACF and PACF for Eggs Series
As the series is quite short, we cannot observe a perfect smooth decaying pattern in ACF from the plots shown in Figure 2. Still, decaying first lags in ACF with a very hight first order correlation in PACF imply the existence of trend and nonstationary nature of the series.
For further analysis we first need to eliminate the effect of any kind of trend in the series. Firstly, Shapiro-Wilk normality test supports the normality of the series at 5% level. We then proceed to implement power transformation, the 95% for \(\lambda\) was reported [0.1, 0.8], so the square root transformation was selected (\(\lambda\) = 0.5). The transformation improved the normality of the series even more. We will continue with square root of the series.
shapiro.test(eggs_ts)
##
## Shapiro-Wilk normality test
##
## data: eggs_ts
## W = 0.94201, p-value = 0.3744
bc <- BoxCox.ar(eggs_ts, method = "yule-walker")
bc$ci
## [1] 0.1 0.8
eggs_bc <- sqrt(eggs_ts)
shapiro.test(eggs_bc)
##
## Shapiro-Wilk normality test
##
## data: eggs_bc
## W = 0.96562, p-value = 0.7636
first_diff <- diff(eggs_bc)
adf.test(first_diff, k=0)
##
## Augmented Dickey-Fuller Test
##
## data: first_diff
## Dickey-Fuller = -3.4211, Lag order = 0, p-value = 0.07484
## alternative hypothesis: stationary
The main concern is making the eggs_ts series stationary. We take the first difference of the series and apply the Dickey-Fuller unit root test to test the \(H_0\) that the process is difference nonstationary. The augmented DF test allows for higher-order autoregressive processes, the test is sensitive to the value of lags. Being able to control the lags in the test, allows us to avoid a stationarity test that is too complex to be supported by our data. Considering the length of our series, we will apply the standard DF test by setting the value of k=0.
The reported p-value for the first difference is > 0.05, so we can conclude that the first difference does not make the series stationary. So we will go for the second difference.
plot(first_diff, type="o", main = "Time series plot of the first difference of Egg Depositions", ylab = "Millions")
Figure 3
From the plot in Figure 3 we can observe the first difference of the series. However, it is hard to make a conclusion on nonstationarity because of the length of the series.
Taking the second difference:
sec_diff <- diff(eggs_bc, differences = 2)
adf.test(sec_diff, k = 0)
## Warning in adf.test(sec_diff, k = 0): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: sec_diff
## Dickey-Fuller = -4.8698, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
The reported p-value for a standard DF test of the second difference is < 0.01, so we can reject the \(H_0\). The second difference makes the series stationary. The second difference will be taken for further analysis.
plot(sec_diff, type="o", main = "Time series plot of the second difference of Egg Depositions", ylab = "Millions")
Figure 4
Figure 5 shows the ACF and PACF plots for the second difference.
f(sec_diff)
Figure 5 - ACF and PACF for the second difference of Eggs Series
There are no significant lags in ACF, and there is one significant lag in PACF in Figure 5, so we can consider ARIMA(1,2,0) model in the set of candidate models.
The following code displays EACF of the second difference, because of the length of the series, we should restrict the maximum number of AR and MA paremeters:
eacf(sec_diff, ar.max = 2, ma.max = 2)
## AR/MA
## 0 1 2
## 0 o o o
## 1 o o o
## 2 o o o
The vertex in EACF includes the values of p=0 and q=0 which suggests a white noise series. But from here we can include models with p=0; q=1 and p=1; q=0. Correspondingly, the set of candidate models is {ARIMA(1,2,0), ARIMA(0,2,1)}
The following code displays BIC for the second difference:
res <- armasubsets(sec_diff, nar=3, nma=3, ar.method = "ols")
plot(res)
Figure 6 - BIC table
From the BIC table in Figure 6, we can observe that the shaded columns correspond to AR(3), MA(2) and MA(3). From here we include ARIMA(3,2,2) in the set of possible models, we do not include ARIMA(3,2,3) because it is a rather large model considering the length of our series.
Finally, the complete set of possible models for parameter estimation is: {ARIMA(1,2,0), ARIMA(0,2,1), ARIMA(1,2,1), ARIMA(3,2,2)}
Both maximum likelihood and least squares estimation approaches will be used to fit the models from our candidate set.
model_120_ml <- arima(eggs_bc, order = c(1,2,0), method = "ML")
model_120_css <- arima(eggs_bc, order = c(1,2,0), method = "CSS")
coeftest(model_120_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.38247 0.23461 -1.6302 0.1031
coeftest(model_120_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.40832 0.24441 -1.6706 0.0948 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The AR(1) component was found insignificant at 5% level by both methods.
model_021_ml <- arima(eggs_bc, order = c(0,2,1), method = "ML")
model_021_css <- arima(eggs_bc, order = c(0,2,1), method = "CSS")
coeftest(model_021_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.00000 0.48155 -2.0766 0.03784 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(model_021_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -1.086178 0.059217 -18.343 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The MA(1) component was found significant at 5% level by both methods.
model_121_ml <- arima(eggs_bc, order = c(1,2,1), method = "ML")
model_121_css <- arima(eggs_bc, order = c(1,2,1), method = "CSS")
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model_121_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.11146 0.26986 0.4130 0.67959
## ma1 -1.00000 0.30735 -3.2536 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(model_121_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.4341337 0.0033778 128.53 < 2.2e-16 ***
## ma1 -1.9383524 0.0128774 -150.52 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AR(1) component is not significant at 5% level according to the ML method, while the CSS method found all the coefficient significant.
model_322_ml <- arima(eggs_bc, order = c(3,2,2), method = "ML")
model_322_css <- arima(eggs_bc, order = c(3,2,2), method = "CSS")
coeftest(model_322_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.22239 0.68892 0.3228 0.7468
## ar2 -0.34608 0.40149 -0.8620 0.3887
## ar3 -0.50546 0.31324 -1.6136 0.1066
## ma1 -1.04431 0.78780 -1.3256 0.1850
## ma2 0.56885 1.27510 0.4461 0.6555
coeftest(model_322_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.331185 0.309311 -1.0707 0.28430
## ar2 -0.352551 0.257240 -1.3705 0.17053
## ar3 -0.593116 0.288819 -2.0536 0.04002 *
## ma1 -0.375666 0.360589 -1.0418 0.29750
## ma2 0.010242 0.365960 0.0280 0.97767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ARIMA(3,2,2) is not supported by either of the methods.
The following code creates a residual.analysis function to implement residual checks in a dynamic way. We check the residuals for randomness, normality and independence(autocorrelation).
residual.analysis <- function(model, std = TRUE){
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
par(mfrow=c(3,2))
plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
abline(h=0)
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardised residuals")
print(shapiro.test(res.model))
k=0
LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
par(mfrow=c(1,1))
}
Residuals analysis for ARIMA(1,2,0):
residual.analysis(model_120_ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.95483, p-value = 0.5698
Figure 7 - Residuals plots for ARIMA(1,2,0)
Firstly, from looking at the time series plot of the residuals, we can observe no obvious trend, the residuals are smaller in the beginning but overall, seem randomly distributed. There are no major deviations from normality, with the p-value reported by Shapiro-Wilk test > 0.05. The histogram shows the distribution of residuals. The majority of the points are aligned with the red line in the QQ plot. To check the independence, we consider the sample ACF of the residuals. There seems to be no evidence of autocorrelation in the residuals from this model. The Ljung-Box test provides an overall test for looking at the residual correlations as a whole. The visualisation of the test shows that several points fall below the 5% line, so there is evidence that the model is not successfull in capturing the autocorrelation in the series. The model is not suitable for forecasting.
Residuals analysis for ARIMA(0,2,1):
residual.analysis(model_021_ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.94253, p-value = 0.3812
Figure 8 - Residuals plots for ARIMA(0,2,1)
Residuals analysis for ARIMA(1,2,1):
residual.analysis(model_121_ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.93404, p-value = 0.2822
Figure 9 - Residuals plots for ARIMA(1,2,1)
Residuals analysis for ARIMA(3,2,2):
residual.analysis(model_322_ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.94539, p-value = 0.4204
Figure 10 - Residuals plots for ARIMA(3,2,2)
However, ARIMA(3,2,2) is a larger model, and the components were found not significant at 5% level, so we will go for the smaller models for forecasting.
For model selection we will consider AIC and BIC values of the models to decide on the best among the set. The following code creates a function sort.score which sorts the models by their AIC and BIC values.
sort.score <- function(x, score = c("bic", "aic")){
if (score == "aic"){
x[with(x, order(AIC)),]
} else if (score == "bic") {
x[with(x, order(BIC)),]
} else {
warning('score = "x" only accepts valid arguments ("aic","bic")')
}
}
sort.score(AIC(model_021_ml, model_120_ml, model_121_ml, model_322_ml), score = "aic")
## df AIC
## model_021_ml 2 3.452967
## model_121_ml 3 5.280205
## model_120_ml 2 7.409780
## model_322_ml 6 8.596308
sort.score(BIC(model_021_ml, model_120_ml, model_121_ml, model_322_ml), score = "bic")
## df BIC
## model_021_ml 2 4.731082
## model_121_ml 3 7.197377
## model_120_ml 2 8.687894
## model_322_ml 6 12.430652
Both AIC and BIC agree on ARIMA(0,2,1) model. So we will continue the analysis with this model.
We can use overfitting as another tool to detect anomalies in terms of goodness of fit. ARIMA(1,2,1) model is adding an AR(1) component to our selected model ARIMA(0,2,1). This model has been estimated before, and an AR(1) was not significant at 5% level according to ML method.
Another model to test is ARIMA(0,2,2) which adds an MA(2) component:
arima(eggs_bc, order = c(0,2,1))
##
## Call:
## arima(x = eggs_bc, order = c(0, 2, 1))
##
## Coefficients:
## ma1
## -1.0000
## s.e. 0.4815
##
## sigma^2 estimated as 0.0464: log likelihood = 0.27, aic = 1.45
arima(eggs_bc, order = c(0,2,2))
##
## Call:
## arima(x = eggs_bc, order = c(0, 2, 2))
##
## Coefficients:
## ma1 ma2
## -0.8877 -0.1121
## s.e. 0.4038 0.2543
##
## sigma^2 estimated as 0.04642: log likelihood = 0.36, aic = 3.27
The \(\theta_1\) in ARIMA(0,2,2) is not significantly different from \(\theta\) in our initial model, the MA(2) coefficient is not significantly different from 0. The AIC is smaller for the initial model. These points support ARIMA(0,2,1) as a suitable model for forecasting.
After specifying the best model, we can proceed to find egg depositions forecasts for the next five years. The following code makes predictions of our time series 5 years ahead and plots them with their confidence intervals. We specify the order of differencing and lambda=0.5 argument to take the square root transformation back.
predict(model_021_ml, n.ahead = 5, newxreg = NULL, se.fit = TRUE)
## $pred
## Time Series:
## Start = 1997
## End = 2001
## Frequency = 1
## [1] 1.066077 1.120175 1.174274 1.228373 1.282471
##
## $se
## Time Series:
## Start = 1997
## End = 2001
## Frequency = 1
## [1] 0.2224791 0.3243162 0.4087202 0.4848820 0.5561978
fit <- Arima(eggs_ts, c(0,2,1), lambda = .5)
plot(forecast(fit, h=5))
Figure 11 - Forecasts of Egg depositions for the next 5 years
To make predictions of yearly Egg depositions for the next five years, ARIMA models were specified and fitted to our series. The main concern was ensuring the stationarity of the series considering its length. Square root transformation and second differencing of our time series were taken to specify the set of possible models. Several tools used to select the parameters included sample ACF and PACF, EACF and BIC table. The parameters of candidate models were estimated using maximum likelihood and least squares methods. Diagnostic checks were performed on all possible models. Some larger models like ARIMA(3,2,2) were eliminated. The best model from the set of candidate models was selected based on AIC and BIC values.
In conclusion, the model chosen for forcasting was ARIMA(0,2,1). It was supported by residual analysis and overfitting. Finally, the square root transformation was taken back to make predictions with the chosen model.