MATH1318 Assignment 2

Introduction

The aim of the investigation is to find the best ARIMA model from a set of possible models for a provided egg depositions time series data and to make predictions for the next 5 years. To answer the question, the series was firstly analysed through visualisation. The set of candidate models was identified by specifying the parameters in ARIMA (p,d,q) with the help of:

the properties of sample ACF and PACF
the extended autocorrelation function (EACF)
Bayesian Information Criterion (BIC)

The process was further followed by estimating the model parameters and testing their significance. Diagnostic checks were made to test the goodness of fit of estimated models. Diagnostic checks included analysis of residuals from the fitted models and analysis of overparameterisised models. The best model was selected based on AIC and BIC values and was used for giving forecasts.

Data description

The dataset eggs used for analysis represents yearly Egg depositions (millions) of age-3 Lake Huron Bloaters between years 1981 and 1996.

The following code chunk imports the dataset as well as displays its first observations.

eggs <- read_csv("~/Documents/Postgrads/Sem 1 2019/Time Series/assignment2/eggs.csv")

## Parsed with column specification:
## cols(
##   year = col_double(),
##   eggs = col_double()
## )

head(eggs)

## # A tibble: 6 x 2
##    year   eggs
##   <dbl>  <dbl>
## 1  1981 0.0402
## 2  1982 0.0602
## 3  1983 0.120 
## 4  1984 0.181 
## 5  1985 0.723 
## 6  1986 0.532

Checking the class of eggs object showed that the data was read as a data frame, so it was converted to a time series eggs_ts. The eggs_ts time series is very short with only 16 observations available.

class(eggs)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

eggs_ts <- ts(eggs[2], start = 1981, end = 1996, frequency = 1)
class(eggs_ts)

## [1] "ts"

eggs_ts

## Time Series:
## Start = 1981 
## End = 1996 
## Frequency = 1 
##         eggs
##  [1,] 0.0402
##  [2,] 0.0602
##  [3,] 0.1205
##  [4,] 0.1807
##  [5,] 0.7229
##  [6,] 0.5321
##  [7,] 0.4317
##  [8,] 0.4819
##  [9,] 1.1546
## [10,] 2.0984
## [11,] 1.5663
## [12,] 1.5562
## [13,] 0.7631
## [14,] 0.8434
## [15,] 1.0141
## [16,] 1.0241

Data exploration and visualisation

The following code creates a time series plot of our data.

plot(eggs_ts, type = "o", main = "Time series plot of Egg Depositions", ylab = "Millions")

Figure 1

The time series plot in Figure 1 shows there is a general upward trend in the egg depositions series over the observed time. There is no seasonality, no changing variance, no obvious intervention, and succeeding observations imply the existence of autoregressive behaviour. There are points in year 1985 and around year 1990 that are possible outliers.

The following code chunk creates a function f to plot sample ACF and PACF together for any object and also produces these plots for the egg depositions series.

f <- function(x) {
  par(mfrow=c(1,2))
  acf(x)
  pacf(x)
}
f(eggs_ts)

Figure 2 - ACF and PACF for Eggs Series

As the series is quite short, we cannot observe a perfect smooth decaying pattern in ACF from the plots shown in Figure 2. Still, decaying first lags in ACF with a very hight first order correlation in PACF imply the existence of trend and nonstationary nature of the series.

Data transformation

For further analysis we first need to eliminate the effect of any kind of trend in the series. Firstly, Shapiro-Wilk normality test supports the normality of the series at 5% level. We then proceed to implement power transformation, the 95% for \(\lambda\) was reported [0.1, 0.8], so the square root transformation was selected (\(\lambda\) = 0.5). The transformation improved the normality of the series even more. We will continue with square root of the series.

shapiro.test(eggs_ts)

## 
##  Shapiro-Wilk normality test
## 
## data:  eggs_ts
## W = 0.94201, p-value = 0.3744

bc <- BoxCox.ar(eggs_ts, method = "yule-walker")

bc$ci

## [1] 0.1 0.8

eggs_bc <- sqrt(eggs_ts)
shapiro.test(eggs_bc)

## 
##  Shapiro-Wilk normality test
## 
## data:  eggs_bc
## W = 0.96562, p-value = 0.7636

first_diff <- diff(eggs_bc)
adf.test(first_diff, k=0)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  first_diff
## Dickey-Fuller = -3.4211, Lag order = 0, p-value = 0.07484
## alternative hypothesis: stationary

The main concern is making the eggs_ts series stationary. We take the first difference of the series and apply the Dickey-Fuller unit root test to test the \(H_0\) that the process is difference nonstationary. The augmented DF test allows for higher-order autoregressive processes, the test is sensitive to the value of lags. Being able to control the lags in the test, allows us to avoid a stationarity test that is too complex to be supported by our data. Considering the length of our series, we will apply the standard DF test by setting the value of k=0.

The reported p-value for the first difference is > 0.05, so we can conclude that the first difference does not make the series stationary. So we will go for the second difference.

plot(first_diff, type="o", main = "Time series plot of the first difference of Egg Depositions", ylab = "Millions")

Figure 3

From the plot in Figure 3 we can observe the first difference of the series. However, it is hard to make a conclusion on nonstationarity because of the length of the series.

Taking the second difference:

sec_diff <- diff(eggs_bc, differences = 2)
adf.test(sec_diff, k = 0)

## Warning in adf.test(sec_diff, k = 0): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  sec_diff
## Dickey-Fuller = -4.8698, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

The reported p-value for a standard DF test of the second difference is < 0.01, so we can reject the \(H_0\). The second difference makes the series stationary. The second difference will be taken for further analysis.

plot(sec_diff, type="o", main = "Time series plot of the second difference of Egg Depositions", ylab = "Millions")

Figure 4

Model specification

ACF and PACF

Figure 5 shows the ACF and PACF plots for the second difference.

f(sec_diff)

Figure 5 - ACF and PACF for the second difference of Eggs Series

There are no significant lags in ACF, and there is one significant lag in PACF in Figure 5, so we can consider ARIMA(1,2,0) model in the set of candidate models.

The following code displays EACF of the second difference, because of the length of the series, we should restrict the maximum number of AR and MA paremeters:

eacf(sec_diff, ar.max = 2, ma.max = 2)

## AR/MA
##   0 1 2
## 0 o o o
## 1 o o o
## 2 o o o

The vertex in EACF includes the values of p=0 and q=0 which suggests a white noise series. But from here we can include models with p=0; q=1 and p=1; q=0. Correspondingly, the set of candidate models is {ARIMA(1,2,0), ARIMA(0,2,1)}

The following code displays BIC for the second difference:

res <- armasubsets(sec_diff, nar=3, nma=3, ar.method = "ols")
plot(res)

Figure 6 - BIC table

From the BIC table in Figure 6, we can observe that the shaded columns correspond to AR(3), MA(2) and MA(3). From here we include ARIMA(3,2,2) in the set of possible models, we do not include ARIMA(3,2,3) because it is a rather large model considering the length of our series.

Finally, the complete set of possible models for parameter estimation is: {ARIMA(1,2,0), ARIMA(0,2,1), ARIMA(1,2,1), ARIMA(3,2,2)}

Parameter estimation

Both maximum likelihood and least squares estimation approaches will be used to fit the models from our candidate set.

ARIMA(1,2,0)

model_120_ml <- arima(eggs_bc, order = c(1,2,0), method = "ML")
model_120_css <- arima(eggs_bc, order = c(1,2,0), method = "CSS")
coeftest(model_120_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 -0.38247    0.23461 -1.6302   0.1031

coeftest(model_120_css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ar1 -0.40832    0.24441 -1.6706   0.0948 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The AR(1) component was found insignificant at 5% level by both methods.

ARIMA(0,2,1)

model_021_ml <- arima(eggs_bc, order = c(0,2,1), method = "ML")
model_021_css <- arima(eggs_bc, order = c(0,2,1), method = "CSS")
coeftest(model_021_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ma1 -1.00000    0.48155 -2.0766  0.03784 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coeftest(model_021_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.086178   0.059217 -18.343 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The MA(1) component was found significant at 5% level by both methods.

ARIMA(1,2,1)

model_121_ml <- arima(eggs_bc, order = c(1,2,1), method = "ML")
model_121_css <- arima(eggs_bc, order = c(1,2,1), method = "CSS")

## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1

coeftest(model_121_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)   
## ar1  0.11146    0.26986  0.4130  0.67959   
## ma1 -1.00000    0.30735 -3.2536  0.00114 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coeftest(model_121_css)

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.4341337  0.0033778  128.53 < 2.2e-16 ***
## ma1 -1.9383524  0.0128774 -150.52 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AR(1) component is not significant at 5% level according to the ML method, while the CSS method found all the coefficient significant.

ARIMA(3,2,2)

model_322_ml <- arima(eggs_bc, order = c(3,2,2), method = "ML")
model_322_css <- arima(eggs_bc, order = c(3,2,2), method = "CSS")
coeftest(model_322_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.22239    0.68892  0.3228   0.7468
## ar2 -0.34608    0.40149 -0.8620   0.3887
## ar3 -0.50546    0.31324 -1.6136   0.1066
## ma1 -1.04431    0.78780 -1.3256   0.1850
## ma2  0.56885    1.27510  0.4461   0.6555

coeftest(model_322_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)  
## ar1 -0.331185   0.309311 -1.0707  0.28430  
## ar2 -0.352551   0.257240 -1.3705  0.17053  
## ar3 -0.593116   0.288819 -2.0536  0.04002 *
## ma1 -0.375666   0.360589 -1.0418  0.29750  
## ma2  0.010242   0.365960  0.0280  0.97767  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(3,2,2) is not supported by either of the methods.

Diagnostic checking

Residual analysis

The following code creates a residual.analysis function to implement residual checks in a dynamic way. We check the residuals for randomness, normality and independence(autocorrelation).

residual.analysis <- function(model, std = TRUE){
  if (std == TRUE){
    res.model = rstandard(model)
  }else{
    res.model = residuals(model)
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  acf(res.model,main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  k=0
  LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
  par(mfrow=c(1,1))
}

Residuals analysis for ARIMA(1,2,0):

residual.analysis(model_120_ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95483, p-value = 0.5698

Figure 7 - Residuals plots for ARIMA(1,2,0)

Firstly, from looking at the time series plot of the residuals, we can observe no obvious trend, the residuals are smaller in the beginning but overall, seem randomly distributed. There are no major deviations from normality, with the p-value reported by Shapiro-Wilk test > 0.05. The histogram shows the distribution of residuals. The majority of the points are aligned with the red line in the QQ plot. To check the independence, we consider the sample ACF of the residuals. There seems to be no evidence of autocorrelation in the residuals from this model. The Ljung-Box test provides an overall test for looking at the residual correlations as a whole. The visualisation of the test shows that several points fall below the 5% line, so there is evidence that the model is not successfull in capturing the autocorrelation in the series. The model is not suitable for forecasting.

Residuals analysis for ARIMA(0,2,1):

residual.analysis(model_021_ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94253, p-value = 0.3812

Figure 8 - Residuals plots for ARIMA(0,2,1)

We can observe no obvious trend in the time series plot of residuals, the residuals are smaller in the beginning but overall, seem randomly distributed.
According to Shapiro-Wilk test the normality assumption is not violated at 5% level of significance.
Due to the length of the series, the histogram and QQ plot of residuals are not very objective when judging normality.
There are no significant lags in sample ACF which supports independence of residuals.
The Ljung-Box test shows no points fall below the 5% line, so the model is supported for forecasting.

Residuals analysis for ARIMA(1,2,1):

residual.analysis(model_121_ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.93404, p-value = 0.2822

Figure 9 - Residuals plots for ARIMA(1,2,1)

We can observe no obvious trend in the time series plot of residuals, the residuals are smaller in the beginning but overall, seem randomly distributed.
According to Shapiro-Wilk test the normality assumption is not violated at 5% level of significance.
Due to the length of the series, the histogram and QQ plot of residuals are not very objective when judging normality.
There are no significant lags in sample ACF which supports independence of residuals.
The Ljung-Box test shows no points fall below the 5% line, so the model is supported for forecasting.

Residuals analysis for ARIMA(3,2,2):

residual.analysis(model_322_ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94539, p-value = 0.4204

Figure 10 - Residuals plots for ARIMA(3,2,2)

We can observe no obvious trend in the time series plot of residuals, the residuals are smaller in the beginning but overall, seem randomly distributed.
According to Shapiro-Wilk test the normality assumption is not violated at 5% level of significance.
Due to the length of the series, the histogram and QQ plot of residuals are not very objective when judging normality.
There are no significant lags in sample ACF which supports independence of residuals.
The Ljung-Box test shows no points fall below the 5% line, so the model is supported for forecasting.

However, ARIMA(3,2,2) is a larger model, and the components were found not significant at 5% level, so we will go for the smaller models for forecasting.

Model selection

For model selection we will consider AIC and BIC values of the models to decide on the best among the set. The following code creates a function sort.score which sorts the models by their AIC and BIC values.

sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}
sort.score(AIC(model_021_ml, model_120_ml, model_121_ml, model_322_ml), score = "aic")

##              df      AIC
## model_021_ml  2 3.452967
## model_121_ml  3 5.280205
## model_120_ml  2 7.409780
## model_322_ml  6 8.596308

sort.score(BIC(model_021_ml, model_120_ml, model_121_ml, model_322_ml), score = "bic")

##              df       BIC
## model_021_ml  2  4.731082
## model_121_ml  3  7.197377
## model_120_ml  2  8.687894
## model_322_ml  6 12.430652

Both AIC and BIC agree on ARIMA(0,2,1) model. So we will continue the analysis with this model.

Overfitting

We can use overfitting as another tool to detect anomalies in terms of goodness of fit. ARIMA(1,2,1) model is adding an AR(1) component to our selected model ARIMA(0,2,1). This model has been estimated before, and an AR(1) was not significant at 5% level according to ML method.

Another model to test is ARIMA(0,2,2) which adds an MA(2) component:

arima(eggs_bc, order = c(0,2,1))

## 
## Call:
## arima(x = eggs_bc, order = c(0, 2, 1))
## 
## Coefficients:
##           ma1
##       -1.0000
## s.e.   0.4815
## 
## sigma^2 estimated as 0.0464:  log likelihood = 0.27,  aic = 1.45

arima(eggs_bc, order = c(0,2,2))

## 
## Call:
## arima(x = eggs_bc, order = c(0, 2, 2))
## 
## Coefficients:
##           ma1      ma2
##       -0.8877  -0.1121
## s.e.   0.4038   0.2543
## 
## sigma^2 estimated as 0.04642:  log likelihood = 0.36,  aic = 3.27

The \(\theta_1\) in ARIMA(0,2,2) is not significantly different from \(\theta\) in our initial model, the MA(2) coefficient is not significantly different from 0. The AIC is smaller for the initial model. These points support ARIMA(0,2,1) as a suitable model for forecasting.

Forecasting

After specifying the best model, we can proceed to find egg depositions forecasts for the next five years. The following code makes predictions of our time series 5 years ahead and plots them with their confidence intervals. We specify the order of differencing and lambda=0.5 argument to take the square root transformation back.

predict(model_021_ml, n.ahead = 5, newxreg = NULL, se.fit = TRUE)

## $pred
## Time Series:
## Start = 1997 
## End = 2001 
## Frequency = 1 
## [1] 1.066077 1.120175 1.174274 1.228373 1.282471
## 
## $se
## Time Series:
## Start = 1997 
## End = 2001 
## Frequency = 1 
## [1] 0.2224791 0.3243162 0.4087202 0.4848820 0.5561978

fit <- Arima(eggs_ts, c(0,2,1), lambda = .5)
plot(forecast(fit, h=5))

Figure 11 - Forecasts of Egg depositions for the next 5 years

Conclusion

To make predictions of yearly Egg depositions for the next five years, ARIMA models were specified and fitted to our series. The main concern was ensuring the stationarity of the series considering its length. Square root transformation and second differencing of our time series were taken to specify the set of possible models. Several tools used to select the parameters included sample ACF and PACF, EACF and BIC table. The parameters of candidate models were estimated using maximum likelihood and least squares methods. Diagnostic checks were performed on all possible models. Some larger models like ARIMA(3,2,2) were eliminated. The best model from the set of candidate models was selected based on AIC and BIC values.

In conclusion, the model chosen for forcasting was ARIMA(0,2,1). It was supported by residual analysis and overfitting. Finally, the square root transformation was taken back to make predictions with the chosen model.