rm(list = ls())
library(easypackages)
checkForPack = c("knitr","car","jsonlite","lattice","stringr","sp","tidyr","forecast", "dplyr",
            "readr","expsmooth","TSA","dynlm","Hmisc","car","AER","tseries","lmtest",
            "FitAR","fUnitRoots","FSAdata", "kableExtra")

checkNInst = checkForPack %in% rownames(installed.packages())
if ( any(!checkNInst) ) { install.packages( checkForPack[!checkNInst] ) }
libraries(checkForPack)

source('C:/Arjun/RMIT/7. Time Series/Functions/residual.analysis.R')
set.seed(738)

Problem

The purpose of this assignment is to analyze the Egg depositions(in millions) of Lake Huron Bloasters(Coregonus hoyi) of age-3 Lake Huron Bloaters between years 1981 and 1996. the task is to try all possible candidate models and find the best fitting model and give a 5-step ahead forecasts of egg depositions.

are available in BloaterLH dataset of FSAdata package.

Data Description

The data is available under BloaterLH dataset of FSAdata package. The data is for 16 years time period between1981 - 1996. The dataset is available in csv format as well.

data("BloaterLH")
bloaterData = BloaterLH
bloaterData = ts(bloaterData[,2], start = bloaterData[1,1], frequency = 1)

Descriptive Analysis

par(mar=c(5,4,4,2),cex.main=1, cex.lab=1, cex.axis=1)

plot(bloaterData,
     #axes = TRUE,
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd=2,
     main = "Time series plot of Egg Depositions of Bloaters (in millions)")

axis(side=1, at=c(1981:1996))
legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"),lwd=2,
       c("Egg Depositions (in Millions)"))
Fig 1.: Time Series Plot of Egg Depositions of Bloaters (in millions)

Fig 1.: Time Series Plot of Egg Depositions of Bloaters (in millions)

From the above time series plot(Fig 1.) we can comment on the fundamental concepts mentioned below:

Let us also plot ACFand PACF for the dataset to infer about the Behaviour of the series

par(mar=c(5,4,4,2),mfrow=c(1,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)

acf(bloaterData, main = "ACF plot for Egg Depositions")
pacf(bloaterData, main = "PACF plot for Egg Depositions")
Fig 2.: ACF and PACF Plots of Eggs Depositions series

Fig 2.: ACF and PACF Plots of Eggs Depositions series

y = bloaterData
x = zlag(bloaterData)

excludeNAs = 2:length(x)

PearsonCor = cor(y[excludeNAs], x[excludeNAs])

cor = knitr::kable(PearsonCor, caption = "Correlation") %>%
  column_spec(1, bold = T, color = "white", background = "#FF4040") 
cor
Correlation
x
0.7445657

Eventhough we notice much of a Autoregressive behaviour in our dataset it would be worth including Moving Average parameters while modelling as the series have only 16 data points. This takes us to the next step of modelling where we can try linear, Quadratic and ARIMA models.

Model Selection

Trend Models

Linear Trend Model

## 
## Call:
## lm(formula = bloaterData ~ time(bloaterData))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4048 -0.2768 -0.1933  0.2536  1.1857 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -165.98275   49.58836  -3.347  0.00479 **
## time(bloaterData)    0.08387    0.02494   3.363  0.00464 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4598 on 14 degrees of freedom
## Multiple R-squared:  0.4469, Adjusted R-squared:  0.4074 
## F-statistic: 11.31 on 1 and 14 DF,  p-value: 0.004642

The linear trend model is significant with p-value 0.004[<0.005]. The coefficients and intercept terms are significant. Residual errors are small however, the R2 values are pretty much low which questions whether the model is a good fit for this data set.

But before we fit the model to the time series data let us try Quadratic trend model as well

Quadratic Trend Model

## 
## Call:
## lm(formula = bloaterData ~ t + t2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50896 -0.25523 -0.02701  0.16615  0.96322 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -4.647e+04  2.141e+04  -2.170   0.0491 *
## t            4.665e+01  2.153e+01   2.166   0.0494 *
## t2          -1.171e-02  5.415e-03  -2.163   0.0498 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5306 
## F-statistic: 9.479 on 2 and 13 DF,  p-value: 0.00289

The quadratic trend model is significant with p-value 0.002[<0.005]. The coefficients[quadratic term included] and intercept terms are significant. Residual errors are small and we notice that the R2 values have improved as well. Now let us fit these two trend models to see their fit on this data set. The below code shows the fits of the both models on the Bloater dataset.

Model Fitting

plot(ts(as.vector(bloaterData)),
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd = 2,
     main = "Fitted Linear Curve to the Bloater Egg Deposition data")

lines(as.vector(fitted(lmModel)), col="red", lwd = 2)
lines(as.vector(fitted(quadModel)), col="black", lwd = 2)

legend("topleft",lty=1, text.width = 20, col=c("orange","red","black"), bty = "n",
       c("Egg Deposition", "Fitted Linear Curve", "Fitted Quadratic Curve"))
Fig 3.: Fitted Curves to the Bloater Egg Deposition data

Fig 3.: Fitted Curves to the Bloater Egg Deposition data

From Fig 3. we can also see that the models are not a that good a fit for the given data points. However, compared to the Quadratic models the linear model seem tofit well visually eventhough its R2 values are much lower.

As we discussed in the visual inspection section we have to bring the fact into consideration that the time series has only 16 data points hence it might be worth not to rule out the Moving Average behaviour in the data set.

Hence, in the following section we will try multiple ARIMA models for the given time series data and plot for the fits to choose the best.

ARIMA Model

Before we proceed to the ARIMA Models we need to make the series stationary. We can perform an adf test to check whether our time series data is stationary or not.

ADF Test

  • Assumtions:
  • H0: The given series is non-stationary
  • Ha: Otherwise; stationary
testAdf = adf.test(bloaterData, alternative = "stationary")
testAdf
## 
##  Augmented Dickey-Fuller Test
## 
## data:  bloaterData
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
testAdf$p.value
## [1] 0.5468942
  • Interpretation:

With * 0.5468942. * > 0.05 we infer that there is no enough statistical evidence & we fail to reject the H0:. Hence the series is non-stationary.

Box-Cox Tranformation

In order to make the time series stationary, we apply Box-Cox transformation and see if it helos in making the series stationary.

bloaterData.BC = BoxCox.ar(bloaterData)
Fig 4.: Box-Cox Transformations with optimal lambda values

Fig 4.: Box-Cox Transformations with optimal lambda values

bloaterData.BC$ci
## [1] 0.7 1.0

From the above (Fig 4.) the optimal values for \(\lambda\) are within the confidence intervals between 0.7 and 1. But the actual optimal value for \(\lambda\) is at 0 as the log likelihood shows it is optimal and hence we can say that the optimal \(/lambda\) value is between 0 and 1. So we pick 0.5 as the optimal \(\lambda\) value.

lambda = 0.5

bloaterData.tr = (bloaterData^lambda - 1)/lambda

par(mar=c(5,4,4,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)

plot(bloaterData.tr,
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd = 2,
     main = "Time series plot of the transformed Bloater Egg Deposition data")

legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"), 
       c("Egg Depositions (in Millions)"))
Fig 5.: Time series plot of the transformed data

Fig 5.: Time series plot of the transformed data

From the above (Fig 5.) we witness that the BoxCox tranformation for the lambda value does not change the property of the series that much, one reason may be because the optimal value is closer to 1.

As next step we can try differencing the series.

Differencing

bloaterData.diff = diff(bloaterData.tr, differences = 1)
testAdf1 = adf.test(bloaterData.diff, alternative = "stationary")
testAdf1
## 
##  Augmented Dickey-Fuller Test
## 
## data:  bloaterData.diff
## Dickey-Fuller = -3.6096, Lag order = 2, p-value = 0.04931
## alternative hypothesis: stationary

With * 0.0493111. * < 0.05 we reject the H0:. Hence the transformed first differenced series is stationary now.

Model Selection - Possible candidate models

To find the possible candidate models, we have more than one methods to decide them, First let us plot for ACF & PACF to find the possible candidate models

plot(bloaterData.diff,
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd = 2,
     main = "Time series plot of the transformed differenced Bloater Egg Deposition data")

legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"), 
       c("Egg Depositions (in Millions)"))
Fig 6.: Time series plot of the transformed differenced series

Fig 6.: Time series plot of the transformed differenced series

par(mar=c(5,4,4,2),mfrow=c(1,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)

acf(bloaterData.diff, main = "ACF plot for Egg Depositions")
pacf(bloaterData.diff, main = "PACF plot for Egg Depositions")
Fig 7.: ACF and PACF Plots of transformed differenced series

Fig 7.: ACF and PACF Plots of transformed differenced series

From (Fig 6.) & (Fig 7.); it is evident that the trend component is not present and hence differencing has made the series stationary. However we do not see any significant lags neither in ACF nor PACF which might be because of the changing variance after differencing we notice from the time series plot.

Let us try other methods to find the candidate model such as EACF - Extended Auto Correlation function & armasubsets.

eacf(bloaterData.diff, ar.max = 3, ma.max =3)
## AR/MA
##   0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
armaSubset = armasubsets(y=bloaterData.diff,
                   nar=3,
                   nma=3,
                   y.name='test',
                   ar.method='ols')
plot(armaSubset)
     #main = "Model Selection")

title('BIC Table: Model Selection', line = 5)
Fig 8 : Model selection: BIC Table

Fig 8 : Model selection: BIC Table

Possible candidate models are as follows:-

  • ARIMA (0,1,1)
  • ARIMA (1,1,0)
  • ARIMA (1,1,1)
  • ARIMA (2,1,0)
  • ARIMA (2,1,1)
  • ARIMA (2,1,2)
  • ARIMA (3,1,0)
  • ARIMA (3,1,1)
  • ARIMA (3,1,2)

Model fitting

ARIMA(0,1,1)

model_011_css = arima(bloaterData.tr,order=c(0,1,1),method='CSS')
coeftest(model_011_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ma1  0.11073    0.24969  0.4435   0.6574
model_011_ml = arima(bloaterData.tr,order=c(0,1,1),method='ML')
coeftest(model_011_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ma1  0.10424    0.24358  0.4279   0.6687

ARIMA(1,1,0)

model_110_css = arima(bloaterData.tr,order=c(1,1,0),method='CSS')
coeftest(model_110_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.10525    0.25639  0.4105   0.6814
model_110_ml = arima(bloaterData.tr,order=c(1,1,0),method='ML')
coeftest(model_110_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.09895    0.24880  0.3977   0.6908

ARIMA(1,1,1)

model_111_css = arima(bloaterData.tr,order=c(1,1,1),method='CSS')
coeftest(model_111_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 0.075038   0.930129  0.0807   0.9357
## ma1 0.032822   0.958494  0.0342   0.9727
model_111_ml = arima(bloaterData.tr,order=c(1,1,1),method='ML')
coeftest(model_111_ml)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value Pr(>|z|)
## ar1 -0.0088178  1.0057864 -0.0088   0.9930
## ma1  0.1124288  0.9735777  0.1155   0.9081

ARIMA(2,1,0)

model_210_css = arima(bloaterData.tr,order=c(2,1,0),method='CSS')
coeftest(model_210_css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.107408   0.255575  0.4203   0.6743
## ar2 -0.080573   0.256663 -0.3139   0.7536
model_210_ml = arima(bloaterData.tr,order=c(2,1,0),method='ML')
coeftest(model_210_ml)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.106498   0.250366  0.4254   0.6706
## ar2 -0.072083   0.244712 -0.2946   0.7683

ARIMA(2,1,1)

model_211_css = arima(bloaterData.tr,order=c(2,1,1),method='CSS')
coeftest(model_211_css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1  1.064074   0.023735  44.8314   <2e-16 ***
## ar2  0.046656   0.051237   0.9106   0.3625    
## ma1 -1.736208   0.079414 -21.8628   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_211_ml = arima(bloaterData.tr,order=c(2,1,1),method='ML')
coeftest(model_211_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.55226    0.67390  0.8195   0.4125
## ar2 -0.17720    0.24423 -0.7255   0.4681
## ma1 -0.45246    0.65222 -0.6937   0.4879

ARIMA(2,1,2)

model_212_css = arima(bloaterData.tr,order=c(2,1,2),method='CSS')
coeftest(model_212_css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.399526   0.051183  7.8058 5.913e-15 ***
## ar2  0.960428   0.121759  7.8879 3.072e-15 ***
## ma1 -1.114982   0.342314 -3.2572  0.001125 ** 
## ma2 -1.139801   0.525849 -2.1675  0.030193 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_212_ml = arima(bloaterData.tr,order=c(2,1,2),method='ML')
coeftest(model_212_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.88833    0.31889   2.7857  0.005342 ** 
## ar2 -0.99851    0.01820 -54.8637 < 2.2e-16 ***
## ma1 -0.91998    0.40354  -2.2798  0.022622 *  
## ma2  0.99530         NA       NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(3,1,0)

model_310_css = arima(bloaterData.tr,order=c(3,1,0),method='CSS')
coeftest(model_310_css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.072525   0.244775  0.2963   0.7670
## ar2 -0.051084   0.244967 -0.2085   0.8348
## ar3 -0.305571   0.244344 -1.2506   0.2111
model_310_ml = arima(bloaterData.tr,order=c(3,1,0),method='ML')
coeftest(model_310_ml)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.082833   0.238029  0.3480   0.7278
## ar2 -0.046516   0.233081 -0.1996   0.8418
## ar3 -0.260505   0.226622 -1.1495   0.2503

ARIMA(3,1,1)

model_311_css = arima(bloaterData.tr,order=c(3,1,1),method='CSS')
coeftest(model_311_css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.196688   0.497225  0.3956   0.6924
## ar2 -0.066782   0.256388 -0.2605   0.7945
## ar3 -0.278693   0.277251 -1.0052   0.3148
## ma1 -0.146421   0.500625 -0.2925   0.7699
model_311_ml = arima(bloaterData.tr,order=c(3,1,1),method='ML')
coeftest(model_311_ml)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.068907   0.536409  0.1285   0.8978
## ar2 -0.044973   0.238886 -0.1883   0.8507
## ar3 -0.261481   0.228515 -1.1443   0.2525
## ma1  0.015399   0.532599  0.0289   0.9769

ARIMA(3,1,3)

model_313_css = arima(bloaterData.tr,order=c(3,1,3),method='CSS')
coeftest(model_313_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.11242    0.18966  0.5928  0.553342    
## ar2  0.77153    0.26142  2.9514  0.003164 ** 
## ar3  0.84955    0.30361  2.7982  0.005139 ** 
## ma1 -0.57250    0.14589 -3.9243 8.699e-05 ***
## ma2 -1.23955    0.26947 -4.5999 4.226e-06 ***
## ma3 -2.06809    0.31578 -6.5492 5.783e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_313_ml = arima(bloaterData.tr,order=c(3,1,3),method='ML')
coeftest(model_313_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.28669    0.94699  0.3027   0.7621
## ar2 -0.67513    0.55025 -1.2269   0.2198
## ar3 -0.42075    0.71567 -0.5879   0.5566
## ma1 -0.33089    1.21308 -0.2728   0.7850
## ma2  0.85258    0.81080  1.0515   0.2930
## ma3  0.25260    0.84456  0.2991   0.7649

Residual Analysis

We will perform Residual Analysis check on the chosen good models

Linear Trend Model

residual.analysis(lmModel, check = 1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.83416, p-value = 0.008034
Fig 9.: Residual Analysis of Linear Trend Model

Fig 9.: Residual Analysis of Linear Trend Model

From the Shapiro Wilk test for normality we reject the null hypothesis as the p-value < 0.05 and confirm that the residuals is not normally distributed. Also from (Fig 9.),we infer that the historgram of the residual plot shows that they are skewed, with ACF & PACF plot show first significant lag.

Quadratic Trend Model

residual.analysis(quadModel, check = 1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9367, p-value = 0.3106
Fig 10.: Residual Analysis of Quadratic Trend Model

Fig 10.: Residual Analysis of Quadratic Trend Model

From the Shapiro Wilk test for normality we fail to reject the null hypothesis as the p-value > 0.05 and confirm that the residuals is normally distributed. Also from (Fig 10.),we infer that the historgram of the residual plot shows that they are slightly skewed, with ACF & PACF plot does not have any significant lag(s).

ARIMA212

residual.analysis(model_212_css,check = 0) #added a small criteria to the function to skip                                                 #Shapiro Wilk Test
## [1] "Shapiro Wilk Test of Normality is not necessary for css models"
Fig 11.: Residual Analysis of ARIMA212 Model

Fig 11.: Residual Analysis of ARIMA212 Model

from (Fig 11.),we infer that the historgram of the residual plot shows that they are not normal, with ACF & PACF plot does not have any significant lag(s). Also the L-Jung Box Test shows that the model has some significant p values.

ARIMA313

residual.analysis(model_313_css,check = 0) #added a small criteria to the function to skip                                                 #Shapiro Wilk Test
## [1] "Shapiro Wilk Test of Normality is not necessary for css models"
Fig 12.: Residual Analysis of ARIMA311 Model

Fig 12.: Residual Analysis of ARIMA311 Model

from (Fig 12.),we infer that the historgram of the residual plot shows that they are not normal, with ACF & PACF plot does not have any significant lag(s). Also the L-Jung Box Test shows that the model has some significant p values.

Comparitively we have ARIMA(2,1,2) as a good fit as it seems to have good residual analysis results.

Forecasting

fit = Arima(bloaterData, model = model_212_css, lambda=0.5)
bloaterDataForecast = forecast(fit,h=5)
knitr::kable(bloaterDataForecast, 
             caption = "5-Step ahead Forecast for Bloater Egg Depositions Series")
5-Step ahead Forecast for Bloater Egg Depositions Series
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1997 0.9929115 0.3839321 1.885888 0.1765197 2.473563
1998 1.0149084 0.1120255 2.822908 -0.0004587 4.146395
1999 0.9938055 -0.0079927 4.339713 -0.4414995 7.066294
2000 1.0064789 -0.3301828 6.661994 -1.9877575 11.671427
2001 0.9913263 -1.4701063 10.264250 -5.6710397 19.120522
par(mar=c(5,4,4,2),cex.main=1, cex.lab=1, cex.axis=1)

plot(bloaterDataForecast, 
     col = c("orange"), 
     type="o",
     main = "Bloater Egg Depositions with 5 year Forecasts",
     ylab = "Egg Depostions (in Millions)",
     xlab = "Year",
     lwd = 2,
     ylim = c(0,3))

legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"), 
       c("Egg Depositions (in Millions)"))
Fig 13.: Original series with 5 step ahead forecast

Fig 13.: Original series with 5 step ahead forecast

From (Fig 13.), we can notice a large variance in the forecast which might be because of the short series we have used and other limitations of restricting to the analysis performed.

Conclusion

The model that the ARIMA(2,1,2) model fits the series the best with significant co-efficients.

References