rm(list = ls())
library(easypackages)
checkForPack = c("knitr","car","jsonlite","lattice","stringr","sp","tidyr","forecast", "dplyr",
            "readr","expsmooth","TSA","dynlm","Hmisc","car","AER","tseries","lmtest",
            "FitAR","fUnitRoots","FSAdata", "kableExtra")

checkNInst = checkForPack %in% rownames(installed.packages())
if ( any(!checkNInst) ) { install.packages( checkForPack[!checkNInst] ) }
libraries(checkForPack)

source('C:/Arjun/RMIT/7. Time Series/Functions/residual.analysis.R')
set.seed(738)

Problem

The purpose of this assignment is to analyze the Egg depositions(in millions) of Lake Huron Bloasters(Coregonus hoyi) of age-3 Lake Huron Bloaters between years 1981 and 1996. the task is to try all possible candidate models and find the best fitting model and give a 5-step ahead forecasts of egg depositions.

are available in BloaterLH dataset of FSAdata package.

Data Description

The data is available under BloaterLH dataset of FSAdata package. The data is for 16 years time period between1981 - 1996. The dataset is available in csv format as well.

data("BloaterLH")
bloaterData = BloaterLH
bloaterData = ts(bloaterData[,2], start = bloaterData[1,1], frequency = 1)

Descriptive Analysis

par(mar=c(5,4,4,2),cex.main=1, cex.lab=1, cex.axis=1)

plot(bloaterData,
     #axes = TRUE,
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd=2,
     main = "Time series plot of Egg Depositions of Bloaters (in millions)")

axis(side=1, at=c(1981:1996))
legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"),lwd=2,
       c("Egg Depositions (in Millions)"))

Fig 1.: Time Series Plot of Egg Depositions of Bloaters (in millions)

From the above time series plot(Fig 1.) we can comment on the fundamental concepts mentioned below:

Trend: Visible evidence of trend, signifying that there is increase in the Egg depositions of the Bloater species.
Changing variance: No apparent change in the variance
Seasonality: No repeating pattern seen, signifying no apparent indication of seasonality effect.This is expected as this is a yearly data set.
Intervention: No apparent Intervention point noticed. However there might be a intervention point in the year 1988 & 1990 but we need to research on the whether it was actually an intervention.

Let us also plot ACFand PACF for the dataset to infer about the Behaviour of the series

par(mar=c(5,4,4,2),mfrow=c(1,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)

acf(bloaterData, main = "ACF plot for Egg Depositions")
pacf(bloaterData, main = "PACF plot for Egg Depositions")

Fig 2.: ACF and PACF Plots of Eggs Depositions series

Autoregressive behaviour or Moving Average: From the above time series plot(Fig 1.) & ACF & PACF plot(Fig 2.) we can infer that the data posses a Auto regressive behaviour due to the successive data points.

y = bloaterData
x = zlag(bloaterData)

excludeNAs = 2:length(x)

PearsonCor = cor(y[excludeNAs], x[excludeNAs])

cor = knitr::kable(PearsonCor, caption = "Correlation") %>%
  column_spec(1, bold = T, color = "white", background = "#FF4040")

cor

Correlation
x
0.7445657

The correlation factor is also supporting our evidence that there is high correlation between succeding data points. i.e. the egg depositions for Bloaters prior year impacts the next years depositions in millions.

Eventhough we notice much of a Autoregressive behaviour in our dataset it would be worth including Moving Average parameters while modelling as the series have only 16 data points. This takes us to the next step of modelling where we can try linear, Quadratic and ARIMA models.

Model Selection

Trend Models

Linear Trend Model

## 
## Call:
## lm(formula = bloaterData ~ time(bloaterData))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4048 -0.2768 -0.1933  0.2536  1.1857 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -165.98275   49.58836  -3.347  0.00479 **
## time(bloaterData)    0.08387    0.02494   3.363  0.00464 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4598 on 14 degrees of freedom
## Multiple R-squared:  0.4469, Adjusted R-squared:  0.4074 
## F-statistic: 11.31 on 1 and 14 DF,  p-value: 0.004642

The linear trend model is significant with p-value 0.004[<0.005]. The coefficients and intercept terms are significant. Residual errors are small however, the R² values are pretty much low which questions whether the model is a good fit for this data set.

But before we fit the model to the time series data let us try Quadratic trend model as well

Quadratic Trend Model

## 
## Call:
## lm(formula = bloaterData ~ t + t2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50896 -0.25523 -0.02701  0.16615  0.96322 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -4.647e+04  2.141e+04  -2.170   0.0491 *
## t            4.665e+01  2.153e+01   2.166   0.0494 *
## t2          -1.171e-02  5.415e-03  -2.163   0.0498 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5306 
## F-statistic: 9.479 on 2 and 13 DF,  p-value: 0.00289

The quadratic trend model is significant with p-value 0.002[<0.005]. The coefficients[quadratic term included] and intercept terms are significant. Residual errors are small and we notice that the R² values have improved as well. Now let us fit these two trend models to see their fit on this data set. The below code shows the fits of the both models on the Bloater dataset.

Model Fitting

plot(ts(as.vector(bloaterData)),
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd = 2,
     main = "Fitted Linear Curve to the Bloater Egg Deposition data")

lines(as.vector(fitted(lmModel)), col="red", lwd = 2)
lines(as.vector(fitted(quadModel)), col="black", lwd = 2)

legend("topleft",lty=1, text.width = 20, col=c("orange","red","black"), bty = "n",
       c("Egg Deposition", "Fitted Linear Curve", "Fitted Quadratic Curve"))

Fig 3.: Fitted Curves to the Bloater Egg Deposition data

From Fig 3. we can also see that the models are not a that good a fit for the given data points. However, compared to the Quadratic models the linear model seem tofit well visually eventhough its R² values are much lower.

As we discussed in the visual inspection section we have to bring the fact into consideration that the time series has only 16 data points hence it might be worth not to rule out the Moving Average behaviour in the data set.

Hence, in the following section we will try multiple ARIMA models for the given time series data and plot for the fits to choose the best.

ARIMA Model

Before we proceed to the ARIMA Models we need to make the series stationary. We can perform an adf test to check whether our time series data is stationary or not.

ADF Test

Assumtions:
H₀: The given series is non-stationary
H_a: Otherwise; stationary

testAdf = adf.test(bloaterData, alternative = "stationary")
testAdf

## 
##  Augmented Dickey-Fuller Test
## 
## data:  bloaterData
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary

testAdf$p.value

## [1] 0.5468942

Interpretation:

With * 0.5468942. * > 0.05 we infer that there is no enough statistical evidence & we fail to reject the H₀:. Hence the series is non-stationary.

Box-Cox Tranformation

In order to make the time series stationary, we apply Box-Cox transformation and see if it helos in making the series stationary.

bloaterData.BC = BoxCox.ar(bloaterData)

Fig 4.: Box-Cox Transformations with optimal lambda values

bloaterData.BC$ci

## [1] 0.7 1.0

From the above (Fig 4.) the optimal values for \(\lambda\) are within the confidence intervals between 0.7 and 1. But the actual optimal value for \(\lambda\) is at 0 as the log likelihood shows it is optimal and hence we can say that the optimal \(/lambda\) value is between 0 and 1. So we pick 0.5 as the optimal \(\lambda\) value.

lambda = 0.5

bloaterData.tr = (bloaterData^lambda - 1)/lambda

par(mar=c(5,4,4,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)

plot(bloaterData.tr,
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd = 2,
     main = "Time series plot of the transformed Bloater Egg Deposition data")

legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"), 
       c("Egg Depositions (in Millions)"))

Fig 5.: Time series plot of the transformed data

From the above (Fig 5.) we witness that the BoxCox tranformation for the lambda value does not change the property of the series that much, one reason may be because the optimal value is closer to 1.

As next step we can try differencing the series.

Differencing

bloaterData.diff = diff(bloaterData.tr, differences = 1)
testAdf1 = adf.test(bloaterData.diff, alternative = "stationary")
testAdf1

## 
##  Augmented Dickey-Fuller Test
## 
## data:  bloaterData.diff
## Dickey-Fuller = -3.6096, Lag order = 2, p-value = 0.04931
## alternative hypothesis: stationary

With * 0.0493111. * < 0.05 we reject the H₀:. Hence the transformed first differenced series is stationary now.

Model Selection - Possible candidate models

To find the possible candidate models, we have more than one methods to decide them, First let us plot for ACF & PACF to find the possible candidate models

plot(bloaterData.diff,
     ylab='Egg Depositions (in millions)',
     xlab='Year',
     type='o',
     col = c("orange"),
     lwd = 2,
     main = "Time series plot of the transformed differenced Bloater Egg Deposition data")

legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"), 
       c("Egg Depositions (in Millions)"))

Fig 6.: Time series plot of the transformed differenced series

par(mar=c(5,4,4,2),mfrow=c(1,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)

acf(bloaterData.diff, main = "ACF plot for Egg Depositions")
pacf(bloaterData.diff, main = "PACF plot for Egg Depositions")

Fig 7.: ACF and PACF Plots of transformed differenced series

From (Fig 6.) & (Fig 7.); it is evident that the trend component is not present and hence differencing has made the series stationary. However we do not see any significant lags neither in ACF nor PACF which might be because of the changing variance after differencing we notice from the time series plot.

Let us try other methods to find the candidate model such as EACF - Extended Auto Correlation function & armasubsets.

eacf(bloaterData.diff, ar.max = 3, ma.max =3)

## AR/MA
##   0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o

armaSubset = armasubsets(y=bloaterData.diff,
                   nar=3,
                   nma=3,
                   y.name='test',
                   ar.method='ols')
plot(armaSubset)
     #main = "Model Selection")

title('BIC Table: Model Selection', line = 5)

Fig 8 : Model selection: BIC Table

Possible candidate models are as follows:-

ARIMA (0,1,1)
ARIMA (1,1,0)
ARIMA (1,1,1)
ARIMA (2,1,0)
ARIMA (2,1,1)
ARIMA (2,1,2)
ARIMA (3,1,0)
ARIMA (3,1,1)
ARIMA (3,1,2)

Model fitting

ARIMA(0,1,1)

model_011_css = arima(bloaterData.tr,order=c(0,1,1),method='CSS')
coeftest(model_011_css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ma1  0.11073    0.24969  0.4435   0.6574

model_011_ml = arima(bloaterData.tr,order=c(0,1,1),method='ML')
coeftest(model_011_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ma1  0.10424    0.24358  0.4279   0.6687

ARIMA(1,1,0)

model_110_css = arima(bloaterData.tr,order=c(1,1,0),method='CSS')
coeftest(model_110_css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.10525    0.25639  0.4105   0.6814

model_110_ml = arima(bloaterData.tr,order=c(1,1,0),method='ML')
coeftest(model_110_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.09895    0.24880  0.3977   0.6908

ARIMA(1,1,1)

model_111_css = arima(bloaterData.tr,order=c(1,1,1),method='CSS')
coeftest(model_111_css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 0.075038   0.930129  0.0807   0.9357
## ma1 0.032822   0.958494  0.0342   0.9727

model_111_ml = arima(bloaterData.tr,order=c(1,1,1),method='ML')
coeftest(model_111_ml)

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value Pr(>|z|)
## ar1 -0.0088178  1.0057864 -0.0088   0.9930
## ma1  0.1124288  0.9735777  0.1155   0.9081

ARIMA(2,1,0)

model_210_css = arima(bloaterData.tr,order=c(2,1,0),method='CSS')
coeftest(model_210_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.107408   0.255575  0.4203   0.6743
## ar2 -0.080573   0.256663 -0.3139   0.7536

model_210_ml = arima(bloaterData.tr,order=c(2,1,0),method='ML')
coeftest(model_210_ml)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.106498   0.250366  0.4254   0.6706
## ar2 -0.072083   0.244712 -0.2946   0.7683

ARIMA(2,1,1)

model_211_css = arima(bloaterData.tr,order=c(2,1,1),method='CSS')
coeftest(model_211_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1  1.064074   0.023735  44.8314   <2e-16 ***
## ar2  0.046656   0.051237   0.9106   0.3625    
## ma1 -1.736208   0.079414 -21.8628   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_211_ml = arima(bloaterData.tr,order=c(2,1,1),method='ML')
coeftest(model_211_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.55226    0.67390  0.8195   0.4125
## ar2 -0.17720    0.24423 -0.7255   0.4681
## ma1 -0.45246    0.65222 -0.6937   0.4879

ARIMA(2,1,2)

model_212_css = arima(bloaterData.tr,order=c(2,1,2),method='CSS')
coeftest(model_212_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.399526   0.051183  7.8058 5.913e-15 ***
## ar2  0.960428   0.121759  7.8879 3.072e-15 ***
## ma1 -1.114982   0.342314 -3.2572  0.001125 ** 
## ma2 -1.139801   0.525849 -2.1675  0.030193 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_212_ml = arima(bloaterData.tr,order=c(2,1,2),method='ML')
coeftest(model_212_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.88833    0.31889   2.7857  0.005342 ** 
## ar2 -0.99851    0.01820 -54.8637 < 2.2e-16 ***
## ma1 -0.91998    0.40354  -2.2798  0.022622 *  
## ma2  0.99530         NA       NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ARIMA(3,1,0)

model_310_css = arima(bloaterData.tr,order=c(3,1,0),method='CSS')
coeftest(model_310_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.072525   0.244775  0.2963   0.7670
## ar2 -0.051084   0.244967 -0.2085   0.8348
## ar3 -0.305571   0.244344 -1.2506   0.2111

model_310_ml = arima(bloaterData.tr,order=c(3,1,0),method='ML')
coeftest(model_310_ml)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.082833   0.238029  0.3480   0.7278
## ar2 -0.046516   0.233081 -0.1996   0.8418
## ar3 -0.260505   0.226622 -1.1495   0.2503

ARIMA(3,1,1)

model_311_css = arima(bloaterData.tr,order=c(3,1,1),method='CSS')
coeftest(model_311_css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.196688   0.497225  0.3956   0.6924
## ar2 -0.066782   0.256388 -0.2605   0.7945
## ar3 -0.278693   0.277251 -1.0052   0.3148
## ma1 -0.146421   0.500625 -0.2925   0.7699

model_311_ml = arima(bloaterData.tr,order=c(3,1,1),method='ML')
coeftest(model_311_ml)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.068907   0.536409  0.1285   0.8978
## ar2 -0.044973   0.238886 -0.1883   0.8507
## ar3 -0.261481   0.228515 -1.1443   0.2525
## ma1  0.015399   0.532599  0.0289   0.9769

ARIMA(3,1,3)

model_313_css = arima(bloaterData.tr,order=c(3,1,3),method='CSS')
coeftest(model_313_css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.11242    0.18966  0.5928  0.553342    
## ar2  0.77153    0.26142  2.9514  0.003164 ** 
## ar3  0.84955    0.30361  2.7982  0.005139 ** 
## ma1 -0.57250    0.14589 -3.9243 8.699e-05 ***
## ma2 -1.23955    0.26947 -4.5999 4.226e-06 ***
## ma3 -2.06809    0.31578 -6.5492 5.783e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_313_ml = arima(bloaterData.tr,order=c(3,1,3),method='ML')
coeftest(model_313_ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1  0.28669    0.94699  0.3027   0.7621
## ar2 -0.67513    0.55025 -1.2269   0.2198
## ar3 -0.42075    0.71567 -0.5879   0.5566
## ma1 -0.33089    1.21308 -0.2728   0.7850
## ma2  0.85258    0.81080  1.0515   0.2930
## ma3  0.25260    0.84456  0.2991   0.7649

Residual Analysis

We will perform Residual Analysis check on the chosen good models

Linear Trend Model

residual.analysis(lmModel, check = 1)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.83416, p-value = 0.008034

Fig 9.: Residual Analysis of Linear Trend Model

From the Shapiro Wilk test for normality we reject the null hypothesis as the p-value < 0.05 and confirm that the residuals is not normally distributed. Also from (Fig 9.),we infer that the historgram of the residual plot shows that they are skewed, with ACF & PACF plot show first significant lag.

Quadratic Trend Model

residual.analysis(quadModel, check = 1)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9367, p-value = 0.3106

Fig 10.: Residual Analysis of Quadratic Trend Model

From the Shapiro Wilk test for normality we fail to reject the null hypothesis as the p-value > 0.05 and confirm that the residuals is normally distributed. Also from (Fig 10.),we infer that the historgram of the residual plot shows that they are slightly skewed, with ACF & PACF plot does not have any significant lag(s).

ARIMA212

residual.analysis(model_212_css,check = 0) #added a small criteria to the function to skip                                                 #Shapiro Wilk Test

## [1] "Shapiro Wilk Test of Normality is not necessary for css models"

Fig 11.: Residual Analysis of ARIMA212 Model

from (Fig 11.),we infer that the historgram of the residual plot shows that they are not normal, with ACF & PACF plot does not have any significant lag(s). Also the L-Jung Box Test shows that the model has some significant p values.

ARIMA313

residual.analysis(model_313_css,check = 0) #added a small criteria to the function to skip                                                 #Shapiro Wilk Test

## [1] "Shapiro Wilk Test of Normality is not necessary for css models"

Fig 12.: Residual Analysis of ARIMA311 Model

from (Fig 12.),we infer that the historgram of the residual plot shows that they are not normal, with ACF & PACF plot does not have any significant lag(s). Also the L-Jung Box Test shows that the model has some significant p values.

Comparitively we have ARIMA(2,1,2) as a good fit as it seems to have good residual analysis results.

Forecasting

fit = Arima(bloaterData, model = model_212_css, lambda=0.5)
bloaterDataForecast = forecast(fit,h=5)
knitr::kable(bloaterDataForecast, 
             caption = "5-Step ahead Forecast for Bloater Egg Depositions Series")

5-Step ahead Forecast for Bloater Egg Depositions Series
	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
1997	0.9929115	0.3839321	1.885888	0.1765197	2.473563
1998	1.0149084	0.1120255	2.822908	-0.0004587	4.146395
1999	0.9938055	-0.0079927	4.339713	-0.4414995	7.066294
2000	1.0064789	-0.3301828	6.661994	-1.9877575	11.671427
2001	0.9913263	-1.4701063	10.264250	-5.6710397	19.120522

par(mar=c(5,4,4,2),cex.main=1, cex.lab=1, cex.axis=1)

plot(bloaterDataForecast, 
     col = c("orange"), 
     type="o",
     main = "Bloater Egg Depositions with 5 year Forecasts",
     ylab = "Egg Depostions (in Millions)",
     xlab = "Year",
     lwd = 2,
     ylim = c(0,3))

legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"), 
       c("Egg Depositions (in Millions)"))

Fig 13.: Original series with 5 step ahead forecast

From (Fig 13.), we can notice a large variance in the forecast which might be because of the short series we have used and other limitations of restricting to the analysis performed.

Conclusion

The model that the ARIMA(2,1,2) model fits the series the best with significant co-efficients.

References

Rob J Hyndman. 2018. Time Series. [ONLINE] Available at: https://robjhyndman.com/categories/time-series/. [Accessed 06 May 2018].
H., R., 2017. Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics). Springer.
D., J., 2009. Time Series Analysis. Springer-Verlag New York, LLC.

Bloater Fish Egg Depositions

MATH1318 Time Series Analysis- Assignment 2

Arjun Balaji (s3629999)

6 May 2018

Problem

Data Description

Descriptive Analysis

Model Selection

Trend Models

Linear Trend Model

Quadratic Trend Model

Model Fitting

ARIMA Model

ADF Test

Box-Cox Tranformation

Differencing

Model Selection - Possible candidate models

Model fitting

ARIMA(0,1,1)

ARIMA(1,1,0)

ARIMA(1,1,1)

ARIMA(2,1,0)

ARIMA(2,1,1)

ARIMA(2,1,2)

ARIMA(3,1,0)

ARIMA(3,1,1)

ARIMA(3,1,3)

Residual Analysis

Linear Trend Model

Quadratic Trend Model

ARIMA212

ARIMA313

Forecasting

Conclusion

References