rm(list = ls())
library(easypackages)
checkForPack = c("knitr","car","jsonlite","lattice","stringr","sp","tidyr","forecast", "dplyr",
"readr","expsmooth","TSA","dynlm","Hmisc","car","AER","tseries","lmtest",
"FitAR","fUnitRoots","FSAdata", "kableExtra")
checkNInst = checkForPack %in% rownames(installed.packages())
if ( any(!checkNInst) ) { install.packages( checkForPack[!checkNInst] ) }
libraries(checkForPack)
source('C:/Arjun/RMIT/7. Time Series/Functions/residual.analysis.R')
set.seed(738)The purpose of this assignment is to analyze the Egg depositions(in millions) of Lake Huron Bloasters(Coregonus hoyi) of age-3 Lake Huron Bloaters between years 1981 and 1996. the task is to try all possible candidate models and find the best fitting model and give a 5-step ahead forecasts of egg depositions.
are available in BloaterLH dataset of FSAdata package.
The data is available under BloaterLH dataset of FSAdata package. The data is for 16 years time period between1981 - 1996. The dataset is available in csv format as well.
data("BloaterLH")
bloaterData = BloaterLH
bloaterData = ts(bloaterData[,2], start = bloaterData[1,1], frequency = 1)par(mar=c(5,4,4,2),cex.main=1, cex.lab=1, cex.axis=1)
plot(bloaterData,
#axes = TRUE,
ylab='Egg Depositions (in millions)',
xlab='Year',
type='o',
col = c("orange"),
lwd=2,
main = "Time series plot of Egg Depositions of Bloaters (in millions)")
axis(side=1, at=c(1981:1996))
legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"),lwd=2,
c("Egg Depositions (in Millions)"))Fig 1.: Time Series Plot of Egg Depositions of Bloaters (in millions)
From the above time series plot(Fig 1.) we can comment on the fundamental concepts mentioned below:
Let us also plot ACFand PACF for the dataset to infer about the Behaviour of the series
par(mar=c(5,4,4,2),mfrow=c(1,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)
acf(bloaterData, main = "ACF plot for Egg Depositions")
pacf(bloaterData, main = "PACF plot for Egg Depositions")Fig 2.: ACF and PACF Plots of Eggs Depositions series
y = bloaterData
x = zlag(bloaterData)
excludeNAs = 2:length(x)
PearsonCor = cor(y[excludeNAs], x[excludeNAs])
cor = knitr::kable(PearsonCor, caption = "Correlation") %>%
column_spec(1, bold = T, color = "white", background = "#FF4040") cor| x |
|---|
| 0.7445657 |
Eventhough we notice much of a Autoregressive behaviour in our dataset it would be worth including Moving Average parameters while modelling as the series have only 16 data points. This takes us to the next step of modelling where we can try linear, Quadratic and ARIMA models.
##
## Call:
## lm(formula = bloaterData ~ time(bloaterData))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4048 -0.2768 -0.1933 0.2536 1.1857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -165.98275 49.58836 -3.347 0.00479 **
## time(bloaterData) 0.08387 0.02494 3.363 0.00464 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4598 on 14 degrees of freedom
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4074
## F-statistic: 11.31 on 1 and 14 DF, p-value: 0.004642
The linear trend model is significant with p-value 0.004[<0.005]. The coefficients and intercept terms are significant. Residual errors are small however, the R2 values are pretty much low which questions whether the model is a good fit for this data set.
But before we fit the model to the time series data let us try Quadratic trend model as well
##
## Call:
## lm(formula = bloaterData ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50896 -0.25523 -0.02701 0.16615 0.96322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.647e+04 2.141e+04 -2.170 0.0491 *
## t 4.665e+01 2.153e+01 2.166 0.0494 *
## t2 -1.171e-02 5.415e-03 -2.163 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared: 0.5932, Adjusted R-squared: 0.5306
## F-statistic: 9.479 on 2 and 13 DF, p-value: 0.00289
The quadratic trend model is significant with p-value 0.002[<0.005]. The coefficients[quadratic term included] and intercept terms are significant. Residual errors are small and we notice that the R2 values have improved as well. Now let us fit these two trend models to see their fit on this data set. The below code shows the fits of the both models on the Bloater dataset.
plot(ts(as.vector(bloaterData)),
ylab='Egg Depositions (in millions)',
xlab='Year',
type='o',
col = c("orange"),
lwd = 2,
main = "Fitted Linear Curve to the Bloater Egg Deposition data")
lines(as.vector(fitted(lmModel)), col="red", lwd = 2)
lines(as.vector(fitted(quadModel)), col="black", lwd = 2)
legend("topleft",lty=1, text.width = 20, col=c("orange","red","black"), bty = "n",
c("Egg Deposition", "Fitted Linear Curve", "Fitted Quadratic Curve"))Fig 3.: Fitted Curves to the Bloater Egg Deposition data
From Fig 3. we can also see that the models are not a that good a fit for the given data points. However, compared to the Quadratic models the linear model seem tofit well visually eventhough its R2 values are much lower.
As we discussed in the visual inspection section we have to bring the fact into consideration that the time series has only 16 data points hence it might be worth not to rule out the Moving Average behaviour in the data set.
Hence, in the following section we will try multiple ARIMA models for the given time series data and plot for the fits to choose the best.
Before we proceed to the ARIMA Models we need to make the series stationary. We can perform an adf test to check whether our time series data is stationary or not.
testAdf = adf.test(bloaterData, alternative = "stationary")
testAdf##
## Augmented Dickey-Fuller Test
##
## data: bloaterData
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
testAdf$p.value## [1] 0.5468942
With * 0.5468942. * > 0.05 we infer that there is no enough statistical evidence & we fail to reject the H0:. Hence the series is non-stationary.
In order to make the time series stationary, we apply Box-Cox transformation and see if it helos in making the series stationary.
bloaterData.BC = BoxCox.ar(bloaterData)Fig 4.: Box-Cox Transformations with optimal lambda values
bloaterData.BC$ci## [1] 0.7 1.0
From the above (Fig 4.) the optimal values for \(\lambda\) are within the confidence intervals between 0.7 and 1. But the actual optimal value for \(\lambda\) is at 0 as the log likelihood shows it is optimal and hence we can say that the optimal \(/lambda\) value is between 0 and 1. So we pick 0.5 as the optimal \(\lambda\) value.
lambda = 0.5
bloaterData.tr = (bloaterData^lambda - 1)/lambda
par(mar=c(5,4,4,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)
plot(bloaterData.tr,
ylab='Egg Depositions (in millions)',
xlab='Year',
type='o',
col = c("orange"),
lwd = 2,
main = "Time series plot of the transformed Bloater Egg Deposition data")
legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"),
c("Egg Depositions (in Millions)"))Fig 5.: Time series plot of the transformed data
From the above (Fig 5.) we witness that the BoxCox tranformation for the lambda value does not change the property of the series that much, one reason may be because the optimal value is closer to 1.
As next step we can try differencing the series.
bloaterData.diff = diff(bloaterData.tr, differences = 1)
testAdf1 = adf.test(bloaterData.diff, alternative = "stationary")
testAdf1##
## Augmented Dickey-Fuller Test
##
## data: bloaterData.diff
## Dickey-Fuller = -3.6096, Lag order = 2, p-value = 0.04931
## alternative hypothesis: stationary
With * 0.0493111. * < 0.05 we reject the H0:. Hence the transformed first differenced series is stationary now.
To find the possible candidate models, we have more than one methods to decide them, First let us plot for ACF & PACF to find the possible candidate models
plot(bloaterData.diff,
ylab='Egg Depositions (in millions)',
xlab='Year',
type='o',
col = c("orange"),
lwd = 2,
main = "Time series plot of the transformed differenced Bloater Egg Deposition data")
legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"),
c("Egg Depositions (in Millions)"))Fig 6.: Time series plot of the transformed differenced series
par(mar=c(5,4,4,2),mfrow=c(1,2),cex.main=0.75, cex.lab=0.75, cex.axis=0.75)
acf(bloaterData.diff, main = "ACF plot for Egg Depositions")
pacf(bloaterData.diff, main = "PACF plot for Egg Depositions")Fig 7.: ACF and PACF Plots of transformed differenced series
From (Fig 6.) & (Fig 7.); it is evident that the trend component is not present and hence differencing has made the series stationary. However we do not see any significant lags neither in ACF nor PACF which might be because of the changing variance after differencing we notice from the time series plot.
Let us try other methods to find the candidate model such as EACF - Extended Auto Correlation function & armasubsets.
eacf(bloaterData.diff, ar.max = 3, ma.max =3)## AR/MA
## 0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
armaSubset = armasubsets(y=bloaterData.diff,
nar=3,
nma=3,
y.name='test',
ar.method='ols')
plot(armaSubset)
#main = "Model Selection")
title('BIC Table: Model Selection', line = 5)Fig 8 : Model selection: BIC Table
Possible candidate models are as follows:-
model_011_css = arima(bloaterData.tr,order=c(0,1,1),method='CSS')
coeftest(model_011_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.11073 0.24969 0.4435 0.6574
model_011_ml = arima(bloaterData.tr,order=c(0,1,1),method='ML')
coeftest(model_011_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.10424 0.24358 0.4279 0.6687
model_110_css = arima(bloaterData.tr,order=c(1,1,0),method='CSS')
coeftest(model_110_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.10525 0.25639 0.4105 0.6814
model_110_ml = arima(bloaterData.tr,order=c(1,1,0),method='ML')
coeftest(model_110_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.09895 0.24880 0.3977 0.6908
model_111_css = arima(bloaterData.tr,order=c(1,1,1),method='CSS')
coeftest(model_111_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.075038 0.930129 0.0807 0.9357
## ma1 0.032822 0.958494 0.0342 0.9727
model_111_ml = arima(bloaterData.tr,order=c(1,1,1),method='ML')
coeftest(model_111_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.0088178 1.0057864 -0.0088 0.9930
## ma1 0.1124288 0.9735777 0.1155 0.9081
model_210_css = arima(bloaterData.tr,order=c(2,1,0),method='CSS')
coeftest(model_210_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.107408 0.255575 0.4203 0.6743
## ar2 -0.080573 0.256663 -0.3139 0.7536
model_210_ml = arima(bloaterData.tr,order=c(2,1,0),method='ML')
coeftest(model_210_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.106498 0.250366 0.4254 0.6706
## ar2 -0.072083 0.244712 -0.2946 0.7683
model_211_css = arima(bloaterData.tr,order=c(2,1,1),method='CSS')
coeftest(model_211_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 1.064074 0.023735 44.8314 <2e-16 ***
## ar2 0.046656 0.051237 0.9106 0.3625
## ma1 -1.736208 0.079414 -21.8628 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_211_ml = arima(bloaterData.tr,order=c(2,1,1),method='ML')
coeftest(model_211_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.55226 0.67390 0.8195 0.4125
## ar2 -0.17720 0.24423 -0.7255 0.4681
## ma1 -0.45246 0.65222 -0.6937 0.4879
model_212_css = arima(bloaterData.tr,order=c(2,1,2),method='CSS')
coeftest(model_212_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.399526 0.051183 7.8058 5.913e-15 ***
## ar2 0.960428 0.121759 7.8879 3.072e-15 ***
## ma1 -1.114982 0.342314 -3.2572 0.001125 **
## ma2 -1.139801 0.525849 -2.1675 0.030193 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_212_ml = arima(bloaterData.tr,order=c(2,1,2),method='ML')
coeftest(model_212_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.88833 0.31889 2.7857 0.005342 **
## ar2 -0.99851 0.01820 -54.8637 < 2.2e-16 ***
## ma1 -0.91998 0.40354 -2.2798 0.022622 *
## ma2 0.99530 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_310_css = arima(bloaterData.tr,order=c(3,1,0),method='CSS')
coeftest(model_310_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.072525 0.244775 0.2963 0.7670
## ar2 -0.051084 0.244967 -0.2085 0.8348
## ar3 -0.305571 0.244344 -1.2506 0.2111
model_310_ml = arima(bloaterData.tr,order=c(3,1,0),method='ML')
coeftest(model_310_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.082833 0.238029 0.3480 0.7278
## ar2 -0.046516 0.233081 -0.1996 0.8418
## ar3 -0.260505 0.226622 -1.1495 0.2503
model_311_css = arima(bloaterData.tr,order=c(3,1,1),method='CSS')
coeftest(model_311_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.196688 0.497225 0.3956 0.6924
## ar2 -0.066782 0.256388 -0.2605 0.7945
## ar3 -0.278693 0.277251 -1.0052 0.3148
## ma1 -0.146421 0.500625 -0.2925 0.7699
model_311_ml = arima(bloaterData.tr,order=c(3,1,1),method='ML')
coeftest(model_311_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.068907 0.536409 0.1285 0.8978
## ar2 -0.044973 0.238886 -0.1883 0.8507
## ar3 -0.261481 0.228515 -1.1443 0.2525
## ma1 0.015399 0.532599 0.0289 0.9769
model_313_css = arima(bloaterData.tr,order=c(3,1,3),method='CSS')
coeftest(model_313_css)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.11242 0.18966 0.5928 0.553342
## ar2 0.77153 0.26142 2.9514 0.003164 **
## ar3 0.84955 0.30361 2.7982 0.005139 **
## ma1 -0.57250 0.14589 -3.9243 8.699e-05 ***
## ma2 -1.23955 0.26947 -4.5999 4.226e-06 ***
## ma3 -2.06809 0.31578 -6.5492 5.783e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_313_ml = arima(bloaterData.tr,order=c(3,1,3),method='ML')
coeftest(model_313_ml)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.28669 0.94699 0.3027 0.7621
## ar2 -0.67513 0.55025 -1.2269 0.2198
## ar3 -0.42075 0.71567 -0.5879 0.5566
## ma1 -0.33089 1.21308 -0.2728 0.7850
## ma2 0.85258 0.81080 1.0515 0.2930
## ma3 0.25260 0.84456 0.2991 0.7649
We will perform Residual Analysis check on the chosen good models
residual.analysis(lmModel, check = 1)##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.83416, p-value = 0.008034
Fig 9.: Residual Analysis of Linear Trend Model
From the Shapiro Wilk test for normality we reject the null hypothesis as the p-value < 0.05 and confirm that the residuals is not normally distributed. Also from (Fig 9.),we infer that the historgram of the residual plot shows that they are skewed, with ACF & PACF plot show first significant lag.
residual.analysis(quadModel, check = 1)##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.9367, p-value = 0.3106
Fig 10.: Residual Analysis of Quadratic Trend Model
From the Shapiro Wilk test for normality we fail to reject the null hypothesis as the p-value > 0.05 and confirm that the residuals is normally distributed. Also from (Fig 10.),we infer that the historgram of the residual plot shows that they are slightly skewed, with ACF & PACF plot does not have any significant lag(s).
residual.analysis(model_212_css,check = 0) #added a small criteria to the function to skip #Shapiro Wilk Test## [1] "Shapiro Wilk Test of Normality is not necessary for css models"
Fig 11.: Residual Analysis of ARIMA212 Model
from (Fig 11.),we infer that the historgram of the residual plot shows that they are not normal, with ACF & PACF plot does not have any significant lag(s). Also the L-Jung Box Test shows that the model has some significant p values.
residual.analysis(model_313_css,check = 0) #added a small criteria to the function to skip #Shapiro Wilk Test## [1] "Shapiro Wilk Test of Normality is not necessary for css models"
Fig 12.: Residual Analysis of ARIMA311 Model
from (Fig 12.),we infer that the historgram of the residual plot shows that they are not normal, with ACF & PACF plot does not have any significant lag(s). Also the L-Jung Box Test shows that the model has some significant p values.
Comparitively we have ARIMA(2,1,2) as a good fit as it seems to have good residual analysis results.
fit = Arima(bloaterData, model = model_212_css, lambda=0.5)
bloaterDataForecast = forecast(fit,h=5)
knitr::kable(bloaterDataForecast,
caption = "5-Step ahead Forecast for Bloater Egg Depositions Series")| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| 1997 | 0.9929115 | 0.3839321 | 1.885888 | 0.1765197 | 2.473563 |
| 1998 | 1.0149084 | 0.1120255 | 2.822908 | -0.0004587 | 4.146395 |
| 1999 | 0.9938055 | -0.0079927 | 4.339713 | -0.4414995 | 7.066294 |
| 2000 | 1.0064789 | -0.3301828 | 6.661994 | -1.9877575 | 11.671427 |
| 2001 | 0.9913263 | -1.4701063 | 10.264250 | -5.6710397 | 19.120522 |
par(mar=c(5,4,4,2),cex.main=1, cex.lab=1, cex.axis=1)
plot(bloaterDataForecast,
col = c("orange"),
type="o",
main = "Bloater Egg Depositions with 5 year Forecasts",
ylab = "Egg Depostions (in Millions)",
xlab = "Year",
lwd = 2,
ylim = c(0,3))
legend("topleft",lty=1, bty = "n" ,text.width = 15, col=c("orange"),
c("Egg Depositions (in Millions)"))Fig 13.: Original series with 5 step ahead forecast
From (Fig 13.), we can notice a large variance in the forecast which might be because of the short series we have used and other limitations of restricting to the analysis performed.
The model that the ARIMA(2,1,2) model fits the series the best with significant co-efficients.
Rob J Hyndman. 2018. Time Series. [ONLINE] Available at: https://robjhyndman.com/categories/time-series/. [Accessed 06 May 2018].
H., R., 2017. Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics). Springer.
D., J., 2009. Time Series Analysis. Springer-Verlag New York, LLC.