Setting up the working directory
setwd("C:/Users/Saikat/Desktop/Time Series/Assignment 2")
Loading the necessary packages
library(TSA)
## Warning: package 'TSA' was built under R version 3.5.3
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(tseries)
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.5.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(fUnitRoots)
## Warning: package 'fUnitRoots' was built under R version 3.5.3
## Loading required package: timeDate
##
## Attaching package: 'timeDate'
## The following objects are masked from 'package:TSA':
##
## kurtosis, skewness
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.5.3
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
## Loading required package: fBasics
## Warning: package 'fBasics' was built under R version 3.5.3
library(dLagM)
## Warning: package 'dLagM' was built under R version 3.5.3
## Loading required package: nardl
## Warning: package 'nardl' was built under R version 3.5.3
## Loading required package: dynlm
## Warning: package 'dynlm' was built under R version 3.5.3
library(FitAR)
## Warning: package 'FitAR' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.5.3
## Loading required package: ltsa
## Warning: package 'ltsa' was built under R version 3.5.2
## Loading required package: bestglm
## Warning: package 'bestglm' was built under R version 3.5.3
library(forecast)
##
## Attaching package: 'forecast'
## The following object is masked from 'package:FitAR':
##
## BoxCox
## The following object is masked from 'package:dLagM':
##
## forecast
Here the read.csv() function has been used to read the data of the egg depositions (in millions) of age-3 between years 1981 to 1996. The data has been read stored as a dataframe.
eggs=read.csv("C:/Users/Saikat/Desktop/Time Series/Assignment 2/eggs.csv")
class(eggs)
## [1] "data.frame"
The objective of this analysis is to analyze the egg depositions of Lake Huron Bloasters by using various time series analysis methods and finally to choose the best model among a set of possible models for this dataset and give forecasts of egg depositions for the next 5 years.
Here the dataset has been converted into a time series, to be ensured the class() function has been used, And finally the time series values has been put into a graph.
rownames(eggs)<-seq(from=1981, to=1996)
eggsdata=ts(as.vector(eggs['eggs']), start=1981, end=1996)
class(eggsdata)
## [1] "ts"
plot(eggsdata[,'eggs'],type='o',ylab='Egg Depositions of Age-3(in millions)')
From the plot we are trying to find out the 5 main characteristics of the data:
Trends: Allthough there are some small changing local trends, the overall series seems to have an upward trend.
Seasonality: The data does not look like to have a constant seasonality.
Intervention/Changing point:The data has upward and downward movements throughout the series. Like in the year 1984 we can observe an upward movement followed by a steep downward movement in the year 1985. Again in the year 1988, we can see an upward movement followed by another downward movement in 1990. But the data does not look to have any changing point as there is no drastic shift in the flow which is followed by constant data.
Changing Variance: The data seems to have changing variance.
Behaviour: Behaviour: The series looks like non-stationary and has a non constant mean which changes with respect to time. The series has Auto Regressive as well as Moving Average characteristics. So we can conclude an ARIMA model will be a better fit for the series.
As this is a non-stationary series we will start our analysis with the very basic trend analysis models,the Linear model and Quadratic model. We will also perform the residual analysis on both the models if needed. If any of these models become the best fit for the prediction of this particular series, we will stop further analysis and will make our prediction accordingly.
We are performing the linear modelling for this series because the summary statistics we will obtain from that will help us to decide how well the model fits the series. The summary statistics will come with a p-value for F statistics and R^2 value.
If the p-value belongs to less than 0.05, we will reject the null hypothesis Ho, which is; the model does not fit the data. On the other hand,R^2, the value of the goodness coefficient shows how well the variation in the series can be explained by the model.
linmod<-lm(eggsdata~time(eggsdata))
summary(linmod)
##
## Call:
## lm(formula = eggsdata ~ time(eggsdata))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4048 -0.2768 -0.1933 0.2536 1.1857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -165.98275 49.58836 -3.347 0.00479 **
## time(eggsdata) 0.08387 0.02494 3.363 0.00464 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4598 on 14 degrees of freedom
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4074
## F-statistic: 11.31 on 1 and 14 DF, p-value: 0.004642
plot(eggsdata,type='o',ylab='y')
abline(linmod)
Here the observed p-value is less than 0.05, thus we can reject the Null Hypothesis Ho and say that the model fits the series. On the other hand the observed R^2 value is 0.4074 which means only 40.74 percentage of the variation in the egg deposition series can be explained by the Linear trend model.Although the model fits this series,the observed R^2 value is much lesser than 0.8, so this model will not be the best model for the prediction.That’s why there is no point of doing the residual analysis of the linear model.
Just like the Linear model, we will again look for the p-value for F statistics and R^2 value in the Quadratic model too
t<- time(eggsdata)
t2<- t^2
quadmod<-lm(eggsdata~ t + t2)
summary(quadmod)
##
## Call:
## lm(formula = eggsdata ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50896 -0.25523 -0.02701 0.16615 0.96322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.647e+04 2.141e+04 -2.170 0.0491 *
## t 4.665e+01 2.153e+01 2.166 0.0494 *
## t2 -1.171e-02 5.415e-03 -2.163 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared: 0.5932, Adjusted R-squared: 0.5306
## F-statistic: 9.479 on 2 and 13 DF, p-value: 0.00289
plot(ts(fitted(quadmod)), ylim = c(min(c(fitted(quadmod),
as.vector(eggsdata))), max(c(fitted(quadmod),as.vector(eggsdata)))),
ylab='y' , main = "Fitted quadratic curve ", type="l",lty=2,col="red")
lines(as.vector(eggsdata),type="o")
Here the observed p-value is less than 0.05, thus we can reject the Null Hypothesis Ho and say that the model fits the series and the observed R^2 value is 0.5306 which means only 53.06% of the variation in the egg deposition series can be explained by the Quadratic trend model. Like the linear model, the observed R^2 value of the quadratic model is also much lesser than the expected 0.8. So the quadratic model too will not be the best model for the prediction.Thus there is no point of doing the residual analysis of this model.
As both linear and quadratic models have been failed to be the best fitted modles, we will further proceed to the ARIMA modelling. First we have plotted the ACF and the PACF and then we have done the ADF test on the original to test whether trend is present at the data or not.
####Plotting the ACF and PACF
par(mfrow=c(1,2))
acf(eggsdata)
pacf(eggsdata)
####ADF test on the original data
adf.test(eggsdata)
##
## Augmented Dickey-Fuller Test
##
## data: eggsdata
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
The slowly decaying pattern of the ACF and a very high first correlation in the PACF indicates that there is trend and non stationarity present in the series. Furthermore the p-value generated from the ADF test confirms the presence of the non stationarity in the series. If it had been less than 0.05, we could say that there is no non stationarity present in the series.
Now, to further proceed with the ARIMA model, we need to get rid of the changing variance first. So to do so, we have applied the Box-Cox transformation.
#Determining the value of lambda
eggsdata.transform = BoxCox.ar(eggsdata, method='yule-walker')
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1
eggsdata.transform$ci
## [1] 0.1 0.8
lambda=0.45 #the mid point of the interval
After doing the Box-Cox transformation, we have plotted a QQ plot and also done the Shapiro Wilk test to check the normality.
BC.eggsdata = (eggsdata^lambda-1)/lambda
par(mfrow=c(1,1))
qqnorm(BC.eggsdata)
qqline(BC.eggsdata, col = 2)
shapiro.test(BC.eggsdata)
##
## Shapiro-Wilk normality test
##
## data: BC.eggsdata
## W = 0.96269, p-value = 0.7107
From the Shapiro Wilk test we have failed to reject the null hypothesis which is the series is normal.
Now in order to remove the trend we have done differencing on the data and then have done an ADF test on it to check whether it has been detrended or not.
#First differencing
diff.BC.eggsdata = diff(BC.eggsdata)
plot(diff.BC.eggsdata,type='o',ylab='Egg Depositions(in millions)')
#ADF test on the first differenced data
adf.test(diff.BC.eggsdata)
##
## Augmented Dickey-Fuller Test
##
## data: diff.BC.eggsdata
## Dickey-Fuller = -3.6798, Lag order = 2, p-value = 0.0443
## alternative hypothesis: stationary
As the p-value of the ADF test is below 0.05, we can safely say that there is no trend present in the series anymore.Thus we have been able to successfully reject the null hypothesis which says that the series is in non-stationarity.Now we can say our series is stationary.
To determine our probable ARIMA model, we will now plot the the ACF and PACF again. From the ACF we will get the value of q and from PACF we will get the value PACF.
acf(diff.BC.eggsdata)
pacf(diff.BC.eggsdata)
But there is no significant lag in both the ACF and the PACF,so we could not get any value for p and q. Also the series looks like a white noise. So now we will plot EACF table, to get our probable models.
#EACF table
eacf(diff.BC.eggsdata,ar.max = 3, ma.max = 3)
## AR/MA
## 0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
From the EACF table we have got three probable models which are- ARIMA(0,1,1),ARIMA(1,1,0),ARIMA(1,1,1).The ‘1’ in the middle is the value of ‘d’, as we have done only the first order differencing. Now we will chart the BIC table to get some more probable models.
#BIC table
res3 = armasubsets(y=diff.BC.eggsdata,nar=4,nma=4,y.name='test',ar.method='yule-walker')
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 4 linear dependencies found
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : nvmax reduced to 4
plot(res3)
From the BIC table we have taken probable models which are- ARIMA(3,1,2),ARIMA(3,1,3).
So our final set of probable models are ARIMA(0,1,1),ARIMA(1,1,0),ARIMA(1,1,1),ARIMA(3,1,2) and ARIMA(3,1,3). Now we will do the parameter estimation of each models. We have checked the conditional sum of squares (CSS) and the maximum likelihood estimation (ML) of each models and find out whether they are significant or not.If the value of Pr(>|z|) comes under .05, we can say that model is significant and can proceed for the final modelling.
#CSS
model_011_css = arima(BC.eggsdata,order=c(0,1,1),method='CSS')
coeftest(model_011_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.11857 0.24784 0.4784 0.6324
#ML
model_011_ml = arima(BC.eggsdata,order=c(0,1,1),method='ML')
coeftest(model_011_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.11173 0.24186 0.462 0.6441
According to coefficient test MA(1) is not significant.
#CSS
model_110_css = arima(BC.eggsdata,order=c(1,1,0),method='CSS')
coeftest(model_110_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.11393 0.25603 0.445 0.6563
#ML
model_110_ml = arima(BC.eggsdata,order=c(1,1,0),method='ML')
coeftest(model_110_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.10720 0.24864 0.4311 0.6664
According to coefficient test AR(1) is not significant.
#CSS
model_111_css = arima(BC.eggsdata,order=c(1,1,1),method='CSS')
coeftest(model_111_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.104270 0.937663 0.1112 0.9115
## ma1 0.010553 0.985521 0.0107 0.9915
#ML
model_111_ml = arima(BC.eggsdata,order=c(1,1,1),method='ML')
coeftest(model_111_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.0016698 0.9762099 0.0017 0.9986
## ma1 0.1100801 0.9440422 0.1166 0.9072
According to coefficient test ARIMA(1,1,1) is not significant.
#CSS
model_312_css = arima(BC.eggsdata,order=c(3,1,2),method='CSS')
coeftest(model_312_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 1.236391 0.259979 4.7557 1.977e-06 ***
## ar2 -0.783025 0.358715 -2.1829 0.02905 *
## ar3 0.097833 0.298755 0.3275 0.74331
## ma1 -1.770577 0.117307 -15.0936 < 2.2e-16 ***
## ma2 1.483194 0.128004 11.5871 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#ML
model_312_ml = arima(BC.eggsdata,order=c(3,1,2),method='ML')
coeftest(model_312_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.51649 1.04689 0.4934 0.62176
## ar2 -0.84463 0.33909 -2.4909 0.01274 *
## ar3 -0.17694 0.36206 -0.4887 0.62506
## ma1 -0.54251 1.29038 -0.4204 0.67417
## ma2 0.99996 0.45441 2.2006 0.02777 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to coefficient test, we can not conclude whether ARIMA(3,1,2) is significant or not. The reason is CSS is telling us that AR(1),AR(2),MA(1),MA(2) are significant. On the other hand MLE is showing us that only AR(2) and MA (2) are significant.
#CSS
model_313_css = arima(BC.eggsdata,order=c(3,1,3),method='CSS')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model_313_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.070009 0.157254 -0.4452 0.6561761
## ar2 0.772293 0.237953 3.2456 0.0011721 **
## ar3 1.067447 0.228710 4.6673 3.052e-06 ***
## ma1 -0.510962 0.140180 -3.6450 0.0002673 ***
## ma2 -1.375972 0.200311 -6.8692 6.457e-12 ***
## ma3 -2.211411 0.258193 -8.5649 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#ML
model_313_ml = arima(BC.eggsdata,order=c(3,1,3),method='ML')
coeftest(model_313_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.33479 0.85521 0.3915 0.6955
## ar2 -0.66116 0.60223 -1.0979 0.2723
## ar3 -0.42614 0.74304 -0.5735 0.5663
## ma1 -0.38936 1.02182 -0.3810 0.7032
## ma2 0.81219 0.81831 0.9925 0.3209
## ma3 0.28031 0.85772 0.3268 0.7438
Again,according to coefficient test, we can not conclude whether ARIMA(313) is significant ot not. As CSS is showing some other values and MLE is showing some other.
Now we will sort the AIC and BIC value of all the probable models. The model which will have the least value among all the models will be our best fit.
# Sorting AIC and BIC values
sort.score <- function(x, score = c("bic", "aic")){
if (score == "aic"){
x[with(x, order(AIC)),]
} else if (score == "bic") {
x[with(x, order(BIC)),]
} else {
warning('score = "x" only accepts valid arguments ("aic","bic")')
}
}
sort.score(AIC(model_011_ml,model_110_ml,model_111_ml,model_312_ml,model_313_ml), score = "aic")
## df AIC
## model_011_ml 2 21.43716
## model_110_ml 2 21.45089
## model_111_ml 3 23.43715
## model_312_ml 6 26.78844
## model_313_ml 7 28.70119
#sort.score(BIC(model_011_ml,model_110_ml,model_111_ml,model_312_ml,model_313_ml), score = "bic" )
According to the AIC and BIC table we can conclude that ARIMA(0,1,1) is the best fit to predict on this particular data.
Now we will test the overfitting of our best model which is ARIMA(0,1,1). To do so we will check whether ARIMA(1,1,1) and ARIMA(0,1,2) is significant or not. We have already checked ARIMA(1,1,1) is significant or not and we found it is insignificant. So now we will only test whether ARIMA(0,1,2) is significant or not. If ARIMA(0,1,2) comes insignificant, we can proceed with ARIMA(0,1,1) model to do the forecast.
# ARIMA(0,1,2)
#CSS
model_012_css = arima(BC.eggsdata,order=c(0,1,2),method='CSS')
coeftest(model_012_css)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.124990 0.366349 0.3412 0.7330
## ma2 0.010491 0.434850 0.0241 0.9808
#ML
model_012_ml = arima(BC.eggsdata,order=c(0,1,2),method='ML')
coeftest(model_012_ml)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.1127669 0.3212634 0.351 0.7256
## ma2 0.0017938 0.3595978 0.005 0.9960
Both the CSS and the MLE is indicating that ARIMA(0,1,2) is insignificant. Hence we can say that only ARIMA(0,1,1) is our best fit to do the prediction.
Here we will do the residual analysis of our predictive model which is ARIMA(0,1,1). This residual analysis includes time series plot of the standardised residuals,histogram of the standardised residuals, QQ Plot of the standardised residuals,ACF of the standardised residuals,Shapiro-Wilk test of the standardised residuals and the Ljung-Box test.
residual.analysis <- function(model, std = TRUE){
library(TSA)
library(FitAR)
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
par(mfrow=c(3,2))
plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
abline(h=0)
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardised residuals")
print(shapiro.test(res.model))
k=0
LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
par(mfrow=c(1,1))
}
residual.analysis(model = model_011_ml)
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.94389, p-value = 0.3994
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length
par(mfrow=c(1,1))
The residual analysis indicates
-standardised normals are randomly distributed over time. -histogram looks normally distributed and form a bell shaped curve. -QQ plot looks not very alligned. -ACF plot does not have any significant lags -Shapiro-Wilk normality test indicates that residuals are random with the p-value of 0.3994 -All p-values in Ljung-Box test plot are above the alpha value.
Although QQplot does not look very alligned and the p-value of the Shapiro-Wilk test is not very high, but the other analysises are satisfying enough to go ahead with the model ARIMA(0,1,1) for the final forecasting.
We will now do the final forecastong based on the ARIMA(0,1,1) model for the upcoming 5 years.
fit = Arima(eggsdata,c(0,1,1), lambda = .45)
plot(forecast(fit,h=5))
As per the forecasting based on the ARIMA(0,1,1) model, the deposition of the eggs should not vary much and should maintain a constant flow for the upcoming 5 years.