This is Time series data for Australian monthly gas production between 1956 - Jan to 1995 - August Objective is to use analyse and build a time series model to forecast 12 periods in future
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
## Jan Feb Mar Apr May Jun
## 1956 1709 1646 1794 1878 2173 2321
## Mar Apr May Jun Jul Aug
## 1995 46287 49013 56624 61739 66600 60054
## [1] FALSE
## [1] "ts"
This time series has total of 476 entries with monthly frequency
ts.plot(AU.gas,xlab = "Year", ylab= "Gas Production",
main = "Australian Gas Production 1956 - 1995 ") ## plotting the ts object AU gas dataset We observe that Monthly gas production between 1956 to almost 1970 is flattened level
Post 1970 era there is gradual increase in production i.e showing an upward trend with peaks and troughs indicating some seasonality in data
Doing a additive decomposition of AU gas production dataset we observe
There is trend component taking upward route around 1970s
There is strong seasonal component
There is random component which shows constant variance till somewhere mid 1970s
This is monthly time series with frequency of 12 observations per year
Visual observation of decomposed plot clearly tells us that since both trend and seasonality are present in this series hence it is Not Stationary
Hypothesis for ADF test
Null hypothesis Ho : Time series non-stationary
Alternative hypothesis Ha : Time series is stationary
##
## Augmented Dickey-Fuller Test
##
## data: AU.gas
## Dickey-Fuller = -2.7131, Lag order = 7, p-value = 0.2764
## alternative hypothesis: stationary
p-value = 0.2764
Since p-value is much greater than significant value of 0.05 we fail to reject the Null Hypothesis Ho
Time series AU.gas is not-stationary Time series AU.gas is not-stationary
Since AU gas production data series is seasonal data it can be de-sesonalized by taking out the seasonal component of series
## decomposing the AU series
decomposed = stl(AU.gas, s.window = "periodic")
seasonal = decomposed$time.series[,1] ## seasonal component
trend = decomposed$time.series[,2] ## trend component
remainder = decomposed$time.series[,3] ## random component
# removing the seasonality
des.data = AU.gas - seasonal
plot(des.data, ylab= "Production units", main = "De-Seasonalized Series") ## plotting the de-seasonlized datatrain.AUgas = window(AU.gas, start=c(1956,1), end = c(1993,12), frequency=12)
test.AUgas = window(AU.gas, start=c(1994,1), frequency=12)
## Plotting the train and Test set
autoplot(train.AUgas, series="Train") +
autolayer(test.AUgas, series="Test") +
ggtitle("AU gas Traning and Test data") +
xlab("Year") + ylab("Production") +
guides(colour=guide_legend(title="Forecast"))ADF test proved that series is non-stationary. So as first step we will have to stationarize the series which can be done by differencing. This will be taken care by selecting d parameter in ARIMA[p,d,q] format modelling
Testing whether series is stationary now by doing ADF test again on differenced data
## Warning in adf.test(Augas.diff): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: Augas.diff
## Dickey-Fuller = -17.029, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
Since p-value suggested by test has gone down significantly we can say that differenced series AUgas.diff is stationary now. Differencing does make point in making series stationary
Autocorrelation or ACF and Partial Autocorrelation PACF (after nullifying intermediary affects) of original time series AU gas dataset
ACF and PACF show the trended data observations and considerable significance upto very large lags for the original dataset of AU gas production
Autocorrelation or ACF and Partial Autocorrelation PACF (after nullifying intermediary affects) of Differenced time series AUdiff gas dataset
We observe that the correlations of lag 1, lag 2 are significant and so are Partial ACFs upto a very high lags
Building a manual ARIMA[p,d,q] model with seasonal effects [P,D,Q]
ð´ð ð¼ðð´(ð, ð, ð) Model: ARIMA is defined by 3 parameters ð: No of autoregressive terms ð: No of differencing to stationarize the series ð: No of moving average terms
##
## Call:
## arima(x = train.AUgas, order = c(1, 1, 1), seasonal = c(1, 1, 1), method = "ML")
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.3059 -0.7346 0.1202 -0.5929
## s.e. 0.1174 0.0894 0.0944 0.0791
##
## sigma^2 estimated as 2234482: log likelihood = -3868.81, aic = 7747.61
AIC is 7747.61 ; Lower is considered better
## Plotting the forecast of manual arima for 12 advance periods
plot(forecast(man.arima, h=12), shadecols = "oldstyle")For ARIMA model is assumed to be reasonable for a series, it is important to check whether the residuals are following white noise or not. Towards that goal Box-Ljung test is applied. Box-Ljung test: This checks whether the residuals of time series data are stationary or not.
H0: Residuals are stationary
H1: Residuals are not stationary
##
## Box-Ljung test
##
## data: man.arima$residuals
## X-squared = 360.79, df = 350, p-value = 0.3341
Purposefully taking a large lag value of 350 to display that p-value for hypothesis test is 0.3341 greater than significance level of 0.05 hence we cannot reject the null hypothesis
We let auto arima decide on best parameters
## Series: train.AUgas
## ARIMA(1,1,2)(0,1,2)[12]
##
## Coefficients:
## ar1 ma1 ma2 sma1 sma2
## 0.8310 -1.2848 0.3160 -0.4382 -0.0993
## s.e. 0.0827 0.1098 0.0887 0.0535 0.0526
##
## sigma^2 estimated as 2219950: log likelihood=-3865.2
## AIC=7742.4 AICc=7742.59 BIC=7766.96
Looks like auto modelling gives us p,d,q = 1,1,2 and seasonal order of P,D,Q ~= 0,1,2
##
## Box-Ljung test
##
## data: auto.fit$residuals
## X-squared = 358.18, df = 350, p-value = 0.37
We check on the accuracy of our manual ARIMA model on our test data ie. test.AUgas Period from Jan-1994 to 1995 Aug
There is visible difference between modelled and actual values for the test period observations
Vec1<- cbind(test.AUgas ,as.data.frame(forecast(man.arima, h=20))[,1])
ts.plot(Vec1, col=c("blue", "red"), main="Gas Production: Actual vs Forecast")
legend("bottomright", legend=c("Actual", "Forecast"),col=c("blue", "red"), cex=0.8, lty= 1:1)## ME RMSE MAE MPE MAPE MASE
## Training set 13.17663 1473.357 848.3123 0.1274631 3.890809 0.4835891
## Test set 2472.67931 3935.265 3094.7245 4.3989870 5.983214 1.7641795
## ACF1 Theil's U
## Training set -0.02465557 NA
## Test set -0.01304396 0.7843535
## ME RMSE MAE MPE MAPE MASE
## Training set 21.00259 1460.247 845.3155 0.3948736 3.886841 0.4818808
## Test set 2517.76019 3998.049 3220.9565 4.4221990 6.204546 1.8361394
## ACF1 Theil's U
## Training set -0.002459325 NA
## Test set 0.008021776 0.7850042
We find that mostly the Manual Arima model is neck to neck in most of the accuracy parameters like
Root mean squared error
Mean absolute percentage error
But when it comes to actual model based on AIC Auto arima performs better giving 7742 against manual models 7747
We can always build a complex better fitting model by taking higher parameters. But such dependecy on the observations of long past such as in this case 1956 on 1995 is questionable as technology of gas exploration , its consumption and usage has gone up tremendously due to economic factors
We can improve model by dropping the long past observations and taking only the post 1970 era when a clear trend happens out