Machine Learning

Executive Summary
Data Exploration
Stationarity
- Dicky Fuller test for Stationarity
- De-seasonalize the data
ARIMA Modelling and Forecasting
Conclusion

Executive Summary

This is Time series data for Australian monthly gas production between 1956 - Jan to 1995 - August Objective is to use analyse and build a time series model to forecast 12 periods in future

Data Exploration

library(ggplot2)
library(tseries)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(forecast)

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

library(kableExtra)

AU.gas = ts(gas, start = c(1956,1), frequency = 12)
head(AU.gas)

##       Jan  Feb  Mar  Apr  May  Jun
## 1956 1709 1646 1794 1878 2173 2321

tail(AU.gas)

##        Mar   Apr   May   Jun   Jul   Aug
## 1995 46287 49013 56624 61739 66600 60054

any(is.na(AU.gas))  #checking for any missing values

## [1] FALSE

class(AU.gas)   # checking the class of dataset

## [1] "ts"

This time series has total of 476 entries with monthly frequency

ts.plot(AU.gas,xlab = "Year", ylab= "Gas Production",
                main = "Australian Gas Production 1956 - 1995 ") ## plotting the ts object AU gas dataset

We observe that Monthly gas production between 1956 to almost 1970 is flattened level
Post 1970 era there is gradual increase in production i.e showing an upward trend with peaks and troughs indicating some seasonality in data

plot(decompose(AU.gas, type = "additive"))

# checking for Trend, Seasonal and Irregular components

Doing a additive decomposition of AU gas production dataset we observe

There is trend component taking upward route around 1970s
There is strong seasonal component
There is random component which shows constant variance till somewhere mid 1970s

This is monthly time series with frequency of 12 observations per year

Stationarity

Visual observation of decomposed plot clearly tells us that since both trend and seasonality are present in this series hence it is Not Stationary

Dicky Fuller test for Stationarity

Hypothesis for ADF test

Null hypothesis Ho : Time series non-stationary

Alternative hypothesis Ha : Time series is stationary

adf.test(AU.gas)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AU.gas
## Dickey-Fuller = -2.7131, Lag order = 7, p-value = 0.2764
## alternative hypothesis: stationary

p-value = 0.2764

Since p-value is much greater than significant value of 0.05 we fail to reject the Null Hypothesis Ho

Time series AU.gas is not-stationary Time series AU.gas is not-stationary

De-seasonalize the data

Since AU gas production data series is seasonal data it can be de-sesonalized by taking out the seasonal component of series

## decomposing the AU series 
decomposed = stl(AU.gas, s.window = "periodic")
seasonal = decomposed$time.series[,1]  ## seasonal component
trend = decomposed$time.series[,2]    ## trend component
remainder = decomposed$time.series[,3]  ## random component 

# removing the seasonality
des.data = AU.gas - seasonal
plot(des.data, ylab= "Production units", main = "De-Seasonalized Series") ## plotting the de-seasonlized data

ARIMA Modelling and Forecasting

ARIMA model

Before Forecasting method we will need to split dataset into test and train data

train.AUgas = window(AU.gas, start=c(1956,1), end = c(1993,12), frequency=12)
test.AUgas = window(AU.gas, start=c(1994,1), frequency=12)
## Plotting the train and Test set 
autoplot(train.AUgas, series="Train") +
 autolayer(test.AUgas, series="Test") +
 ggtitle("AU gas  Traning and Test data") +
 xlab("Year") + ylab("Production") +
 guides(colour=guide_legend(title="Forecast"))

ADF test proved that series is non-stationary. So as first step we will have to stationarize the series which can be done by differencing. This will be taken care by selecting d parameter in ARIMA[p,d,q] format modelling

Augas.diff = diff(AU.gas, differences = 2)

Testing whether series is stationary now by doing ADF test again on differenced data

adf.test(Augas.diff)

## Warning in adf.test(Augas.diff): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Augas.diff
## Dickey-Fuller = -17.029, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Since p-value suggested by test has gone down significantly we can say that differenced series AUgas.diff is stationary now. Differencing does make point in making series stationary

Checking ACF and PACF

Autocorrelation or ACF and Partial Autocorrelation PACF (after nullifying intermediary affects) of original time series AU gas dataset

ACF and PACF show the trended data observations and considerable significance upto very large lags for the original dataset of AU gas production

par(mfrow=c(1,2))
acf(AU.gas, lag.max = 50)
pacf(AU.gas,lag.max = 50)

Autocorrelation or ACF and Partial Autocorrelation PACF (after nullifying intermediary affects) of Differenced time series AUdiff gas dataset

We observe that the correlations of lag 1, lag 2 are significant and so are Partial ACFs upto a very high lags

par(mfrow=c(1,2))
acf(Augas.diff, lag.max = 50)
pacf(Augas.diff,lag.max = 50)

Building a manual ARIMA[p,d,q] model with seasonal effects [P,D,Q]

ð´ðð¼ðð´(ð, ð, ð) Model: ARIMA is defined by 3 parameters ð: No of autoregressive terms ð: No of differencing to stationarize the series ð: No of moving average terms

man.arima = arima(train.AUgas, order = c(1,1,1), seasonal = c(1,1,1), method = 'ML')
man.arima

## 
## Call:
## arima(x = train.AUgas, order = c(1, 1, 1), seasonal = c(1, 1, 1), method = "ML")
## 
## Coefficients:
##          ar1      ma1    sar1     sma1
##       0.3059  -0.7346  0.1202  -0.5929
## s.e.  0.1174   0.0894  0.0944   0.0791
## 
## sigma^2 estimated as 2234482:  log likelihood = -3868.81,  aic = 7747.61

AIC is 7747.61 ; Lower is considered better

Forecasting

## Plotting the forecast of manual arima for 12 advance periods
plot(forecast(man.arima, h=12), shadecols = "oldstyle")

For ARIMA model is assumed to be reasonable for a series, it is important to check whether the residuals are following white noise or not. Towards that goal Box-Ljung test is applied. Box-Ljung test: This checks whether the residuals of time series data are stationary or not.

H0: Residuals are stationary

H1: Residuals are not stationary

Box.test(man.arima$residuals, type = "Ljung-Box", lag = 350)

## 
##  Box-Ljung test
## 
## data:  man.arima$residuals
## X-squared = 360.79, df = 350, p-value = 0.3341

acf(man.arima$residuals, lag.max = 50)

hist(man.arima$residuals) ## checking the normal distribution of residuals

Purposefully taking a large lag value of 350 to display that p-value for hypothesis test is 0.3341 greater than significance level of 0.05 hence we cannot reject the null hypothesis

Auto ARIMA model

We let auto arima decide on best parameters

auto.fit = auto.arima(train.AUgas, trace = F, seasonal = T)
auto.fit

## Series: train.AUgas 
## ARIMA(1,1,2)(0,1,2)[12] 
## 
## Coefficients:
##          ar1      ma1     ma2     sma1     sma2
##       0.8310  -1.2848  0.3160  -0.4382  -0.0993
## s.e.  0.0827   0.1098  0.0887   0.0535   0.0526
## 
## sigma^2 estimated as 2219950:  log likelihood=-3865.2
## AIC=7742.4   AICc=7742.59   BIC=7766.96

Looks like auto modelling gives us p,d,q = 1,1,2 and seasonal order of P,D,Q ~= 0,1,2

plot(forecast(auto.fit, h=12), ylab = "Gas Production", xlab = "Year")

Box.test(auto.fit$residuals, type = "Ljung-Box", lag = 350)

## 
##  Box-Ljung test
## 
## data:  auto.fit$residuals
## X-squared = 358.18, df = 350, p-value = 0.37

We check on the accuracy of our manual ARIMA model on our test data ie. test.AUgas Period from Jan-1994 to 1995 Aug

There is visible difference between modelled and actual values for the test period observations

Vec1<- cbind(test.AUgas ,as.data.frame(forecast(man.arima, h=20))[,1])
ts.plot(Vec1, col=c("blue", "red"), main="Gas Production: Actual vs Forecast")
legend("bottomright", legend=c("Actual", "Forecast"),col=c("blue", "red"), cex=0.8, lty= 1:1)

## Accuracy of the manual arima model 
accuracy(forecast(man.arima, 24), test.AUgas)

##                      ME     RMSE       MAE       MPE     MAPE      MASE
## Training set   13.17663 1473.357  848.3123 0.1274631 3.890809 0.4835891
## Test set     2472.67931 3935.265 3094.7245 4.3989870 5.983214 1.7641795
##                     ACF1 Theil's U
## Training set -0.02465557        NA
## Test set     -0.01304396 0.7843535

## Accuracy of the Auto arima model 
accuracy(forecast(auto.fit, 24), test.AUgas)

##                      ME     RMSE       MAE       MPE     MAPE      MASE
## Training set   21.00259 1460.247  845.3155 0.3948736 3.886841 0.4818808
## Test set     2517.76019 3998.049 3220.9565 4.4221990 6.204546 1.8361394
##                      ACF1 Theil's U
## Training set -0.002459325        NA
## Test set      0.008021776 0.7850042

Conclusion

We find that mostly the Manual Arima model is neck to neck in most of the accuracy parameters like

Root mean squared error
Mean absolute percentage error

But when it comes to actual model based on AIC Auto arima performs better giving 7742 against manual models 7747

We can always build a complex better fitting model by taking higher parameters. But such dependecy on the observations of long past such as in this case 1956 on 1995 is questionable as technology of gas exploration , its consumption and usage has gone up tremendously due to economic factors

We can improve model by dropping the long past observations and taking only the post 1970 era when a clear trend happens out