#install.packages("forecast")
library(forecast)
## Warning: package 'forecast' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)
## Warning: package 'tseries' was built under R version 3.6.3
gas <- forecast::gas

View(gas)
summary(gas)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1646    2675   16788   21415   38629   66600
class(gas)
## [1] "ts"
gas_aft1970 = window(gas, start = c(1970,1), frequency = 12)

After forecast package is installed, now we can call the dataset.

Here, we have only considered the data after 1970, to avoid the old data.

##PLOT and GRAPHS##
plot(gas_aft1970, main = "Australian Gas Production")

hist(gas_aft1970,col = "Blue")

monthplot(gas_aft1970)

seasonplot(gas_aft1970, year.labels = T, col = 1:40)

ggseasonplot(gas_aft1970,polar = TRUE)

gglagplot(gas_aft1970)

gas.ts = ts(gas_aft1970, frequency = 4)
decompose.ts = decompose(gas.ts, type = "additive")
plot(decompose.ts)

plot(decompose.ts$trend)

tsoutliers(gas_aft1970)
## $index
## [1] 236 248 270 272 273 279 296 300
## 
## $replacements
## [1] 54627.97 49350.52 53319.41 52810.33 50829.25 42811.20 57936.25 45755.42
##any(is.na(gas))

From the above plot, it can be seen that the series has a continuous growth after 1970, and till 1970, the rate of growth was not so good or only a slight growth can be seen. The production of Gas increased over a period of time i.e. 40 years.

The upward trend which can be observed seems to be showing some seasonality exists, but there is extremely high variance also in the plot. The timeline involved is 40 years, with a cycle of 10 years (1960, 1970, 1980, 1990). This is also clear that the trend is positive.

A large number of lower values (<10000) depicts lower production of gas are from the early years, prior to 1970’s.

 The month plot for the time series object shows a clear increasing trend in production of gas during each month till the month of July, and then declines. But, each year-to-year there is always growth in months.

 The Seasonal Plot shows that there is a trend each year that, in the month July the production is at its peak (clearly seen in the 2nd Seasonal plot i.e. polar plot) and the months of July and August shows maximum growth each year.

 Each color in the lag plot indicates the month of the variable on the vertical axis. The lines connecting are in a chronological order.
 The relationship is strongly positive at lag 12, reflecting the strong seasonality in the data.

Insights from Decomposition Plot:  The additive decomposition plot is used to here to break the seasonality and to show, all the components individually. This shows there lies strong seasonality across the years, especially after 1970.  There is random component which shows constant variance till somewhere mid-1970s.  This also shows the Trend Component as positive throughout the years especially after 1970, though there lie outliers.

## 3. Periodicity 
frequency(gas_aft1970)
## [1] 12

To check, the periodicity of the dataset, the frequency can be checked. > frequency(gas_aft1970) [1] 12

As this time series data has one observation each month with a monthly sampling frequency or monthly periodicity and so this dataset is a monthly time series. Data periodicity is described by specifying periodic time intervals into which the dates of the observations fall.

## 4. stationary 

diff1<-diff(gas_aft1970)
plot(diff1, main = "Diffrencing Australian Gas Production")

adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -16.915, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## Deseasonalize with Log Tranformation & Decomposition
adf.test(gas_aft1970)
## Warning in adf.test(gas_aft1970): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  gas_aft1970
## Dickey-Fuller = -4.7629, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
logofgas <- log(gas_aft1970)
plot(logofgas, main = "log(gas_aft1970)")

# removing the seasonality
AU.gas = ts(gas_aft1970, start = c(1970,1), frequency = 12)
head(AU.gas)
##       Jan  Feb  Mar  Apr  May  Jun
## 1970 3345 4220 4874 5064 5951 6774
decomposed = stl(AU.gas, s.window = "periodic")
##deseason.data = AU.gas - seasonal
##plot(deseason.data, ylab= "Production units", main = "De-Seasonalized Series") 

Run the above code without # to get De-Seasonalized Series plot.

#plot(decompose.ts$trend)
decompgas_gas = decompose(gas_aft1970, type="multiplicative")
plot (decompgas_gas)

logofgas_decompose = stl(logofgas, s.window = "p")
plot(logofgas_decompose, main = "Decomposed")

seasonal <- logofgas_decompose$time.series[,1]
plot(seasonal)

library(forecast)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
gas.seasonal <- exp(seasadj(logofgas_decompose))
autoplot(cbind(gas_aft1970, SeasonallyAdjusted=gas.seasonal))+xlab("Time") +ylab("Gas Production")

A stationary time series is the one, for which mean, variance, and auto-correlation does not change over a period of time. So, to determine a time-series test is stationary or not, generally we do ADF (Augmented Dickey-fuller Test). But, let’s try to inspect visually first. Then, we can go for ADF.

This plot talks about the increase in production of Gas in Australia after 1970. In no way the plot looks like as the stationary one. Still we need a confirmation and to do so we would apply ADF which is basically a hypothesis test, where: The Null Hypothesis is “The given time series is Non-Stationary” Hence, the alternate Hypothesis is “Time series is stationary” > adf.test(diff1) Augmented Dickey-Fuller Test

data: diff1 Dickey-Fuller = -16.915, Lag order = 6, p-value = 0.01 alternative hypothesis: stationary

As, the p-value is less than 5%, we will not reject the null hypothesis. Hence, it is now clear that the time series is non-stationary.

So, we need to convert the timeseries into a stationary one. After that, we will again do an ADF test to verify if it’s stationary or not. Differencing is a method of transforming a time series dataset. It can be used to remove the series dependence on time, so-called temporal dependence. Differencing can help stabilize the mean of the time series by removing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality. As, we can see above from visual inspection, this dataset is not stationary. So, we need to apply differencing technique to obtain a stationary time series.

After, Differencing, it is now obvious that, the plot is stationary after the conversion. Still, we will use the Augmented Dickey Fuller test to confirm. > adf.test(diff1)

Augmented Dickey-Fuller Test

data: diff1 Dickey-Fuller = -19.321, Lag order = 7, p-value = 0.01 alternative hypothesis: stationary

Now, the P-value is less than, 5%, so we will not reject the null hypothesis. Hence, timeseries is stationary. De-seasonalize the series if seasonality is present: To remove the seasonality component, we can go for decomposition, but before that let’s do a log transformation:

Here, we can compare the two graphs keeping side by side:

It means, there were a lot of skewness in the normal plot because of the seasonality and after taking the log of gas with respect of to the time, the skewness in log transformation is reduced a lot.

After decomposition, we can observe that there is definitely a trend along-with a semiannual seasonal component and there also lies residuals or white noise. But we consider the trend component as the most significant component.

##Finding p,d,q

diff1<-diff(gas_aft1970)
plot(diff1, main = "Diffrencing Australian Gas Production")

acf(diff1, lag =  50, main = "Auto Correlation(q)")

pacf(diff1, lag = 50, main = "Partial Auto Correlation(p)")

## 5. ARIMA

library(forecast)
#Splitting train and test

train.ts = window(gas_aft1970, start=1970, end=c(1993,12))
test.ts = window(gas, start=1994)

## Plotting the train and Test set 
autoplot(train.ts, series="Train") +
  autolayer(test.ts, series="Test") +
  ggtitle("Australia gas Traning and Test data") +
  xlab("Year") + ylab("Production") +
  guides(colour=guide_legend(title="Forecast"))

Insights from above plots: • The blue line indicates the confidence intervals and the vertical lines show the strength of correlation at various lags. At the 1st lag (correlation between y(t) & y(t-1)), the correlation is greater than 0.2. At the 2nd lag, the correlation is lesser than 0.2 but still greater than the confidence level. Initial ten lags auto correlation is beyond significance. Negative side is also considered for significance level. • ACF or auto-correlation plot gives the p value to be 2, and same value is given by pacf or partial auto-correlation plot. • Hence, p=10, d= 1 (as 1st order differencing) and q=2 as seen from the above plots to be taken for working on ARIMA model.

##Manual ARIMA
gas.arima.fit <- arima(train.ts, c(10,1,2))
summary(gas.arima.fit)
## 
## Call:
## arima(x = train.ts, order = c(10, 1, 2))
## 
## Coefficients:
##         ar1      ar2      ar3      ar4      ar5      ar6      ar7      ar8
##       0.046  -0.6648  -0.4108  -0.3090  -0.2427  -0.5663  -0.2907  -0.4705
## s.e.  0.082   0.0673   0.0566   0.0637   0.0528   0.0524   0.0673   0.0618
##           ar9     ar10      ma1     ma2
##       -0.4111  -0.3991  -0.3779  0.6970
## s.e.   0.0498   0.0697   0.0808  0.0704
## 
## sigma^2 estimated as 3959303:  log likelihood = -2591,  aic = 5207.99
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 523.2352 1986.342 1423.788 2.422547 4.918772 0.5895432 -0.1132808
fcast <- forecast(gas.arima.fit, h=12) ## forecasting for next 12 periods
plot(fcast)

#########################

Insights from above plots: • The blue line indicates the confidence intervals and the vertical lines show the strength of correlation at various lags. At the 1st lag (correlation between y(t) & y(t-1)), the correlation is greater than 0.2. At the 2nd lag, the correlation is lesser than 0.2 but still greater than the confidence level. Initial ten lags auto correlation is beyond significance. Negative side is also considered for significance level. • ACF or auto-correlation plot gives the p value to be 2, and same value is given by pacf or partial auto-correlation plot. • Hence, p=10, d= 1 (as 1st order differencing) and q=2 as seen from the above plots to be taken for working on ARIMA model. #########################

ARIMA model: > summary(gas.arima.fit)

Call: arima(x = train.ts, order = c(10, 1, 2))

Coefficients: ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8 ar9 0.046 -0.6648 -0.4108 -0.3090 -0.2427 -0.5663 -0.2907 -0.4705 -0.4111 s.e. 0.082 0.0673 0.0566 0.0637 0.0528 0.0524 0.0673 0.0618 0.0498 ar10 ma1 ma2 -0.3991 -0.3779 0.6970 s.e. 0.0697 0.0808 0.0704

sigma^2 estimated as 3959303: log likelihood = -2591, aic = 5207.99

Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 523.2352 1986.342 1423.788 2.422547 4.918772 0.5895432 -0.1132808 > fcast <- forecast(gas.arima.fit, h=12) ## forecasting for next 12 periods > plot(fcast)

Box.test(gas.arima.fit$residuals, type = "Ljung-Box", lag = 50)
## 
##  Box-Ljung test
## 
## data:  gas.arima.fit$residuals
## X-squared = 120.94, df = 50, p-value = 8.23e-08
hist(gas.arima.fit$residuals)

Box-Ljung test: This is to check whether the residuals of time series data are stationary or not. • H0: Residuals are stationary • H1: Residuals are not stationary > Box.test(gas.arima.fit$residuals, type = “Ljung-Box”, lag = 50)

Box-Ljung test

data: gas.arima.fit$residuals X-squared = 120.94, df = 50, p-value = 8.23e-08

As, the P value is less than significant level i.e. 0.005, we will not reject the null hypothesis and will say that the model or residuals are now stationary. A histogram will be presentable enough to show that, the residuals are normally distributed:

## Auto Arima
auto.fit = auto.arima(train.ts, trace = F, seasonal = T)
auto.fit
## Series: train.ts 
## ARIMA(2,1,1)(0,1,2)[12] 
## 
## Coefficients:
##          ar1     ar2      ma1     sma1     sma2
##       0.5017  0.2057  -0.9583  -0.4404  -0.1236
## s.e.  0.0738  0.0722   0.0426   0.0676   0.0639
## 
## sigma^2 estimated as 3535010:  log likelihood=-2463.67
## AIC=4939.33   AICc=4939.64   BIC=4961.03
plot(forecast(auto.fit, h=12), ylab = "Gas Production", xlab = "Year")

Box.test(auto.fit$residuals, type = "Ljung-Box", lag = 100)
## 
##  Box-Ljung test
## 
## data:  auto.fit$residuals
## X-squared = 160.28, df = 100, p-value = 0.0001235
hist(auto.fit$residuals)

Series: train.ts ARIMA(2,1,1)(0,1,2)[12]

Coefficients: ar1 ar2 ma1 sma1 sma2 0.5017 0.2057 -0.9583 -0.4404 -0.1236 s.e. 0.0738 0.0722 0.0426 0.0676 0.0639

sigma^2 estimated as 3535010: log likelihood=-2463.67 AIC=4939.33 AICc=4939.64 BIC=4961.03

##Actual vs forecast
VectorErr<- cbind(test.ts, as.data.frame(forecast(gas.arima.fit, h=20))[,1])
ts.plot(VectorErr, col=c("blue", "red"), main="Gas Production: Actual vs Forecast")
legend("bottomright", legend=c("Actual", "Forecast"),col=c("blue", "red"), cex=0.8, lty= 1:1)

##Actual vs Forecast

accuracy(forecast(gas.arima.fit, h=20), test.ts)
##                     ME     RMSE      MAE      MPE      MAPE      MASE
## Training set  523.2352 1986.342 1423.788 2.422547  4.918772 0.5400949
## Test set     4977.1502 6579.285 5641.627 8.785987 10.499831 2.1400757
##                    ACF1 Theil's U
## Training set -0.1132808        NA
## Test set      0.3756065  1.176766
accuracy(forecast(auto.fit, h=20), test.ts)
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set  -79.06125 1820.459 1260.153 -0.4954352 4.077759 0.4780222
## Test set     2637.01567 4082.543 3318.157  4.6201750 6.370687 1.2586983
##                      ACF1 Theil's U
## Training set -0.005838212        NA
## Test set      0.034941965 0.7936968

Accuracy of Manual ARIMA model: > accuracy(forecast(gas.arima.fit, h=20), test.ts) ME RMSE MAE MPE MAPE MASE ACF1 Training set 523.2352 1986.342 1423.788 2.422547 4.918772 0.5400949 -0.1132808 Test set 4977.1502 6579.285 5641.627 8.785987 10.499831 2.1400757 0.3756065 Theil’s U Training set NA Test set 1.176766 Accuracy of Auto-ARIMA model: > accuracy(forecast(auto.fit, h=20), test.ts) ME RMSE MAE MPE MAPE MASE ACF1 Training set -79.06125 1820.459 1260.153 -0.4954352 4.077759 0.4780222 -0.005838212 Test set 2637.01567 4082.543 3318.157 4.6201750 6.370687 1.2586983 0.034941965 Theil’s U Training set NA Test set 0.7936968

ME  RSME    MAE MPE MAPE    MASE    ACF1

Manual ARIMA 523.2352 1986.342 1423.788 2.422547 4.918772 0.5400949 -0.1132808 Auto-ARIMA -79.06125 1820.459 1260.153 -0.49543 4.077759 0.4780222 -0.00583821

Interpretation:

Though both the models are neck to neck as compared with all the factors especially. But we would recommend the manual ARIMA model. Because the MAPE in Manual ARIMA is better than, Auto-ARIMA.

We, generally check two things while checking the accuracy and they are: MAPE: Mean Absolute Percentage Error RMSE: Root Mean Squared Error AIC = 4939.33 in auto-ARIMA, but AIC was 5207.99 in case of manual ARIMA and lower is considered as the better, in this case only Auto-ARIMA is better.

To improve the model, we should drop the long past observations and hence, we took only the post 1970 observations, when a clear trend begins. On the other hand, considering very old data is also questionable and irrational as the technology used must have been updated.