Introduction

The aim of this notebook is to build an ARIMA model and use it to make forecasts.

The data utilized, is a record of natural gas delivered to consumers in the United States. It was gotten from the website of the US Energy Information Administration.The series starts from January 2001 and ends March 2020 and it records contain amounts delivered to individual states and the total for the country.

Total gas delivered to the U.S. will be the focus of this forecast.

Data Exploration

#Read in the data
df <- read.csv("ng_us.csv")
summary(df)
##        Date          U.S.            Alabama          Alaska     
##          :  1   Min.   :1240196   Min.   :18885   Min.   : 3522  
##  Apr-2001:  1   1st Qu.:1558990   1st Qu.:29694   1st Qu.: 5634  
##  Apr-2002:  1   Median :1804380   Median :40270   Median : 7950  
##  Apr-2003:  1   Mean   :1914824   Mean   :41146   Mean   : 7781  
##  Apr-2004:  1   3rd Qu.:2179096   3rd Qu.:52377   3rd Qu.: 9644  
##  Apr-2005:  1   Max.   :3150779   Max.   :69167   Max.   :13523  
##  (Other) :226   NA's   :1         NA's   :2       NA's   :10     
##     Arizona         Arkansas       California        Colorado    
##  Min.   :12164   Min.   :11956   Min.   :131809   Min.   :14192  
##  1st Qu.:21447   1st Qu.:17288   1st Qu.:168320   1st Qu.:22180  
##  Median :25860   Median :20580   Median :184112   Median :29725  
##  Mean   :26927   Mean   :21450   Mean   :185427   Mean   :32309  
##  3rd Qu.:32396   3rd Qu.:25365   3rd Qu.:199681   3rd Qu.:42858  
##  Max.   :47047   Max.   :38506   Max.   :260454   Max.   :62803  
##  NA's   :1       NA's   :2       NA's   :2        NA's   :1      
##   Connecticut       Delaware     the.District.of.Columbia    Florida      
##  Min.   : 7197   Min.   : 2081   Min.   : 844             Min.   : 30338  
##  1st Qu.:12822   1st Qu.: 3826   1st Qu.:1168             1st Qu.: 64259  
##  Median :16375   Median : 5682   Median :1942             Median : 89976  
##  Mean   :17091   Mean   : 5962   Mean   :2577             Mean   : 88148  
##  3rd Qu.:20091   3rd Qu.: 8115   3rd Qu.:3889             3rd Qu.:107879  
##  Max.   :33550   Max.   :11006   Max.   :6343             Max.   :151038  
##  NA's   :1       NA's   :4       NA's   :3                NA's   :7       
##     Georgia          Hawaii          Idaho          Illinois     
##  Min.   :20512   Min.   :196.0   Min.   : 2812   Min.   : 35663  
##  1st Qu.:31415   1st Qu.:224.0   1st Qu.: 4643   1st Qu.: 47268  
##  Median :44041   Median :236.0   Median : 6555   Median : 68010  
##  Mean   :44209   Mean   :237.6   Mean   : 7078   Mean   : 82647  
##  3rd Qu.:54667   3rd Qu.:246.5   3rd Qu.: 8630   3rd Qu.:119318  
##  Max.   :89216   Max.   :309.0   Max.   :16946   Max.   :185421  
##  NA's   :3       NA's   :1       NA's   :2       NA's   :2       
##     Indiana            Iowa           Kansas         Kentucky    
##  Min.   : 24580   Min.   : 9546   Min.   :10552   Min.   : 8890  
##  1st Qu.: 34964   1st Qu.:17158   1st Qu.:14450   1st Qu.:12763  
##  Median : 47750   Median :23136   Median :17476   Median :17689  
##  Mean   : 52012   Mean   :24774   Mean   :19804   Mean   :19254  
##  3rd Qu.: 66582   3rd Qu.:32208   3rd Qu.:25196   3rd Qu.:24331  
##  Max.   :107065   Max.   :50650   Max.   :35696   Max.   :47093  
##  NA's   :1        NA's   :3       NA's   :2       NA's   :1      
##    Louisiana          Maine          Maryland     Massachusetts  
##  Min.   : 72958   Min.   : 2234   Min.   : 7131   Min.   :18034  
##  1st Qu.: 91821   1st Qu.: 4350   1st Qu.:10415   1st Qu.:25712  
##  Median :102652   Median : 5355   Median :14646   Median :30968  
##  Mean   :103850   Mean   : 5546   Mean   :17225   Mean   :34001  
##  3rd Qu.:115313   3rd Qu.: 6404   3rd Qu.:24157   3rd Qu.:42309  
##  Max.   :142020   Max.   :12161   Max.   :36626   Max.   :61282  
##  NA's   :3        NA's   :1       NA's   :2       NA's   :2      
##     Michigan        Minnesota      Mississippi       Missouri    
##  Min.   : 23502   Min.   :12708   Min.   :14228   Min.   : 9788  
##  1st Qu.: 41794   1st Qu.:21232   1st Qu.:24292   1st Qu.:13664  
##  Median : 59847   Median :28287   Median :30972   Median :17302  
##  Mean   : 69548   Mean   :33257   Mean   :32067   Mean   :22826  
##  3rd Qu.: 98784   3rd Qu.:45988   3rd Qu.:39833   3rd Qu.:32795  
##  Max.   :140545   Max.   :70320   Max.   :59851   Max.   :51986  
##  NA's   :1        NA's   :5       NA's   :2       NA's   :1      
##     Montana         Nebraska         Nevada      New.Hampshire 
##  Min.   : 2006   Min.   : 4338   Min.   :11123   Min.   : 888  
##  1st Qu.: 3248   1st Qu.: 9362   1st Qu.:17874   1st Qu.:3848  
##  Median : 4920   Median :11048   Median :20736   Median :4876  
##  Mean   : 5426   Mean   :12333   Mean   :20901   Mean   :4753  
##  3rd Qu.: 7494   3rd Qu.:16024   3rd Qu.:23957   3rd Qu.:5694  
##  Max.   :11654   Max.   :23232   Max.   :32635   Max.   :8702  
##  NA's   :2       NA's   :2       NA's   :5       NA's   :1     
##    New.Jersey       New.Mexico       New.York      North.Carolina 
##  Min.   : 28906   Min.   : 5767   Min.   : 53370   Min.   : 9338  
##  1st Qu.: 38790   1st Qu.: 9727   1st Qu.: 74672   1st Qu.:16400  
##  Median : 48681   Median :11408   Median : 87874   Median :26215  
##  Mean   : 54730   Mean   :12146   Mean   :100219   Mean   :28796  
##  3rd Qu.: 70794   3rd Qu.:14119   3rd Qu.:128012   3rd Qu.:37776  
##  Max.   :102052   Max.   :21885   Max.   :185541   Max.   :65655  
##  NA's   :4        NA's   :3       NA's   :1        NA's   :3      
##   North.Dakota        Ohio           Oklahoma         Oregon     
##  Min.   :  963   Min.   : 27989   Min.   :25036   Min.   : 8018  
##  1st Qu.: 2662   1st Qu.: 44406   1st Qu.:37438   1st Qu.:15246  
##  Median : 4043   Median : 61051   Median :44056   Median :18674  
##  Mean   : 4169   Mean   : 72169   Mean   :44669   Mean   :19220  
##  3rd Qu.: 5341   3rd Qu.:100417   3rd Qu.:51433   3rd Qu.:23223  
##  Max.   :10240   Max.   :155070   Max.   :68150   Max.   :34946  
##  NA's   :1       NA's   :1        NA's   :2       NA's   :1      
##   Pennsylvania     Rhode.Island   South.Carolina   South.Dakota  
##  Min.   : 25208   Min.   : 3594   Min.   : 8087   Min.   : 1058  
##  1st Qu.: 48862   1st Qu.: 6266   1st Qu.:13554   1st Qu.: 3654  
##  Median : 70773   Median : 7102   Median :17682   Median : 4743  
##  Mean   : 71555   Mean   : 7360   Mean   :18246   Mean   : 4898  
##  3rd Qu.: 90600   3rd Qu.: 8382   3rd Qu.:21812   3rd Qu.: 6260  
##  Max.   :156462   Max.   :11811   Max.   :33072   Max.   :10500  
##  NA's   :5        NA's   :1       NA's   :1       NA's   :5      
##    Tennessee         Texas             Utah          Vermont      
##  Min.   : 9922   Min.   :202412   Min.   : 5260   Min.   : 293.0  
##  1st Qu.:14044   1st Qu.:259317   1st Qu.:10214   1st Qu.: 436.5  
##  Median :20124   Median :289057   Median :13046   Median : 688.5  
##  Mean   :22377   Mean   :288742   Mean   :14705   Mean   : 812.8  
##  3rd Qu.:28104   3rd Qu.:314211   3rd Qu.:18616   3rd Qu.:1125.0  
##  Max.   :54567   Max.   :381797   Max.   :31301   Max.   :1997.0  
##  NA's   :3       NA's   :1        NA's   :2       NA's   :2       
##     Virginia       Washington    West.Virginia     Wisconsin    
##  Min.   :10937   Min.   :10256   Min.   : 3155   Min.   :14880  
##  1st Qu.:21099   1st Qu.:17194   1st Qu.: 4899   1st Qu.:21575  
##  Median :28548   Median :22937   Median : 6471   Median :31260  
##  Mean   :31683   Mean   :23389   Mean   : 7440   Mean   :35729  
##  3rd Qu.:40174   3rd Qu.:28757   3rd Qu.:10069   3rd Qu.:49378  
##  Max.   :73947   Max.   :43918   Max.   :15151   Max.   :75087  
##  NA's   :4       NA's   :3       NA's   :3       NA's   :1      
##     Wyoming     
##  Min.   : 3337  
##  1st Qu.: 4673  
##  Median : 5775  
##  Mean   : 6060  
##  3rd Qu.: 7306  
##  Max.   :11139  
##  NA's   :4

The data has missing values, the variable of interest U.S. has one missing value, for the purpose of this notebook it will be dropped. Another alternative would have been to explore the data and then impute the value.

There was a slight steady increase in gas delivered to consumers from 2001 to 2011, there is also a clear seasonal pattern, this could be because of yearly seasonal changes. This can be confirmed with a seasonal plot.

ng.ts <- ts(df, start = c(2001, 1), end = c(2020, 3), frequency = 12)
autoplot(ng.ts[,"U.S."], main = "Natual Gas Delievered To Consumers in the US (MMcf) ", ylab="Volume of Gas", xlab= "Year")

Seasonal plots of the data show that in general higher volumes are delivered in the fall and winter months while there is dip in the summer months that starts in the spring

sub.ng <- ng.ts[,c("U.S.","Texas","New.York", "Louisiana", "California")]
ggseasonplot(sub.ng[,"U.S."], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Delievered Gas For the United States")

The total consumption is compared to 4 chosen states - Texas, New York, Louisiana and California. New York and California appear to have no clear trend while Texas had a clear downward trend that leveled around the year 2008 to 2010 and then subsequently went up. All the data has seasonal pattern except for Louisiana

autoplot(sub.ng, facets=TRUE, main = "Natual Gas Delievered To Consumers in the US and Across Texas, New York, Louisana and California in MMcf", ylab="Volume of Gas", xlab= "Year")

Comparing the seasonal trends, California and Louisiana have the same trend as noted earlier, but Texas has an opposite trend peak delivery happens in the summer months while Louisiana does not appear to have a seasonal trend as was noted earlier.

ggseasonplot(sub.ng[,"U.S."], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Total Natural Gas Delivered")

ggseasonplot(sub.ng[,"Texas"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to Texas")

ggseasonplot(sub.ng[,"Louisiana"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to Louisiana")
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_text).

ggseasonplot(sub.ng[,"California"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to California")
## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_text).

ggseasonplot(sub.ng[,"New.York"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to New.York")

Model Building

For the purpose of this forecast the data will be split into train and test datasets. The model will be built with data from January 2001 to December 2018 and the data from 2019 to 2020 will be test dataset and will be used to assess the accuracy of the model.

train <- window(sub.ng[,"U.S."], start=2001, end=c(2018,12))
test <- window(sub.ng[,"U.S."], start=2019)
summary(train)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 1240196 1542508 1752837 1878520 2158041 3103644

Prior to modelling the Box-Cox transformation will be utilized to stabilize the data because of its variance.

lambda <- BoxCox.lambda(train)
train_log <- BoxCox(train, lambda)

Baseline Model

Next a baseline model is built to compare accuracy of the models developed

base1 <- snaive(train_log,h = 15)
base_model_trnf <-base1$mean %>% InvBoxCox(lambda = lambda)
accuracy(base_model_trnf, test)
##                ME     RMSE      MAE      MPE     MAPE       ACF1 Theil's U
## Test set 57069.33 128992.5 101411.7 2.146516 4.075138 -0.4021055  0.395514
autoplot(test) + autolayer(base_model_trnf, series="Seasonal Naive Forecast", PI = FALSE) + xlab("Year") + 
  ylab("Natual Gas Volume") +
  ggtitle("Forecasts For Monthly Natual Gas Volumes Given to Consumers in the US") +
  guides(color=guide_legend(title="Forecast"))
## Warning: Ignoring unknown parameters: PI

ARIMA model

First step is to difference the data if the data is not stationary. From the ealier charts this data is seasonal with some trend, as such it would need a seasonal and then trend difference. The KPSS test is used to test that hypothesis. Fail to reject the hypothesis that the data is stationary so differencing will not be required

train_log %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 1.2929 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Due to the seasonality in the data, a seasonal arima would be fitted, utilize a auto.arima to create a baseline arima model

model_2 <- auto.arima(train_log)
summary(model_2)
## Series: train_log 
## ARIMA(1,0,2)(0,1,2)[12] with drift 
## 
## Coefficients:
##          ar1      ma1      ma2     sma1     sma2    drift
##       0.8649  -0.3469  -0.1967  -0.6428  -0.2012  33.8337
## s.e.  0.1051   0.1326   0.1155   0.0870   0.0903   6.3130
## 
## sigma^2 estimated as 1636943:  log likelihood=-1752.69
## AIC=3519.39   AICc=3519.96   BIC=3542.61
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -51.5341 1224.962 929.5004 -0.2356597 2.591217 0.6404845
##                     ACF1
## Training set -0.01939876

More models will be built, the model with the lowest AIC will be utilized after residual diagnostics.

model_3 <- Arima(train_log,order=c(1,1,2), seasonal=c(1,1,2))
summary(model_3)
## Series: train_log 
## ARIMA(1,1,2)(1,1,2)[12] 
## 
## Coefficients:
##          ar1      ma1     ma2     sar1     sma1     sma2
##       0.4481  -0.9410  0.0116  -0.3713  -0.3097  -0.4589
## s.e.  0.2686   0.3109  0.2668   0.2206   0.2088   0.1593
## 
## sigma^2 estimated as 1639979:  log likelihood=-1745.12
## AIC=3504.23   AICc=3504.81   BIC=3527.43
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 134.9609 1222.997 928.1018 0.3232208 2.577188 0.6395208
##                     ACF1
## Training set -0.01622972
model_4 <- Arima(train_log,order=c(3,1,1), seasonal=c(2,2,2))
summary(model_4)
## Series: train_log 
## ARIMA(3,1,1)(2,2,2)[12] 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1    sar1     sar2     sma1    sma2
##       0.4327  -0.0484  0.1723  -0.9722  0.1979  -0.2098  -1.8121  0.8986
## s.e.  0.0788   0.0857  0.0796   0.0388  0.2537   0.1633   0.7272  0.7634
## 
## sigma^2 estimated as 1632271:  log likelihood=-1671.65
## AIC=3361.31   AICc=3362.3   BIC=3390.58
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -57.47257 1175.966 869.2471 -0.209414 2.432543 0.5989662
##                     ACF1
## Training set -0.00550892

model_4 has the lowest AIC and so the residuals will be checked before utilizing the model for forecast

Residual Diagnostics

The residual diagnostics for model_4 is ok the, fail to reject the hypothesis for residual independence at alpha = 0.05, thus there should not be issues with serial correlation. Also the residuals appear to be normally distributed

checkresiduals(model_4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)(2,2,2)[12]
## Q* = 25.093, df = 16, p-value = 0.06821
## 
## Model df: 8.   Total lags used: 24

Data Forecast

Model4 is better than the benchmark model because it has a smaller error, MAPE is chosen as the error to utilize. model_4: 2.892746%
benchmark: 4.075138

Backtransform the data and test

forecast_data <- forecast(model_4)
fm <- forecast_data$mean %>% InvBoxCox(lambda = lambda)
accuracy(fm[1:15], test)
##                 ME     RMSE     MAE       MPE     MAPE      ACF1 Theil's U
## Test set -23965.43 97593.98 72871.9 -1.016425 2.892746 -0.421733 0.3002681
accuracy(base_model_trnf, test)
##                ME     RMSE      MAE      MPE     MAPE       ACF1 Theil's U
## Test set 57069.33 128992.5 101411.7 2.146516 4.075138 -0.4021055  0.395514
forecast_data %>% autoplot()

Compare forecasted values to actual values

print("Forecasted Values vs Actual Test Data")
## [1] "Forecasted Values vs Actual Test Data"
print(fm)
##          Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep
## 2019 3229887 2711320 2582711 2122791 1946070 1947106 2176185 2139264 2012460
## 2020 3282544 2703523 2630088 2117729 2005475 2000622 2226208 2173342 2060489
##          Oct     Nov     Dec
## 2019 2087922 2508881 2842786
## 2020 2125084 2539985 2954300
print(test)
##          Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep
## 2019 3150779 2774691 2662590 1987983 1905919 1903468 2182171 2207655 1996237
## 2020 3031489 2796596 2468789                                                
##          Oct     Nov     Dec
## 2019 2096670 2514631 2884386
## 2020