The aim of this notebook is to build an ARIMA model and use it to make forecasts.
The data utilized, is a record of natural gas delivered to consumers in the United States. It was gotten from the website of the US Energy Information Administration.The series starts from January 2001 and ends March 2020 and it records contain amounts delivered to individual states and the total for the country.
Total gas delivered to the U.S. will be the focus of this forecast.
#Read in the data
df <- read.csv("ng_us.csv")
summary(df)
## Date U.S. Alabama Alaska
## : 1 Min. :1240196 Min. :18885 Min. : 3522
## Apr-2001: 1 1st Qu.:1558990 1st Qu.:29694 1st Qu.: 5634
## Apr-2002: 1 Median :1804380 Median :40270 Median : 7950
## Apr-2003: 1 Mean :1914824 Mean :41146 Mean : 7781
## Apr-2004: 1 3rd Qu.:2179096 3rd Qu.:52377 3rd Qu.: 9644
## Apr-2005: 1 Max. :3150779 Max. :69167 Max. :13523
## (Other) :226 NA's :1 NA's :2 NA's :10
## Arizona Arkansas California Colorado
## Min. :12164 Min. :11956 Min. :131809 Min. :14192
## 1st Qu.:21447 1st Qu.:17288 1st Qu.:168320 1st Qu.:22180
## Median :25860 Median :20580 Median :184112 Median :29725
## Mean :26927 Mean :21450 Mean :185427 Mean :32309
## 3rd Qu.:32396 3rd Qu.:25365 3rd Qu.:199681 3rd Qu.:42858
## Max. :47047 Max. :38506 Max. :260454 Max. :62803
## NA's :1 NA's :2 NA's :2 NA's :1
## Connecticut Delaware the.District.of.Columbia Florida
## Min. : 7197 Min. : 2081 Min. : 844 Min. : 30338
## 1st Qu.:12822 1st Qu.: 3826 1st Qu.:1168 1st Qu.: 64259
## Median :16375 Median : 5682 Median :1942 Median : 89976
## Mean :17091 Mean : 5962 Mean :2577 Mean : 88148
## 3rd Qu.:20091 3rd Qu.: 8115 3rd Qu.:3889 3rd Qu.:107879
## Max. :33550 Max. :11006 Max. :6343 Max. :151038
## NA's :1 NA's :4 NA's :3 NA's :7
## Georgia Hawaii Idaho Illinois
## Min. :20512 Min. :196.0 Min. : 2812 Min. : 35663
## 1st Qu.:31415 1st Qu.:224.0 1st Qu.: 4643 1st Qu.: 47268
## Median :44041 Median :236.0 Median : 6555 Median : 68010
## Mean :44209 Mean :237.6 Mean : 7078 Mean : 82647
## 3rd Qu.:54667 3rd Qu.:246.5 3rd Qu.: 8630 3rd Qu.:119318
## Max. :89216 Max. :309.0 Max. :16946 Max. :185421
## NA's :3 NA's :1 NA's :2 NA's :2
## Indiana Iowa Kansas Kentucky
## Min. : 24580 Min. : 9546 Min. :10552 Min. : 8890
## 1st Qu.: 34964 1st Qu.:17158 1st Qu.:14450 1st Qu.:12763
## Median : 47750 Median :23136 Median :17476 Median :17689
## Mean : 52012 Mean :24774 Mean :19804 Mean :19254
## 3rd Qu.: 66582 3rd Qu.:32208 3rd Qu.:25196 3rd Qu.:24331
## Max. :107065 Max. :50650 Max. :35696 Max. :47093
## NA's :1 NA's :3 NA's :2 NA's :1
## Louisiana Maine Maryland Massachusetts
## Min. : 72958 Min. : 2234 Min. : 7131 Min. :18034
## 1st Qu.: 91821 1st Qu.: 4350 1st Qu.:10415 1st Qu.:25712
## Median :102652 Median : 5355 Median :14646 Median :30968
## Mean :103850 Mean : 5546 Mean :17225 Mean :34001
## 3rd Qu.:115313 3rd Qu.: 6404 3rd Qu.:24157 3rd Qu.:42309
## Max. :142020 Max. :12161 Max. :36626 Max. :61282
## NA's :3 NA's :1 NA's :2 NA's :2
## Michigan Minnesota Mississippi Missouri
## Min. : 23502 Min. :12708 Min. :14228 Min. : 9788
## 1st Qu.: 41794 1st Qu.:21232 1st Qu.:24292 1st Qu.:13664
## Median : 59847 Median :28287 Median :30972 Median :17302
## Mean : 69548 Mean :33257 Mean :32067 Mean :22826
## 3rd Qu.: 98784 3rd Qu.:45988 3rd Qu.:39833 3rd Qu.:32795
## Max. :140545 Max. :70320 Max. :59851 Max. :51986
## NA's :1 NA's :5 NA's :2 NA's :1
## Montana Nebraska Nevada New.Hampshire
## Min. : 2006 Min. : 4338 Min. :11123 Min. : 888
## 1st Qu.: 3248 1st Qu.: 9362 1st Qu.:17874 1st Qu.:3848
## Median : 4920 Median :11048 Median :20736 Median :4876
## Mean : 5426 Mean :12333 Mean :20901 Mean :4753
## 3rd Qu.: 7494 3rd Qu.:16024 3rd Qu.:23957 3rd Qu.:5694
## Max. :11654 Max. :23232 Max. :32635 Max. :8702
## NA's :2 NA's :2 NA's :5 NA's :1
## New.Jersey New.Mexico New.York North.Carolina
## Min. : 28906 Min. : 5767 Min. : 53370 Min. : 9338
## 1st Qu.: 38790 1st Qu.: 9727 1st Qu.: 74672 1st Qu.:16400
## Median : 48681 Median :11408 Median : 87874 Median :26215
## Mean : 54730 Mean :12146 Mean :100219 Mean :28796
## 3rd Qu.: 70794 3rd Qu.:14119 3rd Qu.:128012 3rd Qu.:37776
## Max. :102052 Max. :21885 Max. :185541 Max. :65655
## NA's :4 NA's :3 NA's :1 NA's :3
## North.Dakota Ohio Oklahoma Oregon
## Min. : 963 Min. : 27989 Min. :25036 Min. : 8018
## 1st Qu.: 2662 1st Qu.: 44406 1st Qu.:37438 1st Qu.:15246
## Median : 4043 Median : 61051 Median :44056 Median :18674
## Mean : 4169 Mean : 72169 Mean :44669 Mean :19220
## 3rd Qu.: 5341 3rd Qu.:100417 3rd Qu.:51433 3rd Qu.:23223
## Max. :10240 Max. :155070 Max. :68150 Max. :34946
## NA's :1 NA's :1 NA's :2 NA's :1
## Pennsylvania Rhode.Island South.Carolina South.Dakota
## Min. : 25208 Min. : 3594 Min. : 8087 Min. : 1058
## 1st Qu.: 48862 1st Qu.: 6266 1st Qu.:13554 1st Qu.: 3654
## Median : 70773 Median : 7102 Median :17682 Median : 4743
## Mean : 71555 Mean : 7360 Mean :18246 Mean : 4898
## 3rd Qu.: 90600 3rd Qu.: 8382 3rd Qu.:21812 3rd Qu.: 6260
## Max. :156462 Max. :11811 Max. :33072 Max. :10500
## NA's :5 NA's :1 NA's :1 NA's :5
## Tennessee Texas Utah Vermont
## Min. : 9922 Min. :202412 Min. : 5260 Min. : 293.0
## 1st Qu.:14044 1st Qu.:259317 1st Qu.:10214 1st Qu.: 436.5
## Median :20124 Median :289057 Median :13046 Median : 688.5
## Mean :22377 Mean :288742 Mean :14705 Mean : 812.8
## 3rd Qu.:28104 3rd Qu.:314211 3rd Qu.:18616 3rd Qu.:1125.0
## Max. :54567 Max. :381797 Max. :31301 Max. :1997.0
## NA's :3 NA's :1 NA's :2 NA's :2
## Virginia Washington West.Virginia Wisconsin
## Min. :10937 Min. :10256 Min. : 3155 Min. :14880
## 1st Qu.:21099 1st Qu.:17194 1st Qu.: 4899 1st Qu.:21575
## Median :28548 Median :22937 Median : 6471 Median :31260
## Mean :31683 Mean :23389 Mean : 7440 Mean :35729
## 3rd Qu.:40174 3rd Qu.:28757 3rd Qu.:10069 3rd Qu.:49378
## Max. :73947 Max. :43918 Max. :15151 Max. :75087
## NA's :4 NA's :3 NA's :3 NA's :1
## Wyoming
## Min. : 3337
## 1st Qu.: 4673
## Median : 5775
## Mean : 6060
## 3rd Qu.: 7306
## Max. :11139
## NA's :4
The data has missing values, the variable of interest U.S. has one missing value, for the purpose of this notebook it will be dropped. Another alternative would have been to explore the data and then impute the value.
There was a slight steady increase in gas delivered to consumers from 2001 to 2011, there is also a clear seasonal pattern, this could be because of yearly seasonal changes. This can be confirmed with a seasonal plot.
ng.ts <- ts(df, start = c(2001, 1), end = c(2020, 3), frequency = 12)
autoplot(ng.ts[,"U.S."], main = "Natual Gas Delievered To Consumers in the US (MMcf) ", ylab="Volume of Gas", xlab= "Year")
Seasonal plots of the data show that in general higher volumes are delivered in the fall and winter months while there is dip in the summer months that starts in the spring
sub.ng <- ng.ts[,c("U.S.","Texas","New.York", "Louisiana", "California")]
ggseasonplot(sub.ng[,"U.S."], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Delievered Gas For the United States")
The total consumption is compared to 4 chosen states - Texas, New York, Louisiana and California. New York and California appear to have no clear trend while Texas had a clear downward trend that leveled around the year 2008 to 2010 and then subsequently went up. All the data has seasonal pattern except for Louisiana
autoplot(sub.ng, facets=TRUE, main = "Natual Gas Delievered To Consumers in the US and Across Texas, New York, Louisana and California in MMcf", ylab="Volume of Gas", xlab= "Year")
Comparing the seasonal trends, California and Louisiana have the same trend as noted earlier, but Texas has an opposite trend peak delivery happens in the summer months while Louisiana does not appear to have a seasonal trend as was noted earlier.
ggseasonplot(sub.ng[,"U.S."], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Total Natural Gas Delivered")
ggseasonplot(sub.ng[,"Texas"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to Texas")
ggseasonplot(sub.ng[,"Louisiana"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to Louisiana")
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_text).
ggseasonplot(sub.ng[,"California"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to California")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_text).
ggseasonplot(sub.ng[,"New.York"], year.labels=TRUE) + ylab("Volume of Gas") + ggtitle("Seasonal Plots of Natural Gas Delivered to New.York")
For the purpose of this forecast the data will be split into train and test datasets. The model will be built with data from January 2001 to December 2018 and the data from 2019 to 2020 will be test dataset and will be used to assess the accuracy of the model.
train <- window(sub.ng[,"U.S."], start=2001, end=c(2018,12))
test <- window(sub.ng[,"U.S."], start=2019)
summary(train)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1240196 1542508 1752837 1878520 2158041 3103644
Prior to modelling the Box-Cox transformation will be utilized to stabilize the data because of its variance.
lambda <- BoxCox.lambda(train)
train_log <- BoxCox(train, lambda)
Next a baseline model is built to compare accuracy of the models developed
base1 <- snaive(train_log,h = 15)
base_model_trnf <-base1$mean %>% InvBoxCox(lambda = lambda)
accuracy(base_model_trnf, test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 57069.33 128992.5 101411.7 2.146516 4.075138 -0.4021055 0.395514
autoplot(test) + autolayer(base_model_trnf, series="Seasonal Naive Forecast", PI = FALSE) + xlab("Year") +
ylab("Natual Gas Volume") +
ggtitle("Forecasts For Monthly Natual Gas Volumes Given to Consumers in the US") +
guides(color=guide_legend(title="Forecast"))
## Warning: Ignoring unknown parameters: PI
First step is to difference the data if the data is not stationary. From the ealier charts this data is seasonal with some trend, as such it would need a seasonal and then trend difference. The KPSS test is used to test that hypothesis. Fail to reject the hypothesis that the data is stationary so differencing will not be required
train_log %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 1.2929
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Due to the seasonality in the data, a seasonal arima would be fitted, utilize a auto.arima to create a baseline arima model
model_2 <- auto.arima(train_log)
summary(model_2)
## Series: train_log
## ARIMA(1,0,2)(0,1,2)[12] with drift
##
## Coefficients:
## ar1 ma1 ma2 sma1 sma2 drift
## 0.8649 -0.3469 -0.1967 -0.6428 -0.2012 33.8337
## s.e. 0.1051 0.1326 0.1155 0.0870 0.0903 6.3130
##
## sigma^2 estimated as 1636943: log likelihood=-1752.69
## AIC=3519.39 AICc=3519.96 BIC=3542.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -51.5341 1224.962 929.5004 -0.2356597 2.591217 0.6404845
## ACF1
## Training set -0.01939876
More models will be built, the model with the lowest AIC will be utilized after residual diagnostics.
model_3 <- Arima(train_log,order=c(1,1,2), seasonal=c(1,1,2))
summary(model_3)
## Series: train_log
## ARIMA(1,1,2)(1,1,2)[12]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 sma2
## 0.4481 -0.9410 0.0116 -0.3713 -0.3097 -0.4589
## s.e. 0.2686 0.3109 0.2668 0.2206 0.2088 0.1593
##
## sigma^2 estimated as 1639979: log likelihood=-1745.12
## AIC=3504.23 AICc=3504.81 BIC=3527.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 134.9609 1222.997 928.1018 0.3232208 2.577188 0.6395208
## ACF1
## Training set -0.01622972
model_4 <- Arima(train_log,order=c(3,1,1), seasonal=c(2,2,2))
summary(model_4)
## Series: train_log
## ARIMA(3,1,1)(2,2,2)[12]
##
## Coefficients:
## ar1 ar2 ar3 ma1 sar1 sar2 sma1 sma2
## 0.4327 -0.0484 0.1723 -0.9722 0.1979 -0.2098 -1.8121 0.8986
## s.e. 0.0788 0.0857 0.0796 0.0388 0.2537 0.1633 0.7272 0.7634
##
## sigma^2 estimated as 1632271: log likelihood=-1671.65
## AIC=3361.31 AICc=3362.3 BIC=3390.58
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -57.47257 1175.966 869.2471 -0.209414 2.432543 0.5989662
## ACF1
## Training set -0.00550892
model_4 has the lowest AIC and so the residuals will be checked before utilizing the model for forecast
The residual diagnostics for model_4 is ok the, fail to reject the hypothesis for residual independence at alpha = 0.05, thus there should not be issues with serial correlation. Also the residuals appear to be normally distributed
checkresiduals(model_4)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)(2,2,2)[12]
## Q* = 25.093, df = 16, p-value = 0.06821
##
## Model df: 8. Total lags used: 24
Model4 is better than the benchmark model because it has a smaller error, MAPE is chosen as the error to utilize. model_4: 2.892746%
benchmark: 4.075138
Backtransform the data and test
forecast_data <- forecast(model_4)
fm <- forecast_data$mean %>% InvBoxCox(lambda = lambda)
accuracy(fm[1:15], test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -23965.43 97593.98 72871.9 -1.016425 2.892746 -0.421733 0.3002681
accuracy(base_model_trnf, test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 57069.33 128992.5 101411.7 2.146516 4.075138 -0.4021055 0.395514
forecast_data %>% autoplot()
Compare forecasted values to actual values
print("Forecasted Values vs Actual Test Data")
## [1] "Forecasted Values vs Actual Test Data"
print(fm)
## Jan Feb Mar Apr May Jun Jul Aug Sep
## 2019 3229887 2711320 2582711 2122791 1946070 1947106 2176185 2139264 2012460
## 2020 3282544 2703523 2630088 2117729 2005475 2000622 2226208 2173342 2060489
## Oct Nov Dec
## 2019 2087922 2508881 2842786
## 2020 2125084 2539985 2954300
print(test)
## Jan Feb Mar Apr May Jun Jul Aug Sep
## 2019 3150779 2774691 2662590 1987983 1905919 1903468 2182171 2207655 1996237
## 2020 3031489 2796596 2468789
## Oct Nov Dec
## 2019 2096670 2514631 2884386
## 2020