Introduction

Forecasting any time series can be tricky with a multitude of modeling options available and tons of options inside of these univariate models which can be tweaked to improve forecasts. Luckily R has incredible packages and functions which make the process of guessing, checking, and tweaking a little simpler and quicker.

In this article we will be forecasting US natural gas consumption. Natural gas serves as a great commodity to forecast. Consumption of natural gas is highly seasonal with massive increases into the winter months as it is still the primary source of heating for areas of the US where temperatures can get below freezing regularly. The demand for natural gas is fairly inelastic with elasticity of demand estimated at -0.21 [-0.35, -0.07] in the great piece from the Energy Institute at Haas associated with Cal-Berkeley. There is a slight upward trend in natural gas consumption starting in 2010 which is likely the result of ramped up shale production in the US driving prices lower for the amply available few. Therefore much of the variance in natural gas consumption from year to year is really a function of weather driving demand. Nonetheless, we will look at using some time-series forecasts which attempt to forecast natural gas consumption based solely on past natural gas consumption.

Loading Packages and Data

Thanks to the amazing Rob Hyndman the fpp2 package is leveraged which includes some stalwart packages like ggplot2, forecast, and tidyverse. The source data has been downloaded in .csv format from FRED and shows natural gas consumption in billion cubic feet (BCF) at monthly intervals unseasonally adjusted. While the fredr package could be utilized, I personally love the FRED website and just enjoy pulling directly from there.

library(fpp2)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✔ ggplot2   3.3.6      ✔ fma       2.4   
## ✔ forecast  8.17.0     ✔ expsmooth 2.3
## 
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
Y <- read.csv("NaturalGas.csv", header = TRUE)
tail(Y)
##           DATE NATURALGAS
## 259 2021-07-01     2388.0
## 260 2021-08-01     2410.9
## 261 2021-09-01     2110.2
## 262 2021-10-01     2237.7
## 263 2021-11-01     2660.0
## 264 2021-12-01     2979.7

Formatting and Exploring Data

In order to forecast time-series data using the methods discussed here the data will have to be converted to the time-series format. To make this step a little easier to replicate with other datasets which can have varying start and end dates the yearA and monthA are defined to give an automated reference to start arguments and yearZ and monthZ for end arguments. These can be seen inside the call of ts() along with a frequency set to 12 since data is monthly and we are only converting the second column of the data to time-series format as this is the column containing variable of interest in FRED files.

The time-series plotted reveals incredible visual confirmation of seasonality with rythmical spikes in consumption. The peak to trough of a given yearly cycle looks to be in the range of 1,000 bcf with incredible consistency! Evidence of an upward trend is noticable as well but to a lower degree.

yearA <- year(min(Y$DATE))
monthA <- month(min(Y$DATE))
yearZ <- year(max(Y$DATE))
monthZ <- month(max(Y$DATE))

Y.ts <- ts(Y[,2], start = c(yearA, monthA), 
           end = c(yearZ, monthZ), 
           frequency = 12)

tail(Y.ts) 
##         Jul    Aug    Sep    Oct    Nov    Dec
## 2021 2388.0 2410.9 2110.2 2237.7 2660.0 2979.7
autoplot(Y.ts)+
  ggtitle("Natural Gas Consumption")+
  ylab("in bcf")

Stationarity

To properly utilize the time-series techniques featured in this article we need data which satisfies stationarity. A stationary time-series is one whose properties are not dependent upon when the outcome is observed. Put another way, variance is data over time needs to be random like white noise. However, many time-series datasets contain both seasonal and trend components which make the data non-stationary. As such the trend and seasonality components need to be controlled for to effectively model and forecast.

While the original plot of of natural gas consumption clearly shows this data is non-stationary, there are some tests we can run to get a little more insight. The first is the Auto-Correlation Function (ACF) followed by the Partial Auto-Correlation Function (PACF).

ACF is giving us values of auto-correlation of any series with its lagged values. Simply put, ACF shows how much the data is related with past data. The ACF run here considers all sources of auto-correlation including trend and seasonality. Notice the rhythmical pattern in the ACF plot–which is not desirably.

PACF dives a little deeper. It examines auto-correlation of residuals or the data after controlling for trend and seasonality. The partial component of PACF’s name is derived from examining auto-correlation as it ignores lags between past and present. Correlations here can be helpful in forecasting, but only to a certain degree as too much can create some multicollinearity concerns. Ideally the lines on the PACF graphic would all spike up and down while staying within the blue dotted lines. Note the first tick will almost always fall outside the desired range.

acf(Y.ts)

pacf(Y.ts)

Obtaining Stationarity

Luckily, two of the three methods/functions used here have the ability to control for non-stationarity built into the actual functions requiring little effort. Nonetheless, to manually get stationary data we can use diff(). While other transformations are possible like logarithmic, difference is simpler, easier to translate, and typically works well. Differencing is the process of converting the data into the literal differences between each reading. Instead of looking at the full nominal amount for each time period only the difference from the prior time period is assessed. It should be noted there is the possibility of getting a secondary differential, however, this is seldom needed.

Note in the plot of the differenced data how the slight upward trend post 2010 is no longer present.

DY <- diff(Y.ts)
tail(DY)
##         Jul    Aug    Sep    Oct    Nov    Dec
## 2021  173.1   22.9 -300.7  127.5  422.3  319.7
autoplot(DY)+
  ggtitle("Natural Gas Consumption Differenced")+
  ylab("in bcf")

Visuallizing Seasonality

While no actual manipulation of data is needed here as the R functions can handle, but it is interesting to use these two visualizations to see the seasonality of natural gas consumption after differencing.

ggseasonplot(DY)+
  ggtitle("Seasonal Plot")

ggsubseriesplot(DY)

Seasonal Naive Model

This is rather basic method in this instance serves as a good baseline to compare other models to. Essentially forecasts are predicated on prior time periods from congruent seasons–meaning December forecasts are made using previious December data, not March data given the seasonality.

fit_snaive <- snaive(DY)
summary(fit_snaive)
## 
## Forecast method: Seasonal naive method
## 
## Model Information:
## Call: snaive(y = DY) 
## 
## Residual sd: 152.996 
## 
## Error measures:
##                     ME    RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set -1.412749 152.996 110.3331 9.473702 143.1075    1 -0.3093341
## 
## Forecasts:
##          Point Forecast      Lo 80      Hi 80       Lo 95        Hi 95
## Jan 2022          124.6  -71.47225  320.67225 -175.266628  424.4666278
## Feb 2022         -250.6 -446.67225  -54.52775 -550.466628   49.2666278
## Mar 2022         -426.2 -622.27225 -230.12775 -726.066628 -126.3333722
## Apr 2022         -377.1 -573.17225 -181.02775 -676.966628  -77.2333722
## May 2022         -144.7 -340.77225   51.37225 -444.566628  155.1666278
## Jun 2022          121.1  -74.97225  317.17225 -178.766628  420.9666278
## Jul 2022          173.1  -22.97225  369.17225 -126.766628  472.9666278
## Aug 2022           22.9 -173.17225  218.97225 -276.966628  322.7666278
## Sep 2022         -300.7 -496.77225 -104.62775 -600.566628   -0.8333722
## Oct 2022          127.5  -68.57225  323.57225 -172.366628  427.3666278
## Nov 2022          422.3  226.22775  618.37225  122.433372  722.1666278
## Dec 2022          319.7  123.62775  515.77225   19.833372  619.5666278
## Jan 2023          124.6 -152.68803  401.88803 -299.475452  548.6754520
## Feb 2023         -250.6 -527.88803   26.68803 -674.675452  173.4754520
## Mar 2023         -426.2 -703.48803 -148.91197 -850.275452   -2.1245480
## Apr 2023         -377.1 -654.38803  -99.81197 -801.175452   46.9754520
## May 2023         -144.7 -421.98803  132.58803 -568.775452  279.3754520
## Jun 2023          121.1 -156.18803  398.38803 -302.975452  545.1754520
## Jul 2023          173.1 -104.18803  450.38803 -250.975452  597.1754520
## Aug 2023           22.9 -254.38803  300.18803 -401.175452  446.9754520
## Sep 2023         -300.7 -577.98803  -23.41197 -724.775452  123.3754520
## Oct 2023          127.5 -149.78803  404.78803 -296.575452  551.5754520
## Nov 2023          422.3  145.01197  699.58803   -1.775452  846.3754520
## Dec 2023          319.7   42.41197  596.98803 -104.375452  743.7754520
fcast_snaive <- forecast(fit_snaive, h = 24)
plot(fcast_snaive, include=60, showgap = FALSE)

ETS Model

The ETS Model derives its name from error–trend–seasonality and is actually the forecasting framework of the exponential smoothing method. These components of the time-series fuel the forecasting ability of the method. There actually eighteen possible iteration of ETS.

Additive Error (A) Seasonal Component
Trend Component None (N) Additive (A) Multiplicative (M)
None (N) NN NA NM
Additive (A) AN AA AM
Additive damped (Ad) AdN AdA AdM
Multiplicative Error (M) Seasonal Component
Trend Component None (N) Additive (A) Multiplicative (M)
None (N) NN NA NM
Additive (A) AN AA AM
Additive damped (Ad) AdN AdA AdM

Fortunately, the ets() in R checks which option is best. By default tt compares the Akaike’s Information Criteria (AIC) for each iteration which can be switched to Bayesian Information Criteria in ets() arguments. In this instance of modeling natural gas consumption it determined (M, N, M) was the best option.

fit_ets <- ets(Y.ts)
summary(fit_ets)
## ETS(M,N,M) 
## 
## Call:
##  ets(y = Y.ts) 
## 
##   Smoothing parameters:
##     alpha = 0.4794 
##     gamma = 0.1687 
## 
##   Initial states:
##     l = 1876.981 
##     s = 1.2308 0.9464 0.8367 0.7912 0.9021 0.8909
##            0.8288 0.8568 0.9583 1.1533 1.2442 1.3605
## 
##   sigma:  0.0484
## 
##      AIC     AICc      BIC 
## 3915.883 3917.818 3969.522 
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.726504 106.9282 81.66587 0.06763396 3.796497 0.7272631
##                    ACF1
## Training set 0.03082804
fcast_ets <- forecast(fit_ets, h = 24)
plot(fcast_ets, include=60, showgap = FALSE)

summary(fcast_ets)
## 
## Forecast method: ETS(M,N,M)
## 
## Model Information:
## ETS(M,N,M) 
## 
## Call:
##  ets(y = Y.ts) 
## 
##   Smoothing parameters:
##     alpha = 0.4794 
##     gamma = 0.1687 
## 
##   Initial states:
##     l = 1876.981 
##     s = 1.2308 0.9464 0.8367 0.7912 0.9021 0.8909
##            0.8288 0.8568 0.9583 1.1533 1.2442 1.3605
## 
##   sigma:  0.0484
## 
##      AIC     AICc      BIC 
## 3915.883 3917.818 3969.522 
## 
## Error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.726504 106.9282 81.66587 0.06763396 3.796497 0.7272631
##                    ACF1
## Training set 0.03082804
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2022       3366.279 3157.664 3574.895 3047.229 3685.330
## Feb 2022       3018.026 2810.564 3225.487 2700.741 3335.311
## Mar 2022       2827.349 2615.571 3039.127 2503.462 3151.236
## Apr 2022       2328.055 2140.416 2515.694 2041.086 2615.025
## May 2022       2164.142 1978.197 2350.087 1879.764 2448.521
## Jun 2022       2187.573 1988.647 2386.498 1883.343 2491.803
## Jul 2022       2406.493 2176.221 2636.765 2054.322 2758.664
## Aug 2022       2372.674 2134.888 2610.461 2009.012 2736.337
## Sep 2022       2120.907 1899.154 2342.659 1781.766 2460.048
## Oct 2022       2215.644 1974.755 2456.533 1847.236 2584.051
## Nov 2022       2537.862 2251.753 2823.971 2100.296 2975.428
## Dec 2022       3049.305 2693.712 3404.899 2505.472 3593.138
## Jan 2023       3366.916 2951.474 3782.358 2731.552 4002.280
## Feb 2023       3018.597 2635.391 3401.802 2432.534 3604.659
## Mar 2023       2827.884 2459.095 3196.673 2263.870 3391.898
## Apr 2023       2328.495 2016.972 2640.018 1852.062 2804.929
## May 2023       2164.552 1867.832 2461.271 1710.758 2618.345
## Jun 2023       2187.986 1881.012 2494.961 1718.509 2657.464
## Jul 2023       2406.948 2061.676 2752.221 1878.900 2934.997
## Aug 2023       2373.123 2025.389 2720.857 1841.309 2904.937
## Sep 2023       2121.308 1804.065 2438.551 1636.127 2606.489
## Oct 2023       2216.063 1878.086 2554.040 1699.172 2732.954
## Nov 2023       2538.342 2143.834 2932.850 1934.994 3141.690
## Dec 2023       3049.882 2567.163 3532.601 2311.626 3788.137

ARIMA Model

While exponential smoothing looks to harness trend and seasonality, ARIMA relies on autocorrelation in the data. ARIMA stands for autoregressive integrated moving average as it looks for both autoregressive and moving average information to potentially use.Similar to ETS modeling ARIMA comes in multiple varieties based off of the number of autoregressive terms (p), number of nonseasonal differences required for stationarity (d), and number of lagged forecast errors in the prediction equation (q). Along with this seasonal ARIMA models can have two iteration of (p, d, q) for the non-seasonal and seasonal parts of the model.

The auto.ARIMA() funciton is incredible and does a lot of work for us similar to the ets() function. The function checks what is the best ARIMA model for our data as judged by AIC. In this instance ARIMA (2,0,1)(2,1,2)[12] with drift was selected.

fit_arima <- auto.arima(Y.ts, stepwise = TRUE, approximation = FALSE,
                        trace = TRUE)
## 
##  ARIMA(2,0,2)(1,1,1)[12] with drift         : 3084.839
##  ARIMA(0,0,0)(0,1,0)[12] with drift         : 3222.206
##  ARIMA(1,0,0)(1,1,0)[12] with drift         : 3150.252
##  ARIMA(0,0,1)(0,1,1)[12] with drift         : 3126.736
##  ARIMA(0,0,0)(0,1,0)[12]                    : 3231.477
##  ARIMA(2,0,2)(0,1,1)[12] with drift         : 3084.815
##  ARIMA(2,0,2)(0,1,0)[12] with drift         : 3172.192
##  ARIMA(2,0,2)(0,1,2)[12] with drift         : Inf
##  ARIMA(2,0,2)(1,1,0)[12] with drift         : 3149.433
##  ARIMA(2,0,2)(1,1,2)[12] with drift         : 3082.76
##  ARIMA(2,0,2)(2,1,2)[12] with drift         : Inf
##  ARIMA(2,0,2)(2,1,1)[12] with drift         : Inf
##  ARIMA(1,0,2)(1,1,2)[12] with drift         : 3082.213
##  ARIMA(1,0,2)(0,1,2)[12] with drift         : 3082.589
##  ARIMA(1,0,2)(1,1,1)[12] with drift         : 3084.411
##  ARIMA(1,0,2)(2,1,2)[12] with drift         : 3074.597
##  ARIMA(1,0,2)(2,1,1)[12] with drift         : 3074.498
##  ARIMA(1,0,2)(2,1,0)[12] with drift         : 3112.144
##  ARIMA(1,0,2)(1,1,0)[12] with drift         : 3149.142
##  ARIMA(0,0,2)(2,1,1)[12] with drift         : 3108.703
##  ARIMA(1,0,1)(2,1,1)[12] with drift         : 3079.165
##  ARIMA(1,0,3)(2,1,1)[12] with drift         : 3075.898
##  ARIMA(0,0,1)(2,1,1)[12] with drift         : 3117.01
##  ARIMA(0,0,3)(2,1,1)[12] with drift         : 3102.127
##  ARIMA(2,0,1)(2,1,1)[12] with drift         : 3073.177
##  ARIMA(2,0,1)(1,1,1)[12] with drift         : 3082.859
##  ARIMA(2,0,1)(2,1,0)[12] with drift         : 3113.149
##  ARIMA(2,0,1)(2,1,2)[12] with drift         : 3073.157
##  ARIMA(2,0,1)(1,1,2)[12] with drift         : 3080.72
##  ARIMA(1,0,1)(2,1,2)[12] with drift         : 3079.04
##  ARIMA(2,0,0)(2,1,2)[12] with drift         : 3087.56
##  ARIMA(3,0,1)(2,1,2)[12] with drift         : Inf
##  ARIMA(1,0,0)(2,1,2)[12] with drift         : 3092.102
##  ARIMA(3,0,0)(2,1,2)[12] with drift         : 3080.341
##  ARIMA(3,0,2)(2,1,2)[12] with drift         : Inf
##  ARIMA(2,0,1)(2,1,2)[12]                    : Inf
## 
##  Best model: ARIMA(2,0,1)(2,1,2)[12] with drift
summary(fit_arima)
## Series: Y.ts 
## ARIMA(2,0,1)(2,1,2)[12] with drift 
## 
## Coefficients:
##          ar1      ar2      ma1    sar1     sar2     sma1    sma2   drift
##       1.3119  -0.3226  -0.8765  0.3919  -0.3338  -1.1617  0.3694  2.6683
## s.e.  0.0879   0.0819   0.0546  0.2061   0.0779   0.2149  0.1939  1.2481
## 
## sigma^2 = 10441:  log likelihood = -1527.21
## AIC=3072.41   AICc=3073.16   BIC=3104.18
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.118156 98.23301 74.15446 -0.1367927 3.446969 0.6603713
##                      ACF1
## Training set -0.009873442
fcast_arima <- forecast(fit_arima, h = 24)
plot(fcast_arima, include=40, showgap = FALSE)

summary(fcast_arima)
## 
## Forecast method: ARIMA(2,0,1)(2,1,2)[12] with drift
## 
## Model Information:
## Series: Y.ts 
## ARIMA(2,0,1)(2,1,2)[12] with drift 
## 
## Coefficients:
##          ar1      ar2      ma1    sar1     sar2     sma1    sma2   drift
##       1.3119  -0.3226  -0.8765  0.3919  -0.3338  -1.1617  0.3694  2.6683
## s.e.  0.0879   0.0819   0.0546  0.2061   0.0779   0.2149  0.1939  1.2481
## 
## sigma^2 = 10441:  log likelihood = -1527.21
## AIC=3072.41   AICc=3073.16   BIC=3104.18
## 
## Error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.118156 98.23301 74.15446 -0.1367927 3.446969 0.6603713
##                      ACF1
## Training set -0.009873442
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2022       3339.322 3208.373 3470.270 3139.053 3539.590
## Feb 2022       2922.631 2779.809 3065.454 2704.203 3141.060
## Mar 2022       2810.978 2664.492 2957.464 2586.947 3035.009
## Apr 2022       2323.019 2174.529 2471.509 2095.923 2550.115
## May 2022       2181.173 2031.150 2331.197 1951.732 2410.614
## Jun 2022       2210.335 2058.955 2361.716 1978.819 2441.852
## Jul 2022       2415.178 2262.531 2567.826 2181.724 2648.633
## Aug 2022       2420.711 2266.857 2574.565 2185.412 2656.010
## Sep 2022       2210.264 2055.255 2365.274 1973.198 2447.331
## Oct 2022       2301.644 2145.525 2457.763 2062.881 2540.408
## Nov 2022       2702.395 2545.210 2859.580 2462.001 2942.789
## Dec 2022       3125.311 2967.100 3283.523 2883.348 3367.275
## Jan 2023       3437.845 3272.557 3603.134 3185.058 3690.633
## Feb 2023       2981.835 2813.748 3149.922 2724.768 3238.902
## Mar 2023       2913.391 2743.508 3083.274 2653.578 3173.205
## Apr 2023       2377.365 2206.003 2548.727 2115.290 2639.440
## May 2023       2230.378 2057.670 2403.086 1966.243 2494.512
## Jun 2023       2239.185 2065.206 2413.164 1973.108 2505.262
## Jul 2023       2495.370 2320.177 2670.563 2227.436 2763.305
## Aug 2023       2478.223 2301.864 2654.581 2208.506 2747.940
## Sep 2023       2285.761 2108.282 2463.240 2014.330 2557.192
## Oct 2023       2382.846 2204.289 2561.404 2109.767 2655.926
## Nov 2023       2729.372 2549.777 2908.967 2454.706 3004.039
## Dec 2023       3234.573 3053.979 3415.167 2958.379 3510.768

Conclusions

First off, I think it is important to emphasize the limitations of univariate time-series forecasting. Past performance does not indicate future performance. However, these forecasting methods can provide a reasonable range of expectations in the short term.

Comparing the three methods used, ARIMA performed the best. One of the simplest ways of seeing this is looking at the standard deviations of each model.

Model Standard Deviation
ARIMA 102.18 bcf
ETS 144.22 bcf
SNaive 152.99 bcf

Visually we can look at the ACF from each model’s forecast. Notice how ARIMA fits best inside of the blue dotted lines. The spread of the residuals in the final histogram of each also shows ARIMA with the tightest spread and highest centrality.

checkresiduals(fit_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,1)(2,1,2)[12] with drift
## Q* = 24.174, df = 17, p-value = 0.1148
## 
## Model df: 7.   Total lags used: 24
checkresiduals(fit_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,M)
## Q* = 64.057, df = 10, p-value = 6.139e-10
## 
## Model df: 14.   Total lags used: 24
checkresiduals(fit_snaive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 121.04, df = 24, p-value = 6.328e-15
## 
## Model df: 0.   Total lags used: 24