Forecasting any time series can be tricky with a multitude of modeling options available and tons of options inside of these univariate models which can be tweaked to improve forecasts. Luckily R has incredible packages and functions which make the process of guessing, checking, and tweaking a little simpler and quicker.
In this article we will be forecasting US natural gas consumption. Natural gas serves as a great commodity to forecast. Consumption of natural gas is highly seasonal with massive increases into the winter months as it is still the primary source of heating for areas of the US where temperatures can get below freezing regularly. The demand for natural gas is fairly inelastic with elasticity of demand estimated at -0.21 [-0.35, -0.07] in the great piece from the Energy Institute at Haas associated with Cal-Berkeley. There is a slight upward trend in natural gas consumption starting in 2010 which is likely the result of ramped up shale production in the US driving prices lower for the amply available few. Therefore much of the variance in natural gas consumption from year to year is really a function of weather driving demand. Nonetheless, we will look at using some time-series forecasts which attempt to forecast natural gas consumption based solely on past natural gas consumption.
Thanks to the amazing Rob Hyndman the fpp2 package is leveraged which includes some stalwart packages like ggplot2, forecast, and tidyverse. The source data has been downloaded in .csv format from FRED and shows natural gas consumption in billion cubic feet (BCF) at monthly intervals unseasonally adjusted. While the fredr package could be utilized, I personally love the FRED website and just enjoy pulling directly from there.
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✔ ggplot2 3.3.6 ✔ fma 2.4
## ✔ forecast 8.17.0 ✔ expsmooth 2.3
##
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Y <- read.csv("NaturalGas.csv", header = TRUE)
tail(Y)
## DATE NATURALGAS
## 259 2021-07-01 2388.0
## 260 2021-08-01 2410.9
## 261 2021-09-01 2110.2
## 262 2021-10-01 2237.7
## 263 2021-11-01 2660.0
## 264 2021-12-01 2979.7
In order to forecast time-series data using the methods discussed here the data will have to be converted to the time-series format. To make this step a little easier to replicate with other datasets which can have varying start and end dates the yearA and monthA are defined to give an automated reference to start arguments and yearZ and monthZ for end arguments. These can be seen inside the call of ts() along with a frequency set to 12 since data is monthly and we are only converting the second column of the data to time-series format as this is the column containing variable of interest in FRED files.
The time-series plotted reveals incredible visual confirmation of seasonality with rythmical spikes in consumption. The peak to trough of a given yearly cycle looks to be in the range of 1,000 bcf with incredible consistency! Evidence of an upward trend is noticable as well but to a lower degree.
yearA <- year(min(Y$DATE))
monthA <- month(min(Y$DATE))
yearZ <- year(max(Y$DATE))
monthZ <- month(max(Y$DATE))
Y.ts <- ts(Y[,2], start = c(yearA, monthA),
end = c(yearZ, monthZ),
frequency = 12)
tail(Y.ts)
## Jul Aug Sep Oct Nov Dec
## 2021 2388.0 2410.9 2110.2 2237.7 2660.0 2979.7
autoplot(Y.ts)+
ggtitle("Natural Gas Consumption")+
ylab("in bcf")
To properly utilize the time-series techniques featured in this article we need data which satisfies stationarity. A stationary time-series is one whose properties are not dependent upon when the outcome is observed. Put another way, variance is data over time needs to be random like white noise. However, many time-series datasets contain both seasonal and trend components which make the data non-stationary. As such the trend and seasonality components need to be controlled for to effectively model and forecast.
While the original plot of of natural gas consumption clearly shows this data is non-stationary, there are some tests we can run to get a little more insight. The first is the Auto-Correlation Function (ACF) followed by the Partial Auto-Correlation Function (PACF).
ACF is giving us values of auto-correlation of any series with its lagged values. Simply put, ACF shows how much the data is related with past data. The ACF run here considers all sources of auto-correlation including trend and seasonality. Notice the rhythmical pattern in the ACF plot–which is not desirably.
PACF dives a little deeper. It examines auto-correlation of residuals or the data after controlling for trend and seasonality. The partial component of PACF’s name is derived from examining auto-correlation as it ignores lags between past and present. Correlations here can be helpful in forecasting, but only to a certain degree as too much can create some multicollinearity concerns. Ideally the lines on the PACF graphic would all spike up and down while staying within the blue dotted lines. Note the first tick will almost always fall outside the desired range.
acf(Y.ts)
pacf(Y.ts)
Luckily, two of the three methods/functions used here have the ability to control for non-stationarity built into the actual functions requiring little effort. Nonetheless, to manually get stationary data we can use diff(). While other transformations are possible like logarithmic, difference is simpler, easier to translate, and typically works well. Differencing is the process of converting the data into the literal differences between each reading. Instead of looking at the full nominal amount for each time period only the difference from the prior time period is assessed. It should be noted there is the possibility of getting a secondary differential, however, this is seldom needed.
Note in the plot of the differenced data how the slight upward trend post 2010 is no longer present.
DY <- diff(Y.ts)
tail(DY)
## Jul Aug Sep Oct Nov Dec
## 2021 173.1 22.9 -300.7 127.5 422.3 319.7
autoplot(DY)+
ggtitle("Natural Gas Consumption Differenced")+
ylab("in bcf")
While no actual manipulation of data is needed here as the R functions can handle, but it is interesting to use these two visualizations to see the seasonality of natural gas consumption after differencing.
ggseasonplot(DY)+
ggtitle("Seasonal Plot")
ggsubseriesplot(DY)
This is rather basic method in this instance serves as a good baseline to compare other models to. Essentially forecasts are predicated on prior time periods from congruent seasons–meaning December forecasts are made using previious December data, not March data given the seasonality.
fit_snaive <- snaive(DY)
summary(fit_snaive)
##
## Forecast method: Seasonal naive method
##
## Model Information:
## Call: snaive(y = DY)
##
## Residual sd: 152.996
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.412749 152.996 110.3331 9.473702 143.1075 1 -0.3093341
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2022 124.6 -71.47225 320.67225 -175.266628 424.4666278
## Feb 2022 -250.6 -446.67225 -54.52775 -550.466628 49.2666278
## Mar 2022 -426.2 -622.27225 -230.12775 -726.066628 -126.3333722
## Apr 2022 -377.1 -573.17225 -181.02775 -676.966628 -77.2333722
## May 2022 -144.7 -340.77225 51.37225 -444.566628 155.1666278
## Jun 2022 121.1 -74.97225 317.17225 -178.766628 420.9666278
## Jul 2022 173.1 -22.97225 369.17225 -126.766628 472.9666278
## Aug 2022 22.9 -173.17225 218.97225 -276.966628 322.7666278
## Sep 2022 -300.7 -496.77225 -104.62775 -600.566628 -0.8333722
## Oct 2022 127.5 -68.57225 323.57225 -172.366628 427.3666278
## Nov 2022 422.3 226.22775 618.37225 122.433372 722.1666278
## Dec 2022 319.7 123.62775 515.77225 19.833372 619.5666278
## Jan 2023 124.6 -152.68803 401.88803 -299.475452 548.6754520
## Feb 2023 -250.6 -527.88803 26.68803 -674.675452 173.4754520
## Mar 2023 -426.2 -703.48803 -148.91197 -850.275452 -2.1245480
## Apr 2023 -377.1 -654.38803 -99.81197 -801.175452 46.9754520
## May 2023 -144.7 -421.98803 132.58803 -568.775452 279.3754520
## Jun 2023 121.1 -156.18803 398.38803 -302.975452 545.1754520
## Jul 2023 173.1 -104.18803 450.38803 -250.975452 597.1754520
## Aug 2023 22.9 -254.38803 300.18803 -401.175452 446.9754520
## Sep 2023 -300.7 -577.98803 -23.41197 -724.775452 123.3754520
## Oct 2023 127.5 -149.78803 404.78803 -296.575452 551.5754520
## Nov 2023 422.3 145.01197 699.58803 -1.775452 846.3754520
## Dec 2023 319.7 42.41197 596.98803 -104.375452 743.7754520
fcast_snaive <- forecast(fit_snaive, h = 24)
plot(fcast_snaive, include=60, showgap = FALSE)
The ETS Model derives its name from error–trend–seasonality and is actually the forecasting framework of the exponential smoothing method. These components of the time-series fuel the forecasting ability of the method. There actually eighteen possible iteration of ETS.
| Additive Error (A) | Seasonal Component | ||
|---|---|---|---|
| Trend Component | None (N) | Additive (A) | Multiplicative (M) |
| None (N) | NN | NA | NM |
| Additive (A) | AN | AA | AM |
| Additive damped (Ad) | AdN | AdA | AdM |
| Multiplicative Error (M) | Seasonal Component | ||
|---|---|---|---|
| Trend Component | None (N) | Additive (A) | Multiplicative (M) |
| None (N) | NN | NA | NM |
| Additive (A) | AN | AA | AM |
| Additive damped (Ad) | AdN | AdA | AdM |
Fortunately, the ets() in R checks which option is best. By default tt compares the Akaike’s Information Criteria (AIC) for each iteration which can be switched to Bayesian Information Criteria in ets() arguments. In this instance of modeling natural gas consumption it determined (M, N, M) was the best option.
fit_ets <- ets(Y.ts)
summary(fit_ets)
## ETS(M,N,M)
##
## Call:
## ets(y = Y.ts)
##
## Smoothing parameters:
## alpha = 0.4794
## gamma = 0.1687
##
## Initial states:
## l = 1876.981
## s = 1.2308 0.9464 0.8367 0.7912 0.9021 0.8909
## 0.8288 0.8568 0.9583 1.1533 1.2442 1.3605
##
## sigma: 0.0484
##
## AIC AICc BIC
## 3915.883 3917.818 3969.522
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.726504 106.9282 81.66587 0.06763396 3.796497 0.7272631
## ACF1
## Training set 0.03082804
fcast_ets <- forecast(fit_ets, h = 24)
plot(fcast_ets, include=60, showgap = FALSE)
summary(fcast_ets)
##
## Forecast method: ETS(M,N,M)
##
## Model Information:
## ETS(M,N,M)
##
## Call:
## ets(y = Y.ts)
##
## Smoothing parameters:
## alpha = 0.4794
## gamma = 0.1687
##
## Initial states:
## l = 1876.981
## s = 1.2308 0.9464 0.8367 0.7912 0.9021 0.8909
## 0.8288 0.8568 0.9583 1.1533 1.2442 1.3605
##
## sigma: 0.0484
##
## AIC AICc BIC
## 3915.883 3917.818 3969.522
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.726504 106.9282 81.66587 0.06763396 3.796497 0.7272631
## ACF1
## Training set 0.03082804
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2022 3366.279 3157.664 3574.895 3047.229 3685.330
## Feb 2022 3018.026 2810.564 3225.487 2700.741 3335.311
## Mar 2022 2827.349 2615.571 3039.127 2503.462 3151.236
## Apr 2022 2328.055 2140.416 2515.694 2041.086 2615.025
## May 2022 2164.142 1978.197 2350.087 1879.764 2448.521
## Jun 2022 2187.573 1988.647 2386.498 1883.343 2491.803
## Jul 2022 2406.493 2176.221 2636.765 2054.322 2758.664
## Aug 2022 2372.674 2134.888 2610.461 2009.012 2736.337
## Sep 2022 2120.907 1899.154 2342.659 1781.766 2460.048
## Oct 2022 2215.644 1974.755 2456.533 1847.236 2584.051
## Nov 2022 2537.862 2251.753 2823.971 2100.296 2975.428
## Dec 2022 3049.305 2693.712 3404.899 2505.472 3593.138
## Jan 2023 3366.916 2951.474 3782.358 2731.552 4002.280
## Feb 2023 3018.597 2635.391 3401.802 2432.534 3604.659
## Mar 2023 2827.884 2459.095 3196.673 2263.870 3391.898
## Apr 2023 2328.495 2016.972 2640.018 1852.062 2804.929
## May 2023 2164.552 1867.832 2461.271 1710.758 2618.345
## Jun 2023 2187.986 1881.012 2494.961 1718.509 2657.464
## Jul 2023 2406.948 2061.676 2752.221 1878.900 2934.997
## Aug 2023 2373.123 2025.389 2720.857 1841.309 2904.937
## Sep 2023 2121.308 1804.065 2438.551 1636.127 2606.489
## Oct 2023 2216.063 1878.086 2554.040 1699.172 2732.954
## Nov 2023 2538.342 2143.834 2932.850 1934.994 3141.690
## Dec 2023 3049.882 2567.163 3532.601 2311.626 3788.137
While exponential smoothing looks to harness trend and seasonality, ARIMA relies on autocorrelation in the data. ARIMA stands for autoregressive integrated moving average as it looks for both autoregressive and moving average information to potentially use.Similar to ETS modeling ARIMA comes in multiple varieties based off of the number of autoregressive terms (p), number of nonseasonal differences required for stationarity (d), and number of lagged forecast errors in the prediction equation (q). Along with this seasonal ARIMA models can have two iteration of (p, d, q) for the non-seasonal and seasonal parts of the model.
The auto.ARIMA() funciton is incredible and does a lot of work for us similar to the ets() function. The function checks what is the best ARIMA model for our data as judged by AIC. In this instance ARIMA (2,0,1)(2,1,2)[12] with drift was selected.
fit_arima <- auto.arima(Y.ts, stepwise = TRUE, approximation = FALSE,
trace = TRUE)
##
## ARIMA(2,0,2)(1,1,1)[12] with drift : 3084.839
## ARIMA(0,0,0)(0,1,0)[12] with drift : 3222.206
## ARIMA(1,0,0)(1,1,0)[12] with drift : 3150.252
## ARIMA(0,0,1)(0,1,1)[12] with drift : 3126.736
## ARIMA(0,0,0)(0,1,0)[12] : 3231.477
## ARIMA(2,0,2)(0,1,1)[12] with drift : 3084.815
## ARIMA(2,0,2)(0,1,0)[12] with drift : 3172.192
## ARIMA(2,0,2)(0,1,2)[12] with drift : Inf
## ARIMA(2,0,2)(1,1,0)[12] with drift : 3149.433
## ARIMA(2,0,2)(1,1,2)[12] with drift : 3082.76
## ARIMA(2,0,2)(2,1,2)[12] with drift : Inf
## ARIMA(2,0,2)(2,1,1)[12] with drift : Inf
## ARIMA(1,0,2)(1,1,2)[12] with drift : 3082.213
## ARIMA(1,0,2)(0,1,2)[12] with drift : 3082.589
## ARIMA(1,0,2)(1,1,1)[12] with drift : 3084.411
## ARIMA(1,0,2)(2,1,2)[12] with drift : 3074.597
## ARIMA(1,0,2)(2,1,1)[12] with drift : 3074.498
## ARIMA(1,0,2)(2,1,0)[12] with drift : 3112.144
## ARIMA(1,0,2)(1,1,0)[12] with drift : 3149.142
## ARIMA(0,0,2)(2,1,1)[12] with drift : 3108.703
## ARIMA(1,0,1)(2,1,1)[12] with drift : 3079.165
## ARIMA(1,0,3)(2,1,1)[12] with drift : 3075.898
## ARIMA(0,0,1)(2,1,1)[12] with drift : 3117.01
## ARIMA(0,0,3)(2,1,1)[12] with drift : 3102.127
## ARIMA(2,0,1)(2,1,1)[12] with drift : 3073.177
## ARIMA(2,0,1)(1,1,1)[12] with drift : 3082.859
## ARIMA(2,0,1)(2,1,0)[12] with drift : 3113.149
## ARIMA(2,0,1)(2,1,2)[12] with drift : 3073.157
## ARIMA(2,0,1)(1,1,2)[12] with drift : 3080.72
## ARIMA(1,0,1)(2,1,2)[12] with drift : 3079.04
## ARIMA(2,0,0)(2,1,2)[12] with drift : 3087.56
## ARIMA(3,0,1)(2,1,2)[12] with drift : Inf
## ARIMA(1,0,0)(2,1,2)[12] with drift : 3092.102
## ARIMA(3,0,0)(2,1,2)[12] with drift : 3080.341
## ARIMA(3,0,2)(2,1,2)[12] with drift : Inf
## ARIMA(2,0,1)(2,1,2)[12] : Inf
##
## Best model: ARIMA(2,0,1)(2,1,2)[12] with drift
summary(fit_arima)
## Series: Y.ts
## ARIMA(2,0,1)(2,1,2)[12] with drift
##
## Coefficients:
## ar1 ar2 ma1 sar1 sar2 sma1 sma2 drift
## 1.3119 -0.3226 -0.8765 0.3919 -0.3338 -1.1617 0.3694 2.6683
## s.e. 0.0879 0.0819 0.0546 0.2061 0.0779 0.2149 0.1939 1.2481
##
## sigma^2 = 10441: log likelihood = -1527.21
## AIC=3072.41 AICc=3073.16 BIC=3104.18
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.118156 98.23301 74.15446 -0.1367927 3.446969 0.6603713
## ACF1
## Training set -0.009873442
fcast_arima <- forecast(fit_arima, h = 24)
plot(fcast_arima, include=40, showgap = FALSE)
summary(fcast_arima)
##
## Forecast method: ARIMA(2,0,1)(2,1,2)[12] with drift
##
## Model Information:
## Series: Y.ts
## ARIMA(2,0,1)(2,1,2)[12] with drift
##
## Coefficients:
## ar1 ar2 ma1 sar1 sar2 sma1 sma2 drift
## 1.3119 -0.3226 -0.8765 0.3919 -0.3338 -1.1617 0.3694 2.6683
## s.e. 0.0879 0.0819 0.0546 0.2061 0.0779 0.2149 0.1939 1.2481
##
## sigma^2 = 10441: log likelihood = -1527.21
## AIC=3072.41 AICc=3073.16 BIC=3104.18
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.118156 98.23301 74.15446 -0.1367927 3.446969 0.6603713
## ACF1
## Training set -0.009873442
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2022 3339.322 3208.373 3470.270 3139.053 3539.590
## Feb 2022 2922.631 2779.809 3065.454 2704.203 3141.060
## Mar 2022 2810.978 2664.492 2957.464 2586.947 3035.009
## Apr 2022 2323.019 2174.529 2471.509 2095.923 2550.115
## May 2022 2181.173 2031.150 2331.197 1951.732 2410.614
## Jun 2022 2210.335 2058.955 2361.716 1978.819 2441.852
## Jul 2022 2415.178 2262.531 2567.826 2181.724 2648.633
## Aug 2022 2420.711 2266.857 2574.565 2185.412 2656.010
## Sep 2022 2210.264 2055.255 2365.274 1973.198 2447.331
## Oct 2022 2301.644 2145.525 2457.763 2062.881 2540.408
## Nov 2022 2702.395 2545.210 2859.580 2462.001 2942.789
## Dec 2022 3125.311 2967.100 3283.523 2883.348 3367.275
## Jan 2023 3437.845 3272.557 3603.134 3185.058 3690.633
## Feb 2023 2981.835 2813.748 3149.922 2724.768 3238.902
## Mar 2023 2913.391 2743.508 3083.274 2653.578 3173.205
## Apr 2023 2377.365 2206.003 2548.727 2115.290 2639.440
## May 2023 2230.378 2057.670 2403.086 1966.243 2494.512
## Jun 2023 2239.185 2065.206 2413.164 1973.108 2505.262
## Jul 2023 2495.370 2320.177 2670.563 2227.436 2763.305
## Aug 2023 2478.223 2301.864 2654.581 2208.506 2747.940
## Sep 2023 2285.761 2108.282 2463.240 2014.330 2557.192
## Oct 2023 2382.846 2204.289 2561.404 2109.767 2655.926
## Nov 2023 2729.372 2549.777 2908.967 2454.706 3004.039
## Dec 2023 3234.573 3053.979 3415.167 2958.379 3510.768
First off, I think it is important to emphasize the limitations of univariate time-series forecasting. Past performance does not indicate future performance. However, these forecasting methods can provide a reasonable range of expectations in the short term.
Comparing the three methods used, ARIMA performed the best. One of the simplest ways of seeing this is looking at the standard deviations of each model.
| Model | Standard Deviation |
|---|---|
| ARIMA | 102.18 bcf |
| ETS | 144.22 bcf |
| SNaive | 152.99 bcf |
Visually we can look at the ACF from each model’s forecast. Notice how ARIMA fits best inside of the blue dotted lines. The spread of the residuals in the final histogram of each also shows ARIMA with the tightest spread and highest centrality.
checkresiduals(fit_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,1)(2,1,2)[12] with drift
## Q* = 24.174, df = 17, p-value = 0.1148
##
## Model df: 7. Total lags used: 24
checkresiduals(fit_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,M)
## Q* = 64.057, df = 10, p-value = 6.139e-10
##
## Model df: 14. Total lags used: 24
checkresiduals(fit_snaive)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 121.04, df = 24, p-value = 6.328e-15
##
## Model df: 0. Total lags used: 24