Following the common theme of energy data, I decided to use the historical closing Henry Hub natural gas price. This data ranges almost a decade from 4/26/2012 to 4/18/2022.
The Henry Hub is the benchmark index for natural gas prices. When local markets across the United States price their natural gas, they tend to do so based off a differential to Henry Hub.
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
cat("\f") # Clear the console
library("feasts")
library("seasonal")
library("tsibble")
library("tsibbledata")
library("dplyr")
library("ggplot2")
library("forecast")
library("fable")
library("weathermetrics")
library("seasonalview")
library("tibbletime")
library("fpp3")
library("fma")
library("fpp")
library("fpp2")
library("grid")
library("gridExtra")
library("lubridate")
library("seasonal")
setwd("/Users/spoll/OneDrive/Documents/Boston College/Predictive Analytics_Forecasting/Week 7")
# Load the datasets
raw_energydata <- read.csv("NYMEX Close.csv"
, check.names = FALSE
, stringsAsFactors = FALSE
, na.strings = ""
)
raw_energydata <- data.frame(raw_energydata)
class(raw_energydata$Date)
# rename column
names(raw_energydata)[1] <- 'Date'
class(raw_energydata$Date)
raw_energydata$Date <- mdy(raw_energydata$Date)
names(raw_energydata)[2] <- 'Close'
Looking at the totals for closing price by month, the results make sense as the most expensive months are energy-intensive months. There is still a lot of gas-powered heating so the winter months will be higher demand. It is odd that November and February are towards the bottom of the list. There does seem to be some seasonality present in the raw time series but I don’t see an obvious trend.
# Create test and train data frames
hh_train <- head(raw_energydata, nrow(raw_energydata)-6) %>%
dplyr::select("Date", "Close")
hh_test <- tail(raw_energydata, 6) %>%
dplyr::select("Date", "Close")
bymonth <- aggregate(Close~month(Date),
data=hh_train,FUN=sum)
bymonth %>% arrange(., -Close)
## month(Date) Close
## 1 10 740.712
## 2 11 718.761
## 3 12 693.189
## 4 3 671.456
## 5 1 663.796
## 6 8 661.538
## 7 9 660.241
## 8 7 649.134
## 9 5 640.663
## 10 4 636.539
## 11 6 627.294
## 12 2 595.295
ggplot(data = hh_train,
mapping = aes(x = Date,
y = Close)) +
geom_line() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title="Historical Henry Hub Closing Price")
The decomposition here shows some serious seasonality, but no obvious trend except for the tail end of the time series.
# Create Time Series
hh_ts <- ts(hh_train$Close, frequency = 12)
hh_test_ts <- ts(hh_test$Close, frequency = 12)
plot(hh_ts)
plot(decompose(hh_ts))
The one caveat to dealing with the Henry Hub data is that it is a market-based index, so there are no closing prices on Saturday or Sunday because the market is close. Since tsibble() needs a continuous time series, I used tsibble::fill_gaps() to plug in the missing dates. These dates do not have a closing price and default to NA. The ACF plots do not properly handle the NA’s so I filled the NAs with the closing price mean of $3.13. We see the lags never reach zero, even after differecing. The undifferenced ACF plot shows some strong seasonality but I wonder if that has anything to do with the default values I put in for the Saturdays and Sundays. Once differenced, the seasonality becomes less apparent and the plots show some sign of decay. Looking at the ACF and PACF plots, I figured I could try transforming the closing price to improve the differencing. I took the log(Close) but it did not improve much. I’d roughly say that because the ACF plot has some exponential decay and is somewhat sinusoidal and the PACF plot has a significant spike at lag 6 and then sharply drops off may suggest an ARIMA(o,d,q).
# Test for Stationarity
hh_tsibble <- tsibble(hh_train, index = Date)
hh_tsibble <- tsibble::fill_gaps(hh_tsibble)
for(i in 1:ncol(hh_tsibble)) { # Replace NA in all columns
hh_tsibble[ , i][is.na(hh_tsibble[ , i])] <- round(mean(hh_train[ , i], na.rm = TRUE),2)
}
gg_tsdisplay(hh_tsibble, y = Close)
hh_tsibble %>% ACF(Close) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_tsibble %>% PACF(Close) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_tsibble %>% ACF(difference(Close)) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_tsibble %>% PACF(difference(Close)) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_log <- hh_tsibble
hh_log$Close <- log(hh_tsibble$Close)
hh_log_ts <- tsibble(hh_log, index = Date)
hh_log_ts %>% ACF(Close) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_log_ts %>% PACF(Close) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_log_ts %>% ACF(difference(Close)) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
hh_log_ts %>% PACF(difference(Close)) %>%
autoplot() + labs(subtitle = "Henry Hub Closing Price")
The next step was to look at the Ljung and Box-Pierce. Both diagnostics show that the closing price data is statistically significant since the p-value is less than 0.05.
# Ljung Box Test
hh_tsibble %>% mutate(diff_close = difference(Close)) %>%
features(diff_close, ljung_box, lag = 10)
## # A tibble: 1 x 2
## lb_stat lb_pvalue
## <dbl> <dbl>
## 1 4286. 0
hh_tsibble %>%
features(Close, ljung_box, lag = 10)
## # A tibble: 1 x 2
## lb_stat lb_pvalue
## <dbl> <dbl>
## 1 16221. 0
hh_tsibble %>% mutate(diff_close = difference(Close)) %>%
features(diff_close, box_pierce, lag = 10)
## # A tibble: 1 x 2
## bp_stat bp_pvalue
## <dbl> <dbl>
## 1 4276. 0
The unit root test shows the same results as the original ACF and PACF plots.
# Unit Root Test
hh_tsibble %>% features(Close, unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 3.51 0.01
hh_tsibble %>% gg_tsdisplay(difference(Close), plot_type = 'partial')
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
The unit root ndiffs test shows that a differencing of 1 should suffice.
hh_tsibble %>% features(Close, unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 1
BUILD MODELS
Holt-Winters’: The first model built is the Holt-Winters additive and multiplicative methods. The additive method beats the multiplicative method by every statistic except for ME and MPE. With RMSE holding the most weight, the additive method is the optimal method of the two.
hh_hw_add <- hw(hh_ts, seasonal = "additive")
hh_hw_mult <- hw(hh_ts, seasonal = "multiplicative")
hh_hw_plot <- autoplot(hh_ts,
main = "Henry Hub Closing Price Holt-Winters'") +
autolayer(fitted(hh_hw_add), series = "Fitted Additive") +
autolayer(fitted(hh_hw_mult), series = "Fitted Multiplicative") +
ylab("Henry Hub Closing Price") +
theme(plot.title = element_text(hjust = 0.5))
accuracy(hh_hw_add)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.003627123 0.1194982 0.07397497 0.03442373 2.309168 0.3168859
## ACF1
## Training set 0.03255199
accuracy(hh_hw_mult)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.00108205 0.1198045 0.07441295 -0.09692943 2.322041 0.3187621
## ACF1
## Training set 0.09361968
hh_hw_plot
Holt-Winters’ Damped: Interesting outcome here where the multiplicative damped actually outperformed the additive damped method. The dampening of the models seems to benefit the multiplicative method more favorably.
hh_hwd_add <- hw(hh_ts, seasonal = "additive", damped = T)
hh_hwd_mult <- hw(hh_ts, seasonal = "multiplicative", damped = T)
hh_hwd_plot <- autoplot(hh_ts,
main = "Henry Hub Closing Price Holt-Winters' Damped") +
autolayer(fitted(hh_hwd_add), series = "Fitted Additive Damped") +
autolayer(fitted(hh_hwd_mult), series = "Fitted Multiplicative Damped") +
ylab("Henry Hub Closing Price") +
theme(plot.title = element_text(hjust = 0.5))
accuracy(hh_hwd_add)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002130839 0.1222695 0.07672619 -0.0100341 2.396479 0.3286712
## ACF1
## Training set 0.104823
accuracy(hh_hwd_mult)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002093357 0.1220342 0.07555259 -0.01445787 2.351806 0.3236439
## ACF1
## Training set 0.03770432
hh_hwd_plot
ARIMA: The ARIMA model fit resulted in an ARIMA(3,1,1) model. The ARIMA forecast, so far, seems to perform the best out of the models. The residuals are very tightly distributed around zero.
hh_arima <- auto.arima(hh_ts)
hh_arima
## Series: hh_ts
## ARIMA(3,1,1)
##
## Coefficients:
## ar1 ar2 ar3 ma1
## -1.0131 -0.1338 -0.0523 0.8875
## s.e. 0.0414 0.0286 0.0208 0.0367
##
## sigma^2 = 0.01411: log likelihood = 1812.9
## AIC=-3615.79 AICc=-3615.77 BIC=-3586.58
checkresiduals(hh_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)
## Q* = 49.508, df = 20, p-value = 0.0002604
##
## Model df: 4. Total lags used: 24
hh_arima_fc <- forecast(hh_arima, h = 12)
accuracy(hh_arima_fc)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002600981 0.1186609 0.07312599 0.000738404 2.277189 0.3132491
## ACF1
## Training set 0.0005399664
hh_arima_fc_plot <- autoplot(hh_arima_fc
, main = "Henry Hub Closing Price Forecast using ARIMA(0,1,0)") +
ylab("Henry Hub Closing Price")
hh_arima_fc_plot
ETS: The ETS model output resulted in an optimal ETS model of ETS(M,A,N) with an alpha=0.9456. The p-value is less than 0.05 so it is statistically significant as well. The forecast seems to stay at elevated levels and follows the trend of the increasing at the tail end of the time series.
hh_ets <- ets(hh_ts)
hh_ets
## ETS(M,Ad,N)
##
## Call:
## ets(y = hh_ts)
##
## Smoothing parameters:
## alpha = 0.9456
## beta = 1e-04
## phi = 0.8
##
## Initial states:
## l = 1.8838
## b = 0.1601
##
## sigma: 0.0331
##
## AIC AICc BIC
## 8227.540 8227.573 8262.593
checkresiduals(hh_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,Ad,N)
## Q* = 46.874, df = 19, p-value = 0.0003722
##
## Model df: 5. Total lags used: 24
hh_ets_fc <- forecast(hh_ets, h = 12)
accuracy(hh_ets_fc)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.00218696 0.119424 0.07251632 -0.008712112 2.256979 0.3106375
## ACF1
## Training set -0.07702683
hh_ets_fc %>% autoplot() +
labs(title="Henry Hub Closing Price Forecast using ETS(M,A,N)") +
ylab("Henry Hub Closing Price") +
theme(plot.title = element_text(hjust = 0.5))
Neural Net: The neural net test resulted in an optima neural net model of NNAR(33,1,17)[12]. This seems a bit high to have 33 lagged inputs for p. Since this is seasonal data, the P input is automatically set to P = 1. The seasonal period seems to be 12 days, so I will use that to forecast. It also looks like k wasn’t specified so it was calculated using k=(p+P+1)/2. The neural net forecast shows that after 36 forecaste periods, the Henry Hub closing price should return to the $4 range. Overall, the neural net model has performed the best out of all the models tested and will be selected for the test set.
hh_nnet <- nnetar(hh_ts)
hh_nnet
## Series: hh_ts
## Model: NNAR(33,1,17)[12]
## Call: nnetar(y = hh_ts)
##
## Average of 20 networks, each of which is
## a 33-17-1 network with 596 weights
## options were - linear output units
##
## sigma^2 estimated as 0.005895
hh_nnet_fc <- forecast(hh_nnet, h = 36)
accuracy(hh_nnet_fc)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0003123617 0.07677892 0.05644649 -0.06657598 1.874948 0.2417993
## ACF1
## Training set 0.003220883
hh_nnet_fc %>% autoplot() +
labs(title="Henry Hub Closing Price Forecast using NNAR(33,1,17)[12]") +
ylab("Henry Hub Closing Price") +
theme(plot.title = element_text(hjust = 0.5))
PREDICTION
Just as the neural net model shows previously, the model predicts a retraction in price. However, this time it seems that the price will end around the $5.50 range. The RMSE of this prediction is very low meaning that the model fits quite well.
# Predictions
hh_predict <- predict(hh_nnet, newdata = hh_test_ts)
hh_predict
## Jan Feb Mar Apr May Jun Jul Aug
## 213 7.318659 7.249111 7.134577 7.319988 7.189914 6.964994
## 214 6.595948 6.689468 6.501102 6.680946 6.698656 6.806314 6.607046 6.536861
## 215 5.660215 5.323413
## Sep Oct Nov Dec
## 213 6.772669 6.888789 6.691848 6.576040
## 214 6.294000 6.347819 6.141768 6.055011
## 215
summary(hh_predict)
##
## Forecast method: NNAR(33,1,17)[12]
##
## Model Information:
##
## Average of 20 networks, each of which is
## a 33-17-1 network with 596 weights
## options were - linear output units
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0003123617 0.07677892 0.05644649 -0.06657598 1.874948 0.2417993
## ACF1
## Training set 0.003220883
##
## Forecasts:
## Jan Feb Mar Apr May Jun Jul Aug
## 213 7.318659 7.249111 7.134577 7.319988 7.189914 6.964994
## 214 6.595948 6.689468 6.501102 6.680946 6.698656 6.806314 6.607046 6.536861
## 215 5.660215 5.323413
## Sep Oct Nov Dec
## 213 6.772669 6.888789 6.691848 6.576040
## 214 6.294000 6.347819 6.141768 6.055011
## 215
hh_predict %>% autoplot(main = "Henry Hub Closing Price Price Predictions using NNAR(33,1,17)[12]") +
ylab("Henry Hub Closing Price") +
theme(plot.title = element_text(hjust = 0.5))