In this study, we would like to create a machine learning model to forecast the Electricity Demand of Houston, TX, in hourly resolution, using historical demand, weather, and other related data.
We scraped the website of Electric Reliability Council of Texas (ERCOT) to download historical demand for Houston, starting with oldest data available, which is from May 07, 2024, to today.
It is known that “Electricity use is higher during the workday than
in the evening; higher on weekdays than on weekends and major holidays;
higher in summer and winter than in fall and spring; and higher
when cold weather leads to more heating, or when warm weather leads to
greater use of air-conditioning.” (The
New York Times, 2020) Therefore, we first want to obtain
corresponding weather and calendar/holiday data: For weather-related
variables, we downloaded ECMWF’s
ERA5 data. We downloaded following variables: 2m dewpoint
temperature (dewpoint_2m), 2m temperature
(temperature_2m), 10m u-component of wind, 10m
v-component of wind, Surface solar radiation
downwards (SSRD) (which is important to capture the effect of
behind-the-meter solar power generation on the demand), and
Total precipitation (TP) variables from April 2024 to
January 11, 2026. SSRD and TP are accumulated
parameters, therefore, we adjusted them (lead by
1-step) to be used with other -instantaneous- variables. Using 10m
u-component of wind, and 10m v-component of wind variable, we calculated
the wind speed, and with the help of temperature variables, we also
calculated apparent (feels like) temperatures.
We employed lares library
and Nager.Date API to label
federal, state, or Houston city holidays and bridge days in our focused
time period: New Year’s Day, Martin Luther King Jr. Day, Presidents’
Day, Spring Holiday, Memorial Day, Juneteenth, Independence Day, Labor
Day, Election Day (November 5, 2024), Veterans Day, Thanksgiving Day,
Day After Thanksgiving Day, Christmas Day, Day After Christmas Day, and
Bridge Days (of Long Weekends): July 5th 2024, November 10th 2025, and
January 2nd 2026.
Other important features are also derived: Cyclical encoding for hour of day and day of year, day of the week, a weekend indicator, a working hour indicator (7 AM - 7 PM on business days), feels like temperature lags (lag of 1, 2, 3, 6 hours), rolling mean of feels like temperature (past 6 hours), rolling maximum of feels like temperature (past 24 hours) (we took only the maximum since Houston is cooling-dominated).
All data is available on the file
houston_ercot_weather_data.csv. Let’s take a quick
look!
ercot_weather_data <- read_csv(file = "houston_ercot_weather_data.csv")
# Note: During Daylight Saving Time "fall back" transitions
# (November 3, 2024 and November 2, 2025),
# the repeated hour's demand values were averaged.
ercot_weather_data$DateTime <- lubridate::with_tz(ercot_weather_data$DateTime, tzone = "America/Chicago")
ercot_weather_data$wday <- as.factor(ercot_weather_data$wday)
ercot_weather_data$holiday <- as.factor(ercot_weather_data$holiday)
ercot_weather_data$is_weekend <- as.factor(ercot_weather_data$is_weekend)
ercot_weather_data$is_working_hour <- as.factor(ercot_weather_data$is_working_hour)
fig1 <- ercot_weather_data %>%
plot_ly(x = ~DateTime, y = ~Demand) %>%
add_lines() %>%
layout(
title = list(text = "Actual Hourly Loads of Houston, TX (in MW)", font = list(size = 16))
)
fig1
We notice some outlier days that have much lower consumption. To get a better glimpse, we want to look to the demand with 2m-temperature and working hour variable together.
ercot_weather_data %>%
mutate(
working_label = factor(
ifelse(is_working_hour == "1", "Yes", "No"),
levels = c("Yes", "No")
),
demand_formatted = paste0(round(Demand / 1000, 2), "k"),
datetime_formatted = format(DateTime, "%B %d, %Y, %H:%M")
) %>%
plot_ly(
x = ~temperature_2m,
y = ~Demand,
color = ~working_label,
colors = c("Yes" = "#0072B2", "No" = "#E69F00"),
type = "scatter",
mode = "markers",
marker = list(opacity = 0.6),
text = ~paste0(
datetime_formatted,
" (2m_Temp: ", round(temperature_2m, 2),
", Demand: ", demand_formatted, ") "
),
hoverinfo = "text"
) %>%
layout(
xaxis = list(title = "Hourly 2m Temperature in Celsius", zeroline = FALSE),
yaxis = list(title = "Hourly Demand in Houston (in MW)"),
legend = list(title = list(text = "Working Hour"))
)
Although we could not find detailed information on ERCOT website about historical load shed events, LLMs and search engines provide this much needed info! We exclude following dates where major blackouts took place, and look at the scatter plot again: 2024 Houston derecho (May 16-17, 2024), May 28, 2024, Hurricane Beryl (July 6-20, 2024), January 21-22, 2025, October 25-26, 2025.
outage_days <- as_date(c(as_date("2024-05-16"), as_date("2024-05-17"),
as_date("2024-05-28"),
as_date("2024-07-06") : as_date("2024-07-20"),
as_date("2025-01-21"), as_date("2025-01-22"),
as_date("2025-10-25"), as_date("2025-10-26")))
ercot_weather_data <- ercot_weather_data %>%
filter(!(as_date(DateTime) %in% outage_days))
Now, we should separate the training and testing data. Although a time-series cross-validation would be a more accurate fit for a production-grade analysis to better estimate out-of-sample performance, we choose a single train/test split approach for the sake of simplicity. Our testing data consists of the last month of the available data, from December 11, 2025 to January 10, 2026:
ercot_weather_data <- ercot_weather_data %>% na.omit()
train_data <- ercot_weather_data %>%
filter(DateTime < as_date("2025-12-11")) %>%
as_tibble()
test_data <- ercot_weather_data %>%
filter(DateTime >= as_date("2025-12-11"))
We need to employ predicting models that allow us to use both the
information from predictor variables and the time series dynamics (e.g.,
trends and seasonalities). We use automl_reg() function of
the modeltime.h2o library. It
trains and cross-validates multiple machine learning and deep learning
models (XGBoost, GBM, GLMs, Random Forest, etc.) and then trains Stacked
Ensemble models with those models. The best model is then selected based
on a stopping metric. Now, we fit our training data for 10 minutes:
h2o.init(
nthreads = -1,
ip = 'localhost',
port = 54321
)
model_spec <- automl_reg(mode = 'regression') %>%
set_engine(
engine = 'h2o',
max_runtime_secs = 600,
seed = 786
)
model_fit_h2o <- model_spec %>%
fit(Demand ~ ., data = train_data)
Finally, we can predict on the test dataset:
forecast_test_data <- test_data %>% select(-Demand)
forecasts <- predict(model_fit_h2o, forecast_test_data)
forecasts <- cbind(forecast_test_data$DateTime, forecasts) %>%
rename(DateTime = `forecast_test_data$DateTime`,
Forecast = .pred)
h2o.shutdown(prompt = FALSE)
Let’s compare forecasts and the realized values:
all_data <- forecasts %>%
left_join(test_data, by = "DateTime") %>%
rename(Realized = Demand)
fig2 <- plot_ly(all_data, x = ~DateTime) %>%
add_lines(y = ~Realized, name = "Realized",
line = list(color = "#2563eb", width = 1.5)) %>%
add_lines(y = ~Forecast, name = "Forecast",
line = list(color = "#dc2626", width = 1.5)) %>%
layout(
title = list(text = "Forecast vs. Realized Values", font = list(size = 16)),
xaxis = list(title = "Datetime", rangeslider = list(visible = TRUE)),
yaxis = list(title = "Value"),
hovermode = "x unified",
legend = list(x = 0.01, y = 0.99, bgcolor = "rgba(255,255,255,0.8)")
)
fig2
Our forecasts and the actual demand data is available on the file
forecast_realized_data.csv.
Visual inspection of forecasts is informative, but we need quantitative metrics to rigorously assess model performance. We calculate several standard forecast accuracy measures:
correlation <- cor(all_data$Forecast, all_data$Realized)
ggplot(all_data, aes(x = Forecast, y = Realized)) +
geom_point(alpha = 0.4, color = "#3b82f6") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray40", linewidth = 1) +
geom_smooth(method = "lm", se = TRUE, color = "#dc2626", fill = "#fecaca") +
labs(title = "Forecast vs Realized: Scatter Plot",
subtitle = paste("Correlation:", round(correlation, 3),
"| Red line: fitted regression | Dashed: perfect forecast"),
x = "Forecast", y = "Realized") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
all_data$Error <- all_data$Realized - all_data$Forecast
all_data$AbsError <- abs(all_data$Error)
all_data$SqError <- all_data$Error^2
all_data$PctError <- (all_data$Error / all_data$Realized) * 100
ggplot(all_data, aes(x = Error)) +
geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "#3b82f6",
color = "white", alpha = 0.7) +
geom_density(color = "#dc2626", linewidth = 1) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
geom_vline(xintercept = mean(all_data$Error), color = "#dc2626", linewidth = 1) +
labs(title = "Distribution of Forecast Errors",
subtitle = paste("Mean Error (red line):", round(mean(all_data$Error), 1)),
x = "Error (Realized - Forecast)", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
all_data$Hour <- lubridate::hour(all_data$DateTime)
hourly_stats <- all_data %>%
group_by(Hour) %>%
summarise(
MeanError = mean(Error),
MAE = mean(AbsError),
RMSE = sqrt(mean(SqError)),
.groups = "drop"
)
ggplot(hourly_stats, aes(x = Hour)) +
geom_bar(aes(y = MeanError), stat = "identity", fill = "#3b82f6", alpha = 0.7) +
geom_line(aes(y = MAE), color = "#dc2626", linewidth = 1) +
geom_point(aes(y = MAE), color = "#dc2626", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(breaks = 0:23) +
labs(title = "Forecast Error by Hour of Day",
subtitle = "Blue bars: Mean Error | Red line: MAE",
x = "Hour", y = "Error (Realized - Forecast)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
ggplot(all_data, aes(x = DateTime, y = Error)) +
geom_line(color = "#3b82f6", alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
geom_smooth(method = "loess", se = TRUE, color = "#dc2626", fill = "#fecaca", span = 0.2) +
labs(title = "Forecast Error Over Time",
subtitle = "Red line: smoothed trend (LOESS)",
x = "DateTime", y = "Error (Realized - Forecast)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
all_data$Date <- lubridate::as_date(all_data$DateTime)
daily_stats <- all_data %>%
group_by(Date) %>%
summarise(
MeanError = mean(Error),
MAE = mean(AbsError),
RMSE = sqrt(mean(SqError)),
.groups = "drop"
)
ggplot(daily_stats, aes(x = Date)) +
geom_col(aes(y = MeanError), fill = "#3b82f6", alpha = 0.7) +
geom_line(aes(y = MAE), color = "#dc2626", linewidth = 1) +
geom_point(aes(y = MAE), color = "#dc2626", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Daily Forecast Performance",
subtitle = "Blue bars: Mean Error | Red line: MAE",
x = "Date", y = "Error (Realized - Forecast)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
New Year’s Day and its bridge day (Friday, Jan 2nd) show worse performance compared to other days! Let’s see the overall performance metrics:
evaluation_metrics <- all_data %>%
summarise(
RMSE = sqrt(mean((Realized - Forecast)^2)),
MAE = mean(abs(Realized - Forecast)),
MAPE = mean(abs((Realized - Forecast) / Realized)) * 100,
R_squared = 1 - sum((Realized - Forecast)^2) / sum((Realized - mean(Realized))^2)
)
evaluation_metrics %>%
pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
mutate(Value = round(Value, 2)) %>%
knitr::kable(caption = "Forecast Accuracy Metrics")
| Metric | Value |
|---|---|
| RMSE | 306.09 |
| MAE | 227.67 |
| MAPE | 1.92 |
| R_squared | 0.93 |
We can also examine how the model performs during different periods, particularly during peak demand hours when accurate forecasts are most critical, or weekends/weekdays/holidays:
all_data_with_hour <- all_data %>%
mutate(
Hour = lubridate::hour(DateTime),
Period = case_when(
Hour >= 7 & Hour <= 22 ~ "Peak (7 AM - 10 PM)",
TRUE ~ "Off-Peak"
)
)
period_metrics <- all_data_with_hour %>%
group_by(Period) %>%
summarise(
RMSE = sqrt(mean((Realized - Forecast)^2)),
MAE = mean(abs(Realized - Forecast)),
MAPE = mean(abs((Realized - Forecast) / Realized)) * 100,
.groups = "drop"
)
period_metrics %>%
mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
knitr::kable(caption = "Forecast Accuracy by Time Period")
| Period | RMSE | MAE | MAPE |
|---|---|---|---|
| Off-Peak | 251.79 | 199.93 | 1.85 |
| Peak (7 AM - 10 PM) | 329.90 | 241.54 | 1.95 |
all_data_workday <- all_data %>%
mutate(
Workday = case_when(
(is_weekend==0 & as.integer(holiday)-1 == 0) ~ "Working Day",
(is_weekend==1 & as.integer(holiday)-1 == 0) ~ "Weekend (Non-Holiday)",
TRUE ~ "Holiday or Bridge Day"
)
)
workday_metrics <- all_data_workday %>%
group_by(Workday) %>%
summarise(
RMSE = sqrt(mean((Realized - Forecast)^2)),
MAE = mean(abs(Realized - Forecast)),
MAPE = mean(abs((Realized - Forecast) / Realized)) * 100,
.groups = "drop"
)
workday_metrics %>%
mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
knitr::kable(caption = "Forecast Accuracy by Working Days")
| Workday | RMSE | MAE | MAPE |
|---|---|---|---|
| Holiday or Bridge Day | 495.72 | 385.62 | 3.30 |
| Weekend (Non-Holiday) | 263.36 | 192.03 | 1.64 |
| Working Day | 268.45 | 210.39 | 1.75 |
We see that the forecasts for the holidays (especially New Year’s Day) and the following bridge day (Friday, January 2, 2026), have bad predicting performances. This can be attributed to the fact that we only have one historical New Year’s Day data (2025), and the lack of adequate holidays and bridge days in our training data. So, longer training data would make our forecasts much better.
What else can be done? Preprocessing (e.g., better anomaly detection, more outage info), Feature Engineering (e.g., using new data/external regressors, a better look at holidays/bridge days), longer data for training, experimenting with other methods (e.g., pretrained models for time series forecasting) would be great! For the sake of simplicity, we took only one point from the city of Houston (Longitude -95.5, Latitude: 29.75) while getting the meteorological variables. But if possible (e.g., if obtaining consumption data for each neighborhood is possible), consumption-weighted weather features would give more accurate results.