Introduction and Problem Definition

In this study, we would like to create a machine learning model to forecast the Electricity Demand of Houston, TX, in hourly resolution, using historical demand, weather, and other related data.

We scraped the website of Electric Reliability Council of Texas (ERCOT) to download historical demand for Houston, starting with oldest data available, which is from May 07, 2024, to today.

It is known that “Electricity use is higher during the workday than in the evening; higher on weekdays than on weekends and major holidays; higher in summer and winter than in fall and spring; and higher when cold weather leads to more heating, or when warm weather leads to greater use of air-conditioning.” (The New York Times, 2020) Therefore, we first want to obtain corresponding weather and calendar/holiday data: For weather-related variables, we downloaded ECMWF’s ERA5 data. We downloaded following variables: 2m dewpoint temperature (dewpoint_2m), 2m temperature (temperature_2m), 10m u-component of wind, 10m v-component of wind, Surface solar radiation downwards (SSRD) (which is important to capture the effect of behind-the-meter solar power generation on the demand), and Total precipitation (TP) variables from April 2024 to January 11, 2026. SSRD and TP are accumulated parameters, therefore, we adjusted them (lead by 1-step) to be used with other -instantaneous- variables. Using 10m u-component of wind, and 10m v-component of wind variable, we calculated the wind speed, and with the help of temperature variables, we also calculated apparent (feels like) temperatures.

We employed lares library and Nager.Date API to label federal, state, or Houston city holidays and bridge days in our focused time period: New Year’s Day, Martin Luther King Jr. Day, Presidents’ Day, Spring Holiday, Memorial Day, Juneteenth, Independence Day, Labor Day, Election Day (November 5, 2024), Veterans Day, Thanksgiving Day, Day After Thanksgiving Day, Christmas Day, Day After Christmas Day, and Bridge Days (of Long Weekends): July 5th 2024, November 10th 2025, and January 2nd 2026.

Other important features are also derived: Cyclical encoding for hour of day and day of year, day of the week, a weekend indicator, a working hour indicator (7 AM - 7 PM on business days), feels like temperature lags (lag of 1, 2, 3, 6 hours), rolling mean of feels like temperature (past 6 hours), rolling maximum of feels like temperature (past 24 hours) (we took only the maximum since Houston is cooling-dominated).

A First Glimpse at Data

All data is available on the file houston_ercot_weather_data.csv. Let’s take a quick look!

ercot_weather_data <- read_csv(file = "houston_ercot_weather_data.csv")
# Note: During Daylight Saving Time "fall back" transitions 
# (November 3, 2024 and November 2, 2025), 
# the repeated hour's demand values were averaged.
ercot_weather_data$DateTime <- lubridate::with_tz(ercot_weather_data$DateTime, tzone = "America/Chicago")
ercot_weather_data$wday <- as.factor(ercot_weather_data$wday)
ercot_weather_data$holiday <- as.factor(ercot_weather_data$holiday)
ercot_weather_data$is_weekend <- as.factor(ercot_weather_data$is_weekend)
ercot_weather_data$is_working_hour <- as.factor(ercot_weather_data$is_working_hour)
fig1 <- ercot_weather_data %>%
  plot_ly(x = ~DateTime, y = ~Demand) %>%
  add_lines() %>%
  layout(
    title = list(text = "Actual Hourly Loads of Houston, TX (in MW)", font = list(size = 16))
  )
fig1

We notice some outlier days that have much lower consumption. To get a better glimpse, we want to look to the demand with 2m-temperature and working hour variable together.

ercot_weather_data %>%
  mutate(
    working_label = factor(
      ifelse(is_working_hour == "1", "Yes", "No"),
      levels = c("Yes", "No")
    ),
    demand_formatted = paste0(round(Demand / 1000, 2), "k"),
    datetime_formatted = format(DateTime, "%B %d, %Y, %H:%M")
  ) %>%
  plot_ly(
    x = ~temperature_2m,
    y = ~Demand,
    color = ~working_label,
    colors = c("Yes" = "#0072B2", "No" = "#E69F00"),
    type = "scatter",
    mode = "markers",
    marker = list(opacity = 0.6),
    text = ~paste0(
      datetime_formatted,
      " (2m_Temp: ", round(temperature_2m, 2),
      ", Demand: ", demand_formatted, ") "
    ),
    hoverinfo = "text"
  ) %>%
  layout(
    xaxis = list(title = "Hourly 2m Temperature in Celsius", zeroline = FALSE),
    yaxis = list(title = "Hourly Demand in Houston (in MW)"),
    legend = list(title = list(text = "Working Hour"))
  )

Although we could not find detailed information on ERCOT website about historical load shed events, LLMs and search engines provide this much needed info! We exclude following dates where major blackouts took place, and look at the scatter plot again: 2024 Houston derecho (May 16-17, 2024), May 28, 2024, Hurricane Beryl (July 6-20, 2024), January 21-22, 2025, October 25-26, 2025.

outage_days <- as_date(c(as_date("2024-05-16"), as_date("2024-05-17"), 
                        as_date("2024-05-28"),
                        as_date("2024-07-06") : as_date("2024-07-20"),
                        as_date("2025-01-21"), as_date("2025-01-22"),
                        as_date("2025-10-25"), as_date("2025-10-26")))
ercot_weather_data <- ercot_weather_data %>% 
             filter(!(as_date(DateTime) %in% outage_days))

Now, we should separate the training and testing data. Although a time-series cross-validation would be a more accurate fit for a production-grade analysis to better estimate out-of-sample performance, we choose a single train/test split approach for the sake of simplicity. Our testing data consists of the last month of the available data, from December 11, 2025 to January 10, 2026:

ercot_weather_data <- ercot_weather_data %>% na.omit()
train_data <- ercot_weather_data %>%
  filter(DateTime < as_date("2025-12-11")) %>%
  as_tibble()
test_data <- ercot_weather_data %>%
  filter(DateTime >= as_date("2025-12-11"))

We need to employ predicting models that allow us to use both the information from predictor variables and the time series dynamics (e.g., trends and seasonalities). We use automl_reg() function of the modeltime.h2o library. It trains and cross-validates multiple machine learning and deep learning models (XGBoost, GBM, GLMs, Random Forest, etc.) and then trains Stacked Ensemble models with those models. The best model is then selected based on a stopping metric. Now, we fit our training data for 10 minutes:

h2o.init(
  nthreads = -1,
  ip       = 'localhost',
  port     = 54321
)

model_spec <- automl_reg(mode = 'regression') %>%
  set_engine(
    engine                     = 'h2o',
    max_runtime_secs           = 600,
    seed                       = 786
  )

model_fit_h2o <- model_spec %>%
  fit(Demand ~ ., data = train_data)

A First Look on Forecast

Finally, we can predict on the test dataset:

forecast_test_data <- test_data %>% select(-Demand)

forecasts <- predict(model_fit_h2o, forecast_test_data)
forecasts <- cbind(forecast_test_data$DateTime, forecasts) %>%
  rename(DateTime = `forecast_test_data$DateTime`,
         Forecast = .pred)
  
h2o.shutdown(prompt = FALSE)

Let’s compare forecasts and the realized values:

all_data <- forecasts %>%
  left_join(test_data, by = "DateTime") %>% 
  rename(Realized = Demand)

fig2 <- plot_ly(all_data, x = ~DateTime) %>%
  add_lines(y = ~Realized, name = "Realized", 
            line = list(color = "#2563eb", width = 1.5)) %>%
  add_lines(y = ~Forecast, name = "Forecast", 
            line = list(color = "#dc2626", width = 1.5)) %>%
  layout(
    title = list(text = "Forecast vs. Realized Values", font = list(size = 16)),
    xaxis = list(title = "Datetime", rangeslider = list(visible = TRUE)),
    yaxis = list(title = "Value"),
    hovermode = "x unified",
    legend = list(x = 0.01, y = 0.99, bgcolor = "rgba(255,255,255,0.8)")
  )

fig2

Our forecasts and the actual demand data is available on the file forecast_realized_data.csv.

Model Evaluation Metrics

Visual inspection of forecasts is informative, but we need quantitative metrics to rigorously assess model performance. We calculate several standard forecast accuracy measures:

Scatter plot with regression line

correlation <- cor(all_data$Forecast, all_data$Realized)

ggplot(all_data, aes(x = Forecast, y = Realized)) +
  geom_point(alpha = 0.4, color = "#3b82f6") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray40", linewidth = 1) +
  geom_smooth(method = "lm", se = TRUE, color = "#dc2626", fill = "#fecaca") +
  labs(title = "Forecast vs Realized: Scatter Plot",
       subtitle = paste("Correlation:", round(correlation, 3), 
                       "| Red line: fitted regression | Dashed: perfect forecast"),
       x = "Forecast", y = "Realized") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Error distribution

all_data$Error <- all_data$Realized - all_data$Forecast
all_data$AbsError <- abs(all_data$Error)
all_data$SqError <- all_data$Error^2
all_data$PctError <- (all_data$Error / all_data$Realized) * 100

ggplot(all_data, aes(x = Error)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "#3b82f6", 
                 color = "white", alpha = 0.7) +
  geom_density(color = "#dc2626", linewidth = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
  geom_vline(xintercept = mean(all_data$Error), color = "#dc2626", linewidth = 1) +
  labs(title = "Distribution of Forecast Errors",
       subtitle = paste("Mean Error (red line):", round(mean(all_data$Error), 1)),
       x = "Error (Realized - Forecast)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Hourly error patterns

all_data$Hour <- lubridate::hour(all_data$DateTime)

hourly_stats <- all_data %>%
  group_by(Hour) %>%
  summarise(
    MeanError = mean(Error),
    MAE = mean(AbsError),
    RMSE = sqrt(mean(SqError)),
    .groups = "drop"
  )

ggplot(hourly_stats, aes(x = Hour)) +
  geom_bar(aes(y = MeanError), stat = "identity", fill = "#3b82f6", alpha = 0.7) +
  geom_line(aes(y = MAE), color = "#dc2626", linewidth = 1) +
  geom_point(aes(y = MAE), color = "#dc2626", size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_x_continuous(breaks = 0:23) +
  labs(title = "Forecast Error by Hour of Day",
       subtitle = "Blue bars: Mean Error | Red line: MAE",
       x = "Hour", y = "Error (Realized - Forecast)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Error over time

ggplot(all_data, aes(x = DateTime, y = Error)) +
  geom_line(color = "#3b82f6", alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  geom_smooth(method = "loess", se = TRUE, color = "#dc2626", fill = "#fecaca", span = 0.2) +
  labs(title = "Forecast Error Over Time",
       subtitle = "Red line: smoothed trend (LOESS)",
       x = "DateTime", y = "Error (Realized - Forecast)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Daily aggregated performance

all_data$Date <- lubridate::as_date(all_data$DateTime)
daily_stats <- all_data %>%
  group_by(Date) %>%
  summarise(
    MeanError = mean(Error),
    MAE = mean(AbsError),
    RMSE = sqrt(mean(SqError)),
    .groups = "drop"
  )

ggplot(daily_stats, aes(x = Date)) +
  geom_col(aes(y = MeanError), fill = "#3b82f6", alpha = 0.7) +
  geom_line(aes(y = MAE), color = "#dc2626", linewidth = 1) +
  geom_point(aes(y = MAE), color = "#dc2626", size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Daily Forecast Performance",
       subtitle = "Blue bars: Mean Error | Red line: MAE",
       x = "Date", y = "Error (Realized - Forecast)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))

New Year’s Day and its bridge day (Friday, Jan 2nd) show worse performance compared to other days! Let’s see the overall performance metrics:

evaluation_metrics <- all_data %>%
  summarise(
    RMSE = sqrt(mean((Realized - Forecast)^2)),
    MAE  = mean(abs(Realized - Forecast)),
    MAPE = mean(abs((Realized - Forecast) / Realized)) * 100,
    R_squared = 1 - sum((Realized - Forecast)^2) / sum((Realized - mean(Realized))^2)
  )

evaluation_metrics %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
  mutate(Value = round(Value, 2)) %>%
  knitr::kable(caption = "Forecast Accuracy Metrics")
Forecast Accuracy Metrics
Metric Value
RMSE 306.09
MAE 227.67
MAPE 1.92
R_squared 0.93

We can also examine how the model performs during different periods, particularly during peak demand hours when accurate forecasts are most critical, or weekends/weekdays/holidays:

all_data_with_hour <- all_data %>%
  mutate(
    Hour = lubridate::hour(DateTime),
    Period = case_when(
      Hour >= 7 & Hour <= 22 ~ "Peak (7 AM - 10 PM)",
      TRUE ~ "Off-Peak"
    )
  )

period_metrics <- all_data_with_hour %>%
  group_by(Period) %>%
  summarise(
    RMSE = sqrt(mean((Realized - Forecast)^2)),
    MAE  = mean(abs(Realized - Forecast)),
    MAPE = mean(abs((Realized - Forecast) / Realized)) * 100,
    .groups = "drop"
  )

period_metrics %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  knitr::kable(caption = "Forecast Accuracy by Time Period")
Forecast Accuracy by Time Period
Period RMSE MAE MAPE
Off-Peak 251.79 199.93 1.85
Peak (7 AM - 10 PM) 329.90 241.54 1.95
all_data_workday <- all_data %>%
  mutate(
    Workday = case_when(
      (is_weekend==0 & as.integer(holiday)-1 == 0) ~ "Working Day",
      (is_weekend==1 & as.integer(holiday)-1 == 0) ~ "Weekend (Non-Holiday)",
      TRUE ~ "Holiday or Bridge Day"
    )
  )

workday_metrics <- all_data_workday %>%
  group_by(Workday) %>%
  summarise(
    RMSE = sqrt(mean((Realized - Forecast)^2)),
    MAE  = mean(abs(Realized - Forecast)),
    MAPE = mean(abs((Realized - Forecast) / Realized)) * 100,
    .groups = "drop"
  )

workday_metrics %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  knitr::kable(caption = "Forecast Accuracy by Working Days")
Forecast Accuracy by Working Days
Workday RMSE MAE MAPE
Holiday or Bridge Day 495.72 385.62 3.30
Weekend (Non-Holiday) 263.36 192.03 1.64
Working Day 268.45 210.39 1.75

We see that the forecasts for the holidays (especially New Year’s Day) and the following bridge day (Friday, January 2, 2026), have bad predicting performances. This can be attributed to the fact that we only have one historical New Year’s Day data (2025), and the lack of adequate holidays and bridge days in our training data. So, longer training data would make our forecasts much better.

Future Research

What else can be done? Preprocessing (e.g., better anomaly detection, more outage info), Feature Engineering (e.g., using new data/external regressors, a better look at holidays/bridge days), longer data for training, experimenting with other methods (e.g., pretrained models for time series forecasting) would be great! For the sake of simplicity, we took only one point from the city of Houston (Longitude -95.5, Latitude: 29.75) while getting the meteorological variables. But if possible (e.g., if obtaining consumption data for each neighborhood is possible), consumption-weighted weather features would give more accurate results.