W13_Discussion

Author

Jincheng Xie

Introduction

In this notebook, I use the vic_elec dataset (Victoria, Australia, 2012–2014) from the fpp3 package to build an LSTM model for half-hourly electricity demand forecasting and compare it with an ETS baseline.

This dataset shows strong daily and weekly seasonality, a clear relationship with temperature, and some holiday-related variation. These characteristics make it a useful case for comparing a recurrent deep learning model with a simpler statistical benchmark.

Load data from fpp3

library(fpp3)
library(readr)

vic_elec |>
  as_tibble() |>
  write_csv("vic_elec.csv")

cat(nrow(vic_elec), "rows written to vic_elec.csv\n")
52608 rows written to vic_elec.csv

Setup

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

from pytorch_forecasting import TimeSeriesDataSet
from pytorch_forecasting.data import NaNLabelEncoder
from pytorch_forecasting.models import RecurrentNetwork
from pytorch_forecasting.metrics import MAE

import torch
from lightning.pytorch import Trainer, seed_everything
from lightning.pytorch.loggers import CSVLogger
from lightning.pytorch.callbacks import EarlyStopping

seed_everything(66, workers=True)
66

Data

Inspect

The vic_elec data contains half-hourly electricity demand, temperature, calendar information, and a holiday indicator. I convert the time fields to datetime format and sort the data chronologically before creating any time-based features.

df = pd.read_csv("vic_elec.csv")
df["Time"] = pd.to_datetime(df["Time"])
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values("Time").reset_index(drop=True)

df.head()
                       Time       Demand  Temperature       Date  Holiday
0 2011-12-31 13:00:00+00:00  4382.825174        21.40 2012-01-01     True
1 2011-12-31 13:30:00+00:00  4263.365526        21.05 2012-01-01     True
2 2011-12-31 14:00:00+00:00  4048.966046        20.70 2012-01-01     True
3 2011-12-31 14:30:00+00:00  3877.563330        20.55 2012-01-01     True
4 2011-12-31 15:00:00+00:00  4036.229746        20.40 2012-01-01     True

The first two weeks already show the main short-term pattern: repeated daily peaks and lower demand at weekends.

fig, ax = plt.subplots(figsize=(10, 3.5))
ax.plot(df["Time"].iloc[:48 * 14], df["Demand"].iloc[:48 * 14], lw=0.9)
ax.set_ylabel("Demand (MW)")
ax.set_title("Demand over the first two weeks")
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Feature engineering

I create calendar features from the timestamp, together with lag and rolling-demand features required by the assignment. After removing the initial missing rows caused by these transformations, I rebuild a continuous time_idx and define a single group identifier for the series.

df["hour"] = df["Time"].dt.hour
df["minute"] = df["Time"].dt.minute
df["day_of_week"] = df["Time"].dt.dayofweek
df["month"] = df["Time"].dt.month
df["hour_fraction"] = df["hour"] + df["minute"] / 60

df["lag_1"] = df["Demand"].shift(1)
df["lag_48"] = df["Demand"].shift(48)
df["rolling_mean_48"] = df["Demand"].rolling(48).mean()
df["rolling_std_48"] = df["Demand"].rolling(48).std()

df = df.dropna().reset_index(drop=True)
df["time_idx"] = np.arange(len(df))
df["group"] = "vic_elec"

df[["Time", "Demand", "hour", "day_of_week", "month", "lag_1", "lag_48",
    "rolling_mean_48", "rolling_std_48", "time_idx", "group"]].head()
                       Time       Demand  ...  time_idx     group
0 2012-01-01 13:00:00+00:00  4367.914468  ...         0  vic_elec
1 2012-01-01 13:30:00+00:00  4164.015958  ...         1  vic_elec
2 2012-01-01 14:00:00+00:00  3898.239882  ...         2  vic_elec
3 2012-01-01 14:30:00+00:00  3752.016756  ...         3  vic_elec
4 2012-01-01 15:00:00+00:00  3941.369156  ...         4  vic_elec

[5 rows x 11 columns]

Train / validation / test split

I split the series chronologically into 70% training, 15% validation, and 15% test.

max_prediction_length = 48
max_encoder_length = 336

n_total = len(df)
train_end = int(n_total * 0.70)
val_end = int(n_total * 0.85)

train_cutoff = df.loc[train_end, "time_idx"]
val_cutoff = df.loc[val_end, "time_idx"]

Scaling

Continuous variables are standardised using the training span only.

features_to_scale = ["Demand", "Temperature"]

scaler = StandardScaler()
scaler.fit(df.loc[df["time_idx"] <= train_cutoff, features_to_scale])
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
df[features_to_scale] = scaler.transform(df[features_to_scale])

demand_mean = scaler.mean_[0]
demand_std = scaler.scale_[0]

PyTorch Forecasting dataset

I next build the TimeSeriesDataSet used for training, validation, and testing. Temperature and clock-based variables are treated as known inputs, while demand is the target series to be predicted.

categorical_cols = ["hour", "day_of_week", "month", "Holiday"]

for col in categorical_cols:
    df[col] = df[col].astype(str).astype("category")

categorical_encoders = {
    col: NaNLabelEncoder(add_nan=True) for col in categorical_cols
}

training = TimeSeriesDataSet(
    df[df["time_idx"] <= train_cutoff],
    time_idx="time_idx",
    target="Demand",
    group_ids=["group"],
    max_encoder_length=max_encoder_length,
    max_prediction_length=max_prediction_length,
    time_varying_known_reals=["time_idx", "Temperature", "hour_fraction"],
    time_varying_unknown_reals=["Demand"],
    time_varying_known_categoricals=categorical_cols,
    categorical_encoders=categorical_encoders,
    add_relative_time_idx=True,
    add_encoder_length=True,
)

validation = TimeSeriesDataSet.from_dataset(
    training,
    df[df["time_idx"] <= val_cutoff],
    min_prediction_idx=train_cutoff + 1,
    stop_randomization=True,
)

testing = TimeSeriesDataSet.from_dataset(
    training,
    df,
    min_prediction_idx=val_cutoff + 1,
    stop_randomization=True,
)

batch_size = 64

train_dl = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dl = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)
test_dl = testing.to_dataloader(train=False, batch_size=batch_size, num_workers=0)

LSTM model

I use an LSTM through the RecurrentNetwork wrapper as the main deep learning model. It is trained with hidden size 32, one recurrent layer, dropout 0.2, learning rate 0.01, MAE loss, and early stopping on validation loss.

lstm_model = RecurrentNetwork.from_dataset(
    training,
    cell_type="LSTM",
    hidden_size=32,
    rnn_layers=1,
    dropout=0.2,
    learning_rate=0.01,
    loss=MAE(),
    log_interval=0,
    log_val_interval=1,
)

Training

from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint

logger = CSVLogger("lightning_logs", name="lstm_run")
early_stop = EarlyStopping(monitor="val_loss", patience=3, mode="min")
checkpoint_cb = ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1)

trainer = Trainer(
    max_epochs=15,
    accelerator="auto",
    logger=logger,
    callbacks=[early_stop, checkpoint_cb],
    enable_progress_bar=False,
    enable_model_summary=False,
    log_every_n_steps=50,
)

trainer.fit(lstm_model, train_dl, val_dl)

best_path = checkpoint_cb.best_model_path
best_lstm = RecurrentNetwork.load_from_checkpoint(best_path)

Loss curves

metrics = pd.read_csv(os.path.join(logger.log_dir, "metrics.csv"))

fig, ax = plt.subplots(figsize=(8, 4))

if "train_loss_epoch" in metrics.columns:
    train_curve = metrics[["epoch", "train_loss_epoch"]].dropna()
    ax.plot(train_curve["epoch"], train_curve["train_loss_epoch"], label="train")

if "val_loss" in metrics.columns:
    val_curve = metrics[["epoch", "val_loss"]].dropna()
    ax.plot(val_curve["epoch"], val_curve["val_loss"], label="validation")

ax.set_xlabel("Epoch")
ax.set_ylabel("Scaled MAE loss")
ax.set_title("Training and validation loss")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Test-set evaluation

To compare the LSTM and ETS fairly, I evaluate both models on the same fixed seven-day test horizon (336 half-hourly observations) immediately after the validation period.

pred_out = best_lstm.predict(test_dl, return_x=True)
pred = pred_out.output
x = pred_out.x

pred_np = pred.detach().cpu().numpy()
y_true = x["decoder_target"].detach().cpu().numpy()

n_days = 7
pred_days = pred_np[::max_prediction_length][:n_days]
true_days = y_true[::max_prediction_length][:n_days]

pred_unscaled = pred_days * demand_std + demand_mean
true_unscaled = true_days * demand_std + demand_mean

lstm_mae = mean_absolute_error(true_unscaled.ravel(), pred_unscaled.ravel())
lstm_rmse = np.sqrt(mean_squared_error(true_unscaled.ravel(), pred_unscaled.ravel()))
lstm_mape = np.mean(
    np.abs((true_unscaled - pred_unscaled) / true_unscaled)
) * 100

pd.DataFrame({
    "Model": ["LSTM"],
    "RMSE": [round(lstm_rmse, 1)],
    "MAE": [round(lstm_mae, 1)],
    "MAPE (%)": [round(lstm_mape, 2)],
    "Test horizon": [f"{n_days} days (336 half-hours)"]
})

Predictions vs actuals

The plot below compares the seven-day LSTM forecast with the observed demand. The daily pattern is captured well, although some peaks are slightly underestimated.

fig, ax = plt.subplots(figsize=(11, 4.5))

for i in range(n_days):
    start = i * max_prediction_length
    steps = range(start, start + max_prediction_length)

    ax.plot(
        steps, true_unscaled[i], "k-", alpha=0.8,
        label="Actual" if i == 0 else None
    )
    ax.plot(
        steps, pred_unscaled[i], "--", alpha=0.9,
        label="LSTM forecast" if i == 0 else None
    )

ax.set_xlabel("Half-hour step")
ax.set_ylabel("Demand (MW)")
ax.set_title("LSTM forecast vs actual demand")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Baseline: ETS

As a statistical benchmark, I fit an automatic ETS model on the most recent 30 days of the training period and forecast the same 336 test steps.

from statsforecast import StatsForecast
from statsforecast.models import AutoETS

# reload the unscaled series for ETS
raw = pd.read_csv("vic_elec.csv")
raw["Time"] = pd.to_datetime(raw["Time"])
raw = raw.sort_values("Time").reset_index(drop=True).iloc[48:].reset_index(drop=True)

train_demand = raw.iloc[: train_cutoff + 1]["Demand"].values
test_demand = raw.iloc[val_cutoff + 1 :]["Demand"].values

ets_train = pd.DataFrame({
    "unique_id": "vic_elec",
    "ds": raw.iloc[: train_cutoff + 1]["Time"].tail(48 * 30).values,
    "y": train_demand[-48 * 30 :],
})

sf = StatsForecast(
    models=[AutoETS(season_length=48, model="ZNA")],
    freq="30min",
    n_jobs=1,
)

horizon = n_days * max_prediction_length
fc = sf.forecast(df=ets_train, h=horizon)

ets_pred = fc["AutoETS"].values[:horizon]
ets_true = test_demand[:horizon]

ets_mae = mean_absolute_error(ets_true, ets_pred)
ets_rmse = np.sqrt(mean_squared_error(ets_true, ets_pred))
ets_mape = np.mean(np.abs((ets_true - ets_pred) / ets_true)) * 100

pd.DataFrame({
    "Model": ["LSTM", "ETS"],
    "RMSE": [round(lstm_rmse, 1), round(ets_rmse, 1)],
    "MAE": [round(lstm_mae, 1), round(ets_mae, 1)],
    "MAPE (%)": [round(lstm_mape, 2), round(ets_mape, 2)],
    "Test horizon": [f"{n_days} days (336 half-hours)", f"{n_days} days (336 half-hours)"]
})
  Model    RMSE    MAE  MAPE (%)             Test horizon
0  LSTM   158.0  120.1      2.32  7 days (336 half-hours)
1   ETS  1220.5  997.8     19.16  7 days (336 half-hours)

Model comparison

The table below compares the two models on the same 336-step test horizon.

results = pd.DataFrame({
    "Model": ["LSTM", "ETS"],
    "RMSE": [lstm_rmse, ets_rmse],
    "MAE": [lstm_mae, ets_mae],
    "MAPE (%)": [lstm_mape, ets_mape],
}).round(2)

results.style.hide(axis="index")
Model RMSE MAE MAPE (%)
LSTM 157.990000 120.080000 2.320000
ETS 1220.540000 997.770000 19.160000

The plot below compares the LSTM and ETS forecasts against the observed demand over the same test horizon.

fig, ax = plt.subplots(figsize=(11, 4.5))
steps = np.arange(horizon)

ax.plot(steps, ets_true, "k-", alpha=0.85, label="Actual")
ax.plot(steps, pred_unscaled.ravel(), "--", alpha=0.9, label="LSTM")
ax.plot(steps, ets_pred, ":", alpha=0.9, label="ETS")

ax.set_xlabel("Half-hour step")
ax.set_ylabel("Demand (MW)")
ax.set_title("LSTM and ETS forecasts vs actual demand")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Challenges encountered

A few practical issues came up in this workflow.

  • Overfitting. Training loss fell faster than validation loss, so early stopping was needed.
  • Training time. Even a small LSTM was slow on CPU because the data were half-hourly and the input windows were long.
  • Hyperparameter tuning. Results changed with learning rate, hidden size, and encoder length.
  • Data preparation. Preparing the TimeSeriesDataSet required careful handling of time_idx, split boundaries, and categorical variables.

Part 2 – Real-world application and interpretability

A realistic application is day-ahead electricity demand forecasting for grid operators and energy retailers. Forecasts at half-hour resolution help with generation scheduling, reserve planning, and market decisions. In this example, the LSTM performed better than ETS on the same seven-day test horizon, which suggests that deep learning can be helpful when demand depends on several drivers such as temperature, calendar effects, and short-term seasonal patterns.

The main weakness of the LSTM is interpretability. Although it forecasts well, it is harder to explain why a specific prediction is high or low, whereas ETS is easier to interpret. There are also some practical limitations: recurrent models usually need more historical data, take more computation than ETS, and may be harder to justify to non-technical stakeholders. In future work, interpretability could be improved by using a more interpretable model such as Temporal Fusion Transformer, applying post-hoc methods such as SHAP, or presenting the deep model alongside a simpler statistical baseline.

Conclusion

On the matched seven-day test horizon, the LSTM outperformed the ETS baseline on RMSE, MAE, and MAPE. The trade-off is that it required more tuning, more computation, and offered weaker interpretability.