library(fpp3)
library(readr)
vic_elec |>
as_tibble() |>
write_csv("vic_elec.csv")
cat(nrow(vic_elec), "rows written to vic_elec.csv\n")52608 rows written to vic_elec.csv
In this notebook, I use the vic_elec dataset (Victoria, Australia, 2012–2014) from the fpp3 package to build an LSTM model for half-hourly electricity demand forecasting and compare it with an ETS baseline.
This dataset shows strong daily and weekly seasonality, a clear relationship with temperature, and some holiday-related variation. These characteristics make it a useful case for comparing a recurrent deep learning model with a simpler statistical benchmark.
library(fpp3)
library(readr)
vic_elec |>
as_tibble() |>
write_csv("vic_elec.csv")
cat(nrow(vic_elec), "rows written to vic_elec.csv\n")52608 rows written to vic_elec.csv
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from pytorch_forecasting import TimeSeriesDataSet
from pytorch_forecasting.data import NaNLabelEncoder
from pytorch_forecasting.models import RecurrentNetwork
from pytorch_forecasting.metrics import MAE
import torch
from lightning.pytorch import Trainer, seed_everything
from lightning.pytorch.loggers import CSVLogger
from lightning.pytorch.callbacks import EarlyStopping
seed_everything(66, workers=True)66
The vic_elec data contains half-hourly electricity demand, temperature, calendar information, and a holiday indicator. I convert the time fields to datetime format and sort the data chronologically before creating any time-based features.
df = pd.read_csv("vic_elec.csv")
df["Time"] = pd.to_datetime(df["Time"])
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values("Time").reset_index(drop=True)
df.head() Time Demand Temperature Date Holiday
0 2011-12-31 13:00:00+00:00 4382.825174 21.40 2012-01-01 True
1 2011-12-31 13:30:00+00:00 4263.365526 21.05 2012-01-01 True
2 2011-12-31 14:00:00+00:00 4048.966046 20.70 2012-01-01 True
3 2011-12-31 14:30:00+00:00 3877.563330 20.55 2012-01-01 True
4 2011-12-31 15:00:00+00:00 4036.229746 20.40 2012-01-01 True
The first two weeks already show the main short-term pattern: repeated daily peaks and lower demand at weekends.
fig, ax = plt.subplots(figsize=(10, 3.5))
ax.plot(df["Time"].iloc[:48 * 14], df["Demand"].iloc[:48 * 14], lw=0.9)
ax.set_ylabel("Demand (MW)")
ax.set_title("Demand over the first two weeks")
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()I create calendar features from the timestamp, together with lag and rolling-demand features required by the assignment. After removing the initial missing rows caused by these transformations, I rebuild a continuous time_idx and define a single group identifier for the series.
df["hour"] = df["Time"].dt.hour
df["minute"] = df["Time"].dt.minute
df["day_of_week"] = df["Time"].dt.dayofweek
df["month"] = df["Time"].dt.month
df["hour_fraction"] = df["hour"] + df["minute"] / 60
df["lag_1"] = df["Demand"].shift(1)
df["lag_48"] = df["Demand"].shift(48)
df["rolling_mean_48"] = df["Demand"].rolling(48).mean()
df["rolling_std_48"] = df["Demand"].rolling(48).std()
df = df.dropna().reset_index(drop=True)
df["time_idx"] = np.arange(len(df))
df["group"] = "vic_elec"
df[["Time", "Demand", "hour", "day_of_week", "month", "lag_1", "lag_48",
"rolling_mean_48", "rolling_std_48", "time_idx", "group"]].head() Time Demand ... time_idx group
0 2012-01-01 13:00:00+00:00 4367.914468 ... 0 vic_elec
1 2012-01-01 13:30:00+00:00 4164.015958 ... 1 vic_elec
2 2012-01-01 14:00:00+00:00 3898.239882 ... 2 vic_elec
3 2012-01-01 14:30:00+00:00 3752.016756 ... 3 vic_elec
4 2012-01-01 15:00:00+00:00 3941.369156 ... 4 vic_elec
[5 rows x 11 columns]
I split the series chronologically into 70% training, 15% validation, and 15% test.
max_prediction_length = 48
max_encoder_length = 336
n_total = len(df)
train_end = int(n_total * 0.70)
val_end = int(n_total * 0.85)
train_cutoff = df.loc[train_end, "time_idx"]
val_cutoff = df.loc[val_end, "time_idx"]Continuous variables are standardised using the training span only.
features_to_scale = ["Demand", "Temperature"]
scaler = StandardScaler()
scaler.fit(df.loc[df["time_idx"] <= train_cutoff, features_to_scale])StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
StandardScaler()
df[features_to_scale] = scaler.transform(df[features_to_scale])
demand_mean = scaler.mean_[0]
demand_std = scaler.scale_[0]I next build the TimeSeriesDataSet used for training, validation, and testing. Temperature and clock-based variables are treated as known inputs, while demand is the target series to be predicted.
categorical_cols = ["hour", "day_of_week", "month", "Holiday"]
for col in categorical_cols:
df[col] = df[col].astype(str).astype("category")
categorical_encoders = {
col: NaNLabelEncoder(add_nan=True) for col in categorical_cols
}
training = TimeSeriesDataSet(
df[df["time_idx"] <= train_cutoff],
time_idx="time_idx",
target="Demand",
group_ids=["group"],
max_encoder_length=max_encoder_length,
max_prediction_length=max_prediction_length,
time_varying_known_reals=["time_idx", "Temperature", "hour_fraction"],
time_varying_unknown_reals=["Demand"],
time_varying_known_categoricals=categorical_cols,
categorical_encoders=categorical_encoders,
add_relative_time_idx=True,
add_encoder_length=True,
)
validation = TimeSeriesDataSet.from_dataset(
training,
df[df["time_idx"] <= val_cutoff],
min_prediction_idx=train_cutoff + 1,
stop_randomization=True,
)
testing = TimeSeriesDataSet.from_dataset(
training,
df,
min_prediction_idx=val_cutoff + 1,
stop_randomization=True,
)
batch_size = 64
train_dl = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dl = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)
test_dl = testing.to_dataloader(train=False, batch_size=batch_size, num_workers=0)I use an LSTM through the RecurrentNetwork wrapper as the main deep learning model. It is trained with hidden size 32, one recurrent layer, dropout 0.2, learning rate 0.01, MAE loss, and early stopping on validation loss.
lstm_model = RecurrentNetwork.from_dataset(
training,
cell_type="LSTM",
hidden_size=32,
rnn_layers=1,
dropout=0.2,
learning_rate=0.01,
loss=MAE(),
log_interval=0,
log_val_interval=1,
)from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
logger = CSVLogger("lightning_logs", name="lstm_run")
early_stop = EarlyStopping(monitor="val_loss", patience=3, mode="min")
checkpoint_cb = ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1)
trainer = Trainer(
max_epochs=15,
accelerator="auto",
logger=logger,
callbacks=[early_stop, checkpoint_cb],
enable_progress_bar=False,
enable_model_summary=False,
log_every_n_steps=50,
)
trainer.fit(lstm_model, train_dl, val_dl)
best_path = checkpoint_cb.best_model_path
best_lstm = RecurrentNetwork.load_from_checkpoint(best_path)metrics = pd.read_csv(os.path.join(logger.log_dir, "metrics.csv"))
fig, ax = plt.subplots(figsize=(8, 4))
if "train_loss_epoch" in metrics.columns:
train_curve = metrics[["epoch", "train_loss_epoch"]].dropna()
ax.plot(train_curve["epoch"], train_curve["train_loss_epoch"], label="train")
if "val_loss" in metrics.columns:
val_curve = metrics[["epoch", "val_loss"]].dropna()
ax.plot(val_curve["epoch"], val_curve["val_loss"], label="validation")
ax.set_xlabel("Epoch")
ax.set_ylabel("Scaled MAE loss")
ax.set_title("Training and validation loss")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()To compare the LSTM and ETS fairly, I evaluate both models on the same fixed seven-day test horizon (336 half-hourly observations) immediately after the validation period.
pred_out = best_lstm.predict(test_dl, return_x=True)
pred = pred_out.output
x = pred_out.x
pred_np = pred.detach().cpu().numpy()
y_true = x["decoder_target"].detach().cpu().numpy()
n_days = 7
pred_days = pred_np[::max_prediction_length][:n_days]
true_days = y_true[::max_prediction_length][:n_days]
pred_unscaled = pred_days * demand_std + demand_mean
true_unscaled = true_days * demand_std + demand_mean
lstm_mae = mean_absolute_error(true_unscaled.ravel(), pred_unscaled.ravel())
lstm_rmse = np.sqrt(mean_squared_error(true_unscaled.ravel(), pred_unscaled.ravel()))
lstm_mape = np.mean(
np.abs((true_unscaled - pred_unscaled) / true_unscaled)
) * 100
pd.DataFrame({
"Model": ["LSTM"],
"RMSE": [round(lstm_rmse, 1)],
"MAE": [round(lstm_mae, 1)],
"MAPE (%)": [round(lstm_mape, 2)],
"Test horizon": [f"{n_days} days (336 half-hours)"]
})The plot below compares the seven-day LSTM forecast with the observed demand. The daily pattern is captured well, although some peaks are slightly underestimated.
fig, ax = plt.subplots(figsize=(11, 4.5))
for i in range(n_days):
start = i * max_prediction_length
steps = range(start, start + max_prediction_length)
ax.plot(
steps, true_unscaled[i], "k-", alpha=0.8,
label="Actual" if i == 0 else None
)
ax.plot(
steps, pred_unscaled[i], "--", alpha=0.9,
label="LSTM forecast" if i == 0 else None
)
ax.set_xlabel("Half-hour step")
ax.set_ylabel("Demand (MW)")
ax.set_title("LSTM forecast vs actual demand")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()As a statistical benchmark, I fit an automatic ETS model on the most recent 30 days of the training period and forecast the same 336 test steps.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS
# reload the unscaled series for ETS
raw = pd.read_csv("vic_elec.csv")
raw["Time"] = pd.to_datetime(raw["Time"])
raw = raw.sort_values("Time").reset_index(drop=True).iloc[48:].reset_index(drop=True)
train_demand = raw.iloc[: train_cutoff + 1]["Demand"].values
test_demand = raw.iloc[val_cutoff + 1 :]["Demand"].values
ets_train = pd.DataFrame({
"unique_id": "vic_elec",
"ds": raw.iloc[: train_cutoff + 1]["Time"].tail(48 * 30).values,
"y": train_demand[-48 * 30 :],
})
sf = StatsForecast(
models=[AutoETS(season_length=48, model="ZNA")],
freq="30min",
n_jobs=1,
)
horizon = n_days * max_prediction_length
fc = sf.forecast(df=ets_train, h=horizon)
ets_pred = fc["AutoETS"].values[:horizon]
ets_true = test_demand[:horizon]
ets_mae = mean_absolute_error(ets_true, ets_pred)
ets_rmse = np.sqrt(mean_squared_error(ets_true, ets_pred))
ets_mape = np.mean(np.abs((ets_true - ets_pred) / ets_true)) * 100
pd.DataFrame({
"Model": ["LSTM", "ETS"],
"RMSE": [round(lstm_rmse, 1), round(ets_rmse, 1)],
"MAE": [round(lstm_mae, 1), round(ets_mae, 1)],
"MAPE (%)": [round(lstm_mape, 2), round(ets_mape, 2)],
"Test horizon": [f"{n_days} days (336 half-hours)", f"{n_days} days (336 half-hours)"]
}) Model RMSE MAE MAPE (%) Test horizon
0 LSTM 158.0 120.1 2.32 7 days (336 half-hours)
1 ETS 1220.5 997.8 19.16 7 days (336 half-hours)
The table below compares the two models on the same 336-step test horizon.
results = pd.DataFrame({
"Model": ["LSTM", "ETS"],
"RMSE": [lstm_rmse, ets_rmse],
"MAE": [lstm_mae, ets_mae],
"MAPE (%)": [lstm_mape, ets_mape],
}).round(2)
results.style.hide(axis="index")| Model | RMSE | MAE | MAPE (%) |
|---|---|---|---|
| LSTM | 157.990000 | 120.080000 | 2.320000 |
| ETS | 1220.540000 | 997.770000 | 19.160000 |
The plot below compares the LSTM and ETS forecasts against the observed demand over the same test horizon.
fig, ax = plt.subplots(figsize=(11, 4.5))
steps = np.arange(horizon)
ax.plot(steps, ets_true, "k-", alpha=0.85, label="Actual")
ax.plot(steps, pred_unscaled.ravel(), "--", alpha=0.9, label="LSTM")
ax.plot(steps, ets_pred, ":", alpha=0.9, label="ETS")
ax.set_xlabel("Half-hour step")
ax.set_ylabel("Demand (MW)")
ax.set_title("LSTM and ETS forecasts vs actual demand")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()A few practical issues came up in this workflow.
TimeSeriesDataSet required careful handling of time_idx, split boundaries, and categorical variables.A realistic application is day-ahead electricity demand forecasting for grid operators and energy retailers. Forecasts at half-hour resolution help with generation scheduling, reserve planning, and market decisions. In this example, the LSTM performed better than ETS on the same seven-day test horizon, which suggests that deep learning can be helpful when demand depends on several drivers such as temperature, calendar effects, and short-term seasonal patterns.
The main weakness of the LSTM is interpretability. Although it forecasts well, it is harder to explain why a specific prediction is high or low, whereas ETS is easier to interpret. There are also some practical limitations: recurrent models usually need more historical data, take more computation than ETS, and may be harder to justify to non-technical stakeholders. In future work, interpretability could be improved by using a more interpretable model such as Temporal Fusion Transformer, applying post-hoc methods such as SHAP, or presenting the deep model alongside a simpler statistical baseline.
On the matched seven-day test horizon, the LSTM outperformed the ETS baseline on RMSE, MAE, and MAPE. The trade-off is that it required more tuning, more computation, and offered weaker interpretability.