DATA624 - Homework 3

Author

Anthony Josue Roman

Exercise 5.1

Produce forecasts for the following series using whichever of Naive(), SeasonalNaive() or RandomWalkWithDrift() is more appropriate in each case:

Australian Population (global_economy)

Code
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

global_econ = pd.read_csv("global_economy.csv", parse_dates=["ds"])

aus_pop = (
    global_econ[global_econ["unique_id"] == "Australia"]
    [["unique_id", "ds", "Population"]]
    .rename(columns={"Population": "y"})
)

rwd = RandomWalkWithDrift()
sf = StatsForecast(models=[rwd], freq="Y")

forecast_pop = sf.forecast(df=aus_pop, h=10, level=[80, 95])
forecast_pop
unique_id ds RWD RWD-lo-80 RWD-lo-95 RWD-hi-80 RWD-hi-95
0 Australia 2017-12-31 2.485020e+07 2.475083e+07 2.469823e+07 2.494958e+07 2.500218e+07
1 Australia 2018-12-31 2.510148e+07 2.495973e+07 2.488470e+07 2.524322e+07 2.531825e+07
2 Australia 2019-12-31 2.535275e+07 2.517768e+07 2.508501e+07 2.552781e+07 2.562048e+07
3 Australia 2020-12-31 2.560402e+07 2.540020e+07 2.529230e+07 2.580784e+07 2.591574e+07
4 Australia 2021-12-31 2.585529e+07 2.562555e+07 2.550393e+07 2.608503e+07 2.620664e+07
5 Australia 2022-12-31 2.610656e+07 2.585287e+07 2.571858e+07 2.636025e+07 2.649454e+07
6 Australia 2023-12-31 2.635783e+07 2.608165e+07 2.593545e+07 2.663401e+07 2.678021e+07
7 Australia 2024-12-31 2.660910e+07 2.631155e+07 2.615404e+07 2.690665e+07 2.706416e+07
8 Australia 2025-12-31 2.686037e+07 2.654236e+07 2.637401e+07 2.717839e+07 2.734674e+07
9 Australia 2026-12-31 2.711164e+07 2.677390e+07 2.659510e+07 2.744939e+07 2.762818e+07
Code
import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

global_econ = pd.read_csv("global_economy.csv", parse_dates=["ds"])

aus_pop = (
    global_econ.loc[global_econ["unique_id"] == "Australia", ["unique_id", "ds", "Population"]]
    .rename(columns={"Population": "y"})
    .dropna()
)

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="YE")
forecast_pop = sf.forecast(df=aus_pop, h=10, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(aus_pop["ds"], aus_pop["y"], linewidth=1, label="History")
ax.plot(forecast_pop["ds"], forecast_pop["RWD"], linewidth=2, label="Forecast")
ax.fill_between(forecast_pop["ds"], forecast_pop["RWD-lo-95"], forecast_pop["RWD-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_pop["ds"], forecast_pop["RWD-lo-80"], forecast_pop["RWD-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Australian Population Forecast (Random Walk with Drift)")
ax.set_xlabel("Year")
ax.set_ylabel("Population")
ax.legend()
fig.tight_layout()
plt.show()

According to the Australian population series, there is a steady and consistent increase to the overall population with no noticeable yearly fluctuation. Therefore, a Random Walk with Drift model is suitable for continuing gradual long-term growth, by projecting the previously measured average increase into the future. The forecast results in a smooth continuation of the trend and agrees with what has happened in the past with respect to the population size. The forecast intervals show an increasing interval width over time; demonstrating an increase in unpredictability as the forecast horizon expands. This is also consistent with the nature of this model.

Bricks (aus_production)

Code
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_prod = pd.read_csv("aus_production.csv")

time_col = next(c for c in ["ds","date","Date","quarter","Quarter","month","Month","time","Time"] if c in aus_prod.columns)

ds = pd.to_datetime(aus_prod[time_col], errors="coerce")

if ds.isna().all():
    ds = pd.PeriodIndex(aus_prod[time_col].astype(str), freq="Q").to_timestamp()

bricks = pd.DataFrame({
    "unique_id": "Bricks",
    "ds": ds,
    "y": aus_prod["Bricks"]
}).dropna()

snaive = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="QE")

forecast_bricks = sf.forecast(df=bricks, h=8, level=[80, 95])

forecast_bricks
unique_id ds SNaive SNaive-lo-80 SNaive-lo-95 SNaive-hi-80 SNaive-hi-95
0 Bricks 2005-06-30 428.0 366.061796 333.273691 489.938204 522.726309
1 Bricks 2005-09-30 397.0 335.061796 302.273691 458.938204 491.726309
2 Bricks 2005-12-31 355.0 293.061796 260.273691 416.938204 449.726309
3 Bricks 2006-03-31 435.0 373.061796 340.273691 496.938204 529.726309
4 Bricks 2006-06-30 428.0 340.406152 294.036769 515.593848 561.963231
5 Bricks 2006-09-30 397.0 309.406152 263.036769 484.593848 530.963231
6 Bricks 2006-12-31 355.0 267.406152 221.036769 442.593848 488.963231
7 Bricks 2007-03-31 435.0 347.406152 301.036769 522.593848 568.963231
Code
import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_prod = pd.read_csv("aus_production.csv")

time_col = next(c for c in ["ds","date","Date","quarter","Quarter","month","Month","time","Time"] if c in aus_prod.columns)
ds = pd.to_datetime(aus_prod[time_col], errors="coerce")

if ds.isna().all():
    ds = pd.PeriodIndex(aus_prod[time_col].astype(str), freq="Q").to_timestamp()

bricks = pd.DataFrame({"unique_id": "Bricks", "ds": ds, "y": aus_prod["Bricks"]}).dropna()

snaive = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="QE")
forecast_bricks = sf.forecast(df=bricks, h=8, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(bricks["ds"], bricks["y"], linewidth=1, label="History")
ax.plot(forecast_bricks["ds"], forecast_bricks["SNaive"], linewidth=2, label="Forecast")
ax.fill_between(forecast_bricks["ds"], forecast_bricks["SNaive-lo-95"], forecast_bricks["SNaive-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_bricks["ds"], forecast_bricks["SNaive-lo-80"], forecast_bricks["SNaive-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Bricks Forecast (Seasonal Naive, period=4)")
ax.set_xlabel("Quarter")
ax.set_ylabel("Bricks")
ax.legend()
fig.tight_layout()
plt.show()

The bricks product line has strong quarterly seasonality, with distinct patterns of seasons comprising peaks and valleys each year, along with noticeable shorter-term fluctuations. Therefore, it is appropriate to use a Seasonal Naive model with a periodicity of 4 for forecasting because this model will repeat values that were realized in the same quarter of the previous year. The forecast continues the established seasonal pattern without adding any changes to trends formed from past observations. The forecast’s prediction intervals widen slightly as the forecast period lengthens, so they accommodate for uncertainty while still capturing the seasonality that has been previously observed in the data.

NSW Lambs (aus_livestock)

Code
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_livestock = pd.read_csv("aus_livestock.csv", parse_dates=["ds"])

uid = next(u for u in aus_livestock["unique_id"].unique() if "Lamb" in str(u))

lambs = aus_livestock.loc[aus_livestock["unique_id"] == uid, ["unique_id", "ds", "y"]].dropna()

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME")

forecast_lambs = sf.forecast(df=lambs, h=12, level=[80, 95])

forecast_lambs
unique_id ds SNaive SNaive-lo-80 SNaive-lo-95 SNaive-hi-80 SNaive-hi-95
0 Australian Capital Territory_Lambs 2018-12-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
1 Australian Capital Territory_Lambs 2019-01-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
2 Australian Capital Territory_Lambs 2019-02-28 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
3 Australian Capital Territory_Lambs 2019-03-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
4 Australian Capital Territory_Lambs 2019-04-30 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
5 Australian Capital Territory_Lambs 2019-05-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
6 Australian Capital Territory_Lambs 2019-06-30 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
7 Australian Capital Territory_Lambs 2019-07-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
8 Australian Capital Territory_Lambs 2019-08-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
9 Australian Capital Territory_Lambs 2019-09-30 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
10 Australian Capital Territory_Lambs 2019-10-31 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
11 Australian Capital Territory_Lambs 2019-11-30 0.0 -7936.301301 -12137.525433 7936.301301 12137.525433
Code
import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_livestock = pd.read_csv("aus_livestock.csv", parse_dates=["ds"])

uid = next(u for u in aus_livestock["unique_id"].unique() if ("New South Wales" in str(u)) and ("Lamb" in str(u)))

lambs = (
    aus_livestock.loc[aus_livestock["unique_id"] == uid, ["unique_id", "ds", "y"]]
    .dropna()
    .sort_values("ds")
)

lambs = lambs.loc[lambs["y"] > 0]

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME")
forecast_lambs = sf.forecast(df=lambs, h=12, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(lambs["ds"], lambs["y"], linewidth=1, label="History")
ax.plot(forecast_lambs["ds"], forecast_lambs["SNaive"], linewidth=2, label="Forecast")
ax.fill_between(forecast_lambs["ds"], forecast_lambs["SNaive-lo-95"], forecast_lambs["SNaive-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_lambs["ds"], forecast_lambs["SNaive-lo-80"], forecast_lambs["SNaive-hi-80"], alpha=0.35, label="80% PI")
ax.set_title(f"{uid} Forecast (Seasonal Naive, period=12)")
ax.set_xlabel("Year")
ax.set_ylabel("Lambs")
ax.legend()
fig.tight_layout()
plt.show()

The Lamb Series varies significantly both seasonally and short term (month to month) from one year to another and continues to show evidence of both. With the presence of monthly data and repeating seasonal behavior, it is appropriate to use a seasonal naive model with a period of 12, by repeating the value of the same month from the previous year. The prediction interval continues the previously established seasonal behavior and does not impose any additional trend structure on the forecast. As the forecast period extends into the future, the width of prediction intervals will continue to increase, indicating increasing uncertainty but sustaining the previously observed seasonal behavior.

Household wealth (hh_budget)

Code
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

hh = pd.read_csv("hh_budget.csv")

wealth = (
    hh.loc[hh["Country"] == "Australia", ["Year", "Wealth"]]
    .rename(columns={"Year": "ds", "Wealth": "y"})
)

wealth["ds"] = pd.to_datetime(wealth["ds"].astype(str))

wealth["unique_id"] = "Household_Wealth"
wealth = wealth[["unique_id", "ds", "y"]]

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="YE")

forecast_wealth = sf.forecast(df=wealth, h=10, level=[80, 95])

forecast_wealth
unique_id ds RWD RWD-lo-80 RWD-lo-95 RWD-hi-80 RWD-hi-95
0 Household_Wealth 2016-12-31 427.069971 397.454807 381.777486 456.685136 472.362457
1 Household_Wealth 2017-12-31 432.167043 389.343586 366.674217 474.990500 497.659869
2 Household_Wealth 2018-12-31 437.264114 383.688266 355.326925 490.839963 519.201304
3 Household_Wealth 2019-12-31 442.361186 379.221443 345.797282 505.500929 538.925089
4 Household_Wealth 2020-12-31 447.458257 375.467874 337.358467 519.448640 557.558047
5 Household_Wealth 2021-12-31 452.555329 372.191556 329.649545 532.919101 575.461113
6 Household_Wealth 2022-12-31 457.652400 369.256760 322.462935 546.048040 592.841865
7 Household_Wealth 2023-12-31 462.749471 366.577880 315.667715 558.921063 609.831228
8 Household_Wealth 2024-12-31 467.846543 364.097359 309.175858 571.595727 626.517228
9 Household_Wealth 2025-12-31 472.943614 361.774625 302.925317 584.112603 642.961911
Code
import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

hh = pd.read_csv("hh_budget.csv")

wealth = (
    hh.loc[hh["Country"] == "Australia", ["Year", "Wealth"]]
    .rename(columns={"Year": "ds", "Wealth": "y"})
    .dropna()
)

wealth["ds"] = pd.to_datetime(wealth["ds"].astype(str))
wealth["unique_id"] = "Household_Wealth"
wealth = wealth[["unique_id", "ds", "y"]]

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="YE")
forecast_wealth = sf.forecast(df=wealth, h=10, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(wealth["ds"], wealth["y"], linewidth=1, label="History")
ax.plot(forecast_wealth["ds"], forecast_wealth["RWD"], linewidth=2, label="Forecast")
ax.fill_between(forecast_wealth["ds"], forecast_wealth["RWD-lo-95"], forecast_wealth["RWD-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_wealth["ds"], forecast_wealth["RWD-lo-80"], forecast_wealth["RWD-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Household Wealth Forecast (Random Walk with Drift)")
ax.set_xlabel("Year")
ax.set_ylabel("Wealth")
ax.legend()
fig.tight_layout()
plt.show()

The series of household wealth represents a general increase in value; however there have been periods where the data appears to fluctuate greatly as a result of economic volatility. The lack of any seasonality and consistent growth over time justifies the use of a random walk with drift model as it allows the average growth rate from historical data to continue into future years. The forecast is consistent with the long term trend of increasing household wealth shown by the data. As the forecast time frame increases, the prediction intervals expand dramatically due to the increased uncertainty associated with longer timeframes as is typical with these models.

Australian takeaway food turnover (aus_retail)

Code
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_retail = pd.read_csv("aus_retail.csv")

date_col = next(c for c in ["ds","Month","month","Date","date"] if c in aus_retail.columns)

takeaway = (
    aus_retail.loc[aus_retail["Industry"] == "Takeaway food services", [date_col, "Turnover"]]
    .rename(columns={date_col: "ds", "Turnover": "y"})
)

takeaway["ds"] = pd.to_datetime(takeaway["ds"])

takeaway["unique_id"] = "Takeaway food services"
takeaway = takeaway[["unique_id", "ds", "y"]].dropna()

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME")

forecast_takeaway = sf.forecast(df=takeaway, h=12, level=[80, 95])

forecast_takeaway
unique_id ds SNaive SNaive-lo-80 SNaive-lo-95 SNaive-hi-80 SNaive-hi-95
0 Takeaway food services 2018-12-31 92.3 -126.608211 -242.491218 311.208211 427.091218
1 Takeaway food services 2019-01-31 28.9 -190.008211 -305.891218 247.808211 363.691218
2 Takeaway food services 2019-02-28 354.9 135.991789 20.108782 573.808211 689.691218
3 Takeaway food services 2019-03-31 183.7 -35.208211 -151.091218 402.608211 518.491218
4 Takeaway food services 2019-04-30 29.1 -189.808211 -305.691218 248.008211 363.891218
5 Takeaway food services 2019-05-31 630.3 411.391789 295.508782 849.208211 965.091218
6 Takeaway food services 2019-06-30 20.4 -198.508211 -314.391218 239.308211 355.191218
7 Takeaway food services 2019-07-31 358.7 139.791789 23.908782 577.608211 693.491218
8 Takeaway food services 2019-08-31 99.7 -119.208211 -235.091218 318.608211 434.491218
9 Takeaway food services 2019-09-30 31.5 -187.408211 -303.291218 250.408211 366.291218
10 Takeaway food services 2019-10-31 393.2 174.291789 58.408782 612.108211 727.991218
11 Takeaway food services 2019-11-30 195.1 -23.808211 -139.691218 414.008211 529.891218
Code
import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_retail = pd.read_csv("aus_retail.csv")

takeaway = (
    aus_retail.loc[aus_retail["Industry"] == "Takeaway food services", ["Month", "Turnover"]]
    .rename(columns={"Month": "ds", "Turnover": "y"})
)

takeaway["ds"] = pd.to_datetime(takeaway["ds"])

takeaway = (
    takeaway.groupby("ds", as_index=False)["y"]
    .sum()
    .sort_values("ds")
)

takeaway["unique_id"] = "Australia_Takeaway food services"
takeaway = takeaway[["unique_id", "ds", "y"]]

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME", n_jobs=1)

forecast_takeaway = sf.forecast(df=takeaway, h=12, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(takeaway["ds"], takeaway["y"], linewidth=1, label="History")
ax.plot(forecast_takeaway["ds"], forecast_takeaway["SNaive"], linewidth=2, label="Forecast")
ax.fill_between(forecast_takeaway["ds"], forecast_takeaway["SNaive-lo-95"], forecast_takeaway["SNaive-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_takeaway["ds"], forecast_takeaway["SNaive-lo-80"], forecast_takeaway["SNaive-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Australian Takeaway Food Turnover Forecast (Seasonal Naive, period=12)")
ax.set_xlabel("Month")
ax.set_ylabel("Turnover")
ax.legend()
fig.tight_layout()
plt.show()

There is considerable seasonality in the monthly turnover pattern of takeaway food turnover, with distinct peaks and troughs occurring every twelve months and there has been a clear increase in turnover over time. Because the pattern is clearly seasonal, a Seasonal Naive model having a period of 12 is a suitable method of forecasting, as it simply uses the most recent value of the same month from the prior year as the basis for making future forecasts. By using this method, the seasonal component is preserved while no trend adjustments will be made to the existing trend. The prediction intervals of the forecast will increase in width as the forecast period increases, with the prediction intervals changing according to how the results from the forecast differ from the expected return in monthly turnover based upon the seasonal pattern seen in historically recorded takeaway food turnover.

Exercise 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

  1. Produce a time plot of the series.
Code
import pandas as pd
import matplotlib.pyplot as plt

gafa = pd.read_csv("gafa_stock.csv", parse_dates=["ds"])

fb = gafa.loc[gafa["unique_id"] == "FB_Close", ["unique_id", "ds", "y"]].copy()
fb = fb.sort_values("ds")

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(fb["ds"], fb["y"])
ax.set_title("Facebook Stock Price (Close)")
ax.set_xlabel("Date")
ax.set_ylabel("Price")
fig.tight_layout()
plt.show()

From 2014-2019, Facebook’s closing stock price is shown to have a considerable amount of short-term volatility but mainly continues to rise overall through the majority of that time frame. The stock price continually rises up until around 2018 at which point there is an obvious decrease in stock value. The data does not have a seasonality component or repeating pattern within it. The entire series operates like an ordinary financial time series characterized by movements upward and downward and continuously changing levels over time.

  1. Produce forecasts using the drift method and plot them.
Code
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="B", n_jobs=1)

h = 30
fc_rwd = sf.forecast(df=fb, h=h, level=[80,95])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(fb["ds"], fb["y"], label="History")
ax.plot(fc_rwd["ds"], fc_rwd["RWD"], label="Drift Forecast")
ax.fill_between(fc_rwd["ds"], fc_rwd["RWD-lo-95"], fc_rwd["RWD-hi-95"], alpha=0.2)
ax.fill_between(fc_rwd["ds"], fc_rwd["RWD-lo-80"], fc_rwd["RWD-hi-80"], alpha=0.35)
ax.set_title("Facebook Drift Forecast")
ax.set_xlabel("Date")
ax.set_ylabel("Price")
ax.legend()
fig.tight_layout()
plt.show()

Facebook’s stock price is projected into the future as a straight line based on the average of historical variations from Facebook’s stock price. Hence, there has been a slight increase to the total projection because of the upward trend across the entire dataset. The confidence intervals get wider over the extended forecast period demonstrating that uncertainty continues to increase further out in time; however because of considerable short term fluctuations in stock prices, it is uncertain that using a drift method of forecasting will provide a complete understanding of any rapid or significant shifts or reversals in stock prices.

  1. Show that the forecasts are identical to extending the line drawn between the first and last observations.
Code
import numpy as np

fb2 = fb.reset_index(drop=True)

y0 = fb2["y"].iloc[0]
yT = fb2["y"].iloc[-1]
n = len(fb2)

slope = (yT - y0) / (n - 1)

line_ext = np.array([yT + slope*k for k in range(1, h+1)])

comparison = pd.DataFrame({
    "Forecast": fc_rwd["RWD"].values,
    "LineExtension": line_ext,
    "Difference": fc_rwd["RWD"].values - line_ext
})

comparison.head()
Forecast LineExtension Difference
0 131.150760 131.150760 0.0
1 131.211523 131.211523 0.0
2 131.272287 131.272287 0.0
3 131.333051 131.333051 0.0
4 131.393815 131.393815 0.0

Drift forecast mathematically represents the continuous projection of linearly extending the line connecting the initial and last observation in the training set, and it will yield identical results when comparing the calculated drift forecast with manually calculating linear projection of the same last observation to the last drift forecast and finding an identical outcome (a difference of 0) demonstrating that this method is used for predicting future value(s) based upon an average slope of historical series.

  1. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
Code
from statsforecast.models import Naive

naive = Naive(alias="Naive")
sf2 = StatsForecast(models=[naive, rwd], freq="B", n_jobs=1)

fc2 = sf2.forecast(df=fb, h=h)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(fb["ds"], fb["y"], label="History")
ax.plot(fc2["ds"], fc2["Naive"], label="Naive")
ax.plot(fc2["ds"], fc2["RWD"], label="Drift")
ax.set_title("Facebook Benchmark Comparison")
ax.set_xlabel("Date")
ax.set_ylabel("Price")
ax.legend()
fig.tight_layout()
plt.show()

The Naive and Drift forecasts are comparable over short horizons; however, the Drift forecast has a slight upward slope due to the addition of the average historical increases over the entire time-series of data. In contrast, the current most recent observations are declining, which is not adequately captured in the Drift forecast. The Naive forecast simply takes the last observation and utilizes that as the next value, thereby making the Naive forecast much more consistent with the random walk behavior observed in stock prices. Therefore, the Naive forecast will usually be a more appropriate benchmark for short-term forecasting of daily stock prices.

Exercise 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive
from statsmodels.tsa.stattools import acf

aus_production = pd.read_csv("aus_production.csv")

date_col = next(c for c in ["ds","date","Date","quarter","Quarter"] if c in aus_production.columns)
aus_production[date_col] = pd.to_datetime(aus_production[date_col], errors="coerce")
aus_production = aus_production.dropna(subset=[date_col])

y_col = "Beer" if "Beer" in aus_production.columns else "y"

recent_production = (
    aus_production.loc[aus_production[date_col].dt.year >= 1992, [date_col, y_col]]
    .rename(columns={date_col: "ds", y_col: "y"})
    .dropna()
    .sort_values("ds")
)
recent_production["unique_id"] = "Beer"
recent_production = recent_production[["unique_id", "ds", "y"]]

snaive = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="QE", n_jobs=1)

preds = sf.forecast(df=recent_production, h=10, fitted=True)
insample = sf.forecast_fitted_values()

insample = insample.merge(
    recent_production.rename(columns={"y": "y_actual"}),
    on=["unique_id", "ds"],
    how="inner"
)
resid = insample["y_actual"] - insample["SNaive"]

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.plot(insample["ds"], resid, linewidth=1)
ax.axhline(0, linewidth=1)
ax.set_title("Residuals over time (Seasonal Naive)")
ax.set_xlabel("Quarter")
ax.set_ylabel("Residual")
fig.tight_layout()
plt.show()

r = resid.dropna().to_numpy()
acf_vals = acf(r, nlags=24, fft=True)
n = len(r)
conf = 1.96 / np.sqrt(n)

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.stem(range(len(acf_vals)), acf_vals, basefmt=" ")
ax.fill_between([0, len(acf_vals)-1], [-conf, -conf], [conf, conf], alpha=0.2)
ax.set_title("Residual ACF")
ax.set_xlabel("Lag")
ax.set_ylabel("ACF")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.hist(r, bins=20)
ax.set_title("Residual histogram")
ax.set_xlabel("Residual")
ax.set_ylabel("Count")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.plot(recent_production["ds"], recent_production["y"], linewidth=1, label="History")
ax.plot(preds["ds"], preds["SNaive"], linewidth=2, label="Forecast")
ax.set_title("Beer production forecast (Seasonal Naive, period=4)")
ax.set_xlabel("Quarter")
ax.set_ylabel("Beer")
ax.legend()
fig.tight_layout()
plt.show()

The seasonal naïve model (evaluation period of four) clearly captures the strong seasonal nature of Australian beer production over four quarters per year. The original pattern observed in the historical beer production data will continue to have around 25% similarities to that of the seasonally naïve forecasting model (i.e., the forecast is essentially a replication of the last four quarters of the actual value, also known as forecast replication).

The residual time series plot shows that residuals fluctuate around zero (the average residual), but the magnitude of the fluctuations are almost evenly distributed between the type of trend that has been fit to the model and no historical trend; therefore, the model does capture most of the seasonality. However, there are still a large number of large positive residuals and a similar number of large negative residuals remaining in the residuals distribution; therefore, there is still a significant amount of variation remaining that is not captured by the model.

The residual ACF shows at least one statistically significant spike (clearly at lag four), which means that the residuals also contain some degree of seasonal autocorrelation. If the residuals were pure white noise, then the vast majority of the spikes would not exceed the 95% confidence boundaries.

The histogram of the residuals is centre approximately on the zero mark; however, the residuals are also very spread out and do not show a perfect symmetric distribution.

Based on the data collected from the residual analysis, the residuals are not perfect white noise due to the presence of remaining autocorrelation, and while the unchanged productivity delivered by the seasonal naïve modelling approach provides a reasonable baseline for future comparisons, more advanced seasonal modelling could result in improved production forecast accuracy.

Exercise 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of Naive() or SeasonlNaive() is more appropriate in each case.

Australian Exports (global_economy)

Code
import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import Naive

ge = pd.read_csv("global_economy.csv")

ge["ds"] = pd.to_datetime(ge["ds"].astype(str), format="%Y")

aus_exports = ge[
    ge["unique_id"] == "Australia"
][["ds", "Exports"]].dropna().sort_values("ds")

aus_exports = aus_exports.rename(columns={"Exports": "y"})
aus_exports["unique_id"] = "Australia"
aus_exports = aus_exports[["unique_id", "ds", "y"]]

sf = StatsForecast(models=[Naive(alias="Naive")], freq="Y", n_jobs=1)
preds = sf.forecast(df=aus_exports, h=10)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(aus_exports["ds"], aus_exports["y"], linewidth=1, label="History")
ax.plot(preds["ds"], preds["Naive"], linewidth=2, label="Forecast")
ax.set_title("Australian Exports Forecast (Naive)")
ax.set_xlabel("Year")
ax.set_ylabel("Exports")
ax.legend()
plt.show()

Bricks (aus_production)

Code
from statsforecast.models import SeasonalNaive

aus_production = pd.read_csv("aus_production.csv")
aus_production["ds"] = pd.to_datetime(aus_production["ds"])

bricks = aus_production[["ds","Bricks"]].dropna().copy()
bricks["unique_id"] = "Bricks"
bricks = bricks[["unique_id","ds","Bricks"]].rename(columns={"Bricks":"y"})

model = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[model], freq="Q", n_jobs=1)

preds = sf.forecast(df=bricks, h=8, fitted=True)
fitted_vals = sf.forecast_fitted_values()

resid = fitted_vals["y"] - fitted_vals["SNaive"]

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(bricks["ds"], bricks["y"], label="History")
ax.plot(preds["ds"], preds["SNaive"], label="Forecast")
ax.set_title("Bricks Forecast (Seasonal Naive, period=4)")
ax.legend()
plt.show()

The Naive approach was applied to the Australian Exports series data (no obvious correlation). The exports have increased continuously plus inconsistently over time, however, there is no seasonal variation in them. The Naive forecasting method uses the last data point (observation) to produce a repetitive pattern of prediction into the future. Therefore, since there is an optimistic trend for increasing exports into the future, the Naive approach will not give a good representation of the overall trend because the values would look like flat lines rather than an increasing pattern. Thus, the Naive approach can only provide a basic benchmark of the data, not an accurate and legitimate long-range forecast.

The Seasonal Naive forecasting method (period of 4) is used in conjunction with the aus_production Bricks series location/volume. The use of this approach was appropriate because the data demonstrated standard repeating seasonal patterns (seasonally positive high peaks and low valleys) for each year. The Seasonal Naive method will replicate the previous year’s seasonal pattern of one year prior in how it forecasts future activity because it is inherently repetitive in nature - this will produce quarterly forecast error variance that reflects the historical trend of the past few years (i.e., provide for cyclic Quarterly Forecast error variance rather than a consistency of trend throughout each quarter).

Exercise 5.7

For your retail time series (from Exercise 7 in Section 2.10):

  1. Create a training dataset consisting of observations before 2011
Code
import numpy as np
import pandas as pd

aus_retail = pd.read_csv("aus_retail.csv")

if "ds" not in aus_retail.columns:
    date_col = next(c for c in ["Month", "month", "Date", "date"] if c in aus_retail.columns)
    aus_retail[date_col] = pd.to_datetime(aus_retail[date_col], errors="coerce")
    aus_retail = aus_retail.rename(columns={date_col: "ds"})

if "y" not in aus_retail.columns:
    y_col = next(c for c in ["Turnover", "turnover", "value", "Value"] if c in aus_retail.columns)
    aus_retail = aus_retail.rename(columns={y_col: "y"})

if "unique_id" not in aus_retail.columns:
    if "Series ID" in aus_retail.columns:
        aus_retail = aus_retail.rename(columns={"Series ID": "unique_id"})
    else:
        id_candidates = [c for c in ["Industry", "State", "Region", "Series"] if c in aus_retail.columns]
        aus_retail["unique_id"] = aus_retail[id_candidates].astype(str).agg("_".join, axis=1)

aus_retail["ds"] = pd.to_datetime(aus_retail["ds"], errors="coerce")
aus_retail["y"] = pd.to_numeric(aus_retail["y"], errors="coerce")
aus_retail = aus_retail.dropna(subset=["ds", "y"]).sort_values(["unique_id", "ds"])

rng = np.random.default_rng(12345678)
chosen_id = rng.choice(aus_retail["unique_id"].unique(), 1)[0]

data = aus_retail.loc[aus_retail["unique_id"] == chosen_id, ["unique_id", "ds", "y"]].copy()
train = data.loc[data["ds"] < "2011-01-01"].copy()
test  = data.loc[data["ds"] >= "2011-01-01"].copy()

chosen_id, train.shape, test.shape
('A3349908R', (345, 3), (96, 3))

The dataset identified by the identifier A3349908R was separated into a training set of all observations up until the end of 2010 (345) and assigned a test set containing an observations from 2011 and on​ (96). As the tests will be performed over the future documentation, the evaluation of the model will be based on the unobserved future data.

  1. Check that your data have been split appropriately by producing the following plot.
Code
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(train["ds"], train["y"], label="Train")
ax.plot(test["ds"], test["y"], label="Test")
ax.set_title(f"Retail series split (unique_id = {chosen_id})")
ax.set_xlabel("Date")
ax.set_ylabel("y")
ax.legend()
fig.tight_layout()
plt.show()

The plot provides evidence that the split between training and testing data occurred correctly. The training data ends just prior to (2011) and the testing data begins in (2011) without any overlap. Time ordering is retained in both segments as well since a strong seasonality can be observed even in both data sets. Test Data appear to be significantly elevated relative to training data indicating potential upward movement beyond (2011).

  1. Fit a seasonal naïve model using SeasonalNaive() applied to your training data.
Code
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

m = 12

sf = StatsForecast(
    models=[SeasonalNaive(m, alias="SNaive")],
    freq="ME",
    n_jobs=1
)

sf.fit(train)

sf

sf.forecast(df=train, h=len(test), fitted=True)
fitted_vals = sf.forecast_fitted_values()
fitted_vals.head()
unique_id ds y SNaive
0 A3349908R 1982-04-01 11.3 NaN
1 A3349908R 1982-05-01 11.6 NaN
2 A3349908R 1982-06-01 10.9 NaN
3 A3349908R 1982-07-01 11.4 NaN
4 A3349908R 1982-08-01 10.9 NaN

The training data was fitted with a Seasonal Naive model. The Seasonal Naive model had a seasonal component (m=12) which indicates monthly seasonality existed when this model was fit. Forecasts generated by this model entail that the forecast will equal the value from the same month in the previous year. The output demonstrates that forecasts were produced for the 96 months in the test horizon as expected.

  1. Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
Code
from statsmodels.tsa.stattools import acf

preds_train = sf.forecast(df=train, h=len(test), fitted=True)
fitted = sf.forecast_fitted_values()

train_diag = train.merge(fitted[["unique_id", "ds", "SNaive"]], on=["unique_id", "ds"], how="left")
train_diag["resid"] = train_diag["y"] - train_diag["SNaive"]

r = train_diag["resid"].dropna().to_numpy()
acf_vals = acf(r, nlags=24, fft=True)
conf = 1.96 / np.sqrt(len(r))

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.plot(train_diag["ds"], train_diag["resid"], linewidth=1)
ax.axhline(0, linewidth=1)
ax.set_title("Residuals over time (Seasonal Naive)")
ax.set_xlabel("Date")
ax.set_ylabel("Residual")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.stem(range(len(acf_vals)), acf_vals, basefmt=" ")
ax.fill_between([0, len(acf_vals)-1], [-conf, -conf], [conf, conf], alpha=0.2)
ax.set_title("Residual ACF")
ax.set_xlabel("Lag")
ax.set_ylabel("ACF")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.hist(r, bins=20)
ax.set_title("Residual histogram")
ax.set_xlabel("Residual")
ax.set_ylabel("Count")
fig.tight_layout()
plt.show()

The plot of the residual time shows that the data cluster and that the variance is much larger than expected in later years indicating that the variance is not constant across the series. The plot of ACF indicates that there are certainly significant levels of autocorrelation (especially for the initial lags and near the seasonal lags), indicating that the errors are not independent.

The histogram of the residuals is, however, approximately centered around zero and has a small positive skew and large tails.

Therefore, the residuals are not uncorrelated, nor are they normally distributed; this suggests that the Seasonal Naïve model does not adequately capture all of the structure of the dependant variable, including its seasonal and dynamic components.

  1. Produce forecasts for the test data
Code
preds_test = sf.forecast(df=train, h=len(test), fitted=False)
preds_test = preds_test.merge(test[["unique_id", "ds", "y"]], on=["unique_id", "ds"], how="left")
preds_test.head()

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(train["ds"], train["y"], label="Train")
ax.plot(test["ds"], test["y"], label="Test")
ax.plot(preds_test["ds"], preds_test["SNaive"], label="Seasonal Naive Forecast")
ax.set_title("Seasonal Naive forecast vs test")
ax.set_xlabel("Date")
ax.set_ylabel("y")
ax.legend()
fig.tight_layout()
plt.show()

Seasonal Naive forecasts repeat the same seasonal pattern as last season without being able to capture all the upward trend that was demonstrated in the test period; therefore, these forecasts have systematically projected below the true values for the test periods following 2011.

  1. Compare the accuracy of your forecasts against the actual values.
Code
import pandas as pd
from pandas.tseries.offsets import MonthEnd
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import rmse, mae, mape, mase
from functools import partial

data["ds"]  = pd.to_datetime(data["ds"]) + MonthEnd(0)

train = data.loc[data["ds"] <  "2011-01-01", ["unique_id","ds","y"]].copy()
test  = data.loc[data["ds"] >= "2011-01-01", ["unique_id","ds","y"]].copy()

fcst = sf.forecast(df=train, h=len(test))
preds = test.merge(fcst, on=["unique_id", "ds"], how="left")

preds_eval = preds.dropna(subset=["y", "SNaive"]).copy()
preds_eval = preds_eval.loc[preds_eval["y"] != 0]

evaluation = evaluate(
    df=preds_eval,
    train_df=train,
    metrics=[rmse, mae, mape, partial(mase, seasonality=12)],
)

evaluation
unique_id metric SNaive
0 A3349908R rmse 45.328759
1 A3349908R mae 40.792708
2 A3349908R mape 0.325858
3 A3349908R mase 7.934100

The seasonal naive model forecast error analysis was conducted using RMSE, MAE, MAPE, and MASE for the 2011+ test set data. The forecast error values are RMSE = 45.33 and MAE = 40.79, which means there is a large amount of average difference in forecasted versus actual data of approximately 45.33 and 40.79 respectively. The percentage of difference between forecasted and actual data was relatively high with a MAPE value of approximately 0.326 (or 32.6 percent). In addition to this, the MASE was 7.93, indicating that the seasonal naive forecasting method performed significantly worse than the basic seasonal in-sample benchmarking method for this dataset.

  1. How sensitive are the accuracy measures to the amount of training data used?
Code
def eval_with_cutoff(cutoff):
    tr = data.loc[data["ds"] < cutoff].copy()
    te = data.loc[data["ds"] >= cutoff].copy()
    sf_local = StatsForecast(models=[SeasonalNaive(m, alias="SNaive")], freq="ME", n_jobs=1)
    sf_local.fit(tr)
    pr = sf_local.forecast(df=tr, h=len(te), fitted=False).merge(te[["unique_id","ds","y"]], on=["unique_id","ds"], how="left")
    ev = evaluate(pr, train_df=tr, metrics=[rmse, mae, mape, partial(mase, seasonality=m)])
    ev["cutoff"] = str(cutoff)[:10]
    return ev

cutoffs = ["2009-01-01", "2010-01-01", "2011-01-01"]
sens = pd.concat([eval_with_cutoff(c) for c in cutoffs], ignore_index=True)
sens
unique_id metric SNaive cutoff
0 A3349908R rmse 58.353861 2009-01-01
1 A3349908R mae 52.932500 2009-01-01
2 A3349908R mape 0.455890 2009-01-01
3 A3349908R mase 11.588595 2009-01-01
4 A3349908R rmse 54.441381 2010-01-01
5 A3349908R mae 50.565741 2010-01-01
6 A3349908R mape 0.429018 2010-01-01
7 A3349908R mase 10.751542 2010-01-01
8 A3349908R rmse 45.328759 2011-01-01
9 A3349908R mae 40.792708 2011-01-01
10 A3349908R mape 0.325858 2011-01-01
11 A3349908R mase 7.934100 2011-01-01

The forecast evaluation was carried out over three cut-off years (2009, 2010, 2011) measuring the accuracy of forecasts with different training sample sizes. Both RMSE & MASE decreased in value as sample sizes increased; for example, RMSE of 58.35 at the 2009 cutoff compared to RMSE of 45.33 at the 2011 cut-off. MASEs went from 11.59 to 7.93 during this time period. Therefore, forecasts become more accurate with additional historical data used to estimate model parameters.