DATA624 - Homework 3

Author

Anthony Josue Roman

Exercise 5.1

Produce forecasts for the following series using whichever of Naive(), SeasonalNaive() or RandomWalkWithDrift() is more appropriate in each case:

Australian Population (global_economy)

Code

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

global_econ = pd.read_csv("global_economy.csv", parse_dates=["ds"])

aus_pop = (
    global_econ[global_econ["unique_id"] == "Australia"]
    [["unique_id", "ds", "Population"]]
    .rename(columns={"Population": "y"})
)

rwd = RandomWalkWithDrift()
sf = StatsForecast(models=[rwd], freq="Y")

forecast_pop = sf.forecast(df=aus_pop, h=10, level=[80, 95])
forecast_pop

	unique_id	ds	RWD	RWD-lo-80	RWD-lo-95	RWD-hi-80	RWD-hi-95
0	Australia	2017-12-31	2.485020e+07	2.475083e+07	2.469823e+07	2.494958e+07	2.500218e+07
1	Australia	2018-12-31	2.510148e+07	2.495973e+07	2.488470e+07	2.524322e+07	2.531825e+07
2	Australia	2019-12-31	2.535275e+07	2.517768e+07	2.508501e+07	2.552781e+07	2.562048e+07
3	Australia	2020-12-31	2.560402e+07	2.540020e+07	2.529230e+07	2.580784e+07	2.591574e+07
4	Australia	2021-12-31	2.585529e+07	2.562555e+07	2.550393e+07	2.608503e+07	2.620664e+07
5	Australia	2022-12-31	2.610656e+07	2.585287e+07	2.571858e+07	2.636025e+07	2.649454e+07
6	Australia	2023-12-31	2.635783e+07	2.608165e+07	2.593545e+07	2.663401e+07	2.678021e+07
7	Australia	2024-12-31	2.660910e+07	2.631155e+07	2.615404e+07	2.690665e+07	2.706416e+07
8	Australia	2025-12-31	2.686037e+07	2.654236e+07	2.637401e+07	2.717839e+07	2.734674e+07
9	Australia	2026-12-31	2.711164e+07	2.677390e+07	2.659510e+07	2.744939e+07	2.762818e+07

Code

import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

global_econ = pd.read_csv("global_economy.csv", parse_dates=["ds"])

aus_pop = (
    global_econ.loc[global_econ["unique_id"] == "Australia", ["unique_id", "ds", "Population"]]
    .rename(columns={"Population": "y"})
    .dropna()
)

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="YE")
forecast_pop = sf.forecast(df=aus_pop, h=10, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(aus_pop["ds"], aus_pop["y"], linewidth=1, label="History")
ax.plot(forecast_pop["ds"], forecast_pop["RWD"], linewidth=2, label="Forecast")
ax.fill_between(forecast_pop["ds"], forecast_pop["RWD-lo-95"], forecast_pop["RWD-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_pop["ds"], forecast_pop["RWD-lo-80"], forecast_pop["RWD-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Australian Population Forecast (Random Walk with Drift)")
ax.set_xlabel("Year")
ax.set_ylabel("Population")
ax.legend()
fig.tight_layout()
plt.show()

According to the Australian population series, there is a steady and consistent increase to the overall population with no noticeable yearly fluctuation. Therefore, a Random Walk with Drift model is suitable for continuing gradual long-term growth, by projecting the previously measured average increase into the future. The forecast results in a smooth continuation of the trend and agrees with what has happened in the past with respect to the population size. The forecast intervals show an increasing interval width over time; demonstrating an increase in unpredictability as the forecast horizon expands. This is also consistent with the nature of this model.

Bricks (aus_production)

Code

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_prod = pd.read_csv("aus_production.csv")

time_col = next(c for c in ["ds","date","Date","quarter","Quarter","month","Month","time","Time"] if c in aus_prod.columns)

ds = pd.to_datetime(aus_prod[time_col], errors="coerce")

if ds.isna().all():
    ds = pd.PeriodIndex(aus_prod[time_col].astype(str), freq="Q").to_timestamp()

bricks = pd.DataFrame({
    "unique_id": "Bricks",
    "ds": ds,
    "y": aus_prod["Bricks"]
}).dropna()

snaive = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="QE")

forecast_bricks = sf.forecast(df=bricks, h=8, level=[80, 95])

forecast_bricks

	unique_id	ds	SNaive	SNaive-lo-80	SNaive-lo-95	SNaive-hi-80	SNaive-hi-95
0	Bricks	2005-06-30	428.0	366.061796	333.273691	489.938204	522.726309
1	Bricks	2005-09-30	397.0	335.061796	302.273691	458.938204	491.726309
2	Bricks	2005-12-31	355.0	293.061796	260.273691	416.938204	449.726309
3	Bricks	2006-03-31	435.0	373.061796	340.273691	496.938204	529.726309
4	Bricks	2006-06-30	428.0	340.406152	294.036769	515.593848	561.963231
5	Bricks	2006-09-30	397.0	309.406152	263.036769	484.593848	530.963231
6	Bricks	2006-12-31	355.0	267.406152	221.036769	442.593848	488.963231
7	Bricks	2007-03-31	435.0	347.406152	301.036769	522.593848	568.963231

Code

import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_prod = pd.read_csv("aus_production.csv")

time_col = next(c for c in ["ds","date","Date","quarter","Quarter","month","Month","time","Time"] if c in aus_prod.columns)
ds = pd.to_datetime(aus_prod[time_col], errors="coerce")

if ds.isna().all():
    ds = pd.PeriodIndex(aus_prod[time_col].astype(str), freq="Q").to_timestamp()

bricks = pd.DataFrame({"unique_id": "Bricks", "ds": ds, "y": aus_prod["Bricks"]}).dropna()

snaive = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="QE")
forecast_bricks = sf.forecast(df=bricks, h=8, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(bricks["ds"], bricks["y"], linewidth=1, label="History")
ax.plot(forecast_bricks["ds"], forecast_bricks["SNaive"], linewidth=2, label="Forecast")
ax.fill_between(forecast_bricks["ds"], forecast_bricks["SNaive-lo-95"], forecast_bricks["SNaive-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_bricks["ds"], forecast_bricks["SNaive-lo-80"], forecast_bricks["SNaive-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Bricks Forecast (Seasonal Naive, period=4)")
ax.set_xlabel("Quarter")
ax.set_ylabel("Bricks")
ax.legend()
fig.tight_layout()
plt.show()

The bricks product line has strong quarterly seasonality, with distinct patterns of seasons comprising peaks and valleys each year, along with noticeable shorter-term fluctuations. Therefore, it is appropriate to use a Seasonal Naive model with a periodicity of 4 for forecasting because this model will repeat values that were realized in the same quarter of the previous year. The forecast continues the established seasonal pattern without adding any changes to trends formed from past observations. The forecast’s prediction intervals widen slightly as the forecast period lengthens, so they accommodate for uncertainty while still capturing the seasonality that has been previously observed in the data.

NSW Lambs (aus_livestock)

Code

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_livestock = pd.read_csv("aus_livestock.csv", parse_dates=["ds"])

uid = next(u for u in aus_livestock["unique_id"].unique() if "Lamb" in str(u))

lambs = aus_livestock.loc[aus_livestock["unique_id"] == uid, ["unique_id", "ds", "y"]].dropna()

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME")

forecast_lambs = sf.forecast(df=lambs, h=12, level=[80, 95])

forecast_lambs

	unique_id	ds	SNaive-lo-80	SNaive-lo-95	SNaive-hi-80	SNaive-hi-95
0	Australian Capital Territory_Lambs	2018-12-31	-7936.301301	-12137.525433	7936.301301	12137.525433
1	Australian Capital Territory_Lambs	2019-01-31	-7936.301301	-12137.525433	7936.301301	12137.525433
2	Australian Capital Territory_Lambs	2019-02-28	-7936.301301	-12137.525433	7936.301301	12137.525433
3	Australian Capital Territory_Lambs	2019-03-31	-7936.301301	-12137.525433	7936.301301	12137.525433
4	Australian Capital Territory_Lambs	2019-04-30	-7936.301301	-12137.525433	7936.301301	12137.525433
5	Australian Capital Territory_Lambs	2019-05-31	-7936.301301	-12137.525433	7936.301301	12137.525433
6	Australian Capital Territory_Lambs	2019-06-30	-7936.301301	-12137.525433	7936.301301	12137.525433
7	Australian Capital Territory_Lambs	2019-07-31	-7936.301301	-12137.525433	7936.301301	12137.525433
8	Australian Capital Territory_Lambs	2019-08-31	-7936.301301	-12137.525433	7936.301301	12137.525433
9	Australian Capital Territory_Lambs	2019-09-30	-7936.301301	-12137.525433	7936.301301	12137.525433
10	Australian Capital Territory_Lambs	2019-10-31	-7936.301301	-12137.525433	7936.301301	12137.525433
11	Australian Capital Territory_Lambs	2019-11-30	-7936.301301	-12137.525433	7936.301301	12137.525433

Code

import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_livestock = pd.read_csv("aus_livestock.csv", parse_dates=["ds"])

uid = next(u for u in aus_livestock["unique_id"].unique() if ("New South Wales" in str(u)) and ("Lamb" in str(u)))

lambs = (
    aus_livestock.loc[aus_livestock["unique_id"] == uid, ["unique_id", "ds", "y"]]
    .dropna()
    .sort_values("ds")
)

lambs = lambs.loc[lambs["y"] > 0]

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME")
forecast_lambs = sf.forecast(df=lambs, h=12, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(lambs["ds"], lambs["y"], linewidth=1, label="History")
ax.plot(forecast_lambs["ds"], forecast_lambs["SNaive"], linewidth=2, label="Forecast")
ax.fill_between(forecast_lambs["ds"], forecast_lambs["SNaive-lo-95"], forecast_lambs["SNaive-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_lambs["ds"], forecast_lambs["SNaive-lo-80"], forecast_lambs["SNaive-hi-80"], alpha=0.35, label="80% PI")
ax.set_title(f"{uid} Forecast (Seasonal Naive, period=12)")
ax.set_xlabel("Year")
ax.set_ylabel("Lambs")
ax.legend()
fig.tight_layout()
plt.show()

The Lamb Series varies significantly both seasonally and short term (month to month) from one year to another and continues to show evidence of both. With the presence of monthly data and repeating seasonal behavior, it is appropriate to use a seasonal naive model with a period of 12, by repeating the value of the same month from the previous year. The prediction interval continues the previously established seasonal behavior and does not impose any additional trend structure on the forecast. As the forecast period extends into the future, the width of prediction intervals will continue to increase, indicating increasing uncertainty but sustaining the previously observed seasonal behavior.

Household wealth (hh_budget)

Code

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

hh = pd.read_csv("hh_budget.csv")

wealth = (
    hh.loc[hh["Country"] == "Australia", ["Year", "Wealth"]]
    .rename(columns={"Year": "ds", "Wealth": "y"})
)

wealth["ds"] = pd.to_datetime(wealth["ds"].astype(str))

wealth["unique_id"] = "Household_Wealth"
wealth = wealth[["unique_id", "ds", "y"]]

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="YE")

forecast_wealth = sf.forecast(df=wealth, h=10, level=[80, 95])

forecast_wealth

	unique_id	ds	RWD	RWD-lo-80	RWD-lo-95	RWD-hi-80	RWD-hi-95
0	Household_Wealth	2016-12-31	427.069971	397.454807	381.777486	456.685136	472.362457
1	Household_Wealth	2017-12-31	432.167043	389.343586	366.674217	474.990500	497.659869
2	Household_Wealth	2018-12-31	437.264114	383.688266	355.326925	490.839963	519.201304
3	Household_Wealth	2019-12-31	442.361186	379.221443	345.797282	505.500929	538.925089
4	Household_Wealth	2020-12-31	447.458257	375.467874	337.358467	519.448640	557.558047
5	Household_Wealth	2021-12-31	452.555329	372.191556	329.649545	532.919101	575.461113
6	Household_Wealth	2022-12-31	457.652400	369.256760	322.462935	546.048040	592.841865
7	Household_Wealth	2023-12-31	462.749471	366.577880	315.667715	558.921063	609.831228
8	Household_Wealth	2024-12-31	467.846543	364.097359	309.175858	571.595727	626.517228
9	Household_Wealth	2025-12-31	472.943614	361.774625	302.925317	584.112603	642.961911

Code

import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

hh = pd.read_csv("hh_budget.csv")

wealth = (
    hh.loc[hh["Country"] == "Australia", ["Year", "Wealth"]]
    .rename(columns={"Year": "ds", "Wealth": "y"})
    .dropna()
)

wealth["ds"] = pd.to_datetime(wealth["ds"].astype(str))
wealth["unique_id"] = "Household_Wealth"
wealth = wealth[["unique_id", "ds", "y"]]

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="YE")
forecast_wealth = sf.forecast(df=wealth, h=10, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(wealth["ds"], wealth["y"], linewidth=1, label="History")
ax.plot(forecast_wealth["ds"], forecast_wealth["RWD"], linewidth=2, label="Forecast")
ax.fill_between(forecast_wealth["ds"], forecast_wealth["RWD-lo-95"], forecast_wealth["RWD-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_wealth["ds"], forecast_wealth["RWD-lo-80"], forecast_wealth["RWD-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Household Wealth Forecast (Random Walk with Drift)")
ax.set_xlabel("Year")
ax.set_ylabel("Wealth")
ax.legend()
fig.tight_layout()
plt.show()

The series of household wealth represents a general increase in value; however there have been periods where the data appears to fluctuate greatly as a result of economic volatility. The lack of any seasonality and consistent growth over time justifies the use of a random walk with drift model as it allows the average growth rate from historical data to continue into future years. The forecast is consistent with the long term trend of increasing household wealth shown by the data. As the forecast time frame increases, the prediction intervals expand dramatically due to the increased uncertainty associated with longer timeframes as is typical with these models.

Australian takeaway food turnover (aus_retail)

Code

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_retail = pd.read_csv("aus_retail.csv")

date_col = next(c for c in ["ds","Month","month","Date","date"] if c in aus_retail.columns)

takeaway = (
    aus_retail.loc[aus_retail["Industry"] == "Takeaway food services", [date_col, "Turnover"]]
    .rename(columns={date_col: "ds", "Turnover": "y"})
)

takeaway["ds"] = pd.to_datetime(takeaway["ds"])

takeaway["unique_id"] = "Takeaway food services"
takeaway = takeaway[["unique_id", "ds", "y"]].dropna()

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME")

forecast_takeaway = sf.forecast(df=takeaway, h=12, level=[80, 95])

forecast_takeaway

	unique_id	ds	SNaive	SNaive-lo-80	SNaive-lo-95	SNaive-hi-80	SNaive-hi-95
0	Takeaway food services	2018-12-31	92.3	-126.608211	-242.491218	311.208211	427.091218
1	Takeaway food services	2019-01-31	28.9	-190.008211	-305.891218	247.808211	363.691218
2	Takeaway food services	2019-02-28	354.9	135.991789	20.108782	573.808211	689.691218
3	Takeaway food services	2019-03-31	183.7	-35.208211	-151.091218	402.608211	518.491218
4	Takeaway food services	2019-04-30	29.1	-189.808211	-305.691218	248.008211	363.891218
5	Takeaway food services	2019-05-31	630.3	411.391789	295.508782	849.208211	965.091218
6	Takeaway food services	2019-06-30	20.4	-198.508211	-314.391218	239.308211	355.191218
7	Takeaway food services	2019-07-31	358.7	139.791789	23.908782	577.608211	693.491218
8	Takeaway food services	2019-08-31	99.7	-119.208211	-235.091218	318.608211	434.491218
9	Takeaway food services	2019-09-30	31.5	-187.408211	-303.291218	250.408211	366.291218
10	Takeaway food services	2019-10-31	393.2	174.291789	58.408782	612.108211	727.991218
11	Takeaway food services	2019-11-30	195.1	-23.808211	-139.691218	414.008211	529.891218

Code

import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

aus_retail = pd.read_csv("aus_retail.csv")

takeaway = (
    aus_retail.loc[aus_retail["Industry"] == "Takeaway food services", ["Month", "Turnover"]]
    .rename(columns={"Month": "ds", "Turnover": "y"})
)

takeaway["ds"] = pd.to_datetime(takeaway["ds"])

takeaway = (
    takeaway.groupby("ds", as_index=False)["y"]
    .sum()
    .sort_values("ds")
)

takeaway["unique_id"] = "Australia_Takeaway food services"
takeaway = takeaway[["unique_id", "ds", "y"]]

snaive = SeasonalNaive(12, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="ME", n_jobs=1)

forecast_takeaway = sf.forecast(df=takeaway, h=12, level=[80, 95])

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(takeaway["ds"], takeaway["y"], linewidth=1, label="History")
ax.plot(forecast_takeaway["ds"], forecast_takeaway["SNaive"], linewidth=2, label="Forecast")
ax.fill_between(forecast_takeaway["ds"], forecast_takeaway["SNaive-lo-95"], forecast_takeaway["SNaive-hi-95"], alpha=0.2, label="95% PI")
ax.fill_between(forecast_takeaway["ds"], forecast_takeaway["SNaive-lo-80"], forecast_takeaway["SNaive-hi-80"], alpha=0.35, label="80% PI")
ax.set_title("Australian Takeaway Food Turnover Forecast (Seasonal Naive, period=12)")
ax.set_xlabel("Month")
ax.set_ylabel("Turnover")
ax.legend()
fig.tight_layout()
plt.show()

There is considerable seasonality in the monthly turnover pattern of takeaway food turnover, with distinct peaks and troughs occurring every twelve months and there has been a clear increase in turnover over time. Because the pattern is clearly seasonal, a Seasonal Naive model having a period of 12 is a suitable method of forecasting, as it simply uses the most recent value of the same month from the prior year as the basis for making future forecasts. By using this method, the seasonal component is preserved while no trend adjustments will be made to the existing trend. The prediction intervals of the forecast will increase in width as the forecast period increases, with the prediction intervals changing according to how the results from the forecast differ from the expected return in monthly turnover based upon the seasonal pattern seen in historically recorded takeaway food turnover.

Exercise 5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

Code

import pandas as pd
import matplotlib.pyplot as plt

gafa = pd.read_csv("gafa_stock.csv", parse_dates=["ds"])

fb = gafa.loc[gafa["unique_id"] == "FB_Close", ["unique_id", "ds", "y"]].copy()
fb = fb.sort_values("ds")

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(fb["ds"], fb["y"])
ax.set_title("Facebook Stock Price (Close)")
ax.set_xlabel("Date")
ax.set_ylabel("Price")
fig.tight_layout()
plt.show()

From 2014-2019, Facebook’s closing stock price is shown to have a considerable amount of short-term volatility but mainly continues to rise overall through the majority of that time frame. The stock price continually rises up until around 2018 at which point there is an obvious decrease in stock value. The data does not have a seasonality component or repeating pattern within it. The entire series operates like an ordinary financial time series characterized by movements upward and downward and continuously changing levels over time.

Produce forecasts using the drift method and plot them.

Code

from statsforecast import StatsForecast
from statsforecast.models import RandomWalkWithDrift

rwd = RandomWalkWithDrift(alias="RWD")
sf = StatsForecast(models=[rwd], freq="B", n_jobs=1)

h = 30
fc_rwd = sf.forecast(df=fb, h=h, level=[80,95])

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(fb["ds"], fb["y"], label="History")
ax.plot(fc_rwd["ds"], fc_rwd["RWD"], label="Drift Forecast")
ax.fill_between(fc_rwd["ds"], fc_rwd["RWD-lo-95"], fc_rwd["RWD-hi-95"], alpha=0.2)
ax.fill_between(fc_rwd["ds"], fc_rwd["RWD-lo-80"], fc_rwd["RWD-hi-80"], alpha=0.35)
ax.set_title("Facebook Drift Forecast")
ax.set_xlabel("Date")
ax.set_ylabel("Price")
ax.legend()
fig.tight_layout()
plt.show()

Facebook’s stock price is projected into the future as a straight line based on the average of historical variations from Facebook’s stock price. Hence, there has been a slight increase to the total projection because of the upward trend across the entire dataset. The confidence intervals get wider over the extended forecast period demonstrating that uncertainty continues to increase further out in time; however because of considerable short term fluctuations in stock prices, it is uncertain that using a drift method of forecasting will provide a complete understanding of any rapid or significant shifts or reversals in stock prices.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

Code

import numpy as np

fb2 = fb.reset_index(drop=True)

y0 = fb2["y"].iloc[0]
yT = fb2["y"].iloc[-1]
n = len(fb2)

slope = (yT - y0) / (n - 1)

line_ext = np.array([yT + slope*k for k in range(1, h+1)])

comparison = pd.DataFrame({
    "Forecast": fc_rwd["RWD"].values,
    "LineExtension": line_ext,
    "Difference": fc_rwd["RWD"].values - line_ext
})

comparison.head()

	Forecast	LineExtension
0	131.150760	131.150760
1	131.211523	131.211523
2	131.272287	131.272287
3	131.333051	131.333051
4	131.393815	131.393815

Drift forecast mathematically represents the continuous projection of linearly extending the line connecting the initial and last observation in the training set, and it will yield identical results when comparing the calculated drift forecast with manually calculating linear projection of the same last observation to the last drift forecast and finding an identical outcome (a difference of 0) demonstrating that this method is used for predicting future value(s) based upon an average slope of historical series.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

Code

from statsforecast.models import Naive

naive = Naive(alias="Naive")
sf2 = StatsForecast(models=[naive, rwd], freq="B", n_jobs=1)

fc2 = sf2.forecast(df=fb, h=h)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(fb["ds"], fb["y"], label="History")
ax.plot(fc2["ds"], fc2["Naive"], label="Naive")
ax.plot(fc2["ds"], fc2["RWD"], label="Drift")
ax.set_title("Facebook Benchmark Comparison")
ax.set_xlabel("Date")
ax.set_ylabel("Price")
ax.legend()
fig.tight_layout()
plt.show()

The Naive and Drift forecasts are comparable over short horizons; however, the Drift forecast has a slight upward slope due to the addition of the average historical increases over the entire time-series of data. In contrast, the current most recent observations are declining, which is not adequately captured in the Drift forecast. The Naive forecast simply takes the last observation and utilizes that as the next value, thereby making the Naive forecast much more consistent with the random walk behavior observed in stock prices. Therefore, the Naive forecast will usually be a more appropriate benchmark for short-term forecasting of daily stock prices.

Exercise 5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive
from statsmodels.tsa.stattools import acf

aus_production = pd.read_csv("aus_production.csv")

date_col = next(c for c in ["ds","date","Date","quarter","Quarter"] if c in aus_production.columns)
aus_production[date_col] = pd.to_datetime(aus_production[date_col], errors="coerce")
aus_production = aus_production.dropna(subset=[date_col])

y_col = "Beer" if "Beer" in aus_production.columns else "y"

recent_production = (
    aus_production.loc[aus_production[date_col].dt.year >= 1992, [date_col, y_col]]
    .rename(columns={date_col: "ds", y_col: "y"})
    .dropna()
    .sort_values("ds")
)
recent_production["unique_id"] = "Beer"
recent_production = recent_production[["unique_id", "ds", "y"]]

snaive = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[snaive], freq="QE", n_jobs=1)

preds = sf.forecast(df=recent_production, h=10, fitted=True)
insample = sf.forecast_fitted_values()

insample = insample.merge(
    recent_production.rename(columns={"y": "y_actual"}),
    on=["unique_id", "ds"],
    how="inner"
)
resid = insample["y_actual"] - insample["SNaive"]

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.plot(insample["ds"], resid, linewidth=1)
ax.axhline(0, linewidth=1)
ax.set_title("Residuals over time (Seasonal Naive)")
ax.set_xlabel("Quarter")
ax.set_ylabel("Residual")
fig.tight_layout()
plt.show()

r = resid.dropna().to_numpy()
acf_vals = acf(r, nlags=24, fft=True)
n = len(r)
conf = 1.96 / np.sqrt(n)

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.stem(range(len(acf_vals)), acf_vals, basefmt=" ")
ax.fill_between([0, len(acf_vals)-1], [-conf, -conf], [conf, conf], alpha=0.2)
ax.set_title("Residual ACF")
ax.set_xlabel("Lag")
ax.set_ylabel("ACF")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.hist(r, bins=20)
ax.set_title("Residual histogram")
ax.set_xlabel("Residual")
ax.set_ylabel("Count")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.plot(recent_production["ds"], recent_production["y"], linewidth=1, label="History")
ax.plot(preds["ds"], preds["SNaive"], linewidth=2, label="Forecast")
ax.set_title("Beer production forecast (Seasonal Naive, period=4)")
ax.set_xlabel("Quarter")
ax.set_ylabel("Beer")
ax.legend()
fig.tight_layout()
plt.show()

The seasonal naïve model (evaluation period of four) clearly captures the strong seasonal nature of Australian beer production over four quarters per year. The original pattern observed in the historical beer production data will continue to have around 25% similarities to that of the seasonally naïve forecasting model (i.e., the forecast is essentially a replication of the last four quarters of the actual value, also known as forecast replication).

The residual time series plot shows that residuals fluctuate around zero (the average residual), but the magnitude of the fluctuations are almost evenly distributed between the type of trend that has been fit to the model and no historical trend; therefore, the model does capture most of the seasonality. However, there are still a large number of large positive residuals and a similar number of large negative residuals remaining in the residuals distribution; therefore, there is still a significant amount of variation remaining that is not captured by the model.

The residual ACF shows at least one statistically significant spike (clearly at lag four), which means that the residuals also contain some degree of seasonal autocorrelation. If the residuals were pure white noise, then the vast majority of the spikes would not exceed the 95% confidence boundaries.

The histogram of the residuals is centre approximately on the zero mark; however, the residuals are also very spread out and do not show a perfect symmetric distribution.

Based on the data collected from the residual analysis, the residuals are not perfect white noise due to the presence of remaining autocorrelation, and while the unchanged productivity delivered by the seasonal naïve modelling approach provides a reasonable baseline for future comparisons, more advanced seasonal modelling could result in improved production forecast accuracy.

Exercise 5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of Naive() or SeasonlNaive() is more appropriate in each case.

Australian Exports (global_economy)

Code

import pandas as pd
import matplotlib.pyplot as plt
from statsforecast import StatsForecast
from statsforecast.models import Naive

ge = pd.read_csv("global_economy.csv")

ge["ds"] = pd.to_datetime(ge["ds"].astype(str), format="%Y")

aus_exports = ge[
    ge["unique_id"] == "Australia"
][["ds", "Exports"]].dropna().sort_values("ds")

aus_exports = aus_exports.rename(columns={"Exports": "y"})
aus_exports["unique_id"] = "Australia"
aus_exports = aus_exports[["unique_id", "ds", "y"]]

sf = StatsForecast(models=[Naive(alias="Naive")], freq="Y", n_jobs=1)
preds = sf.forecast(df=aus_exports, h=10)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(aus_exports["ds"], aus_exports["y"], linewidth=1, label="History")
ax.plot(preds["ds"], preds["Naive"], linewidth=2, label="Forecast")
ax.set_title("Australian Exports Forecast (Naive)")
ax.set_xlabel("Year")
ax.set_ylabel("Exports")
ax.legend()
plt.show()

Bricks (aus_production)

Code

from statsforecast.models import SeasonalNaive

aus_production = pd.read_csv("aus_production.csv")
aus_production["ds"] = pd.to_datetime(aus_production["ds"])

bricks = aus_production[["ds","Bricks"]].dropna().copy()
bricks["unique_id"] = "Bricks"
bricks = bricks[["unique_id","ds","Bricks"]].rename(columns={"Bricks":"y"})

model = SeasonalNaive(4, alias="SNaive")
sf = StatsForecast(models=[model], freq="Q", n_jobs=1)

preds = sf.forecast(df=bricks, h=8, fitted=True)
fitted_vals = sf.forecast_fitted_values()

resid = fitted_vals["y"] - fitted_vals["SNaive"]

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(bricks["ds"], bricks["y"], label="History")
ax.plot(preds["ds"], preds["SNaive"], label="Forecast")
ax.set_title("Bricks Forecast (Seasonal Naive, period=4)")
ax.legend()
plt.show()

The Naive approach was applied to the Australian Exports series data (no obvious correlation). The exports have increased continuously plus inconsistently over time, however, there is no seasonal variation in them. The Naive forecasting method uses the last data point (observation) to produce a repetitive pattern of prediction into the future. Therefore, since there is an optimistic trend for increasing exports into the future, the Naive approach will not give a good representation of the overall trend because the values would look like flat lines rather than an increasing pattern. Thus, the Naive approach can only provide a basic benchmark of the data, not an accurate and legitimate long-range forecast.

The Seasonal Naive forecasting method (period of 4) is used in conjunction with the aus_production Bricks series location/volume. The use of this approach was appropriate because the data demonstrated standard repeating seasonal patterns (seasonally positive high peaks and low valleys) for each year. The Seasonal Naive method will replicate the previous year’s seasonal pattern of one year prior in how it forecasts future activity because it is inherently repetitive in nature - this will produce quarterly forecast error variance that reflects the historical trend of the past few years (i.e., provide for cyclic Quarterly Forecast error variance rather than a consistency of trend throughout each quarter).

Exercise 5.7

For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011

Code

import numpy as np
import pandas as pd

aus_retail = pd.read_csv("aus_retail.csv")

if "ds" not in aus_retail.columns:
    date_col = next(c for c in ["Month", "month", "Date", "date"] if c in aus_retail.columns)
    aus_retail[date_col] = pd.to_datetime(aus_retail[date_col], errors="coerce")
    aus_retail = aus_retail.rename(columns={date_col: "ds"})

if "y" not in aus_retail.columns:
    y_col = next(c for c in ["Turnover", "turnover", "value", "Value"] if c in aus_retail.columns)
    aus_retail = aus_retail.rename(columns={y_col: "y"})

if "unique_id" not in aus_retail.columns:
    if "Series ID" in aus_retail.columns:
        aus_retail = aus_retail.rename(columns={"Series ID": "unique_id"})
    else:
        id_candidates = [c for c in ["Industry", "State", "Region", "Series"] if c in aus_retail.columns]
        aus_retail["unique_id"] = aus_retail[id_candidates].astype(str).agg("_".join, axis=1)

aus_retail["ds"] = pd.to_datetime(aus_retail["ds"], errors="coerce")
aus_retail["y"] = pd.to_numeric(aus_retail["y"], errors="coerce")
aus_retail = aus_retail.dropna(subset=["ds", "y"]).sort_values(["unique_id", "ds"])

rng = np.random.default_rng(12345678)
chosen_id = rng.choice(aus_retail["unique_id"].unique(), 1)[0]

data = aus_retail.loc[aus_retail["unique_id"] == chosen_id, ["unique_id", "ds", "y"]].copy()
train = data.loc[data["ds"] < "2011-01-01"].copy()
test  = data.loc[data["ds"] >= "2011-01-01"].copy()

chosen_id, train.shape, test.shape

('A3349908R', (345, 3), (96, 3))

The dataset identified by the identifier A3349908R was separated into a training set of all observations up until the end of 2010 (345) and assigned a test set containing an observations from 2011 and on (96). As the tests will be performed over the future documentation, the evaluation of the model will be based on the unobserved future data.

Check that your data have been split appropriately by producing the following plot.

Code

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(train["ds"], train["y"], label="Train")
ax.plot(test["ds"], test["y"], label="Test")
ax.set_title(f"Retail series split (unique_id = {chosen_id})")
ax.set_xlabel("Date")
ax.set_ylabel("y")
ax.legend()
fig.tight_layout()
plt.show()

The plot provides evidence that the split between training and testing data occurred correctly. The training data ends just prior to (2011) and the testing data begins in (2011) without any overlap. Time ordering is retained in both segments as well since a strong seasonality can be observed even in both data sets. Test Data appear to be significantly elevated relative to training data indicating potential upward movement beyond (2011).

Fit a seasonal naïve model using SeasonalNaive() applied to your training data.

Code

from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

m = 12

sf = StatsForecast(
    models=[SeasonalNaive(m, alias="SNaive")],
    freq="ME",
    n_jobs=1
)

sf.fit(train)

sf

sf.forecast(df=train, h=len(test), fitted=True)
fitted_vals = sf.forecast_fitted_values()
fitted_vals.head()

	unique_id	ds	y	SNaive
0	A3349908R	1982-04-01	11.3	NaN
1	A3349908R	1982-05-01	11.6	NaN
2	A3349908R	1982-06-01	10.9	NaN
3	A3349908R	1982-07-01	11.4	NaN
4	A3349908R	1982-08-01	10.9	NaN

The training data was fitted with a Seasonal Naive model. The Seasonal Naive model had a seasonal component (m=12) which indicates monthly seasonality existed when this model was fit. Forecasts generated by this model entail that the forecast will equal the value from the same month in the previous year. The output demonstrates that forecasts were produced for the 96 months in the test horizon as expected.

Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

Code

from statsmodels.tsa.stattools import acf

preds_train = sf.forecast(df=train, h=len(test), fitted=True)
fitted = sf.forecast_fitted_values()

train_diag = train.merge(fitted[["unique_id", "ds", "SNaive"]], on=["unique_id", "ds"], how="left")
train_diag["resid"] = train_diag["y"] - train_diag["SNaive"]

r = train_diag["resid"].dropna().to_numpy()
acf_vals = acf(r, nlags=24, fft=True)
conf = 1.96 / np.sqrt(len(r))

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.plot(train_diag["ds"], train_diag["resid"], linewidth=1)
ax.axhline(0, linewidth=1)
ax.set_title("Residuals over time (Seasonal Naive)")
ax.set_xlabel("Date")
ax.set_ylabel("Residual")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.stem(range(len(acf_vals)), acf_vals, basefmt=" ")
ax.fill_between([0, len(acf_vals)-1], [-conf, -conf], [conf, conf], alpha=0.2)
ax.set_title("Residual ACF")
ax.set_xlabel("Lag")
ax.set_ylabel("ACF")
fig.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(10, 3.8))
ax.hist(r, bins=20)
ax.set_title("Residual histogram")
ax.set_xlabel("Residual")
ax.set_ylabel("Count")
fig.tight_layout()
plt.show()

The plot of the residual time shows that the data cluster and that the variance is much larger than expected in later years indicating that the variance is not constant across the series. The plot of ACF indicates that there are certainly significant levels of autocorrelation (especially for the initial lags and near the seasonal lags), indicating that the errors are not independent.

The histogram of the residuals is, however, approximately centered around zero and has a small positive skew and large tails.

Therefore, the residuals are not uncorrelated, nor are they normally distributed; this suggests that the Seasonal Naïve model does not adequately capture all of the structure of the dependant variable, including its seasonal and dynamic components.

Produce forecasts for the test data

Code

preds_test = sf.forecast(df=train, h=len(test), fitted=False)
preds_test = preds_test.merge(test[["unique_id", "ds", "y"]], on=["unique_id", "ds"], how="left")
preds_test.head()

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(train["ds"], train["y"], label="Train")
ax.plot(test["ds"], test["y"], label="Test")
ax.plot(preds_test["ds"], preds_test["SNaive"], label="Seasonal Naive Forecast")
ax.set_title("Seasonal Naive forecast vs test")
ax.set_xlabel("Date")
ax.set_ylabel("y")
ax.legend()
fig.tight_layout()
plt.show()

Seasonal Naive forecasts repeat the same seasonal pattern as last season without being able to capture all the upward trend that was demonstrated in the test period; therefore, these forecasts have systematically projected below the true values for the test periods following 2011.

Compare the accuracy of your forecasts against the actual values.

Code

import pandas as pd
from pandas.tseries.offsets import MonthEnd
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import rmse, mae, mape, mase
from functools import partial

data["ds"]  = pd.to_datetime(data["ds"]) + MonthEnd(0)

train = data.loc[data["ds"] <  "2011-01-01", ["unique_id","ds","y"]].copy()
test  = data.loc[data["ds"] >= "2011-01-01", ["unique_id","ds","y"]].copy()

fcst = sf.forecast(df=train, h=len(test))
preds = test.merge(fcst, on=["unique_id", "ds"], how="left")

preds_eval = preds.dropna(subset=["y", "SNaive"]).copy()
preds_eval = preds_eval.loc[preds_eval["y"] != 0]

evaluation = evaluate(
    df=preds_eval,
    train_df=train,
    metrics=[rmse, mae, mape, partial(mase, seasonality=12)],
)

evaluation

	unique_id	metric	SNaive
0	A3349908R	rmse	45.328759
1	A3349908R	mae	40.792708
2	A3349908R	mape	0.325858
3	A3349908R	mase	7.934100

The seasonal naive model forecast error analysis was conducted using RMSE, MAE, MAPE, and MASE for the 2011+ test set data. The forecast error values are RMSE = 45.33 and MAE = 40.79, which means there is a large amount of average difference in forecasted versus actual data of approximately 45.33 and 40.79 respectively. The percentage of difference between forecasted and actual data was relatively high with a MAPE value of approximately 0.326 (or 32.6 percent). In addition to this, the MASE was 7.93, indicating that the seasonal naive forecasting method performed significantly worse than the basic seasonal in-sample benchmarking method for this dataset.

How sensitive are the accuracy measures to the amount of training data used?

Code

def eval_with_cutoff(cutoff):
    tr = data.loc[data["ds"] < cutoff].copy()
    te = data.loc[data["ds"] >= cutoff].copy()
    sf_local = StatsForecast(models=[SeasonalNaive(m, alias="SNaive")], freq="ME", n_jobs=1)
    sf_local.fit(tr)
    pr = sf_local.forecast(df=tr, h=len(te), fitted=False).merge(te[["unique_id","ds","y"]], on=["unique_id","ds"], how="left")
    ev = evaluate(pr, train_df=tr, metrics=[rmse, mae, mape, partial(mase, seasonality=m)])
    ev["cutoff"] = str(cutoff)[:10]
    return ev

cutoffs = ["2009-01-01", "2010-01-01", "2011-01-01"]
sens = pd.concat([eval_with_cutoff(c) for c in cutoffs], ignore_index=True)
sens

	unique_id	metric	SNaive	cutoff
0	A3349908R	rmse	58.353861	2009-01-01
1	A3349908R	mae	52.932500	2009-01-01
2	A3349908R	mape	0.455890	2009-01-01
3	A3349908R	mase	11.588595	2009-01-01
4	A3349908R	rmse	54.441381	2010-01-01
5	A3349908R	mae	50.565741	2010-01-01
6	A3349908R	mape	0.429018	2010-01-01
7	A3349908R	mase	10.751542	2010-01-01
8	A3349908R	rmse	45.328759	2011-01-01
9	A3349908R	mae	40.792708	2011-01-01
10	A3349908R	mape	0.325858	2011-01-01
11	A3349908R	mase	7.934100	2011-01-01

The forecast evaluation was carried out over three cut-off years (2009, 2010, 2011) measuring the accuracy of forecasts with different training sample sizes. Both RMSE & MASE decreased in value as sample sizes increased; for example, RMSE of 58.35 at the 2009 cutoff compared to RMSE of 45.33 at the 2011 cut-off. MASEs went from 11.59 to 7.93 during this time period. Therefore, forecasts become more accurate with additional historical data used to estimate model parameters.