According to the Australian population series, there is a steady and consistent increase to the overall population with no noticeable yearly fluctuation. Therefore, a Random Walk with Drift model is suitable for continuing gradual long-term growth, by projecting the previously measured average increase into the future. The forecast results in a smooth continuation of the trend and agrees with what has happened in the past with respect to the population size. The forecast intervals show an increasing interval width over time; demonstrating an increase in unpredictability as the forecast horizon expands. This is also consistent with the nature of this model.
Bricks (aus_production)
Code
import pandas as pdfrom statsforecast import StatsForecastfrom statsforecast.models import SeasonalNaiveaus_prod = pd.read_csv("aus_production.csv")time_col =next(c for c in ["ds","date","Date","quarter","Quarter","month","Month","time","Time"] if c in aus_prod.columns)ds = pd.to_datetime(aus_prod[time_col], errors="coerce")if ds.isna().all(): ds = pd.PeriodIndex(aus_prod[time_col].astype(str), freq="Q").to_timestamp()bricks = pd.DataFrame({"unique_id": "Bricks","ds": ds,"y": aus_prod["Bricks"]}).dropna()snaive = SeasonalNaive(4, alias="SNaive")sf = StatsForecast(models=[snaive], freq="QE")forecast_bricks = sf.forecast(df=bricks, h=8, level=[80, 95])forecast_bricks
unique_id
ds
SNaive
SNaive-lo-80
SNaive-lo-95
SNaive-hi-80
SNaive-hi-95
0
Bricks
2005-06-30
428.0
366.061796
333.273691
489.938204
522.726309
1
Bricks
2005-09-30
397.0
335.061796
302.273691
458.938204
491.726309
2
Bricks
2005-12-31
355.0
293.061796
260.273691
416.938204
449.726309
3
Bricks
2006-03-31
435.0
373.061796
340.273691
496.938204
529.726309
4
Bricks
2006-06-30
428.0
340.406152
294.036769
515.593848
561.963231
5
Bricks
2006-09-30
397.0
309.406152
263.036769
484.593848
530.963231
6
Bricks
2006-12-31
355.0
267.406152
221.036769
442.593848
488.963231
7
Bricks
2007-03-31
435.0
347.406152
301.036769
522.593848
568.963231
Code
import pandas as pdimport matplotlib.pyplot as pltfrom statsforecast import StatsForecastfrom statsforecast.models import SeasonalNaiveaus_prod = pd.read_csv("aus_production.csv")time_col =next(c for c in ["ds","date","Date","quarter","Quarter","month","Month","time","Time"] if c in aus_prod.columns)ds = pd.to_datetime(aus_prod[time_col], errors="coerce")if ds.isna().all(): ds = pd.PeriodIndex(aus_prod[time_col].astype(str), freq="Q").to_timestamp()bricks = pd.DataFrame({"unique_id": "Bricks", "ds": ds, "y": aus_prod["Bricks"]}).dropna()snaive = SeasonalNaive(4, alias="SNaive")sf = StatsForecast(models=[snaive], freq="QE")forecast_bricks = sf.forecast(df=bricks, h=8, level=[80, 95])fig, ax = plt.subplots(figsize=(10, 4))ax.plot(bricks["ds"], bricks["y"], linewidth=1, label="History")ax.plot(forecast_bricks["ds"], forecast_bricks["SNaive"], linewidth=2, label="Forecast")ax.fill_between(forecast_bricks["ds"], forecast_bricks["SNaive-lo-95"], forecast_bricks["SNaive-hi-95"], alpha=0.2, label="95% PI")ax.fill_between(forecast_bricks["ds"], forecast_bricks["SNaive-lo-80"], forecast_bricks["SNaive-hi-80"], alpha=0.35, label="80% PI")ax.set_title("Bricks Forecast (Seasonal Naive, period=4)")ax.set_xlabel("Quarter")ax.set_ylabel("Bricks")ax.legend()fig.tight_layout()plt.show()
The bricks product line has strong quarterly seasonality, with distinct patterns of seasons comprising peaks and valleys each year, along with noticeable shorter-term fluctuations. Therefore, it is appropriate to use a Seasonal Naive model with a periodicity of 4 for forecasting because this model will repeat values that were realized in the same quarter of the previous year. The forecast continues the established seasonal pattern without adding any changes to trends formed from past observations. The forecast’s prediction intervals widen slightly as the forecast period lengthens, so they accommodate for uncertainty while still capturing the seasonality that has been previously observed in the data.
NSW Lambs (aus_livestock)
Code
import pandas as pdfrom statsforecast import StatsForecastfrom statsforecast.models import SeasonalNaiveaus_livestock = pd.read_csv("aus_livestock.csv", parse_dates=["ds"])uid =next(u for u in aus_livestock["unique_id"].unique() if"Lamb"instr(u))lambs = aus_livestock.loc[aus_livestock["unique_id"] == uid, ["unique_id", "ds", "y"]].dropna()snaive = SeasonalNaive(12, alias="SNaive")sf = StatsForecast(models=[snaive], freq="ME")forecast_lambs = sf.forecast(df=lambs, h=12, level=[80, 95])forecast_lambs
unique_id
ds
SNaive
SNaive-lo-80
SNaive-lo-95
SNaive-hi-80
SNaive-hi-95
0
Australian Capital Territory_Lambs
2018-12-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
1
Australian Capital Territory_Lambs
2019-01-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
2
Australian Capital Territory_Lambs
2019-02-28
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
3
Australian Capital Territory_Lambs
2019-03-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
4
Australian Capital Territory_Lambs
2019-04-30
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
5
Australian Capital Territory_Lambs
2019-05-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
6
Australian Capital Territory_Lambs
2019-06-30
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
7
Australian Capital Territory_Lambs
2019-07-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
8
Australian Capital Territory_Lambs
2019-08-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
9
Australian Capital Territory_Lambs
2019-09-30
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
10
Australian Capital Territory_Lambs
2019-10-31
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
11
Australian Capital Territory_Lambs
2019-11-30
0.0
-7936.301301
-12137.525433
7936.301301
12137.525433
Code
import pandas as pdimport matplotlib.pyplot as pltfrom statsforecast import StatsForecastfrom statsforecast.models import SeasonalNaiveaus_livestock = pd.read_csv("aus_livestock.csv", parse_dates=["ds"])uid =next(u for u in aus_livestock["unique_id"].unique() if ("New South Wales"instr(u)) and ("Lamb"instr(u)))lambs = ( aus_livestock.loc[aus_livestock["unique_id"] == uid, ["unique_id", "ds", "y"]] .dropna() .sort_values("ds"))lambs = lambs.loc[lambs["y"] >0]snaive = SeasonalNaive(12, alias="SNaive")sf = StatsForecast(models=[snaive], freq="ME")forecast_lambs = sf.forecast(df=lambs, h=12, level=[80, 95])fig, ax = plt.subplots(figsize=(10, 4))ax.plot(lambs["ds"], lambs["y"], linewidth=1, label="History")ax.plot(forecast_lambs["ds"], forecast_lambs["SNaive"], linewidth=2, label="Forecast")ax.fill_between(forecast_lambs["ds"], forecast_lambs["SNaive-lo-95"], forecast_lambs["SNaive-hi-95"], alpha=0.2, label="95% PI")ax.fill_between(forecast_lambs["ds"], forecast_lambs["SNaive-lo-80"], forecast_lambs["SNaive-hi-80"], alpha=0.35, label="80% PI")ax.set_title(f"{uid} Forecast (Seasonal Naive, period=12)")ax.set_xlabel("Year")ax.set_ylabel("Lambs")ax.legend()fig.tight_layout()plt.show()
The Lamb Series varies significantly both seasonally and short term (month to month) from one year to another and continues to show evidence of both. With the presence of monthly data and repeating seasonal behavior, it is appropriate to use a seasonal naive model with a period of 12, by repeating the value of the same month from the previous year. The prediction interval continues the previously established seasonal behavior and does not impose any additional trend structure on the forecast. As the forecast period extends into the future, the width of prediction intervals will continue to increase, indicating increasing uncertainty but sustaining the previously observed seasonal behavior.
The series of household wealth represents a general increase in value; however there have been periods where the data appears to fluctuate greatly as a result of economic volatility. The lack of any seasonality and consistent growth over time justifies the use of a random walk with drift model as it allows the average growth rate from historical data to continue into future years. The forecast is consistent with the long term trend of increasing household wealth shown by the data. As the forecast time frame increases, the prediction intervals expand dramatically due to the increased uncertainty associated with longer timeframes as is typical with these models.
Australian takeaway food turnover (aus_retail)
Code
import pandas as pdfrom statsforecast import StatsForecastfrom statsforecast.models import SeasonalNaiveaus_retail = pd.read_csv("aus_retail.csv")date_col =next(c for c in ["ds","Month","month","Date","date"] if c in aus_retail.columns)takeaway = ( aus_retail.loc[aus_retail["Industry"] =="Takeaway food services", [date_col, "Turnover"]] .rename(columns={date_col: "ds", "Turnover": "y"}))takeaway["ds"] = pd.to_datetime(takeaway["ds"])takeaway["unique_id"] ="Takeaway food services"takeaway = takeaway[["unique_id", "ds", "y"]].dropna()snaive = SeasonalNaive(12, alias="SNaive")sf = StatsForecast(models=[snaive], freq="ME")forecast_takeaway = sf.forecast(df=takeaway, h=12, level=[80, 95])forecast_takeaway
There is considerable seasonality in the monthly turnover pattern of takeaway food turnover, with distinct peaks and troughs occurring every twelve months and there has been a clear increase in turnover over time. Because the pattern is clearly seasonal, a Seasonal Naive model having a period of 12 is a suitable method of forecasting, as it simply uses the most recent value of the same month from the prior year as the basis for making future forecasts. By using this method, the seasonal component is preserved while no trend adjustments will be made to the existing trend. The prediction intervals of the forecast will increase in width as the forecast period increases, with the prediction intervals changing according to how the results from the forecast differ from the expected return in monthly turnover based upon the seasonal pattern seen in historically recorded takeaway food turnover.
Exercise 5.2
Use the Facebook stock price (data set gafa_stock) to do the following:
From 2014-2019, Facebook’s closing stock price is shown to have a considerable amount of short-term volatility but mainly continues to rise overall through the majority of that time frame. The stock price continually rises up until around 2018 at which point there is an obvious decrease in stock value. The data does not have a seasonality component or repeating pattern within it. The entire series operates like an ordinary financial time series characterized by movements upward and downward and continuously changing levels over time.
Produce forecasts using the drift method and plot them.
Facebook’s stock price is projected into the future as a straight line based on the average of historical variations from Facebook’s stock price. Hence, there has been a slight increase to the total projection because of the upward trend across the entire dataset. The confidence intervals get wider over the extended forecast period demonstrating that uncertainty continues to increase further out in time; however because of considerable short term fluctuations in stock prices, it is uncertain that using a drift method of forecasting will provide a complete understanding of any rapid or significant shifts or reversals in stock prices.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Drift forecast mathematically represents the continuous projection of linearly extending the line connecting the initial and last observation in the training set, and it will yield identical results when comparing the calculated drift forecast with manually calculating linear projection of the same last observation to the last drift forecast and finding an identical outcome (a difference of 0) demonstrating that this method is used for predicting future value(s) based upon an average slope of historical series.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
The Naive and Drift forecasts are comparable over short horizons; however, the Drift forecast has a slight upward slope due to the addition of the average historical increases over the entire time-series of data. In contrast, the current most recent observations are declining, which is not adequately captured in the Drift forecast. The Naive forecast simply takes the last observation and utilizes that as the next value, thereby making the Naive forecast much more consistent with the random walk behavior observed in stock prices. Therefore, the Naive forecast will usually be a more appropriate benchmark for short-term forecasting of daily stock prices.
Exercise 5.3
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992.
The seasonal naïve model (evaluation period of four) clearly captures the strong seasonal nature of Australian beer production over four quarters per year. The original pattern observed in the historical beer production data will continue to have around 25% similarities to that of the seasonally naïve forecasting model (i.e., the forecast is essentially a replication of the last four quarters of the actual value, also known as forecast replication).
The residual time series plot shows that residuals fluctuate around zero (the average residual), but the magnitude of the fluctuations are almost evenly distributed between the type of trend that has been fit to the model and no historical trend; therefore, the model does capture most of the seasonality. However, there are still a large number of large positive residuals and a similar number of large negative residuals remaining in the residuals distribution; therefore, there is still a significant amount of variation remaining that is not captured by the model.
The residual ACF shows at least one statistically significant spike (clearly at lag four), which means that the residuals also contain some degree of seasonal autocorrelation. If the residuals were pure white noise, then the vast majority of the spikes would not exceed the 95% confidence boundaries.
The histogram of the residuals is centre approximately on the zero mark; however, the residuals are also very spread out and do not show a perfect symmetric distribution.
Based on the data collected from the residual analysis, the residuals are not perfect white noise due to the presence of remaining autocorrelation, and while the unchanged productivity delivered by the seasonal naïve modelling approach provides a reasonable baseline for future comparisons, more advanced seasonal modelling could result in improved production forecast accuracy.
Exercise 5.4
Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of Naive() or SeasonlNaive() is more appropriate in each case.
The Naive approach was applied to the Australian Exports series data (no obvious correlation). The exports have increased continuously plus inconsistently over time, however, there is no seasonal variation in them. The Naive forecasting method uses the last data point (observation) to produce a repetitive pattern of prediction into the future. Therefore, since there is an optimistic trend for increasing exports into the future, the Naive approach will not give a good representation of the overall trend because the values would look like flat lines rather than an increasing pattern. Thus, the Naive approach can only provide a basic benchmark of the data, not an accurate and legitimate long-range forecast.
The Seasonal Naive forecasting method (period of 4) is used in conjunction with the aus_production Bricks series location/volume. The use of this approach was appropriate because the data demonstrated standard repeating seasonal patterns (seasonally positive high peaks and low valleys) for each year. The Seasonal Naive method will replicate the previous year’s seasonal pattern of one year prior in how it forecasts future activity because it is inherently repetitive in nature - this will produce quarterly forecast error variance that reflects the historical trend of the past few years (i.e., provide for cyclic Quarterly Forecast error variance rather than a consistency of trend throughout each quarter).
Exercise 5.7
For your retail time series (from Exercise 7 in Section 2.10):
Create a training dataset consisting of observations before 2011
Code
import numpy as npimport pandas as pdaus_retail = pd.read_csv("aus_retail.csv")if"ds"notin aus_retail.columns: date_col =next(c for c in ["Month", "month", "Date", "date"] if c in aus_retail.columns) aus_retail[date_col] = pd.to_datetime(aus_retail[date_col], errors="coerce") aus_retail = aus_retail.rename(columns={date_col: "ds"})if"y"notin aus_retail.columns: y_col =next(c for c in ["Turnover", "turnover", "value", "Value"] if c in aus_retail.columns) aus_retail = aus_retail.rename(columns={y_col: "y"})if"unique_id"notin aus_retail.columns:if"Series ID"in aus_retail.columns: aus_retail = aus_retail.rename(columns={"Series ID": "unique_id"})else: id_candidates = [c for c in ["Industry", "State", "Region", "Series"] if c in aus_retail.columns] aus_retail["unique_id"] = aus_retail[id_candidates].astype(str).agg("_".join, axis=1)aus_retail["ds"] = pd.to_datetime(aus_retail["ds"], errors="coerce")aus_retail["y"] = pd.to_numeric(aus_retail["y"], errors="coerce")aus_retail = aus_retail.dropna(subset=["ds", "y"]).sort_values(["unique_id", "ds"])rng = np.random.default_rng(12345678)chosen_id = rng.choice(aus_retail["unique_id"].unique(), 1)[0]data = aus_retail.loc[aus_retail["unique_id"] == chosen_id, ["unique_id", "ds", "y"]].copy()train = data.loc[data["ds"] <"2011-01-01"].copy()test = data.loc[data["ds"] >="2011-01-01"].copy()chosen_id, train.shape, test.shape
('A3349908R', (345, 3), (96, 3))
The dataset identified by the identifier A3349908R was separated into a training set of all observations up until the end of 2010 (345) and assigned a test set containing an observations from 2011 and on (96). As the tests will be performed over the future documentation, the evaluation of the model will be based on the unobserved future data.
Check that your data have been split appropriately by producing the following plot.
Code
import matplotlib.pyplot as pltfig, ax = plt.subplots(figsize=(10, 4))ax.plot(train["ds"], train["y"], label="Train")ax.plot(test["ds"], test["y"], label="Test")ax.set_title(f"Retail series split (unique_id = {chosen_id})")ax.set_xlabel("Date")ax.set_ylabel("y")ax.legend()fig.tight_layout()plt.show()
The plot provides evidence that the split between training and testing data occurred correctly. The training data ends just prior to (2011) and the testing data begins in (2011) without any overlap. Time ordering is retained in both segments as well since a strong seasonality can be observed even in both data sets. Test Data appear to be significantly elevated relative to training data indicating potential upward movement beyond (2011).
Fit a seasonal naïve model using SeasonalNaive() applied to your training data.
The training data was fitted with a Seasonal Naive model. The Seasonal Naive model had a seasonal component (m=12) which indicates monthly seasonality existed when this model was fit. Forecasts generated by this model entail that the forecast will equal the value from the same month in the previous year. The output demonstrates that forecasts were produced for the 96 months in the test horizon as expected.
Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
The plot of the residual time shows that the data cluster and that the variance is much larger than expected in later years indicating that the variance is not constant across the series. The plot of ACF indicates that there are certainly significant levels of autocorrelation (especially for the initial lags and near the seasonal lags), indicating that the errors are not independent.
The histogram of the residuals is, however, approximately centered around zero and has a small positive skew and large tails.
Therefore, the residuals are not uncorrelated, nor are they normally distributed; this suggests that the Seasonal Naïve model does not adequately capture all of the structure of the dependant variable, including its seasonal and dynamic components.
Seasonal Naive forecasts repeat the same seasonal pattern as last season without being able to capture all the upward trend that was demonstrated in the test period; therefore, these forecasts have systematically projected below the true values for the test periods following 2011.
Compare the accuracy of your forecasts against the actual values.
The seasonal naive model forecast error analysis was conducted using RMSE, MAE, MAPE, and MASE for the 2011+ test set data. The forecast error values are RMSE = 45.33 and MAE = 40.79, which means there is a large amount of average difference in forecasted versus actual data of approximately 45.33 and 40.79 respectively. The percentage of difference between forecasted and actual data was relatively high with a MAPE value of approximately 0.326 (or 32.6 percent). In addition to this, the MASE was 7.93, indicating that the seasonal naive forecasting method performed significantly worse than the basic seasonal in-sample benchmarking method for this dataset.
How sensitive are the accuracy measures to the amount of training data used?
Code
def eval_with_cutoff(cutoff): tr = data.loc[data["ds"] < cutoff].copy() te = data.loc[data["ds"] >= cutoff].copy() sf_local = StatsForecast(models=[SeasonalNaive(m, alias="SNaive")], freq="ME", n_jobs=1) sf_local.fit(tr) pr = sf_local.forecast(df=tr, h=len(te), fitted=False).merge(te[["unique_id","ds","y"]], on=["unique_id","ds"], how="left") ev = evaluate(pr, train_df=tr, metrics=[rmse, mae, mape, partial(mase, seasonality=m)]) ev["cutoff"] =str(cutoff)[:10]return evcutoffs = ["2009-01-01", "2010-01-01", "2011-01-01"]sens = pd.concat([eval_with_cutoff(c) for c in cutoffs], ignore_index=True)sens
unique_id
metric
SNaive
cutoff
0
A3349908R
rmse
58.353861
2009-01-01
1
A3349908R
mae
52.932500
2009-01-01
2
A3349908R
mape
0.455890
2009-01-01
3
A3349908R
mase
11.588595
2009-01-01
4
A3349908R
rmse
54.441381
2010-01-01
5
A3349908R
mae
50.565741
2010-01-01
6
A3349908R
mape
0.429018
2010-01-01
7
A3349908R
mase
10.751542
2010-01-01
8
A3349908R
rmse
45.328759
2011-01-01
9
A3349908R
mae
40.792708
2011-01-01
10
A3349908R
mape
0.325858
2011-01-01
11
A3349908R
mase
7.934100
2011-01-01
The forecast evaluation was carried out over three cut-off years (2009, 2010, 2011) measuring the accuracy of forecasts with different training sample sizes. Both RMSE & MASE decreased in value as sample sizes increased; for example, RMSE of 58.35 at the 2009 cutoff compared to RMSE of 45.33 at the 2011 cut-off. MASEs went from 11.59 to 7.93 during this time period. Therefore, forecasts become more accurate with additional historical data used to estimate model parameters.