1. Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.
Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.
Partition the data into the training and validation periods as explained above.
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1995 | 1665 | 2398 | 2841 | 3547 | 3753 | 3715 | 4350 | 3566 | 5022 | 6423 | 7601 | 19756 |
1996 | 2500 | 5198 | 7225 | 4806 | 5901 | 4951 | 6179 | 4752 | 5496 | 5835 | 12600 | 28542 |
1997 | 4717 | 5703 | 9958 | 5305 | 6492 | 6631 | 7350 | 8177 | 8573 | 9690 | 15152 | 34061 |
1998 | 5921 | 5815 | 12421 | 6370 | 7609 | 7225 | 8121 | 7979 | 8093 | 8477 | 17915 | 30114 |
1999 | 4827 | 6470 | 9639 | 8821 | 8722 | 10209 | 11277 | 12552 | 11637 | 13607 | 21822 | 45061 |
2000 | 7615 | 9850 | 14558 | 11587 | 9333 | 13082 | 16733 | 19889 | 23933 | 25391 | 36025 | 80722 |
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 | 10243 | 11267 | 21827 | 17357 | 15998 | 18602 | 26155 | 28587 | 30505 | 30821 | 46634 | 104661 |
(a) Why was the data partitioned?
Partitioning data helps address the problem of overfitting a model and allows the modeler to evaluate the performance of a model by measuring forecast errors. Models are trained on the earlier portion and their predictive performnce is assessed on the later portion. In the shampoo sales example, the analyst uses the training set to build a forecasting model and then evaluates by comparing what the model forcasts for the next year to the actual data that is the validation set.
(b) Why did the analyst choose a 12-month validation period?
The validation period should mimic the forecast horizon. Additionally, too long a validation period would mean too little of most recent training data will be used to build a model. The data frequency and the forecasting goal also need to be considered when deciding on the length of the validation period.
(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)
The naive forecast is the last data point in the training set, i.e., the most recent data point. Since a naive forecast is equal to the previous point, the Jan 2001 is the same as Dec. 2000. In this non-seasonal naive forecast, Feb 2001 will be equal to Jan 2001. This carries for the rest of the forecast period. When plotted, the naive forecast will be a straight line. There is also a seasonal naive forecast which accounts for the seasonal cycles in a time series.
Before deploying a forecast, I will plot the entire Souvenir Sales time series data set to see if there is seasonality.
The time series shows a seasonal pattern with a small spike in sales early in the year, followed by a slow, steady incline, and a large spike at the end of the year. This is supported by both the season plot and the monthly plot of the entire time series.
The lack of overlap in the month lines, as well as the relative straightness of the lines, confirms the presence of seasonality for the Souvenir Sales time series data set.
Because there is seasonality, I will show both naive (as required in the question) and seasonal naive forecasts for 2001. A seasonal naive forecast is the appropriate choice for this data.
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 | 80722 |
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 | 7615 | 9850 | 14558 | 11587 | 9333 | 13082 | 16733 | 19889 | 23933 | 25391 | 36025 | 80722 |
Blue line is the forecast and the shaded grey areas are confidence intervals.
(d) Compute the RMSE and MAPE for the naive forecasts.
The RMSE = 56099.07 and the MAPE = 290.95049
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
Training set | 1113.477 | 10460.73 | 5506.879 | -25.27554 | 61.16191 | 1.47054 | -0.1968879 | NA |
Test set | -50500.287 | 56099.07 | 54490.114 | -287.13834 | 290.95049 | 14.55087 | 0.3182456 | 6.649124 |
The RMSE = 9542.346 and the MAPE = 27.27926
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
Training set | 3401.360 | 6467.818 | 3744.800 | 22.39270 | 25.64127 | 1.000000 | 0.4140974 | NA |
Test set | 7828.278 | 9542.346 | 7828.278 | 27.27926 | 27.27926 | 2.090439 | 0.2264895 | 0.7373759 |
(e1) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period).
The histogram below shows the frequency distribution of seasonal naive forecasting errors. The snaive forecast method is consistently underpredicting sales.
(e2) Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period.
What we see in the plots below is a seasonal forecast that consistetly is underpredicting what the actual forecast is.
Most residuals are over the zero line, further illustrating underpredicting.
The distribution of error terms in the training period is not normal; we can surmise this because the line is not at a 45 degree angle from the lower left hand corner.
Residuals from validation period are not normally distributed either; the line is not at a 45 degree angle and the errors are positive, indicating underprediction.
(e3)What can you say about the behavior of the naive forecasts?
Actual sales of souvenirs in 2001 is quite different from the naive forecast. This is evident in the plot, “2001 Souvenir Sales: Actual Sales and Naive and Seasonal Naive Forecasts” which shows both actual and forecast values; the red line delineates the actual sales and the bue line represents the seasonal naive forecast. (The straight green line is the naive forecast not taking seasonality into consideration). As we saw from previous data, the time series exhibits seasonality, therefore a seasonal naive forecast would be more appropriate. Although the seasonal naive forecast mimics more closely the actual souvenir sales, it is consistently underpredicting, and this is evident fromthe above 4 plots. Furthermore, the errors for both the training and validation periods are not normally distributed.
The seasonal naive forecast is generally a consistent underprediction of the actual data.
(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?
The forecaster must recombine the data from the training and the validation sets to generate and deploy a forecast for 2002. The validation set has the most recent data and is valuable and necessary because it it may likely lead to a more accurate forecast. Finally, generating a forecast based solely on the training set, forces the forcaster to forecast further into the future than he/she would have to when using the recombined full data set, i.e., to forecast 2002 from only the training set, it would be necessary to forecast 2001 and 2002. And the further out the forecast horizon, the less accurate the forecast becomes.
2. Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a threeyear period.
If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)
partition the data into training and validation periods
This is a necessary step in forecasting because in order to find the best forecasting model, the data needs to be fitted to the model and then it needs to be tested for predictive accuracy. This can only be done with partitioning.
examine time plots of the series and of model forecasts only for the training period
No. plots of both the training and validation periods should be examined. A plot of the forecast from the training data during the validation period can be compared to the actual data. It is important to “eyeball” how the data looks when plotted together.
look at MAPE and RMSE values for the training period
No, the MAPE and RMSE are less important considerations in the training period.
look at MAPE and RMSE values for the validation period
Yes, this is an essential step in order to evaluate predictive performance of a forecast. The validation period is a more objective basis than the training period, and therefore these computations are important to consider in the test set.
compute naive forecasts
Yes, naive and seasonal naive forecasts provide an important baseline forecast for comparison of the forecast and of the errors. They may often may not be the model of choice, however they should always be run.