R Markdown

1. Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.

Partition the data into the training and validation periods as explained above.

Training Set

  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1995 1665 2398 2841 3547 3753 3715 4350 3566 5022 6423 7601 19756
1996 2500 5198 7225 4806 5901 4951 6179 4752 5496 5835 12600 28542
1997 4717 5703 9958 5305 6492 6631 7350 8177 8573 9690 15152 34061
1998 5921 5815 12421 6370 7609 7225 8121 7979 8093 8477 17915 30114
1999 4827 6470 9639 8821 8722 10209 11277 12552 11637 13607 21822 45061
2000 7615 9850 14558 11587 9333 13082 16733 19889 23933 25391 36025 80722

Validation Set

  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 10243 11267 21827 17357 15998 18602 26155 28587 30505 30821 46634 104661

(a) Why was the data partitioned?

Partitioning data helps address the problem of overfitting a model and allows the modeler to evaluate the performance of a model by measuring forecast errors. Models are trained on the earlier portion and their predictive performnce is assessed on the later portion. In the shampoo sales example, the analyst uses the training set to build a forecasting model and then evaluates by comparing what the model forcasts for the next year to the actual data that is the validation set.

(b) Why did the analyst choose a 12-month validation period?

The validation period should mimic the forecast horizon. Additionally, too long a validation period would mean too little of most recent training data will be used to build a model. The data frequency and the forecasting goal also need to be considered when deciding on the length of the validation period.

(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

The naive forecast is the last data point in the training set, i.e., the most recent data point. Since a naive forecast is equal to the previous point, the Jan 2001 is the same as Dec. 2000. In this non-seasonal naive forecast, Feb 2001 will be equal to Jan 2001. This carries for the rest of the forecast period. When plotted, the naive forecast will be a straight line. There is also a seasonal naive forecast which accounts for the seasonal cycles in a time series.

Before deploying a forecast, I will plot the entire Souvenir Sales time series data set to see if there is seasonality.

The time series shows a seasonal pattern with a small spike in sales early in the year, followed by a slow, steady incline, and a large spike at the end of the year. This is supported by both the season plot and the monthly plot of the entire time series.

The lack of overlap in the month lines, as well as the relative straightness of the lines, confirms the presence of seasonality for the Souvenir Sales time series data set.

Because there is seasonality, I will show both naive (as required in the question) and seasonal naive forecasts for 2001. A seasonal naive forecast is the appropriate choice for this data.

2001 (Validation period) Naive Forecast for Souvenir Sales

  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 80722 80722 80722 80722 80722 80722 80722 80722 80722 80722 80722 80722

2001 (Validation period) Seasonal Naive Forecast for Souvenir Sales

  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 7615 9850 14558 11587 9333 13082 16733 19889 23933 25391 36025 80722

Plot of seasonal naive forecast for 2001 with confidence intervals.

Blue line is the forecast and the shaded grey areas are confidence intervals.

(d) Compute the RMSE and MAPE for the naive forecasts.

Error computations for naive forecast

The RMSE = 56099.07 and the MAPE = 290.95049

ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 1113.477 10460.73 5506.879 -25.27554 61.16191 1.47054 -0.1968879 NA
Test set -50500.287 56099.07 54490.114 -287.13834 290.95049 14.55087 0.3182456 6.649124

Error computations for seasonal naive forecast

The RMSE = 9542.346 and the MAPE = 27.27926

ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 3401.360 6467.818 3744.800 22.39270 25.64127 1.000000 0.4140974 NA
Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439 0.2264895 0.7373759

(e1) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period).

The histogram below shows the frequency distribution of seasonal naive forecasting errors. The snaive forecast method is consistently underpredicting sales.

(e2) Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period.

What we see in the plots below is a seasonal forecast that consistetly is underpredicting what the actual forecast is.

Plot residuals over time

Most residuals are over the zero line, further illustrating underpredicting.

Normality Plot

The distribution of error terms in the training period is not normal; we can surmise this because the line is not at a 45 degree angle from the lower left hand corner.

Residuals from validation period are not normally distributed either; the line is not at a 45 degree angle and the errors are positive, indicating underprediction.

(e3)What can you say about the behavior of the naive forecasts?

Actual sales of souvenirs in 2001 is quite different from the naive forecast. This is evident in the plot, “2001 Souvenir Sales: Actual Sales and Naive and Seasonal Naive Forecasts” which shows both actual and forecast values; the red line delineates the actual sales and the bue line represents the seasonal naive forecast. (The straight green line is the naive forecast not taking seasonality into consideration). As we saw from previous data, the time series exhibits seasonality, therefore a seasonal naive forecast would be more appropriate. Although the seasonal naive forecast mimics more closely the actual souvenir sales, it is consistently underpredicting, and this is evident fromthe above 4 plots. Furthermore, the errors for both the training and validation periods are not normally distributed.

The seasonal naive forecast is generally a consistent underprediction of the actual data.

(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?

The forecaster must recombine the data from the training and the validation sets to generate and deploy a forecast for 2002. The validation set has the most recent data and is valuable and necessary because it it may likely lead to a more accurate forecast. Finally, generating a forecast based solely on the training set, forces the forcaster to forecast further into the future than he/she would have to when using the recombined full data set, i.e., to forecast 2002 from only the training set, it would be necessary to forecast 2001 and 2002. And the further out the forecast horizon, the less accurate the forecast becomes.

2. Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a threeyear period.

If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)

partition the data into training and validation periods

This is a necessary step in forecasting because in order to find the best forecasting model, the data needs to be fitted to the model and then it needs to be tested for predictive accuracy. This can only be done with partitioning.

examine time plots of the series and of model forecasts only for the training period

No. plots of both the training and validation periods should be examined. A plot of the forecast from the training data during the validation period can be compared to the actual data. It is important to “eyeball” how the data looks when plotted together.

look at MAPE and RMSE values for the training period

No, the MAPE and RMSE are less important considerations in the training period.

look at MAPE and RMSE values for the validation period

Yes, this is an essential step in order to evaluate predictive performance of a forecast. The validation period is a more objective basis than the training period, and therefore these computations are important to consider in the test set.

compute naive forecasts

Yes, naive and seasonal naive forecasts provide an important baseline forecast for comparison of the forecast and of the errors. They may often may not be the model of choice, however they should always be run.