Final Assignment 2

R Markdown

1. Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.

Partition the data into the training and validation periods as explained above.

Training Set

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
1995	1665	2398	2841	3547	3753	3715	4350	3566	5022	6423	7601	19756
1996	2500	5198	7225	4806	5901	4951	6179	4752	5496	5835	12600	28542
1997	4717	5703	9958	5305	6492	6631	7350	8177	8573	9690	15152	34061
1998	5921	5815	12421	6370	7609	7225	8121	7979	8093	8477	17915	30114
1999	4827	6470	9639	8821	8722	10209	11277	12552	11637	13607	21822	45061
2000	7615	9850	14558	11587	9333	13082	16733	19889	23933	25391	36025	80722

Validation Set

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2001	10243	11267	21827	17357	15998	18602	26155	28587	30505	30821	46634	104661

(a) Why was the data partitioned?

Partitioning data helps address the problem of overfitting a model and allows the modeler to evaluate the performance of a model by measuring forecast errors. Models are trained on the earlier portion and their predictive performnce is assessed on the later portion. In the shampoo sales example, the analyst uses the training set to build a forecasting model and then evaluates by comparing what the model forcasts for the next year to the actual data that is the validation set.

(b) Why did the analyst choose a 12-month validation period?

The validation period should mimic the forecast horizon. Additionally, too long a validation period would mean too little of most recent training data will be used to build a model. The data frequency and the forecasting goal also need to be considered when deciding on the length of the validation period.

(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

The naive forecast is the last data point in the training set, i.e., the most recent data point. Since a naive forecast is equal to the previous point, the Jan 2001 is the same as Dec. 2000. In this non-seasonal naive forecast, Feb 2001 will be equal to Jan 2001. This carries for the rest of the forecast period. When plotted, the naive forecast will be a straight line. There is also a seasonal naive forecast which accounts for the seasonal cycles in a time series.

Before deploying a forecast, I will plot the entire Souvenir Sales time series data set to see if there is seasonality.

The time series shows a seasonal pattern with a small spike in sales early in the year, followed by a slow, steady incline, and a large spike at the end of the year. This is supported by both the season plot and the monthly plot of the entire time series.

The lack of overlap in the month lines, as well as the relative straightness of the lines, confirms the presence of seasonality for the Souvenir Sales time series data set.

Because there is seasonality, I will show both naive (as required in the question) and seasonal naive forecasts for 2001. A seasonal naive forecast is the appropriate choice for this data.

2001 (Validation period) Naive Forecast for Souvenir Sales

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2001	80722	80722	80722	80722	80722	80722	80722	80722	80722	80722	80722	80722

2001 (Validation period) Seasonal Naive Forecast for Souvenir Sales

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2001	7615	9850	14558	11587	9333	13082	16733	19889	23933	25391	36025	80722

Plot of seasonal naive forecast for 2001 with confidence intervals.

Blue line is the forecast and the shaded grey areas are confidence intervals.

(d) Compute the RMSE and MAPE for the naive forecasts.

Error computations for naive forecast

The RMSE = 56099.07 and the MAPE = 290.95049

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	1113.477	10460.73	5506.879	-25.27554	61.16191	1.47054	-0.1968879	NA
Test set	-50500.287	56099.07	54490.114	-287.13834	290.95049	14.55087	0.3182456	6.649124

Error computations for seasonal naive forecast

The RMSE = 9542.346 and the MAPE = 27.27926

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	3401.360	6467.818	3744.800	22.39270	25.64127	1.000000	0.4140974	NA
Test set	7828.278	9542.346	7828.278	27.27926	27.27926	2.090439	0.2264895	0.7373759

(e1) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period).

The histogram below shows the frequency distribution of seasonal naive forecasting errors. The snaive forecast method is consistently underpredicting sales.

(e2) Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period.

What we see in the plots below is a seasonal forecast that consistetly is underpredicting what the actual forecast is.

Plot residuals over time

Most residuals are over the zero line, further illustrating underpredicting.

Normality Plot

The distribution of error terms in the training period is not normal; we can surmise this because the line is not at a 45 degree angle from the lower left hand corner.

Residuals from validation period are not normally distributed either; the line is not at a 45 degree angle and the errors are positive, indicating underprediction.

(e3)What can you say about the behavior of the naive forecasts?

Actual sales of souvenirs in 2001 is quite different from the naive forecast. This is evident in the plot, “2001 Souvenir Sales: Actual Sales and Naive and Seasonal Naive Forecasts” which shows both actual and forecast values; the red line delineates the actual sales and the bue line represents the seasonal naive forecast. (The straight green line is the naive forecast not taking seasonality into consideration). As we saw from previous data, the time series exhibits seasonality, therefore a seasonal naive forecast would be more appropriate. Although the seasonal naive forecast mimics more closely the actual souvenir sales, it is consistently underpredicting, and this is evident fromthe above 4 plots. Furthermore, the errors for both the training and validation periods are not normally distributed.

The seasonal naive forecast is generally a consistent underprediction of the actual data.

(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?

The forecaster must recombine the data from the training and the validation sets to generate and deploy a forecast for 2002. The validation set has the most recent data and is valuable and necessary because it it may likely lead to a more accurate forecast. Finally, generating a forecast based solely on the training set, forces the forcaster to forecast further into the future than he/she would have to when using the recombined full data set, i.e., to forecast 2002 from only the training set, it would be necessary to forecast 2001 and 2002. And the further out the forecast horizon, the less accurate the forecast becomes.

2. Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a threeyear period.

If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)

partition the data into training and validation periods

This is a necessary step in forecasting because in order to find the best forecasting model, the data needs to be fitted to the model and then it needs to be tested for predictive accuracy. This can only be done with partitioning.

examine time plots of the series and of model forecasts only for the training period

No. plots of both the training and validation periods should be examined. A plot of the forecast from the training data during the validation period can be compared to the actual data. It is important to “eyeball” how the data looks when plotted together.

look at MAPE and RMSE values for the training period

No, the MAPE and RMSE are less important considerations in the training period.

look at MAPE and RMSE values for the validation period

Yes, this is an essential step in order to evaluate predictive performance of a forecast. The validation period is a more objective basis than the training period, and therefore these computations are important to consider in the test set.

compute naive forecasts

Yes, naive and seasonal naive forecasts provide an important baseline forecast for comparison of the forecast and of the errors. They may often may not be the model of choice, however they should always be run.