Assignment

Due Monday, February 12th, 2018 at 11:59PM: Problems 1 and 2 from Chapter 3 of Shmueli

Souvenir Sales of Shop X in Queensland, Australia

The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecase sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.

Partition the data into the training and validation periods as explained above.

Table 1.1 - Training and Validation Periods

#Upload of the SouvenirSales.csv file that was then partitioned for Training and Validation Periods

Souvenir <- read.csv("SouvenirSales.csv")
Souvenir.ts <- ts(Souvenir[,2], start = c(1995,1), frequency = 12)
nValid <- 12
nTrain <- length(Souvenir.ts) - nValid
Train.ts <- window(Souvenir.ts, start = c(1995,1), end = c(1995, nTrain))
Valid.ts <- window(Souvenir.ts, start = c(1995, nTrain + 1), end = c(1995, nTrain + nValid))

plot(Train.ts, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")
lines(Souvenir.ts, col = "black")
lines(Valid.ts, col = "orange", lwd = 2)
legend(x = "topleft", legend = c("Training Period", "Validation Period"), col = c("black", "orange"), lty = c(1, 1), bty = "n")

Table 1.2 - Raw Data for the Training and Validation Periods of Souvenir Sales by Shop X in Queensland, Australia

pander(Train.ts, caption = "Training Period : January 1995 to December 2000", split.table = Inf)

Training Period : January 1995 to December 2000
	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
1995	1665	2398	2841	3547	3753	3715	4350	3566	5022	6423	7601	19756
1996	2500	5198	7225	4806	5901	4951	6179	4752	5496	5835	12600	28542
1997	4717	5703	9958	5305	6492	6631	7350	8177	8573	9690	15152	34061
1998	5921	5815	12421	6370	7609	7225	8121	7979	8093	8477	17915	30114
1999	4827	6470	9639	8821	8722	10209	11277	12552	11637	13607	21822	45061
2000	7615	9850	14558	11587	9333	13082	16733	19889	23933	25391	36025	80722

pander(Valid.ts, caption = "Validation Period : January 2001 to December 2001", split.table = Inf)

Validation Period : January 2001 to December 2001
	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2001	10243	11267	21827	17357	15998	18602	26155	28587	30505	30821	46634	104661

(a) Why was the data partitioned?

The data was partitioned so that we could create a forecast model using a “training period” (in our cast the Training Period is from January 1995 to Decembe 2001 — represented above by the black line) which is later tested against another period, a “validation period” (the Validation Period for this particular problem is January 2001 to December 2001 — the orange line), to gauge its performance.

(b) Why did the analyst choose a 12-month validation period?

To forecast the 12-month period representing the year 2002, The analyst chose the prior 12-month validation period of 2001 (at least). If the chosen validation period were extended beyond the 12-month period then we would ultimately run the risk of our training period containing less recent information and forecast models would be created using dated informational points.

(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

Table 1.3 - Naive Forecast of the Validation Period

#Plot the naive forecast

Souvenir.naive <- naive(Train.ts, h = 12)
plot(Souvenir.naive, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")

However, it is apparent from the data set that there is some seasonality to the sales of souvenirs as spikes in sales have a tendency to occur during the tail end of each year. To confirm the seasonality, we plotted the appriopriate information in the table below.

Table 1.4 - Seasonal Plot of the Training Period

#Plot the ggseasonplot forecast

ggseasonplot(Train.ts, ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", bty = "l")

It is evident from the ggseasonplot that there is some seasonality to the sales of souvenirs by Shop X during November and December. Given the immediate decline month-over-month from December to January, our previous naive forecasting model would render useless as the first half of the year would inevitably fall outside 80% and 95% ranges. To adjust for that, we will attempt a Seasonal Naive forecast instead.

Table 1.5 - Seasonal Naive Forecast of the Validation Period

#Plot the seasonal naive forecast

Train.ss <- snaive(Train.ts, h = 12 * frequency(12))
plot(Train.ss, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")

(d) Compute the RMSE and MAPE for the naive forecasts.

Located in the Table below are the Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) results for the Training set and Test set.

Table 1.6 - RMSE and MAPE computations

#Calculate the RMSE and MAPE results

naive.forecast <- naive(Train.ts, h = 12)
pander(accuracy(naive.forecast, Souvenir.ts), caption = "RMSE and MAPE computations", split.table = Inf)

RMSE and MAPE computations
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Training set	1113	10461	5507	-25.28	61.16	1.471	-0.1969	NA
Test set	-50500	56099	54490	-287.1	291	14.55	0.3182	6.649

(e) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?

Table 1.7 - Histogram of Forecast Errors from the Naive Forecasts

#Generate a histogram of Forecast Errots from the Naive Forecasts
Souvenir.lm <- tslm(Train.ts ~ trend + I(trend^2))
Souvenir.lm.pred <- forecast(Souvenir.lm, h = nValid, level = 0)
hist(Souvenir.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "Histogram of Forecast Errors", ylim = c(0, 50))

Table 1.8 - Plot of Realized Sales in 2001 to Seasonal Naive and Naive Forecasts

plot(Valid.ts, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Realized Sales as compared to Seasonal Naive and Naive Forecasts", xlab = "2001", bty = "l", col = "orange", lwd = 3, xaxt = "n")
axis(1, at = seq(2001, (2002-1/12), (1/12)), labels = c("Jan.", "Feb.", "Mar.", "Apr.", "May", "Jun.", "Jul.", "Aug.", "Sep.", "Oct.", "Nov.", "Dec."))
lines(naive.forecast$mean)
lines(Train.ss$mean, col = "blue", lwd = 2)
legend(x = "topleft", legend = c("Realized Sales", "Seasonal Naive Forecast", "Naive Forecast"), col = c("orange", "blue", "black"), lty = c(1, 1), bty = "n")

As uncovered earlier in Table 1.4, there is some seasonality to the sales of souvenirs from the Souvenir Shop X throughout the years. With a significant lift occuring every year from October to December. Therefore, the Naive Forecast becomes inefficient for comparison purposes. However, in comparing the Realized Sales for the Souvenir Shop in 2001 to the Seasonal Naive Forecast we can see that the two lines plotted in Table 1.8 are closer together. Where the forecast unfortunately fails is that it consistently underpredicts sales throughout the year.

(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?

Now that the analyst has identified a model that yields a satisfactory performance against the validation set, she must recombine the Training Period and Validation Period data sets. At which point, she will utilize the full Souvenir data set to forecast for the upcoming year (2002). By recombining the two period sets, she is able to more accurately the forecast the upcoming year as she will be using the most recent information — rather than forecasting off of dated information.

Forecasting Shampoo Sales

The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a three-year period.

If the goal is forecasting sales in the future months, which of the following steps should be taken?

partition the data into training and validation periods YES!
examine time plots of the series and of model forecasts only for the training period
look at MAPE and RMSE values for the training period
look at MAPE and RMSE values for the validation period YES!
compute native forecasts YES!

- partition the data into training and validation periods

As with the prior problem set, we partitioned the data so that we could create a forecast model. This model was constructured using the Training Period, and tested against the Validatoin Period. In order to properly test the forecast model before simply applying it to real-world data, we had to ensure that it yielded accurate results.

- look at MAPE and RMSE values for the validation period

In order to effectively identify whether the forecast model can yield accurate results one must look at the Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE) results, as shown in the previous problem in Table 1.6. While MAPE and RMSE may be calculated using the Training Period, doing so will only measure the ‘goodness-of-fit’ or how closely the model fits the training period.

- compute naive forecasts

Naive forecasts should always be considered as a baseline. Besides, not only are they easy to execute and understand, they may be some of the most accurate models as well!

Week 3 : Beginning February 5th, 2018

Pete Wiernusz

2/11/2018