Assignment

Due Monday, February 12th, 2018 at 11:59PM: Problems 1 and 2 from Chapter 3 of Shmueli




Souvenir Sales of Shop X in Queensland, Australia

The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecase sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.

Partition the data into the training and validation periods as explained above.

Table 1.1 - Training and Validation Periods
#Upload of the SouvenirSales.csv file that was then partitioned for Training and Validation Periods

Souvenir <- read.csv("SouvenirSales.csv")
Souvenir.ts <- ts(Souvenir[,2], start = c(1995,1), frequency = 12)
nValid <- 12
nTrain <- length(Souvenir.ts) - nValid
Train.ts <- window(Souvenir.ts, start = c(1995,1), end = c(1995, nTrain))
Valid.ts <- window(Souvenir.ts, start = c(1995, nTrain + 1), end = c(1995, nTrain + nValid))

plot(Train.ts, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")
lines(Souvenir.ts, col = "black")
lines(Valid.ts, col = "orange", lwd = 2)
legend(x = "topleft", legend = c("Training Period", "Validation Period"), col = c("black", "orange"), lty = c(1, 1), bty = "n")

Table 1.2 - Raw Data for the Training and Validation Periods of Souvenir Sales by Shop X in Queensland, Australia
pander(Train.ts, caption = "Training Period : January 1995 to December 2000", split.table = Inf)
Training Period : January 1995 to December 2000
  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1995 1665 2398 2841 3547 3753 3715 4350 3566 5022 6423 7601 19756
1996 2500 5198 7225 4806 5901 4951 6179 4752 5496 5835 12600 28542
1997 4717 5703 9958 5305 6492 6631 7350 8177 8573 9690 15152 34061
1998 5921 5815 12421 6370 7609 7225 8121 7979 8093 8477 17915 30114
1999 4827 6470 9639 8821 8722 10209 11277 12552 11637 13607 21822 45061
2000 7615 9850 14558 11587 9333 13082 16733 19889 23933 25391 36025 80722
pander(Valid.ts, caption = "Validation Period : January 2001 to December 2001", split.table = Inf)
Validation Period : January 2001 to December 2001
  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 10243 11267 21827 17357 15998 18602 26155 28587 30505 30821 46634 104661



(a) Why was the data partitioned?

The data was partitioned so that we could create a forecast model using a “training period” (in our cast the Training Period is from January 1995 to Decembe 2001 — represented above by the black line) which is later tested against another period, a “validation period” (the Validation Period for this particular problem is January 2001 to December 2001 — the orange line), to gauge its performance.



(b) Why did the analyst choose a 12-month validation period?

To forecast the 12-month period representing the year 2002, The analyst chose the prior 12-month validation period of 2001 (at least). If the chosen validation period were extended beyond the 12-month period then we would ultimately run the risk of our training period containing less recent information and forecast models would be created using dated informational points.



(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

Table 1.3 - Naive Forecast of the Validation Period
#Plot the naive forecast

Souvenir.naive <- naive(Train.ts, h = 12)
plot(Souvenir.naive, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")

However, it is apparent from the data set that there is some seasonality to the sales of souvenirs as spikes in sales have a tendency to occur during the tail end of each year. To confirm the seasonality, we plotted the appriopriate information in the table below.

Table 1.4 - Seasonal Plot of the Training Period
#Plot the ggseasonplot forecast

ggseasonplot(Train.ts, ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", bty = "l")

It is evident from the ggseasonplot that there is some seasonality to the sales of souvenirs by Shop X during November and December. Given the immediate decline month-over-month from December to January, our previous naive forecasting model would render useless as the first half of the year would inevitably fall outside 80% and 95% ranges. To adjust for that, we will attempt a Seasonal Naive forecast instead.

Table 1.5 - Seasonal Naive Forecast of the Validation Period
#Plot the seasonal naive forecast

Train.ss <- snaive(Train.ts, h = 12 * frequency(12))
plot(Train.ss, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")



(d) Compute the RMSE and MAPE for the naive forecasts.

Located in the Table below are the Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) results for the Training set and Test set.

Table 1.6 - RMSE and MAPE computations
#Calculate the RMSE and MAPE results

naive.forecast <- naive(Train.ts, h = 12)
pander(accuracy(naive.forecast, Souvenir.ts), caption = "RMSE and MAPE computations", split.table = Inf)
RMSE and MAPE computations
  ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 1113 10461 5507 -25.28 61.16 1.471 -0.1969 NA
Test set -50500 56099 54490 -287.1 291 14.55 0.3182 6.649



(e) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?

Table 1.7 - Histogram of Forecast Errors from the Naive Forecasts
#Generate a histogram of Forecast Errots from the Naive Forecasts
Souvenir.lm <- tslm(Train.ts ~ trend + I(trend^2))
Souvenir.lm.pred <- forecast(Souvenir.lm, h = nValid, level = 0)
hist(Souvenir.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "Histogram of Forecast Errors", ylim = c(0, 50))

Table 1.8 - Plot of Realized Sales in 2001 to Seasonal Naive and Naive Forecasts
plot(Valid.ts, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Realized Sales as compared to Seasonal Naive and Naive Forecasts", xlab = "2001", bty = "l", col = "orange", lwd = 3, xaxt = "n")
axis(1, at = seq(2001, (2002-1/12), (1/12)), labels = c("Jan.", "Feb.", "Mar.", "Apr.", "May", "Jun.", "Jul.", "Aug.", "Sep.", "Oct.", "Nov.", "Dec."))
lines(naive.forecast$mean)
lines(Train.ss$mean, col = "blue", lwd = 2)
legend(x = "topleft", legend = c("Realized Sales", "Seasonal Naive Forecast", "Naive Forecast"), col = c("orange", "blue", "black"), lty = c(1, 1), bty = "n")

As uncovered earlier in Table 1.4, there is some seasonality to the sales of souvenirs from the Souvenir Shop X throughout the years. With a significant lift occuring every year from October to December. Therefore, the Naive Forecast becomes inefficient for comparison purposes. However, in comparing the Realized Sales for the Souvenir Shop in 2001 to the Seasonal Naive Forecast we can see that the two lines plotted in Table 1.8 are closer together. Where the forecast unfortunately fails is that it consistently underpredicts sales throughout the year.



(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?

Now that the analyst has identified a model that yields a satisfactory performance against the validation set, she must recombine the Training Period and Validation Period data sets. At which point, she will utilize the full Souvenir data set to forecast for the upcoming year (2002). By recombining the two period sets, she is able to more accurately the forecast the upcoming year as she will be using the most recent information — rather than forecasting off of dated information.













Forecasting Shampoo Sales

The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a three-year period.

If the goal is forecasting sales in the future months, which of the following steps should be taken?
  • partition the data into training and validation periods YES!
  • examine time plots of the series and of model forecasts only for the training period
  • look at MAPE and RMSE values for the training period
  • look at MAPE and RMSE values for the validation period YES!
  • compute native forecasts YES!



- partition the data into training and validation periods

As with the prior problem set, we partitioned the data so that we could create a forecast model. This model was constructured using the Training Period, and tested against the Validatoin Period. In order to properly test the forecast model before simply applying it to real-world data, we had to ensure that it yielded accurate results.



- look at MAPE and RMSE values for the validation period

In order to effectively identify whether the forecast model can yield accurate results one must look at the Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE) results, as shown in the previous problem in Table 1.6. While MAPE and RMSE may be calculated using the Training Period, doing so will only measure the ‘goodness-of-fit’ or how closely the model fits the training period.



- compute naive forecasts

Naive forecasts should always be considered as a baseline. Besides, not only are they easy to execute and understand, they may be some of the most accurate models as well!