Due Monday, February 12th, 2018 at 11:59PM: Problems 1 and 2 from Chapter 3 of Shmueli
The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia between 1995 and 2001.
Back in 2001, the store wanted to use the data to forecase sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.
Partition the data into the training and validation periods as explained above.
#Upload of the SouvenirSales.csv file that was then partitioned for Training and Validation Periods
Souvenir <- read.csv("SouvenirSales.csv")
Souvenir.ts <- ts(Souvenir[,2], start = c(1995,1), frequency = 12)
nValid <- 12
nTrain <- length(Souvenir.ts) - nValid
Train.ts <- window(Souvenir.ts, start = c(1995,1), end = c(1995, nTrain))
Valid.ts <- window(Souvenir.ts, start = c(1995, nTrain + 1), end = c(1995, nTrain + nValid))
plot(Train.ts, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")
lines(Souvenir.ts, col = "black")
lines(Valid.ts, col = "orange", lwd = 2)
legend(x = "topleft", legend = c("Training Period", "Validation Period"), col = c("black", "orange"), lty = c(1, 1), bty = "n")
pander(Train.ts, caption = "Training Period : January 1995 to December 2000", split.table = Inf)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1995 | 1665 | 2398 | 2841 | 3547 | 3753 | 3715 | 4350 | 3566 | 5022 | 6423 | 7601 | 19756 |
| 1996 | 2500 | 5198 | 7225 | 4806 | 5901 | 4951 | 6179 | 4752 | 5496 | 5835 | 12600 | 28542 |
| 1997 | 4717 | 5703 | 9958 | 5305 | 6492 | 6631 | 7350 | 8177 | 8573 | 9690 | 15152 | 34061 |
| 1998 | 5921 | 5815 | 12421 | 6370 | 7609 | 7225 | 8121 | 7979 | 8093 | 8477 | 17915 | 30114 |
| 1999 | 4827 | 6470 | 9639 | 8821 | 8722 | 10209 | 11277 | 12552 | 11637 | 13607 | 21822 | 45061 |
| 2000 | 7615 | 9850 | 14558 | 11587 | 9333 | 13082 | 16733 | 19889 | 23933 | 25391 | 36025 | 80722 |
pander(Valid.ts, caption = "Validation Period : January 2001 to December 2001", split.table = Inf)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001 | 10243 | 11267 | 21827 | 17357 | 15998 | 18602 | 26155 | 28587 | 30505 | 30821 | 46634 | 104661 |
The data was partitioned so that we could create a forecast model using a “training period” (in our cast the Training Period is from January 1995 to Decembe 2001 — represented above by the black line) which is later tested against another period, a “validation period” (the Validation Period for this particular problem is January 2001 to December 2001 — the orange line), to gauge its performance.
To forecast the 12-month period representing the year 2002, The analyst chose the prior 12-month validation period of 2001 (at least). If the chosen validation period were extended beyond the 12-month period then we would ultimately run the risk of our training period containing less recent information and forecast models would be created using dated informational points.
#Plot the naive forecast
Souvenir.naive <- naive(Train.ts, h = 12)
plot(Souvenir.naive, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")
However, it is apparent from the data set that there is some seasonality to the sales of souvenirs as spikes in sales have a tendency to occur during the tail end of each year. To confirm the seasonality, we plotted the appriopriate information in the table below.
#Plot the ggseasonplot forecast
ggseasonplot(Train.ts, ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", bty = "l")
It is evident from the ggseasonplot that there is some seasonality to the sales of souvenirs by Shop X during November and December. Given the immediate decline month-over-month from December to January, our previous naive forecasting model would render useless as the first half of the year would inevitably fall outside 80% and 95% ranges. To adjust for that, we will attempt a Seasonal Naive forecast instead.
#Plot the seasonal naive forecast
Train.ss <- snaive(Train.ts, h = 12 * frequency(12))
plot(Train.ss, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Souvenir Sales by Shop X in Queensland, Australia", xlim = c(1995, 2002), xlab = "Year", bty = "l")
Located in the Table below are the Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) results for the Training set and Test set.
#Calculate the RMSE and MAPE results
naive.forecast <- naive(Train.ts, h = 12)
pander(accuracy(naive.forecast, Souvenir.ts), caption = "RMSE and MAPE computations", split.table = Inf)
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 1113 | 10461 | 5507 | -25.28 | 61.16 | 1.471 | -0.1969 | NA |
| Test set | -50500 | 56099 | 54490 | -287.1 | 291 | 14.55 | 0.3182 | 6.649 |
#Generate a histogram of Forecast Errots from the Naive Forecasts
Souvenir.lm <- tslm(Train.ts ~ trend + I(trend^2))
Souvenir.lm.pred <- forecast(Souvenir.lm, h = nValid, level = 0)
hist(Souvenir.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "Histogram of Forecast Errors", ylim = c(0, 50))
plot(Valid.ts, ylim = c(0, 125000), ylab = "Sales of Souvenirs", main = "Realized Sales as compared to Seasonal Naive and Naive Forecasts", xlab = "2001", bty = "l", col = "orange", lwd = 3, xaxt = "n")
axis(1, at = seq(2001, (2002-1/12), (1/12)), labels = c("Jan.", "Feb.", "Mar.", "Apr.", "May", "Jun.", "Jul.", "Aug.", "Sep.", "Oct.", "Nov.", "Dec."))
lines(naive.forecast$mean)
lines(Train.ss$mean, col = "blue", lwd = 2)
legend(x = "topleft", legend = c("Realized Sales", "Seasonal Naive Forecast", "Naive Forecast"), col = c("orange", "blue", "black"), lty = c(1, 1), bty = "n")
As uncovered earlier in Table 1.4, there is some seasonality to the sales of souvenirs from the Souvenir Shop X throughout the years. With a significant lift occuring every year from October to December. Therefore, the Naive Forecast becomes inefficient for comparison purposes. However, in comparing the Realized Sales for the Souvenir Shop in 2001 to the Seasonal Naive Forecast we can see that the two lines plotted in Table 1.8 are closer together. Where the forecast unfortunately fails is that it consistently underpredicts sales throughout the year.
Now that the analyst has identified a model that yields a satisfactory performance against the validation set, she must recombine the Training Period and Validation Period data sets. At which point, she will utilize the full Souvenir data set to forecast for the upcoming year (2002). By recombining the two period sets, she is able to more accurately the forecast the upcoming year as she will be using the most recent information — rather than forecasting off of dated information.
The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a three-year period.
As with the prior problem set, we partitioned the data so that we could create a forecast model. This model was constructured using the Training Period, and tested against the Validatoin Period. In order to properly test the forecast model before simply applying it to real-world data, we had to ensure that it yielded accurate results.
In order to effectively identify whether the forecast model can yield accurate results one must look at the Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE) results, as shown in the previous problem in Table 1.6. While MAPE and RMSE may be calculated using the Training Period, doing so will only measure the ‘goodness-of-fit’ or how closely the model fits the training period.
Naive forecasts should always be considered as a baseline. Besides, not only are they easy to execute and understand, they may be some of the most accurate models as well!