Question 1

First we partition the data into a training set and a validation (testing) set:

library(readr)
library(forecast)

Souvenir.Data <- read_csv('data/SouvenirSales.csv')
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Sales = col_double()
## )
Souvenir.Data <- ts(Souvenir.Data[,2], 
                    start = c(1995, 1), end = c(2001, 12),
                    frequency = 12)
Souv.Train <- window(Souvenir.Data,
                     start = c(1995, 1), end = c(2000, 12))
Souv.Valid <- window(Souvenir.Data,
                     start = c(2001, 1), end = c(2001, 12))


A) Why was the data partioned?

We need to break the data into two partitions so that the model can be trained/fit on the earlier time series partition (January 1995 - December 2000) and then tested on the most recent time series data (January 2001 - December 2001). It is important to seperate the testing data from the training data because we need to test the model on data that it has not every seen before or been fitted to; this enables us to get an idea of how well the model actually predicts new values (new monthly souvenir sales).


B) Why did the analyst choose a 12-month validation period?

We had to use an entire year for our testing partition since that is our forecast horizon. The goal is to create a model that predicts the next years monthly souvenir sales. In order to properly test how well our model will perform in making predictions a year out, we need at least an entire year of testing data.


C) What is the naive forecast for the validation period?

naive.fc <- naive(Souv.Train, h = 12)
plot(Souvenir.Data,
     main = "Souvenir Sales Time Series",
     ylab = "Monthly Sales",
     xlab = 'Year',
     lwd = 2)
abline(v = 2001, lty = 2)
lines(naive.fc$fitted, lwd = 2, lty = 2, col = 'blue')
lines(naive.fc$mean, lwd = 2, lty = 2, col = 'red')
legend("topleft",
       legend = c('Observed Sales Data',
                  'Fitted Sales',
                  'Predicted Sales'),
       lty = c(1,2,2),
       lwd = c(2),
       col = c('black', 'blue', 'red'))

The plot above shows the observed monthly souvenir sales data as a solid black line, the fitted monthly sales (training partition) using the naive model as a dashed blue line, and the predicted monthly sales (testing partition) as a dashed red line. The vertical dashed black line shows the demarcation between the training and testing partitions.

The actual predicted values for year 2001 were:

naive.fc$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71
##           Aug      Sep      Oct      Nov      Dec
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71


D) Compute the RMSE and MAPE for the naive forecasts.

The table below shows the RMSE and MAPE for the test period (not the fitted partition).

library(pander)

perf.mtrc <- accuracy(naive.fc, Souv.Valid)
pandoc.table(perf.mtrc[2,c(2,5)], digits = 6)
RMSE MAPE
56099.1 290.95


E) Plot a histogram of the forecast errors and a timeplot of the forecasts and actual sale sin the validation period.

forecast_errors <- Souv.Valid - naive.fc$mean

par(mfrow = c(1,2))

hist(forecast_errors, breaks = 10,
     main = 'Distribution of Forecast Error',
     xlab = 'Forecast Error')

plot(Souv.Valid,
     main = "Sales: Validation Period",
     ylab = "Monthly Sales",
     lwd = 2,
     axes = FALSE)
axis(1, at=time(Souv.Valid), 
     labels=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
axis(2)
lines(naive.fc$mean, lwd = 2, lty = 2, col = 'red')
legend("left",
       legend = c('Observed Sales Data',
                  'Predicted Sales'),
       lty = c(1,2),
       lwd = c(2),
       col = c('black', 'red'))

The histogram of the forecast errors show that they are strongly skewed to the right. For most months the naive model over-predicts souvenir sales. The timeplot corroborates the histogram’s message. The naive model over-predicts the souvenir sales for every month except for December (which it under-predicts). This is likely due to seasonality in souvenir sales, sales peak in December which is the month the naive model used for its predictions of the validation period.


F) The analyst found a model that gives satisfactory performance on the validation set. What must they do to use the model for generating forecasts for the year 2002?

They must refit the model using the entire time-series data (1995 through 2001) in order to incorporate the most recent information into their predictions.



Question 2

If an analyst wants to use the shampoo sales data to forecast sales in future months they should do all of the following steps: