1 and 1a. Partition the souvenir data. Why was it partitioned?
souvenir <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
souvenir.ts <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
validLength <- 12
trainLength <- length(souvenir.ts) - validLength
SouvTrain <- window(souvenir.ts, start=c(1995,1), end=c(1995, trainLength))
SouvValid <- window(souvenir.ts, start=c(1995,trainLength+1))

To avoid overfitting the data, the analyst must divide the time series into two periods — the training period, where the model is developed, and the validation period, where it is tested. After that’s done, the forecaster can see measure the forecast errors between the forecasted values and the actual values.

1b. Why did the analyst choose a 12-month validation period?

The analyst used a 12-month period because they’re forecasting one more year into the future and that period covers all of the seasonality in the dataset.

1c. What is the naive forecast for the validation period (assume that you must provide forecasts for 12 months ahead)?
SouvnaiveForValid <- snaive(SouvTrain, h=validLength)
SouvnaiveForValid$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71

1d. Compute RMSE and MAPE for naive forecasts.

accuracy(SouvnaiveForValid, SouvValid)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

The RMSE is just over 9,542 and the MAPE is nearly 27.3%, meaning the forecast is off by that much from the actual data on average. This is not a particularly good fit and a look at the data shows that the model is under-predicting.

1e. Plot a histogram of forecast errors resulting from the naive forecasts (for the validation period). Also plot a time plot for the naive forecasts and actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?
SouvForValid <- snaive(souvenir.ts)
SouvnaiveValResid <- SouvValid - SouvnaiveForValid$mean
myhist <- hist(SouvnaiveValResid, ylab="Frequency", xlab="Forecast Error", bty="l", main="")
multiplier <- myhist$counts / myhist$density
mydensity <- density(SouvnaiveValResid, na.rm=TRUE)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity)

plot(SouvValid, bty="l", xaxt="n", xlab="2001", yaxt="n", ylab="Souvenir sales, 000s")
axis(1, at=seq(from = 2001, to = 2001.88, length.out = 12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(0,105000,15000), labels=format(seq(0,105,15)), las=2)
lines(SouvnaiveForValid$mean, col=2, lty=2)
legend(2001, 90000, c("Actual","Forecast"), col=1:2, lty=1:2)

We can confirm that the naive forecast is consistently under-predicting compared to the actual values.

1f. The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for 2002?

The analyst now has to recombine the training and validation periods and run the final model rerun the model on the entire series.

2. If the goal is forecasting sales in future months in the Shampoo data, which of the following steps should be taken?

Partition data into training and validation periods: Just as was done in Question 1, you have to partition data into these periods to respectively refine and test the forecasting model.

Compute naive forecasts: This crucial step — in which you craft the model with the training period and test it on the validation period — is where you get your initial forecast and see if you must tweak your model.

Look at MAPE and RMSE values for the validation period: These values tell you how well the forecast for the validation period fits actual observed values.