souvenir <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
souvenir.ts <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
validLength <- 12
trainLength <- length(souvenir.ts) - validLength
SouvTrain <- window(souvenir.ts, start=c(1995,1), end=c(1995, trainLength))
SouvValid <- window(souvenir.ts, start=c(1995,trainLength+1))
To avoid overfitting the data, the analyst must divide the time series into two periods — the training period, where the model is developed, and the validation period, where it is tested. After that’s done, the forecaster can see measure the forecast errors between the forecasted values and the actual values.
The analyst used a 12-month period because they’re forecasting one more year into the future and that period covers all of the seasonality in the dataset.
SouvnaiveForValid <- snaive(SouvTrain, h=validLength)
SouvnaiveForValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
accuracy(SouvnaiveForValid, SouvValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
The RMSE is just over 9,542 and the MAPE is nearly 27.3%, meaning the forecast is off by that much from the actual data on average. This is not a particularly good fit and a look at the data shows that the model is under-predicting.
SouvForValid <- snaive(souvenir.ts)
SouvnaiveValResid <- SouvValid - SouvnaiveForValid$mean
myhist <- hist(SouvnaiveValResid, ylab="Frequency", xlab="Forecast Error", bty="l", main="")
multiplier <- myhist$counts / myhist$density
mydensity <- density(SouvnaiveValResid, na.rm=TRUE)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity)
plot(SouvValid, bty="l", xaxt="n", xlab="2001", yaxt="n", ylab="Souvenir sales, 000s")
axis(1, at=seq(from = 2001, to = 2001.88, length.out = 12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(0,105000,15000), labels=format(seq(0,105,15)), las=2)
lines(SouvnaiveForValid$mean, col=2, lty=2)
legend(2001, 90000, c("Actual","Forecast"), col=1:2, lty=1:2)
We can confirm that the naive forecast is consistently under-predicting compared to the actual values.
The analyst now has to recombine the training and validation periods and run the final model rerun the model on the entire series.
Partition data into training and validation periods: Just as was done in Question 1, you have to partition data into these periods to respectively refine and test the forecasting model.
Compute naive forecasts: This crucial step — in which you craft the model with the training period and test it on the validation period — is where you get your initial forecast and see if you must tweak your model.
Look at MAPE and RMSE values for the validation period: These values tell you how well the forecast for the validation period fits actual observed values.