Question 1.
SouvenirSales <- read_excel("~/rProjects/Assignment_2/SouvenirSales.xlsx")
sales.ts <- ts(SouvenirSales$SalesT, start = c(1995,1), end = c(2001, 12), freq = 12)
#Partition the data into the training and validation periods as explained in the problem
nValid <- 12
nTrain <- length(sales.ts) - nValid
train.ts <- window(sales.ts, start = c(1995, 1), end = c(1995, nTrain))
valid.ts <- window(sales.ts, start = c(1995, nTrain + 1), end = c(1995, nTrain + nValid))
seasonal <- window(sales.ts, start = c(2000, 1), end = c(2000, 12))
sales.lm <- tslm(train.ts ~ trend + I(trend^2))
sales.lm.pred <- forecast(sales.lm, h = nValid, level = 0)
plot(sales.lm.pred, ylim = c(0,110), xlab = "Date", main = " ", bty = "l", flty = 2)
mtext("Total Monthly Sales (Thousands)", side = 2, line = 2.3)
mtext("Training", line = -1, at = c(1998,85))
mtext("Validation", line = -1, at = c(2001.7,85))
lines(sales.lm$fitted, lwd = 2)
lines(valid.ts)
abline(v = 2001, col = "red", lwd = 3)
abline(h = 104, col = "red", lty = 2, lwd = 3)

1B.
The analyst chose a 12 month validation period because she was asked to forecast for the next 12 month period. According to the text, it is best “to choose a validation period that mimics the forecast horizon, to allow the evaluation of actual predictive performace.”
1E.
Below is the histogram of errors, along with the graph of the validation period and the seasonal naive forecast on the same plot.
#Plot histogram of errors
hist(snaive.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "Histogram of Forecast Errors")

The histogram of errors does not appear to be distributed normally; the errors are much more frequent on the left tail of the histogram. We can also see that we are under predicting much less frequently than over predicting.
#Plot actual validation period with seasonal naive forecast
plot(valid.ts, type='l', col = "black", ylab = "Total Monthly Sales (Thousands)", xlab = " ", main = "Validation Period with Seasonal Naive Forecast", xaxt='n')
axis(1, at=c(2001.0, 2001.083, 2001.166, 2001.249, 2001.332, 2001.415, 2001.498, 2001.581, 2001.664, 2001.747, 2001.830, 2001.91), labels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), las = 2)
legend(2001,100, c("Actual","Seasonal Naive Forecast"), col=1:2, lty=1:2)
par(new=T)
plot(seasonal, type='l', lty = 2, col = "red", ylab = " ", xlab = " ", axes=F)

par(new=F)
It appears that when only looking at the validation period, we under predict more than over predict, but in general the two curves follow a very similar path. This shows that naive forecasts can be effective prediction models even though they are the easist to create.
#Residuals over time
plot(snaive.pred$residuals, main = "Residuals Over Time", bty="l")
abline(h = 0, col = "red", lty = 1, lwd = 1)

The residuals show that the forecast is over predicting the majority of the time, when looking at the training period, which correlates with the histogram above.
#Normality plot for errors in the training period
qqnorm(snaive.pred$residuals[13:72])
qqline(snaive.pred$residuals[13:72])

The normality line for the training period does not appear as a 45 degree angle which again shows that we do not have noramlly distributed errors. Again we see that most of the predications are above the line, indicating over prediction.
#Normality plot for errors in the validation period
qqnorm(valid.ts - snaive.pred$mean)
qqline(valid.ts - snaive.pred$mean)

Again, the normality plot for the validation period shows that we do not have normally distributed errors. Also, in looking at just the validation period, we see that we under predict more in this period which we first saw when we graphed the actual data with the seasonal naive forecast.
Question 2.
If the goal is to forecast sales in future months, the following steps should be taken:
- partition the data into training and validation periods
- compute naive forecasts
- look at MAPE and RMSE values for the validation period
I elminated “examine time plots of the series and of model forecasts only for the training period” because you need to test the models on the validation period; you cannot work solely with the training period. And I also left out “look at MAPE and RMSE values for the training period” because the validation period serves as a more objective basis than the training period to assess predictive accuracy, as stated in the text.