Homework 2: Questions 1 & 2 from Chapter 3


Problem 1: Souvenir Sales.

A. Why was the data partitioned?

Data is partitioned in order to address the problem of overfitting and to test the performance of a model for accuracy. In order to do so, analysts create a training set and then test it against their validation set to see how accurate the model is at predicting future sales.

B. Why 12-month validation period?

Analyst was asked to predict the next 12 months of sales therefore she picked a 12-month validation period because “the main principle is to choose a validation period that mimics the forecast horizon to allow the evaluation of actual predictive performance.” The choice of the length of the validation period closely depends on the forecasting goal, on the data frequency, and on forecast horizon.

C. What is the naive forecast for the validation period?

The naive forecast is defined as the most recent value of the series. In the case of a seasonal series, the season naive forecast is the value from the most recent identical season (for example, forecasting December sales using last December’s figures).

library(forecast)
library(readr)
SouvenirSales <- read.csv("SouvenirSales.csv", header = TRUE, stringsAsFactors = FALSE)
head(SouvenirSales)
##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
str(SouvenirSales)
## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...
SouvenirSales.ts <- ts(SouvenirSales$Sales, start = c(1995, 1), frequency = 12)
plot(SouvenirSales.ts/1000, bty="l", xlab = "Year", ylab = "Sales (in thousands)", lwd = 2, col = "green", main = "Souvenir Sales in 1995-2001")

As evidenced by the chart, the souvenir sale exhibit seasonal pattern spiking and then rapidly declining around the end of each year. If I were to hypothesize, I would suggest that the increase in sales is perhaps connected to winter holidays.

ggseasonplot(SouvenirSales.ts/1000, ty="l", ylab = "Sales (in thousands)", lwd = 2, main = "Seasonal Plot for Souvenir Sales")

Seasonal plot for souvenir sales above, confirms the notion of seasonality of the data. Because of that the seasonal naive forecast is appropriate choice of forecast method for this type of data.

fix.nvalid <-12
fix.ntrain <-length(SouvenirSales.ts) - fix.nvalid
train.ts <-window(SouvenirSales.ts/1000, start = c(1995, 1), end = c(1995, fix.ntrain))
valid.ts <-window(SouvenirSales.ts/1000, start = c(1995, fix.ntrain +1), end =c(1995, fix.nvalid +fix.ntrain))
SouvenirSalesLinear <-tslm(train.ts ~ trend + I(trend^2))
SouvenirSalesLinearPrediction <-forecast(SouvenirSalesLinear, h = fix.nvalid, level = 0)
plot(SouvenirSalesLinearPrediction, bty="l", ylab = "Sales (in thousands)", xlab = "Year", main = "Souvenir Sales")
lines(SouvenirSalesLinear$fitted.values, lwd =2, col = "red")
lines(valid.ts)

SouvenirSalesNaive <-snaive(train.ts, h = fix.nvalid)
SouvenirSalesNaive$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7.61503  9.84969 14.55840 11.58733  9.33256 13.08209 16.73278
##           Aug      Sep      Oct      Nov      Dec
## 2001 19.88861 23.93338 25.39135 36.02480 80.72171
SouvenirSalesNaivePrediction <-snaive(train.ts, h = fix.nvalid)
plot(SouvenirSalesNaivePrediction, ylab = "Sales(in thousands", xlab = "Year", main = "Seasonal Naive Forecast")

D. RMSE & MAPE?

accuracy(SouvenirSalesNaive, valid.ts)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3.401361 6.467818 3.744801 22.39270 25.64127 1.000000
## Test set     7.828278 9.542346 7.828278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

We see that the RMSE = 9542.346 and the MAPE = 27.28%.

E. Plot a histogram.

SnaiveSouvenirValid <-snaive(SouvenirSales.ts)
SouvenirHist <-hist(SnaiveSouvenirValid$residuals, ylab="Frequency", xlab="Forecast Error (in thousands)", main="Frequency of Seasonal Naive Forecasting Errors in Validation Period", bty="l")
multiplier <- SouvenirHist$counts / SouvenirHist$density
density <- density(SnaiveSouvenirValid$residuals, na.rm=TRUE)
density$y <- density$y * multiplier[1]
lines(density, col = "red", lwd = 2)

By examining the histogram above, it appears that the seasonal naive forecast is underpredicting sales with most of the forecasting errors being positive.

plot(valid.ts, bty="l", xaxt="n", xlab="Year: 2001", main="Naive Forecast versus Actual Sales", yaxt="n", ylab="Sales (in thousands)")
axis(1, at=seq(2001,2001.917,0.08333), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, las=2)
lines(SouvenirSalesNaive$mean, col=2, lty=2)
legend(2001.3,100, c("Actual Sales","Naive Seasonal Forecast"), col=1:2, lty=1:2)

plot(SouvenirSalesNaive$residuals, xlab = "Year", ylab = "Residuals", main = "Residuals by Year", lwd = 2)

The fact that most residuals are over the zero line is yet another illustration of underpredicting.

qqnorm(SouvenirSalesNaive$residuals[13:72])
qqline(SouvenirSalesNaive$residuals[13:72])

The Normal QQ plot shows that the error terms are not normally distibuted: we don’t see a 45-degree line from the bottom, left-hand corner to the upper.

qqnorm(valid.ts - SouvenirSalesNaive$mean)
qqline(valid.ts - SouvenirSalesNaive$mean)

This plot verifies that in the validation period, the error terms are not normally distributed. We again see that they are positive residuals indicating underprediction of the model.

In conculsion, the sales forecast of souvenirs for 2001 in very different from the actual sales.The seasonal naive forecast model is consistently underpredicting, and this is evident from the above plots. Additionally, errors for both the training and validation periods are not normally distributed.

F. What must the analyst do for year 2002?

In order for the analyst to generate forecast for year 2002, she needs to combine the training and validation periods and rerun the model on the complete data set.

Question 2: Forecasting Shampoo Sales

Partition the data into training and validation periods

Yes, this is an important step. Partitioning the data allows an analyst develop the best forecasting model by testing it for accuracy and fit.

Examine time plots of the series and of model forecasts only for the training period

No, plots for both the training and valudation periods should be looked at. The training period does not provide enough information on its own.

Look at MAPE and RMSE values for the training period

No, the MAPE and RMSE are more important in the validation period.

Look at MAPE and RMSE values for the validation period

Yes, this is an important step in order to evaluate how well the validation data fits to the training period and how well the chosen model works.

Compute naive forecasts

Yes, it can be helpful. The most recent data points may be the most relevant for forecasting the future. Naive forecasts serve as a baseline, which is needed for comparison when evaluating a method’s predictive performance.