1. Souvenir Sales:

The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period.

Partition the data into the training and validation periods as explained above.

  1. Why was the data partitioned?

Partitioning the data allows the analyst to test the model they have for accuracy. They can use a training set and then test to it to their validation set. Once that has been done, the analyst can look a number of error measures, residuals, and comparison to see how accuate her model may be at predicting future sales.

  1. Why did the analyst choose a 12-month validation period?

First and foremost, because she was asked to predict the next 12 months of sales. To do this she needs to account for a full period (one year) to account for any seasonality of the data. And having the validation period in-line (timewise) with the forecast period makes the forecasting more realistic/accurate.

#libraries
library(forecast)
library(readr)
#read and review
souvenir <- read.csv("C:/Users/Edrick/Google Drive/678 - Predictive Analytics/R Files/SouvenirSales.csv", stringsAsFactors=FALSE)
head(souvenir)
##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
str(souvenir)
## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...
#time series
souvenir.ts <- ts(souvenir$Sales, start = c(1995, 1), frequency = 12)

#Plot data set
plot(souvenir.ts/1000, ylim = c(0,120), ylab = "Sales (in thousands)", xlab = "Time", main = "Souvenir Sales Over Time", bty = "l") 

#Partition data
nValid <- 12
nTrain <- length(souvenir.ts) - nValid

training.ts <- window(souvenir.ts/1000, start = c(1995, 1), end = c(1995, nTrain))
validation.ts <- window(souvenir.ts/1000, start = c(1995, nTrain + 1), end = c(1995, nTrain + nValid))

souvenir.lm <- tslm(training.ts ~ trend + I(trend^2))
souvenir.lm.pred <- forecast(souvenir.lm, h = nValid, level = 0)

plot(souvenir.lm.pred, ylim = c(0,120), ylab = "Sales (in thousands)", xlab = "Time", main = "Souvenir Sales Over Time", bty = "l", flty = 2) 

lines(souvenir.lm$fitted, lwd = 2) 
lines(validation.ts)

  1. What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

See below - both a table with the values, followed by the plotted values for the validation period (2002)

#Naive Forecast
validLength <- 12

trainLength <- length(souvenir.ts) - validLength

sSouvenirTrain <- window(souvenir.ts, start=c(1995,1), end=c(1995,trainLength))
sSouvenirValid <- window(souvenir.ts, start=c(1995,trainLength + 1), end=c(1995,trainLength + validLength))

# Seasonal naive forecast
snaiveForValid <- snaive(sSouvenirTrain, h=validLength)

# View Forecasts from the seasonal naive model
#Answer for 1C:
snaiveForValid$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
#Plot Seasonal naive forecast
Snaive.pred <- snaive(training.ts, h = nValid)

plot(Snaive.pred, ylab = "Sales (in thousands)", xlab = "Date", main = "Seasonal Naive Forecast")

  1. Compute the RMSE and MAPE for the naive forecasts.

Shown below, the RMSE for the naive forecast is $9,542 in sales and the MAPE is 27.3% (deviation from actual values)

#Accuracy - RMSE & MAPE
accuracy(snaiveForValid, sSouvenirValid)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759
  1. Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?

Histogram found below. The naive forecasts are consistently underpredicting sales - for every period of the forecast. The Histogram shows that there are many more smaller (positive) errors than there are large. Together, we can say that the naive forecast is underpredicting sales, with the same kind of errors (all positive).

#validation residuals
souvenirValid <- snaive(souvenir.ts)
NaiveResidual <- validation.ts - Snaive.pred$mean

#Plot the histogram
myhist <- hist(NaiveResidual, ylab="Frequency", xlab="Forecast Error", main="Forecast Errors of Naive Forecast (Validation)", bty="l")
multiplier <- myhist$counts / myhist$density
mydensity <- density(NaiveResidual, na.rm=TRUE)
mydensity$y <- mydensity$y * multiplier[1]

# Add the density curve
lines(mydensity, lwd = 2, col = "blue")

#Plot Naive Forecast and Actual Sales (2001)
plot(sSouvenirValid/1000, bty="l", xaxt="n", xlab="Year: 2001", main="Naive Forecast vs. Actual Sales", yaxt="n", ylab="Sales (in thousands)")

axis(1, at=seq(2001,2001.917,0.08333), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, las=2)

# Now add the forecasts and make the line red and dashed
lines(Snaive.pred$mean, col=2, lty=2)

# Add a legend
legend(2001.3,100, c("Actual","Forecast"), col=1:2, lty=1:2)

# plot all residuals
plot(Snaive.pred$residuals, bty="l", ylab = "Residuals")

Residuals show generally underpredicted values to actual.

#Normality plot
qqnorm(Snaive.pred$residuals[13:72])
qqline(Snaive.pred$residuals[13:72])

Normality plot shows that errors are not normally distributed.

#Normality plot (Validation period only)
qqnorm(sSouvenirValid/1000 - Snaive.pred$mean)
qqline(sSouvenirValid/1000 - Snaive.pred$mean)

Normality plot of Validation period shows that errors are not normally distributed in the validation period - all residuals are positive again showing the underpredicting of the model.

  1. The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?

The analyst must merge the two time-series periods (training and validation periods). Once that is complete, she can run the model she is using to predict forecasted souvenir sales for 2002.

2. Forecasting Shampoo Sales:

The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a three year period.

If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)

*partition the data into training and validation periods

YES. This should be done. By partitioning the data we will be able to develop models that can be tested for accuracy and fit. Once we have done this we can feel more confident about the outcome of forecasting sales in the future.

*examine time plots of the series and of model forecasts only for the training period

NO. I would think we would want to look at the validation period as well - the training period is not enough information on its own.

*look at MAPE and RMSE values for the training period

NO. We are interested in these values for the validation set (see below)

*look at MAPE and RMSE values for the validation period

Yes. We want to look at the MAPE and RMSE for the validation period. The MAPE gives us an idea of how much the forecasted values deviate form the actual. The RMSE measures the accuracy of the model, and is scale independent (using the same units as the data). Looking at both of these values for the validation period will give us a good idea of how well the validation data fits to the training period - and how well our model works.

*compute naive forecasts

YES. This can be helpful if for nothing else than a gut-check. The most recent data points are often the most relevant, which is what the naive forecast is based on. And they are a good baseline or lens in which to evaluate our model. I would want to use the naive forecasts as a comparison to our model to see if they are at least in the same ballpark.