Ch.3: Q.1,2

Q.1 \(\textit{Souvenir Sales}\)

# Note: Requires the csv file "SouvenirSales.csv"

# Read in the data
souvenirSales <- read.csv("SouvenirSales.csv", stringsAsFactors=FALSE)

# Create time series
souvSales.ts <- ts(souvenirSales$Sales, start=c(1995,1), frequency=12)

We partition the data with the first 6 years (1995 through 2000) serving as a training period and the year of 2001 serving as a validation period. This partitioned data is visualized below.

library(forecast)
# Validation period length is 1 year
validLength <- 12

# Training period is the first 6 years (72 months)
trainLength <- length(souvSales.ts) - validLength

# Partition the data into training and validation periods
souvSalesTrain.ts <- window(souvSales.ts, start=c(1995,1), end=c(1995, trainLength))
souvSalesValid.ts <- window(souvSales.ts, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))

# Vizualize the partitioned data
plot(souvSales.ts,ylim=c(1664,104661),ylab="Sales", xlab="Year", bty="l", main="Partitioned Data")
arrows(x0=1995, y0=70000, x1 = 2000.75, y1 = 70000, length = 0.07, angle = 40,
       code = 3, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=2001.1, y0=70000, x1 = 2001.9, y1 = 70000, length = 0.07, angle = 40,
       code = 3, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=2001, y0=0, x1 = 2001, y1 = 104661, length = 0.07, angle = 40,
       code = 0, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=2002, y0=0, x1 = 2002, y1 = 104661, length = 0.07, angle = 40,
       code = 0, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
text(x=c(1998),y=c(75000), "Training",cex=0.75)
text(x=c(2001.5),y=c(75000), "Validation",cex=0.75)

  1. The data was partitioned to reduce the possibility of overfitting, where in addition to the systematic component noise is also fitted, and as a consequence the predictive performance on new data will most likely be poor. Thus a portion of the data is used for model generation (training set) and a latter portion is used for model evaluation (validation set).

  2. A validation period should be the same as the forecast horizon so that we can evaluate the model’s predictive performance. Since the store’s goal is to forecast 12 months into the future, the validation period should be at least 12 months. A validation period shoter than 12 months will not allow us to forecast for the full year of 2002 and a validation period longer than 12 months uses less recent information which may be the most valuable information for forecasting purposes.

  3. Before any naive forecasts are generated, let’s investigate any possible seasonality. Below is a plot of the data broken down by month.

# set the outer margin area to the right a bit bigger
par(oma = c(0, 0, 0, 2))

# We have 7 years of data
xrange <- c(1995,2001)
# Get the range of the ridership to set up a nicely formatted plot
yrange <- range(souvSales.ts)

# set up the plot 
plot(xrange, yrange, type="n", xlab="Year", ylab="Sales", bty="l", las=1)

# Give each of the months its own color, line type, and character
colors <- rainbow(12) 
linetype <- c(1:12) 
plotchar <- c(1:12)

axis(1, at=seq(1995,2001,1), labels=format(seq(1995,2001,1)))

# add lines 
for (i in 1:12) { 
  currentMonth <- subset(souvSales.ts, cycle(souvSales.ts)==i)
  lines(seq(1995, 1995+length(currentMonth)-1,1), currentMonth, type="b", lwd=1,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) 
} 

# add a title
title("Sales by Month")

# add a legend 
legend(2001.25, 100000, c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)

There does appear to be seasonality present in the data as indicated by the sales peaks during November and December every year. Thus a seasonal naive forecast model is appropriate. Below is a table of the seasonal naive forecasts for each month for the year 2001.

# Use the seasonal naive forecast
snaiveForValid.pred <- snaive(souvSalesTrain.ts, h=validLength)

# Point forecasts from the naive model
snaiveForValid.pred.means <- snaiveForValid.pred$mean

\[\textbf{Seasonal Naive Forecasts for 2001} \\ \begin{array}{l l l l} \text{Month} & \text{Forecast} & \text{Month} & \text{Forecast} \\ \hline \text{Jan} & 7615.03 & \text{Jul} & 16732.78 \\ \text{Feb} & 9849.69 & \text{Aug} & 19888.61 \\ \text{Mar} & 14558.40 & \text{Sep} & 23933.38 \\ \text{Apr} & 11587.33 & \text{Oct} & 25391.35 \\ \text{May} & 9332.56 & \text{Nov} & 36024.80 \\ \text{Jun} & 13082.09 & \text{Dec} & 80721.71 \\ \hline \end{array} \]

Next we plot the seasonal naive forecasts in the validation period.

snaive.forecastForValid <- forecast(snaiveForValid.pred, h=validLength)
plot(snaive.forecastForValid,ylim=c(1664,104661),ylab="Sales", xlab="Year", bty="l", main="Seasonal Naive Forecasts")
arrows(x0=1995, y0=70000, x1 = 2000.75, y1 = 70000, length = 0.07, angle = 40,
       code = 3, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=2001.1, y0=70000, x1 = 2001.9, y1 = 70000, length = 0.07, angle = 40,
       code = 3, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=2001, y0=0, x1 = 2001, y1 = 104661, length = 0.07, angle = 40,
       code = 0, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=2002, y0=0, x1 = 2002, y1 = 104661, length = 0.07, angle = 40,
       code = 0, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
text(x=c(1998),y=c(75000), "Training",cex=0.75)
text(x=c(2001.5),y=c(75000), "Validation",cex=0.75)

  1. From the output of the “accuracy()” function, the \(\text{RMSE} \approx 9542\) and \(\text{MAPE} = 27.28\%\).
accuracy.measures <- accuracy(snaiveForValid.pred, souvSalesValid.ts)
  1. A plot of the histogram is shown below.
# Residuals for validation period
residualsForValid <- souvSalesValid.ts - snaiveForValid.pred$mean

# Plot of histogram
myhist <- hist(residualsForValid, ylab="Frequency", xlab="Forecast Error", main="", bty="l")
multiplier <- myhist$counts / myhist$density
mydensity <- density(residualsForValid, na.rm=TRUE)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity)

Although there does seem to be an approximate bell shaped curve, it is difficult to determine normality for the residuals at a first glance. Exluding the apparent outlier to the far right may improve the bell shape of the histogram. Note that we are always under-predicting (positive forecast errors) throughout the validation period. Further analysis may be done by examining the following normality plot.

qqnorm(residualsForValid, bty="l")
qqline(residualsForValid, bty="l")

This plot certainly does not show a 45-degree line, but let’s remove the apparent outlier to see if we get better results.

res.valid.no.outlier <- c(2628.21,1417.19,7268.44,5770.00,6665.23,5519.44,9422.37,8697.91,6572.03,5429.98,10609.58)
qqnorm(res.valid.no.outlier, bty="l")
qqline(res.valid.no.outlier, bty="l")

Although still not showing an exact 45-degree line, we do get a better angle with the removal of the outlier. However, with such a small sample, I would not conclude that the error terms are normally distributed and would gather more data and/or perform more normality tests. Now let us examine a line plot for the naive forecasts and the actual sales numbers for the validation period.

plot(souvSalesValid.ts, bty="l",xaxt="n", xlab="The Year 2001", yaxt="n", ylab="Sales",main="Line Plots of Data and Forecasts")
axis(1, at = time(souvSalesValid.ts), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, las=2)
lines(snaiveForValid.pred.means, col=2, lty=2)
legend(2001,100000, c("Actual","Forecast"), col=1:2, lty=1:2)

As observed from the above plot, the naive forecasts are always under-predicting the data.

  1. If the analyst found a forecasting model that gives satisfactory performance on the validation set and then wishes to generate forecasts for the year 2002, then the chosen model should be re-run on the data for the combined training and validation periods. After doing this, the model can then forecast values for the year 2002.

\(\text{ }\)

Q.2 \(\textit{Forecasting Shampoo Sales}\)

If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)

\(\bullet\) partition the data into training and validation periods

-As explained previously, it is very important to partition the data to avoid overfitting.

\(\bullet\) examine time plots of the series and of the model forecasts only for the training period

-The word “only” here makes me cross off this step. While I would look at time plots of the series and of the model for the training period, my focus would be on the validation period since it is necessary to base measures of predictive performance, residual distribution, and prediction intervals on the validation period.

\(\bullet\) look at MAPE and RSME values for the training period

-Examining these accuracy measures for the training period does not help us with forecasting and only tells us information (goodness of fit) for the particular training period. To measure predictive performance, I would look at these accuracy measures for the validation period.

\(\bullet\) look at MAPE and RSME values for the validation period

-The MAPE measure would tell us percentage-wise how much on average the forecasts deviate from the actual data. This measure is very useful since it is scale-independent. The RSME is another useful deviation measure that has the same units as the data. It is very important these measures are obtained using the validation period as explained above.

\(\bullet\) compute naive forecasts

-I would compute naive forecasts because there is a chance of obtaining sufficient accuracy using a very simple method. Additionally, these naive forecasts could be used as a baseline for the comparison of more complex methods.