Question 1

First, we partition the data, as shown below:

#Read the csv
sales <- read.csv("sales.csv", header = TRUE)
str(sales)
## 'data.frame':    84 obs. of  2 variables:
##  $ Date : Factor w/ 84 levels "Apr-00","Apr-01",..: 31 24 52 3 59 45 38 10 80 73 ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...
#Create Time Series Object
sales.ts <- ts(sales$Sales, start = c(1995,1), frequency = 12)
#Take an unformatted first plot to gain a basic understanding of the data.  Look for seasonality, trend, level, and noise.
plot (sales.ts, bty = "l")

# Set the length of the validation period to 12 months (one year)
validLength <- 12
# Set the length of the training period to everything else
trainLength <- length(sales.ts) - validLength
# Partition the data into training and validation periods
SalesTrain <- window(sales.ts, start=c(1995,1), end=c(1995, trainLength))
SalesValid <- window(sales.ts, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))
plot (SalesTrain, bty = "l")

plot (SalesValid, bty = "l")

When reviewing the unformatted plot of the complete time series, seasonality of the data become evident. In a moment, we’ll look at the sales broken out by month to validate seasonality. Above, we also create rough plots for the training and validation partitions to confirm those functions worked as expected.

Seasonality validation:

# set the outer margin area to the right a bit bigger
par(oma = c(0, 0, 0, 4))
# We have 7 years of data
xrange <- c(1995,2001)
# Get the range of the ridership to set up a nicely formatted plot
yrange <- range(sales.ts)
# set up the plot 
plot(xrange, yrange, type="n", xlab="Year", ylab="Souvenir Sales by Month", bty="l", las=1, yaxt = "n")
# Give each of the months its own color, line type, and character
colors <- rainbow(12) 
linetype <- c(1:12) 
plotchar <- c(1:12)
axis(1, at=seq(1995,2001), labels=format(seq(1995,2001)))
axis(2, at=seq(0,105000,17500), labels=format(seq(0,105000,17500)), cex.axis=0.5, las=2)
# add lines 
for (i in 1:12) { 
  currentmonth <- subset(sales.ts, cycle(sales.ts)==i)
  lines(seq(1995, 1995+length(currentmonth)-1,1), currentmonth, type="b", lwd=1,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) 
} 
# add a title
title("Souvenir Sales Broken Out by Month")
# add a legend 
legend(2002, 105000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)

In the plot above, we see minimal crossing of the monthly plots; this confirms the high degree of seasonality we expected when reviewing the raw plot of the time series.

Question 1A

The analyst partitioned the data to create training, and validation / “holdout” subsets of the time series. We use the training partition to fit forecast models, and the validation partition to measure the predictive power of the various models.

Question 1B

The analyst chose a 12-month validation period for a few reasons. Ideally, the selected validation period will match the length of the forecast horizon we are ultimately predicting. In this example, the analyst was asked to forecast the sales for the 12 months of 2002.

The seasonality of the data also support the selection of a 12-month validation period, as the seasonal pattern repeats every 12-months.

Finally, a 12-month validation period makes sense in this case because we have an adequate series length to train a model with the remaining data outside of the validation period.

Question 1C

To determine the naive forecast, we will use the naive function of the forecast package:

# develop naive forecast
naive.sales <- naive(SalesTrain, h = validLength)
# brief investigation of what we created - confirm we have naive forecasts
plot(naive.sales)

summary(naive.sales)
## 
## Forecast method: Naive method
## 
## Model Information:
## $drift
## [1] 0
## 
## $drift.se
## [1] 0
## 
## $sd
## [1] 10475.33
## 
## $call
## naive(y = SalesTrain, h = validLength)
## 
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE    MASE
## Training set 1113.477 10460.73 5506.879 -25.27554 61.16191 1.47054
##                    ACF1
## Training set -0.1968879
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Feb 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Mar 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Apr 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## May 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Jun 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Jul 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Aug 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Sep 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Oct 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Nov 2001       80721.71 67315.75 94127.67 60219.06 101224.4
## Dec 2001       80721.71 67315.75 94127.67 60219.06 101224.4

From reviewing the summary from our naive forecast, we find point forecasts of 80,722 for each month of 2002. Therefore, the total forecast for 2002 is (80722 * 12) = 968,664. In reality, this naive forecast is probably too high, as it carries the seasonal peak forward for every month of the coming year.

As we discovered earlier this time series exhibits strong seasonal characteristics. Let’s attempt to fit the seasonal naive forecast:

# develop seasonal naive forecast
snaive.sales <- snaive(SalesTrain, h = validLength)
# brief investigation of what we created - confirm we have naive forecasts
plot(snaive.sales)

summary(snaive.sales)
## 
## Forecast method: Seasonal naive method
## 
## Model Information:
## $drift
## [1] 0
## 
## $drift.se
## [1] 0
## 
## $sd
## [1] 5547.644
## 
## $call
## snaive(y = SalesTrain, h = validLength)
## 
## 
## Error measures:
##                    ME     RMSE      MAE     MPE     MAPE MASE      ACF1
## Training set 3401.361 6467.818 3744.801 22.3927 25.64127    1 0.4140974
## 
## Forecasts:
##          Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## Jan 2001        7615.03  -673.8117 15903.87 -5061.6594 20291.72
## Feb 2001        9849.69  1560.8483 18138.53 -2826.9994 22526.38
## Mar 2001       14558.40  6269.5583 22847.24  1881.7106 27235.09
## Apr 2001       11587.33  3298.4883 19876.17 -1089.3594 24264.02
## May 2001        9332.56  1043.7183 17621.40 -3344.1294 22009.25
## Jun 2001       13082.09  4793.2483 21370.93   405.4006 25758.78
## Jul 2001       16732.78  8443.9383 25021.62  4056.0906 29409.47
## Aug 2001       19888.61 11599.7683 28177.45  7211.9206 32565.30
## Sep 2001       23933.38 15644.5383 32222.22 11256.6906 36610.07
## Oct 2001       25391.35 17102.5083 33680.19 12714.6606 38068.04
## Nov 2001       36024.80 27735.9583 44313.64 23348.1106 48701.49
## Dec 2001       80721.71 72432.8683 89010.55 68045.0206 93398.40
#accuracy(snaive.sales, SalesValid)

For this time series, it seems the seasonal naive forecast will yield better results than the naive forecast, because it respects the seasonality of the time series.

Question 1D

To compute the RMSE and MAPE for the naive forecast, we call the accuracy function:

accuracy(naive.sales, SalesValid)
##                      ME     RMSE       MAE        MPE      MAPE     MASE
## Training set   1113.477 10460.73  5506.879  -25.27554  61.16191  1.47054
## Test set     -50500.288 56099.07 54490.114 -287.13834 290.95050 14.55087
##                    ACF1 Theil's U
## Training set -0.1968879        NA
## Test set      0.3182456  6.649124
# For interest, let's call accuracy of the snaive forecast, since we bothered to create it
accuracy(snaive.sales, SalesValid)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

In this example, we find the RMSE for the naive forecast (test set) is 56,099. The MAPE is 290%. These metrics indicate that the naive forecast isn’t a terribly good fit for this time series.

Using the seasonal naive method, we achieve RMSE of 9,542 and MAPE of 27%. The lower errors for seasonal naive suggest the seasonal naive forecast has higher predictive power than the naive method in this example.

Question 1E

Create a histogram of the forecast errors of the validation period, based on the naive forecast:

# We begin by plotting the naive forecast with the validation period actuals:
plot(naive.sales, yaxt="n")
axis(2, at=seq(0,105000,17500), labels=format(seq(0,105000,17500)), cex.axis=1, las=2)
lines (SalesValid, col=2)
legend(1995,87500, c("Forecast","Actual"), col=c("blue","red"), lty = 1)

naive.valid.residuals <- SalesValid - naive.sales$mean
valid.hist <- hist(naive.valid.residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "")

#Again for interest, let's compare the snaive forecast to the actuals of the validation period:
plot(snaive.sales, yaxt="n")
axis(2, at=seq(0,105000,17500), labels=format(seq(0,105000,17500)), cex.axis=1, las=2)
lines (SalesValid, col=2)
legend(1995,87500, c("Forecast","Actual"), col=c("blue","red"), lty = 1)

From both the plot of the naive forecast versus actuals and the histogram, we can clearly see that the naive forecast’s errors are not normally distributed. We find that in 11 of 12 months, the naive forecast produces negative forecast errors (the forecast is too high). In only one month, we find the actuals greater than the naive forecast. In other words, our naive forecast exhibits a positive bias compared to the validation period.

As expected, we can also see from the plot of the seasonal naive forecast that it provides a much stronger prediction in this scenario.

Question 1F

To forecast 2002 sales, the analyst must apply the selected model to the entire original time series (training + validation) and generate their forecast using all the known values. This allows the most recent data available (from 2001) to inform the model, and should result in the best possible forecast for 2002.

Question 2

Given the goal of forecasting future (unknown) months, the analyst should engage in the following steps:

  1. Partition the data into training and validation periods
  2. Examine time plots of the series and model forecasts for both the training and validation periods. Note the bullet in the text refers to only the training period, which would not be my recommendation.
  3. Compute the naive forecasts
  4. Review MAPE and RMSE for the training period to understand strength of fit.
  5. Review MAPE and RMSE for the validation period to understand predictive strength of the model.
  6. [Not listed in the text] Compare error summaries from training and validation periods to ensure any non-naive model does not over fit the curve in the training period.