Question 1

SV_ts <- ts(SV_sales$Sales, start = c(1995,1), frequency = 12)

ltest <- 12
ltrain <- length(SV_ts) - ltest

train_ts <- window(SV_ts, start=c(1995,1), end=c(1995, ltrain))
test_ts <- window(SV_ts, start=c(1995,ltrain+1), end=c(1995,ltrain +ltest))

plot(SV_ts,ylab="Sales", xlab="Year", bty="l", main="Partitioned Souvenir Data")
abline(v=2001)
text(x=c(1998),y=c(75000), "Training",cex=0.75)
text(x=c(2001.5),y=c(75000), "Validation",cex=0.75)

Parts

A

Hindsight is always 20/20; by using all known previous observations in your forecasting, you can create an almost miraculously accurate prediction model. However, odds are this ‘wonder-child’ won’t hold up once introduced to future observations. The reason for this is “overfitting”, where past data is considered in a vacuum and is not tested in any capacity against future variations.
By partitioning the data the analyst will have two key sets of data, training and validation, against which to test any forecasting models created.

B

In this situation the company hiring the analyst desires a forecast for the upcoming 12 month period. The general rule of thumb for any validation period is for it to representative of your forecasting horizon. Therefore, a 12 month validation set is the minimum period that should be used with consideration given to the data frequency and forecasting goal as well.

C

SV_naive <- naive(train_ts, h = 12)
SV_naive
##          Point Forecast    Lo 80     Hi 80     Lo 95    Hi 95
## Jan 2001       80721.71 67315.75  94127.67 60219.059 101224.4
## Feb 2001       80721.71 61762.82  99680.60 51726.583 109716.8
## Mar 2001       80721.71 57501.90 103941.52 45210.077 116233.3
## Apr 2001       80721.71 53909.78 107533.64 39716.409 121727.0
## May 2001       80721.71 50745.07 110698.35 34876.389 126567.0
## Jun 2001       80721.71 47883.94 113559.48 30500.677 130942.7
## Jul 2001       80721.71 45252.87 116190.55 26476.795 134966.6
## Aug 2001       80721.71 42803.92 118639.50 22731.457 138712.0
## Sep 2001       80721.71 40503.82 120939.60 19213.758 142229.7
## Oct 2001       80721.71 38328.33 123115.09 15886.636 145556.8
## Nov 2001       80721.71 36259.16 125184.26 12722.110 148721.3
## Dec 2001       80721.71 34282.09 127161.33  9698.445 151745.0
plot(SV_naive, bty = "l")
abline(v = 2001)
text(x=c(2001.5),y=c(75000), "Validation",cex=0.75)
text(x=c(1998),y=c(75000), "Training",cex=0.75)

plot1 <- recordPlot()

SV_snaive <- snaive(train_ts, h= 12)
SV_snaive
##          Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## Jan 2001        7615.03  -673.8117 15903.87 -5061.6594 20291.72
## Feb 2001        9849.69  1560.8483 18138.53 -2826.9994 22526.38
## Mar 2001       14558.40  6269.5583 22847.24  1881.7106 27235.09
## Apr 2001       11587.33  3298.4883 19876.17 -1089.3594 24264.02
## May 2001        9332.56  1043.7183 17621.40 -3344.1294 22009.25
## Jun 2001       13082.09  4793.2483 21370.93   405.4006 25758.78
## Jul 2001       16732.78  8443.9383 25021.62  4056.0906 29409.47
## Aug 2001       19888.61 11599.7683 28177.45  7211.9206 32565.30
## Sep 2001       23933.38 15644.5383 32222.22 11256.6906 36610.07
## Oct 2001       25391.35 17102.5083 33680.19 12714.6606 38068.04
## Nov 2001       36024.80 27735.9583 44313.64 23348.1106 48701.49
## Dec 2001       80721.71 72432.8683 89010.55 68045.0206 93398.40
plot(SV_snaive, bty = "l")
abline(v = 2001)
text(x=c(2001.5),y=c(75000), "Validation",cex=0.75)
text(x=c(1998),y=c(75000), "Training",cex=0.75)

plot2 <- recordPlot()

D

Below are displayed the multiple measures of accuracy that can be employed against any forecasting model. We are particularly interested in the RMSE and MAPE in this exercise.

kable(accuracy(SV_naive))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1113.477 10460.73 5506.879 -25.27554 61.16191 1.47054 -0.1968879
kable(accuracy(SV_snaive))
ME RMSE MAE MPE MAPE MASE ACF1
Training set 3401.36 6467.818 3744.8 22.3927 25.64127 1 0.4140974

E

These histograms represent the distribution of forecasting errors generated from both the seasonal and basic naive models. We would hope to see a mean of zero in these distributions. Also we have generated plots showing the naive forecasts compared to the actual (validation) observations that occured.

f_naive_errors <- test_ts - SV_naive$mean
f_snaive_errors <- test_ts - SV_snaive$mean
hist(f_naive_errors, xlab = "Forecast Errors", main = "Distribution of Naive Forecasting Errors")

hist(f_snaive_errors, xlab = "Forecast Errors", main = "Distribution of Seasonal Naive Forecasting Errors")

plot1
lines(test_ts, col = "red")

plot2
lines(test_ts, col = "red")

F

Time to bring it all home! In order to demonstrate the accuracy of any model the analyst generates, the two sets must be brought back together. Once the training and validation sets have been combined, the proposed model can be tested against the aggregate set. This provides the best of both worlds, a decreased chance of overfitting from the initial split and the advantage of a larger sample size against which to test your forecasting model.

Question 2

To be successful in generating a model for forecasting shampoo sales the following steps should be taken.

1. Partition the data into training and validations periods

This is always a good practice to validate any proposed forecasting method. The exceptions to this rule would be when data is limited or infrequent, in these cases partitioning may actually degrade any attempt to forecast (one must ask the hard question first of “is this data worth modeling in its current state” in these situations).

2. Look at MAPE and RMSE values for the validation period

This is where our forecasting model earns its stripes. These values are important in determining whether or not “overfitting” or any other bias is present in the tested model. The MAPE (excellent for comparing different scales) and RMSE are the defacto benchmarks against which a variety of forecasting models should be tested to determine overall accuracy and reliability.

3. Compute Naive Forecasts

The wonderful thing about naive forecasts is that they are incredibly easy to generate and can often satisfy needed accuracy levels. It’s important to compute these measures (standard naive/seasonal naive) at the beginning of any forecasting analysis to hopefully avoid any unecessary work. They also serve as a baseline against which any other method can be compared. This would be especially important when presenting time series data to an audience not literate in the definitions used in predictive analytics. Rather than droning on about a decrease in the RMSE value, you can describe your models efficiencies (in percentage terms) against the practice of basing each observation of the one that came before.