Question 1

## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...

Plot souvenir sales:

SouvenirSales <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
autoplot(SouvenirSales, bty="l")

Partition the dataset:

#set length of validation period to 12
validperiod= 12

#training period
trainperiod= length(SouvenirSales) - validperiod

#data partitioning
souvtrain = window(SouvenirSales, start=c(1995,1), end=c(1995, trainperiod))

souvvalid= window(SouvenirSales, start=c(1995, trainperiod+1), end=c(1995, trainperiod + validperiod))

a) Why was the data partitioned?

As a preliminary step to doing the forecast, in order to try to avoid the problem of overfitting the data.

b) Why a 12-month validation period?

because you typically want the validation period to be as long as the forecasting period.

c) Naive forecast for validation period:

#naive forecast
naive_for_valid = naive(souvtrain, h=validperiod)

naive_for_valid
##          Point Forecast    Lo 80     Hi 80     Lo 95    Hi 95
## Jan 2001       80721.71 67315.75  94127.67 60219.059 101224.4
## Feb 2001       80721.71 61762.82  99680.60 51726.583 109716.8
## Mar 2001       80721.71 57501.90 103941.52 45210.077 116233.3
## Apr 2001       80721.71 53909.78 107533.64 39716.409 121727.0
## May 2001       80721.71 50745.07 110698.35 34876.389 126567.0
## Jun 2001       80721.71 47883.94 113559.48 30500.677 130942.7
## Jul 2001       80721.71 45252.87 116190.55 26476.795 134966.6
## Aug 2001       80721.71 42803.92 118639.50 22731.457 138712.0
## Sep 2001       80721.71 40503.82 120939.60 19213.758 142229.7
## Oct 2001       80721.71 38328.33 123115.09 15886.636 145556.8
## Nov 2001       80721.71 36259.16 125184.26 12722.110 148721.3
## Dec 2001       80721.71 34282.09 127161.33  9698.445 151745.0

d) Compute RMSE and MAPE:

accuracy(naive_for_valid, souvvalid)
##                      ME     RMSE       MAE        MPE      MAPE     MASE
## Training set   1113.477 10460.73  5506.879  -25.27554  61.16191  1.47054
## Test set     -50500.288 56099.07 54490.114 -287.13834 290.95050 14.55087
##                    ACF1 Theil's U
## Training set -0.1968879        NA
## Test set      0.3182456  6.649124

e) Plot a histogram of forecast errors for validation period, as well as a time plot for naive forecasts and the actual sales numbers in the validation period. Explain the behavior of the naive forecasts.

#plot histogram
hist(naive_for_valid$residuals, breaks= 20, probability = TRUE)

#add density line
lines(density(naive_for_valid$residuals, na.rm = TRUE))

#plot actual values from 2002
plot(souvvalid, bty="l", xaxt="n", xlab="The Year 2002", yaxt="n", ylab="Sales")

       
#add forecast line
lines(naive_for_valid$mean, col = "red", lty =2)

legend(2002,300, c("Actual","Forecast"), col=1:2, lty=1:2)

the forecasted sales overestimate the actual sales for almost the entire year, as they are based on the sales from December of the previous year, and as we can see from the first plot, there is a sharp decrease in sales after every december. A seasonal forecast would have worked much better in this case.

f) What must the analyst do to use the forecasting model for generating forecasts for 2002?

she needs to use a seasonal naive forecast instead of a naive forecast, as you can see from the data that there is a clear seasonal pattern, this way the forecast will be more accurate.

Question 2

-partition the data into training and validation periods
-look at MAPE and RMSE values for the validation period
-compute naive forecasts