Chapter 3 Problem 1
Bring in the souvenir time series and look at them start date of the data.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.3
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.3
souvenir <- read.csv("SouvenirSales.csv", stringsAsFactors=FALSE)
head(souvenir)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
Plot the time series
appSouv <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
autoplot(appSouv, bty="l")

We will next get the components of appSouv to look at the seasonality and trend of the data.
componentsOfAppSouv <- decompose(appSouv)
autoplot(componentsOfAppSouv)

There is definitely seasonality, with jumps happening in November and December each year. The height of the peaks seem to increase between years, so it looks to be multiplicative. The trend is upward.
We will then partition the data into training and validation periods.
# Set the length of the validation period to 12 months (one year)
validLength <- 12
# Set the length of the training period to everything else
trainLength <- length(appSouv) - validLength
# Partition the data into training and validation periods
sSouvTrain <- window(appSouv, start=c(1995,1), end=c(1995, trainLength))
sSouvValid <- window(appSouv, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))
(a) Why was the data partitioned?
When doing time series forecasting, we need to avoid overfitting and asses the predictive performance of the model. To do this we partition the data into training and validation periods. Because this is time series fata, the data is split into periods where the earlier one is the training period and the later one is the validation period. We will not include a test period because omitting the most recent data from the performance evaluation does more harm than good.
(b) Why did the analyst choose a 12-month validation period?
Because we are forecasting 12 months into the future, the validation period should match the length of time that we will be forecasting.
(c) What is the naive forecast for the validation period?
From above we determined that this data is seasonal, so we will use the seasonal naive forecast.
# Use the seasonal naive forecast
snaiveForValid <- snaive(sSouvTrain, h=validLength)
# To see the point forecasts from the seasonal naive model
snaiveForValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
(d) Compute the RMSE and MAPE for the naive forecasts.
We will compute these using the accuracy function.
accuracy(snaiveForValid, sSouvValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
We are interested in the RMSE and MAPE for the Test set. We see that the RMSE ??? 9542 and the MAPE = 27.28%.
(e) Plot a histogram of the forecast errors from the naive forecasts. Also plot a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?
First we plot the histogram of the forecast errors. We will display the density curve.
# Plot the histogram and store it to use later
hist(snaiveForValid$residuals, probability=TRUE)
# Add the density curve
lines(density(snaiveForValid$residuals, na.rm=TRUE))

Next we plot the naive forecast and the actual sales numbers during the validation period.
# Plot the actual values from the validation period (2001)
plot(sSouvValid, bty="l", xaxt="n", xlab="The Year 2001", yaxt="n", ylab="Sales")
axis(1, at=seq(2001,2001.92,0.083), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, las=2)
# Now add the forecasts and make the line red and dashed
lines(snaiveForValid$mean, col=2, lty=2)
# Add a legend
legend("topleft", c("Actual","Forecast"), col=1:2, lty=1:2)

Looking a the plot, we can see that the naive forecast is underforecasting compared to the actual sales data.
Chapter 3 Problem 2
For the Shampoo Sales file, if the goal is forecasting sales in future months, which of the following steps should be taken:
*partition the data into training and validation periods- YES, the size of the validation period should match the length of time that is being forecasted.
*examine time plots of the series and of model forecasts only for the training period- NO, we want to look at the validation period for both the series plot and the model forecasts for assessing the model.
*look at the MAPE and RMSE for the training period- NO, we want to look at these for the validation period.
*look at the MAPE and RMSE for the validation period- YES, we can use these to look at the performance of the model to determine if it will provide an acceprtable forecast.
*compute naive forecasts-YES, we can use these as our initial forecasts and assess their performance against the validation period.From assignment 1, we learned that this data is not seasonal, so the seasonal naive forecast would not be needed.