Data is partitioned in order to address the problem of overfitting and to test the performance of a model for accuracy. In order to do so, analysts create a training set and then test it against their validation set to see how accurate the model is at predicting future sales.
Analyst was asked to predict the next 12 months of sales therefore she picked a 12-month validation period because “the main principle is to choose a validation period that mimics the forecast horizon to allow the evaluation of actual predictive performance.” The choice of the length of the validation period closely depends on the forecasting goal, on the data frequency, and on forecast horizon.
The naive forecast is defined as the most recent value of the series. In the case of a seasonal series, the season naive forecast is the value from the most recent identical season (for example, forecasting December sales using last December’s figures).
library(forecast)
library(readr)
SouvenirSales <- read.csv("SouvenirSales.csv", header = TRUE, stringsAsFactors = FALSE)
head(SouvenirSales)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
str(SouvenirSales)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
SouvenirSales.ts <- ts(SouvenirSales$Sales, start = c(1995, 1), frequency = 12)
plot(SouvenirSales.ts/1000, bty="l", xlab = "Year", ylab = "Sales (in thousands)", lwd = 2, col = "green", main = "Souvenir Sales in 1995-2001")
As evidenced by the chart, the souvenir sale exhibit seasonal pattern spiking and then rapidly declining around the end of each year. If I were to hypothesize, I would suggest that the increase in sales is perhaps connected to winter holidays.
ggseasonplot(SouvenirSales.ts/1000, ty="l", ylab = "Sales (in thousands)", lwd = 2, main = "Seasonal Plot for Souvenir Sales")
Seasonal plot for souvenir sales above, confirms the notion of seasonality of the data. Because of that the seasonal naive forecast is appropriate choice of forecast method for this type of data.
fix.nvalid <-12
fix.ntrain <-length(SouvenirSales.ts) - fix.nvalid
train.ts <-window(SouvenirSales.ts/1000, start = c(1995, 1), end = c(1995, fix.ntrain))
valid.ts <-window(SouvenirSales.ts/1000, start = c(1995, fix.ntrain +1), end =c(1995, fix.nvalid +fix.ntrain))
SouvenirSalesLinear <-tslm(train.ts ~ trend + I(trend^2))
SouvenirSalesLinearPrediction <-forecast(SouvenirSalesLinear, h = fix.nvalid, level = 0)
plot(SouvenirSalesLinearPrediction, bty="l", ylab = "Sales (in thousands)", xlab = "Year", main = "Souvenir Sales")
lines(SouvenirSalesLinear$fitted.values, lwd =2, col = "red")
lines(valid.ts)
SouvenirSalesNaive <-snaive(train.ts, h = fix.nvalid)
SouvenirSalesNaive$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7.61503 9.84969 14.55840 11.58733 9.33256 13.08209 16.73278
## Aug Sep Oct Nov Dec
## 2001 19.88861 23.93338 25.39135 36.02480 80.72171
SouvenirSalesNaivePrediction <-snaive(train.ts, h = fix.nvalid)
plot(SouvenirSalesNaivePrediction, ylab = "Sales(in thousands", xlab = "Year", main = "Seasonal Naive Forecast")
accuracy(SouvenirSalesNaive, valid.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 3.401361 6.467818 3.744801 22.39270 25.64127 1.000000
## Test set 7.828278 9.542346 7.828278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
We see that the RMSE = 9542.346 and the MAPE = 27.28%.
SnaiveSouvenirValid <-snaive(SouvenirSales.ts)
SouvenirHist <-hist(SnaiveSouvenirValid$residuals, ylab="Frequency", xlab="Forecast Error (in thousands)", main="Frequency of Seasonal Naive Forecasting Errors in Validation Period", bty="l")
multiplier <- SouvenirHist$counts / SouvenirHist$density
density <- density(SnaiveSouvenirValid$residuals, na.rm=TRUE)
density$y <- density$y * multiplier[1]
lines(density, col = "red", lwd = 2)
By examining the histogram above, it appears that the seasonal naive forecast is underpredicting sales with most of the forecasting errors being positive.
plot(valid.ts, bty="l", xaxt="n", xlab="Year: 2001", main="Naive Forecast versus Actual Sales", yaxt="n", ylab="Sales (in thousands)")
axis(1, at=seq(2001,2001.917,0.08333), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, las=2)
lines(SouvenirSalesNaive$mean, col=2, lty=2)
legend(2001.3,100, c("Actual Sales","Naive Seasonal Forecast"), col=1:2, lty=1:2)
plot(SouvenirSalesNaive$residuals, xlab = "Year", ylab = "Residuals", main = "Residuals by Year", lwd = 2)
The fact that most residuals are over the zero line is yet another illustration of underpredicting.
qqnorm(SouvenirSalesNaive$residuals[13:72])
qqline(SouvenirSalesNaive$residuals[13:72])
The Normal QQ plot shows that the error terms are not normally distibuted: we don’t see a 45-degree line from the bottom, left-hand corner to the upper.
qqnorm(valid.ts - SouvenirSalesNaive$mean)
qqline(valid.ts - SouvenirSalesNaive$mean)
This plot verifies that in the validation period, the error terms are not normally distributed. We again see that they are positive residuals indicating underprediction of the model.
In conculsion, the sales forecast of souvenirs for 2001 in very different from the actual sales.The seasonal naive forecast model is consistently underpredicting, and this is evident from the above plots. Additionally, errors for both the training and validation periods are not normally distributed.
In order for the analyst to generate forecast for year 2002, she needs to combine the training and validation periods and rerun the model on the complete data set.
Yes, this is an important step. Partitioning the data allows an analyst develop the best forecasting model by testing it for accuracy and fit.
No, plots for both the training and valudation periods should be looked at. The training period does not provide enough information on its own.
No, the MAPE and RMSE are more important in the validation period.
Yes, this is an important step in order to evaluate how well the validation data fits to the training period and how well the chosen model works.
Yes, it can be helpful. The most recent data points may be the most relevant for forecasting the future. Naive forecasts serve as a baseline, which is needed for comparison when evaluating a method’s predictive performance.