# Load Souvenir Sales file
souvenirSales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
# Check for missing values
any(is.na(souvenirSales))
## [1] FALSE
# Create time-series object
souvSales <- ts(souvenirSales$Sales, start = c(1995,1), frequency = 12)
# Plot souvenir sales
plot(souvSales, main = "Souvenir Sales", ylab = "Sales", bty = "l")
It appears that Souvenir sales have a lot of seasonality, which makes sense because of the seasonality of tourism.
Other methods for checking for seasonality.
# Aggregate by Quarter
quarterly <- aggregate(souvSales, nfrequency = 4, FUN = sum)
plot(quarterly, main = "Quarterly Aggregation", ylab = "Sales", bty = "l")
# Aggregate by year
yearly <- aggregate(souvSales, nfrequency = 1, FUN = sum)
plot(yearly, main = "Annual Aggregation", ylab = "Sales", bty = "l")
## Create separate plots for each season (month)
# Make outer margin area to the right bigger
par(oma = c(0, 0, 0, 4))
# We have 7 years of data
xrange <- c(1995,2001)
# Get the range of the sales
yrange <- c(0,100000)
# Create plot
plot(xrange, yrange, type = "n", xlab = "Year", ylab = "Souvenir Sales", cex.axis = 0.8, bty = "l", las = 1)
# Give each month its own color, line type and character
colors <- rainbow(12)
linetype <- c(1:12)
plotchar <- c(1:12)
# Add lines
for (i in 1:12) {
currentMonth <- subset(souvSales, cycle(souvSales) == i)
lines(seq(1995, 1995 + length(currentMonth) - 1, 1),
currentMonth, type = "b", lwd = 1,
lty = linetype[i], col=colors[i], pch = plotchar[i])
}
# Add title
title("Souvenir Sales by Month")
# Add legend
legend(2001.9, 120000, 1:12, cex = 0.8, col = colors, pch = plotchar, lty = linetype, title = "Month", xpd = NA)
You can see seasonality at the quarterly aggregation level and aggregating by year removes seasonality. Souvenir sales are seasonally based.
Partition the data into the training and validation periods.
# Set the validation period to 12 months (2001)
validLength <- 12
# Set the length of the training period (data - validation period)
trainLength <- length(souvSales) - validLength
# Partition the data into training and validation periods
sSalesTrain <- window(souvSales, start = c(1995,1), end = c(1995, trainLength))
sSalesValid <- window(souvSales, start = c(1995, trainLength + 1), end = c(1995, trainLength + validLength))
Part (a): Why was the data partitioned?
Bias is introduced when using the same set of data to choose a forecasting model and assess it’s performance. Fitting a model to the systematic and noise components of the data is called overfitting. Overfitting can cause the model to perform poorly on new data. Data partitioning is an important preliminary step in addressing the issue of overfitting. By using two different data sets you can measure the forecast errors. The model can then be rerun on the combined complete data set to forecast into the future.
Part (b): Why did the analyst choose a 12-month validation period?
The length of the validation period is dependent on the forecasting goal, data frequency and forecast horizon. The forecasting horizon in this problem is 12 months and data is given in monthly increments. Typically the validation period mimics the forecast horizon to allow evaluation of actual predictive performance. If the validation period is longer than the forecast horizon the training period will have less recent information. If it is shorter than the forecast horizon you won’t be able to measure the predictive performance of long-term forecasts and seasonality might not be accounted for.
Part (c): What is the naive forecast for the validation period?
# Naive forecast for validation period
naiveValid <- naive(sSalesTrain, h = validLength)
naiveValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71
## Aug Sep Oct Nov Dec
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71
# Seasonal naive forecast for validation period
snaiveValid <- snaive(sSalesTrain, h = validLength)
snaiveValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
Part (d): Compute the RMSE and MAPE for the naive forecasts.
# Range of summary measures of the seasonal naive forecast accuracy
accuracy(snaiveValid, sSalesValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
# Range of summary measures of the naive forecast accuracy
accuracy(naiveValid, sSalesValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 1113.477 10460.73 5506.879 -25.27554 61.16191 1.47054
## Test set -50500.288 56099.07 54490.114 -287.13834 290.95050 14.55087
## ACF1 Theil's U
## Training set -0.1968879 NA
## Test set 0.3182456 6.649124
The RSME is is ~9542.35 and the MAPE is ~27.28 Another indicator that seasonality is present can be seen by using naive versus snaive. The error is much lower using snaive.
Part (e): Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?
## Create histogram with a density curve
histValid <- hist(snaiveValid$residuals, ylim = c(1, 50), ylab = "Frequency", xlab = "Forecast Error", labels = TRUE, main = "Histogram of Training Period Error Terms", cex = 0.5, bty = "l")
axis(2, at = seq(0, 50, 10))
multiplier <- histValid$counts / histValid$density
densityValid <- density(snaiveValid$residuals, na.rm = TRUE)
densityValid$y <- densityValid$y * multiplier[1]
lines(densityValid)
## Create a time plot
#Plot actual values from the validation period (2001)
par(oma = c(0, 2, 0, 0))
plot(sSalesValid, bty = "l", main = "Time Plot for Forecast & Actual Numbers", xaxt = "n", xlab = "Year of 2001", yaxt = "n", ylab = "Sales")
axis(1, at = seq(2001,2001.92, 0.08), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), cex.axis = 0.75)
axis(2, cex.axis = 0.75, las = 2)
lines(snaiveValid$mean, col = 2, lty = 2)
legend(2001, 105000, c("Actual", "Forecast"), col = 1:2, lty = 1:2)
The histogram shows that of the 60 error terms 7 are negative and 53 are positive, meaning we are under predicting much more than over predicting. We are underpredicting 7.5 times as often as over predicting.The density curve shows that the error terms for the training period are not normally distributed. The time plot for the seasonal naive model and actual sales numbers shows the seasonal naive model under predicts the actual sales. The seasonal naive model follows the same pattern as the actual sales numbers.
Part (f): The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?
To deploy the model and forecast the future year of 2002 the analyst must combine the training and validation periods and rerun the model on the complete data set. The final model is then used for forecasting future values.
If the goal is forecasting sales in future months, which of the following steps should be taken?
Answer:
Yes - Data should be partitioned into training and validation periods.
No - The training set contains data that is used for building or selecting models. The validation set is used to assess the performance of each model. The training and validation sets must be recombined to validate the model and forecast sales in future months. Some models can be estimated more accurately with a more ocmplete data set. Time plots during both the training and validation periods of the actual and predicted series can help identify performance issues and opportunities for improvement. If the validation period is left out the most valuable information is also left out. Using only the training period would require forecasting further into the future.
No - Predictive accuracy measures are based on the validation period, not the training period.
Yes - The validation period is not used to select predictors or estimate model parameters, so it is more objective than the training period. Prediction errors from the validation period with the model trained on the training period are used in measures of accuracy.
Yes - A naive forecast is the most recent value of the series, or most recent identical season in the case of seasonal naive forecasts, which may be the most relevant for forecasting the future. Naive forecasts serve as a baseline, which is needed for comparison when evaluating a method’s predictive performance. Sometimes naive forecasts can even work sufficiently as the actual forecast itself.