Problem 1
Souvenir Sales:
a. Why was the data partitioned?
To avoid overfitting in our problem it is a very important step before applying the forecasting and be able to assess the predictive performance of the model is data partitioning. She first partitions the data into periods, training period and validation period.She would develop her forecasting model using only the training period by fit a forecasting model to sales using training period.She would try it out on validation period and see how it performs. In particular, she can measure the forecast errors (residual), which are the differences between the predicted values and the actual values.
b. Why did the analyst choose a 12-months validation?
The analyst chose 12 month validation period because:
c. What is the naive forecast for the validation period?(assume that you must provide forecasts for 12 months ahead)
getwd()
## [1] "/Users/yusufsultan"
setwd("/Users/yusufsultan")
SouvenirSales <- read.csv("SouvenirSales.csv")
head(SouvenirSales)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
str(SouvenirSales)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : Factor w/ 84 levels "1-Apr","1-Aug",..: 38 32 56 14 62 50 44 20 80 74 ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
summary(SouvenirSales)
## Date Sales
## 1-Apr : 1 Min. : 1665
## 1-Aug : 1 1st Qu.: 5884
## 1-Dec : 1 Median : 8772
## 1-Feb : 1 Mean : 14316
## 1-Jan : 1 3rd Qu.: 16889
## 1-Jul : 1 Max. :104661
## (Other):78
library(forecast)
## Warning: package 'forecast' was built under R version 3.2.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.3
SouvenirSales.ts <- ts(SouvenirSales$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
nValid <- 12
nTrain <- length(SouvenirSales.ts) - nValid
train.ts <- window(SouvenirSales.ts, start = c(1995, 1), end = c(1995,nTrain))
valid.ts <- window(SouvenirSales.ts, start = c(1995, nTrain + 1), end = c(1995,nTrain + nValid))
snaive.pred <- snaive(train.ts, h = nValid)
snaive.pred$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
plot (SouvenirSales.ts, bty = "l")
plot (train.ts, bty = "l")
plot (valid.ts, bty = "l")
naive.pred <- naive(train.ts, h = nValid)
plot(naive.pred)
snaive.pred <- snaive(train.ts, h = nValid)
plot(snaive.pred)
valid.ts
## Jan Feb Mar Apr May Jun Jul
## 2001 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 2001 28586.52 30505.41 30821.33 46634.38 104660.67
snaive.pred
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 7615.03 -673.8117 15903.87 -5061.6594 20291.72
## Feb 2001 9849.69 1560.8483 18138.53 -2826.9994 22526.38
## Mar 2001 14558.40 6269.5583 22847.24 1881.7106 27235.09
## Apr 2001 11587.33 3298.4883 19876.17 -1089.3594 24264.02
## May 2001 9332.56 1043.7183 17621.40 -3344.1294 22009.25
## Jun 2001 13082.09 4793.2483 21370.93 405.4006 25758.78
## Jul 2001 16732.78 8443.9383 25021.62 4056.0906 29409.47
## Aug 2001 19888.61 11599.7683 28177.45 7211.9206 32565.30
## Sep 2001 23933.38 15644.5383 32222.22 11256.6906 36610.07
## Oct 2001 25391.35 17102.5083 33680.19 12714.6606 38068.04
## Nov 2001 36024.80 27735.9583 44313.64 23348.1106 48701.49
## Dec 2001 80721.71 72432.8683 89010.55 68045.0206 93398.40
par(oma = c(0, 0, 0, 4))
xrange <- c(1995,2001)
yrange <- range(SouvenirSales.ts)
plot(xrange, yrange, type="n", xlab="Year", ylab="Souvenir Sales by Month", bty="l", las=1, yaxt = "n")
colors <- rainbow(12)
linetype <- c(14:25)
plotchar <- c(14:25)
axis(1, at=seq(1995,2001), labels=format(seq(1995,2001)))
axis(2, at=seq(0,110000,18000), labels=format(seq(0,110000,18000)), cex.axis=0.5, las=2)
for (i in 1:12) {
currentmonth <- subset(SouvenirSales.ts, cycle(SouvenirSales.ts)==i)
lines(seq(1995, 1995+length(currentmonth)-1,1), currentmonth, type="b", lwd=1,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
title("Souvenir Sales Broken Out by Month")
legend(2002, 105000, 13:24, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)
summary(snaive.pred)
##
## Forecast method: Seasonal naive method
##
## Model Information:
## $drift
## [1] 0
##
## $drift.se
## [1] 0
##
## $sd
## [1] 5547.644
##
## $call
## snaive(y = train.ts, h = nValid)
##
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 3401.361 6467.818 3744.801 22.3927 25.64127 1 0.4140974
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 7615.03 -673.8117 15903.87 -5061.6594 20291.72
## Feb 2001 9849.69 1560.8483 18138.53 -2826.9994 22526.38
## Mar 2001 14558.40 6269.5583 22847.24 1881.7106 27235.09
## Apr 2001 11587.33 3298.4883 19876.17 -1089.3594 24264.02
## May 2001 9332.56 1043.7183 17621.40 -3344.1294 22009.25
## Jun 2001 13082.09 4793.2483 21370.93 405.4006 25758.78
## Jul 2001 16732.78 8443.9383 25021.62 4056.0906 29409.47
## Aug 2001 19888.61 11599.7683 28177.45 7211.9206 32565.30
## Sep 2001 23933.38 15644.5383 32222.22 11256.6906 36610.07
## Oct 2001 25391.35 17102.5083 33680.19 12714.6606 38068.04
## Nov 2001 36024.80 27735.9583 44313.64 23348.1106 48701.49
## Dec 2001 80721.71 72432.8683 89010.55 68045.0206 93398.40
d. Compute the RMSE and MAPE for the naive forecasts.
accuracy(snaive.pred, valid.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
RMSE for Validation Period: 9,542 and MAPE: 27.3%
e.Plot a histogram of the forecast error that results from the naive forecasts(for the validation period).Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period .what can you say about the behavior of the naive forecasts?
Souv_Valid <- snaive(SouvenirSales.ts)
Souv_naive_Resid <- valid.ts - snaive.pred$mean
myhist <- hist(Souv_naive_Resid, ylab="Frequency", xlab="Forecast Error", bty="l", main="")
multiplier <- myhist$counts / myhist$density
mydensity <- density(Souv_naive_Resid, na.rm=TRUE)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity)
Naive Forecasts (showed in red line) follows same pattern trend as of actual sales numbers (black line) with a certain delta across validation period.
plot(valid.ts, bty="l", xaxt="n", xlab="2001", yaxt="n", ylab="Souvenir sales ,000s",lwd=3)
axis(1, at=seq(from = 2001, to = 2001.88, length.out = 12))
axis(2, at=seq(0,110000,15000), labels=format(seq(0,110,15)), las=2)
lines(snaive.pred$mean, col=34, lty=1 ,lwd=3)
legend(2001.6, 105000, c("Actual","Forecast"), col=1:2, lty=1:2 )
f.The analyst found a forecasting model that gives satisfactory performance on the validation set.What must she do to use the forecasting model for generating forecasts for year 2002
She must combine the training and validation periods into a single time series and then rerun the chosen model on the complete data. This final model is then used to forecast for year 2002.
=================================================================================
Problem 2
Forecasting Shampoo Sales:
a. Partitioning the data into training and validation periods
Yes, Partitioning the data into training and validation periods very important and will help us to measure the accuracy of our forecasting model.
b. Examine time plots of the series and of model forecasts only for the training period?
No, we need to examine time series data for both training period and validation period. Also, model forecasts for both training and validation periods need to be reviewed.
ShampooSales <- read.csv("ShampooSales.csv")
Month Sales
head(ShampooSales)
## Month Sales
## 1 Jan-95 266.0
## 2 Feb-95 145.9
## 3 Mar-95 183.1
## 4 Apr-95 119.3
## 5 May-95 180.3
## 6 Jun-95 168.5
structure of the data
str(ShampooSales)
## 'data.frame': 36 obs. of 2 variables:
## $ Month: Factor w/ 36 levels "Apr-95","Apr-96",..: 13 10 22 1 25 19 16 4 34 31 ...
## $ Sales: num 266 146 183 119 180 ...
summary with the time series of the data + Decomposition show the components of the time series
summary(ShampooSales)
## Month Sales
## Apr-95 : 1 Min. :119.3
## Apr-96 : 1 1st Qu.:192.4
## Apr-97 : 1 Median :280.1
## Aug-95 : 1 Mean :312.6
## Aug-96 : 1 3rd Qu.:411.1
## Aug-97 : 1 Max. :682.0
## (Other):30
ShampooSales.ts <- ts(ShampooSales$Sales, start = c(1995,1), end = c(1998), freq = 12)
ShampooSales_plots = decompose(ShampooSales.ts)
ShampooSales_plots$seasonal
## Jan Feb Mar Apr May Jun
## 1995 -22.035069 -5.060069 -51.016319 24.750347 -47.641319 3.504514
## 1996 -22.035069 -5.060069 -51.016319 24.750347 -47.641319 3.504514
## 1997 -22.035069 -5.060069 -51.016319 24.750347 -47.641319 3.504514
## 1998 -22.035069
## Jul Aug Sep Oct Nov Dec
## 1995 34.204514 27.589931 -4.012153 17.454514 34.433681 -12.172569
## 1996 34.204514 27.589931 -4.012153 17.454514 34.433681 -12.172569
## 1997 34.204514 27.589931 -4.012153 17.454514 34.433681 -12.172569
## 1998
plot(ShampooSales_plots)
ShampooSales.tsyear <- aggregate(ShampooSales.ts, nfrequency=1, FUN=mean)
ShampooSales.ts.zoom <- window(ShampooSales.tsyear, start = c(1995, 1), end = c(1997, 12))
## Warning in window.default(x, ...): 'end' value not changed
dev.new()
plot(ShampooSales.ts.zoom, xlab = "Time", ylab = "ShampooSales", ylim = c(100, 800), bty = "l")
dev.off()
## quartz_off_screen
## 2
totalRecords <- length(ShampooSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(ShampooSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(ShampooSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
ShampooSales.lm <- tslm(train.ts ~ trend + I(trend^2))
ShampooSales.lm.pred <- forecast(ShampooSales.lm, h = nValidationRecords, level = 0)
plot(ShampooSales.lm.pred, ylim = c(100, 800),ylab = "ShampooSales", xlab = "Time", bty = "l", xaxt = "n", xlim = c(1995,1998), main ="", flty = 2)
axis(1, at = seq(1995, 1998, 1), labels = format(seq(1995, 1998, 1)))
lines(ShampooSales.lm$fitted, lwd = 2)
lines(valid.ts)
c.Look at MAPE and RMSE values for the training period
No, we don’t need to look MAPE and RMSE values for a training period.
d.Look at MAPE and RMSE values for the validation period
Yes, this is very important. It serves as a more objective basis than the training period to assess predictive accuracy (because records in the validation period are not used to select predictors or to estimate model parameters).
e.Compute naive forecasts
Yes, it’s a good idea to have naive forecasts that can help with two things:
1. As the actual forecasts of the series. Naive forecasts, they are easy to understand and easy to implement, can sometimes achieve sufficiently useful accuracy levels.naive forecasts are a serious contender.
2. As a baseline. When evaluating the predictive performance of a certain method, it is important to compare it to some baseline. Naive forecasts should always be considered as a baseline, and the comparative advantage of any other methods considered should be clearly shown.