Predictive analytics  MBA 678           Assignment # 2                Yusuf Sultan


 

Problem 1 

Souvenir Sales:

a. Why was the data partitioned?

To avoid overfitting in our problem it is a very important step before applying the forecasting and be able to assess the predictive performance of the model is data partitioning. She first partitions the data into periods, training period and validation period.She would develop her forecasting model using only the training period by fit a forecasting model to sales using training period.She would try it out on validation period and see how it performs. In particular, she can measure the forecast errors (residual), which are the differences between the predicted values and the actual values.


 

b. Why did the analyst choose a 12-months validation?

 The analyst chose 12 month validation period because:


 

c. What is the naive forecast for the validation period?(assume that you must provide forecasts for 12 months ahead)

    

getwd()
## [1] "/Users/yusufsultan"
setwd("/Users/yusufsultan")
SouvenirSales <- read.csv("SouvenirSales.csv")
head(SouvenirSales)
##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
str(SouvenirSales)
## 'data.frame':    84 obs. of  2 variables:
##  $ Date : Factor w/ 84 levels "1-Apr","1-Aug",..: 38 32 56 14 62 50 44 20 80 74 ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...
summary(SouvenirSales)
##       Date        Sales       
##  1-Apr  : 1   Min.   :  1665  
##  1-Aug  : 1   1st Qu.:  5884  
##  1-Dec  : 1   Median :  8772  
##  1-Feb  : 1   Mean   : 14316  
##  1-Jan  : 1   3rd Qu.: 16889  
##  1-Jul  : 1   Max.   :104661  
##  (Other):78
library(forecast)
## Warning: package 'forecast' was built under R version 3.2.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.5
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.3
SouvenirSales.ts <- ts(SouvenirSales$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
nValid <- 12
nTrain <- length(SouvenirSales.ts) - nValid
train.ts <- window(SouvenirSales.ts, start = c(1995, 1), end = c(1995,nTrain))
valid.ts <- window(SouvenirSales.ts, start = c(1995, nTrain + 1), end = c(1995,nTrain + nValid))
snaive.pred <- snaive(train.ts, h = nValid)
snaive.pred$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71

Plot the time series data

plot (SouvenirSales.ts, bty = "l")

Plot the training period

plot (train.ts, bty = "l")

Plot the Validation time series period

plot (valid.ts, bty = "l")

forecasts From naive method.

forecasts From Seasonal naive method.

naive.pred <- naive(train.ts, h = nValid)
plot(naive.pred)

snaive.pred <- snaive(train.ts, h = nValid)
plot(snaive.pred)

Actual Sales

valid.ts
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2001  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 2001  28586.52  30505.41  30821.33  46634.38 104660.67

Seasonal Naive Sales Forecasting

snaive.pred
##          Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## Jan 2001        7615.03  -673.8117 15903.87 -5061.6594 20291.72
## Feb 2001        9849.69  1560.8483 18138.53 -2826.9994 22526.38
## Mar 2001       14558.40  6269.5583 22847.24  1881.7106 27235.09
## Apr 2001       11587.33  3298.4883 19876.17 -1089.3594 24264.02
## May 2001        9332.56  1043.7183 17621.40 -3344.1294 22009.25
## Jun 2001       13082.09  4793.2483 21370.93   405.4006 25758.78
## Jul 2001       16732.78  8443.9383 25021.62  4056.0906 29409.47
## Aug 2001       19888.61 11599.7683 28177.45  7211.9206 32565.30
## Sep 2001       23933.38 15644.5383 32222.22 11256.6906 36610.07
## Oct 2001       25391.35 17102.5083 33680.19 12714.6606 38068.04
## Nov 2001       36024.80 27735.9583 44313.64 23348.1106 48701.49
## Dec 2001       80721.71 72432.8683 89010.55 68045.0206 93398.40
par(oma = c(0, 0, 0, 4))
xrange <- c(1995,2001)
yrange <- range(SouvenirSales.ts)
plot(xrange, yrange, type="n", xlab="Year", ylab="Souvenir Sales by Month", bty="l", las=1, yaxt = "n")
colors <- rainbow(12) 
linetype <- c(14:25) 
plotchar <- c(14:25)
axis(1, at=seq(1995,2001), labels=format(seq(1995,2001)))
axis(2, at=seq(0,110000,18000), labels=format(seq(0,110000,18000)), cex.axis=0.5, las=2)
for (i in 1:12) { 
  currentmonth <- subset(SouvenirSales.ts, cycle(SouvenirSales.ts)==i)
  lines(seq(1995, 1995+length(currentmonth)-1,1), currentmonth, type="b", lwd=1,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) 
} 
title("Souvenir Sales Broken Out by Month")
legend(2002, 105000, 13:24, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)

summary(snaive.pred)
## 
## Forecast method: Seasonal naive method
## 
## Model Information:
## $drift
## [1] 0
## 
## $drift.se
## [1] 0
## 
## $sd
## [1] 5547.644
## 
## $call
## snaive(y = train.ts, h = nValid)
## 
## 
## Error measures:
##                    ME     RMSE      MAE     MPE     MAPE MASE      ACF1
## Training set 3401.361 6467.818 3744.801 22.3927 25.64127    1 0.4140974
## 
## Forecasts:
##          Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## Jan 2001        7615.03  -673.8117 15903.87 -5061.6594 20291.72
## Feb 2001        9849.69  1560.8483 18138.53 -2826.9994 22526.38
## Mar 2001       14558.40  6269.5583 22847.24  1881.7106 27235.09
## Apr 2001       11587.33  3298.4883 19876.17 -1089.3594 24264.02
## May 2001        9332.56  1043.7183 17621.40 -3344.1294 22009.25
## Jun 2001       13082.09  4793.2483 21370.93   405.4006 25758.78
## Jul 2001       16732.78  8443.9383 25021.62  4056.0906 29409.47
## Aug 2001       19888.61 11599.7683 28177.45  7211.9206 32565.30
## Sep 2001       23933.38 15644.5383 32222.22 11256.6906 36610.07
## Oct 2001       25391.35 17102.5083 33680.19 12714.6606 38068.04
## Nov 2001       36024.80 27735.9583 44313.64 23348.1106 48701.49
## Dec 2001       80721.71 72432.8683 89010.55 68045.0206 93398.40
d. Compute the RMSE and MAPE for the naive forecasts.

accuracy(snaive.pred, valid.ts)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

 

 

RMSE for Validation Period: 9,542 and MAPE: 27.3%


 

e.Plot a histogram of the forecast error that results from the naive forecasts(for the validation period).Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period .what can you say about the behavior of the naive forecasts? 

Souv_Valid <- snaive(SouvenirSales.ts)
Souv_naive_Resid <- valid.ts - snaive.pred$mean
myhist <- hist(Souv_naive_Resid, ylab="Frequency", xlab="Forecast Error", bty="l", main="")
multiplier <- myhist$counts / myhist$density
mydensity <- density(Souv_naive_Resid, na.rm=TRUE)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity)

Naive Forecasts (showed in red line) follows same pattern trend as of actual sales numbers (black line) with a certain delta across validation period.

plot(valid.ts, bty="l", xaxt="n", xlab="2001", yaxt="n", ylab="Souvenir sales ,000s",lwd=3)
axis(1, at=seq(from = 2001, to = 2001.88, length.out = 12))
axis(2, at=seq(0,110000,15000), labels=format(seq(0,110,15)), las=2)
lines(snaive.pred$mean, col=34, lty=1 ,lwd=3)
legend(2001.6, 105000, c("Actual","Forecast"), col=1:2, lty=1:2 )


 

f.The analyst found a forecasting model that gives satisfactory performance on the validation set.What must she do to use the forecasting model for generating forecasts for  year 2002

She must combine the training and validation periods into a single time series and then rerun the chosen model on the complete data. This final model is then used to forecast for year 2002.

=================================================================================

 

 

 

 

Problem 2 

Forecasting Shampoo Sales:

a. Partitioning the data into training and validation periods

 

Yes, Partitioning the data into training and validation periods very important and will help us to measure the accuracy of our forecasting model.


 

b. Examine time plots of the series and of model forecasts only for the training period?

 

No, we need to examine time series data for both training period and validation period. Also, model forecasts for both training and validation periods need to be reviewed.

 


ShampooSales <- read.csv("ShampooSales.csv")

Month Sales

head(ShampooSales)
##    Month Sales
## 1 Jan-95 266.0
## 2 Feb-95 145.9
## 3 Mar-95 183.1
## 4 Apr-95 119.3
## 5 May-95 180.3
## 6 Jun-95 168.5

structure of the data

str(ShampooSales)
## 'data.frame':    36 obs. of  2 variables:
##  $ Month: Factor w/ 36 levels "Apr-95","Apr-96",..: 13 10 22 1 25 19 16 4 34 31 ...
##  $ Sales: num  266 146 183 119 180 ...

summary with the time series of the data + Decomposition show the components of the time series

summary(ShampooSales)
##      Month        Sales      
##  Apr-95 : 1   Min.   :119.3  
##  Apr-96 : 1   1st Qu.:192.4  
##  Apr-97 : 1   Median :280.1  
##  Aug-95 : 1   Mean   :312.6  
##  Aug-96 : 1   3rd Qu.:411.1  
##  Aug-97 : 1   Max.   :682.0  
##  (Other):30
ShampooSales.ts <- ts(ShampooSales$Sales, start = c(1995,1), end = c(1998), freq = 12)
ShampooSales_plots = decompose(ShampooSales.ts)
ShampooSales_plots$seasonal
##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -22.035069  -5.060069 -51.016319  24.750347 -47.641319   3.504514
## 1996 -22.035069  -5.060069 -51.016319  24.750347 -47.641319   3.504514
## 1997 -22.035069  -5.060069 -51.016319  24.750347 -47.641319   3.504514
## 1998 -22.035069                                                       
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995  34.204514  27.589931  -4.012153  17.454514  34.433681 -12.172569
## 1996  34.204514  27.589931  -4.012153  17.454514  34.433681 -12.172569
## 1997  34.204514  27.589931  -4.012153  17.454514  34.433681 -12.172569
## 1998
plot(ShampooSales_plots)

ShampooSales.tsyear <-  aggregate(ShampooSales.ts, nfrequency=1, FUN=mean)
ShampooSales.ts.zoom <- window(ShampooSales.tsyear, start = c(1995, 1), end = c(1997, 12))
## Warning in window.default(x, ...): 'end' value not changed
dev.new()
plot(ShampooSales.ts.zoom, xlab = "Time", ylab = "ShampooSales", ylim = c(100, 800), bty = "l")
dev.off()
## quartz_off_screen 
##                 2
totalRecords <- length(ShampooSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(ShampooSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(ShampooSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
ShampooSales.lm <- tslm(train.ts ~ trend + I(trend^2))
ShampooSales.lm.pred <- forecast(ShampooSales.lm, h = nValidationRecords, level = 0)
plot(ShampooSales.lm.pred, ylim = c(100, 800),ylab = "ShampooSales", xlab = "Time", bty = "l", xaxt = "n", xlim = c(1995,1998), main ="", flty = 2) 
axis(1, at = seq(1995, 1998, 1), labels = format(seq(1995, 1998, 1))) 
lines(ShampooSales.lm$fitted, lwd = 2) 
lines(valid.ts)


 

c.Look at MAPE and RMSE values for the training period

 

No, we don’t need to look MAPE and RMSE values for a training period.

 

 


 

d.Look at MAPE and RMSE values for the validation period

 

Yes, this is very important. It serves as a more objective basis than the training period to assess predictive accuracy (because records in the validation period are not used to select predictors or to estimate model parameters).


 

e.Compute naive forecasts

Yes, it’s a good idea to have naive forecasts that can help with two things:

1. As the actual forecasts of the series. Naive forecasts, they are easy to understand and easy to implement, can sometimes achieve sufficiently useful accuracy levels.naive forecasts are a serious contender.

2.  As a baseline. When evaluating the predictive performance of a certain method, it is important to compare it to some baseline. Naive forecasts should always be considered as a baseline, and the comparative advantage of any other methods considered should be clearly shown.