1: Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a 3-year period.

# this code basically installed packages if not installed already and load the mentioned packages 
if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

pacman::p_load("moments","extRemes","stringi", "ggplot2", "TTR", "forecast","zoo","xts")
op <- par(oma=c(5,7,1,1))
ShampooSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\ShampooSales.csv")
ShampooSales.ts <- ts(ShampooSales.data$Shampoo.Sales, start = c(1995,1), end = c(1997, 12), freq = 12)
ShampooSales.lm <- tslm(ShampooSales.ts ~ trend + I(trend^2)) 
par(op)

1A: Create a well-formatted time plot of the data.

Above graph clearly showed that this is a Upward Exponantial Trend with additive Seasonality.

1:B Which of the four components (level, trend, seasonality, noise) seem to be present in this series?

All 4 levels presnts in this time series

1:C Do you expect to see seasonality in sales of shampoo? Why?

Yes, we expects seasonality in sales of shampoo. After decomposing time series

##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1996 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1997 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1996   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1997   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424

There are 3 reasons of seasonality 1. People buy more shampoo during summer time (like jun, july and august) as shown in above table
2. People tends to buy more during Off season sales periods like Oct and/or November 3. April shows positive numbers which indicates people stocked up shampoo as they deffered their purchases during winter season.

2 Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a threeyear period If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)

Step 1: partition the data into training and validation periods

#Data Partioning
ShampooSales.plots = decompose(ShampooSales.ts)
ShampooSales.plots$seasonal

##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1996 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1997 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1996   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1997   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424

plot(ShampooSales.plots)

ShampooSales.tsyear <-  aggregate(ShampooSales.ts, nfrequency=1, FUN=mean)
ShampooSales.ts.zoom <- window(ShampooSales.tsyear, start = c(1995, 1), end = c(1997, 12))

## Warning in window.default(x, ...): 'end' value not changed

dev.new()
plot(ShampooSales.ts.zoom, xlab = "Time", ylab = "ShampooSales", ylim = c(100, 800), bty = "l")
dev.off()

## png 
##   2

totalRecords <- length(ShampooSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(ShampooSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(ShampooSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
ShampooSales.lm <- tslm(train.ts ~ trend + I(trend^2))
ShampooSales.lm.pred <- forecast(ShampooSales.lm, h = nValidationRecords, level = 0)
plot(ShampooSales.lm.pred, ylim = c(100, 800),ylab = "ShampooSales", xlab = "Time", bty = "l", xaxt = "n", xlim = c(1995,1998), main ="", flty = 2) 
axis(1, at = seq(1995, 1998, 1), labels = format(seq(1995, 1998, 1))) 
lines(ShampooSales.lm$fitted, lwd = 2) 
lines(valid.ts)

Step 2: examine time plots of the series and of model forecasts only for the training period?

No, we need to combine time series data for both training period and validation period. If only the training period is used to generate forecasts, then it will require forecasting further into the future.

Step 3: look at MAPE and RMSE values for the training period No we don’t need to look MAPE and RMSE values for training period.

Step 4: look at MAPE and RMSE values for the validation period it serves as a more objective basis than the training period to assess predictive accuracy (because records in the validation period are not used to select predictors or to estimate model parameters).

step 5: compute naive forecasts Yes We need naive forecasts becuase it may serves 2 purposes 1. As the actual forecasts of the series. Naive forecasts, which are simple to understand and easy to implement, can sometimes achieve sufficiently useful accuracy levels. Following the principle of “the simplest method that does the job”, naive forecasts are a serious contender. 2. As a baseline. When evaluating the predictive performance of a certain method, it is important to compare it to some baseline. Naive forecasts should always be considered as a baseline, and the comparative advantage of any other methods considered should be clearly shown.

3 Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period. Partition the data into the training and validation periods as explained above

Q3(a) Why was the data partitioned? Ans 3(a) To address the problem of overfitting, an important step before applying any forecasting method is data partitioning, The series is split into two periods. She develop her forecasting model using only one of the periods. After she has a model, She try it out on another period and see how it performs. In particular, she can measure the forecast errors, which are the differences between the predicted values and the actual values.

Q3(b) Why did the analyst choose a 12-month validation period Ans 3(b)She took 12 month validation period to test forecasting mode to cover all the monthly seasonal trend.

Q3(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

SouvenirSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\SouvenirSales.csv")
SouvenirSales.ts <- ts(SouvenirSales.data$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
nValid <- 12
nTrain <- length(SouvenirSales.ts) - nValid
train.ts <- window(SouvenirSales.ts, start = c(1995, 1), end = c(1995,nTrain))
valid.ts <- window(SouvenirSales.ts, start = c(1995, nTrain + 1), end = c(1995,nTrain + nValid))
snaive.pred <- snaive(train.ts, h = nValid)

## [1] "Actual Sales"

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2001  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 2001  28586.52  30505.41  30821.33  46634.38 104660.67

## [1] "Seasonal Naive Sales Forecasting"

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71

Q3(d) Compute the RMSE and MAPE for the naive forecasts

##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

Q3(e) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts

Ans 3(e) Naive Forecasts (shows in red line) follows same pattern as of actual sales numbers(blue line) in thevalidation period.

Q3(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the Ans3(f) forecasting model for generating forecasts for year 2002? She must recombined the training and validation periods into one long series and then chosen method/model is rerun on the complete data. This final model is then used to forecast for year 2002.

4 Analysis of Canadian Manufacturing Workers Work-Hours: The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which one model of the following regression-based models would fit the series best?

. Linear trend model . Linear trend model with seasonality . Quadratic trend model . Quadratic trend model with seasonality

Ans4 After looking into time series, this series is yearly series and hence no seasonality but may have business cyclic in nature. At this time we don’t know how to handle business cyclic based time series and moreover this option is not avialble on given problem so we will look into Linear trend model and Quadratic trend model, who give lowest RMSE/MAE, we will select that model.

WorkHours.data <- read.csv("D:\\Google Drive\\FA\\assignment\\CanadianWorkHours.csv")
WorkHours.ts <- ts(WorkHours.data$Hoursperweek, start = c(1966,1), end = c(2000, 1), freq = 1)
WorkHours.ts

## Time Series:
## Start = 1966 
## End = 2000 
## Frequency = 1 
##  [1] 37.2 37.0 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 36.9 36.7 36.7 36.5
## [15] 36.3 35.9 35.8 35.9 36.0 35.7 35.6 35.2 34.8 35.3 35.6 35.6 35.6 35.9
## [29] 36.0 35.7 35.7 35.5 35.6 36.3 36.5

#Data Partioning
totalRecords <- length(WorkHours.ts)
nValidationRecords <- 5
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(WorkHours.ts, start = c(1966,1), end = c(1966,nTrainigRecords))
valid.ts <- window(WorkHours.ts, start = c (1966,nTrainigRecords+1), end = c(1966,totalRecords))

#linear trend model
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)
summary(train.lm)

## 
## Call:
## tslm(formula = train.ts ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95181 -0.31426  0.02328  0.28599  0.74808 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.668046   0.153308 245.702  < 2e-16 ***
## trend       -0.083315   0.008636  -9.648 2.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4094 on 28 degrees of freedom
## Multiple R-squared:  0.7687, Adjusted R-squared:  0.7605 
## F-statistic: 93.08 on 1 and 28 DF,  p-value: 2.111e-10

accuracy(train.lm.pred,valid.ts)

##                        ME      RMSE       MAE         MPE      MAPE
## Training set 9.473975e-16 0.3955153 0.3226642 -0.01191583 0.8913215
## Test set     1.001342e+00 1.1216734 1.0013422  2.77249855 2.7724986
##                  MASE      ACF1 Theil's U
## Training set 1.585977 0.7728777        NA
## Test set     4.921852 0.4331874  3.166203

#Quadratic trend model
train.poly.lm <- tslm(train.ts ~ trend + I(trend^2))
train.poly.lm.pred <- forecast(train.poly.lm, h = nValidationRecords, level = 0)
summary(train.poly.lm)

## 
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91631 -0.22458  0.04514  0.26864  0.54400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.982414   0.231481 164.084  < 2e-16 ***
## trend       -0.142259   0.034422  -4.133 0.000311 ***
## I(trend^2)   0.001901   0.001077   1.765 0.088905 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3948 on 27 degrees of freedom
## Multiple R-squared:  0.7927, Adjusted R-squared:  0.7773 
## F-statistic: 51.61 on 2 and 27 DF,  p-value: 5.958e-10

accuracy(train.poly.lm.pred,valid.ts)

##                    ME      RMSE       MAE         MPE     MAPE     MASE
## Training set 0.000000 0.3745042 0.3042046 -0.01072127 0.836695 1.495243
## Test set     0.557678 0.6987179 0.5576780  1.53970072 1.539701 2.741129
##                   ACF1 Theil's U
## Training set 0.7384102        NA
## Test set     0.4135712  1.993851

By comparing RMSE/MAE we will select Quadratic trend model

5 Forecasting Australian Wine Sales:

5(a) Which forecasting method would you choose if you had to choose the same method for all series? Why?

After carefully observing trends of 6 types of wine sales. we identify some seasonality and different kind of trends. If we asked to the same method then one which may do the job with less risk of under/over estimating is Seasoanl Naive Method.

5(b)Fortified wine has the largest market share of the six types of wine considered. You are asked to focus on fortified wine sales alone and produce as accurate as possible forecasts for the next 2 months

par(op) 
AustralianWines.data <- read.csv("D:\\Google Drive\\FA\\assignment\\AustralianWines.csv")
fortifiedwines.ts <- ts(AustralianWines.data$Fortified, start = c(1980,1), end = c(1994, 12), freq = 12)

#Start by partitioning the data using the period until December 1993 as the training period.
totalRecords <- length(fortifiedwines.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(fortifiedwines.ts, start = c(1980,1), end = c(1980,nTrainigRecords))
valid.ts <- window(fortifiedwines.ts, start = c (1980,nTrainigRecords+1), end = c(1980,totalRecords))

############# Linear Trend Model with Season ##################
#tslm function (which stands for time series linear model)
train.lm <- tslm(train.ts ~ trend + season)
# based on this model try to predict next 12 records i.e. validation period
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)

## now use this model and run it over on combined data
fortifiedwines.lm.tns <- tslm(fortifiedwines.ts ~ trend + season)
fortifiedwines.lm.tns.pred <- forecast(fortifiedwines.lm.tns, h = 2, level = 0)

##       Jan  Feb
## 1995  837 1192

## Predicted Sales in Jan 1995 and Feb 1995 are 837000 litres and 1192000 litres respectively

5(b)(i) Create the “actual vs. forecast” plot. What can you say about model fit?

## actual(blue line) vs. forecast(red line plot

## 
## Call:
## tslm(formula = fortifiedwines.ts ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -647.3 -200.8  -22.1  168.0  975.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2671.3344    87.7569  30.440  < 2e-16 ***
## trend        -10.1365     0.4417 -22.951  < 2e-16 ***
## season2      365.0698   112.1788   3.254  0.00138 ** 
## season3      794.0730   112.1814   7.078 3.85e-11 ***
## season4     1080.0761   112.1857   9.628  < 2e-16 ***
## season5     1593.0793   112.1918  14.200  < 2e-16 ***
## season6     1676.4157   112.1997  14.941  < 2e-16 ***
## season7     2355.8189   112.2092  20.995  < 2e-16 ***
## season8     2054.0220   112.2205  18.303  < 2e-16 ***
## season9     1128.7585   112.2336  10.057  < 2e-16 ***
## season10     921.8950   112.2483   8.213 5.64e-14 ***
## season11    1395.8315   112.2648  12.433  < 2e-16 ***
## season12    1569.7013   112.2831  13.980  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 307.2 on 167 degrees of freedom
## Multiple R-squared:  0.8842, Adjusted R-squared:  0.8759 
## F-statistic: 106.3 on 12 and 167 DF,  p-value: < 2.2e-16

##              ME     RMSE      MAE        MPE     MAPE      MASE      ACF1
## Training set  0 295.9102 228.2863 -0.5528882 7.605789 0.8340494 0.1772564

Model seems to capture the seasonality however there is difference in peak and trough of seasons.

5(b)(ii) Use the regression model to forecast sales in January and February 1994.

##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1994  919 1270 1690 1936 2497 2558 3266 2962 1951 1747 2188 2387

## Predicted Sales in Jan 1994 and Feb 1994 are 919,000 litres and 1,270,000 litres respectively

Souvenir Sales: The file SouvenirSales.xls contains monthly salesfor a souvenir shop at a beach resort town in Queensland,Australia, between 1995 and 2001.Back in 2001, the store wanted to use the data to forecast salesfor the next 12 months (year 2002). They hired an analystto generate forecasts. The analyst first partitioned the datainto training and validation periods, with the validation setcontaining the last 12 months of data (year 2001). She then fita regression model to sales, using the training period.

6(a) Run a regression model with log(Sales) as the outputvariable and with a linear trend and monthly predictors. Remember to fit only the training period. Use this model toforecast the sales in February 2002

SouvenirSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\SouvenirSales.csv")
SouvenirSales.ts <- ts(SouvenirSales.data$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
#Data Partioning
totalRecords <- length(SouvenirSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(SouvenirSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(SouvenirSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
#Regression model using log of sales
logSouvenirSales.lm <- tslm(train.ts ~ season + trend,lambda = 0)
logSouvenirSales.lm.pred <- forecast(logSouvenirSales.lm, h = nValidationRecords, level = 0)
summary(logSouvenirSales.lm)

## 
## Call:
## tslm(formula = train.ts ~ season + trend, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16

accuracy(logSouvenirSales.lm.pred,valid.ts)

##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set  197.519 2865.154 1671.185 -1.472819 13.94047 0.446268
## Test set     4824.494 7101.444 5191.669 12.359434 15.51910 1.386367
##                   ACF1 Theil's U
## Training set 0.4381370        NA
## Test set     0.4245018 0.4610253

## Prediction using only training period data for month of feb 2002
logSouvenirSales.lm.pred <- forecast(logSouvenirSales.lm, h = nValidationRecords+2, level = 0)
cat(paste0("Expected Sales of Souvenirs in month of feb 2002 : ",round(logSouvenirSales.lm.pred$mean[14],0), " units"))

## Expected Sales of Souvenirs in month of feb 2002 : 17063 units

CBA/B7/Term 2- FA1/RI/Home Work Assignment

Anurag Singhvi, Deb, Soubhagya Rout, Vineet Garg

Feb 07, 2017