Q1. Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a 3-year period.

Q1(a): Create a well-formatted time plot of the data.

ShampooSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\ShampooSales.csv")
ShampooSales.ts <- ts(ShampooSales.data$Shampoo.Sales, start = c(1995,1), end = c(1997, 12), freq = 12)
ShampooSales.lm <- tslm(ShampooSales.ts ~ trend + I(trend^2)) 
par(op)

Time Series plot below shows that this is an increasing quadratic trend.

Q1(b): Which of the four components (level, trend, seasonality, noise) seem to be present in this series?

All the 4 components are present in this time series as can be seen from following decomposed components

Q1(c): Do you expect to see seasonality in sales of shampoo? Why?

Though little non-intuitive to begin with but upon close look at the below monthly seasonal components we do see seasonality in sales of shampoo. There are possible 3 reasons of seasonality * People take more bath during summer and hence buy more shampoo during this time (like Jun, July and August) as shown in below table
* People tend to buy more during Off season sales periods like Oct and/or November (just before winter when they stop bathing :)) * April shows positive numbers which indicates people stocked up shampoo as they deffered their purchases during winter season.

##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1996 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1997 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1996   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1997   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424

Q2. Forecasting Shampoo Sales: The file ShampooSales.xls contains data on the monthly sales of a certain shampoo over a three year period. If the goal is forecasting sales in future months, which of the following steps should be taken? (choose one or more)

Q2(a): Partitioning the data into training and validation periods

Yes, this step a very important and will help us to measure accuracy of our forecasting model.

Q2(b): Examine time plots of the series and of model forecasts only for the training period?

No, we need to examine time series data for both training period as well as validation period. Also model forecasts for both training and validation periods need to be reviewed.

#Data Partioning
ShampooSales.plots = decompose(ShampooSales.ts)
ShampooSales.plots$seasonal

##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1996 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1997 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1996   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1997   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424

plot(ShampooSales.plots)

ShampooSales.tsyear <-  aggregate(ShampooSales.ts, nfrequency=1, FUN=mean)
ShampooSales.ts.zoom <- window(ShampooSales.tsyear, start = c(1995, 1), end = c(1997, 12))

## Warning in window.default(x, ...): 'end' value not changed

dev.new()
plot(ShampooSales.ts.zoom, xlab = "Time", ylab = "ShampooSales", ylim = c(100, 800), bty = "l")
dev.off()

## png 
##   2

totalRecords <- length(ShampooSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(ShampooSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(ShampooSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
ShampooSales.lm <- tslm(train.ts ~ trend + I(trend^2))
ShampooSales.lm.pred <- forecast(ShampooSales.lm, h = nValidationRecords, level = 0)
plot(ShampooSales.lm.pred, ylim = c(100, 800),ylab = "ShampooSales", xlab = "Time", bty = "l", xaxt = "n", xlim = c(1995,1998), main ="", flty = 2) 
axis(1, at = seq(1995, 1998, 1), labels = format(seq(1995, 1998, 1))) 
lines(ShampooSales.lm$fitted, lwd = 2) 
lines(valid.ts)

Q2(c): Look at MAPE and RMSE values for the training period

No, we don’t need to look MAPE and RMSE values for training period.

Q2(d): Look at MAPE and RMSE values for the validation period

Yes, this is very important. It serves as a more objective basis than the training period to assess predictive accuracy (because records in the validation period are not used to select predictors or to estimate model parameters).

Q2(e): Compute naive forecasts

Yes, its a good idea to have naive forecasts as it may serve 2 purposes * As the actual forecasts of the series. Naive forecasts, which are simple to understand and easy to implement, can sometimes achieve sufficiently useful accuracy levels. Following the principle of “the simplest method that does the job”, naive forecasts are a serious contender. * As a baseline. When evaluating the predictive performance of a certain method, it is important to compare it to some baseline. Naive forecasts should always be considered as a baseline, and the comparative advantage of any other methods considered should be clearly shown.

Q3 Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation period containing the last 12 months of data (year 2001). She then fit a forecasting model to sales, using the training period. Partition the data into the training and validation periods as explained above

Q3(a): Why was the data partitioned?

To address the problem of overfitting, an important step before applying any forecasting method is data partitioning. The series is split into two periods. She would develop her forecasting model using only the training period. After she has a model, she would try it out on validation period and see how it performs. In particular, she can measure the forecast errors, which are the differences between the predicted values and the actual values.

Q3(b): Why did the analyst choose a 12-month validation period?

She took 12 month validation period since (a) the forecast is required for next 12 months and (b) to test forecasting model to cover all the monthly seasonality.

Q3(c): What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)

SouvenirSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\SouvenirSales.csv")
SouvenirSales.ts <- ts(SouvenirSales.data$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
nValid <- 12
nTrain <- length(SouvenirSales.ts) - nValid
train.ts <- window(SouvenirSales.ts, start = c(1995, 1), end = c(1995,nTrain))
valid.ts <- window(SouvenirSales.ts, start = c(1995, nTrain + 1), end = c(1995,nTrain + nValid))
snaive.pred <- snaive(train.ts, h = nValid)

## [1] "Actual Sales"

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2001  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 2001  28586.52  30505.41  30821.33  46634.38 104660.67

## [1] "Seasonal Naive Sales Forecasting"

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71

Q3(d): Compute the RMSE and MAPE for the naive forecasts

RMSE for Validation Period: 9,542 and MAPE: 27.3%

##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

Q3(e): Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts

Naive Forecasts (showed in red line) follows same pattern/ trend as of actual sales numbers (blue line) with a certain delta across validation period.

Q3(f): The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for year 2002?

She must combine the training and validation periods into a single time series and then rerun the chosen model on the complete data. This final model is then used to forecast for year 2002.

Q4. Analysis of Canadian Manufacturing Workers Work-Hours: The time series plot in the given figure describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which of the following regression-based models would fit the series best?

Linear trend model
Linear trend model with seasonality
Quadratic trend model
Quadratic trend model with seasonality

Quadratic trend moded: After looking into time series, this series is yearly series and hence no seasonality but may be business cyclic in nature e.g. changing over every 4-5 years. At this time we don’t know how to handle business cyclic based time series and moreover this option is not avialble in the given options. We will consider Quadratic trend model since in different times of series it has increasing as well as decreasing trends e.g. between 1974 to 1988 it is decreasing and then it is increasing. RMSE/MAPE is also better for Quadratic trend model as seen below

WorkHours.data <- read.csv("D:\\Google Drive\\FA\\assignment\\CanadianWorkHours.csv")
WorkHours.ts <- ts(WorkHours.data$Hoursperweek, start = c(1966,1), end = c(2000, 1), freq = 1)
WorkHours.ts

## Time Series:
## Start = 1966 
## End = 2000 
## Frequency = 1 
##  [1] 37.2 37.0 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 36.9 36.7 36.7 36.5
## [15] 36.3 35.9 35.8 35.9 36.0 35.7 35.6 35.2 34.8 35.3 35.6 35.6 35.6 35.9
## [29] 36.0 35.7 35.7 35.5 35.6 36.3 36.5

#Data Partioning
totalRecords <- length(WorkHours.ts)
nValidationRecords <- 5
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(WorkHours.ts, start = c(1966,1), end = c(1966,nTrainigRecords))
valid.ts <- window(WorkHours.ts, start = c (1966,nTrainigRecords+1), end = c(1966,totalRecords))

#linear trend model
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)
summary(train.lm)

## 
## Call:
## tslm(formula = train.ts ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95181 -0.31426  0.02328  0.28599  0.74808 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.668046   0.153308 245.702  < 2e-16 ***
## trend       -0.083315   0.008636  -9.648 2.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4094 on 28 degrees of freedom
## Multiple R-squared:  0.7687, Adjusted R-squared:  0.7605 
## F-statistic: 93.08 on 1 and 28 DF,  p-value: 2.111e-10

accuracy(train.lm.pred,valid.ts)

##                        ME      RMSE       MAE         MPE      MAPE
## Training set 9.473975e-16 0.3955153 0.3226642 -0.01191583 0.8913215
## Test set     1.001342e+00 1.1216734 1.0013422  2.77249855 2.7724986
##                  MASE      ACF1 Theil's U
## Training set 1.585977 0.7728777        NA
## Test set     4.921852 0.4331874  3.166203

#Quadratic trend model
train.poly.lm <- tslm(train.ts ~ trend + I(trend^2))
train.poly.lm.pred <- forecast(train.poly.lm, h = nValidationRecords, level = 0)
summary(train.poly.lm)

## 
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91631 -0.22458  0.04514  0.26864  0.54400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.982414   0.231481 164.084  < 2e-16 ***
## trend       -0.142259   0.034422  -4.133 0.000311 ***
## I(trend^2)   0.001901   0.001077   1.765 0.088905 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3948 on 27 degrees of freedom
## Multiple R-squared:  0.7927, Adjusted R-squared:  0.7773 
## F-statistic: 51.61 on 2 and 27 DF,  p-value: 5.958e-10

accuracy(train.poly.lm.pred,valid.ts)

##                    ME      RMSE       MAE         MPE     MAPE     MASE
## Training set 0.000000 0.3745042 0.3042046 -0.01072127 0.836695 1.495243
## Test set     0.557678 0.6987179 0.5576780  1.53970072 1.539701 2.741129
##                   ACF1 Theil's U
## Training set 0.7384102        NA
## Test set     0.4135712  1.993851

5 Forecasting Australian Wine Sales:

5(a) Which forecasting method would you choose if you had to choose the same method for all series? Why?

After carefully observing trends of 6 types of wine sales. we identify some seasonality and different kind of trends. If we asked to the same method then one which may do the job with less risk of under/over estimating is Seasoanl Naive Method.

5(b)Fortified wine has the largest market share of the six types of wine considered. You are asked to focus on fortified wine sales alone and produce as accurate as possible forecasts for the next 2 months

par(op) 
AustralianWines.data <- read.csv("D:\\Google Drive\\FA\\assignment\\AustralianWines.csv")
fortifiedwines.ts <- ts(AustralianWines.data$Fortified, start = c(1980,1), end = c(1994, 12), freq = 12)

#Start by partitioning the data using the period until December 1993 as the training period.
totalRecords <- length(fortifiedwines.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(fortifiedwines.ts, start = c(1980,1), end = c(1980,nTrainigRecords))
valid.ts <- window(fortifiedwines.ts, start = c (1980,nTrainigRecords+1), end = c(1980,totalRecords))

############# Linear Trend Model with Season ##################
#tslm function (which stands for time series linear model)
train.lm <- tslm(train.ts ~ trend + season)
# based on this model try to predict next 12 records i.e. validation period
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)

## now use this model and run it over on combined data
fortifiedwines.lm.tns <- tslm(fortifiedwines.ts ~ trend + season)
fortifiedwines.lm.tns.pred <- forecast(fortifiedwines.lm.tns, h = 2, level = 0)

##       Jan  Feb
## 1995  837 1192

## Predicted Sales in Jan 1995 and Feb 1995 are 837000 litres and 1192000 litres respectively

5(b)(i) Create the “actual vs. forecast” plot. What can you say about model fit?

## actual(blue line) vs. forecast(red line plot

## 
## Call:
## tslm(formula = fortifiedwines.ts ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -647.3 -200.8  -22.1  168.0  975.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2671.3344    87.7569  30.440  < 2e-16 ***
## trend        -10.1365     0.4417 -22.951  < 2e-16 ***
## season2      365.0698   112.1788   3.254  0.00138 ** 
## season3      794.0730   112.1814   7.078 3.85e-11 ***
## season4     1080.0761   112.1857   9.628  < 2e-16 ***
## season5     1593.0793   112.1918  14.200  < 2e-16 ***
## season6     1676.4157   112.1997  14.941  < 2e-16 ***
## season7     2355.8189   112.2092  20.995  < 2e-16 ***
## season8     2054.0220   112.2205  18.303  < 2e-16 ***
## season9     1128.7585   112.2336  10.057  < 2e-16 ***
## season10     921.8950   112.2483   8.213 5.64e-14 ***
## season11    1395.8315   112.2648  12.433  < 2e-16 ***
## season12    1569.7013   112.2831  13.980  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 307.2 on 167 degrees of freedom
## Multiple R-squared:  0.8842, Adjusted R-squared:  0.8759 
## F-statistic: 106.3 on 12 and 167 DF,  p-value: < 2.2e-16

##              ME     RMSE      MAE        MPE     MAPE      MASE      ACF1
## Training set  0 295.9102 228.2863 -0.5528882 7.605789 0.8340494 0.1772564

Model seems to capture the seasonality however there is difference in peak and trough of seasons.

5(b)(ii) Use the regression model to forecast sales in January and February 1994.

##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1994  919 1270 1690 1936 2497 2558 3266 2962 1951 1747 2188 2387

## Predicted Sales in Jan 1994 and Feb 1994 are 919,000 litres and 1,270,000 litres respectively

Q6 Souvenir Sales: The file SouvenirSales.xls contains monthly salesfor a souvenir shop at a beach resort town in Queensland,Australia, between 1995 and 2001.Back in 2001, the store wanted to use the data to forecast salesfor the next 12 months (year 2002). They hired an analystto generate forecasts. The analyst first partitioned the datainto training and validation periods, with the validation setcontaining the last 12 months of data (year 2001). She then fita regression model to sales, using the training period.

6(a) Run a regression model with log(Sales) as the outputvariable and with a linear trend and monthly predictors. Remember to fit only the training period. Use this model to forecast the sales in February 2002

SouvenirSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\SouvenirSales.csv")
SouvenirSales.ts <- ts(SouvenirSales.data$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
#Data Partioning
totalRecords <- length(SouvenirSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(SouvenirSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(SouvenirSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
#Regression model using log of sales
logSouvenirSales.lm <- tslm(train.ts ~ season + trend,lambda = 0)
logSouvenirSales.lm.pred <- forecast(logSouvenirSales.lm, h = nValidationRecords, level = 0)
summary(logSouvenirSales.lm)

## 
## Call:
## tslm(formula = train.ts ~ season + trend, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16

accuracy(logSouvenirSales.lm.pred,valid.ts)

##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set  197.519 2865.154 1671.185 -1.472819 13.94047 0.446268
## Test set     4824.494 7101.444 5191.669 12.359434 15.51910 1.386367
##                   ACF1 Theil's U
## Training set 0.4381370        NA
## Test set     0.4245018 0.4610253

## Prediction using only training period data for month of feb 2002
logSouvenirSales.lm.pred <- forecast(logSouvenirSales.lm, h = nValidationRecords+2, level = 0)
cat(paste0("Expected Sales of Souvenirs in month of feb 2002 : ",round(logSouvenirSales.lm.pred$mean[14],0), " units"))

## Expected Sales of Souvenirs in month of feb 2002 : 17063 units

6(b) Create an ACF plot until lag-15 for the forecast errors.Now fit an AR model with lag-2 ARIMA(2,0,0)] to theforecast errors.

ACF Plot of lag-15 SouvenirSales.ts

totalRecords <- length(SouvenirSales.ts)
nValidationRecords <- 15
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(SouvenirSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(SouvenirSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
train.lm <- tslm(train.ts ~ season + trend,lambda = 0)
train.lm.pred <- forecast(train.lm, h = nValidationRecords)
Acf(residuals(train.lm.pred),lag.max = nValidationRecords, main = "")

* fit an AR model with lag-2 ARIMA(2,0,0)] to forecast errors.

train.arima <- Arima(train.lm$residuals, order = c(2,0,0))
train.arima.pred <- forecast(train.arima, h = nValidationRecords+1)
summary(train.arima)

## Series: train.lm$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2  intercept
##       0.2994  0.3623     0.0036
## s.e.  0.1142  0.1182     0.0488
## 
## sigma^2 estimated as 0.02132:  log likelihood=36.12
## AIC=-64.25   AICc=-63.62   BIC=-55.31
## 
## Training set error measures:
##                       ME   RMSE       MAE      MPE     MAPE     MASE
## Training set 0.005294917 0.1428 0.1157806 52.38468 173.3156 0.623725
##                    ACF1
## Training set 0.01731792

Anurag can you check above aCF and ARIMA model is ok, if yes can u please reply to this part

Examining the ACF plot and the coefficients of the AR(2)model (and their statistical significance), what can we learn about the regression model forecasts?

###. Use the autocorrelation information to compute an improved forecast for January 2002, using the regression model and the AR(2) model above

pred <- forecast(train.lm, h = nValidationRecords+1)
cat(paste0("Improved Sales forecast of Souvenirs in month of jan 2002 : ",round(exp(train.arima.pred$mean[16]) + pred$mean[16]), " units"))

## Improved Sales forecast of Souvenirs in month of jan 2002 : 12058 units

CBA7/Term 2- FA1/RI/Home Work Assignment

Team24: Anurag Singhvi, Debanjan Paul, Soubhagya Rout, Vineet Garg

Feb 07, 2017