# this code basically installed packages if not installed already and load the mentioned packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load("moments","extRemes","stringi", "ggplot2", "TTR", "forecast","zoo","xts")
op <- par(oma=c(5,7,1,1))
ShampooSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\ShampooSales.csv")
ShampooSales.ts <- ts(ShampooSales.data$Shampoo.Sales, start = c(1995,1), end = c(1997, 12), freq = 12)
ShampooSales.lm <- tslm(ShampooSales.ts ~ trend + I(trend^2))
par(op)
Above graph clearly showed that this is a Upward Exponantial Trend with additive Seasonality.
All 4 levels presnts in this time series
Yes, we expects seasonality in sales of shampoo. After decomposing time series
## Jan Feb Mar Apr May Jun
## 1995 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1996 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1997 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## Jul Aug Sep Oct Nov Dec
## 1995 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1996 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1997 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
There are 3 reasons of seasonality 1. People buy more shampoo during summer time (like jun, july and august) as shown in above table
2. People tends to buy more during Off season sales periods like Oct and/or November 3. April shows positive numbers which indicates people stocked up shampoo as they deffered their purchases during winter season.
Step 1: partition the data into training and validation periods
#Data Partioning
ShampooSales.plots = decompose(ShampooSales.ts)
ShampooSales.plots$seasonal
## Jan Feb Mar Apr May Jun
## 1995 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1996 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1997 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## Jul Aug Sep Oct Nov Dec
## 1995 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1996 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1997 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
plot(ShampooSales.plots)
ShampooSales.tsyear <- aggregate(ShampooSales.ts, nfrequency=1, FUN=mean)
ShampooSales.ts.zoom <- window(ShampooSales.tsyear, start = c(1995, 1), end = c(1997, 12))
## Warning in window.default(x, ...): 'end' value not changed
dev.new()
plot(ShampooSales.ts.zoom, xlab = "Time", ylab = "ShampooSales", ylim = c(100, 800), bty = "l")
dev.off()
## png
## 2
totalRecords <- length(ShampooSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(ShampooSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(ShampooSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
ShampooSales.lm <- tslm(train.ts ~ trend + I(trend^2))
ShampooSales.lm.pred <- forecast(ShampooSales.lm, h = nValidationRecords, level = 0)
plot(ShampooSales.lm.pred, ylim = c(100, 800),ylab = "ShampooSales", xlab = "Time", bty = "l", xaxt = "n", xlim = c(1995,1998), main ="", flty = 2)
axis(1, at = seq(1995, 1998, 1), labels = format(seq(1995, 1998, 1)))
lines(ShampooSales.lm$fitted, lwd = 2)
lines(valid.ts)
Step 2: examine time plots of the series and of model forecasts only for the training period?
No, we need to combine time series data for both training period and validation period. If only the training period is used to generate forecasts, then it will require forecasting further into the future.
Step 3: look at MAPE and RMSE values for the training period No we don’t need to look MAPE and RMSE values for training period.
Step 4: look at MAPE and RMSE values for the validation period it serves as a more objective basis than the training period to assess predictive accuracy (because records in the validation period are not used to select predictors or to estimate model parameters).
step 5: compute naive forecasts Yes We need naive forecasts becuase it may serves 2 purposes 1. As the actual forecasts of the series. Naive forecasts, which are simple to understand and easy to implement, can sometimes achieve sufficiently useful accuracy levels. Following the principle of “the simplest method that does the job”, naive forecasts are a serious contender. 2. As a baseline. When evaluating the predictive performance of a certain method, it is important to compare it to some baseline. Naive forecasts should always be considered as a baseline, and the comparative advantage of any other methods considered should be clearly shown.
Q3(a) Why was the data partitioned? Ans 3(a) To address the problem of overfitting, an important step before applying any forecasting method is data partitioning, The series is split into two periods. She develop her forecasting model using only one of the periods. After she has a model, She try it out on another period and see how it performs. In particular, she can measure the forecast errors, which are the differences between the predicted values and the actual values.
Q3(b) Why did the analyst choose a 12-month validation period Ans 3(b)She took 12 month validation period to test forecasting mode to cover all the monthly seasonal trend.
Q3(c) What is the naive forecast for the validation period? (assume that you must provide forecasts for 12 months ahead)
SouvenirSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\SouvenirSales.csv")
SouvenirSales.ts <- ts(SouvenirSales.data$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
nValid <- 12
nTrain <- length(SouvenirSales.ts) - nValid
train.ts <- window(SouvenirSales.ts, start = c(1995, 1), end = c(1995,nTrain))
valid.ts <- window(SouvenirSales.ts, start = c(1995, nTrain + 1), end = c(1995,nTrain + nValid))
snaive.pred <- snaive(train.ts, h = nValid)
## [1] "Actual Sales"
## Jan Feb Mar Apr May Jun Jul
## 2001 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 2001 28586.52 30505.41 30821.33 46634.38 104660.67
## [1] "Seasonal Naive Sales Forecasting"
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
Q3(d) Compute the RMSE and MAPE for the naive forecasts
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
Q3(e) Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts
Ans 3(e) Naive Forecasts (shows in red line) follows same pattern as of actual sales numbers(blue line) in thevalidation period.
Q3(f) The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the Ans3(f) forecasting model for generating forecasts for year 2002? She must recombined the training and validation periods into one long series and then chosen method/model is rerun on the complete data. This final model is then used to forecast for year 2002.
. Linear trend model . Linear trend model with seasonality . Quadratic trend model . Quadratic trend model with seasonality
Ans4 After looking into time series, this series is yearly series and hence no seasonality but may have business cyclic in nature. At this time we don’t know how to handle business cyclic based time series and moreover this option is not avialble on given problem so we will look into Linear trend model and Quadratic trend model, who give lowest RMSE/MAE, we will select that model.
WorkHours.data <- read.csv("D:\\Google Drive\\FA\\assignment\\CanadianWorkHours.csv")
WorkHours.ts <- ts(WorkHours.data$Hoursperweek, start = c(1966,1), end = c(2000, 1), freq = 1)
WorkHours.ts
## Time Series:
## Start = 1966
## End = 2000
## Frequency = 1
## [1] 37.2 37.0 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 36.9 36.7 36.7 36.5
## [15] 36.3 35.9 35.8 35.9 36.0 35.7 35.6 35.2 34.8 35.3 35.6 35.6 35.6 35.9
## [29] 36.0 35.7 35.7 35.5 35.6 36.3 36.5
#Data Partioning
totalRecords <- length(WorkHours.ts)
nValidationRecords <- 5
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(WorkHours.ts, start = c(1966,1), end = c(1966,nTrainigRecords))
valid.ts <- window(WorkHours.ts, start = c (1966,nTrainigRecords+1), end = c(1966,totalRecords))
#linear trend model
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)
summary(train.lm)
##
## Call:
## tslm(formula = train.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95181 -0.31426 0.02328 0.28599 0.74808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.668046 0.153308 245.702 < 2e-16 ***
## trend -0.083315 0.008636 -9.648 2.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4094 on 28 degrees of freedom
## Multiple R-squared: 0.7687, Adjusted R-squared: 0.7605
## F-statistic: 93.08 on 1 and 28 DF, p-value: 2.111e-10
accuracy(train.lm.pred,valid.ts)
## ME RMSE MAE MPE MAPE
## Training set 9.473975e-16 0.3955153 0.3226642 -0.01191583 0.8913215
## Test set 1.001342e+00 1.1216734 1.0013422 2.77249855 2.7724986
## MASE ACF1 Theil's U
## Training set 1.585977 0.7728777 NA
## Test set 4.921852 0.4331874 3.166203
#Quadratic trend model
train.poly.lm <- tslm(train.ts ~ trend + I(trend^2))
train.poly.lm.pred <- forecast(train.poly.lm, h = nValidationRecords, level = 0)
summary(train.poly.lm)
##
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91631 -0.22458 0.04514 0.26864 0.54400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.982414 0.231481 164.084 < 2e-16 ***
## trend -0.142259 0.034422 -4.133 0.000311 ***
## I(trend^2) 0.001901 0.001077 1.765 0.088905 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3948 on 27 degrees of freedom
## Multiple R-squared: 0.7927, Adjusted R-squared: 0.7773
## F-statistic: 51.61 on 2 and 27 DF, p-value: 5.958e-10
accuracy(train.poly.lm.pred,valid.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.000000 0.3745042 0.3042046 -0.01072127 0.836695 1.495243
## Test set 0.557678 0.6987179 0.5576780 1.53970072 1.539701 2.741129
## ACF1 Theil's U
## Training set 0.7384102 NA
## Test set 0.4135712 1.993851
By comparing RMSE/MAE we will select Quadratic trend model
After carefully observing trends of 6 types of wine sales. we identify some seasonality and different kind of trends. If we asked to the same method then one which may do the job with less risk of under/over estimating is Seasoanl Naive Method.