ShampooSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\ShampooSales.csv")
ShampooSales.ts <- ts(ShampooSales.data$Shampoo.Sales, start = c(1995,1), end = c(1997, 12), freq = 12)
ShampooSales.lm <- tslm(ShampooSales.ts ~ trend + I(trend^2))
par(op)
Time Series plot below shows that this is an increasing quadratic trend.
All the 4 components are present in this time series as can be seen from following decomposed components
Though little non-intuitive to begin with but upon close look at the below monthly seasonal components we do see seasonality in sales of shampoo. There are possible 3 reasons of seasonality * People take more bath during summer and hence buy more shampoo during this time (like Jun, July and August) as shown in below table
* People tend to buy more during Off season sales periods like Oct and/or November (just before winter when they stop bathing :)) * April shows positive numbers which indicates people stocked up shampoo as they deffered their purchases during winter season.
## Jan Feb Mar Apr May Jun
## 1995 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1996 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1997 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## Jul Aug Sep Oct Nov Dec
## 1995 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1996 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1997 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
Yes, this step a very important and will help us to measure accuracy of our forecasting model.
No, we need to examine time series data for both training period as well as validation period. Also model forecasts for both training and validation periods need to be reviewed.
#Data Partioning
ShampooSales.plots = decompose(ShampooSales.ts)
ShampooSales.plots$seasonal
## Jan Feb Mar Apr May Jun
## 1995 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1996 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1997 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## Jul Aug Sep Oct Nov Dec
## 1995 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1996 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1997 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
plot(ShampooSales.plots)
ShampooSales.tsyear <- aggregate(ShampooSales.ts, nfrequency=1, FUN=mean)
ShampooSales.ts.zoom <- window(ShampooSales.tsyear, start = c(1995, 1), end = c(1997, 12))
## Warning in window.default(x, ...): 'end' value not changed
dev.new()
plot(ShampooSales.ts.zoom, xlab = "Time", ylab = "ShampooSales", ylim = c(100, 800), bty = "l")
dev.off()
## png
## 2
totalRecords <- length(ShampooSales.ts)
nValidationRecords <- 12
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(ShampooSales.ts, start = c(1995,1), end = c(1995,nTrainigRecords))
valid.ts <- window(ShampooSales.ts, start = c (1995,nTrainigRecords+1), end = c(1995,totalRecords))
ShampooSales.lm <- tslm(train.ts ~ trend + I(trend^2))
ShampooSales.lm.pred <- forecast(ShampooSales.lm, h = nValidationRecords, level = 0)
plot(ShampooSales.lm.pred, ylim = c(100, 800),ylab = "ShampooSales", xlab = "Time", bty = "l", xaxt = "n", xlim = c(1995,1998), main ="", flty = 2)
axis(1, at = seq(1995, 1998, 1), labels = format(seq(1995, 1998, 1)))
lines(ShampooSales.lm$fitted, lwd = 2)
lines(valid.ts)
No, we don’t need to look MAPE and RMSE values for training period.
Yes, this is very important. It serves as a more objective basis than the training period to assess predictive accuracy (because records in the validation period are not used to select predictors or to estimate model parameters).
Yes, its a good idea to have naive forecasts as it may serve 2 purposes * As the actual forecasts of the series. Naive forecasts, which are simple to understand and easy to implement, can sometimes achieve sufficiently useful accuracy levels. Following the principle of “the simplest method that does the job”, naive forecasts are a serious contender. * As a baseline. When evaluating the predictive performance of a certain method, it is important to compare it to some baseline. Naive forecasts should always be considered as a baseline, and the comparative advantage of any other methods considered should be clearly shown.
To address the problem of overfitting, an important step before applying any forecasting method is data partitioning. The series is split into two periods. She would develop her forecasting model using only the training period. After she has a model, she would try it out on validation period and see how it performs. In particular, she can measure the forecast errors, which are the differences between the predicted values and the actual values.
She took 12 month validation period since (a) the forecast is required for next 12 months and (b) to test forecasting model to cover all the monthly seasonality.
SouvenirSales.data <- read.csv("D:\\Google Drive\\FA\\assignment\\SouvenirSales.csv")
SouvenirSales.ts <- ts(SouvenirSales.data$Sales, start = c(1995,1), end = c(2001, 12), freq = 12)
nValid <- 12
nTrain <- length(SouvenirSales.ts) - nValid
train.ts <- window(SouvenirSales.ts, start = c(1995, 1), end = c(1995,nTrain))
valid.ts <- window(SouvenirSales.ts, start = c(1995, nTrain + 1), end = c(1995,nTrain + nValid))
snaive.pred <- snaive(train.ts, h = nValid)
## [1] "Actual Sales"
## Jan Feb Mar Apr May Jun Jul
## 2001 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 2001 28586.52 30505.41 30821.33 46634.38 104660.67
## [1] "Seasonal Naive Sales Forecasting"
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
RMSE for Validation Period: 9,542 and MAPE: 27.3%
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
Naive Forecasts (showed in red line) follows same pattern/ trend as of actual sales numbers (blue line) with a certain delta across validation period.
She must combine the training and validation periods into a single time series and then rerun the chosen model on the complete data. This final model is then used to forecast for year 2002.
Quadratic trend moded: After looking into time series, this series is yearly series and hence no seasonality but may be business cyclic in nature e.g. changing over every 4-5 years. At this time we don’t know how to handle business cyclic based time series and moreover this option is not avialble in the given options. We will consider Quadratic trend model since in different times of series it has increasing as well as decreasing trends e.g. between 1974 to 1988 it is decreasing and then it is increasing. RMSE/MAPE is also better for Quadratic trend model as seen below
WorkHours.data <- read.csv("D:\\Google Drive\\FA\\assignment\\CanadianWorkHours.csv")
WorkHours.ts <- ts(WorkHours.data$Hoursperweek, start = c(1966,1), end = c(2000, 1), freq = 1)
WorkHours.ts
## Time Series:
## Start = 1966
## End = 2000
## Frequency = 1
## [1] 37.2 37.0 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 36.9 36.7 36.7 36.5
## [15] 36.3 35.9 35.8 35.9 36.0 35.7 35.6 35.2 34.8 35.3 35.6 35.6 35.6 35.9
## [29] 36.0 35.7 35.7 35.5 35.6 36.3 36.5
#Data Partioning
totalRecords <- length(WorkHours.ts)
nValidationRecords <- 5
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(WorkHours.ts, start = c(1966,1), end = c(1966,nTrainigRecords))
valid.ts <- window(WorkHours.ts, start = c (1966,nTrainigRecords+1), end = c(1966,totalRecords))
#linear trend model
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)
summary(train.lm)
##
## Call:
## tslm(formula = train.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95181 -0.31426 0.02328 0.28599 0.74808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.668046 0.153308 245.702 < 2e-16 ***
## trend -0.083315 0.008636 -9.648 2.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4094 on 28 degrees of freedom
## Multiple R-squared: 0.7687, Adjusted R-squared: 0.7605
## F-statistic: 93.08 on 1 and 28 DF, p-value: 2.111e-10
accuracy(train.lm.pred,valid.ts)
## ME RMSE MAE MPE MAPE
## Training set 9.473975e-16 0.3955153 0.3226642 -0.01191583 0.8913215
## Test set 1.001342e+00 1.1216734 1.0013422 2.77249855 2.7724986
## MASE ACF1 Theil's U
## Training set 1.585977 0.7728777 NA
## Test set 4.921852 0.4331874 3.166203
#Quadratic trend model
train.poly.lm <- tslm(train.ts ~ trend + I(trend^2))
train.poly.lm.pred <- forecast(train.poly.lm, h = nValidationRecords, level = 0)
summary(train.poly.lm)
##
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91631 -0.22458 0.04514 0.26864 0.54400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.982414 0.231481 164.084 < 2e-16 ***
## trend -0.142259 0.034422 -4.133 0.000311 ***
## I(trend^2) 0.001901 0.001077 1.765 0.088905 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3948 on 27 degrees of freedom
## Multiple R-squared: 0.7927, Adjusted R-squared: 0.7773
## F-statistic: 51.61 on 2 and 27 DF, p-value: 5.958e-10
accuracy(train.poly.lm.pred,valid.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.000000 0.3745042 0.3042046 -0.01072127 0.836695 1.495243
## Test set 0.557678 0.6987179 0.5576780 1.53970072 1.539701 2.741129
## ACF1 Theil's U
## Training set 0.7384102 NA
## Test set 0.4135712 1.993851
After carefully observing trends of 6 types of wine sales. we identify some seasonality and different kind of trends. If we asked to the same method then one which may do the job with less risk of under/over estimating is Seasoanl Naive Method.