MBA678 Assignment 5

Ch.6: Q.2,4,5

Q.2 \(\textit{Analysis of Canadian Manufacturing Workers Work-Hours}\)

library(forecast)
leafhrs <- read.csv("CanadianWorkHours.csv",stringsAsFactors = FALSE)
#str(leafhrs)
#head(leafhrs)
#tail(leafhrs)
leafhrsTS <- ts(leafhrs$HoursPerWeek,start=c(1966,1),frequency=1)
yrange <- range(leafhrsTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")
lines(leafhrsTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)

Which one model of the following regression-based models would fit the series best?

\(\bullet\) Linear trend model

\(\bullet\) Linear trend model with seasonality

\(\bullet\) Quadratic trend model

\(\bullet\) Quadratic trend model with seasonality

Note we have yearly data (average number of weekly hours given annually). Since seasons occur \(\textit{within}\) a year, we cannot do anything with models involving seasons, eliminating bullets 2 and 4 automatically. Now let’s fit both a linear (bullet 1) and a quadratic (bullet 3) model to the data and examine \(R^2\) values for the goodness of fit measure. To reiterate, we won’t look at linear and quadratic models with seasonality due to our data given yearly.

leafhrsLinear <- tslm(leafhrsTS ~ trend)
summary(leafhrsLinear)

## 
## Call:
## tslm(formula = leafhrsTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

leafhrsQuad <- tslm(leafhrsTS ~ trend + I(trend^2))
summary(leafhrsQuad)

## 
## Call:
## tslm(formula = leafhrsTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

We see that the adjusted \(R^2\) for the linear and quadratic models was determined to be 0.5996 and 0.7465, respectively. So about 60% and about 75% of average annual number of weekly hours can be explained by the linear and quadratic models, respectively. Since the quadratic model adjusted \(R^2\) value is roughly 15 points higher than the linear model’s, the quadratic trend model (bullet 3) would fit the series best. We’ll now plot both models for visualization purposes.

yrange = range(leafhrsTS)
plot(c(1966,2000),yrange,type="n",xlab="Year",ylab="Hours Per Week",bty="l",xaxt="n",yaxt="n")
lines(leafhrsTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)
# linear model
lines(leafhrsLinear$fitted,col="red")
# quadratic model
lines(leafhrsQuad$fitted,col="blue")
# legend
legend(1990,37.5,c("CWH","Linear","Quadratic"),lty=c(1,1,1),col=c("black","red","blue"),bty="n")

Q.4 \(\textit{Forecasting Department Store Sales}\)

The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

\(\bullet\) Take a logarithm of the Quarter index

\(\bullet\) Take a logarithm of sales

\(\bullet\) Take an exponent of sales

\(\bullet\) Take an exponent of Quarter index

To account for an exponential trend, we take the logarithm (natural log) of the y-variable (sales).

The following is the output of a regression model with exponential trend and seasonality using the first 20 quarters as the training period.

sales <- read.csv("DeptStoreSales.csv",stringsAsFactors = FALSE)
salesTS <- ts(sales$Sales,start=c(1),frequency=4)
#str(sales)
#head(sales)
#tail(sales)
validLength <- 4
trainLength <- length(salesTS) - validLength
salesTrain <- window(salesTS, end=c(1,trainLength))
salesValid <- window(salesTS, start=c(1,trainLength+1))
# fit exponential trend with seasonality mode to training set
salesExpo <- tslm(salesTrain ~ trend + season,lambda=0)
summary(salesExpo)

## 
## Call:
## tslm(formula = salesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

The estimated log-transformed coefficients are in relation to the base season (Quarter 1). We see that Quarter 2 has an etimated coefficient of 0.024956, however, this value does not have an associated statistically significant p-value (p=0.248). That is, Quarter 2 is not statistically significantly different than Quarter 1. However, practical significance should always be considered.

The forecasts for the validation period are given below:

validPeriodForecasts <- forecast(salesExpo, h=validLength)
validPeriodForecasts

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

We see that the forecasts for quarters 21 and 22 are roughly 58793.71 and 60951.51, respectively.

The fitted model is shown below:

yrange = range(salesTS)
plot(c(1,7),yrange,type="n",xlab="Year",ylab="Sales (thousands)",bty="l",xaxt="n",yaxt="n")
lines(salesTS,bty="l")
axis(1,at=seq(1,7,1),labels=format(seq(1,7,1)))
axis(2,at=seq(45000,105000,5000),labels=format(seq(45,105,5)),las=2)
lines(salesExpo$fitted,col="red")
lines(validPeriodForecasts$mean,col="blue",lty=2)
legend(1,105000,c("Sales","Exponential Trend+Season","Forecasts"),lty=c(1,1,2),col=c("black","red","blue"),bty="n")

accuracy(validPeriodForecasts, salesValid)

##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set   30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## Test set     5395.00854 6076.912 5395.009  6.62904618 6.629046 1.7311114
##                   ACF1 Theil's U
## Training set 0.5540818        NA
## Test set     0.2156416 0.4259668

The MAPE for the test set is roughly 6.63%. The residual plot is reproduced below:

plot(salesExpo$residuals, type="o", bty="l")
abline(0,0,col="red")

This plot is on the log-transformed scale. Converting back yields:

plot(salesTrain - salesExpo$fitted, type="o", bty="l")
abline(0,0,col="red")

The residual plot appears to be roughly centered at 0 (9 positive residuals, 11 negative). We see that towards the end of the training period, we are underpredicting (positive residuals). I would say we will underpredict for quarters 21 and 22 as well. This is indeed confirmed by the following residual plot for the validation period.

salesValid - validPeriodForecasts$mean

##       Qtr1     Qtr2     Qtr3     Qtr4
## 6 2006.290 3948.490 6076.915 9548.339

plot(salesValid - validPeriodForecasts$mean, type="o", bty="l")

Looking at the residual plot, which of the following statements appear true?

\(\bullet\) Seasonality is not captured well.

This does not appear true. If seasonality was not captured well by the model, there would be noticable patterns in the residual plot.

\(\bullet\) The regression model fits the data well.

This appears to be true. The residuals are approximately centered at 0. We also recall the MAPE for the training period to be about 2.22%.

\(\bullet\) The trend in the data is not captured well by the model.

The appears to be true. The residual plot for the training period shows a somewhat U-shaped trend so a quadratic trend model might be more appropriate. The validation residual plot also exhibits a linear trend.

Which of the following solutions is adequate and a parsimonious solution for improving model fit?

\(\bullet\) Fit a quadratic trend model to the residuals (with \(\textit{Quarter}\) and \(\textit{Quarter}^2\).)

\(\bullet\) Fit a quadratic trend model to Sales (with \(\textit{Quarter}\) and \(\textit{Quarter}^2\).)

I would fit the quadratic trend model to Sales. This is appropriate since we witnessed a U-shaped pattern in the training period residual plot.

\(\textit{Souvenir Sales}\)

The two time plots in Figure 6.14 indicate the presence of both trend and seasonality. The trend appears linear and the seasonality is monthly. So there are \(12-1=11\) dummy variables for the monthly seasonality and 1 variable for the linear trend \(t\), making 12 predictors in the model. Assuming a quadratic trend adds 1 more predictor (\(t^2\)).

souvSales <- read.csv("SouvenirSales.csv",stringsAsFactors = FALSE)
#str(souvSales)
#head(souvSales)
#tail(souvSales)
souvSalesTS <- ts(souvSales$Sales,start=c(1995,1),frequency=12)
validLength <- 12
trainLength <- length(souvSalesTS) - validLength
souvSalesTrain <- window(souvSalesTS,end=c(1995,trainLength))
souvSalesValid <- window(souvSalesTS, start=c(1995,trainLength+1))
# fit linear trend with seasonality model to training set
modelA <- tslm(souvSalesTrain ~ trend + season)
summary(modelA)

## 
## Call:
## tslm(formula = souvSalesTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

Season 12 (December) has the highest sales on average. This is reasonable because the data comes from a souvenir shop at a beach resort town in Queensland, Australia. Combined with the holiday season, December is also the beginning of summer for Australia so it makes sense for souvenir shops to sell more when there are more tourists around.
The trend coefficient of 245.36 is the slope of the model, meaning for every 1 month that passes, sales go up on average by 245.36 (australian dollars).

First we run a regression model with log(Sales) as the output variable with a linear trend and monthly predictors.

modelB <- tslm(log(souvSalesTrain) ~ trend + season)
summary(modelB)

## 
## Call:
## tslm(formula = log(souvSalesTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

Now run an exponential trend model with Sales as the output variable.

souvSalesExpo <- tslm(souvSalesTrain ~ trend + season,lambda=0)
summary(souvSalesExpo)

## 
## Call:
## tslm(formula = souvSalesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16

As witnessed above, fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales with exponential trend (\(\log(y) = x \Leftarrow \Rightarrow y=e^x\)).
The estimated trend coefficient of model B is equal to 0.021120 on the log-scale. This means sales increase by roughly 2.1% per month.

str(souvSalesTS)

##  Time-Series [1:84] from 1995 to 2002: 1665 2398 2841 3547 3753 ...

We see that there are 84 time periods in the data ending at December 2001, so February 2002 corresponds to time period 86.

febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)

## (Intercept) 
##    17062.99

So the forecast for February 2002 is roughly 17063 (Australian dollars).

Accuracy measures for model A:

modelAForecast <- forecast(modelA,h=validLength)
accuracy(modelAForecast$mean,souvSalesValid)

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924

Accuracy measures for model B:

modelBForecast <- forecast(modelB,h=validLength)
accuracy(exp(modelBForecast$mean),souvSalesValid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

Plots of actuals and the two models for the validation period:

plot(souvSalesValid, bty="l", xaxt="n", xlab="Year 2001", yaxt="n", ylab="Sales (thousands Australian Dollars")
axis(1, at=seq(2001,2001+11/12,1/12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(10000,85000,5000), labels=format(seq(10,85,5)), las=2)
lines(modelAForecast$mean, col="blue")
lines(exp(modelBForecast$mean), col="red")
legend(2001,100000, c("Actuals", "Linear+Season (Model A)", "log(y) Linear+Season (Model B)"), lty=c(1,1,1), col=c("black", "blue", "red"), bty="n")

Model B had a MAPE of about 15.52% for the test set while Model A’s was about 26.67%. We also see from the above plot that towards the end of the validation period, Model A does not stay very close to the actual data while Model B does. We confirm this by comparing the residuals of Model A and model B.

souvSalesValid - modelAForecast$mean

##             Jan        Feb        Mar        Apr        May        Jun
## 2001 -4602.7902 -4943.8986  2081.2364   312.6398 -1275.8919   660.7014
##             Jul        Aug        Sep        Oct        Nov        Dec
## 2001  6848.3714  8795.3598  9740.9064  8945.3648 17810.0697 54646.0831

souvSalesValid - exp(modelBForecast$mean)

##             Jan        Feb        Mar        Apr        May        Jun
## 2001   463.2178 -1976.2148  1385.0914  2213.7892  -226.8377  1605.3926
##             Jul        Aug        Sep        Oct        Nov        Dec
## 2001  6260.7263  8995.4085  8640.9177  6291.0309  6489.6051 17751.8018

With the lower MAPE and lower residuals towards the end of the validation period, model B (log-transformed Sales Linear + Season) is preferred.

If the time period of interest is between 1995 and 2001, then no forecasting will be required. Instead, the goal will be completely descriptive in nature and there will be no need to partition the data. Understanding the “different components” of sales might refer to temperature/weather, economy, discount sales, etc., so we may wish to create time series for each relevant factor. Econometric models may then be employed to capture any correlations between sales and the different components via multivariate time series.

MBA678 Assignment 5

Ronald Coury

April 03, 2017