MBA 676: Assignment 5

2. The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which one of the following regression-based models would work best?

The two seasonal methods wouldn’t work at all, because this yearly data is not seasonal. So, I’ll set up the time series, plot it and only try to fit linear and quadratic models to the data.

work <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
workTS <- ts(work$HoursPerWeek, start=c(1966), frequency=1)
yrange = range(workTS)

#Plotted
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Average Canadian work hours, weekly", bty="l", xaxt="n", yaxt="n")

#Added time series
lines(workTS, bty="l")

#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(34,38,1), labels=format(seq(34,38,1)), las=2)

#Linear trend
workLinear <- tslm(workTS ~ trend)
summary(workLinear)

## 
## Call:
## tslm(formula = workTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

Linear trend model: This series doesn’t have a linear trend, so we know it won’t be an ideal fit. The adjusted R-squared value of 0.5996 means that nearly 60% of the variation in work hours across the data can be explained by a straight line.

#Quadratic trend
workQuad <- tslm(workTS ~ trend + I(trend^2))
summary(workQuad)

## 
## Call:
## tslm(formula = workTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

Quadratic trend model: Here, the adjusted R-squared value jumps up to 0.7465, indicating under 75% of variation explained by this more flexible model.

#Plotted
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Average Canadian work hours, weekly", bty="l", xaxt="n", yaxt="n")

#Added time series
lines(workTS, bty="l")

#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(34,38,0.5), labels=format(seq(34,38,0.5)), las=2)

#Added models
lines(workLinear$fitted, col="red")
lines(workQuad$fitted, col="blue")

#Added legend
legend(1984,37.5, c("ARPM", "Linear", "Quadratic"), lty=c(1,1,1,1,1), col=c("black", "red", "blue"), bty="n")

The reason for the difference is stark once the lines are plotted: The linear model can’t account for the later rise in hours worked, while the quadratic model can respond to it.

4a. The time series plot in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period.The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for the trend, which of the following operations must be performed (either manually or by an R function)?

Take a logarithm of sales: This is used to fit an exponential trend.

4b. Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

salesData <- read.csv("DeptStoreSales.csv")
salesTS <- ts(salesData$Sales, start=c(1,1), freq=4)

#Partitioned data
validLength2 <- 4
trainLength2 <- length(salesTS) - validLength2
trainTS <- window(salesTS, start= c(1,1), end= c(1, trainLength2))
validTS <- window(salesTS, start= c(1, trainLength2+1), end= c(1, trainLength2+validLength2))

#Fitted exponential model 
salesExpo <- tslm(trainTS ~ trend + season, lambda=0)
summary(salesExpo)

## 
## Call:
## tslm(formula = trainTS ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

4c. A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower or approximately equal to the average Q1 sales?

Higher, because the coefficient for season2 is positive.

4d. Use this model to forecast sales in quarters 21 and 22.

#h=2 is how to get quarters 21, 22
validPeriodForecasts <- forecast(salesExpo, h=2)
validPeriodForecasts

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04 66250.87

4e. The plot shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from the regression model.

i. Recreate these plots.

validPeriodForecasts2 <- forecast(salesExpo, h=validLength2)
yrange = range(salesTS)

#Plotted
plot(c(1,7), yrange, type="n", xlab="Year", ylab="Department stores sales, thousands", bty="l", xaxt="n", yaxt="n")

#Added time series
lines(salesTS, bty="l")

#Added axes
axis(1, at=seq(1,24,1), labels=format(seq(1,24,1)))
axis(2, at=seq(45000,105000,10000), labels=format(seq(45,105,10)), las=1)

#Added models
lines(salesExpo$fitted, col="red", lwd=2)
lines(validPeriodForecasts2$mean, col="blue", lty=2)

#Added legend
legend(1,100000, c("ARPM", "Exponential Trend+Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")

#Simple plot
plot(trainTS - salesExpo$fitted, type="o", bty="l")

ii. Based on these plots, what can you say your forecasts for quarters 21 and 22? Are they likely to over-forecast, under-forecast or be reasonably close to the real sales values?

The blue dotted line on the first plot shows consistent but somewhat negligible under-forecasting, which is confirmed on the residual chart, where I’m under 11 times and over 9 times.

4f. Looking at the residual plot, which of the following statements appear true?

The regression model fits the data well is the only true one. The model sticks well to the seasonality and trend of the data, which is born out by the plot.

4g. Which of the following solutions is adequate and a parsimonious solution for improving model fit?

Fit a quadratic trend model to Sales.

5. In 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

5a. Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

Like the example given on page 129 of the textbook, we see a quadratic trend and monthly seasonality, which means there are 13 predictors: 11 dummy variables for each month after January and t and t-squared for trend.

5b. Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

souvData <- read.csv("SouvenirSales.csv")
souvTS <- ts(souvData$Sales, start=c(1995, 1), frequency=12)
yrange = range(souvTS)

#Partitioned data
validLength3 <- 12
trainLength3 <- length(souvTS) - validLength3

souvTrain <- window(souvTS, end=c(1995, trainLength3))
souvValid <- window(souvTS, start=c(1995, trainLength3+1))

#Model A
modelA <- tslm(souvTrain ~ trend + season)
summary(modelA)

## 
## Call:
## tslm(formula = souvTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

December is exponentially higher than all other months, which makes sense: It’s the holiday season and the beginning and warmest time of Australia’s summer.

ii. What does the trend coefficient of model A mean?

The trend is positive, so the trend coefficient is also positive. It means that sales go up monthly by an average of $245.36AUD across the dataset.

5c. Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

modelB <- tslm(log(souvTrain) ~ trend + season)
summary(modelB)

## 
## Call:
## tslm(formula = log(souvTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

i. Fitting a model only to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

An exponential trend.

ii. What does the estimated trend coefficient of model B mean?

Going to a log scale means that these coefficients are now in percentages, so it means It means that sales go up monthly by an average of 0.021% across the dataset.

iii. Use this model to forecast the sales in February 2002.

#Call forecast for period 86, plus add "season 2" because February is the second period.
febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)

## (Intercept) 
##    17062.99

The forecast for February 2002 — the 86th period — is $17,062.99 in sales.

5d. Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

First, we’ll start with the summary data, which will foreshadow the forecast data.

summary(modelA)

## 
## Call:
## tslm(formula = souvTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

Model A is relatively weak, with values for all but 2 months not statistically significant and with the model only explaining 75 percent of the variation.

summary(modelB)

## 
## Call:
## tslm(formula = log(souvTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

In Model B, all months are statistically significant and the R-squared value indicates that 93 percent of the variation is explained.

modelAForecast <- forecast(modelA, h=validLength3)
accuracy(modelAForecast$mean, souvValid)

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924

modelBForecast <- forecast(modelB, h=validLength3)
accuracy(exp(modelBForecast$mean), souvValid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

In the forecast accuracy measures, we see that Model B has far lower RMSE and MAPE values. That confirms that it’s a better model.

5e. How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.

In a well-known paper, Galit Shmueli, a co-author of our textbook, explains the differences between descriptive modeling and explanatory modeling. If our goal switched from descriptive to explanatory, we would be less likely to partition the dataset, since it has diminished statistical power in explanatory modeling. On the other hand, statistical significance of one variable’s effect on another is crucial in explanatory modeling, while it only has a minor effect on predictive power.