Homework 6: Problems 2, 4 & 5 from Chapter 6.


Problem 2: The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers.

Which one of the following regression-based models would work best?

library(fpp)
library(fma)
library(forecast)
library(ggplot2)
Work <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
WorkTS <- ts(Work$HoursPerWeek, start=c(1966), frequency=1)
yrange = range(WorkTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Average Canadian Work Hours per Week", bty="l", xaxt="n", yaxt="n")
lines(WorkTS, bty="l")
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(30,40,1), labels=format(seq(30,40,1)), las=2)

Linear trend

LinearTrend <- tslm(WorkTS ~ trend)
summary(LinearTrend)
## 
## Call:
## tslm(formula = WorkTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

The series has an adjusted R-squared of 0.6, which means that the linear trend explains 60% of the variation in hours worked.

Linear Trend with Seasonality

LinearTrendSeason <- tslm(WorkTS ~ trend + season)
summary(LinearTrendSeason)

This series does not have explicit seasonality. Because of that linear trend with seasonality is not an option.

Quadratic Trend

QuadTrend <- tslm(WorkTS ~ trend + I(trend^2))
summary(QuadTrend)
## 
## Call:
## tslm(formula = WorkTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

This is a much better fit as compared to linear trend, as illustrated by the adjusted R-squared value of 0.75.This means that the quadratic trend explains 75% of the variation in hours worked.

Quadratic Trend with Seasonality

QuadTrendSeason <- tslm(WorkTS ~ trend + I(trend^2) + season)
summary(QuadTrendSeason)

This series does not have explicit seasonality. Because of that quadratic trend with seasonality is not an option.

plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Average Canadian Work Hours per Week", bty="l", xaxt="n", yaxt="n")
lines(WorkTS, bty="l")
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(30,40,1), labels=format(seq(30,40,1)), las=2)
lines(LinearTrend$fitted, col="red")
lines(QuadTrend$fitted, col="green")
legend(1966,36, c("Average Work Hours", "Linear Trend", "Quadratic Trend"), lty=c(1,1,1), col=c("black", "red", "green"), bty="n")

Based on the chart above, quadratic trend model fits the series best as it is able to capture not only the downward trend, but also the upward trend that starts in the late 1980s. Looking at the Adjusted R-squared values for each model, we also see that the model that used a quadratic trend had the highest explanation of variation in the average work hours.

Problem 4: Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period.

a. The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

Take a logarithm of the Quarter index

No

Take a logarithm of sales

Yes, this is used to fit an exponential trend. In R this is done by setting lambda to equal zero in the tslm function.

Take an exponent of sales

No

Take an exponent of Quarter index

No

b. Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

DeptSales <- read.csv("DeptStoreSales.csv")
SalesTS <- ts(DeptSales$Sales, start=c(1,1), freq=4)
ValidLength <- 4
TrainLength <- length(SalesTS) - ValidLength
TrainTS <- window(SalesTS, start= c(1,1), end= c(1, TrainLength))
ValidTS <- window(SalesTS, start= c(1, TrainLength+1), end= c(1, TrainLength+ValidLength))
ExpoModel <- tslm(TrainTS ~ trend + season, lambda=0)
summary(ExpoModel)
## 
## Call:
## tslm(formula = TrainTS ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

c. A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

Even though the Season 2 coefficient is 0.024, which appears to be slightly higher than that shown in Season 1. However, the P-value illustrates the fact that Season 2 is not statistically different from Season 1. Which means that the average sales in Season 2 are about the same as in Season 1.

d. Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(ExpoModel, h=ValidLength)
validPeriodForecasts 
##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

e. The plot shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from the regression model.

i. Recreate these plots.

yrange = range(SalesTS)
plot(c(1,7), yrange, type="n", xlab="Year", ylab="Sales Dollars in Thousands", bty="l", xaxt="n", yaxt="n")
lines(SalesTS, bty="l")
axis(1, at=seq(1,24,1), labels=format(seq(1,24,1)))
axis(2, at=seq(45000,105000,10000), labels=format(seq(45,105,10)), las=1)
lines(ExpoModel$fitted,col="red")
lines(validPeriodForecasts$mean,col="blue",lty=2)
legend(1,100000, c("Actual Sales", "Exponential Trend+Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")

plot(ExpoModel$residuals, type="o", bty="l", main="Residuals for Training Period - Log-Transformed Units")

plot(TrainTS - ExpoModel$fitted, type="o", bty="l")
title(main = "Residuals - Original Scale")

ErrorValid <- ValidTS - validPeriodForecasts$mean
plot(ErrorValid, type="o", bty="l", main="Residuals in Validation Period")

ii. Based on these plots, what can you say about your forecasts for quarters 21 and 22? Are they likely to over-forecast, under-forecast or be reasonably close to the real sales values?

The forecast for Q21 and Q22 are likely to be lower as compared to the actual sales.

The first chart indicates a relatively small under-forecasting, as compared to actual sales. This is confirmed by the residuals plot: the parabolic shape indicates under-forecasting in the beginning and the end of the series. Additionally, residuals in the plot showing validation period confirm under-forecasting.

f. Looking at the residual plot, which of the following statements appear true?

Seasonality is not captured well

FALSE. The chart above shows seasonality.

The regression model fits the data well

TRUE. The model works well with the seasonality and trend, as confirmed by the charts above.

The trend in the data is not captured well by the model

FALSE. However, it is not a perfect fit as illustrated by the parabolic shape of the residuals plot.

g. Which of the following solutions is adequate and a parsimonious solution for improving model fit?

Fit a quadratic trend model to the residuals (with Quarter and Quarter-squared.)

NO.

Fit a quadratic trend model to Sales (with Quarter and Quarter-squared.)

Yes, to address the parabolic shape of residuals.

Problem 5: Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001. In 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

a. Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

Based on the plots, it is clear that the forecasting model for this series should include both trend and seasonality predictors. Similar to the example in the book, the quadratic trend and monthly seasonality implies 13 predictors: 11 dummy varilables for each month after January and t and t-squared for trend.

b. Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

SouvenirData <- read.csv("SouvenirSales.csv")
SouvenirTS <- ts(SouvenirData$Sales, start=c(1995, 1), frequency=12)
yrange = range(SouvenirTS)
ValidLength1 <- 12
TrainLength1 <- length(SouvenirTS) - ValidLength1

SouvenirTrain <- window(SouvenirTS, end=c(1995, TrainLength1))
SouvenirValid <- window(SouvenirTS, start=c(1995, TrainLength1+1))
ModelA <- tslm(SouvenirTrain ~ trend + season)
summary(ModelA)
## 
## Call:
## tslm(formula = SouvenirTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

As illustrated by the table, December has the highest average sales of the year, with November being the second highest. This can be explained by the holiday season as well as by increased number of tourists coming from the Northern hemispere.

ii. What does the trend coefficient of model A mean?

The trend coeffient of the model A is 245.36, which represents the increase per period in the model’s trend. What it means is we can expect a monthly increase of roughly AU $245 from the prior month before considering the seasonality impact of each period.

c. Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

ModelB <- tslm(log(SouvenirTrain) ~ trend + season)
summary(ModelB)
## 
## Call:
## tslm(formula = log(SouvenirTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

i. Fitting a model only to log (Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

It is equivalent to fitting an exponential trend model to Sales.

ii. What does the estimated trend coefficient of model B mean?

The trend coefficient estimates the percentage increase of each period compared to the previous period, before applying seasonality. In our example, it would be a 2.1% increase each month.

iii. Use this model to forecast the sales in February 2002.

Feb2002Forecast <- ModelB$coefficients["(Intercept)"] + ModelB$coefficients["trend"]*86 + ModelB$coefficients["season2"]
exp(Feb2002Forecast)
## (Intercept) 
##    17062.99

d. Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

ModelAForecast <- forecast(ModelA, h = ValidLength1)
accuracy(ModelAForecast$mean, SouvenirValid)
##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
ModelBForecast <- forecast(ModelB, h = ValidLength1)
accuracy(exp(ModelBForecast$mean), SouvenirValid)
##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
plot(SouvenirValid, bty="l", xaxt="n", xlab="Year: 2001", yaxt="n", ylab="Sales (A$)")
axis(1, at=seq(2001,2001+11/12,1/12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(0,100000,10000), labels=format(seq(0,100,10)), las=2)
lines(ModelAForecast$mean, col="blue")
lines(exp(ModelBForecast$mean), col="red")
legend(2001,85000, c("Actuals", "Model A (Linear & Season)", "Model B (log(y) Linear + Season)"), lty=c(1,1,1), col=c("black", "blue", "red"), bty="n")

Model B is preferred because of the following reasons:

  • According to the plot above, both Model A and B are subject to under-forecasting. That said, it is still a better fit;
  • It has better MAPE and RMSE scores, which makes it preferable for forecasting;
  • Model B has a better value for an adjusted R-squared value - 93% as compared to 75% in case of Model A;
  • Looking at the raw data series, it is apparent that the magnitude of seasonality seems to be increaing over time, which is an indicator of exponential seasonality. This confirms the idea that Model B should be a better fit.

e. How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.

If the goal was to understand the different components of sales in the souvenir shop, I would model the data based on the classification of the souvenirs sold as opposed to the total dollar amount.Taking it one step further, I would also recommend collecting demographic information of the customers and the items they purchase to develop marketing campaigns depending on features of the most profitable cluster of clients.

That said, the information that is being collected now is incredibly useful in terms of informing the owners of their inventory needs as well as their staffing neeeds. Knowing that November and December are the most profitable months of the year - assuming because of the holidays - can be used to capitalize on holiday cheer by offering more holiday souvenirs, decorations and perhaps even costumes.