Chapter 6

2). Analysis of Canadian Manufacturing Workers Work-Hours: The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. The data is available in CanadianWorkHours.CSV

The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which one of the following regression-based models would work best?

First we will create a time series and plot the data

library(forecast)
library(ggplot2)
cwh <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)

# Create a time series 
cwh.TS<- ts(cwh$HoursPerWeek, start=c(1966), frequency=1)
yrange = range(cwh.TS)

#Plot
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Avg Canadian Work Hours (per week)", bty="l", xaxt="n", yaxt="n")

#Added time series
lines(cwh.TS, bty="l")

#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(20,50,1), labels=format(seq(20,50,1)), las=2)

box()
box(which = "plot")

Using the autoplot method (for fun)

autoplot(cwh.TS)

Linear Trend

# Linear trend
cwhLinear <- tslm(cwh.TS ~ trend)
summary(cwhLinear)
## 
## Call:
## tslm(formula = cwh.TS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

Linear Trend with Seasonality

Non-seasonal data Cannot be modeled using a seasonal factor

Quadratic trend - 2 Ways to do this

# Quadratic trend - there are two ways to do this
# 1. Like the book
cwhQuad1 <- tslm(cwh.TS ~ trend + I(trend^2))
summary(cwhQuad1)
## 
## Call:
## tslm(formula = cwh.TS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10
# 2. Another way - be sure to use raw=TRUE
cwhQuad2 <- tslm(cwh.TS ~ poly(trend, 2, raw=TRUE))
summary(cwhQuad2)
## 
## Call:
## tslm(formula = cwh.TS ~ poly(trend, 2, raw = TRUE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 38.1644156  0.2176599 175.340  < 2e-16 ***
## poly(trend, 2, raw = TRUE)1 -0.1827154  0.0278786  -6.554 2.21e-07 ***
## poly(trend, 2, raw = TRUE)2  0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

Quadratic trend with seasonality

Non-seasonal data Cannot be modeled using a seasonal factor

The data seems to not have any seasonality, as it is noted that non-seasonal data cannot be modelled using a seasonal factor. This would rule out using the linear trend with seasonality and using the quadratic trend with seasonality as options. By looking at the adjusted r-squared, the linear trend shows .59 while the quadratic trend shows .74. Based on this information it seems that the quadratic trend is the best choice. Plotting this data further supports the analysis.

#Plot
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Avg Canadian Work Hours (per week)", bty="l", xaxt="n", yaxt="n")

#Added time series
lines(cwh.TS, bty="l")

#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(20,50,1), labels=format(seq(20,50,1)), las=2)

# Add the linear trend model
lines(cwhLinear$fitted, col="blue")
lines(cwhQuad1$fitted, col="red")

# Add a legend
legend(1966, 36, c("Avg Work Hours", "Linear", "Quadratic"), lty=c(1,1,1), col=c("black", "blue", "red"), bty="n")

box()
box(which = "plot")

4). Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period. The data is available in DepartmentStoreSales.xls

a.)The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

An exponential trend in the series suggests a multiplicative change in the series over time. To fit an exponential trend, one must Take a logarithm of sales and then fit to a linear regression. The other steps Take a logarithm of the quarter index, take an exponent of sales and take an exponent of quarter index are not necessary

b.)Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods)

dss <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)

# Create a time series 
dss.TS<- ts(dss$Sales, start=c(1,1), frequency=4)

#Partition data
ValidLength <- 4
TrainLength <- length(dss.TS) - ValidLength
Train.TS <- window(dss.TS, start= c(1,1), end= c(1, TrainLength))
Valid.TS <- window(dss.TS, start= c(1, TrainLength+1), end= c(1, TrainLength+ValidLength))

#Exponential model 
sxpo <- tslm(Train.TS ~ trend + season, lambda=0)
summary(sxpo)
## 
## Call:
## tslm(formula = Train.TS ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

c.)A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

In review of the summary, quarter 2 average sales are close to equal to those of quarter 1. The p-value of .248 suggests that quarter 1 is statistically not much different from quarter 2. Based on this analysis, the quarters are Approximately equal

d.)Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(sxpo, h=ValidLength)
validPeriodForecasts 
##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

Quarter 21 forecasts sales of $58,793.71. Quarter 22 forecasts sales of $60,951.51.

e.)The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model. i. Recreate these plots. ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?

# Helps set up the plot
yrange = range(dss.TS)

# Set up the plot
plot(c(1,7), yrange, type="n", xlab="Year",  ylab="Sales (thousands)", bty="l", xaxt="n", yaxt="n")

# Add the time series cwh
lines(dss.TS, bty="l")

# Add the x-axis
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))

# Add the y-axis
axis(2, at=seq(45000,105000,5000), labels=format(seq(45,105,5)), las=2)

# Add fitted line from training period
lines(sxpo$fitted, col="red")

# Add forecasts for valiation period
lines(validPeriodForecasts$mean, col="blue", lty=2)

# Add the legend
legend(1,105000, c("Actual Sales", "Exponential Trend + Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")

box()
box(which = "plot")

plot(dss.TS - sxpo$fitted, type="o", bty="l", xlab = "Year", ylab = "Residuals")
box()
box(which = "plot")
abline(h = 0)

Based in the residual plot, it would appear both quarter 21 and 22 are likely to be *under-forecasted** as the trend shows an increase in residuals as time progesses.

f.)Looking at the residual plot, which of the following statements appear true? . Seasonality is not captured well. . The regression model fits the data well. . The trend in the data is not captured well by the model.

Of the three statements, the residual plot shows that the regression model fits the data well holds true.

g.)Which of the following solutions is adequate and a parsimonious solution for improving model fit? . Fit a quadratic trend model to the residuals (with Quarter and Quarter2.) . Fit a quadratic trend model to Sales (with Quarter and Quarter2.)

The residual plot displays an almost u-shaped pattern. By fitting a quadratic trend model to sales it may help address the flucuation in residuals.

5). Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001. Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

a.)Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

souvenirSales<- read.csv("SouvenirSales.csv")

# Create a time series 
souvenir.ts <- ts(souvenirSales$Sales, start = c(1995,1), frequency = 12)

#Partition data
validLength <- 12
trainLength <- length(souvenir.ts) - validLength
souvenirTrain <- window(souvenir.ts, end=c(1995, trainLength))
souvenirValid <- window(souvenir.ts, start=c(1995,trainLength+1))
souvenirLinearSeason <- tslm(souvenirTrain ~ trend + season)

Looking at the two time plots in figure 6.14, there seems to be seasonality and the increasing log values would suggest a quadratic trend. The total number of predictors in this model is 12 (1 for trend, 11 for seasononality)

b.)Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A. i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable? ii. What does the trend coefficient of model A mean?

# Plot Model A
ModelA <- tslm(souvenirTrain ~ trend + season)
summary(ModelA)
## 
## Call:
## tslm(formula = souvenirTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16
# Forecast Model A
ModelAForecast <- forecast(souvenirLinearSeason, h=souvenirValid)
## Warning in 1:h: numerical expression has 12 elements: only the first used

## Warning in 1:h: numerical expression has 12 elements: only the first used
accuracy(ModelAForecast, souvenirValid)
##                         ME      RMSE       MAE       MPE     MAPE     MASE
## Training set -5.684342e-14  5365.199  3205.089  6.967778 36.75088 0.855877
## Test set      8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
##                   ACF1 Theil's U
## Training set 0.4048039        NA
## Test set     0.3206228 0.9075924

December seems to have the highest average sales during the year. This could be due to the holiday season. As December is summer time in Australia, it is not typically known as peak tourist season as the weather can be described as ‘unbearable’ that time of year. As December is much higher comparitively to other months and it is not a peak travel month, it would be reasonable to assume the holidays are what leads to this spike in sales.

The trend coefficent of Model A (245.36) means that the average linear increase per month is $246.36.

c.)Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B. i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend? ii. What does the estimated trend coefficient of model B mean? iii. Use this model to forecast the sales in February 2002.

#Plot Model B
ModelB <- tslm(log(souvenirTrain) ~ trend + season)
summary(ModelB)
## 
## Call:
## tslm(formula = log(souvenirTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16
# Forecast Model B
ModelBForecast <- forecast(ModelB, h = souvenirValid)
## Warning in 1:h: numerical expression has 12 elements: only the first used

## Warning in 1:h: numerical expression has 12 elements: only the first used
accuracy(exp(ModelBForecast$mean), souvenirValid)
##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

The estimated trend coefficient of Model B would imply that sales increase by 0.0211 (2.1%) on a monthly basis.

FebruaryForecast <- ModelB$coefficients["(Intercept)"] + ModelB$coefficients["trend"]*86 + ModelB$coefficients["season2"]

exp(FebruaryForecast)
## (Intercept) 
##    17062.99

The forecasted sales of February 2002 would be $17,062.

d.)Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

#Accuracy Model A
ModelAForecast <- forecast(ModelA, h = validLength)
accuracy(ModelAForecast$mean, souvenirValid)
##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
#Accuracy Model B
ModelBForecast <- forecast(ModelB, h = validLength)
accuracy(exp(ModelBForecast$mean), souvenirValid)
##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

In review of forecast performance, the plot would suggest that Model B is the better fit. The MAPE would also indicate this as the MAPE is lower for Model B (15.52 compared to 26.67). The r-squared value for model B also appears to be better number (93% versus 74%). When reviewing the accuracy of the 2 models in conjunction with the plot above it appears that model B would be superior.

e.)How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences. Comparing components of sales and time component sales are like comparing apples and oranges (or pears per the video). There would be no need to separate the data into training/validation periods as there is no need for forecasting when analyzing components of sales. Another difference to evaluate would be influence of factors such as dependent variables against the components of sales to determine how sales components could be impacted by external sources.