The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which one of the following regression-based models would work best?
First we will create a time series and plot the data
library(forecast)
library(ggplot2)
cwh <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
# Create a time series
cwh.TS<- ts(cwh$HoursPerWeek, start=c(1966), frequency=1)
yrange = range(cwh.TS)
#Plot
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Avg Canadian Work Hours (per week)", bty="l", xaxt="n", yaxt="n")
#Added time series
lines(cwh.TS, bty="l")
#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(20,50,1), labels=format(seq(20,50,1)), las=2)
box()
box(which = "plot")
Using the autoplot method (for fun)
autoplot(cwh.TS)
Linear Trend
# Linear trend
cwhLinear <- tslm(cwh.TS ~ trend)
summary(cwhLinear)
##
## Call:
## tslm(formula = cwh.TS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
Linear Trend with Seasonality
Non-seasonal data Cannot be modeled using a seasonal factor
Quadratic trend - 2 Ways to do this
# Quadratic trend - there are two ways to do this
# 1. Like the book
cwhQuad1 <- tslm(cwh.TS ~ trend + I(trend^2))
summary(cwhQuad1)
##
## Call:
## tslm(formula = cwh.TS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
# 2. Another way - be sure to use raw=TRUE
cwhQuad2 <- tslm(cwh.TS ~ poly(trend, 2, raw=TRUE))
summary(cwhQuad2)
##
## Call:
## tslm(formula = cwh.TS ~ poly(trend, 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## poly(trend, 2, raw = TRUE)1 -0.1827154 0.0278786 -6.554 2.21e-07 ***
## poly(trend, 2, raw = TRUE)2 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
Quadratic trend with seasonality
Non-seasonal data Cannot be modeled using a seasonal factor
The data seems to not have any seasonality, as it is noted that non-seasonal data cannot be modelled using a seasonal factor. This would rule out using the linear trend with seasonality and using the quadratic trend with seasonality as options. By looking at the adjusted r-squared, the linear trend shows .59 while the quadratic trend shows .74. Based on this information it seems that the quadratic trend is the best choice. Plotting this data further supports the analysis.
#Plot
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Avg Canadian Work Hours (per week)", bty="l", xaxt="n", yaxt="n")
#Added time series
lines(cwh.TS, bty="l")
#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(20,50,1), labels=format(seq(20,50,1)), las=2)
# Add the linear trend model
lines(cwhLinear$fitted, col="blue")
lines(cwhQuad1$fitted, col="red")
# Add a legend
legend(1966, 36, c("Avg Work Hours", "Linear", "Quadratic"), lty=c(1,1,1), col=c("black", "blue", "red"), bty="n")
box()
box(which = "plot")
a.)The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?
An exponential trend in the series suggests a multiplicative change in the series over time. To fit an exponential trend, one must Take a logarithm of sales and then fit to a linear regression. The other steps Take a logarithm of the quarter index, take an exponent of sales and take an exponent of quarter index are not necessary
b.)Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods)
dss <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)
# Create a time series
dss.TS<- ts(dss$Sales, start=c(1,1), frequency=4)
#Partition data
ValidLength <- 4
TrainLength <- length(dss.TS) - ValidLength
Train.TS <- window(dss.TS, start= c(1,1), end= c(1, TrainLength))
Valid.TS <- window(dss.TS, start= c(1, TrainLength+1), end= c(1, TrainLength+ValidLength))
#Exponential model
sxpo <- tslm(Train.TS ~ trend + season, lambda=0)
summary(sxpo)
##
## Call:
## tslm(formula = Train.TS ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
c.)A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?
In review of the summary, quarter 2 average sales are close to equal to those of quarter 1. The p-value of .248 suggests that quarter 1 is statistically not much different from quarter 2. Based on this analysis, the quarters are Approximately equal
d.)Use this model to forecast sales in quarters 21 and 22.
validPeriodForecasts <- forecast(sxpo, h=ValidLength)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
Quarter 21 forecasts sales of $58,793.71. Quarter 22 forecasts sales of $60,951.51.
e.)The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model. i. Recreate these plots. ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?
# Helps set up the plot
yrange = range(dss.TS)
# Set up the plot
plot(c(1,7), yrange, type="n", xlab="Year", ylab="Sales (thousands)", bty="l", xaxt="n", yaxt="n")
# Add the time series cwh
lines(dss.TS, bty="l")
# Add the x-axis
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
# Add the y-axis
axis(2, at=seq(45000,105000,5000), labels=format(seq(45,105,5)), las=2)
# Add fitted line from training period
lines(sxpo$fitted, col="red")
# Add forecasts for valiation period
lines(validPeriodForecasts$mean, col="blue", lty=2)
# Add the legend
legend(1,105000, c("Actual Sales", "Exponential Trend + Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")
box()
box(which = "plot")
plot(dss.TS - sxpo$fitted, type="o", bty="l", xlab = "Year", ylab = "Residuals")
box()
box(which = "plot")
abline(h = 0)
Based in the residual plot, it would appear both quarter 21 and 22 are likely to be *under-forecasted** as the trend shows an increase in residuals as time progesses.
f.)Looking at the residual plot, which of the following statements appear true? . Seasonality is not captured well. . The regression model fits the data well. . The trend in the data is not captured well by the model.
Of the three statements, the residual plot shows that the regression model fits the data well holds true.
g.)Which of the following solutions is adequate and a parsimonious solution for improving model fit? . Fit a quadratic trend model to the residuals (with Quarter and Quarter2.) . Fit a quadratic trend model to Sales (with Quarter and Quarter2.)
The residual plot displays an almost u-shaped pattern. By fitting a quadratic trend model to sales it may help address the flucuation in residuals.
a.)Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?
souvenirSales<- read.csv("SouvenirSales.csv")
# Create a time series
souvenir.ts <- ts(souvenirSales$Sales, start = c(1995,1), frequency = 12)
#Partition data
validLength <- 12
trainLength <- length(souvenir.ts) - validLength
souvenirTrain <- window(souvenir.ts, end=c(1995, trainLength))
souvenirValid <- window(souvenir.ts, start=c(1995,trainLength+1))
souvenirLinearSeason <- tslm(souvenirTrain ~ trend + season)
Looking at the two time plots in figure 6.14, there seems to be seasonality and the increasing log values would suggest a quadratic trend. The total number of predictors in this model is 12 (1 for trend, 11 for seasononality)
b.)Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A. i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable? ii. What does the trend coefficient of model A mean?
# Plot Model A
ModelA <- tslm(souvenirTrain ~ trend + season)
summary(ModelA)
##
## Call:
## tslm(formula = souvenirTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
# Forecast Model A
ModelAForecast <- forecast(souvenirLinearSeason, h=souvenirValid)
## Warning in 1:h: numerical expression has 12 elements: only the first used
## Warning in 1:h: numerical expression has 12 elements: only the first used
accuracy(ModelAForecast, souvenirValid)
## ME RMSE MAE MPE MAPE MASE
## Training set -5.684342e-14 5365.199 3205.089 6.967778 36.75088 0.855877
## Test set 8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
## ACF1 Theil's U
## Training set 0.4048039 NA
## Test set 0.3206228 0.9075924
December seems to have the highest average sales during the year. This could be due to the holiday season. As December is summer time in Australia, it is not typically known as peak tourist season as the weather can be described as ‘unbearable’ that time of year. As December is much higher comparitively to other months and it is not a peak travel month, it would be reasonable to assume the holidays are what leads to this spike in sales.
The trend coefficent of Model A (245.36) means that the average linear increase per month is $246.36.
c.)Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B. i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend? ii. What does the estimated trend coefficient of model B mean? iii. Use this model to forecast the sales in February 2002.
#Plot Model B
ModelB <- tslm(log(souvenirTrain) ~ trend + season)
summary(ModelB)
##
## Call:
## tslm(formula = log(souvenirTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
# Forecast Model B
ModelBForecast <- forecast(ModelB, h = souvenirValid)
## Warning in 1:h: numerical expression has 12 elements: only the first used
## Warning in 1:h: numerical expression has 12 elements: only the first used
accuracy(exp(ModelBForecast$mean), souvenirValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
The estimated trend coefficient of Model B would imply that sales increase by 0.0211 (2.1%) on a monthly basis.
FebruaryForecast <- ModelB$coefficients["(Intercept)"] + ModelB$coefficients["trend"]*86 + ModelB$coefficients["season2"]
exp(FebruaryForecast)
## (Intercept)
## 17062.99
The forecasted sales of February 2002 would be $17,062.
d.)Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.
#Accuracy Model A
ModelAForecast <- forecast(ModelA, h = validLength)
accuracy(ModelAForecast$mean, souvenirValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
#Accuracy Model B
ModelBForecast <- forecast(ModelB, h = validLength)
accuracy(exp(ModelBForecast$mean), souvenirValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
In review of forecast performance, the plot would suggest that Model B is the better fit. The MAPE would also indicate this as the MAPE is lower for Model B (15.52 compared to 26.67). The r-squared value for model B also appears to be better number (93% versus 74%). When reviewing the accuracy of the 2 models in conjunction with the plot above it appears that model B would be superior.
e.)How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences. Comparing components of sales and time component sales are like comparing apples and oranges (or pears per the video). There would be no need to separate the data into training/validation periods as there is no need for forecasting when analyzing components of sales. Another difference to evaluate would be influence of factors such as dependent variables against the components of sales to determine how sales components could be impacted by external sources.