The two seasonal methods wouldn’t work at all, because this yearly data is not seasonal. So, I’ll set up the time series, plot it and only try to fit linear and quadratic models to the data.
work <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
workTS <- ts(work$HoursPerWeek, start=c(1966), frequency=1)
yrange = range(workTS)
#Plotted
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Average Canadian work hours, weekly", bty="l", xaxt="n", yaxt="n")
#Added time series
lines(workTS, bty="l")
#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(34,38,1), labels=format(seq(34,38,1)), las=2)
#Linear trend
workLinear <- tslm(workTS ~ trend)
summary(workLinear)
##
## Call:
## tslm(formula = workTS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
Linear trend model: This series doesn’t have a linear trend, so we know it won’t be an ideal fit. The adjusted R-squared value of 0.5996 means that nearly 60% of the variation in work hours across the data can be explained by a straight line.
#Quadratic trend
workQuad <- tslm(workTS ~ trend + I(trend^2))
summary(workQuad)
##
## Call:
## tslm(formula = workTS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
Quadratic trend model: Here, the adjusted R-squared value jumps up to 0.7465, indicating under 75% of variation explained by this more flexible model.
#Plotted
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Average Canadian work hours, weekly", bty="l", xaxt="n", yaxt="n")
#Added time series
lines(workTS, bty="l")
#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(34,38,0.5), labels=format(seq(34,38,0.5)), las=2)
#Added models
lines(workLinear$fitted, col="red")
lines(workQuad$fitted, col="blue")
#Added legend
legend(1984,37.5, c("ARPM", "Linear", "Quadratic"), lty=c(1,1,1,1,1), col=c("black", "red", "blue"), bty="n")
The reason for the difference is stark once the lines are plotted: The linear model can’t account for the later rise in hours worked, while the quadratic model can respond to it.
Take a logarithm of sales: This is used to fit an exponential trend.
salesData <- read.csv("DeptStoreSales.csv")
salesTS <- ts(salesData$Sales, start=c(1,1), freq=4)
#Partitioned data
validLength2 <- 4
trainLength2 <- length(salesTS) - validLength2
trainTS <- window(salesTS, start= c(1,1), end= c(1, trainLength2))
validTS <- window(salesTS, start= c(1, trainLength2+1), end= c(1, trainLength2+validLength2))
#Fitted exponential model
salesExpo <- tslm(trainTS ~ trend + season, lambda=0)
summary(salesExpo)
##
## Call:
## tslm(formula = trainTS ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
Higher, because the coefficient for season2 is positive.
#h=2 is how to get quarters 21, 22
validPeriodForecasts <- forecast(salesExpo, h=2)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
validPeriodForecasts2 <- forecast(salesExpo, h=validLength2)
yrange = range(salesTS)
#Plotted
plot(c(1,7), yrange, type="n", xlab="Year", ylab="Department stores sales, thousands", bty="l", xaxt="n", yaxt="n")
#Added time series
lines(salesTS, bty="l")
#Added axes
axis(1, at=seq(1,24,1), labels=format(seq(1,24,1)))
axis(2, at=seq(45000,105000,10000), labels=format(seq(45,105,10)), las=1)
#Added models
lines(salesExpo$fitted, col="red", lwd=2)
lines(validPeriodForecasts2$mean, col="blue", lty=2)
#Added legend
legend(1,100000, c("ARPM", "Exponential Trend+Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")
#Simple plot
plot(trainTS - salesExpo$fitted, type="o", bty="l")
The blue dotted line on the first plot shows consistent but somewhat negligible under-forecasting, which is confirmed on the residual chart, where I’m under 11 times and over 9 times.
The regression model fits the data well is the only true one. The model sticks well to the seasonality and trend of the data, which is born out by the plot.
Fit a quadratic trend model to Sales.
Like the example given on page 129 of the textbook, we see a quadratic trend and monthly seasonality, which means there are 13 predictors: 11 dummy variables for each month after January and t and t-squared for trend.
souvData <- read.csv("SouvenirSales.csv")
souvTS <- ts(souvData$Sales, start=c(1995, 1), frequency=12)
yrange = range(souvTS)
#Partitioned data
validLength3 <- 12
trainLength3 <- length(souvTS) - validLength3
souvTrain <- window(souvTS, end=c(1995, trainLength3))
souvValid <- window(souvTS, start=c(1995, trainLength3+1))
#Model A
modelA <- tslm(souvTrain ~ trend + season)
summary(modelA)
##
## Call:
## tslm(formula = souvTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
December is exponentially higher than all other months, which makes sense: It’s the holiday season and the beginning and warmest time of Australia’s summer.
The trend is positive, so the trend coefficient is also positive. It means that sales go up monthly by an average of $245.36AUD across the dataset.
modelB <- tslm(log(souvTrain) ~ trend + season)
summary(modelB)
##
## Call:
## tslm(formula = log(souvTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
An exponential trend.
Going to a log scale means that these coefficients are now in percentages, so it means It means that sales go up monthly by an average of 0.021% across the dataset.
#Call forecast for period 86, plus add "season 2" because February is the second period.
febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)
## (Intercept)
## 17062.99
The forecast for February 2002 — the 86th period — is $17,062.99 in sales.
First, we’ll start with the summary data, which will foreshadow the forecast data.
summary(modelA)
##
## Call:
## tslm(formula = souvTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
Model A is relatively weak, with values for all but 2 months not statistically significant and with the model only explaining 75 percent of the variation.
summary(modelB)
##
## Call:
## tslm(formula = log(souvTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
In Model B, all months are statistically significant and the R-squared value indicates that 93 percent of the variation is explained.
modelAForecast <- forecast(modelA, h=validLength3)
accuracy(modelAForecast$mean, souvValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
modelBForecast <- forecast(modelB, h=validLength3)
accuracy(exp(modelBForecast$mean), souvValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
In the forecast accuracy measures, we see that Model B has far lower RMSE and MAPE values. That confirms that it’s a better model.
In a well-known paper, Galit Shmueli, a co-author of our textbook, explains the differences between descriptive modeling and explanatory modeling. If our goal switched from descriptive to explanatory, we would be less likely to partition the dataset, since it has diminished statistical power in explanatory modeling. On the other hand, statistical significance of one variable’s effect on another is crucial in explanatory modeling, while it only has a minor effect on predictive power.