Ch.6: Q.2,4,5
Q.2 \(\textit{Analysis of Canadian Manufacturing Workers Work-Hours}\)
library(forecast)
leafhrs <- read.csv("CanadianWorkHours.csv",stringsAsFactors = FALSE)
#str(leafhrs)
#head(leafhrs)
#tail(leafhrs)
leafhrsTS <- ts(leafhrs$HoursPerWeek,start=c(1966,1),frequency=1)
yrange <- range(leafhrsTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")
lines(leafhrsTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)
Which one model of the following regression-based models would fit the series best?
\(\bullet\) Linear trend model
\(\bullet\) Linear trend model with seasonality
\(\bullet\) Quadratic trend model
\(\bullet\) Quadratic trend model with seasonality
Note we have yearly data (average number of weekly hours given annually). Since seasons occur \(\textit{within}\) a year, we cannot do anything with models involving seasons, eliminating bullets 2 and 4 automatically. Now let’s fit both a linear (bullet 1) and a quadratic (bullet 3) model to the data and examine \(R^2\) values for the goodness of fit measure. To reiterate, we won’t look at linear and quadratic models with seasonality due to our data given yearly.
leafhrsLinear <- tslm(leafhrsTS ~ trend)
summary(leafhrsLinear)
##
## Call:
## tslm(formula = leafhrsTS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
leafhrsQuad <- tslm(leafhrsTS ~ trend + I(trend^2))
summary(leafhrsQuad)
##
## Call:
## tslm(formula = leafhrsTS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
We see that the adjusted \(R^2\) for the linear and quadratic models was determined to be 0.5996 and 0.7465, respectively. So about 60% and about 75% of average annual number of weekly hours can be explained by the linear and quadratic models, respectively. Since the quadratic model adjusted \(R^2\) value is roughly 15 points higher than the linear model’s, the quadratic trend model (bullet 3) would fit the series best. We’ll now plot both models for visualization purposes.
yrange = range(leafhrsTS)
plot(c(1966,2000),yrange,type="n",xlab="Year",ylab="Hours Per Week",bty="l",xaxt="n",yaxt="n")
lines(leafhrsTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)
# linear model
lines(leafhrsLinear$fitted,col="red")
# quadratic model
lines(leafhrsQuad$fitted,col="blue")
# legend
legend(1990,37.5,c("CWH","Linear","Quadratic"),lty=c(1,1,1),col=c("black","red","blue"),bty="n")
Q.4 \(\textit{Forecasting Department Store Sales}\)
\(\bullet\) Take a logarithm of the Quarter index
\(\bullet\) Take a logarithm of sales
\(\bullet\) Take an exponent of sales
\(\bullet\) Take an exponent of Quarter index
To account for an exponential trend, we take the logarithm (natural log) of the y-variable (sales).
sales <- read.csv("DeptStoreSales.csv",stringsAsFactors = FALSE)
salesTS <- ts(sales$Sales,start=c(1),frequency=4)
#str(sales)
#head(sales)
#tail(sales)
validLength <- 4
trainLength <- length(salesTS) - validLength
salesTrain <- window(salesTS, end=c(1,trainLength))
salesValid <- window(salesTS, start=c(1,trainLength+1))
# fit exponential trend with seasonality mode to training set
salesExpo <- tslm(salesTrain ~ trend + season,lambda=0)
summary(salesExpo)
##
## Call:
## tslm(formula = salesTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
The estimated log-transformed coefficients are in relation to the base season (Quarter 1). We see that Quarter 2 has an etimated coefficient of 0.024956, however, this value does not have an associated statistically significant p-value (p=0.248). That is, Quarter 2 is not statistically significantly different than Quarter 1. However, practical significance should always be considered.
validPeriodForecasts <- forecast(salesExpo, h=validLength)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
We see that the forecasts for quarters 21 and 22 are roughly 58793.71 and 60951.51, respectively.
yrange = range(salesTS)
plot(c(1,7),yrange,type="n",xlab="Year",ylab="Sales (thousands)",bty="l",xaxt="n",yaxt="n")
lines(salesTS,bty="l")
axis(1,at=seq(1,7,1),labels=format(seq(1,7,1)))
axis(2,at=seq(45000,105000,5000),labels=format(seq(45,105,5)),las=2)
lines(salesExpo$fitted,col="red")
lines(validPeriodForecasts$mean,col="blue",lty=2)
legend(1,105000,c("Sales","Exponential Trend+Season","Forecasts"),lty=c(1,1,2),col=c("black","red","blue"),bty="n")
accuracy(validPeriodForecasts, salesValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## Test set 5395.00854 6076.912 5395.009 6.62904618 6.629046 1.7311114
## ACF1 Theil's U
## Training set 0.5540818 NA
## Test set 0.2156416 0.4259668
The MAPE for the test set is roughly 6.63%. The residual plot is reproduced below:
plot(salesExpo$residuals, type="o", bty="l")
abline(0,0,col="red")
This plot is on the log-transformed scale. Converting back yields:
plot(salesTrain - salesExpo$fitted, type="o", bty="l")
abline(0,0,col="red")
salesValid - validPeriodForecasts$mean
## Qtr1 Qtr2 Qtr3 Qtr4
## 6 2006.290 3948.490 6076.915 9548.339
plot(salesValid - validPeriodForecasts$mean, type="o", bty="l")
\(\bullet\) Seasonality is not captured well.
This does not appear true. If seasonality was not captured well by the model, there would be noticable patterns in the residual plot.
\(\bullet\) The regression model fits the data well.
This appears to be true. The residuals are approximately centered at 0. We also recall the MAPE for the training period to be about 2.22%.
\(\bullet\) The trend in the data is not captured well by the model.
The appears to be true. The residual plot for the training period shows a somewhat U-shaped trend so a quadratic trend model might be more appropriate. The validation residual plot also exhibits a linear trend.
\(\bullet\) Fit a quadratic trend model to the residuals (with \(\textit{Quarter}\) and \(\textit{Quarter}^2\).)
\(\bullet\) Fit a quadratic trend model to Sales (with \(\textit{Quarter}\) and \(\textit{Quarter}^2\).)
I would fit the quadratic trend model to Sales. This is appropriate since we witnessed a U-shaped pattern in the training period residual plot.
The two time plots in Figure 6.14 indicate the presence of both trend and seasonality. The trend appears linear and the seasonality is monthly. So there are \(12-1=11\) dummy variables for the monthly seasonality and 1 variable for the linear trend \(t\), making 12 predictors in the model. Assuming a quadratic trend adds 1 more predictor (\(t^2\)).
souvSales <- read.csv("SouvenirSales.csv",stringsAsFactors = FALSE)
#str(souvSales)
#head(souvSales)
#tail(souvSales)
souvSalesTS <- ts(souvSales$Sales,start=c(1995,1),frequency=12)
validLength <- 12
trainLength <- length(souvSalesTS) - validLength
souvSalesTrain <- window(souvSalesTS,end=c(1995,trainLength))
souvSalesValid <- window(souvSalesTS, start=c(1995,trainLength+1))
# fit linear trend with seasonality model to training set
modelA <- tslm(souvSalesTrain ~ trend + season)
summary(modelA)
##
## Call:
## tslm(formula = souvSalesTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
Season 12 (December) has the highest sales on average. This is reasonable because the data comes from a souvenir shop at a beach resort town in Queensland, Australia. Combined with the holiday season, December is also the beginning of summer for Australia so it makes sense for souvenir shops to sell more when there are more tourists around.
The trend coefficient of 245.36 is the slope of the model, meaning for every 1 month that passes, sales go up on average by 245.36 (australian dollars).
modelB <- tslm(log(souvSalesTrain) ~ trend + season)
summary(modelB)
##
## Call:
## tslm(formula = log(souvSalesTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
Now run an exponential trend model with Sales as the output variable.
souvSalesExpo <- tslm(souvSalesTrain ~ trend + season,lambda=0)
summary(souvSalesExpo)
##
## Call:
## tslm(formula = souvSalesTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2e+10 on 12 and 59 DF, p-value: < 2.2e-16
As witnessed above, fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales with exponential trend (\(\log(y) = x \Leftarrow \Rightarrow y=e^x\)).
The estimated trend coefficient of model B is equal to 0.021120 on the log-scale. This means sales increase by roughly 2.1% per month.
str(souvSalesTS)
## Time-Series [1:84] from 1995 to 2002: 1665 2398 2841 3547 3753 ...
We see that there are 84 time periods in the data ending at December 2001, so February 2002 corresponds to time period 86.
febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)
## (Intercept)
## 17062.99
So the forecast for February 2002 is roughly 17063 (Australian dollars).
Accuracy measures for model A:
modelAForecast <- forecast(modelA,h=validLength)
accuracy(modelAForecast$mean,souvSalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
Accuracy measures for model B:
modelBForecast <- forecast(modelB,h=validLength)
accuracy(exp(modelBForecast$mean),souvSalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
Plots of actuals and the two models for the validation period:
plot(souvSalesValid, bty="l", xaxt="n", xlab="Year 2001", yaxt="n", ylab="Sales (thousands Australian Dollars")
axis(1, at=seq(2001,2001+11/12,1/12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(10000,85000,5000), labels=format(seq(10,85,5)), las=2)
lines(modelAForecast$mean, col="blue")
lines(exp(modelBForecast$mean), col="red")
legend(2001,100000, c("Actuals", "Linear+Season (Model A)", "log(y) Linear+Season (Model B)"), lty=c(1,1,1), col=c("black", "blue", "red"), bty="n")
Model B had a MAPE of about 15.52% for the test set while Model A’s was about 26.67%. We also see from the above plot that towards the end of the validation period, Model A does not stay very close to the actual data while Model B does. We confirm this by comparing the residuals of Model A and model B.
souvSalesValid - modelAForecast$mean
## Jan Feb Mar Apr May Jun
## 2001 -4602.7902 -4943.8986 2081.2364 312.6398 -1275.8919 660.7014
## Jul Aug Sep Oct Nov Dec
## 2001 6848.3714 8795.3598 9740.9064 8945.3648 17810.0697 54646.0831
souvSalesValid - exp(modelBForecast$mean)
## Jan Feb Mar Apr May Jun
## 2001 463.2178 -1976.2148 1385.0914 2213.7892 -226.8377 1605.3926
## Jul Aug Sep Oct Nov Dec
## 2001 6260.7263 8995.4085 8640.9177 6291.0309 6489.6051 17751.8018
With the lower MAPE and lower residuals towards the end of the validation period, model B (log-transformed Sales Linear + Season) is preferred.
If the time period of interest is between 1995 and 2001, then no forecasting will be required. Instead, the goal will be completely descriptive in nature and there will be no need to partition the data. Understanding the “different components” of sales might refer to temperature/weather, economy, discount sales, etc., so we may wish to create time series for each relevant factor. Econometric models may then be employed to capture any correlations between sales and the different components via multivariate time series.