Due Monday, March 26th, 2018 at 11:59PM: Problems 2, 4 and 5 from Chapter 6 of Shmueli
ca_workers <- read.xls("CanadianWorkHours.xls")
ca_workers.ts <- ts(ca_workers[,2], start = c(1966,1), frequency=1)
#Plotting the Canadian Worker Hours chart
yrange = range(ca_workers.ts)
plot(c(1966,2000), yrange, type = "n", main = "Avg. Annual Weekly Hours Spent By Canadian Manufacturing Workers", xlab = "Year", ylab = "Hours Per Week", bty = "l", xaxt = "n", yaxt = "n")
lines(ca_workers.ts, bty="l")
#X-axis
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
#Y-axis
axis(2, at=seq(34.5,38.0,.5), labels=format(seq(34.5,38.0,.5)))
To determine the best fit for the Canadian Work Hours series, we will look at all four regression-based models and compare the Adjusted R-squared values.
We begin by evaluating the Linear trend model:
ca_workers_linear <- tslm(ca_workers.ts ~ trend)
summary(ca_workers_linear)
##
## Call:
## tslm(formula = ca_workers.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
Because there appears to be no seasonality, we will forego the Linear trend model with seasonality and Quadratic model with seasonality, and instead simply look at the Quadratic trend model:
ca_workers_quadratic <- tslm(ca_workers.ts ~ poly(trend, 2, raw = TRUE))
summary(ca_workers_quadratic)
##
## Call:
## tslm(formula = ca_workers.ts ~ poly(trend, 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## poly(trend, 2, raw = TRUE)1 -0.1827154 0.0278786 -6.554 2.21e-07 ***
## poly(trend, 2, raw = TRUE)2 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
Based on the Adjusted R-squared values of 0.5996 for the Linear trend model and 0.7465 for the Quadratic trend model, it can be determined that the Quadratic trend model is (of the four proposed) the best regression-based model to fit the Canadian Work Hours series.
plot(c(1966,2000), yrange, type = "n", main = "Avg. Annual Weekly Hours Spent By Canadian Manufacturing Workers", xlab = "Year", ylab = "Hours Per Week", bty = "l", xaxt = "n", yaxt = "n")
lines(ca_workers.ts, bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,.5), labels=format(seq(34.5,38.0,.5)))
lines(ca_workers_linear$fitted, col="orange", lwd=2)
lines(ca_workers_quadratic$fitted, col="blue", lwd=2)
legend(x = "topright", legend = c("Linear trend model", "Quadratic trend model"), col = c("orange", "blue"), lty = c(1, 1), bty = "n")
dept <- read.xls("DepartmentStoreSales.xls")
dept.ts <- ts(dept[,2], start = c(1,1), frequency=4)
#Plotting the Quarterly Sales chart
yrange = range(dept.ts)
plot(c(1,7), yrange, type = "n", main = "Department Store Quarterly Sales", xlab = "Year", ylab = "Sales", bty = "l", xaxt = "n", yaxt = "n")
lines(dept.ts, bty="l")
lines(dept.ts, type = "o")
#X-axis
axis(1, at = seq(1,7,1), labels = format(seq(1,7,1)))
#Y-axis
axis(2, at = seq(40000,120000,20000), labels = format(seq(40000,120000,20000)))
The forecaster must take a logarithm of sales, which can be done as a function in R.
dept_nValid <- 4
dept_nTrain <- length(dept.ts) - dept_nValid
dept_train.ts <- window(dept.ts, start = c(1,1), end = c(1, dept_nTrain))
dept_valid.ts <- window(dept.ts, start = c(1, dept_nTrain + 1), end = c(1, dept_nTrain + dept_nValid))
dept_exponential_season <- tslm(dept_train.ts ~ trend + season, lambda = 0)
summary(dept_exponential_season)
##
## Call:
## tslm(formula = dept_train.ts ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
dept_sales_forecast <- forecast(dept_exponential_season, h = 2)
print(dept_sales_forecast)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
yrange = range(dept_train.ts)
#recreate charts found on page 139 in textbook
plot(c(1,6), yrange, type = "n", main = "Exponential Regression Model With Seasonality \n For Department Store Quarterly Sales", xlab = "Year", ylab = "Sales", bty = "l", xaxt = "n", yaxt = "n")
lines(dept_train.ts, bty="l")
lines(dept_train.ts, type = "o")
lines(dept_exponential_season$fitted.values, col = "red", lwd = 2)
#X-axis
axis(1, at = seq(1,6,1), labels = format(seq(1,6,1)))
#Y-axis
axis(2, at = seq(40000,120000,20000), labels = format(seq(40000,120000,20000)))
#recreate charts found on page 139 in textbook
plot(dept_train.ts - dept_exponential_season$fitted.values, main = "Exponential Regression Model With Seasonality Residuals \n For Department Store Quarterly Sales", xlab = "Year", ylab = "Residuals", bty = "l", type = "o", xlim=c(1,6))
#recreate charts found on page 139 in textbook
plot(dept_train.ts - dept_exponential_season$fitted.values, main = "Exponential Regression Model With Seasonality Residuals \n For Department Store Quarterly Sales", xlab = "Year", ylab = "Residuals", bty = "l", type = "o", xlim=c(1,6))
abline(h = 0, col = "pink", lwd = 4)
Because we saw periods of under-forecasting throughout year 5, it would be safe to assume that Q21 and Q22 (Q1 and Q2 of year 6) would also be under-forecast.
Based on the statements above, it appears as though the only statement that rings true is that the regression model fits the data well. We can see from the residual plot above (Chart 2.4) that we hover relatively close to the horizontal line at point 0 suggesting that we have adequately captured both seasonality and trend over the 6-year sales period.
The solution of fitting a quadratic trend model to Sales (with Quarter and Quarter^2) is both an adequate and parsimonious solution for improving model fit.
#Upload of the SouvenirSales.xls file
souvenir <- read.xls("SouvenirSales.xls")
#partitioned
souvenir.ts <- ts(souvenir[,2], start = c(1995,1), frequency = 12)
souvenir_nValid <- 12
souvenir_nTrain <- length(souvenir.ts) - souvenir_nValid
souvenir_train.ts <- window(souvenir.ts, start = c(1995,1), end = c(1995, souvenir_nTrain))
souvenir_valid.ts <- window(souvenir.ts, start = c(1995, souvenir_nTrain + 1), end = c(1995, souvenir_nTrain + souvenir_nValid))
plot(souvenir.ts, main = "Monthly Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir.ts, bty = "l", lwd = 2, col = "000033", type = "o", pch = 18)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
plot(log(souvenir.ts), main = "Monthly Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "log(Sales)", type = "n", xaxt = "n", bty = "l")
lines(log(souvenir.ts), bty = "l", lwd = 2, col = "000033", type = "o", pch = 18)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
From the two plots on page 140 in the textbook (and recreated above), Trend and Seasonality are two predictors that should be included in the regression model. Trend accounts for 1 variable, and because we are looking at monthly sales, there are 11 variables created as well (the 12 monthly minus 1 for the base). Therefore in total, there are 12.
souvenir_model_A <- tslm(souvenir_train.ts ~ trend + season)
summary(souvenir_model_A)
##
## Call:
## tslm(formula = souvenir_train.ts ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
plot(souvenir.ts, main = "Model A \n Monthly Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir_train.ts, lwd = 2)
lines(souvenir_model_A$fitted.values, col = "#FCC02A", lwd=2)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
legend(x = "topleft", legend = c("Souvenir sales", "Regression model"), col = c("black", "#FCC02A"), lty = c(1, 1), bty = "n")
i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable? December has the highest average monthly sales. This is perfectly reasonable since Australia has a Summer season that begins in December and goes through to February. Given the warmer temperatures and holiday period, one would expect a greater number of tourists to visit the Souvenir Shop than in cooler months.
ii. What does the trend coefficient of Model A mean? The trend coefficient of 245.36 means that month-over-month souvenir sales will typically increase by that amount over the long run.
souvenir_model_B <- tslm(souvenir_train.ts ~ trend + season, lambda=0)
summary(souvenir_model_B)
##
## Call:
## tslm(formula = souvenir_train.ts ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2e+10 on 12 and 59 DF, p-value: < 2.2e-16
plot(souvenir.ts, main = "Model B \n Monthly Log Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir_train.ts, lwd = 2)
lines(souvenir_model_B$fitted.values, col = "#008080", lwd=2)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
legend(x = "topleft", legend = c("Souvenir sales", "Exponential model"), col = c("black", "#008080"), lty = c(1, 1), bty = "n")
i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend? An exponential trend.
ii. What does the estimated trend coefficient of Model B mean? The estimated trend coefficient of 0.021120 means that every month sales typically increase by approximately 2.1120%.
iii. Use this model to forecast the sales in February 2002.
Based on the forecast, sales for February 2002 will be $17,063 (Australian dollar).
model_B_forecast <- forecast(souvenir_model_B, h = 14)
pander(model_B_forecast$mean, caption = "Souvenir Sales Forecasts from January 2001 to February 2002", split.table = Inf)
| Â | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001 | 9780 | 13243 | 20442 | 15144 | 16225 | 16996 | 19894 | 19591 | 21864 | 24530 | 40145 | 86909 |
| 2002 | 12601 | 17063 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
model_A_forecast <- forecast(souvenir_model_A, h = 14)
pander(accuracy(model_A_forecast, souvenir_valid.ts), caption = "Accuracy of Model A on Souvenir Sales", split.table = Inf)
|  | ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U |
|---|---|---|---|---|---|---|---|---|
| Training set | -5.684e-14 | 5365 | 3205 | 6.968 | 36.75 | 0.8559 | 0.4048 | NA |
| Test set | 8252 | 17452 | 10055 | 10.53 | 26.67 | 2.685 | 0.3206 | 0.9076 |
pander(accuracy(model_B_forecast, souvenir_valid.ts), caption = "Accuracy of Model B on Souvenir Sales", split.table = Inf)
|  | ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U |
|---|---|---|---|---|---|---|---|---|
| Training set | 197.5 | 2865 | 1671 | -1.473 | 13.94 | 0.4463 | 0.4381 | NA |
| Test set | 4824 | 7101 | 5192 | 12.36 | 15.52 | 1.386 | 0.4245 | 0.461 |
plot(souvenir_valid.ts, main = "Forecasting Monthly Sales by Shop X in Queensland, Australia in 2001", xlab = "Month", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir_valid.ts, col = "black")
lines(model_A_forecast$mean, col="#FCC02A", lwd = 2)
lines(model_B_forecast$mean, col="#008080", lwd = 2)
axis(1, at = seq(2001,2001+11/12, 1/12), labels = c("Jan.", "Feb.", "Mar.", "Apr.", "May", "Jun.", "Jul.", "Aug.", "Sep.", "Oct.", "Nov.", "Dec."))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
legend(x = "topleft", legend = c("Souvenir sales", "Model A", "Model B"), col = c("black", "#FCC02A", "#008080"), lty = c(1, 1), bty = "n")
When plotting actual monthly souvenir sales in 2001 against the forecasts from Model A and Model B, we can see that Model B is more preferable forecast model as it hugs actual results closer than Model A. Also, from the summary tables above, not only does Model B have a greater adjusted R-squared value (1 compared to 0.7476) but Model B also has lower MAPE (15.52 to 26.67) and MASE (1.386 to 2.685) values.