Assignment

Due Monday, March 26th, 2018 at 11:59PM: Problems 2, 4 and 5 from Chapter 6 of Shmueli




Analysis of Canadian Manufacturing Workers Work-Hours

The time series plot describes the average annual number of weekly hours spent by Canadian manufacturing workers.

  1. Which one of the following regression-based models would fit the series best?

Chart 1.1

ca_workers <- read.xls("CanadianWorkHours.xls")
ca_workers.ts <- ts(ca_workers[,2], start = c(1966,1), frequency=1)

#Plotting the Canadian Worker Hours chart
yrange = range(ca_workers.ts)

plot(c(1966,2000), yrange, type = "n", main = "Avg. Annual Weekly Hours Spent By Canadian Manufacturing Workers", xlab = "Year", ylab = "Hours Per Week", bty = "l", xaxt = "n", yaxt = "n")
lines(ca_workers.ts, bty="l")

#X-axis
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))

#Y-axis
axis(2, at=seq(34.5,38.0,.5), labels=format(seq(34.5,38.0,.5)))


To determine the best fit for the Canadian Work Hours series, we will look at all four regression-based models and compare the Adjusted R-squared values.
We begin by evaluating the Linear trend model:

ca_workers_linear <- tslm(ca_workers.ts ~ trend)
summary(ca_workers_linear)
## 
## Call:
## tslm(formula = ca_workers.ts ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08


Because there appears to be no seasonality, we will forego the Linear trend model with seasonality and Quadratic model with seasonality, and instead simply look at the Quadratic trend model:

ca_workers_quadratic <- tslm(ca_workers.ts ~ poly(trend, 2, raw = TRUE))
summary(ca_workers_quadratic)
## 
## Call:
## tslm(formula = ca_workers.ts ~ poly(trend, 2, raw = TRUE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 38.1644156  0.2176599 175.340  < 2e-16 ***
## poly(trend, 2, raw = TRUE)1 -0.1827154  0.0278786  -6.554 2.21e-07 ***
## poly(trend, 2, raw = TRUE)2  0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10


Based on the Adjusted R-squared values of 0.5996 for the Linear trend model and 0.7465 for the Quadratic trend model, it can be determined that the Quadratic trend model is (of the four proposed) the best regression-based model to fit the Canadian Work Hours series.



Chart 1.2

plot(c(1966,2000), yrange, type = "n", main = "Avg. Annual Weekly Hours Spent By Canadian Manufacturing Workers", xlab = "Year", ylab = "Hours Per Week", bty = "l", xaxt = "n", yaxt = "n")
lines(ca_workers.ts, bty="l")

axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,.5), labels=format(seq(34.5,38.0,.5)))

lines(ca_workers_linear$fitted, col="orange", lwd=2)
lines(ca_workers_quadratic$fitted, col="blue", lwd=2)

legend(x = "topright", legend = c("Linear trend model", "Quadratic trend model"), col = c("orange", "blue"), lty = c(1, 1), bty = "n")













Forecast Department Store Sales

The time series plot describes actual quarterly sales for a department store over a 6-year period.

Chart 2.1

dept <- read.xls("DepartmentStoreSales.xls")
dept.ts <- ts(dept[,2], start = c(1,1), frequency=4)

#Plotting the Quarterly Sales chart
yrange = range(dept.ts)

plot(c(1,7), yrange, type = "n", main = "Department Store Quarterly Sales", xlab = "Year", ylab = "Sales", bty = "l", xaxt = "n", yaxt = "n")
lines(dept.ts, bty="l")
lines(dept.ts, type = "o")

#X-axis
axis(1, at = seq(1,7,1), labels = format(seq(1,7,1)))

#Y-axis
axis(2, at = seq(40000,120000,20000), labels = format(seq(40000,120000,20000)))


  1. The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

  • Take a logaritm of the Quarter index
  • Take a logarithm of sales
  • Take an exponent of sales
  • Take an exponent of Quarter index

The forecaster must take a logarithm of sales, which can be done as a function in R.



  1. Fit a regression model with an exponential trend and seasonality, usin only the first 20 quarters as the trainign period.

dept_nValid <- 4
dept_nTrain <- length(dept.ts) - dept_nValid
dept_train.ts <- window(dept.ts, start = c(1,1), end = c(1, dept_nTrain))
dept_valid.ts <- window(dept.ts, start = c(1, dept_nTrain + 1), end = c(1, dept_nTrain + dept_nValid))

dept_exponential_season <- tslm(dept_train.ts ~ trend + season, lambda = 0)
summary(dept_exponential_season)
## 
## Call:
## tslm(formula = dept_train.ts ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16



  1. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

Q2 average sales are not statistically significant from the base season of Q1, meaning that sales in both quarters are approximately equal to one another.



  1. Use this model to forecast sales in quarters 21 and 22.

dept_sales_forecast <- forecast(dept_exponential_season, h = 2)
print(dept_sales_forecast)
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04 66250.87



  1. The plot shown describe the fit (top) and forecast errors (bottom) from this regression model.

i. Recreate these plots.

Chart 2.3

yrange = range(dept_train.ts)

#recreate charts found on page 139 in textbook
plot(c(1,6), yrange, type = "n", main = "Exponential Regression Model With Seasonality \n For Department Store Quarterly Sales", xlab = "Year", ylab = "Sales", bty = "l", xaxt = "n", yaxt = "n")
lines(dept_train.ts, bty="l")
lines(dept_train.ts, type = "o")
lines(dept_exponential_season$fitted.values, col = "red", lwd = 2)
#X-axis
axis(1, at = seq(1,6,1), labels = format(seq(1,6,1)))

#Y-axis
axis(2, at = seq(40000,120000,20000), labels = format(seq(40000,120000,20000)))

Chart 2.3

#recreate charts found on page 139 in textbook
plot(dept_train.ts - dept_exponential_season$fitted.values, main = "Exponential Regression Model With Seasonality Residuals \n For Department Store Quarterly Sales", xlab = "Year", ylab = "Residuals", bty = "l", type = "o", xlim=c(1,6))


ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?
To more easily identify whether we have over or under forecasted, we can replot our residuals chart with a horizontal break at point 0. Residual values well above that line are under-forecast, while values that fall well below the line were overforecasted.

Chart 2.4

#recreate charts found on page 139 in textbook
plot(dept_train.ts - dept_exponential_season$fitted.values, main = "Exponential Regression Model With Seasonality Residuals \n For Department Store Quarterly Sales", xlab = "Year", ylab = "Residuals", bty = "l", type = "o", xlim=c(1,6))
abline(h = 0, col = "pink", lwd = 4)

Because we saw periods of under-forecasting throughout year 5, it would be safe to assume that Q21 and Q22 (Q1 and Q2 of year 6) would also be under-forecast.



  1. Looking at the residual plot, which of the following statements appear true?

  • Seasonality is not captured well.
  • The regression model fits the data well.
  • The trend in the data is not captured well by the model.

Based on the statements above, it appears as though the only statement that rings true is that the regression model fits the data well. We can see from the residual plot above (Chart 2.4) that we hover relatively close to the horizontal line at point 0 suggesting that we have adequately captured both seasonality and trend over the 6-year sales period.



  1. Which of the following solutions is adequate and a parsimonious solution for improving model fit?

  • Fit a quadratic trend model to the residuals (with Quarter and Quarter^2.)
  • Fit a quadratic trend model to Sales (with Quarter and Quarter^2).

The solution of fitting a quadratic trend model to Sales (with Quarter and Quarter^2) is both an adequate and parsimonious solution for improving model fit.













Souvenir Sales

The data contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

  1. Based on the the two time plots, which predictors should be included in the regression model? What is the total number of predictors in the model?

Chart 3.1

#Upload of the SouvenirSales.xls file
souvenir <- read.xls("SouvenirSales.xls")

#partitioned
souvenir.ts <- ts(souvenir[,2], start = c(1995,1), frequency = 12)
souvenir_nValid <- 12
souvenir_nTrain <- length(souvenir.ts) - souvenir_nValid
souvenir_train.ts <- window(souvenir.ts, start = c(1995,1), end = c(1995, souvenir_nTrain))
souvenir_valid.ts <- window(souvenir.ts, start = c(1995, souvenir_nTrain + 1), end = c(1995, souvenir_nTrain + souvenir_nValid))

plot(souvenir.ts, main = "Monthly Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir.ts, bty = "l", lwd = 2, col = "000033", type = "o", pch = 18)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))

Chart 3.2

plot(log(souvenir.ts), main = "Monthly Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "log(Sales)", type = "n", xaxt = "n", bty = "l")
lines(log(souvenir.ts), bty = "l", lwd = 2, col = "000033", type = "o", pch = 18)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))

From the two plots on page 140 in the textbook (and recreated above), Trend and Seasonality are two predictors that should be included in the regression model. Trend accounts for 1 variable, and because we are looking at monthly sales, there are 11 variables created as well (the 12 monthly minus 1 for the base). Therefore in total, there are 12.



  1. Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

souvenir_model_A <- tslm(souvenir_train.ts ~ trend + season)
summary(souvenir_model_A)
## 
## Call:
## tslm(formula = souvenir_train.ts ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16
plot(souvenir.ts, main = "Model A \n Monthly Souvenir Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir_train.ts, lwd = 2)
lines(souvenir_model_A$fitted.values, col = "#FCC02A", lwd=2)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
legend(x = "topleft", legend = c("Souvenir sales", "Regression model"), col = c("black", "#FCC02A"), lty = c(1, 1), bty = "n")



i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable? December has the highest average monthly sales. This is perfectly reasonable since Australia has a Summer season that begins in December and goes through to February. Given the warmer temperatures and holiday period, one would expect a greater number of tourists to visit the Souvenir Shop than in cooler months.

ii. What does the trend coefficient of Model A mean? The trend coefficient of 245.36 means that month-over-month souvenir sales will typically increase by that amount over the long run.



  1. Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

souvenir_model_B <- tslm(souvenir_train.ts ~ trend + season, lambda=0)
summary(souvenir_model_B)
## 
## Call:
## tslm(formula = souvenir_train.ts ~ trend + season, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16
plot(souvenir.ts, main = "Model B \n Monthly Log Sales by Shop X in Queensland, Australia", xlab = "Year", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir_train.ts, lwd = 2)
lines(souvenir_model_B$fitted.values, col = "#008080", lwd=2)
axis(1, at = seq(1995,2002, 1), labels = format(seq(1995,2002, 1)))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
legend(x = "topleft", legend = c("Souvenir sales", "Exponential model"), col = c("black", "#008080"), lty = c(1, 1), bty = "n")



i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend? An exponential trend.



ii. What does the estimated trend coefficient of Model B mean? The estimated trend coefficient of 0.021120 means that every month sales typically increase by approximately 2.1120%.



iii. Use this model to forecast the sales in February 2002.
Based on the forecast, sales for February 2002 will be $17,063 (Australian dollar).

model_B_forecast <- forecast(souvenir_model_B, h = 14)
pander(model_B_forecast$mean, caption = "Souvenir Sales Forecasts from January 2001 to February 2002", split.table = Inf)
Souvenir Sales Forecasts from January 2001 to February 2002
  Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 9780 13243 20442 15144 16225 16996 19894 19591 21864 24530 40145 86909
2002 12601 17063 NA NA NA NA NA NA NA NA NA NA



  1. Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

Table 3.1

model_A_forecast <- forecast(souvenir_model_A, h = 14)
pander(accuracy(model_A_forecast, souvenir_valid.ts), caption = "Accuracy of Model A on Souvenir Sales", split.table = Inf)
  ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -5.684e-14 5365 3205 6.968 36.75 0.8559 0.4048 NA
Test set 8252 17452 10055 10.53 26.67 2.685 0.3206 0.9076
Table: Accuracy of Model A on Souvenir Sales

Table 3.2

pander(accuracy(model_B_forecast, souvenir_valid.ts), caption = "Accuracy of Model B on Souvenir Sales", split.table = Inf)
Accuracy of Model B on Souvenir Sales
  ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 197.5 2865 1671 -1.473 13.94 0.4463 0.4381 NA
Test set 4824 7101 5192 12.36 15.52 1.386 0.4245 0.461
plot(souvenir_valid.ts, main = "Forecasting Monthly Sales by Shop X in Queensland, Australia in 2001", xlab = "Month", ylab = "Sales (Australian Dollars)", type = "n", xaxt = "n", yaxt = "n", bty = "l")
lines(souvenir_valid.ts, col = "black")
lines(model_A_forecast$mean, col="#FCC02A", lwd = 2)
lines(model_B_forecast$mean, col="#008080", lwd = 2)
axis(1, at = seq(2001,2001+11/12, 1/12), labels = c("Jan.", "Feb.", "Mar.", "Apr.", "May", "Jun.", "Jul.", "Aug.", "Sep.", "Oct.", "Nov.", "Dec."))
axis(2, at = seq(0, 125000, 25000), labels = format(seq(0, 125000, 25000), las = 2))
legend(x = "topleft", legend = c("Souvenir sales", "Model A", "Model B"), col = c("black", "#FCC02A", "#008080"), lty = c(1, 1), bty = "n")


When plotting actual monthly souvenir sales in 2001 against the forecasts from Model A and Model B, we can see that Model B is more preferable forecast model as it hugs actual results closer than Model A. Also, from the summary tables above, not only does Model B have a greater adjusted R-squared value (1 compared to 0.7476) but Model B also has lower MAPE (15.52 to 26.67) and MASE (1.386 to 2.685) values.



  1. How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.

I may seek to model out individual months and try to obtain more granular data to work from. If we began to break out months by day then we could account for various seasonality on a weekly level and may act accordingly to help spur sales in slower months. Also, given the strong surge in sales for December year-after-year, we could incorporate additional data sources as a means of shedding light on the effect various holiday and/or tourism trends have on sales.