Assignment 5

Chapter 6

Question 2-Analysis of Canadian Manufacturing Workers Work-Hours: The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. The data is available in CanadianWorkHours.xls.

Which one model of the following regression-based models would fit the series best?

From the choices of a linear trend model, a linear trend model with seasonality, a quadratic trend model, and a quadreatic trend model with seasonality, the quadratic trend model best fits the time series depicted in figure 6.10 of the text. There is no apparent seasonality displayed and the plot is definitely not linear.

Question 4-Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period. The data is available in DepartmentStoreSales.xls.

(a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

Take a logarithm of sales.

(b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

StoreSales <- read.csv("/Users/wendyhayes/Desktop/MBA 678-Predictive Analytics/DeptStoreSales.csv",stringsAsFactors=FALSE)
library(knitr)
library(forecast)

#I decided to just look at the data.
kable(StoreSales)

Quarter	Sales
1	50147
2	49325
3	57048
4	76781
5	48617
6	50898
7	58517
8	77691
9	50862
10	53028
11	58849
12	79660
13	51640
14	54119
15	65681
16	85175
17	56405
18	60031
19	71486
20	92183
21	60800
22	64900
23	76997
24	103337

StoreSalesTS <- ts(StoreSales$Sales, start=c(1,1), frequency=4)
validLength <- 4
trainLength <- length(StoreSalesTS)-validLength
SalesTrain <- window(StoreSalesTS, end=c(1,trainLength))
SalesTrain

##    Qtr1  Qtr2  Qtr3  Qtr4
## 1 50147 49325 57048 76781
## 2 48617 50898 58517 77691
## 3 50862 53028 58849 79660
## 4 51640 54119 65681 85175
## 5 56405 60031 71486 92183

SalesValid <- window(StoreSalesTS, start=c(1,trainLength +1))
validLength <- 4
trainLength <- length(StoreSalesTS) - validLength
salesTrain <- window(StoreSalesTS, end=c(1, trainLength))
salesValid <- window(StoreSalesTS, start=c(1, trainLength+1))
salesExpo <- tslm(salesTrain ~ trend + season, lambda=0)
summary(salesExpo)

## 
## Call:
## tslm(formula = salesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

yrange=range(StoreSalesTS)
plot(c(1, 20), yrange, type="n", xlab="Quarter", ylab="Store Sales (in thousands)", bty="l", xaxt="n", yaxt="n")
title("Store Sales - Regression with Exponential Trend & Seasonality")
lines(as.vector(StoreSalesTS), bty="l")
axis(1, at=seq(1,20,1),labels=format(seq(1,20,1)))
axis(2, at=seq(45000,110000,5000), labels=format(seq(45,110,5)), las=2)
lines(as.vector(salesExpo$fitted), col="red")
legend("topleft", c("ARPM","Exponential Trend+Season"), lty=c(1,1), col=c("black","red"), bty="n")

(c) A partial output is shown in Table 6.7 (plotted above). From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

After adjusting for trend, Q2 average sales are higher than the average Q1 sales.

(d) Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(salesExpo, h=validLength)
validPeriodForecasts

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

yrange=range(StoreSalesTS)
plot(c(1, 10), yrange, type="n", xlab="Quarters", ylab="Store Sales (in thousands)", bty="l", xaxt="n", yaxt="n")
title("Forecast Store Sales")
lines(StoreSalesTS, bty="l")
axis(1, at=seq(1,10,1),labels=format(seq(1,10,1)))
axis(2, at=seq(45000,110000,5000), labels=format(seq(45,110,5)), las=2)
lines(salesExpo$fitted, col="red")
lines(validPeriodForecasts$mean, col="blue", lty=2)
legend("topleft", c("ARPM", "Exponential Trend+Season", "Forecasts"), lty=c(1,1,2), col=c("black","red","blue"), bty="n")

(e) The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.

i. Recreate these plots.

StoreSales <- read.csv("/Users/wendyhayes/Desktop/MBA 678-Predictive Analytics/DeptStoreSales.csv",stringsAsFactors=FALSE)

StoreSalesTS <- ts(StoreSales$Sales, start=c(1,1), frequency=4)
validLength <- 4
trainLength <- length(StoreSalesTS)-validLength
SalesTrain <- window(StoreSalesTS, end=c(1,trainLength))

SalesValid <- window(StoreSalesTS, start=c(1,trainLength +1))
library(forecast)
validLength <- 4
trainLength <- length(StoreSalesTS) - validLength
salesTrain <- window(StoreSalesTS, end=c(1, trainLength))
salesValid <- window(StoreSalesTS, start=c(1, trainLength+1))
salesExpo <- tslm(salesTrain ~ trend + season, lambda=0)

yrange=range(StoreSalesTS)
plot(c(1, 20), yrange, type="n", xlab="Quarter", ylab="Store Sales (in thousands)", bty="l", xaxt="n", yaxt="n")
title("Fit of Regression Model from Textbook")
lines(as.vector(StoreSalesTS), bty="l")
axis(1, at=seq(1,20,1),labels=format(seq(1,20,1)))
axis(2, at=seq(45000,110000,5000), labels=format(seq(45,110,5)), las=2)
lines(as.vector(salesExpo$fitted), col="red")
legend("topleft", c("ARPM","Exponential Trend+Season"), lty=c(1,1), col=c("black","red"), bty="n")

modelvalues <- c(salesExpo$fitted,validPeriodForecasts$mean)
residualvalues <- StoreSalesTS - modelvalues
plot(as.vector(residualvalues), xlab="Quarter",ylab="residuals", type="o",bty="l")
title("Residuals")

ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?

The forecasts are likely to be under-forecasted as the residuals are going more positive, especially in the later quarters.

(f) Looking at the residual plot, which of the following statements appear true?

• Seasonality is not captured well. - False as there does not seem to be a seasonal pattern to the residuals.

• The regression model fits the data well. - False as the residuals don’t seem to be well distributed around zero and there seems like there is a pattern to the residuals.

• The trend in the data is not captured well by the model.-True as there seems to be a pattern to the residuals indicating that the trend isn’t captured well.

(g) Which of the following solutions is adequate and a parsimonious solution for improving model fit?

• Fit a quadratic trend model to the residuals (with Quarter and Quarter2.) - no

• Fit a quadratic trend model to Sales (with Quarter and Quarter2.)-Yes, this is the best solution for improving the model fit.

Question 5 - Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

(a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

The exponential model seems to be the better of the two because the seasonal peaks are more constant and the trend seems more apparent. Therefore there would be 13 predictors: t, t^2, and 11 months.

(b) Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

SouvSales<- read.csv("/Users/wendyhayes/Desktop/MBA 678-Predictive Analytics/SouvenirSales.csv",stringsAsFactors=FALSE)
library(forecast)

SouvSalesTS <- ts(SouvSales$Sales, start = c(1995, 1), frequency = 12)

validLength <- 12
trainLength <- length(SouvSalesTS)-validLength
SalesTrain <- window(SouvSalesTS, end=c(1995,trainLength))
SalesValid <- window(SouvSalesTS, start=c(1995,trainLength +1))

yrange = range (SouvSalesTS)

plot(c(1995,2000), yrange, type = "n", xlab = "Year", ylab = "Souvenir Sales (in thousands in Australian Dollars)", bty="l",xaxt = "n", yaxt = "n")
title("Model A")
lines(SalesTrain, bty="l")
axis(1, at=seq(1995,2000,1), labels=format(seq(1995,2000,1)))
axis(2, at=seq(1600,110000,10000),labels =format(seq(16,1100,100)), las=2)

SouvSalesLinearSeason <- tslm(SalesTrain ~ trend + season)
lines(SouvSalesLinearSeason$fitted, col="red")

summary(SouvSalesLinearSeason)

## 
## Call:
## tslm(formula = SalesTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

validAForecast <- forecast(SouvSalesLinearSeason, h=validLength)
accuracy(validAForecast,SalesValid)

##                         ME      RMSE       MAE       MPE     MAPE     MASE
## Training set -5.684342e-14  5365.199  3205.089  6.967778 36.75088 0.855877
## Test set      8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
##                   ACF1 Theil's U
## Training set 0.4048039        NA
## Test set     0.3206228 0.9075924

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

December tends to have the highest average sales during the year. This is reasonable as there are most likely more tourists vacationing then.

ii. What does the trend coefficient of model A mean?

The trend coefficient of model A indicates that January has lower average sales.

(c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

souvExpo <- tslm(SalesTrain ~ trend + season, lambda=0)
yrange=range(SouvSalesTS)


plot(c(1995,2000), yrange, type="n", xlab="Year", ylab="Souvenir Sales (in thousands in Australian Dollars)", bty="l", xaxt="n", yaxt="n")
title("Model B")
lines(SalesTrain, bty="l")
axis(1, at=seq(1995,2000,1), labels=format(seq(1995,2000,1)))
axis(2, at=seq(1600,110000,10000), labels=format(seq(16,1100,100)), las=2)
lines(souvExpo$fitted, col="red")
legend("topleft", c("ARPM", "Exponential Trend+Season"), lty=c(1,1), col=c("black","red"), bty="n")

summary(souvExpo)

## 
## Call:
## tslm(formula = SalesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16

i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

This is equivalent to to fitting a model to Sales (in dollars) with an exponential trend.

ii. What does the estimated trend coefficient of model B mean?

The estimated trend coefficient multiplies the time in the exponent.

iii. Use this model to forecast the sales in February 2002.

validPeriodForecasts <- forecast(souvExpo, h=validLength)
Feb <- forecast(souvExpo, h=14)
yrange=range(SouvSalesTS+2)
plot(c(1995, 2003), yrange, type="n", xlab="Year", ylab="Souvenir Sales (in thousands in Australian Dollars)", bty="l", xaxt="n", yaxt="n")
title("Model B - Forecast")
lines(SouvSalesTS, bty="l")
axis(1, at=seq(1995,2003,1),labels=format(seq(1995,2003,1)))
axis(2, at=seq(1600,110000,10000), labels=format(seq(16,1100,100)), las=2)
lines(souvExpo$fitted, col="red")
#lines(validPeriodForecasts$mean, col="yellow", lty=2)
lines(Feb$mean,col="blue",lty=2)
legend(1995, 110000, c("ARPM", "Exponential Trend+Season", "Forecasts"), lty=c(1,1,2), col=c("black","red","blue"), bty="n")

accuracy(validPeriodForecasts, SalesValid)

##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set  197.519 2865.154 1671.185 -1.472819 13.94047 0.446268
## Test set     4824.494 7101.444 5191.669 12.359434 15.51910 1.386367
##                   ACF1 Theil's U
## Training set 0.4381370        NA
## Test set     0.4245018 0.4610253

(d) Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

Lower MAPE in Model B indicates that Model B predicts the validation set better than Model A does.

Visually the plot in Model B seemed to be a better fit than in Model A.

(e) How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.

If I needed to look at the differences in the months in greater detail, I would look at the cycles in each individual month and model the months rather than looking at seasonality. I would try to retain more granualarity. I might seek out another type of model that could help look at the components differently.

Assignment 5

Wendy Hayes

4/1/2017

Chapter 6

Question 2-Analysis of Canadian Manufacturing Workers Work-Hours: The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers. The data is available in CanadianWorkHours.xls.

Which one model of the following regression-based models would fit the series best?

Question 4-Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period. The data is available in DepartmentStoreSales.xls.

(a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

Take a logarithm of sales.

(b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

(c) A partial output is shown in Table 6.7 (plotted above). From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

After adjusting for trend, Q2 average sales are higher than the average Q1 sales.

(d) Use this model to forecast sales in quarters 21 and 22.

(e) The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.

i. Recreate these plots.

ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?

The forecasts are likely to be under-forecasted as the residuals are going more positive, especially in the later quarters.

(f) Looking at the residual plot, which of the following statements appear true?

• Seasonality is not captured well. - False as there does not seem to be a seasonal pattern to the residuals.

• The regression model fits the data well. - False as the residuals don’t seem to be well distributed around zero and there seems like there is a pattern to the residuals.

• The trend in the data is not captured well by the model.-True as there seems to be a pattern to the residuals indicating that the trend isn’t captured well.

(g) Which of the following solutions is adequate and a parsimonious solution for improving model fit?

• Fit a quadratic trend model to the residuals (with Quarter and Quarter2.) - no

• Fit a quadratic trend model to Sales (with Quarter and Quarter2.)-Yes, this is the best solution for improving the model fit.

Question 5 - Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

(a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

The exponential model seems to be the better of the two because the seasonal peaks are more constant and the trend seems more apparent. Therefore there would be 13 predictors: t, t^2, and 11 months.

(b) Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

December tends to have the highest average sales during the year. This is reasonable as there are most likely more tourists vacationing then.

ii. What does the trend coefficient of model A mean?

The trend coefficient of model A indicates that January has lower average sales.

(c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

This is equivalent to to fitting a model to Sales (in dollars) with an exponential trend.

ii. What does the estimated trend coefficient of model B mean?

The estimated trend coefficient multiplies the time in the exponent.

iii. Use this model to forecast the sales in February 2002.

(d) Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

Lower MAPE in Model B indicates that Model B predicts the validation set better than Model A does.

Visually the plot in Model B seemed to be a better fit than in Model A.

(e) How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.