#Loading necessary libraries:

library(forecast)
library(tidytext)
library(stringr)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(pander)

Chapter 6: Regression-Based Models: Capturing Trend and Seasonality

For this assignment I will be completing the assigned problems (problems 2, 4, and 5) from Chapter 6 - Regression-Based Models: Capturing Trend and Seasonality in the Forecasting: Principles and Practice textbook by, Galit Schmueli and Kenneth C. Lichtendahl Jr.

Problem 2: Analysis of Canadian Manufacturing Workers Work-Hours

First, I will import the data set and look at the structure, head, and tail of the data.

CanadianHours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
str(CanadianHours)

## 'data.frame':    35 obs. of  2 variables:
##  $ ï..Year       : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
##  $ Hours.per.week: num  37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...

head(CanadianHours)

##   ï..Year Hours.per.week
## 1    1966           37.2
## 2    1967           37.0
## 3    1968           37.4
## 4    1969           37.5
## 5    1970           37.7
## 6    1971           37.7

tail(CanadianHours)

##    ï..Year Hours.per.week
## 30    1995           35.7
## 31    1996           35.7
## 32    1997           35.5
## 33    1998           35.6
## 34    1999           36.3
## 35    2000           36.5

Then I will create a time series object from the data set.

CWH_ts <- ts(CanadianHours$Hours.per.week, start=c(1966), frequency=1)

Figure 1. Plotting Average Annual Canadian Work Hours over Time

autoplot(CWH_ts)

Which model of the following regression-based models would fit the series best?

Linear trend model:

CWH_linear <- tslm(CWH_ts ~ trend)
summary(CWH_linear)

## 
## Call:
## tslm(formula = CWH_ts ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

The adjusted R$^2$ value for the linear trend model is 0.5996.

Linear trend model with seasonality:

Seasonality cannot be modeled using annual (or yearly) data.

Quadratic trend model:

CWH_quad <- tslm(CWH_ts ~ trend + I(trend^2))
summary(CWH_quad)

## 
## Call:
## tslm(formula = CWH_ts ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

The adjusted R$^2$ value for the quadratic trend model is 0.7465.

Quadratic trend model with seasonality:

Seasonality cannot be modeled using annual (or yearly) data.

Since the quadratic trend model has the highest adjusted R$^2$ value of 0.7465, this regression-based model has the highest explanation of variation and therefore fits the series the best out of these 2 options.

Figure 2. Visualizing the 2 Different Regression-Based Models for the Canadian Work Hours Time Series

yrange = range(CWH_ts)

plot(c(1966,2000), yrange, type="n", xlab="Year", ylab="Average Annual Number of Weekly Hours Spent by Canadian Manufacturing Workers", bty="l", xaxt="n")

lines(CWH_ts, bty="l")

axis(1, at=seq(1966, 2000, 1), labels=format(seq(1966, 2000, 1)))

axis(2, at=seq(34.5, 38.0, 0.5), labels= format(seq(34.5, 38.0, 0.5)), las=2)

lines(CWH_linear$fitted, col="red")
lines(CWH_quad$fitted, col="blue")

Problem 4: Forecasting Department Store Sales

First, I will import the data set and look at the structure, head, and tail of the data.

Dept_Sales <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)
str(Dept_Sales)

## 'data.frame':    24 obs. of  2 variables:
##  $ Quarter: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sales  : int  50147 49325 57048 76781 48617 50898 58517 77691 50862 53028 ...

head(Dept_Sales)

##   Quarter Sales
## 1       1 50147
## 2       2 49325
## 3       3 57048
## 4       4 76781
## 5       5 48617
## 6       6 50898

tail(Dept_Sales)

##    Quarter  Sales
## 19      19  71486
## 20      20  92183
## 21      21  60800
## 22      22  64900
## 23      23  76997
## 24      24 103337

Then I will create a time series object from the data set.

Dept_ts <- ts(Dept_Sales$Sales, start=1, frequency=4)

Figure 3.Time Series of Actual Quarterly Department Store Sales for 6 Years

plot(Dept_ts, ylim = c(0,120000), ylab = "Quarterly Sales", xlab = "Year", main = "Time Series of Department Store Sales", bty = "l")

autoplot(Dept_ts)

(a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

Take a logarithm of the Quarter index
No, you will want to take the logarithm of the y variable, in this case the quarterly sales numbers.

Take a logarithm of sales
Yes, the logarithm of the quarterly sales values should be taken in order to fit an exponential trend model.

Take an exponent of sales
No, you will want to take the logarithm of the y variable, in this case the quarterly sales numbers.

Take an exponent of Quarter index
No, you will want to take the logarithm of the y variable, in this case the quarterly sales numbers.

(b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

First, I will partition the data so that the last 4 quarters are in the validation period, and the first 20 quarters are creating the training period.

validLen <- 4
trainLen <- length(Dept_ts) - validLen

DeptTrain <- window(Dept_ts, end=c(1, trainLen))
DeptValid <- window(Dept_ts, start=c(1, trainLen +1))

Once the validation and training periods are created, we can fit the exponential trend with seasonality to the training set, viewing the summary after allows us to better understand the fit of the model. Lambda = 0 allow us to take the logarithm values of the sales data (depdendent variable) as well as translates the value back into the original scale.

Dept_Expo <- tslm(DeptTrain ~ trend + season, lambda = 0)
summary(Dept_Expo)

## 
## Call:
## tslm(formula = DeptTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

From the $summary()$ we can see that all the quarters are statistically significant except for $season2$. In other words, the second quarter of the year is not statistically significantly different from the first quarter of the year, for the 6 years worth of data we are looking at.

(c) A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

Q2 average sales are not statistically significantly different from Q1 when looking at p-values. This means they are approximately equal to the average Q1 sales, at least statistically speaking. When looking at the actual values for Q1 and Q2 we can see that Q2 is generally lower than Q1 but not by large amounts.

(d) Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(Dept_Expo, h=validLen)
validPeriodForecasts

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

Using the model developed for this question, we can see that the point-forecasts for Q21 and Q22 (which are the forecasts for Q1 and Q2 here) are $58,793.71 and $60,951.51, respectively.

(e) The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.

i. Recreate these plots

Figure 4. Fit of Regression Model for Department Stores Sales: Plotting Quarterly Department Store Sales vs.Exponential Trend + Seasonality and Point Forecasts

yrange = range(Dept_ts)

plot(c(1,7), yrange, type="n", xlab="Quarter", ylab="Sales (in thousands)", bty="l", xaxt="n", yaxt="n")

lines(Dept_ts, bty="l")

axis(1, at=seq(1, 7, 1), labels=format(seq(1,7, 1)), las=2)
axis(2, at=seq(0, 110000, 25000), labels=format(seq(1, 110, 25), las=2))

abline(v=6)
arrows(1,100000,6,100000,code=3,length = 0.1)
text(3, 110000, "Training")

abline(v=7)
arrows(6, 100000, 7, 100000, code=3, length = 0.1)
text(6.5, 110000, "Validation")

lines(Dept_Expo$fitted, col="red")

lines(validPeriodForecasts$mean, col="blue", lty=2)

legend(1,100000, c("Dept Sales", "Exponential Trend + Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")

From Figure 4, we can see that the point forecasts are under-forecasting. Next we can look at how accurately our point forecasts are predicting the validation period, or in other words, how well our model is performing based on the $accuracy()$ function.

accuracy(validPeriodForecasts, DeptValid)

##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set   30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## Test set     5395.00854 6076.912 5395.009  6.62904618 6.629046 1.7311114
##                   ACF1 Theil's U
## Training set 0.5540818        NA
## Test set     0.2156416 0.4259668

We can see that the approximately 6.62% MAPE is not as accurate as the double-differencing MAPE of 4.06% or the Holt-Winter’s MAPE of 1.67% from Assignment 4.

Re-creating the residuals plot from page 139 in the textbook.

Figure 5. Fit of Regression Model for Department Store Sales Residuals

plot(Dept_Expo$residuals, type="o", bty="l", xlab="Year", ylab="Residuals")

To get the residuals back into the original scale (sales dollars) we subtract the the exponential trend fitted formula from the training set. This will display the same chart visually but the y-axis will now be back in the original scale - sales dollars.

plot(DeptTrain - Dept_Expo$fitted, type="o", bty="l", xlab="Year", ylab="Residuals")
abline(h=0)

checkresiduals(Dept_Expo)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 10.517, df = 8, p-value = 0.2306

DeptValid - validPeriodForecasts$mean

##       Qtr1     Qtr2     Qtr3     Qtr4
## 6 2006.290 3948.490 6076.915 9548.339

Figure 6. Residuals for the Validation Period

plot(DeptValid - validPeriodForecasts$mean, type="o", bty="l")

ii. Based on these plots, what can you say about your forecast for quarters Q21 and Q22? Are they likely to over-forecast, under-forecast, or be reasonably close to the real sales values?

As mentioned, Figure 4 shows us that the point forecasts (validation period) for all 4 quarters being forecast are under-forecasted. However, when looking at the residuals for the training period, it looks like we are over-forecasting more than under-forecasting we can see this because there are more negative residuals (11) than positive residuals (9).

(f) Looking at the residual plot, which of the following statement appear true?

Seasonality is not captured well
False. The residuals do not show seasonal patterns, which would indicate that the seasonality that exists is being captured in the model.

The regression model fits the data well
False. Referencing the $R^2=1$ would indicate that the regression model fits the data perfectly. However, after looking at the point forecasts against the validation period as well as the residuals and performance metrics (MAPE) it become more clear that the regression model does not fit the data perfectly. Additionally, the residuals are not evenly distributed around zero.

The trend in the data is not captured well by the model True. The residuals seem to have a trend that is not captured in the model.

(g) Which of the following solutions is adequate and a parsimonious solution for improving model fit?

Fit a quadratic trend model to the residuals (with Quarter and Quarter$^2$)
No, ways for improving the model fit should be applied to the dependent variables or in this case Sales.

Fit a quardratic trend model to Sales (with Quarter and Quarter$^2$)
Yes, another trend model should be applied to Sales (dependent variable) to see if additional underlying trends can produce a more accurate forecast model

Problem 5: Souvenir Sales

Back in 2011, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

First, I will import the data set and look at the structure, head, and tail of the data.

Souv_Sales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
str(Souv_Sales)

## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...

head(Souv_Sales)

##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74

tail(Souv_Sales)

##     Date     Sales
## 79 1-Jul  26155.15
## 80 1-Aug  28586.52
## 81 1-Sep  30505.41
## 82 1-Oct  30821.33
## 83 1-Nov  46634.38
## 84 1-Dec 104660.67

Then I will create a time series object from the data set.

Souv_ts <- ts(Souv_Sales$Sales, start=c(1995,1), frequency=12)

validLen_SS<- 12
TrainLen_SS <- length(Souv_ts) - validLen_SS
  
Souv_Train <- window(Souv_ts, start =c(1995,1), end=c(1995, TrainLen_SS))
Souv_Valid <-window(Souv_ts, start=c(1995,TrainLen_SS +1))

yrange=range(Souv_ts)

(a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression mode? What is the total number of predictors in the model?

Based on Figure 6.14, trend and seasonality should be accounted for in the model. It is hard to say for sure if the series has a quadratic trend or not. If we include the quadratic trend predictor ($t^2$) as well as a trend ($t$) predictor and the 11 predictors for monthly seasonality (12-1), there will be 13 predictors in the model. However, without the quadratic trend predictor ($t^2$) there will be 12 predictors.

(b) Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

Model_A <- tslm(Souv_Train ~trend + season)
summary(Model_A)

## 
## Call:
## tslm(formula = Souv_Train ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

For Model A the month with the highest sales is $season12$ which makes sense because that represents December and Christmas falls in that month. Additionally, summer begins in Australia in December so a souvenir shop in a beach resort town would see the higest sales during this time period.

ii. What does the trend coefficient of model A mean?

The trend coefficient of Model A is statistically significant (according to the p-value) which means there is a linear trend in the model. The linear trend coefficient of $245.36, represents the slope of the linear trend, meaning that there is an average increase of $245.36 for each successive time period over the time series, holding all other variables constant.

(c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B

Model_B <- tslm(log(Souv_Train) ~trend + season)
summary(Model_B)

## 
## Call:
## tslm(formula = log(Souv_Train) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales with a exponential trend.

ii. What does the estimated trend coefficient of model B mean?

The estimated trend coefficient of Model B represents the slope of the trend in terms of percentage increase, in this case it means that souvenir sales increase by 2.1% per month on average.

iii. Use this model to forecast the sales in February 2002.

The souvenir sales data set has 84 observations (monthly data points) over January 1995 to December 2001, to forecast February 2002 we wil need to forecast for observation number 86. This task is asking us to forecast beyond the validation period. Instead of recombining the training set and validation set and re-fitting the model. Like this assignments class notes, I will use the better of the two models from the training period to forecast both January and February of 2002. Which is the same as forecasting time period 85 and 86.

Jan_Forecast <- Model_B$coefficients["(Intercept)"] + Model_B$coefficients["trend"]*85

Feb_Forecast <- Model_B$coefficients["(Intercept)"] + Model_B$coefficients["trend"]*86 + Model_B$coefficients["season2"]

exp(Jan_Forecast)

## (Intercept) 
##    12601.02

exp(Feb_Forecast)

## (Intercept) 
##    17062.99

The forecasts for January and February 2002 sales are $12,601.02 and $17,062.99, respectively.

(d) Compare the two regession models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

Model_A_Forecast <- forecast(Model_A, h=validLen_SS)

accuracy(Model_A_Forecast$mean, Souv_Valid)

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924

The MAPE for Model A is 26.66.

Model_B_Forecast <- forecast(Model_B, h=validLen_SS)

accuracy((exp(Model_B_Forecast$mean)), Souv_Valid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

The MAPEfor Model B is 15.52.

Figure 7. Plotting Souvenir Sales vs. Model A vs. Model B

plot(c(1995,2002), c(0,120000), type="n", xlab="Year",  ylab="Souvenir Sales", bty="l", xaxt="n", yaxt="n")

lines(Souv_ts, bty="l", lwd = 2)

axis(1, at=seq(1995,2002,2), labels=format(seq(1995,2002,2)))

axis(2, at=seq(0,120000, 20000), labels=format(seq(0,120000,20000)), las=2)
abline(v=2001)
arrows(1995, 120000, 2001, 120000, code=3, length=0.1)
text(1998, 115000, "Training")
abline(v=2002)
arrows(2001, 120000, 2002, 120000, code=3, length=0.1)
text(2001.5, 115000, "Validation")
lines(Model_A_Forecast$fitted, col="blue")
lines(Model_A_Forecast$mean, col="blue")
lines(exp(Model_B_Forecast$fitted),col="red")
lines(exp(Model_B_Forecast$mean), col = "red")


legend(1995,110000, c("Souvenir Sales", "Model A", "Model B"), col=c("black", "blue",  "red"), bty="n", lwd=c(1,2,2))

A few reasons why Model B is preferable for forecasting include;

The accuracy for predicting the validation set of Model B according to the MAPE value is preferable because it is lower at 15.52% compared to Model A’s MAPE of 26.66%.
Figure 7 shows us that visually Model B has better predictive power than Model A. Simiarly, when we were first deciding on how many predictors should go into the model, the visual wasn’t clear on if a linear trend or an exponential trend was present, after analysis we can see that the Model B (where exponential trend exists) is the better suited model.

(e) How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.

If the goal was to understanding the different components of sales in the souvenir shop between 1995 and 2001, then forecasting would not be necessary. Therefore, all the data would be better used on a combined basis (not breaking into validation and training sets). I would be able to look at the data on a monthly basis with combined information to see how sales changed over months and years. If more data was available for different variables, like day of the week or different categories within the store, like product type (i.e. sunglasses vs. keychains), I would be interested in looking at the sales levels between those categories.

Marissa Valente - Assignment 6

MBA 678: Marissa Valente

March 19, 2018

Chapter 6: Regression-Based Models: Capturing Trend and Seasonality

Problem 2: Analysis of Canadian Manufacturing Workers Work-Hours

Figure 1. Plotting Average Annual Canadian Work Hours over Time

Which model of the following regression-based models would fit the series best?

Figure 2. Visualizing the 2 Different Regression-Based Models for the Canadian Work Hours Time Series

Problem 4: Forecasting Department Store Sales

Figure 3.Time Series of Actual Quarterly Department Store Sales for 6 Years

(a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

(b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

(c) A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

(d) Use this model to forecast sales in quarters 21 and 22.

(e) The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.

i. Recreate these plots

Figure 4. Fit of Regression Model for Department Stores Sales: Plotting Quarterly Department Store Sales vs.Exponential Trend + Seasonality and Point Forecasts

Figure 5. Fit of Regression Model for Department Store Sales Residuals

Figure 6. Residuals for the Validation Period

ii. Based on these plots, what can you say about your forecast for quarters Q21 and Q22? Are they likely to over-forecast, under-forecast, or be reasonably close to the real sales values?

(f) Looking at the residual plot, which of the following statement appear true?

(g) Which of the following solutions is adequate and a parsimonious solution for improving model fit?

Problem 5: Souvenir Sales

(a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression mode? What is the total number of predictors in the model?

(b) Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

ii. What does the trend coefficient of model A mean?

(c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B

i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

ii. What does the estimated trend coefficient of model B mean?

iii. Use this model to forecast the sales in February 2002.

(d) Compare the two regession models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

Figure 7. Plotting Souvenir Sales vs. Model A vs. Model B

(e) How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.