Chapter 6 Problem 2

The time series plot in Figure 6.10 describes the average annnual number of weekly hours spent by Canadian manufacturing workers. Which model of the following regression-based models would fit the series best?

Linear trend model ###Linear trend model with seasonality

Quadratic trend model ###Quadratic trend model with seasonality

First we bring in the data and look at the head and tail.

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.3
hours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
str(hours)
## 'data.frame':    35 obs. of  2 variables:
##  $ Year        : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
##  $ HoursPerWeek: num  37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...
head(hours)
##   Year HoursPerWeek
## 1 1966         37.2
## 2 1967         37.0
## 3 1968         37.4
## 4 1969         37.5
## 5 1970         37.7
## 6 1971         37.7
tail(hours)
##    Year HoursPerWeek
## 30 1995         35.7
## 31 1996         35.7
## 32 1997         35.5
## 33 1998         35.6
## 34 1999         36.3
## 35 2000         36.5

Next we create a time series and plot

hoursTS <- ts(hours$Hours, start=c(1966, 1), frequency=1)

autoplot(hoursTS)

We will now look at the for regression-based models and look at the adjusted R-squared values to determine which is the best fit for this series. Looking at the plot, there seems to be a negative trend but no indication of seasonality due to this data being annual totals. First we will look at linear trend.

hoursLinear <- tslm(hoursTS ~ trend)
summary(hoursLinear)
## 
## Call:
## tslm(formula = hoursTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

This data does not appear to be seasonal, so the linear seasonal model won’t work. Next we look at the quadratic trend

hoursQuad <- tslm(hoursTS ~ trend + I(trend^2))
summary(hoursQuad)
## 
## Call:
## tslm(formula = hoursTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

Due to the lack of seasonality, the quadtratic seasonal model will not be useful for this time series. Looking at the adjusted R-squared values, the quadtratic model (0.7465) is a better fit for this series than the linear model (0.5996)

Chapter 6 Problem 4

The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period.

(a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed?

*Take a logarithm of the Quater index- No

*Take a logarithm of sales- Yes

*Take an exponent of sales- No

*Take and exponent of Quarter index- No

(b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period.

We will bring in the data and create a time series, then plot it.

dsales <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)

dsalesTS <- ts(dsales$Sales, frequency=4)

autoplot(dsalesTS)

We will split the time series into the training period of 20 quarters and the remaining 4 as the validation period. We then fit the regression model with exponential trend and seasonality.

validLength <- 4
trainLength <- length(dsalesTS) - validLength

dsalesTrain <- window(dsalesTS, end=c(1, trainLength))
dsalesValid <- window(dsalesTS, start=c(1,trainLength+1))

dsalesExpo <- tslm(dsalesTrain ~ trend + season, lambda=0)
summary(dsalesExpo)
## 
## Call:
## tslm(formula = dsalesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

(c) A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

We see from the p-value for Q2 that is is not statistically significant, which means it is approximately equal to Q1 sales.

(d) Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(dsalesExpo, h=2)
validPeriodForecasts
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04 66250.87

(e) The plots shown in figure 6.13 describe the fit and forecast errors from this regression model.

i, Recreate these plots

# Helps set up the plot
yrange = range(dsalesTS)

# Set up the plot
plot(c(1, 7), yrange, type="n", xlab="Quarter",  ylab="Sales")

# Add the time series dsales
lines(dsalesTS, bty="l")


# Add fitted line from training period
lines(dsalesExpo$fitted, col="red")

# Add forecasts for valiation period
lines(validPeriodForecasts$mean, col="blue", lty=2)

plot(dsalesTrain - dsalesExpo$fitted, type="o", bty="l")

ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to over-forecast, under-forecast, or be reasonably close to the real forecast values?

Looking at the forecast and residuals, we can see that the forecast is under-forecasting. The first plot shows the forecast being below the acutal values and the residuals are well above 0.

(f) Looking at the residual plot, which of the following statements appear true?

Seasonality is not captured well- No, there is not a pattern in the residuals to indicate that the seasonality is not being captured ####The regression model fits the data well- there seems to be a trend in the residuals so no

*The trend in the data is not captured well by the model- Yes since there seems to be a trend in the residuals

(g) Which of following solutions is adequate and a parsimonious solution for improving model fit?

Fit a quadratic trend model to the residuals- this would not help the original model

Fit a quadratic model to Sales- yes the quadtratic model would help the fit of the overall model.

Chapter 6 Problem 5

(a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

Based on the first graph, we can se there is seasonality and that it looks to be multiplicative since the peaks between periods appear to grow over time. The second plot indicates there is a trend as well. In total there are two predictors in the model.

(b) Run a regression model with Sales as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

#Bring in the data
ssales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)

#Create time series
ssalesTS <- ts(ssales$Sales, start = c(1995,1), frequency=12)

#Plot
autoplot(ssalesTS)

Split into training and validation periods then run regression model

#Split into training and validation periods
validLength <- 12
trainLength <- length(ssalesTS) - validLength

ssalesTrain <- window(ssalesTS, end=c(1995, trainLength))
ssalesValid <- window(ssalesTS, start=c(1995,trainLength+1))

# Fit the model to the training set
modelA <- tslm(ssalesTrain ~ trend + season)

# See the estimated regression equation
summary(modelA)
## 
## Call:
## tslm(formula = ssalesTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

From the coefficents we see that season 12 (December) has the highest sales, which makes sense due to holiday shopping and we are analyzing souvenir sales.

ii. What does the trend coefficient of model A mean?

The trend coefficient (245.36) means the sales increase on average by this amount for each successive time period.

(c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Call this model B.

# Call it modelB and have the dependent variable transformed
modelB <- tslm(log(ssalesTrain) ~ trend + season)
summary(modelB)
## 
## Call:
## tslm(formula = log(ssalesTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

i. Fitting a model to log(Sales) witha linear trend is equivalent to fitting a model to Sales with what type of trend?

It is the same as fittin to a model with an exponential trend (lambda = 0 in the tslm function).

ii.What does the estimated trend coefficent of model B mean?

This means that on average, sales increased by the trend percentage (~2%) per month.

iii. Use this model to forecast sales in February 2002.

Feb 2002 would be time period 86, so we will use ModelB to forecast this month. Since our model gives the log(Sales), we need to convert it back into Sales after forecasting.

febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)
## (Intercept) 
##    17062.99

(d) Compater the two regression models (A and B) in terms of forecasting performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

First we need to compare our models to the validation period. First Model A

modelAForecast <- forecast(modelA, h=validLength)

accuracy(modelAForecast, ssalesValid)
##                         ME      RMSE       MAE       MPE     MAPE     MASE
## Training set -5.684342e-14  5365.199  3205.089  6.967778 36.75088 0.855877
## Test set      8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
##                   ACF1 Theil's U
## Training set 0.4048039        NA
## Test set     0.3206228 0.9075924

Next Model B

modelBForecast <- forecast(modelB, h=validLength)

# We have to tranform the forecasts back to the original scale.
accuracy(exp(modelBForecast$mean), ssalesValid)
##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

Looking at the accuracy outputs, model B seems preferable. It has a better RMSE and MAPE on the validation period.

(e) How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.