Problems: 2, 4, 5, Pgs. 135-141

2. Which one model of the following regression-based models would fit the series best?

. Linear trend model

. Linear trend model with seasonality

. Quadratic trend model

. Quadratic trend model with seasonality

setwd("C:/Users/larms.LA-INSP5559/Documents/R/win-library/3.3/17_0401_assignment5")
cwhours<- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
str(cwhours)
## 'data.frame':    35 obs. of  2 variables:
##  $ Year          : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
##  $ Hours_per_week: num  37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...
head(cwhours)
##   Year Hours_per_week
## 1 1966           37.2
## 2 1967           37.0
## 3 1968           37.4
## 4 1969           37.5
## 5 1970           37.7
## 6 1971           37.7
library(forecast)

#create a time series
cworkhoursTS<- ts(cwhours$Hours_per_week, start = c(1966), frequency = 1)
#linear trend
cwhoursLM<- tslm(cworkhoursTS ~ trend)

#quadratic trend
cwhoursQM<- tslm(cworkhoursTS ~ trend + I(trend^2))

plot(cworkhoursTS, xlab = "Years", ylab = "Average Hours per Week", ylim = c(34.5, 38), bty = "l")
lines(cwhoursLM$fitted, col = "red", lwd = 2)
lines(cwhoursQM$fitted, col = "blue", lwd = 2)
legend(1990, 38, c("Linear Trend", "Quad. Trend", "Actual"), lty=c(1,1,1,1,1), col=c("red", "blue", "black"), bty="n", lwd = c(1,1,1))

summary(cwhoursLM)
## 
## Call:
## tslm(formula = cworkhoursTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08
summary(cwhoursQM)
## 
## Call:
## tslm(formula = cworkhoursTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

In order to determine the “suitability” of any trend, the book suggests that analysts should look at the time plot of the deseasonalized time series with the trend overlaid. Analysts should review the residual time plot and performance measures on the validation period.

I didn’t partition this data into the training or validation periods. However, there is no seasonality in this data, since it includes one average per year, which led me to overlay only the linear and quadratic trend lines. After reviewing this plot of Canadian work hours, I found that a quadratic trend is the best regression model for this time series. Even though the two residual figures are relatively close, the quadratic trend fits the time series better.

4. Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period.

#Read data, create time series, partition the data, & plot.
DeptSales<- read.csv("DeptSales.csv", stringsAsFactors = FALSE)
SalesTS<- ts(DeptSales$Sales, start = c(1), frequency = 4)
validationlen<- 4
traininglen<- length(SalesTS)-validationlen
salestrain<- window(SalesTS,end = c(1,traininglen))
salesvalid<- window(SalesTS, start = c(1,traininglen + 1))
yrange = range(SalesTS)
plot(c(1,7), yrange, type = "n", xlab = "Year", ylab = "Dept. Store Sales ($ Thousands)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=seq(1,7,1), labels = format(seq(1,7,1)))
axis(2,at=seq(48000,105000,5000),labels = format(seq(48.0,105.0,5)),las = 2)
lines(SalesTS,bty = "l")

4. (a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

The following operation must be performed:

. Take a logarithm of sales

According to the text book, to fit an exponential trend, an analyst must replace the output variable y (or sales in this case) with log (y).

4. (b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

dssaleslmtrse<- tslm(salestrain ~ trend + season, lambda = 0)
yrange = range(SalesTS)
plot(c(1,7), yrange, type = "n", xlab = "Year", ylab = "Dept. Store Sales ($ Thousands)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=seq(1,7,1), labels = format(seq(1,7,1)))
axis(2,at=seq(48000,105000,5000),labels = format(seq(48.0,105.0,5)),las = 2)
lines(SalesTS,bty = "l")
lines(dssaleslmtrse$fitted, col = "red", lwd = 1)

4. (c) A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?

summary(dssaleslmtrse)
## 
## Call:
## tslm(formula = salestrain ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

Per the regression summary, Q2 and Q1 sales are approximately equal.

4. (d) Use this model to forecast sales in quarters 21 and 22.

Deptsalesforecast<- forecast(dssaleslmtrse, h=2)
Deptsalesforecast
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04 66250.87

4. (e) The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.

i. Recreate these plots.

yrange = range(salestrain)
plot(c(1,7), yrange, type = "n", xlab = "Year", ylab = "Dept. Store Sales ($ Thousands)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=seq(1,7,1), labels = format(seq(1,7,1)))
axis(2,at=seq(48000,105000,5000),labels = format(seq(48.0,105.0,5)),las = 2)
lines(salestrain, bty =  "l", lwd = 1)
lines(Deptsalesforecast$fitted, col = "red", lwd = 2)

plot(Deptsalesforecast$residuals, xlab = "Year", ylab = "Residuals", bty = "l")
abline(h=0)

ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?

Overall, it looks like this forecast is reasonably close to the real sales values. The residuals from years two and three are very close to zero.

4. (f) Looking at the residual plot, which of the following statements appear true?

The following statements appear true:

. The regression model fits the data well.

4. (g) Which of the following solutions is adequate and a parsimonious solution for improving model fit?

. Fit a quadratic trend model to Sales (with Quarter and Quarter2.)

5. Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

#read data, create time series, & partition data
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
souvenir<- read.csv("souvenirsales.csv", stringsAsFactors=FALSE)
Souvenir_SalesTS<- ts(souvenir$Sales, start=c(1995,1), frequency=12)
validsouvLength <- 12
trainsouvLength <- length(Souvenir_SalesTS) - validsouvLength
souvenirWTrain <- window(Souvenir_SalesTS, start=c(1995,1), end=c(1995, trainsouvLength))
souvenirWValid <- window(Souvenir_SalesTS, start=c(1995,trainsouvLength+1), end=c(1995,trainsouvLength+validsouvLength))

5. (a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

plot(Souvenir_SalesTS, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (Thousands)", main = "Souvenir Sales")
lines(Souvenir_SalesTS, lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,105000,5000),  labels = format(seq(0, 105, 5)), las = 2)

plot(log(Souvenir_SalesTS), type = "n", yaxt = "n", xlab = "Year", ylab = "Souvenir Sales", main = "Log Souvenir Sales")
lines(log(Souvenir_SalesTS), lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,105000,5000),  labels = format(seq(0, 105, 5)), las = 2)

The predictors that should be included in the model are (total of 2):

5. (b) Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

modela<- tslm(souvenirWTrain ~ trend + season)
forecastA<- forecast(modela, h=2)
accuracymodela<- accuracy(forecastA$mean, souvenirWValid)
plot(Souvenir_SalesTS, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (Thousands)", main = "Model A")
lines(souvenirWTrain, lwd = 2)
lines(modela$fitted, col = "red", lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,105000,5000),  labels = format(seq(0, 105, 5)), las = 2)

par(oma = c(0, 0, 0, 2))
xrange<- c(1,7)
yrange<- range(Souvenir_SalesTS)
plot(xrange, yrange, type="n", xlab="Year", ylab="Monthly Souvenir Sales (Thousands)", bty="l", las=1)
colors <- rainbow(12) 
linetype <- c(1:12) 
plotchar <- c(1:12)

axis (1,at=seq(1,12,1),labels=format(seq(1,12,1)))

for (i in 1:12) { 
  currentMonth <- subset(Souvenir_SalesTS, cycle(Souvenir_SalesTS)==i)
  lines(currentMonth, type="b", lwd=1.5,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) } 

title("Souvenir Sales by Month")

legend(7.5,80000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)

i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

December has the highest average sales during the year. It could be reasonable due to the holidays.

ii. What does the trend coefficient of model A mean?

The trend coefficient of model A is the estimated impact on a variable over each unit of time. In this case, every month, it’s estimated that sales will increase by this amount every month.

summary(modela)
## 
## Call:
## tslm(formula = souvenirWTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

5. (c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

modelb<- tslm(log(souvenirWTrain) ~ trend + season)
plot(souvenirWTrain, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (Thousands)", main = "Model B", lwd = 1)
lines(souvenirWTrain, lwd = 2)
lines(exp(modelb$fitted.values), col = "red", lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))

i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

ii. What does the estimated trend coefficient of model B mean?

summary(modelb)
## 
## Call:
## tslm(formula = log(souvenirWTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

The trend coefficient of model B is the estimated impact on a variable over each unit of time. In this case, every month, it’s estimated that sales will increase by 2.1% every month.

iii. Use this model to forecast the sales in February 2002.

forecasttslm<- tslm(souvenirWTrain ~ trend + season, lambda = 0)
forecastmodeltslm<- forecast(forecasttslm, h = 2)
forecastmodeltslm
##          Point Forecast     Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2001       9780.022  7459.322 12822.72 6437.463 14858.16
## Feb 2001      13243.095 10100.643 17363.21 8716.947 20119.38

5. (d) Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

Model A forecast accuracy & summary:

ModelAaccuracy<- accuracy(forecastA$mean, souvenirWValid[1:2])
summary(modela)
## 
## Call:
## tslm(formula = souvenirWTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

Model B forecast accuracy & summary:

ModelBaccuracy<- accuracy(forecastmodeltslm$mean, souvenirWValid[1:2])
summary(modelb)
## 
## Call:
## tslm(formula = log(souvenirWTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

The MAPE and RMSE are higher in model A than model B, which means an exponential trend with seasonality fits the souvenir sales data better. In addition, the adjusted R squared for model B is greater than model A.