Problems: 2, 4, 5, Pgs. 135-141
2. Which one model of the following regression-based models would fit the series best?
. Linear trend model
. Linear trend model with seasonality
. Quadratic trend model
. Quadratic trend model with seasonality
setwd("C:/Users/larms.LA-INSP5559/Documents/R/win-library/3.3/17_0401_assignment5")
cwhours<- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
str(cwhours)
## 'data.frame': 35 obs. of 2 variables:
## $ Year : int 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
## $ Hours_per_week: num 37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...
head(cwhours)
## Year Hours_per_week
## 1 1966 37.2
## 2 1967 37.0
## 3 1968 37.4
## 4 1969 37.5
## 5 1970 37.7
## 6 1971 37.7
library(forecast)
#create a time series
cworkhoursTS<- ts(cwhours$Hours_per_week, start = c(1966), frequency = 1)
#linear trend
cwhoursLM<- tslm(cworkhoursTS ~ trend)
#quadratic trend
cwhoursQM<- tslm(cworkhoursTS ~ trend + I(trend^2))
plot(cworkhoursTS, xlab = "Years", ylab = "Average Hours per Week", ylim = c(34.5, 38), bty = "l")
lines(cwhoursLM$fitted, col = "red", lwd = 2)
lines(cwhoursQM$fitted, col = "blue", lwd = 2)
legend(1990, 38, c("Linear Trend", "Quad. Trend", "Actual"), lty=c(1,1,1,1,1), col=c("red", "blue", "black"), bty="n", lwd = c(1,1,1))
summary(cwhoursLM)
##
## Call:
## tslm(formula = cworkhoursTS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
summary(cwhoursQM)
##
## Call:
## tslm(formula = cworkhoursTS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
In order to determine the “suitability” of any trend, the book suggests that analysts should look at the time plot of the deseasonalized time series with the trend overlaid. Analysts should review the residual time plot and performance measures on the validation period.
I didn’t partition this data into the training or validation periods. However, there is no seasonality in this data, since it includes one average per year, which led me to overlay only the linear and quadratic trend lines. After reviewing this plot of Canadian work hours, I found that a quadratic trend is the best regression model for this time series. Even though the two residual figures are relatively close, the quadratic trend fits the time series better.
4. Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period.
#Read data, create time series, partition the data, & plot.
DeptSales<- read.csv("DeptSales.csv", stringsAsFactors = FALSE)
SalesTS<- ts(DeptSales$Sales, start = c(1), frequency = 4)
validationlen<- 4
traininglen<- length(SalesTS)-validationlen
salestrain<- window(SalesTS,end = c(1,traininglen))
salesvalid<- window(SalesTS, start = c(1,traininglen + 1))
yrange = range(SalesTS)
plot(c(1,7), yrange, type = "n", xlab = "Year", ylab = "Dept. Store Sales ($ Thousands)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=seq(1,7,1), labels = format(seq(1,7,1)))
axis(2,at=seq(48000,105000,5000),labels = format(seq(48.0,105.0,5)),las = 2)
lines(SalesTS,bty = "l")
4. (a) The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?
The following operation must be performed:
. Take a logarithm of sales
According to the text book, to fit an exponential trend, an analyst must replace the output variable y (or sales in this case) with log (y).
4. (b) Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).
dssaleslmtrse<- tslm(salestrain ~ trend + season, lambda = 0)
yrange = range(SalesTS)
plot(c(1,7), yrange, type = "n", xlab = "Year", ylab = "Dept. Store Sales ($ Thousands)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=seq(1,7,1), labels = format(seq(1,7,1)))
axis(2,at=seq(48000,105000,5000),labels = format(seq(48.0,105.0,5)),las = 2)
lines(SalesTS,bty = "l")
lines(dssaleslmtrse$fitted, col = "red", lwd = 1)
4. (c) A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?
summary(dssaleslmtrse)
##
## Call:
## tslm(formula = salestrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
Per the regression summary, Q2 and Q1 sales are approximately equal.
4. (d) Use this model to forecast sales in quarters 21 and 22.
Deptsalesforecast<- forecast(dssaleslmtrse, h=2)
Deptsalesforecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
4. (e) The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.
i. Recreate these plots.
yrange = range(salestrain)
plot(c(1,7), yrange, type = "n", xlab = "Year", ylab = "Dept. Store Sales ($ Thousands)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=seq(1,7,1), labels = format(seq(1,7,1)))
axis(2,at=seq(48000,105000,5000),labels = format(seq(48.0,105.0,5)),las = 2)
lines(salestrain, bty = "l", lwd = 1)
lines(Deptsalesforecast$fitted, col = "red", lwd = 2)
plot(Deptsalesforecast$residuals, xlab = "Year", ylab = "Residuals", bty = "l")
abline(h=0)
ii. Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to overforecast, under-forecast, or be reasonably close to the real sales values?
Overall, it looks like this forecast is reasonably close to the real sales values. The residuals from years two and three are very close to zero.
4. (f) Looking at the residual plot, which of the following statements appear true?
The following statements appear true:
. The regression model fits the data well.
4. (g) Which of the following solutions is adequate and a parsimonious solution for improving model fit?
. Fit a quadratic trend model to Sales (with Quarter and Quarter2.)
5. Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.
Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.
#read data, create time series, & partition data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
souvenir<- read.csv("souvenirsales.csv", stringsAsFactors=FALSE)
Souvenir_SalesTS<- ts(souvenir$Sales, start=c(1995,1), frequency=12)
validsouvLength <- 12
trainsouvLength <- length(Souvenir_SalesTS) - validsouvLength
souvenirWTrain <- window(Souvenir_SalesTS, start=c(1995,1), end=c(1995, trainsouvLength))
souvenirWValid <- window(Souvenir_SalesTS, start=c(1995,trainsouvLength+1), end=c(1995,trainsouvLength+validsouvLength))
5. (a) Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?
plot(Souvenir_SalesTS, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (Thousands)", main = "Souvenir Sales")
lines(Souvenir_SalesTS, lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,105000,5000), labels = format(seq(0, 105, 5)), las = 2)
plot(log(Souvenir_SalesTS), type = "n", yaxt = "n", xlab = "Year", ylab = "Souvenir Sales", main = "Log Souvenir Sales")
lines(log(Souvenir_SalesTS), lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,105000,5000), labels = format(seq(0, 105, 5)), las = 2)
The predictors that should be included in the model are (total of 2):
Exponential Trend
Multiplicative Seasonality
5. (b) Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.
modela<- tslm(souvenirWTrain ~ trend + season)
forecastA<- forecast(modela, h=2)
accuracymodela<- accuracy(forecastA$mean, souvenirWValid)
plot(Souvenir_SalesTS, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (Thousands)", main = "Model A")
lines(souvenirWTrain, lwd = 2)
lines(modela$fitted, col = "red", lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
axis(2, at = seq(0,105000,5000), labels = format(seq(0, 105, 5)), las = 2)
par(oma = c(0, 0, 0, 2))
xrange<- c(1,7)
yrange<- range(Souvenir_SalesTS)
plot(xrange, yrange, type="n", xlab="Year", ylab="Monthly Souvenir Sales (Thousands)", bty="l", las=1)
colors <- rainbow(12)
linetype <- c(1:12)
plotchar <- c(1:12)
axis (1,at=seq(1,12,1),labels=format(seq(1,12,1)))
for (i in 1:12) {
currentMonth <- subset(Souvenir_SalesTS, cycle(Souvenir_SalesTS)==i)
lines(currentMonth, type="b", lwd=1.5,
lty=linetype[i], col=colors[i], pch=plotchar[i]) }
title("Souvenir Sales by Month")
legend(7.5,80000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)
i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?
December has the highest average sales during the year. It could be reasonable due to the holidays.
ii. What does the trend coefficient of model A mean?
The trend coefficient of model A is the estimated impact on a variable over each unit of time. In this case, every month, it’s estimated that sales will increase by this amount every month.
summary(modela)
##
## Call:
## tslm(formula = souvenirWTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
5. (c) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.
modelb<- tslm(log(souvenirWTrain) ~ trend + season)
plot(souvenirWTrain, type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Sales (Thousands)", main = "Model B", lwd = 1)
lines(souvenirWTrain, lwd = 2)
lines(exp(modelb$fitted.values), col = "red", lwd = 2)
axis(1,at=seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?
ii. What does the estimated trend coefficient of model B mean?
summary(modelb)
##
## Call:
## tslm(formula = log(souvenirWTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
The trend coefficient of model B is the estimated impact on a variable over each unit of time. In this case, every month, it’s estimated that sales will increase by 2.1% every month.
iii. Use this model to forecast the sales in February 2002.
forecasttslm<- tslm(souvenirWTrain ~ trend + season, lambda = 0)
forecastmodeltslm<- forecast(forecasttslm, h = 2)
forecastmodeltslm
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9780.022 7459.322 12822.72 6437.463 14858.16
## Feb 2001 13243.095 10100.643 17363.21 8716.947 20119.38
5. (d) Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.
Model A forecast accuracy & summary:
ModelAaccuracy<- accuracy(forecastA$mean, souvenirWValid[1:2])
summary(modela)
##
## Call:
## tslm(formula = souvenirWTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
Model B forecast accuracy & summary:
ModelBaccuracy<- accuracy(forecastmodeltslm$mean, souvenirWValid[1:2])
summary(modelb)
##
## Call:
## tslm(formula = log(souvenirWTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
The MAPE and RMSE are higher in model A than model B, which means an exponential trend with seasonality fits the souvenir sales data better. In addition, the adjusted R squared for model B is greater than model A.