#Libraries
library(forecast)
library(zoo)
library(timeDate)
The time series plot in Figure 6.10 describes the average annual number of weekly hours spent by Canadian manufacturing workers.
Which one of the following regression-based models would work best?
Work <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
Work.TS <- ts(Work$HoursPerWeek, start=c(1966), frequency=1)
yrange = range(Work.TS)
#Plotted
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Average Canadian Work Hours (per week)", bty="l", xaxt="n", yaxt="n")
#Added time series
lines(Work.TS, bty="l")
#Added axes
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(30,40,1), labels=format(seq(30,40,1)), las=2)
-Linear Trend
# Linear trend
WorkLinear <- tslm(Work.TS ~ trend)
summary(WorkLinear)
##
## Call:
## tslm(formula = Work.TS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
-Linear Trend w/ Seasonality
# Linear trend with seasonality - N/A
# WorkLinearSeason <- tslm(Work.TS ~ trend + season)
# summary(WorkLinearSeason)
-Quadratic Trend
# Quadratic trend
WorkQuad <- tslm(Work.TS ~ trend + I(trend^2))
summary(WorkQuad)
##
## Call:
## tslm(formula = Work.TS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
-Quadratic Trend w/ Seasonality
# Quadratic trend with seasonality - N/A
# WorkQuadSeason <- tslm(Work.TS ~ trend + I(trend^2) + season)
# summary(WorkQuadSeason)
The data set has no explicit seasonality. As such, Linear trend with seasonality and Quadratic seasonality are not viable options. Linear trend has an adj R-squared of .60 and the Quadratic trend has an adj R-squared of .75. Looking at these two options, it appears that the Quadratic trend explains the variation in hours worked better (75% vs. 60%. The plot below further supports this, as the quadratic model better reflects the dip and later rise (quadratic function) in the data.
#plot trend lines
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Average Canadian Work Hours (per week)", bty="l", xaxt="n", yaxt="n")
# Add the time series Work
lines(Work.TS, bty="l")
# Add the x-axis
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
# Add the y-axis
axis(2, at=seq(30,40,1), labels=format(seq(30,40,1)), las=2)
# Add the linear trend model
lines(WorkLinear$fitted, col="red")
lines(WorkQuad$fitted, col="blue")
# Add a legend
legend(1966,36, c("Avg Work Hours", "Linear", "Quadratic"), lty=c(1,1,1), col=c("black", "red", "blue"), bty="n")
Forecasting Department Store Sales: The time series plot shown in Figure 6.12 describes actual quarterly sales for a department store over a 6-year period.
The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?
Take a logarithm of the Quarter index - Not necessary
Take a logarithm of sales - Necessary, in order to account for an exponential trend in the data
Take an exponent of sales - Not necessary
Take an exponent of Quarter index - Not necessary
Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).
#Import data
DeptSales <- read.csv("DeptStoreSales.csv")
Sales.TS <- ts(DeptSales$Sales, start=c(1,1), freq=4)
#Partition data
ValidLength <- 4
TrainLength <- length(Sales.TS) - ValidLength
Train.TS <- window(Sales.TS, start= c(1,1), end= c(1, TrainLength))
Valid.TS <- window(Sales.TS, start= c(1, TrainLength+1), end= c(1, TrainLength+ValidLength))
#Exponential model
SalesExpo <- tslm(Train.TS ~ trend + season, lambda=0)
summary(SalesExpo)
##
## Call:
## tslm(formula = Train.TS ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?
Looking at the output, I would expect Q2 average sales to be approximately equal, but possibly higher. While the coefficient is positive at .024956 suggesting they would be higher, the P-value tells us that Q2 is not statistically significantly different than Q1. Therefore, we should expect them to be about the same.
Use this model to forecast sales in quarters 21 and 22.
validPeriodForecasts <- forecast(SalesExpo, h=ValidLength)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
Forecasted sales for quarters 21 and 22 are 58793.71 and 60951.51 respectively
The plot shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from the regression model.
i. Recreate these plots.
yrange = range(Sales.TS)
plot(c(1,7),yrange, type="n",xlab="Year",ylab="Sales Dollars (in thousands)",bty="l",xaxt="n",yaxt="n")
lines(Sales.TS,bty="l")
axis(1,at=seq(1,7,1),labels=format(seq(1,7,1)))
axis(2,at=seq(45000,105000,5000),labels=format(seq(45,105,5)),las=2)
lines(SalesExpo$fitted,col="red")
lines(validPeriodForecasts$mean,col="blue",lty=2)
legend(1,105000,c("Actual (sales)","Exponential Trend + Season","Forecasts"),lty=c(1,1,2),col=c("black","red","blue"),bty="n")
plot(SalesExpo$residuals, type="o", xlab = "Year", ylab = "Residuals", bty="l")
ii. Based on these plots, what can you say about your forecasts for quarters 21 and 22? Are they likely to over-forecast, under-forecast or be reasonably close to the real sales values?
I would say that the forecasted quarters should be reasonably close. The residuals show about an even mix above/below 0, signifying a relatively balanced model that is on average predicting well. However, the validation period quarters shown above with the forecasted values suggest the model under-predicts.
Looking at the residual plot, which of the following statements appear true?
Seasonality is not captured well - NOT TRUE
The regression model fits the data well - TRUE. With an even mix of residuals, and no real outliers, the regression model fits fairly well with the data.
The trend in the data is not captured well by the model - NOT TRUE. However, the residuals show a quadratic pattern, suggesting the trend is not fully captured by our model.
Which of the following solutions is adequate and a parsimonious solution for improving model fit?
Fit a quadratic trend model to the residuals (with Quarter and Quarter-squared.) - NO.
Fit a quadratic trend model to Sales (with Quarter and Quarter-squared.) - YES. By doing this we can hope that some of the U-shaped trend we saw in the residuals is addressed.
Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.
In 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.
Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?
Looking at the time plots we see there is definite seasonality, and the log values are increasing, indicating a quadratic trend. As such we would need 11 dummy variables for the other months (seasons) as well as one for t and another for t-squared, for a total of 13 predictors.
Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.
#Model A
SouvenirData <- read.csv("SouvenirSales.csv")
Souvenir.TS <- ts(SouvenirData$Sales, start=c(1995, 1), frequency=12)
yrange = range(Souvenir.TS)
#partition
ValidLength1 <- 12
TrainLength1 <- length(Souvenir.TS) - ValidLength1
SouvenirTrain <- window(Souvenir.TS, end=c(1995, TrainLength1))
SouvenirValid <- window(Souvenir.TS, start=c(1995, TrainLength1+1))
#Plot
ModelA <- tslm(SouvenirTrain ~ trend + season)
summary(ModelA)
##
## Call:
## tslm(formula = SouvenirTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
i. Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?
December has by far the highest average sales during the year (November is a distant 2nd). This is reasonable because sales would increase near the holiday season where people buy more. And while it’s cold here in the US, December are warm months in Australia, and tourists are probably flocking to Australia that time of the year.
ii. What does the trend coefficient of model A mean?
The trend coefficient is 245.36, meaning the average (linear) increase each month is A$245.36 over the entire period
Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.
#Model B
ModelB <- tslm(log(SouvenirTrain) ~ trend + season)
summary(ModelB)
##
## Call:
## tslm(formula = log(SouvenirTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
i. Fitting a model only to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?
Fitting a model to log(Sales) with a linear trend is the same as fitting a Quadratic trend model to Sales
ii. What does the estimated trend coefficient of model B mean?
The estimated trend coefficient of Model B is 0.021120. As a log, this now is the change as a percent, meaning the model has an overall increase of ~2.1% in sales each month over the previous month.
iii. Use this model to forecast the sales in February 2002.
#84 obervations - Feb 2002 = obs #86
Feb2002Forecast <- ModelB$coefficients["(Intercept)"] + ModelB$coefficients["trend"]*86 + ModelB$coefficients["season2"]
exp(Feb2002Forecast)
## (Intercept)
## 17062.99
Forecasted sales in February 2002 are A$17,062.99
Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.
#Accuracy - Model A
ModelAForecast <- forecast(ModelA, h = ValidLength1)
accuracy(ModelAForecast$mean, SouvenirValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
#Accuracy - Model B
ModelBForecast <- forecast(ModelB, h = ValidLength1)
accuracy(exp(ModelBForecast$mean), SouvenirValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
# Plot
plot(SouvenirValid, bty="l", xaxt="n", xlab="Year: 2001", yaxt="n", ylab="Sales (A$)")
# X-axis
axis(1, at=seq(2001,2001+11/12,1/12), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
# Y-axis
axis(2, at=seq(0,100000,10000), labels=format(seq(0,100,10)), las=2)
lines(ModelAForecast$mean, col="blue")
lines(exp(ModelBForecast$mean), col="red")
# Legend
legend(2001,85000, c("Actual", "Model A (Linear & Season)", "Model B (log(y) Linear + Season)"), lty=c(1,1,1), col=c("black", "blue", "red"), bty="n")
Looking at the forecast performance, we would prefer Model B. The above plot shows that Model B appears to be a better fit, especially in the later part of the data set. Looking at their accuracy, we see that Model A has a MAPE of 26.67 whereas Model B has a better MAPE of 15.52. Looking at how well they explain the data, Model A has an adjusted R-squared value of 0.7476 (75%). Model B again is better, with an R-squared value of 0.9306 (93%).
How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.
The biggest difference is that there is no forecasting required. We would only need to be descriptive in our analysis. This means we could look at the data in one period (no need to partition data). We may also want to look at separating it by month/season - this may help us look at what we can do to balance out sales in lower volume months. We may also want to include other factors to examine? Currency rates? Some tourist volume metric?
…END…