MBA 678

Chapter 6: Questions 2, 4 and 5

Question 2

To answer which model would best fit the Canadian Manufacturing Workers Work-Hours data, I began by pulling in the .csv and creating some plots.

WorkHours <- read.csv("~/Personal/CanadianWorkHours.csv", stringsAsFactors = FALSE)

workTS <- ts(WorkHours$Hours, start=c(1966, 1), frequency=1)

yrange = range(workTS)


plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")

lines(workTS, bty="l")

axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))

axis(2, at=seq(34.5,38,0.5), las=2)

After I had re-created the plot in the book, I ran some regression-based models. I wanted to do each of the four options, beginning with linear trend and linear trend with seasonality.

library(forecast)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.3
worklinear <- tslm(workTS ~ trend)
summary(worklinear)
## 
## Call:
## tslm(formula = workTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

When running the linear model with seasonality I got the following error: Error in tslm(workTS ~ trend + season) : Non-seasonal data cannot be modelled using a seasonal factor," which indicates to me that this data set cannot be modelled with either a linear model with seasonality or a quadratic model with seasonality. Because there is only one data point annually, I think the seasonality would be hard to determine - we don’t have enough information to compare groups of years as seasons.

That meant I wanted to run the quadratic trend model next.

workquadratic <- tslm(workTS ~ trend + I(trend^2))
summary(workquadratic)
## 
## Call:
## tslm(formula = workTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

To compare the linear trend and quadratic trend models, I plotted both.

yrange = range(workTS)


plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")

lines(workTS, bty="l")

axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))

axis(2, at=seq(34.5,38,0.5), las=2)

lines(worklinear$fitted, col="red")
lines(workquadratic$fitted, col="blue")

legend(1966,36, c("Hours per week", "Linear", "Quadratic"), lty=c(1,1,1), col=c("black", "red", "blue"), bty="n")

Based on the results of the linear trend model and the quadratic trend model, as well as the visual, I think the quadratic trend model fits the series best.

Question 4

  1. To fit the Department Store sales data with an exponetial trend, you must take a logarithm of sales, as this is the y or output variable in the function.

  2. I began by plotting the data and then partitioned it into training and validation periods.

DeptSales <- read.csv("~/Personal/DeptStoreSales (1).csv", stringsAsFactors = FALSE)

salesTS <- ts(DeptSales$Sales, start=c(1), frequency=4)

yrange = range(salesTS)


plot(c(1,7), yrange, type="n", xlab="Year",  ylab="Sales", bty="l", xaxt="n", yaxt="n")

lines(salesTS, bty="l")

axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))

axis(2, at=seq(45000,105000, 10000), las=(2))

validLength <- 4
trainLength <- length(salesTS) - validLength

salesTrain <- window(salesTS, end=c(1, trainLength))
salesValid <- window(salesTS, start=c(1,trainLength+1))

Next I ran the regression model with an exponential trend and seasonality. I used lambda =0 to create the log needed for an exponential trend.

salesexponent <- tslm(salesTrain ~ trend + season, lambda=0)
summary(salesexponent)
## 
## Call:
## tslm(formula = salesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16
  1. Based on the output from our exponential trend model with seasonality, we can see the p value for season2 (or the second quarter) is not statistically significantly different than Q1. We can interpret this as meaning Q2 average sales are approximately equal to Q1.

  2. See below:

validLength = 2
validationforecast <- forecast(salesexponent, h=validLength)
validationforecast
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04 66250.87
  1. I recreated the plots from the text for part i.To begin I had to run the model on the entire time series, not just the training set, as the image in the book shows the model fitted to the entire set of data.
salesexponent <- tslm(salesTS ~ trend + season, lambda=0)

yrange = range(salesTS)

plot(c(1, 7), yrange, type="n", xlab="Year",  ylab="Sales", bty="l", xaxt="n", yaxt="n")

lines(salesTS, bty="l")

axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))

axis(2, at=seq(45000,105000,10000), las=2)

lines(salesexponent$fitted, col="red")

legend(1,105000, c("Sales", "Exponential with Seasonality"), lty=c(1,1), col=c("black", "red"), bty="n")

plot(salesexponent$residuals, type="o", bty="l")

  1. These plots look reasonably acceptable - the residuals seem to be centered generally above and below zero and the fitted line is very accurate in comparison to the actual reported sales. I think the forecasts for Q21 and Q22 will be close to the real sales values, but as the residuals are trending positive we may be underforecasting slightly. This would mean we predict a slightly lower value than actual for Quarters 21 and 22.
  1. Seasonality is captured well in the plot - there is no obvious pattern to the residuals. I think that the residual plot indicates the regression model fits the data well as the residuals are low and centered mainly around zero. *The trend in the data is not captured well by the model. There is a slight U-shape to the residual plot, which makes me feel a quadratic model may have yielded better results.

  2. As I mentioned in (f), there is a U-shape to the residuals, so I would suggest fitting a quadratic trend model to Sales.

Question 5

  1. Based on the plots, it seems there is an upward linear trend, as well as seasonality present in the Souvenier sales data. There should be 11 seasonal dummy variables (as there are 12 “seasons” - months) and the trend, meaning 12 predictors total.

  2. I read in the Souvenir Sales data and partitioned it into training and validation periods. Once I had done that I ran a regression model with trend and seasonality.

souvensales <- read.csv("~/Personal/SouvenirSales (1).csv",stringsAsFactors = FALSE)

salesTS <- ts(souvensales$Sales,start=c(1995,1),frequency=12)
validLength <- 12
trainLength <- length(salesTS) - validLength
souvensalesTrain <- window(salesTS,end=c(1995,trainLength))
souvensalesValid <- window(salesTS, start=c(1995,trainLength+1))

ModelA <- tslm(souvensalesTrain ~ trend + season)
summary(ModelA)
## 
## Call:
## tslm(formula = souvensalesTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16
  1. Based on the coefficients, it is clear December has the highest average sales. This is reasonable for a few reasons: December is summer in Australia, meaning peak season for beach resorts and Christmas and holiday shopping are likely highest in December.

  2. The trend coeffiecent’s p-value is statisitically significant, showing there is in fact a linear trend present in the data. The 245.36 is the slope of that trend line that for each month, there is an average increase of $245.36 in sales.

  1. To create Model B, I ran the same model but added log(souvensalesTrain to generate log(Sales) as the output.
ModelB <- tslm(log(souvensalesTrain) ~ trend + season)
summary(ModelB)
## 
## Call:
## tslm(formula = log(souvensalesTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16
  1. The model with log(Sales) output is the same as with an exponential trend.

  2. The trend coeffecient of Model B represents the increase month over month as a percentage, so 0.021 shows a 2.1% increase each month.

  3. This data has 84 points until the end of 2001, so to calculate the forecast for February 2002, we will need to use the 86th.

febforecast <- ModelB$coefficients["(Intercept)"] + ModelB$coefficients["trend"]*86 + ModelB$coefficients["season2"]

exp(febforecast)
## (Intercept) 
##    17062.99
  1. To compare Models A and B I calculated accuracy and also plotted the two outputs.
AForecast <- forecast(ModelA,h=validLength)
accuracy(AForecast$mean,souvensalesValid)
##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
BForecast <- forecast(ModelB,h=validLength)
accuracy((exp(BForecast$mean)),souvensalesValid)
##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
plot(souvensalesValid)

lines(AForecast$mean, col="red")
lines(exp(BForecast$mean), col="green")

Based on the accuracy measures alone, it seems Model B is a better fit for the data. The plot confirms this as well.

  1. If I were doing this to understand the different components of sales between 1995 and 2001, I would not partition the data and use only a training period. If I don’t aim to forecast future values, removing some of the data would not be helpful. I would also aim to incorporate other factors, to see if things outside of a linear trend and monthly seasonality play a role.