Assignment 5

Read in the .csv files for problems 2, 4 & 5

CAWorkHours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
StoreSales <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)
SouvSales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)

Analysis of Canadian Manufacturing Workers Work-Hours

str(CAWorkHours)

## 'data.frame':    35 obs. of  2 variables:
##  $ Year        : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
##  $ HoursPerWeek: num  37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...

head(CAWorkHours)

##   Year HoursPerWeek
## 1 1966         37.2
## 2 1967         37.0
## 3 1968         37.4
## 4 1969         37.5
## 5 1970         37.7
## 6 1971         37.7

tail(CAWorkHours)

##    Year HoursPerWeek
## 30 1995         35.7
## 31 1996         35.7
## 32 1997         35.5
## 33 1998         35.6
## 34 1999         36.3
## 35 2000         36.5

# Create a time series object out of Canadian Work Hours
CAhoursTS <- ts(CAWorkHours$HoursPerWeek, start = c(1966), frequency = 1)

#Set up the plot
par(oma=c(0,0,2,0))
plot(c(1966,2000), ylim = c(34,38), xlim = c(1966,2000), type = "n", xlab = "Year", ylab = "Hours per Week", bty = "l", xaxt = "n", yaxt = "n")

# Add the time series CAhours
lines(CAhoursTS, bty = "l")

# Add the x and y axes
axis(1, at = seq(1966,2000,1), labels = format(seq(1966,2000,1)))
axis(2, at = seq(34,38,0.5), labels = format(seq(34,38,0.5)), las = 2, cex.axis = 0.6)

Which one model of the following regression-based models would fit the series best?

Set up the models we want to fit to the data set

# Linear Trend
CAhoursLinear <- tslm(CAhoursTS ~ trend)
summary(CAhoursLinear)

## 
## Call:
## tslm(formula = CAhoursTS ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20457 -0.28761  0.04779  0.30210  1.23190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.416134   0.175818 212.811  < 2e-16 ***
## trend       -0.061373   0.008518  -7.205 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared:  0.6113, Adjusted R-squared:  0.5996 
## F-statistic: 51.91 on 1 and 33 DF,  p-value: 2.928e-08

# Quadratic trend
CAhoursQuad <- tslm(CAhoursTS ~ trend + I(trend^2))
summary(CAhoursQuad)

## 
## Call:
## tslm(formula = CAhoursTS ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94503 -0.20964 -0.01652  0.31862  0.60160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.1644156  0.2176599 175.340  < 2e-16 ***
## trend       -0.1827154  0.0278786  -6.554 2.21e-07 ***
## I(trend^2)   0.0033706  0.0007512   4.487 8.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7465 
## F-statistic: 51.07 on 2 and 32 DF,  p-value: 1.1e-10

Visualization of fitted models

#Set up the plot
plot(c(1966,2000), ylim = c(34,38), xlim = c(1966,2000), type = "n", xlab = "Year", ylab = "Hours per Week", bty = "l", xaxt = "n", yaxt = "n")

# Add the time series CAhours
lines(CAhoursTS, bty = "l")

# Add the x and y axes
axis(1, at = seq(1966,2000,1), labels = format(seq(1966,2000,1)))
axis(2, at = seq(34,38,0.5), labels = format(seq(34,38,0.5)), las = 2, cex.axis = 0.6)

# Add the linear and quadratic trend models and trend models with seasonality
lines(CAhoursLinear$fitted, col = "blue")
lines(CAhoursQuad$fitted, col = "red")

# Add a legend
legend(1966, 36, c("Hours per Week", "Linear", "Quadratic"), lty = c(1,1,1), col = c("black", "blue", "red"), bty = "n")

The adjusted R^2 of the linear trend model and quadratic trend model are 0.5996 and 0.7465 respectively. The linear trend and quadratic trend models with seasonality cannot be fitted because non-seasonal data cannot be modelled using a seasonal factor.

• Linear trend model - The low Adj. R-sq. does not give a good explanation of variation in Canadian hours worked per week. • Linear trend model with seasonality - There is no seasonal component to the time series, so the model cannot be fitted to the data set. • Quadratic trend model - The higher Adj. R-Sq value gives a better explanation of the variation in Candian hours worked per week. This model would fit the series the best out of the 4 options. • Quadratic trend model with seasonality - There is no seasonal component to the time series, so the model cannot be fitted to the data set.

Forecasting Department Store Sales

The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)? • Take a logarithm of the Quarter index • Take a logarithm of sales • Take an exponent of sales • Take an exponent of Quarter index

An exponential trend implies a multiplicative increase or decrease of the series over time. To fit an exponential trend, the output variable (Sales) is replaced with log(Sales) and then fit a linear regression.

Plot the department store sales time series

# Create a time series object out of Department Store Sales
StoresalesTS <- ts(StoreSales$Sales, start = c(1), frequency = 4)

# Set up the plot
plot(c(1,7), ylim = c(40000, 120000), xlim = c(1,7),  type = "n", xlab = "Year", ylab = "Sales (in thousands)", bty = "l", xaxt = "n", yaxt = "n")
box()
box(which = "plot")
minor.tick(nx = 3, ny = 2, tick.ratio = 0.5)
lines(StoresalesTS, bty = "l")
axis(1, at = seq(1,7,1), labels = format(seq(1,7,1)))
axis(2, at = seq(40000, 120000, 10000), labels = format(seq(40, 120, 10)), las = 2)

Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

Partition the data, so that the last 4 quarters are in the validation period and all previous quarters are in the training period.

valid.Length <- 4
train.Length <- length(StoresalesTS) - valid.Length

StoresalesTS.Train <- window(StoresalesTS, end = c(1, train.Length))
StoresalesTS.Valid <- window(StoresalesTS, start = c(1, train.Length+1))

Fit the regression model with exponential trend with seasonality. Use the tslm function with lambda = 0 to do this.

StoresalesExp <- tslm(StoresalesTS.Train ~ trend + season, lambda = 0)
summary(StoresalesExp)

## 
## Call:
## tslm(formula = StoresalesTS.Train ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

# Plot the regression
plot(c(1,7), ylim = c(40000, 120000), xlim = c(1,7),  type = "n", xlab = "Year", ylab = "Sales (in thousands)", bty = "l", xaxt = "n", yaxt = "n")
box()
box(which = "plot")
minor.tick(nx = 3, ny = 2, tick.ratio = 0.5)

# Add the time series object
lines(StoresalesTS, bty = "l")

# Add the exponential fit
lines(StoresalesExp$fitted, col = "red")

# Add the x-axis and y-axis
axis(1, at = seq(1,7,1), labels = format(seq(1,7,1)))
axis(2, at = seq(40000, 120000, 10000), labels = format(seq(40, 120, 10)), las = 2)

From the output, after adjusting for trend, are Q2 average sales higher, lower or approximately equal to the average Q1 sales? Looking at the p-values, every variable is statistically significant from the base season, quarter 1, except season2, which has a p-value of 0.248. season4 has the largest coefficient, which means it is the largest of the 4 quarters. season2 has the smallest coefficient and is slightly higher than Q1 average sales, but as stated above this is the only season that is not statistically significant and as a result it could be higher or lower.
Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(StoresalesExp, h = valid.Length - 2)
validPeriodForecasts

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04 66250.87

The forecasts for Q21 and Q22 are $58,793.71 and $60,951.51 respectively.

The plots shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from this regression model.

Recreate these plots.

# Plot fit of regression model for department store sales
plot(c(1, 6), ylim = c(40000,100000), xlim = c(1,6), type="o", xlab="Year",  ylab="Sales", bty="l", xaxt="n", yaxt="n")
axis(1, at=seq(1,6,1), labels=format(seq(1,6,1)))
axis(2, at=seq(40000, 100000, 10000), labels=format(seq(40000, 100000, 10000)), las=0)
lines(StoresalesExp$fitted, col="red")
lines(StoresalesTS.Train)

# Plot the residuals in log-transformed units
plot(StoresalesExp$residuals, type = "o", bty = "l")

# Plot the residuals in the orginal scale
plot(StoresalesTS.Train - StoresalesExp$fitted, ylim = c(-4000, 4000), xlim = c(1,6), type = "o", bty = "l", xlab = "Quarter", ylab = "Residuals")
minor.tick(nx = 3, ny = 2, tick.ratio = 0.5)

Based on these plots, what can you say about your forecasts for quarters Q21 and Q22? Are they likely to over-forecast, under-forecast, or be reasonably close to the real sales values?

From the residual plot we can see that roughly the first year and last year is under-forecasted, and in between is over-forecasted. If the trend continues we can assume that Q21 and Q22 will be under-forecasted. We can verify this by plotting the residuals of the validation period.

# Residuals of the validation period
StoresalesTS.Valid - validPeriodForecasts$mean

##      Qtr1    Qtr2
## 6 2006.29 3948.49

# Plot the residuals of the validation period
valid.PeriodForecasts <- forecast(StoresalesExp, h = valid.Length)
plot(StoresalesTS.Valid - valid.PeriodForecasts$mean, type = "o", bty = "l", ylab = "Residuals")

The residuals for the validation period are all positive, meaning they are all under-forecasted.

To see how well the regression model fits the data, generate a plot showing the training period fit and validation period forecasts.

# Set up the plot
plot(c(1, 7), c(40000,120000), type="n", xlab="Year",  ylab="Sales", bty="l", xaxt="n", yaxt="n")

#Add the time series deptTS
lines(StoresalesTS, bty="l")

# Add the x-axis and y-axis
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
axis(2, at=seq(40000, 120000, 10000), labels=format(seq(40000, 120000, 10000)), las=0)
abline(v=6)
arrows(1, 110000, 6, 110000, code=3, length=0.1)
text(3.5, 115000, "Training", cex = 0.75)
abline(v=7)
arrows(6, 110000, 7, 110000, code=3, length=0.1)
text(6.5, 115000, "Validation", cex = 0.75)
lines(valid.PeriodForecasts$fitted, col="red")
lines(valid.PeriodForecasts$mean, col="blue", lty=2)

# Add the legend
legend(1,105000, c("Sales", "Exponential Trend + Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n", cex = 0.7)

summary(valid.PeriodForecasts)

## 
## Forecast method: Linear regression model
## 
## Model Information:
## 
## Call:
## tslm(formula = StoresalesTS.Train ~ trend + season, lambda = 0)
## 
## Coefficients:
## (Intercept)        trend      season2      season3      season4  
##    10.74894      0.01109      0.02496      0.16534      0.43375  
## 
## 
## Error measures:
##                    ME     RMSE      MAE         MPE     MAPE      MASE
## Training set 30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
##                   ACF1
## Training set 0.5540818
## 
## Forecasts:
##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

Looking at the residual plot, which of the following statements appear to be true? • Seasonality is not captured well - The model seems to capture seasonality in the data. • The regression model fits the data well - The regression model gives a MAPE of 2.215 indicating that the model fits the data well. • The trend in the data is not captured well by the model - True. The dip in the middle of the series indicates that the trend is not captured. The residuals should not display a U-shaped trend if it is captured well.
Which of the following solutions is adequate and a parsimonious solution for improving model fit? • Fit a quadratic trend model to the residuals (with Quarter and Quarter.) • Fit a quadratic trend model to Sales (with Quarter and Quarter.)

To improve the model fit we would want to fit a quadratic trend model to Sales.

Souvenir Sales

Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in this model?

Both time plots show seasonality and trend. The first plot looks like it might have an exponential trend, but looking at the log time plot it looks like the trend may be more linear. In this case the seasons are months, therefore there would be 12 predictors, 1 for trend and 11 for seasonality.

Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Call this model A.

# Create a time series object
SouvsalesTS <- ts(SouvSales$Sales, start = c(1995,1), frequency = 12)

# Set the validation period length to the last 12 months and the previous months as the training set
valid.Length <- 12
train.Length <- length(SouvsalesTS) - valid.Length

# Partition the data into the training and validation periods set above
Souvsales.Train <- window(SouvsalesTS, start=c(1995,1), end=c(1995, train.Length))
Souvsales.Valid <- window(SouvsalesTS, start=c(1995,train.Length+1), end=c(1995,train.Length+valid.Length))

# Model A - Regression model with sales as output variable and with linear trend and monthly predictors
modelA <- tslm(Souvsales.Train ~ trend + season)
summary(modelA)

## 
## Call:
## tslm(formula = Souvsales.Train ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable?

modelAForecast <- forecast(modelA, h = valid.Length)
accuracy(modelAForecast, Souvsales.Valid)

##                         ME      RMSE       MAE       MPE     MAPE     MASE
## Training set -5.684342e-14  5365.199  3205.089  6.967778 36.75088 0.855877
## Test set      8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
##                   ACF1 Theil's U
## Training set 0.4048039        NA
## Test set     0.3206228 0.9075924

Based on the coefficients months 11 and 12 have the highest average sales during the year. This makes sense because in Australia the summer months are December through February. Months 11 and 12 would be around the beginning of summer.

The MAPE for modelA is not very good.

What does the trend coefficient of model A mean?

Souvenir sales increase by $245.36 on average for each successive time period.

Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Call this model B.

# Run regression model with log(Sales) with linear trend and monthly predictors
modelB <-tslm(log(Souvsales.Train) ~ trend + season)
summary(modelB)

## 
## Call:
## tslm(formula = log(Souvsales.Train) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

Fitting a model to log(Sales) with linear trend is equivalent to fitting a model to Sales with exponential trend.

What does the estimated trend coefficient of model B mean? The estimated trend coefficient of model B is 0.021120, which means that Souvenir sales increase by rouhly 2.1% for each consecutive period.

# Generate forecasts using this model
modelBForecast <- forecast(modelB, h = valid.Length)
summary(modelBForecast)

## 
## Forecast method: Linear regression model
## 
## Model Information:
## 
## Call:
## tslm(formula = log(Souvsales.Train) ~ trend + season)
## 
## Coefficients:
## (Intercept)        trend      season2      season3      season4  
##     7.64636      0.02112      0.28201      0.69500      0.37387  
##     season5      season6      season7      season8      season9  
##     0.42171      0.44705      0.58338      0.54690      0.63557  
##    season10     season11     season12  
##     0.72949      1.20095      1.95220  
## 
## 
## Error measures:
##                        ME      RMSE       MAE         MPE     MAPE
## Training set 4.935341e-17 0.1709374 0.1382015 -0.03661472 1.534989
##                  MASE      ACF1
## Training set 0.444076 0.4593573
## 
## Forecasts:
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2001       9.188097  8.917220  9.458974  8.769890  9.606304
## Feb 2001       9.491232  9.220354  9.762109  9.073024  9.909439
## Mar 2001       9.925335  9.654457 10.196212  9.507127 10.343542
## Apr 2001       9.625329  9.354452  9.896207  9.207122 10.043537
## May 2001       9.694286  9.423408  9.965163  9.276078 10.112493
## Jun 2001       9.740741  9.469864 10.011619  9.322534 10.158949
## Jul 2001       9.898195  9.627318 10.169072  9.479988 10.316402
## Aug 2001       9.882831  9.611954 10.153708  9.464624 10.301038
## Sep 2001       9.992619  9.721742 10.263496  9.574412 10.410826
## Oct 2001      10.107664  9.836787 10.378542  9.689457 10.525872
## Nov 2001      10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001      11.372615 11.101738 11.643493 10.954408 11.790823

# Compute accuracy metrics by transforming the forecasts back to the original scale
accuracy(exp(modelBForecast$mean), Souvsales.Valid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

Use this model to forecast sales in February 2002.

# Forecast beyond the validation period to period 86, which corresponds to Feb 2002
febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86

Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

# Plot the actuals and the forecasts for the validation period
plot(Souvsales.Valid, bty = "l", xaxt = "n", yaxt = "n", xlab = "Year 2001", ylab = "Souvenir Sales")

# Add the axes
axis(1, at = seq(2001, 2001 + 11/12, 1/12), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at = seq(0, 100000, 10000), labels = format(seq(0,100,10)))

# Add the actual, linear+season and log(linear+season) lines
lines(SouvsalesTS, bty = "l")
lines(modelAForecast$mean, col = "blue")
lines(modelBForecast$mean, col = "red")

# Add a legend
legend(2001, 110000, c("Actuals", "Linear+Season", "log(y) Linear+Season"), lty = c(1,1,1), col = c("black", "blue", "red"), bty = "n")

Model B is preferable for forecasting. It has a lower MAPE than Model A. ModelB residuals also trend closer to the actuals.

How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 to 2001? Mention two differences.

No forecasting is required if the goal is understanding the different components of sales in the souvenir shop between 1995 to 2001. Instead, linear regression would be used for modeling cross-sectional data to infer the affect of different components. I would use the sales sample to estimate a model using different components like climate, category of souvenir etc.The sales would be the outcome of the variable of interest, the independent variables would be the components. The standard error would be the variation due to taking the sample and the coefficient would be the beta estimated by the software. The p-values would be used for inferring the effects of each component to the sample of sales.

Assignment 5

Meryn Lounsbury

April 3, 2017