First, let’s take care of some preliminary actions:
setwd("~/Documents/MBA 678/Unit 5")
getwd()
## [1] "/Users/Josh/Documents/MBA 678/Unit 5"
library(forecast)
Canadian <- read.csv("Canadian_Wrk_Hrs.csv", stringsAsFactors = FALSE)
CanadianTS <- ts(Canadian$HoursPerWeek, start=c(1966), frequency=1)
yrange <- range(CanadianTS)
plot(c(1966, 2000), c(34.5,38), type="n", xlab="Year", ylab="Canadian Manufacturing Worker Work-Hours", bty="l", xaxt="n", yaxt="n")
axis(1, at=seq(1966,2000,2), labels=format(seq(1966,2000,2)))
# Add the y-axis
axis(2, at=seq(34.5, 38, .5), labels=format(seq(34.5, 38, .5)), las=0)
lines(CanadianTS)
Linear Trend Model: Not a good choice for this time series because the trend in the base data is not linear in nature.
Linear Trend Model with Seasonality: Not a good choice. The base time series does not show convincing evidence of seasonality (we’re already at year), nor does it contain a linear trend.
Quadratic Trend Model: This would probably be the strongest choice for forecasting this time series, as the shape of the curve is generally parabolic, and seasonality isn’t apparent.
Quadratic Trend Model with Seasonality: This approach would not be recommended because seasonality is not apparent in the time series. The quadratic trend would probably capture the future level of the series well.
To fit a regression-based model for exponential trend, we must take a logarithm of sales.
#Read the csv
dept <- read.csv("Dept_Str_Sls.csv", stringsAsFactors = FALSE)
#Create the ts object
deptTS <- ts(dept$Sales, start=c(1), frequency=4)
#Refining plot
yrange <- range(deptTS)
plot(c(1, 7), c(48000,114500), type="n", xlab="Year", ylab="Quarterly Department Store Sales", bty="l", xaxt="n", yaxt="n")
#Add the time series deptTS
lines(deptTS, bty="l")
# Add the x-axis
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
# Add the y-axis
axis(2, at=seq(48000, 114500, 9500), labels=format(seq(48000, 114500, 9500)), las=0)
abline(v=6)
arrows(1, 105000, 6, 105000, code=3, length=0.1)
text(3, 108000, "Training")
abline(v=7)
arrows(6, 105000, 7, 105000, code=3, length=0.1)
text(6.5, 108000, "Validation")
# Set the length of the validation period to 4 quarters
validLength <- 4
# Set the length of the training period to everything else
trainLength <- length(deptTS) - validLength
# Partition the data into training and validation periods
deptTrain <- window(deptTS, end=c(1, trainLength))
deptValid <- window(deptTS, start=c(1,trainLength+1), end=c(1,trainLength+validLength))
#Fit a regressions model with exponential trend
deptExpo <- tslm(deptTrain ~ trend + season, lambda=0)
lines(deptExpo$fitted, col = "blue")
summary(deptExpo)
##
## Call:
## tslm(formula = deptTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
Even though the coefficients from the model summary are presented in log form, we can still make some interesting conclusions from them. Where the season 2 coefficient is 0.02, we can conclude that Q2 is, on the whole, slightly higher in sales than Q1. However, we should also note that the relationship is not statistically significant.
deptExpoFcst <- forecast(deptExpo, h=validLength)
deptExpoFcst
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
Reading from the console, we produced these forecasts:
Q 21: $58,794
Q 22: $60,952
Note the forecasts produced are in the original scale, as the forecast package automatically converts the values from the model where lambda <> 1.
#Plot the fitted values based on deptExpo
plot(c(1, 6), c(48000,114500), type="n", xlab="Year", ylab="Quarterly Department Store Sales", bty="l", xaxt="n", yaxt="n")
axis(1, at=seq(1,6,1), labels=format(seq(1,6,1)))
axis(2, at=seq(48000, 114500, 9500), labels=format(seq(48000, 114500, 9500)), las=0)
lines(deptExpo$fitted.values, col="blue")
lines(deptTrain)
legend(1,105000, c("Sales", "Exponential Trend+Season"), lty=c(1,1), col=c("black", "blue"), bty="n")
#Rediduals Plot
plot(deptTrain - deptExpo$fitted, type="o", bty="l")
title(main = "Residuals - Original Scale")
#Let's also show a "pretty" plot of the training fit with the validation forecast:
plot(c(1, 7), c(48000,114500), type="n", xlab="Year", ylab="Quarterly Department Store Sales", bty="l", xaxt="n", yaxt="n")
#Add the time series deptTS
lines(deptTS, bty="l")
# Add the x-axis
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
# Add the y-axis
axis(2, at=seq(48000, 114500, 9500), labels=format(seq(48000, 114500, 9500)), las=0)
abline(v=6)
arrows(1, 105000, 6, 105000, code=3, length=0.1)
text(3, 108000, "Training")
abline(v=7)
arrows(6, 105000, 7, 105000, code=3, length=0.1)
text(6.5, 108000, "Validation")
lines(deptExpoFcst$fitted, col="blue")
lines(deptExpoFcst$mean, col="red", lty=2)
# Add the legend
legend(1,105000, c("Sales", "Exponential Trend+Season", "Forecasts"), lty=c(1,1,2), col=c("black", "blue", "red"), bty="n")
From the residuals plot, we can discern a roughly parabolic shape. This indicates that we are tending to under-forecast the beginning and end of the series, with some over-forecasting in the middle of the training period.
Presuming the parabolic shape of the residuals continues into the validation period, we would expect our forecasts to under-forecast in quarters 21 and 22.
In the final plot above, we can confirm that this occurred; Q21 and Q22 are in fact under-forecasted.
Our residual plot in this example indicates we may have trouble with our model. Per our lecture, ideally, a residual plot should have no discernible patterns. As discussed in 4eii, we see a parabolic shape in this residual plot.
The parabolic residuals indicate that our regression model is not capturing the trend of the data well.
Where we believe the existing model is not capturing the trend of the data well, a reasonable next step would be to fit a quadratic trend model to the original time series and compare the results of that approach to the exponential trend model covered in question 4.
Based on the raw time series, any forecasting model for this series should include both trend, and seasonality predictors. We will end up with 12 predictors in this model (one for trend and 11 for the monthly seasons).
#Read the csv
sales <- read.csv("Souvenir_Sls.csv", header = TRUE, stringsAsFactors = FALSE)
#Create Time Series Object
sales.ts <- ts(sales$Sales, start = c(1995,1), frequency = 12)
# Set the length of the validation period to 12 months (one year)
validLength <- 12
# Set the length of the training period to everything else
trainLength <- length(sales.ts) - validLength
# Partition the data into training and validation periods
SalesTrain <- window(sales.ts, start=c(1995,1), end=c(1995, trainLength))
SalesValid <- window(sales.ts, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))
ModelA <- tslm(SalesTrain ~ trend + season)
summary(ModelA)
##
## Call:
## tslm(formula = SalesTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
See coefficients from the console, above, for Model A.
We find seasons 11 and 12 have the highest average sales during the year. This absolutely makes sense, where seasons 11 as 12 represent the months of November and December, with sales spurred by holiday shopping activity.
In Model A, the trend coefficient represents the increase per period in the model’s trend. In other words, we can expect each month to increase by AU$245 from the prior months, before considering the seasonality impact of each period.
#Create Model B
modelB <-tslm(log(SalesTrain) ~ trend + season)
summary(modelB)
##
## Call:
## tslm(formula = log(SalesTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
#sanity check
ModelC <- tslm(SalesTrain ~ trend + season, lambda = 0)
summary (ModelC)
##
## Call:
## tslm(formula = SalesTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2e+10 on 12 and 59 DF, p-value: < 2.2e-16
Fitting a model to log(Sales) with linear trend is equivalent to fitting an exponential trend model to Sales.
With an exponential model, the trend coefficient estimates the percentage increase of each period compared to the previous period (before applying seasonality). In this case, we model a 2.1% increase for each month.
ModelB.fcst <-forecast(modelB, h=validLength)
#Before we can compare performance of the two methods, we must create the forecast for the validation period for Model A
ModelA.fcst <-forecast(ModelA, h=validLength)
#Let's plot the forecasts together
plot(c(1995,2002), c(0,120000), type="n", xlab="Year", ylab="Souvenir Sales", bty="l", xaxt="n", yaxt="n")
#Add the time series base
lines(sales.ts, bty="l", lwd = 2)
# Add the x-axis
axis(1, at=seq(1995,2002,2), labels=format(seq(1995,2002,2)))
# Add the y-axis
axis(2, at=seq(0,120000, 20000), labels=format(seq(0,120000,20000)), las=0)
abline(v=2001)
arrows(1995, 110000, 2001, 110000, code=3, length=0.1)
text(1998, 115000, "Training")
abline(v=2002)
arrows(2001, 110000, 2002, 110000, code=3, length=0.1)
text(2001.5, 115000, "Validation")
lines(ModelA$fitted, col="blue")
lines(ModelA.fcst$mean, col="blue", lty=2)
lines(SalesTrain - modelB$fitted, col="red")
lines(exp(ModelB.fcst$mean), col = "red", lty = 2)
# Add the legend
legend(1995,110000, c("Sales", "Linear Trend+Season Fitted", "Linear Trend+Season Fcst", "Exponential Trend+Season Fitted", "Exponential Trend+Season Forecast"), lty=c(1,1,2,1,2), col=c("black", "blue", "blue", "red", "red"), bty="n", lwd = c(2,1,1,1,1))
accuracy(ModelA.fcst, SalesValid)
## ME RMSE MAE MPE MAPE MASE
## Training set -5.684342e-14 5365.199 3205.089 6.967778 36.75088 0.855877
## Test set 8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
## ACF1 Theil's U
## Training set 0.4048039 NA
## Test set 0.3206228 0.9075924
accuracy(exp(ModelB.fcst$mean), SalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
In this example, Model B, the exponential model, provides the better results.
From the raw time series, we can see that the magnitude of the seasonality seems to be increasing over time - this is an indicator of exponential seasonality. Conceptually, this supports the notion that Model B should provide a better fit.
When we plot the forecasts for the validation period, we can see that both Model A and Model B under-forecast the actual result, but Model B comes much closer to matching the actual value.
When we call accuracy on both models, we find both have rather high MAPE values. However, Model B performs better, with 15.5% MAPE, versus Model A’s 26.7% MAPE.
In this scenario, our goal is descriptive, rather than predictive. That means any modeling can consume the whole time series; there’s no need to partition into a training and validation period.
Since we need to describe the change over time in simple terms, I might use Model A to demonstrate the health of the business in gross terms over the measurement period. Though not as robust in predictive power, Model A simply confirms that the business is growing.
I might also incorporate external information to better-understand the drivers of the pattern change; for example, showing a time series of total marketing impressions over the same period of time, to demonstrate the impact of marketing efforts.
Of any interest to a business would be a deeper understanding of what’s happening in the peak November-December time period, as those months are contributing the lion’s share of annual sales. We could use descriptive techniques to model a generic index for each month’s contribution to the year as a whole, which could inform future planning efforts.
Additionally, for a descriptive goal, I might seek the data at a lower level of grain, and break the sales for those periods into product categories, to better understand what types of products are driving sales at what points during the year.