#Loading necessary libraries:
library(forecast)
library(tidytext)
library(stringr)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(pander)
For this assignment I will be completing the assigned problems (problems 2, 4, and 5) from Chapter 6 - Regression-Based Models: Capturing Trend and Seasonality in the Forecasting: Principles and Practice textbook by, Galit Schmueli and Kenneth C. Lichtendahl Jr.
First, I will import the data set and look at the structure, head, and tail of the data.
CanadianHours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
str(CanadianHours)
## 'data.frame': 35 obs. of 2 variables:
## $ ï..Year : int 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
## $ Hours.per.week: num 37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...
head(CanadianHours)
## ï..Year Hours.per.week
## 1 1966 37.2
## 2 1967 37.0
## 3 1968 37.4
## 4 1969 37.5
## 5 1970 37.7
## 6 1971 37.7
tail(CanadianHours)
## ï..Year Hours.per.week
## 30 1995 35.7
## 31 1996 35.7
## 32 1997 35.5
## 33 1998 35.6
## 34 1999 36.3
## 35 2000 36.5
Then I will create a time series object from the data set.
CWH_ts <- ts(CanadianHours$Hours.per.week, start=c(1966), frequency=1)
autoplot(CWH_ts)
Linear trend model:
CWH_linear <- tslm(CWH_ts ~ trend)
summary(CWH_linear)
##
## Call:
## tslm(formula = CWH_ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
The adjusted R\(^2\) value for the linear trend model is 0.5996.
Linear trend model with seasonality:
Seasonality cannot be modeled using annual (or yearly) data.
Quadratic trend model:
CWH_quad <- tslm(CWH_ts ~ trend + I(trend^2))
summary(CWH_quad)
##
## Call:
## tslm(formula = CWH_ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
The adjusted R\(^2\) value for the quadratic trend model is 0.7465.
Quadratic trend model with seasonality:
Seasonality cannot be modeled using annual (or yearly) data.
Since the quadratic trend model has the highest adjusted R\(^2\) value of 0.7465, this regression-based model has the highest explanation of variation and therefore fits the series the best out of these 2 options.
yrange = range(CWH_ts)
plot(c(1966,2000), yrange, type="n", xlab="Year", ylab="Average Annual Number of Weekly Hours Spent by Canadian Manufacturing Workers", bty="l", xaxt="n")
lines(CWH_ts, bty="l")
axis(1, at=seq(1966, 2000, 1), labels=format(seq(1966, 2000, 1)))
axis(2, at=seq(34.5, 38.0, 0.5), labels= format(seq(34.5, 38.0, 0.5)), las=2)
lines(CWH_linear$fitted, col="red")
lines(CWH_quad$fitted, col="blue")
First, I will import the data set and look at the structure, head, and tail of the data.
Dept_Sales <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)
str(Dept_Sales)
## 'data.frame': 24 obs. of 2 variables:
## $ Quarter: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Sales : int 50147 49325 57048 76781 48617 50898 58517 77691 50862 53028 ...
head(Dept_Sales)
## Quarter Sales
## 1 1 50147
## 2 2 49325
## 3 3 57048
## 4 4 76781
## 5 5 48617
## 6 6 50898
tail(Dept_Sales)
## Quarter Sales
## 19 19 71486
## 20 20 92183
## 21 21 60800
## 22 22 64900
## 23 23 76997
## 24 24 103337
Then I will create a time series object from the data set.
Dept_ts <- ts(Dept_Sales$Sales, start=1, frequency=4)
plot(Dept_ts, ylim = c(0,120000), ylab = "Quarterly Sales", xlab = "Year", main = "Time Series of Department Store Sales", bty = "l")
autoplot(Dept_ts)
Take a logarithm of the Quarter index
No, you will want to take the logarithm of the y variable, in this case the quarterly sales numbers.
Take a logarithm of sales
Yes, the logarithm of the quarterly sales values should be taken in order to fit an exponential trend model.
Take an exponent of sales
No, you will want to take the logarithm of the y variable, in this case the quarterly sales numbers.
Take an exponent of Quarter index
No, you will want to take the logarithm of the y variable, in this case the quarterly sales numbers.
First, I will partition the data so that the last 4 quarters are in the validation period, and the first 20 quarters are creating the training period.
validLen <- 4
trainLen <- length(Dept_ts) - validLen
DeptTrain <- window(Dept_ts, end=c(1, trainLen))
DeptValid <- window(Dept_ts, start=c(1, trainLen +1))
Once the validation and training periods are created, we can fit the exponential trend with seasonality to the training set, viewing the summary after allows us to better understand the fit of the model. Lambda = 0 allow us to take the logarithm values of the sales data (depdendent variable) as well as translates the value back into the original scale.
Dept_Expo <- tslm(DeptTrain ~ trend + season, lambda = 0)
summary(Dept_Expo)
##
## Call:
## tslm(formula = DeptTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
From the \(summary()\) we can see that all the quarters are statistically significant except for \(season2\). In other words, the second quarter of the year is not statistically significantly different from the first quarter of the year, for the 6 years worth of data we are looking at.
Q2 average sales are not statistically significantly different from Q1 when looking at p-values. This means they are approximately equal to the average Q1 sales, at least statistically speaking. When looking at the actual values for Q1 and Q2 we can see that Q2 is generally lower than Q1 but not by large amounts.
validPeriodForecasts <- forecast(Dept_Expo, h=validLen)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
Using the model developed for this question, we can see that the point-forecasts for Q21 and Q22 (which are the forecasts for Q1 and Q2 here) are $58,793.71 and $60,951.51, respectively.
yrange = range(Dept_ts)
plot(c(1,7), yrange, type="n", xlab="Quarter", ylab="Sales (in thousands)", bty="l", xaxt="n", yaxt="n")
lines(Dept_ts, bty="l")
axis(1, at=seq(1, 7, 1), labels=format(seq(1,7, 1)), las=2)
axis(2, at=seq(0, 110000, 25000), labels=format(seq(1, 110, 25), las=2))
abline(v=6)
arrows(1,100000,6,100000,code=3,length = 0.1)
text(3, 110000, "Training")
abline(v=7)
arrows(6, 100000, 7, 100000, code=3, length = 0.1)
text(6.5, 110000, "Validation")
lines(Dept_Expo$fitted, col="red")
lines(validPeriodForecasts$mean, col="blue", lty=2)
legend(1,100000, c("Dept Sales", "Exponential Trend + Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n")
From Figure 4, we can see that the point forecasts are under-forecasting. Next we can look at how accurately our point forecasts are predicting the validation period, or in other words, how well our model is performing based on the \(accuracy()\) function.
accuracy(validPeriodForecasts, DeptValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## Test set 5395.00854 6076.912 5395.009 6.62904618 6.629046 1.7311114
## ACF1 Theil's U
## Training set 0.5540818 NA
## Test set 0.2156416 0.4259668
We can see that the approximately 6.62% MAPE is not as accurate as the double-differencing MAPE of 4.06% or the Holt-Winter’s MAPE of 1.67% from Assignment 4.
Re-creating the residuals plot from page 139 in the textbook.
plot(Dept_Expo$residuals, type="o", bty="l", xlab="Year", ylab="Residuals")
To get the residuals back into the original scale (sales dollars) we subtract the the exponential trend fitted formula from the training set. This will display the same chart visually but the y-axis will now be back in the original scale - sales dollars.
plot(DeptTrain - Dept_Expo$fitted, type="o", bty="l", xlab="Year", ylab="Residuals")
abline(h=0)
checkresiduals(Dept_Expo)
##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals from Linear regression model
## LM test = 10.517, df = 8, p-value = 0.2306
DeptValid - validPeriodForecasts$mean
## Qtr1 Qtr2 Qtr3 Qtr4
## 6 2006.290 3948.490 6076.915 9548.339
plot(DeptValid - validPeriodForecasts$mean, type="o", bty="l")
As mentioned, Figure 4 shows us that the point forecasts (validation period) for all 4 quarters being forecast are under-forecasted. However, when looking at the residuals for the training period, it looks like we are over-forecasting more than under-forecasting we can see this because there are more negative residuals (11) than positive residuals (9).
Seasonality is not captured well
False. The residuals do not show seasonal patterns, which would indicate that the seasonality that exists is being captured in the model.
The regression model fits the data well
False. Referencing the \(R^2=1\) would indicate that the regression model fits the data perfectly. However, after looking at the point forecasts against the validation period as well as the residuals and performance metrics (MAPE) it become more clear that the regression model does not fit the data perfectly. Additionally, the residuals are not evenly distributed around zero.
The trend in the data is not captured well by the model True. The residuals seem to have a trend that is not captured in the model.
Fit a quadratic trend model to the residuals (with Quarter and Quarter\(^2\))
No, ways for improving the model fit should be applied to the dependent variables or in this case Sales.
Fit a quardratic trend model to Sales (with Quarter and Quarter\(^2\))
Yes, another trend model should be applied to Sales (dependent variable) to see if additional underlying trends can produce a more accurate forecast model
Back in 2011, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.
First, I will import the data set and look at the structure, head, and tail of the data.
Souv_Sales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
str(Souv_Sales)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
head(Souv_Sales)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
tail(Souv_Sales)
## Date Sales
## 79 1-Jul 26155.15
## 80 1-Aug 28586.52
## 81 1-Sep 30505.41
## 82 1-Oct 30821.33
## 83 1-Nov 46634.38
## 84 1-Dec 104660.67
Then I will create a time series object from the data set.
Souv_ts <- ts(Souv_Sales$Sales, start=c(1995,1), frequency=12)
validLen_SS<- 12
TrainLen_SS <- length(Souv_ts) - validLen_SS
Souv_Train <- window(Souv_ts, start =c(1995,1), end=c(1995, TrainLen_SS))
Souv_Valid <-window(Souv_ts, start=c(1995,TrainLen_SS +1))
yrange=range(Souv_ts)
Based on Figure 6.14, trend and seasonality should be accounted for in the model. It is hard to say for sure if the series has a quadratic trend or not. If we include the quadratic trend predictor (\(t^2\)) as well as a trend (\(t\)) predictor and the 11 predictors for monthly seasonality (12-1), there will be 13 predictors in the model. However, without the quadratic trend predictor (\(t^2\)) there will be 12 predictors.
Model_A <- tslm(Souv_Train ~trend + season)
summary(Model_A)
##
## Call:
## tslm(formula = Souv_Train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
For Model A the month with the highest sales is \(season12\) which makes sense because that represents December and Christmas falls in that month. Additionally, summer begins in Australia in December so a souvenir shop in a beach resort town would see the higest sales during this time period.
The trend coefficient of Model A is statistically significant (according to the p-value) which means there is a linear trend in the model. The linear trend coefficient of $245.36, represents the slope of the linear trend, meaning that there is an average increase of $245.36 for each successive time period over the time series, holding all other variables constant.
Model_B <- tslm(log(Souv_Train) ~trend + season)
summary(Model_B)
##
## Call:
## tslm(formula = log(Souv_Train) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales with a exponential trend.
The estimated trend coefficient of Model B represents the slope of the trend in terms of percentage increase, in this case it means that souvenir sales increase by 2.1% per month on average.
The souvenir sales data set has 84 observations (monthly data points) over January 1995 to December 2001, to forecast February 2002 we wil need to forecast for observation number 86. This task is asking us to forecast beyond the validation period. Instead of recombining the training set and validation set and re-fitting the model. Like this assignments class notes, I will use the better of the two models from the training period to forecast both January and February of 2002. Which is the same as forecasting time period 85 and 86.
Jan_Forecast <- Model_B$coefficients["(Intercept)"] + Model_B$coefficients["trend"]*85
Feb_Forecast <- Model_B$coefficients["(Intercept)"] + Model_B$coefficients["trend"]*86 + Model_B$coefficients["season2"]
exp(Jan_Forecast)
## (Intercept)
## 12601.02
exp(Feb_Forecast)
## (Intercept)
## 17062.99
The forecasts for January and February 2002 sales are $12,601.02 and $17,062.99, respectively.
Model_A_Forecast <- forecast(Model_A, h=validLen_SS)
accuracy(Model_A_Forecast$mean, Souv_Valid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
The MAPE for Model A is 26.66.
Model_B_Forecast <- forecast(Model_B, h=validLen_SS)
accuracy((exp(Model_B_Forecast$mean)), Souv_Valid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
The MAPEfor Model B is 15.52.
plot(c(1995,2002), c(0,120000), type="n", xlab="Year", ylab="Souvenir Sales", bty="l", xaxt="n", yaxt="n")
lines(Souv_ts, bty="l", lwd = 2)
axis(1, at=seq(1995,2002,2), labels=format(seq(1995,2002,2)))
axis(2, at=seq(0,120000, 20000), labels=format(seq(0,120000,20000)), las=2)
abline(v=2001)
arrows(1995, 120000, 2001, 120000, code=3, length=0.1)
text(1998, 115000, "Training")
abline(v=2002)
arrows(2001, 120000, 2002, 120000, code=3, length=0.1)
text(2001.5, 115000, "Validation")
lines(Model_A_Forecast$fitted, col="blue")
lines(Model_A_Forecast$mean, col="blue")
lines(exp(Model_B_Forecast$fitted),col="red")
lines(exp(Model_B_Forecast$mean), col = "red")
legend(1995,110000, c("Souvenir Sales", "Model A", "Model B"), col=c("black", "blue", "red"), bty="n", lwd=c(1,2,2))
A few reasons why Model B is preferable for forecasting include;
The accuracy for predicting the validation set of Model B according to the MAPE value is preferable because it is lower at 15.52% compared to Model A’s MAPE of 26.66%.
Figure 7 shows us that visually Model B has better predictive power than Model A. Simiarly, when we were first deciding on how many predictors should go into the model, the visual wasn’t clear on if a linear trend or an exponential trend was present, after analysis we can see that the Model B (where exponential trend exists) is the better suited model.
If the goal was to understanding the different components of sales in the souvenir shop between 1995 and 2001, then forecasting would not be necessary. Therefore, all the data would be better used on a combined basis (not breaking into validation and training sets). I would be able to look at the data on a monthly basis with combined information to see how sales changed over months and years. If more data was available for different variables, like day of the week or different categories within the store, like product type (i.e. sunglasses vs. keychains), I would be interested in looking at the sales levels between those categories.