Home Depot Sales

The Home Depot chain of home improvement stores grew in the 1980s and 1990s faster than any other retailer in history. By 2005, it was the second largest retailer in the United States. But its extraordinary record of growth was slowed by the financial crisis of 2008. How do different methods of modeling time series compare for understanding these data?

Let’s first read the dataset into R

str(df)
## 'data.frame':    72 obs. of  6 variables:
##  $ Sales: num  3.57 4.15 4 3.75 4.36 ...
##  $ year : num  1995 1995 1996 1996 1996 ...
##  $ Q1   : int  1 0 0 0 1 0 0 0 1 0 ...
##  $ Q2   : int  0 1 0 0 0 1 0 0 0 1 ...
##  $ Q3   : int  0 0 1 0 0 0 1 0 0 0 ...
##  $ Q4   : int  0 0 0 1 0 0 0 1 0 0 ...

As we can see, there are 72 observations for each of 6 variables. Both Sales and year are numerical variables, while Q1, Q2, Q3 and Q4 are the indicator variables for the quarters.

Let’s plot the time series of Home Depot quarterly sales:

plot(df$year, df$Sales, xlab = 'Year', ylab = 'Sales', main = 'Home Depot Quarterly Sales 1995 - 2012')
lines(df$year, df$Sales)

As we can see, there was a consistent increasing trend until the end of 2006. After that, sales fell sharply.

They appear to have been recovering. Throughout this period, however, there are fluctuations around the trend that appear to be seasonal because they repeat every four quarters.

Smoothing Methods

First, Let’s try a simple moving average (SMA). For data with a strong seasonal component, it’s a good idea to choose a moving average length based on the period. The period of Home Depot sales is 4. We therefore apply a SMA of length 2 and 4, respectively. We use SMA() function in TTR package to do this:

library(TTR)
## Warning: package 'TTR' was built under R version 3.6.3
sma1 <- SMA(df$Sales, n = 2)
sma2 <- SMA(df$Sales, n = 4)

The sma1 and sma2 return the smoothed series, which are plotted below:

par(mfrow = c(1, 2))
plot(df$year, df$Sales, xlab = 'Year', ylab = 'Sales', main = 'SMA of Length 2')
lines(df$year, df$Sales)
lines(df$year, sma1, col = 'red')
plot(df$year, df$Sales, xlab = 'Year', ylab = 'Sales', main = 'SMA of Length 4')
lines(df$year, df$Sales)
lines(df$year, sma2, col = 'red')

As expected, the SMA of longer length shows a smoother series. The SMA of length 4 smooths out most of the seasonal effects, and has trouble modeling the sudden change in 2007. But, it provides a good description of the trend. The SMA of length 2 is more wiggly, capturing the seasonal effects and the sudden change better.

Second, let’s try an exponential moving average (EMA). We need to choose a smoothing weight. We will use \(\alpha\) = 0.5, which weights the current data value equally all the rest in the past. For comparison reasons, we also try \(\alpha\) = 0.2, which makes the smoothing heavily depend on the data in the past. The initial value is chosen as the first observed data point. We use EMA() function in TTR package to do this:

ema1 <- EMA(df$Sales, ratio = 0.5, n = 1)
ema2 <- EMA(df$Sales, ratio = 0.2, n = 1)
par(mfrow = c(1, 2))
plot(df$year, df$Sales, xlab = 'Year', ylab = 'Sales', main = 'EMA (alpha = 0.5)')
lines(df$year, df$Sales)
lines(df$year, ema1, col = 'red')
plot(df$year, df$Sales, xlab = 'Year', ylab = 'Sales', main = 'EMA (alpha = 0.2)')
lines(df$year, df$Sales)
lines(df$year, ema2, col = 'red')

As we can see, the EMA with a bigger \(\alpha\) provides a less smooth series. The EMA of \(\alpha\) = 0.5 follows the seasonal pattern more closely and models the sudden change better. The EMA of \(\alpha\) = 0.2 totally fails, because it is too smooth to capture the seasonal effects, and responds too slowly to the trend (under-estimate before 2007 and over-estimate after).

Autoregressive Model

Let’s fit an autoregressive model. Because we know that the seasonal pattern is of 4 quarters long, we use 4 lagged variables as the predictors. The autocorrelations regarding those lags are given by

acf(df$Sales, lag.max = 4, plot = FALSE)
## 
## Autocorrelations of series 'df$Sales', by lag
## 
##     0     1     2     3     4 
## 1.000 0.912 0.838 0.831 0.838

It’s interesting to see that the autocorrelation of lag4 is 0.838, the second largest among all. It is also the evidence of the seasonal effects.

Now let’s use ar() function to fit the autoregressive model:

ar1 <- ar(df$Sales, aic = FALSE, order.max = 4, demean = FALSE, intercept = TRUE, method = 'ols')
ar1
## 
## Call:
## ar(x = df$Sales, aic = FALSE, order.max = 4, method = "ols",     demean = FALSE, intercept = TRUE)
## 
## Coefficients:
##       1        2        3        4  
##  0.4695  -0.2551   0.2233   0.4871  
## 
## Intercept: 1.694 (0.4544) 
## 
## Order selected 4  sigma^2 estimated as  1.612

The fitted AR(4) model is:

\[\hat{y_{t}} = 1.694 + 0.4695(y_{t-1}) - 0.2551(y_{t-2}) + 0.2233(y_{t-3}) + 0.4871(y_{t-4})\]

Unfortunately, the ar1 does not provide the fitted values. But, we can use the residuals e to compute them using \(\hat{y} = y -e\):

fitted.ar1 <- df$Sales - ar1$resid

Let’s plot the fitted series and the residuals:

par(mfrow = c(1, 2))
plot(df$year, df$Sales, xlab = 'Year', ylab = 'Sales', main = 'AR(4)')
lines(df$year, df$Sales)
lines(df$year, fitted.ar1, col = 'red')
plot(df$year, ar1$resid, xlab = 'Year', ylab = 'Residuals')
abline(0, 0)

As we can see, the AR(4) model follows the seasonal pattern well, but seems to have great difficulty capturing the sudden change in 2007. The residual plot shows some disturbance around the time of the financial crisis. Because sales at that time stopped resembling previous behavior with respect to growth, the 4 lagged variables are less successful predictors.

Multiple Regression-based Model

Let’s fit a multiple regression model to Sales using year and the dummy variables. We use year to model trend and the dummy variables seasonal component. Since there is a linear trend until the end of 2006, we will fit the model to the time series before 2007 only. Note that we will use the last quarter Q4 as baseline, so only put the first three dummy variables in the model.

df.new <- df[df$year<2007,]
m <- lm(Sales ~ year + Q1 + Q2 + Q3, data = df.new)
summary(m)
## 
## Call:
## lm(formula = Sales ~ year + Q1 + Q2 + Q3, data = df.new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9176 -0.6004 -0.1046  0.3649  4.7269 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.515e+03  1.084e+02 -32.414  < 2e-16 ***
## year         1.762e+00  5.414e-02  32.554  < 2e-16 ***
## Q1           5.885e-01  4.967e-01   1.185 0.242652    
## Q2           1.755e+00  4.921e-01   3.567 0.000901 ***
## Q3           4.554e-01  4.878e-01   0.934 0.355686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.178 on 43 degrees of freedom
## Multiple R-squared:  0.966,  Adjusted R-squared:  0.9629 
## F-statistic: 305.7 on 4 and 43 DF,  p-value: < 2.2e-16

The coefficient of year is positive and highly significant, showing a positive linear trend in the time series of Sales. The significant coefficient of Q2 indicates that the sales in second quarter is on average significantly higher than the last quarter (baseline). The whole model looks great with high R2 and significant F-test.

##Comparison

So far, we have applied a few different methods to the Home Depot series. Suppose we want to forecast the sales for the first quarter of 2013. (Note: the original series ends at the fourth quarter of 2012.) In the smoothing methods, we use the last average in the series to forecast:

(yhat.sma1 <- sma1[length(sma1)])
## [1] 18.627
(yhat.sma2 <- sma2[length(sma2)])
## [1] 18.908
(yhat.ema1 <- ema1[length(ema1)])
## [1] 18.89295
(yhat.ema2 <- ema2[length(ema2)])
## [1] 18.38313

In the autoregressive model, we may use predict() function to forecast:

(yhat.ar1 <- predict(ar1, n.ahead = 1, se.fit = FALSE))
## Time Series:
## Start = 73 
## End = 73 
## Frequency = 1 
## [1] 19.31656

In the multiple regression, we also need predict() function to forecast:

data.new <- data.frame(year = 2013, Q1 = 1, Q2 = 0, Q3 = 0)
yhat <- predict(m, newdata = data.new)

We know that the actual sales in the first quarter of 2013 were \(19.124\)B. Which model does provide the best prediction? We can compute absolute percentage error (APE) for each prediction:

y.true <- 19.124
abs(y.true - yhat.sma1)/abs(y.true)*100
## [1] 2.598829
abs(y.true - yhat.sma2)/abs(y.true)*100
## [1] 1.129471
abs(y.true - yhat.ema1)/abs(y.true)*100
## [1] 1.208153
abs(y.true - yhat.ema2)/abs(y.true)*100
## [1] 3.87401
abs(y.true - yhat.ar1)/abs(y.true)*100
## Time Series:
## Start = 73 
## End = 73 
## Frequency = 1 
## [1] 1.006905
abs(y.true - yhat)/abs(y.true)*100
##        1 
## 76.65737

The AR(4) model seems to offer the best prediction, because it yields the smallest APE (1.01%). The multiple regression model gives the worst forecast, because the financial crisis changed the trend of the time series after 2007, but the model assumes the trend remains the same.