What predicts auto sales?

Motivation
Data
- Data sources
- Variables
Data acquisition
Data munging
Analysis
Auto sales this month

Motivation

Over the past few months, I have noticed a remarkable number of cars driving around Denver without license plates. The number of new cars on the road raised a number of questions:

Is this a national trend?
What exactly is driving this surge in new auto sales?
Which economic factors (if any) are driving these sales?

This project will look at the last question using publicly available national data and build forecasting models to examine this dynamic.

Data

Data sources

Automobile sales data

Real GDP

Unemployment rate

Variables

Though there are many other variables that should probably be included in the analysis below (eg, population growth, gas prices, etc), due to limitations in data availability and time, we will include only lags of three variables. These three variables are:

Sales of automobiles (in millions)
Real GDP (Stock-Watson Index)
Unemployment rate

As with many time series, car sales may be expected to be persisent; in other words, last month’s car sales volume will be ‘similar’ to this month’s sales figure. Hence, prior month sales will be used for training the forecasting models. Real GDP is included to capture economic growth. Finally, the unemployment rate measures another aspect of people’s economic well-being and will also be included as a predictor.

While these last two factors may overlap to some extent, they are in fact distinct from each other and, as the jobless recovery following the Great Recession demonstrates, they do not necessarily track one another.

Since we will also be forecasting auto sales, we will include lags of each of the above variable as predictors in our models. Specifically, the 1-, 2-, and 3-month prior values of each variable are used.

Data acquisition

Loading all required packages.

require(httr)
require(XML)
require(xlsx)
require(stringr)
require(tidyr)
require(dplyr)
require(lubridate)
require(zoo)
require(ggplot2)
require(caret)
require(randomForest)
require(doParallel)

Auto sales data

Let’s get the auto sales data first. Since the parameters of the data are set by an HTML form, we’ll need to check the HTML of the page source to find the names of the form parameters and the possible values. The parameters and the values which we want can then be stored in a list to be passed as part of the POST request.

fred.url <- 'http://research.stlouisfed.org/fred2/series/TOTALSA/downloaddata'
fred.query <- list('form[native_frequency]' = 'Monthly', 
                   'form[units]' = 'lin',
                    'form[frequency]' = 'Monthly', 'form[obs_start_date]' = "1976-01-01",
                    'form[obs_end_date]' = "2014-11-01", 'form[file_format]' = 'txt',
                    'form[aggregation]' = 'Average',
                    sep = '&')

To avoid having to save a hardcopy file, the content of the HTTP request is passed to read.table as a text connection, allowing read.table to create a data frame directly from text. As a final step, we clean up any variables that are not relevant for the subsequent processing.

fred.response <- POST(fred.url, body = fred.query)
fred.con <- textConnection(content(fred.response, 'text'))
fred.data <- read.table(fred.con, header = TRUE, sep = "", skip = 10, 
                        stringsAsFactors = FALSE)

rm(list = c('fred.query', 'fred.response', 'fred.con'))

Real GDP data

Real GDP (RGDP) data is hosted as an Excel file, which complicates things a bit. To get the data in, we download the file to disk (the user will be prompted to choose where to save the data) and use the xlsx R package to read it into R.

nber.url <- 'http://www.nber.org/cycles/BCDCFiguresData100920_ver5.xlsx'
nber.dir <- choose.dir(caption = 'Select directory in which to save NBER data.')
nber.loc <- str_c(nber.dir, '\\NBER_data.xlsx')
download.file(nber.url, nber.loc, mode = 'wb')
nber.data <- read.xlsx(nber.loc, sheetIndex = 2, startRow = 2)

rm(list = c('nber.dir', 'nber.loc'))

Unemployment rate data

Acquiring the unemployment data from the Bureau of Labor Statistics website will be similar to the steps we took to get the sales data.

bls.url <- 'http://data.bls.gov/timeseries/LNS14000000/pdq/SurveyOutputServlet'
bls.query <- list(request_action = 'get_data',
                  reformat = 'true',
                  from_results_page = 'false',
                  years_option = 'specific_years',
                  delimiter = 'comma',
                  output_type = 'default',
                  output_view = 'data',
                  to_year = 2014,
                  from_year = 1976,
                  output_format = 'text',
                  original_output_type = 'default',
                  annualAveragesRequested = 'false',
                  include_graphs = 'false'
                  )
bls.response <- GET(bls.url, query = bls.query, encode = 'form')

This time, however, the data is served within the response HTML and will need to be extracted using the XML package. Once again, the extracted data will be passed as a text connection, this time to read.csv, to create the data frame.

bls.html <- htmlTreeParse(readBin(bls.response$content,
                                  'text'), 
                          useInternalNodes = TRUE)
bls.raw <- xpathApply(bls.html, "//div/pre", xmlValue)
bls.con <- textConnection(str_replace_all(as.character(bls.raw),
                          'Â', ''))
bls.data <- read.csv(bls.con, skip = 11)
bls.data <- bls.data[,c(-length(names(bls.data)))]  # get rid of 'X' col

rm(list = c('bls.query', 'bls.response', 'bls.html', 'bls.raw', 'bls.con'))

Data munging

These data only yield a complete series for the date range 1/1/1976 - 6/1/2010, inclusive. Before merging the three datasets together, we’ll truncate them to ensure that they include only data in the date range above.

# FRED
fred.data <- fred.data[fred.data$DATE <= '2010-06-01', ]
names(fred.data) <- c('Month', 'SalesM')
fred.data$Month <- ymd(fred.data$Month)  # Month to posix

# NBER
nber.data <- nber.data[nber.data$Month.Year >= '1976-01-01' &
                         nber.data$Month.Year <= '2010-06-01',
                       c(1, 3)]
names(nber.data) <- c('Month', 'RGDP')
nber.data$Month <- ymd(nber.data$Month)  # Month to posix

The BLS data will require a bit more work because of the ‘long’ format of the data, with each month included as a separate column:

head(bls.data)

##   Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 1976 7.9 7.7 7.6 7.7 7.4 7.6 7.8 7.8 7.6 7.7 7.8 7.8
## 2 1977 7.5 7.6 7.4 7.2 7.0 7.2 6.9 7.0 6.8 6.8 6.8 6.4
## 3 1978 6.4 6.3 6.3 6.1 6.0 5.9 6.2 5.9 6.0 5.8 5.9 6.0
## 4 1979 5.9 5.9 5.8 5.8 5.6 5.7 5.7 6.0 5.9 6.0 5.9 6.0
## 5 1980 6.3 6.3 6.3 6.9 7.5 7.6 7.8 7.7 7.5 7.5 7.5 7.2
## 6 1981 7.5 7.4 7.4 7.2 7.5 7.5 7.2 7.4 7.6 7.9 8.3 8.5

Before limiting the date range, we will need to build a single date column in YYYY-MM-DD format. Fortunately, the tidyr and dplyr packages make this a painless process.

# BLS
# Create single Month column
names(bls.data) <- c('Year', '01', '02', '03', '04', '05', '06', '07', '08',
                     '09', '10', '11', '12')
bls.data <- bls.data %>% gather(MonthNum, Unemployment, -Year) %>%
  unite_('Month', c('Year', 'MonthNum'), sep = '-') %>% arrange(Month)
bls.data$Month <- str_c(bls.data$Month, '-01')

# limit date range
bls.data <- bls.data[bls.data$Month <= '2010-06-01', ]
bls.data$Month <- ymd(bls.data$Month)  # Month to posix

Finally, we merge the three datasets together using the Reduce function.

# Merge data into single dataset
data <- Reduce(function(...) merge(..., all = TRUE), 
               list(fred.data, nber.data, bls.data))

rm(bls.data, nber.data, fred.data)

Now the data look pretty good:

head(data)

##        Month SalesM RGDP Unemployment
## 1 1976-01-01   12.8 5074          7.9
## 2 1976-02-01   13.3 5111          7.7
## 3 1976-03-01   13.4 5088          7.6
## 4 1976-04-01   13.2 5099          7.7
## 5 1976-05-01   12.9 5136          7.4
## 6 1976-06-01   13.0 5152          7.6

In order to build a forecasting model, we will need lagged variants of the existing set of variables. As we decided in the Data section, we will need three lags of each variable for the modeling below. The zoo package makes working with leads and lags straightforward.

# create lags for each variable
data.ts <- read.zoo(data)
for (lag in 1:3) {
  data[, c(str_c('L', lag, 'SalesM'),
           str_c('L', lag, 'RGDP'), 
           str_c('L', lag, 'Unemployment'))
       ] <- stats::lag(data.ts, -lag, na.pad = TRUE)  # avoid dplyr::lag
}

rm(data.ts)

# remove rows with NA
data <- na.omit(data)

rownames(data) <- data$Month
data <- data[, -c(1, 3, 4)]  # remove Month column (since in rownames)

Our fully demunged and lagged data looks as follows:

head(data)

##            SalesM L1SalesM L1RGDP L1Unemployment L2SalesM L2RGDP
## 1976-04-01   13.2     13.4   5088            7.6     13.3   5111
## 1976-05-01   12.9     13.2   5099            7.7     13.4   5088
## 1976-06-01   13.0     12.9   5136            7.4     13.2   5099
## 1976-07-01   13.4     13.0   5152            7.6     12.9   5136
## 1976-08-01   13.0     13.4   5155            7.8     13.0   5152
## 1976-09-01   13.6     13.0   5133            7.8     13.4   5155
##            L2Unemployment L3SalesM L3RGDP L3Unemployment
## 1976-04-01            7.7     12.8   5074            7.9
## 1976-05-01            7.6     13.3   5111            7.7
## 1976-06-01            7.7     13.4   5088            7.6
## 1976-07-01            7.4     13.2   5099            7.7
## 1976-08-01            7.6     12.9   5136            7.4
## 1976-09-01            7.8     13.0   5152            7.6

Analysis

We will proceed to train and evaluate three forecasting models and then return to the question of which factor is most important in predicting future auto sales.

Before building the forecasting models, we need to first split the data into training and test samples. The training sample will be used for fitting the model and validation. We can then use the test sample to get a sense of how the model would perform on data it has not seen before. In this particular instance, we will set the last 20% of the data aside and use the first 80% of the data to train the models.

n <- dim(data)[1]
in.test <- seq(n - (n %/% 10 * 2), n)  # integer division
test <- data[in.test, ]
train <- data[-in.test, ]
rm(list = c('n', 'in.test'))

Linear model

We are now ready to train the first model, a linear regression model. This model was chosen as a sensible starting point due to its simplicity and parsimony. It should yield a good first order approximation of the relationship among the included variables, a benchmark we can use when looking at the more complex random forest models below.

linear <- lm(SalesM ~ ., data = train)

To gauge the performance of this model, we will use two helper functions to calculate the mean standard error (MSE) and the 95% confidence interval (ie, estimated value +/- 1.96 * standard error).

MSE <- function(actual, pred) {
  error <- actual - pred
  return(sum(error**2) / length(error))
}

SE_bounds <- function(pred, se) {
  return(list(upper = pred + 1.96 * se, lower = pred - 1.96 * se))
}

Linear model performance

Using the test data, we can assess how the model might perform on ‘new’ data.

linfit <- predict(linear, newdata = test, se.fit = TRUE)
lm.pred <- data.frame(date = rownames(data), actual = data$SalesM, 
                      predicted = c(rep(NA, dim(data)[1] - length(linfit$fit)), 
                                    linfit$fit),
                      se_l = c(rep(NA, dim(data)[1] - length(linfit$fit)),
                               SE_bounds(linfit$fit, linfit$se.fit)$lower),
                      se_u = c(rep(NA, dim(data)[1] - length(linfit$fit)),
                               SE_bounds(linfit$fit, linfit$se.fit)$upper))

tick.dates <- c('1976-04-01', '1980-01-01', '1985-01-01', '1990-01-01',
                '1995-01-01', '2000-01-01', '2005-01-01', '2010-01-01')
ggplot(lm.pred, aes(x = date, y = actual)) + geom_point(alpha = .7) + 
  geom_point(aes(y = predicted), color = 'red', alpha = .7) +
  scale_x_discrete(breaks = tick.dates) + 
  geom_smooth(aes(y = predicted, ymin = se_l, ymax = se_u, group = 1), 
              , color = 'red', linetype = 0 , stat = 'identity') +
  labs(title = 'Linear model actual (black) and predicted (red) sales', x = 'Month', 
       y = 'Auto Sales (m)')

plot of chunk lm.perf

print(paste('Linear Model MSE:', round(MSE(test$SalesM, linfit$fit), 2)))

## [1] "Linear Model MSE: 1.31"

We see from the above plot that the model appears to fit the test data rather well. The MSE figure will help us below to make comparisons in predictive performance across different models.

Random forests

Before fitting a random forest model to the training data, we set up the processor clusters to accelerate the model fitting processes by taking advantage of the multicore processor on our machine. In my case, wehave four processing cores. Due to an apparent bug in the caret package, we will need to take a few extra steps to set everything up.

cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)

To fit the random forest model, we will use 10-fold cross validation on the training set to fit and tune the model. Once a final model has been selected, we will again use the test sample to evaluate its out-of-sample performance.

randForest <- train(SalesM ~ ., data = train, method = 'rf',
                    trControl = trainControl(method = 'cv', number = 10),
                    prox = TRUE, allowParallel = TRUE, importance = TRUE)

Random forest performance

Let’s see how the model performs on the test set.

rf.fit <- predict(randForest, test)

rf.pred <- data.frame(Month = rownames(data), actual = data$SalesM,
                      predicted = c(rep(NA, dim(data)[1] - length(rf.fit)), 
                                        rf.fit))
ggplot(rf.pred, aes(x = Month, y = actual)) + geom_point(alpha = .7) + 
  geom_point(aes(y = predicted), color = 'red', alpha = .7) +
  scale_x_discrete(breaks = tick.dates) + 
  labs(title = 'Cross-validated random forest actual (black) and predicted (red) sales', 
       x = 'Month', y = 'Auto Sales (m)')

plot of chunk rf1.perf

print(paste('Random Forest MSE:', round(MSE(test$SalesM, rf.fit), 2)))

## [1] "Random Forest MSE: 3.59"

Surprisingly, the linear model appears to predict more accurately! Visual inspection confirms what the higher MSE score tells us: the random forest model predicted observations do not track the realized sales values as well as the linear model.

Random forests with time slices

There is some evidence that standard cross validation, which randomly partitions the data into training and test sets, does not work as well for sequential data or data with a time component. Here, we will apply a different splitting method, essentially splitting the data on a rolling basis and using the resulting mini data sets to train and calibrate the model. This new approaach is reflected in the trControl = trainControl(method = 'timeslice', ...) argument passed to the train method. (Since we will not be fitting any additional models, we can close the connection to the different processing cores.)

randForest2 <- train(SalesM ~ ., data = train, method = 'rf',
                 trControl = trainControl(method = 'timeslice',
                                          initialWindow = 50,
                                          horizon = 15,
                                          fixedWindow = TRUE),
                 prox = TRUE, allowParallel = TRUE, importance = TRUE)
stopCluster(cl)

Random forest with time slices performance

Hopefully, the new data splits will improve model performance over the vanilla cross-validated random forest model.

rf.fit2 <- predict(randForest2, test)

rf.pred2 <- data.frame(Month = rownames(data), actual = data$SalesM,
                      predicted = c(rep(NA, dim(data)[1] - length(rf.fit2)), 
                                        rf.fit2))
ggplot(rf.pred2, aes(x = Month, y = actual)) + geom_point(alpha = .7) + 
  geom_point(aes(y = predicted), color = 'red', alpha = .7) +
  scale_x_discrete(breaks = tick.dates) + 
  labs(title = 'Time-sliced random forest actual (black) and predicted (red) sales',
       x = 'Month', y = 'Auto Sales (m)')

plot of chunk rf2.perf

print(paste('Random Forest with time slicing MSE:',
            round(MSE(test$SalesM, rf.fit2), 2)))

## [1] "Random Forest with time slicing MSE: 2.28"

Both the MSE and the chart demonstrate unequivocally the improvement in the model performance out of sample. However, the linear model still outperforms the much more computationally intensive random forest model.

Driver of auto sales

So which of these variables seems to be most influential for predicting auto sales? To answer this question, we will look at variable statistics for each of the models above.

In the case of the linear model, we will need to center and scale the variables and fit the model again to ensure that the variables are of comparable units (or rather, unit-less). This way, we can use the size of the coefficient as an indicator of how important a variable is for predicting sales, provided that we are confident in the accuracy of the estimate (ie, taking into account the standard errors).

zscore <- function(num.vect) {
  return((num.vect - mean(num.vect)) / sd(num.vect))
}
linear.scaled <- lm(SalesM ~ ., data = as.data.frame(apply(train, 2, zscore)))

# chart of coefficients (with SE)
lin.coef <- data.frame(summary(linear.scaled)$coefficients)[-1, ]
colnames(lin.coef) <- c('Est', 'SE', 't', 'p-val')
lin.coef$Est <- abs(lin.coef$Est)
new.order <- c('L1SalesM', 'L2SalesM', 'L3SalesM', 
               'L1RGDP', 'L2RGDP', 'L3RGDP', 
               'L1Unemployment', 'L2Unemployment', 'L3Unemployment')
ggplot(lin.coef, aes(x = rownames(lin.coef), y = Est)) + 
  geom_pointrange(stat='identity', 
                  aes(ymax = SE_bounds(lin.coef$Est, lin.coef$SE)[[1]], 
                      ymin = SE_bounds(lin.coef$Est, lin.coef$SE)[[2]])) +
  labs(title = 'Linear model coefficients',
       x = 'Variable', y = 'Magnitude (abs. value) & precision') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_discrete(limits = new.order)

plot of chunk lm.vars

The linear model suggests that the previous month’s sales is a strong predictor on this month’s sales, as we might expect of a highly persisent time series. The previous month’s unemployment is also an important predictor of this month’s sales, though the estimated coefficient is less precise. Finally, the two-month lag of RGDP appears to have the most sizeable impact on sales.

For random forest models, we will use the mean increase in MSE as the primary measure of variable importance. A detailed description of the statistic is available on Wikipedia or in the caret documentation.

Note that the output below includes another measure of variable importance, node purity. Change in node purity and the above measure do not necessarily correspond. Since we are examining different lags of the same (persistent) variables, the former measure assesses a variable’s importance while taking into account the impact of other lags, whereas the latter does not. Hence, we’ll use the former.

rf.imp <- importance(randForest$finalModel)
rf2.imp <- importance(randForest2$finalModel)
# print variable importance sorted on %IncMSE
print(rf.imp[order(rf.imp[,1], decreasing = TRUE),])

##                %IncMSE IncNodePurity
## L1SalesM         22.55        333.96
## L1RGDP           19.93        118.83
## L2SalesM         16.78        251.13
## L3RGDP           15.52        131.26
## L3SalesM         15.24        224.40
## L2RGDP           15.13        146.81
## L1Unemployment   15.13        110.08
## L2Unemployment   14.12         55.97
## L3Unemployment   10.91         40.65

print(rf2.imp[order(rf2.imp[,1], decreasing = TRUE),])

##                %IncMSE IncNodePurity
## L1SalesM        35.558        681.00
## L1RGDP          19.267         75.67
## L2SalesM        16.253        232.18
## L3RGDP          14.992         94.09
## L2RGDP          14.987         97.98
## L3SalesM        13.345        162.45
## L2Unemployment  13.019         24.02
## L1Unemployment  11.210         31.95
## L3Unemployment   9.243         17.34

Both random forest models support the evidence in the linear model regarding the importance of the first lag of sales. Interestingly, they also rank the impact of real GDP more highly than unemployment.

In sum, the first lag of sales is an important predictor for the next month’s sales. Lagged RGDP may also have predictive power, although in the linear model, only the second lag was significant at the 5% level. Whether real GDP really is a better predictor of future sales than unemployment is not completely clear, though there is some evidence that it is.

Auto sales this month

Since we have these three models trained and ready to go, let’s have a bit of fun and predict the volume of auto sales this month (December 2014). Unfortunately, Stock-Watson RGDP figures are not available after 2010, so the data below uses completely fabricated RGDP numbers below. (Some ad hoc testing showed that the random forest models are not particularly sensitive to the magnitude of any of the RGDP variables.)

new.data <- list(L1SalesM = 17.5,
                 L1RGDP = 14000,
                 L1Unemployment = 5.9,
                 L2SalesM = 16.8,
                 L2RGDP = 14000,
                 L2Unemployment = 5.8,
                 L3SalesM = 16.8,
                 L3RGDP = 14000,
                 L3Unemployment = 5.8)
predict(linear, new.data)

##     1 
## 17.37

predict(randForest, new.data)

## [1] 16.93

predict(randForest2, new.data)

## [1] 16.84

It looks like the three models more or less agree that roughly 17m cars will be sold in December. My gut tells me that this number might overshoot the actual number a bit, especially since I am not sure whether these models adequately take into account the seasonality of sales.