In this notebook we will be using data from the Federal Reserve Bank of St. Louis of auto sales to build a model and forecast auto sales for approximately the next two years.
The data has been downloaded to the machine in a csv file and we can read it into r with the code below.
library(tidyverse)
library(modeltime)
library(tidymodels)
library(timetk)
vehicle_sales <- read_csv("C:/Users/Owner/Documents/Data/TotalVehicleSales.csv")
First we take a top level look at the data. We can see below that we
have 563 records of 2 variables. The data contains a column with the
seasonally adjusted units sold in millions, and a column with the month
for the record. All records are filled and there is no missing data. The
TOTALSA variable ranges from 8.923 million and 22.055
million with a median of 15.385 million and a mean of 15.131 million.
The DATE range for this data is from January 1976 through
November 2022. This data will not need to be changed for this model to
be built.
dim(vehicle_sales)
[1] 563 2
glimpse(vehicle_sales)
Rows: 563
Columns: 2
$ DATE <date> 1976-01-01, 1976-02-01, 1976-03-01, 1976-04-01, 1976-05-01, 1976-06-01, 1976-07-01, 1976-…
$ TOTALSA <dbl> 12.814, 13.340, 13.378, 13.223, 12.962, 13.051, 13.466, 12.953, 13.610, 12.823, 13.332, 14…
summary(vehicle_sales)
DATE TOTALSA
Min. :1976-01-01 Min. : 8.923
1st Qu.:1987-09-16 1st Qu.:13.659
Median :1999-06-01 Median :15.385
Mean :1999-06-01 Mean :15.131
3rd Qu.:2011-02-15 3rd Qu.:16.951
Max. :2022-11-01 Max. :22.055
Let’s start with a quick time series plot of the data we have. The plot below shows that car sales varies quite a bit. We see significant drops around 1980, 2008, and 2020, correlating with significant economic downturns. Sales tend to climb from those lows with high around 1986, 2001, and 2005. It would appear that the latest data show car sales still rebounding from a drop.
vehicle_sales %>%
plot_time_series(DATE, TOTALSA)
NA
First thing we will need to do is split our data into a testing and training data set. We will test on the last two years of data. Because the only independent variable we have is data, we don’t need to create a future data set.
In the plot below the blue portion of the line indicates the data that our model will be trained on and the red portion indicates the data that will be withheld for testing on. After our model is built we will predict car monthly car sales for the next two years using our data.
splits <- vehicle_sales %>%
time_series_split(
assess = "2 years",
cumulative = TRUE
)
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(DATE, TOTALSA)
NA
We will train three models and compare their results. Below is the code for training a prophect model, a glmnet model and an xgboost model.
vhcl_sls_prpht <- prophet_reg(seasonality_yearly = TRUE) %>%
set_engine("prophet") %>%
fit(TOTALSA ~ DATE, training(splits))
vhcl_sls_prpht
parsnip model object
PROPHET Model
- growth: 'linear'
- n.changepoints: 25
- changepoint.range: 0.8
- yearly.seasonality: 'TRUE'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
vhcl_sls_glmnt <- linear_reg(penalty = 0.01) %>%
set_engine("glmnet") %>%
fit(TOTALSA ~ lubridate::month(DATE,label = T) + as.numeric(DATE)+., training(splits))
vhcl_sls_glmnt
parsnip model object
Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian")
Df %Dev Lambda
1 0 0.00 0.96600
2 1 3.02 0.88020
3 1 5.53 0.80200
4 1 7.61 0.73070
5 1 9.34 0.66580
6 1 10.78 0.60670
7 1 11.97 0.55280
8 1 12.96 0.50370
9 1 13.78 0.45890
10 1 14.46 0.41810
11 1 15.03 0.38100
12 1 15.50 0.34720
13 1 15.89 0.31630
14 1 16.22 0.28820
15 1 16.48 0.26260
16 1 16.71 0.23930
17 1 16.89 0.21800
18 1 17.05 0.19870
19 1 17.17 0.18100
20 1 17.28 0.16490
21 2 17.37 0.15030
22 1 17.44 0.13690
23 1 17.50 0.12480
24 2 17.55 0.11370
25 3 17.60 0.10360
26 3 17.67 0.09438
27 3 17.73 0.08599
28 3 17.78 0.07835
29 3 17.82 0.07139
30 4 17.85 0.06505
31 4 17.89 0.05927
32 4 17.93 0.05401
33 4 17.96 0.04921
34 4 17.98 0.04484
35 5 18.00 0.04085
36 5 18.02 0.03722
37 6 18.05 0.03392
38 6 18.06 0.03090
39 6 18.08 0.02816
40 8 18.10 0.02566
41 9 18.11 0.02338
42 9 18.13 0.02130
43 9 18.14 0.01941
44 8 18.15 0.01768
45 9 18.16 0.01611
46 8 18.16 0.01468
47 10 18.17 0.01338
48 11 18.17 0.01219
49 11 18.18 0.01111
50 11 18.18 0.01012
51 10 18.19 0.00922
52 11 18.19 0.00840
53 11 18.19 0.00766
54 11 18.19 0.00698
55 11 18.19 0.00636
56 10 18.20 0.00579
57 12 18.20 0.00528
58 11 18.20 0.00481
59 12 18.20 0.00438
60 12 18.20 0.00399
61 12 18.20 0.00364
62 13 18.20 0.00331
63 13 18.20 0.00302
64 12 18.20 0.00275
65 12 18.20 0.00251
66 12 18.20 0.00228
67 12 18.20 0.00208
68 12 18.20 0.00190
vhcl_sls_xgboost <- boost_tree() %>%
set_engine("xgboost") %>%
set_mode("regression") %>%
fit(TOTALSA ~ lubridate::month(DATE,label = T) + as.numeric(DATE)+., training(splits))
vhcl_sls_xgboost
parsnip model object
##### xgb.Booster
raw: 36.4 Kb
call:
xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0,
colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1,
subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist,
verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
niter
callbacks:
cb.evaluation.log()
# of features: 13
niter: 15
nfeatures : 13
evaluation_log:
Below we have the metrics for the models. We can see that our glmnet model has the smallest rmse, suggesting it might perform the best, but our prophet model explains the greatest amount of the variation in our dependent variable, but then again, looking at our plot both of those models don’t seem to respond to the data while the xgboost model does seem to respond to movements in data while the other two models stay constant and don’t adjust to the data. This makes it a little difficult to determine which model to choose, luckily, including all three in a report is always an option, although if it’s not, my metric of choice is rmse, so I would likely go with the glmnet model even though xgboost tends to be my favorite.
model_compare <- modeltime_table(vhcl_sls_prpht, vhcl_sls_glmnt, vhcl_sls_xgboost)
calibration <- model_compare %>%
modeltime_calibrate(testing(splits))
calibration %>% modeltime_accuracy()
calibration %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = vehicle_sales
) %>%
plot_modeltime_forecast()
NA
Our model could almost certainly be improved by adding additional variables like a lag of sales and a lagged chained average, both options that would likely make our model perform much better, but a convenient part of only using the date is the model can easily be used to predict out as far as we want. Here we predict two years out. The plot below shows those forecasts for each of our models. Both the Prophect and GLMNET model seem to suggest sales above where we currently are while the xgboost model predicts car sales consistent with current sales levels.
hs_sls_forecast <- calibration %>%
modeltime_refit(vehicle_sales) %>%
modeltime_forecast(
h = "2 years",
actual_data = vehicle_sales
)
hs_sls_forecast %>%
plot_modeltime_forecast()
NA