Truck data analysis - final results

Introduction

In this report we present the results of the models considered to fit the monthly trucks licensing in Brazil. As suggested, the method for prediction and thus validation of the models will be done by taking the sum of the monthly predicted values for the testing period. The R code used in this analysis can be show by clicking in the “code” buttons on the right.

Data splitting

Data considering for training are those available until January 2019 and the validation data corresponds to the following 12-month period, as shown in the graph below.

splits <- dados_dessaz %>% 
  time_series_split(
    assess     = "12 months", 
    cumulative = TRUE
  )

splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, trucks)

Models considered

Several models were considered in the initial analysis (ARIMA, TSLM, VAR, XGBoost, random forest, prophet) and here we present the results of the 5 best models. From now on the models will be regarded to as the bold expression in the list bellow.

1 - ARIMA (log): this is a ARIMA model considering the trucks variable transformed to the log scale.
2 - ARIMA with regressors (log): this ARIMA model also considers the response variable in the log scale and takes ibcbr, corrente_de_comercio, working_days as exogenous variables.
3 - ARIMA with regressors (differences): this ARIMA model considers as the response the differences in the trucks series and as exogenous variables the working_days series and the differences in ibcbr, corrente_de_comercio series.
4 - Random Forest: the model considers the differences in the trucks series as the response variable and the remaining variables as predictors (all taken in their differences, excepting working_days).
5 - ETS: Error-Trend-Season (ETS) model using exponential smoothing state space model.

# model1
model_fit_arima <- arima_reg() %>%
  set_engine(engine = "auto_arima") %>%
  fit(log(trucks) ~ date, data = training(splits))

# model2 
model_fit_arima_reg <- arima_reg() %>%
  set_engine(engine = "auto_arima") %>%
  fit(log(trucks) ~ date + ibcbr + corrente_de_comercio + working_days, 
      data = training(splits))

# model3
model_fit_arima_reg_d <- arima_reg() %>%
  set_engine(engine = "auto_arima") %>%
  fit(trucks_d1 ~ date + ibcbr_d1 + corrente_de_comercio_d1 + working_days, 
      data = training(splits))

# model4
model_rf <- rand_forest() %>% 
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("regression") %>%
  fit(trucks_d1 ~ as.numeric(date) + ibcbr_d1 +
        corrente_de_comercio_d1 +
        confianca_d1 + ipa_d1 + credito_d1 + consumo_de_energia_d1 +
        indice_de_commodities_d1 + working_days,
      data = training(splits))

# model5
model_fit_ets <- exp_smoothing() %>%
  set_engine(engine = "ets") %>%
  fit(trucks ~ date, data = training(splits))

models <- modeltime_table(
  model_fit_arima, # escala log
  model_fit_arima_reg, # escala log
  model_fit_arima_reg_d, # diff
  model_rf, # diff
  model_fit_ets # original
)

The table below shows the estimated parameters for the ARIMA models described above.

models %>%
  rename(.model_espec = .model_desc) %>% 
  bind_cols(tibble(.model_desc = c("1 - ARIMA (log)",
                                            "2 - ARIMA with regressors (log)",
                                            "3 - ARIMA with regressors (differences)",
                                            "4 - Random Forest",
                                            "5 - ETS"))) %>% 
  filter(str_count(.model_desc, "ARIMA")>0) %>% 
  select(Description = .model_desc, 
         Specification = .model_espec) %>% 
  gt::gt()

Description	Specification
1 - ARIMA (log)	ARIMA(2,1,2)(0,0,2)[12]
2 - ARIMA with regressors (log)	REGRESSION WITH ARIMA(2,1,2)(1,0,1)[12] ERRORS
3 - ARIMA with regressors (differences)	REGRESSION WITH ARIMA(0,0,2)(0,0,1)[12] ERRORS

  # kable(col.names = c("Description", "Specification")) %>% 
  # kable_styling(full_width = F)

Assessing model accuracy on test data

Again, instead of considering each monthly prediction separately, we take the sum of all the predictions on testing data (12-month period after January 2019). The figure below shows the 12-months sum of predicted values (solid black points) for each model and their corresponding confidence intervals (CI). The dashed black line represents the observed sum of the number of licensed trucks in the referred period. It is important to note two aspects in this figure. First, the models are ordered according to the sum of predict values. Second, the CI are considerably wider for the models that consider the response variable in its differences. This might be due to the way the intervals were computed: applying the reverse of the difference function on the estimated CI bounds (need o check the statistical consistency of this transformation).

# list of models in log scale
lista_log <- c(1, 2)

# list of models in differences
lista_diff <- c(3, 4)

# this funciton computes the inverse of the difference when needed
# (depending on the column is_diff)
# and return a dataframe with prediction and conf intervals
transform_diff <- function(dados) {
  # check if the model is in differences (column is_diff)
  if(dados %>% summarise(sum(is_diff)) %>% pull() != nrow(dados))
    return(dados %>% select(-is_diff))
  
  # last observed value in the trucks training data
  last <- testing(splits) %>% 
                       filter(date==max(date)) %>% 
                       select(trucks) %>% pull()
  
  value <- (diffinv(dados$.value, xi = last))[-1]
  # [-1] to remove the first value of the series
  
  conf_lo <- (diffinv(dados$.conf_lo, xi = last))[-1]
  
  conf_hi <- (diffinv(dados$.conf_hi, xi = last))[-1]
  
  return(tibble(.index = dados$.index,
         .value = value,
         .conf_lo = conf_lo,
         .conf_hi = conf_hi))
}


# obtain the monthly predictions of each model in the original
# data scale (inverse operations computed for log and diff models)
pred <- models %>%
  modeltime_calibrate(new_data = testing(splits)) %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = dados_dessaz
  ) %>%
  # filter only predictions
  filter(.model_desc != "ACTUAL") %>%
  # transform the values in log scale back to original 
  mutate(.value = ifelse(.model_id %in% lista_log,
                           exp(.value), .value),
           .conf_lo = ifelse(.model_id %in% lista_log,
                           exp(.conf_lo), .conf_lo),
           .conf_hi = ifelse(.model_id %in% lista_log,
                           exp(.conf_hi), .conf_hi)) %>% 
  # transform the values in differences back to original
  mutate(is_diff = ifelse(.model_id %in% lista_diff,
                          TRUE, FALSE),
         .model = paste(as.character(.model_id),
                        .model_desc, sep=" - ")) %>% 
  select(-.model_id, -.model_desc, -.key) %>%
  group_by(.model) %>% 
  nest() %>%
  mutate(res = purrr::map(data, ~{transform_diff(.x)})) %>%
  select(res) %>%
  unnest(cols = res) 

# obtain the 12-month predicted sum for each model
res <- pred %>%
  # computes the 12-month sum of predictions
  summarise(across(where(is.numeric), ~sum(.x)))

# renaming the model descriptions
res <- res %>% 
  separate(.model, into = c("id", "desc"), sep = " - ") %>%
  mutate(.model_desc = case_when(id == 1 ~ "1 - ARIMA (log)",
                        id == 2 ~ "2 - ARIMA with regressors (log)",
                        id == 3 ~ "5 - ARIMA with regressors (differences)",
                        id == 4 ~ "3 - Random Forest (differences)",
                        id == 5 ~ "4 - ETS")) %>%
  select(-id, -desc) %>%
  relocate(.model_desc, .value)

# graph of 12-month prediction and CI 
res %>%
  # select and reorder the models names
  mutate(.model_desc = fct_reorder(.model_desc, .value)) %>%
  # add a column with the observed 12-month sum
  bind_cols(tibble(.observed = testing(splits) %>% 
                     select(trucks) %>% 
                     pull() %>% 
                     sum()))  %>%
  # start of the graph
  ggplot(aes(x = .model_desc, y = .value))+
  geom_point()+
  geom_errorbar(aes(ymin = .conf_lo, ymax = .conf_hi),
                width = 0.2, alpha = 0.5)+
  geom_hline(aes(yintercept = .observed, 
                 color = "Observed value"), linetype="dashed")+
  scale_color_manual(values = "black")+
  coord_flip()+
  labs(x=NULL, y="Trucks", color=NULL,
       title = "12-month prediction for the number of trucks")

The interactive graph below shows the original data, the monthly forecast and corresponding confidence interval for each model. Clicking on the model name in the legend enables/disables its value on the graph.

models %>%
  modeltime_calibrate(new_data = testing(splits)) %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = dados_dessaz
  ) %>%
  filter(.model_desc != "ACTUAL") %>%
  # transform the values in log scale back to original 
  mutate(.value = ifelse(.model_id %in% lista_log,
                           exp(.value), .value),
           .conf_lo = ifelse(.model_id %in% lista_log,
                           exp(.conf_lo), .conf_lo),
           .conf_hi = ifelse(.model_id %in% lista_log,
                           exp(.conf_hi), .conf_hi)) %>% 
  # transform the values in differences back to original
  mutate(is_diff = ifelse(.model_id %in% lista_diff,
                          TRUE, FALSE),
         .model = paste(as.character(.model_id),
                        .model_desc, sep=" - ")) %>% 
  select(-.model_id, -.model_desc, -.key) %>%
  group_by(.model) %>% 
  nest() %>%
  mutate(res = purrr::map(data, ~{transform_diff(.x)})) %>%
  select(res) %>%
  unnest(cols = res) %>% 
  ungroup() %>% 
  separate(.model, c(".model_id", ".model_desc"), " - ") %>% 
  mutate(.model_id = as.integer(.model_id),
         .key = "prediction") %>% 
  select(.model_id, .model_desc, .index, .value, .conf_lo, .conf_hi) %>% 
  bind_rows(modeltime_table(model_fit_ets) %>%
              modeltime_calibrate(new_data = testing(splits)) %>%
              modeltime_forecast(
                new_data    = testing(splits),
                actual_data = dados_dessaz) %>%
              filter(.model_desc == "ACTUAL")) %>% 
  plot_modeltime_forecast(
              .legend_max_width = 25,
              .interactive      = T
            )

Finally, since we are considering the sum of the predicted values for the testing data, it does not make sense to compute metrics such as MSE, MAE, and MAPE (there is only one single test data). Instead, in the accuracy table below we show two metrics

root squared error (RSE): \(\sqrt{(y_{obs}-y_{pred})^2}\), which, in this case equals the absolute difference (\(|y_{obs}-y_{pred}|\)).
absolute percentage error (APE): \(100 \times \bigg|\frac{y_{obs}-y_{pred}}{y_{obs}}\bigg|\),

where \(y_{obs}\) is the sum of the observed values in the period and \(y_{pred}\) is the sum of the corresponding predictions. Note that these metrics are based solely on the point estimate given by each model.

res %>%
  bind_cols(tibble(.observed = testing(splits) %>% 
                     select(trucks) %>% 
                     pull() %>% 
                     sum())) %>%
  mutate(RSE = sqrt((.observed - .value)^2),
         APE = abs(.observed - .value)/.observed * 100) %>%
  mutate(APE = paste0(round(APE,2), "%")) %>% 
  select(-(.value:.observed)) %>%
  separate(.model_desc, into = c(".model_id", 
                            ".model_desc"), sep = " - ") %>%
  arrange(.model_id) %>% 
  table_modeltime_accuracy(.interactive = F)

Accuracy Table
.model_id	.model_desc	RSE	APE
1	ARIMA (log)	2160.11	2.15%
2	ARIMA with regressors (log)	2647.15	2.63%
3	Random Forest (differences)	8289.09	8.23%
4	ETS	525.45	0.52%
5	ARIMA with regressors (differences)	1606.02	1.6%

Finally, considering the results presented above, the model which performed the best and has a narrow confidence interval is Model 4 - ETS, the model using exponential smoothing state space model, and thus should be used to forecast.

Although Model 5 - ARIMA with regressors (differences) makes an accurate point estimation of the sum of licensed trucks in the testing period, we would no recomend this model since its confidence interval is considerably wider than that of Modelo 4 - ETS.

library(knitr)
library(rmarkdown)
render("results.Rmd", 
       html_document(pandoc_args = "--self-contained"))

Truck data analysis - final results

Magno

04/11/2021

Introduction

Data splitting

Models considered

Assessing model accuracy on test data