Introduction

In this report we present the results of the models considered to fit the monthly trucks licensing in Brazil. As suggested, the method for prediction and thus validation of the models will be done by taking the sum of the monthly predicted values for the testing period. The R code used in this analysis can be show by clicking in the “code” buttons on the right.

Data splitting

Data considering for training are those available until January 2019 and the validation data corresponds to the following 12-month period, as shown in the graph below.

splits <- dados_dessaz %>% 
  time_series_split(
    assess     = "12 months", 
    cumulative = TRUE
  )

splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, trucks)

Models considered

Several models were considered in the initial analysis (ARIMA, TSLM, VAR, XGBoost, random forest, prophet) and here we present the results of the 5 best models. From now on the models will be regarded to as the bold expression in the list bellow.

# model1
model_fit_arima <- arima_reg() %>%
  set_engine(engine = "auto_arima") %>%
  fit(log(trucks) ~ date, data = training(splits))

# model2 
model_fit_arima_reg <- arima_reg() %>%
  set_engine(engine = "auto_arima") %>%
  fit(log(trucks) ~ date + ibcbr + corrente_de_comercio + working_days, 
      data = training(splits))

# model3
model_fit_arima_reg_d <- arima_reg() %>%
  set_engine(engine = "auto_arima") %>%
  fit(trucks_d1 ~ date + ibcbr_d1 + corrente_de_comercio_d1 + working_days, 
      data = training(splits))

# model4
model_rf <- rand_forest() %>% 
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("regression") %>%
  fit(trucks_d1 ~ as.numeric(date) + ibcbr_d1 +
        corrente_de_comercio_d1 +
        confianca_d1 + ipa_d1 + credito_d1 + consumo_de_energia_d1 +
        indice_de_commodities_d1 + working_days,
      data = training(splits))

# model5
model_fit_ets <- exp_smoothing() %>%
  set_engine(engine = "ets") %>%
  fit(trucks ~ date, data = training(splits))

models <- modeltime_table(
  model_fit_arima, # escala log
  model_fit_arima_reg, # escala log
  model_fit_arima_reg_d, # diff
  model_rf, # diff
  model_fit_ets # original
)

The table below shows the estimated parameters for the ARIMA models described above.

models %>%
  rename(.model_espec = .model_desc) %>% 
  bind_cols(tibble(.model_desc = c("1 - ARIMA (log)",
                                            "2 - ARIMA with regressors (log)",
                                            "3 - ARIMA with regressors (differences)",
                                            "4 - Random Forest",
                                            "5 - ETS"))) %>% 
  filter(str_count(.model_desc, "ARIMA")>0) %>% 
  select(Description = .model_desc, 
         Specification = .model_espec) %>% 
  gt::gt()
Description Specification
1 - ARIMA (log) ARIMA(2,1,2)(0,0,2)[12]
2 - ARIMA with regressors (log) REGRESSION WITH ARIMA(2,1,2)(1,0,1)[12] ERRORS
3 - ARIMA with regressors (differences) REGRESSION WITH ARIMA(0,0,2)(0,0,1)[12] ERRORS
  # kable(col.names = c("Description", "Specification")) %>% 
  # kable_styling(full_width = F)

Assessing model accuracy on test data

Again, instead of considering each monthly prediction separately, we take the sum of all the predictions on testing data (12-month period after January 2019). The figure below shows the 12-months sum of predicted values (solid black points) for each model and their corresponding confidence intervals (CI). The dashed black line represents the observed sum of the number of licensed trucks in the referred period. It is important to note two aspects in this figure. First, the models are ordered according to the sum of predict values. Second, the CI are considerably wider for the models that consider the response variable in its differences. This might be due to the way the intervals were computed: applying the reverse of the difference function on the estimated CI bounds (need o check the statistical consistency of this transformation).

# list of models in log scale
lista_log <- c(1, 2)

# list of models in differences
lista_diff <- c(3, 4)

# this funciton computes the inverse of the difference when needed
# (depending on the column is_diff)
# and return a dataframe with prediction and conf intervals
transform_diff <- function(dados) {
  # check if the model is in differences (column is_diff)
  if(dados %>% summarise(sum(is_diff)) %>% pull() != nrow(dados))
    return(dados %>% select(-is_diff))
  
  # last observed value in the trucks training data
  last <- testing(splits) %>% 
                       filter(date==max(date)) %>% 
                       select(trucks) %>% pull()
  
  value <- (diffinv(dados$.value, xi = last))[-1]
  # [-1] to remove the first value of the series
  
  conf_lo <- (diffinv(dados$.conf_lo, xi = last))[-1]
  
  conf_hi <- (diffinv(dados$.conf_hi, xi = last))[-1]
  
  return(tibble(.index = dados$.index,
         .value = value,
         .conf_lo = conf_lo,
         .conf_hi = conf_hi))
}


# obtain the monthly predictions of each model in the original
# data scale (inverse operations computed for log and diff models)
pred <- models %>%
  modeltime_calibrate(new_data = testing(splits)) %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = dados_dessaz
  ) %>%
  # filter only predictions
  filter(.model_desc != "ACTUAL") %>%
  # transform the values in log scale back to original 
  mutate(.value = ifelse(.model_id %in% lista_log,
                           exp(.value), .value),
           .conf_lo = ifelse(.model_id %in% lista_log,
                           exp(.conf_lo), .conf_lo),
           .conf_hi = ifelse(.model_id %in% lista_log,
                           exp(.conf_hi), .conf_hi)) %>% 
  # transform the values in differences back to original
  mutate(is_diff = ifelse(.model_id %in% lista_diff,
                          TRUE, FALSE),
         .model = paste(as.character(.model_id),
                        .model_desc, sep=" - ")) %>% 
  select(-.model_id, -.model_desc, -.key) %>%
  group_by(.model) %>% 
  nest() %>%
  mutate(res = purrr::map(data, ~{transform_diff(.x)})) %>%
  select(res) %>%
  unnest(cols = res) 

# obtain the 12-month predicted sum for each model
res <- pred %>%
  # computes the 12-month sum of predictions
  summarise(across(where(is.numeric), ~sum(.x)))

# renaming the model descriptions
res <- res %>% 
  separate(.model, into = c("id", "desc"), sep = " - ") %>%
  mutate(.model_desc = case_when(id == 1 ~ "1 - ARIMA (log)",
                        id == 2 ~ "2 - ARIMA with regressors (log)",
                        id == 3 ~ "5 - ARIMA with regressors (differences)",
                        id == 4 ~ "3 - Random Forest (differences)",
                        id == 5 ~ "4 - ETS")) %>%
  select(-id, -desc) %>%
  relocate(.model_desc, .value)

# graph of 12-month prediction and CI 
res %>%
  # select and reorder the models names
  mutate(.model_desc = fct_reorder(.model_desc, .value)) %>%
  # add a column with the observed 12-month sum
  bind_cols(tibble(.observed = testing(splits) %>% 
                     select(trucks) %>% 
                     pull() %>% 
                     sum()))  %>%
  # start of the graph
  ggplot(aes(x = .model_desc, y = .value))+
  geom_point()+
  geom_errorbar(aes(ymin = .conf_lo, ymax = .conf_hi),
                width = 0.2, alpha = 0.5)+
  geom_hline(aes(yintercept = .observed, 
                 color = "Observed value"), linetype="dashed")+
  scale_color_manual(values = "black")+
  coord_flip()+
  labs(x=NULL, y="Trucks", color=NULL,
       title = "12-month prediction for the number of trucks")

The interactive graph below shows the original data, the monthly forecast and corresponding confidence interval for each model. Clicking on the model name in the legend enables/disables its value on the graph.

models %>%
  modeltime_calibrate(new_data = testing(splits)) %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = dados_dessaz
  ) %>%
  filter(.model_desc != "ACTUAL") %>%
  # transform the values in log scale back to original 
  mutate(.value = ifelse(.model_id %in% lista_log,
                           exp(.value), .value),
           .conf_lo = ifelse(.model_id %in% lista_log,
                           exp(.conf_lo), .conf_lo),
           .conf_hi = ifelse(.model_id %in% lista_log,
                           exp(.conf_hi), .conf_hi)) %>% 
  # transform the values in differences back to original
  mutate(is_diff = ifelse(.model_id %in% lista_diff,
                          TRUE, FALSE),
         .model = paste(as.character(.model_id),
                        .model_desc, sep=" - ")) %>% 
  select(-.model_id, -.model_desc, -.key) %>%
  group_by(.model) %>% 
  nest() %>%
  mutate(res = purrr::map(data, ~{transform_diff(.x)})) %>%
  select(res) %>%
  unnest(cols = res) %>% 
  ungroup() %>% 
  separate(.model, c(".model_id", ".model_desc"), " - ") %>% 
  mutate(.model_id = as.integer(.model_id),
         .key = "prediction") %>% 
  select(.model_id, .model_desc, .index, .value, .conf_lo, .conf_hi) %>% 
  bind_rows(modeltime_table(model_fit_ets) %>%
              modeltime_calibrate(new_data = testing(splits)) %>%
              modeltime_forecast(
                new_data    = testing(splits),
                actual_data = dados_dessaz) %>%
              filter(.model_desc == "ACTUAL")) %>% 
  plot_modeltime_forecast(
              .legend_max_width = 25,
              .interactive      = T
            )

Finally, since we are considering the sum of the predicted values for the testing data, it does not make sense to compute metrics such as MSE, MAE, and MAPE (there is only one single test data). Instead, in the accuracy table below we show two metrics

where \(y_{obs}\) is the sum of the observed values in the period and \(y_{pred}\) is the sum of the corresponding predictions. Note that these metrics are based solely on the point estimate given by each model.

res %>%
  bind_cols(tibble(.observed = testing(splits) %>% 
                     select(trucks) %>% 
                     pull() %>% 
                     sum())) %>%
  mutate(RSE = sqrt((.observed - .value)^2),
         APE = abs(.observed - .value)/.observed * 100) %>%
  mutate(APE = paste0(round(APE,2), "%")) %>% 
  select(-(.value:.observed)) %>%
  separate(.model_desc, into = c(".model_id", 
                            ".model_desc"), sep = " - ") %>%
  arrange(.model_id) %>% 
  table_modeltime_accuracy(.interactive = F)
Accuracy Table
.model_id .model_desc RSE APE
1 ARIMA (log) 2160.11 2.15%
2 ARIMA with regressors (log) 2647.15 2.63%
3 Random Forest (differences) 8289.09 8.23%
4 ETS 525.45 0.52%
5 ARIMA with regressors (differences) 1606.02 1.6%

Finally, considering the results presented above, the model which performed the best and has a narrow confidence interval is Model 4 - ETS, the model using exponential smoothing state space model, and thus should be used to forecast.

Although Model 5 - ARIMA with regressors (differences) makes an accurate point estimation of the sum of licensed trucks in the testing period, we would no recomend this model since its confidence interval is considerably wider than that of Modelo 4 - ETS.

library(knitr)
library(rmarkdown)
render("results.Rmd", 
       html_document(pandoc_args = "--self-contained"))