library(dplyr)
library(forecast)
library(tidyverse)
library(randomForest)
library(tsibble)
library(readr)

Background

The purpose of this presentation is to explore a real world application of a Random Forest model on time series data. Random forest is one of the simplest, most robust, powerful and popular ML algorithms and we wanted to go “full circle” in exploring this popular algorithm, from a subsection of this week’s tree-based model topic, on time series data (something we covered in-depth earlier in the semester).

Approach

In completing this exploration of a Random Forest model we:

  • pre-process and visualize our data upon reading it in,
  • difference and transform the data so as to make it stationary,
  • utilize time delay embedding to bring our data into a form an ML model can handle,
  • train our model, consider its statistics, and compare its performance with that of a SNAIVE model.

Data

We use a time series from the German Statistical Office on German wage and income tax revenue from 1999 – 2019 (after tax redistribution). Note: we end up dropping entries from 2019 because they aren’t complete for the year.

The can be downloaded from STATWORX’s GitHub.


Pre-process and Visualize

In order to use random forest for time series data we transform, difference, and embed our data.

We read in and pre-process the German tax data so as to transform our csv file into a compatible time series format:

urlfile="https://raw.githubusercontent.com/Magnus-PS/DATA-624/main/tax.csv"
#mydata<-read_csv(url(urlfile))
#head(mydata)
# read in the csv file
tax_tbl <- readr::read_delim(file = urlfile, 
                             delim = ";", #split on ;
                             col_names = c("Year", "Type", month.abb), #label columns
                             skip = 1, #skip 1st row
                             na = c("...")) %>% #label ... entries as NA
    select(-Type) %>% #drop tax type label
    gather(Date, Value, -Year) %>% #unite date and value minus year
    unite("Date", c(Date, Year), sep = " ") %>% #combine month and year
    mutate( Date = Date %>% 
            lubridate::parse_date_time("m y") %>% 
            yearmonth()) %>% 
    drop_na() %>% #drop NAs
    as_tsibble(index = "Date") %>% #specify Date column as index
    filter(Date <= yearmonth(as.Date("2018-12-01"))) #drop early 2019 data
# convert to ts format
tax_ts <- as.ts(tax_tbl)

To do so we drop the “Type” column, exclude data from 2019, combine month and year into one column, and convert our data from tabular to time series format.

Next, we verify that there are no missing data issues and proceed to visualizing German Wage and Income Tax from 1999 to 2018:

#Explore missing data
#has_gaps(tax_tbl) # implicit missings
#colSums(is.na(tax_tbl[, "Value"])) # explicit missings
#Visualize German Tax Data
plot_org <- tax_tbl %>% 
  ggplot(aes(Date, Value / 1000)) + # to get the axis on a more manageable scale
  geom_line() +
  theme_minimal() +
  labs(title = "German Wage and Income Taxes 1999 - 2018", x = "Year", y = "Euros")
plot_org

We observe seasonality and trend in our data. German tax income, likely similar to American tax incomes, appears to have a major spike once a year (likely when personal taxes are due) with multiple smaller spikes throughout the year. Additionally, the total tax income is climbing on a year-to-year basis.

Data Preparation

What does this mean?

We have to difference our data to make our non-stationary time series stationary and remove the variance within our data that is present from level to level. This, differencing time series data to make it stationary, is an essential step for classical time series models much like it will be for our random forest model.

Reminder: making our data stationary means that the mean and variance of the series is stable and does not change over time. It implies stability.

Difference and Transform

We apply first order differencing, the difference between consecutive obervations (ie. one month to the next), to stabilize the mean of our time series and a log transformation to stabilize the variance. The prior plot is set above our transformed plot to highlight the difference:

# pretend we're in December 2017 and have to forecast the next twelve months
tax_ts_org <- window(tax_ts, end = c(2017, 12))
# log transform and difference the data
tax_ts_trf <- tax_ts_org %>% 
  log() %>% 
  diff(nsdiffs(tax_ts_org)) #required order of differencing: 1
# check out the difference! (pun)
plot_trf <- tax_ts_trf %>% 
  autoplot() +
  xlab("Year") +
  ylab("Euros") +
  ggtitle("German Wage and Income Taxes 1999 - 2018") +
  theme_minimal()
gridExtra::grid.arrange(plot_org, plot_trf)

It appears that first order differencing and log transformation have set our data straight. We proceed with stationary data.

Time Delay Embedding

From here, we’ll train our random forest model and make forecasts for 2018, which we’ll later compare to the actual values to assess the accuracy of our model. In order to do so, we’ll later have to reverse our transformations to bring them to the original scale.

Before doing so we have to bring our data into a form that a machine learning algorithm can handle. What we need to do, essentially, is transform a vector into a matrix (a structure that a ML algorithm can work with).

To do so, we make use of a concept called time delay embedding using the embed() function:

lag_order <- 6 # desired lag number (6 months)
horizon <- 12 # forecast duration (12 months)
tax_ts_mbd <- embed(tax_ts_trf, lag_order + 1) # embedding magic!

Time delay embedding represents a time series in a Euclidean space (a 3-D geometrical space) and allows us to proceed with any linear or non-linear regression method on time series data, whether Random Forest, Gradient Boosting, etc. We set our desired number of lags to be 6 months and the duration of our forecast to be 12 months.

Model Building

We make use of a direct forecasting strategy, split our data into training and test sets, and fit our “out of the box” random forest model:

#train-test  split
y_train <- tax_ts_mbd[, 1] # the target
X_train <- tax_ts_mbd[, -1] # everything but the target
y_test <- window(tax_ts, start = c(2018, 1), end = c(2018, 12)) # 2018
X_test <- tax_ts_mbd[nrow(tax_ts_mbd), c(1:lag_order)] # test set consisting of 6 most recent values (6 lags) of training set
#forecast
forecasts_rf <- numeric(horizon)
for (i in 1:horizon){
  # set seed
  set.seed(333)
  # fit the model
  fit_rf <- randomForest(X_train, y_train)
  # predict using the test set
  forecasts_rf[i] <- predict(fit_rf, X_test)
  # here is where we repeatedly reshape the training data to reflect the time distance
  # corresponding to the current forecast horizon.
  y_train <- y_train[-1] 
  X_train <- X_train[-nrow(X_train), ] 
}

We train 12 models and get 12 separate forecasts (one for 1 mo, one for 2mos, etc etc. up until 12 months) and before we can assess our Random Forest model, we have to transform the forecasts back to the original scale:

# calculate the exp term
exp_term <- exp(cumsum(forecasts_rf))
# extract the last observation from the time series (y_t)
last_observation <- as.vector(tail(tax_ts_org, 1))
# calculate the final predictions
backtransformed_forecasts <- last_observation * exp_term
# convert to ts format
y_pred <- ts(
  backtransformed_forecasts,
  start = c(2018, 1),
  frequency = 12
)
# add the forecasts to the original tibble
tax_tbl <- tax_tbl %>% 
  mutate(Forecast = c(rep(NA, length(tax_ts_org)), y_pred))
# visualize the forecasts
plot_fc <- tax_tbl %>% 
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Value / 1000)) +
  geom_line(aes(y = Forecast / 1000), color = "blue") +
  theme_minimal() +
  labs(
    title = "Forecast of the German Wage and Income Tax for the Year 2018",
    x = "Year",
    y = "Euros"
  )
plot_fc

accuracy(y_pred, y_test)
##                ME   RMSE      MAE     MPE     MAPE      ACF1 Theil's U
## Test set 173403.6 368796 253322.1 2.05931 2.646831 0.2240158 0.0734833

We reverse the effects of earlier differencing and (log) transformation by exponentiating the cumulative sum of our transformed forecasts (reverse log) and multiplying the result with the last observation of our time series (reverse diff) … the resulting plot and output statistics hold some promise.

Based on the plot, our forecast appears to capture the trend and seasonality of the data.

When we consider accuracy metrics, on the surface it appears that we have a high RMSE but when we consider that we’d divided our tax value by 1000 when plotting it brings the high RMSE value into perspective. We’re dealing with large tax revenue values and so a larger RMSE is to be expected. Thus we put more of an emphasis on the MAPE. The MAPE (2.64%) is excellent. Especially when we consider that this is performance was with respect to unseen test data.

For comparison, we visit the performance statistics of a Seasonal Naive model:

snaive <- forecast(snaive(tax_ts_org), h = horizon)
tax_ts %>% 
  autoplot() +
  autolayer(snaive, PI = FALSE)

accuracy(snaive, y_test)
##                    ME     RMSE      MAE      MPE     MAPE     MASE      ACF1
## Training set 216938.7 470874.8 388618.5 2.852533 6.648711 1.000000 0.6303982
## Test set     485006.8 495096.8 485006.8 5.507943 5.507943 1.248028 0.1382461
##               Theil's U
## Training set         NA
## Test set     0.09267227

The plot looks decent although it doesn’t quite capture the upward trend in the same way our random forest model did. When we compare accuracy statistics, we see that our Random Forest model had a far lower RMSE value with a MAPE that was half the value of our seasonal naive model’s performance.

Our Random Forest model was a far better model than a seasonal naive model in predicting German wage and taxes for 2018.

Tuning (A Different Approach)

  1. Pick hyperparameters,
  2. Select an appropriate range of values for each
  3. Determine the “best” configuration

We’re going to create Time Slices in order to perform cross-validation on a time series. Essentially, training sets consiste only of observations that occurred prior to the observation that forms the test set. No future observations can be used in creating the forecast. The approach boils down to creating mutliple train/test splits, testing each hyperparameter with multiple training/testing sets.

Check out Hyndman’s article on cross-validation for time series: https://robjhyndman.com/hyndsight/tscv/

#Hold out last 12 observations from our train set because this is how far we want to predict into the future
caret::createTimeSlices(
  1:nrow(X_train),
  initialWindow = nrow(X_train) - horizon,
  horizon = horizon,
  fixedWindow = TRUE
)
## $train
## $train$Training197
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
## 
## 
## $test
## $test$Testing197
##  [1] 198 199 200 201 202 203 204 205 206 207 208 209
tr_control <- caret::trainControl(
  method = 'timeslice',
  initialWindow = nrow(X_train) - horizon,
  fixedWindow = TRUE
)

Random Forest 2 most important hyperparameters are ntree, the number of tress and mtry, the number of predictors that get sampled at each split in the tree.

Typically, good values for ntree are a few hundred or thousand (more trees, greater accuracy). This approach is computationally expensive, so for graduate students trying to do a demonstration: 500 trees is enough.

The mtry parameter is the number of predictors that get considered as plitting candidates at each node. A good rule of thumb is p (number of predictors)/3.

We will use the traditional methodology to find mtry. The variable holdout_result says the best model will produced with mtry 6.

tune_grid <- expand.grid(
  mtry = c(
    ncol(X_train), # p
    ncol(X_train) / 3, # p / 3
    ceiling(sqrt(ncol(X_train))) # square root of p
  )
)
holdout_result <- caret::train(
  data.frame(X_train),
  y_train,
  method = 'rf',
  trControl = tr_control,
  tuneGrid = tune_grid
)
holdout_result
## Random Forest 
## 
## 209 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (1 held-out with a fixed window) 
## Summary of sample sizes: 197, 197, 197, 197, 197, 197, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared  MAE       
##   2     0.05345850  NaN       0.05345850
##   3     0.05240308  NaN       0.05240308
##   6     0.05164967  NaN       0.05164967
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

Here we are creating k-fold cross validation, helping us figure out which configuration is ‘best’. Cross-validation works for time series, if it is autoregressive. This means that k-fold CV works if the predictors the model are lagged verions of the response.

Using repeated cross validation we get an mtry value of 2 for the best model.

tr_control <- trainControl(
  method = 'repeatedcv',
  number = 10, 
  repeats = 3
)
kfold_result <- caret::train(
  data.frame(X_train),
  y_train,
  method = 'rf',
  trControl = tr_control,
  tuneGrid = tune_grid
)
kfold_result
## Random Forest 
## 
## 209 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 188, 188, 187, 187, 189, 188, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##   2     0.09321239  0.9730781  0.06349543
##   3     0.09487457  0.9710779  0.06309934
##   6     0.10588903  0.9614294  0.06653892
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
# set up our empty forecast tibble
forecasts_rf <- tibble(
  mtry_holdout = rep(0, horizon),
  mtry_kfold = rep(0, horizon)
)
# collect the two mtry values from the tuning step
mtrys <- c(
  holdout_result$bestTune[[1]],   
  kfold_result$bestTune[[1]]
)
# train the model in a double loop
for (i in seq_len(length(mtrys))) {
  for (j in seq_len(horizon)) {
    # set seed
    set.seed(2019)
    # fit the model
    fit_rf2 <- randomForest(X_train, y_train, mtry = mtrys[i])
    # predict using the test set
    forecasts_rf[j, i] <- predict(fit_rf2, X_test)
    # here is where we repeatedly reshape the training data to reflect the time                            # distance corresponding to the current forecast horizon.
    y_train <- y_train[-1] 
    X_train <- X_train[-nrow(X_train), ] 
  }
}
last_observation <- as.vector(tail(tax_ts_org, 1))
forecasts <- forecasts_rf %>% 
  purrr::map_df(function(x) exp(cumsum(x)) * last_observation)
accuracies <- forecasts %>% 
  purrr::map(function(x) accuracy(x, as.vector(y_test))) %>%
  do.call(rbind, .)
accuracies
##                ME     RMSE      MAE      MPE     MAPE
## Test set 453852.4 485538.1 453852.4 5.118042 5.118042
## Test set 350752.4 459037.6 362697.2 3.727779 3.881447

The tuned model with with k-fold has a reduced RMSE and a lower MAPE. The tuned model does perform better.

Conclusion

It appears that this was a good real world application of a Random Forest model. We were able to predict German wage and income taxes for a full year with less than 3% mean absolute percent error (MAPE) and further improve the model via hyper parameter tuning.

How might it be further improved?

  1. experiment with box cox rather than log transformation,
  2. compare to a boosting model (ie. AdaBoost or XGBoost)

Reference

This presentation was made with reference to the following resource(s):