I have a question about the randomForestSRC package. Specifically, I have found that the calculated VIMPs and OOB performance error seem to vary with the scale of the variables used. This is confusing when trying to interpret the meaning of a VIMP score, in particular whether it is different from zero. I have included the dataset and my work below as a reproducible example. (Of note, I use the package “labelled” in my work).

# packages
library(tidyverse)
library(labelled)
library(flextable)
library(randomForestSRC)

# presets
theme_set(theme_classic())

This was discovered when I performed a regression random forest on an outcome variable (elg.msts) initially reported in percent. We subsequently determined it should be reported as a decimal and I made a new variable (strain) by simple linear transformation, e.g. dividing the initial variable by 100. The R2 of the model stayed the same, but the VIMP scores and OOB performance error were also reduced by approximately a factor of 100 and 10000, respectively.

dta <- readRDS("Datasets/dta_ax_07292022.RDS") %>%
  mutate(id = as.factor(id)) %>% 
  labelled::set_variable_labels(id = "Patient")
# store results in lists
models = list()
# check data
dta[,-c(1)] %>% labelled::look_for()
##  pos variable    label                         col_type values               
##  1   age         Age                           dbl                           
##  2   sex         Sex                           ord      Male                 
##                                                         Female               
##  3   bmi         Body Mass Index               dbl                           
##  4   av          Aortic Valve Morphology       dbl                           
##  5   ai          Aortic Regurgitation          ord      None/Trivial         
##                                                         Mild                 
##                                                         Moderate             
##                                                         Severe               
##  6   aos         Aortic Stenosis               ord      None/Trivial         
##                                                         Mild                 
##                                                         Moderate             
##                                                         Severe               
##  7   htn         Hypertension                  ord      Yes                  
##                                                         No                   
##  8   stroke      Stroke                        ord      Yes                  
##                                                         No                   
##  9   dbts        Diabetes                      ord      Yes                  
##                                                         No                   
##  10  copd        Chronic Obstructive Pulmonar~ ord      Yes                  
##                                                         No                   
##  11  dld         Dyslipidemia                  ord      Yes                  
##                                                         No                   
##  12  cad         Coronary Artery Disease       dbl                           
##  13  ckd         Chronic Kidney Disease        ord      Yes                  
##                                                         No                   
##  14  hf          Heart Failure                 ord      Yes                  
##                                                         No                   
##  15  dmax        Maximum Aortic Diameter       dbl                           
##  16  sx_indic    Primary Indication            ord      Aneurysm             
##                                                         Aortic Valve         
##                                                         Other                
##  17  phenotype   Aneurysm Phenotype            ord      Small                
##                                                         Root Predominant     
##                                                         Ascending Predominant
##                                                         Tubular              
##  18  svensidx    Aorta Area to Height Index    dbl                           
##  19  rgn         Region                        ord      Proximal             
##                                                         Middle               
##                                                         Distal               
##  20  diam        Regional Aortic Diameter      dbl                           
##  21  uthick      Tissue Thickness              dbl                           
##  22  msts        Failure Stress (Cauchy)       dbl                           
##  23  stretch.max Failure Stretch               dbl                           
##  24  strain      Failure Strain                dbl                           
##  25  msts_pk     Failure Stress (Piola-Kircho~ dbl                           
##  26  elg.msts    Failure Strain                dbl
naniar::miss_var_summary(dta)
## # A tibble: 27 x 3
##    variable n_miss pct_miss
##    <chr>     <int>    <dbl>
##  1 dld           4    2.94 
##  2 stroke        1    0.735
##  3 hf            1    0.735
##  4 id            0    0    
##  5 age           0    0    
##  6 sex           0    0    
##  7 bmi           0    0    
##  8 av            0    0    
##  9 ai            0    0    
## 10 aos           0    0    
## # ... with 17 more rows

As demonstrated below, “strain” is a simple linear transformation of “elg.msts”, with the exception of a few observations which were edited during data cleaning between the first and second analysis.

ggplot(dta, aes(y=elg.msts, x=strain))+
  geom_point()+
  cowplot::theme_minimal_grid()

I then ran a random forest regression model. I kept the same parameters used in the actual analysis (e.g. 1000 trees) for reproducibility. For the original variable, the calculated R squared is ~0.74 and performance error is ~94.

# model of elg.msts e.g. Failure Strain (%)
l = list()
l$dta <- dta %>%
  select(-c(msts, stretch.max, strain, msts_pk))

l$var <- "elg.msts"
l$var_label <- "Failure Strain (%)"
l$labels <- data.frame(labs = sapply(l$dta, var_label)) %>%
  rownames_to_column(var="variable")

# random forest
l$mdl <- rfsrc(reformulate(".", l$var),
                data=as.data.frame(l$dta), 
                ntree=1000,
                tree.err=T, forest=T,
                importance=T, err.block=TRUE,
                na.action = "na.impute")
l$mdl
##                          Sample size: 136
##                     Was data imputed: yes
##                      Number of trees: 1000
##            Forest terminal node size: 5
##        Average no. of terminal nodes: 17.72
## No. of variables tried at each split: 8
##               Total no. of variables: 22
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 86
##                             Analysis: RF-R
##                               Family: regr
##                       Splitting rule: mse *random*
##        Number of random split points: 10
##                      (OOB) R squared: 0.74638667
##    (OOB) Requested performance error: 94.54161314

The calculated VIMP score for age is >800, with most other variables VIMP<100.

# gather VIMP scores into dataframe for plotting
l$df_vimp <- data.frame(vimp = l$mdl$importance) %>%
  rownames_to_column(var="variable") %>%
  arrange(desc(vimp)) %>%
  inner_join(., l$labels)
# VIMP plot
l$plot_vimp <- 
  ggplot(l$df_vimp, aes(y=vimp, x=reorder(labs, vimp)))+
  geom_bar(stat="identity", width=0.75,
           fill="royalblue", col="black")+
  labs(x="", y="Variable Importance (VIMP)", title=l$var_label)+
  ylim(0,1000)+
  coord_flip()+
  cowplot::theme_minimal_vgrid()
l$plot_vimp

models$pct <- l

When these steps are repeated with the strain variable, I get much smaller VIMPs and performance error, but nearly identical R2..

# Failure Strain (decimal) ----

l = list()
l$dta <- dta %>%
  select(-c(msts, stretch.max, elg.msts, msts_pk))

l$var <- "strain"
l$var_label <- "Failure Strain (-)"
l$labels <- data.frame(labs = sapply(l$dta, var_label)) %>%
  rownames_to_column(var="variable")

l$mdl <- rfsrc(reformulate(".", l$var),
                data=as.data.frame(l$dta), 
                ntree=1000,
                tree.err=T, forest=T,
                importance=T, err.block=TRUE,
                na.action = "na.impute")
l$mdl
##                          Sample size: 136
##                     Was data imputed: yes
##                      Number of trees: 1000
##            Forest terminal node size: 5
##        Average no. of terminal nodes: 17.644
## No. of variables tried at each split: 8
##               Total no. of variables: 22
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 86
##                             Analysis: RF-R
##                               Family: regr
##                       Splitting rule: mse *random*
##        Number of random split points: 10
##                      (OOB) R squared: 0.7437145
##    (OOB) Requested performance error: 0.00966163
l$df_vimp <- data.frame(vimp = l$mdl$importance) %>%
  rownames_to_column(var="variable") %>%
  arrange(desc(vimp)) %>%
  inner_join(., l$labels)
l$plot_vimp <- 
  ggplot(l$df_vimp, aes(y=vimp, x=reorder(labs, vimp)))+
  geom_bar(stat="identity", width=0.75,
           fill="royalblue", col="black")+
  labs(x="", y="Variable Importance (VIMP)", title=l$var_label)+
  ylim(0,.1)+
  coord_flip()+
  cowplot::theme_minimal_vgrid()

models$decimal <- l

# compare results

models$decimal$plot_vimp

models$pct$plot_vimp

Could anyone provide insight into this? I could not readily find an answer in the package documentation. I understand that the R2 (mean squared error) is corrected for variance, so should I take this to mean VIMP scores are not corrected for variance? My understanding is that the actual VIMP value is not useful outside of comparing values within a single model, but these results were suprising.