I have a question about the randomForestSRC package. Specifically, I have found that the calculated VIMPs and OOB performance error seem to vary with the scale of the variables used. This is confusing when trying to interpret the meaning of a VIMP score, in particular whether it is different from zero. I have included the dataset and my work below as a reproducible example. (Of note, I use the package “labelled” in my work).
# packages
library(tidyverse)
library(labelled)
library(flextable)
library(randomForestSRC)
# presets
theme_set(theme_classic())
This was discovered when I performed a regression random forest on an outcome variable (elg.msts) initially reported in percent. We subsequently determined it should be reported as a decimal and I made a new variable (strain) by simple linear transformation, e.g. dividing the initial variable by 100. The R2 of the model stayed the same, but the VIMP scores and OOB performance error were also reduced by approximately a factor of 100 and 10000, respectively.
dta <- readRDS("Datasets/dta_ax_07292022.RDS") %>%
mutate(id = as.factor(id)) %>%
labelled::set_variable_labels(id = "Patient")
# store results in lists
models = list()
# check data
dta[,-c(1)] %>% labelled::look_for()
## pos variable label col_type values
## 1 age Age dbl
## 2 sex Sex ord Male
## Female
## 3 bmi Body Mass Index dbl
## 4 av Aortic Valve Morphology dbl
## 5 ai Aortic Regurgitation ord None/Trivial
## Mild
## Moderate
## Severe
## 6 aos Aortic Stenosis ord None/Trivial
## Mild
## Moderate
## Severe
## 7 htn Hypertension ord Yes
## No
## 8 stroke Stroke ord Yes
## No
## 9 dbts Diabetes ord Yes
## No
## 10 copd Chronic Obstructive Pulmonar~ ord Yes
## No
## 11 dld Dyslipidemia ord Yes
## No
## 12 cad Coronary Artery Disease dbl
## 13 ckd Chronic Kidney Disease ord Yes
## No
## 14 hf Heart Failure ord Yes
## No
## 15 dmax Maximum Aortic Diameter dbl
## 16 sx_indic Primary Indication ord Aneurysm
## Aortic Valve
## Other
## 17 phenotype Aneurysm Phenotype ord Small
## Root Predominant
## Ascending Predominant
## Tubular
## 18 svensidx Aorta Area to Height Index dbl
## 19 rgn Region ord Proximal
## Middle
## Distal
## 20 diam Regional Aortic Diameter dbl
## 21 uthick Tissue Thickness dbl
## 22 msts Failure Stress (Cauchy) dbl
## 23 stretch.max Failure Stretch dbl
## 24 strain Failure Strain dbl
## 25 msts_pk Failure Stress (Piola-Kircho~ dbl
## 26 elg.msts Failure Strain dbl
naniar::miss_var_summary(dta)
## # A tibble: 27 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 dld 4 2.94
## 2 stroke 1 0.735
## 3 hf 1 0.735
## 4 id 0 0
## 5 age 0 0
## 6 sex 0 0
## 7 bmi 0 0
## 8 av 0 0
## 9 ai 0 0
## 10 aos 0 0
## # ... with 17 more rows
As demonstrated below, “strain” is a simple linear transformation of “elg.msts”, with the exception of a few observations which were edited during data cleaning between the first and second analysis.
ggplot(dta, aes(y=elg.msts, x=strain))+
geom_point()+
cowplot::theme_minimal_grid()
I then ran a random forest regression model. I kept the same parameters used in the actual analysis (e.g. 1000 trees) for reproducibility. For the original variable, the calculated R squared is ~0.74 and performance error is ~94.
# model of elg.msts e.g. Failure Strain (%)
l = list()
l$dta <- dta %>%
select(-c(msts, stretch.max, strain, msts_pk))
l$var <- "elg.msts"
l$var_label <- "Failure Strain (%)"
l$labels <- data.frame(labs = sapply(l$dta, var_label)) %>%
rownames_to_column(var="variable")
# random forest
l$mdl <- rfsrc(reformulate(".", l$var),
data=as.data.frame(l$dta),
ntree=1000,
tree.err=T, forest=T,
importance=T, err.block=TRUE,
na.action = "na.impute")
l$mdl
## Sample size: 136
## Was data imputed: yes
## Number of trees: 1000
## Forest terminal node size: 5
## Average no. of terminal nodes: 17.72
## No. of variables tried at each split: 8
## Total no. of variables: 22
## Resampling used to grow trees: swor
## Resample size used to grow trees: 86
## Analysis: RF-R
## Family: regr
## Splitting rule: mse *random*
## Number of random split points: 10
## (OOB) R squared: 0.74638667
## (OOB) Requested performance error: 94.54161314
The calculated VIMP score for age is >800, with most other variables VIMP<100.
# gather VIMP scores into dataframe for plotting
l$df_vimp <- data.frame(vimp = l$mdl$importance) %>%
rownames_to_column(var="variable") %>%
arrange(desc(vimp)) %>%
inner_join(., l$labels)
# VIMP plot
l$plot_vimp <-
ggplot(l$df_vimp, aes(y=vimp, x=reorder(labs, vimp)))+
geom_bar(stat="identity", width=0.75,
fill="royalblue", col="black")+
labs(x="", y="Variable Importance (VIMP)", title=l$var_label)+
ylim(0,1000)+
coord_flip()+
cowplot::theme_minimal_vgrid()
l$plot_vimp
models$pct <- l
When these steps are repeated with the strain variable, I get much smaller VIMPs and performance error, but nearly identical R2..
# Failure Strain (decimal) ----
l = list()
l$dta <- dta %>%
select(-c(msts, stretch.max, elg.msts, msts_pk))
l$var <- "strain"
l$var_label <- "Failure Strain (-)"
l$labels <- data.frame(labs = sapply(l$dta, var_label)) %>%
rownames_to_column(var="variable")
l$mdl <- rfsrc(reformulate(".", l$var),
data=as.data.frame(l$dta),
ntree=1000,
tree.err=T, forest=T,
importance=T, err.block=TRUE,
na.action = "na.impute")
l$mdl
## Sample size: 136
## Was data imputed: yes
## Number of trees: 1000
## Forest terminal node size: 5
## Average no. of terminal nodes: 17.644
## No. of variables tried at each split: 8
## Total no. of variables: 22
## Resampling used to grow trees: swor
## Resample size used to grow trees: 86
## Analysis: RF-R
## Family: regr
## Splitting rule: mse *random*
## Number of random split points: 10
## (OOB) R squared: 0.7437145
## (OOB) Requested performance error: 0.00966163
l$df_vimp <- data.frame(vimp = l$mdl$importance) %>%
rownames_to_column(var="variable") %>%
arrange(desc(vimp)) %>%
inner_join(., l$labels)
l$plot_vimp <-
ggplot(l$df_vimp, aes(y=vimp, x=reorder(labs, vimp)))+
geom_bar(stat="identity", width=0.75,
fill="royalblue", col="black")+
labs(x="", y="Variable Importance (VIMP)", title=l$var_label)+
ylim(0,.1)+
coord_flip()+
cowplot::theme_minimal_vgrid()
models$decimal <- l
# compare results
models$decimal$plot_vimp
models$pct$plot_vimp
Could anyone provide insight into this? I could not readily find an answer in the package documentation. I understand that the R2 (mean squared error) is corrected for variance, so should I take this to mean VIMP scores are not corrected for variance? My understanding is that the actual VIMP value is not useful outside of comparing values within a single model, but these results were suprising.