The data set

In this example, the training data set \(\mathcal{D}_n\) is composed of 2231 observations of 39 variables and a survival response \((T,\delta)\). It will be analysed with \(\texttt{randomForestSRC}\) package. \(~\) \(~\)

library("randomForestSRC")
library("survival")

data(peakVO2, package = "randomForestSRC")

dataSet <- peakVO2
head(dataSet, n=3)
##   age betablok dilver nifed acei angioten.II anti.arrhy anti.coag aspirin
## 1  74        1      0     0    0           1          1         1       1
## 2  77        1      0     0    1           0          0         0       1
## 3  79        0      0     0    0           0          0         0       1
##   digoxin nitrates vasodilator diuretic.loop diuretic.thiazide
## 1       0        0           0             1                 0
## 2       1        0           0             1                 0
## 3       1        0           0             1                 0
##   diuretic.potassium.spar lipidrx.statin insulin surgery.pacemaker surgery.cabg
## 1                       0              1       1                 0            0
## 2                       0              0       0                 0            0
## 3                       0              0       0                 0            0
##   surgery.pci surgery.aicd.implant resting.systolic.bp resting.hr smknow
## 1           0                    1                  90         74      0
## 2           0                    0                 134         53      0
## 3           0                    0                 108        104      0
##   q.wave.mi      bmi niddm lvef.metabl peak.rer peak.vo2 interval cad died
## 1         0 26.07807     0          15     1.12     11.9      374   1    0
## 2         0 20.83791     0          20     0.98     24.0      755   0    0
## 3         0 20.72785     0          40     1.15     11.2      225   0    1
##    ttodead      bun   sodium      hgb  glucose male black     crcl
## 1 1.207392 25.00000 141.0000 14.60000 89.00000    1     0 49.92778
## 2 3.611225 20.00000 138.0000 13.10000 90.00000    0     0 60.35000
## 3 0.238193 29.68983 139.8368 12.87517 98.74435    1     0 33.61172

\(~\)

A. Building the model

First, \(\mathcal{B}\) bootstrap samples are generated from the original data \(\mathcal{D}_n\) and used to grow an ensemble of trees (forest).

\(~\)

B <- 1000

# Building a RSF
RF_obj <- rfsrc(Surv(ttodead,died)~., dataSet, ntree = B,  membership = TRUE, importance=TRUE)

# Printing the RF object  
print(RF_obj)
##                          Sample size: 2231
##                     Number of deaths: 726
##                      Number of trees: 1000
##            Forest terminal node size: 15
##        Average no. of terminal nodes: 107.658
## No. of variables tried at each split: 7
##               Total no. of variables: 39
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 1410
##                             Analysis: RSF
##                               Family: surv
##                       Splitting rule: logrank *random*
##        Number of random split points: 10
##                           (OOB) CRPS: 0.1538013
##    (OOB) Requested performance error: 0.29473597

\(~\) The number of trees \(\mathcal{B}\) is a tuning parameter. Its choices is made in order to minimize the prediction error (1-C-index, calculated in an out-of-bag sample (not the training sample):

B. Predicting for new observations

A new observation of an hypothetical individual (not in \(\mathcal{D}_n\)) is collected. \(~\)

# Creating an hypothetical observation 
newdata <- data.frame(lapply(1:ncol(RF_obj$xvar),function(i){median(RF_obj$xvar[,i])}))
colnames(newdata) <- RF_obj$xvar.names
newdata [,which(RF_obj$xvar.names == "peak_vo2")] <- quantile(RF_obj$xvar$peak_vo2, 0.25)

newdata
##   age betablok dilver nifed acei angioten.II anti.arrhy anti.coag aspirin
## 1  55        1      0     0    1           0          0         0       0
##   digoxin nitrates vasodilator diuretic.loop diuretic.thiazide
## 1       1        0           0             1                 0
##   diuretic.potassium.spar lipidrx.statin insulin surgery.pacemaker surgery.cabg
## 1                       0              0       0                 0            0
##   surgery.pci surgery.aicd.implant resting.systolic.bp resting.hr smknow
## 1           0                    0                 110         75      0
##   q.wave.mi      bmi niddm lvef.metabl peak.rer peak.vo2 interval cad bun
## 1         0 27.77673     0          20      1.1     15.7      480   0  23
##     sodium      hgb  glucose male black     crcl
## 1 139.8042 13.58482 96.92888    1     0 83.13088

\(~\) We “drop” it in the built forest to obtain an ensemble prediction of his survival function:

\(~\)

y.pred <- predict(RF_obj,newdata = rbind(newdata,RF_obj$xvar)[1,])

par(cex.axis = 1.0, cex.lab = 1.0, cex.main = 1.0, mar = c(6.0,6,1,1), mgp = c(4, 1, 0))
plot(round(y.pred$time.interest,2),y.pred$survival[1,], type="l", xlab="Time (Year)",   
   ylab="Survival", col=1, lty=1, lwd=2)

\(~\) Can we trust this prediction? We had acces to the C-index, but we can also check the Brier Score curve. \(~\)

## obtain Brier score using KM censoring distribution estimators
bs.km <- get.brier.survival(RF_obj, cens.mode = "km")$brier.score

## plot the brier score
plot(bs.km, type = "s", col = 2, ylab="Brier Score")

C. Variable Importance

RSF provides a fully nonparametric measure of variable importance (VIMP). For that, we need a accuracy measure and the following steps: to calculate the VIMP of a variable \(X_j\) in the data set:

  1. For each tree in the forest we:
  1. Calculate the prediction error in the OOB data, denoted as \(errOOB_b\) (Brier score, C-index).

  2. Permute the variable \(x_j\) in the \(OOB_b\) data, which results in a perturbed OOB data, \(\widetilde{OOB_b^j}\).

  3. Compute \(err\widetilde{OOB_b^j}\).

  1. Calculate: \[\text{VIMP}(X_j) = \frac{1}{\mathcal{B}} \sum_{b =1 }^{\mathcal{B}}{ (err\widetilde{OOB_b^j} - errOOB_b)}.\]