In this example, the training data set \(\mathcal{D}_n\) is composed of 2231 observations of 39 variables and a survival response \((T,\delta)\). It will be analysed with \(\texttt{randomForestSRC}\) package. \(~\) \(~\)
library("randomForestSRC")
library("survival")
data(peakVO2, package = "randomForestSRC")
dataSet <- peakVO2
head(dataSet, n=3)
## age betablok dilver nifed acei angioten.II anti.arrhy anti.coag aspirin
## 1 74 1 0 0 0 1 1 1 1
## 2 77 1 0 0 1 0 0 0 1
## 3 79 0 0 0 0 0 0 0 1
## digoxin nitrates vasodilator diuretic.loop diuretic.thiazide
## 1 0 0 0 1 0
## 2 1 0 0 1 0
## 3 1 0 0 1 0
## diuretic.potassium.spar lipidrx.statin insulin surgery.pacemaker surgery.cabg
## 1 0 1 1 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## surgery.pci surgery.aicd.implant resting.systolic.bp resting.hr smknow
## 1 0 1 90 74 0
## 2 0 0 134 53 0
## 3 0 0 108 104 0
## q.wave.mi bmi niddm lvef.metabl peak.rer peak.vo2 interval cad died
## 1 0 26.07807 0 15 1.12 11.9 374 1 0
## 2 0 20.83791 0 20 0.98 24.0 755 0 0
## 3 0 20.72785 0 40 1.15 11.2 225 0 1
## ttodead bun sodium hgb glucose male black crcl
## 1 1.207392 25.00000 141.0000 14.60000 89.00000 1 0 49.92778
## 2 3.611225 20.00000 138.0000 13.10000 90.00000 0 0 60.35000
## 3 0.238193 29.68983 139.8368 12.87517 98.74435 1 0 33.61172
\(~\)
First, \(\mathcal{B}\) bootstrap samples are generated from the original data \(\mathcal{D}_n\) and used to grow an ensemble of trees (forest).
\(~\)
B <- 1000
# Building a RSF
RF_obj <- rfsrc(Surv(ttodead,died)~., dataSet, ntree = B, membership = TRUE, importance=TRUE)
# Printing the RF object
print(RF_obj)
## Sample size: 2231
## Number of deaths: 726
## Number of trees: 1000
## Forest terminal node size: 15
## Average no. of terminal nodes: 107.658
## No. of variables tried at each split: 7
## Total no. of variables: 39
## Resampling used to grow trees: swor
## Resample size used to grow trees: 1410
## Analysis: RSF
## Family: surv
## Splitting rule: logrank *random*
## Number of random split points: 10
## (OOB) CRPS: 0.1538013
## (OOB) Requested performance error: 0.29473597
\(~\) The number of trees \(\mathcal{B}\) is a tuning parameter. Its choices is made in order to minimize the prediction error (1-C-index, calculated in an out-of-bag sample (not the training sample):
A new observation of an hypothetical individual (not in \(\mathcal{D}_n\)) is collected. \(~\)
# Creating an hypothetical observation
newdata <- data.frame(lapply(1:ncol(RF_obj$xvar),function(i){median(RF_obj$xvar[,i])}))
colnames(newdata) <- RF_obj$xvar.names
newdata [,which(RF_obj$xvar.names == "peak_vo2")] <- quantile(RF_obj$xvar$peak_vo2, 0.25)
newdata
## age betablok dilver nifed acei angioten.II anti.arrhy anti.coag aspirin
## 1 55 1 0 0 1 0 0 0 0
## digoxin nitrates vasodilator diuretic.loop diuretic.thiazide
## 1 1 0 0 1 0
## diuretic.potassium.spar lipidrx.statin insulin surgery.pacemaker surgery.cabg
## 1 0 0 0 0 0
## surgery.pci surgery.aicd.implant resting.systolic.bp resting.hr smknow
## 1 0 0 110 75 0
## q.wave.mi bmi niddm lvef.metabl peak.rer peak.vo2 interval cad bun
## 1 0 27.77673 0 20 1.1 15.7 480 0 23
## sodium hgb glucose male black crcl
## 1 139.8042 13.58482 96.92888 1 0 83.13088
\(~\) We “drop” it in the built
forest to obtain an ensemble prediction of his survival function:
\(~\)
y.pred <- predict(RF_obj,newdata = rbind(newdata,RF_obj$xvar)[1,])
par(cex.axis = 1.0, cex.lab = 1.0, cex.main = 1.0, mar = c(6.0,6,1,1), mgp = c(4, 1, 0))
plot(round(y.pred$time.interest,2),y.pred$survival[1,], type="l", xlab="Time (Year)",
ylab="Survival", col=1, lty=1, lwd=2)
\(~\) Can we trust this prediction? We had acces to the C-index, but we can also check the Brier Score curve. \(~\)
## obtain Brier score using KM censoring distribution estimators
bs.km <- get.brier.survival(RF_obj, cens.mode = "km")$brier.score
## plot the brier score
plot(bs.km, type = "s", col = 2, ylab="Brier Score")
RSF provides a fully nonparametric measure of variable importance (VIMP). For that, we need a accuracy measure and the following steps: to calculate the VIMP of a variable \(X_j\) in the data set:
Calculate the prediction error in the OOB data, denoted as \(errOOB_b\) (Brier score, C-index).
Permute the variable \(x_j\) in the \(OOB_b\) data, which results in a perturbed OOB data, \(\widetilde{OOB_b^j}\).
Compute \(err\widetilde{OOB_b^j}\).