1. Data

The data used in this demo comes from the UCI machine learning repository. Specifically, this is the heart disease data set. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. The "outcome is the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

The data set description can be found in http://archive.ics.uci.edu/ml/datasets/Heart+Disease.

2. Preparing data

2.1 Download data and check data structure

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header=FALSE)
head(data)

##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
## 3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0

colnames(data) <- c( "age","sex", "cp", "trestbps", "chol","fbs", "restecg", "thalach", 
                     "exang","oldpeak", "slope", "ca",  "thal","hd")
dim(data)

## [1] 303  14

head(data)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal hd
## 1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0  0
## 2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0  2
## 3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0  1
## 4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0  0
## 5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0  0
## 6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0  0

str(data)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ thal    : chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

2.2 Check missing values

table(data == "?")

## 
## FALSE  TRUE 
##  4236     6

data[data == "?"] <- NA
sapply(data, function(x) sum(is.na(x)))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal       hd 
##        0        0        0        4        2        0

2.3 Add factors for variables that are factors

data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
data$sex <- as.factor(data$sex)

data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)

Since two columns (“ca” and “thal”) had “?”s in it (which we have since converted to NAs) R thinks that the levels for the factor are strings, but we know they are integers, so we’ll first convert the strings to integers.

data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)

data$thal <- as.integer(data$thal) 
data$thal <- as.factor(data$thal)

This next line replaces 0 and 1 with “Healthy” and “Unhealthy”

data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)

3. Build a decision tree

3.1 Impute missing values

Typically, R packages encourage two ways of handling missing values:

Drop data points with missing values (not recommended)
Fill in missing values with the median (for numerical values) or mode (for categorical values)

In this session, we impute missing values in the training set using rfImpute() function. In most cases, the number of iterations is set between 4 to 6 (Breiman et al.). However, If you really want to micromanage the missing imputation, you can change the number of iterations, number of trees it makes (the default is 300), and the number of variables that it will consider at each step.

set.seed(42)
data.imputed <- rfImpute(hd ~ ., data = data, iter=6)

## ntree      OOB      1      2
##   300:  17.49% 12.80% 23.02%
## ntree      OOB      1      2
##   300:  16.83% 14.02% 20.14%
## ntree      OOB      1      2
##   300:  17.82% 13.41% 23.02%
## ntree      OOB      1      2
##   300:  17.49% 14.02% 21.58%
## ntree      OOB      1      2
##   300:  17.16% 12.80% 22.30%
## ntree      OOB      1      2
##   300:  18.15% 14.63% 22.30%

Question 1: how do you interpret the outputs? - 6 lines are 6 dataset (internation) - 1 is the case. 12.8% case are not the case –> 12.8% is Misclassification - 2 is the control - OOB = (value in column 1+ column 2)/2

Question 2: should we change the number of iterations or it is good enough? The number of OOB is not stable –> We need to increase the iterations If OOB = 0 –>It is an ideal situation, all case are on the case column, all control are on the control column –> No Misclassification

3.2 Fitting a Classification Tree

The tree library is used to construct classification. We now use the tree() function to fit a classification tree in order to predict risk of heart disease (hd) using all variables.

set.seed(42)
tree<- tree(hd ~ ., data.imputed)
summary(tree)

## 
## Classification tree:
## tree(formula = hd ~ ., data = data.imputed)
## Variables actually used in tree construction:
##  [1] "thal"     "cp"       "age"      "ca"       "trestbps" "thalach" 
##  [7] "exang"    "oldpeak"  "chol"     "restecg" 
## Number of terminal nodes:  17 
## Residual mean deviance:  0.5585 = 159.7 / 286 
## Misclassification error rate: 0.1023 = 31 / 303

Why the tree only used 10 variables of 13 variables: - Randomly - If the variables do not improve the algorithm to grow the tree bigger and bigger. They are not importance. - 17 regions - Misclassification error rate 31/303: 10% of Misclassification - Residual mean deviance: lower value is better - Most of the time we report Misclassification error rate to compare 2 models

The summary() function lists the variables that are used as internal nodes in the tree, the number of terminal nodes, and the (training) error rate. We see that the training error rate is 10 %. The deviance (159.7) reported in the output of summary() is given by:

A small deviance indicates a tree that provides a good fit to the (training) data. The residual mean deviance reported is simply the deviance divided by n−|T0|, which in this case is 303 − 17 = 286.

Question 3: Why only 10 out of 13 variables were used in tree construction?

3.3 Presenting the results

One of the most attractive properties of trees is that they can be graphically displayed. We use the plot() function to display the tree structure, and the text() function to display the node labels. The argument pretty = 0 instructs R to include the category names for any qualitative predictors, rather than simply displaying a letter for each category.

par(mai=c(0,0,0,0))
plot(tree)
text(tree , pretty = 0)

Question 4: which predictor is the most important one? - Thai is the most importance one (the root)

Question 5: what is the best cut point for this predictor? - The cut point number 3 of Thai is the best one to separate predictors into different regions - The left side is YES - The right side is NO

3.4 Training and test data

In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error. We split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data.

set.seed(42)
sample = sample (1: nrow(data.imputed), nrow(data.imputed)*0.8)
train.data= (data.imputed[sample,])
test.data =data.imputed[-sample ,] 
tree.train <- tree(hd ~ ., train.data)
tree.prediction <- predict(tree.train , test.data , type = "class")
Observed= test.data[,1]
table(tree.prediction, Observed)

##                Observed
## tree.prediction Healthy Unhealthy
##       Healthy        33         6
##       Unhealthy       6        16

Question 6: What percentage of test observations are correctly classified? - (33+16) / (33+6+6+16) - Classification error: 20% - 20% is higher than the error of all the data-set –> We should little trust on the result of model Question 7: What is the sensitivity and specificity of this model? -

3.5 Tree pruning

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. This is because the resulting tree might be too complex. A smaller tree with fewer splits (that is, fewer regions R1, . . . ,RJ) might lead to lower variance and better interpretation at the cost of a little bias. One possible solution to the process described above is to grow a very large tree (T0), and then prune it back in order to obtain a subtree. Intuitively, our goal is to select a subtree that leads to the lowest test error rate.

Here, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to determine the optimal level of tree complexity.

set.seed(42)
cv <- cv.tree(tree.train , FUN = prune.misclass)
cv

## $size
## [1] 21 15 12 10  8  6  4  2  1
## 
## $dev
## [1]  71  72  72  66  68  71  72  72 117
## 
## $k
## [1]       -Inf  0.0000000  0.3333333  1.0000000  2.5000000  3.0000000  4.0000000
## [8]  5.5000000 60.0000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter(k) used. The “dev”corresponds to the number of cross-validation errors. - Size = number of leave node - dev: deviation - k: parameter of cross-validation Question 8: what is the optimum number of terminal nodes and why? - 10 leave nodes is the best model (See the plot below) - The lowest deviation

Question 9: how can we link this number to bias variance tradeoff? - Missclassification is higher but we still select this model –> trade off

3.6 Plot the error rate

The error rate is plotted ed as a function of both size .

par(mfrow = c(1, 1))
plot(cv$size, cv$dev, type = "b", xlab="Tree size(model complexity)", ylab="Error rate")

We now apply the prune.misclass() function in order to prune the tree to obtain the nine-node tree.

set.seed(42)
prune.tree <- prune.misclass(tree.train , best = 10)
plot(prune.tree)
text(prune.tree , pretty = 0)

Once again, we apply the predict() function.

set.seed(42)
prune <- predict(prune.tree , test.data , type = "class")
Observed= test.data[,1]
table(prune, Observed)

##            Observed
## prune       Healthy Unhealthy
##   Healthy        33         7
##   Unhealthy       6        15

Question 10: How well does this pruned tree perform on the test data?

–> Linear model is outperform decision tree becasue it use all variables, decision tree only use 10 variables –> If we use the same variables with the decision tree, the tree might outperform the linear model

4. Build a Random Forests

We use the same data to build a Random Forests. In this example, the response variable is a binary variable. So by default, randomForest() will set mtry = sqrt(13) = 3.6 and rounded down it to 3. Also, by default random forest generates 500 trees.

set.seed(42)
model <- randomForest(hd ~ ., data=data.imputed, proximity=TRUE)
model

## 
## Call:
##  randomForest(formula = hd ~ ., data = data.imputed, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.17%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       143        21   0.1280488
## Unhealthy      28       111   0.2014388

For most machine learning methods, you need to divide the data manually into a “training” set and a “test” set. This allows you to train the method using the training data, and then test it on data it was not originally trained on.

In contrast, Random Forests split the data into “training” and “test” sets for you. This is because Random Forests use bootstrapped data, and thus, not every sample is used to build every tree. The training dataset is the bootstrapped data and the test dataset is the remaining samples. The remaining samples are called the Out-Of-Bag (OOB) data.

Question 10: How do you interpret th OOB error rate? Can it be considered as a measure to evaluate the performance of our prediction model and why? it is acceptable because it do not build by test data set Question 11: What is the False Positive Value (FPV) and False Negative Value (FNV) of this model?

4.1 Gini Impurity Index

model$importance

##          MeanDecreaseGini
## age             12.747981
## sex              4.377738
## cp              19.676051
## trestbps        10.694440
## chol            11.692057
## fbs              1.278029
## restecg          2.975147
## thalach         16.316744
## exang            8.264741
## oldpeak         15.873510
## slope            6.382230
## ca              18.521296
## thal            19.437404

varImpPlot (model)

Question 12: Based on the output, which variable is the most important one? cp is the most importance one

Question 13: Compare the results with the outputs of Section 3.3 (Decision Tree) and explain why the result are inconsistent?

using 500 trees, the other only based on 1 tree

4.2 Tunning hyperparameters

Now check to see if the random forest is actually big enough. Up to a point, the more trees in the forest, the better. You can tell when you’ve made enough when the OOB no longer improves.

oob.error.data <- data.frame(
  Trees=rep(1:nrow(model$err.rate), times=3),
  Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model$err.rate)),
  Error=c(model$err.rate[,"OOB"], 
    model$err.rate[,"Healthy"], 
    model$err.rate[,"Unhealthy"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))

After building a random forest with 500 tress, the graph does not make it clear that the OOB-error has settled on a value or, if we added more trees, it would continue to decrease.So we do the whole thing again, but this time add more trees.

set.seed(7)
model <- randomForest(hd ~ ., data=data.imputed, ntree=1000, proximity=TRUE)
model

## 
## Call:
##  randomForest(formula = hd ~ ., data = data.imputed, ntree = 1000,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.15%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       140        24   0.1463415
## Unhealthy      31       108   0.2230216

oob.error.data <- data.frame(
  Trees=rep(1:nrow(model$err.rate), times=3),
  Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model$err.rate)),
  Error=c(model$err.rate[,"OOB"], 
    model$err.rate[,"Healthy"], 
    model$err.rate[,"Unhealthy"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))

After building a random forest with 1,000 trees, we get the same OOB-error 16.5% and we can see convergence in the graph. If we want to compare this random forest to others with different values for mtry (to control how many variables are considered at each step), then:

oob.values <- vector(length=13)
for(i in 1:13) {
  temp.model <- randomForest(hd ~ ., data=data.imputed, mtry=i, ntree=1000)
  oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values

##  [1] 0.1716172 0.1716172 0.1617162 0.1815182 0.1815182 0.1914191 0.1947195
##  [8] 0.1881188 0.1881188 0.2013201 0.1947195 0.1914191 0.1881188

## find the minimum error
min(oob.values)

## [1] 0.1617162

## find the optimal value for mtry...
which(oob.values == min(oob.values))

## [1] 3

Question 14: What was the default value for “mtry”?

Question 15: Based on the output, what is the optimum value for “mtry”? Is this value same as the default value? the best value of mtry means the best takeoff of variance and bias

Now, we create a prediction model using the best value for mtry:

model <- randomForest(hd ~ ., data=data.imputed,
                      ntree=1000, 
                      proximity=TRUE, 
                      mtry=3)
model

## 
## Call:
##  randomForest(formula = hd ~ ., data = data.imputed, ntree = 1000,      proximity = TRUE, mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.17%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       145        19   0.1158537
## Unhealthy      30       109   0.2158273

4.2 Create an MDS-plot

Multidimensional scaling (MDS) is a means of visualizing how the samples are related to each other.

set.seed(42)
# Start by converting the proximity matrix into a distance matrix.
distance.matrix <- as.dist(1-model$proximity)
mds.stuff <- cmdscale(distance.matrix, eig=TRUE, x.ret=TRUE)

# calculate the percentage of variation that each MDS axis accounts for.
mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)

# now make a fancy looking plot that shows the MDS axes and the variation.
mds.values <- mds.stuff$points
mds.data <- data.frame(Sample=rownames(mds.values),
  X=mds.values[,1],
  Y=mds.values[,2],
  Status=data.imputed$hd)

ggplot(data=mds.data, aes(x=X, y=Y, label=Sample)) + 
  geom_text(aes(color=Status)) +
  theme_bw() +
  xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
  ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
  ggtitle("MDS plot") +
  theme(plot.title = element_text(hjust = 0.5))

Question 14: could our model discriminate the cases and the controls perfectly?

quite good, most of the blue come together. There are some missclassification 14.9% +55.5% = 20% can be explained by the model

Advanced Topics in Epidemiology

Integrating machine learning into epidemiologic research

Huyen Nguyen Thi Khanh

2023-03-16