This chapter is a part of the project “Machine Learning in Medicine” with participation of 12 members who comes from different backgrounds.
Via this project, we hope fulfill the following objectives:
The 57th case study (Chapter 13 - Machine Learning in Medicine-Cookbook Two by T. J. Cleophas and A. H. Zwinderman, Springer, 2014).
In this study, the dataset is composed of 10 markers and survival outcome of 200 patients hospitalized for sepsis. The objective is to predict the mortality risk of these patients using 10 markers: gamma-glutamyl transferase (gammagt, U/l),aspartate aminotransferase and Alanine transaminase (U/l), bilirubine (μmol/l), urea (mmol/l), creatinine (μmol/l), creatinine clearance (ml/min), erythrocyte sedimentation rate (ESR, mm), c-reactive protein (CRP, mg/l) and leucocyte count (×10^9/l). (Le Dong Nhat Nam, 20/01/2017)
In the original textbook, the authors adopted an SVM algorithm (Knime Data Miner) to resolve this classification problem. Le Dong Nhat Nam has succesfully extended this experiment by using the Support Vector Machine model and Extreme Gradient Boosting Tree model (http://rpubs.com/ledongnhatnam/243186)
In this study, we suggest another point of view of the classification problem: choosing between the parsimony and accuracy at all costs. We compare two cases: 1) a simple LDA or knn can give a very good results; 2) stacking the knn model and svm model give an even better accuracy.
library(data.table)
library(ggplot2)
library(mltools)
library(class)
library(LiblineaR)
library(magrittr)
library(tidyverse)
rm(list=ls())
df <- read.csv("svm.csv",sep=";")
df %<>% rename(Death=VAR00001,Ggt=VAR00002,asat=VAR00003,alat=VAR00004,bili=VAR00005,ureum=VAR00006,
creat=VAR00007, cclear=VAR00008, esr=VAR00009, crp =VAR00010,leucos=VAR00011) %>%as_tibble()
df <- df %>% filter(!is.na(Death))
df$Death=recode_factor(df$Death,`0`="Survive",`1`="Death")
df$ureum <- as.character(df$ureum) %>% gsub(df$ureum,pattern = "," ,replace = ".") %>% as.numeric()
df_backup <- df
Look at the data
head(df)
## # A tibble: 6 × 11
## Death Ggt asat alat bili ureum creat cclear esr crp leucos
## <fctr> <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <int>
## 1 Survive 20 23 34 2 3.4 89 -111 2 2 5
## 2 Survive 14 21 33 3 2.0 67 -112 7 3 6
## 3 Survive 30 35 32 4 5.6 58 -116 8 4 4
## 4 Survive 35 34 40 4 6.0 76 -110 6 5 7
## 5 Survive 23 33 22 4 6.1 95 -120 9 6 6
## 6 Survive 26 31 24 3 5.4 78 -132 8 4 8
In this study, we skip the exploratory step. You can get this information in the rpubs of Le Dong Nhat Nam.
It is a classification problem. We run quickly a LDA model to see the inter-class variance of Death and Survive group.
set.seed(11)
training <- df %>% sample_n(round(0.8*nrow(df)))
testing <- setdiff(df, training)
lda_mod <- MASS::lda(Death~., data = training, prior = c(0.5,0.5))
lda_pred <- predict(lda_mod,testing[,-1])
caret::confusionMatrix(lda_pred$class,testing$Death,positive = "Death")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survive Death
## Survive 16 1
## Death 2 21
##
## Accuracy : 0.925
## 95% CI : (0.7961, 0.9843)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 2.456e-07
##
## Kappa : 0.8477
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9545
## Specificity : 0.8889
## Pos Pred Value : 0.9130
## Neg Pred Value : 0.9412
## Prevalence : 0.5500
## Detection Rate : 0.5250
## Detection Prevalence : 0.5750
## Balanced Accuracy : 0.9217
##
## 'Positive' Class : Death
##
plot(lda_mod)
We can see that a simple linear discriminant analysis separates well the 2 groups “Death” and “Survive”, except some observations (cf plot, Accuracy, Kappa, Sensitivity and Specificity)
We know that the data itself allows to be well separated by a simple model. However, in this study, we want to go further. To do that, we build an ensemble learner using knn and svm model. The final model will be the result of different steps:
At first, we use mlTools package to generate the cross-validation folds
#Generate Folds for CV
training <- as.data.table(training)
testing <- as.data.table(testing)
training[,ID := seq_len(nrow(training))]
## Death Ggt asat alat bili ureum creat cclear esr crp leucos ID
## 1: Survive 36 15 29 2 6.3 89 -117 15 7 7 1
## 2: Survive 20 23 34 2 3.4 89 -111 2 2 5 2
## 3: Death 1500 709 1154 400 46.0 601 -10 7 7 8 3
## 4: Survive 30 35 32 4 5.6 58 -116 8 4 4 4
## 5: Survive 16 14 12 3 3.2 87 -124 11 3 3 5
## ---
## 156: Survive 14 14 24 2 2.0 78 -110 12 5 5 156
## 157: Death 26 34 27 2 54.0 666 -10 108 124 30 157
## 158: Death 194 277 257 67 17.0 231 -48 59 28 16 158
## 159: Death 385 435 387 145 35.0 276 -23 86 34 17 159
## 160: Death 376 421 465 176 36.0 256 -18 67 43 18 160
training[, FoldID := folds(Death, nfolds=5, stratified=TRUE, seed=2016)]
## Death Ggt asat alat bili ureum creat cclear esr crp leucos ID
## 1: Survive 36 15 29 2 6.3 89 -117 15 7 7 1
## 2: Survive 20 23 34 2 3.4 89 -111 2 2 5 2
## 3: Death 1500 709 1154 400 46.0 601 -10 7 7 8 3
## 4: Survive 30 35 32 4 5.6 58 -116 8 4 4 4
## 5: Survive 16 14 12 3 3.2 87 -124 11 3 3 5
## ---
## 156: Survive 14 14 24 2 2.0 78 -110 12 5 5 156
## 157: Death 26 34 27 2 54.0 666 -10 108 124 30 157
## 158: Death 194 277 257 67 17.0 231 -48 59 28 16 158
## 159: Death 385 435 387 145 35.0 276 -23 86 34 17 159
## 160: Death 376 421 465 176 36.0 256 -18 67 43 18 160
## FoldID
## 1: 2
## 2: 3
## 3: 3
## 4: 4
## 5: 2
## ---
## 156: 3
## 157: 2
## 158: 2
## 159: 1
## 160: 2
Now we tune the knn model on the training to determine the best parameters.
#Base model 1 knn
knnCV <- list()
knnCV[["Features"]] <- names(training[,2:11])
knnCV[["Tuning"]] <- CJ(k=seq(1, 20))
knnCV[["BestScore"]] <- 0
for(i in seq_len(nrow(knnCV[["Tuning"]]))){
accuracy <- numeric()
pred <- list()
parameter <- knnCV[["Tuning"]][i]
# Cross Validation
for(foldID in 1:5){
# Create the train/test folds
testFold <- training[FoldID == foldID,]
trainFolds <- setdiff(training,testFold)
# Train knn model
testFold %<>% mutate(Pred = knn(train = trainFolds[,knnCV$Features,with = FALSE], test = testFold[,knnCV$Features, with=FALSE], cl = trainFolds$Death, k= parameter$k))
pred[[foldID]] <- as.data.table(list(testFold[, c("ID","FoldID", "Pred")]))
# Accuracy of the the fold
accuracy0 <- mean(testFold$Pred == testFold$Death)
accuracy<- c(accuracy, accuracy0)
}
# The overal accuracy..
accuracy0 <- mean(accuracy)
# Fill the Tuning matrix with accuracy
knnCV$Tuning[i, Accuracy := accuracy0]
if(accuracy0 > knnCV[["BestScore"]]){
knnCV[["BestScore"]] <- accuracy0
knnCV[["Bestparam"]] <- knnCV[["Tuning"]][i]
knnCV[["BestPreds"]] <- rbindlist(pred)
}
}
#The best parameter of knn model
knnCV[["Bestparam"]]
## k Accuracy
## 1: 5 0.9310239
The best parameter for the knn model is k = 5. The accuracy is 0.93
Next step, we tune the svm model on the same training dataset
# Base model 2
svmCV <- list()
svmCV[["Features"]] <- names(training[,2:11])
svmCV[["Tuning"]] <- CJ(type=c(1,6,7), cost=c(.001, .01, .1, 1, 10, 100))
svmCV[["BestScore"]] <- 0
for(i in seq_len(nrow(svmCV[["Tuning"]]))){
accuracy <- numeric()
pred <- list()
parameter <- svmCV[["Tuning"]][i]
# Cross Validation
for(foldID in 1:5){
# Create the train/test folds
testFold <- training[FoldID == foldID,]
trainFolds <- setdiff(training,testFold)
# Train svm model
logreg <- LiblineaR(data=trainFolds[, svmCV$Features, with=FALSE], target=trainFolds$Death, type= parameter$type, cost=parameter$cost)
testFold[, Pred := predict(logreg, testFold[, svmCV$Features, with=FALSE])$predictions]
pred[[foldID]] <- as.data.table(list(testFold[, c("ID","FoldID", "Pred")]))
# Accuracy of the the fold
accuracy0 <- mean(testFold$Pred == testFold$Death)
accuracy<- c(accuracy, accuracy0)
}
# The overal accuracy
accuracy0 <- mean(accuracy)
# Fill the Tuning matrix
svmCV$Tuning[i, Accuracy := accuracy0]
if(accuracy0 > svmCV[["BestScore"]]){
svmCV[["BestScore"]] <- accuracy0
svmCV[["Bestparam"]] <- svmCV[["Tuning"]][i]
svmCV[["BestPreds"]] <- rbindlist(pred)
}
}
# Check the best parameters
svmCV[["Bestparam"]]
## type cost Accuracy
## 1: 6 0.1 0.9501772
The best model is svm sigmoid with cost = 0.1 and accuracy = 0.95 SVM is slightly better than knn model In the next step, we build an ensemble learner svm model using the predictions of knn and svm
# Ensemble KNN, SVM sigmoid
# Extract best model
best.knn <- knnCV[["BestPreds"]]
best.svm <- svmCV[["BestPreds"]]
# Insert predictions into train
training[best.knn, pred.knn := Pred, on="ID"]
## Death Ggt asat alat bili ureum creat cclear esr crp leucos ID
## 1: Survive 36 15 29 2 6.3 89 -117 15 7 7 1
## 2: Survive 20 23 34 2 3.4 89 -111 2 2 5 2
## 3: Death 1500 709 1154 400 46.0 601 -10 7 7 8 3
## 4: Survive 30 35 32 4 5.6 58 -116 8 4 4 4
## 5: Survive 16 14 12 3 3.2 87 -124 11 3 3 5
## ---
## 156: Survive 14 14 24 2 2.0 78 -110 12 5 5 156
## 157: Death 26 34 27 2 54.0 666 -10 108 124 30 157
## 158: Death 194 277 257 67 17.0 231 -48 59 28 16 158
## 159: Death 385 435 387 145 35.0 276 -23 86 34 17 159
## 160: Death 376 421 465 176 36.0 256 -18 67 43 18 160
## FoldID pred.knn
## 1: 2 Survive
## 2: 3 Survive
## 3: 3 Death
## 4: 4 Survive
## 5: 2 Survive
## ---
## 156: 3 Survive
## 157: 2 Death
## 158: 2 Death
## 159: 1 Death
## 160: 2 Death
training[best.svm, pred.svm := Pred, on="ID"]
## Death Ggt asat alat bili ureum creat cclear esr crp leucos ID
## 1: Survive 36 15 29 2 6.3 89 -117 15 7 7 1
## 2: Survive 20 23 34 2 3.4 89 -111 2 2 5 2
## 3: Death 1500 709 1154 400 46.0 601 -10 7 7 8 3
## 4: Survive 30 35 32 4 5.6 58 -116 8 4 4 4
## 5: Survive 16 14 12 3 3.2 87 -124 11 3 3 5
## ---
## 156: Survive 14 14 24 2 2.0 78 -110 12 5 5 156
## 157: Death 26 34 27 2 54.0 666 -10 108 124 30 157
## 158: Death 194 277 257 67 17.0 231 -48 59 28 16 158
## 159: Death 385 435 387 145 35.0 276 -23 86 34 17 159
## 160: Death 376 421 465 176 36.0 256 -18 67 43 18 160
## FoldID pred.knn pred.svm
## 1: 2 Survive Survive
## 2: 3 Survive Survive
## 3: 3 Death Death
## 4: 4 Survive Survive
## 5: 2 Survive Survive
## ---
## 156: 3 Survive Survive
## 157: 2 Death Death
## 158: 2 Death Death
## 159: 1 Death Death
## 160: 2 Death Death
# One-hot encode predictions
training <- one_hot(training, cols=c("pred.knn", "pred.svm"), dropCols=FALSE)
ensCV <- list()
ensCV[["Features"]] <- setdiff(colnames(training[]), c("ID", "FoldID", "Death", "pred.knn", "pred.svm"))
ensCV[["Tuning"]] <- CJ(type=c(1,6,7), cost=c(.001, .01, .1, 1, 10, 100))
ensCV[["BestScore"]] <- 0
for(i in seq_len(nrow(ensCV[["Tuning"]]))){
accuracy <- numeric()
pred <- list()
parameter <- ensCV[["Tuning"]][i]
# Cross Validation
for(foldID in 1:5){
# Create the train/test folds
testFold <- training[FoldID == foldID,]
trainFolds <- setdiff(training,testFold)
# Train svm model
logreg <- LiblineaR(data=trainFolds[, ensCV$Features, with=FALSE], target=trainFolds$Death, type= parameter$type, cost=parameter$cost)
testFold[, Pred := predict(logreg, testFold[, ensCV$Features, with=FALSE])$predictions]
pred[[foldID]] <- as.data.table(list(testFold[, c("ID","FoldID", "Pred")]))
# Accuracy of the the fold
accuracy0 <- mean(testFold$Pred == testFold$Death)
accuracy<- c(accuracy, accuracy0)
}
# The overal accuracy.. If best, tell knnCV
accuracy0 <- mean(accuracy)
# Insert the score into ParamGrid
ensCV$Tuning[i, Accuracy := accuracy0]
if(accuracy0 > ensCV[["BestScore"]]){
ensCV[["BestScore"]] <- accuracy0
ensCV[["Bestparam"]] <- ensCV[["Tuning"]][i]
ensCV[["BestPreds"]] <- rbindlist(pred)
}
}
# Check the best parameters
ensCV[["Bestparam"]]
## type cost Accuracy
## 1: 6 100 0.9628666
Now we compare the predictions of each model knn, svm with ensemble model.
# knn
testing[, pred.knn := knn(train=training[, knnCV$Features, with=FALSE], test=testing[, knnCV$Features, with=FALSE], cl=training$Death, k=knnCV$Bestparam$k)]
# svm
svm <- LiblineaR(data=training[, svmCV$Features, with=FALSE], target=training$Death, type=svmCV$Bestparam$type, cost=svmCV$Bestparam$cost)
testing[, pred.svm := predict(svm, testing[, svmCV$Features, with=FALSE])$predictions]
# ensemble
test <- one_hot(testing, cols=c("pred.knn", "pred.svm"), dropCols=FALSE)
logreg <- LiblineaR(data=training[, ensCV$Features, with=FALSE], target=training$Death, type= ensCV$Bestparam$type, cost=ensCV$Bestparam$cost)
test[, Pred.ensemble := predict(logreg, test[, ensCV$Features, with=FALSE])$predictions]
# Accuracy of knn
mean(testing$Death == testing$pred.knn)
## [1] 0.925
# Accuracy of svm
mean(testing$Death == testing$pred.svm)
## [1] 0.925
# Accuracy of ensemble learner
mean(test$Death == test$Pred.ensemble)
## [1] 0.95
Stacking both model knn and svm gave a better results. However, in reality, we should make a trade-off between the parsimony and the complexity of the model.