Machine Learning in Medecine project

This chapter is a part of the project “Machine Learning in Medicine” with participation of 12 members who comes from different backgrounds.

Via this project, we hope fulfill the following objectives:

To spread and encourage the application of Machine Learning in medical research in VietNam
To connect between Machine Learning and Statistics
To promote the use of R language, an open source and leading tool for statistics and machine learning.

Introduction

The 57th case study (Chapter 13 - Machine Learning in Medicine-Cookbook Two by T. J. Cleophas and A. H. Zwinderman, Springer, 2014).

In this study, the dataset is composed of 10 markers and survival outcome of 200 patients hospitalized for sepsis. The objective is to predict the mortality risk of these patients using 10 markers: gamma-glutamyl transferase (gammagt, U/l),aspartate aminotransferase and Alanine transaminase (U/l), bilirubine (μmol/l), urea (mmol/l), creatinine (μmol/l), creatinine clearance (ml/min), erythrocyte sedimentation rate (ESR, mm), c-reactive protein (CRP, mg/l) and leucocyte count (×10^9/l). (Le Dong Nhat Nam, 20/01/2017)

In the original textbook, the authors adopted an SVM algorithm (Knime Data Miner) to resolve this classification problem. Le Dong Nhat Nam has succesfully extended this experiment by using the Support Vector Machine model and Extreme Gradient Boosting Tree model (http://rpubs.com/ledongnhatnam/243186)

In this study, we suggest another point of view of the classification problem: choosing between the parsimony and accuracy at all costs. We compare two cases: 1) a simple LDA or knn can give a very good results; 2) stacking the knn model and svm model give an even better accuracy.

Import data

library(data.table)
library(ggplot2)
library(mltools)  
library(class)  
library(LiblineaR)
library(magrittr)
library(tidyverse)


rm(list=ls())
df <- read.csv("svm.csv",sep=";")


df %<>% rename(Death=VAR00001,Ggt=VAR00002,asat=VAR00003,alat=VAR00004,bili=VAR00005,ureum=VAR00006,
               creat=VAR00007, cclear=VAR00008, esr=VAR00009, crp =VAR00010,leucos=VAR00011) %>%as_tibble()

df <- df %>% filter(!is.na(Death))
df$Death=recode_factor(df$Death,`0`="Survive",`1`="Death")
df$ureum <- as.character(df$ureum) %>% gsub(df$ureum,pattern = "," ,replace = ".") %>% as.numeric()
df_backup <- df

Look at the data

head(df)

## # A tibble: 6 × 11
##     Death   Ggt  asat  alat  bili ureum creat cclear   esr   crp leucos
##    <fctr> <int> <int> <int> <int> <dbl> <int>  <int> <int> <int>  <int>
## 1 Survive    20    23    34     2   3.4    89   -111     2     2      5
## 2 Survive    14    21    33     3   2.0    67   -112     7     3      6
## 3 Survive    30    35    32     4   5.6    58   -116     8     4      4
## 4 Survive    35    34    40     4   6.0    76   -110     6     5      7
## 5 Survive    23    33    22     4   6.1    95   -120     9     6      6
## 6 Survive    26    31    24     3   5.4    78   -132     8     4      8

In this study, we skip the exploratory step. You can get this information in the rpubs of Le Dong Nhat Nam.

Case 1: A simple LDA model

It is a classification problem. We run quickly a LDA model to see the inter-class variance of Death and Survive group.

set.seed(11)
training <- df %>% sample_n(round(0.8*nrow(df)))
testing <- setdiff(df, training)
lda_mod <- MASS::lda(Death~., data = training, prior = c(0.5,0.5))
lda_pred <- predict(lda_mod,testing[,-1])

caret::confusionMatrix(lda_pred$class,testing$Death,positive = "Death")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survive Death
##    Survive      16     1
##    Death         2    21
##                                           
##                Accuracy : 0.925           
##                  95% CI : (0.7961, 0.9843)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 2.456e-07       
##                                           
##                   Kappa : 0.8477          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9545          
##             Specificity : 0.8889          
##          Pos Pred Value : 0.9130          
##          Neg Pred Value : 0.9412          
##              Prevalence : 0.5500          
##          Detection Rate : 0.5250          
##    Detection Prevalence : 0.5750          
##       Balanced Accuracy : 0.9217          
##                                           
##        'Positive' Class : Death           
##

plot(lda_mod)

We can see that a simple linear discriminant analysis separates well the 2 groups “Death” and “Survive”, except some observations (cf plot, Accuracy, Kappa, Sensitivity and Specificity)

Case 2:A better accuracy at all costs

We know that the data itself allows to be well separated by a simple model. However, in this study, we want to go further. To do that, we build an ensemble learner using knn and svm model. The final model will be the result of different steps:

Tuning knn model with cross-validation and select the best model
Tuning svm model with cross-validation and select the best model
Integrate the outcomes of knn and svm models into the training set and convert them to one-hot-encoding to build the ensembler learner
Compare the accuracy of the knn, svm and knn+svm model.

At first, we use mlTools package to generate the cross-validation folds

#Generate Folds for CV
training <- as.data.table(training)
testing <- as.data.table(testing)
training[,ID := seq_len(nrow(training))]

##        Death  Ggt asat alat bili ureum creat cclear esr crp leucos  ID
##   1: Survive   36   15   29    2   6.3    89   -117  15   7      7   1
##   2: Survive   20   23   34    2   3.4    89   -111   2   2      5   2
##   3:   Death 1500  709 1154  400  46.0   601    -10   7   7      8   3
##   4: Survive   30   35   32    4   5.6    58   -116   8   4      4   4
##   5: Survive   16   14   12    3   3.2    87   -124  11   3      3   5
##  ---                                                                  
## 156: Survive   14   14   24    2   2.0    78   -110  12   5      5 156
## 157:   Death   26   34   27    2  54.0   666    -10 108 124     30 157
## 158:   Death  194  277  257   67  17.0   231    -48  59  28     16 158
## 159:   Death  385  435  387  145  35.0   276    -23  86  34     17 159
## 160:   Death  376  421  465  176  36.0   256    -18  67  43     18 160

training[, FoldID := folds(Death, nfolds=5, stratified=TRUE, seed=2016)]

##        Death  Ggt asat alat bili ureum creat cclear esr crp leucos  ID
##   1: Survive   36   15   29    2   6.3    89   -117  15   7      7   1
##   2: Survive   20   23   34    2   3.4    89   -111   2   2      5   2
##   3:   Death 1500  709 1154  400  46.0   601    -10   7   7      8   3
##   4: Survive   30   35   32    4   5.6    58   -116   8   4      4   4
##   5: Survive   16   14   12    3   3.2    87   -124  11   3      3   5
##  ---                                                                  
## 156: Survive   14   14   24    2   2.0    78   -110  12   5      5 156
## 157:   Death   26   34   27    2  54.0   666    -10 108 124     30 157
## 158:   Death  194  277  257   67  17.0   231    -48  59  28     16 158
## 159:   Death  385  435  387  145  35.0   276    -23  86  34     17 159
## 160:   Death  376  421  465  176  36.0   256    -18  67  43     18 160
##      FoldID
##   1:      2
##   2:      3
##   3:      3
##   4:      4
##   5:      2
##  ---       
## 156:      3
## 157:      2
## 158:      2
## 159:      1
## 160:      2

Now we tune the knn model on the training to determine the best parameters.

#Base model 1 knn
knnCV <- list()
knnCV[["Features"]] <- names(training[,2:11])
knnCV[["Tuning"]] <- CJ(k=seq(1, 20))
knnCV[["BestScore"]] <- 0

for(i in seq_len(nrow(knnCV[["Tuning"]]))){
  accuracy <- numeric()
  pred <- list()
  parameter <- knnCV[["Tuning"]][i]
  
  # Cross Validation
  for(foldID in 1:5){
    # Create the train/test folds
    testFold <- training[FoldID == foldID,]
    trainFolds <- setdiff(training,testFold)
    
    # Train knn model
    testFold %<>% mutate(Pred = knn(train = trainFolds[,knnCV$Features,with = FALSE], test = testFold[,knnCV$Features, with=FALSE], cl = trainFolds$Death, k= parameter$k))
    pred[[foldID]] <- as.data.table(list(testFold[, c("ID","FoldID", "Pred")]))
    
    # Accuracy of the the fold
    accuracy0 <- mean(testFold$Pred == testFold$Death)
    accuracy<- c(accuracy, accuracy0)
  }
  
  # The overal accuracy.. 
  accuracy0 <- mean(accuracy)

  # Fill the Tuning matrix with accuracy
  knnCV$Tuning[i, Accuracy := accuracy0]

  if(accuracy0 > knnCV[["BestScore"]]){
    knnCV[["BestScore"]] <- accuracy0
    knnCV[["Bestparam"]] <- knnCV[["Tuning"]][i]
    knnCV[["BestPreds"]] <- rbindlist(pred)
  }
}

#The best parameter of knn model
knnCV[["Bestparam"]]

##    k  Accuracy
## 1: 5 0.9310239

The best parameter for the knn model is k = 5. The accuracy is 0.93

Next step, we tune the svm model on the same training dataset

# Base model 2
svmCV <- list()
svmCV[["Features"]] <- names(training[,2:11])
svmCV[["Tuning"]] <- CJ(type=c(1,6,7), cost=c(.001, .01, .1, 1, 10, 100))
svmCV[["BestScore"]] <- 0

for(i in seq_len(nrow(svmCV[["Tuning"]]))){
  accuracy <- numeric()
  pred <- list()
  parameter <- svmCV[["Tuning"]][i]
  
  # Cross Validation
  for(foldID in 1:5){
    # Create the train/test folds
    testFold <- training[FoldID == foldID,]
    trainFolds <- setdiff(training,testFold)
    
    # Train svm model
    logreg <- LiblineaR(data=trainFolds[, svmCV$Features, with=FALSE], target=trainFolds$Death, type= parameter$type, cost=parameter$cost)
    testFold[, Pred := predict(logreg, testFold[, svmCV$Features, with=FALSE])$predictions]
    pred[[foldID]] <- as.data.table(list(testFold[, c("ID","FoldID", "Pred")]))
    
    # Accuracy of the the fold
    accuracy0 <- mean(testFold$Pred == testFold$Death)
    accuracy<- c(accuracy, accuracy0)
  }
  
  # The overal accuracy
  accuracy0 <- mean(accuracy)
  
  # Fill the Tuning matrix
  svmCV$Tuning[i, Accuracy := accuracy0]
  
  if(accuracy0 > svmCV[["BestScore"]]){
    svmCV[["BestScore"]] <- accuracy0
    svmCV[["Bestparam"]] <- svmCV[["Tuning"]][i]
    svmCV[["BestPreds"]] <- rbindlist(pred)
  }
}

# Check the best parameters
svmCV[["Bestparam"]]

##    type cost  Accuracy
## 1:    6  0.1 0.9501772

The best model is svm sigmoid with cost = 0.1 and accuracy = 0.95 SVM is slightly better than knn model In the next step, we build an ensemble learner svm model using the predictions of knn and svm

# Ensemble KNN, SVM sigmoid

# Extract best model
best.knn <- knnCV[["BestPreds"]]
best.svm <- svmCV[["BestPreds"]]

# Insert predictions into train
training[best.knn, pred.knn := Pred, on="ID"]

##        Death  Ggt asat alat bili ureum creat cclear esr crp leucos  ID
##   1: Survive   36   15   29    2   6.3    89   -117  15   7      7   1
##   2: Survive   20   23   34    2   3.4    89   -111   2   2      5   2
##   3:   Death 1500  709 1154  400  46.0   601    -10   7   7      8   3
##   4: Survive   30   35   32    4   5.6    58   -116   8   4      4   4
##   5: Survive   16   14   12    3   3.2    87   -124  11   3      3   5
##  ---                                                                  
## 156: Survive   14   14   24    2   2.0    78   -110  12   5      5 156
## 157:   Death   26   34   27    2  54.0   666    -10 108 124     30 157
## 158:   Death  194  277  257   67  17.0   231    -48  59  28     16 158
## 159:   Death  385  435  387  145  35.0   276    -23  86  34     17 159
## 160:   Death  376  421  465  176  36.0   256    -18  67  43     18 160
##      FoldID pred.knn
##   1:      2  Survive
##   2:      3  Survive
##   3:      3    Death
##   4:      4  Survive
##   5:      2  Survive
##  ---                
## 156:      3  Survive
## 157:      2    Death
## 158:      2    Death
## 159:      1    Death
## 160:      2    Death

training[best.svm, pred.svm := Pred, on="ID"]

##        Death  Ggt asat alat bili ureum creat cclear esr crp leucos  ID
##   1: Survive   36   15   29    2   6.3    89   -117  15   7      7   1
##   2: Survive   20   23   34    2   3.4    89   -111   2   2      5   2
##   3:   Death 1500  709 1154  400  46.0   601    -10   7   7      8   3
##   4: Survive   30   35   32    4   5.6    58   -116   8   4      4   4
##   5: Survive   16   14   12    3   3.2    87   -124  11   3      3   5
##  ---                                                                  
## 156: Survive   14   14   24    2   2.0    78   -110  12   5      5 156
## 157:   Death   26   34   27    2  54.0   666    -10 108 124     30 157
## 158:   Death  194  277  257   67  17.0   231    -48  59  28     16 158
## 159:   Death  385  435  387  145  35.0   276    -23  86  34     17 159
## 160:   Death  376  421  465  176  36.0   256    -18  67  43     18 160
##      FoldID pred.knn pred.svm
##   1:      2  Survive  Survive
##   2:      3  Survive  Survive
##   3:      3    Death    Death
##   4:      4  Survive  Survive
##   5:      2  Survive  Survive
##  ---                         
## 156:      3  Survive  Survive
## 157:      2    Death    Death
## 158:      2    Death    Death
## 159:      1    Death    Death
## 160:      2    Death    Death

# One-hot encode predictions
training <- one_hot(training, cols=c("pred.knn", "pred.svm"), dropCols=FALSE)

ensCV <- list()
ensCV[["Features"]] <- setdiff(colnames(training[]), c("ID", "FoldID", "Death", "pred.knn", "pred.svm"))
ensCV[["Tuning"]] <- CJ(type=c(1,6,7), cost=c(.001, .01, .1, 1, 10, 100))
ensCV[["BestScore"]] <- 0

for(i in seq_len(nrow(ensCV[["Tuning"]]))){
  accuracy <- numeric()
  pred <- list()
  parameter <- ensCV[["Tuning"]][i]
  
  # Cross Validation
  for(foldID in 1:5){
    # Create the train/test folds
    testFold <- training[FoldID == foldID,]
    trainFolds <- setdiff(training,testFold)
    
    # Train svm model
    logreg <- LiblineaR(data=trainFolds[, ensCV$Features, with=FALSE], target=trainFolds$Death, type= parameter$type, cost=parameter$cost)
    testFold[, Pred := predict(logreg, testFold[, ensCV$Features, with=FALSE])$predictions]
    pred[[foldID]] <- as.data.table(list(testFold[, c("ID","FoldID", "Pred")]))
    
    # Accuracy of the the fold
    accuracy0 <- mean(testFold$Pred == testFold$Death)
    accuracy<- c(accuracy, accuracy0)
  }
  
  # The overal accuracy.. If best, tell knnCV
  accuracy0 <- mean(accuracy)
  
  # Insert the score into ParamGrid
  ensCV$Tuning[i, Accuracy := accuracy0]
  
  if(accuracy0 > ensCV[["BestScore"]]){
    ensCV[["BestScore"]] <- accuracy0
    ensCV[["Bestparam"]] <- ensCV[["Tuning"]][i]
    ensCV[["BestPreds"]] <- rbindlist(pred)
  }
}

# Check the best parameters
ensCV[["Bestparam"]]

##    type cost  Accuracy
## 1:    6  100 0.9628666

Now we compare the predictions of each model knn, svm with ensemble model.

# knn
testing[, pred.knn := knn(train=training[, knnCV$Features, with=FALSE], test=testing[, knnCV$Features, with=FALSE], cl=training$Death, k=knnCV$Bestparam$k)]
# svm
svm <- LiblineaR(data=training[, svmCV$Features, with=FALSE], target=training$Death, type=svmCV$Bestparam$type, cost=svmCV$Bestparam$cost)
testing[, pred.svm := predict(svm, testing[, svmCV$Features, with=FALSE])$predictions]
# ensemble
test <- one_hot(testing, cols=c("pred.knn", "pred.svm"), dropCols=FALSE)
logreg <- LiblineaR(data=training[, ensCV$Features, with=FALSE], target=training$Death, type= ensCV$Bestparam$type, cost=ensCV$Bestparam$cost)
test[, Pred.ensemble := predict(logreg, test[, ensCV$Features, with=FALSE])$predictions]

# Accuracy of knn
mean(testing$Death == testing$pred.knn)

## [1] 0.925

# Accuracy of svm
mean(testing$Death == testing$pred.svm)

## [1] 0.925

# Accuracy of ensemble learner
mean(test$Death == test$Pred.ensemble)

## [1] 0.95

Conclusion

Stacking both model knn and svm gave a better results. However, in reality, we should make a trade-off between the parsimony and the complexity of the model.

Simple LDA, knn, svm or ensemble learning? Parsimony or accuracy at all costs?

Vịt Trần (PhD)

January 24, 2017