Context

The chapter 9 “Discriminant Analysis for Making a Diagnosis from Multiple Outcomes” from “Machine Learning in Medecine Cookbook - Tome 1 of Ton J. Cleophas and Aeilko H. Zwinderman” has suggested to use Linear Discriminant Analysis to make diagnosis of sepsis by analyzing multiple laboratory screenings.

What we learnt from this chapter:

  1. The traditional method - MANOVA (multivarite analysis of variance) is not appropriate for the studycase due to high positive correlation between the outcome variables.
  2. Discriminant Analysis applies an orthogonal modelling on the variables and consequently omits the efffect of correlation between variables.
  3. Discriminant Analysis is a supervised learning method, includes a grouping predictor variable and answers the scientific question “Is the diagnosis group a significant predictor of the outcome estimated with 10 lab values ?”

However we have some remarks for this chapter:

  1. All the statitical analysis are conducted in SPSS statisical software.
  2. The methodology in this chapter didn’t use a cross-validation to evaluate the model.

In this report, we suggest some additional features:

  1. The statitical analysis will be performed using R
  2. Integrate the cross-validation into the methodology
  3. Compare Linear Discriminant Analysis (LDA) with Quadratic Discriminant Analysis (QDA) and Regularized Discriminant Analysis (RDA)

Objectives

Predict the type of sepsis (urosepsis, bile duct sepsis, and airway sepsis) based on the laboratory screenings of 45 patients

0. Loading package and support functions

require(dplyr);require(ggplot2);require(gridExtra)

doPlots <- function(data.in, fun, ii, ncol=3) {
  pp <- list()
  for (i in ii) {
    p <- fun(data.in=data.in, i=i)
    pp <- c(pp, list(p))
  }
  do.call("grid.arrange", c(pp, ncol=ncol))
}

plotDensity <- function(data.in, i) {
  data <- data.frame(x=data.in[,i], Response=data.in$diagnosisgroup)
  p <- ggplot(data) + geom_density(aes(x=x, colour=factor(Response))) +
    #geom_line(aes(x=x), stat="density", size=1, alpha=1.0) +
    xlab(colnames(data.in)[i]) + theme_light()#+theme(axis.title.x = element_blank())
  return (p)
}

plotBox <- function(data.in, i) {
  data <- data.frame(y=data.in[,i], Response=data.in$diagnosisgroup)
  p <- ggplot(data, aes(x=factor(Response), y=y)) + geom_boxplot() +
     ylab(colnames(data.in)[i]) + theme_light() + theme(axis.title.x = element_blank())
  return (p)
}

1. Import data

file <- "https://raw.githubusercontent.com/tkvit/MLM_project/master/Chapter9_LDA/discriminantanalysis.csv"
df <- read.csv(file,header=TRUE)
raw <- df #backup data
df$diagnosisgroup <- factor(df$diagnosisgroup, levels = c(1,2,3), 
                            labels = c("Uro","Bile","Air"))

The data includes 11 variables: 1-10 correspond to laboratory screenings and 11 corresponds to the type of sepsis detected for the patient.

Var 1 = gammagt
Var 2 = asat
Var 3 = alat
Var 4 = bilirubine
Var 5 = ureum
Var 6 = creatinine
Var 7 = creatinine clearance
Var 8 = erythrocyte sedimentation rate
Var 9 = c-reactive protein
Var 10 = leucocyte count
Var 11 = type of sepsis (1–3 as described above)
head(df)
##   gammagt asat alat bili ureum creatinine creatinineclearance
## 1       8    5   28    4   2.5         79                 108
## 2      11   10   29    7   2.1         94                  89
## 3       7    8   30    7   2.2         79                  96
## 4       4    6   16    6   2.6         80                 120
## 5       1    6   15    6   2.2         84                 108
## 6      23    5   14    6   2.1         78                 120
##   erythrocytesr creactiveprotein leucos diagnosisgroup
## 1            19               18     16           Bile
## 2            18               15     15           Bile
## 3            20               16     14           Bile
## 4            17               17     19           Bile
## 5            21               18     20           Bile
## 6            18               17     21            Air
str(df)
## 'data.frame':    45 obs. of  11 variables:
##  $ gammagt            : int  8 11 7 4 1 23 12 31 22 30 ...
##  $ asat               : int  5 10 8 6 6 5 10 8 7 6 ...
##  $ alat               : int  28 29 30 16 15 14 17 27 26 25 ...
##  $ bili               : int  4 7 7 6 6 6 5 5 5 4 ...
##  $ ureum              : num  2.5 2.1 2.2 2.6 2.2 2.1 3.2 0.2 1.2 2.4 ...
##  $ creatinine         : int  79 94 79 80 84 78 85 68 74 69 ...
##  $ creatinineclearance: int  108 89 96 120 108 120 100 113 98 90 ...
##  $ erythrocytesr      : int  19 18 20 17 21 18 17 19 16 20 ...
##  $ creactiveprotein   : int  18 15 16 17 18 17 20 15 16 18 ...
##  $ leucos             : int  16 15 14 19 20 21 18 18 17 16 ...
##  $ diagnosisgroup     : Factor w/ 3 levels "Uro","Bile","Air": 2 2 2 2 2 3 3 3 3 3 ...

Only “diagnosisgroup” variable is factor. All the rest are numeric.

2. Exploration Data

In this step, we will plot the boxplot and density curves of variables in order to visualize their distribution. From this point, we use “Uro”,“Bile”,“Air” as abbreviation of urosepsis, bile duct sepsis, and airway sepsis

doPlots(data.in=df, fun=plotBox, ii=1:10, ncol=4)

The Linear Discriminant Analysis LDA is suggested by Fisher (1936) and Welch (1939). According to Welch, the allocation of an individual into a group is based on baysian model by minimizing the total probability of misclassification. According to Fisher, the decision is based on the minimal canonical distance between the individual and the closet centroid of group. Therefore, one of the assumptions of LDA is the equaity of the class covariances.

Boxplots showed the group Airways sepsis is well distinguished from the urosepsis and bile duct sepsis for most of variables

doPlots(data.in=df, fun=plotDensity, ii=1:5, ncol=3)

doPlots(data.in=df, fun=plotDensity, ii=6:10, ncol=3)

For creatinineclearance and leucos variables, the distribution of “Uro” and “Air” is quite similar

3. Quick t-test for “Uro” and “Bile”

uro <- filter(df, diagnosisgroup =="Uro")
bile <- filter(df, diagnosisgroup =="Bile")
air <-filter(df, diagnosisgroup =="Air")
f <- function(x,y){
  test <- t.test(x,y, paired=F)
  out <- data.frame(stat = test$statistic,
                    df   = test$parameter,
                    pval = test$p.value)
  return(out)
}

out <- data.frame(sapply(seq(ncol(uro)-1), function(x) f(uro[,x], bile[,x])))
out1 <- data.frame(sapply(seq(ncol(uro)-1), function(x) f(air[,x], uro[,x])))

colnames(out) <- colnames(uro[,-11])
colnames(out1) <- colnames(uro[,-11])

“Uro” vs “Bile”

print(out)
##       gammagt       asat      alat       bili      ureum creatinine
## stat 1.193045   1.757939  2.078944   2.009345   2.784985   2.472672
## df     25.614   24.06174  21.53695   22.97937   22.23297   21.84157
## pval 0.243786 0.09147432 0.0497501 0.05638737 0.01073173 0.02168119
##      creatinineclearance erythrocytesr creactiveprotein     leucos
## stat            -1.15023      1.961817         2.464239   1.781807
## df               30.9986      22.28859         21.07213   30.58236
## pval           0.2588458    0.06239108        0.0224106 0.08471406

The difference is statitically sigificant for “alat”, “ureum”, “creatinine” and “creactiveproteine”

“Air” vs “Uro”

print(out1)
##         gammagt       asat      alat      bili     ureum creatinine
## stat    1.82939   2.169881  1.411499  1.165456   1.49246   1.554292
## df     15.49078   12.68806  18.62082   18.6417  14.87933   16.32207
## pval 0.08665503 0.04962873 0.1745892 0.2585309 0.1564766  0.1392892
##      creatinineclearance erythrocytesr creactiveprotein    leucos
## stat         -0.01758492       1.20173       -0.9590779 0.3940968
## df              21.21756      19.58911         23.22281  20.29552
## pval           0.9861343     0.2437957        0.3473991 0.6976198

The difference is not statitically significant for creatininecleareance, creactiveproteine and leucos. # Methodology The LDA is applied on the dataset with cross-validation. The results (not shown here) is quite poor. To improve the accuracy of the prediction, we suggest 2 test cases:

  1. Case 1: Apply in addition a Quadratic Discriminant Analysis and Regularized Discriminant Analysis on the complete dataset
  2. Case 2 : Apply in addition a Quadratic Discriminant Analysis and Regularized Discriminant Analysis on the selected dataset (selected variables)

In order to evaluate the performance of these models, we calculated the accuracy for 500 iterations. At each iteration, the dataset is randomly separated into trainings (44 patients) and testing (1 patient). The reason we can only have 1 patient for testing dataset: the Quadratic Discrimnant Analysis requires the number of predictors < number of individuals within each group.

Case 1: LDA, QDA and RDA on the complete dataset

require(MASS);
require(klaR)

data <- df
lda <- NULL
qda <- NULL
rda <- NULL
set.seed(123)
for (i in 1:500){
  tri=sample(nrow(data),1); training <- data[-tri,]; testing <- data[tri,]
  
  #LDA
  lda_mod <- lda(diagnosisgroup ~., data = training)
  lda_pred <- predict(lda_mod,testing)
  lda_res <- cbind(Reality = testing$diagnosisgroup, LDA = lda_pred$class)
  lda_res <- as.data.frame(lda_res)
  lda[[i]] <- lda_res
  
  #QDA
  qda_mod <- qda(diagnosisgroup ~., data = training)
  qda_pred <- predict(qda_mod,testing)
  qda_res <- cbind(Reality = testing$diagnosisgroup, QDA = qda_pred$class)
  qda_res <- as.data.frame(qda_res)
  qda[[i]] <- qda_res
  
  #RDA
  
  rda_mod <- rda(diagnosisgroup~. ,data = training,gamma = 0.1, lamda = 0.8) #0.1, 0.8
  rda_pred <- predict(rda_mod, newdata = testing[,-ncol(testing)])$class
  rda_res <- cbind(Reality = testing$diagnosisgroup, RDA = rda_pred)
  rda_res <- as.data.frame(rda_res)
  rda[[i]] <- rda_res
  
  
}

lda <- do.call(rbind,lda)
qda <- do.call(rbind,qda)
rda <- do.call(rbind,rda)

accuracy <- function(data){
  cm <- table(data[,1],data[,2])
  n = sum(cm) # number of instances
  diag = diag(cm) # number of correctly classified instances per class 
  accuracy = sum(diag) / n ;
  return(accuracy)
}

acc_lda <- accuracy(lda)
acc_qda <- accuracy(qda)
acc_rda <- accuracy(rda)
cat ("LDA = ",acc_lda,"; QDA = ",acc_qda, "; RDA = ",acc_rda)
## LDA =  0.346 ; QDA =  0.504 ; RDA =  0.582

Results

Tuning parameters : gamma and lambda

The unique tuning parameters are the gamma and lamda of the RDA model. According to “Applied Preditive Modeling” of Max Kuhn and Kjell Johnson, the gamma and lamda are from 0 to 1. The gamma represents the balance between LDA and QDA, when the gamma = 0, the model gives the same results as LDA model. The lambda represents the correlation of variables in the dataset.

As we know, the variables are highly correlated between them (result not shown here). We used a loop for gamma from 0.1 to 0.9 and lamda from 0.5 to 0.9 in order to determine which values are optimal for the model.

As a result, we retain gamma = 0.1 and lamda = 0.8

Accuracy of the 3 models

The LDA model gave a poor accuracy compared to QDA and RDA. RDA is the compromise of LDA and QDA gave the best results.

LDA = 0.346 ; QDA = 0.504 ; RDA = 0.582

Case 2: LDA, QDA and RDA on the selected dataset

In this step, the author processes the selection of variables purely based on the observation of variables, machine learning techniques and without any medical knowledge.

Based on the theory and the observation of variables, we decided to tune the model by deselecting the following varialbes : bili, creactiveprotein, leucos, creatinineclearance. After various tests, we found the best results for omitting bibli, creativeprotein and leucos from the dataset.

The new dataset became :

df <- dplyr::select(df, -bili,-creactiveprotein, - leucos)

We applied the same models for this datasets.

Results

The interesting point here is that we observed a clear improvement for the LDA and QDA model. The accuracy of RDA model is almost unchanged.

LDA = 0.422 ; QDA = 0.612 ; RDA = 0.552

4. Conclusion

The QDA model gave the better prediction compared to the LDA model. The RDA model can be a good choice in some cases because it is the compromise of the LDA and QDA model. In our example, the RDA model gave the best accuracy on the complete dataset

By omitting some variables, we improved the accuracy of the LDA and QDA model. But we remind that the omission of these variables is purely based on the machine learning techniques and without any medical knowledge. In medical reality, whether it is acceptable to omit these variables is another question.

9. Reference

Cleophas, Ton J., Zwinderman, Aeilko H (2014). Machine Learning in Medicine - Cookbook. Springer. ISBN 978-3-319-04181-0

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer. http://doi.org/10.1007/978-1-4614-6849-3

Fisher R (1936). “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics, 7(2), 179–188

Welch B (1939). “Note on Discriminant Functions.” Biometrika, 31, 218–220.