The chapter 9 “Discriminant Analysis for Making a Diagnosis from Multiple Outcomes” from “Machine Learning in Medecine Cookbook - Tome 1 of Ton J. Cleophas and Aeilko H. Zwinderman” has suggested to use Linear Discriminant Analysis to make diagnosis of sepsis by analyzing multiple laboratory screenings.
What we learnt from this chapter:
However we have some remarks for this chapter:
In this report, we suggest some additional features:
Predict the type of sepsis (urosepsis, bile duct sepsis, and airway sepsis) based on the laboratory screenings of 45 patients
require(dplyr);require(ggplot2);require(gridExtra)
doPlots <- function(data.in, fun, ii, ncol=3) {
pp <- list()
for (i in ii) {
p <- fun(data.in=data.in, i=i)
pp <- c(pp, list(p))
}
do.call("grid.arrange", c(pp, ncol=ncol))
}
plotDensity <- function(data.in, i) {
data <- data.frame(x=data.in[,i], Response=data.in$diagnosisgroup)
p <- ggplot(data) + geom_density(aes(x=x, colour=factor(Response))) +
#geom_line(aes(x=x), stat="density", size=1, alpha=1.0) +
xlab(colnames(data.in)[i]) + theme_light()#+theme(axis.title.x = element_blank())
return (p)
}
plotBox <- function(data.in, i) {
data <- data.frame(y=data.in[,i], Response=data.in$diagnosisgroup)
p <- ggplot(data, aes(x=factor(Response), y=y)) + geom_boxplot() +
ylab(colnames(data.in)[i]) + theme_light() + theme(axis.title.x = element_blank())
return (p)
}
file <- "https://raw.githubusercontent.com/tkvit/MLM_project/master/Chapter9_LDA/discriminantanalysis.csv"
df <- read.csv(file,header=TRUE)
raw <- df #backup data
df$diagnosisgroup <- factor(df$diagnosisgroup, levels = c(1,2,3),
labels = c("Uro","Bile","Air"))
The data includes 11 variables: 1-10 correspond to laboratory screenings and 11 corresponds to the type of sepsis detected for the patient.
Var 1 = gammagt
Var 2 = asat
Var 3 = alat
Var 4 = bilirubine
Var 5 = ureum
Var 6 = creatinine
Var 7 = creatinine clearance
Var 8 = erythrocyte sedimentation rate
Var 9 = c-reactive protein
Var 10 = leucocyte count
Var 11 = type of sepsis (1–3 as described above)
head(df)
## gammagt asat alat bili ureum creatinine creatinineclearance
## 1 8 5 28 4 2.5 79 108
## 2 11 10 29 7 2.1 94 89
## 3 7 8 30 7 2.2 79 96
## 4 4 6 16 6 2.6 80 120
## 5 1 6 15 6 2.2 84 108
## 6 23 5 14 6 2.1 78 120
## erythrocytesr creactiveprotein leucos diagnosisgroup
## 1 19 18 16 Bile
## 2 18 15 15 Bile
## 3 20 16 14 Bile
## 4 17 17 19 Bile
## 5 21 18 20 Bile
## 6 18 17 21 Air
str(df)
## 'data.frame': 45 obs. of 11 variables:
## $ gammagt : int 8 11 7 4 1 23 12 31 22 30 ...
## $ asat : int 5 10 8 6 6 5 10 8 7 6 ...
## $ alat : int 28 29 30 16 15 14 17 27 26 25 ...
## $ bili : int 4 7 7 6 6 6 5 5 5 4 ...
## $ ureum : num 2.5 2.1 2.2 2.6 2.2 2.1 3.2 0.2 1.2 2.4 ...
## $ creatinine : int 79 94 79 80 84 78 85 68 74 69 ...
## $ creatinineclearance: int 108 89 96 120 108 120 100 113 98 90 ...
## $ erythrocytesr : int 19 18 20 17 21 18 17 19 16 20 ...
## $ creactiveprotein : int 18 15 16 17 18 17 20 15 16 18 ...
## $ leucos : int 16 15 14 19 20 21 18 18 17 16 ...
## $ diagnosisgroup : Factor w/ 3 levels "Uro","Bile","Air": 2 2 2 2 2 3 3 3 3 3 ...
Only “diagnosisgroup” variable is factor. All the rest are numeric.
In this step, we will plot the boxplot and density curves of variables in order to visualize their distribution. From this point, we use “Uro”,“Bile”,“Air” as abbreviation of urosepsis, bile duct sepsis, and airway sepsis
doPlots(data.in=df, fun=plotBox, ii=1:10, ncol=4)
The Linear Discriminant Analysis LDA is suggested by Fisher (1936) and Welch (1939). According to Welch, the allocation of an individual into a group is based on baysian model by minimizing the total probability of misclassification. According to Fisher, the decision is based on the minimal canonical distance between the individual and the closet centroid of group. Therefore, one of the assumptions of LDA is the equaity of the class covariances.
Boxplots showed the group Airways sepsis is well distinguished from the urosepsis and bile duct sepsis for most of variables
doPlots(data.in=df, fun=plotDensity, ii=1:5, ncol=3)
doPlots(data.in=df, fun=plotDensity, ii=6:10, ncol=3)
For creatinineclearance and leucos variables, the distribution of “Uro” and “Air” is quite similar
uro <- filter(df, diagnosisgroup =="Uro")
bile <- filter(df, diagnosisgroup =="Bile")
air <-filter(df, diagnosisgroup =="Air")
f <- function(x,y){
test <- t.test(x,y, paired=F)
out <- data.frame(stat = test$statistic,
df = test$parameter,
pval = test$p.value)
return(out)
}
out <- data.frame(sapply(seq(ncol(uro)-1), function(x) f(uro[,x], bile[,x])))
out1 <- data.frame(sapply(seq(ncol(uro)-1), function(x) f(air[,x], uro[,x])))
colnames(out) <- colnames(uro[,-11])
colnames(out1) <- colnames(uro[,-11])
“Uro” vs “Bile”
print(out)
## gammagt asat alat bili ureum creatinine
## stat 1.193045 1.757939 2.078944 2.009345 2.784985 2.472672
## df 25.614 24.06174 21.53695 22.97937 22.23297 21.84157
## pval 0.243786 0.09147432 0.0497501 0.05638737 0.01073173 0.02168119
## creatinineclearance erythrocytesr creactiveprotein leucos
## stat -1.15023 1.961817 2.464239 1.781807
## df 30.9986 22.28859 21.07213 30.58236
## pval 0.2588458 0.06239108 0.0224106 0.08471406
The difference is statitically sigificant for “alat”, “ureum”, “creatinine” and “creactiveproteine”
“Air” vs “Uro”
print(out1)
## gammagt asat alat bili ureum creatinine
## stat 1.82939 2.169881 1.411499 1.165456 1.49246 1.554292
## df 15.49078 12.68806 18.62082 18.6417 14.87933 16.32207
## pval 0.08665503 0.04962873 0.1745892 0.2585309 0.1564766 0.1392892
## creatinineclearance erythrocytesr creactiveprotein leucos
## stat -0.01758492 1.20173 -0.9590779 0.3940968
## df 21.21756 19.58911 23.22281 20.29552
## pval 0.9861343 0.2437957 0.3473991 0.6976198
The difference is not statitically significant for creatininecleareance, creactiveproteine and leucos. # Methodology The LDA is applied on the dataset with cross-validation. The results (not shown here) is quite poor. To improve the accuracy of the prediction, we suggest 2 test cases:
In order to evaluate the performance of these models, we calculated the accuracy for 500 iterations. At each iteration, the dataset is randomly separated into trainings (44 patients) and testing (1 patient). The reason we can only have 1 patient for testing dataset: the Quadratic Discrimnant Analysis requires the number of predictors < number of individuals within each group.
require(MASS);
require(klaR)
data <- df
lda <- NULL
qda <- NULL
rda <- NULL
set.seed(123)
for (i in 1:500){
tri=sample(nrow(data),1); training <- data[-tri,]; testing <- data[tri,]
#LDA
lda_mod <- lda(diagnosisgroup ~., data = training)
lda_pred <- predict(lda_mod,testing)
lda_res <- cbind(Reality = testing$diagnosisgroup, LDA = lda_pred$class)
lda_res <- as.data.frame(lda_res)
lda[[i]] <- lda_res
#QDA
qda_mod <- qda(diagnosisgroup ~., data = training)
qda_pred <- predict(qda_mod,testing)
qda_res <- cbind(Reality = testing$diagnosisgroup, QDA = qda_pred$class)
qda_res <- as.data.frame(qda_res)
qda[[i]] <- qda_res
#RDA
rda_mod <- rda(diagnosisgroup~. ,data = training,gamma = 0.1, lamda = 0.8) #0.1, 0.8
rda_pred <- predict(rda_mod, newdata = testing[,-ncol(testing)])$class
rda_res <- cbind(Reality = testing$diagnosisgroup, RDA = rda_pred)
rda_res <- as.data.frame(rda_res)
rda[[i]] <- rda_res
}
lda <- do.call(rbind,lda)
qda <- do.call(rbind,qda)
rda <- do.call(rbind,rda)
accuracy <- function(data){
cm <- table(data[,1],data[,2])
n = sum(cm) # number of instances
diag = diag(cm) # number of correctly classified instances per class
accuracy = sum(diag) / n ;
return(accuracy)
}
acc_lda <- accuracy(lda)
acc_qda <- accuracy(qda)
acc_rda <- accuracy(rda)
cat ("LDA = ",acc_lda,"; QDA = ",acc_qda, "; RDA = ",acc_rda)
## LDA = 0.346 ; QDA = 0.504 ; RDA = 0.582
The unique tuning parameters are the gamma and lamda of the RDA model. According to “Applied Preditive Modeling” of Max Kuhn and Kjell Johnson, the gamma and lamda are from 0 to 1. The gamma represents the balance between LDA and QDA, when the gamma = 0, the model gives the same results as LDA model. The lambda represents the correlation of variables in the dataset.
As we know, the variables are highly correlated between them (result not shown here). We used a loop for gamma from 0.1 to 0.9 and lamda from 0.5 to 0.9 in order to determine which values are optimal for the model.
As a result, we retain gamma = 0.1 and lamda = 0.8
The LDA model gave a poor accuracy compared to QDA and RDA. RDA is the compromise of LDA and QDA gave the best results.
LDA = 0.346 ; QDA = 0.504 ; RDA = 0.582
In this step, the author processes the selection of variables purely based on the observation of variables, machine learning techniques and without any medical knowledge.
Based on the theory and the observation of variables, we decided to tune the model by deselecting the following varialbes : bili, creactiveprotein, leucos, creatinineclearance. After various tests, we found the best results for omitting bibli, creativeprotein and leucos from the dataset.
The new dataset became :
df <- dplyr::select(df, -bili,-creactiveprotein, - leucos)
We applied the same models for this datasets.
The interesting point here is that we observed a clear improvement for the LDA and QDA model. The accuracy of RDA model is almost unchanged.
LDA = 0.422 ; QDA = 0.612 ; RDA = 0.552
The QDA model gave the better prediction compared to the LDA model. The RDA model can be a good choice in some cases because it is the compromise of the LDA and QDA model. In our example, the RDA model gave the best accuracy on the complete dataset
By omitting some variables, we improved the accuracy of the LDA and QDA model. But we remind that the omission of these variables is purely based on the machine learning techniques and without any medical knowledge. In medical reality, whether it is acceptable to omit these variables is another question.
Cleophas, Ton J., Zwinderman, Aeilko H (2014). Machine Learning in Medicine - Cookbook. Springer. ISBN 978-3-319-04181-0
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer. http://doi.org/10.1007/978-1-4614-6849-3
Fisher R (1936). “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics, 7(2), 179–188
Welch B (1939). “Note on Discriminant Functions.” Biometrika, 31, 218–220.