Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
In this study, 4 inflammatory markers [C-reactive protein (CRP), erythrocyte sedimentation rate (ESR), leucocyte count (leucos), and fibrinogen)] were measured in 150 patients with pneumonia. Based on X-ray chest, clinical severity was classified as A (mild infection), B (medium severity), C (severe infection). One scientific question was to assess whether the markers could adequately predict the severity of infection.
Our objective is to demonstrate the ability of Cluster analysis, an unsupervised machine learning technique for exploring and identifying groups (i.e clusters) within the biomarkers data set.
Materials and method
The original dataset was provided in Chaper 2, Machine Learning in Medicine-Cookbook Three (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS format (decisiontree.sav) from the website: extras.springer.com.
Data analysis was performed in R statistical programming language. Our demonstration consists of 4 steps
Following packages will be used during our demonstration:
library(tidyverse)
library(foreign)
library(RColorBrewer)
library(corrplot)
library(NbClust)
library(factoextra)
library(cluster)
library(clValid)
library(dendextend)
library(dplyr)
library(gplots)
library(mlr)
library(gridExtra)
Customizing the theme for all ggplot2 objects
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#faefff"),
strip.background = element_rect(fill = "#400156", color = "#400156", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
Results
1) Conventional data exploration
df=read.spss("decisiontree.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()
df
## # A tibble: 150 × 5
## CRP leucos fibrinogen ESR xrayseverity
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 120 5 11 60 A
## 2 100 5 11 56 A
## 3 94 4 11 60 A
## 4 92 5 11 58 A
## 5 100 5 11 52 A
## 6 108 6 17 48 A
## 7 92 5 14 48 A
## 8 100 5 11 54 A
## 9 88 5 11 54 A
## 10 98 5 8 60 A
## # ... with 140 more rows
Figure 1: Heatmap plot for exploring the biomarker dataset
dfscale<-df[,-5]%>%as.matrix()%>%scale()%>%as_tibble()%>%mutate(.,Class=df$xrayseverity,Id=row.names(.))
dfscale%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=reorder(Id,Value),y=reorder(Marker,Value),fill=Value))+geom_tile(color="black",show.legend=T)+facet_wrap(~Class,ncol=1,shrink=T,scale="free")+scale_fill_gradient2(low="gold",mid="#ff5656",high="#890150",midpoint=1)+theme(axis.text.x=element_blank())+scale_y_discrete("Markers")+scale_x_discrete("Patient's Id")
Figure 2: Correlation matrix for exploring the relationship between 4 biomarkers
cor.mtest <- function(mat, conf.level = 0.95){
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for(i in 1:(n-1)){
for(j in (i+1):n){
tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
res1<-df[,-5]%>%cor.mtest(.,0.95)
df[,-5]%>%cor(.,method="pearson")%>%corrplot(.,p.mat=res1[[1]],sig.level=0.05,order="hclust",type="lower",method="pie",tl.col="black", tl.srt=45,col=brewer.pal(n=8, name="PuRd"))
Figure 3: Characteristics of 4 biomarkers among 3 subgroups of severity
df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Value,fill=xrayseverity,color=xrayseverity))+geom_density(alpha = 0.4, size=1) +
geom_rug() +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
facet_wrap( ~ Marker, scales = "free_y", ncol = 2)
2) K-mean clustering method
First, we perform an exploratory analysis, in order to identify the most appropriate Clustering method, as well as the optimized number of cluster (K value)
df[,-5]%>%scale()%>%clValid(nClust=2:4,metric="euclidean",method="complete",clMethods = c("hierarchical","kmeans","pam"),validation = "internal",verbose=F)%>%summary()
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4
##
## Validation Measures:
## 2 3 4
##
## hierarchical Connectivity 0.0000 9.2468 21.6004
## Dunn 0.2584 0.1139 0.1128
## Silhouette 0.5389 0.4683 0.4496
## kmeans Connectivity 5.4421 15.3294 19.4476
## Dunn 0.1192 0.1036 0.0984
## Silhouette 0.5416 0.4727 0.4527
## pam Connectivity 5.4421 16.2782 32.0389
## Dunn 0.1192 0.0890 0.0812
## Silhouette 0.5416 0.4322 0.4397
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 0.0000 hierarchical 2
## Dunn 0.2584 hierarchical 2
## Silhouette 0.5416 kmeans 2
The results indicate that the most appropriate k value should be 2 for both hierarchical and k-mean methods
Figure 4: K-mean cluster analysis
df[,-5]%>%scale()%>%kmeans(2)%>%fviz_cluster(data=df[,-5],geom = "point",pointsize=2,shape=19,frame.type="norm")+my_theme()+scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1")
fit<-df[,-5]%>%scale()%>%kmeans(centers=2)
fit$size
## [1] 96 54
fit$centers
## CRP leucos fibrinogen ESR
## 1 0.5630151 0.6887467 0.6609378 0.3941610
## 2 -1.0009157 -1.2244386 -1.1750005 -0.7007307
Just out of curiosity, we will try to verify whether those 2 clusers would help us to identify the pneumonia severity in our patients ?
df[,-5]%>%aggregate(.,by=list(cluster=fit$cluster),quantile)
## cluster CRP.0% CRP.25% CRP.50% CRP.75% CRP.100% leucos.0% leucos.25%
## 1 1 98 116 126 134 158 12 15
## 2 2 86 96 100 104 120 3 5
## leucos.50% leucos.75% leucos.100% fibrinogen.0% fibrinogen.25%
## 1 16 19 23 35.00 46.25
## 2 5 5 12 8.00 11.00
## fibrinogen.50% fibrinogen.75% fibrinogen.100% ESR.0% ESR.25% ESR.50%
## 1 54.50 65.00 80.00 44.0 62.0 70.0
## 2 11.00 14.00 38.00 44.0 52.5 58.0
## ESR.75% ESR.100%
## 1 88.0 130.0
## 2 60.0 76.0
df[,-5]%>%aggregate(.,by=list(cluster=fit$cluster),mean)
## cluster CRP leucos fibrinogen ESR
## 1 1 126.2917 16.572917 56.09375 76.02083
## 2 2 100.4815 5.314815 14.11111 57.68519
table(fit$cluster,df$xrayseverity)%>%knitr::kable()
| A | B | C |
|---|---|---|
| 0 | 46 | 50 |
| 50 | 4 | 0 |
The answer seems to be positive, at least we can distinguish between the “Mild” and “Severe” groups among out patients, using 2 clusters…
Figure 5 Characteristics of 4 biomarkers in 2 clusters
fit$centers%>%as_tibble()%>%mutate(Class=c("PC1","PC2"))%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=reorder(Marker,-Value),y=Value,fill=Class))+geom_point(shape=21,size=5,stat="identity")+facet_wrap(~Class,ncol=1,scales="free")+coord_flip()+scale_fill_brewer(palette = "Set1")+scale_color_brewer(palette = "Set1")
Note: the scales were standardized for all biomarkers
3) Hierarchical clustering method
Figure 6: Result of hierarchical cluster analysis
Rowv<-df[,-5]%>%scale %>%dist %>% hclust %>% as.dendrogram %>%
set("branches_k_color",k=2) %>% set("branches_lwd", 1) %>%
ladderize
Colv<-df[,-5]%>% scale %>% t %>% dist %>% hclust %>% as.dendrogram %>%
set("branches_k_color", k = 2, value = c("blue","red")) %>%
set("branches_lwd", 1) %>%
ladderize
heatmap.2(scale(df[,-5]), scale = "none", col = redblue(100),
Rowv = Rowv, Colv = Colv,
trace = "none", density.info = "none")
df[,-5]%>%scale()%>%eclust("hclust",k = 2,graph =TRUE)%>%fviz_dend(.,rect=TRUE,show_labels=TRUE)
4) Cluster analysis by machine learning approach (Package mlr)
Now, it’s time for us to unleash the power of mlr package. This is the most sophisticated framework for Machine learning in R
In mlr, the Clustering analysis is treated as an unsupervised learning task. For the beginners, you must create 2 basic objects: A cluster task that contains dataset (as this is an unsupervised learning problem, we dont need any target variable and all features must be numerical); a learner, that will handle the task. The k.mean algorithm will be adopted in our example:
library(mlr)
cluster.task = makeClusterTask(data=dfscale[-c(5,6)])
cluster.task
## Unsupervised task: dfscale[-c(5, 6)]
## Type: cluster
## Observations: 150
## Features:
## numerics factors ordered
## 4 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
cluster.lrn = makeLearner("cluster.kmeans", centers =2)
cluster.lrn
## Learner cluster.kmeans from package stats,clue
## Type: cluster
## Name: K-Means; Short name: kmeans
## Class: cluster.kmeans
## Properties: numerics,prob
## Predict-Type: response
## Hyperparameters: centers=2
the mlr package offers a very useful tool to visualize how a machine learning algorithm functions. This graphical display method (only applied to classification / clustering tasks) represents 2 paired features in a 2 dimensioned plot, whilst the predicted class labels/clusters are presented in colors
Figure 7: K-mean clustering result, based on a 10 folds cross-validation
p12<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("CRP","leucos"))+scale_color_brewer(palette = "Set1")
p13<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("CRP","fibrinogen"))+scale_color_brewer(palette = "Set1")
p14<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("CRP","ESR"))+scale_color_brewer(palette = "Set1")
p23<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("leucos","fibrinogen"))+scale_color_brewer(palette = "Set1")
p24<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("leucos","ESR"))+scale_color_brewer(palette = "Set1")
p34<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("fibrinogen","ESR"))+scale_color_brewer(palette = "Set1")
library(gridExtra)
grid.arrange(p12,p13,p14,p23,p24,p34,ncol=3)
Figure 8: Characteristics of clustered biomarkers among 3 Severity subgroups
df$PC<-train(cluster.lrn,cluster.task)%>%predict(.,cluster.task)%>%.$data%>%.$response%>%as.factor()
df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Marker,y=Value,fill=PC,color=PC))+geom_point(shape=21,alpha=0.5, size=3,position = "jitter") +
geom_rug() +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
facet_wrap(~xrayseverity, scales = "free", ncol = 2)
Figure 9A: Distribution of clustered biomarker among 3 groups of severity
df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Marker,y=Value,fill=PC,color=PC))+geom_boxplot(alpha=0.6)+
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
facet_wrap(~xrayseverity, scales = "free", ncol = 1)+coord_flip()
Figure 9B Distribution of clustered biomarker among 3 groups of severity
df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Value,fill=PC,color=PC))+geom_histogram(alpha=0.5)+
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
facet_wrap(~xrayseverity, scales = "free", ncol =3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Likes other Machine learning task, cluster analysis could be validated through a resampling process, such as repeated crossvalidation or bootstraping.
The figures 10,11 and 12 show us the distribution of 5 specified performance metric of our k-mean clustering algorithm, based on a 10x10 crossvaliration:
Figure 10,11,12 Performance of a K-mean clustering algorithm as evaluated by 10x10 Cross-validation
rdesc = makeResampleDesc("RepCV", folds=10,reps=10)
rdf<-resample(cluster.lrn, cluster.task, rdesc, measures = list(db,dunn,G1,G2,silhouette))%>%.$measures.test%>%gather(.,db:silhouette,key="Metric",value="Value")
rdf%>%ggplot(aes(x=iter,y=Value))+geom_path(size=1,aes(col=Metric))+facet_wrap(~Metric,scales="free",ncol=1)+scale_color_brewer(palette = "Set1")
rdf%>%ggplot(aes(x=Metric,y=Value))+geom_boxplot(alpha=0.7,aes(fill=Metric,col=Metric))+facet_wrap(~Metric,scales="free",ncol=2)+scale_color_brewer(palette = "Set1")+scale_fill_brewer(palette = "Set1")+coord_flip()
rdf%>%ggplot(aes(x=Value))+geom_histogram(alpha=0.6,aes(fill=Metric,col=Metric))+facet_wrap(~Metric,scales="free",ncol=2)+scale_color_brewer(palette = "Set1")+scale_fill_brewer(palette = "Set1")
Note: db = Davies-Bouldin cluster separation measure (Best =0), or Ratio of the within cluster scatter, to the between cluster separation, averaged over the clusters
dunn refers to Dunn index (Defined as the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance.)
G1= Calinski-Harabasz pseudo F statistic; Defined as ratio of between-cluster variance to within cluster variance.
G2 = Baker and Hubert adaptation of Goodman-Kruskal’s gamma statistic. Defined as: (number of concordant comparisons - number of discordant comparisons) / (number of concordant comparisons + number of discordant comparisons).
silhouette referes to Rousseeuw’s silhouette internal cluster quality index, Silhouette value of an observation is a measure of how similar an object is to its own cluster compared to other clusters. The measure is calculated as the average of all silhouette values.
Conclusion: Cluster analysis is an useful technique to reveal the hidden information in multidimensional clinical data. In medicine, clustering can be applied for identifying the pathological patterns or classifying the biomarker, molecular or gene custers that would help the healthcare practioners to make the right decision or to develop targeted therapies.