Case study N46: Cluster analysis for Data Mining

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

In this study, 4 inflammatory markers [C-reactive protein (CRP), erythrocyte sedimentation rate (ESR), leucocyte count (leucos), and fibrinogen)] were measured in 150 patients with pneumonia. Based on X-ray chest, clinical severity was classified as A (mild infection), B (medium severity), C (severe infection). One scientific question was to assess whether the markers could adequately predict the severity of infection.

Our objective is to demonstrate the ability of Cluster analysis, an unsupervised machine learning technique for exploring and identifying groups (i.e clusters) within the biomarkers data set.

Materials and method

The original dataset was provided in Chaper 2, Machine Learning in Medicine-Cookbook Three (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS format (decisiontree.sav) from the website: extras.springer.com.

Data analysis was performed in R statistical programming language. Our demonstration consists of 4 steps

Conventional data exploration
K-mean clustering method
Hierarchical clustering method
Cluster analysis by machine learning approach (Package mlr)

Following packages will be used during our demonstration:

library(tidyverse)
library(foreign)
library(RColorBrewer)
library(corrplot)
library(NbClust)
library(factoextra)
library(cluster)
library(clValid)
library(dendextend)
library(dplyr)
library(gplots)
library(mlr)
library(gridExtra)

Customizing the theme for all ggplot2 objects

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 12),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#faefff"),
      strip.background = element_rect(fill = "#400156", color = "#400156", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

Results

1) Conventional data exploration

df=read.spss("decisiontree.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()

df

## # A tibble: 150 × 5
##      CRP leucos fibrinogen   ESR xrayseverity
##    <dbl>  <dbl>      <dbl> <dbl>       <fctr>
## 1    120      5         11    60       A     
## 2    100      5         11    56       A     
## 3     94      4         11    60       A     
## 4     92      5         11    58       A     
## 5    100      5         11    52       A     
## 6    108      6         17    48       A     
## 7     92      5         14    48       A     
## 8    100      5         11    54       A     
## 9     88      5         11    54       A     
## 10    98      5          8    60       A     
## # ... with 140 more rows

Figure 1: Heatmap plot for exploring the biomarker dataset

dfscale<-df[,-5]%>%as.matrix()%>%scale()%>%as_tibble()%>%mutate(.,Class=df$xrayseverity,Id=row.names(.))

dfscale%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=reorder(Id,Value),y=reorder(Marker,Value),fill=Value))+geom_tile(color="black",show.legend=T)+facet_wrap(~Class,ncol=1,shrink=T,scale="free")+scale_fill_gradient2(low="gold",mid="#ff5656",high="#890150",midpoint=1)+theme(axis.text.x=element_blank())+scale_y_discrete("Markers")+scale_x_discrete("Patient's Id")

Figure 2: Correlation matrix for exploring the relationship between 4 biomarkers

cor.mtest <- function(mat, conf.level = 0.95){
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  diag(lowCI.mat) <- diag(uppCI.mat) <- 1
  for(i in 1:(n-1)){
    for(j in (i+1):n){
      tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
      p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
      lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
      uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
    }
  }
  return(list(p.mat, lowCI.mat, uppCI.mat))
}

res1<-df[,-5]%>%cor.mtest(.,0.95)

df[,-5]%>%cor(.,method="pearson")%>%corrplot(.,p.mat=res1[[1]],sig.level=0.05,order="hclust",type="lower",method="pie",tl.col="black", tl.srt=45,col=brewer.pal(n=8, name="PuRd"))

Figure 3: Characteristics of 4 biomarkers among 3 subgroups of severity

df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Value,fill=xrayseverity,color=xrayseverity))+geom_density(alpha = 0.4, size=1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ Marker, scales = "free_y", ncol = 2)

2) K-mean clustering method

First, we perform an exploratory analysis, in order to identify the most appropriate Clustering method, as well as the optimized number of cluster (K value)

df[,-5]%>%scale()%>%clValid(nClust=2:4,metric="euclidean",method="complete",clMethods = c("hierarchical","kmeans","pam"),validation = "internal",verbose=F)%>%summary()

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 
## 
## Validation Measures:
##                                  2       3       4
##                                                   
## hierarchical Connectivity   0.0000  9.2468 21.6004
##              Dunn           0.2584  0.1139  0.1128
##              Silhouette     0.5389  0.4683  0.4496
## kmeans       Connectivity   5.4421 15.3294 19.4476
##              Dunn           0.1192  0.1036  0.0984
##              Silhouette     0.5416  0.4727  0.4527
## pam          Connectivity   5.4421 16.2782 32.0389
##              Dunn           0.1192  0.0890  0.0812
##              Silhouette     0.5416  0.4322  0.4397
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 0.0000 hierarchical 2       
## Dunn         0.2584 hierarchical 2       
## Silhouette   0.5416 kmeans       2

The results indicate that the most appropriate k value should be 2 for both hierarchical and k-mean methods

Figure 4: K-mean cluster analysis

df[,-5]%>%scale()%>%kmeans(2)%>%fviz_cluster(data=df[,-5],geom = "point",pointsize=2,shape=19,frame.type="norm")+my_theme()+scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1")

fit<-df[,-5]%>%scale()%>%kmeans(centers=2)

fit$size

## [1] 96 54

fit$centers

##          CRP     leucos fibrinogen        ESR
## 1  0.5630151  0.6887467  0.6609378  0.3941610
## 2 -1.0009157 -1.2244386 -1.1750005 -0.7007307

Just out of curiosity, we will try to verify whether those 2 clusers would help us to identify the pneumonia severity in our patients ?

df[,-5]%>%aggregate(.,by=list(cluster=fit$cluster),quantile)

##   cluster CRP.0% CRP.25% CRP.50% CRP.75% CRP.100% leucos.0% leucos.25%
## 1       1     98     116     126     134      158        12         15
## 2       2     86      96     100     104      120         3          5
##   leucos.50% leucos.75% leucos.100% fibrinogen.0% fibrinogen.25%
## 1         16         19          23         35.00          46.25
## 2          5          5          12          8.00          11.00
##   fibrinogen.50% fibrinogen.75% fibrinogen.100% ESR.0% ESR.25% ESR.50%
## 1          54.50          65.00           80.00   44.0    62.0    70.0
## 2          11.00          14.00           38.00   44.0    52.5    58.0
##   ESR.75% ESR.100%
## 1    88.0    130.0
## 2    60.0     76.0

df[,-5]%>%aggregate(.,by=list(cluster=fit$cluster),mean)

##   cluster      CRP    leucos fibrinogen      ESR
## 1       1 126.2917 16.572917   56.09375 76.02083
## 2       2 100.4815  5.314815   14.11111 57.68519

table(fit$cluster,df$xrayseverity)%>%knitr::kable()

A	B	C
0	46	50
50	4	0

The answer seems to be positive, at least we can distinguish between the “Mild” and “Severe” groups among out patients, using 2 clusters…

Figure 5 Characteristics of 4 biomarkers in 2 clusters

fit$centers%>%as_tibble()%>%mutate(Class=c("PC1","PC2"))%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=reorder(Marker,-Value),y=Value,fill=Class))+geom_point(shape=21,size=5,stat="identity")+facet_wrap(~Class,ncol=1,scales="free")+coord_flip()+scale_fill_brewer(palette = "Set1")+scale_color_brewer(palette = "Set1")

Note: the scales were standardized for all biomarkers

3) Hierarchical clustering method

Figure 6: Result of hierarchical cluster analysis

Rowv<-df[,-5]%>%scale %>%dist %>% hclust %>% as.dendrogram %>%
  set("branches_k_color",k=2) %>% set("branches_lwd", 1) %>%
  ladderize

Colv<-df[,-5]%>% scale %>% t %>% dist %>% hclust %>% as.dendrogram %>%
  set("branches_k_color", k = 2, value = c("blue","red")) %>%
  set("branches_lwd", 1) %>%
  ladderize

heatmap.2(scale(df[,-5]), scale = "none", col = redblue(100), 
          Rowv = Rowv, Colv = Colv,
          trace = "none", density.info = "none")

df[,-5]%>%scale()%>%eclust("hclust",k = 2,graph =TRUE)%>%fviz_dend(.,rect=TRUE,show_labels=TRUE)

4) Cluster analysis by machine learning approach (Package mlr)

Now, it’s time for us to unleash the power of mlr package. This is the most sophisticated framework for Machine learning in R

In mlr, the Clustering analysis is treated as an unsupervised learning task. For the beginners, you must create 2 basic objects: A cluster task that contains dataset (as this is an unsupervised learning problem, we dont need any target variable and all features must be numerical); a learner, that will handle the task. The k.mean algorithm will be adopted in our example:

library(mlr)
cluster.task = makeClusterTask(data=dfscale[-c(5,6)])

cluster.task

## Unsupervised task: dfscale[-c(5, 6)]
## Type: cluster
## Observations: 150
## Features:
## numerics  factors  ordered 
##        4        0        0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE

cluster.lrn = makeLearner("cluster.kmeans", centers =2)

cluster.lrn

## Learner cluster.kmeans from package stats,clue
## Type: cluster
## Name: K-Means; Short name: kmeans
## Class: cluster.kmeans
## Properties: numerics,prob
## Predict-Type: response
## Hyperparameters: centers=2

the mlr package offers a very useful tool to visualize how a machine learning algorithm functions. This graphical display method (only applied to classification / clustering tasks) represents 2 paired features in a 2 dimensioned plot, whilst the predicted class labels/clusters are presented in colors

Figure 7: K-mean clustering result, based on a 10 folds cross-validation

p12<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("CRP","leucos"))+scale_color_brewer(palette = "Set1")
p13<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("CRP","fibrinogen"))+scale_color_brewer(palette = "Set1")
p14<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("CRP","ESR"))+scale_color_brewer(palette = "Set1")
p23<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("leucos","fibrinogen"))+scale_color_brewer(palette = "Set1")
p24<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("leucos","ESR"))+scale_color_brewer(palette = "Set1")
p34<-plotLearnerPrediction(cluster.lrn,cluster.task,cv=10L,gridsize=100,features =c("fibrinogen","ESR"))+scale_color_brewer(palette = "Set1")

library(gridExtra)

grid.arrange(p12,p13,p14,p23,p24,p34,ncol=3)

Figure 8: Characteristics of clustered biomarkers among 3 Severity subgroups

df$PC<-train(cluster.lrn,cluster.task)%>%predict(.,cluster.task)%>%.$data%>%.$response%>%as.factor()

df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Marker,y=Value,fill=PC,color=PC))+geom_point(shape=21,alpha=0.5, size=3,position = "jitter") +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~xrayseverity, scales = "free", ncol = 2)

Figure 9A: Distribution of clustered biomarker among 3 groups of severity

df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Marker,y=Value,fill=PC,color=PC))+geom_boxplot(alpha=0.6)+
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~xrayseverity, scales = "free", ncol = 1)+coord_flip()

Figure 9B Distribution of clustered biomarker among 3 groups of severity

df%>%gather(CRP:ESR,key="Marker",value="Value")%>%ggplot(aes(x=Value,fill=PC,color=PC))+geom_histogram(alpha=0.5)+
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~xrayseverity, scales = "free", ncol =3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Likes other Machine learning task, cluster analysis could be validated through a resampling process, such as repeated crossvalidation or bootstraping.

The figures 10,11 and 12 show us the distribution of 5 specified performance metric of our k-mean clustering algorithm, based on a 10x10 crossvaliration:

Figure 10,11,12 Performance of a K-mean clustering algorithm as evaluated by 10x10 Cross-validation

rdesc = makeResampleDesc("RepCV", folds=10,reps=10)
rdf<-resample(cluster.lrn, cluster.task, rdesc, measures = list(db,dunn,G1,G2,silhouette))%>%.$measures.test%>%gather(.,db:silhouette,key="Metric",value="Value")

rdf%>%ggplot(aes(x=iter,y=Value))+geom_path(size=1,aes(col=Metric))+facet_wrap(~Metric,scales="free",ncol=1)+scale_color_brewer(palette = "Set1")

rdf%>%ggplot(aes(x=Metric,y=Value))+geom_boxplot(alpha=0.7,aes(fill=Metric,col=Metric))+facet_wrap(~Metric,scales="free",ncol=2)+scale_color_brewer(palette = "Set1")+scale_fill_brewer(palette = "Set1")+coord_flip()

rdf%>%ggplot(aes(x=Value))+geom_histogram(alpha=0.6,aes(fill=Metric,col=Metric))+facet_wrap(~Metric,scales="free",ncol=2)+scale_color_brewer(palette = "Set1")+scale_fill_brewer(palette = "Set1")

Note: db = Davies-Bouldin cluster separation measure (Best =0), or Ratio of the within cluster scatter, to the between cluster separation, averaged over the clusters

dunn refers to Dunn index (Defined as the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance.)

G1= Calinski-Harabasz pseudo F statistic; Defined as ratio of between-cluster variance to within cluster variance.

G2 = Baker and Hubert adaptation of Goodman-Kruskal’s gamma statistic. Defined as: (number of concordant comparisons - number of discordant comparisons) / (number of concordant comparisons + number of discordant comparisons).

silhouette referes to Rousseeuw’s silhouette internal cluster quality index, Silhouette value of an observation is a measure of how similar an object is to its own cluster compared to other clusters. The measure is calculated as the average of all silhouette values.

Conclusion: Cluster analysis is an useful technique to reveal the hidden information in multidimensional clinical data. In medicine, clustering can be applied for identifying the pathological patterns or classifying the biomarker, molecular or gene custers that would help the healthcare practioners to make the right decision or to develop targeted therapies.

MLM Project Case Study N°46

Nhat Nam Le Dong (MD, PhD)

17 janvier 2017

Case study N46: Cluster analysis for Data Mining