The aim of the report is to analyze the characteristics (diagnostic measurements) of women of Prima Indian Heritage aged at least 21.To identify similar groups due to diagnostic measurements partition clustering methods were applied - K-means, PAM and CLARA. In the first step, it was verified whether the data is clusterable. Next, the appropriate number of clusters was selected for each method. Then, the characteristics of women belonging to particular clusters were compared. It was also verified whether the average characteristics in each group were related to suffering from diabetes.
The data used in the report comes from Kaggle (https://www.kaggle.com/mathchi/diabetes-data-set). It is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and includes data for 768 women at list 21 years old of Prima Indian heritage. Below a description of the variables included in the dataset and basic characteristics are presented.
| Parameter | Value |
|---|---|
| Pregnancy | Number of time pregnant |
| Glucose | Plasma glucose concentration a 2 hours in an oral glucose tolerance test |
| BloodPressure | Diastolic blood pressure (mm Hg) |
| SkinThickness | Triceps skin fold thickness (mm) |
| Insulin | 2-Hour serum insulin (mu U/ml) |
| BMI | Body mass index (weight in kg/(height in m)^2) |
| DiabetesPedigree | Diabetes pedigree function |
| Age | Age (years) |
| Outcome | Suffers diabetes |
Let’s see first few rows of dataset:
library(kableExtra)
diabetes1[1:10,] %>%
kbl() %>%
kable_styling()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
|---|---|---|---|---|---|---|---|---|
| 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
| 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
| 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
| 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
| 8 | 125 | 96 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |
#Basic statistics
summary(diabetes1)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Below I present the characteristics for health and ill females separately:
#relabel variable Oucome
library(kableExtra)
diabetes1$Outcome <- factor(diabetes1$Outcome, labels = c('No diabetes', 'Diabetes'))
summary(diabetes1[diabetes1$Outcome== "No diabetes",]) %>% kable() %>%
kable_styling()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.000 | Min. : 0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. :0.0780 | Min. :21.00 | No diabetes:500 | |
| 1st Qu.: 1.000 | 1st Qu.: 93 | 1st Qu.: 62.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.:25.40 | 1st Qu.:0.2298 | 1st Qu.:23.00 | Diabetes : 0 | |
| Median : 2.000 | Median :107 | Median : 70.00 | Median :21.00 | Median : 39.00 | Median :30.05 | Median :0.3360 | Median :27.00 | NA | |
| Mean : 3.298 | Mean :110 | Mean : 68.18 | Mean :19.66 | Mean : 68.79 | Mean :30.30 | Mean :0.4297 | Mean :31.19 | NA | |
| 3rd Qu.: 5.000 | 3rd Qu.:125 | 3rd Qu.: 78.00 | 3rd Qu.:31.00 | 3rd Qu.:105.00 | 3rd Qu.:35.30 | 3rd Qu.:0.5617 | 3rd Qu.:37.00 | NA | |
| Max. :13.000 | Max. :197 | Max. :122.00 | Max. :60.00 | Max. :744.00 | Max. :57.30 | Max. :2.3290 | Max. :81.00 | NA |
summary(diabetes1[diabetes1$Outcome== "Diabetes",]) %>% kable() %>%
kable_styling()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.000 | Min. : 0.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.00 | Min. :0.0880 | Min. :21.00 | No diabetes: 0 | |
| 1st Qu.: 1.750 | 1st Qu.:119.0 | 1st Qu.: 66.00 | 1st Qu.: 0.00 | 1st Qu.: 0.0 | 1st Qu.:30.80 | 1st Qu.:0.2625 | 1st Qu.:28.00 | Diabetes :268 | |
| Median : 4.000 | Median :140.0 | Median : 74.00 | Median :27.00 | Median : 0.0 | Median :34.25 | Median :0.4490 | Median :36.00 | NA | |
| Mean : 4.866 | Mean :141.3 | Mean : 70.82 | Mean :22.16 | Mean :100.3 | Mean :35.14 | Mean :0.5505 | Mean :37.07 | NA | |
| 3rd Qu.: 8.000 | 3rd Qu.:167.0 | 3rd Qu.: 82.00 | 3rd Qu.:36.00 | 3rd Qu.:167.2 | 3rd Qu.:38.77 | 3rd Qu.:0.7280 | 3rd Qu.:44.00 | NA | |
| Max. :17.000 | Max. :199.0 | Max. :114.00 | Max. :99.00 | Max. :846.0 | Max. :67.10 | Max. :2.4200 | Max. :70.00 | NA |
We can note that for the analised measures the means are lower for females not suffering from diabetes.
In the first step, we will verify whether the data has missing values and is clusterable.
#checking for missing values
any(is.na(diabetes1))
## [1] FALSE
We do not have problem with missing values. `
#limiting dataset
diabetes=diabetes1[,1:8]
As the variables are on different scales, they should be standardized, it will make variables comparable.
#scaling data
diabetes<- as.data.frame(lapply(diabetes, scale))
#diabetes <- scale(diabetes)
#correlation matrix
corrplot(cor(diabetes, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=40, tl.cex=0.9, number.cex=0.85, title="Diagnostic measurements", mar=c(0,0,1,0))
From the correlation matrix we see that for most cases the relation is positive. High correlation occurs for number of pregnancies and age (0.54).
get_clust_tendency(diabetes, 2, graph=TRUE, gradient=list(low="dark blue", high="white"), seed=1234)
## $hopkins_stat
## [1] 0.7097394
##
## $plot
To verify whether the data is clustrable I used Hopkins statistics, which tells “how well the data can be clustered”. Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters) Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters) A value of statistic close to 1 indicates the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0 In our case the Hopkins statistics is equal to 0.7097394, thus we should conclude that the data is clusterable.
In K-means clustering we want to define clusters so that the total intra-cluster variation is minimized. Each cluster is represented by its center. The main disadvantages of these method are: its sensitivity to outliers,and the initial random selection of cluster centers and the need to choose the number of clusters in advance.
When selecting the appropriate number of clusters, 3 methods of their determination were used - elbow method, silhouette method, gap statistic and Caliński-Harabasz statistic
# optimal number of clusters - elbow
k1=fviz_nbclust(diabetes, FUN = kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2, col="blue") +
labs(subtitle = "Silhouette method - kmeans")
# optimal number of clusters - silhouette
k2=fviz_nbclust(diabetes, FUN = kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method - kmeans")
# optimal number of clusters - gap statistic
k3=fviz_nbclust(diabetes, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method - kmeans")
grid.arrange(k1, k2, k3, ncol=3)
We see that various metrics give different results, they indicate 1, 2 or 3 clusters; therefore we will also use Caliński-Harabasz Statistic, which is considered as the most reliable measure. In addition, we will use NbClust package to check what number of clusters is mostly supported.
#Calinski Harabasz statistic
#install.packages("fpc")
library(fpc)
# values of the criterion according to the number of clusters
kmeans <- kmeansruns(diabetes,krange=2:10,criterion="ch", critout=TRUE)
## 2 clusters 151.6371
## 3 clusters 156.534
## 4 clusters 144.5911
## 5 clusters 132.4852
## 6 clusters 126.1802
## 7 clusters 121.4491
## 8 clusters 114.0233
## 9 clusters 108.1808
## 10 clusters 103.3457
k4=plot(1:10,kmeans$crit,xlab="Nb. of clusters",ylab="Calinski-Harabasz", col="blue")
Caliński-Harabasz statistics takes the highest value for 3 clusters (156.534), and suggests that 3 clusters will be the most optimal.
Now we will check what number of clusters is supported the most often. To do this we will use NbClust package.
# Various measures
library("NbClust")
nb <- NbClust(diabetes, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 7 proposed 4 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
library("factoextra")
fviz_nbclust(nb)
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 4 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 7 proposed 4 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
9 indices proposed 3 as the best number of clusters.
On the basis of the obtained results, 3 were selected as the best number of clusters in K-means clustering and will be used in further analysis.
km3_diabetes<-eclust(diabetes, "kmeans", k=3, hc_metric="euclidean", graph=FALSE)
print(km3_diabetes)
## K-means clustering with 3 clusters of sizes 334, 214, 220
##
## Cluster means:
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 -0.5049117 -0.5674554 -0.4326831 -0.2519082 -0.3516108 -0.41856875
## 2 -0.2563359 0.6501723 0.2714609 0.8896891 1.0223295 0.63666531
## 3 1.0158927 0.2290601 0.3928342 -0.4829824 -0.4606387 0.01616176
## DiabetesPedigreeFunction Age
## 1 -0.1938818 -0.6261475
## 2 0.4130844 -0.1162438
## 3 -0.1074707 1.0636793
##
## Clustering vector:
## [1] 3 1 3 1 2 1 1 1 2 3 3 3 3 2 3 1 2 3 1 2 2 3 3 3 3 3 3 1 3 3 3 2 1 1 3 2 3
## [38] 3 1 2 2 3 3 2 3 2 1 1 3 1 1 1 1 2 2 1 2 2 2 2 1 3 1 2 3 1 1 3 1 2 1 2 3 2
## [75] 1 1 3 1 1 1 1 1 3 1 3 2 3 1 3 1 1 2 3 3 1 2 1 1 1 2 2 1 1 1 1 1 1 2 1 1 2
## [112] 2 1 1 2 3 3 1 1 1 2 1 1 3 1 2 2 1 2 3 2 3 2 3 1 1 1 1 1 2 3 3 1 3 2 1 3 2
## [149] 3 1 2 1 2 2 3 2 1 1 1 3 3 3 2 1 1 3 1 1 1 1 3 2 1 1 1 2 3 2 3 3 1 1 1 1 3
## [186] 3 2 2 2 2 1 3 3 3 3 2 1 1 2 2 1 1 1 1 3 1 2 3 1 3 1 2 3 2 3 2 2 2 1 3 2 3
## [223] 1 3 1 1 1 2 2 2 3 2 1 1 1 3 3 2 3 1 1 1 1 2 2 3 3 2 2 1 3 1 1 1 3 1 1 1 2
## [260] 3 2 1 1 3 1 3 1 2 1 1 3 1 3 1 3 2 1 1 3 1 1 3 3 3 3 3 2 2 1 2 1 1 2 2 3 2
## [297] 2 2 3 3 1 2 1 3 3 2 3 1 2 2 1 2 1 1 3 1 1 3 2 3 1 1 1 3 1 1 2 3 2 3 3 1 1
## [334] 3 1 2 1 3 2 3 1 1 1 3 3 3 1 1 1 1 1 3 1 1 1 3 2 3 3 2 2 3 3 3 2 1 3 1 1 2
## [371] 2 1 1 1 2 2 1 1 3 2 1 1 1 1 1 1 3 3 2 1 2 3 2 1 3 2 1 1 1 2 1 3 2 3 3 2 3
## [408] 1 3 2 2 2 2 1 2 2 1 2 1 1 2 1 2 1 2 2 1 2 2 2 1 1 1 1 1 1 3 3 1 3 2 1 1 3
## [445] 1 2 1 2 1 1 1 1 2 3 1 3 3 1 2 3 3 1 3 1 3 1 1 1 1 2 2 1 1 3 1 3 2 3 3 3 2
## [482] 1 1 1 1 2 2 2 1 3 1 1 1 3 1 3 1 1 3 2 1 1 1 3 1 3 2 1 1 3 3 1 3 1 1 2 3 3
## [519] 3 3 1 2 1 3 1 1 1 1 2 1 1 1 2 1 1 1 3 3 2 2 2 2 3 1 1 2 2 2 2 2 1 1 3 1 1
## [556] 2 1 3 3 3 3 2 1 1 1 1 1 3 2 2 1 1 1 1 2 1 1 1 3 2 2 1 3 3 2 1 3 1 2 1 3 2
## [593] 3 1 2 2 1 1 3 1 1 1 1 3 1 1 2 1 2 1 1 2 2 1 3 1 3 1 3 1 2 1 3 1 1 2 1 1 3
## [630] 1 3 1 1 1 3 3 3 1 2 1 1 1 3 1 1 2 2 2 3 1 1 1 2 1 1 2 1 2 3 2 3 2 2 2 3 2
## [667] 3 3 2 3 2 1 3 2 3 3 3 1 1 1 1 2 2 1 3 2 1 1 2 2 3 3 2 2 1 2 2 1 2 1 2 3 2
## [704] 1 1 1 1 2 3 2 2 3 3 2 1 2 2 3 2 3 1 2 2 3 3 2 2 1 1 1 1 3 2 1 3 1 1 3 1 1
## [741] 3 1 1 3 2 3 2 2 2 3 1 2 1 2 3 2 3 3 1 3 1 3 3 3 1 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 1602.691 1455.996 1296.025
## (between_SS / total_SS = 29.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "silinfo"
## [11] "nbclust" "data"
aggregate(diabetes, by=list(cluster=km3_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 -0.5049117 -0.5674554 -0.4326831 -0.2519082 -0.3516108
## 2 2 -0.2563359 0.6501723 0.2714609 0.8896891 1.0223295
## 3 3 1.0158927 0.2290601 0.3928342 -0.4829824 -0.4606387
## BMI DiabetesPedigreeFunction Age
## 1 -0.41856875 -0.1938818 -0.6261475
## 2 0.63666531 0.4130844 -0.1162438
## 3 0.01616176 -0.1074707 1.0636793
# plots
k=fviz_cluster(list(data=diabetes, cluster=km3_diabetes$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
a <- fviz_silhouette(km3_diabetes)
## cluster size ave.sil.width
## 1 1 334 0.25
## 2 2 214 0.10
## 3 3 220 0.15
grid.arrange(k, a, ncol=2)
We see, that although 3 was indicated as the best number of clusters, in the silhouette plot we have some values below 0.
Now we will compare the means of the variables by cluster groups.
# statistics by clusters
d1<-cclust(diabetes, 3, dist="euclidean") #flexclust::
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
stripes(d1) #flexclust::
# boxplots
groupBWplot(diabetes, km3_diabetes$cluster, alpha=0.05)
aggregate(diabetes, by=list(cluster=km3_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 -0.5049117 -0.5674554 -0.4326831 -0.2519082 -0.3516108
## 2 2 -0.2563359 0.6501723 0.2714609 0.8896891 1.0223295
## 3 3 1.0158927 0.2290601 0.3928342 -0.4829824 -0.4606387
## BMI DiabetesPedigreeFunction Age
## 1 -0.41856875 -0.1938818 -0.6261475
## 2 0.63666531 0.4130844 -0.1162438
## 3 0.01616176 -0.1074707 1.0636793
In Cluster 3 there are mostly females with the highest number of pregnancies, blood pressure, and age and the lowest Insulin and Skin Thickness. In Cluster 1 there are the youngest females, all metrics are the lowest for them.
Checking the dependence of cluster division on suffering from diabetes.
table(diabetes1$Outcome, km3_diabetes$cluster)
##
## 1 2 3
## No diabetes 288 99 113
## Diabetes 46 115 107
Only 13.7% of females from cluster 1 suffer from diabetes. In cluster 2 it is 53.74% and in cluster 3 - 48.64% of females.
PAM algorithm is also partitioning cluster method, however it is less sensitive to outliers compared to K-means. PAM is based on medoids, which represent one of the data point in the cluster.
# optimal number of clusters - elbow
p1=fviz_nbclust(diabetes, FUN = pam, method = "wss") +
geom_vline(xintercept = 3, linetype = 2, col="blue") +
labs(subtitle = "Silhouette method - PAM")
# optimal number of clusters - silhouette
p2=fviz_nbclust(diabetes, FUN = pam, method = "silhouette") +
labs(subtitle = "Silhouette method - PAM")
grid.arrange(p1, p2, ncol=2)
From the plots, the suggested number of clusters is 2 or 3. We’ll classify the observations into 2 and 3 clusters and compare the results.
pam2_diabetes <- eclust(diabetes, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
print(pam2_diabetes)
## Medoids:
## ID Pregnancies Glucose BloodPressure SkinThickness Insulin
## [1,] 712 0.3427574 0.1596825 0.45952779 0.40518139 -0.5015400
## [2,] 316 -0.5475618 -0.2781921 -0.05711303 0.09174534 0.1232213
## BMI DiabetesPedigreeFunction Age
## [1,] -0.3034664 -0.09922567 0.5747433
## [2,] 0.2672982 -0.47347650 -0.6157094
## Clustering vector:
## [1] 1 2 1 2 2 1 2 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1
## [38] 1 2 1 2 1 1 1 1 1 2 2 1 2 2 2 1 1 1 2 1 2 1 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2
## [75] 2 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2
## [112] 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 1 1 2 1 2 1 1 1
## [149] 1 2 2 1 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2 1 1 1 2 2 2 1
## [186] 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 1 2 1 1 1 2 1
## [223] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 1 2 2 2 1 2 2 2 2
## [260] 1 1 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 1 1 1 1 2 2 2 1 2 2 2 2 1 1
## [297] 2 2 1 1 2 2 1 2 1 2 1 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2 2 1 2 2 2 1 2 1 1 2 2
## [334] 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 2 1
## [371] 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 1 2 1 2 2 1 2 1 1 1 1 2 1
## [408] 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 2 2 1
## [445] 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 2
## [482] 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 1 2 2 2 1 1
## [519] 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 1 1 2 1 1 2 2 1 2 2
## [556] 1 2 1 1 1 1 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2
## [593] 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1
## [630] 2 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 1 2
## [667] 1 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 2 2 2 2 1 1
## [704] 2 2 2 1 2 1 2 2 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 1 2 2 1 2 2
## [741] 1 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 1 1 2 1 2 1 1 1 2 1 1 2
## Objective function:
## build swap
## 2.476052 2.476052
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
## [11] "nbclust"
aggregate(diabetes, by=list(cluster=pam2_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 0.7851521 0.2899376 0.3815503 -0.10157267 -0.1660649
## 2 2 -0.5668587 -0.2093272 -0.2754691 0.07333273 0.1198944
## BMI DiabetesPedigreeFunction Age
## 1 -0.03005876 0.11818436 0.8654905
## 2 0.02170161 -0.08532593 -0.6248609
#plots
p2a=fviz_cluster(list(data=diabetes, cluster=pam2_diabetes$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
p2b <- fviz_silhouette(pam2_diabetes)
## cluster size ave.sil.width
## 1 1 322 0.11
## 2 2 446 0.21
grid.arrange(p2a, p2b, ncol=2)
The first cluster includes 322 observations and the second one 446.
aggregate(diabetes, by=list(cluster=pam2_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 0.7851521 0.2899376 0.3815503 -0.10157267 -0.1660649
## 2 2 -0.5668587 -0.2093272 -0.2754691 0.07333273 0.1198944
## BMI DiabetesPedigreeFunction Age
## 1 -0.03005876 0.11818436 0.8654905
## 2 0.02170161 -0.08532593 -0.6248609
The data was divided into young and healthy women (cluster 2) and the older ones with health problems (cluster 1)
Checking the dependence of cluster division on suffering from diabetes
table(diabetes1$Outcome, pam2_diabetes$cluster)
##
## 1 2
## No diabetes 165 335
## Diabetes 157 111
In cluster 1 51.24% of females suffer from diabetes, compared to 24.89% in cluster 2.
pam3_diabetes <- eclust(diabetes, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
print(pam3_diabetes)
## Medoids:
## ID Pregnancies Glucose BloodPressure SkinThickness Insulin
## [1,] 712 0.3427574 0.1596825 0.4595278 0.4051814 -0.5015400
## [2,] 33 -0.2507887 -1.0288345 -0.5737538 -0.5978140 -0.2238683
## [3,] 412 -0.8443348 -0.2781921 0.1495433 0.5932430 0.8347551
## BMI DiabetesPedigreeFunction Age
## [1,] -0.3034664 -0.09922567 0.5747433
## [2,] -0.9122821 -0.61834778 -0.9558388
## [3,] 0.3053492 0.16938984 -0.7007418
## Clustering vector:
## [1] 1 2 1 2 3 2 2 2 3 1 1 1 1 3 1 2 3 2 3 3 3 1 1 1 1 1 1 2 1 1 1 3 2 2 1 3 1
## [38] 1 3 1 1 1 1 1 1 3 2 2 1 2 2 2 2 1 3 2 1 3 1 3 2 1 2 3 1 1 1 1 2 1 3 3 1 3
## [75] 3 2 1 1 2 2 2 2 1 2 1 3 1 3 1 2 2 3 1 1 3 1 2 2 2 3 1 2 2 2 2 3 2 1 2 3 3
## [112] 3 3 2 1 1 1 2 2 2 3 1 3 1 2 3 3 3 3 1 1 1 3 1 2 3 3 3 1 3 1 1 2 1 3 2 1 3
## [149] 1 2 3 2 1 3 1 1 2 3 2 1 1 1 3 2 1 1 1 1 2 1 1 1 2 3 2 1 1 3 1 1 2 3 2 2 1
## [186] 1 1 3 1 3 2 1 1 1 1 3 2 2 3 3 3 1 3 2 1 1 1 1 3 1 2 3 1 3 1 1 3 1 1 1 3 1
## [223] 2 1 2 3 2 3 3 3 1 3 2 2 2 1 1 3 1 2 2 3 2 3 3 1 1 3 3 2 1 2 2 3 1 3 1 2 3
## [260] 1 1 2 1 1 2 1 2 3 2 2 1 2 1 3 1 3 1 3 1 3 3 1 1 1 1 1 3 3 2 1 3 3 3 3 1 1
## [297] 3 3 1 1 2 3 1 1 1 3 1 2 3 3 1 3 1 2 1 3 2 1 3 1 3 1 1 1 3 3 3 1 3 1 1 2 2
## [334] 1 2 3 2 1 1 1 2 1 2 1 1 1 3 2 2 2 2 1 1 2 2 1 3 1 1 3 3 1 1 1 3 3 2 2 2 1
## [371] 3 2 3 3 3 1 2 3 1 3 3 2 3 3 2 2 1 1 1 3 3 1 3 1 1 3 1 3 2 1 2 1 1 1 1 3 1
## [408] 2 1 3 1 3 3 3 3 3 3 1 2 3 3 2 3 2 1 3 2 3 3 3 2 1 2 2 2 2 1 1 2 1 1 2 3 1
## [445] 2 3 2 3 3 3 2 2 3 1 3 1 1 2 1 1 1 2 1 1 1 2 2 3 2 3 3 3 3 1 2 1 3 1 1 1 3
## [482] 3 2 3 2 3 3 3 2 1 3 1 1 1 2 1 2 2 1 1 2 3 1 1 1 1 1 3 3 1 1 3 1 2 2 1 1 1
## [519] 1 1 2 3 2 1 2 2 2 2 3 2 3 3 3 2 3 2 1 1 3 3 1 3 1 3 3 1 1 3 1 1 2 3 1 2 3
## [556] 1 3 1 1 1 1 3 3 2 2 2 3 1 1 3 2 2 2 2 3 3 2 3 1 1 3 2 1 1 3 2 1 2 1 2 1 3
## [593] 1 3 3 3 1 2 1 2 2 2 1 1 2 3 3 2 3 2 2 3 3 1 1 2 1 2 1 2 3 3 1 3 2 3 2 2 1
## [630] 2 1 3 2 3 2 1 1 3 1 2 3 2 1 2 3 3 1 3 1 2 2 3 1 2 3 3 2 3 1 3 1 3 1 1 1 3
## [667] 1 1 1 1 1 2 1 3 1 1 1 2 2 3 2 3 3 1 1 3 2 2 3 3 1 1 3 1 2 3 1 2 3 3 3 1 1
## [704] 2 2 1 2 3 1 3 3 1 1 3 2 3 3 1 3 1 1 3 1 1 1 1 3 3 1 2 1 2 3 3 1 1 3 1 3 1
## [741] 1 2 2 1 1 1 3 3 3 1 1 3 2 3 1 3 1 1 2 1 2 1 2 1 3 1 1 2
## Objective function:
## build swap
## 2.355666 2.326927
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
## [11] "nbclust"
aggregate(diabetes, by=list(cluster=pam3_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 0.7574908 0.2914001 0.39352163 -0.1568282 -0.2763971
## 2 2 -0.4207849 -0.6267914 -0.71871423 -0.6096819 -0.4698218
## 3 3 -0.6262975 0.1499804 0.09513786 0.7155470 0.7526578
## BMI DiabetesPedigreeFunction Age
## 1 0.07980411 -0.0172500 0.8462030
## 2 -0.73895926 -0.3772473 -0.6635917
## 3 0.51807226 0.3395151 -0.5369244
#plots
p3a=fviz_cluster(list(data=diabetes, cluster=pam3_diabetes$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
p3b <- fviz_silhouette(pam3_diabetes)
## cluster size ave.sil.width
## 1 1 317 0.09
## 2 2 206 0.25
## 3 3 245 0.14
grid.arrange(p3a, p3b, ncol=2)
The first cluster includes 317 observations, the second 206 and third 245 observations. Cluster 3 was created mainly by splitting one of the clusters from the case of PAM with 2 clusters. The silhouette statistic is slightly higher for 3 clusters, however, the negative silhouette values are still present.
aggregate(diabetes, by=list(cluster=pam3_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 0.7574908 0.2914001 0.39352163 -0.1568282 -0.2763971
## 2 2 -0.4207849 -0.6267914 -0.71871423 -0.6096819 -0.4698218
## 3 3 -0.6262975 0.1499804 0.09513786 0.7155470 0.7526578
## BMI DiabetesPedigreeFunction Age
## 1 0.07980411 -0.0172500 0.8462030
## 2 -0.73895926 -0.3772473 -0.6635917
## 3 0.51807226 0.3395151 -0.5369244
table(diabetes1$Outcome, pam3_diabetes$cluster)
##
## 1 2 3
## No diabetes 161 179 160
## Diabetes 156 27 85
In cluster 1 49.21% of females suffer from diabetes, compared to 13.11% in cluster 2 and 34.69% in Cluster 3
CLARA algorithm is an extension of PAM algorithm. Instead of finding medoids for the entire data set, a small samples are taken and PAM algorithm is applied. It is recommended for for large datasets as it reduces the computation time.
# optimal number of clusters - elbow
c1=fviz_nbclust(diabetes, FUN = clara, method = "wss") +
geom_vline(xintercept = 3, linetype = 2, col="blue") +
labs(subtitle = "Elbow method - CLARA")
# optimal number of clusters - silhouette
c2=fviz_nbclust(diabetes, FUN = clara, method = "silhouette") +
labs(subtitle = "Silhouette method - CLARA")
grid.arrange(c1, c2, ncol=2)
From the plots, the suggested number of clusters is 3.
cl3_diabetes <- eclust(diabetes, k=3 , FUNcluster="clara", hc_metric="euclidean", graph=F)
print(cl3_diabetes)
## Call: fun_clust(x = x, k = k)
## Medoids:
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## [1,] -0.2507887 0.2847896 0.4595278 0.1544326 -0.006937274 -0.4556704
## [2,] -0.8443348 -1.0288345 -0.3670975 0.2171198 -0.310640714 -0.2654155
## [3,] -1.1411079 0.9728784 0.6661841 1.1574279 1.667770269 1.2058890
## DiabetesPedigreeFunction Age
## [1,] -0.4493313 0.06454929
## [2,] -0.1505343 -0.87080644
## [3,] -0.6092933 -0.53067709
## Objective function: 2.442605
## Clustering vector: int [1:768] 1 2 1 2 2 1 2 2 3 1 1 1 1 3 1 2 3 1 ...
## Cluster sizes: 389 279 100
## Best sample:
## [1] 9 69 78 103 114 130 138 145 149 169 170 245 260 271 278 308 336 382 399
## [20] 406 422 455 467 487 490 491 517 554 556 566 568 575 582 609 611 616 630 646
## [39] 648 651 664 669 687 714 731 751
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
## [11] "nbclust"
aggregate(diabetes, by=list(cluster=cl3_diabetes$cluster), mean)
## cluster Pregnancies Glucose BloodPressure SkinThickness Insulin
## 1 1 0.4991545 0.2765081 0.3465043 -0.22523600 -0.2341962
## 2 2 -0.5581988 -0.7375906 -0.6133815 -0.06306286 -0.2589505
## 3 3 -0.3843366 0.9822614 0.3634326 1.05211339 1.6334952
## BMI DiabetesPedigreeFunction Age
## 1 -0.04121794 -0.10563439 0.5323365
## 2 -0.29328315 0.02702827 -0.6617305
## 3 0.97859779 0.33550892 -0.2245607
#plot the output
ca=fviz_cluster(list(data=diabetes, cluster=cl3_diabetes$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic())
cb <- fviz_silhouette(cl3_diabetes)
## cluster size ave.sil.width
## 1 1 389 0.09
## 2 2 279 0.21
## 3 3 100 0.15
grid.arrange(ca, cb, ncol=2)
The first cluster includes 389 observations, the second 279 and third 100 observations.
The last method that will be used is hierarchical clustering. The hierarchical clustering is a tree based representation of objects. We distinguish two types of hierarchical clustering: agglomerative (AGNES) - bottom-up approach, wherein similar clusters are merged until there is one single cluster; divisive (DIANA) - top-down approach, wherein clustering begins from the root and the most heterogeneous clusters are successively divided. Unlike partitioning clustering, we don’t need to define the number of clusters to be produced.
Although there are many linkage methods (complete, single, average, centroid, Ward’s), complete and Ward’s method are mostly preferred. Here only Ward’s method will be applied for AGNES and DIANA clustering.
agnes <- eclust(diabetes, k=2, FUNcluster="agnes", hc_metric="euclidean", hc_method = "ward.D2" )
fviz_dend(agnes, show_labels = FALSE, palette = "jco", as.ggplot = TRUE)
#cluster size
cb <- fviz_silhouette(agnes)
## cluster size ave.sil.width
## 1 1 342 0.04
## 2 2 426 0.25
For AGNES algorithm cluster one includes 342 and cluster two 426 observations.
#Suffering from diabetes and cluster belonging
table(diabetes1$Outcome, agnes$cluster)
##
## 1 2
## No diabetes 167 333
## Diabetes 175 93
diana <- eclust(diabetes, k=2, FUNcluster="diana", hc_metric="euclidean", hc_method = "ward.D2")
fviz_dend(diana, show_labels = FALSE, palette = "jco", as.ggplot = TRUE)
#cluster size
cb <- fviz_silhouette(diana)
## cluster size ave.sil.width
## 1 1 621 0.24
## 2 2 147 0.13
For DIANA algorithm the cluster 1 includes 621 observations, and cLuster 2 147 observations.
#Suffering from diabetes and cluster belonging
table(diabetes1$Outcome, diana$cluster)
##
## 1 2
## No diabetes 451 49
## Diabetes 170 98
In the end we will try to compare the clustering algorithms. To do this clValid package can be used. When comparing clustering algorithms two validation measures are used:
internal measures:
- Connectivity - tells to what extent objects are in the same as their nearest neighbors. Takes value between 0 and infinity, should be minimized.
- Silhouette coefficient - estimates the average distance between clusters. Observations with Coefficient close to 1 are very well clustered, around 0 - lie between two clusers; below 0 - are in the wrong cluster.
- Dunn index - index shoyld be maximized.
stability measures:
- The average proportion of non-overlap (APN) - measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
- The average distance (AD) - measures the average distance between observations placed in the same cluster under both cases.
- The average distance between means (ADM) - measures the average distance between cluster centers for observations placed in the same cluster under both cases.
- The figure of merit (FOM) - measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.
APN, ADM and FOM take values from 0 to 1; smaller values corresponds with highly consistent clustering results. AD takes values between 0 and infinity, smaller values are preferred.
Internal measures:
#install.packages("clValid")
#library(clValid)
#clmethods <- c("hierarchical","kmeans","pam")
#intern <- clValid(diabetes, nClust = 2:10, clMethods = clmethods, validation = "internal")
#summary(intern)
| Internal measure | Score | Method | Clusters |
|---|---|---|---|
| Connectivity | 12.0171 | hierarchical | 2 |
| Dunn | 0.1887 | hierarchical | 2 |
| Silhouette | 0.4437 | hierarchical | 2 |
It can be seen that hierarchical clustering with 2 clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures). For the K-means and PAM algorithms different measures give different optimal number of clusters (2 or 3 clusters), which is consistent with the earlier results.
Stability measures:
#stab <- clValid(diabetes, nClust = 2:10,
#clMethods = clmethods, validation = "stability")
#optimalScores(stab)
| Stability measure | Score | Method | Clusters |
|---|---|---|---|
| APN | 0.003067485 | hierarchical | 2 |
| AD | 2.817856993 | kmeans | 10 |
| ADM | 0.070250715 | hierarchical | 3 |
| FOM | 0.918725745 | pam | 10 |
For the APN and ADM measures, hierarchical clustering with two and three clusters respectively gives the best score. For the AD measure, the k-means clustering with 10 clusters has the best score and according to FOM the PAM algorithm performs the best with ten clusters.
In this report basic clustering methods were applied to diabetes dataset. The measures used in the report mostly indicated existence of 2 or 3 clusters. The analysis of internal measures suggest that the hierarchical clustering with two clusters would be the most appropriate for diabetes dataset. The stability measures give inconsistent result, only APN indicated on hierarchical clustering with 2 clusters.