Unsupervised learning

Introduction

The aim of the report is to analyze the characteristics (diagnostic measurements) of women of Prima Indian Heritage aged at least 21.To identify similar groups due to diagnostic measurements partition clustering methods were applied - K-means, PAM and CLARA. In the first step, it was verified whether the data is clusterable. Next, the appropriate number of clusters was selected for each method. Then, the characteristics of women belonging to particular clusters were compared. It was also verified whether the average characteristics in each group were related to suffering from diabetes.

Dataset

The data used in the report comes from Kaggle (https://www.kaggle.com/mathchi/diabetes-data-set). It is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and includes data for 768 women at list 21 years old of Prima Indian heritage. Below a description of the variables included in the dataset and basic characteristics are presented.

Parameter	Value
Pregnancy	Number of time pregnant
Glucose	Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure	Diastolic blood pressure (mm Hg)
SkinThickness	Triceps skin fold thickness (mm)
Insulin	2-Hour serum insulin (mu U/ml)
BMI	Body mass index (weight in kg/(height in m)^2)
DiabetesPedigree	Diabetes pedigree function
Age	Age (years)
Outcome	Suffers diabetes

Let’s see first few rows of dataset:

library(kableExtra)
diabetes1[1:10,] %>%
  kbl() %>%
  kable_styling()

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1
5	116	74	0	0	25.6	0.201	30	0
3	78	50	32	88	31.0	0.248	26	1
10	115	0	0	0	35.3	0.134	29	0
2	197	70	45	543	30.5	0.158	53	1
8	125	96	0	0	0.0	0.232	54	1

#Basic statistics
summary(diabetes1)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Below I present the characteristics for health and ill females separately:

#relabel variable Oucome
library(kableExtra)
diabetes1$Outcome <- factor(diabetes1$Outcome, labels = c('No diabetes', 'Diabetes'))
summary(diabetes1[diabetes1$Outcome== "No diabetes",]) %>% kable()  %>%
  kable_styling()

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Min. : 0.000	Min. : 0	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. :0.0780	Min. :21.00	No diabetes:500
1st Qu.: 1.000	1st Qu.: 93	1st Qu.: 62.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.:25.40	1st Qu.:0.2298	1st Qu.:23.00	Diabetes : 0
Median : 2.000	Median :107	Median : 70.00	Median :21.00	Median : 39.00	Median :30.05	Median :0.3360	Median :27.00	NA
Mean : 3.298	Mean :110	Mean : 68.18	Mean :19.66	Mean : 68.79	Mean :30.30	Mean :0.4297	Mean :31.19	NA
3rd Qu.: 5.000	3rd Qu.:125	3rd Qu.: 78.00	3rd Qu.:31.00	3rd Qu.:105.00	3rd Qu.:35.30	3rd Qu.:0.5617	3rd Qu.:37.00	NA
Max. :13.000	Max. :197	Max. :122.00	Max. :60.00	Max. :744.00	Max. :57.30	Max. :2.3290	Max. :81.00	NA

summary(diabetes1[diabetes1$Outcome== "Diabetes",]) %>% kable()  %>%
  kable_styling()

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Min. : 0.000	Min. : 0.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.00	Min. :0.0880	Min. :21.00	No diabetes: 0
1st Qu.: 1.750	1st Qu.:119.0	1st Qu.: 66.00	1st Qu.: 0.00	1st Qu.: 0.0	1st Qu.:30.80	1st Qu.:0.2625	1st Qu.:28.00	Diabetes :268
Median : 4.000	Median :140.0	Median : 74.00	Median :27.00	Median : 0.0	Median :34.25	Median :0.4490	Median :36.00	NA
Mean : 4.866	Mean :141.3	Mean : 70.82	Mean :22.16	Mean :100.3	Mean :35.14	Mean :0.5505	Mean :37.07	NA
3rd Qu.: 8.000	3rd Qu.:167.0	3rd Qu.: 82.00	3rd Qu.:36.00	3rd Qu.:167.2	3rd Qu.:38.77	3rd Qu.:0.7280	3rd Qu.:44.00	NA
Max. :17.000	Max. :199.0	Max. :114.00	Max. :99.00	Max. :846.0	Max. :67.10	Max. :2.4200	Max. :70.00	NA

We can note that for the analised measures the means are lower for females not suffering from diabetes.

Preliminary analysis

In the first step, we will verify whether the data has missing values and is clusterable.

#checking for missing values
any(is.na(diabetes1))

## [1] FALSE

We do not have problem with missing values. `

#limiting dataset
diabetes=diabetes1[,1:8]

As the variables are on different scales, they should be standardized, it will make variables comparable.

#scaling data
diabetes<- as.data.frame(lapply(diabetes, scale))
#diabetes <- scale(diabetes)

Correlation matrix

#correlation matrix
corrplot(cor(diabetes, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=40, tl.cex=0.9, number.cex=0.85, title="Diagnostic measurements", mar=c(0,0,1,0))

From the correlation matrix we see that for most cases the relation is positive. High correlation occurs for number of pregnancies and age (0.54).

In the next step we will check whether the data is clusterable

get_clust_tendency(diabetes, 2, graph=TRUE, gradient=list(low="dark blue",  high="white"), seed=1234)

## $hopkins_stat
## [1] 0.7097394
## 
## $plot

To verify whether the data is clustrable I used Hopkins statistics, which tells “how well the data can be clustered”. Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters) Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters) A value of statistic close to 1 indicates the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0 In our case the Hopkins statistics is equal to 0.7097394, thus we should conclude that the data is clusterable.

K-means clustering

In K-means clustering we want to define clusters so that the total intra-cluster variation is minimized. Each cluster is represented by its center. The main disadvantages of these method are: its sensitivity to outliers,and the initial random selection of cluster centers and the need to choose the number of clusters in advance.

Deciding on the number of clusters

When selecting the appropriate number of clusters, 3 methods of their determination were used - elbow method, silhouette method, gap statistic and Caliński-Harabasz statistic

# optimal number of clusters - elbow
k1=fviz_nbclust(diabetes, FUN = kmeans, method = "wss") + 
  geom_vline(xintercept = 3, linetype = 2, col="blue") +
  labs(subtitle = "Silhouette method - kmeans")

# optimal number of clusters - silhouette
k2=fviz_nbclust(diabetes, FUN = kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method - kmeans")

# optimal number of clusters - gap statistic
k3=fviz_nbclust(diabetes, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method - kmeans")

grid.arrange(k1, k2, k3, ncol=3)

We see that various metrics give different results, they indicate 1, 2 or 3 clusters; therefore we will also use Caliński-Harabasz Statistic, which is considered as the most reliable measure. In addition, we will use NbClust package to check what number of clusters is mostly supported.

#Calinski Harabasz statistic
#install.packages("fpc")
library(fpc)
# values of the criterion according to the number of clusters
kmeans <- kmeansruns(diabetes,krange=2:10,criterion="ch", critout=TRUE)

## 2  clusters  151.6371 
## 3  clusters  156.534 
## 4  clusters  144.5911 
## 5  clusters  132.4852 
## 6  clusters  126.1802 
## 7  clusters  121.4491 
## 8  clusters  114.0233 
## 9  clusters  108.1808 
## 10  clusters  103.3457

k4=plot(1:10,kmeans$crit,xlab="Nb. of clusters",ylab="Calinski-Harabasz", col="blue")

Caliński-Harabasz statistics takes the highest value for 3 clusters (156.534), and suggests that 3 clusters will be the most optimal.

Now we will check what number of clusters is supported the most often. To do this we will use NbClust package.

# Various measures
library("NbClust")
nb <- NbClust(diabetes, distance = "euclidean", min.nc = 2,
              max.nc = 10, method = "kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 7 proposed 4 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

library("factoextra")
fviz_nbclust(nb)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 9 proposed  3 as the best number of clusters
## * 7 proposed  4 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

9 indices proposed 3 as the best number of clusters.

On the basis of the obtained results, 3 were selected as the best number of clusters in K-means clustering and will be used in further analysis.

km3_diabetes<-eclust(diabetes, "kmeans", k=3, hc_metric="euclidean", graph=FALSE) 
print(km3_diabetes)

## K-means clustering with 3 clusters of sizes 334, 214, 220
## 
## Cluster means:
##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin         BMI
## 1  -0.5049117 -0.5674554    -0.4326831    -0.2519082 -0.3516108 -0.41856875
## 2  -0.2563359  0.6501723     0.2714609     0.8896891  1.0223295  0.63666531
## 3   1.0158927  0.2290601     0.3928342    -0.4829824 -0.4606387  0.01616176
##   DiabetesPedigreeFunction        Age
## 1               -0.1938818 -0.6261475
## 2                0.4130844 -0.1162438
## 3               -0.1074707  1.0636793
## 
## Clustering vector:
##   [1] 3 1 3 1 2 1 1 1 2 3 3 3 3 2 3 1 2 3 1 2 2 3 3 3 3 3 3 1 3 3 3 2 1 1 3 2 3
##  [38] 3 1 2 2 3 3 2 3 2 1 1 3 1 1 1 1 2 2 1 2 2 2 2 1 3 1 2 3 1 1 3 1 2 1 2 3 2
##  [75] 1 1 3 1 1 1 1 1 3 1 3 2 3 1 3 1 1 2 3 3 1 2 1 1 1 2 2 1 1 1 1 1 1 2 1 1 2
## [112] 2 1 1 2 3 3 1 1 1 2 1 1 3 1 2 2 1 2 3 2 3 2 3 1 1 1 1 1 2 3 3 1 3 2 1 3 2
## [149] 3 1 2 1 2 2 3 2 1 1 1 3 3 3 2 1 1 3 1 1 1 1 3 2 1 1 1 2 3 2 3 3 1 1 1 1 3
## [186] 3 2 2 2 2 1 3 3 3 3 2 1 1 2 2 1 1 1 1 3 1 2 3 1 3 1 2 3 2 3 2 2 2 1 3 2 3
## [223] 1 3 1 1 1 2 2 2 3 2 1 1 1 3 3 2 3 1 1 1 1 2 2 3 3 2 2 1 3 1 1 1 3 1 1 1 2
## [260] 3 2 1 1 3 1 3 1 2 1 1 3 1 3 1 3 2 1 1 3 1 1 3 3 3 3 3 2 2 1 2 1 1 2 2 3 2
## [297] 2 2 3 3 1 2 1 3 3 2 3 1 2 2 1 2 1 1 3 1 1 3 2 3 1 1 1 3 1 1 2 3 2 3 3 1 1
## [334] 3 1 2 1 3 2 3 1 1 1 3 3 3 1 1 1 1 1 3 1 1 1 3 2 3 3 2 2 3 3 3 2 1 3 1 1 2
## [371] 2 1 1 1 2 2 1 1 3 2 1 1 1 1 1 1 3 3 2 1 2 3 2 1 3 2 1 1 1 2 1 3 2 3 3 2 3
## [408] 1 3 2 2 2 2 1 2 2 1 2 1 1 2 1 2 1 2 2 1 2 2 2 1 1 1 1 1 1 3 3 1 3 2 1 1 3
## [445] 1 2 1 2 1 1 1 1 2 3 1 3 3 1 2 3 3 1 3 1 3 1 1 1 1 2 2 1 1 3 1 3 2 3 3 3 2
## [482] 1 1 1 1 2 2 2 1 3 1 1 1 3 1 3 1 1 3 2 1 1 1 3 1 3 2 1 1 3 3 1 3 1 1 2 3 3
## [519] 3 3 1 2 1 3 1 1 1 1 2 1 1 1 2 1 1 1 3 3 2 2 2 2 3 1 1 2 2 2 2 2 1 1 3 1 1
## [556] 2 1 3 3 3 3 2 1 1 1 1 1 3 2 2 1 1 1 1 2 1 1 1 3 2 2 1 3 3 2 1 3 1 2 1 3 2
## [593] 3 1 2 2 1 1 3 1 1 1 1 3 1 1 2 1 2 1 1 2 2 1 3 1 3 1 3 1 2 1 3 1 1 2 1 1 3
## [630] 1 3 1 1 1 3 3 3 1 2 1 1 1 3 1 1 2 2 2 3 1 1 1 2 1 1 2 1 2 3 2 3 2 2 2 3 2
## [667] 3 3 2 3 2 1 3 2 3 3 3 1 1 1 1 2 2 1 3 2 1 1 2 2 3 3 2 2 1 2 2 1 2 1 2 3 2
## [704] 1 1 1 1 2 3 2 2 3 3 2 1 2 2 3 2 3 1 2 2 3 3 2 2 1 1 1 1 3 2 1 3 1 1 3 1 1
## [741] 3 1 1 3 2 3 2 2 2 3 1 2 1 2 3 2 3 3 1 3 1 3 3 3 1 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 1602.691 1455.996 1296.025
##  (between_SS / total_SS =  29.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "silinfo"     
## [11] "nbclust"      "data"

aggregate(diabetes, by=list(cluster=km3_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1  -0.5049117 -0.5674554    -0.4326831    -0.2519082 -0.3516108
## 2       2  -0.2563359  0.6501723     0.2714609     0.8896891  1.0223295
## 3       3   1.0158927  0.2290601     0.3928342    -0.4829824 -0.4606387
##           BMI DiabetesPedigreeFunction        Age
## 1 -0.41856875               -0.1938818 -0.6261475
## 2  0.63666531                0.4130844 -0.1162438
## 3  0.01616176               -0.1074707  1.0636793

# plots
k=fviz_cluster(list(data=diabetes, cluster=km3_diabetes$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

a <- fviz_silhouette(km3_diabetes)

##   cluster size ave.sil.width
## 1       1  334          0.25
## 2       2  214          0.10
## 3       3  220          0.15

grid.arrange(k, a, ncol=2)

We see, that although 3 was indicated as the best number of clusters, in the silhouette plot we have some values below 0.

Now we will compare the means of the variables by cluster groups.

# statistics by clusters
d1<-cclust(diabetes, 3, dist="euclidean") #flexclust::

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

stripes(d1) #flexclust::

# boxplots
groupBWplot(diabetes, km3_diabetes$cluster, alpha=0.05)

aggregate(diabetes, by=list(cluster=km3_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1  -0.5049117 -0.5674554    -0.4326831    -0.2519082 -0.3516108
## 2       2  -0.2563359  0.6501723     0.2714609     0.8896891  1.0223295
## 3       3   1.0158927  0.2290601     0.3928342    -0.4829824 -0.4606387
##           BMI DiabetesPedigreeFunction        Age
## 1 -0.41856875               -0.1938818 -0.6261475
## 2  0.63666531                0.4130844 -0.1162438
## 3  0.01616176               -0.1074707  1.0636793

In Cluster 3 there are mostly females with the highest number of pregnancies, blood pressure, and age and the lowest Insulin and Skin Thickness. In Cluster 1 there are the youngest females, all metrics are the lowest for them.

Checking the dependence of cluster division on suffering from diabetes.

table(diabetes1$Outcome, km3_diabetes$cluster)

##              
##                 1   2   3
##   No diabetes 288  99 113
##   Diabetes     46 115 107

Only 13.7% of females from cluster 1 suffer from diabetes. In cluster 2 it is 53.74% and in cluster 3 - 48.64% of females.

PAM clustering

PAM algorithm is also partitioning cluster method, however it is less sensitive to outliers compared to K-means. PAM is based on medoids, which represent one of the data point in the cluster.

Deciding on the number of clusters

# optimal number of clusters - elbow
p1=fviz_nbclust(diabetes, FUN = pam, method = "wss") + 

  geom_vline(xintercept = 3, linetype = 2, col="blue") +
  labs(subtitle = "Silhouette method - PAM")

# optimal number of clusters - silhouette
p2=fviz_nbclust(diabetes, FUN = pam, method = "silhouette") +
  labs(subtitle = "Silhouette method - PAM")


grid.arrange(p1, p2, ncol=2)

From the plots, the suggested number of clusters is 2 or 3. We’ll classify the observations into 2 and 3 clusters and compare the results.

PAM with 2 clusters

pam2_diabetes <- eclust(diabetes, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
print(pam2_diabetes)

## Medoids:
##       ID Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## [1,] 712   0.3427574  0.1596825    0.45952779    0.40518139 -0.5015400
## [2,] 316  -0.5475618 -0.2781921   -0.05711303    0.09174534  0.1232213
##             BMI DiabetesPedigreeFunction        Age
## [1,] -0.3034664              -0.09922567  0.5747433
## [2,]  0.2672982              -0.47347650 -0.6157094
## Clustering vector:
##   [1] 1 2 1 2 2 1 2 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1
##  [38] 1 2 1 2 1 1 1 1 1 2 2 1 2 2 2 1 1 1 2 1 2 1 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2
##  [75] 2 2 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2
## [112] 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 1 1 2 1 2 1 1 1
## [149] 1 2 2 1 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2 1 1 1 2 2 2 1
## [186] 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 1 2 1 1 1 2 1
## [223] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 1 2 2 2 1 2 2 2 2
## [260] 1 1 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 1 1 1 1 2 2 2 1 2 2 2 2 1 1
## [297] 2 2 1 1 2 2 1 2 1 2 1 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2 2 1 2 2 2 1 2 1 1 2 2
## [334] 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 2 1
## [371] 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 1 2 1 2 2 1 2 1 1 1 1 2 1
## [408] 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 2 2 1
## [445] 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 2
## [482] 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 1 2 2 2 1 1
## [519] 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 1 1 2 1 1 2 2 1 2 2
## [556] 1 2 1 1 1 1 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2
## [593] 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1
## [630] 2 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 1 2
## [667] 1 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 2 2 2 2 1 1
## [704] 2 2 2 1 2 1 2 2 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 1 2 2 1 2 2
## [741] 1 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 1 1 2 1 2 1 1 1 2 1 1 2
## Objective function:
##    build     swap 
## 2.476052 2.476052 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## [11] "nbclust"

aggregate(diabetes, by=list(cluster=pam2_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1   0.7851521  0.2899376     0.3815503   -0.10157267 -0.1660649
## 2       2  -0.5668587 -0.2093272    -0.2754691    0.07333273  0.1198944
##           BMI DiabetesPedigreeFunction        Age
## 1 -0.03005876               0.11818436  0.8654905
## 2  0.02170161              -0.08532593 -0.6248609

#plots
p2a=fviz_cluster(list(data=diabetes, cluster=pam2_diabetes$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

p2b <- fviz_silhouette(pam2_diabetes)

##   cluster size ave.sil.width
## 1       1  322          0.11
## 2       2  446          0.21

grid.arrange(p2a, p2b, ncol=2)

The first cluster includes 322 observations and the second one 446.

aggregate(diabetes, by=list(cluster=pam2_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1   0.7851521  0.2899376     0.3815503   -0.10157267 -0.1660649
## 2       2  -0.5668587 -0.2093272    -0.2754691    0.07333273  0.1198944
##           BMI DiabetesPedigreeFunction        Age
## 1 -0.03005876               0.11818436  0.8654905
## 2  0.02170161              -0.08532593 -0.6248609

The data was divided into young and healthy women (cluster 2) and the older ones with health problems (cluster 1)

Checking the dependence of cluster division on suffering from diabetes

table(diabetes1$Outcome, pam2_diabetes$cluster)

##              
##                 1   2
##   No diabetes 165 335
##   Diabetes    157 111

In cluster 1 51.24% of females suffer from diabetes, compared to 24.89% in cluster 2.

PAM with 3 clusters

pam3_diabetes <- eclust(diabetes, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
print(pam3_diabetes)

## Medoids:
##       ID Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## [1,] 712   0.3427574  0.1596825     0.4595278     0.4051814 -0.5015400
## [2,]  33  -0.2507887 -1.0288345    -0.5737538    -0.5978140 -0.2238683
## [3,] 412  -0.8443348 -0.2781921     0.1495433     0.5932430  0.8347551
##             BMI DiabetesPedigreeFunction        Age
## [1,] -0.3034664              -0.09922567  0.5747433
## [2,] -0.9122821              -0.61834778 -0.9558388
## [3,]  0.3053492               0.16938984 -0.7007418
## Clustering vector:
##   [1] 1 2 1 2 3 2 2 2 3 1 1 1 1 3 1 2 3 2 3 3 3 1 1 1 1 1 1 2 1 1 1 3 2 2 1 3 1
##  [38] 1 3 1 1 1 1 1 1 3 2 2 1 2 2 2 2 1 3 2 1 3 1 3 2 1 2 3 1 1 1 1 2 1 3 3 1 3
##  [75] 3 2 1 1 2 2 2 2 1 2 1 3 1 3 1 2 2 3 1 1 3 1 2 2 2 3 1 2 2 2 2 3 2 1 2 3 3
## [112] 3 3 2 1 1 1 2 2 2 3 1 3 1 2 3 3 3 3 1 1 1 3 1 2 3 3 3 1 3 1 1 2 1 3 2 1 3
## [149] 1 2 3 2 1 3 1 1 2 3 2 1 1 1 3 2 1 1 1 1 2 1 1 1 2 3 2 1 1 3 1 1 2 3 2 2 1
## [186] 1 1 3 1 3 2 1 1 1 1 3 2 2 3 3 3 1 3 2 1 1 1 1 3 1 2 3 1 3 1 1 3 1 1 1 3 1
## [223] 2 1 2 3 2 3 3 3 1 3 2 2 2 1 1 3 1 2 2 3 2 3 3 1 1 3 3 2 1 2 2 3 1 3 1 2 3
## [260] 1 1 2 1 1 2 1 2 3 2 2 1 2 1 3 1 3 1 3 1 3 3 1 1 1 1 1 3 3 2 1 3 3 3 3 1 1
## [297] 3 3 1 1 2 3 1 1 1 3 1 2 3 3 1 3 1 2 1 3 2 1 3 1 3 1 1 1 3 3 3 1 3 1 1 2 2
## [334] 1 2 3 2 1 1 1 2 1 2 1 1 1 3 2 2 2 2 1 1 2 2 1 3 1 1 3 3 1 1 1 3 3 2 2 2 1
## [371] 3 2 3 3 3 1 2 3 1 3 3 2 3 3 2 2 1 1 1 3 3 1 3 1 1 3 1 3 2 1 2 1 1 1 1 3 1
## [408] 2 1 3 1 3 3 3 3 3 3 1 2 3 3 2 3 2 1 3 2 3 3 3 2 1 2 2 2 2 1 1 2 1 1 2 3 1
## [445] 2 3 2 3 3 3 2 2 3 1 3 1 1 2 1 1 1 2 1 1 1 2 2 3 2 3 3 3 3 1 2 1 3 1 1 1 3
## [482] 3 2 3 2 3 3 3 2 1 3 1 1 1 2 1 2 2 1 1 2 3 1 1 1 1 1 3 3 1 1 3 1 2 2 1 1 1
## [519] 1 1 2 3 2 1 2 2 2 2 3 2 3 3 3 2 3 2 1 1 3 3 1 3 1 3 3 1 1 3 1 1 2 3 1 2 3
## [556] 1 3 1 1 1 1 3 3 2 2 2 3 1 1 3 2 2 2 2 3 3 2 3 1 1 3 2 1 1 3 2 1 2 1 2 1 3
## [593] 1 3 3 3 1 2 1 2 2 2 1 1 2 3 3 2 3 2 2 3 3 1 1 2 1 2 1 2 3 3 1 3 2 3 2 2 1
## [630] 2 1 3 2 3 2 1 1 3 1 2 3 2 1 2 3 3 1 3 1 2 2 3 1 2 3 3 2 3 1 3 1 3 1 1 1 3
## [667] 1 1 1 1 1 2 1 3 1 1 1 2 2 3 2 3 3 1 1 3 2 2 3 3 1 1 3 1 2 3 1 2 3 3 3 1 1
## [704] 2 2 1 2 3 1 3 3 1 1 3 2 3 3 1 3 1 1 3 1 1 1 1 3 3 1 2 1 2 3 3 1 1 3 1 3 1
## [741] 1 2 2 1 1 1 3 3 3 1 1 3 2 3 1 3 1 1 2 1 2 1 2 1 3 1 1 2
## Objective function:
##    build     swap 
## 2.355666 2.326927 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## [11] "nbclust"

aggregate(diabetes, by=list(cluster=pam3_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1   0.7574908  0.2914001    0.39352163    -0.1568282 -0.2763971
## 2       2  -0.4207849 -0.6267914   -0.71871423    -0.6096819 -0.4698218
## 3       3  -0.6262975  0.1499804    0.09513786     0.7155470  0.7526578
##           BMI DiabetesPedigreeFunction        Age
## 1  0.07980411               -0.0172500  0.8462030
## 2 -0.73895926               -0.3772473 -0.6635917
## 3  0.51807226                0.3395151 -0.5369244

#plots
p3a=fviz_cluster(list(data=diabetes, cluster=pam3_diabetes$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

p3b <- fviz_silhouette(pam3_diabetes)

##   cluster size ave.sil.width
## 1       1  317          0.09
## 2       2  206          0.25
## 3       3  245          0.14

grid.arrange(p3a, p3b, ncol=2)

The first cluster includes 317 observations, the second 206 and third 245 observations. Cluster 3 was created mainly by splitting one of the clusters from the case of PAM with 2 clusters. The silhouette statistic is slightly higher for 3 clusters, however, the negative silhouette values are still present.

aggregate(diabetes, by=list(cluster=pam3_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1   0.7574908  0.2914001    0.39352163    -0.1568282 -0.2763971
## 2       2  -0.4207849 -0.6267914   -0.71871423    -0.6096819 -0.4698218
## 3       3  -0.6262975  0.1499804    0.09513786     0.7155470  0.7526578
##           BMI DiabetesPedigreeFunction        Age
## 1  0.07980411               -0.0172500  0.8462030
## 2 -0.73895926               -0.3772473 -0.6635917
## 3  0.51807226                0.3395151 -0.5369244

table(diabetes1$Outcome, pam3_diabetes$cluster)

##              
##                 1   2   3
##   No diabetes 161 179 160
##   Diabetes    156  27  85

In cluster 1 49.21% of females suffer from diabetes, compared to 13.11% in cluster 2 and 34.69% in Cluster 3

CLARA clustering

CLARA algorithm is an extension of PAM algorithm. Instead of ﬁnding medoids for the entire data set, a small samples are taken and PAM algorithm is applied. It is recommended for for large datasets as it reduces the computation time.

Deciding on the number of clusters

# optimal number of clusters - elbow
c1=fviz_nbclust(diabetes, FUN = clara, method = "wss") + 
  geom_vline(xintercept = 3, linetype = 2, col="blue") +
  labs(subtitle = "Elbow method - CLARA")

# optimal number of clusters - silhouette
c2=fviz_nbclust(diabetes, FUN = clara, method = "silhouette") +
  labs(subtitle = "Silhouette method - CLARA")


grid.arrange(c1, c2, ncol=2)

From the plots, the suggested number of clusters is 3.

cl3_diabetes <- eclust(diabetes, k=3 , FUNcluster="clara", hc_metric="euclidean", graph=F)
print(cl3_diabetes)

## Call:     fun_clust(x = x, k = k) 
## Medoids:
##      Pregnancies    Glucose BloodPressure SkinThickness      Insulin        BMI
## [1,]  -0.2507887  0.2847896     0.4595278     0.1544326 -0.006937274 -0.4556704
## [2,]  -0.8443348 -1.0288345    -0.3670975     0.2171198 -0.310640714 -0.2654155
## [3,]  -1.1411079  0.9728784     0.6661841     1.1574279  1.667770269  1.2058890
##      DiabetesPedigreeFunction         Age
## [1,]               -0.4493313  0.06454929
## [2,]               -0.1505343 -0.87080644
## [3,]               -0.6092933 -0.53067709
## Objective function:   2.442605
## Clustering vector:    int [1:768] 1 2 1 2 2 1 2 2 3 1 1 1 1 3 1 2 3 1 ...
## Cluster sizes:            389 279 100 
## Best sample:
##  [1]   9  69  78 103 114 130 138 145 149 169 170 245 260 271 278 308 336 382 399
## [20] 406 422 455 467 487 490 491 517 554 556 566 568 575 582 609 611 616 630 646
## [39] 648 651 664 669 687 714 731 751
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"      
## [11] "nbclust"

aggregate(diabetes, by=list(cluster=cl3_diabetes$cluster), mean)

##   cluster Pregnancies    Glucose BloodPressure SkinThickness    Insulin
## 1       1   0.4991545  0.2765081     0.3465043   -0.22523600 -0.2341962
## 2       2  -0.5581988 -0.7375906    -0.6133815   -0.06306286 -0.2589505
## 3       3  -0.3843366  0.9822614     0.3634326    1.05211339  1.6334952
##           BMI DiabetesPedigreeFunction        Age
## 1 -0.04121794              -0.10563439  0.5323365
## 2 -0.29328315               0.02702827 -0.6617305
## 3  0.97859779               0.33550892 -0.2245607

#plot the output
ca=fviz_cluster(list(data=diabetes, cluster=cl3_diabetes$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", ggtheme=theme_classic()) 

cb <- fviz_silhouette(cl3_diabetes)

##   cluster size ave.sil.width
## 1       1  389          0.09
## 2       2  279          0.21
## 3       3  100          0.15

grid.arrange(ca, cb, ncol=2)

The first cluster includes 389 observations, the second 279 and third 100 observations.

Hierarchical clustering

The last method that will be used is hierarchical clustering. The hierarchical clustering is a tree based representation of objects. We distinguish two types of hierarchical clustering: agglomerative (AGNES) - bottom-up approach, wherein similar clusters are merged until there is one single cluster; divisive (DIANA) - top-down approach, wherein clustering begins from the root and the most heterogeneous clusters are successively divided. Unlike partitioning clustering, we don’t need to define the number of clusters to be produced.

Although there are many linkage methods (complete, single, average, centroid, Ward’s), complete and Ward’s method are mostly preferred. Here only Ward’s method will be applied for AGNES and DIANA clustering.

Agglomerative hierarchical clustering (AGNES)

agnes <- eclust(diabetes, k=2, FUNcluster="agnes", hc_metric="euclidean", hc_method = "ward.D2" )
fviz_dend(agnes, show_labels = FALSE, palette = "jco", as.ggplot = TRUE)

#cluster size
cb <- fviz_silhouette(agnes)

##   cluster size ave.sil.width
## 1       1  342          0.04
## 2       2  426          0.25

For AGNES algorithm cluster one includes 342 and cluster two 426 observations.

#Suffering from diabetes and cluster belonging
table(diabetes1$Outcome, agnes$cluster)

##              
##                 1   2
##   No diabetes 167 333
##   Diabetes    175  93

Divisive hierarchical clustering (DIANA)

diana <- eclust(diabetes, k=2, FUNcluster="diana", hc_metric="euclidean", hc_method = "ward.D2")
fviz_dend(diana, show_labels = FALSE, palette = "jco", as.ggplot = TRUE)

#cluster size
cb <- fviz_silhouette(diana)

##   cluster size ave.sil.width
## 1       1  621          0.24
## 2       2  147          0.13

For DIANA algorithm the cluster 1 includes 621 observations, and cLuster 2 147 observations.

#Suffering from diabetes and cluster belonging
table(diabetes1$Outcome, diana$cluster)

##              
##                 1   2
##   No diabetes 451  49
##   Diabetes    170  98

Choosing the Best Clustering Algorithms

In the end we will try to compare the clustering algorithms. To do this clValid package can be used. When comparing clustering algorithms two validation measures are used:

internal measures:
- Connectivity - tells to what extent objects are in the same as their nearest neighbors. Takes value between 0 and inﬁnity, should be minimized.
- Silhouette coefficient - estimates the average distance between clusters. Observations with Coefficient close to 1 are very well clustered, around 0 - lie between two clusers; below 0 - are in the wrong cluster.
- Dunn index - index shoyld be maximized.

stability measures:
- The average proportion of non-overlap (APN) - measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
- The average distance (AD) - measures the average distance between observations placed in the same cluster under both cases.
- The average distance between means (ADM) - measures the average distance between cluster centers for observations placed in the same cluster under both cases.
- The ﬁgure of merit (FOM) - measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.

APN, ADM and FOM take values from 0 to 1; smaller values corresponds with highly consistent clustering results. AD takes values between 0 and inﬁnity, smaller values are preferred.

Internal measures:

#install.packages("clValid")
#library(clValid)
#clmethods <- c("hierarchical","kmeans","pam")
#intern <- clValid(diabetes, nClust = 2:10, clMethods = clmethods, validation = "internal")
#summary(intern)

Internal measure	Score	Method	Clusters
Connectivity	12.0171	hierarchical	2
Dunn	0.1887	hierarchical	2
Silhouette	0.4437	hierarchical	2

It can be seen that hierarchical clustering with 2 clusters performs the best in each case (i.e., for connectivity, Dunn and Silhouette measures). For the K-means and PAM algorithms different measures give different optimal number of clusters (2 or 3 clusters), which is consistent with the earlier results.

Stability measures:

#stab <- clValid(diabetes, nClust = 2:10,
#clMethods = clmethods, validation = "stability")

#optimalScores(stab)

Stability measure	Score	Method	Clusters
APN	0.003067485	hierarchical	2
AD	2.817856993	kmeans	10
ADM	0.070250715	hierarchical	3
FOM	0.918725745	pam	10

For the APN and ADM measures, hierarchical clustering with two and three clusters respectively gives the best score. For the AD measure, the k-means clustering with 10 clusters has the best score and according to FOM the PAM algorithm performs the best with ten clusters.

Conclusions

In this report basic clustering methods were applied to diabetes dataset. The measures used in the report mostly indicated existence of 2 or 3 clusters. The analysis of internal measures suggest that the hierarchical clustering with two clusters would be the most appropriate for diabetes dataset. The stability measures give inconsistent result, only APN indicated on hierarchical clustering with 2 clusters.

References

https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/book-preview/clustering_en_preview.pdf https://www.kaggle.com/mathchi/diabetes-data-set

Unsupervised learning - Clustering

Sylwia Wrona

February, 2022

Introduction

Dataset

Preliminary analysis

Correlation matrix

In the next step we will check whether the data is clusterable

K-means clustering

Deciding on the number of clusters

PAM clustering

Deciding on the number of clusters

PAM with 2 clusters

PAM with 3 clusters

CLARA clustering

Deciding on the number of clusters

Hierarchical clustering

Agglomerative hierarchical clustering (AGNES)

Divisive hierarchical clustering (DIANA)

Choosing the Best Clustering Algorithms

Conclusions

References