mydata <- read.csv("./HW4.csv",
                    header = TRUE,
                    sep = ';',
                    dec = ',')
colnames(mydata) <- c('Gender','Age', 'Residence_Type','Glucose_Level','BMI')
head(mydata)

##   Gender Age Residence_Type Glucose_Level BMI
## 1      0  67              1           229  37
## 2      1  80              0           106  33
## 3      1  49              1           171  34
## 4      1  79              0           174  24
## 5      0  81              1           186  29
## 6      0  74              0            70  27

Description of the variables

Gender: Gender of the patient/resident of Framingham (0- male patient, 1- female patient)

Age: Age of the patient/resident of Framingham (in years)

Residence_type: Place, where patient/resident of Framigham lives (0- rural area, 1- urban area)

Glucose_Level: Glucose Level (in mg/dL) of patient/resident of Framingham

BMI: Body Mass Index (in kg/m2) of patient/resident of Framingham

Sample has 100 observations. Unit of observation is a patient/resident of the town of Framingham, Massachusetts. The dataset was found on the Kaggle website (link: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea).

Research question: Classification of 100 patients/residents of Framingham based on 3 standardized variables and I will try to find differences.

Standardization of variables used in analysis

mydata$BMI_z <- scale(mydata$BMI)
mydata$Glucose_z <- scale(mydata$Glucose_Level)
mydata$Age_z <- scale(mydata$Age)

mydata$GenderFactor <- factor(mydata$Gender,
                              levels = c(0,1),
                              labels = c('M','F'))

mydata$ResidenceTypeFactor <- factor(mydata$Residence_Type,
                              levels = c(0,1),
                              labels = c('Rural','Urban'))

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[,c("BMI_z", "Glucose_z", "Age_z")]), 
      type = "pearson")

##           BMI_z Glucose_z Age_z
## BMI_z      1.00      0.21  0.28
## Glucose_z  0.21      1.00  0.23
## Age_z      0.28      0.23  1.00
## 
## n= 100 
## 
## 
## P
##           BMI_z  Glucose_z Age_z 
## BMI_z            0.0369    0.0043
## Glucose_z 0.0369           0.0237
## Age_z     0.0043 0.0237

Now, we will calculate dissimilarity. Below, we are trying to find outliers and with head function we can see 10 units with the highest dissimilarity.

mydata$Dissimilarity <- sqrt(mydata$BMI_z^2 + mydata$Glucose_z^2 + mydata$Age_z^2)

head(mydata[order(-mydata$Dissimilarity),], 10)

##    Gender Age Residence_Type Glucose_Level BMI      BMI_z  Glucose_z      Age_z
## 18      0  70              1           252  27 -0.2936342  3.1559564  1.1696449
## 31      0  58              0           223  42  1.7267847  2.5275388  0.6515298
## 1       0  67              1           229  37  1.0533117  2.6575562  1.0401161
## 78      0  71              1           216  39  1.3227009  2.3758518  1.2128212
## 56      1  59              0           237  28 -0.1589396  2.8309128  0.6947060
## 66      0  75              0           221  33  0.5145334  2.4841997  1.3855262
## 8       1  88              0            88  16 -1.7752748 -0.3978533  1.9468176
## 75      0  50              1           192  43  1.8614793  1.8557821  0.3061197
## 28      0  60              0           213  36  0.9186171  2.3108431  0.7378823
## 48      1   3              0            81  16 -1.7752748 -0.5495403 -1.7231646
##    GenderFactor ResidenceTypeFactor Dissimilarity
## 18            M               Urban      3.378513
## 31            M               Rural      3.129653
## 1             M               Urban      3.042024
## 78            M               Urban      2.977439
## 56            F               Rural      2.919237
## 66            M               Rural      2.890619
## 8             F               Rural      2.664580
## 75            M               Urban      2.646269
## 28            M               Rural      2.593901
## 48            F               Rural      2.534342

hist(mydata$Dissimilarity,
 breaks = 20,
 xlab = "Dissimilarity",
 main = "Histogram of dissimilarities")

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# EUCLIDIAN DISTANCE
distance <- get_dist(mydata[,c("BMI_z", "Glucose_z", "Age_z")], 
                     method = "euclidian")

distance2 <- distance^2

# DISSIMILARITY MATRIX
fviz_dist(distance2)

# HOPKINS STATISTICS
get_clust_tendency(mydata[,c("BMI_z", "Glucose_z", "Age_z")],
 n = nrow(mydata)- 1,
 graph = FALSE)

## $hopkins_stat
## [1] 0.6520647
## 
## $plot
## NULL

The calculated Hopkins statistics of 0.65 indicates, that the data is suitable for clustering, since it has a value higher than 0.5.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(factoextra)

WARD <- mydata[,c("BMI_z","Glucose_z","Age_z")] %>%
 get_dist(method = "euclidian") %>%
 hclust(method = "ward.D2")
WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 100

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.

set.seed(1)
library(factoextra)
library(NbClust)
OptNumber <- mydata[,c("BMI_z", "Glucose_z", "Age_z")] %>%
 NbClust(distance = "euclidean", 
         min.nc = 2, 
         max.nc = 10, 
         method = "ward.D2", 
         index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Based on the above statistics we will use k = 3 (3 clusters) because the majority rule suggests using 3 clusters.

k <- 3

Assigning each of the units to one of 3 clusters

# ASSINGING EACH UNIT TO ONE CLUSTER
mydata$ClusterWard <- cutree(WARD, 
                             k = 3)
head(mydata)

##   Gender Age Residence_Type Glucose_Level BMI       BMI_z    Glucose_z
## 1      0  67              1           229  37  1.05331174  2.657556238
## 2      1  80              0           106  33  0.51453336 -0.007801046
## 3      1  49              1           171  34  0.64922795  1.400721096
## 4      1  79              0           174  24 -0.69771801  1.465729810
## 5      0  81              1           186  29 -0.02424503  1.725764667
## 6      0  74              0            70  27 -0.29363422 -0.787905616
##       Age_z GenderFactor ResidenceTypeFactor Dissimilarity ClusterWard
## 1 1.0401161            M               Urban      3.042024           1
## 2 1.6014075            F               Rural      1.682056           2
## 3 0.2629434            F               Urban      1.566096           1
## 4 1.5582312            F               Rural      2.250169           1
## 5 1.6445838            M               Urban      2.384011           1
## 6 1.3423499            M               Rural      1.583957           2

#Showing the positions of initial leaders, used as starting point for k-means clustering

Initial_Leaders <- aggregate(mydata[,c("BMI_z","Glucose_z","Age_z")], 
                             by = list(mydata$ClusterWard), 
                             FUN = mean)
Initial_Leaders

##   Group.1      BMI_z  Glucose_z      Age_z
## 1       1  0.3243763  1.9615806  0.7251834
## 2       2  0.5241544 -0.4675055  0.5631212
## 3       3 -0.6714361 -0.3344302 -0.8775417

Performing K-MEANS Clustering

# K-MEANS CLUSTERING
K_MEANS <- hkmeans(mydata[,c("BMI_z", "Glucose_z", "Age_z")],
 k = 3,
 hc.metric = "euclidean",
 hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 3 clusters of sizes 16, 50, 34
## 
## Cluster means:
##        BMI_z  Glucose_z      Age_z
## 1  0.3714203  2.0521626  0.6866105
## 2  0.4256349 -0.4199563  0.4814153
## 3 -0.8007198 -0.3481408 -1.0310745
## 
## Clustering vector:
##   [1] 1 2 1 1 1 2 2 2 3 3 2 3 2 2 3 3 2 1 2 2 2 2 3 3 2 3 2 1 2 2 1 3 3 2 2 3 2
##  [38] 2 3 2 3 2 2 2 2 2 2 3 3 1 2 2 3 2 1 1 2 2 2 1 2 2 1 2 3 1 3 2 2 3 2 2 3 3
##  [75] 1 3 3 1 2 3 3 2 3 1 3 2 3 2 3 2 2 3 3 3 2 3 3 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 22.28590 79.79287 26.25871
##  (between_SS / total_SS =  56.8 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

From the table above, we can see that we have cluster 1 with 16 units, cluster 2 with 50 units and cluster 3 with 34 units. Ratio of variability between sum of squares and total sum squares is 56.8%, which is above 50% and it means that it can be analysed further.

Graphical representation of clusters

fviz_cluster(K_MEANS, 
             palette = "Dark1", 
             repel = FALSE, 
             ggtheme = theme_classic())

Check for re-classifications

mydata$ClusterK_Means <- K_MEANS$cluster
table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##      1  2  3
##   1 16  1  0
##   2  0 42  0
##   3  0  7 34

From this table, we can conclude, we have 16 units in the cluster1 and that 1 unit was reclassified from cluster1 to cluster2 after K-MEANS clustering was conducted. In cluster2, we have 42 units. In cluster3, we have 34 units and 7 units were reclassified from cluster3 to cluster2 after K-MEANS clustering.

Clusters

Centroids <- K_MEANS$centers
Centroids

##        BMI_z  Glucose_z      Age_z
## 1  0.3714203  2.0521626  0.6866105
## 2  0.4256349 -0.4199563  0.4814153
## 3 -0.8007198 -0.3481408 -1.0310745

library(tidyr)
figure <- as.data.frame(Centroids)
figure$ID <- 1:nrow(figure)
figure <- pivot_longer(figure, cols = c(BMI_z, Glucose_z, Age_z))

figure$groups <- factor(figure$ID, 
                        levels = c(1,2,3), 
                        labels = c("Cluster 1", "Cluster 2", "Cluster 3" ))

figure$nameFactor <- as.factor(figure$name)

ggplot(figure, aes(x = nameFactor, y = value)) +
 geom_hline(yintercept = 0) +
 theme_bw() +
 geom_point(aes(shape= groups, col = groups), size = 3) +
 geom_line(aes(group = ID), linewidth = 1) +
 ylab("Averages") +
 xlab("Cluster variables")

From the chart, we can conclude that patients that are in Cluster 1, have above average values of Age, BMI, and Glucose Level. Patients in Cluster 2 have above average values of Age and BMI, but below average value of Glucose Level. Patients in Cluster 3 have below average values of Age, BMI and Glucose Level.

Anova testing of clustering variables

# H0: µ(var, cluster1) = µ(var, cluster2) = µ(var, cluster3)
# H1: At least one µi is different.
fit <- aov(cbind(BMI_z, Glucose_z, Age_z) ~ as.factor(ClusterK_Means), 
           data= mydata)

summary(fit)

##  Response 1 :
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)  2 33.065 16.5323  24.321 2.747e-09 ***
## Residuals                 97 65.935  0.6797                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)  2 80.321  40.160  208.55 < 2.2e-16 ***
## Residuals                 97 18.679   0.193                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)  2 55.277 27.6384  61.316 < 2.2e-16 ***
## Residuals                 97 43.723  0.4508                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the conducted anova test, we can conclude, that for all clustering variables, H0 is rejected since p-value < 0.001. Meaning that at least for one cluster, the mean of the variable is different. Another conclusion we can observe from the results is that the differences between the clusters are the highest for the variable Glucose_z -> it has the highest value of F-statistics. The lowest differences between clusters are for the variable BMI_z -> lowest value of F-statistics.

Checking the association of the categorical variable Gender

# H0: There is no association between clusters in categorical variable.
# H1: There is association between clusters in categorical variable.

chisq_results <- chisq.test(mydata$GenderFactor, as.factor(mydata$ClusterK_Means))
chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$GenderFactor and as.factor(mydata$ClusterK_Means)
## X-squared = 4.3175, df = 2, p-value = 0.1155

addmargins(chisq_results$observed)

##                    
## mydata$GenderFactor   1   2   3 Sum
##                 M    12  25  15  52
##                 F     4  25  19  48
##                 Sum  16  50  34 100

round(chisq_results$expected, 2)

##                    
## mydata$GenderFactor    1  2     3
##                   M 8.32 26 17.68
##                   F 7.68 24 16.32

round(chisq_results$res, 2)

##                    
## mydata$GenderFactor     1     2     3
##                   M  1.28 -0.20 -0.64
##                   F -1.33  0.20  0.66

From the results of the Chi-squared test, we cannot reject H0 since p-value > 0.05. We cannot claim that there is association between the clusters in categorical variable. In Group 2, we can see that it looks like there is more than expected Females and less than expected Males, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 1, we can see that it looks like there is more than expected Males and less than expected Females, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 3, we can see that it looks like there is more than expected Females and less than expected Males, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). Since the differences are not statistically significant, we cannot use this variable for validating our clustering. For validation, we would need to find some other variables that would have statistically significant differences.

Checking the association of the categorical variable Residence_Type

# H0: There is no association between clusters of the categorical variable.
# H1: There is association between clusters of the categorical variable.

chisq_results <- chisq.test(mydata$ResidenceTypeFactor, as.factor(mydata$ClusterK_Means))
chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$ResidenceTypeFactor and as.factor(mydata$ClusterK_Means)
## X-squared = 5.5656, df = 2, p-value = 0.06186

addmargins(chisq_results$observed)

##                           
## mydata$ResidenceTypeFactor   1   2   3 Sum
##                      Rural   5  32  17  54
##                      Urban  11  18  17  46
##                      Sum    16  50  34 100

round(chisq_results$expected, 2)

##                           
## mydata$ResidenceTypeFactor    1  2     3
##                      Rural 8.64 27 18.36
##                      Urban 7.36 23 15.64

round(chisq_results$res, 2)

##                           
## mydata$ResidenceTypeFactor     1     2     3
##                      Rural -1.24  0.96 -0.32
##                      Urban  1.34 -1.04  0.34

From the results of the Chi-squared test, we cannot reject H0 since p-value > 0.05. We cannot claim that there is association between the clusters in categorical variable. In Group 1, we can see that it looks like there is more than expected Urban area living patients and less than expected Rural area living patients, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 2, we can see that it looks like there is more than expected Rural area living patients and less than expected Urban area living patients, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 3, we can see that it looks like there is more than expected Urban area living patients and less than expected Rural area living patients, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). Since the differences are not statistically significant, we cannot use this variable for validating our clustering. For validation, we would need to find some other variables that would have statistically significant differences.

Conclusion

Classification of 100 patients was based on 3 standardized variables (Age, Glucose, BMI).

For hierarchical clustering, Ward’s algorithm was used, and based on the analysis of the dendrogram and the indices analyzing the increase in heterogeneity, it was decided to classify patients into three clusters (groups). The classification was further optimized using the K-Means clustering. Hierarchical K-means clustering revealed 3 clusters of sizes 16, 50, 34 units.

Group 2 contains the most patients (50), characterized by having the lowest average glucose level between groups. It has values above average for two clustering variables (Age and BMI). Furthermore, it has the highest average BMI between groups.

The second biggest group is Group 3. It contains (34) patients, characterized by lower-than-average values of all cluster variables. It has the lowest average age and lowest average BMI between groups.

Group 1 contains the lowest number of patients (16), characterized by more-than-average values of all cluster variables. This group is characterized by the highest average age and highest average glucose level between groups.

Homework 4

Nina Jankovič

2023-02-16