mydata <- read.csv("./HW4.csv",
header = TRUE,
sep = ';',
dec = ',')
colnames(mydata) <- c('Gender','Age', 'Residence_Type','Glucose_Level','BMI')
head(mydata)
## Gender Age Residence_Type Glucose_Level BMI
## 1 0 67 1 229 37
## 2 1 80 0 106 33
## 3 1 49 1 171 34
## 4 1 79 0 174 24
## 5 0 81 1 186 29
## 6 0 74 0 70 27
Gender: Gender of the patient/resident of Framingham (0- male patient, 1- female patient)
Age: Age of the patient/resident of Framingham (in years)
Residence_type: Place, where patient/resident of Framigham lives (0- rural area, 1- urban area)
Glucose_Level: Glucose Level (in mg/dL) of patient/resident of Framingham
BMI: Body Mass Index (in kg/m2) of patient/resident of Framingham
Sample has 100 observations. Unit of observation is a patient/resident of the town of Framingham, Massachusetts. The dataset was found on the Kaggle website (link: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea).
Research question: Classification of 100 patients/residents of Framingham based on 3 standardized variables and I will try to find differences.
mydata$BMI_z <- scale(mydata$BMI)
mydata$Glucose_z <- scale(mydata$Glucose_Level)
mydata$Age_z <- scale(mydata$Age)
mydata$GenderFactor <- factor(mydata$Gender,
levels = c(0,1),
labels = c('M','F'))
mydata$ResidenceTypeFactor <- factor(mydata$Residence_Type,
levels = c(0,1),
labels = c('Rural','Urban'))
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[,c("BMI_z", "Glucose_z", "Age_z")]),
type = "pearson")
## BMI_z Glucose_z Age_z
## BMI_z 1.00 0.21 0.28
## Glucose_z 0.21 1.00 0.23
## Age_z 0.28 0.23 1.00
##
## n= 100
##
##
## P
## BMI_z Glucose_z Age_z
## BMI_z 0.0369 0.0043
## Glucose_z 0.0369 0.0237
## Age_z 0.0043 0.0237
Now, we will calculate dissimilarity. Below, we are trying to find outliers and with head function we can see 10 units with the highest dissimilarity.
mydata$Dissimilarity <- sqrt(mydata$BMI_z^2 + mydata$Glucose_z^2 + mydata$Age_z^2)
head(mydata[order(-mydata$Dissimilarity),], 10)
## Gender Age Residence_Type Glucose_Level BMI BMI_z Glucose_z Age_z
## 18 0 70 1 252 27 -0.2936342 3.1559564 1.1696449
## 31 0 58 0 223 42 1.7267847 2.5275388 0.6515298
## 1 0 67 1 229 37 1.0533117 2.6575562 1.0401161
## 78 0 71 1 216 39 1.3227009 2.3758518 1.2128212
## 56 1 59 0 237 28 -0.1589396 2.8309128 0.6947060
## 66 0 75 0 221 33 0.5145334 2.4841997 1.3855262
## 8 1 88 0 88 16 -1.7752748 -0.3978533 1.9468176
## 75 0 50 1 192 43 1.8614793 1.8557821 0.3061197
## 28 0 60 0 213 36 0.9186171 2.3108431 0.7378823
## 48 1 3 0 81 16 -1.7752748 -0.5495403 -1.7231646
## GenderFactor ResidenceTypeFactor Dissimilarity
## 18 M Urban 3.378513
## 31 M Rural 3.129653
## 1 M Urban 3.042024
## 78 M Urban 2.977439
## 56 F Rural 2.919237
## 66 M Rural 2.890619
## 8 F Rural 2.664580
## 75 M Urban 2.646269
## 28 M Rural 2.593901
## 48 F Rural 2.534342
hist(mydata$Dissimilarity,
breaks = 20,
xlab = "Dissimilarity",
main = "Histogram of dissimilarities")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# EUCLIDIAN DISTANCE
distance <- get_dist(mydata[,c("BMI_z", "Glucose_z", "Age_z")],
method = "euclidian")
distance2 <- distance^2
# DISSIMILARITY MATRIX
fviz_dist(distance2)
# HOPKINS STATISTICS
get_clust_tendency(mydata[,c("BMI_z", "Glucose_z", "Age_z")],
n = nrow(mydata)- 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6520647
##
## $plot
## NULL
The calculated Hopkins statistics of 0.65 indicates, that the data is suitable for clustering, since it has a value higher than 0.5.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(factoextra)
WARD <- mydata[,c("BMI_z","Glucose_z","Age_z")] %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 100
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
set.seed(1)
library(factoextra)
library(NbClust)
OptNumber <- mydata[,c("BMI_z", "Glucose_z", "Age_z")] %>%
NbClust(distance = "euclidean",
min.nc = 2,
max.nc = 10,
method = "ward.D2",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Based on the above statistics we will use k = 3 (3 clusters) because the majority rule suggests using 3 clusters.
k <- 3
# ASSINGING EACH UNIT TO ONE CLUSTER
mydata$ClusterWard <- cutree(WARD,
k = 3)
head(mydata)
## Gender Age Residence_Type Glucose_Level BMI BMI_z Glucose_z
## 1 0 67 1 229 37 1.05331174 2.657556238
## 2 1 80 0 106 33 0.51453336 -0.007801046
## 3 1 49 1 171 34 0.64922795 1.400721096
## 4 1 79 0 174 24 -0.69771801 1.465729810
## 5 0 81 1 186 29 -0.02424503 1.725764667
## 6 0 74 0 70 27 -0.29363422 -0.787905616
## Age_z GenderFactor ResidenceTypeFactor Dissimilarity ClusterWard
## 1 1.0401161 M Urban 3.042024 1
## 2 1.6014075 F Rural 1.682056 2
## 3 0.2629434 F Urban 1.566096 1
## 4 1.5582312 F Rural 2.250169 1
## 5 1.6445838 M Urban 2.384011 1
## 6 1.3423499 M Rural 1.583957 2
#Showing the positions of initial leaders, used as starting point for k-means clustering
Initial_Leaders <- aggregate(mydata[,c("BMI_z","Glucose_z","Age_z")],
by = list(mydata$ClusterWard),
FUN = mean)
Initial_Leaders
## Group.1 BMI_z Glucose_z Age_z
## 1 1 0.3243763 1.9615806 0.7251834
## 2 2 0.5241544 -0.4675055 0.5631212
## 3 3 -0.6714361 -0.3344302 -0.8775417
# K-MEANS CLUSTERING
K_MEANS <- hkmeans(mydata[,c("BMI_z", "Glucose_z", "Age_z")],
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 3 clusters of sizes 16, 50, 34
##
## Cluster means:
## BMI_z Glucose_z Age_z
## 1 0.3714203 2.0521626 0.6866105
## 2 0.4256349 -0.4199563 0.4814153
## 3 -0.8007198 -0.3481408 -1.0310745
##
## Clustering vector:
## [1] 1 2 1 1 1 2 2 2 3 3 2 3 2 2 3 3 2 1 2 2 2 2 3 3 2 3 2 1 2 2 1 3 3 2 2 3 2
## [38] 2 3 2 3 2 2 2 2 2 2 3 3 1 2 2 3 2 1 1 2 2 2 1 2 2 1 2 3 1 3 2 2 3 2 2 3 3
## [75] 1 3 3 1 2 3 3 2 3 1 3 2 3 2 3 2 2 3 3 3 2 3 3 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 22.28590 79.79287 26.25871
## (between_SS / total_SS = 56.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
From the table above, we can see that we have cluster 1 with 16 units, cluster 2 with 50 units and cluster 3 with 34 units. Ratio of variability between sum of squares and total sum squares is 56.8%, which is above 50% and it means that it can be analysed further.
fviz_cluster(K_MEANS,
palette = "Dark1",
repel = FALSE,
ggtheme = theme_classic())
mydata$ClusterK_Means <- K_MEANS$cluster
table(mydata$ClusterWard, mydata$ClusterK_Means)
##
## 1 2 3
## 1 16 1 0
## 2 0 42 0
## 3 0 7 34
From this table, we can conclude, we have 16 units in the cluster1 and that 1 unit was reclassified from cluster1 to cluster2 after K-MEANS clustering was conducted. In cluster2, we have 42 units. In cluster3, we have 34 units and 7 units were reclassified from cluster3 to cluster2 after K-MEANS clustering.
Centroids <- K_MEANS$centers
Centroids
## BMI_z Glucose_z Age_z
## 1 0.3714203 2.0521626 0.6866105
## 2 0.4256349 -0.4199563 0.4814153
## 3 -0.8007198 -0.3481408 -1.0310745
library(tidyr)
figure <- as.data.frame(Centroids)
figure$ID <- 1:nrow(figure)
figure <- pivot_longer(figure, cols = c(BMI_z, Glucose_z, Age_z))
figure$groups <- factor(figure$ID,
levels = c(1,2,3),
labels = c("Cluster 1", "Cluster 2", "Cluster 3" ))
figure$nameFactor <- as.factor(figure$name)
ggplot(figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape= groups, col = groups), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")
From the chart, we can conclude that patients that are in Cluster 1, have above average values of Age, BMI, and Glucose Level. Patients in Cluster 2 have above average values of Age and BMI, but below average value of Glucose Level. Patients in Cluster 3 have below average values of Age, BMI and Glucose Level.
# H0: µ(var, cluster1) = µ(var, cluster2) = µ(var, cluster3)
# H1: At least one µi is different.
fit <- aov(cbind(BMI_z, Glucose_z, Age_z) ~ as.factor(ClusterK_Means),
data= mydata)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 33.065 16.5323 24.321 2.747e-09 ***
## Residuals 97 65.935 0.6797
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 80.321 40.160 208.55 < 2.2e-16 ***
## Residuals 97 18.679 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 55.277 27.6384 61.316 < 2.2e-16 ***
## Residuals 97 43.723 0.4508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the conducted anova test, we can conclude, that for all clustering variables, H0 is rejected since p-value < 0.001. Meaning that at least for one cluster, the mean of the variable is different. Another conclusion we can observe from the results is that the differences between the clusters are the highest for the variable Glucose_z -> it has the highest value of F-statistics. The lowest differences between clusters are for the variable BMI_z -> lowest value of F-statistics.
# H0: There is no association between clusters in categorical variable.
# H1: There is association between clusters in categorical variable.
chisq_results <- chisq.test(mydata$GenderFactor, as.factor(mydata$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$GenderFactor and as.factor(mydata$ClusterK_Means)
## X-squared = 4.3175, df = 2, p-value = 0.1155
addmargins(chisq_results$observed)
##
## mydata$GenderFactor 1 2 3 Sum
## M 12 25 15 52
## F 4 25 19 48
## Sum 16 50 34 100
round(chisq_results$expected, 2)
##
## mydata$GenderFactor 1 2 3
## M 8.32 26 17.68
## F 7.68 24 16.32
round(chisq_results$res, 2)
##
## mydata$GenderFactor 1 2 3
## M 1.28 -0.20 -0.64
## F -1.33 0.20 0.66
From the results of the Chi-squared test, we cannot reject H0 since p-value > 0.05. We cannot claim that there is association between the clusters in categorical variable. In Group 2, we can see that it looks like there is more than expected Females and less than expected Males, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 1, we can see that it looks like there is more than expected Males and less than expected Females, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 3, we can see that it looks like there is more than expected Females and less than expected Males, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). Since the differences are not statistically significant, we cannot use this variable for validating our clustering. For validation, we would need to find some other variables that would have statistically significant differences.
# H0: There is no association between clusters of the categorical variable.
# H1: There is association between clusters of the categorical variable.
chisq_results <- chisq.test(mydata$ResidenceTypeFactor, as.factor(mydata$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$ResidenceTypeFactor and as.factor(mydata$ClusterK_Means)
## X-squared = 5.5656, df = 2, p-value = 0.06186
addmargins(chisq_results$observed)
##
## mydata$ResidenceTypeFactor 1 2 3 Sum
## Rural 5 32 17 54
## Urban 11 18 17 46
## Sum 16 50 34 100
round(chisq_results$expected, 2)
##
## mydata$ResidenceTypeFactor 1 2 3
## Rural 8.64 27 18.36
## Urban 7.36 23 15.64
round(chisq_results$res, 2)
##
## mydata$ResidenceTypeFactor 1 2 3
## Rural -1.24 0.96 -0.32
## Urban 1.34 -1.04 0.34
From the results of the Chi-squared test, we cannot reject H0 since p-value > 0.05. We cannot claim that there is association between the clusters in categorical variable. In Group 1, we can see that it looks like there is more than expected Urban area living patients and less than expected Rural area living patients, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 2, we can see that it looks like there is more than expected Rural area living patients and less than expected Urban area living patients, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). In Group 3, we can see that it looks like there is more than expected Urban area living patients and less than expected Rural area living patients, but based on the value we cannot say it is statically different from 0 (|residuals| < 1.96). Since the differences are not statistically significant, we cannot use this variable for validating our clustering. For validation, we would need to find some other variables that would have statistically significant differences.
Classification of 100 patients was based on 3 standardized variables (Age, Glucose, BMI).
For hierarchical clustering, Ward’s algorithm was used, and based on the analysis of the dendrogram and the indices analyzing the increase in heterogeneity, it was decided to classify patients into three clusters (groups). The classification was further optimized using the K-Means clustering. Hierarchical K-means clustering revealed 3 clusters of sizes 16, 50, 34 units.
Group 2 contains the most patients (50), characterized by having the lowest average glucose level between groups. It has values above average for two clustering variables (Age and BMI). Furthermore, it has the highest average BMI between groups.
The second biggest group is Group 3. It contains (34) patients, characterized by lower-than-average values of all cluster variables. It has the lowest average age and lowest average BMI between groups.
Group 1 contains the lowest number of patients (16), characterized by more-than-average values of all cluster variables. This group is characterized by the highest average age and highest average glucose level between groups.