The dataset used for this notebook contains information on a series
of 94 patients admitted in an argentinian high complexity hospital, in
the city of Córdoba, during a 10-year period with diagnosis of
intracerebral hematoma. The goal of this work is to attempt to perform
cluster analysis exploration based on multiple elemental variables of
patient mortality and survival after admission.
[Dataset:]https://www.kaggle.com/datasets/ffalco/cerebral-hemorrhage-dataset
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
library(foreign)
library(corrplot)
## corrplot 0.95 loaded
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(FactoMineR)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dbscan)
##
## Attaching package: 'dbscan'
##
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(e1071)
file_path <- "C:/Users/admin/Desktop/unsupervied-learning/project/cluster/Dataset HIP NHSR.xlsx"
HIP <- read_excel(file_path, sheet = "Hoja1")
head(HIP)
## # A tibble: 6 × 32
## Edad Varón `Volumen (ml)` `Óbito en internación` `Vuelco ventricular` Cirugía
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 62 1 19 0 1 0
## 2 64 1 1 0 0 0
## 3 32 1 10 1 1 1
## 4 66 0 14 0 1 0
## 5 44 0 30 0 0 0
## 6 23 0 6 0 0 0
## # ℹ 26 more variables: Embolización <dbl>, `GCS al ingreso` <dbl>, HTA <dbl>,
## # `TAS al ingreso (mmHg)` <dbl>, `TAD al ingreso (mmHg)` <dbl>,
## # Izquierdo <dbl>, Infratentorial <dbl>, Ganglionar <dbl>, Capsular <dbl>,
## # Lobar <dbl>, DBT <dbl>, `Glucemia al ingreso (mg/dl)` <dbl>,
## # Obesidad <dbl>, Tabaco <dbl>, Alcohol <dbl>, ACO <dbl>,
## # `APP ACV hemorrágico` <dbl>,
## # `Cardiopatía (ICC- Arritmia - Valvulopatía)` <dbl>, …
# After loading the HIP set, we can look at it’s shape
dim(HIP)
## [1] 94 32
The HIP consists of 89 observations and 31 variables. Overall, this gives us 2759 data points. This dataset mainly contains binary variables and continuous variables. Continuous variables include blood pressure, blood sugar, bleeding volume, and GSC coma score. Next, the data of continuous variables will be standardized to facilitate subsequent dimensionality reduction and cluster analysis.
# Inspecting the dataset
str(HIP)
## tibble [94 × 32] (S3: tbl_df/tbl/data.frame)
## $ Edad : num [1:94] 62 64 32 66 44 23 58 56 50 59 ...
## $ Varón : num [1:94] 1 1 1 0 0 0 0 1 1 0 ...
## $ Volumen (ml) : num [1:94] 19 1 10 14 30 6 20 15 14 15 ...
## $ Óbito en internación : num [1:94] 0 0 1 0 0 0 1 0 1 0 ...
## $ Vuelco ventricular : num [1:94] 1 0 1 1 0 0 1 0 0 0 ...
## $ Cirugía : num [1:94] 0 0 1 0 0 0 0 0 0 0 ...
## $ Embolización : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ GCS al ingreso : num [1:94] 12 11 5 11 11 14 4 12 12 14 ...
## $ HTA : num [1:94] 1 1 0 1 1 0 1 1 1 1 ...
## $ TAS al ingreso (mmHg) : num [1:94] 180 250 200 190 200 120 220 220 200 230 ...
## $ TAD al ingreso (mmHg) : num [1:94] 100 120 100 110 90 80 100 110 110 110 ...
## $ Izquierdo : num [1:94] 0 1 1 1 0 1 0 1 1 1 ...
## $ Infratentorial : num [1:94] 0 1 0 0 0 0 0 0 0 0 ...
## $ Ganglionar : num [1:94] 1 0 1 1 1 0 1 1 1 1 ...
## $ Capsular : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Lobar : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ DBT : num [1:94] 1 1 0 1 0 0 0 1 0 0 ...
## $ Glucemia al ingreso (mg/dl) : num [1:94] 180 280 110 200 100 80 130 300 110 130 ...
## $ Obesidad : num [1:94] 1 0 0 0 1 0 0 1 0 0 ...
## $ Tabaco : num [1:94] 0 0 1 1 0 0 1 1 1 0 ...
## $ Alcohol : num [1:94] 1 0 1 0 0 0 0 0 1 0 ...
## $ ACO : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ APP ACV hemorrágico : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Cardiopatía (ICC- Arritmia - Valvulopatía): num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Aneurismas - MAVs - FAVs - Cavernomas : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ HTA periparto : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ APP ACV isquémico : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Neoplasia sólida : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Cirrosis : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Insuficiencia renal : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
## $ Cocaína - Metanfetaminas : num [1:94] 0 0 1 0 0 0 0 0 0 0 ...
## $ Coagulopatía : num [1:94] 0 0 0 0 0 0 0 0 0 0 ...
# Check for missing values
any(is.na(HIP))
## [1] FALSE
# Check for duplicate values
any(duplicated(HIP))
## [1] FALSE
As ‘Surgery’ seems to be an ambiguous category, including ‘External ventricular drainage’ (EVD) and ‘Hematoma evacuation’, I’ll divide both groups according to a series of parameters.
# Divide surgery into specific groups
#'EVD' is mean Ventricular drainage
HIP$EVD <- ifelse(HIP$Cirugía == 1 &
HIP$`Volumen (ml)` < 30.0 &
HIP$`Vuelco ventricular` == 1, 1, 0)
# 'Hematoma evacuation'
HIP$`Hematoma evacuation` <- ifelse(HIP$Cirugía == 1 &
HIP$EVD == 0, 1, 0)
There are actually duplications between the column of hypertension history and the columns of actual values of high and low blood pressure, so we can perform dimensionality reduction on this group of data and delete the actual values of high and low blood pressure of all patients.
HIP<-HIP[, !colnames(HIP) %in% c("TAS al ingreso (mmHg)", "TAD al ingreso (mmHg)")]
#There are also some noise options related to topic analysis, which are not strong and can be deleted and reduced in dimension.
HIP<-HIP[, !colnames(HIP) %in% c("Cirrosis", "Insuficiencia renal ","Cocaína - Metanfetaminas","Obesidad","Tabaco","Alcohol","ACO")]
head(HIP)
## # A tibble: 6 × 26
## Edad Varón `Volumen (ml)` `Óbito en internación` `Vuelco ventricular` Cirugía
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 62 1 19 0 1 0
## 2 64 1 1 0 0 0
## 3 32 1 10 1 1 1
## 4 66 0 14 0 1 0
## 5 44 0 30 0 0 0
## 6 23 0 6 0 0 0
## # ℹ 20 more variables: Embolización <dbl>, `GCS al ingreso` <dbl>, HTA <dbl>,
## # Izquierdo <dbl>, Infratentorial <dbl>, Ganglionar <dbl>, Capsular <dbl>,
## # Lobar <dbl>, DBT <dbl>, `Glucemia al ingreso (mg/dl)` <dbl>,
## # `APP ACV hemorrágico` <dbl>,
## # `Cardiopatía (ICC- Arritmia - Valvulopatía)` <dbl>,
## # `Aneurismas - MAVs - FAVs - Cavernomas` <dbl>, `HTA periparto` <dbl>,
## # `APP ACV isquémico` <dbl>, `Neoplasia sólida` <dbl>, …
#Plot a correlation heatmap
cor_matrix <- cor(HIP)
corrplot(cor_matrix, method = "circle", tl.cex = 0.5, tl.col = "red")
“Edad” (age) and “Óbito en internación” (hospital mortality rate) : This indicates that the older the age, the higher the in-hospital mortality rate may be, suggesting that the two are closely related and can be used for risk assessment. “Volumen (ml)” (volume) and “Vuelo ventricular” (ventricular excursion) : This may be related to pathological factors, such as hydrocephalus or tumors. “GCS al ingreso” (GCS score at admission) and “Cirugía” (Surgery) : There was a strong positive correlation between the GCS score at admission and whether surgery was received.
“Edad” (age) and “GCS al ingreso” (GCS score on admission) : The older the patient, the lower the GCS score may be, reflecting the potential disadvantage of elderly patients in terms of disease severity. “Glucemia al ingreso” (admission blood sugar) and “DBT” (diabetes mellitus) : The strong relationship between hyperglycemia and a history of diabetes was evident in the negative correlation.
pca_results <- PCA(HIP[sapply(HIP, is.numeric)], graph = FALSE)
# Visualizing PCA Results
fviz_pca_var(pca_results, col.var = "contrib")
Dim1 and Dim2 explain 14.9% and 10.1% of the variance, respectively, which totals about 25% of the variance in the dataset. The contribution (contrib) color gradient highlights how much each variable contributes to Dim1 and Dim2: Variables like Cirugía, EVD, and Óbito en internación (surgery, external ventricular drainage, and in-hospital mortality) have higher contributions to Dim1 or Dim2. Variables like Edad (age) and HTA (hypertension) are also significant contributors to Dim1 or Dim2. Dim1 appears to capture surgical interventions and their associations with mortality or other clinical outcomes. Dim2 might capture relationships involving demographic and clinical variables (e.g., age, hypertension, glucose levels). Variables closer to the origin have a lower influence on the variance.
# Distinguishing between binary and continuous numerical data
binary_columns <- HIP %>%
select_if(~ all(. %in% c(0, 1))) %>%
names()
continuous_columns <- HIP %>%
select_if(~ is.numeric(.) && !all(. %in% c(0, 1))) %>%
names()
# Normalizing continuous numeric data
scaled_data <- HIP %>%
mutate(across(all_of(continuous_columns), scale))
# Perform PCA analysis (numeric columns only)
numeric_data <- scaled_data %>%
select_if(is.numeric)
pca <- prcomp(numeric_data, center = TRUE, scale. = FALSE)
# View the cumulative variance ratio
cumulative_variance <- cumsum(pca$sdev^2 / sum(pca$sdev^2))
# Find the number of principal components that explains 90% of the variance
n_components_90 <- which(cumulative_variance >= 0.90)[1]
cat("Cumulative variance ratio:\n", cumulative_variance, "\n")
## Cumulative variance ratio:
## 0.3012149 0.4737458 0.6042632 0.6867418 0.7409382 0.7794866 0.8166575 0.8501539 0.8776441 0.8958274 0.912002 0.9248553 0.9369252 0.9478349 0.9577244 0.9661157 0.9734777 0.9791541 0.9843563 0.9885955 0.9922592 0.9955013 0.9976854 0.9994608 1 1
cat("The number of principal components required to explain 90% of the variance:", n_components_90, "\n")
## The number of principal components required to explain 90% of the variance: 11
# Visualization
plot(cumulative_variance, type = "b", xlab = "Number of principal components", ylab = "Cumulative variance ratio",
main = "Cumulative variance ratio plot")
abline(h = 0.90, col = "red", lty = 2)
The plot shows a noticeable “elbow” around 5–7 components, suggesting that fewer components might suffice for approximate clustering, but 11 is optimal for a 90% variance threshold.
As you can see, we will use the first 11th element.
# Prepare PCA-reduced data for clustering
pca_data <- as.data.frame(pca$x[, 1:11])
# Perform K-Means clustering
set.seed(123)
kmeans_result <- kmeans(pca$x[, 1:11], centers = 2, nstart = 25)
kmeans_result$cluster
## [1] 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 1 1 1
## [39] 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 1 2 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 2
## [77] 1 1 2 2 1 1 2 1 2 1 2 2 1 1 2 2 2 2
# Visualize clustering results(PC1 vs. PC2 for 2D visualization)
plot_data <- as.data.frame(pca_data[, 1:2]) %>%
mutate(Cluster = as.factor(kmeans_result$cluster))
ggplot(plot_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "K-Means Clustering Results", x = "PC1", y = "PC2") +
theme_minimal()
# # Function to calculate silhouette scores for different k values
silhouette_scores <- function(data, max_k = 10) {
sil_scores <- numeric(max_k - 1)
for (k in 2:max_k) {
set.seed(123) # Ensure reproducibility
kmeans_result <- kmeans(data, centers = k, nstart = 25)
sil <- silhouette(kmeans_result$cluster, dist(data))
sil_scores[k - 1] <- mean(sil[, 3]) # Extract mean silhouette score
}
return(sil_scores)
}
# Calculate silhouette scores for k = 2 to max_k
max_k <- 10 # Maximum number of clusters to test
sil_scores <- silhouette_scores(pca_data, max_k)
# Visualize silhouette scores
plot(2:max_k, sil_scores, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Average Silhouette Score",
main = "Silhouette Scores for Different k")
abline(v = which.max(sil_scores) + 1, col = "red", lty = 2)
best_k <- which.max(sil_scores) + 1
cat("Optimal number of clusters (k):", best_k, "\n")
## Optimal number of clusters (k): 2
A silhouette score of 0.25 suggests that the clusters are not strongly separated,K-means works well with spherical clusters but struggles with non-linear structures. The data might not conform to these assumptions.
# 1. Determine the optimal number of clusters using silhouette score
silhouette_scores <- function(hc, dist_matrix, max_k) {
sil_scores <- numeric(max_k - 1)
for (k in 2:max_k) {
clusters <- cutree(hc, k = k)
sil <- silhouette(clusters, dist_matrix)
sil_scores[k - 1] <- mean(sil[, 3])
}
return(sil_scores)
}
# Compute silhouette scores for k = 2 to k = 10
max_k <- 10
dist_matrix <- dist(pca_data, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
sil_scores <- silhouette_scores(hc, dist_matrix, max_k)
# Plot silhouette scores to find the optimal k
plot(2:max_k, sil_scores, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Average Silhouette Score",
main = "Silhouette Scores for Different k")
abline(v = which.max(sil_scores) + 1, col = "red", lty = 2)
# 2. Use the best k (from silhouette scores)
best_k <- which.max(sil_scores) + 1
cat("Optimal number of clusters (k):", best_k, "\n")
## Optimal number of clusters (k): 9
# Perform hierarchical clustering with the optimal k
clusters <- cutree(hc, k = best_k)
# 3. Visualize the dendrogram with cluster boundaries
plot(hc, labels = FALSE, hang = -1,
main = "Hierarchical Clustering Dendrogram with Optimal k",
xlab = "Samples", ylab = "Distance")
rect.hclust(hc, k = best_k, border = 2:5) # Highlight clusters with boxes
# 4. Compute silhouette score for the optimal clustering
silhouette_result <- silhouette(clusters, dist_matrix)
fviz_silhouette(silhouette_result)
## cluster size ave.sil.width
## 1 1 12 0.22
## 2 2 6 -0.05
## 3 3 15 0.24
## 4 4 15 0.11
## 5 5 25 0.24
## 6 6 8 0.11
## 7 7 5 0.29
## 8 8 5 0.21
## 9 9 3 0.39
avg_silhouette <- mean(silhouette_result[, 3])
cat("Average Silhouette Score:", avg_silhouette, "\n")
## Average Silhouette Score: 0.1954164
The dendrogram shows hierarchical clustering, but it may not reflect well-separated clusters. The branches seem to merge at relatively large distances, suggesting that the data does not naturally cluster into clear groups.The average silhouette score is 0.2, which indicates poor clustering quality
# # Determine epsilon using k-distance plot
k_distances <- kNNdist(pca$x[, 1:11], k = 4)
plot(sort(k_distances), type = "l", main = "k-Distance Plot", xlab = "Samples (sorted)", ylab = "Distance")
abline(h = 0.5, col = "red", lty = 2)
# Perform DBSCAN
dbscan_result <- dbscan(pca_data, eps = 1.2, minPts = 5)
# Visualize clustering
print(dbscan_result)
## DBSCAN clustering for 94 objects.
## Parameters: eps = 1.2, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 76 noise points.
##
## 0 1 2
## 76 13 5
##
## Available fields: cluster, eps, minPts, metric, borderPoints
pca_plot <- data.frame(PC1 = pca_data[, 1], PC2 = pca_data[, 2], Cluster = factor(dbscan_result$cluster))
# Visualize clustering
ggplot(pca_plot, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "DBSCAN Clustering Results", x = "PC1", y = "PC2") +
theme_minimal()
# Remove noise points for silhouette analysis
valid_data <- pca_data[dbscan_result$cluster != 0, ]
valid_clusters <- dbscan_result$cluster[dbscan_result$cluster != 0]
distance_matrix <- dist(valid_data, method = "euclidean")
# Compute silhouette scores
silhouette_result <- silhouette(valid_clusters, distance_matrix)
fviz_silhouette(silhouette_result)
## cluster size ave.sil.width
## 1 1 13 0.53
## 2 2 5 0.63
# Average silhouette score
avg_silhouette <- mean(silhouette_result[, 3])
cat("Average Silhouette Score:", avg_silhouette, "\n")
## Average Silhouette Score: 0.5548983
High Proportion of Noise: DBSCAN identified 76 noise points, which might indicate: The dataset does not have a well-defined cluster structure.
Fuzzy C-Means (FCM) is a clustering algorithm that allows one piece of data to belong to two or more clusters. Unlike traditional clustering methods, such as k-means, which assign each data point to a single cluster, FCM assigns membership levels to each data point, reflecting its degree of association with multiple clusters.
The algorithm was first introduced by J. C. Dunn in 1973 and was later refined by James Bezdek in 1981. It is widely used in pattern recognition, image processing, and data analysis due to its flexibility in handling overlapping data points.
The FCM algorithm minimizes the following objective function:
\[ J_m = \sum_{i=1}^{N} \sum_{j=1}^{C} u_{ij}^m \|x_i - c_j\|^2 \]
Where:
- \(N\) = Number of data points - \(C\) = Number of clusters - \(u_{ij}\) = Membership degree of data point
\(x_i\) in cluster \(j\) - \(c_j\) = Center of cluster \(j\) - \(m\) = Fuzziness parameter (\(m > 1\)), controlling the level of
cluster overlap - \(\|\cdot\|\) =
Euclidean distance
Algorithm Steps
1. Initialization: Assign initial cluster centers and
membership degrees (\(u_{ij}\)).
2. Update Membership Degrees: \[
u_{ij} = \frac{1}{\sum_{k=1}^{C} \left( \frac{\|x_i - c_j\|}{\|x_i -
c_k\|} \right)^{2/(m-1)}}
\] 3. Update Cluster Centers: \[
c_j = \frac{\sum_{i=1}^{N} u_{ij}^m x_i}{\sum_{i=1}^{N} u_{ij}^m}
\] 4. Iterate: Repeat steps 2 and 3 until
convergence (membership values or cluster centers stabilize).
Applications
- Image Segmentation: FCM is effective in segmenting
images with overlapping features. - Market
Segmentation: Helps identify customer groups with shared
characteristics. - Medical Diagnosis: Analyzes medical
data to detect overlapping conditions.
Advantages
- Captures uncertainty and overlap in data. - More flexible than hard
clustering methods like k-means.
Limitations
- Sensitive to the choice of the fuzziness parameter (\(m\)). - Prone to local optima. -
Computationally expensive for large datasets.
optimal_clusters <- 2:6
sil_scores <- numeric(length(optimal_clusters))
for (i in seq_along(optimal_clusters)) {
fcm_tmp <- cmeans(pca_data[, 1:2], centers = optimal_clusters[i], iter.max = 100, m = 2)
tmp_labels <- apply(fcm_tmp$membership, 1, which.max)
sil_scores[i] <- mean(silhouette(tmp_labels, dist(pca_data[, 1:2]))[, 3])
}
plot(optimal_clusters, sil_scores, type = "b", pch = 19, col = "blue",
xlab = "Number of Clusters", ylab = "Average silhouette coefficient", main = "Number of clusters and silhouette coefficient")
pca_data <- as.data.frame(pca_data)
n_clusters <- 6
fcm_result <- cmeans(pca_data, centers = n_clusters, iter.max = 100, m = 2, method = "cmeans")
cat("Cluster centers:\n")
## Cluster centers:
print(fcm_result$centers)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 1.5722474 0.16036983 -0.18818702 -0.32250012 0.03139698 -0.015778297
## 2 -0.2756626 -0.03760648 0.03894717 0.04452192 -0.01387584 0.004389157
## 3 -0.2754941 -0.03766853 0.03893829 0.04452349 -0.01385583 0.004383836
## 4 -0.2699112 -0.03967190 0.03859276 0.04457839 -0.01318566 0.004207021
## 5 -0.2746591 -0.03797421 0.03889261 0.04453140 -0.01375638 0.004357454
## 6 -0.2758218 -0.03754766 0.03895542 0.04452043 -0.01389472 0.004394187
## PC7 PC8 PC9 PC10 PC11
## 1 0.11796474 -0.035591471 -0.028449885 -0.0299422630 0.0077991801
## 2 -0.01075598 0.006051465 0.003591757 -0.0002535099 0.0007273682
## 3 -0.01076383 0.006046250 0.003591595 -0.0002561331 0.0007136823
## 4 -0.01102669 0.005877137 0.003585916 -0.0003421976 0.0002598342
## 5 -0.01080286 0.006020521 0.003590780 -0.0002691098 0.0006458495
## 6 -0.01074856 0.006056403 0.003591911 -0.0002510287 0.0007403051
cat("\nMembership matrix (first 10 rows):\n")
##
## Membership matrix (first 10 rows):
print(head(fcm_result$membership, 10))
## 1 2 3 4 5 6
## [1,] 0.11265561 0.1773145 0.1773340 0.1779698 0.1774300 0.1772961
## [2,] 0.14087724 0.1717095 0.1717239 0.1721981 0.1717953 0.1716959
## [3,] 0.16040946 0.1679214 0.1679209 0.1679079 0.1679184 0.1679219
## [4,] 0.14045214 0.1717655 0.1717837 0.1723771 0.1718732 0.1717484
## [5,] 0.06110500 0.1879687 0.1879450 0.1871627 0.1878275 0.1879911
## [6,] 0.08255860 0.1836417 0.1836224 0.1829903 0.1835270 0.1836599
## [7,] 0.27604195 0.1447107 0.1447208 0.1450548 0.1447707 0.1447011
## [8,] 0.14817529 0.1702496 0.1702641 0.1707396 0.1703354 0.1702360
## [9,] 0.05016453 0.1900945 0.1900788 0.1895519 0.1900009 0.1901093
## [10,] 0.05541407 0.1890113 0.1890000 0.1886095 0.1889432 0.1890220
# Get the cluster label for each point (based on maximum membership)
pca_data$cluster <- apply(fcm_result$membership, 1, which.max)
# Convert cluster centers to a data frame and set column names
centers_df <- as.data.frame(fcm_result$centers)
colnames(centers_df) <- colnames(pca_data)[1:ncol(centers_df)] # 使用 PCA 数据的列名
# Visualize clustering results
ggplot(pca_data, aes(x = pca_data[,1], y = pca_data[,2], color = factor(cluster))) +
geom_point(size = 2) +
geom_point(data = centers_df, aes(x = centers_df[,1], y = centers_df[,2]),
color = "red", size = 4, shape = 3) +
labs(title = "Fuzzy C-Means Clustering Results", x = "PC1", y = "PC2", color = "Cluster") +
theme_minimal()
# Print summary of the membership matrix
cat("Membership matrix summary:\n")
## Membership matrix summary:
summary(fcm_result$membership)
## 1 2 3 4
## Min. :0.04599 Min. :0.0946 Min. :0.09461 Min. :0.09494
## 1st Qu.:0.06881 1st Qu.:0.1623 1st Qu.:0.16229 1st Qu.:0.16266
## Median :0.09740 Median :0.1805 Median :0.18054 Median :0.18044
## Mean :0.14645 Mean :0.1707 Mean :0.17072 Mean :0.17067
## 3rd Qu.:0.18818 3rd Qu.:0.1863 3rd Qu.:0.18632 3rd Qu.:0.18605
## Max. :0.52660 Max. :0.1910 Max. :0.19098 Max. :0.19015
## 5 6
## Min. :0.09466 Min. :0.09459
## 1st Qu.:0.16234 1st Qu.:0.16226
## Median :0.18053 Median :0.18055
## Mean :0.17071 Mean :0.17072
## 3rd Qu.:0.18626 3rd Qu.:0.18635
## Max. :0.19085 Max. :0.19103
# Visualize the distribution of maximum membership values
max_membership <- apply(fcm_result$membership, 1, max)
hist(max_membership, main = "Distribution of Maximum Membership Values", xlab = "Maximum Membership", col = "skyblue")
pca_data$cluster <- apply(fcm_result$membership, 1, which.max)
# Calculate silhouette score
silhouette_score <- silhouette(pca_data$cluster, dist(pca_data[, 1:2]))
mean_sil <- mean(silhouette_score[, 3])
cat("Average silhouette score:", mean_sil, "\n")
## Average silhouette score: 0.2799081
# xtract hard cluster labels (from FCM membership matrix based on maximum membership)
hard_clusters <- apply(fcm_result$membership, 1, which.max)
# Combine hard cluster labels with the original data
data_with_clusters <- cbind(HIP, Cluster = hard_clusters)
# Calculate the mean of each cluster across all dimensions
cluster_centers <- data_with_clusters %>%
group_by(Cluster) %>%
summarise(across(everything(), mean)) %>%
arrange(Cluster)
# View cluster centers
print(cluster_centers)
## # A tibble: 3 × 27
## Cluster Edad Varón `Volumen (ml)` `Óbito en internación` `Vuelco ventricular`
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 52.4 0.68 69.9 1 0.8
## 2 4 59.0 0.741 20.5 0.333 0.370
## 3 6 45.0 0.476 17.1 0.167 0.143
## # ℹ 21 more variables: Cirugía <dbl>, Embolización <dbl>,
## # `GCS al ingreso` <dbl>, HTA <dbl>, Izquierdo <dbl>, Infratentorial <dbl>,
## # Ganglionar <dbl>, Capsular <dbl>, Lobar <dbl>, DBT <dbl>,
## # `Glucemia al ingreso (mg/dl)` <dbl>, `APP ACV hemorrágico` <dbl>,
## # `Cardiopatía (ICC- Arritmia - Valvulopatía)` <dbl>,
## # `Aneurismas - MAVs - FAVs - Cavernomas` <dbl>, `HTA periparto` <dbl>,
## # `APP ACV isquémico` <dbl>, `Neoplasia sólida` <dbl>, …
Considering individual differences among patients, the FCM fuzzy clustering algorithm was ultimately used to classify all patients into six groups. Among these, the second, third, and fourth groups were identified as high-mortality groups, while the remaining three groups were classified as low-mortality groups. The first three groups share a common characteristic: their GCS (Glasgow Coma Scale) scores range between 3 and 6. In contrast, the GCS scores for the remaining three groups exceed 11, indicating that this variable has a significant impact. Clinically, variables such as hypertension and hemorrhage volume were not prominently influential in this analysis and did not play a decisive role.
In the second group, although the hemorrhage volume was less than 40 ml, the mortality rate remained as high as 100%. Additionally, most intracerebral hemorrhage (ICH) patients were over 50 years old; however, one group comprised patients under 30 years old, raising the question of whether ICH is becoming more prevalent in younger individuals. Except for the first group, the other five groups shared a common characteristic: a majority of patients had a history of hypertension. This suggests that while hypertension is an important risk factor for ICH, it is not the determining factor for mortality.
Furthermore, surgical intervention was not identified as a significant factor in this analysis. These findings align with clinical conclusions to a large extent. However, given the small sample size and the lack of consideration for rescue time in ICH patients, this analysis has notable limitations.