I analyzed a Chronic Kidney Disease (CKD) dataset containing a combination of numerical laboratory measurements and categorical clinical indicators collected from patients. The features describe kidney function, blood health, metabolic conditions, and related comorbidities, including measurements such as blood urea, serum creatinine, hemoglobin, electrolyte levels, and the presence or absence of conditions like hypertension, diabetes, anemia, and edema.
The dataset consists of 24 clinical attributes and one class label, where the class indicates whether a patient was diagnosed with chronic kidney disease (ckd) or not (notckd). Although this label was available, it was not used during clustering and was only referenced later to assess how well the discovered clusters aligned with the true medical categories.
The dataset was obtained from the UCI Machine Learning Repository and was originally sourced via IEEE DataPort:
https://ieee-dataport.org/documents/chronic-kidney-disease-data-set
https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease
While the class label (ckd vs notckd) is available, I keep it aside and only use it later to check whether the clusters I find match the true medical categories.
pkgs <- c(
"tidyverse","factoextra","cluster","flexclust","ClusterR","psych",
"gridExtra","ggplot2","ggfortify","dbscan","NbClust","clustertend",
"smacof","ape","fpc","StatMatch","dendextend","plotly","corrplot"
)
missing <- pkgs[!sapply(pkgs, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) {
stop(paste("Missing packages:", paste(missing, collapse=", "),
"\nInstall them first, then knit again."))
}
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
library(ClusterR)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(rlang)
##
## Attaching package: 'rlang'
##
## The following objects are masked from 'package:purrr':
##
## flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
library(ggfortify)
library(dbscan)
##
## Attaching package: 'dbscan'
##
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(NbClust)
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
library(smacof)
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
##
## The following object is masked from 'package:dbscan':
##
## clplot
##
## The following object is masked from 'package:flexclust':
##
## placeLabels
##
## The following object is masked from 'package:psych':
##
## rescale
##
## Loading required package: colorspace
## Loading required package: e1071
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:flexclust':
##
## bclust
##
## The following object is masked from 'package:ggplot2':
##
## element
##
##
## Attaching package: 'smacof'
##
## The following object is masked from 'package:psych':
##
## Procrustes
##
## The following object is masked from 'package:base':
##
## transform
library(ape)
##
## Attaching package: 'ape'
##
## The following object is masked from 'package:dplyr':
##
## where
library(fpc)
##
## Attaching package: 'fpc'
##
## The following object is masked from 'package:dbscan':
##
## dbscan
library(StatMatch)
## Loading required package: proxy
##
## Attaching package: 'proxy'
##
## The following objects are masked from 'package:stats':
##
## as.dist, dist
##
## The following object is masked from 'package:base':
##
## as.matrix
##
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loading required package: survival
##
## Attaching package: 'survey'
##
## The following object is masked from 'package:graphics':
##
## dotchart
##
## Loading required package: lpSolve
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.19.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
##
## Attaching package: 'dendextend'
##
## The following objects are masked from 'package:ape':
##
## ladderize, rotate
##
## The following object is masked from 'package:stats':
##
## cutree
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
set.seed(123)
theme_set(theme_minimal())
The analysis proceeds as follows. First, the dataset is cleaned, encoded, and scaled to construct a suitable feature matrix for clustering. Next, cluster tendency and the optimal number of clusters are assessed. Multiple clustering algorithms are then applied in the original feature space and evaluated. Finally, dimensionality reduction using MDS and PCA is explored to assess whether reduced representations improve cluster separation and stability, including validation using train–test splits.
I first loaded the dataset and examined its structure to verify that:
variables were correctly typed
the class label contained no obvious formatting issues
missing values were consistently encoded
df0 <- read.csv("data.csv", stringsAsFactors = FALSE, na.strings = "?")
str(df0)
## 'data.frame': 401 obs. of 25 variables:
## $ age : chr "48" "7" "62" "48" ...
## $ bp : int 80 50 80 70 80 90 70 NA 100 90 ...
## $ sg : num 1.02 1.02 1.01 1 1.01 ...
## $ al : int 1 4 2 4 2 3 0 2 3 2 ...
## $ su : int 0 0 3 0 0 0 0 4 0 0 ...
## $ rbc : chr NA NA "normal" "normal" ...
## $ pc : chr "normal" "normal" "normal" "abnormal" ...
## $ pcc : chr "notpresent" "notpresent" "notpresent" "present" ...
## $ ba : chr "notpresent" "notpresent" "notpresent" "notpresent" ...
## $ bgr : int 121 NA 423 117 106 74 100 410 138 70 ...
## $ bu : num 36 18 53 56 26 25 54 31 60 107 ...
## $ sc : num 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
## $ sod : num NA NA NA 111 NA 142 104 NA NA 114 ...
## $ pot : num NA NA NA 2.5 NA 3.2 4 NA NA 3.7 ...
## $ hemo : num 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
## $ pcv : chr "44" "38" "31" "32" ...
## $ wbcc : chr "7800" "6000" "7500" "6700" ...
## $ rbcc : chr "5.2" NA NA "3.9" ...
## $ htn : chr "yes" "no" "no" "yes" ...
## $ dm : chr "yes" "no" "yes" "no" ...
## $ cad : chr "no" "no" "no" "no" ...
## $ appet: chr "good" "good" "poor" "poor" ...
## $ pe : chr "no" "no" "no" "yes" ...
## $ ane : chr "no" "no" "yes" "yes" ...
## $ class: chr "ckd" "ckd" "ckd" "ckd" ...
head(df0)
The class column was cleaned by converting all values to lowercase, trimming whitespace, and removing invalid entries. Only valid labels (ckd, notckd) were retained.
# removed because of extreme outliers found later in the code
df0$class <- tolower(trimws(df0$class))
df0$class <- gsub(",", "", df0$class)
df0 <- df0 %>% filter(class %in% c("ckd", "notckd"))
y <- factor(df0$class, levels = c("ckd", "notckd"))
Binary clinical indicators were converted to numeric values (1 = present, 0 = absent), while missing values were preserved for later imputation.
df0$dm <- ifelse(df0$dm == "yes", 1, ifelse(is.na(df0$dm), NA, 0))
df0$cad <- ifelse(df0$cad == "yes", 1, ifelse(is.na(df0$cad), NA, 0))
df0$pe <- ifelse(df0$pe == "yes", 1, ifelse(is.na(df0$pe), NA, 0))
df0$ane <- ifelse(df0$ane == "yes", 1, ifelse(is.na(df0$ane), NA, 0))
df0$htn <- ifelse(df0$htn == "yes", 1, ifelse(is.na(df0$htn), NA, 0))
df0$rbc <- ifelse(df0$rbc == "abnormal", 1, ifelse(is.na(df0$rbc), NA, 0))
df0$pc <- ifelse(df0$pc == "abnormal", 1, ifelse(is.na(df0$pc), NA, 0))
df0$pcc <- ifelse(df0$pcc == "present", 1, ifelse(is.na(df0$pcc), NA, 0))
df0$ba <- ifelse(df0$ba == "present", 1, ifelse(is.na(df0$ba), NA, 0))
df0$appet <- ifelse(df0$appet == "poor", 1, ifelse(is.na(df0$appet), NA, 0))
The class label was then removed to form the clustering feature matrix.
df <- df0 %>% select(-class)
After encoding, I grouped the variables into two sets.
The first set contains laboratory-style numeric or ordinal measurements (for example age, blood pressure, specific gravity, albumin, sugar, urea, creatinine, electrolytes, and blood counts). I scaled these variables so that no single unit dominates the distance calculation.
The second set contains true binary indicators (0/1), such as diabetes, hypertension, anemia, appetite status, and abnormal cell findings. I kept these as 0/1 and combined them with the scaled numeric variables to form the final matrix X.
numeric_vars <- c("age","bp","bgr","bu","sc","sod","pot","hemo","pcv","wbcc","rbcc")
ordinal_vars <- c("sg","al","su")
binary_vars <- setdiff(names(df), c(numeric_vars, ordinal_vars))
df[c(numeric_vars, ordinal_vars)] <- lapply(df[c(numeric_vars, ordinal_vars)], as.numeric)
## Warning in lapply(df[c(numeric_vars, ordinal_vars)], as.numeric): NAs
## introduced by coercion
## Warning in lapply(df[c(numeric_vars, ordinal_vars)], as.numeric): NAs
## introduced by coercion
## Warning in lapply(df[c(numeric_vars, ordinal_vars)], as.numeric): NAs
## introduced by coercion
df[binary_vars] <- lapply(df[binary_vars], as.numeric)
summary(df[c(numeric_vars, ordinal_vars)])
## age bp bgr bu
## Min. : 2.00 Min. : 50.00 Min. : 22.0 Min. : 1.50
## 1st Qu.:42.00 1st Qu.: 70.00 1st Qu.: 99.0 1st Qu.: 27.00
## Median :54.50 Median : 80.00 Median :121.0 Median : 42.00
## Mean :51.42 Mean : 76.49 Mean :148.2 Mean : 57.45
## 3rd Qu.:64.00 3rd Qu.: 80.00 3rd Qu.:163.0 3rd Qu.: 66.00
## Max. :90.00 Max. :180.00 Max. :490.0 Max. :391.00
## NA's :9 NA's :12 NA's :44 NA's :19
## sc sod pot hemo
## Min. : 0.400 Min. : 4.5 Min. : 2.500 Min. : 3.10
## 1st Qu.: 0.900 1st Qu.:135.0 1st Qu.: 3.800 1st Qu.:10.30
## Median : 1.300 Median :138.0 Median : 4.400 Median :12.60
## Mean : 3.078 Mean :137.5 Mean : 4.631 Mean :12.52
## 3rd Qu.: 2.800 3rd Qu.:142.0 3rd Qu.: 4.900 3rd Qu.:15.00
## Max. :76.000 Max. :163.0 Max. :47.000 Max. :17.80
## NA's :17 NA's :87 NA's :88 NA's :52
## pcv wbcc rbcc sg al
## Min. : 9.00 Min. : 2200 Min. :2.100 Min. :1.005 Min. :0.00
## 1st Qu.:32.00 1st Qu.: 6500 1st Qu.:3.900 1st Qu.:1.010 1st Qu.:0.00
## Median :40.00 Median : 8000 Median :4.800 Median :1.020 Median :0.00
## Mean :38.86 Mean : 8400 Mean :4.707 Mean :1.017 Mean :1.02
## 3rd Qu.:45.00 3rd Qu.: 9800 3rd Qu.:5.400 3rd Qu.:1.020 3rd Qu.:2.00
## Max. :54.00 Max. :26400 Max. :8.000 Max. :1.025 Max. :5.00
## NA's :71 NA's :106 NA's :131 NA's :47 NA's :46
## su
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4514
## 3rd Qu.:0.0000
## Max. :5.0000
## NA's :49
summary(df[binary_vars])
## rbc pc pcc ba
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1903 Mean :0.2275 Mean :0.1063 Mean :0.0557
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :152 NA's :65 NA's :4 NA's :4
## htn dm cad appet
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.000
## Mean :0.3703 Mean :0.3375 Mean :0.08564 Mean :0.206
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.000
## NA's :2 NA's :2 NA's :2 NA's :1
## pe ane
## Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.0000
## Mean :0.191 Mean :0.1508
## 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.0000
## NA's :1 NA's :1
Missing values were handled differently depending on variable type:
Numeric variables were imputed using the median, which is robust to outliers.
Binary variables were imputed using the mode (most frequent value).
mode_impute <- function(x) {
ux <- na.omit(x)
if(length(ux) == 0) return(NA)
ux[which.max(tabulate(match(ux, ux)))]
}
df[numeric_vars] <- lapply(df[numeric_vars], function(x) { x[is.na(x)] <- median(x, na.rm=TRUE); x })
df[ordinal_vars] <- lapply(df[ordinal_vars], function(x) { x[is.na(x)] <- mode_impute(x); x })
df[binary_vars] <- lapply(df[binary_vars], function(x) { x[is.na(x)] <- mode_impute(x); x })
colSums(is.na(df))
## age bp sg al su rbc pc pcc ba bgr bu sc sod
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## pot hemo pcv wbcc rbcc htn dm cad appet pe ane
## 0 0 0 0 0 0 0 0 0 0 0
After imputation, no missing values remained in the dataset.
To ensure fair distance calculations:
Numeric variables were scaled to mean 0 and variance 1
Binary variables were kept as 0/1
X_num <- scale(df[c(numeric_vars, ordinal_vars)])
X_bin <- data.matrix(df[binary_vars])
X <- cbind(X_num, X_bin)
At this point, the final feature matrix X is suitable for distance-based clustering methods.
Before applying clustering algorithms, I assessed whether the data contained meaningful cluster structure using the Hopkins statistic.
factoextra::get_clust_tendency(
X,
n = min(50, nrow(X)),
graph = TRUE,
gradient = list(low = "red", mid = "white", high = "blue"),
seed = 1
)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## $hopkins_stat
## [1] 0.8584354
##
## $plot
The Hopkins statistic was 0.86, indicating a strong tendency for clustering and confirming that the data is far from random.
Multiple methods were used to determine the optimal number of clusters:
Elbow Method (WSS): Looks for the point where the within-cluster sum of squares starts to level off.
Silhouette Method: Measures the separation between clusters, with higher values indicating better separation.
Gap Statistic: Compares the clustering structure to a random reference dataset.
Calinski-Harabasz index: The CH index measures the ratio of between-cluster variance to within-cluster variance, where higher values indicate better-defined clusters.
fviz_nbclust(X, kmeans, method = "wss", k.max = 10) + ggtitle("K-means: Elbow (WSS)")
fviz_nbclust(X, kmeans, method = "silhouette", k.max = 10) + ggtitle("K-means: Silhouette")
fviz_nbclust(X, kmeans, method = "gap_stat", k.max = 10) + ggtitle("K-means: Gap Statistic")
fviz_nbclust(X, FUNcluster = cluster::pam, method = "wss", k.max = 10) + ggtitle("PAM: Elbow (WSS)")
fviz_nbclust(X, FUNcluster = cluster::pam, method = "silhouette", k.max = 10) + ggtitle("PAM: Silhouette")
fviz_nbclust(X, FUNcluster = cluster::pam, method = "gap_stat", k.max = 10) + ggtitle("PAM: Gap Statistic")
fviz_nbclust(X, FUNcluster = cluster::clara, method = "wss", k.max = 10) + ggtitle("CLARA: Elbow (WSS)")
fviz_nbclust(X, FUNcluster = cluster::clara, method = "silhouette", k.max = 10) + ggtitle("CLARA: Silhouette")
fviz_nbclust(X, FUNcluster = cluster::clara, method = "gap_stat", k.max = 10) + ggtitle("CLARA: Gap Statistic")
fviz_nbclust(X, hcut, method = "wss", k.max = 10) + ggtitle("Hierarchical Clustering: Elbow (WSS)")
fviz_nbclust(X, hcut, method = "silhouette", k.max = 10) + ggtitle("Hierarchical Clustering: Silhouette")
fviz_nbclust(X, hcut, method = "gap_stat", k.max = 10) + ggtitle("Hierarchical Clustering: Gap Statistic")
set.seed(123)
nb <- NbClust::NbClust(
X, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "ch"
)
nb$Best.nc
## Number_clusters Value_Index
## 2.0000 113.2369
From the silhouette plots, K-means, PAM, and CLARA all show their highest average silhouette width at k = 2, after which the values steadily decrease. This suggests that dividing the data into two clusters gives the clearest separation, while adding more clusters mainly breaks existing groups into smaller, less meaningful pieces. Hierarchical clustering shows a similar pattern, with silhouette values stabilizing around k = 2-5, but without a clear improvement beyond two clusters.
The Gap Statistic selected two clusters for all algorithms, and the Calinski–Harabasz index also supported this choice, with its maximum value occurring at k = 2 (CH = 113.24). This provided a clear and stable basis for using a two-cluster solution in subsequent analyses.
I ran clustering on the full feature matrix X using K-means, PAM and CLARA.
set.seed(123)
km2_e <- eclust(X, "kmeans", k = 2, graph = FALSE)
fviz_cluster(km2_e, geom = "point", alpha = 0.5, stand = FALSE)
fviz_silhouette(km2_e)
## cluster size ave.sil.width
## 1 1 186 0.00
## 2 2 213 0.45
pam2_e <- factoextra::eclust(X, "pam", k = 2, graph = FALSE)
fviz_cluster(pam2_e, geom = "point", alpha = 0.5, stand = FALSE)
fviz_silhouette(pam2_e)
## cluster size ave.sil.width
## 1 1 228 0.43
## 2 2 171 0.01
df$class <- y
df$cluster1 <- km2_e$cluster
table(df$cluster1, df$class)
##
## ckd notckd
## 1 186 0
## 2 64 149
df$cluster2 <- pam2_e$cluster
table(df$cluster2, df$class)
##
## ckd notckd
## 1 79 149
## 2 171 0
Both the 2 partition-based methods produced similar patterns:
One cluster was relatively compact and well separated
The second cluster showed substantial overlap and dispersion
This indicates that while a two-cluster structure exists, perfect separation between CKD and non-CKD patients is not achievable in the original high-dimensional space.
fviz_cluster(
list(data = X, cluster = km2_e$cluster),
geom = "point",
ellipse.type = "norm",
stand = FALSE,
ggtheme = theme_minimal()
) +
ggtitle("K-means Clustering (Original Feature Space)")
fviz_cluster(
list(data = X, cluster = pam2_e$cluster),
geom = "point",
ellipse.type = "norm",
stand = FALSE,
ggtheme = theme_minimal()
) + ggtitle("PAM Clustering (Original Feature Space)")
The clustering results show some overlap between the clusters. These plots are 2D projections for visualization. Some overlap is expected because we are compressing a higher-dimensional structure into two dimensions. The separation between clusters might be more distinct when considering additional dimensions.
I also ran hierarchical clustering because it gives me a dendrogram view of how points merge step by step. I tried multiple linkage methods and then cut the tree into 2 clusters for consistency with the rest of the analysis.
This helps confirm whether the two-cluster structure is stable across different clustering approaches.
dist_euc <- dist(X)
hc_complete <- hclust(dist_euc, method = "complete")
hc_ward <- hclust(dist_euc, method = "ward.D2")
factoextra::fviz_dend(hc_complete, k = 2, cex = 0.6, rect = TRUE) +
ggtitle("Dendrogram: Complete linkage (k=2)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
factoextra::fviz_dend(hc_ward, k = 2, cex = 0.6, rect = TRUE) +
ggtitle("Dendrogram: Ward.D2 (k=2)")
The Complete linkage dendrogram shows a highly unbalanced structure where most data points are grouped into one large blue cluster, leaving a very small red branch on the left. This indicates that the method is sensitive to outliers or specific distant points, merging them only at the very final step at a height of approximately 25.
In contrast, Ward’s method produces two well-balanced and distinct branches of similar size. The significant vertical height (around 50) before the final merge indicates a high degree of separation between these two groups, providing strong evidence that a two-cluster structure is a natural and stable fit for the underlying data.
hc_cut <- factoextra::hcut(X, k = 2, stand = FALSE, hc_method = "ward.D2")
fviz_cluster(hc_cut, main = "Hierarchical (hcut) on X", geom = "point", stand = FALSE)
fviz_silhouette(hc_cut)
## cluster size ave.sil.width
## 1 1 174 0.48
## 2 2 225 -0.01
table(hc_cut$cluster, Pred = y)
## Pred
## ckd notckd
## 1 25 149
## 2 225 0
The results from hierarchical clustering (using Ward’s method) show that the two-cluster structure is stable across different clustering approaches. Cluster 2 predominantly contains ckd cases, while Cluster 1 consists mostly of notckd cases, with only a few misclassifications. This represents the strongest class alignment observed so far.
The average silhouette width for hierarchical , however, is low, primarily due to the spread-out nature of Cluster 2.
DBSCAN is useful because it can label some points as noise instead of forcing every point into a cluster.
I used the kNN distance plot to pick a reasonable eps value, then I ran DBSCAN.
dbscan::kNNdistplot(X, k = 10)
abline(h = 2.5, col = "red", lty = 2)
db <- dbscan::dbscan(X, eps = 2.5, minPts = 10)
Larger eps values merged most points into one cluster, so I selected eps = 2.5 to avoid a single-cluster solution. By sticking with 2.5, I’m able to keep some separation.
fviz_cluster(db, data = X, geom = "point", stand = FALSE) +
ggtitle("DBSCAN clusters on X")
table(db$cluster,y)
## y
## ckd notckd
## 0 147 0
## 1 103 149
The graph shows that DBSCAN really only picks up on one big group of points. Everything else just gets scattered and labeled as noise. Because the rest of the data is so spread out, a density-based clustering approach like this doesn’t work that well for this dataset.
Overall, hierarchical clustering provides the best label separation among the methods tested in the original feature space, but the uneven compactness of the clusters suggests that the high-dimensional representation still contains noise and redundant variation. This motivated the use of dimensionality reduction to improve cluster compactness while preserving the strong class alignment.
While PCA is variance-based, MDS is distance-based and is useful for visualizing how points relate to each other based on pairwise distances. I applied MDS using Euclidean distance computed on the full preprocessed feature matrix X (scaled numeric variables plus binary indicators)
mds_euc <- cmdscale(dist_euc, k = 2)
plot(mds_euc, pch = 19, col = y, main = "MDS (cmdscale) from Euclidean distance on X")
legend("topright", legend = levels(y), col = 1:length(levels(y)), pch = 19)
The MDS plot visualizes the relationship between the data points based on Euclidean distance. The plot shows a clear separation between the ckd and notckd groups along the Dimension 1, with ckd cases (black) mostly grouped together and notckd cases (red) spread across a larger range.
However, there’s some overlap between the two groups, particularly in the lower range of Dim 1. This suggests that while there is some distinction between the groups, the separation is not perfect, and there is still a significant amount of variability within each group.
stress_by_dim <- sapply(1:25, function(k) {
smacof::mds(dist_euc, ndim = k, type = "ratio")$stress
})
data.frame(
Dimensions = 1:25,
Stress = stress_by_dim
)
plot(1:25, stress_by_dim, type = "b",
xlab = "Number of MDS dimensions",
ylab = "Stress",
main = "Stress vs MDS dimensionality")
I selected 5 dimensions for the MDS solution because this is the point where the stress value drops into the fair - good range (approx. 0.09) according to standard MDS guidelines. Beyond this point, additional dimensions continue to reduce stress, but the improvement becomes more gradual, indicating diminishing returns. Therefore, a 5-dimensional solution provides a good balance between preserving the distance structure of the data and keeping the model reasonably low-dimensional.
mds <- smacof::mds(dist_euc, ndim = 5, type = "ratio")
mds$stress
## [1] 0.09198974
summary(mds)
##
## Configurations:
## D1 D2 D3 D4 D5
## 1 -0.2053 0.0976 0.1418 0.0922 0.0282
## 2 -0.1097 -0.1911 -0.0756 0.2668 0.7008
## 3 0.5553 0.5849 0.4615 -0.0747 0.1052
## 4 0.5820 -0.1412 -0.3052 0.1236 0.6274
## 5 0.1046 0.0707 -0.2181 0.1913 0.0562
## 6 0.1307 0.1868 -0.1430 0.3585 -0.0997
## 7 0.2885 -0.6171 0.0426 -0.7160 0.2844
## 8 0.1680 0.6993 0.4898 0.2211 0.4600
## 9 0.3931 0.1116 -0.1733 0.3562 -0.1453
## 10 0.5649 -0.3519 -0.5354 0.0238 -0.1173
## 11 0.6253 0.6958 0.4447 -0.2503 0.5380
## 12 0.6191 0.0631 0.3183 -0.2635 0.4516
## 13 0.6041 0.2177 -0.2612 -0.2465 0.0647
## 14 0.0998 -0.2765 0.0429 -0.3155 -0.1571
## 15 1.0324 0.1701 -0.3402 -0.0205 -0.0693
## 16 0.6155 -0.5852 -0.0683 0.3773 0.0458
## 17 -0.0399 -0.0021 -0.1048 0.0670 0.1639
## 18 -0.0397 -0.2438 0.0175 0.0493 -0.1293
## 19 0.1161 0.5570 0.2681 -0.1104 -0.4665
## 20 0.1962 -0.3530 -0.2268 -0.2594 0.1700
## 21 0.7587 -0.2114 -0.1308 -0.0488 -0.1057
## 22 1.5570 -2.1894 1.0329 -2.0110 0.6399
## 23 0.4786 -0.5179 0.1508 0.2946 0.1622
## 24 -0.1268 -0.1174 -0.3616 0.2380 0.1950
## 25 0.2772 0.1120 -0.0732 0.5662 0.0646
## 26 0.0867 -0.4155 -0.1738 -0.3315 -0.1596
## 27 0.1548 0.0907 -0.1522 -0.2505 -0.2954
## 28 0.5635 0.6002 0.1379 -0.1413 0.3342
## 29 -0.0209 0.3905 0.2294 -0.3165 0.0006
## 30 0.0503 0.3194 -0.4409 -0.1433 0.0680
## 31 0.0276 -0.4410 0.1274 -0.1259 -0.0256
## 32 0.3574 0.0989 -0.2534 0.2710 -0.2538
## 33 0.3222 0.2356 -0.1446 -0.0088 -0.2002
## 34 0.2799 -0.0331 0.0630 0.3262 -0.2921
## 35 0.2939 -0.1290 0.1826 -0.3645 0.1761
## 36 0.2556 0.3176 0.2675 -0.0069 -0.1928
## 37 0.1215 -0.0619 -0.1938 -0.2937 -0.1316
## 38 0.3506 -0.2493 -0.1248 -0.2415 -0.4848
## 39 0.2099 -0.1459 0.3035 0.0938 0.0781
## 40 0.2952 0.4036 -0.0621 -0.2154 -0.1415
## 41 0.2938 -0.0051 -0.2734 0.2537 -0.0470
## 42 -0.1103 0.0619 -0.3179 0.0092 0.1006
## 43 0.1881 0.1683 -0.2120 0.2318 -0.3997
## 44 0.6258 -0.5730 -0.3241 0.2121 0.0203
## 45 0.5526 0.0305 -0.0494 0.1656 0.1870
## 46 0.3774 -0.0976 0.2436 0.1087 0.1844
## 47 -0.0916 -0.1355 -0.1637 -0.0405 0.0084
## 48 -0.1545 0.1712 -0.2047 0.4642 0.3472
## 49 0.3518 0.2474 -0.8585 -0.4503 -0.2141
## 50 0.5007 0.3671 -0.9283 -0.2796 0.0229
## 51 0.2423 -0.3949 -0.3905 -0.2810 -0.0801
## 52 0.3740 0.0984 0.0328 0.3729 -0.1110
## 53 0.0966 -0.2714 -0.0628 0.2440 -0.3378
## 54 0.0953 0.7883 0.4033 -0.0928 -0.0472
## 55 0.1994 0.4257 -0.1511 0.0079 0.0316
## 56 0.3159 0.0747 -0.3476 0.3993 0.2460
## 57 0.9533 0.1781 0.1469 -0.3471 0.1926
## 58 0.1579 -0.2947 0.2725 -0.1541 -0.4173
## 59 0.4729 -0.1282 0.3526 -0.1230 -0.0791
## 60 0.2203 -0.2334 0.1178 0.1524 -0.4858
## 61 0.0482 0.0197 0.1633 0.0794 -0.3030
## 62 0.9071 -0.5940 2.1350 1.3514 0.5010
## 63 0.0440 -0.2782 -0.2768 0.2016 0.4925
## 64 0.3569 -0.3873 -0.1366 0.0314 0.1211
## 65 0.0366 0.0638 -0.2651 0.0404 -0.1488
## 66 -0.1877 0.2499 -0.1160 0.2899 -0.0621
## 67 0.1057 -0.0363 0.1519 -0.2145 0.1766
## 68 0.1617 0.2833 0.4500 0.0611 0.5120
## 69 0.1937 0.0296 -0.2318 -0.0685 0.1654
## 70 -0.3068 0.5979 0.4475 0.1270 0.3921
## 71 0.0172 0.7385 0.4860 -0.1761 0.0488
## 72 0.3772 0.0259 -0.6168 -0.2436 0.1111
## 73 0.3648 0.5181 0.0579 0.3241 0.0791
## 74 0.7039 -0.3042 -0.0897 0.4301 -0.3305
## 75 0.5168 -0.2456 -0.0267 0.2026 -0.1150
## 76 -0.0732 -0.1761 -0.3214 0.5049 0.1755
## 77 0.6118 -0.1173 -0.0483 0.3883 0.3685
## 78 0.2575 -0.0733 -0.2615 -0.2219 -0.0535
## 79 0.1516 -0.1636 0.0191 -0.1919 -0.2949
## 80 0.0701 0.3312 -0.2918 -0.0769 -0.0351
## 81 0.3498 -0.0168 -0.1987 -0.2178 -0.2787
## 82 0.4385 -0.0676 0.5027 -0.1549 -0.3415
## 83 -0.1197 -0.3087 -0.0914 0.0081 0.0418
## 84 -0.0237 0.0404 -0.0753 -0.0435 0.1568
## 85 0.9423 -0.6722 -0.0936 0.1654 0.1777
## 86 0.1698 0.0076 -0.0854 -0.2563 0.1306
## 87 0.0528 0.2375 0.5296 -0.3465 -0.0342
## 88 0.3535 0.2372 -0.0152 0.2493 -0.4706
## 89 0.3020 0.5853 -0.2874 0.3707 -0.1543
## 90 -0.1828 -0.1395 -0.0864 -0.0646 -0.0202
## 91 0.2825 0.6111 0.0389 0.1386 -0.1883
## 92 -0.1661 0.5746 -0.0884 -0.0571 0.4386
## 93 0.5961 -0.0645 0.0934 -0.1512 0.2685
## 94 0.8253 0.3476 0.1731 0.0618 -0.2316
## 95 0.1049 0.0564 -0.4232 -0.2338 -0.1330
## 96 0.0110 0.0607 -0.1227 0.1284 -0.3360
## 97 0.2013 0.2858 -0.0457 -0.0065 -0.0484
## 98 0.2358 -0.1552 -0.2668 -0.3040 0.0582
## 99 0.6652 -0.3362 -0.0846 0.4363 -0.9202
## 100 0.4810 0.8298 0.3367 0.4868 -1.1825
## 101 0.0822 0.0972 -0.0679 0.2649 0.4830
## 102 0.2918 0.0698 -0.3671 0.0769 -0.2876
## 103 -0.3736 -0.0556 -0.2573 0.1713 0.4213
## 104 0.6947 -0.2120 -0.0043 -0.5212 0.1106
## 105 0.0304 -0.1091 0.1399 0.0490 -0.2915
## 106 -0.1604 -0.0331 0.0617 -0.0644 -0.3826
## 107 0.4566 -0.5715 -0.0498 0.1211 -0.3834
## 108 0.3904 0.8060 0.0736 -0.0172 -0.2327
## 109 -0.0794 0.0112 -0.3442 0.0642 -0.1223
## 110 -0.0277 0.0016 0.1737 -0.1784 0.0437
## 111 -0.0373 0.1450 -0.2695 -0.0199 -0.3118
## 112 0.6635 0.4561 0.1513 -0.0774 0.1094
## 113 0.1405 -0.0234 -0.1589 -0.0310 0.4042
## 114 0.0151 0.3661 -0.0191 0.0020 -0.3100
## 115 -0.0345 0.0116 -0.2976 0.2319 0.5456
## 116 -0.0709 -0.0995 -0.2323 0.2957 -0.1990
## 117 0.0568 0.1689 -0.1033 0.1549 0.4332
## 118 -0.0736 0.0624 -0.0894 -0.2538 -0.0224
## 119 0.1132 0.1483 -0.2548 0.0971 0.3069
## 120 -0.0713 0.1242 -0.2314 -0.1328 -0.0120
## 121 0.1404 0.4490 0.5758 -0.1292 -0.1982
## 122 -0.1250 0.0728 0.0507 -0.1830 0.4365
## 123 0.3179 -0.7978 0.1321 0.0177 0.1200
## 124 -0.0267 0.6869 -0.3066 0.1100 0.1904
## 125 0.3066 -0.2489 -0.0774 0.1429 -0.4823
## 126 0.1078 0.1790 0.3247 -0.2501 -0.3343
## 127 0.2511 -0.0849 0.0815 -0.1027 -0.3609
## 128 0.4156 0.0914 -0.4835 -0.4059 0.3539
## 129 1.2279 -0.0174 1.5019 2.1380 0.4143
## 130 0.0082 -0.1282 0.1573 -0.3694 -0.1149
## 131 0.9073 -0.1362 -0.7389 0.0902 -0.1057
## 132 -0.2021 -0.0246 -0.5920 0.0372 0.4619
## 133 0.6988 -0.4237 -0.2587 -0.3658 -0.2949
## 134 0.1014 0.1659 0.5628 0.7207 -0.1354
## 135 0.2517 -0.0303 -0.4029 0.2317 -0.5038
## 136 -0.0091 0.3697 0.1764 0.0435 0.0202
## 137 0.0833 -0.0586 0.2449 0.1871 -0.2688
## 138 0.4123 0.0534 -0.0901 -0.1392 0.4304
## 139 -0.0106 0.1760 -0.2237 -0.0859 -0.2166
## 140 0.2156 -0.1679 -0.1173 0.1324 0.2699
## 141 0.0795 0.7007 0.2603 -0.2141 0.0907
## 142 0.5493 -0.4342 -0.2061 -0.1031 -0.1293
## 143 0.0531 -0.2827 0.1951 -0.0669 -0.3942
## 144 0.4189 0.1014 0.7561 0.1777 0.2854
## 145 0.2534 0.1297 -0.3538 0.1467 -0.1138
## 146 1.1710 -0.6673 0.2609 0.3429 0.1872
## 147 0.1440 0.5744 0.1824 0.2548 -0.1181
## 148 0.7053 0.3770 -0.5131 -0.3028 0.2786
## 149 0.2249 -0.8281 0.7891 -0.9017 0.3181
## 150 0.1161 -0.1598 -0.0672 -0.2945 0.0280
## 151 -0.2177 -0.2885 -0.0508 0.3079 0.5430
## 152 0.1687 -0.0077 0.0445 -0.1326 -0.4699
## 153 0.0265 -0.0792 -0.4036 0.0315 0.0925
## 154 1.1981 -0.0444 -0.2269 -0.1998 -0.1235
## 155 1.0998 0.2339 0.0897 0.2626 0.1386
## 156 0.0824 -0.1719 0.2062 0.2219 0.3213
## 157 0.1397 0.1146 0.1063 0.2031 -0.1345
## 158 0.1263 -0.1928 0.2142 -0.1255 0.2964
## 159 0.4199 0.3901 0.4295 -0.4672 0.3356
## 160 0.3750 0.3556 -0.2041 -0.3794 -0.0085
## 161 0.1790 -0.0833 -0.2551 -0.6224 -0.3264
## 162 0.0356 0.3495 -0.1701 0.0554 0.1324
## 163 0.0782 -0.2046 0.2863 -0.2273 0.0511
## 164 0.3299 -0.1578 -0.4370 0.0659 -0.2362
## 165 -0.2447 0.2387 -0.1380 0.3762 0.2347
## 166 -0.0692 0.2371 0.1684 -0.0644 -0.1539
## 167 -0.1763 -0.4038 -0.1819 -0.0063 0.2701
## 168 -0.2149 -0.3312 0.3042 0.2100 0.1556
## 169 0.3771 0.6256 0.4005 0.0086 0.3869
## 170 0.4456 0.1965 0.0933 -0.3180 -0.0774
## 171 0.4722 0.8438 0.4697 -0.3439 0.2359
## 172 0.5827 -0.1380 -0.4988 -0.5169 -0.0903
## 173 0.4000 0.5442 -0.0784 -0.2897 0.0245
## 174 -0.2929 -0.0357 -0.5233 0.2271 0.1849
## 175 0.0609 -0.4231 -0.0239 -0.1016 -0.0788
## 176 0.2088 -0.1988 0.1469 -0.6014 0.4944
## 177 0.5075 0.0594 -0.6211 0.4767 0.2133
## 178 0.2295 0.1505 0.3059 -0.0053 0.0888
## 179 0.2907 -0.3002 -0.0555 0.4731 -0.0426
## 180 0.1657 0.2581 -0.1690 0.1841 -0.1983
## 181 0.4972 0.6229 0.2867 0.0816 -0.1947
## 182 0.1081 0.1264 -0.9158 -0.2230 0.0763
## 183 -0.0884 -0.0556 -0.0735 -0.0588 -0.2204
## 184 -0.0522 -0.3429 -0.1315 0.1200 0.1353
## 185 0.6453 0.1058 0.3843 -0.1549 0.4376
## 186 -0.1947 -0.1950 -0.2066 0.4353 0.1696
## 187 -0.1043 -0.1393 -0.1345 0.1886 0.7082
## 188 -0.0049 0.1120 -0.5212 0.4170 0.2621
## 189 -0.2468 -0.2766 -0.1978 0.3566 0.0651
## 190 0.6728 0.1325 -0.0650 -0.0485 0.4450
## 191 0.2335 0.1460 -0.7693 0.2632 0.6226
## 192 0.5924 -0.1550 -0.3227 -0.0115 0.1420
## 193 -0.1131 0.1171 -0.0690 0.3339 -0.3698
## 194 0.9112 -1.0192 0.2332 0.1975 -0.1603
## 195 0.1137 0.1910 -0.1948 -0.3347 0.0224
## 196 0.4910 0.0054 0.3693 0.1293 -0.2570
## 197 0.8985 -0.2173 -0.2379 0.2814 -0.0761
## 198 0.3854 -0.6023 0.2084 -0.0420 -0.2147
## 199 0.5487 1.0821 -0.8993 -0.1236 -0.1112
## 200 0.2824 -0.1132 -0.4479 -0.1600 -0.3651
## 201 0.1317 -0.0732 0.1401 -0.2688 -0.4890
## 202 0.3448 -0.4459 -0.0769 -0.2537 -0.2360
## 203 0.2127 -0.3356 -0.0968 -0.4626 -0.1888
## 204 0.1858 -0.1371 0.2486 0.0743 -0.3153
## 205 0.6716 0.2283 0.1853 0.3844 0.0500
## 206 -0.1414 -0.0857 -0.0091 -0.2142 -0.0938
## 207 0.0655 -0.0235 -0.1410 -0.2000 0.1803
## 208 0.0679 0.2477 -0.1978 -0.2270 0.0988
## 209 0.0644 0.1354 0.4435 -0.2830 -0.1775
## 210 -0.2189 -0.2497 -0.0916 0.2072 0.1551
## 211 0.9777 0.1285 0.1576 0.1798 -0.1622
## 212 -0.0972 0.1345 -0.0640 0.3605 -0.5152
## 213 0.7748 0.1979 0.3576 0.1029 0.4574
## 214 0.5291 0.1789 -0.0036 0.1336 0.1885
## 215 -0.0994 0.0619 0.3528 -0.0294 -0.1394
## 216 0.0003 0.0381 -0.2214 0.5410 0.3958
## 217 -0.0866 0.0998 -0.2862 -0.1580 -0.0742
## 218 0.1209 0.1766 -0.3091 0.1226 -0.3517
## 219 -0.1019 -0.1027 -0.1612 0.3188 -0.1264
## 220 0.4625 0.0965 -0.1027 -0.1415 -0.4130
## 221 -0.0514 -0.0075 -0.3181 0.1811 -0.0185
## 222 0.0224 0.1017 0.1929 -0.2622 0.0156
## 223 -0.0810 -0.1150 0.0120 -0.3792 -0.0602
## 224 0.2381 0.6258 0.2088 -0.1725 -0.2018
## 225 -0.2304 -0.1982 -0.1416 -0.0427 0.2016
## 226 0.7342 1.0162 0.3708 -0.1039 0.1601
## 227 0.7590 0.2206 0.0143 0.3750 -0.1411
## 228 0.1157 -0.1172 -0.1340 -0.0497 -0.2187
## 229 -0.1075 -0.1027 0.0054 -0.1506 -0.0662
## 230 0.7723 -0.1057 -0.4025 -0.5037 0.5866
## 231 0.1584 0.2164 -0.1546 -0.3030 0.3370
## 232 0.1522 0.0884 0.2706 -0.1070 -0.3320
## 233 0.0009 0.0732 -0.0396 0.2859 -0.0907
## 234 -0.0483 0.1725 -0.1121 0.3944 -0.2235
## 235 -0.1933 0.0870 0.1385 0.5493 -0.2025
## 236 0.3533 -0.3270 -0.2138 0.2277 0.1940
## 237 0.1799 -0.3112 -0.1514 -0.1377 -0.3648
## 238 0.1770 0.3378 0.0277 -0.3497 0.0316
## 239 0.6111 -0.4369 0.4187 -0.0911 -0.4688
## 240 -0.0028 0.0026 -0.1018 0.3240 0.0351
## 241 0.2397 -0.0911 0.1837 -0.2320 0.0243
## 242 0.2116 -0.3110 0.0957 0.0367 0.1125
## 243 0.9207 0.3599 -0.1377 -0.1522 0.3074
## 244 -0.1723 0.4723 0.0808 0.0839 -0.1065
## 245 0.5166 0.6191 0.4714 -0.2113 0.0196
## 246 0.4967 -0.4111 -0.2533 0.1925 -0.5624
## 247 0.9312 -0.5833 0.1357 0.4414 -0.2614
## 248 -0.1121 -0.0334 0.1500 0.2537 -0.3206
## 249 0.4150 0.6988 0.2240 -0.3519 0.2215
## 250 1.5378 -0.5231 -0.0015 0.2325 -0.0229
## 251 -0.4223 -0.0269 -0.2153 0.0421 -0.1876
## 252 -0.6553 -0.0260 -0.1747 0.2172 -0.0626
## 253 -0.4575 -0.0897 -0.0961 0.0278 -0.1910
## 254 -0.4198 -0.0185 0.3225 -0.0038 -0.0856
## 255 -0.3841 -0.1773 0.0102 -0.2045 0.0974
## 256 -0.5787 0.1482 0.0349 0.1759 0.0284
## 257 -0.3678 0.1077 -0.1176 -0.1119 -0.2544
## 258 -0.4086 -0.0157 -0.0697 -0.0636 0.2365
## 259 -0.3466 0.0289 -0.0331 0.1052 -0.0271
## 260 -0.3645 -0.1470 0.2190 0.2243 0.0278
## 261 -0.3795 0.0577 -0.1209 0.1788 0.0112
## 262 -0.3171 -0.0786 0.1906 0.1105 -0.1121
## 263 -0.3167 0.0473 0.1324 0.0157 -0.0807
## 264 -0.3500 0.0581 0.0236 0.0764 -0.0490
## 265 -0.3131 -0.1455 0.1680 0.1874 -0.0702
## 266 -0.3858 0.1190 -0.1763 -0.0018 -0.1518
## 267 -0.2073 -0.0255 0.1210 0.0266 -0.0903
## 268 -0.4033 -0.0292 -0.0955 -0.0045 -0.1917
## 269 -0.4312 0.0774 0.0135 -0.0012 -0.1969
## 270 -0.4928 -0.1104 0.0472 0.2306 0.0185
## 271 -0.4046 -0.1922 -0.0461 0.2439 -0.0209
## 272 -0.3719 -0.1743 -0.1670 0.1521 -0.0977
## 273 -0.3328 -0.0961 0.2797 -0.0006 -0.0973
## 274 -0.1981 -0.0832 -0.0693 0.0570 -0.0980
## 275 -0.3602 -0.0956 -0.1106 0.2730 0.0535
## 276 -0.3011 -0.1327 0.2284 0.0946 -0.1426
## 277 -0.4089 -0.3141 0.1521 0.1243 0.2704
## 278 -0.4224 -0.2270 0.1579 -0.1280 0.1138
## 279 -0.4511 0.1140 -0.0167 -0.1412 0.2020
## 280 -0.5408 -0.1156 0.1608 0.1625 0.1686
## 281 -0.3942 0.0321 -0.0280 0.0899 -0.1234
## 282 -0.4497 0.1590 0.0860 -0.0239 -0.1749
## 283 -0.3193 -0.2261 0.0543 0.2171 0.1864
## 284 -0.3816 0.1894 -0.0565 -0.1823 -0.0146
## 285 -0.6178 -0.0163 0.1585 0.1672 0.0099
## 286 -0.3669 -0.0481 0.3115 -0.1596 -0.0143
## 287 -0.4244 0.1859 -0.0493 -0.2346 -0.1229
## 288 -0.3331 -0.1963 -0.0377 0.0044 0.0199
## 289 -0.4952 -0.0270 -0.1589 -0.2061 -0.1210
## 290 -0.4290 -0.0330 0.0303 0.0276 0.0577
## 291 -0.5267 0.1227 0.0064 -0.0831 -0.0009
## 292 -0.3734 -0.0711 0.1661 0.0590 -0.0707
## 293 -0.5071 0.0832 -0.1760 0.1802 -0.0217
## 294 -0.4462 -0.0329 0.1813 0.1596 -0.1741
## 295 -0.2696 0.0011 0.0779 -0.3853 0.0206
## 296 -0.5835 0.2185 -0.0353 -0.0090 0.0842
## 297 -0.4159 0.0105 0.1804 0.0559 0.1375
## 298 -0.4816 -0.0869 0.0253 -0.1920 0.0334
## 299 -0.3372 -0.1570 -0.1122 -0.0390 0.2017
## 300 -0.3991 0.1465 -0.1472 -0.4163 -0.0710
## 301 -0.4148 0.0485 -0.0540 -0.1072 0.1589
## 302 -0.5079 -0.1964 -0.0270 -0.1432 0.0557
## 303 -0.3949 -0.0751 -0.0865 0.1178 0.1132
## 304 -0.5111 0.0836 0.2097 -0.0723 0.0498
## 305 -0.2851 -0.2684 0.0731 0.1704 -0.0405
## 306 -0.3830 0.1219 -0.0509 0.1169 -0.0440
## 307 -0.4172 -0.0116 0.3194 0.0976 -0.0396
## 308 -0.3376 -0.0975 -0.1775 -0.1880 0.0863
## 309 -0.3645 -0.2146 0.0707 0.0839 -0.1013
## 310 -0.4307 0.0783 0.0860 -0.1717 0.1889
## 311 -0.3604 0.0405 -0.2048 -0.1441 0.1194
## 312 -0.4457 -0.0504 0.0673 -0.2326 0.0713
## 313 -0.3748 0.0585 0.2998 -0.2909 -0.0941
## 314 -0.4692 0.0952 0.0175 -0.0017 -0.2060
## 315 -0.4091 -0.1318 -0.1653 -0.0367 -0.0408
## 316 -0.3387 -0.2300 -0.0076 -0.0472 -0.0340
## 317 -0.3783 -0.1207 0.1734 0.2165 0.0261
## 318 -0.2516 -0.0492 -0.0134 -0.1237 -0.0494
## 319 -0.5290 -0.0176 0.2063 -0.1738 -0.0666
## 320 -0.2466 -0.1635 -0.1179 -0.0019 0.2421
## 321 -0.5015 0.1811 -0.0941 -0.2372 0.0996
## 322 -0.3257 0.0182 -0.1033 -0.2967 -0.0043
## 323 -0.3116 -0.1514 0.2450 -0.2908 0.0177
## 324 -0.4025 -0.1307 -0.0060 0.0654 -0.1701
## 325 -0.2763 -0.0529 -0.1511 0.1452 -0.1393
## 326 -0.4379 0.1359 0.2666 0.0384 -0.0683
## 327 -0.3711 -0.0294 -0.0502 -0.1293 0.1258
## 328 -0.5413 0.0237 -0.2305 -0.0539 0.2098
## 329 -0.4614 0.1246 -0.0154 0.1619 0.2378
## 330 -0.5197 -0.1825 -0.0950 -0.0284 0.1229
## 331 -0.1936 -0.0897 -0.0419 0.0923 -0.0543
## 332 -0.3294 -0.1905 0.0592 -0.1572 -0.1461
## 333 -0.5887 0.1356 -0.1325 0.0308 0.0982
## 334 -0.4912 -0.1176 0.2100 0.3269 0.0860
## 335 -0.3541 -0.2899 0.0935 0.2520 0.0045
## 336 -0.3623 0.1161 -0.0623 -0.2652 0.0927
## 337 -0.3828 -0.0669 -0.1639 0.0279 0.2516
## 338 -0.5615 -0.0099 0.1445 -0.0034 0.0803
## 339 -0.3745 -0.0466 0.2424 0.0096 -0.3267
## 340 -0.3455 -0.1950 -0.0488 0.1637 0.1469
## 341 -0.4792 -0.1305 0.0831 0.1046 0.1140
## 342 -0.3049 -0.1567 0.0145 -0.2024 -0.1988
## 343 -0.5025 0.0348 0.0956 -0.0787 0.2287
## 344 -0.5378 -0.1621 0.2072 -0.0230 0.1917
## 345 -0.3270 -0.0544 -0.1184 -0.3032 -0.0309
## 346 -0.5323 -0.1053 0.0007 0.0724 0.3573
## 347 -0.4989 -0.1019 0.2410 0.0517 0.2710
## 348 -0.5706 -0.0877 0.0777 -0.1238 0.1343
## 349 -0.4639 0.1315 0.0627 0.2198 0.0190
## 350 -0.6567 0.0627 -0.0791 0.0425 0.0716
## 351 -0.3962 -0.0549 -0.1012 -0.2352 -0.1992
## 352 -0.4045 -0.0384 -0.2740 0.1861 -0.0594
## 353 -0.4043 -0.1061 0.0250 -0.0251 0.1882
## 354 -0.4138 -0.3042 -0.0188 -0.0808 0.1839
## 355 -0.4760 -0.2414 -0.0611 -0.0104 0.1327
## 356 -0.4966 -0.2213 -0.0767 0.1007 0.2555
## 357 -0.6259 -0.0295 0.0897 0.0733 0.0936
## 358 -0.3209 0.0167 -0.1501 -0.2613 -0.1770
## 359 -0.3277 -0.1398 0.1624 -0.0783 0.1953
## 360 -0.4729 -0.0840 0.1152 -0.4371 -0.0919
## 361 -0.4742 -0.1793 0.1999 -0.0037 0.2441
## 362 -0.4998 -0.0732 0.1316 0.3198 0.0455
## 363 -0.4203 -0.0890 -0.2279 0.1124 -0.1184
## 364 -0.4374 -0.0796 0.2502 -0.1020 -0.2401
## 365 -0.2917 0.0103 0.0275 -0.2158 -0.2933
## 366 -0.4194 0.1213 -0.2538 0.2760 0.0587
## 367 -0.4475 0.0711 -0.0927 -0.1025 -0.2235
## 368 -0.5589 0.0114 0.2624 -0.2761 0.0373
## 369 -0.5478 0.0103 -0.0801 0.1883 -0.0260
## 370 -0.4270 0.1112 0.0177 -0.2440 -0.1050
## 371 -0.6336 -0.2394 0.0331 0.0205 0.1674
## 372 -0.4818 0.1792 -0.0839 -0.3556 -0.0320
## 373 -0.3465 -0.0333 -0.0154 -0.2052 -0.1162
## 374 -0.4627 0.1266 0.2077 -0.2214 -0.3315
## 375 -0.3964 0.1578 -0.0182 -0.1409 -0.2250
## 376 -0.5880 0.0371 -0.0181 -0.1627 -0.1150
## 377 -0.3393 -0.0903 0.0710 -0.1769 -0.1355
## 378 -0.3741 -0.0560 0.1273 -0.3249 -0.0251
## 379 -0.4930 -0.0897 0.2747 -0.0407 -0.1675
## 380 -0.4222 -0.0650 0.1315 -0.2616 0.1103
## 381 -0.4500 -0.1154 0.2677 -0.2214 -0.1524
## 382 -0.6265 0.0002 0.2875 0.0837 -0.0505
## 383 -0.3192 -0.1028 0.2834 -0.1856 -0.2983
## 384 -0.3532 0.0817 -0.2542 -0.2741 -0.0001
## 385 -0.4740 0.1046 0.0563 -0.1806 -0.0910
## 386 -0.5465 -0.0605 0.2915 0.0019 0.0775
## 387 -0.5707 -0.1793 0.0755 0.3244 0.0848
## 388 -0.5046 0.2256 0.0004 0.0321 -0.0588
## 389 -0.5729 0.0031 0.1739 0.1026 -0.0418
## 390 -0.4620 -0.1213 0.1924 0.0219 -0.1297
## 391 -0.5533 -0.0568 0.1980 0.1924 -0.0071
## 392 -0.3457 0.0573 0.1470 0.0044 -0.1330
## 393 -0.4735 -0.1694 0.0801 -0.1362 0.1733
## 394 -0.2381 0.0317 -0.0783 -0.0006 -0.1226
## 395 -0.3640 0.0452 0.1061 0.0322 -0.2185
## 396 -0.6677 0.0164 0.1146 0.0009 0.0558
## 397 -0.4833 -0.0855 0.0235 0.3562 0.1443
## 398 -0.5363 -0.1754 0.0907 0.1149 0.3411
## 399 -0.5520 0.0857 0.2465 -0.0269 -0.1075
##
##
## Stress per point (in %):
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 0.15 0.20 0.34 0.45 0.19 0.43 0.90 0.38 0.25 0.57 0.39 0.50 0.26 0.26 0.21 0.25
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 0.09 0.16 0.44 0.46 0.18 1.04 0.43 0.31 0.33 0.30 0.12 0.27 0.37 0.56 0.31 0.45
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 0.15 0.29 0.56 0.31 0.27 0.30 0.46 0.27 0.14 0.25 0.33 0.32 0.21 0.30 0.25 0.17
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 0.28 0.36 0.32 0.26 0.32 0.41 0.28 0.24 0.58 0.37 0.27 0.27 0.27 1.26 0.32 0.29
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 0.27 0.22 0.28 0.59 0.30 0.33 0.26 0.29 0.33 0.31 0.16 0.30 0.31 0.32 0.20 0.27
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 0.18 0.68 0.20 0.15 0.28 0.33 0.54 0.46 0.38 0.09 0.25 0.31 0.37 0.20 0.18 0.32
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 0.24 0.19 0.21 0.34 0.22 0.37 0.31 0.62 0.22 0.39 0.33 0.37 0.14 0.20 0.12 0.18
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 0.27 0.26 0.18 0.48 0.39 0.24 0.31 0.24 0.29 0.32 0.50 0.38 0.18 0.28 0.24 0.50
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 1.12 0.28 0.39 0.27 0.55 0.93 0.48 0.27 0.35 0.28 0.30 0.21 0.40 0.44 0.33 0.74
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 0.18 0.33 0.25 0.27 0.94 0.37 0.24 0.20 0.35 0.41 0.23 0.34 0.28 0.54 0.32 0.52
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 0.40 0.38 0.34 0.37 0.38 0.24 0.24 0.28 0.28 0.52 0.29 0.48 0.40 0.21 0.29 0.60
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 0.23 0.28 0.47 0.38 0.36 0.62 0.11 0.26 0.29 0.19 0.19 0.21 0.17 0.26 0.27 0.16
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 0.15 0.26 0.44 0.41 0.21 0.30 0.65 0.23 0.25 0.33 0.35 0.30 0.36 0.12 0.36 0.32
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 0.33 0.14 0.25 0.17 0.48 0.18 0.30 0.15 0.27 0.19 0.16 0.26 0.18 0.17 0.13 0.23
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 0.12 0.27 0.27 0.17 0.06 0.40 0.26 0.31 0.19 0.22 0.36 0.35 0.30 0.28 0.38 0.09
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 0.32 0.43 0.24 0.33 0.38 0.28 0.29 0.38 0.33 0.14 0.29 0.16 0.13 0.15 0.11 0.17
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 0.16 0.10 0.05 0.12 0.11 0.15 0.10 0.09 0.18 0.11 0.09 0.14 0.16 0.09 0.13 0.15
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 0.10 0.04 0.10 0.19 0.11 0.10 0.15 0.07 0.11 0.20 0.12 0.13 0.06 0.21 0.15 0.11
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 0.22 0.08 0.10 0.10 0.18 0.16 0.18 0.12 0.10 0.08 0.10 0.17 0.06 0.14 0.11 0.09
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 0.14 0.14 0.12 0.18 0.09 0.13 0.09 0.12 0.19 0.16 0.18 0.13 0.10 0.06 0.10 0.14
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 0.17 0.09 0.19 0.14 0.15 0.20 0.07 0.22 0.19 0.08 0.06 0.14 0.19 0.13 0.16 0.10
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 0.11 0.12 0.28 0.09 0.09 0.17 0.10 0.06 0.15 0.18 0.10 0.11 0.16 0.14 0.14 0.22
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 0.06 0.25 0.13 0.17 0.06 0.14 0.12 0.28 0.11 0.14 0.17 0.13 0.13 0.28 0.12 0.09
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 0.15 0.13 0.12 0.14 0.10 0.22 0.13 0.09 0.15 0.10 0.10 0.16 0.17 0.10 0.14 0.15
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399
## 0.12 0.12 0.12 0.12 0.08 0.11 0.10 0.08 0.13 0.08 0.15 0.07 0.10 0.12 0.07
X_mds <- mds$conf
dim(X_mds)
## [1] 399 5
The 5-dimensional MDS solution resulted in a stress value of 0.09, which indicates a good fit. This means that the original distance relationships between patients are well preserved in the reduced space.
Most observations show very low stress contributions, suggesting they are well represented by the 5D configuration. Only a few patients have higher stress values (like Patient 62 and 129), indicating possible outliers with more unusual clinical profiles.
Overall, this confirms that the 5-dimensional MDS representation is suitable for further clustering analysis.
km_mds <- factoextra::eclust(X_mds, "kmeans", k = 2, graph = FALSE)
fviz_cluster(list(data = X_mds, cluster = km_mds$cluster), geom="point", stand=FALSE)
fviz_silhouette(km_mds)
## cluster size ave.sil.width
## 1 1 190 0.04
## 2 2 209 0.49
table(km_mds$cluster,y)
## y
## ckd notckd
## 1 190 0
## 2 60 149
For K-means, MDS does not meaningfully improve clustering beyond what is already achieved in the original feature space. Only 5 more data points were clustered into the correct group. This confirms that MDS is effective at preserving structure, but not necessarily at enhancing cluster separability for variance-driven algorithms like K-means.
pam_mds <- factoextra::eclust(X_mds, "pam", k = 2, graph = FALSE)
fviz_cluster(list(data = X_mds, cluster = pam_mds$cluster), geom = "point", stand = FALSE)
fviz_silhouette(pam_mds)
## cluster size ave.sil.width
## 1 1 195 0.49
## 2 2 204 0.04
table(pam_mds$cluster,y)
## y
## ckd notckd
## 1 46 149
## 2 204 0
MDS provides a modest improvement for PAM by making the dominant cluster more compact and slightly reducing class mixing. However, the gain is limited, and the overall clustering behavior remains similar to that observed in the original feature space.
hc_mds <- hclust(dist(X_mds), method = "ward.D2")
hc_mds_cut <- cutree(hc_mds, k = 2)
fviz_cluster(
list(
data = X_mds[, 1:min(2, ncol(X_mds)), drop = FALSE],
cluster = hc_mds_cut
),
main = paste0("Hierarchical clustering on MDS"),
geom = "point",
stand = FALSE
)
sil_hc_mds <- silhouette(hc_mds_cut, dist(X_mds))
fviz_silhouette(sil_hc_mds)
## cluster size ave.sil.width
## 1 1 152 0.56
## 2 2 247 0.00
table(hc_mds_cut, y)
## y
## hc_mds_cut ckd notckd
## 1 6 146
## 2 244 3
MDS provides a clear improvement for hierarchical clustering by reducing class mixing and producing clusters that align more closely with the underlying CKD and non-CKD groups. While one cluster remains geometrically spread out, the separation in terms of true class membership becomes noticeably cleaner compared to the original feature space.
Among all methods, hierarchical clustering benefits the most from MDS. Overall, MDS does not radically change the cluster structure, but it does meaningfully reduce misclassification and slightly improves cluster compactness, particularly for weaker clusters, without consistently strengthening separation across all methods.
To ensure that this configuration was not due to random structure, I performed a Random Stress Test.
cat("Empirical stress:", mds$stress, "\n")
## Empirical stress: 0.09198974
stressvec <- smacof::randomstress(n = nrow(X), ndim = 5, nrep = 50)
cat("Mean random stress:", mean(stressvec), "\n")
## Mean random stress: 0.4999112
cat("Stress ratio (empirical / random):", mds$stress / mean(stressvec), "\n")
## Stress ratio (empirical / random): 0.1840122
plot(mds, main = "SMACOF MDS (point size shows stress per point)")
plot(mds$conf, pch = 19, col = y, main = "SMACOF MDS configuration colored by class")
The empirical stress (0.09) is substantially lower than the Mean Random Stress (0.499), resulting in a Stress Ratio of 0.184. This confirms that the 5D MDS representation captures significantly more structure than a random configuration.
I also performed a Mantel Test to validate the transformation of the correlation matrix into a distance (dissimilarity) matrix.
sim_num <- cor(X_num)
dis_from_sim <- smacof::sim2diss(sim_num, method = 1, to.dist = TRUE)
mantel.test(as.matrix(sim_num), as.matrix(dis_from_sim))
## $z.stat
## [1] -7.952564
##
## $p
## [1] 0.001
##
## $alternative
## [1] "two.sided"
The negative value here is expected and it indicates a strong inverse relationship between the two matrices. This makes sense because as the correlation between two variables increases, the distance between them must decrease. Also, since the p-value is well below the standard threshold of 0.05, we reject the null hypothesis that the two matrices are unrelated.
So we can say that there is a statistically significant relationship between our original correlations and the calculated distances.
While MDS focuses on preserving pairwise distances, PCA provides a variance-based view of the data. I therefore applied PCA to compare how variance-driven dimensionality reduction affects clustering performance.
I plotted a correlation matrix for the scaled numeric variables to check for any strong redundancy between them. This helps me understand how much of the variance might be captured by the first few principal components (PCs).
num_cor <- cor(as.matrix(X_num), method = "pearson")
corrplot::corrplot(num_cor, order = "alphabet", tl.cex = 0.75)
The correlation plot shows that certain variables, like hemo and pcv, are highly correlated, while others like pot and sod have weaker correlations. This suggests that PCA will likely capture significant variance with just a few components, as the data shows some degree of redundancy between the variables.
I also ran a PCA on the scaled feature matrix X to reduce the dimensions and identify the primary drivers of variance in the data.
pca1 <- prcomp(X, center = FALSE, scale. = FALSE)
pca2 <- princomp(X, cor = FALSE)
fviz_eig(pca1, addlabels = TRUE, main = "Scree plot: variance explained (PCA on X)")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
eig <- factoextra::get_eigenvalue(pca1)
head(eig)
From the outputs, I observed that Dim.1 and Dim.2 account for 31.1% and 10.9% of the variance, respectively. Together, these first two components capture a cumulative of only 42% of the total information.
cum_var <- cumsum(eig$variance.percent)
cum_var
## [1] 31.11465 42.01136 50.26028 57.32449 63.67347 69.46141 74.69301
## [8] 78.78563 82.63228 85.58116 88.35236 90.74606 92.82611 94.34145
## [15] 95.31320 96.17149 96.95888 97.58304 98.14983 98.66543 99.13059
## [22] 99.48949 99.75278 100.00000
plot(cum_var, type = "l", xlab = "PC", ylab = "Cumulative variance (%)",
main = "Cumulative variance")
By plotting the cumulative variance, I observed that the first six principal components capture 74.7% of the data’s total information, with the curve flattening significantly after the 12th PC. This confirms that the first 7 dimensions represent the primary “signal” of the dataset, justifying their use for a low-dimensional representation that effectively simplifies the original 25 variables.
I analyzed the variable map and contribution plots to see which clinical markers were actually driving the patterns in the data.
fviz_pca_var(pca1, repel = TRUE) + ggtitle("PCA variable map")
a1 <- fviz_contrib(pca1, choice = "var", axes = 1, top = 15) + ggtitle("Top contributions to PC1")
a2 <- fviz_contrib(pca1, choice = "var", axes = 2, top = 15) + ggtitle("Top contributions to PC2")
gridExtra::grid.arrange(a1, a2, ncol = 2)
The PCA results show that different clinical variables drive each principal component rather than a single dominant factor. PC1 is mainly influenced by blood-related markers such as packed cell volume, hemoglobin, and red blood cell count, together with renal indicators like blood urea and albumin, capturing overall blood health and kidney function. PC2 is primarily shaped by metabolic and electrolyte-related variables, including sugar, blood glucose, specific gravity, and sodium.
Together, these components indicate that patient separation in the PCA space is driven by a combination of blood health, kidney function, and metabolic regulation. By concentrating this clinically meaningful variation into a small number of dimensions, PCA provides a representation that aligns well with CKD-related structure in the data.
I generated two maps to examine how individual patients are positioned within the PCA space.
fviz_pca_ind(pca1, col.ind = y, repel = TRUE) + ggtitle("Individuals on PCA map (colored by class)")
fviz_pca_ind(pca1, col.ind = "cos2", gradient.cols = c("white","#2E9FDF","#FC4E07")) +
ggtitle("Individuals colored by cos2 quality")
The first plot shows a clear separation trend between ckd and notckd patients along the first principal component. The non-CKD group forms a tight and compact cluster on the left, while CKD patients are more spread out across the PCA space, reflecting greater clinical variability in disease severity.
The second plot uses cos2 values to indicate how well each patient is represented by the two-dimensional PCA map. Most points at the extremes of the plot, especially those far from the origin, show high cos2 values, meaning they are well captured by the first two components. Points closer to the center tend to have lower cos2 values, indicating more average clinical profiles that are not fully explained by the first two dimensions.
Overall, these plots confirm that the PCA projection captures the main structure of the data well, particularly for the more distinct and extreme patient profiles, while some overlap near the center is expected in a reduced two-dimensional view.
I generated a PCA biplot to visualize how individual patients and clinical variables interact within the same space.
fviz_pca_biplot(pca1, geom.ind = "point", col.ind = ifelse(y == 'ckd', "blue", "red"), col.var = "black", repel = TRUE, title = "PCA Biplot")
The PCA biplot shows a clear separation pattern driven by clinically meaningful variables. Non-CKD patients cluster tightly on the left side of the plot, aligning with higher Hemoglobin (hemo), Packed Cell Volume (pcv), and Red Blood Cell Count (rbcc), which are markers of healthier blood profiles.
In contrast, CKD patients are more dispersed and extend toward the right side of the map, in the direction of renal and metabolic indicators such as Blood Urea (bu), Serum Creatinine (sc), Albumin (al), Blood Glucose (bgr), and Sugar (su). The strong alignment of bu and sc indicates their close relationship and central role in distinguishing diseased patients.
Overall, the biplot links the geometric separation of patients directly to underlying clinical signals, showing that variation in kidney function and blood health is the main driver of structure in the data.
I applied a Varimax rotation to the first three principal components to create a cleaner, more interpretable structure by grouping variables onto specific axes.
pca_rot <- psych::principal(X, nfactors = 3, rotate = "varimax")
pca_rot
## Principal Components Analysis
## Call: psych::principal(r = X, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC3 RC2 h2 u2 com
## age 0.18 -0.04 0.54 0.327 0.67 1.2
## bp 0.26 0.09 0.25 0.142 0.86 2.2
## bgr -0.02 0.19 0.78 0.639 0.36 1.1
## bu 0.74 0.21 0.07 0.596 0.40 1.2
## sc 0.73 -0.08 0.04 0.543 0.46 1.0
## sod -0.59 -0.02 -0.09 0.358 0.64 1.0
## pot 0.24 -0.03 0.06 0.063 0.94 1.2
## hemo -0.69 -0.44 -0.23 0.727 0.27 1.9
## pcv -0.68 -0.45 -0.25 0.721 0.28 2.0
## wbcc -0.09 0.31 0.21 0.150 0.85 2.0
## rbcc -0.61 -0.38 -0.24 0.582 0.42 2.0
## sg -0.22 -0.54 -0.31 0.434 0.57 2.0
## al 0.21 0.76 0.19 0.654 0.35 1.3
## su -0.08 0.17 0.75 0.604 0.40 1.1
## rbc 0.16 0.49 -0.02 0.263 0.74 1.2
## pc 0.21 0.75 0.07 0.604 0.40 1.2
## pcc 0.02 0.62 0.14 0.404 0.60 1.1
## ba -0.02 0.59 -0.01 0.344 0.66 1.0
## htn 0.53 0.19 0.53 0.609 0.39 2.3
## dm 0.34 0.13 0.72 0.649 0.35 1.5
## cad 0.25 0.12 0.38 0.218 0.78 2.0
## appet 0.35 0.35 0.17 0.274 0.73 2.4
## pe 0.36 0.40 0.10 0.299 0.70 2.1
## ane 0.59 0.21 -0.02 0.395 0.61 1.2
##
## RC1 RC3 RC2
## SS loadings 4.16 3.54 2.90
## Proportion Var 0.17 0.15 0.12
## Cumulative Var 0.17 0.32 0.44
## Proportion Explained 0.39 0.33 0.27
## Cumulative Proportion 0.39 0.73 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 1011.61 with prob < 4e-106
##
## Fit based upon off diagonal values = 0.94
summary(pca_rot)
##
## Factor analysis with Call: psych::principal(r = X, nfactors = 3, rotate = "varimax")
##
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 207 and the objective function was 2.47
## The number of observations was 399 with Chi Square = 956.69 with prob < 1.1e-96
##
## The root mean square of the residuals (RMSA) is 0.07
print(loadings(pca_rot), digits = 3, cutoff = 0.35, sort = TRUE)
##
## Loadings:
## RC1 RC3 RC2
## bu 0.740
## sc 0.731
## sod -0.592
## hemo -0.693 -0.441
## pcv -0.677 -0.445
## rbcc -0.614 -0.383
## htn 0.535 0.534
## ane 0.593
## sg -0.540
## al 0.756
## pc 0.746
## pcc 0.620
## ba 0.586
## age 0.540
## bgr 0.776
## su 0.755
## dm 0.717
## bp
## pot
## wbcc
## rbc 0.486
## cad 0.377
## appet 0.354
## pe 0.362 0.397
##
## RC1 RC3 RC2
## SS loadings 4.162 3.535 2.902
## Proportion Var 0.173 0.147 0.121
## Cumulative Var 0.173 0.321 0.442
After Varimax rotation, the three components show clear and clinically interpretable structure. RC1 is primarily associated with kidney dysfunction and anemia, with strong positive loadings for Blood Urea (bu), Serum Creatinine (sc), Hypertension (htn), and Anemia (ane), and strong negative loadings for Hemoglobin (hemo), Packed Cell Volume (pcv), and Red Blood Cell Count (rbcc). This component captures the contrast between impaired renal function and healthy blood profiles.
RC2 represents a metabolic and diabetic dimension, dominated by Blood Glucose Random (bgr), Sugar (su), Diabetes Mellitus (dm), and Age. This axis reflects metabolic stress and glucose-related abnormalities commonly associated with CKD progression.
RC3 is driven mainly by infection and urine abnormality indicators, with high loadings for Albumin (al), Pus Cell (pc), Pus Cell Clumps (pcc), Bacteria (ba), and Specific Gravity (sg). This component reflects inflammatory and urinary pathology.
The top PCA loadings were driven by variables that are clinically relevant for kidney function and anemia, which fits expectations for CKD-related separation..
I used the 70% variance threshold to determine the optimal number of dimensions for the rest of the analysis.
k_pc <- which(cum_var >= 70)[1]
k_pc
## [1] 7
X_pca_k <- pca1$x[, 1:k_pc, drop = FALSE]
From the output, I identified that the first 7 principal components are sufficient to retain the majority of the dataset’s clinical information. I then subset the PCA results into a new matrix, X_pca_k, keeping only these 7 dimensions and removing the other 18.
km_pca <- eclust(X_pca_k, "kmeans", k = 2, graph = FALSE)
fviz_cluster(km_pca, main = paste0("K-means on PCA ", k_pc, " PCs"), geom = "point", stand = FALSE)
fviz_silhouette(km_pca)
## cluster size ave.sil.width
## 1 1 187 0.09
## 2 2 212 0.50
table(km_pca$cluster, y)
## y
## ckd notckd
## 1 187 0
## 2 63 149
Compared to the original feature space, PCA leads to a modest improvement in the silhouette score and clustering quality for K-means by making the weaker cluster slightly more compact. However, the overall class alignment remains largely unchanged. This indicates that PCA mainly reduces noise and improves geometric cohesion, rather than uncovering a fundamentally different or clearer separation between CKD and non-CKD patients.
pam_pca <- eclust(X_pca_k, "pam", k = 2, graph = FALSE)
fviz_cluster(pam_pca, main = paste0("PAM on PCA (first ", k_pc, " PCs)"), geom = "point", stand = FALSE)
fviz_silhouette(pam_pca)
## cluster size ave.sil.width
## 1 1 176 0.54
## 2 2 223 0.08
table(pam_pca$cluster,y)
## y
## ckd notckd
## 1 27 149
## 2 223 0
For PAM, PCA provides a more noticeable improvement than for K-means. After dimensionality reduction, the main cluster becomes substantially more compact, and the weaker cluster shows a clear gain in separation. This is reflected in the increase in average silhouette width and the cleaner silhouette shape. While some overlap still remains, PCA reduces within-cluster dispersion and improves robustness, without distorting the underlying class structure.
dist_pca <- dist(X_pca_k)
hc_pca <- hclust(dist_pca, method = "ward.D2")
hc_pca_cut <- cutree(hc_pca, k = 2)
fviz_cluster(
list(
data = X_pca_k[, 1:min(2, ncol(X_pca_k)), drop = FALSE],
cluster = hc_pca_cut
),
main = paste0("Hierarchical clustering on PCA (first ", k_pc, " PCs)"),
geom = "point",
stand = FALSE
)
sil_hc_pca <- silhouette(hc_pca_cut, dist_pca)
fviz_silhouette(sil_hc_pca)
## cluster size ave.sil.width
## 1 1 151 0.58
## 2 2 248 0.05
table(hc_pca_cut,y)
## y
## hc_pca_cut ckd notckd
## 1 6 145
## 2 244 4
Applying PCA improves hierarchical clustering compared to the original feature space. In the original data, one cluster was compact while the other showed weak separation and slight dispersion. After PCA, the main cluster becomes more compact and the weaker cluster improves from negative to positive silhouette values. Overall, PCA reduces noise and leads to a more stable and better-defined hierarchical clustering structure, even though some overlap remains.
Among all methods, hierarchical clustering benefits the most from PCA, showing the largest gain in cluster compactness and overall silhouette quality.
par(mfrow = c(1, 2))
plot(pca1$x[, 1], pca1$x[, 2], col = y, pch = 19, xlab = "PC1", ylab = "PC2", main = "PCA: First Two Principal Components")
legend("topright", legend = levels(y), col = 1:length(levels(y)), pch = 19, cex = 0.8)
plot( mds_euc[, 1], mds_euc[, 2], col = y, pch = 19, xlab = "D1", ylab = "D2", main = "MDS: First Two Dimensions (Euclidean)")
legend("topright", legend = levels(y), col = 1:length(levels(y)), pch = 19, cex = 0.8)
par(mfrow = c(1, 1))
The PCA and MDS projections in two dimensions look almost identical, with both methods showing the same tight notckd cluster and a more dispersed CKD group. Since the visual separation is almost the same, I compared silhouette widths and the alignment between true class labels and cluster assignments to evaluate the underlying clustering quality.
sil_plot <- function(sil_obj, title_prefix) {
avg_sil <- round(mean(sil_obj[, "sil_width"]), 3)
fviz_silhouette(sil_obj) +
ggtitle(paste0(title_prefix,
"\nAvg silhouette width = ", avg_sil))
}
p_km_mds <- sil_plot(silhouette(km_mds$cluster, dist(X_mds)), "K-means (MDS, 5 Dimensions)")
## cluster size ave.sil.width
## 1 1 190 0.04
## 2 2 209 0.49
p_km_pca <- sil_plot(silhouette(km_pca$cluster, dist(X_pca_k)),"K-means (PCA, First 7 PCs)")
## cluster size ave.sil.width
## 1 1 187 0.09
## 2 2 212 0.50
table(km_mds$cluster,y)
## y
## ckd notckd
## 1 190 0
## 2 60 149
table(km_pca$cluster,y)
## y
## ckd notckd
## 1 187 0
## 2 63 149
gridExtra::grid.arrange(p_km_mds, p_km_pca, nrow = 2)
Both PCA and MDS produce almost identical clustering results. Each method separates CKD and notckd patients in nearly the same way, with only a few observations differing between clusters. The slightly higher silhouette width for PCA reflects marginally tighter clusters, but there is no meaningful improvement in class alignment compared to MDS.
p_pam_mds <- sil_plot(silhouette(pam_mds$cluster, dist(X_mds)), "PAM (MDS, 5 Dimensions)")
## cluster size ave.sil.width
## 1 1 195 0.49
## 2 2 204 0.04
p_pam_pca <- sil_plot(silhouette(pam_pca$cluster, dist(X_pca_k)), "PAM (PCA, First 7 PCs)")
## cluster size ave.sil.width
## 1 1 176 0.54
## 2 2 223 0.08
table(pam_mds$cluster,y)
## y
## ckd notckd
## 1 46 149
## 2 204 0
table(pam_pca$cluster,y)
## y
## ckd notckd
## 1 27 149
## 2 223 0
gridExtra::grid.arrange(p_pam_mds, p_pam_pca, nrow = 2)
For PAM, both MDS and PCA produce the same high-level clustering structure, with one cluster dominated by notckd patients and the other by ckd patients. However, PCA improves class alignment by reducing the number of CKD patients mixed into the notckd-dominated cluster. This is reflected in cleaner class alignment and a slightly higher average silhouette width, indicating more compact and coherent clusters under PCA.
p_hc_mds <- sil_plot(sil_hc_mds,"Hierarchical (MDS, 5 Dimensions)")
## cluster size ave.sil.width
## 1 1 152 0.56
## 2 2 247 0.00
p_hc_pca <- sil_plot(sil_hc_pca, "Hierarchical (PCA, First 7 PCs)")
## cluster size ave.sil.width
## 1 1 151 0.58
## 2 2 248 0.05
table(hc_mds_cut,y)
## y
## hc_mds_cut ckd notckd
## 1 6 146
## 2 244 3
table(hc_pca_cut,y)
## y
## hc_pca_cut ckd notckd
## 1 6 145
## 2 244 4
gridExtra::grid.arrange(p_hc_mds, p_hc_pca, nrow = 2)
For hierarchical clustering, MDS and PCA produce almost identical class alignment, with only a negligible difference in misclassified cases. In practice, both methods recover the same underlying structure. However, PCA consistently achieves a higher average silhouette width, indicating slightly more compact and better-defined clusters overall.
Overall, PCA performs slightly better than MDS for this dataset. Although both methods produce very similar visual structures and almost identical class alignment across clustering algorithms, PCA consistently yields higher average silhouette scores, indicating more compact and better-defined clusters.
MDS, even when using five dimensions, does not provide a meaningful advantage in terms of class alignment or clustering quality. In most cases, it preserves the same structure already present in the data rather than improving it. The only notable exception is minimal and does not change the overall interpretation.
PCA therefore appears to be the more effective dimensionality reduction method here, as it achieves comparable or better class alignment while consistently improving internal clustering quality.
To evaluate whether dimensionality reduction improves clustering, I compared the average silhouette widths obtained from clustering on the original feature space and on the PCA-reduced space.
avg_sil <- data.frame(
Method = c(
"K-means (Original)", "K-means (PCA)",
"PAM (Original)", "PAM (PCA)",
"Hierarchical (Original)", "Hierarchical (PCA)"
),
Avg_Silhouette = c(
mean(silhouette(km2_e$cluster, dist(X))[, 3]),
mean(silhouette(km_pca$cluster, dist(X_pca_k))[, 3]),
mean(silhouette(pam2_e$cluster, dist(X))[, 3]),
mean(silhouette(pam_pca$cluster, dist(X_pca_k))[, 3]),
mean(silhouette(hc_cut$cluster, dist(X))[, 3]),
mean(silhouette(hc_pca_cut, dist(X_pca_k))[, 3])
)
)
avg_sil
Across all methods, PCA leads to higher average silhouette values. The improvement is most noticeable for hierarchical clustering and K-means, where PCA reduces noise and results in more compact and better-separated clusters. PAM shows a smaller but still consistent improvement.
Overall, this confirms that clustering in the PCA-reduced space provides more stable and meaningful group structure than clustering directly on the original high-dimensional data.
I also tried doing some predictions. To assess clustering stability, the data was split into training (80%) and test (20%) sets.
set.seed(123)
idx <- sample(seq_len(nrow(X_pca_k)), size = floor(0.8 * nrow(X_pca_k)))
X_train <- X_pca_k[idx, ]
X_test <- X_pca_k[-idx, ]
y_test <- y[-idx]
km_train <- eclust(X_train, "kmeans", k = 2, nstart = 50, graph = FALSE)
km_kcca <- flexclust::as.kcca(km_train, X_train)
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
km_pred <- predict(km_kcca, X_test)
print(table(km_pred,y_test))
## y_test
## km_pred ckd notckd
## 1 42 0
## 2 16 22
p1 <- fviz_cluster(km_train, geom = "point", stand = FALSE) +
ggtitle("Train Clusters (7 PCs)")
p2 <- fviz_cluster(list(data = X_test, cluster = km_pred), geom = "point", stand = FALSE) +
ggtitle("Predicted Test Clusters (7 PCs)")
gridExtra::grid.arrange(p1, p2, nrow = 2)
s1 <- fviz_silhouette(km_train) + ggtitle("K-means Train Silhouette")
## cluster size ave.sil.width
## 1 1 143 0.11
## 2 2 176 0.50
km_test_sil <- cluster::silhouette(km_pred, dist(X_test))
s2 <- fviz_silhouette(km_test_sil) + ggtitle("K-means Test Silhouette")
## cluster size ave.sil.width
## 1 1 42 0.00
## 2 2 38 0.49
In the training set, one cluster is clearly well defined, with a high average silhouette width (0.5), while the other cluster shows weaker separation (0.11), indicating some overlap.
When applied to the test set, a similar pattern is observed. The well-separated cluster remains stable with an average silhouette width of 0.49, while the weaker cluster again shows low separation (0.00). This consistency suggests that the clustering structure generalizes reasonably well to unseen data.
The test-set cluster composition shows that one cluster is dominated by CKD cases, while the other contains a mix of CKD and non-CKD patients. Overall, clustering in the PCA-reduced space improves stability compared to the original feature space, although the class alignment is not the best.
pam_train <- eclust(X_train, "pam", k = 2, graph = FALSE)
pam_kcca <- flexclust::as.kcca(pam_train, X_train)
pam_pred <- predict(pam_kcca, X_test)
print(table(Pred = pam_pred, True = y_test))
## True
## Pred ckd notckd
## 1 52 0
## 2 6 22
p3 <- fviz_cluster(pam_train, geom = "point", stand = FALSE) +
ggtitle("PAM Train Clusters (6 PCs)")
p4 <- fviz_cluster(list(data = X_test, cluster = pam_pred), geom = "point", stand = FALSE) +
ggtitle("PAM Predicted Test Clusters (6 PCs)")
gridExtra::grid.arrange(p3, p4, nrow = 2)
s3 <- fviz_silhouette(pam_train) + ggtitle("PAM Train Silhouette")
## cluster size ave.sil.width
## 1 1 178 0.09
## 2 2 141 0.54
pam_test_sil <- cluster::silhouette(pam_pred, dist(X_test))
s4 <- fviz_silhouette(pam_test_sil) + ggtitle("PAM Test Silhouette")
## cluster size ave.sil.width
## 1 1 52 0.02
## 2 2 28 0.53
PAM clustering on the first 6 principal components shows a pattern similar to K-means. In the training set, one cluster is well defined, with a high average silhouette width of 0.54, while the other cluster shows weaker separation (0.09), indicating some overlap.
When applied to the test set, the well-separated cluster remains stable with an average silhouette width of 0.53, while the weaker cluster again shows low separation (0.02). This consistency between training and test results suggests that, similar to K-means, PAM benefits from PCA-based dimensionality reduction by improving cluster stability, especially for the better-defined cluster. The class alignment is also better than K-means.
Across all methods, the data consistently forms two main clusters, but separation is not perfect due to overlap in patient profiles. In the original feature space, clustering methods tend to produce one compact cluster and one more diffuse cluster, which limits overall quality.
Applying dimensionality reduction improves how these clusters behave rather than changing their structure. Clustering on PCA-reduced data leads to higher and more stable silhouette scores across K-means, PAM, and hierarchical clustering, indicating tighter and better-separated groups. MDS preserves the same cluster assignments and class alignment, but its impact on clustering quality is smaller.
While PCA and MDS produce very similar visual patterns, PCA consistently results in more compact clusters and higher silhouette values. This makes PCA the more effective representation for clustering in this dataset, even though the underlying class alignment remains largely the same.