USL Project: Clustering and Dimension Reduction on CKD Data

1. Introduction

I analyzed a Chronic Kidney Disease (CKD) dataset containing a combination of numerical laboratory measurements and categorical clinical indicators collected from patients. The features describe kidney function, blood health, metabolic conditions, and related comorbidities, including measurements such as blood urea, serum creatinine, hemoglobin, electrolyte levels, and the presence or absence of conditions like hypertension, diabetes, anemia, and edema.

The dataset consists of 24 clinical attributes and one class label, where the class indicates whether a patient was diagnosed with chronic kidney disease (ckd) or not (notckd). Although this label was available, it was not used during clustering and was only referenced later to assess how well the discovered clusters aligned with the true medical categories.

The dataset was obtained from the UCI Machine Learning Repository and was originally sourced via IEEE DataPort:

https://ieee-dataport.org/documents/chronic-kidney-disease-data-set

https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease

While the class label (ckd vs notckd) is available, I keep it aside and only use it later to check whether the clusters I find match the true medical categories.

pkgs <- c(
  "tidyverse","factoextra","cluster","flexclust","ClusterR","psych",
  "gridExtra","ggplot2","ggfortify","dbscan","NbClust","clustertend",
  "smacof","ape","fpc","StatMatch","dendextend","plotly","corrplot"
)

missing <- pkgs[!sapply(pkgs, requireNamespace, quietly = TRUE)]
if (length(missing) > 0) {
  stop(paste("Missing packages:", paste(missing, collapse=", "),
             "\nInstall them first, then knit again."))
}

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster)
library(ClusterR)
library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(rlang)

## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

library(ggplot2)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(flexclust)
library(ggfortify)
library(dbscan)

## 
## Attaching package: 'dbscan'
## 
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(NbClust)
library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

library(smacof)

## Loading required package: plotrix
## 
## Attaching package: 'plotrix'
## 
## The following object is masked from 'package:dbscan':
## 
##     clplot
## 
## The following object is masked from 'package:flexclust':
## 
##     placeLabels
## 
## The following object is masked from 'package:psych':
## 
##     rescale
## 
## Loading required package: colorspace
## Loading required package: e1071
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:flexclust':
## 
##     bclust
## 
## The following object is masked from 'package:ggplot2':
## 
##     element
## 
## 
## Attaching package: 'smacof'
## 
## The following object is masked from 'package:psych':
## 
##     Procrustes
## 
## The following object is masked from 'package:base':
## 
##     transform

library(ape)

## 
## Attaching package: 'ape'
## 
## The following object is masked from 'package:dplyr':
## 
##     where

library(fpc)

## 
## Attaching package: 'fpc'
## 
## The following object is masked from 'package:dbscan':
## 
##     dbscan

library(StatMatch)

## Loading required package: proxy
## 
## Attaching package: 'proxy'
## 
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## 
## The following object is masked from 'package:base':
## 
##     as.matrix
## 
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loading required package: survival
## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart
## 
## Loading required package: lpSolve

library(dendextend)

## 
## ---------------------
## Welcome to dendextend version 1.19.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## 
## Attaching package: 'dendextend'
## 
## The following objects are masked from 'package:ape':
## 
##     ladderize, rotate
## 
## The following object is masked from 'package:stats':
## 
##     cutree

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

set.seed(123)
theme_set(theme_minimal())

The analysis proceeds as follows. First, the dataset is cleaned, encoded, and scaled to construct a suitable feature matrix for clustering. Next, cluster tendency and the optimal number of clusters are assessed. Multiple clustering algorithms are then applied in the original feature space and evaluated. Finally, dimensionality reduction using MDS and PCA is explored to assess whether reduced representations improve cluster separation and stability, including validation using train–test splits.

2. Data Preparation for Clustering

I first loaded the dataset and examined its structure to verify that:

variables were correctly typed
the class label contained no obvious formatting issues
missing values were consistently encoded

df0 <- read.csv("data.csv", stringsAsFactors = FALSE, na.strings = "?")
str(df0)

## 'data.frame':    401 obs. of  25 variables:
##  $ age  : chr  "48" "7" "62" "48" ...
##  $ bp   : int  80 50 80 70 80 90 70 NA 100 90 ...
##  $ sg   : num  1.02 1.02 1.01 1 1.01 ...
##  $ al   : int  1 4 2 4 2 3 0 2 3 2 ...
##  $ su   : int  0 0 3 0 0 0 0 4 0 0 ...
##  $ rbc  : chr  NA NA "normal" "normal" ...
##  $ pc   : chr  "normal" "normal" "normal" "abnormal" ...
##  $ pcc  : chr  "notpresent" "notpresent" "notpresent" "present" ...
##  $ ba   : chr  "notpresent" "notpresent" "notpresent" "notpresent" ...
##  $ bgr  : int  121 NA 423 117 106 74 100 410 138 70 ...
##  $ bu   : num  36 18 53 56 26 25 54 31 60 107 ...
##  $ sc   : num  1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
##  $ sod  : num  NA NA NA 111 NA 142 104 NA NA 114 ...
##  $ pot  : num  NA NA NA 2.5 NA 3.2 4 NA NA 3.7 ...
##  $ hemo : num  15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
##  $ pcv  : chr  "44" "38" "31" "32" ...
##  $ wbcc : chr  "7800" "6000" "7500" "6700" ...
##  $ rbcc : chr  "5.2" NA NA "3.9" ...
##  $ htn  : chr  "yes" "no" "no" "yes" ...
##  $ dm   : chr  "yes" "no" "yes" "no" ...
##  $ cad  : chr  "no" "no" "no" "no" ...
##  $ appet: chr  "good" "good" "poor" "poor" ...
##  $ pe   : chr  "no" "no" "no" "yes" ...
##  $ ane  : chr  "no" "no" "yes" "yes" ...
##  $ class: chr  "ckd" "ckd" "ckd" "ckd" ...

head(df0)

3. Data Cleaning and Encoding

3.1 Class Label Cleaning

The class column was cleaned by converting all values to lowercase, trimming whitespace, and removing invalid entries. Only valid labels (ckd, notckd) were retained.

 # removed because of extreme outliers found later in the code

df0$class <- tolower(trimws(df0$class))
df0$class <- gsub(",", "", df0$class)
df0 <- df0 %>% filter(class %in% c("ckd", "notckd"))

y <- factor(df0$class, levels = c("ckd", "notckd"))

3.2 Encoding Binary Clinical Variables

Binary clinical indicators were converted to numeric values (1 = present, 0 = absent), while missing values were preserved for later imputation.

df0$dm <- ifelse(df0$dm == "yes", 1, ifelse(is.na(df0$dm), NA, 0))
df0$cad <- ifelse(df0$cad == "yes", 1, ifelse(is.na(df0$cad), NA, 0))
df0$pe <- ifelse(df0$pe == "yes", 1, ifelse(is.na(df0$pe), NA, 0))
df0$ane <- ifelse(df0$ane == "yes", 1, ifelse(is.na(df0$ane), NA, 0))
df0$htn <- ifelse(df0$htn == "yes", 1, ifelse(is.na(df0$htn), NA, 0))

df0$rbc <- ifelse(df0$rbc == "abnormal", 1, ifelse(is.na(df0$rbc), NA, 0))
df0$pc <- ifelse(df0$pc == "abnormal", 1, ifelse(is.na(df0$pc), NA, 0))

df0$pcc <- ifelse(df0$pcc == "present", 1, ifelse(is.na(df0$pcc), NA, 0))
df0$ba <- ifelse(df0$ba == "present", 1, ifelse(is.na(df0$ba), NA, 0))

df0$appet <- ifelse(df0$appet == "poor", 1, ifelse(is.na(df0$appet), NA, 0))

The class label was then removed to form the clustering feature matrix.

df <- df0 %>% select(-class)

3.3 Separating laboratory measurements and binary clinical indicators

After encoding, I grouped the variables into two sets.

The first set contains laboratory-style numeric or ordinal measurements (for example age, blood pressure, specific gravity, albumin, sugar, urea, creatinine, electrolytes, and blood counts). I scaled these variables so that no single unit dominates the distance calculation.

The second set contains true binary indicators (0/1), such as diabetes, hypertension, anemia, appetite status, and abnormal cell findings. I kept these as 0/1 and combined them with the scaled numeric variables to form the final matrix X.

numeric_vars <- c("age","bp","bgr","bu","sc","sod","pot","hemo","pcv","wbcc","rbcc")
ordinal_vars <- c("sg","al","su")
binary_vars  <- setdiff(names(df), c(numeric_vars, ordinal_vars))

df[c(numeric_vars, ordinal_vars)] <- lapply(df[c(numeric_vars, ordinal_vars)], as.numeric)

## Warning in lapply(df[c(numeric_vars, ordinal_vars)], as.numeric): NAs
## introduced by coercion
## Warning in lapply(df[c(numeric_vars, ordinal_vars)], as.numeric): NAs
## introduced by coercion
## Warning in lapply(df[c(numeric_vars, ordinal_vars)], as.numeric): NAs
## introduced by coercion

df[binary_vars] <- lapply(df[binary_vars], as.numeric)

summary(df[c(numeric_vars, ordinal_vars)])

##       age              bp              bgr              bu        
##  Min.   : 2.00   Min.   : 50.00   Min.   : 22.0   Min.   :  1.50  
##  1st Qu.:42.00   1st Qu.: 70.00   1st Qu.: 99.0   1st Qu.: 27.00  
##  Median :54.50   Median : 80.00   Median :121.0   Median : 42.00  
##  Mean   :51.42   Mean   : 76.49   Mean   :148.2   Mean   : 57.45  
##  3rd Qu.:64.00   3rd Qu.: 80.00   3rd Qu.:163.0   3rd Qu.: 66.00  
##  Max.   :90.00   Max.   :180.00   Max.   :490.0   Max.   :391.00  
##  NA's   :9       NA's   :12       NA's   :44      NA's   :19      
##        sc              sod             pot              hemo      
##  Min.   : 0.400   Min.   :  4.5   Min.   : 2.500   Min.   : 3.10  
##  1st Qu.: 0.900   1st Qu.:135.0   1st Qu.: 3.800   1st Qu.:10.30  
##  Median : 1.300   Median :138.0   Median : 4.400   Median :12.60  
##  Mean   : 3.078   Mean   :137.5   Mean   : 4.631   Mean   :12.52  
##  3rd Qu.: 2.800   3rd Qu.:142.0   3rd Qu.: 4.900   3rd Qu.:15.00  
##  Max.   :76.000   Max.   :163.0   Max.   :47.000   Max.   :17.80  
##  NA's   :17       NA's   :87      NA's   :88       NA's   :52     
##       pcv             wbcc            rbcc             sg              al      
##  Min.   : 9.00   Min.   : 2200   Min.   :2.100   Min.   :1.005   Min.   :0.00  
##  1st Qu.:32.00   1st Qu.: 6500   1st Qu.:3.900   1st Qu.:1.010   1st Qu.:0.00  
##  Median :40.00   Median : 8000   Median :4.800   Median :1.020   Median :0.00  
##  Mean   :38.86   Mean   : 8400   Mean   :4.707   Mean   :1.017   Mean   :1.02  
##  3rd Qu.:45.00   3rd Qu.: 9800   3rd Qu.:5.400   3rd Qu.:1.020   3rd Qu.:2.00  
##  Max.   :54.00   Max.   :26400   Max.   :8.000   Max.   :1.025   Max.   :5.00  
##  NA's   :71      NA's   :106     NA's   :131     NA's   :47      NA's   :46    
##        su        
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4514  
##  3rd Qu.:0.0000  
##  Max.   :5.0000  
##  NA's   :49

summary(df[binary_vars])

##       rbc               pc              pcc               ba        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1903   Mean   :0.2275   Mean   :0.1063   Mean   :0.0557  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :152      NA's   :65       NA's   :4        NA's   :4       
##       htn               dm              cad              appet      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.000  
##  Mean   :0.3703   Mean   :0.3375   Mean   :0.08564   Mean   :0.206  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000  
##  NA's   :2        NA's   :2        NA's   :2         NA's   :1      
##        pe             ane        
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000  
##  Mean   :0.191   Mean   :0.1508  
##  3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.0000  
##  NA's   :1       NA's   :1

4. Handling Missing Values

Missing values were handled differently depending on variable type:

Numeric variables were imputed using the median, which is robust to outliers.
Binary variables were imputed using the mode (most frequent value).

mode_impute <- function(x) {
  ux <- na.omit(x)
  if(length(ux) == 0) return(NA)
  ux[which.max(tabulate(match(ux, ux)))]
}

df[numeric_vars] <- lapply(df[numeric_vars], function(x) { x[is.na(x)] <- median(x, na.rm=TRUE); x })

df[ordinal_vars] <- lapply(df[ordinal_vars], function(x) { x[is.na(x)] <- mode_impute(x); x })

df[binary_vars]  <- lapply(df[binary_vars],  function(x) { x[is.na(x)] <- mode_impute(x); x })

colSums(is.na(df))

##   age    bp    sg    al    su   rbc    pc   pcc    ba   bgr    bu    sc   sod 
##     0     0     0     0     0     0     0     0     0     0     0     0     0 
##   pot  hemo   pcv  wbcc  rbcc   htn    dm   cad appet    pe   ane 
##     0     0     0     0     0     0     0     0     0     0     0

After imputation, no missing values remained in the dataset.

5. Feature Scaling and Final Matrix Construction

To ensure fair distance calculations:

Numeric variables were scaled to mean 0 and variance 1
Binary variables were kept as 0/1

X_num <- scale(df[c(numeric_vars, ordinal_vars)])
X_bin <- data.matrix(df[binary_vars])
X <- cbind(X_num, X_bin)

At this point, the final feature matrix X is suitable for distance-based clustering methods.

6. Cluster Tendency Assessment

Before applying clustering algorithms, I assessed whether the data contained meaningful cluster structure using the Hopkins statistic.

factoextra::get_clust_tendency(
X,
n = min(50, nrow(X)),
graph = TRUE,
gradient = list(low = "red", mid = "white", high = "blue"),
seed = 1
)

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## $hopkins_stat
## [1] 0.8584354
## 
## $plot

The Hopkins statistic was 0.86, indicating a strong tendency for clustering and confirming that the data is far from random.

7. Determining the Number of Clusters

Multiple methods were used to determine the optimal number of clusters:

Elbow Method (WSS): Looks for the point where the within-cluster sum of squares starts to level off.
Silhouette Method: Measures the separation between clusters, with higher values indicating better separation.
Gap Statistic: Compares the clustering structure to a random reference dataset.
Calinski-Harabasz index: The CH index measures the ratio of between-cluster variance to within-cluster variance, where higher values indicate better-defined clusters.

K-means

fviz_nbclust(X, kmeans, method = "wss", k.max = 10) + ggtitle("K-means: Elbow (WSS)")

fviz_nbclust(X, kmeans, method = "silhouette", k.max = 10) + ggtitle("K-means: Silhouette")

fviz_nbclust(X, kmeans, method = "gap_stat", k.max = 10) + ggtitle("K-means: Gap Statistic")

PAM

fviz_nbclust(X, FUNcluster = cluster::pam, method = "wss", k.max = 10) + ggtitle("PAM: Elbow (WSS)")

fviz_nbclust(X, FUNcluster = cluster::pam, method = "silhouette", k.max = 10) + ggtitle("PAM: Silhouette")

fviz_nbclust(X, FUNcluster = cluster::pam, method = "gap_stat", k.max = 10) + ggtitle("PAM: Gap Statistic")

CLARA

fviz_nbclust(X, FUNcluster = cluster::clara, method = "wss", k.max = 10) + ggtitle("CLARA: Elbow (WSS)")

fviz_nbclust(X, FUNcluster = cluster::clara, method = "silhouette", k.max = 10) + ggtitle("CLARA: Silhouette")

fviz_nbclust(X, FUNcluster = cluster::clara, method = "gap_stat", k.max = 10) + ggtitle("CLARA: Gap Statistic")

Hierarchical Clustering

fviz_nbclust(X, hcut, method = "wss", k.max = 10) + ggtitle("Hierarchical Clustering: Elbow (WSS)")

fviz_nbclust(X, hcut, method = "silhouette", k.max = 10) + ggtitle("Hierarchical Clustering: Silhouette")

fviz_nbclust(X, hcut, method = "gap_stat", k.max = 10) + ggtitle("Hierarchical Clustering: Gap Statistic")

Calinski-Harabasz

set.seed(123)
nb <- NbClust::NbClust(
X, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "ch"
)
nb$Best.nc

## Number_clusters     Value_Index 
##          2.0000        113.2369

From the silhouette plots, K-means, PAM, and CLARA all show their highest average silhouette width at k = 2, after which the values steadily decrease. This suggests that dividing the data into two clusters gives the clearest separation, while adding more clusters mainly breaks existing groups into smaller, less meaningful pieces. Hierarchical clustering shows a similar pattern, with silhouette values stabilizing around k = 2-5, but without a clear improvement beyond two clusters.

The Gap Statistic selected two clusters for all algorithms, and the Calinski–Harabasz index also supported this choice, with its maximum value occurring at k = 2 (CH = 113.24). This provided a clear and stable basis for using a two-cluster solution in subsequent analyses.

8. Clustering

8.1 Partition-Based Clustering (K-means, PAM, CLARA)

I ran clustering on the full feature matrix X using K-means, PAM and CLARA.

K-means

set.seed(123)

km2_e <- eclust(X, "kmeans", k = 2, graph = FALSE)
fviz_cluster(km2_e, geom = "point", alpha = 0.5, stand = FALSE)

fviz_silhouette(km2_e)

##   cluster size ave.sil.width
## 1       1  186          0.00
## 2       2  213          0.45

PAM

pam2_e <- factoextra::eclust(X, "pam", k = 2, graph = FALSE)
fviz_cluster(pam2_e, geom = "point", alpha = 0.5, stand = FALSE)

fviz_silhouette(pam2_e)

##   cluster size ave.sil.width
## 1       1  228          0.43
## 2       2  171          0.01

Cluster Analysis: Comparison of K-means, PAM, and

df$class <- y

df$cluster1 <- km2_e$cluster
table(df$cluster1, df$class)

##    
##     ckd notckd
##   1 186      0
##   2  64    149

df$cluster2 <- pam2_e$cluster
table(df$cluster2, df$class)

##    
##     ckd notckd
##   1  79    149
##   2 171      0

Both the 2 partition-based methods produced similar patterns:

One cluster was relatively compact and well separated
The second cluster showed substantial overlap and dispersion

This indicates that while a two-cluster structure exists, perfect separation between CKD and non-CKD patients is not achievable in the original high-dimensional space.

Some better visualizations

fviz_cluster(
  list(data = X, cluster = km2_e$cluster),
  geom = "point",
  ellipse.type = "norm",
  stand = FALSE,
  ggtheme = theme_minimal()
) +
  ggtitle("K-means Clustering (Original Feature Space)")

fviz_cluster(
  list(data = X, cluster = pam2_e$cluster),
  geom = "point",
  ellipse.type = "norm",
  stand = FALSE,
  ggtheme = theme_minimal()
) + ggtitle("PAM Clustering (Original Feature Space)")

The clustering results show some overlap between the clusters. These plots are 2D projections for visualization. Some overlap is expected because we are compressing a higher-dimensional structure into two dimensions. The separation between clusters might be more distinct when considering additional dimensions.

8.2 Hierarchical Clustering

I also ran hierarchical clustering because it gives me a dendrogram view of how points merge step by step. I tried multiple linkage methods and then cut the tree into 2 clusters for consistency with the rest of the analysis.

This helps confirm whether the two-cluster structure is stable across different clustering approaches.

dist_euc <- dist(X)
hc_complete <- hclust(dist_euc, method = "complete")
hc_ward <- hclust(dist_euc, method = "ward.D2")

factoextra::fviz_dend(hc_complete, k = 2, cex = 0.6, rect = TRUE) +
ggtitle("Dendrogram: Complete linkage (k=2)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

factoextra::fviz_dend(hc_ward, k = 2, cex = 0.6, rect = TRUE) +
ggtitle("Dendrogram: Ward.D2 (k=2)")

The Complete linkage dendrogram shows a highly unbalanced structure where most data points are grouped into one large blue cluster, leaving a very small red branch on the left. This indicates that the method is sensitive to outliers or specific distant points, merging them only at the very final step at a height of approximately 25.

In contrast, Ward’s method produces two well-balanced and distinct branches of similar size. The significant vertical height (around 50) before the final merge indicates a high degree of separation between these two groups, providing strong evidence that a two-cluster structure is a natural and stable fit for the underlying data.

hc_cut <- factoextra::hcut(X, k = 2, stand = FALSE, hc_method = "ward.D2")
fviz_cluster(hc_cut, main = "Hierarchical (hcut) on X", geom = "point", stand = FALSE)

fviz_silhouette(hc_cut)

##   cluster size ave.sil.width
## 1       1  174          0.48
## 2       2  225         -0.01

table(hc_cut$cluster, Pred = y)

##    Pred
##     ckd notckd
##   1  25    149
##   2 225      0

The results from hierarchical clustering (using Ward’s method) show that the two-cluster structure is stable across different clustering approaches. Cluster 2 predominantly contains ckd cases, while Cluster 1 consists mostly of notckd cases, with only a few misclassifications. This represents the strongest class alignment observed so far.

The average silhouette width for hierarchical , however, is low, primarily due to the spread-out nature of Cluster 2.

8.3 Density-Based Clustering (DBSCAN)

DBSCAN is useful because it can label some points as noise instead of forcing every point into a cluster.

I used the kNN distance plot to pick a reasonable eps value, then I ran DBSCAN.

dbscan::kNNdistplot(X, k = 10)
abline(h = 2.5, col = "red", lty = 2)

db <- dbscan::dbscan(X, eps = 2.5, minPts = 10)

Larger eps values merged most points into one cluster, so I selected eps = 2.5 to avoid a single-cluster solution. By sticking with 2.5, I’m able to keep some separation.

fviz_cluster(db, data = X, geom = "point", stand = FALSE) +
ggtitle("DBSCAN clusters on X")

table(db$cluster,y)

##    y
##     ckd notckd
##   0 147      0
##   1 103    149

The graph shows that DBSCAN really only picks up on one big group of points. Everything else just gets scattered and labeled as noise. Because the rest of the data is so spread out, a density-based clustering approach like this doesn’t work that well for this dataset.

Conclusion

Overall, hierarchical clustering provides the best label separation among the methods tested in the original feature space, but the uneven compactness of the clusters suggests that the high-dimensional representation still contains noise and redundant variation. This motivated the use of dimensionality reduction to improve cluster compactness while preserving the strong class alignment.

9. Dimension Reduction

9.1 Multidimensional Scaling (MDS)

While PCA is variance-based, MDS is distance-based and is useful for visualizing how points relate to each other based on pairwise distances. I applied MDS using Euclidean distance computed on the full preprocessed feature matrix X (scaled numeric variables plus binary indicators)

mds_euc <- cmdscale(dist_euc, k = 2)
plot(mds_euc, pch = 19, col = y, main = "MDS (cmdscale) from Euclidean distance on X")
legend("topright", legend = levels(y), col = 1:length(levels(y)), pch = 19)

The MDS plot visualizes the relationship between the data points based on Euclidean distance. The plot shows a clear separation between the ckd and notckd groups along the Dimension 1, with ckd cases (black) mostly grouped together and notckd cases (red) spread across a larger range.

However, there’s some overlap between the two groups, particularly in the lower range of Dim 1. This suggests that while there is some distinction between the groups, the separation is not perfect, and there is still a significant amount of variability within each group.

stress_by_dim <- sapply(1:25, function(k) {
  smacof::mds(dist_euc, ndim = k, type = "ratio")$stress
})

data.frame(
  Dimensions = 1:25,
  Stress = stress_by_dim
)

plot(1:25, stress_by_dim, type = "b",
     xlab = "Number of MDS dimensions",
     ylab = "Stress",
     main = "Stress vs MDS dimensionality")

I selected 5 dimensions for the MDS solution because this is the point where the stress value drops into the fair - good range (approx. 0.09) according to standard MDS guidelines. Beyond this point, additional dimensions continue to reduce stress, but the improvement becomes more gradual, indicating diminishing returns. Therefore, a 5-dimensional solution provides a good balance between preserving the distance structure of the data and keeping the model reasonably low-dimensional.

mds <- smacof::mds(dist_euc, ndim = 5, type = "ratio")
mds$stress

## [1] 0.09198974

summary(mds)

## 
## Configurations:
##          D1      D2      D3      D4      D5
## 1   -0.2053  0.0976  0.1418  0.0922  0.0282
## 2   -0.1097 -0.1911 -0.0756  0.2668  0.7008
## 3    0.5553  0.5849  0.4615 -0.0747  0.1052
## 4    0.5820 -0.1412 -0.3052  0.1236  0.6274
## 5    0.1046  0.0707 -0.2181  0.1913  0.0562
## 6    0.1307  0.1868 -0.1430  0.3585 -0.0997
## 7    0.2885 -0.6171  0.0426 -0.7160  0.2844
## 8    0.1680  0.6993  0.4898  0.2211  0.4600
## 9    0.3931  0.1116 -0.1733  0.3562 -0.1453
## 10   0.5649 -0.3519 -0.5354  0.0238 -0.1173
## 11   0.6253  0.6958  0.4447 -0.2503  0.5380
## 12   0.6191  0.0631  0.3183 -0.2635  0.4516
## 13   0.6041  0.2177 -0.2612 -0.2465  0.0647
## 14   0.0998 -0.2765  0.0429 -0.3155 -0.1571
## 15   1.0324  0.1701 -0.3402 -0.0205 -0.0693
## 16   0.6155 -0.5852 -0.0683  0.3773  0.0458
## 17  -0.0399 -0.0021 -0.1048  0.0670  0.1639
## 18  -0.0397 -0.2438  0.0175  0.0493 -0.1293
## 19   0.1161  0.5570  0.2681 -0.1104 -0.4665
## 20   0.1962 -0.3530 -0.2268 -0.2594  0.1700
## 21   0.7587 -0.2114 -0.1308 -0.0488 -0.1057
## 22   1.5570 -2.1894  1.0329 -2.0110  0.6399
## 23   0.4786 -0.5179  0.1508  0.2946  0.1622
## 24  -0.1268 -0.1174 -0.3616  0.2380  0.1950
## 25   0.2772  0.1120 -0.0732  0.5662  0.0646
## 26   0.0867 -0.4155 -0.1738 -0.3315 -0.1596
## 27   0.1548  0.0907 -0.1522 -0.2505 -0.2954
## 28   0.5635  0.6002  0.1379 -0.1413  0.3342
## 29  -0.0209  0.3905  0.2294 -0.3165  0.0006
## 30   0.0503  0.3194 -0.4409 -0.1433  0.0680
## 31   0.0276 -0.4410  0.1274 -0.1259 -0.0256
## 32   0.3574  0.0989 -0.2534  0.2710 -0.2538
## 33   0.3222  0.2356 -0.1446 -0.0088 -0.2002
## 34   0.2799 -0.0331  0.0630  0.3262 -0.2921
## 35   0.2939 -0.1290  0.1826 -0.3645  0.1761
## 36   0.2556  0.3176  0.2675 -0.0069 -0.1928
## 37   0.1215 -0.0619 -0.1938 -0.2937 -0.1316
## 38   0.3506 -0.2493 -0.1248 -0.2415 -0.4848
## 39   0.2099 -0.1459  0.3035  0.0938  0.0781
## 40   0.2952  0.4036 -0.0621 -0.2154 -0.1415
## 41   0.2938 -0.0051 -0.2734  0.2537 -0.0470
## 42  -0.1103  0.0619 -0.3179  0.0092  0.1006
## 43   0.1881  0.1683 -0.2120  0.2318 -0.3997
## 44   0.6258 -0.5730 -0.3241  0.2121  0.0203
## 45   0.5526  0.0305 -0.0494  0.1656  0.1870
## 46   0.3774 -0.0976  0.2436  0.1087  0.1844
## 47  -0.0916 -0.1355 -0.1637 -0.0405  0.0084
## 48  -0.1545  0.1712 -0.2047  0.4642  0.3472
## 49   0.3518  0.2474 -0.8585 -0.4503 -0.2141
## 50   0.5007  0.3671 -0.9283 -0.2796  0.0229
## 51   0.2423 -0.3949 -0.3905 -0.2810 -0.0801
## 52   0.3740  0.0984  0.0328  0.3729 -0.1110
## 53   0.0966 -0.2714 -0.0628  0.2440 -0.3378
## 54   0.0953  0.7883  0.4033 -0.0928 -0.0472
## 55   0.1994  0.4257 -0.1511  0.0079  0.0316
## 56   0.3159  0.0747 -0.3476  0.3993  0.2460
## 57   0.9533  0.1781  0.1469 -0.3471  0.1926
## 58   0.1579 -0.2947  0.2725 -0.1541 -0.4173
## 59   0.4729 -0.1282  0.3526 -0.1230 -0.0791
## 60   0.2203 -0.2334  0.1178  0.1524 -0.4858
## 61   0.0482  0.0197  0.1633  0.0794 -0.3030
## 62   0.9071 -0.5940  2.1350  1.3514  0.5010
## 63   0.0440 -0.2782 -0.2768  0.2016  0.4925
## 64   0.3569 -0.3873 -0.1366  0.0314  0.1211
## 65   0.0366  0.0638 -0.2651  0.0404 -0.1488
## 66  -0.1877  0.2499 -0.1160  0.2899 -0.0621
## 67   0.1057 -0.0363  0.1519 -0.2145  0.1766
## 68   0.1617  0.2833  0.4500  0.0611  0.5120
## 69   0.1937  0.0296 -0.2318 -0.0685  0.1654
## 70  -0.3068  0.5979  0.4475  0.1270  0.3921
## 71   0.0172  0.7385  0.4860 -0.1761  0.0488
## 72   0.3772  0.0259 -0.6168 -0.2436  0.1111
## 73   0.3648  0.5181  0.0579  0.3241  0.0791
## 74   0.7039 -0.3042 -0.0897  0.4301 -0.3305
## 75   0.5168 -0.2456 -0.0267  0.2026 -0.1150
## 76  -0.0732 -0.1761 -0.3214  0.5049  0.1755
## 77   0.6118 -0.1173 -0.0483  0.3883  0.3685
## 78   0.2575 -0.0733 -0.2615 -0.2219 -0.0535
## 79   0.1516 -0.1636  0.0191 -0.1919 -0.2949
## 80   0.0701  0.3312 -0.2918 -0.0769 -0.0351
## 81   0.3498 -0.0168 -0.1987 -0.2178 -0.2787
## 82   0.4385 -0.0676  0.5027 -0.1549 -0.3415
## 83  -0.1197 -0.3087 -0.0914  0.0081  0.0418
## 84  -0.0237  0.0404 -0.0753 -0.0435  0.1568
## 85   0.9423 -0.6722 -0.0936  0.1654  0.1777
## 86   0.1698  0.0076 -0.0854 -0.2563  0.1306
## 87   0.0528  0.2375  0.5296 -0.3465 -0.0342
## 88   0.3535  0.2372 -0.0152  0.2493 -0.4706
## 89   0.3020  0.5853 -0.2874  0.3707 -0.1543
## 90  -0.1828 -0.1395 -0.0864 -0.0646 -0.0202
## 91   0.2825  0.6111  0.0389  0.1386 -0.1883
## 92  -0.1661  0.5746 -0.0884 -0.0571  0.4386
## 93   0.5961 -0.0645  0.0934 -0.1512  0.2685
## 94   0.8253  0.3476  0.1731  0.0618 -0.2316
## 95   0.1049  0.0564 -0.4232 -0.2338 -0.1330
## 96   0.0110  0.0607 -0.1227  0.1284 -0.3360
## 97   0.2013  0.2858 -0.0457 -0.0065 -0.0484
## 98   0.2358 -0.1552 -0.2668 -0.3040  0.0582
## 99   0.6652 -0.3362 -0.0846  0.4363 -0.9202
## 100  0.4810  0.8298  0.3367  0.4868 -1.1825
## 101  0.0822  0.0972 -0.0679  0.2649  0.4830
## 102  0.2918  0.0698 -0.3671  0.0769 -0.2876
## 103 -0.3736 -0.0556 -0.2573  0.1713  0.4213
## 104  0.6947 -0.2120 -0.0043 -0.5212  0.1106
## 105  0.0304 -0.1091  0.1399  0.0490 -0.2915
## 106 -0.1604 -0.0331  0.0617 -0.0644 -0.3826
## 107  0.4566 -0.5715 -0.0498  0.1211 -0.3834
## 108  0.3904  0.8060  0.0736 -0.0172 -0.2327
## 109 -0.0794  0.0112 -0.3442  0.0642 -0.1223
## 110 -0.0277  0.0016  0.1737 -0.1784  0.0437
## 111 -0.0373  0.1450 -0.2695 -0.0199 -0.3118
## 112  0.6635  0.4561  0.1513 -0.0774  0.1094
## 113  0.1405 -0.0234 -0.1589 -0.0310  0.4042
## 114  0.0151  0.3661 -0.0191  0.0020 -0.3100
## 115 -0.0345  0.0116 -0.2976  0.2319  0.5456
## 116 -0.0709 -0.0995 -0.2323  0.2957 -0.1990
## 117  0.0568  0.1689 -0.1033  0.1549  0.4332
## 118 -0.0736  0.0624 -0.0894 -0.2538 -0.0224
## 119  0.1132  0.1483 -0.2548  0.0971  0.3069
## 120 -0.0713  0.1242 -0.2314 -0.1328 -0.0120
## 121  0.1404  0.4490  0.5758 -0.1292 -0.1982
## 122 -0.1250  0.0728  0.0507 -0.1830  0.4365
## 123  0.3179 -0.7978  0.1321  0.0177  0.1200
## 124 -0.0267  0.6869 -0.3066  0.1100  0.1904
## 125  0.3066 -0.2489 -0.0774  0.1429 -0.4823
## 126  0.1078  0.1790  0.3247 -0.2501 -0.3343
## 127  0.2511 -0.0849  0.0815 -0.1027 -0.3609
## 128  0.4156  0.0914 -0.4835 -0.4059  0.3539
## 129  1.2279 -0.0174  1.5019  2.1380  0.4143
## 130  0.0082 -0.1282  0.1573 -0.3694 -0.1149
## 131  0.9073 -0.1362 -0.7389  0.0902 -0.1057
## 132 -0.2021 -0.0246 -0.5920  0.0372  0.4619
## 133  0.6988 -0.4237 -0.2587 -0.3658 -0.2949
## 134  0.1014  0.1659  0.5628  0.7207 -0.1354
## 135  0.2517 -0.0303 -0.4029  0.2317 -0.5038
## 136 -0.0091  0.3697  0.1764  0.0435  0.0202
## 137  0.0833 -0.0586  0.2449  0.1871 -0.2688
## 138  0.4123  0.0534 -0.0901 -0.1392  0.4304
## 139 -0.0106  0.1760 -0.2237 -0.0859 -0.2166
## 140  0.2156 -0.1679 -0.1173  0.1324  0.2699
## 141  0.0795  0.7007  0.2603 -0.2141  0.0907
## 142  0.5493 -0.4342 -0.2061 -0.1031 -0.1293
## 143  0.0531 -0.2827  0.1951 -0.0669 -0.3942
## 144  0.4189  0.1014  0.7561  0.1777  0.2854
## 145  0.2534  0.1297 -0.3538  0.1467 -0.1138
## 146  1.1710 -0.6673  0.2609  0.3429  0.1872
## 147  0.1440  0.5744  0.1824  0.2548 -0.1181
## 148  0.7053  0.3770 -0.5131 -0.3028  0.2786
## 149  0.2249 -0.8281  0.7891 -0.9017  0.3181
## 150  0.1161 -0.1598 -0.0672 -0.2945  0.0280
## 151 -0.2177 -0.2885 -0.0508  0.3079  0.5430
## 152  0.1687 -0.0077  0.0445 -0.1326 -0.4699
## 153  0.0265 -0.0792 -0.4036  0.0315  0.0925
## 154  1.1981 -0.0444 -0.2269 -0.1998 -0.1235
## 155  1.0998  0.2339  0.0897  0.2626  0.1386
## 156  0.0824 -0.1719  0.2062  0.2219  0.3213
## 157  0.1397  0.1146  0.1063  0.2031 -0.1345
## 158  0.1263 -0.1928  0.2142 -0.1255  0.2964
## 159  0.4199  0.3901  0.4295 -0.4672  0.3356
## 160  0.3750  0.3556 -0.2041 -0.3794 -0.0085
## 161  0.1790 -0.0833 -0.2551 -0.6224 -0.3264
## 162  0.0356  0.3495 -0.1701  0.0554  0.1324
## 163  0.0782 -0.2046  0.2863 -0.2273  0.0511
## 164  0.3299 -0.1578 -0.4370  0.0659 -0.2362
## 165 -0.2447  0.2387 -0.1380  0.3762  0.2347
## 166 -0.0692  0.2371  0.1684 -0.0644 -0.1539
## 167 -0.1763 -0.4038 -0.1819 -0.0063  0.2701
## 168 -0.2149 -0.3312  0.3042  0.2100  0.1556
## 169  0.3771  0.6256  0.4005  0.0086  0.3869
## 170  0.4456  0.1965  0.0933 -0.3180 -0.0774
## 171  0.4722  0.8438  0.4697 -0.3439  0.2359
## 172  0.5827 -0.1380 -0.4988 -0.5169 -0.0903
## 173  0.4000  0.5442 -0.0784 -0.2897  0.0245
## 174 -0.2929 -0.0357 -0.5233  0.2271  0.1849
## 175  0.0609 -0.4231 -0.0239 -0.1016 -0.0788
## 176  0.2088 -0.1988  0.1469 -0.6014  0.4944
## 177  0.5075  0.0594 -0.6211  0.4767  0.2133
## 178  0.2295  0.1505  0.3059 -0.0053  0.0888
## 179  0.2907 -0.3002 -0.0555  0.4731 -0.0426
## 180  0.1657  0.2581 -0.1690  0.1841 -0.1983
## 181  0.4972  0.6229  0.2867  0.0816 -0.1947
## 182  0.1081  0.1264 -0.9158 -0.2230  0.0763
## 183 -0.0884 -0.0556 -0.0735 -0.0588 -0.2204
## 184 -0.0522 -0.3429 -0.1315  0.1200  0.1353
## 185  0.6453  0.1058  0.3843 -0.1549  0.4376
## 186 -0.1947 -0.1950 -0.2066  0.4353  0.1696
## 187 -0.1043 -0.1393 -0.1345  0.1886  0.7082
## 188 -0.0049  0.1120 -0.5212  0.4170  0.2621
## 189 -0.2468 -0.2766 -0.1978  0.3566  0.0651
## 190  0.6728  0.1325 -0.0650 -0.0485  0.4450
## 191  0.2335  0.1460 -0.7693  0.2632  0.6226
## 192  0.5924 -0.1550 -0.3227 -0.0115  0.1420
## 193 -0.1131  0.1171 -0.0690  0.3339 -0.3698
## 194  0.9112 -1.0192  0.2332  0.1975 -0.1603
## 195  0.1137  0.1910 -0.1948 -0.3347  0.0224
## 196  0.4910  0.0054  0.3693  0.1293 -0.2570
## 197  0.8985 -0.2173 -0.2379  0.2814 -0.0761
## 198  0.3854 -0.6023  0.2084 -0.0420 -0.2147
## 199  0.5487  1.0821 -0.8993 -0.1236 -0.1112
## 200  0.2824 -0.1132 -0.4479 -0.1600 -0.3651
## 201  0.1317 -0.0732  0.1401 -0.2688 -0.4890
## 202  0.3448 -0.4459 -0.0769 -0.2537 -0.2360
## 203  0.2127 -0.3356 -0.0968 -0.4626 -0.1888
## 204  0.1858 -0.1371  0.2486  0.0743 -0.3153
## 205  0.6716  0.2283  0.1853  0.3844  0.0500
## 206 -0.1414 -0.0857 -0.0091 -0.2142 -0.0938
## 207  0.0655 -0.0235 -0.1410 -0.2000  0.1803
## 208  0.0679  0.2477 -0.1978 -0.2270  0.0988
## 209  0.0644  0.1354  0.4435 -0.2830 -0.1775
## 210 -0.2189 -0.2497 -0.0916  0.2072  0.1551
## 211  0.9777  0.1285  0.1576  0.1798 -0.1622
## 212 -0.0972  0.1345 -0.0640  0.3605 -0.5152
## 213  0.7748  0.1979  0.3576  0.1029  0.4574
## 214  0.5291  0.1789 -0.0036  0.1336  0.1885
## 215 -0.0994  0.0619  0.3528 -0.0294 -0.1394
## 216  0.0003  0.0381 -0.2214  0.5410  0.3958
## 217 -0.0866  0.0998 -0.2862 -0.1580 -0.0742
## 218  0.1209  0.1766 -0.3091  0.1226 -0.3517
## 219 -0.1019 -0.1027 -0.1612  0.3188 -0.1264
## 220  0.4625  0.0965 -0.1027 -0.1415 -0.4130
## 221 -0.0514 -0.0075 -0.3181  0.1811 -0.0185
## 222  0.0224  0.1017  0.1929 -0.2622  0.0156
## 223 -0.0810 -0.1150  0.0120 -0.3792 -0.0602
## 224  0.2381  0.6258  0.2088 -0.1725 -0.2018
## 225 -0.2304 -0.1982 -0.1416 -0.0427  0.2016
## 226  0.7342  1.0162  0.3708 -0.1039  0.1601
## 227  0.7590  0.2206  0.0143  0.3750 -0.1411
## 228  0.1157 -0.1172 -0.1340 -0.0497 -0.2187
## 229 -0.1075 -0.1027  0.0054 -0.1506 -0.0662
## 230  0.7723 -0.1057 -0.4025 -0.5037  0.5866
## 231  0.1584  0.2164 -0.1546 -0.3030  0.3370
## 232  0.1522  0.0884  0.2706 -0.1070 -0.3320
## 233  0.0009  0.0732 -0.0396  0.2859 -0.0907
## 234 -0.0483  0.1725 -0.1121  0.3944 -0.2235
## 235 -0.1933  0.0870  0.1385  0.5493 -0.2025
## 236  0.3533 -0.3270 -0.2138  0.2277  0.1940
## 237  0.1799 -0.3112 -0.1514 -0.1377 -0.3648
## 238  0.1770  0.3378  0.0277 -0.3497  0.0316
## 239  0.6111 -0.4369  0.4187 -0.0911 -0.4688
## 240 -0.0028  0.0026 -0.1018  0.3240  0.0351
## 241  0.2397 -0.0911  0.1837 -0.2320  0.0243
## 242  0.2116 -0.3110  0.0957  0.0367  0.1125
## 243  0.9207  0.3599 -0.1377 -0.1522  0.3074
## 244 -0.1723  0.4723  0.0808  0.0839 -0.1065
## 245  0.5166  0.6191  0.4714 -0.2113  0.0196
## 246  0.4967 -0.4111 -0.2533  0.1925 -0.5624
## 247  0.9312 -0.5833  0.1357  0.4414 -0.2614
## 248 -0.1121 -0.0334  0.1500  0.2537 -0.3206
## 249  0.4150  0.6988  0.2240 -0.3519  0.2215
## 250  1.5378 -0.5231 -0.0015  0.2325 -0.0229
## 251 -0.4223 -0.0269 -0.2153  0.0421 -0.1876
## 252 -0.6553 -0.0260 -0.1747  0.2172 -0.0626
## 253 -0.4575 -0.0897 -0.0961  0.0278 -0.1910
## 254 -0.4198 -0.0185  0.3225 -0.0038 -0.0856
## 255 -0.3841 -0.1773  0.0102 -0.2045  0.0974
## 256 -0.5787  0.1482  0.0349  0.1759  0.0284
## 257 -0.3678  0.1077 -0.1176 -0.1119 -0.2544
## 258 -0.4086 -0.0157 -0.0697 -0.0636  0.2365
## 259 -0.3466  0.0289 -0.0331  0.1052 -0.0271
## 260 -0.3645 -0.1470  0.2190  0.2243  0.0278
## 261 -0.3795  0.0577 -0.1209  0.1788  0.0112
## 262 -0.3171 -0.0786  0.1906  0.1105 -0.1121
## 263 -0.3167  0.0473  0.1324  0.0157 -0.0807
## 264 -0.3500  0.0581  0.0236  0.0764 -0.0490
## 265 -0.3131 -0.1455  0.1680  0.1874 -0.0702
## 266 -0.3858  0.1190 -0.1763 -0.0018 -0.1518
## 267 -0.2073 -0.0255  0.1210  0.0266 -0.0903
## 268 -0.4033 -0.0292 -0.0955 -0.0045 -0.1917
## 269 -0.4312  0.0774  0.0135 -0.0012 -0.1969
## 270 -0.4928 -0.1104  0.0472  0.2306  0.0185
## 271 -0.4046 -0.1922 -0.0461  0.2439 -0.0209
## 272 -0.3719 -0.1743 -0.1670  0.1521 -0.0977
## 273 -0.3328 -0.0961  0.2797 -0.0006 -0.0973
## 274 -0.1981 -0.0832 -0.0693  0.0570 -0.0980
## 275 -0.3602 -0.0956 -0.1106  0.2730  0.0535
## 276 -0.3011 -0.1327  0.2284  0.0946 -0.1426
## 277 -0.4089 -0.3141  0.1521  0.1243  0.2704
## 278 -0.4224 -0.2270  0.1579 -0.1280  0.1138
## 279 -0.4511  0.1140 -0.0167 -0.1412  0.2020
## 280 -0.5408 -0.1156  0.1608  0.1625  0.1686
## 281 -0.3942  0.0321 -0.0280  0.0899 -0.1234
## 282 -0.4497  0.1590  0.0860 -0.0239 -0.1749
## 283 -0.3193 -0.2261  0.0543  0.2171  0.1864
## 284 -0.3816  0.1894 -0.0565 -0.1823 -0.0146
## 285 -0.6178 -0.0163  0.1585  0.1672  0.0099
## 286 -0.3669 -0.0481  0.3115 -0.1596 -0.0143
## 287 -0.4244  0.1859 -0.0493 -0.2346 -0.1229
## 288 -0.3331 -0.1963 -0.0377  0.0044  0.0199
## 289 -0.4952 -0.0270 -0.1589 -0.2061 -0.1210
## 290 -0.4290 -0.0330  0.0303  0.0276  0.0577
## 291 -0.5267  0.1227  0.0064 -0.0831 -0.0009
## 292 -0.3734 -0.0711  0.1661  0.0590 -0.0707
## 293 -0.5071  0.0832 -0.1760  0.1802 -0.0217
## 294 -0.4462 -0.0329  0.1813  0.1596 -0.1741
## 295 -0.2696  0.0011  0.0779 -0.3853  0.0206
## 296 -0.5835  0.2185 -0.0353 -0.0090  0.0842
## 297 -0.4159  0.0105  0.1804  0.0559  0.1375
## 298 -0.4816 -0.0869  0.0253 -0.1920  0.0334
## 299 -0.3372 -0.1570 -0.1122 -0.0390  0.2017
## 300 -0.3991  0.1465 -0.1472 -0.4163 -0.0710
## 301 -0.4148  0.0485 -0.0540 -0.1072  0.1589
## 302 -0.5079 -0.1964 -0.0270 -0.1432  0.0557
## 303 -0.3949 -0.0751 -0.0865  0.1178  0.1132
## 304 -0.5111  0.0836  0.2097 -0.0723  0.0498
## 305 -0.2851 -0.2684  0.0731  0.1704 -0.0405
## 306 -0.3830  0.1219 -0.0509  0.1169 -0.0440
## 307 -0.4172 -0.0116  0.3194  0.0976 -0.0396
## 308 -0.3376 -0.0975 -0.1775 -0.1880  0.0863
## 309 -0.3645 -0.2146  0.0707  0.0839 -0.1013
## 310 -0.4307  0.0783  0.0860 -0.1717  0.1889
## 311 -0.3604  0.0405 -0.2048 -0.1441  0.1194
## 312 -0.4457 -0.0504  0.0673 -0.2326  0.0713
## 313 -0.3748  0.0585  0.2998 -0.2909 -0.0941
## 314 -0.4692  0.0952  0.0175 -0.0017 -0.2060
## 315 -0.4091 -0.1318 -0.1653 -0.0367 -0.0408
## 316 -0.3387 -0.2300 -0.0076 -0.0472 -0.0340
## 317 -0.3783 -0.1207  0.1734  0.2165  0.0261
## 318 -0.2516 -0.0492 -0.0134 -0.1237 -0.0494
## 319 -0.5290 -0.0176  0.2063 -0.1738 -0.0666
## 320 -0.2466 -0.1635 -0.1179 -0.0019  0.2421
## 321 -0.5015  0.1811 -0.0941 -0.2372  0.0996
## 322 -0.3257  0.0182 -0.1033 -0.2967 -0.0043
## 323 -0.3116 -0.1514  0.2450 -0.2908  0.0177
## 324 -0.4025 -0.1307 -0.0060  0.0654 -0.1701
## 325 -0.2763 -0.0529 -0.1511  0.1452 -0.1393
## 326 -0.4379  0.1359  0.2666  0.0384 -0.0683
## 327 -0.3711 -0.0294 -0.0502 -0.1293  0.1258
## 328 -0.5413  0.0237 -0.2305 -0.0539  0.2098
## 329 -0.4614  0.1246 -0.0154  0.1619  0.2378
## 330 -0.5197 -0.1825 -0.0950 -0.0284  0.1229
## 331 -0.1936 -0.0897 -0.0419  0.0923 -0.0543
## 332 -0.3294 -0.1905  0.0592 -0.1572 -0.1461
## 333 -0.5887  0.1356 -0.1325  0.0308  0.0982
## 334 -0.4912 -0.1176  0.2100  0.3269  0.0860
## 335 -0.3541 -0.2899  0.0935  0.2520  0.0045
## 336 -0.3623  0.1161 -0.0623 -0.2652  0.0927
## 337 -0.3828 -0.0669 -0.1639  0.0279  0.2516
## 338 -0.5615 -0.0099  0.1445 -0.0034  0.0803
## 339 -0.3745 -0.0466  0.2424  0.0096 -0.3267
## 340 -0.3455 -0.1950 -0.0488  0.1637  0.1469
## 341 -0.4792 -0.1305  0.0831  0.1046  0.1140
## 342 -0.3049 -0.1567  0.0145 -0.2024 -0.1988
## 343 -0.5025  0.0348  0.0956 -0.0787  0.2287
## 344 -0.5378 -0.1621  0.2072 -0.0230  0.1917
## 345 -0.3270 -0.0544 -0.1184 -0.3032 -0.0309
## 346 -0.5323 -0.1053  0.0007  0.0724  0.3573
## 347 -0.4989 -0.1019  0.2410  0.0517  0.2710
## 348 -0.5706 -0.0877  0.0777 -0.1238  0.1343
## 349 -0.4639  0.1315  0.0627  0.2198  0.0190
## 350 -0.6567  0.0627 -0.0791  0.0425  0.0716
## 351 -0.3962 -0.0549 -0.1012 -0.2352 -0.1992
## 352 -0.4045 -0.0384 -0.2740  0.1861 -0.0594
## 353 -0.4043 -0.1061  0.0250 -0.0251  0.1882
## 354 -0.4138 -0.3042 -0.0188 -0.0808  0.1839
## 355 -0.4760 -0.2414 -0.0611 -0.0104  0.1327
## 356 -0.4966 -0.2213 -0.0767  0.1007  0.2555
## 357 -0.6259 -0.0295  0.0897  0.0733  0.0936
## 358 -0.3209  0.0167 -0.1501 -0.2613 -0.1770
## 359 -0.3277 -0.1398  0.1624 -0.0783  0.1953
## 360 -0.4729 -0.0840  0.1152 -0.4371 -0.0919
## 361 -0.4742 -0.1793  0.1999 -0.0037  0.2441
## 362 -0.4998 -0.0732  0.1316  0.3198  0.0455
## 363 -0.4203 -0.0890 -0.2279  0.1124 -0.1184
## 364 -0.4374 -0.0796  0.2502 -0.1020 -0.2401
## 365 -0.2917  0.0103  0.0275 -0.2158 -0.2933
## 366 -0.4194  0.1213 -0.2538  0.2760  0.0587
## 367 -0.4475  0.0711 -0.0927 -0.1025 -0.2235
## 368 -0.5589  0.0114  0.2624 -0.2761  0.0373
## 369 -0.5478  0.0103 -0.0801  0.1883 -0.0260
## 370 -0.4270  0.1112  0.0177 -0.2440 -0.1050
## 371 -0.6336 -0.2394  0.0331  0.0205  0.1674
## 372 -0.4818  0.1792 -0.0839 -0.3556 -0.0320
## 373 -0.3465 -0.0333 -0.0154 -0.2052 -0.1162
## 374 -0.4627  0.1266  0.2077 -0.2214 -0.3315
## 375 -0.3964  0.1578 -0.0182 -0.1409 -0.2250
## 376 -0.5880  0.0371 -0.0181 -0.1627 -0.1150
## 377 -0.3393 -0.0903  0.0710 -0.1769 -0.1355
## 378 -0.3741 -0.0560  0.1273 -0.3249 -0.0251
## 379 -0.4930 -0.0897  0.2747 -0.0407 -0.1675
## 380 -0.4222 -0.0650  0.1315 -0.2616  0.1103
## 381 -0.4500 -0.1154  0.2677 -0.2214 -0.1524
## 382 -0.6265  0.0002  0.2875  0.0837 -0.0505
## 383 -0.3192 -0.1028  0.2834 -0.1856 -0.2983
## 384 -0.3532  0.0817 -0.2542 -0.2741 -0.0001
## 385 -0.4740  0.1046  0.0563 -0.1806 -0.0910
## 386 -0.5465 -0.0605  0.2915  0.0019  0.0775
## 387 -0.5707 -0.1793  0.0755  0.3244  0.0848
## 388 -0.5046  0.2256  0.0004  0.0321 -0.0588
## 389 -0.5729  0.0031  0.1739  0.1026 -0.0418
## 390 -0.4620 -0.1213  0.1924  0.0219 -0.1297
## 391 -0.5533 -0.0568  0.1980  0.1924 -0.0071
## 392 -0.3457  0.0573  0.1470  0.0044 -0.1330
## 393 -0.4735 -0.1694  0.0801 -0.1362  0.1733
## 394 -0.2381  0.0317 -0.0783 -0.0006 -0.1226
## 395 -0.3640  0.0452  0.1061  0.0322 -0.2185
## 396 -0.6677  0.0164  0.1146  0.0009  0.0558
## 397 -0.4833 -0.0855  0.0235  0.3562  0.1443
## 398 -0.5363 -0.1754  0.0907  0.1149  0.3411
## 399 -0.5520  0.0857  0.2465 -0.0269 -0.1075
## 
## 
## Stress per point (in %):
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 0.15 0.20 0.34 0.45 0.19 0.43 0.90 0.38 0.25 0.57 0.39 0.50 0.26 0.26 0.21 0.25 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 0.09 0.16 0.44 0.46 0.18 1.04 0.43 0.31 0.33 0.30 0.12 0.27 0.37 0.56 0.31 0.45 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 0.15 0.29 0.56 0.31 0.27 0.30 0.46 0.27 0.14 0.25 0.33 0.32 0.21 0.30 0.25 0.17 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 0.28 0.36 0.32 0.26 0.32 0.41 0.28 0.24 0.58 0.37 0.27 0.27 0.27 1.26 0.32 0.29 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
## 0.27 0.22 0.28 0.59 0.30 0.33 0.26 0.29 0.33 0.31 0.16 0.30 0.31 0.32 0.20 0.27 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
## 0.18 0.68 0.20 0.15 0.28 0.33 0.54 0.46 0.38 0.09 0.25 0.31 0.37 0.20 0.18 0.32 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
## 0.24 0.19 0.21 0.34 0.22 0.37 0.31 0.62 0.22 0.39 0.33 0.37 0.14 0.20 0.12 0.18 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
## 0.27 0.26 0.18 0.48 0.39 0.24 0.31 0.24 0.29 0.32 0.50 0.38 0.18 0.28 0.24 0.50 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
## 1.12 0.28 0.39 0.27 0.55 0.93 0.48 0.27 0.35 0.28 0.30 0.21 0.40 0.44 0.33 0.74 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
## 0.18 0.33 0.25 0.27 0.94 0.37 0.24 0.20 0.35 0.41 0.23 0.34 0.28 0.54 0.32 0.52 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
## 0.40 0.38 0.34 0.37 0.38 0.24 0.24 0.28 0.28 0.52 0.29 0.48 0.40 0.21 0.29 0.60 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
## 0.23 0.28 0.47 0.38 0.36 0.62 0.11 0.26 0.29 0.19 0.19 0.21 0.17 0.26 0.27 0.16 
##  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208 
## 0.15 0.26 0.44 0.41 0.21 0.30 0.65 0.23 0.25 0.33 0.35 0.30 0.36 0.12 0.36 0.32 
##  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224 
## 0.33 0.14 0.25 0.17 0.48 0.18 0.30 0.15 0.27 0.19 0.16 0.26 0.18 0.17 0.13 0.23 
##  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
## 0.12 0.27 0.27 0.17 0.06 0.40 0.26 0.31 0.19 0.22 0.36 0.35 0.30 0.28 0.38 0.09 
##  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256 
## 0.32 0.43 0.24 0.33 0.38 0.28 0.29 0.38 0.33 0.14 0.29 0.16 0.13 0.15 0.11 0.17 
##  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272 
## 0.16 0.10 0.05 0.12 0.11 0.15 0.10 0.09 0.18 0.11 0.09 0.14 0.16 0.09 0.13 0.15 
##  273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288 
## 0.10 0.04 0.10 0.19 0.11 0.10 0.15 0.07 0.11 0.20 0.12 0.13 0.06 0.21 0.15 0.11 
##  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304 
## 0.22 0.08 0.10 0.10 0.18 0.16 0.18 0.12 0.10 0.08 0.10 0.17 0.06 0.14 0.11 0.09 
##  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320 
## 0.14 0.14 0.12 0.18 0.09 0.13 0.09 0.12 0.19 0.16 0.18 0.13 0.10 0.06 0.10 0.14 
##  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336 
## 0.17 0.09 0.19 0.14 0.15 0.20 0.07 0.22 0.19 0.08 0.06 0.14 0.19 0.13 0.16 0.10 
##  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352 
## 0.11 0.12 0.28 0.09 0.09 0.17 0.10 0.06 0.15 0.18 0.10 0.11 0.16 0.14 0.14 0.22 
##  353  354  355  356  357  358  359  360  361  362  363  364  365  366  367  368 
## 0.06 0.25 0.13 0.17 0.06 0.14 0.12 0.28 0.11 0.14 0.17 0.13 0.13 0.28 0.12 0.09 
##  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384 
## 0.15 0.13 0.12 0.14 0.10 0.22 0.13 0.09 0.15 0.10 0.10 0.16 0.17 0.10 0.14 0.15 
##  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399 
## 0.12 0.12 0.12 0.12 0.08 0.11 0.10 0.08 0.13 0.08 0.15 0.07 0.10 0.12 0.07

X_mds <- mds$conf
dim(X_mds)

## [1] 399   5

The 5-dimensional MDS solution resulted in a stress value of 0.09, which indicates a good fit. This means that the original distance relationships between patients are well preserved in the reduced space.

Most observations show very low stress contributions, suggesting they are well represented by the 5D configuration. Only a few patients have higher stress values (like Patient 62 and 129), indicating possible outliers with more unusual clinical profiles.

Overall, this confirms that the 5-dimensional MDS representation is suitable for further clustering analysis.

K-means

km_mds <- factoextra::eclust(X_mds, "kmeans", k = 2, graph = FALSE)
fviz_cluster(list(data = X_mds, cluster = km_mds$cluster), geom="point", stand=FALSE)

fviz_silhouette(km_mds)

##   cluster size ave.sil.width
## 1       1  190          0.04
## 2       2  209          0.49

table(km_mds$cluster,y)

##    y
##     ckd notckd
##   1 190      0
##   2  60    149

For K-means, MDS does not meaningfully improve clustering beyond what is already achieved in the original feature space. Only 5 more data points were clustered into the correct group. This confirms that MDS is effective at preserving structure, but not necessarily at enhancing cluster separability for variance-driven algorithms like K-means.

PAM

pam_mds <- factoextra::eclust(X_mds, "pam", k = 2, graph = FALSE)
fviz_cluster(list(data = X_mds, cluster = pam_mds$cluster), geom = "point", stand = FALSE)

fviz_silhouette(pam_mds)

##   cluster size ave.sil.width
## 1       1  195          0.49
## 2       2  204          0.04

table(pam_mds$cluster,y)

##    y
##     ckd notckd
##   1  46    149
##   2 204      0

MDS provides a modest improvement for PAM by making the dominant cluster more compact and slightly reducing class mixing. However, the gain is limited, and the overall clustering behavior remains similar to that observed in the original feature space.

Hierarchical clustering

hc_mds <- hclust(dist(X_mds), method = "ward.D2")
hc_mds_cut <- cutree(hc_mds, k = 2)

fviz_cluster(
  list(
    data = X_mds[, 1:min(2, ncol(X_mds)), drop = FALSE],
    cluster = hc_mds_cut
  ),
  main = paste0("Hierarchical clustering on MDS"),
  geom = "point",
  stand = FALSE
)

sil_hc_mds <- silhouette(hc_mds_cut, dist(X_mds))
fviz_silhouette(sil_hc_mds)

##   cluster size ave.sil.width
## 1       1  152          0.56
## 2       2  247          0.00

table(hc_mds_cut, y)

##           y
## hc_mds_cut ckd notckd
##          1   6    146
##          2 244      3

MDS provides a clear improvement for hierarchical clustering by reducing class mixing and producing clusters that align more closely with the underlying CKD and non-CKD groups. While one cluster remains geometrically spread out, the separation in terms of true class membership becomes noticeably cleaner compared to the original feature space.

Conclusion

Among all methods, hierarchical clustering benefits the most from MDS. Overall, MDS does not radically change the cluster structure, but it does meaningfully reduce misclassification and slightly improves cluster compactness, particularly for weaker clusters, without consistently strengthening separation across all methods.

To ensure that this configuration was not due to random structure, I performed a Random Stress Test.

cat("Empirical stress:", mds$stress, "\n")

## Empirical stress: 0.09198974

stressvec <- smacof::randomstress(n = nrow(X), ndim = 5, nrep = 50)
cat("Mean random stress:", mean(stressvec), "\n")

## Mean random stress: 0.4999112

cat("Stress ratio (empirical / random):", mds$stress / mean(stressvec), "\n")

## Stress ratio (empirical / random): 0.1840122

plot(mds, main = "SMACOF MDS (point size shows stress per point)")

plot(mds$conf, pch = 19, col = y, main = "SMACOF MDS configuration colored by class")

The empirical stress (0.09) is substantially lower than the Mean Random Stress (0.499), resulting in a Stress Ratio of 0.184. This confirms that the 5D MDS representation captures significantly more structure than a random configuration.

I also performed a Mantel Test to validate the transformation of the correlation matrix into a distance (dissimilarity) matrix.

sim_num <- cor(X_num)
dis_from_sim <- smacof::sim2diss(sim_num, method = 1, to.dist = TRUE)

mantel.test(as.matrix(sim_num), as.matrix(dis_from_sim))

## $z.stat
## [1] -7.952564
## 
## $p
## [1] 0.001
## 
## $alternative
## [1] "two.sided"

The negative value here is expected and it indicates a strong inverse relationship between the two matrices. This makes sense because as the correlation between two variables increases, the distance between them must decrease. Also, since the p-value is well below the standard threshold of 0.05, we reject the null hypothesis that the two matrices are unrelated.

So we can say that there is a statistically significant relationship between our original correlations and the calculated distances.

While MDS focuses on preserving pairwise distances, PCA provides a variance-based view of the data. I therefore applied PCA to compare how variance-driven dimensionality reduction affects clustering performance.

9.2 Principal Component Analysis (PCA)

Quick numeric correlation check

I plotted a correlation matrix for the scaled numeric variables to check for any strong redundancy between them. This helps me understand how much of the variance might be captured by the first few principal components (PCs).

num_cor <- cor(as.matrix(X_num), method = "pearson") 
corrplot::corrplot(num_cor, order = "alphabet", tl.cex = 0.75)

The correlation plot shows that certain variables, like hemo and pcv, are highly correlated, while others like pot and sod have weaker correlations. This suggests that PCA will likely capture significant variance with just a few components, as the data shows some degree of redundancy between the variables.

I also ran a PCA on the scaled feature matrix X to reduce the dimensions and identify the primary drivers of variance in the data.

pca1 <- prcomp(X, center = FALSE, scale. = FALSE)
pca2 <- princomp(X, cor = FALSE)

fviz_eig(pca1, addlabels = TRUE, main = "Scree plot: variance explained (PCA on X)")

## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

eig <- factoextra::get_eigenvalue(pca1)
head(eig)

From the outputs, I observed that Dim.1 and Dim.2 account for 31.1% and 10.9% of the variance, respectively. Together, these first two components capture a cumulative of only 42% of the total information.

cum_var <- cumsum(eig$variance.percent)
cum_var

##  [1]  31.11465  42.01136  50.26028  57.32449  63.67347  69.46141  74.69301
##  [8]  78.78563  82.63228  85.58116  88.35236  90.74606  92.82611  94.34145
## [15]  95.31320  96.17149  96.95888  97.58304  98.14983  98.66543  99.13059
## [22]  99.48949  99.75278 100.00000

plot(cum_var, type = "l", xlab = "PC", ylab = "Cumulative variance (%)",
main = "Cumulative variance")

By plotting the cumulative variance, I observed that the first six principal components capture 74.7% of the data’s total information, with the curve flattening significantly after the 12th PC. This confirms that the first 7 dimensions represent the primary “signal” of the dataset, justifying their use for a low-dimensional representation that effectively simplifies the original 25 variables.

Rotated PCA for Interpretability

I analyzed the variable map and contribution plots to see which clinical markers were actually driving the patterns in the data.

fviz_pca_var(pca1, repel = TRUE) + ggtitle("PCA variable map")

a1 <- fviz_contrib(pca1, choice = "var", axes = 1, top = 15) + ggtitle("Top contributions to PC1")
a2 <- fviz_contrib(pca1, choice = "var", axes = 2, top = 15) + ggtitle("Top contributions to PC2")
gridExtra::grid.arrange(a1, a2, ncol = 2)

The PCA results show that different clinical variables drive each principal component rather than a single dominant factor. PC1 is mainly influenced by blood-related markers such as packed cell volume, hemoglobin, and red blood cell count, together with renal indicators like blood urea and albumin, capturing overall blood health and kidney function. PC2 is primarily shaped by metabolic and electrolyte-related variables, including sugar, blood glucose, specific gravity, and sodium.

Together, these components indicate that patient separation in the PCA space is driven by a combination of blood health, kidney function, and metabolic regulation. By concentrating this clinically meaningful variation into a small number of dimensions, PCA provides a representation that aligns well with CKD-related structure in the data.

I generated two maps to examine how individual patients are positioned within the PCA space.

fviz_pca_ind(pca1, col.ind = y, repel = TRUE) + ggtitle("Individuals on PCA map (colored by class)")

fviz_pca_ind(pca1, col.ind = "cos2", gradient.cols = c("white","#2E9FDF","#FC4E07")) +
ggtitle("Individuals colored by cos2 quality")

The first plot shows a clear separation trend between ckd and notckd patients along the first principal component. The non-CKD group forms a tight and compact cluster on the left, while CKD patients are more spread out across the PCA space, reflecting greater clinical variability in disease severity.

The second plot uses cos2 values to indicate how well each patient is represented by the two-dimensional PCA map. Most points at the extremes of the plot, especially those far from the origin, show high cos2 values, meaning they are well captured by the first two components. Points closer to the center tend to have lower cos2 values, indicating more average clinical profiles that are not fully explained by the first two dimensions.

Overall, these plots confirm that the PCA projection captures the main structure of the data well, particularly for the more distinct and extreme patient profiles, while some overlap near the center is expected in a reduced two-dimensional view.

I generated a PCA biplot to visualize how individual patients and clinical variables interact within the same space.

fviz_pca_biplot(pca1, geom.ind = "point", col.ind = ifelse(y == 'ckd', "blue", "red"), col.var = "black", repel = TRUE, title = "PCA Biplot")

The PCA biplot shows a clear separation pattern driven by clinically meaningful variables. Non-CKD patients cluster tightly on the left side of the plot, aligning with higher Hemoglobin (hemo), Packed Cell Volume (pcv), and Red Blood Cell Count (rbcc), which are markers of healthier blood profiles.

In contrast, CKD patients are more dispersed and extend toward the right side of the map, in the direction of renal and metabolic indicators such as Blood Urea (bu), Serum Creatinine (sc), Albumin (al), Blood Glucose (bgr), and Sugar (su). The strong alignment of bu and sc indicates their close relationship and central role in distinguishing diseased patients.

Overall, the biplot links the geometric separation of patients directly to underlying clinical signals, showing that variation in kidney function and blood health is the main driver of structure in the data.

I applied a Varimax rotation to the first three principal components to create a cleaner, more interpretable structure by grouping variables onto specific axes.

pca_rot <- psych::principal(X, nfactors = 3, rotate = "varimax")
pca_rot

## Principal Components Analysis
## Call: psych::principal(r = X, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1   RC3   RC2    h2   u2 com
## age    0.18 -0.04  0.54 0.327 0.67 1.2
## bp     0.26  0.09  0.25 0.142 0.86 2.2
## bgr   -0.02  0.19  0.78 0.639 0.36 1.1
## bu     0.74  0.21  0.07 0.596 0.40 1.2
## sc     0.73 -0.08  0.04 0.543 0.46 1.0
## sod   -0.59 -0.02 -0.09 0.358 0.64 1.0
## pot    0.24 -0.03  0.06 0.063 0.94 1.2
## hemo  -0.69 -0.44 -0.23 0.727 0.27 1.9
## pcv   -0.68 -0.45 -0.25 0.721 0.28 2.0
## wbcc  -0.09  0.31  0.21 0.150 0.85 2.0
## rbcc  -0.61 -0.38 -0.24 0.582 0.42 2.0
## sg    -0.22 -0.54 -0.31 0.434 0.57 2.0
## al     0.21  0.76  0.19 0.654 0.35 1.3
## su    -0.08  0.17  0.75 0.604 0.40 1.1
## rbc    0.16  0.49 -0.02 0.263 0.74 1.2
## pc     0.21  0.75  0.07 0.604 0.40 1.2
## pcc    0.02  0.62  0.14 0.404 0.60 1.1
## ba    -0.02  0.59 -0.01 0.344 0.66 1.0
## htn    0.53  0.19  0.53 0.609 0.39 2.3
## dm     0.34  0.13  0.72 0.649 0.35 1.5
## cad    0.25  0.12  0.38 0.218 0.78 2.0
## appet  0.35  0.35  0.17 0.274 0.73 2.4
## pe     0.36  0.40  0.10 0.299 0.70 2.1
## ane    0.59  0.21 -0.02 0.395 0.61 1.2
## 
##                        RC1  RC3  RC2
## SS loadings           4.16 3.54 2.90
## Proportion Var        0.17 0.15 0.12
## Cumulative Var        0.17 0.32 0.44
## Proportion Explained  0.39 0.33 0.27
## Cumulative Proportion 0.39 0.73 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  1011.61  with prob <  4e-106 
## 
## Fit based upon off diagonal values = 0.94

summary(pca_rot)

## 
## Factor analysis with Call: psych::principal(r = X, nfactors = 3, rotate = "varimax")
## 
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 207  and the objective function was  2.47 
## The number of observations was  399  with Chi Square =  956.69  with prob <  1.1e-96 
## 
## The root mean square of the residuals (RMSA) is  0.07

print(loadings(pca_rot), digits = 3, cutoff = 0.35, sort = TRUE)

## 
## Loadings:
##       RC1    RC3    RC2   
## bu     0.740              
## sc     0.731              
## sod   -0.592              
## hemo  -0.693 -0.441       
## pcv   -0.677 -0.445       
## rbcc  -0.614 -0.383       
## htn    0.535         0.534
## ane    0.593              
## sg           -0.540       
## al            0.756       
## pc            0.746       
## pcc           0.620       
## ba            0.586       
## age                  0.540
## bgr                  0.776
## su                   0.755
## dm                   0.717
## bp                        
## pot                       
## wbcc                      
## rbc           0.486       
## cad                  0.377
## appet  0.354              
## pe     0.362  0.397       
## 
##                  RC1   RC3   RC2
## SS loadings    4.162 3.535 2.902
## Proportion Var 0.173 0.147 0.121
## Cumulative Var 0.173 0.321 0.442

After Varimax rotation, the three components show clear and clinically interpretable structure. RC1 is primarily associated with kidney dysfunction and anemia, with strong positive loadings for Blood Urea (bu), Serum Creatinine (sc), Hypertension (htn), and Anemia (ane), and strong negative loadings for Hemoglobin (hemo), Packed Cell Volume (pcv), and Red Blood Cell Count (rbcc). This component captures the contrast between impaired renal function and healthy blood profiles.

RC2 represents a metabolic and diabetic dimension, dominated by Blood Glucose Random (bgr), Sugar (su), Diabetes Mellitus (dm), and Age. This axis reflects metabolic stress and glucose-related abnormalities commonly associated with CKD progression.

RC3 is driven mainly by infection and urine abnormality indicators, with high loadings for Albumin (al), Pus Cell (pc), Pus Cell Clumps (pcc), Bacteria (ba), and Specific Gravity (sg). This component reflects inflammatory and urinary pathology.

The top PCA loadings were driven by variables that are clinically relevant for kidney function and anemia, which fits expectations for CKD-related separation..

I used the 70% variance threshold to determine the optimal number of dimensions for the rest of the analysis.

k_pc <- which(cum_var >= 70)[1]
k_pc

## [1] 7

X_pca_k <- pca1$x[, 1:k_pc, drop = FALSE]

From the output, I identified that the first 7 principal components are sufficient to retain the majority of the dataset’s clinical information. I then subset the PCA results into a new matrix, X_pca_k, keeping only these 7 dimensions and removing the other 18.

K-means

km_pca <- eclust(X_pca_k, "kmeans", k = 2, graph = FALSE)
fviz_cluster(km_pca, main = paste0("K-means on PCA ", k_pc, " PCs"), geom = "point", stand = FALSE)

fviz_silhouette(km_pca)

##   cluster size ave.sil.width
## 1       1  187          0.09
## 2       2  212          0.50

table(km_pca$cluster, y)

##    y
##     ckd notckd
##   1 187      0
##   2  63    149

Compared to the original feature space, PCA leads to a modest improvement in the silhouette score and clustering quality for K-means by making the weaker cluster slightly more compact. However, the overall class alignment remains largely unchanged. This indicates that PCA mainly reduces noise and improves geometric cohesion, rather than uncovering a fundamentally different or clearer separation between CKD and non-CKD patients.

PAM

pam_pca <- eclust(X_pca_k, "pam", k = 2, graph = FALSE)

fviz_cluster(pam_pca, main = paste0("PAM on PCA (first ", k_pc, " PCs)"), geom = "point", stand = FALSE)

fviz_silhouette(pam_pca)

##   cluster size ave.sil.width
## 1       1  176          0.54
## 2       2  223          0.08

table(pam_pca$cluster,y)

##    y
##     ckd notckd
##   1  27    149
##   2 223      0

For PAM, PCA provides a more noticeable improvement than for K-means. After dimensionality reduction, the main cluster becomes substantially more compact, and the weaker cluster shows a clear gain in separation. This is reflected in the increase in average silhouette width and the cleaner silhouette shape. While some overlap still remains, PCA reduces within-cluster dispersion and improves robustness, without distorting the underlying class structure.

Hierarchical clustering

dist_pca <- dist(X_pca_k)
hc_pca <- hclust(dist_pca, method = "ward.D2")
hc_pca_cut <- cutree(hc_pca, k = 2)

fviz_cluster(
  list(
    data = X_pca_k[, 1:min(2, ncol(X_pca_k)), drop = FALSE],
    cluster = hc_pca_cut
  ),
  main = paste0("Hierarchical clustering on PCA (first ", k_pc, " PCs)"),
  geom = "point",
  stand = FALSE
)

sil_hc_pca <- silhouette(hc_pca_cut, dist_pca)
fviz_silhouette(sil_hc_pca)

##   cluster size ave.sil.width
## 1       1  151          0.58
## 2       2  248          0.05

table(hc_pca_cut,y)

##           y
## hc_pca_cut ckd notckd
##          1   6    145
##          2 244      4

Applying PCA improves hierarchical clustering compared to the original feature space. In the original data, one cluster was compact while the other showed weak separation and slight dispersion. After PCA, the main cluster becomes more compact and the weaker cluster improves from negative to positive silhouette values. Overall, PCA reduces noise and leads to a more stable and better-defined hierarchical clustering structure, even though some overlap remains.

Conclusion

Among all methods, hierarchical clustering benefits the most from PCA, showing the largest gain in cluster compactness and overall silhouette quality.

10. PCA vs MDS: Impact on Clustering Quality

par(mfrow = c(1, 2))

plot(pca1$x[, 1], pca1$x[, 2], col = y, pch = 19, xlab = "PC1", ylab = "PC2", main = "PCA: First Two Principal Components")
legend("topright", legend = levels(y), col = 1:length(levels(y)), pch = 19, cex = 0.8)

plot( mds_euc[, 1], mds_euc[, 2], col = y, pch = 19, xlab = "D1", ylab = "D2", main = "MDS: First Two Dimensions (Euclidean)")
legend("topright", legend = levels(y), col = 1:length(levels(y)), pch = 19, cex = 0.8)

par(mfrow = c(1, 1))

The PCA and MDS projections in two dimensions look almost identical, with both methods showing the same tight notckd cluster and a more dispersed CKD group. Since the visual separation is almost the same, I compared silhouette widths and the alignment between true class labels and cluster assignments to evaluate the underlying clustering quality.

sil_plot <- function(sil_obj, title_prefix) {
  avg_sil <- round(mean(sil_obj[, "sil_width"]), 3)
  
  fviz_silhouette(sil_obj) +
    ggtitle(paste0(title_prefix, 
                   "\nAvg silhouette width = ", avg_sil))
}

K-means

p_km_mds <- sil_plot(silhouette(km_mds$cluster, dist(X_mds)), "K-means (MDS, 5 Dimensions)")

##   cluster size ave.sil.width
## 1       1  190          0.04
## 2       2  209          0.49

p_km_pca <- sil_plot(silhouette(km_pca$cluster, dist(X_pca_k)),"K-means (PCA, First 7 PCs)")

##   cluster size ave.sil.width
## 1       1  187          0.09
## 2       2  212          0.50

table(km_mds$cluster,y)

##    y
##     ckd notckd
##   1 190      0
##   2  60    149

table(km_pca$cluster,y)

##    y
##     ckd notckd
##   1 187      0
##   2  63    149

gridExtra::grid.arrange(p_km_mds, p_km_pca, nrow = 2)

Both PCA and MDS produce almost identical clustering results. Each method separates CKD and notckd patients in nearly the same way, with only a few observations differing between clusters. The slightly higher silhouette width for PCA reflects marginally tighter clusters, but there is no meaningful improvement in class alignment compared to MDS.

PAM

p_pam_mds <- sil_plot(silhouette(pam_mds$cluster, dist(X_mds)), "PAM (MDS, 5 Dimensions)")

##   cluster size ave.sil.width
## 1       1  195          0.49
## 2       2  204          0.04

p_pam_pca <- sil_plot(silhouette(pam_pca$cluster, dist(X_pca_k)), "PAM (PCA, First 7 PCs)")

##   cluster size ave.sil.width
## 1       1  176          0.54
## 2       2  223          0.08

table(pam_mds$cluster,y)

##    y
##     ckd notckd
##   1  46    149
##   2 204      0

table(pam_pca$cluster,y)

##    y
##     ckd notckd
##   1  27    149
##   2 223      0

gridExtra::grid.arrange(p_pam_mds, p_pam_pca, nrow = 2)

For PAM, both MDS and PCA produce the same high-level clustering structure, with one cluster dominated by notckd patients and the other by ckd patients. However, PCA improves class alignment by reducing the number of CKD patients mixed into the notckd-dominated cluster. This is reflected in cleaner class alignment and a slightly higher average silhouette width, indicating more compact and coherent clusters under PCA.

Hierarchical clustering

p_hc_mds <- sil_plot(sil_hc_mds,"Hierarchical (MDS, 5 Dimensions)")

##   cluster size ave.sil.width
## 1       1  152          0.56
## 2       2  247          0.00

p_hc_pca <- sil_plot(sil_hc_pca, "Hierarchical (PCA, First 7 PCs)")

##   cluster size ave.sil.width
## 1       1  151          0.58
## 2       2  248          0.05

table(hc_mds_cut,y)

##           y
## hc_mds_cut ckd notckd
##          1   6    146
##          2 244      3

table(hc_pca_cut,y)

##           y
## hc_pca_cut ckd notckd
##          1   6    145
##          2 244      4

gridExtra::grid.arrange(p_hc_mds, p_hc_pca, nrow = 2)

For hierarchical clustering, MDS and PCA produce almost identical class alignment, with only a negligible difference in misclassified cases. In practice, both methods recover the same underlying structure. However, PCA consistently achieves a higher average silhouette width, indicating slightly more compact and better-defined clusters overall.

Conclusion

Overall, PCA performs slightly better than MDS for this dataset. Although both methods produce very similar visual structures and almost identical class alignment across clustering algorithms, PCA consistently yields higher average silhouette scores, indicating more compact and better-defined clusters.

MDS, even when using five dimensions, does not provide a meaningful advantage in terms of class alignment or clustering quality. In most cases, it preserves the same structure already present in the data rather than improving it. The only notable exception is minimal and does not change the overall interpretation.

PCA therefore appears to be the more effective dimensionality reduction method here, as it achieves comparable or better class alignment while consistently improving internal clustering quality.

11. Does Dimension Reduction Improve Clustering?

To evaluate whether dimensionality reduction improves clustering, I compared the average silhouette widths obtained from clustering on the original feature space and on the PCA-reduced space.

avg_sil <- data.frame(
  Method = c(
    "K-means (Original)", "K-means (PCA)",
    "PAM (Original)", "PAM (PCA)",
    "Hierarchical (Original)", "Hierarchical (PCA)"
  ),
  Avg_Silhouette = c(
    mean(silhouette(km2_e$cluster, dist(X))[, 3]),
    mean(silhouette(km_pca$cluster, dist(X_pca_k))[, 3]),
    mean(silhouette(pam2_e$cluster, dist(X))[, 3]),
    mean(silhouette(pam_pca$cluster, dist(X_pca_k))[, 3]),
    mean(silhouette(hc_cut$cluster, dist(X))[, 3]),
    mean(silhouette(hc_pca_cut, dist(X_pca_k))[, 3])
  )
)

avg_sil

Across all methods, PCA leads to higher average silhouette values. The improvement is most noticeable for hierarchical clustering and K-means, where PCA reduces noise and results in more compact and better-separated clusters. PAM shows a smaller but still consistent improvement.

Overall, this confirms that clustering in the PCA-reduced space provides more stable and meaningful group structure than clustering directly on the original high-dimensional data.

12. Cluster Stability and Prediction

I also tried doing some predictions. To assess clustering stability, the data was split into training (80%) and test (20%) sets.

set.seed(123)
idx <- sample(seq_len(nrow(X_pca_k)), size = floor(0.8 * nrow(X_pca_k)))
X_train <- X_pca_k[idx, ]
X_test <- X_pca_k[-idx, ]
y_test <- y[-idx]

K-means

km_train <- eclust(X_train, "kmeans", k = 2, nstart = 50, graph = FALSE)
km_kcca <- flexclust::as.kcca(km_train, X_train)

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

km_pred <- predict(km_kcca, X_test)

print(table(km_pred,y_test))

##        y_test
## km_pred ckd notckd
##       1  42      0
##       2  16     22

p1 <- fviz_cluster(km_train, geom = "point", stand = FALSE) + 
      ggtitle("Train Clusters (7 PCs)")
p2 <- fviz_cluster(list(data = X_test, cluster = km_pred), geom = "point", stand = FALSE) + 
      ggtitle("Predicted Test Clusters (7 PCs)")

gridExtra::grid.arrange(p1, p2, nrow = 2)

s1 <- fviz_silhouette(km_train) + ggtitle("K-means Train Silhouette")

##   cluster size ave.sil.width
## 1       1  143          0.11
## 2       2  176          0.50

km_test_sil <- cluster::silhouette(km_pred, dist(X_test))
s2 <- fviz_silhouette(km_test_sil) + ggtitle("K-means Test Silhouette")

##   cluster size ave.sil.width
## 1       1   42          0.00
## 2       2   38          0.49

In the training set, one cluster is clearly well defined, with a high average silhouette width (0.5), while the other cluster shows weaker separation (0.11), indicating some overlap.

When applied to the test set, a similar pattern is observed. The well-separated cluster remains stable with an average silhouette width of 0.49, while the weaker cluster again shows low separation (0.00). This consistency suggests that the clustering structure generalizes reasonably well to unseen data.

The test-set cluster composition shows that one cluster is dominated by CKD cases, while the other contains a mix of CKD and non-CKD patients. Overall, clustering in the PCA-reduced space improves stability compared to the original feature space, although the class alignment is not the best.

PAM

pam_train <- eclust(X_train, "pam", k = 2, graph = FALSE)

pam_kcca <- flexclust::as.kcca(pam_train, X_train)
pam_pred <- predict(pam_kcca, X_test)

print(table(Pred = pam_pred, True = y_test))

##     True
## Pred ckd notckd
##    1  52      0
##    2   6     22

p3 <- fviz_cluster(pam_train, geom = "point", stand = FALSE) + 
      ggtitle("PAM Train Clusters (6 PCs)")
p4 <- fviz_cluster(list(data = X_test, cluster = pam_pred), geom = "point", stand = FALSE) + 
      ggtitle("PAM Predicted Test Clusters (6 PCs)")

gridExtra::grid.arrange(p3, p4, nrow = 2)

s3 <- fviz_silhouette(pam_train) + ggtitle("PAM Train Silhouette")

##   cluster size ave.sil.width
## 1       1  178          0.09
## 2       2  141          0.54

pam_test_sil <- cluster::silhouette(pam_pred, dist(X_test))
s4 <- fviz_silhouette(pam_test_sil) + ggtitle("PAM Test Silhouette")

##   cluster size ave.sil.width
## 1       1   52          0.02
## 2       2   28          0.53

PAM clustering on the first 6 principal components shows a pattern similar to K-means. In the training set, one cluster is well defined, with a high average silhouette width of 0.54, while the other cluster shows weaker separation (0.09), indicating some overlap.

When applied to the test set, the well-separated cluster remains stable with an average silhouette width of 0.53, while the weaker cluster again shows low separation (0.02). This consistency between training and test results suggests that, similar to K-means, PAM benefits from PCA-based dimensionality reduction by improving cluster stability, especially for the better-defined cluster. The class alignment is also better than K-means.

13. Conclusion

Across all methods, the data consistently forms two main clusters, but separation is not perfect due to overlap in patient profiles. In the original feature space, clustering methods tend to produce one compact cluster and one more diffuse cluster, which limits overall quality.

Applying dimensionality reduction improves how these clusters behave rather than changing their structure. Clustering on PCA-reduced data leads to higher and more stable silhouette scores across K-means, PAM, and hierarchical clustering, indicating tighter and better-separated groups. MDS preserves the same cluster assignments and class alignment, but its impact on clustering quality is smaller.

While PCA and MDS produce very similar visual patterns, PCA consistently results in more compact clusters and higher silhouette values. This makes PCA the more effective representation for clustering in this dataset, even though the underlying class alignment remains largely the same.

USL Project: Clustering and Dimension Reduction on CKD Data

Akhil Ambalapurath Ajith (487806)

1. Introduction

2. Data Preparation for Clustering

3. Data Cleaning and Encoding

3.1 Class Label Cleaning

3.2 Encoding Binary Clinical Variables

3.3 Separating laboratory measurements and binary clinical indicators

4. Handling Missing Values

5. Feature Scaling and Final Matrix Construction

6. Cluster Tendency Assessment

7. Determining the Number of Clusters

K-means

PAM

CLARA

Hierarchical Clustering

Calinski-Harabasz

8. Clustering

8.1 Partition-Based Clustering (K-means, PAM, CLARA)

K-means

PAM

Cluster Analysis: Comparison of K-means, PAM, and

Some better visualizations

8.2 Hierarchical Clustering

8.3 Density-Based Clustering (DBSCAN)

Conclusion

9. Dimension Reduction

9.1 Multidimensional Scaling (MDS)

K-means

PAM

Hierarchical clustering

Conclusion

9.2 Principal Component Analysis (PCA)

Quick numeric correlation check

Rotated PCA for Interpretability

K-means

PAM

Hierarchical clustering

Conclusion

10. PCA vs MDS: Impact on Clustering Quality

K-means

PAM

Hierarchical clustering

Conclusion

11. Does Dimension Reduction Improve Clustering?

12. Cluster Stability and Prediction

K-means

PAM

13. Conclusion