Chronic Kidney Disease (CKD) is a significant global health challenge characterized by the gradual loss of kidney function over time. Kidneys play a critical role in filtering waste and excess fluids from the blood; when this function is impaired, dangerous levels of fluid, electrolytes, and wastes can build up in the body. Early detection of CKD is crucial for preventing the progression to End-Stage Renal Disease (ESRD), yet the disease is often asymptomatic in its early stages, leading to delayed diagnosis.
In recent years, the integration of Artificial Intelligence (AI) and Machine Learning (ML) into healthcare has provided powerful tools for medical diagnostics. Unlike traditional statistical methods, machine learning algorithms can model complex, non-linear relationships between various clinical parameters—such as blood pressure, specific gravity, and blood glucose—to predict disease outcomes with high accuracy. This project explores the application of data-driven techniques to analyze clinical records, aiming to enhance the predictive accuracy of CKD diagnosis and relevant clinical indicators like hemoglobin levels.
The dataset used in this study was obtained from the UCI Machine Learning Repository (specifically the “Chronic_Kidney_Disease” dataset), which is also hosted on Kaggle. The data was originally collected from Apollo Hospitals in India over a two-month period.
The dataset contains clinical records for 400 patients. It comprises 25 variables in total, including 11 numeric (continuous) variables and 14 nominal (categorical) variables.
Key Dataset Attributes:
classification — indicating whether the patient has CKD
(ckd) or not (notckd).hemo —
Hemoglobin level, used as the dependent variable for the regression
analysis portion of this study.age (Age in years).bp (Blood Pressure).sg (Specific Gravity),
al (Albumin), su (Sugar).bgr (Blood Glucose
Random), bu (Blood Urea), sc (Serum
Creatinine), sod (Sodium), pot
(Potassium).rbc (Red Blood Cells),
pc (Pus Cell), pcv (Packed Cell Volume),
wc (White Blood Cell Count), rc (Red Blood
Cell Count).htn (Hypertension),
dm (Diabetes Mellitus), cad (Coronary Artery
Disease).The raw data reflects real-world clinical conditions and includes missing values and mixed data types, requiring significant preprocessing and cleaning to ensure model robustness.
Based on the formulated research questions, the primary objectives of this study are:
This study aims to leverage machine learning techniques to extract actionable insights from clinical data related to Chronic Kidney Disease (CKD). To guide this investigation, we have formulated the following research questions:
Question 1: To what extent can clinical variables be
used to accurately predict hemoglobin (hemo) levels, and
how do different feature selection strategies impact the predictive
accuracy of linear versus rule-based regression models?
Question 2: Which machine learning approach delivers the optimal diagnostic performance for detecting Chronic Kidney Disease (CKD), and how do ensemble methods compare to baseline models?
We start by loading the necessary R packages required for data analysis, visualization, and modeling. The key libraries include:
library(doParallel)
if (!getDoParRegistered()) registerDoParallel(makePSOCKcluster(max(1, detectCores() - 2)))
library(readr)
library(dplyr)
library(stringr)
library(ggplot2)
library(gridExtra)
library(caret)
library(corrplot)
# Load Data
if (file.exists("kidney_disease.csv")) {
df <- read_csv("kidney_disease.csv", show_col_types = FALSE)
cat("Dataset Loaded Successfully.\n")
} else {
stop("Error: kidney_disease.csv not found.")
}
## Dataset Loaded Successfully.
We perform a preliminary inspection to understand the dataset’s structure, check for duplicates, and identify missing values.
# 1. Quick Overview of Data Types and Sample Values
str(df)
## spc_tbl_ [400 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:400] 0 1 2 3 4 5 6 7 8 9 ...
## $ age : num [1:400] 48 7 62 48 51 60 68 24 52 53 ...
## $ bp : num [1:400] 80 50 80 70 80 90 70 NA 100 90 ...
## $ sg : num [1:400] 1.02 1.02 1.01 1 1.01 ...
## $ al : num [1:400] 1 4 2 4 2 3 0 2 3 2 ...
## $ su : num [1:400] 0 0 3 0 0 0 0 4 0 0 ...
## $ rbc : chr [1:400] NA NA "normal" "normal" ...
## $ pc : chr [1:400] "normal" "normal" "normal" "abnormal" ...
## $ pcc : chr [1:400] "notpresent" "notpresent" "notpresent" "present" ...
## $ ba : chr [1:400] "notpresent" "notpresent" "notpresent" "notpresent" ...
## $ bgr : num [1:400] 121 NA 423 117 106 74 100 410 138 70 ...
## $ bu : num [1:400] 36 18 53 56 26 25 54 31 60 107 ...
## $ sc : num [1:400] 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
## $ sod : num [1:400] NA NA NA 111 NA 142 104 NA NA 114 ...
## $ pot : num [1:400] NA NA NA 2.5 NA 3.2 4 NA NA 3.7 ...
## $ hemo : num [1:400] 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
## $ pcv : chr [1:400] "44" "38" "31" "32" ...
## $ wc : chr [1:400] "7800" "6000" "7500" "6700" ...
## $ rc : chr [1:400] "5.2" NA NA "3.9" ...
## $ htn : chr [1:400] "yes" "no" "no" "yes" ...
## $ dm : chr [1:400] "yes" "no" "yes" "no" ...
## $ cad : chr [1:400] "no" "no" "no" "no" ...
## $ appet : chr [1:400] "good" "good" "poor" "poor" ...
## $ pe : chr [1:400] "no" "no" "no" "yes" ...
## $ ane : chr [1:400] "no" "no" "yes" "yes" ...
## $ classification: chr [1:400] "ckd" "ckd" "ckd" "ckd" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. age = col_double(),
## .. bp = col_double(),
## .. sg = col_double(),
## .. al = col_double(),
## .. su = col_double(),
## .. rbc = col_character(),
## .. pc = col_character(),
## .. pcc = col_character(),
## .. ba = col_character(),
## .. bgr = col_double(),
## .. bu = col_double(),
## .. sc = col_double(),
## .. sod = col_double(),
## .. pot = col_double(),
## .. hemo = col_double(),
## .. pcv = col_character(),
## .. wc = col_character(),
## .. rc = col_character(),
## .. htn = col_character(),
## .. dm = col_character(),
## .. cad = col_character(),
## .. appet = col_character(),
## .. pe = col_character(),
## .. ane = col_character(),
## .. classification = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# 2. Dimensions
cat("Rows:", dim(df)[1], "| Columns:", dim(df)[2], "\n")
## Rows: 400 | Columns: 26
# 3. Duplicate Check
num_duplicates <- sum(duplicated(df))
if (num_duplicates == 0) {
cat("No duplicate rows found.\n")
} else {
cat("Number of duplicate rows:", num_duplicates, "\n")
}
## No duplicate rows found.
# 4. Missing Value Analysis
na_counts <- colSums(is.na(df))
total_na_cols <- sum(na_counts > 0)
if (total_na_cols > 0) {
cat("Total columns with missing values:", total_na_cols, "\n")
cat("Top columns with missing data:\n")
print(sort(na_counts[na_counts > 0], decreasing = TRUE))
} else {
cat("No missing values found in the raw dataset.\n")
}
## Total columns with missing values: 24
## Top columns with missing data:
## rbc rc wc pot sod pcv pc hemo su sg al bgr bu
## 152 130 105 88 87 70 65 52 49 47 46 44 19
## sc bp age pcc ba htn dm cad appet pe ane
## 17 12 9 4 4 2 2 2 1 1 1
Observation:
Before visualization, we perform a quick cleaning step to handle strings and convert types for plotting purposes.
clean_for_viz <- function(df) {
clean_str <- function(x) {
x <- str_trim(x)
x <- str_replace_all(x, "\t", "")
na_strings <- c("", "?", "NA", "NaN")
x[x %in% na_strings] <- NA
return(x)
}
df_clean <- df %>% mutate(across(where(is.character), clean_str))
# Numeric conversion
cols_to_numeric <- c("age", "bp", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wc", "rc")
cols_present <- intersect(names(df_clean), cols_to_numeric)
df_clean[cols_present] <- lapply(df_clean[cols_present], as.numeric)
# Target
df_clean$classification <- ifelse(grepl("notckd", df_clean$classification), "notckd",
ifelse(grepl("ckd", df_clean$classification), "ckd", NA)
)
return(df_clean)
}
df_viz <- clean_for_viz(df)
df_viz <- df_viz %>% filter(!is.na(classification))
We examine the overall correlation between numeric variables to identify potential relationships and multicollinearity.
# Filter numeric columns for correlation
numeric_cols <- df_viz %>% select(where(is.numeric))
if (ncol(numeric_cols) > 1) {
cor_matrix <- cor(numeric_cols, use = "pairwise.complete.obs")
corrplot::corrplot(cor_matrix,
method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, title = "Correlation Matrix", mar = c(0, 0, 1, 0),
addCoef.col = "black", number.cex = 0.7
)
}
Observation:
classification): Shows
meaningful correlations with several features. Positive correlations
with risk factors (like sc, bu) and negative
correlations with health indicators (like hemo,
pcv, rc) suggest these are strong predictors
for CKD.hemo): As the regression
target, hemo exhibits very strong positive correlations
with pcv and rc, and negative correlations
with sc and bu. This confirms its strong
relationship with other kidney function markers.hemo-pcv-rc), most other
variables show low multicollinearity.classification)In this section, we analyze the features in relation to the diagnosis
of Chronic Kidney Disease (ckd vs notckd).
1. Class Distribution
ggplot(df_viz, aes(x = classification, fill = classification)) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
labs(title = "Distribution of Target Variable (Classification)", x = "Class", y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
theme(legend.position = "none")
Observation:
ckd cases
than notckd.2. Feature Separation Analysis
We visualize key numeric predictors to see how well they separate the two classes.
# Select key variables that typically show strong separation
key_vars <- c("hemo", "sc", "bu", "sg")
for (var in key_vars) {
if (var %in% names(df_viz)) {
p1 <- ggplot(df_viz, aes(x = classification, y = .data[[var]], fill = classification)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = paste("Boxplot of", var), x = "Class", y = var) +
theme(legend.position = "none")
p2 <- ggplot(df_viz, aes(x = .data[[var]], fill = classification)) +
geom_density(alpha = 0.5) +
theme_minimal() +
labs(title = paste("Density of", var), x = var, y = "Density")
grid.arrange(p1, p2, ncol = 2)
}
}
Observation:
hemo (Hemoglobin), sc
(Serum Creatinine), and sg (Specific Gravity) show
distinct separation between the two classes.sc is exclusively ckd).hemo)Here we focus on understanding the predictors for Hemoglobin
(hemo).
1. Target Distribution
ggplot(df_viz, aes(x = hemo)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
geom_density(color = "red", size = 1) +
labs(title = "Distribution of Hemoglobin (hemo)", x = "Hemoglobin", y = "Density") +
theme_minimal()
2. Top Correlations with Hemoglobin
# Calculate correlations with hemo
if ("hemo" %in% names(numeric_cols)) {
hemo_cor <- sort(abs(cor(numeric_cols, use = "pairwise.complete.obs")[, "hemo"]), decreasing = TRUE)
top_predictors <- names(hemo_cor)[2:5] # Top 4 excluding hemo itself
top_predictors <- setdiff(top_predictors, "id") # Ensure ID is excluded
plot_list <- list()
for (var in top_predictors) {
if (!is.na(var)) {
p <- ggplot(df_viz, aes(x = .data[[var]], y = hemo)) +
geom_point(alpha = 0.6, color = "darkblue") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(
title = paste("hemo vs", var),
subtitle = paste("Corr:", round(hemo_cor[var], 2)),
x = var, y = "Hemoglobin"
) +
theme_minimal()
plot_list[[var]] <- p
}
}
if (length(plot_list) > 0) {
grid.arrange(grobs = plot_list, ncol = 2)
}
}
Observation:
hemo shows strong linear relationships with variables
like pcv (Packed Cell Volume) and rc (Red
Blood Cell Count).In this section, the dataset is prepared for modeling by cleaning and structuring the data. This includes ensuring that all variables are in the appropriate format and removing irrelevant or unnecessary columns that do not contribute to the modeling process.
First, string inconsistencies are addressed by removing unwanted characters, such as tab spaces, and standardizing text entries. String values that contain meaningless or invalid information are identified and recoded as null values.
Next, variables representing numerical information are detected and converted to the appropriate numeric data types, ensuring that they can be processed smoothly and consistently in subsequent analyses and modeling stages.
clean_str <- function(x) {
x <- str_trim(x)
x <- str_replace_all(x, "\t", "")
na_strings <- c("", "?", "NA", "NaN")
x[x %in% na_strings] <- NA
return(x)
}
df_processed <- df %>% mutate(across(where(is.character), clean_str))
# Numeric conversion
cols_to_numeric <- c("age", "bp", "sg", "al", "su", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wc", "rc")
df_processed[intersect(names(df_processed), cols_to_numeric)] <- lapply(df_processed[intersect(names(df_processed), cols_to_numeric)], as.numeric)
Categorical variables are mapped to numeric binary values (0/1) to facilitate correlation analysis and model training. This direct mapping is performed based on prior knowledge obtained from the earlier EDA process. By defining a fixed, rule-based encoding scheme prior to the train–test split, the risk of data leakage is avoided, as the transformation does not rely on information derived from the data distribution or target variable and is applied consistently to both training and testing datasets.
# Encode Categoricals (0/1 mapping)
df_processed$rbc <- ifelse(df_processed$rbc == "normal", 0, ifelse(df_processed$rbc == "abnormal", 1, NA))
df_processed$pc <- ifelse(df_processed$pc == "normal", 0, ifelse(df_processed$pc == "abnormal", 1, NA))
df_processed$pcc <- ifelse(df_processed$pcc == "notpresent", 0, ifelse(df_processed$pcc == "present", 1, NA))
df_processed$ba <- ifelse(df_processed$ba == "notpresent", 0, ifelse(df_processed$ba == "present", 1, NA))
df_processed$htn <- ifelse(df_processed$htn == "no", 0, ifelse(df_processed$htn == "yes", 1, NA))
df_processed$dm <- ifelse(grepl("no", df_processed$dm, ignore.case = TRUE), 0,
ifelse(grepl("yes", df_processed$dm, ignore.case = TRUE), 1, NA)
)
df_processed$cad <- ifelse(grepl("no", df_processed$cad, ignore.case = TRUE), 0,
ifelse(grepl("yes", df_processed$cad, ignore.case = TRUE), 1, NA)
)
df_processed$appet <- ifelse(df_processed$appet == "good", 0, ifelse(df_processed$appet == "poor", 1, NA))
df_processed$pe <- ifelse(df_processed$pe == "no", 0, ifelse(df_processed$pe == "yes", 1, NA))
df_processed$ane <- ifelse(df_processed$ane == "no", 0, ifelse(df_processed$ane == "yes", 1, NA))
Next, we standardize the target variable classification
and remove rows with missing targets or unnecessary ID columns.
df_processed$classification <- ifelse(grepl("notckd", df_processed$classification), "notckd",
ifelse(grepl("ckd", df_processed$classification), "ckd", NA)
)
if ("id" %in% names(df_processed)) df_processed <- df_processed %>% select(-id)
df_processed <- df_processed[!is.na(df_processed$classification), ]
The dataset is split into training (70%) and testing (30%) sets to facilitate model evaluation on unseen data. The training set is used to fit and tune the models, while the testing set provides an unbiased assessment of the model’s performance, helping to ensure that the results generalize well to new data.
# Train/Test Split
set.seed(42)
trainIndex <- createDataPartition(df_processed$classification, p = .7, list = FALSE, times = 1)
train_data <- df_processed[trainIndex, ]
test_data <- df_processed[-trainIndex, ]
# Ensure Factor
train_data$classification <- factor(train_data$classification, levels = c("notckd", "ckd"))
test_data$classification <- factor(test_data$classification, levels = c("notckd", "ckd"))
Finally, missing values are addressed to ensure data completeness. To prevent data leakage, the median for numeric variables and the mode for categorical variables are calculated using only the training data. These values are then applied consistently to both the training and testing sets, maintaining the integrity of the evaluation process.
# Imputation (Median/Mode) using Train stats
pred_cols <- setdiff(names(train_data), "classification")
for (col in pred_cols) {
if (is.numeric(train_data[[col]])) {
val <- median(train_data[[col]], na.rm = TRUE)
} else {
val <- names(which.max(table(train_data[[col]])))
}
train_data[[col]][is.na(train_data[[col]])] <- val
test_data[[col]][is.na(test_data[[col]])] <- val
}
cat("Train Size:", nrow(train_data), "| Test Size:", nrow(test_data), "\n")
## Train Size: 280 | Test Size: 120
This study aims to predict hemoglobin (hemo) levels using clinical variables through regression analysis. We compare the impact of three feature selection strategies on two modeling approaches: Linear Regression (lm), representing a linear model, and Decision Tree (rpart), representing a rule-based model.
The following feature selection methods were evaluated:
All models were trained and evaluated using 5-fold cross-validation (CV = 5) to ensure robustness and reduce overfitting. Model performance was assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²). RMSE and MAE measure prediction error magnitude, while R² quantifies the proportion of variance in hemoglobin levels explained by the model.
library(glmnet)
# Target Variable
target_var <- "hemo"
# Store results
comparison_results <- data.frame()
# Define Base Models to Test
models_to_test <- c("lm", "rpart")
# Helper Function for Evaluation
evaluate_model <- function(model, test_data, target, method_name, fs_name) {
preds <- predict(model, newdata = test_data)
actuals <- test_data[[target]]
mse <- mean((preds - actuals)^2)
rmse <- sqrt(mse)
mae <- mean(abs(preds - actuals))
r2 <- cor(preds, actuals)^2
return(data.frame(
FeatureSelection = fs_name,
Model = method_name,
RMSE = rmse,
MAE = mae,
R2 = r2
))
}
train_ctrl <- trainControl(method = "cv", number = 5)
In this approach, features were manually selected based on their linear association with the target variable, hemoglobin (hemo). Specifically, predictor variables with an absolute Pearson correlation coefficient greater than 0.5 with hemo were retained for model training.
# Correlation Matrix
df_numeric <- train_data %>% select(where(is.numeric))
cor_matrix <- cor(df_numeric, use = "pairwise.complete.obs")
target_cor <- abs(cor_matrix[target_var, ])
# Select logic
selected_feats_corr <- names(target_cor)[target_cor > 0.5 & names(target_cor) != target_var]
# Subset
train_corr <- train_data %>% select(all_of(c(target_var, selected_feats_corr)))
test_corr <- test_data %>% select(all_of(c(target_var, selected_feats_corr)))
# Train & Evaluate
for (m in models_to_test) {
set.seed(123)
model <- train(as.formula(paste(target_var, "~ .")), data = train_corr, method = m, trControl = train_ctrl)
comparison_results <- rbind(comparison_results, evaluate_model(model, test_corr, target_var, m, "Correlation > 0.5"))
}
In this approach, we use Recursive Feature Elimination (RFE) with Random Forest functions to find the optimal subset of features.
x_train <- train_data %>% select(-all_of(target_var))
y_train <- train_data[[target_var]]
set.seed(123)
rfe_ctrl <- rfeControl(functions = rfFuncs, method = "cv", number = 5, verbose = FALSE)
# Limiting sizes for speed in report
rfe_res <- rfe(x = x_train, y = y_train, sizes = c(1:10), rfeControl = rfe_ctrl)
selected_feats_rfe <- predictors(rfe_res)
train_rfe <- train_data %>% select(all_of(c(target_var, selected_feats_rfe)))
test_rfe <- test_data %>% select(all_of(c(target_var, selected_feats_rfe)))
for (m in models_to_test) {
set.seed(123)
model <- train(as.formula(paste(target_var, "~ .")), data = train_rfe, method = m, trControl = train_ctrl)
comparison_results <- rbind(comparison_results, evaluate_model(model, test_rfe, target_var, m, "RFE"))
}
In this approach, we use Principal Component Analysis (PCA) as a pre-processing step within the training control to reduce dimensionality.
for (m in models_to_test) {
set.seed(123)
model <- train(
as.formula(paste(target_var, "~ .")),
data = train_data,
method = m,
trControl = train_ctrl,
preProcess = c("center", "scale", "pca")
)
comparison_results <- rbind(comparison_results, evaluate_model(model, test_data, target_var, m, "PCA"))
}
# Final Table
comparison_results <- comparison_results %>%
arrange(RMSE) %>%
select(FeatureSelection, Model, RMSE, MAE, R2)
knitr::kable(comparison_results, caption = "Regression Model Comparison", digits = 3)
| FeatureSelection | Model | RMSE | MAE | R2 |
|---|---|---|---|---|
| RFE | lm | 1.173 | 0.913 | 0.839 |
| Correlation > 0.5 | lm | 1.311 | 0.997 | 0.795 |
| PCA | lm | 1.321 | 1.051 | 0.784 |
| RFE | rpart | 1.396 | 1.149 | 0.760 |
| Correlation > 0.5 | rpart | 1.610 | 1.221 | 0.686 |
| PCA | rpart | 1.821 | 1.416 | 0.588 |
# Visualization
ggplot(comparison_results, aes(x = FeatureSelection, y = R2, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
labs(
title = "Feature Selection Comparison (R-Squared)",
subtitle = "Higher is better",
y = "R-Squared",
x = "Feature Selection Method"
) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.position = "top"
)
Key Findings
RFE with Linear Regression (lm) provided the best overall performance, achieving the lowest RMSE (1.17) and highest R² (0.839).
Linear Regression (lm), representing a linear model, consistently outperformed Decision Tree (rpart), representing a rule-based model, across all feature selection methods. This suggests that the relationship between the predictors and hemoglobin levels is largely linear.
PCA performed the worst among feature selection strategies, likely due to reduced interpretability and potential loss of predictive information from non-linear relationships.
RFE demonstrated clear advantages over manual correlation-based selection by efficiently identifying the most relevant features, improving predictive accuracy for both model types.
To visually assess the model performance, we plot the predictions from the best performing model: Linear Regression with RFE Feature Selection.
# Retrain the best model (lm + RFE) for visualization
set.seed(123)
model_viz <- train(as.formula(paste(target_var, "~ .")), data = train_rfe, method = "lm", trControl = train_ctrl)
preds_viz <- predict(model_viz, newdata = test_rfe)
# Create Plot Data
plot_data <- data.frame(
Actual = test_rfe[[target_var]],
Predicted = preds_viz
)
ggplot(plot_data, aes(x = Actual, y = Predicted)) +
geom_point(color = "steelblue", alpha = 0.7, size = 2) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
labs(
title = "Actual vs Predicted Hemoglobin",
subtitle = "Linear Regression (RFE Selection)",
x = "Actual Value",
y = "Predicted Value"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
We performed a three-stage model comparison to systematically evaluate model performance:
knn,
rpart, glm) trained without hyperparameter
tuning.rf, kknn, glmnet) trained with
tuning to achieve higher accuracy.This structure illustrates the progression of our modeling
approach:
> “We started simple, optimized what we had, and then upgraded to
advanced algorithms to achieve state-of-the-art results.”
# Feature Selection (Top 15 correlated)
df_numeric <- train_data %>%
mutate(classification = as.numeric(classification) - 1) %>%
select(where(is.numeric))
cor_matrix <- cor(df_numeric, use = "pairwise.complete.obs")
class_cor <- sort(abs(cor_matrix["classification", ]), decreasing = TRUE)
top_15 <- names(class_cor)[names(class_cor) != "classification"][1:15]
train_data_final <- train_data %>% select(all_of(c("classification", top_15)))
test_data_final <- test_data %>% select(all_of(c("classification", top_15)))
All experiments were conducted using 5-fold
cross-validation (cv = 5).
As previously noted, the dataset exhibits class
imbalance. To address this issue, we applied the SMOTE
(Synthetic Minority Over-sampling Technique) method during
model training.
train_control <- trainControl(
method = "cv", number = 5, sampling = "smote",
classProbs = TRUE, summaryFunction = twoClassSummary
)
In this section, we used accuracy,
recall, and precision as the
evaluation metrics.
Among these, we primarily focus on recall, as this is a
healthcare dataset where minimizing Type II errors (false
negatives) is particularly important.
evaluate_res <- function(model, test_df) {
probs <- predict(model, test_df, type = "prob")[, "ckd"]
pred_class <- ifelse(probs > 0.5, "ckd", "notckd")
cm <- confusionMatrix(factor(pred_class, levels = c("notckd", "ckd")), test_df$classification, positive = "ckd")
list(Accuracy = cm$overall[["Accuracy"]], Recall = cm$byClass[["Recall"]], Precision = cm$byClass[["Precision"]])
}
run_experiment <- function(methods, tune = FALSE, label) {
res_list <- list()
for (m in methods) {
grid <- NULL
if (tune) {
if (m == "knn") grid <- expand.grid(k = seq(1, 21, 2))
if (m == "rpart") grid <- expand.grid(cp = seq(0.001, 0.05, 0.002))
if (m == "rf") grid <- expand.grid(mtry = c(2, 4, 6, 8))
if (m == "kknn") grid <- expand.grid(kmax = c(5, 7, 9), distance = 2, kernel = c("optimal", "rectangular"))
if (m == "glmnet") grid <- expand.grid(alpha = seq(0, 1, 0.2), lambda = 10^seq(-4, -1, length = 5))
}
set.seed(42)
fit <- train(classification ~ ., data = train_data_final, method = m, trControl = train_control, metric = "Sens", tuneGrid = grid, tuneLength = if (tune) 5 else 1)
metrics <- evaluate_res(fit, test_data_final)
res_list[[m]] <- data.frame(Method = m, Category = label, t(unlist(metrics)))
}
do.call(rbind, res_list)
}
# Run Comparison
res1 <- run_experiment(c("glm", "rpart", "knn"), FALSE, "Baseline (Raw)")
res2 <- run_experiment(c("glm", "rpart", "knn"), TRUE, "Baseline (Tuned)")
res3 <- run_experiment(c("glmnet", "rf", "kknn"), TRUE, "Enhanced (Tuned)")
final_results <- rbind(res1, res2, res3) %>% arrange(Category)
knitr::kable(final_results, caption = "Model Performance Comparison")
| Method | Category | Accuracy | Recall | Precision | |
|---|---|---|---|---|---|
| glm | glm | Baseline (Raw) | 0.9750000 | 0.9600000 | 1.0000000 |
| rpart | rpart | Baseline (Raw) | 0.9416667 | 0.9333333 | 0.9722222 |
| knn | knn | Baseline (Raw) | 0.8916667 | 0.8400000 | 0.9843750 |
| glm1 | glm | Baseline (Tuned) | 0.9750000 | 0.9600000 | 1.0000000 |
| rpart1 | rpart | Baseline (Tuned) | 0.9833333 | 1.0000000 | 0.9740260 |
| knn1 | knn | Baseline (Tuned) | 0.8500000 | 0.7600000 | 1.0000000 |
| glmnet | glmnet | Enhanced (Tuned) | 0.9750000 | 0.9600000 | 1.0000000 |
| rf | rf | Enhanced (Tuned) | 0.9833333 | 0.9866667 | 0.9866667 |
| kknn | kknn | Enhanced (Tuned) | 0.9833333 | 0.9733333 | 1.0000000 |
Key Findings
knn,
rpart)
knn struggled with Recall (~0.84) and dropped further
after tuning (0.76), despite high Precision.rpart,
knn)
tuned-knn also shows
that hyperparameter optimization helps, but simple models still hit a
performance ceiling.kknn,
rf)
kknn) increased Recall to 0.973 with
perfect Precision.Conclusion: The Random Forest model is selected as the optimal model because it achieves the best overall balance of accuracy, recall, and precision, while delivering the second-highest recall, which is critical in healthcare settings to minimize false negatives. Additionally, its ensemble structure enables strong generalization and effective modeling of complex nonlinear relationships, making it more robust than simpler models.
To further validate the model’s strong performance, we examined the top features contributing to the Random Forest’s predictions.
# Retrain RF for Importance Plot
set.seed(42)
rf_final <- train(classification ~ ., data = train_data_final, method = "rf", trControl = train_control, metric = "Sens")
vi <- varImp(rf_final)$importance
vi$Feature <- rownames(vi)
ggplot(vi, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance (Random Forest)", x = "Feature", y = "Importance") +
theme_minimal()
The bar chart above shows the relative importance of clinical features in predicting the target variable.
Top Predictors:
* hemo (hemoglobin), pcv (packed cell volume),
rc (red blood cell count), and sg (specific
gravity) are the most influential features.
* Their high importance scores indicate that variations in these
measurements are the strongest drivers of the model’s predictions.
This study has successfully addressed both of our objectives of evaluating hemoglobin level prediction through clinical variables and identifying the most effective machine learning approach for Chronic Kidney Disease (CKD) diagnosis. In regression analysis, it was determined that Linear Regression (lm) consistently outperformed rule-based models like Decision Tree (rpart) across all feature selection methods. This indicates a primarily linear relationship between selected clinical predictors with hemoglobin. Furthermore, Recursive Feature Elimination (RFE) is proved to be the best feature selection strategy. It effectively identifies critical features to enhance predictive accuracy compared to manual correlation-based selection and Principal Component Analysis (PCA).
Regarding diagnostic performance for CKD, our study concludes that ensemble methods offer more robust classification results than baseline models. After addressing class imbalance with the SMOTE technique and utilizing 5-fold cross-validation, Random Forest (rf) model was identified as the most optimal diagnostic approach. It achieved the best overall balance of accuracy, recall, and precision. Critically, it had a high recall, which is important in the CKD diagnosis to identify patients with potential risk. Feature importance analysis further highlighted that hemoglobin (hb), red blood cell count (rc), packed cell volume (pcv) and specific gravity (sg) are the primary drivers of CKD detection. These results demonstrate that integrating advanced ensemble learning with optimized feature selection can significantly enhance clinical screening and diagnosis for CKD.
Despite the high performance of our models, there are several limitations that should be noted. This analysis was based on a relatively small dataset of 400 records from Kaggle website. This may not fully represent the demographic and clinical diversity of broader global populations. Furthermore, this dataset originates from approximately nine years ago. Many clinical standards, diagnostic technologies, and baseline health profiles of patients have evolved nowadays. Future research should focus on validating these algorithms against real-world large clinical datasets to ensure generalizability, and explore the inclusion of modern diagnostic biomarkers that were not available at the time of the original data collection.