1 Introduction

1.1 Background Study

Chronic Kidney Disease (CKD) is a significant global health challenge characterized by the gradual loss of kidney function over time. Kidneys play a critical role in filtering waste and excess fluids from the blood; when this function is impaired, dangerous levels of fluid, electrolytes, and wastes can build up in the body. Early detection of CKD is crucial for preventing the progression to End-Stage Renal Disease (ESRD), yet the disease is often asymptomatic in its early stages, leading to delayed diagnosis.

In recent years, the integration of Artificial Intelligence (AI) and Machine Learning (ML) into healthcare has provided powerful tools for medical diagnostics. Unlike traditional statistical methods, machine learning algorithms can model complex, non-linear relationships between various clinical parameters—such as blood pressure, specific gravity, and blood glucose—to predict disease outcomes with high accuracy. This project explores the application of data-driven techniques to analyze clinical records, aiming to enhance the predictive accuracy of CKD diagnosis and relevant clinical indicators like hemoglobin levels.

1.2 Dataset Information

The dataset used in this study was obtained from the UCI Machine Learning Repository (specifically the “Chronic_Kidney_Disease” dataset), which is also hosted on Kaggle. The data was originally collected from Apollo Hospitals in India over a two-month period.

The dataset contains clinical records for 400 patients. It comprises 25 variables in total, including 11 numeric (continuous) variables and 14 nominal (categorical) variables.

Key Dataset Attributes:

  • Target Variable (Classification): classification — indicating whether the patient has CKD (ckd) or not (notckd).
  • Target Variable (Regression): hemo — Hemoglobin level, used as the dependent variable for the regression analysis portion of this study.
  • Clinical Features: The dataset covers a wide range of physiological measurements and indicators, including but not limited to:
    • Demographics: age (Age in years).
    • Vital Signs: bp (Blood Pressure).
    • Urinalysis: sg (Specific Gravity), al (Albumin), su (Sugar).
    • Blood Test Results: bgr (Blood Glucose Random), bu (Blood Urea), sc (Serum Creatinine), sod (Sodium), pot (Potassium).
    • Blood Count: rbc (Red Blood Cells), pc (Pus Cell), pcv (Packed Cell Volume), wc (White Blood Cell Count), rc (Red Blood Cell Count).
    • Disease History: htn (Hypertension), dm (Diabetes Mellitus), cad (Coronary Artery Disease).

The raw data reflects real-world clinical conditions and includes missing values and mixed data types, requiring significant preprocessing and cleaning to ensure model robustness.

1.3 Research Objectives

Based on the formulated research questions, the primary objectives of this study are:

  1. To evaluate regression performance on physiological indicators: Specifically, to determine the extent to which clinical variables can accurately predict hemoglobin (hemo) levels. This involves assessing how different feature selection strategies influence the predictive accuracy when comparing linear models against rule-based regression models.
  2. To identify the optimal diagnostic approach for CKD: By comparing the performance of various machine learning classifiers. The objective is to benchmark ensemble methods (such as Random Forest) against baseline models to determine which approach delivers the highest sensitivity and overall accuracy for detecting Chronic Kidney Disease.

2 Research Questions

This study aims to leverage machine learning techniques to extract actionable insights from clinical data related to Chronic Kidney Disease (CKD). To guide this investigation, we have formulated the following research questions:

Question 1: To what extent can clinical variables be used to accurately predict hemoglobin (hemo) levels, and how do different feature selection strategies impact the predictive accuracy of linear versus rule-based regression models?

Question 2: Which machine learning approach delivers the optimal diagnostic performance for detecting Chronic Kidney Disease (CKD), and how do ensemble methods compare to baseline models?

3 Data Loading

We start by loading the necessary R packages required for data analysis, visualization, and modeling. The key libraries include:

  • doParallel: Enables parallel processing to speed up computation.
  • readr: Provides fast and friendly functions to read data.
  • dplyr: A grammar of data manipulation for filtering, selecting, and transforming data.
  • stringr: Designed to make string manipulation easy (e.g., cleaning text data).
  • ggplot2 & gridExtra: Used for creating and arranging elegant data visualizations.
  • caret: A comprehensive framework for training and evaluating machine learning models.
  • corrplot: Visualizes the correlation matrix of the dataset.
library(doParallel)
if (!getDoParRegistered()) registerDoParallel(makePSOCKcluster(max(1, detectCores() - 2)))

library(readr)
library(dplyr)
library(stringr)
library(ggplot2)
library(gridExtra)
library(caret)
library(corrplot)
# Load Data
if (file.exists("kidney_disease.csv")) {
    df <- read_csv("kidney_disease.csv", show_col_types = FALSE)
    cat("Dataset Loaded Successfully.\n")
} else {
    stop("Error: kidney_disease.csv not found.")
}
## Dataset Loaded Successfully.

3.0.1 Data Inspection

We perform a preliminary inspection to understand the dataset’s structure, check for duplicates, and identify missing values.

# 1. Quick Overview of Data Types and Sample Values
str(df)
## spc_tbl_ [400 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id            : num [1:400] 0 1 2 3 4 5 6 7 8 9 ...
##  $ age           : num [1:400] 48 7 62 48 51 60 68 24 52 53 ...
##  $ bp            : num [1:400] 80 50 80 70 80 90 70 NA 100 90 ...
##  $ sg            : num [1:400] 1.02 1.02 1.01 1 1.01 ...
##  $ al            : num [1:400] 1 4 2 4 2 3 0 2 3 2 ...
##  $ su            : num [1:400] 0 0 3 0 0 0 0 4 0 0 ...
##  $ rbc           : chr [1:400] NA NA "normal" "normal" ...
##  $ pc            : chr [1:400] "normal" "normal" "normal" "abnormal" ...
##  $ pcc           : chr [1:400] "notpresent" "notpresent" "notpresent" "present" ...
##  $ ba            : chr [1:400] "notpresent" "notpresent" "notpresent" "notpresent" ...
##  $ bgr           : num [1:400] 121 NA 423 117 106 74 100 410 138 70 ...
##  $ bu            : num [1:400] 36 18 53 56 26 25 54 31 60 107 ...
##  $ sc            : num [1:400] 1.2 0.8 1.8 3.8 1.4 1.1 24 1.1 1.9 7.2 ...
##  $ sod           : num [1:400] NA NA NA 111 NA 142 104 NA NA 114 ...
##  $ pot           : num [1:400] NA NA NA 2.5 NA 3.2 4 NA NA 3.7 ...
##  $ hemo          : num [1:400] 15.4 11.3 9.6 11.2 11.6 12.2 12.4 12.4 10.8 9.5 ...
##  $ pcv           : chr [1:400] "44" "38" "31" "32" ...
##  $ wc            : chr [1:400] "7800" "6000" "7500" "6700" ...
##  $ rc            : chr [1:400] "5.2" NA NA "3.9" ...
##  $ htn           : chr [1:400] "yes" "no" "no" "yes" ...
##  $ dm            : chr [1:400] "yes" "no" "yes" "no" ...
##  $ cad           : chr [1:400] "no" "no" "no" "no" ...
##  $ appet         : chr [1:400] "good" "good" "poor" "poor" ...
##  $ pe            : chr [1:400] "no" "no" "no" "yes" ...
##  $ ane           : chr [1:400] "no" "no" "yes" "yes" ...
##  $ classification: chr [1:400] "ckd" "ckd" "ckd" "ckd" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   age = col_double(),
##   ..   bp = col_double(),
##   ..   sg = col_double(),
##   ..   al = col_double(),
##   ..   su = col_double(),
##   ..   rbc = col_character(),
##   ..   pc = col_character(),
##   ..   pcc = col_character(),
##   ..   ba = col_character(),
##   ..   bgr = col_double(),
##   ..   bu = col_double(),
##   ..   sc = col_double(),
##   ..   sod = col_double(),
##   ..   pot = col_double(),
##   ..   hemo = col_double(),
##   ..   pcv = col_character(),
##   ..   wc = col_character(),
##   ..   rc = col_character(),
##   ..   htn = col_character(),
##   ..   dm = col_character(),
##   ..   cad = col_character(),
##   ..   appet = col_character(),
##   ..   pe = col_character(),
##   ..   ane = col_character(),
##   ..   classification = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# 2. Dimensions
cat("Rows:", dim(df)[1], "| Columns:", dim(df)[2], "\n")
## Rows: 400 | Columns: 26
# 3. Duplicate Check
num_duplicates <- sum(duplicated(df))
if (num_duplicates == 0) {
    cat("No duplicate rows found.\n")
} else {
    cat("Number of duplicate rows:", num_duplicates, "\n")
}
## No duplicate rows found.
# 4. Missing Value Analysis
na_counts <- colSums(is.na(df))
total_na_cols <- sum(na_counts > 0)

if (total_na_cols > 0) {
    cat("Total columns with missing values:", total_na_cols, "\n")
    cat("Top columns with missing data:\n")
    print(sort(na_counts[na_counts > 0], decreasing = TRUE))
} else {
    cat("No missing values found in the raw dataset.\n")
}
## Total columns with missing values: 24 
## Top columns with missing data:
##   rbc    rc    wc   pot   sod   pcv    pc  hemo    su    sg    al   bgr    bu 
##   152   130   105    88    87    70    65    52    49    47    46    44    19 
##    sc    bp   age   pcc    ba   htn    dm   cad appet    pe   ane 
##    17    12     9     4     4     2     2     2     1     1     1

Observation:

  • The dataset structure shows a mix of numeric and character variables.
  • There are no duplicate rows, ensuring data uniqueness.
  • However, significant missing values are observed across several columns, which will be handled in the Data Cleaning section.

4 Data Understanding (EDA)

4.0.1 Data Preparation for EDA

Before visualization, we perform a quick cleaning step to handle strings and convert types for plotting purposes.

clean_for_viz <- function(df) {
    clean_str <- function(x) {
        x <- str_trim(x)
        x <- str_replace_all(x, "\t", "")
        na_strings <- c("", "?", "NA", "NaN")
        x[x %in% na_strings] <- NA
        return(x)
    }
    df_clean <- df %>% mutate(across(where(is.character), clean_str))

    # Numeric conversion
    cols_to_numeric <- c("age", "bp", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wc", "rc")
    cols_present <- intersect(names(df_clean), cols_to_numeric)
    df_clean[cols_present] <- lapply(df_clean[cols_present], as.numeric)

    # Target
    df_clean$classification <- ifelse(grepl("notckd", df_clean$classification), "notckd",
        ifelse(grepl("ckd", df_clean$classification), "ckd", NA)
    )
    return(df_clean)
}

df_viz <- clean_for_viz(df)
df_viz <- df_viz %>% filter(!is.na(classification))

4.0.2 Correlation Analysis

We examine the overall correlation between numeric variables to identify potential relationships and multicollinearity.

# Filter numeric columns for correlation
numeric_cols <- df_viz %>% select(where(is.numeric))
if (ncol(numeric_cols) > 1) {
    cor_matrix <- cor(numeric_cols, use = "pairwise.complete.obs")
    corrplot::corrplot(cor_matrix,
        method = "color", type = "upper", order = "hclust",
        tl.col = "black", tl.srt = 45, title = "Correlation Matrix", mar = c(0, 0, 1, 0),
        addCoef.col = "black", number.cex = 0.7
    )
}

Observation:

  • Classification (classification): Shows meaningful correlations with several features. Positive correlations with risk factors (like sc, bu) and negative correlations with health indicators (like hemo, pcv, rc) suggest these are strong predictors for CKD.
  • Hemoglobin (hemo): As the regression target, hemo exhibits very strong positive correlations with pcv and rc, and negative correlations with sc and bu. This confirms its strong relationship with other kidney function markers.
  • Multicollinearity: While some clusters exist (e.g., hemo-pcv-rc), most other variables show low multicollinearity.

4.0.3 Classification EDA (Target: classification)

In this section, we analyze the features in relation to the diagnosis of Chronic Kidney Disease (ckd vs notckd).

1. Class Distribution

ggplot(df_viz, aes(x = classification, fill = classification)) +
    geom_bar() +
    geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
    labs(title = "Distribution of Target Variable (Classification)", x = "Class", y = "Count") +
    theme_minimal() +
    scale_fill_brewer(palette = "Set1") +
    theme(legend.position = "none")

Observation:

  • There is a clear class imbalance, with more ckd cases than notckd.

2. Feature Separation Analysis

We visualize key numeric predictors to see how well they separate the two classes.

# Select key variables that typically show strong separation
key_vars <- c("hemo", "sc", "bu", "sg")

for (var in key_vars) {
    if (var %in% names(df_viz)) {
        p1 <- ggplot(df_viz, aes(x = classification, y = .data[[var]], fill = classification)) +
            geom_boxplot(alpha = 0.7) +
            theme_minimal() +
            labs(title = paste("Boxplot of", var), x = "Class", y = var) +
            theme(legend.position = "none")

        p2 <- ggplot(df_viz, aes(x = .data[[var]], fill = classification)) +
            geom_density(alpha = 0.5) +
            theme_minimal() +
            labs(title = paste("Density of", var), x = var, y = "Density")

        grid.arrange(p1, p2, ncol = 2)
    }
}

Observation:

  • Variables like hemo (Hemoglobin), sc (Serum Creatinine), and sg (Specific Gravity) show distinct separation between the two classes.
  • In some ranges, the classes are almost perfectly separated (e.g., very high sc is exclusively ckd).
  • Modeling Implication: This suggests that Tree-based models (like Random Forest or Decision Trees) are likely to perform very well as they can easily learn these threshold-based splits.

4.0.4 Regression EDA (Target: hemo)

Here we focus on understanding the predictors for Hemoglobin (hemo).

1. Target Distribution

ggplot(df_viz, aes(x = hemo)) +
    geom_histogram(aes(y = ..density..), binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
    geom_density(color = "red", size = 1) +
    labs(title = "Distribution of Hemoglobin (hemo)", x = "Hemoglobin", y = "Density") +
    theme_minimal()

2. Top Correlations with Hemoglobin

# Calculate correlations with hemo
if ("hemo" %in% names(numeric_cols)) {
    hemo_cor <- sort(abs(cor(numeric_cols, use = "pairwise.complete.obs")[, "hemo"]), decreasing = TRUE)
    top_predictors <- names(hemo_cor)[2:5] # Top 4 excluding hemo itself
    top_predictors <- setdiff(top_predictors, "id") # Ensure ID is excluded

    plot_list <- list()
    for (var in top_predictors) {
        if (!is.na(var)) {
            p <- ggplot(df_viz, aes(x = .data[[var]], y = hemo)) +
                geom_point(alpha = 0.6, color = "darkblue") +
                geom_smooth(method = "lm", color = "red", se = TRUE) +
                labs(
                    title = paste("hemo vs", var),
                    subtitle = paste("Corr:", round(hemo_cor[var], 2)),
                    x = var, y = "Hemoglobin"
                ) +
                theme_minimal()
            plot_list[[var]] <- p
        }
    }

    if (length(plot_list) > 0) {
        grid.arrange(grobs = plot_list, ncol = 2)
    }
}

Observation:

  • hemo shows strong linear relationships with variables like pcv (Packed Cell Volume) and rc (Red Blood Cell Count).
  • The scatter plots confirm that these relationships are largely linear, supporting the use of Linear Regression or Regularized Regression (Lasso/Ridge).

5 Data Cleaning & Preparation

In this section, the dataset is prepared for modeling by cleaning and structuring the data. This includes ensuring that all variables are in the appropriate format and removing irrelevant or unnecessary columns that do not contribute to the modeling process.

5.0.1 Data Cleaning & Type Conversion

First, string inconsistencies are addressed by removing unwanted characters, such as tab spaces, and standardizing text entries. String values that contain meaningless or invalid information are identified and recoded as null values.

Next, variables representing numerical information are detected and converted to the appropriate numeric data types, ensuring that they can be processed smoothly and consistently in subsequent analyses and modeling stages.

clean_str <- function(x) {
    x <- str_trim(x)
    x <- str_replace_all(x, "\t", "")
    na_strings <- c("", "?", "NA", "NaN")
    x[x %in% na_strings] <- NA
    return(x)
}

df_processed <- df %>% mutate(across(where(is.character), clean_str))

# Numeric conversion
cols_to_numeric <- c("age", "bp", "sg", "al", "su", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wc", "rc")
df_processed[intersect(names(df_processed), cols_to_numeric)] <- lapply(df_processed[intersect(names(df_processed), cols_to_numeric)], as.numeric)

5.0.2 Categorical Encoding

Categorical variables are mapped to numeric binary values (0/1) to facilitate correlation analysis and model training. This direct mapping is performed based on prior knowledge obtained from the earlier EDA process. By defining a fixed, rule-based encoding scheme prior to the train–test split, the risk of data leakage is avoided, as the transformation does not rely on information derived from the data distribution or target variable and is applied consistently to both training and testing datasets.

# Encode Categoricals (0/1 mapping)
df_processed$rbc <- ifelse(df_processed$rbc == "normal", 0, ifelse(df_processed$rbc == "abnormal", 1, NA))
df_processed$pc <- ifelse(df_processed$pc == "normal", 0, ifelse(df_processed$pc == "abnormal", 1, NA))
df_processed$pcc <- ifelse(df_processed$pcc == "notpresent", 0, ifelse(df_processed$pcc == "present", 1, NA))
df_processed$ba <- ifelse(df_processed$ba == "notpresent", 0, ifelse(df_processed$ba == "present", 1, NA))
df_processed$htn <- ifelse(df_processed$htn == "no", 0, ifelse(df_processed$htn == "yes", 1, NA))
df_processed$dm <- ifelse(grepl("no", df_processed$dm, ignore.case = TRUE), 0,
    ifelse(grepl("yes", df_processed$dm, ignore.case = TRUE), 1, NA)
)
df_processed$cad <- ifelse(grepl("no", df_processed$cad, ignore.case = TRUE), 0,
    ifelse(grepl("yes", df_processed$cad, ignore.case = TRUE), 1, NA)
)
df_processed$appet <- ifelse(df_processed$appet == "good", 0, ifelse(df_processed$appet == "poor", 1, NA))
df_processed$pe <- ifelse(df_processed$pe == "no", 0, ifelse(df_processed$pe == "yes", 1, NA))
df_processed$ane <- ifelse(df_processed$ane == "no", 0, ifelse(df_processed$ane == "yes", 1, NA))

5.0.3 Target Variable Processing

Next, we standardize the target variable classification and remove rows with missing targets or unnecessary ID columns.

df_processed$classification <- ifelse(grepl("notckd", df_processed$classification), "notckd",
    ifelse(grepl("ckd", df_processed$classification), "ckd", NA)
)

if ("id" %in% names(df_processed)) df_processed <- df_processed %>% select(-id)
df_processed <- df_processed[!is.na(df_processed$classification), ]

5.0.4 Train-Test Split

The dataset is split into training (70%) and testing (30%) sets to facilitate model evaluation on unseen data. The training set is used to fit and tune the models, while the testing set provides an unbiased assessment of the model’s performance, helping to ensure that the results generalize well to new data.

# Train/Test Split
set.seed(42)
trainIndex <- createDataPartition(df_processed$classification, p = .7, list = FALSE, times = 1)
train_data <- df_processed[trainIndex, ]
test_data <- df_processed[-trainIndex, ]

# Ensure Factor
train_data$classification <- factor(train_data$classification, levels = c("notckd", "ckd"))
test_data$classification <- factor(test_data$classification, levels = c("notckd", "ckd"))

5.0.5 Missing Value Imputation

Finally, missing values are addressed to ensure data completeness. To prevent data leakage, the median for numeric variables and the mode for categorical variables are calculated using only the training data. These values are then applied consistently to both the training and testing sets, maintaining the integrity of the evaluation process.

# Imputation (Median/Mode) using Train stats
pred_cols <- setdiff(names(train_data), "classification")
for (col in pred_cols) {
    if (is.numeric(train_data[[col]])) {
        val <- median(train_data[[col]], na.rm = TRUE)
    } else {
        val <- names(which.max(table(train_data[[col]])))
    }
    train_data[[col]][is.na(train_data[[col]])] <- val
    test_data[[col]][is.na(test_data[[col]])] <- val
}

cat("Train Size:", nrow(train_data), "| Test Size:", nrow(test_data), "\n")
## Train Size: 280 | Test Size: 120

6 Regression Analysis

This study aims to predict hemoglobin (hemo) levels using clinical variables through regression analysis. We compare the impact of three feature selection strategies on two modeling approaches: Linear Regression (lm), representing a linear model, and Decision Tree (rpart), representing a rule-based model.

The following feature selection methods were evaluated:

  • Manual Selection: Features were selected based on their Pearson correlation with the target variable, retaining only variables with an absolute correlation coefficient greater than 0.5.
  • Recursive Feature Elimination (RFE): An automated feature selection technique using a Random Forest estimator to iteratively remove less important variables.
  • Principal Component Analysis (PCA): A dimensionality reduction approach that transforms the original variables into a smaller set of uncorrelated principal components while preserving most of the variance.

All models were trained and evaluated using 5-fold cross-validation (CV = 5) to ensure robustness and reduce overfitting. Model performance was assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²). RMSE and MAE measure prediction error magnitude, while R² quantifies the proportion of variance in hemoglobin levels explained by the model.

library(glmnet)

# Target Variable
target_var <- "hemo"

# Store results
comparison_results <- data.frame()

# Define Base Models to Test
models_to_test <- c("lm", "rpart")

# Helper Function for Evaluation
evaluate_model <- function(model, test_data, target, method_name, fs_name) {
    preds <- predict(model, newdata = test_data)
    actuals <- test_data[[target]]

    mse <- mean((preds - actuals)^2)
    rmse <- sqrt(mse)
    mae <- mean(abs(preds - actuals))
    r2 <- cor(preds, actuals)^2

    return(data.frame(
        FeatureSelection = fs_name,
        Model = method_name,
        RMSE = rmse,
        MAE = mae,
        R2 = r2
    ))
}

train_ctrl <- trainControl(method = "cv", number = 5)

6.0.1 Manual Feature Selection (Correlation)

In this approach, features were manually selected based on their linear association with the target variable, hemoglobin (hemo). Specifically, predictor variables with an absolute Pearson correlation coefficient greater than 0.5 with hemo were retained for model training.

# Correlation Matrix
df_numeric <- train_data %>% select(where(is.numeric))
cor_matrix <- cor(df_numeric, use = "pairwise.complete.obs")
target_cor <- abs(cor_matrix[target_var, ])

# Select logic
selected_feats_corr <- names(target_cor)[target_cor > 0.5 & names(target_cor) != target_var]

# Subset
train_corr <- train_data %>% select(all_of(c(target_var, selected_feats_corr)))
test_corr <- test_data %>% select(all_of(c(target_var, selected_feats_corr)))

# Train & Evaluate
for (m in models_to_test) {
    set.seed(123)
    model <- train(as.formula(paste(target_var, "~ .")), data = train_corr, method = m, trControl = train_ctrl)
    comparison_results <- rbind(comparison_results, evaluate_model(model, test_corr, target_var, m, "Correlation > 0.5"))
}

6.0.2 Recursive Feature Elimination (RFE)

In this approach, we use Recursive Feature Elimination (RFE) with Random Forest functions to find the optimal subset of features.

x_train <- train_data %>% select(-all_of(target_var))
y_train <- train_data[[target_var]]

set.seed(123)
rfe_ctrl <- rfeControl(functions = rfFuncs, method = "cv", number = 5, verbose = FALSE)
# Limiting sizes for speed in report
rfe_res <- rfe(x = x_train, y = y_train, sizes = c(1:10), rfeControl = rfe_ctrl)

selected_feats_rfe <- predictors(rfe_res)

train_rfe <- train_data %>% select(all_of(c(target_var, selected_feats_rfe)))
test_rfe <- test_data %>% select(all_of(c(target_var, selected_feats_rfe)))

for (m in models_to_test) {
    set.seed(123)
    model <- train(as.formula(paste(target_var, "~ .")), data = train_rfe, method = m, trControl = train_ctrl)
    comparison_results <- rbind(comparison_results, evaluate_model(model, test_rfe, target_var, m, "RFE"))
}

6.0.3 Principal Component Analysis (PCA)

In this approach, we use Principal Component Analysis (PCA) as a pre-processing step within the training control to reduce dimensionality.

for (m in models_to_test) {
    set.seed(123)
    model <- train(
        as.formula(paste(target_var, "~ .")),
        data = train_data,
        method = m,
        trControl = train_ctrl,
        preProcess = c("center", "scale", "pca")
    )
    comparison_results <- rbind(comparison_results, evaluate_model(model, test_data, target_var, m, "PCA"))
}

6.0.4 Regression Results Comparison

# Final Table
comparison_results <- comparison_results %>%
    arrange(RMSE) %>%
    select(FeatureSelection, Model, RMSE, MAE, R2)

knitr::kable(comparison_results, caption = "Regression Model Comparison", digits = 3)
Regression Model Comparison
FeatureSelection Model RMSE MAE R2
RFE lm 1.173 0.913 0.839
Correlation > 0.5 lm 1.311 0.997 0.795
PCA lm 1.321 1.051 0.784
RFE rpart 1.396 1.149 0.760
Correlation > 0.5 rpart 1.610 1.221 0.686
PCA rpart 1.821 1.416 0.588
# Visualization
ggplot(comparison_results, aes(x = FeatureSelection, y = R2, fill = Model)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.7) +
    labs(
        title = "Feature Selection Comparison (R-Squared)",
        subtitle = "Higher is better",
        y = "R-Squared",
        x = "Feature Selection Method"
    ) +
    scale_fill_brewer(palette = "Set2") +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        axis.text.x = element_text(angle = 0, hjust = 0.5),
        legend.position = "top"
    )

Key Findings

  • RFE with Linear Regression (lm) provided the best overall performance, achieving the lowest RMSE (1.17) and highest R² (0.839).

  • Linear Regression (lm), representing a linear model, consistently outperformed Decision Tree (rpart), representing a rule-based model, across all feature selection methods. This suggests that the relationship between the predictors and hemoglobin levels is largely linear.

  • PCA performed the worst among feature selection strategies, likely due to reduced interpretability and potential loss of predictive information from non-linear relationships.

  • RFE demonstrated clear advantages over manual correlation-based selection by efficiently identifying the most relevant features, improving predictive accuracy for both model types.

6.0.5 Actual vs Predicted Plot

To visually assess the model performance, we plot the predictions from the best performing model: Linear Regression with RFE Feature Selection.

# Retrain the best model (lm + RFE) for visualization
set.seed(123)
model_viz <- train(as.formula(paste(target_var, "~ .")), data = train_rfe, method = "lm", trControl = train_ctrl)
preds_viz <- predict(model_viz, newdata = test_rfe)

# Create Plot Data
plot_data <- data.frame(
    Actual = test_rfe[[target_var]],
    Predicted = preds_viz
)

ggplot(plot_data, aes(x = Actual, y = Predicted)) +
    geom_point(color = "steelblue", alpha = 0.7, size = 2) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
    labs(
        title = "Actual vs Predicted Hemoglobin",
        subtitle = "Linear Regression (RFE Selection)",
        x = "Actual Value",
        y = "Predicted Value"
    ) +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold"))

7 Classification

7.0.1 Comprehensive 3-Tier Model Comparison

We performed a three-stage model comparison to systematically evaluate model performance:

  1. Baseline (Raw): Simple models (knn, rpart, glm) trained without hyperparameter tuning.
  2. Baseline (Tuned): Simple models with hyperparameter optimization applied to improve performance.
  3. Enhanced (Advanced): Advanced algorithms (rf, kknn, glmnet) trained with tuning to achieve higher accuracy.

This structure illustrates the progression of our modeling approach:
> “We started simple, optimized what we had, and then upgraded to advanced algorithms to achieve state-of-the-art results.”

# Feature Selection (Top 15 correlated)
df_numeric <- train_data %>%
    mutate(classification = as.numeric(classification) - 1) %>%
    select(where(is.numeric))
cor_matrix <- cor(df_numeric, use = "pairwise.complete.obs")
class_cor <- sort(abs(cor_matrix["classification", ]), decreasing = TRUE)
top_15 <- names(class_cor)[names(class_cor) != "classification"][1:15]

train_data_final <- train_data %>% select(all_of(c("classification", top_15)))
test_data_final <- test_data %>% select(all_of(c("classification", top_15)))

7.0.2 Train Control and Sampling Method

All experiments were conducted using 5-fold cross-validation (cv = 5).
As previously noted, the dataset exhibits class imbalance. To address this issue, we applied the SMOTE (Synthetic Minority Over-sampling Technique) method during model training.

train_control <- trainControl(
    method = "cv", number = 5, sampling = "smote",
    classProbs = TRUE, summaryFunction = twoClassSummary
)

7.0.3 Evaluation Metrics

In this section, we used accuracy, recall, and precision as the evaluation metrics.
Among these, we primarily focus on recall, as this is a healthcare dataset where minimizing Type II errors (false negatives) is particularly important.

evaluate_res <- function(model, test_df) {
    probs <- predict(model, test_df, type = "prob")[, "ckd"]
    pred_class <- ifelse(probs > 0.5, "ckd", "notckd")
    cm <- confusionMatrix(factor(pred_class, levels = c("notckd", "ckd")), test_df$classification, positive = "ckd")
    list(Accuracy = cm$overall[["Accuracy"]], Recall = cm$byClass[["Recall"]], Precision = cm$byClass[["Precision"]])
}

7.0.4 Perform Experiment

run_experiment <- function(methods, tune = FALSE, label) {
    res_list <- list()
    for (m in methods) {
        grid <- NULL
        if (tune) {
            if (m == "knn") grid <- expand.grid(k = seq(1, 21, 2))
            if (m == "rpart") grid <- expand.grid(cp = seq(0.001, 0.05, 0.002))
            if (m == "rf") grid <- expand.grid(mtry = c(2, 4, 6, 8))
            if (m == "kknn") grid <- expand.grid(kmax = c(5, 7, 9), distance = 2, kernel = c("optimal", "rectangular"))
            if (m == "glmnet") grid <- expand.grid(alpha = seq(0, 1, 0.2), lambda = 10^seq(-4, -1, length = 5))
        }
        set.seed(42)
        fit <- train(classification ~ ., data = train_data_final, method = m, trControl = train_control, metric = "Sens", tuneGrid = grid, tuneLength = if (tune) 5 else 1)
        metrics <- evaluate_res(fit, test_data_final)
        res_list[[m]] <- data.frame(Method = m, Category = label, t(unlist(metrics)))
    }
    do.call(rbind, res_list)
}

# Run Comparison
res1 <- run_experiment(c("glm", "rpart", "knn"), FALSE, "Baseline (Raw)")
res2 <- run_experiment(c("glm", "rpart", "knn"), TRUE, "Baseline (Tuned)")
res3 <- run_experiment(c("glmnet", "rf", "kknn"), TRUE, "Enhanced (Tuned)")

final_results <- rbind(res1, res2, res3) %>% arrange(Category)
knitr::kable(final_results, caption = "Model Performance Comparison")
Model Performance Comparison
Method Category Accuracy Recall Precision
glm glm Baseline (Raw) 0.9750000 0.9600000 1.0000000
rpart rpart Baseline (Raw) 0.9416667 0.9333333 0.9722222
knn knn Baseline (Raw) 0.8916667 0.8400000 0.9843750
glm1 glm Baseline (Tuned) 0.9750000 0.9600000 1.0000000
rpart1 rpart Baseline (Tuned) 0.9833333 1.0000000 0.9740260
knn1 knn Baseline (Tuned) 0.8500000 0.7600000 1.0000000
glmnet glmnet Enhanced (Tuned) 0.9750000 0.9600000 1.0000000
rf rf Enhanced (Tuned) 0.9833333 0.9866667 0.9866667
kknn kknn Enhanced (Tuned) 0.9833333 0.9733333 1.0000000

Key Findings

  1. Limitations of Simple Models (knn, rpart)
    • knn struggled with Recall (~0.84) and dropped further after tuning (0.76), despite high Precision.
    • Basic distance-based methods cannot capture the dataset’s complexity.
  2. Effect of Tuning (rpart, knn)
    • Tuning improved Recall (0.933 → 1.000) and reach the highest among all model.
    • However the underperformance of tuned-knn also shows that hyperparameter optimization helps, but simple models still hit a performance ceiling.
  3. Advanced Models (kknn, rf)
    • Weighted KNN (kknn) increased Recall to 0.973 with perfect Precision.
    • Random Forest achieved highest Recall (0.987) and strong Precision (0.987), balancing overall performance.

Conclusion: The Random Forest model is selected as the optimal model because it achieves the best overall balance of accuracy, recall, and precision, while delivering the second-highest recall, which is critical in healthcare settings to minimize false negatives. Additionally, its ensemble structure enables strong generalization and effective modeling of complex nonlinear relationships, making it more robust than simpler models.

7.0.5 Feature Importance (Random Forest)

To further validate the model’s strong performance, we examined the top features contributing to the Random Forest’s predictions.

# Retrain RF for Importance Plot
set.seed(42)
rf_final <- train(classification ~ ., data = train_data_final, method = "rf", trControl = train_control, metric = "Sens")
vi <- varImp(rf_final)$importance
vi$Feature <- rownames(vi)
ggplot(vi, aes(x = reorder(Feature, Overall), y = Overall)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = "Feature Importance (Random Forest)", x = "Feature", y = "Importance") +
    theme_minimal()

The bar chart above shows the relative importance of clinical features in predicting the target variable.

Top Predictors:
* hemo (hemoglobin), pcv (packed cell volume), rc (red blood cell count), and sg (specific gravity) are the most influential features.
* Their high importance scores indicate that variations in these measurements are the strongest drivers of the model’s predictions.

8 Conclusion

This study has successfully addressed both of our objectives of evaluating hemoglobin level prediction through clinical variables and identifying the most effective machine learning approach for Chronic Kidney Disease (CKD) diagnosis. In regression analysis, it was determined that Linear Regression (lm) consistently outperformed rule-based models like Decision Tree (rpart) across all feature selection methods. This indicates a primarily linear relationship between selected clinical predictors with hemoglobin. Furthermore, Recursive Feature Elimination (RFE) is proved to be the best feature selection strategy. It effectively identifies critical features to enhance predictive accuracy compared to manual correlation-based selection and Principal Component Analysis (PCA).

Regarding diagnostic performance for CKD, our study concludes that ensemble methods offer more robust classification results than baseline models. After addressing class imbalance with the SMOTE technique and utilizing 5-fold cross-validation, Random Forest (rf) model was identified as the most optimal diagnostic approach. It achieved the best overall balance of accuracy, recall, and precision. Critically, it had a high recall, which is important in the CKD diagnosis to identify patients with potential risk. Feature importance analysis further highlighted that hemoglobin (hb), red blood cell count (rc), packed cell volume (pcv) and specific gravity (sg) are the primary drivers of CKD detection. These results demonstrate that integrating advanced ensemble learning with optimized feature selection can significantly enhance clinical screening and diagnosis for CKD.

Despite the high performance of our models, there are several limitations that should be noted. This analysis was based on a relatively small dataset of 400 records from Kaggle website. This may not fully represent the demographic and clinical diversity of broader global populations. Furthermore, this dataset originates from approximately nine years ago. Many clinical standards, diagnostic technologies, and baseline health profiles of patients have evolved nowadays. Future research should focus on validating these algorithms against real-world large clinical datasets to ensure generalizability, and explore the inclusion of modern diagnostic biomarkers that were not available at the time of the original data collection.