This report presents a comprehensive analysis comparing four types of supervised classification models to predict diabetes status based on patient health metrics. The practical hypothesis is that by comparing these models, we can identify the most accurate and reliable algorithm for identifying patients at risk of diabetes.
This project implements and evaluates the following algorithms:
The report is structured as follows: * Part 1: Data loading, exploratory data analysis (EDA), and preprocessing. * Part 2: Development of all four candidate model families, including model-specific visualizations. * Part 3: A final comparison of all developed models using ROC-AUC analysis to determine the best-performing model.
The first step in any modeling process is to load, understand, and clean the data. This section covers our data preparation and preprocessing, which will be used for all subsequent models.
# Set directory and load the dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
diabetes.data <- read_csv("diabetes_prediction_dataset.csv")After loading, we inspect its dimensions (dim()) and
data types (str()) to understand the dataset’s structure
and variables.
## [1] 100000 9
summary() provides an overview of each variable.
## gender age hypertension heart_disease
## Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
## Mode :character Median :43.00 Median :0.00000 Median :0.00000
## Mean :41.89 Mean :0.07485 Mean :0.03942
## 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :80.00 Max. :1.00000 Max. :1.00000
## smoking_history bmi HbA1c_level blood_glucose_level
## Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0
## Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0
## Mode :character Median :27.32 Median :5.800 Median :140.0
## Mean :27.32 Mean :5.528 Mean :138.1
## 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0
## Max. :95.69 Max. :9.000 Max. :300.0
## diabetes
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.085
## 3rd Qu.:0.000
## Max. :1.000
The next step is to clean and prepare the data for modeling. This involves handling missing values and ensuring all variables are in the correct format.
We use colSums(is.na()) to count the number of
NA values in each column.
## gender age hypertension heart_disease
## 0 0 0 0
## smoking_history bmi HbA1c_level blood_glucose_level
## 0 0 0 0
## diabetes
## 0
Observation: The output shows 0 missing values for all columns, so no imputation is needed.
The dataset contains several character-based variables
(gender, smoking_history) that need to be
converted to factors for R’s modeling functions (like
glm()) to interpret them correctly as categorical
predictors.
The target variable, diabetes, is also converted to a
factor with clear ‘Yes’/‘No’ labels for better interpretability in our
results.
During our initial summary(), we noted that the ‘Other’
category in gender has very few observations (only 18). To
prevent model instability, we will remove these observations. We then
use droplevels() to remove ‘Other’ from the factor
levels.
# Filter out 'Other' gender category
diabetes.data.clean <- diabetes.data %>%
filter(gender != "Other")
# Convert character columns and target to factors
diabetes.data.clean <- diabetes.data.clean %>%
mutate(
diabetes = factor(diabetes,
levels = c(0, 1),
labels = c("No", "Yes")),
gender = as.factor(gender),
smoking_history = as.factor(smoking_history)
)
# Remove 'Other' level from the factor
diabetes.data.clean$gender <- droplevels(diabetes.data.clean$gender)
# Check the structure to confirm all changes have been applied
str(diabetes.data.clean)## tibble [99,982 × 9] (S3: tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 2 1 ...
## $ age : num [1:99982] 80 54 28 36 76 20 44 79 42 32 ...
## $ hypertension : num [1:99982] 0 0 0 0 1 0 0 0 0 0 ...
## $ heart_disease : num [1:99982] 1 0 0 0 1 0 0 0 0 0 ...
## $ smoking_history : Factor w/ 6 levels "current","ever",..: 4 5 4 1 1 4 4 5 4 4 ...
## $ bmi : num [1:99982] 25.2 27.3 27.3 23.4 20.1 ...
## $ HbA1c_level : num [1:99982] 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
## $ blood_glucose_level: num [1:99982] 140 80 158 155 155 85 200 85 145 100 ...
## $ diabetes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
Now that the data is clean, we can perform visual EDA to understand the distributions and relationships between variables.
We first create a pair-wise scatter plot of the numerical variables to check for correlations.
# Select numeric variables from our cleaned data
numeric_vars_df <- diabetes.data.clean[, c("age", "bmi", "HbA1c_level", "blood_glucose_level")]
# Create the pairs plot
pairs(numeric_vars_df, cex = 0.3, col = "navy",
main = "Pair-wise Scatter Plot of Numerical Variables")Pair-wise Scatter Plot of Numerical Variables
Observation: The plot shows a strong positive
correlation between blood_glucose_level and
HbA1c_level. The variables age and
bmi are continuously and widely distributed.
Next, we plot the distributions of several key predictors and the outcome variable.
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))
# 1. Age
hist(diabetes.data.clean$age, xlab = "Age", col = "steelblue", main = "Age Distribution")
# 2. Gender
barplot(table(diabetes.data.clean$gender), col = "steelblue", main = "Distribution of Gender")
# 3. BMI
hist(diabetes.data.clean$bmi, xlab = "BMI", col = "steelblue", main = "BMI Distribution")
# 4. Smoking History
barplot(table(diabetes.data.clean$smoking_history), col = "steelblue", main = "Smoking History")Distributions of Key Variables
Observation: The dataset is diverse in
age and bmi. Most participants are ‘Female’,
and a large portion have ‘no info’ or ‘never’ for smoking history.
Finally, we show a density plot to compare the distribution of
blood_glucose_level for those with and without
diabetes.
# Get glucose for diabetic and non-diabetic groups
diab.glucose <- diabetes.data.clean$blood_glucose_level[diabetes.data.clean$diabetes == "Yes"]
no.diab.glucose <- diabetes.data.clean$blood_glucose_level[diabetes.data.clean$diabetes == "No"]
# Plot the densities
plot(density(no.diab.glucose), col = "darkgreen",
main = "Distribution of Glucose Levels", xlab = "Glucose", xlim = range(diabetes.data.clean$blood_glucose_level))
lines(density(diab.glucose), col = "darkred", lwd = 2)
legend("topright", c("Diabetes Group", "Diabetes Free"),
col = c("darkred", "darkgreen"), lwd = c(2, 1), bty = "n")Distribution of Glucose Levels by Diabetes Status
Observation: As expected, the distributions are very different. The “Diabetes Free” group is tightly clustered around lower glucose levels, while the “Diabetes Group” has a much higher mean and wider spread. This confirms it will be a strong predictor.
Now we develop our candidate models. This involves building multiple
glm models, one neuralnet model, one
rpart (decision tree) model, and one bagging
model on the full dataset to compare their
performance.
We will build three logistic regression models to serve as our baseline and for comparison.
The logistic regression model works by transforming a linear combination of inputs (\(z\)) through the logistic function (also called a sigmoid function) to produce a probability between 0 and 1. We can visualize this function.
z <- seq(-6, 6, length = 100)
pi_z <- 1 / (1 + exp(-z))
plot(z, pi_z, type = "l", lwd = 2, xlab = "z (Linear Combination)", ylab = expression(pi(z)),
main = "Logistic Curve", col = "blue")
text(-3, 0.8, expression(pi(z) == frac(1, 1 + exp(-z))), col = "blue")The Logistic (Sigmoid) Function
We build three candidate models: 1.
reducedModel: A simple model with only the
variables we hypothesize are most clinically significant. 2.
fullModel: A complex model that includes
all available predictors. 3. forwards: An
optimized model found using forward selection, starting from the
reducedModel and adding predictors from the
fullModel based on AIC.
# Define a reduced model with variables we assume are significant
reducedModel <- glm(diabetes ~ age + bmi + blood_glucose_level + HbA1c_level,
family = binomial(link = logit),
data = diabetes.data.clean)
# Define the full model with all variables
fullModel <- glm(diabetes ~ .,
family = binomial(link = logit),
data = diabetes.data.clean)
# Use forward selection to find the best model between reduced and full
forwards <- step(reducedModel,
scope = list(lower = formula(reducedModel), upper = formula(fullModel)),
direction = "forward",
trace = FALSE)
# Display the summary of the final, forward-selected model
summary(forwards)##
## Call:
## glm(formula = diabetes ~ age + bmi + blood_glucose_level + HbA1c_level +
## hypertension + smoking_history + heart_disease + gender,
## family = binomial(link = logit), data = diabetes.data.clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.708e+01 2.929e-01 -92.456 < 2e-16 ***
## age 4.620e-02 1.126e-03 41.040 < 2e-16 ***
## bmi 8.895e-02 2.555e-03 34.819 < 2e-16 ***
## blood_glucose_level 3.336e-02 4.821e-04 69.207 < 2e-16 ***
## HbA1c_level 2.340e+00 3.578e-02 65.414 < 2e-16 ***
## hypertension 7.413e-01 4.710e-02 15.737 < 2e-16 ***
## smoking_historyever -5.097e-02 9.248e-02 -0.551 0.58154
## smoking_historyformer -1.084e-01 7.009e-02 -1.546 0.12203
## smoking_historynever -1.566e-01 6.057e-02 -2.586 0.00971 **
## smoking_historyNo Info -7.304e-01 6.651e-02 -10.981 < 2e-16 ***
## smoking_historynot current -2.114e-01 8.332e-02 -2.538 0.01115 *
## heart_disease 7.346e-01 6.072e-02 12.099 < 2e-16 ***
## genderMale 2.724e-01 3.613e-02 7.540 4.69e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 58160 on 99981 degrees of freedom
## Residual deviance: 22627 on 99969 degrees of freedom
## AIC: 22653
##
## Number of Fisher Scoring iterations: 8
The perceptron is a single-layer neural network. When it uses a logistic (sigmoid) activation function, it is mathematically equivalent to logistic regression.
neuralnetNeural networks are sensitive to the scale of input data and require
all inputs to be numeric. We must first scale our numeric features
(using min-max normalization) and create a design matrix (dummify) for
the neuralnet function.
# Create a copy of the data for neural network preprocessing
neuralData <- diabetes.data.clean
# Identify numeric variables for scaling
numeric.vars <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")
# Loop through numeric variables, scale them using min-max normalization
for (col in numeric.vars) {
min.val <- min(neuralData[[col]])
max.val <- max(neuralData[[col]])
# The min-max formula
neuralData[[col]] <- (neuralData[[col]] - min.val) / (max.val - min.val)
}Next, we use model.matrix() to automatically create
dummy variables for all our factors and build a formula string.
# Create the design matrix, which automatically dummifies factor variables
neuralData.matrix <- model.matrix(~ ., data = neuralData)
neuralData.nn <- as.data.frame(neuralData.matrix)
# Clean the column names to make them valid R variables
valid.names <- make.names(colnames(neuralData.nn))
colnames(neuralData.nn) <- valid.names
# Add the numeric response variable (0/1) for neuralnet
neuralData.nn$diabetes_num <- ifelse(neuralData$diabetes == "Yes", 1, 0)
# Get all column names from the new data frame
columnNames <- colnames(neuralData.nn)
# Create the list of predictors by removing the (Intercept) and response variable
columnList <- paste(columnNames[-c(1, length(columnNames))], collapse = "+")
# Create the final formula string
modelFormula <- as.formula(paste("diabetes_num ~", columnList))
# Print the formula to check
print(modelFormula)## diabetes_num ~ genderMale + age + hypertension + heart_disease +
## smoking_historyever + smoking_historyformer + smoking_historynever +
## smoking_historyNo.Info + smoking_historynot.current + bmi +
## HbA1c_level + blood_glucose_level + diabetesYes
With the scaled and dummified data, we train the perceptron. * We set
hidden = 1 for a single-layer network. * We use
act.fct = "logistic" (the sigmoid function). * We set
linear.output = FALSE for classification.
# Train the perceptron
set.seed(123)
perceptron.model <- neuralnet(modelFormula,
data = neuralData.nn,
hidden = 1,
act.fct = "logistic",
linear.output = FALSE) We can now plot the resulting network diagram.
Perceptron Network Diagram
Observation: The plot displays a two-layer neural
network. The input variables (e.g., genderMale,
age, bmi) are on the left. Each input is
connected to a single hidden node (the first circle), with the
connecting lines showing their respective weights. This hidden node also
receives a bias input (from the top-left “1” with weight -9.25315).
The output of this hidden node (shown as 14.75531) is then fed into a
final output node. This output node receives its own bias input (from
the top-right “1” with weight -7.842) and produces the final
diabetes_num prediction.
We will use the rpart library to build our decision
tree. This algorithm generates a set of rules that are easy to
interpret.
# Build the decision tree model
# We use method = "class" for a classification tree
tree_model <- rpart(
formula = diabetes ~ .,
data = diabetes.data.clean,
method = "class",
# We use the default control parameters
control = rpart.control(minsplit = 20, cp = 0.01)
)
# Print the model summary
print(tree_model)## n= 99982
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 99982 8500 No (0.91498470 0.08501530)
## 2) HbA1c_level< 6.7 96087 4605 No (0.95207468 0.04792532)
## 4) blood_glucose_level< 210 94295 2813 No (0.97016809 0.02983191) *
## 5) blood_glucose_level>=210 1792 0 Yes (0.00000000 1.00000000) *
## 3) HbA1c_level>=6.7 3895 0 Yes (0.00000000 1.00000000) *
We use rpart.plot to create a clean visual of the tree
structure.
Decision Tree for Diabetes Prediction
Observation: The tree first splits the data based on
an HbA1c_level threshold. If a patient’s level is high, the
model immediately predicts ‘Yes’ for diabetes. For patients with a lower
HbA1c level, the model asks a second question about their
blood_glucose_level. This shows the hierarchy of predictor
importance.
BAGGING (Bootstrap Aggregation) is an ensemble method that builds multiple decision trees on different bootstrap samples of the data and aggregates their predictions (using “majority voting”). This process is used to decrease the variance in the prediction model.
We will use the bagging() function from the
ipred library.
# Build the BAGGING ensemble model
set.seed(123) # For reproducibility
bagging_model <- bagging(
formula = as.factor(diabetes) ~ .,
data = diabetes.data.clean,
nbagg = 150, # number of trees
coob = TRUE,
parms = list(loss = matrix(c(0, 10, 1, 0),
ncol = 2,
byrow = TRUE),
split = "gini"),
control = rpart.control(minsplit = 10,
cp = 0.02)
)
# Print the model summary
print(bagging_model)##
## Bagging classification trees with 150 bootstrap replications
##
## Call: bagging.data.frame(formula = as.factor(diabetes) ~ ., data = diabetes.data.clean,
## nbagg = 150, coob = TRUE, parms = list(loss = matrix(c(0,
## 10, 1, 0), ncol = 2, byrow = TRUE), split = "gini"),
## control = rpart.control(minsplit = 10, cp = 0.02))
##
## Out-of-bag estimate of misclassification error: 0.0281
To visualize the effect of bagging, we can run a smaller bootstrap loop (B=100) to build 100 logistic regression models and plot the histogram of their AUCs.
# We will use B=100 for speed.
btAUC.vec <- c()
B <- 100
sample.size <- dim(diabetes.data.clean)[1]
set.seed(456) # New seed for this loop
for (k in 1:B) {
# 1. Get bootstrap sample
boot.id <- sample(1:sample.size, sample.size, replace = TRUE)
boot.sample <- diabetes.data.clean[boot.id, ]
# 2. Fit model on bootstrap sample
boot.logistic <- glm(diabetes ~ ., family = binomial, data = boot.sample)
# 3. Get predictions
pred.prob <- predict.glm(boot.logistic, newdata = boot.sample, type = "response")
# 4. Calculate and store AUC
category <- boot.sample$diabetes == "Yes"
ROCobj <- roc(category, pred.prob, quiet = TRUE)
btAUC.vec[k] <- auc(ROCobj)
}
# Plot the histogram of the 100 bootstrap AUCs
hist(btAUC.vec, xlab = "Bootstrap AUC", main = "Bootstrap Sampling Distribution of AUCs \n (B=100)")Bootstrap Sampling Distribution of AUCs (B=100)
Observation: This histogram shows the sampling distribution of the AUC. We can see a clear central tendency, and the spread of this distribution gives us an idea of the model’s stability. We could use this to form a 95% confidence interval.
Now that all candidate models have been built, we will perform a comprehensive comparison to select the best one. As per the project instructions and feedback, we will use ROC-AUC analysis to compare the global performance of all models.
We will generate predictions from all models and plot their ROC curves on a single graph for a direct, synthesized comparison.
# 1. Get predictions (as probabilities) for all models
predReduced <- predict(reducedModel, newdata = diabetes.data.clean, type = "response")
predFull <- predict(fullModel, newdata = diabetes.data.clean, type = "response")
predForwards <- predict(forwards, newdata = diabetes.data.clean, type = "response")
# Perceptron model predictions (uses the scaled neuralData.nn)
predNN.raw <- predict(perceptron.model, newdata = neuralData.nn)
predNN <- as.vector(predNN.raw)
# Decision Tree model predictions
tree_probs <- predict(tree_model, newdata = diabetes.data.clean, type = "prob")[, "Yes"]
# BAGGING model predictions (requires type = "prob")
bagging_probs <- predict(bagging_model, newdata = diabetes.data.clean, type = "prob")
bagging_probs_yes <- bagging_probs[, "Yes"] # Select the "Yes" probability
# 2. Define the true category
category <- diabetes.data.clean$diabetes == "Yes"
# 3. Create ROC objects for all models
ROCobj.reduced <- roc(category, predReduced, quiet = TRUE)
ROCobj.full <- roc(category, predFull, quiet = TRUE)
ROCobj.forwards <- roc(category, predForwards, quiet = TRUE)
ROCobj.NN <- roc(category, predNN, quiet = TRUE)
ROCobj.tree <- roc(category, tree_probs, quiet = TRUE)
ROCobj.bagging <- roc(category, bagging_probs_yes, quiet = TRUE)
# 4. Get AUC values
auc_reduced <- auc(ROCobj.reduced)
auc_full <- auc(ROCobj.full)
auc_forwards <- auc(ROCobj.forwards)
auc_nn <- auc(ROCobj.NN)
auc_tree <- auc(ROCobj.tree)
auc_bagging <- auc(ROCobj.bagging)
# 5. Plot all ROC curves on one graph for comparison
plot(ROCobj.bagging, col = "red", main = "Comprehensive Model Comparison (ROC)", lwd = 2)
plot(ROCobj.full, col = "blue", add = TRUE, lty = 2, lwd = 2)
plot(ROCobj.forwards, col = "darkgreen", add = TRUE, lty = 3, lwd = 2)
plot(ROCobj.NN, col = "purple", add = TRUE, lty = 4, lwd = 2)
plot(ROCobj.tree, col = "orange", add = TRUE, lty = 5, lwd = 2)
plot(ROCobj.reduced, col = "grey", add = TRUE, lty = 6, lwd = 2)
legend("bottomright",
legend = c(paste("BAGGING (AUC:", round(auc_bagging, 4), ")"),
paste("GLM Full (AUC:", round(auc_full, 4), ")"),
paste("GLM Forwards (AUC:", round(auc_forwards, 4), ")"),
paste("Perceptron (AUC:", round(auc_nn, 4), ")"),
paste("Decision Tree (AUC:", round(auc_tree, 4), ")"),
paste("GLM Reduced (AUC:", round(auc_reduced, 4), ")")),
col = c("red", "blue", "darkgreen", "purple", "orange", "grey"),
lwd = 2,
lty = c(1, 2, 3, 4, 5, 6))Comprehensive Model Comparison (ROC)
Analysis: The ROC plot compares the performance of six models. The Perceptron model (purple) shows perfect performance with an AUC of 1, its curve following the top-left edge of the plot.
The three GLM models also perform exceptionally well and are clustered together: GLM Full (AUC 0.9619), GLM Forwards (AUC 0.9619), and GLM Reduced (AUC 0.9586).
The BAGGING and Decision Tree models are the weakest, both achieving an identical and significantly lower AUC of 0.8345.
Part 1 (Data Prep): The data was successfully
loaded, inspected, and preprocessed. No missing values were found, and
categorical variables were correctly encoded. Visual EDA confirmed that
key predictors like blood_glucose_level are highly
correlated with diabetes.
Part 2 (Model Development): We developed all four candidate models as specified by the project:
reduced, full, forwards) were
built as a baseline.neuralnet): A single-layer
neural network was built and visualized.ipred, and its bootstrap-based AUC distribution was
visualized.Part 3 (Final Model Comparison): We evaluated all models by calculating their ROC-AUC on the entire dataset and plotting them on a single, comprehensive graph. The final AUC scores were:
| Model | AUC |
|---|---|
| BAGGING (Ensemble) | 0.8345 |
| Full GLM | 0.9619 |
| Forwards GLM | 0.9619 |
| Perceptron (NN) | 1 |
| Decision Tree | 0.8345 |
| Reduced GLM | 0.9586 |
Based on this analysis, the Perceptron (NN) model is the best-performing model, achieving a perfect AUC of 1. The Full Logistic Model and Forwards-Selected Logistic Model also proved to be very strong and highly competitive.
Interestingly, the BAGGING model (AUC 0.8345) performed identically to the single Decision Tree (AUC 0.8345). While theory suggests bagging should reduce variance and improve accuracy, that was not the case here, as it provided no performance benefit over the single tree.