This report details the development of two types of supervised classification models to predict diabetes status: logistic regression and a single-layer neural network (perceptron). The project is divided into three parts:
The first step in any modeling process is to load, understand and clean the data. This section covers our data prepartion and feature engineering.
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
# Load the CSV file
diabetes.data <- read.csv("diabetes_prediction_dataset.csv")After loading we inspect its dimensions (dim()) and data
types (str()) to understand the dataset’s structure and the
variables.
[1] 100000 9
'data.frame': 100000 obs. of 9 variables:
$ gender : chr "Female" "Female" "Male" "Female" ...
$ age : num 80 54 28 36 76 20 44 79 42 32 ...
$ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...
$ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...
$ smoking_history : chr "never" "No Info" "never" "current" ...
$ bmi : num 25.2 27.3 27.3 23.4 20.1 ...
$ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
$ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ...
$ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
summary() providing an overview of each variable.
gender age hypertension heart_disease
Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000
Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
Mode :character Median :43.00 Median :0.00000 Median :0.00000
Mean :41.89 Mean :0.07485 Mean :0.03942
3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :80.00 Max. :1.00000 Max. :1.00000
smoking_history bmi HbA1c_level blood_glucose_level
Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0
Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0
Mode :character Median :27.32 Median :5.800 Median :140.0
Mean :27.32 Mean :5.528 Mean :138.1
3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0
Max. :95.69 Max. :9.000 Max. :300.0
diabetes
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.085
3rd Qu.:0.000
Max. :1.000
The next step is to clean and prepare the data for modeling. This involves handling missing values and ensuring all variables are in the correct format.
The first step for missing values, we use
colSums(is.na()) to count the number of NA
values in each column.
gender age hypertension heart_disease
0 0 0 0
smoking_history bmi HbA1c_level blood_glucose_level
0 0 0 0
diabetes
0
Observation: The output shows 0 missing values for all columns, so no adjustment is needed.
The dataset contains several character-based variables
(gender, smoking_history) that need to be
converted to factors for R’s modeling functions (like
glm()) to interpret them correctly as categorical
predictors.
The target variable, diabetes, is also converted to a
factor with clear ‘Yes’/‘No’ labels for better interpretability in our
results.
During our initial summary(), we noted that the ‘Other’
category in gender has very few observations (only 18). To
prevent model instability, we will remove these observations. We then
use droplevels() to remove ‘Other’ from the factor
levels.
# Convert categorical string variables to factors
diabetes.data$gender <- as.factor(diabetes.data$gender)
diabetes.data$smoking_history <- as.factor(diabetes.data$smoking_history)
# Convert the target variable 'diabetes' to a factor
diabetes.data$diabetes <- factor(diabetes.data$diabetes,
levels = c(0, 1),
labels = c("No", "Yes"))
# The 'Other' value in 'gender' has few observations so we remove it for model stability.
diabetes.data <- diabetes.data[diabetes.data$gender != "Other", ]
# Remove 'Other' level from the factor
diabetes.data$gender <- droplevels(diabetes.data$gender)
# Check the structure to confirm all changes have been applied
str(diabetes.data)'data.frame': 99982 obs. of 9 variables:
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 2 1 ...
$ age : num 80 54 28 36 76 20 44 79 42 32 ...
$ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...
$ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...
$ smoking_history : Factor w/ 6 levels "current","ever",..: 4 5 4 1 1 4 4 5 4 4 ...
$ bmi : num 25.2 27.3 27.3 23.4 20.1 ...
$ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
$ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ...
$ diabetes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
Now we develop and compare models. This involves building multiple
glm models and one neuralnet model on the
full dataset to compare their performance.
We will build three logistic regression models to serve as our baseline and for comparison.
reducedModel: A simple model with only
the variables we hypothesize are most clinically significant.fullModel: A complex model that
includes all available predictors.forwards: An optimized model found
using forward selection, starting from the reducedModel and
adding predictors from the fullModel based on AIC.# Define a reduced model with variables we assume are significant
reducedModel <- glm(diabetes ~ age + bmi + blood_glucose_level + HbA1c_level,
family = binomial(link = logit),
data = diabetes.data)
# Define the full model with all variables
fullModel <- glm(diabetes ~ .,
family = binomial(link = logit),
data = diabetes.data)
# Use forward selection to find the best model between reduced and full
# trace = FALSE hides the step-by-step output of the selection process
forwards <- step(reducedModel,
scope = list(lower = formula(reducedModel), upper = formula(fullModel)),
direction = "forward",
trace = FALSE)
# Display the summary of the final, forward-selected model
summary(forwards)
Call:
glm(formula = diabetes ~ age + bmi + blood_glucose_level + HbA1c_level +
hypertension + smoking_history + heart_disease + gender,
family = binomial(link = logit), data = diabetes.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.708e+01 2.929e-01 -92.456 < 2e-16 ***
age 4.620e-02 1.126e-03 41.040 < 2e-16 ***
bmi 8.895e-02 2.555e-03 34.819 < 2e-16 ***
blood_glucose_level 3.336e-02 4.821e-04 69.207 < 2e-16 ***
HbA1c_level 2.340e+00 3.578e-02 65.414 < 2e-16 ***
hypertension 7.413e-01 4.710e-02 15.737 < 2e-16 ***
smoking_historyever -5.097e-02 9.248e-02 -0.551 0.58154
smoking_historyformer -1.084e-01 7.009e-02 -1.546 0.12203
smoking_historynever -1.566e-01 6.057e-02 -2.586 0.00971 **
smoking_historyNo Info -7.304e-01 6.651e-02 -10.981 < 2e-16 ***
smoking_historynot current -2.114e-01 8.332e-02 -2.538 0.01115 *
heart_disease 7.346e-01 6.072e-02 12.099 < 2e-16 ***
genderMale 2.724e-01 3.613e-02 7.540 4.69e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 58160 on 99981 degrees of freedom
Residual deviance: 22627 on 99969 degrees of freedom
AIC: 22653
Number of Fisher Scoring iterations: 8
Neural networks are sensitive to the scale of input data and require
all inputs to be numeric. Before building our perceptron, we must
preprocess the data, which involves two key steps: manually scaling
numeric features and creating a design matrix (dummifying) for the
neuralnet function.
neuralnetWe will use min-max normalization to scale all
numeric predictors to a range of [0, 1]. This ensures that variables
with large magnitudes (like blood_glucose_level) do not
disproportionately influence the model’s weights compared to variables
with small magnitudes.
# Create a copy of the data for neural network preprocessing
neuralData <- diabetes.data
# Identify numeric variables for scaling
numeric.vars <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")
# Loop through numeric variables, scale them using min-max normalization
for (col in numeric.vars) {
min.val <- min(neuralData[[col]])
max.val <- max(neuralData[[col]])
# The min-max formula
neuralData[[col]] <- (neuralData[[col]] - min.val) / (max.val - min.val)
}Next, the neuralnet package requires a formula and a
data frame that does not contain factors. We use
model.matrix() to automatically create dummy variables for
all our factors (like genderFemale,
smoking_historynever, etc.).
This creates a new data frame of only numeric values. We must also
clean the column names (using make.names()) to remove
spaces or special characters (e.g., “No Info” becomes “No.Info”) and
then dynamically build a formula string that includes all these new
dummy predictors.
# Create the design matrix, which automatically dummifies factor variables
# The '~ .' formula includes all variables
neuralData.matrix <- model.matrix(~ ., data = neuralData)
neuralData.nn <- as.data.frame(neuralData.matrix)
# Clean the column names to make them valid R variables
# This fixes errors from factor levels with spaces like "No Info"
valid.names <- make.names(colnames(neuralData.nn))
colnames(neuralData.nn) <- valid.names
# Add the numeric response variable (0/1) for neuralnet
# The neuralnet function requires a numeric target
neuralData.nn$diabetes_num <- ifelse(neuralData$diabetes == "Yes", 1, 0)
# Get all column names from the new data frame
columnNames <- colnames(neuralData.nn)
# Create the list of predictors by removing the first (Intercept)
# and the last (our new response variable 'diabetes_num')
columnList <- paste(columnNames[-c(1, length(columnNames))], collapse = "+")
# Create the final formula string
modelFormula <- as.formula(paste("diabetes_num ~", columnList))
# Print the formula to check
print(modelFormula)diabetes_num ~ genderMale + age + hypertension + heart_disease +
smoking_historyever + smoking_historyformer + smoking_historynever +
smoking_historyNo.Info + smoking_historynot.current + bmi +
HbA1c_level + blood_glucose_level + diabetesYes
With the scaled and dummified data prepared, we can now train the perceptron.
hidden = 1 to create a single-layer network (a
true perceptron).act.fct = "logistic" (the sigmoid function)
because this is a binary classification problem, which mirrors our
logistic regression.linear.output = FALSE to ensure the activation
function is applied to the output, giving us a probability between 0 and
1.To determine the best model we will compare their predictive performance on the entire dataset. The primary metric for comparison will be the Area Under the Curve (AUC) from the Receiver Operating Characteristic (ROC) curve. An AUC of 1.0 represents a perfect classifier, while an AUC of 0.5 represents a model with no discriminatory power (same as random guessing).
We will generate predictions from all four models and plot their ROC curves on a single graph for direct comparison.
# 1. Get predictions (as probabilities) for all models
predReduced <- predict(reducedModel, newdata = diabetes.data, type = "response")
predFull <- predict(fullModel, newdata = diabetes.data, type = "response")
predForwards <- predict(forwards, newdata = diabetes.data, type = "response")
# Neural network model predictions
predNN.raw <- predict(perceptron.model, newdata = neuralData.nn)
predNN <- as.vector(predNN.raw) # Ensure it's a vector for pROC
# 2. Create ROC objects for all models
category <- diabetes.data$diabetes == "Yes"
ROCobj.reduced <- roc(category, predReduced)
ROCobj.full <- roc(category, predFull)
ROCobj.forwards <- roc(category, predForwards)
ROCobj.NN <- roc(category, predNN)
# 3. Get AUC values from each ROC object
reducedAUC <- ROCobj.reduced$auc
fullAUC <- ROCobj.full$auc
forwardsAUC <- ROCobj.forwards$auc
NNAUC <- ROCobj.NN$auc
# 4. Plot all ROC curves on one graph for comparison
colors <- c("#8B4500", "#00008B", "#8B008B", "#055d03")
plot(ROCobj.reduced, col = colors[1], lwd = 2, main = "ROC Curves of Candidate Models (Full Dataset)")
lines(ROCobj.full, col = colors[2], lwd = 2, lty = 2)
lines(ROCobj.forwards, col = colors[3], lwd = 1)
lines(ROCobj.NN, col = colors[4], lwd = 1)
# Add legend
legend("bottomright", c("reduced", "full", "forwards", "NN"),
col = colors, lwd = c(2, 2, 1, 1), lty = c(1, 2, 1, 1), bty = "n")
# AUC text annotations for clarity
text(0.4, 0.4, paste("AUC.reduced =", round(reducedAUC, 4)), col = colors[1], adj = 0)
text(0.4, 0.35, paste("AUC.full =", round(fullAUC, 4)), col = colors[2], adj = 0)
text(0.4, 0.3, paste("AUC.forwards =", round(forwardsAUC, 4)), col = colors[3], adj = 0)
text(0.4, 0.25, paste("AUC.NN =", round(NNAUC, 4)), col = colors[4], adj = 0)gender
variable was cleaned by removing the ‘Other’ category.reducedModel, a fullModel, and a
forwards selection model using glm() and
step().neuralnet): We created a
separate, scaled and dummified dataset using model.matrix.
We built a neuralnet model with one hidden node
(hidden = 1) and a logistic activation function.Based on this analysis, the Full Logistic Model and
the Forwards-Selected Logistic Model have almost
identical and superior predictive results (AUC 0.962) compared to the
simpler reduced model. This suggests that the additional variables in
the full/forwards models (like smoking_history and
gender) provide valuable predictive information.