Part 1: Data
Preparation
Load Data
Here we load the diabetes prediction dataset.
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
diabetes.data <- read_csv("diabetes_prediction_dataset.csv")
Clean Data
We will remove the ‘Other’ category from gender since it
has few observations. We also convert our target variable
diabetes and categorical predictors (gender,
smoking_history) into factors so the classification models
can interpret them correctly.
# Filter out other gender category
diabetes.data <- diabetes.data %>%
filter(gender != "Other")
# Convert character columns and target to factors
diabetes.data <- diabetes.data %>%
mutate(
diabetes = as.factor(diabetes),
gender = as.factor(gender),
smoking_history = as.factor(smoking_history)
)
# Set factor levels for clarity (0 = No, 1 = Yes)
levels(diabetes.data$diabetes) <- c("No", "Yes")
# Check the structure of the prepared data
summary(diabetes.data)
## gender age hypertension heart_disease
## Female:58552 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Male :41430 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
## Median :43.00 Median :0.00000 Median :0.00000
## Mean :41.89 Mean :0.07486 Mean :0.03943
## 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :80.00 Max. :1.00000 Max. :1.00000
## smoking_history bmi HbA1c_level blood_glucose_level
## current : 9286 Min. :10.01 Min. :3.500 Min. : 80.0
## ever : 4003 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0
## former : 9352 Median :27.32 Median :5.800 Median :140.0
## never :35092 Mean :27.32 Mean :5.528 Mean :138.1
## No Info :35810 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0
## not current: 6439 Max. :95.69 Max. :9.000 Max. :300.0
## diabetes
## No :91482
## Yes: 8500
##
##
##
##
Part 2: Model
Development
Build Tree with
rpart
We will use the rpart library to build our decision
tree. We fit a model using all available predictors to predict the
diabetes outcome.
# Build the decision tree model
# We use method = "class" for a classification tree
tree_model <- rpart(
formula = diabetes ~ .,
data = diabetes.data,
method = "class"
)
# Print the model summary
print(tree_model)
## n= 99982
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 99982 8500 No (0.91498470 0.08501530)
## 2) HbA1c_level< 6.7 96087 4605 No (0.95207468 0.04792532)
## 4) blood_glucose_level< 210 94295 2813 No (0.97016809 0.02983191) *
## 5) blood_glucose_level>=210 1792 0 Yes (0.00000000 1.00000000) *
## 3) HbA1c_level>=6.7 3895 0 Yes (0.00000000 1.00000000) *
Visualize the
Tree
We can use the rpart.plot package to create a clean
visual of the tree structure. This helps in understanding the rules the
model learned.
# Plot the tree
rpart.plot(tree_model, main = "Decision Tree for Diabetes Prediction")

In the plot, the tree first splits the data based on an
HbA1c_level threshold of 6.7. If a patient’s level is 6.7
or higher (the ‘no’ path, accounting for 4% of the data), the model
immediately predicts ‘Yes’ for diabetes. For the 96% of patients with a
lower HbA1c level, the model asks a second question: is
their blood_glucose_level less than 210? This final split
results in a ‘No’ prediction for 94% of the total dataset and a ‘Yes’
prediction for the remaining 2%.
Part 3: Model
Comparison (ROC/AUC)
We will evaluate the new model using ROC-AUC. For comparison, we will
also build the Full Logistic Model and plot both ROC curves
together.
Logistic Model
This chunk builds the full logistic regression model to use as a
baseline.
# Build the full logistic model
glm_full <- glm(
formula = diabetes ~ .,
data = diabetes.data,
family = "binomial"
)
Calculate Predictions
& AUC
Now we get the probabilities for both models and calculate their
respective AUC scores.
# Get probabilities for the GLM model
glm_probs <- predict(glm_full, type = "response")
# Get probabilities for the Decision Tree model
tree_probs <- predict(tree_model, type = "prob")[, "Yes"]
# Calculate ROC curves
roc_glm <- roc(diabetes.data$diabetes, glm_probs, quiet = TRUE)
roc_tree <- roc(diabetes.data$diabetes, tree_probs, quiet = TRUE)
# Get AUC values
auc_glm <- auc(roc_glm)
auc_tree <- auc(roc_tree)
print(paste("Full GLM AUC:", round(auc_glm, 4)))
## [1] "Full GLM AUC: 0.9619"
print(paste("Decision Tree AUC:", round(auc_tree, 4)))
## [1] "Decision Tree AUC: 0.8345"
Plot ROC Curves
Finally, we plot both curves on the same graph to visually compare
their performance.
# Plot the ROC curves
plot(roc_glm, col = "blue", main = "Model Comparison: GLM vs. Decision Tree")
plot(roc_tree, col = "darkgreen", add = TRUE)
legend("bottomright",
legend = c(paste("GLM Full (AUC:", round(auc_glm, 4), ")"),
paste("Decision Tree (AUC:", round(auc_tree, 4), ")")),
col = c("blue", "darkgreen"),
lwd = 2)

The plot compares the predictive performance of two models: the full
linear model and the decision tree. The graph plots Sensitivity (true
positive rate) against 1 - Specificity (false positive rate). A model’s
curve bending closer to the top-left corner signifies better
performance. This is quantified by the Area Under the Curve (AUC), where
the GLM AUC = 0.9619 clearly outperforms the Decision Tree
AUC = 0.8345, indicating it is much more accurate at
correctly classifying cases.
Conclusion
Part 1 (Data Prep): The data was loaded and
prepped. The gender variable was cleaned by removing the
‘Other’ category, and diabetes, gender, and
smoking_history were converted to factors.
Part 2 (Model Development): We developed one new
model:
- Decision Tree: We built a classification tree using
the
rpart library to predict diabetes based on
all other features. We see that two variables both
HbA1c_level and blood_glucose_level can
predict a segment of the data as having diabetes.
Part 3 (Model Comparison): We evaluated the new
model by calculating its ROC-AUC on the entire dataset and
compared it to the Full Logistic Model. The final AUC scores were:
- Full GLM AUC: 0.9619
- Decision Tree AUC: 0.8345
Based on this analysis, the Full Logistic Model
continues to show stronger predictive performance (higher AUC) than the
default rpart Decision Tree model. The
tree model provides a simpler overview of the predictions.
