1 Part 1: Data Preparation

1.1 Load Data

Here we load the diabetes prediction dataset.

setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
diabetes.data <- read_csv("diabetes_prediction_dataset.csv")

1.2 Clean Data

We will remove the ‘Other’ category from gender since it has few observations. We also convert our target variable diabetes and categorical predictors (gender, smoking_history) into factors so the classification models can interpret them correctly.

# Filter out other gender category
diabetes.data <- diabetes.data %>%
  filter(gender != "Other")

# Convert character columns and target to factors
diabetes.data <- diabetes.data %>%
  mutate(
    diabetes = as.factor(diabetes),
    gender = as.factor(gender),
    smoking_history = as.factor(smoking_history)
  )

# Set factor levels for clarity (0 = No, 1 = Yes)
levels(diabetes.data$diabetes) <- c("No", "Yes")

# Check the structure of the prepared data
summary(diabetes.data)
##     gender           age         hypertension     heart_disease    
##  Female:58552   Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
##  Male  :41430   1st Qu.:24.00   1st Qu.:0.00000   1st Qu.:0.00000  
##                 Median :43.00   Median :0.00000   Median :0.00000  
##                 Mean   :41.89   Mean   :0.07486   Mean   :0.03943  
##                 3rd Qu.:60.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##                 Max.   :80.00   Max.   :1.00000   Max.   :1.00000  
##     smoking_history       bmi         HbA1c_level    blood_glucose_level
##  current    : 9286   Min.   :10.01   Min.   :3.500   Min.   : 80.0      
##  ever       : 4003   1st Qu.:23.63   1st Qu.:4.800   1st Qu.:100.0      
##  former     : 9352   Median :27.32   Median :5.800   Median :140.0      
##  never      :35092   Mean   :27.32   Mean   :5.528   Mean   :138.1      
##  No Info    :35810   3rd Qu.:29.58   3rd Qu.:6.200   3rd Qu.:159.0      
##  not current: 6439   Max.   :95.69   Max.   :9.000   Max.   :300.0      
##  diabetes   
##  No :91482  
##  Yes: 8500  
##             
##             
##             
## 

2 Part 2: Model Development

2.1 Build Tree with rpart

We will use the rpart library to build our decision tree. We fit a model using all available predictors to predict the diabetes outcome.

# Build the decision tree model
# We use method = "class" for a classification tree
tree_model <- rpart(
  formula = diabetes ~ .,
  data = diabetes.data,
  method = "class" 
)

# Print the model summary
print(tree_model)
## n= 99982 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 99982 8500 No (0.91498470 0.08501530)  
##   2) HbA1c_level< 6.7 96087 4605 No (0.95207468 0.04792532)  
##     4) blood_glucose_level< 210 94295 2813 No (0.97016809 0.02983191) *
##     5) blood_glucose_level>=210 1792    0 Yes (0.00000000 1.00000000) *
##   3) HbA1c_level>=6.7 3895    0 Yes (0.00000000 1.00000000) *

2.2 Visualize the Tree

We can use the rpart.plot package to create a clean visual of the tree structure. This helps in understanding the rules the model learned.

# Plot the tree
rpart.plot(tree_model, main = "Decision Tree for Diabetes Prediction")

In the plot, the tree first splits the data based on an HbA1c_level threshold of 6.7. If a patient’s level is 6.7 or higher (the ‘no’ path, accounting for 4% of the data), the model immediately predicts ‘Yes’ for diabetes. For the 96% of patients with a lower HbA1c level, the model asks a second question: is their blood_glucose_level less than 210? This final split results in a ‘No’ prediction for 94% of the total dataset and a ‘Yes’ prediction for the remaining 2%.

3 Part 3: Model Comparison (ROC/AUC)

We will evaluate the new model using ROC-AUC. For comparison, we will also build the Full Logistic Model and plot both ROC curves together.

3.1 Logistic Model

This chunk builds the full logistic regression model to use as a baseline.

# Build the full logistic model
glm_full <- glm(
  formula = diabetes ~ .,
  data = diabetes.data,
  family = "binomial"
)

3.2 Calculate Predictions & AUC

Now we get the probabilities for both models and calculate their respective AUC scores.

# Get probabilities for the GLM model
glm_probs <- predict(glm_full, type = "response")

# Get probabilities for the Decision Tree model
tree_probs <- predict(tree_model, type = "prob")[, "Yes"]

# Calculate ROC curves
roc_glm <- roc(diabetes.data$diabetes, glm_probs, quiet = TRUE)
roc_tree <- roc(diabetes.data$diabetes, tree_probs, quiet = TRUE)

# Get AUC values
auc_glm <- auc(roc_glm)
auc_tree <- auc(roc_tree)

print(paste("Full GLM AUC:", round(auc_glm, 4)))
## [1] "Full GLM AUC: 0.9619"
print(paste("Decision Tree AUC:", round(auc_tree, 4)))
## [1] "Decision Tree AUC: 0.8345"

3.3 Plot ROC Curves

Finally, we plot both curves on the same graph to visually compare their performance.

# Plot the ROC curves
plot(roc_glm, col = "blue", main = "Model Comparison: GLM vs. Decision Tree")
plot(roc_tree, col = "darkgreen", add = TRUE)

legend("bottomright", 
       legend = c(paste("GLM Full (AUC:", round(auc_glm, 4), ")"),
                  paste("Decision Tree (AUC:", round(auc_tree, 4), ")")),
       col = c("blue", "darkgreen"),
       lwd = 2)

The plot compares the predictive performance of two models: the full linear model and the decision tree. The graph plots Sensitivity (true positive rate) against 1 - Specificity (false positive rate). A model’s curve bending closer to the top-left corner signifies better performance. This is quantified by the Area Under the Curve (AUC), where the GLM AUC = 0.9619 clearly outperforms the Decision Tree AUC = 0.8345, indicating it is much more accurate at correctly classifying cases.

4 Conclusion

  • Part 1 (Data Prep): The data was loaded and prepped. The gender variable was cleaned by removing the ‘Other’ category, and diabetes, gender, and smoking_history were converted to factors.

  • Part 2 (Model Development): We developed one new model:

    • Decision Tree: We built a classification tree using the rpart library to predict diabetes based on all other features. We see that two variables both HbA1c_level and blood_glucose_level can predict a segment of the data as having diabetes.
  • Part 3 (Model Comparison): We evaluated the new model by calculating its ROC-AUC on the entire dataset and compared it to the Full Logistic Model. The final AUC scores were:

    • Full GLM AUC: 0.9619
    • Decision Tree AUC: 0.8345

Based on this analysis, the Full Logistic Model continues to show stronger predictive performance (higher AUC) than the default rpart Decision Tree model. The tree model provides a simpler overview of the predictions.

