Part 1: Data
Preparation
Load Data
Here we load the diabetes prediction dataset.
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
diabetes.data <- read_csv("diabetes_prediction_dataset.csv")
Clean Data
We will remove the ‘Other’ category from gender since it
has few observations. We also convert our target variable
diabetes and categorical predictors (gender,
smoking_history) into factors so the classification models
can interpret them correctly.
# Filter out other gender category
diabetes.data <- diabetes.data %>%
filter(gender != "Other")
# Convert character columns and target to factors
diabetes.data <- diabetes.data %>%
mutate(
diabetes = as.factor(diabetes),
gender = as.factor(gender),
smoking_history = as.factor(smoking_history)
)
# Set factor levels for clarity (0 = No, 1 = Yes)
levels(diabetes.data$diabetes) <- c("No", "Yes")
# Check the structure of the prepared data
summary(diabetes.data)
## gender age hypertension heart_disease
## Female:58552 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Male :41430 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
## Median :43.00 Median :0.00000 Median :0.00000
## Mean :41.89 Mean :0.07486 Mean :0.03943
## 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :80.00 Max. :1.00000 Max. :1.00000
## smoking_history bmi HbA1c_level blood_glucose_level
## current : 9286 Min. :10.01 Min. :3.500 Min. : 80.0
## ever : 4003 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0
## former : 9352 Median :27.32 Median :5.800 Median :140.0
## never :35092 Mean :27.32 Mean :5.528 Mean :138.1
## No Info :35810 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0
## not current: 6439 Max. :95.69 Max. :9.000 Max. :300.0
## diabetes
## No :91482
## Yes: 8500
##
##
##
##
Part 2: Model
Development
Build Tree with
rpart
We will use the rpart library to build our decision
tree. We fit a model using all available predictors to predict the
diabetes outcome.
# Build the decision tree model
# We use method = "class" for a classification tree
tree_model <- rpart(
formula = diabetes ~ .,
data = diabetes.data,
method = "class"
)
# Print the model summary
print(tree_model)
## n= 99982
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 99982 8500 No (0.91498470 0.08501530)
## 2) HbA1c_level< 6.7 96087 4605 No (0.95207468 0.04792532)
## 4) blood_glucose_level< 210 94295 2813 No (0.97016809 0.02983191) *
## 5) blood_glucose_level>=210 1792 0 Yes (0.00000000 1.00000000) *
## 3) HbA1c_level>=6.7 3895 0 Yes (0.00000000 1.00000000) *
Visualize the
Tree
We can use the rpart.plot package to create a clean
visual of the tree structure. This helps in understanding the rules the
model learned.
# Plot the tree
rpart.plot(tree_model, main = "Decision Tree for Diabetes Prediction")

In the plot, the tree first splits the data based on an
HbA1c_level threshold of 6.7. If a patient’s level is 6.7
or higher (the ‘no’ path, accounting for 4% of the data), the model
immediately predicts ‘Yes’ for diabetes. For the 96% of patients with a
lower HbA1c level, the model asks a second question: is
their blood_glucose_level less than 210? This final split
results in a ‘No’ prediction for 94% of the total dataset and a ‘Yes’
prediction for the remaining 2%.
Part 3: Model
Comparison (ROC/AUC)
We will evaluate the new model using ROC-AUC. For comparison, we will
also build the Full Logistic Model and plot both ROC curves
together.
Logistic Model
This chunk builds the full logistic regression model to use as a
baseline.
# Build the full logistic model
glm_full <- glm(
formula = diabetes ~ .,
data = diabetes.data,
family = "binomial"
)
Calculate Predictions
& AUC
Now we get the probabilities for both models and calculate their
respective AUC scores.
# Get probabilities for the GLM model
glm_probs <- predict(glm_full, type = "response")
# Get probabilities for the Decision Tree model
tree_probs <- predict(tree_model, type = "prob")[, "Yes"]
# Calculate ROC curves
roc_glm <- roc(diabetes.data$diabetes, glm_probs, quiet = TRUE)
roc_tree <- roc(diabetes.data$diabetes, tree_probs, quiet = TRUE)
# Get AUC values
auc_glm <- auc(roc_glm)
auc_tree <- auc(roc_tree)
print(paste("Full GLM AUC:", round(auc_glm, 4)))
## [1] "Full GLM AUC: 0.9619"
print(paste("Decision Tree AUC:", round(auc_tree, 4)))
## [1] "Decision Tree AUC: 0.8345"
Plot ROC Curves
Finally, we plot both curves on the same graph to visually compare
their performance.
# Plot the ROC curves
plot(roc_glm, col = "blue", main = "Model Comparison: GLM vs. Decision Tree")
plot(roc_tree, col = "darkgreen", add = TRUE)
legend("bottomright",
legend = c(paste("GLM Full (AUC:", round(auc_glm, 4), ")"),
paste("Decision Tree (AUC:", round(auc_tree, 4), ")")),
col = c("blue", "darkgreen"),
lwd = 2)

The plot compares the predictive performance of two models: the full
linear model and the decision tree. The graph plots Sensitivity (true
positive rate) against 1 - Specificity (false positive rate). A model’s
curve bending closer to the top-left corner signifies better
performance. This is quantified by the Area Under the Curve (AUC), where
the GLM AUC = 0.9619 clearly outperforms the Decision Tree
AUC = 0.8345, indicating it is much more accurate at
correctly classifying cases.
Conclusion
Part 1 (Data Prep): The data was loaded and
prepped. The gender variable was cleaned by removing the
‘Other’ category, and diabetes, gender, and
smoking_history were converted to factors.
Part 2 (Model Development): We developed one new
model:
- Decision Tree: We built a classification tree using
the
rpart library to predict diabetes based on
all other features. We see that two variables both
HbA1c_level and blood_glucose_level can
predict a segment of the data as having diabetes.
Part 3 (Model Comparison): We evaluated the new
model by calculating its ROC-AUC on the entire dataset and
compared it to the Full Logistic Model. The final AUC scores were:
- Full GLM AUC: 0.9619
- Decision Tree AUC: 0.8345
Based on this analysis, the Full Logistic Model
continues to show stronger predictive performance (higher AUC) than the
default rpart Decision Tree model. The
tree model provides a simpler overview of the predictions.
---
title: 'Project Two: Part 2 - Decision Trees'
author: 'Jeff Delva'
date: "October 29, 2025"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 8
    fig_height: 5
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
    highlight: tango
---

```{css, echo = FALSE}
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author, h4.date {
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
h3 {
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

library(readr)
library(dplyr)
library(rpart)      
library(rpart.plot)
library(pROC)       
```

# Part 1: Data Preparation

## Load Data

Here we load the diabetes prediction dataset.

```{r load-data}
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
diabetes.data <- read_csv("diabetes_prediction_dataset.csv")
```

## Clean Data

We will remove the 'Other' category from `gender` since it has few observations. We also convert our target variable `diabetes` and categorical predictors (`gender`, `smoking_history`) into factors so the classification models can interpret them correctly.

```{r clean-data}
# Filter out other gender category
diabetes.data <- diabetes.data %>%
  filter(gender != "Other")

# Convert character columns and target to factors
diabetes.data <- diabetes.data %>%
  mutate(
    diabetes = as.factor(diabetes),
    gender = as.factor(gender),
    smoking_history = as.factor(smoking_history)
  )

# Set factor levels for clarity (0 = No, 1 = Yes)
levels(diabetes.data$diabetes) <- c("No", "Yes")

# Check the structure of the prepared data
summary(diabetes.data)
```

# Part 2: Model Development

## Build Tree with rpart

We will use the `rpart` library to build our decision tree. We fit a model using all available predictors to predict the `diabetes` outcome.

```{r build-tree}
# Build the decision tree model
# We use method = "class" for a classification tree
tree_model <- rpart(
  formula = diabetes ~ .,
  data = diabetes.data,
  method = "class" 
)

# Print the model summary
print(tree_model)
```

## Visualize the Tree

We can use the `rpart.plot` package to create a clean visual of the tree structure. This helps in understanding the rules the model learned.

```{r plot-tree}
# Plot the tree
rpart.plot(tree_model, main = "Decision Tree for Diabetes Prediction")
```

In the plot, the tree first splits the data based on an `HbA1c_level` threshold of 6.7. If a patient's level is 6.7 or higher (the 'no' path, accounting for 4% of the data), the model immediately predicts 'Yes' for diabetes. For the 96% of patients with a lower `HbA1c` level, the model asks a second question: is their `blood_glucose_level` less than 210? This final split results in a 'No' prediction for 94% of the total dataset and a 'Yes' prediction for the remaining 2%.


# Part 3: Model Comparison (ROC/AUC)

We will evaluate the new model using ROC-AUC. For comparison, we will also build the Full Logistic Model and plot both ROC curves together.

## Logistic Model

This chunk builds the full logistic regression model to use as a baseline.

```{r glm-model}
# Build the full logistic model
glm_full <- glm(
  formula = diabetes ~ .,
  data = diabetes.data,
  family = "binomial"
)
```

## Calculate Predictions & AUC

Now we get the probabilities for both models and calculate their respective AUC scores.

```{r tree-auc}
# Get probabilities for the GLM model
glm_probs <- predict(glm_full, type = "response")

# Get probabilities for the Decision Tree model
tree_probs <- predict(tree_model, type = "prob")[, "Yes"]

# Calculate ROC curves
roc_glm <- roc(diabetes.data$diabetes, glm_probs, quiet = TRUE)
roc_tree <- roc(diabetes.data$diabetes, tree_probs, quiet = TRUE)

# Get AUC values
auc_glm <- auc(roc_glm)
auc_tree <- auc(roc_tree)

print(paste("Full GLM AUC:", round(auc_glm, 4)))
print(paste("Decision Tree AUC:", round(auc_tree, 4)))
```

## Plot ROC Curves

Finally, we plot both curves on the same graph to visually compare their performance.

```{r plot-roc}
# Plot the ROC curves
plot(roc_glm, col = "blue", main = "Model Comparison: GLM vs. Decision Tree")
plot(roc_tree, col = "darkgreen", add = TRUE)

legend("bottomright", 
       legend = c(paste("GLM Full (AUC:", round(auc_glm, 4), ")"),
                  paste("Decision Tree (AUC:", round(auc_tree, 4), ")")),
       col = c("blue", "darkgreen"),
       lwd = 2)
```

The plot compares the predictive performance of two models: the full linear model and the decision tree. The graph plots Sensitivity (true positive rate) against 1 - Specificity (false positive rate). A model's curve bending closer to the top-left corner signifies better performance. This is quantified by the Area Under the Curve (AUC), where the GLM `AUC` = 0.9619 clearly outperforms the Decision Tree `AUC` = 0.8345, indicating it is much more accurate at correctly classifying cases.

# Conclusion

  * **Part 1 (Data Prep)**: The data was loaded and prepped. The `gender` variable was cleaned by removing the 'Other' category, and `diabetes`, `gender`, and `smoking_history` were converted to factors.

  * **Part 2 (Model Development)**: We developed one new model:

      * **Decision Tree**: We built a classification tree using the `rpart` library to predict `diabetes` based on all other features. We see that two variables both `HbA1c_level` and `blood_glucose_level` can predict a segment of the data as having diabetes.

  * **Part 3 (Model Comparison)**: We evaluated the new model by calculating its ROC-AUC on the *entire dataset* and compared it to the Full Logistic Model. The final AUC scores were:

      * **Full GLM AUC**: `r round(auc_glm, 4)`
      * **Decision Tree AUC**: `r round(auc_tree, 4)`

Based on this analysis, the **Full Logistic Model** continues to show stronger predictive performance (higher AUC) than the default `rpart` **Decision Tree** model. The tree model provides a simpler overview of the predictions.
