Introduction

This R Markdown document demonstrates how to create a Receiver Operating Characteristic (ROC) Curve to assess the performance of a binary classification model.

The dataset used is the Pima Indians Diabetes Dataset, which predicts the onset of diabetes based on diagnostic measurements.

Load Required Libraries

# Install the packages (only if not installed)
if (!require(pROC)) install.packages("pROC")

## Loading required package: pROC

## Warning: package 'pROC' was built under R version 4.4.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

if (!require(ggplot2)) install.packages("ggplot2")

## Loading required package: ggplot2

if (!require(caret)) install.packages("caret")

## Loading required package: caret

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

# Load the libraries
library(pROC)
library(ggplot2)
library(caret)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset

# Read the dataset
diabetes_data <- read.csv("diabetes.csv")

# Display first few rows
head(diabetes_data)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Explore the Dataset

# Summary statistics
summary(diabetes_data)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

# Check structure
str(diabetes_data)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

# Class distribution
table(diabetes_data$Outcome)

## 
##   0   1 
## 500 268

Exploratory Data Analysis (EDA)

# Check for zero-values (potentially missing data)
colSums(diabetes_data == 0)

##              Pregnancies                  Glucose            BloodPressure 
##                      111                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                      500

# Visualize class distribution
ggplot(diabetes_data, aes(x = factor(Outcome))) +
  geom_bar(fill = c("#00AFBB", "#E7B800")) +
  labs(x = "Diabetes (0 = No, 1 = Yes)", y = "Count",
       title = "Class Distribution") +
  theme_minimal()

# Feature distributions
diabetes_data %>%
  gather(key = "Variable", value = "Value", -Outcome) %>%
  ggplot(aes(x = Value, fill = factor(Outcome))) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  facet_wrap(~Variable, scales = "free") +
  theme_minimal()

# Correlation matrix
corr_matrix <- cor(diabetes_data[, -9])
corrplot(corr_matrix, method = "ellipse", type = "upper",
         title = "Correlation Between Features", tl.cex = 0.8)

# Boxplot for outliers
diabetes_data %>%
  gather(key = "Variable", value = "Value", -Outcome) %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "#FF9999") +
  coord_flip() +
  theme_minimal()

Insulin:

Exhibits an extremely wide distribution with many high-value outliers.
Suggests substantial variance or measurement issues, possibly indicating missing or incorrectly recorded data.

Glucose:

Few very low-value outliers (near zero), which are medically unrealistic (as glucose can’t realistically be zero in living patients).
Requires careful consideration or potential data correction.

Blood Pressure:

Some low-value outliers (close to zero), indicating measurement anomalies or data-entry errors, as physiological blood pressure cannot be zero in a living individual.

BMI (Body Mass Index):

Has a relatively stable central distribution but includes some outliers both at low and high extremes, suggesting potential measurement inconsistencies.

Skin Thickness:

Contains several significantly high-value outliers, possibly due to measurement errors or extraordinary cases.

Age, Pregnancies, and Diabetes Pedigree Function:

Distributions seem reasonably stable with fewer outliers, though some isolated extreme points remain.

Build a Logistic Regression Model

# Fit logistic regression model
logistic_model <- glm(Outcome ~ ., data = diabetes_data, family = binomial)

# Show model summary
summary(logistic_model)

## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = diabetes_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.4046964  0.7166359 -11.728  < 2e-16 ***
## Pregnancies               0.1231823  0.0320776   3.840 0.000123 ***
## Glucose                   0.0351637  0.0037087   9.481  < 2e-16 ***
## BloodPressure            -0.0132955  0.0052336  -2.540 0.011072 *  
## SkinThickness             0.0006190  0.0068994   0.090 0.928515    
## Insulin                  -0.0011917  0.0009012  -1.322 0.186065    
## BMI                       0.0897010  0.0150876   5.945 2.76e-09 ***
## DiabetesPedigreeFunction  0.9451797  0.2991475   3.160 0.001580 ** 
## Age                       0.0148690  0.0093348   1.593 0.111192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 759  degrees of freedom
## AIC: 741.45
## 
## Number of Fisher Scoring iterations: 5

# Predicted probabilities (probability of being diabetic)
predicted_probs <- predict(logistic_model, type = "response")

# Actual labels (ground truth)
true_labels <- diabetes_data$Outcome

#Using logistic regression, a common classification algorithm for binary outcomes.
#The model predicts the probability of the positive class (Outcome = 1), which we use for ROC analysis.
#type = "response" gives us probabilities rather than class labels (0/1), necessary for plotting an ROC curve.

Generate ROC Curve and Calculate AUC (Manual FPR/TPR Fix)

# Create ROC object
roc_obj <- roc(true_labels, predicted_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Compute AUC
auc_value <- auc(roc_obj)
paste("AUC:", auc_value)

## [1] "AUC: 0.839425373134328"

# Plot ROC curve with traditional axes
plot(roc_obj,
     legacy.axes = TRUE, 
     col = "blue",
     lwd = 2,
     main = "ROC Curve - Pima Indians Diabetes Dataset (Fixed Axes)")

# Add random guess line
abline(a = 0, b = 1, lty = 2, col = "gray")

# Show AUC
text(0.6, 0.4, labels = paste("AUC =", round(auc(roc_obj), 3)), col = "red")

The ROC curve illustrates the performance of the classification model at various threshold settings. True Positive Rate (Sensitivity) is plotted against False Positive Rate (1 - Specificity). The AUC quantifies the classifier’s ability to distinguish between the positive and negative classes: - AUC = 1.0 means a perfect classifier. - AUC = 0.5 means the classifier performs no better than random guessing. Higher AUC values indicate better model performance.

Plot the Correct ROC Curve

roc_obj <- roc(true_labels, predicted_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Extract TPR (Sensitivity) and FPR (1 - Specificity)
roc_data <- data.frame(
  FPR = 1 - roc_obj$specificities,
  TPR = roc_obj$sensitivities
)

# Plot with ggplot2
ggplot(roc_data, aes(x = FPR, y = TPR)) +
  geom_line(color = "red", size = 1.2) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  ggtitle("Enhanced ROC Curve - Pima Indians Diabetes Dataset (Correct Axes)") +
  xlab("False Positive Rate") +
  ylab("True Positive Rate (Sensitivity)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Curve Position: The ROC curve is clearly above the diagonal (dashed line), indicating your model has good predictive power and is far better than random guessing.

Area Under the Curve (AUC): Judging visually, the AUC appears high (~0.80-0.85 range). This signifies strong model performance.

Curve Shape: The curve rises steeply near the left side (low False Positive Rate), suggesting your model effectively identifies positive cases with fewer false alarms at lower thresholds.

# Cross-validation (5-fold) using caret
train_control <- trainControl(method = "cv", number = 5)
cv_model <- train(as.factor(Outcome) ~ .,
                  data = diabetes_data,
                  method = "glm",
                  family = "binomial",
                  trControl = train_control)

# Cross-validation results
print(cv_model)

## Generalized Linear Model 
## 
## 768 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 614, 615, 615, 614, 614 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.7630252  0.451491

# Optimal Threshold Selection (Youden's Index)
optimal_threshold <- coords(roc_obj, "best", ret = "threshold")[[1]]
cat("Optimal Threshold (Youden's index):", optimal_threshold, "\n")

## Optimal Threshold (Youden's index): 0.3536825

# Make predictions using optimal threshold
predicted_classes_optimal <- ifelse(predicted_probs >= optimal_threshold, 1, 0)

# Verify lengths match
cat("Length of true labels:", length(true_labels), "\n")

## Length of true labels: 768

cat("Length of predicted labels:", length(predicted_classes_optimal), "\n")

## Length of predicted labels: 768

# Generate confusion matrix
conf_matrix <- confusionMatrix(as.factor(predicted_classes_optimal), 
                               as.factor(true_labels), positive = "1")
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 392  70
##          1 108 198
##                                           
##                Accuracy : 0.7682          
##                  95% CI : (0.7367, 0.7976)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : 1.285e-12       
##                                           
##                   Kappa : 0.5062          
##                                           
##  Mcnemar's Test P-Value : 0.00555         
##                                           
##             Sensitivity : 0.7388          
##             Specificity : 0.7840          
##          Pos Pred Value : 0.6471          
##          Neg Pred Value : 0.8485          
##              Prevalence : 0.3490          
##          Detection Rate : 0.2578          
##    Detection Prevalence : 0.3984          
##       Balanced Accuracy : 0.7614          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 0.7682

Good overall accuracy (approximately 77%), significantly higher than the “no information rate” (65.1%), with a very significant p-value (1.285e-12). Your model clearly predicts better than random guessing. Sensitivity (Recall or TPR): 0.7388

Model correctly identifies 73.9% of actual positive (diabetic) cases. This indicates fairly good identification of individuals who actually have diabetes. Specificity (TNR): 0.7840

Model correctly identifies 78.4% of actual negative (non-diabetic) cases. This indicates the model is also good at avoiding false positives. Positive Predictive Value (Precision): 0.6471

About 64.7% of predictions labeled as “positive” (diabetic) are actually diabetic. There’s room for improvement here. Negative Predictive Value: 0.8485

Very good negative predictive value—84.9% of predicted non-diabetics are truly non-diabetic. Balanced Accuracy: 0.7614

Balanced accuracy (76.1%) is good, considering both sensitivity and specificity. Kappa (Cohen’s Kappa): 0.5062

Moderate agreement (0.41–0.60). This metric indicates a fair degree of reliability in the model predictions beyond random chance.

22BI13220_Midterm

NguyenMinhKhoi

2025-03-13