This R Markdown document demonstrates how to create a Receiver Operating Characteristic (ROC) Curve to assess the performance of a binary classification model.
The dataset used is the Pima Indians Diabetes Dataset, which predicts the onset of diabetes based on diagnostic measurements.
# Install the packages (only if not installed)
if (!require(pROC)) install.packages("pROC")
## Loading required package: pROC
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
if (!require(ggplot2)) install.packages("ggplot2")
## Loading required package: ggplot2
if (!require(caret)) install.packages("caret")
## Loading required package: caret
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Load the libraries
library(pROC)
library(ggplot2)
library(caret)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the dataset
diabetes_data <- read.csv("diabetes.csv")
# Display first few rows
head(diabetes_data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
# Summary statistics
summary(diabetes_data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
# Check structure
str(diabetes_data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
# Class distribution
table(diabetes_data$Outcome)
##
## 0 1
## 500 268
# Check for zero-values (potentially missing data)
colSums(diabetes_data == 0)
## Pregnancies Glucose BloodPressure
## 111 5 35
## SkinThickness Insulin BMI
## 227 374 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 500
# Visualize class distribution
ggplot(diabetes_data, aes(x = factor(Outcome))) +
geom_bar(fill = c("#00AFBB", "#E7B800")) +
labs(x = "Diabetes (0 = No, 1 = Yes)", y = "Count",
title = "Class Distribution") +
theme_minimal()
# Feature distributions
diabetes_data %>%
gather(key = "Variable", value = "Value", -Outcome) %>%
ggplot(aes(x = Value, fill = factor(Outcome))) +
geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
facet_wrap(~Variable, scales = "free") +
theme_minimal()
# Correlation matrix
corr_matrix <- cor(diabetes_data[, -9])
corrplot(corr_matrix, method = "ellipse", type = "upper",
title = "Correlation Between Features", tl.cex = 0.8)
# Boxplot for outliers
diabetes_data %>%
gather(key = "Variable", value = "Value", -Outcome) %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "#FF9999") +
coord_flip() +
theme_minimal()
Insulin:
Exhibits an extremely wide distribution with many high-value outliers.
Suggests substantial variance or measurement issues, possibly indicating missing or incorrectly recorded data.
Glucose:
Few very low-value outliers (near zero), which are medically unrealistic (as glucose can’t realistically be zero in living patients).
Requires careful consideration or potential data correction.
Blood Pressure:
BMI (Body Mass Index):
Skin Thickness:
Age, Pregnancies, and Diabetes Pedigree Function:
# Fit logistic regression model
logistic_model <- glm(Outcome ~ ., data = diabetes_data, family = binomial)
# Show model summary
summary(logistic_model)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = diabetes_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4046964 0.7166359 -11.728 < 2e-16 ***
## Pregnancies 0.1231823 0.0320776 3.840 0.000123 ***
## Glucose 0.0351637 0.0037087 9.481 < 2e-16 ***
## BloodPressure -0.0132955 0.0052336 -2.540 0.011072 *
## SkinThickness 0.0006190 0.0068994 0.090 0.928515
## Insulin -0.0011917 0.0009012 -1.322 0.186065
## BMI 0.0897010 0.0150876 5.945 2.76e-09 ***
## DiabetesPedigreeFunction 0.9451797 0.2991475 3.160 0.001580 **
## Age 0.0148690 0.0093348 1.593 0.111192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 759 degrees of freedom
## AIC: 741.45
##
## Number of Fisher Scoring iterations: 5
# Predicted probabilities (probability of being diabetic)
predicted_probs <- predict(logistic_model, type = "response")
# Actual labels (ground truth)
true_labels <- diabetes_data$Outcome
#Using logistic regression, a common classification algorithm for binary outcomes.
#The model predicts the probability of the positive class (Outcome = 1), which we use for ROC analysis.
#type = "response" gives us probabilities rather than class labels (0/1), necessary for plotting an ROC curve.
# Create ROC object
roc_obj <- roc(true_labels, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Compute AUC
auc_value <- auc(roc_obj)
paste("AUC:", auc_value)
## [1] "AUC: 0.839425373134328"
# Plot ROC curve with traditional axes
plot(roc_obj,
legacy.axes = TRUE,
col = "blue",
lwd = 2,
main = "ROC Curve - Pima Indians Diabetes Dataset (Fixed Axes)")
# Add random guess line
abline(a = 0, b = 1, lty = 2, col = "gray")
# Show AUC
text(0.6, 0.4, labels = paste("AUC =", round(auc(roc_obj), 3)), col = "red")
The ROC curve illustrates the performance of the classification model at various threshold settings. True Positive Rate (Sensitivity) is plotted against False Positive Rate (1 - Specificity). The AUC quantifies the classifier’s ability to distinguish between the positive and negative classes: - AUC = 1.0 means a perfect classifier. - AUC = 0.5 means the classifier performs no better than random guessing. Higher AUC values indicate better model performance.
roc_obj <- roc(true_labels, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Extract TPR (Sensitivity) and FPR (1 - Specificity)
roc_data <- data.frame(
FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities
)
# Plot with ggplot2
ggplot(roc_data, aes(x = FPR, y = TPR)) +
geom_line(color = "red", size = 1.2) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
ggtitle("Enhanced ROC Curve - Pima Indians Diabetes Dataset (Correct Axes)") +
xlab("False Positive Rate") +
ylab("True Positive Rate (Sensitivity)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Curve Position: The ROC curve is clearly above the diagonal (dashed line), indicating your model has good predictive power and is far better than random guessing.
Area Under the Curve (AUC): Judging visually, the AUC appears high (~0.80-0.85 range). This signifies strong model performance.
Curve Shape: The curve rises steeply near the left side (low False Positive Rate), suggesting your model effectively identifies positive cases with fewer false alarms at lower thresholds.
# Cross-validation (5-fold) using caret
train_control <- trainControl(method = "cv", number = 5)
cv_model <- train(as.factor(Outcome) ~ .,
data = diabetes_data,
method = "glm",
family = "binomial",
trControl = train_control)
# Cross-validation results
print(cv_model)
## Generalized Linear Model
##
## 768 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 614, 615, 615, 614, 614
## Resampling results:
##
## Accuracy Kappa
## 0.7630252 0.451491
# Optimal Threshold Selection (Youden's Index)
optimal_threshold <- coords(roc_obj, "best", ret = "threshold")[[1]]
cat("Optimal Threshold (Youden's index):", optimal_threshold, "\n")
## Optimal Threshold (Youden's index): 0.3536825
# Make predictions using optimal threshold
predicted_classes_optimal <- ifelse(predicted_probs >= optimal_threshold, 1, 0)
# Verify lengths match
cat("Length of true labels:", length(true_labels), "\n")
## Length of true labels: 768
cat("Length of predicted labels:", length(predicted_classes_optimal), "\n")
## Length of predicted labels: 768
# Generate confusion matrix
conf_matrix <- confusionMatrix(as.factor(predicted_classes_optimal),
as.factor(true_labels), positive = "1")
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 392 70
## 1 108 198
##
## Accuracy : 0.7682
## 95% CI : (0.7367, 0.7976)
## No Information Rate : 0.651
## P-Value [Acc > NIR] : 1.285e-12
##
## Kappa : 0.5062
##
## Mcnemar's Test P-Value : 0.00555
##
## Sensitivity : 0.7388
## Specificity : 0.7840
## Pos Pred Value : 0.6471
## Neg Pred Value : 0.8485
## Prevalence : 0.3490
## Detection Rate : 0.2578
## Detection Prevalence : 0.3984
## Balanced Accuracy : 0.7614
##
## 'Positive' Class : 1
##
Accuracy: 0.7682
Good overall accuracy (approximately 77%), significantly higher than the “no information rate” (65.1%), with a very significant p-value (1.285e-12). Your model clearly predicts better than random guessing. Sensitivity (Recall or TPR): 0.7388
Model correctly identifies 73.9% of actual positive (diabetic) cases. This indicates fairly good identification of individuals who actually have diabetes. Specificity (TNR): 0.7840
Model correctly identifies 78.4% of actual negative (non-diabetic) cases. This indicates the model is also good at avoiding false positives. Positive Predictive Value (Precision): 0.6471
About 64.7% of predictions labeled as “positive” (diabetic) are actually diabetic. There’s room for improvement here. Negative Predictive Value: 0.8485
Very good negative predictive value—84.9% of predicted non-diabetics are truly non-diabetic. Balanced Accuracy: 0.7614
Balanced accuracy (76.1%) is good, considering both sensitivity and specificity. Kappa (Cohen’s Kappa): 0.5062
Moderate agreement (0.41–0.60). This metric indicates a fair degree of reliability in the model predictions beyond random chance.