In this homework, you will apply logistic regression to a real-world dataset: the Pima Indians Diabetes Database. This dataset contains medical records from 768 women of Pima Indian heritage, aged 21 or older, and is used to predict the onset of diabetes (binary outcome: 0 = no diabetes, 1 = diabetes) based on physiological measurements.
The data is publicly available from the UCI Machine Learning Repository and can be imported directly.
Dataset URL: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
Columns (no header in the CSV, so we need to assign them manually):
Task Overview: You will load the data, build a logistic regression model to predict diabetes onset using a subset of predictors (Glucose, BMI, Age), interpret the model, evaluate it with a confusion matrix and metrics, and analyze the ROC curve and AUC.
Cleaning the dataset Don’t change the following code
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
head(data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
data$Outcome <- as.factor(data$Outcome)
colSums(is.na(data))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
# Handle missing values (replace 0s with NA because 0 makes no sense here)
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 NA 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...
Question 1: Create and Interpret a Logistic Regression Model - Fit a logistic regression model to predict Outcome using Glucose, BMI, and Age.
Provide the model summary.
Calculate and interpret R²: 1 - (model\(deviance / model\)null.deviance). What does it indicate about the model’s explanatory power?
In this problem, the predicted variable is categorical, and the predictor variables are continuous, so we can’t do tables to check for category levels>5 like in the class activity. We will directly use the glm() function (Generalized Linear Model). Setting family = binomial tells glm() to perform logistic regression, which is appropriate for binary outcomes like here.
## Enter your code here
diabetic <- glm(Outcome ~ Glucose + BMI + Age, data=data, family="binomial")
summary(diabetic)
##
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.032377 0.711037 -12.703 < 2e-16 ***
## Glucose 0.035548 0.003481 10.212 < 2e-16 ***
## BMI 0.089753 0.014377 6.243 4.3e-10 ***
## Age 0.028699 0.007809 3.675 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.75 on 751 degrees of freedom
## Residual deviance: 724.96 on 748 degrees of freedom
## (16 observations deleted due to missingness)
## AIC: 732.96
##
## Number of Fisher Scoring iterations: 4
What does the intercept represent (log-odds of diabetes when predictors are zero)?
Note: the intercept and the three coefficients are statistically significant (p<<0.05)
The ln(odds) equation is:
Ln(odds) of diabetes = -9.032377 + .035548(Glucose) + .089752(BMI) + .028699(Age)
Thus, odds = 1/e^9.032377 = 0.00012 (very small odds, 3/2500)
For each predictor (Glucose, BMI, Age), does a one-unit increase raise or lower the odds of diabetes? Are they significant (p-value < 0.05)?
Per above, the three predictors are statistically significant.
A unit raise of Glucose: ln(odds) = -9.032377 + .035548, i.e., odds = 1/e^8.996829 = 0.0001235; increases the odds of diabetes.
A unit raise of BMI: ln(odds) = -9.032377 + .089752, i.e., odds = 1/e^8.942625 = 0.0001307; also increases the odds of diabetes.
A unit raise of Age: ln(odds) = -9.032377 + .028699, i.e., odds = 1/e^9.003678 = 0.0001230; also increases the odds of diabetes.
The 3 predictors increase the odds of diabetes in the following proportionality: BMI > Glucose > Age.
R-Squared calculation
Deviance in logistic regression is a measure of model fit.
Null deviance (974.75): deviance with no predictors.
Residual deviance (724.96): deviance with 3 predictors: Glucose, BMI and Age.
The decrease shows the model explains some of the variation in diabetes outcome.
We use them to compute r-squared and overall p-value.
r_square <- 1 - (diabetic$deviance/diabetic$null.deviance)
r_square
## [1] 0.25626
Overall p-value
1 - pchisq((diabetic$null.deviance - diabetic$deviance), df=3)
## [1] 0
Conclusion The model is statistically significant, although it only explains about 26% of the observations.
Question 2: Confusion Matrix and Important Metric
Predict probabilities using the fitted model.
Create predicted classes with a 0.5 threshold (1 if probability > 0.5, else 0).
Build a confusion matrix (Predicted vs. Actual Outcome).
Calculate and report the metrics:
Accuracy: (TP + TN) / Total Sensitivity (Recall): TP / (TP + FN) Specificity: TN / (TN + FP) Precision: TP / (TP + FP)
Use the following starter code
# Keep only rows with no missing values in Outcome, Glucose, BMI, or Age
data_subset <- data[complete.cases(data[, c("Outcome", "Glucose", "BMI", "Age")]), ]
#Create a numeric version of the outcome (0 = no diabetes, 1 = diabetes).This is required for calculating confusion matrices.
data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)
# Predicted probabilities
predicted_diabetic <- data.frame(
prob_diabetic= diabetic$fitted.values,
trueoutcome = data_subset$Outcome, glucose = data_subset$Glucose, bmi= data_subset$BMI, age= data_subset$Age)
head(predicted_diabetic, 10)
## prob_diabetic trueoutcome glucose bmi age
## 1 0.66360006 1 148 33.6 50
## 2 0.06101402 0 85 26.6 31
## 3 0.61834186 1 183 23.3 32
## 4 0.06043396 0 89 28.1 21
## 5 0.65771328 1 137 43.1 33
## 6 0.14802668 0 116 25.6 30
## 7 0.06116212 1 78 31.0 26
## 8 0.28013239 0 115 35.3 29
## 9 0.90283168 1 197 30.5 53
## 11 0.29185049 0 110 37.6 30
# Predicted classes
# do: if probability>0.5, predicted outcome=1
predicted_diabetic$predictedoutcome <- as.factor(ifelse(predicted_diabetic$prob_diabetic > 0.5, 1, 0))
# rank them from highest to lowest probability
predicted_diabetic <- predicted_diabetic[
order(predicted_diabetic$prob_diabetic, decreasing=FALSE),]
predicted_diabetic$rank <- 1:nrow(predicted_diabetic)
head(predicted_diabetic, 10)
## prob_diabetic trueoutcome glucose bmi age predictedoutcome rank
## 681 0.01422697 0 56 24.2 22 0 1
## 63 0.01490161 0 44 25.0 36 0 2
## 618 0.01550448 0 68 20.1 23 0 3
## 98 0.01718930 0 71 20.4 22 0 4
## 91 0.02040067 0 80 19.1 21 0 5
## 590 0.02132939 0 73 21.1 25 0 6
## 462 0.02176006 0 71 21.8 26 0 7
## 56 0.02252439 0 73 23.0 21 0 8
## 419 0.02475853 0 83 18.2 27 0 9
## 521 0.02523876 0 68 25.0 25 0 10
# following class activity, show the curve (not quite a sigmoid)
ggplot(data=predicted_diabetic, aes(x=rank, y=prob_diabetic)) +
geom_point(aes(color=trueoutcome), alpha=1, shape=4, stroke=2) +
# geom_point(aes(color=predictedoutcome), alpha=1, shape=1, stroke=2)+ #attempt to show both predicted and true
xlab("Index") +
ylab("Predicted probability of diabetes")
The plot shows that some nondiabetic patients (orange color crosses) are predicted as diabetic (probability>0.5). Not perfect.
Confusion matrix
confusion <- table(
Predicted = predicted_diabetic$predictedoutcome,
Actual = predicted_diabetic$trueoutcome
)
confusion
## Actual
## Predicted 0 1
## 0 429 114
## 1 59 150
429 people were actually nondiabetic, and the model correctly predicted them as nondiabetic. This is a True Negative.
114 people were actually diabetic, but the model mistakenly predicted them as nondiabetic. This is a False Negative.
59 people were actually nondiabetic, but the model mistakenly predicted them as diabetic. This is a False Positive.
150 people were actually diabetic, and the model correctly predicted them as diabetic. This is a True Positive.
# Values:
TN <- 429
FN <- 114
FP <- 59
TP <- 150
# Metrics:
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN) # also called recall or true positive rate
specificity <- TN / (TN + FP) # true negative rate
precision <- TP / (TP + FP) # positive predictive value
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Print results
cat("Accuracy:", round(accuracy, 3),
"\nSensitivity:", round(sensitivity, 3),
"\nSpecificity:", round(specificity, 3),
"\nPrecision:", round(precision, 3),
"\nF1 Score:", round(f1_score, 3))
## Accuracy: 0.77
## Sensitivity: 0.568
## Specificity: 0.879
## Precision: 0.718
## F1 Score: 0.634
Interpret: How well does the model perform? Is it better at detecting diabetes (sensitivity) or non-diabetes (specificity)? Why might this matter for medical diagnosis?
The model has modest accuracy, poor sensitivity and a mediocre F1 score.
It’s especially good at identifying nondiabetic individuals (high specificity), and poor at identifying diabetics (57% sensitivity).
Overall, these are mixed results for a binary classification model on diabetics data.
Question 3: ROC Curve, AUC, and Interpretation
Plot the ROC curve, use the “data_subset” from Q2.
Calculate AUC.
# Enter your code here
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# ROC curve & AUC on full data
roc_obj <- roc(response = predicted_diabetic$trueoutcome,
predictor = predicted_diabetic$prob_diabetic,
levels = c(0, 1),
direction = "<") # smaller prob = nondiabetic
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.828
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
What does AUC indicate (0.5 = random, 1.0 = perfect)?
The AUC of 0.828 means that the model is average (i.e., between 0.5 and 1) at distinguishing between diabetic and nondiabetic patients.
The AUC curve is above the diagonal, which shows the model is better than chance choice. The model has about 83% chance of ranking the diabetic patients higher.
For diabetes diagnosis, prioritize sensitivity (catching cases) or specificity (avoiding false positives)? Suggest a threshold and explain.
For diagnosing diabetes, I prefer more sensitivity (accepting lower specificity). I would lower the threshold from 0.5 to 0.4 (e.g.)
let’s try it:
# Predicted classes with lower threshold:
# do: if probability>0.4, predicted outcome=1
predicted_diabetic$predictedoutcome2 <- as.factor(ifelse(predicted_diabetic$prob_diabetic > 0.4, 1, 0))
confusion <- table(
Predicted = predicted_diabetic$predictedoutcome2,
Actual = predicted_diabetic$trueoutcome
)
confusion
## Actual
## Predicted 0 1
## 0 397 94
## 1 91 170
# Values
TN <- 397
FN <- 94
FP <- 91
TP <- 170
# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN) # also called recall or true positive rate
specificity <- TN / (TN + FP) # true negative rate
precision <- TP / (TP + FP) # positive predictive value
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Print results
cat("Accuracy:", round(accuracy, 3),
"\nSensitivity:", round(sensitivity, 3),
"\nSpecificity:", round(specificity, 3),
"\nPrecision:", round(precision, 3),
"\nF1 Score:", round(f1_score, 3))
## Accuracy: 0.754
## Sensitivity: 0.644
## Specificity: 0.814
## Precision: 0.651
## F1 Score: 0.648
Those results are better for Sensitivity than the previous results with threshold =0.5: Accuracy: 0.77 Sensitivity: 0.568 Specificity: 0.879 Precision: 0.718 F1 Score: 0.634
Conclusion
To predict diabetes with higher sensitivity, it is better to lower the threshold to less than 0.5. For example, sensitivity increases modestly to 64% with a threshold of 0.4, compared to 57% with a threshold of 0.5. Of course, specificity suffers, dropping to 81% compared to 88%, but 81% is still an acceptable probability of identifying nondiabetic individuals.
Published
Published at https://rpubs.com/rmiranda/1367320