This short example demonstrates logistic regression
with glm() and adds practical diagnostics:
- Linearity of the logit (Box–Tidwell) - Multicollinearity (VIF) -
Classification quality (ROC/AUC) - Calibration (Hosmer–Lemeshow) -
Influential observations (Cook’s D, leverage)
We’ll predict whether a student is admitted (admit = 1)
based on GRE and GPA.
# Install/load helper packages as needed
pkgs <- c("ggplot2", "car", "pROC", "ResourceSelection", "dplyr")
to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) install.packages(to_install, quiet = TRUE)
lapply(pkgs, library, character.only = TRUE)
## [[1]]
## [1] "ggplot2" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "car" "carData" "ggplot2" "stats" "graphics" "grDevices"
## [7] "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "pROC" "car" "carData" "ggplot2" "stats" "graphics"
## [7] "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "ResourceSelection" "pROC" "car"
## [4] "carData" "ggplot2" "stats"
## [7] "graphics" "grDevices" "utils"
## [10] "datasets" "methods" "base"
##
## [[5]]
## [1] "dplyr" "ResourceSelection" "pROC"
## [4] "car" "carData" "ggplot2"
## [7] "stats" "graphics" "grDevices"
## [10] "utils" "datasets" "methods"
## [13] "base"
# Create a small sample dataset (toy example)
data <- data.frame(
admit = c(1,0,1,0,1,0,0,1,0,1),
gre = c(800, 640, 700, 580, 720, 600, 500, 750, 620, 710),
gpa = c(4.0, 3.5, 3.8, 3.2, 3.9, 3.3, 2.9, 3.7, 3.4, 3.9)
)
# Inspect
dplyr::glimpse(data)
## Rows: 10
## Columns: 3
## $ admit <dbl> 1, 0, 1, 0, 1, 0, 0, 1, 0, 1
## $ gre <dbl> 800, 640, 700, 580, 720, 600, 500, 750, 620, 710
## $ gpa <dbl> 4.0, 3.5, 3.8, 3.2, 3.9, 3.3, 2.9, 3.7, 3.4, 3.9
summary(data)
## admit gre gpa
## Min. :0.0 Min. :500.0 Min. :2.900
## 1st Qu.:0.0 1st Qu.:605.0 1st Qu.:3.325
## Median :0.5 Median :670.0 Median :3.600
## Mean :0.5 Mean :662.0 Mean :3.560
## 3rd Qu.:1.0 3rd Qu.:717.5 3rd Qu.:3.875
## Max. :1.0 Max. :800.0 Max. :4.000
# Logistic regression (binomial link = logit)
model <- glm(admit ~ gre + gpa, data = data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = admit ~ gre + gpa, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.593e+02 1.430e+06 0 1
## gre 2.486e-01 3.030e+03 0 1
## gpa 1.077e+02 7.607e+05 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.3863e+01 on 9 degrees of freedom
## Residual deviance: 2.4490e-10 on 7 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
Interpretation tips
Coefficients are on the log-odds scale.
exp(coef) yields odds ratios.
exp(cbind(OR = coef(model), confint.default(model)))
## OR 2.5 % 97.5 %
## (Intercept) 1.203862e-243 0 Inf
## gre 1.282197e+00 0 Inf
## gpa 5.710109e+46 0 Inf
data$pred_prob <- predict(model, type = "response")
data$pred_class <- ifelse(data$pred_prob >= 0.5, 1, 0)
data
## admit gre gpa pred_prob pred_class
## 1 1 800 4.0 1.000000e+00 1
## 2 0 640 3.5 6.605993e-11 0
## 3 1 700 3.8 1.000000e+00 1
## 4 0 580 3.2 2.220446e-16 0
## 5 1 720 3.9 1.000000e+00 1
## 6 0 600 3.3 2.220446e-16 0
## 7 0 500 2.9 2.220446e-16 0
## 8 1 750 3.7 1.000000e+00 1
## 9 0 620 3.4 2.220446e-16 0
## 10 1 710 3.9 1.000000e+00 1
cm <- table(Predicted = data$pred_class, Actual = data$admit)
cm
## Actual
## Predicted 0 1
## 0 5 0
## 1 0 5
acc <- mean(data$pred_class == data$admit)
paste("Accuracy:", round(acc, 2))
## [1] "Accuracy: 1"
ggplot(data, aes(x = gre, y = pred_prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) +
labs(title = "Logistic Regression Curve",
x = "GRE Score",
y = "Predicted Probability of Admission")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Assumption: the logit of the outcome is linear in continuous predictors.
# Ensure positive values for log() (true here)
data$gre_log <- log(data$gre)
data$gpa_log <- log(data$gpa)
# Box–Tidwell-style terms
data$gre_bt <- data$gre * data$gre_log
data$gpa_bt <- data$gpa * data$gpa_log
bt_model <- glm(admit ~ gre + gpa + gre_bt + gpa_bt, data = data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(bt_model)[["coefficients"]]
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7052.231379 41730184.43 1.689959e-04 0.9998652
## gre 53.692729 749015.30 7.168442e-05 0.9999428
## gpa -7737.001534 66174365.54 -1.169184e-04 0.9999067
## gre_bt -7.049733 98891.35 -7.128766e-05 0.9999431
## gpa_bt 3373.805180 28375344.67 1.188992e-04 0.9999051
Interpretation: If gre_bt or gpa_bt is significant, consider transformations (splines, polynomials) for that predictor.
With tiny samples, power is low; treat results as illustrative.
car::vif(model)
## gre gpa
## 3.979745 3.979745
Guideline: VIF < 5 (often < 10) is acceptable; higher values suggest multicollinearity.
roc_obj <- pROC::roc(response = data$admit, predictor = data$pred_prob, quiet = TRUE)
auc_val <- pROC::auc(roc_obj)
auc_val
## Area under the curve: 1
plot(roc_obj, print.auc = TRUE, main = "ROC Curve (Logistic Regression)")
Interpretation: AUC closer to 1.0 indicates better class separation; ~0.5 is no better than chance.
# Group into deciles if possible; with tiny n this is approximate
ResourceSelection::hoslem.test(x = data$admit, y = fitted(model), g = 5)
## Warning in ResourceSelection::hoslem.test(x = data$admit, y = fitted(model), :
## The data did not allow for the requested number of bins.
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: data$admit, fitted(model)
## X-squared = 8.9903e-12, df = 1, p-value = 1
Interpretation: Large p-value → no strong evidence of miscalibration.
This test is unstable with very small samples; use as a teaching example.
# Cook's Distance
cd <- cooks.distance(model)
# Leverage (hat values)
h <- hatvalues(model)
# Dfbetas (per-parameter influence)
dfb <- dfbetas(model)
infl <- data.frame(obs = 1:nrow(data),
cooksD = as.numeric(cd),
leverage = as.numeric(h))
infl[order(-infl$cooksD), ]
## obs cooksD leverage
## 3 3 5.485509e-02 9.999830e-01
## 8 8 5.181969e-02 9.999924e-01
## 2 2 3.892036e-02 9.999762e-01
## 7 7 1.349423e-21 1.823111e-05
## 1 1 7.301988e-22 9.865375e-06
## 4 4 4.936430e-22 6.669422e-06
## 6 6 3.109184e-22 4.200720e-06
## 10 10 2.815120e-22 3.803423e-06
## 5 5 2.369301e-22 3.201094e-06
## 9 9 1.768787e-22 2.389761e-06
par(mfrow = c(1,2))
plot(cd, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4/length(cd), lty = 2) # heuristic cutoff
plot(h, type = "h", main = "Leverage (Hat Values)", ylab = "Leverage")
abline(h = 2*mean(h), lty = 2) # heuristic cutoff
par(mfrow = c(1,1))
Interpretation: Points above dashed lines merit review (data entry errors, unusual cases, missing predictors). Consider robust modeling, transformations, or verifying data quality.
Linearity of logit reasonable? (Box–Tidwell)
Multicollinearity acceptable? (VIF)
Discrimination strong enough? (ROC/AUC)
Calibrated predictions? (Hosmer–Lemeshow)
No overly influential points? (Cook’s D, leverage)