1 1. Introduction

This short example demonstrates logistic regression with glm() and adds practical diagnostics: - Linearity of the logit (Box–Tidwell) - Multicollinearity (VIF) - Classification quality (ROC/AUC) - Calibration (Hosmer–Lemeshow) - Influential observations (Cook’s D, leverage)

We’ll predict whether a student is admitted (admit = 1) based on GRE and GPA.

2 2. Load Packages & Data

# Install/load helper packages as needed
pkgs <- c("ggplot2", "car", "pROC", "ResourceSelection", "dplyr")
to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) install.packages(to_install, quiet = TRUE)
lapply(pkgs, library, character.only = TRUE)

## [[1]]
## [1] "ggplot2"   "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "car"       "carData"   "ggplot2"   "stats"     "graphics"  "grDevices"
##  [7] "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "pROC"      "car"       "carData"   "ggplot2"   "stats"     "graphics" 
##  [7] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "ResourceSelection" "pROC"              "car"              
##  [4] "carData"           "ggplot2"           "stats"            
##  [7] "graphics"          "grDevices"         "utils"            
## [10] "datasets"          "methods"           "base"             
## 
## [[5]]
##  [1] "dplyr"             "ResourceSelection" "pROC"             
##  [4] "car"               "carData"           "ggplot2"          
##  [7] "stats"             "graphics"          "grDevices"        
## [10] "utils"             "datasets"          "methods"          
## [13] "base"

# Create a small sample dataset (toy example)
data <- data.frame(
  admit = c(1,0,1,0,1,0,0,1,0,1),
  gre   = c(800, 640, 700, 580, 720, 600, 500, 750, 620, 710),
  gpa   = c(4.0, 3.5, 3.8, 3.2, 3.9, 3.3, 2.9, 3.7, 3.4, 3.9)
)

# Inspect
dplyr::glimpse(data)

## Rows: 10
## Columns: 3
## $ admit <dbl> 1, 0, 1, 0, 1, 0, 0, 1, 0, 1
## $ gre   <dbl> 800, 640, 700, 580, 720, 600, 500, 750, 620, 710
## $ gpa   <dbl> 4.0, 3.5, 3.8, 3.2, 3.9, 3.3, 2.9, 3.7, 3.4, 3.9

summary(data)

##      admit          gre             gpa       
##  Min.   :0.0   Min.   :500.0   Min.   :2.900  
##  1st Qu.:0.0   1st Qu.:605.0   1st Qu.:3.325  
##  Median :0.5   Median :670.0   Median :3.600  
##  Mean   :0.5   Mean   :662.0   Mean   :3.560  
##  3rd Qu.:1.0   3rd Qu.:717.5   3rd Qu.:3.875  
##  Max.   :1.0   Max.   :800.0   Max.   :4.000

3 3. Fit Logistic Regression

# Logistic regression (binomial link = logit)
model <- glm(admit ~ gre + gpa, data = data, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model)

## 
## Call:
## glm(formula = admit ~ gre + gpa, family = binomial, data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.593e+02  1.430e+06       0        1
## gre          2.486e-01  3.030e+03       0        1
## gpa          1.077e+02  7.607e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3863e+01  on 9  degrees of freedom
## Residual deviance: 2.4490e-10  on 7  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

Interpretation tips

Coefficients are on the log-odds scale.

exp(coef) yields odds ratios.

exp(cbind(OR = coef(model), confint.default(model)))

##                        OR 2.5 % 97.5 %
## (Intercept) 1.203862e-243     0    Inf
## gre          1.282197e+00     0    Inf
## gpa          5.710109e+46     0    Inf

4 4. Predict Probabilities & Classify

data$pred_prob  <- predict(model, type = "response")
data$pred_class <- ifelse(data$pred_prob >= 0.5, 1, 0)
data

##    admit gre gpa    pred_prob pred_class
## 1      1 800 4.0 1.000000e+00          1
## 2      0 640 3.5 6.605993e-11          0
## 3      1 700 3.8 1.000000e+00          1
## 4      0 580 3.2 2.220446e-16          0
## 5      1 720 3.9 1.000000e+00          1
## 6      0 600 3.3 2.220446e-16          0
## 7      0 500 2.9 2.220446e-16          0
## 8      1 750 3.7 1.000000e+00          1
## 9      0 620 3.4 2.220446e-16          0
## 10     1 710 3.9 1.000000e+00          1

cm <- table(Predicted = data$pred_class, Actual = data$admit)
cm

##          Actual
## Predicted 0 1
##         0 5 0
##         1 0 5

acc <- mean(data$pred_class == data$admit)
paste("Accuracy:", round(acc, 2))

## [1] "Accuracy: 1"

5 5. Visualize Logistic Curve (GRE vs Prob)

ggplot(data, aes(x = gre, y = pred_prob)) +
  geom_point() +
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              se = FALSE) +
  labs(title = "Logistic Regression Curve",
       x = "GRE Score",
       y = "Predicted Probability of Admission")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

6 6. Diagnostics

6.1 6.1 Linearity of the Logit (Box–Tidwell)

Assumption: the logit of the outcome is linear in continuous predictors.

# Ensure positive values for log() (true here)
data$gre_log <- log(data$gre)
data$gpa_log <- log(data$gpa)

# Box–Tidwell-style terms
data$gre_bt <- data$gre * data$gre_log
data$gpa_bt <- data$gpa * data$gpa_log

bt_model <- glm(admit ~ gre + gpa + gre_bt + gpa_bt, data = data, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(bt_model)[["coefficients"]]

##                 Estimate  Std. Error       z value  Pr(>|z|)
## (Intercept)  7052.231379 41730184.43  1.689959e-04 0.9998652
## gre            53.692729   749015.30  7.168442e-05 0.9999428
## gpa         -7737.001534 66174365.54 -1.169184e-04 0.9999067
## gre_bt         -7.049733    98891.35 -7.128766e-05 0.9999431
## gpa_bt       3373.805180 28375344.67  1.188992e-04 0.9999051

Interpretation: If gre_bt or gpa_bt is significant, consider transformations (splines, polynomials) for that predictor.

With tiny samples, power is low; treat results as illustrative.

6.2 6.2 Multicollinearity (VIF)

car::vif(model)

##      gre      gpa 
## 3.979745 3.979745

Guideline: VIF < 5 (often < 10) is acceptable; higher values suggest multicollinearity.

6.3 6.3 Discrimination: ROC Curve & AUC

roc_obj <- pROC::roc(response = data$admit, predictor = data$pred_prob, quiet = TRUE)
auc_val <- pROC::auc(roc_obj)
auc_val

## Area under the curve: 1

plot(roc_obj, print.auc = TRUE, main = "ROC Curve (Logistic Regression)")

Interpretation: AUC closer to 1.0 indicates better class separation; ~0.5 is no better than chance.

6.4 6.4 Calibration: Hosmer–Lemeshow Test

# Group into deciles if possible; with tiny n this is approximate
ResourceSelection::hoslem.test(x = data$admit, y = fitted(model), g = 5)

## Warning in ResourceSelection::hoslem.test(x = data$admit, y = fitted(model), :
## The data did not allow for the requested number of bins.

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  data$admit, fitted(model)
## X-squared = 8.9903e-12, df = 1, p-value = 1

Interpretation: Large p-value → no strong evidence of miscalibration.

This test is unstable with very small samples; use as a teaching example.

6.5 6.5 Influence & Leverage

# Cook's Distance
cd <- cooks.distance(model)

# Leverage (hat values)
h  <- hatvalues(model)

# Dfbetas (per-parameter influence)
dfb <- dfbetas(model)

infl <- data.frame(obs = 1:nrow(data),
                   cooksD = as.numeric(cd),
                   leverage = as.numeric(h))
infl[order(-infl$cooksD), ]

##    obs       cooksD     leverage
## 3    3 5.485509e-02 9.999830e-01
## 8    8 5.181969e-02 9.999924e-01
## 2    2 3.892036e-02 9.999762e-01
## 7    7 1.349423e-21 1.823111e-05
## 1    1 7.301988e-22 9.865375e-06
## 4    4 4.936430e-22 6.669422e-06
## 6    6 3.109184e-22 4.200720e-06
## 10  10 2.815120e-22 3.803423e-06
## 5    5 2.369301e-22 3.201094e-06
## 9    9 1.768787e-22 2.389761e-06

par(mfrow = c(1,2))
plot(cd, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4/length(cd), lty = 2)  # heuristic cutoff

plot(h, type = "h", main = "Leverage (Hat Values)", ylab = "Leverage")
abline(h = 2*mean(h), lty = 2)     # heuristic cutoff

par(mfrow = c(1,1))

Interpretation: Points above dashed lines merit review (data entry errors, unusual cases, missing predictors). Consider robust modeling, transformations, or verifying data quality.

7 7. Quick Checklist

Linearity of logit reasonable? (Box–Tidwell)

Multicollinearity acceptable? (VIF)

Discrimination strong enough? (ROC/AUC)

Calibrated predictions? (Hosmer–Lemeshow)

No overly influential points? (Cook’s D, leverage)

Simple Logistic Regression in R (with Diagnostics)

Irene Tsapara

October 11, 2025