- What logistic regression models: probability of a binary event.
- Fit a model using
height
on the dslabs::heights dataset. - Diagnostics: ROC/AUC and calibration.
- Interactive Plotly logistic curve.
height
on the dslabs::heights dataset.We use dslabs::heights
. Outcome sex
is Male
/Female
; we recode to 1/0 with 1 = Male. Predictor is height
(inches).
data(heights) dat = heights %>% transmute(sex = ifelse(sex == "Male", 1L, 0L), height = height, label = paste0("id=", row_number())) %>% drop_na() summary(dat)
## sex height label ## Min. :0.0000 Min. :50.00 Length:1050 ## 1st Qu.:1.0000 1st Qu.:66.00 Class :character ## Median :1.0000 Median :68.50 Mode :character ## Mean :0.7733 Mean :68.32 ## 3rd Qu.:1.0000 3rd Qu.:71.00 ## Max. :1.0000 Max. :82.68
Predict a yes/no outcome (from height in this case)
Model: Bernoulli GLM with logit link:
\[\Pr(Y_i=1\mid x_i) = \operatorname{logit}^{-1}\!\left(\beta_0 + \beta_1\,\text{height}_i\right).\]
The logistic (S-shaped) curve maps any number to a valid probability in \([0,1]\).
\(\beta_1\) is the per-inch effect: a positive value means higher predicted probability, negative means lower.
Odds = \(p/(1-p)\). Each +1 inch multiplies the odds by \(\exp(\beta_1)\) (the odds ratio). Probability changes are largest near 50% and smaller near 0% & 100%.
Log-likelihood for independent observations:
\[\ell(\beta) = \sum_{i=1}^n\big[ y_i\log p_i + (1-y_i)\log(1-p_i) \big],\qquad p_i = \operatorname{logit}^{-1}(\beta_0 + \beta_1 x_i).\]
Maximum likelihood esitmation via iteratively reweighted least squares
mod = glm(sex ~ height, family = binomial, data = dat) coefs = broom::tidy(mod, conf.int = TRUE, conf.level = 0.95) %>% mutate(odds_ratio = exp(estimate), or_low = exp(conf.low), or_high = exp(conf.high)) coefs
## # A tibble: 2 × 10 ## term estimate std.error statistic p.value conf.low conf.high odds_ratio ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercep… -21.1 1.77 -11.9 8.77e-33 -24.7 -17.8 6.54e-10 ## 2 height 0.333 0.0266 12.5 6.97e-36 0.282 0.386 1.39e+ 0 ## # ℹ 2 more variables: or_low <dbl>, or_high <dbl>
ggplot(dat, aes(x = height, y = sex)) + geom_point(alpha = 0.25, position = position_jitter(height = 0.03)) + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE) + labs(title = "Probability(Male) vs Height", y = "Male (0/1)", x = "Height (inches)")
Threshold choice for classification trades off sensitivity & specificity, ROC helps visualize it.
# Predicted probabilities dat$prob = predict(mod, type = "response") # ROC with pROC roc_obj = pROC::roc(response = dat$sex, predictor = dat$prob, quiet = TRUE) auc_val = pROC::auc(roc_obj) roc_df = tibble(fpr = 1 - roc_obj$specificities, tpr = roc_obj$sensitivities) ggplot(roc_df, aes(x = fpr, y = tpr)) + geom_line(size = 1) + geom_abline(linetype = "dashed") + coord_equal() + labs(title = paste0("ROC Curve (AUC = ", round(as.numeric(auc_val), 3), ")"), x = "False Positive Rate", y = "True Positive Rate")
hg = seq(min(dat$height), max(dat$height), length.out = 200) grid = tibble(height = hg, prob = predict(mod, newdata = tibble(height = hg), type = "response")) plot_ly() %>% add_markers(data = dat, x = ~height, y = ~sex, type = 'scatter', mode = 'markers', text = ~label, hovertemplate = 'height=%{x}<br>sex=%{y}<br>%{text}', showlegend = FALSE) %>% add_lines(data = grid, x = ~height, y = ~prob, name = 'Pr(Male|height)') %>% layout(xaxis = list(title = 'Height (inches)'), yaxis = list(title = 'Probability Male'))
height
.