Logistic Regression: Predicting Sex from Height

Content

What logistic regression models: probability of a binary event.
Fit a model using height on the dslabs::heights dataset.
Diagnostics: ROC/AUC and calibration.
Interactive Plotly logistic curve.

Data & Variables

We use dslabs::heights. Outcome sex is Male/Female; we recode to 1/0 with 1 = Male. Predictor is height (inches).

data(heights)
dat = heights %>%
  transmute(sex = ifelse(sex == "Male", 1L, 0L),
            height = height,
            label = paste0("id=", row_number())) %>%
  drop_na()
summary(dat)

##       sex             height         label          
##  Min.   :0.0000   Min.   :50.00   Length:1050       
##  1st Qu.:1.0000   1st Qu.:66.00   Class :character  
##  Median :1.0000   Median :68.50   Mode  :character  
##  Mean   :0.7733   Mean   :68.32                     
##  3rd Qu.:1.0000   3rd Qu.:71.00                     
##  Max.   :1.0000   Max.   :82.68

Model

Predict a yes/no outcome (from height in this case)
Model: Bernoulli GLM with logit link:

\[\Pr(Y_i=1\mid x_i) = \operatorname{logit}^{-1}\!\left(\beta_0 + \beta_1\,\text{height}_i\right).\]

The logistic (S-shaped) curve maps any number to a valid probability in \([0,1]\).
\(\beta_1\) is the per-inch effect: a positive value means higher predicted probability, negative means lower.
Odds = \(p/(1-p)\). Each +1 inch multiplies the odds by \(\exp(\beta_1)\) (the odds ratio). Probability changes are largest near 50% and smaller near 0% & 100%.

Likelihood & Estimation

Log-likelihood for independent observations:

\[\ell(\beta) = \sum_{i=1}^n\big[ y_i\log p_i + (1-y_i)\log(1-p_i) \big],\qquad p_i = \operatorname{logit}^{-1}(\beta_0 + \beta_1 x_i).\]

Maximum likelihood esitmation via iteratively reweighted least squares

Fit the Model

mod = glm(sex ~ height, family = binomial, data = dat)
coefs = broom::tidy(mod, conf.int = TRUE, conf.level = 0.95) %>%
  mutate(odds_ratio = exp(estimate),
         or_low = exp(conf.low),
         or_high = exp(conf.high))
coefs

## # A tibble: 2 × 10
##   term       estimate std.error statistic  p.value conf.low conf.high odds_ratio
##   <chr>         <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>      <dbl>
## 1 (Intercep…  -21.1      1.77       -11.9 8.77e-33  -24.7     -17.8     6.54e-10
## 2 height        0.333    0.0266      12.5 6.97e-36    0.282     0.386   1.39e+ 0
## # ℹ 2 more variables: or_low <dbl>, or_high <dbl>

ggplot #1 Sex vs Height

ggplot(dat, aes(x = height, y = sex)) +
  geom_point(alpha = 0.25, position = position_jitter(height = 0.03)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE) +
  labs(title = "Probability(Male) vs Height",
       y = "Male (0/1)", x = "Height (inches)")

ggplot #2 ROC Curve with Area Under the Curve

Threshold choice for classification trades off sensitivity & specificity, ROC helps visualize it.

# Predicted probabilities
dat$prob = predict(mod, type = "response")
# ROC with pROC
roc_obj = pROC::roc(response = dat$sex, predictor = dat$prob, quiet = TRUE)
auc_val = pROC::auc(roc_obj)
roc_df = tibble(fpr = 1 - roc_obj$specificities, tpr = roc_obj$sensitivities)

ggplot(roc_df, aes(x = fpr, y = tpr)) +
  geom_line(size = 1) +
  geom_abline(linetype = "dashed") +
  coord_equal() +
  labs(title = paste0("ROC Curve (AUC = ", round(as.numeric(auc_val), 3), ")"),
       x = "False Positive Rate", y = "True Positive Rate")

Plotly Interactive Logistic Curve (Height)

hg = seq(min(dat$height), max(dat$height), length.out = 200)
grid = tibble(height = hg,
               prob = predict(mod, newdata = tibble(height = hg), type = "response"))

plot_ly() %>%
  add_markers(data = dat, x = ~height, y = ~sex, type = 'scatter', mode = 'markers',
              text = ~label, hovertemplate = 'height=%{x}<br>sex=%{y}<br>%{text}', showlegend = FALSE) %>%
  add_lines(data = grid, x = ~height, y = ~prob, name = 'Pr(Male|height)') %>%
  layout(xaxis = list(title = 'Height (inches)'), yaxis = list(title = 'Probability Male'))

Assumptions

Linearity of the log-odds in height.
Independent observations.