Notes 6

Modeling Proportions with Binomial Distribution | Part 1: Fitting Logit Link Model

Setup

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(GLMsData)
library(broom)
## Warning: package 'broom' was built under R version 4.3.3

Organization of course

Units in this course:

  • Intro to GLMs (Notes 1-5)

  • Modeling proportions with the Binomial distribution (Notes 6-10)

  • Modeling counts with the Poisson distribution

  • Modeling positive continuous variables with the Gamma distribution

Chapter 9 in the book focuses on modeling proportions with the Binomial distribution.

  • The proportions are obtained as the number of “positive” cases (successes) out of a total number of independent cases.

Example 9.1 - turbines

data(turbines)
turbines
##    Hours Turbines Fissures
## 1    400       39        0
## 2   1000       53        4
## 3   1400       33        2
## 4   1800       73        7
## 5   2200       30        5
## 6   2600       39        9
## 7   3000       42        9
## 8   3400       13        6
## 9   3800       34       22
## 10  4200       40       21
## 11  4600       36       21
  • Response: Proportion of turbines with fissures

  • Explanatory variable: the time in hours the turbines are running

Plot of proportions (Fig 9.1)

turbines <- turbines |> 
  mutate(prop = Fissures / Turbines)
  
ggplot(
    data = turbines,
    mapping = aes(x = Hours, y = prop, size = Turbines)
  ) + 
  geom_point() + 
  labs(
    x = "Hours of Use", 
    y = "Proportion of turbines with fissures", 
    size = "Number of\nTurbines"
  )

Model:

  1. \(y \sim Binomial(\mu, m)\), where

    • \(y\) is the proportion of turbines with fissures

    • \(\mu\) is the proportion parameter

    • \(m\) is the number of turbines

  2. \(\log(\frac{\mu}{1-\mu})=logit(\mu)=\beta_0 + \beta_1 Hours\)

1 is the random component,
2 is the systematic component with logit link function

Fitting the model with R

“Fitting the model” means: calculating the maximum likelihood estimates of the \(\beta\) parameters.

Template for fitting generalized linear models in R:

<fit-object> <- glm(<prop-var> ~ <x>, family=binomial(link=__), weights=<m>, data=___)

In this case:

fit <- glm(prop ~ Hours, family=binomial(link="logit"), weights=Turbines, data=turbines)

Extracting model fit info

Functions for extracting information from fit (Section 6.9).

I prefer using functions from the {broom} package, which all produce datasets:

  • tidy(fit): one row for each \(\beta\) parameter - gives the beta estimates, etc.

  • augment(fit): one row for each observation - gives the fitted responses, etc.

  • glance(fit): one row total, gives overall summaries like the loglikelihood value, etc.

tidy()

tidy(fit)
## # A tibble: 2 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -3.92      0.378       -10.4  3.03e-25
## 2 Hours        0.000999  0.000114      8.75 2.07e-18

We will only focus on the estimate column for now, which gives \(\hat{\beta}_0\) = -3.92 and \(\hat{\beta}_1\) = .000999

Add confidence intervals for betas:

tidy(fit, conf.int=TRUE, conf.level=.95)
## # A tibble: 2 × 7
##   term         estimate std.error statistic  p.value  conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1 (Intercept) -3.92      0.378       -10.4  3.03e-25 -4.70      -3.22   
## 2 Hours        0.000999  0.000114      8.75 2.07e-18  0.000783   0.00123

augment()

augment(fit)
## # A tibble: 11 × 9
##      prop Hours `(weights)` .fitted .resid   .hat .sigma .cooksd .std.resid
##     <dbl> <int>       <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1 0        400          39  -3.52  -1.51  0.122   0.984 0.0910      -1.61 
##  2 0.0755  1000          53  -2.92   0.760 0.191   1.10  0.0964       0.845
##  3 0.0606  1400          33  -2.52  -0.306 0.124   1.13  0.00712     -0.327
##  4 0.0959  1800          73  -2.12  -0.304 0.271   1.13  0.0228      -0.356
##  5 0.167   2200          30  -1.73   0.233 0.105   1.13  0.00367      0.247
##  6 0.231   2600          39  -1.33   0.316 0.128   1.13  0.00865      0.339
##  7 0.214   3000          42  -0.926 -1.03  0.141   1.07  0.0955      -1.11 
##  8 0.462   3400          13  -0.526  0.664 0.0529  1.11  0.0133       0.682
##  9 0.647   3800          34  -0.126  2.09  0.190   0.784 0.632        2.33 
## 10 0.525   4200          40   0.273 -0.546 0.311   1.11  0.0982      -0.657
## 11 0.583   4600          36   0.673 -0.984 0.363   1.05  0.447       -1.23

The variable .fitted gives the predicted values (not to be confused with .hat), but these don’t look like predicted proportions. Why? These values mostly aren’t between 0 and 1.

What is the default value of .fitted giving us? ____

tidy(fit)$estimate[1] + tidy(fit)$estimate[2] * turbines$Hours
##  [1] -3.5239017 -2.9243593 -2.5246644 -2.1249695 -1.7252746 -1.3255798
##  [7] -0.9258849 -0.5261900 -0.1264951  0.2731998  0.6728947

To get the predicted proportions from augment(), we need to add the type.predict="response" argument:

augment(fit, type.predict="response")
## # A tibble: 11 × 9
##      prop Hours `(weights)` .fitted .resid   .hat .sigma .cooksd .std.resid
##     <dbl> <int>       <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1 0        400          39  0.0286 -1.51  0.122   0.984 0.0910      -1.61 
##  2 0.0755  1000          53  0.0510  0.760 0.191   1.10  0.0964       0.845
##  3 0.0606  1400          33  0.0741 -0.306 0.124   1.13  0.00712     -0.327
##  4 0.0959  1800          73  0.107  -0.304 0.271   1.13  0.0228      -0.356
##  5 0.167   2200          30  0.151   0.233 0.105   1.13  0.00367      0.247
##  6 0.231   2600          39  0.210   0.316 0.128   1.13  0.00865      0.339
##  7 0.214   3000          42  0.284  -1.03  0.141   1.07  0.0955      -1.11 
##  8 0.462   3400          13  0.371   0.664 0.0529  1.11  0.0133       0.682
##  9 0.647   3800          34  0.468   2.09  0.190   0.784 0.632        2.33 
## 10 0.525   4200          40  0.568  -0.546 0.311   1.11  0.0982      -0.657
## 11 0.583   4600          36  0.662  -0.984 0.363   1.05  0.447       -1.23

Where do the predicted probabilities \(\hat{\mu}\) come from? We solve the systematic component

\[\log(\frac{\mu}{1-\mu})=\beta_0 + \beta_1 Hours\]

for \(\mu\), giving:

\[\hat{\mu} = \frac{exp(\hat{\beta}_0 + \hat{\beta}_1 Hours)}{1 + exp(\hat{\beta}_0 + \hat{\beta}_1 Hours)}\]

How can we use this formula in R?

exp(tidy(fit)$estimate[1] + tidy(fit)$estimate[2] * turbines$Hours) / (1 + exp(tidy(fit)$estimate[1] + tidy(fit)$estimate[2] * turbines$Hours))
##  [1] 0.02863975 0.05096245 0.07414710 0.10669350 0.15119300 0.20989147
##  [7] 0.28376033 0.37140595 0.46841833 0.56787829 0.66215103

Plot of data with model fit line

Scatterplot with x = Hours, y = proportion with line predicted probabilities

ggplot(
  data = augment(fit, type.predict="response"), 
  mapping = aes(x = Hours)
) + 
  geom_point(
    mapping = aes(y = prop)
  ) + 
  geom_line(mapping = aes(y = .fitted))