Notes 6
Modeling Proportions with Binomial Distribution | Part 1: Fitting Logit Link Model
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
Organization of course
Units in this course:
Intro to GLMs (Notes 1-5)
Modeling proportions with the Binomial distribution (Notes 6-10)
Modeling counts with the Poisson distribution
Modeling positive continuous variables with the Gamma distribution
Chapter 9 in the book focuses on modeling proportions with the Binomial distribution.
- The proportions are obtained as the number of “positive” cases (successes) out of a total number of independent cases.
Example 9.1 - turbines
## Hours Turbines Fissures
## 1 400 39 0
## 2 1000 53 4
## 3 1400 33 2
## 4 1800 73 7
## 5 2200 30 5
## 6 2600 39 9
## 7 3000 42 9
## 8 3400 13 6
## 9 3800 34 22
## 10 4200 40 21
## 11 4600 36 21
Response: Proportion of turbines with fissures
Explanatory variable: the time in hours the turbines are running
Plot of proportions (Fig 9.1)
turbines <- turbines |>
mutate(prop = Fissures / Turbines)
ggplot(
data = turbines,
mapping = aes(x = Hours, y = prop, size = Turbines)
) +
geom_point() +
labs(
x = "Hours of Use",
y = "Proportion of turbines with fissures",
size = "Number of\nTurbines"
)
Model:
\(y \sim Binomial(\mu, m)\), where
\(y\) is the proportion of turbines with fissures
\(\mu\) is the proportion parameter
\(m\) is the number of turbines
\(\log(\frac{\mu}{1-\mu})=logit(\mu)=\beta_0 + \beta_1 Hours\)
1 is the random component,
2 is the systematic component with logit link function
Fitting the model with R
“Fitting the model” means: calculating the maximum likelihood estimates of the \(\beta\) parameters.
Template for fitting generalized linear models in R:
<fit-object> <- glm(<prop-var> ~ <x>, family=binomial(link=__), weights=<m>, data=___)
In this case:
Extracting model fit info
Functions for extracting information from fit
(Section
6.9).
I prefer using functions from the {broom}
package, which
all produce datasets:
tidy(fit)
: one row for each \(\beta\) parameter - gives the beta estimates, etc.augment(fit)
: one row for each observation - gives the fitted responses, etc.glance(fit)
: one row total, gives overall summaries like the loglikelihood value, etc.
tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.92 0.378 -10.4 3.03e-25
## 2 Hours 0.000999 0.000114 8.75 2.07e-18
We will only focus on the estimate
column for now, which
gives \(\hat{\beta}_0\) = -3.92 and
\(\hat{\beta}_1\) = .000999
Add confidence intervals for betas:
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.92 0.378 -10.4 3.03e-25 -4.70 -3.22
## 2 Hours 0.000999 0.000114 8.75 2.07e-18 0.000783 0.00123
augment()
## # A tibble: 11 × 9
## prop Hours `(weights)` .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 400 39 -3.52 -1.51 0.122 0.984 0.0910 -1.61
## 2 0.0755 1000 53 -2.92 0.760 0.191 1.10 0.0964 0.845
## 3 0.0606 1400 33 -2.52 -0.306 0.124 1.13 0.00712 -0.327
## 4 0.0959 1800 73 -2.12 -0.304 0.271 1.13 0.0228 -0.356
## 5 0.167 2200 30 -1.73 0.233 0.105 1.13 0.00367 0.247
## 6 0.231 2600 39 -1.33 0.316 0.128 1.13 0.00865 0.339
## 7 0.214 3000 42 -0.926 -1.03 0.141 1.07 0.0955 -1.11
## 8 0.462 3400 13 -0.526 0.664 0.0529 1.11 0.0133 0.682
## 9 0.647 3800 34 -0.126 2.09 0.190 0.784 0.632 2.33
## 10 0.525 4200 40 0.273 -0.546 0.311 1.11 0.0982 -0.657
## 11 0.583 4600 36 0.673 -0.984 0.363 1.05 0.447 -1.23
The variable .fitted
gives the predicted values (not to
be confused with .hat
), but these don’t look like predicted
proportions. Why? These values mostly aren’t between 0 and
1.
What is the default value of .fitted
giving us? ____
## [1] -3.5239017 -2.9243593 -2.5246644 -2.1249695 -1.7252746 -1.3255798
## [7] -0.9258849 -0.5261900 -0.1264951 0.2731998 0.6728947
To get the predicted proportions from augment()
, we need
to add the type.predict="response"
argument:
## # A tibble: 11 × 9
## prop Hours `(weights)` .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 400 39 0.0286 -1.51 0.122 0.984 0.0910 -1.61
## 2 0.0755 1000 53 0.0510 0.760 0.191 1.10 0.0964 0.845
## 3 0.0606 1400 33 0.0741 -0.306 0.124 1.13 0.00712 -0.327
## 4 0.0959 1800 73 0.107 -0.304 0.271 1.13 0.0228 -0.356
## 5 0.167 2200 30 0.151 0.233 0.105 1.13 0.00367 0.247
## 6 0.231 2600 39 0.210 0.316 0.128 1.13 0.00865 0.339
## 7 0.214 3000 42 0.284 -1.03 0.141 1.07 0.0955 -1.11
## 8 0.462 3400 13 0.371 0.664 0.0529 1.11 0.0133 0.682
## 9 0.647 3800 34 0.468 2.09 0.190 0.784 0.632 2.33
## 10 0.525 4200 40 0.568 -0.546 0.311 1.11 0.0982 -0.657
## 11 0.583 4600 36 0.662 -0.984 0.363 1.05 0.447 -1.23
Where do the predicted probabilities \(\hat{\mu}\) come from? We solve the systematic component
\[\log(\frac{\mu}{1-\mu})=\beta_0 + \beta_1 Hours\]
for \(\mu\), giving:
\[\hat{\mu} = \frac{exp(\hat{\beta}_0 + \hat{\beta}_1 Hours)}{1 + exp(\hat{\beta}_0 + \hat{\beta}_1 Hours)}\]
How can we use this formula in R?
exp(tidy(fit)$estimate[1] + tidy(fit)$estimate[2] * turbines$Hours) / (1 + exp(tidy(fit)$estimate[1] + tidy(fit)$estimate[2] * turbines$Hours))
## [1] 0.02863975 0.05096245 0.07414710 0.10669350 0.15119300 0.20989147
## [7] 0.28376033 0.37140595 0.46841833 0.56787829 0.66215103