Notes 8
Modeling Proportions with Binomial Distribution | Part 3: Probit Link, Median Effective Dose
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
Logit link, probit link (Section 9.3)
Recall the binomial, logit link model for the turbines data:
\(y \sim Binomial(\mu, m)\), where
\(y\) is the proportion of turbines with fissures
\(\mu\) is the proportion parameter
\(m\) is the number of turbines
\(\log(\frac{\mu}{1-\mu})=logit(\mu)=\beta_0 + \beta_1 Hours\)
In the systematic component #2, another common link function to use for the proportion parameter \(\mu\) is \(probit(\mu\)):
- \(probit(\mu) = \beta_0 + \beta_1 Hours\)
The probit function is the same as the qnorm()
for the
quantile function for the standard normal distribution. We provide the
proportion \(\mu\), and the
qnorm()
(or probit) function outputs the standard normal
value that has a probability of \(\mu\)
to the left.
Examples:
## [1] -0.8416212
## [1] 0.2533471
## [1] 1.281552
Pictures:
Why use link functions?
The idea behind link functions for the binomial model is that
- they take the proportion \(\mu\), which must take values between 0 and 1
- and transform it to the range of values for the linear predictor \(\beta_0 + \beta_1 x\), which can take values between negative infinity to positive infinity.
This is true for both the link functions we have discussed for the binomial model: logit and probit. Fig 9.2 (Section 9.3) illustrates both link functions:
tibble(
mu = seq(from = .001, to = .999, length=500),
logit = log(mu / (1 - mu)),
probit = qnorm(mu)
) |>
pivot_longer(cols = c(logit, probit), names_to = "Link function", values_to = "Link(mu)") |>
ggplot(
mapping = aes(x = mu, y = `Link(mu)`, color = `Link function`)
) +
geom_line() +
labs(x = "mu, proportion parameter")
Fitting turbines data with probit link
## Hours Turbines Fissures prop
## 1 400 39 0 0.00000000
## 2 1000 53 4 0.07547170
## 3 1400 33 2 0.06060606
## 4 1800 73 7 0.09589041
## 5 2200 30 5 0.16666667
## 6 2600 39 9 0.23076923
## 7 3000 42 9 0.21428571
## 8 3400 13 6 0.46153846
## 9 3800 34 22 0.64705882
## 10 4200 40 21 0.52500000
## 11 4600 36 21 0.58333333
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.28 0.197 -11.5 9.55e-31
## 2 Hours 0.000578 0.0000626 9.24 2.49e-20
Predicted probabilities
## # A tibble: 11 × 9
## prop Hours `(weights)` .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 400 39 0.0205 -1.27 0.144 0.996 0.0802 -1.37
## 2 0.0755 1000 53 0.0448 0.987 0.220 1.03 0.211 1.12
## 3 0.0606 1400 33 0.0713 -0.245 0.134 1.10 0.00512 -0.263
## 4 0.0959 1800 73 0.108 -0.351 0.272 1.10 0.0305 -0.411
## 5 0.167 2200 30 0.158 0.132 0.0982 1.11 0.00107 0.139
## 6 0.231 2600 39 0.220 0.161 0.116 1.11 0.00196 0.172
## 7 0.214 3000 42 0.294 -1.17 0.130 1.01 0.112 -1.26
## 8 0.462 3400 13 0.378 0.610 0.0503 1.09 0.0106 0.626
## 9 0.647 3800 34 0.469 2.09 0.181 0.748 0.587 2.31
## 10 0.525 4200 40 0.561 -0.456 0.297 1.09 0.0627 -0.544
## 11 0.583 4600 36 0.650 -0.824 0.356 1.05 0.298 -1.03
Compare fits under logit and probit links
First, fitting the logit model and collecting the predicted probabilities under both the logit and probit models:
fit_logit <- glm(prop ~ Hours, family=binomial(link="logit"), weights=Turbines, data=turbines)
fitted_logit <- augment(fit_logit, type.predict = "response") |>
rename(logit = .fitted) |>
select(prop, Hours, logit)
fitted_probit <- augment(fit_probit, type.predict = "response") |>
rename(probit = .fitted) |>
select(prop, Hours, probit)
Then merging these probabilities into a single dataset and plotting (along with the observed proportions):
inner_join(fitted_logit, fitted_probit, by = join_by(Hours, prop)) |>
pivot_longer(cols = c("logit", "probit"), names_to = "link_fcn", values_to = "fitted") |>
ggplot(
mapping = aes(x = Hours, y = prop, color = link_fcn)
) +
geom_point(color = "black") +
geom_line(aes(y = fitted))
We see that the predicted probabilities under the logit and probit link functions are very similar.
Section 9.6 - Median Effective Dose
(Basically, given a probability - solving for x.)
From text: “Binomial GLMs are commonly used to examine the relationship between the dose \(d\) of a drug or poison and the proportion \(y\) of insects (or plants, or animals) that survive. These models are called dose–response models. Associated with these experiments is the concept of the median effective dose, ED50: the dose of poison affecting 50% of the insects.
We use the systematic component of the model to find this dose \(d\):
\[g(\mu) = \beta_0 + \beta_1 d,\]
where \(g()\) is the link function (either logit or probit). After fitting the model, we know the maximum likelihood estimates of the beta parameters, \(\hat{\beta}_0\) and \(\hat{\beta}_1\). We can then use algebra to solve for \(d\), the dose that affects a proportion of \(\mu\) of the insects:
\[ d = \frac{g(\mu) - \hat{\beta}_0}{\hat{\beta}_1} \]
turbines
example
To calculate the run time for which 50% of turbines would be expected to experience fissures:
- Under the logit model: \(g(\mu) = \log\frac{\mu}{1-\mu}\)
mu <- .5
num <- log(mu/(1-mu)) - tidy(fit_logit)$estimate[1] # num: numerator of fraction
num / tidy(fit_logit)$estimate[2]
## [1] 3926.592
- Under the probit model: \(g(\mu) = qnorm(\mu)\)
mu <- .5
num <- qnorm(mu) - tidy(fit_probit)$estimate[1] # num: numerator of fraction
num / tidy(fit_probit)$estimate[2]
## [1] 3935.197
These numbers should seem reasonable based on our graph above.
We could of course calculate the run time for any percent of failure rate, say 20% (as long as it’s reasonable based on the data):
- Under the logit model:
# calculate
mu <- .2
num <- log(mu/(1-mu)) - tidy(fit_logit)$estimate[1] # num: numerator of fraction
num / tidy(fit_logit)$estimate[2]
## [1] 2539.239
- Under the probit model:
# calculate
mu <- .2
num <- qnorm(mu) - tidy(fit_probit)$estimate[1] # num: numerator of fraction
num / tidy(fit_probit)$estimate[2]
## [1] 2479.913