library(faraway)
library(ggplot2)
library(tidyverse)
Suppose that students answer questions on a test. One specific student has apitude \(T\). A particular question has difficulty \(d\). The student will only answer the question correct if \(T > d\). Now lets fix \(d\) and consider \(T\) as a random variable with density \(f\) and distribution of the function \(F\). Then the probability that the student gets the answer wrong is:
\[p = \mathbb{P}(T \leq d) = F(d)\] T is called a Latent Variable. Suppose that the distribution of T is logistic:
\[F(y) = \frac{e^{\frac{y-\mu}{\sigma}}}{1+e^{\frac{y-\mu}{\sigma}}}\] So \[logit(p) = \frac{- \mu}{\sigma} + \frac{d}{\sigma}\]
If we set \(\beta_0 = \frac{- \mu}{\sigma}\) and \(\beta_1 = \frac{d}{\sigma}\), we now have a logistic regression model.
Let’s do a example. We set \(d = 1\) and let \(T\) have mean \(-1\) and \(\sigma =1\)
x <- seq(-6, 4, 0.1)
y <- dlogis(x,location= -1)
plot(x,y, type= 'l', ylab = "Density", xlab = "t")
ii <- (x <= 1)
polygon(x = c(x[ii], 1, -6), y = c(y[ii], 0, 0), col = 'gray')
This is a example of a logistically distributed latent variable.
This is similar to the normal distribution
The shaded area is the probably of getting the answer wrong
We see the idea of latent variables often in bioassays. In bioassays we treat animals, plants, and people with some concentration of a treatment and observe the outcome.
Here is a example: We are interested in the concentration of insecticide required to exterminate some type of pest. Insectes will having varying levels of tolerances (our latent variable) for the toxin and will survive if their tolerance is greater than the dose. In this context, the term tolerance distribution is used for T.
Tolerance Distribution: is the distribution of some sort of “tolerance threshold”
mosquitoes are sprayed with insecticide at various doses, and the response is whether the mosquito dies \((Y=1\) or \(0)\). Each mosquito has a tolerance \(X\) and the probability a mosquito will die is:
\(p_i(x) = F_x(x) = \mathbb{P}(X \leq x) = \int _{-\infty }^{\pi }\: f(t) dt\)$
Big Idea: using any CDF function, we can link “probability of success” to some threshold “\(x\)”
This is precisely what we do in logistic regression, we try to link probability of success to some threshold.
So we can simple choose a link function by inverting a CDF
So far, we have used the Logit link to connect the probability and linear predictors. However, there exist other links functions that are reasonable. In our case use thus far we need:
link function that can bound probability between \([0,1]\)
monotone link function
Note: Obviously, probability of success can go up or down as the linear predictors increase. In this case we can add quadratic or higher order terms rather than adjust the link functions.
Some popular link functions:
(1). Probit
(2). Complementary
(3). Cauchit
To understand the difference in link functions, we must first understand the difference in corresponding CDF functions
data(bliss)
bliss
## dead alive conc
## 1 2 28 0
## 2 8 22 1
## 3 15 15 2
## 4 23 7 3
## 5 27 3 4
logit_model <- glm(cbind(dead, alive)~conc, family =binomial, data=bliss)
probit_model <- glm(cbind(dead, alive)~conc, family =binomial(link= probit), data=bliss)
log_log_model <- glm(cbind(dead, alive)~conc, family =binomial(link= cloglog), data=bliss)
cauchit_model <- glm(cbind(dead, alive)~conc, family =binomial(link= cauchit), data=bliss)
Let’s look at the fitted values:
fitted(logit_model)
## 1 2 3 4 5
## 0.08917177 0.23832314 0.50000000 0.76167686 0.91082823
These are constructed using linear predictor \(\eta\):
coef(logit_model)[1] + coef(logit_model)[2]*bliss$conc
## [1] -2.323790e+00 -1.161895e+00 -4.440892e-16 1.161895e+00 2.323790e+00
Or it is equivalent to:
ilogit(logit_model$lin)
## 1 2 3 4 5
## 0.08917177 0.23832314 0.50000000 0.76167686 0.91082823
Let’s do some comparison between the models:
predval <- sapply(list(logit_model, probit_model, log_log_model, cauchit_model), fitted)
dimnames(predval) <- list(0:4, c('logit','probit','cloglog','cauchit'))
round(predval,3)
## logit probit cloglog cauchit
## 0 0.089 0.084 0.127 0.119
## 1 0.238 0.245 0.250 0.213
## 2 0.500 0.498 0.455 0.506
## 3 0.762 0.752 0.722 0.791
## 4 0.911 0.914 0.933 0.882
Notice that they’re all pretty similar regardless of the links.
dose <- seq(-4, 8, 0.2)
predval <- sapply(list(logit_model, probit_model, log_log_model, cauchit_model), function(m) predict(m, data.frame(conc = dose), type = "response"))
mpv <- data.frame(dose = rep(dose, 4), predval)
colnames(mpv)[-1] <- c("logit_model", "probit_model", "log_log_model", "cauchit_model")
mpv <- gather(mpv, link, probability, -dose)
ggplot(mpv, aes(x = dose, y = probability, linetype = link)) +
geom_line() +
theme_minimal()
While the predicted probabilities don’t differ much in the range of the data, as we extrapolate, the differences become apparent. Look at the regions of dose above 5 and below 0. So they become more apparently different at the tails. In reality the probit and logit actually differ a lot in the tails which is an issue in medical experiments.
Decisions of what link to choose is important. However, it is typically not possible to make this choice just off data. For moderate p’s, it seems they all the propose ones act similarly. We have to be more careful for the tail p’s. We typically choose links based on convenience or physical knowledge
We will start this section with a study. In 1987, a study on infant respiratory disease which shows the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding and sex.
data(babyfood)
xtabs(disease/(disease + nondisease) ~ sex + food ,babyfood)
## food
## sex Bottle Breast Suppl
## Boy 0.16812227 0.09514170 0.12925170
## Girl 0.12500000 0.06681034 0.12598425
In prospective sampling, the predictors are fixed and then the outcome is observed. This also called a cohort study. So prospective sampling \(\equiv\) cohort study. In the infant respiratory disease example, we would select a sample of newborn girls and boys whose parents had chosen a particular method of feeding and then monitor them for their first year.
In retrospective sampling,the outcome is fixed and then the predictors are observed. This is also called a case-controlled study. So retrospective sampling \(\equiv\) case-controlled. Typically we would find infants coming to a doctor with a respiratory disease in the first year and then record their sex and method of feeding. We would also obtain a sample of respiratory disease free infants and record their information. The method of sampling is important because we require that the method of inclusion is independent of the predictor values. Not every subject in the population have the same change to be included in the sample.
What is our goal in this study? : We want to know what predictors affect the response. In context, what feeding methods effect the probability of getting a cold for an infant in their first year.
So what type of study or sampling to use? : In this case, it makes sense to do prospective study.
Let’s focus on just boys who are breast or bottle fed:
babyfood[c(1,3), ]
## disease nondisease sex food
## 1 77 381 Boy Bottle
## 3 47 447 Boy Breast
Log odds seems to be a sensible way to measure the strength of the association of a predictor with a outcome.
log(47/447)
## [1] -2.252411
log(77/381)
## [1] -1.598994
The difference between the two log odds \(\Delta\): \(\Delta = -1.60 - (-2.25) = 0.65\). This difference represents the increased risk of respiratory disease incurred by bottle feeding relative to breast feeding. This is the log-odds ratio.
-1.60 - -2.25
## [1] 0.65
Interpretation: Log odds ratio is positive, so there is a greater risk of respiratory disease for bottle fed compared to breast-fed babies.
What if this was a retrospective study?
If this was a retrospective, we can compute the log odds of feeding type given respiratory disease status and then find the difference.
\(log(\frac{77}{47})\) log(prob of disease and bottle / prob of disease and breast) \(log(\frac{381}{447})\) log(prob of nondisease and bottle / prob of nondisease and breast)
\(\Delta = log(\frac{77}{47}) - log(\frac{381}{447}) = 0.65\)
Notice, \(= log(\frac{77}{381}) - log(\frac{47}{447}) = 0.65\)
So our retrospective design is as effective as a prospective design for estimating \(\Delta\). This is not possible for other links such as probit so we have to use logit. Logit link provides us a framework of interpretation such as log odds and log ratios.
What are some advantages of retrospective studies?
In the previous example, we would need to wait a year before the outcome for a particular baby is known when using the prospective/cohort approach. For studying some diseases such as cancer or heart disease, we may need to wait a long time for results to be known. In our case-control (retrospective) approach, we start right away (provided we have data)
In retrospective studies, we can study many possible predictors (wheras in cohort studies, the predictors have to be pre-determined, but we lack foresight)
For rare outcomes, retrospective studies are favored, otherwise we will need a very large cohort study guarantee any “positive” responses.
Issues of retrospective studies?
One major concern of the Case control studies: your sample is not representative of the population
E.g.: in a case-control study with 100 heathy people and 100 HIV positive patients. But this does not mean \(50\%\) of human on earth are infected by HIV
Summmary: Using the logit link, we are provided with log-odds interpretation which will allow us to study predictors using log odds. We can use retrospective or prospective sampling. Retrospective sampling or case-control studies have the advantage of being a 1). quickstart, 2). able to study many predictors 3). good for rare outcomes.
What are some advantages of prospective studies?
They are less susceptible to bias in the selection of the model. This is because retrospective sampling relies on historical data which may contain inaccuracies and/or incompleteness
Prospective studies allow for more than one outcome. For example, if we also want to measure the babie’s success in learning to walk, this is problematic in case control studies because more than one outcome would lead to different samples.
Only prospective studies can generate models that predict the probability of an outcome
Let’s compare the two studies
Let \(\pi_0\) be the probability of an individual being included in the study if they do not have the disease
Let \(\pi_1\) be the probability of an individual being included in the study if they do not have the disease
Notice
Now for a given \(X = x\), let \(p(x) = \mathbb{P}(Y = 1 |X=x)\) the true conditional probability of having the disease for any given person in the population with \(X = x\).
Let \(p^*(x)\) be the probability of having the disease for a person with \(X = x\) and is included in the case-control study
Using Baye’s Formula:
\[p^{*}(x) = \frac{\pi_1p(x)}{\pi_1p(x)-\pi_0(1-p(x))}\] Apply the logit transformation on both sides:
\[logit[p^*(x)] = log\frac{\pi_1}{\pi_0} + logit[p(x)]\]
This in the case-control data to fit the multiple logistic regression model, all coefficients \(\beta_1, \beta_2,...,\beta_p\) except for the intercept \(\beta_0\) will be correctly estimated, as if the data were collected prospectively
Summary: The retrospective study is correct to the prospective study to the degree with the intercept \(\beta_0\). Besides the intercept, all the coefficients of the logistic regression will be correctly estimated. This means retrospective cannot give us the absolute effects. So case-control studies can tell us if a particular lifestyle choice would increase the risk of a certain outcome but not an exact probability.
Modeling the studies
\[logit\{\mathbb{P}(Y = 1 | X = x)\} = log \frac{\mathbb{P}(Y = 1 | X = x)}{\mathbb{P}(Y = 0 | X = x)} = \beta_0 + \beta_1 x\]
We have in hand are two distributions: \(X|Y = 0\) (control group) and \(X|Y = 1\) (case group)
Our goal is to estimate \(\mathbb{P}(Y = 1 | X = x)\), by the Baye’s formula:
\[\mathbb{P}(Y = j | X = x) = \frac{\mathbb{P}(Y = j)f(X|Y=j)}{f_X(x))}, j=0,1\] Then we apply the case control data we have:
\[logit(\mathbb{P}(Y = 1 | X = x) = log \frac{\pi_1}{1-\pi_1} + log\frac{f(X|Y=1)}{f(X|Y = 0)}\] where \(\pi_1 = \mathbb{P}(Y = 1)\)
We need to make some assumptions. Assume \(X|Y = 0 \sim N(\mu_0, \sigma^2_0)\) and \(X|Y = 1 \sim N(\mu_1, \sigma^2_1)\).
We know the log-density function for \(N(\mu,\sigma^2\) is:
\[log(f(x;\mu,\sigma^2) = -\frac{n}{2}log2\pi - \frac{n}{2}log\sigma^2 - \frac{1}{2\sigma^2}(x-\mu)^2\]
Therefore using the case control data we have:
\[logit\{\mathbb{P}(Y = 1 | X = x)\} = \beta_0 + (\frac{1}{2\sigma^2_0} + \frac{1}{2\sigma^2_1})x^2 + (\frac{\mu_1}{\sigma^2_1} - \frac{\mu_0}{\sigma^2_0})x \]
Observations:
When \(\sigma^2_1 = \sigma^2_0\),\(\mu_1 = \mu_0\): the null logistic model
When \(\sigma^2_1 = \sigma^2_0\),\(\mu_1 \neq \mu_0\): the linear logistic model
When \(\sigma^2_1 \neq \sigma^2_0\): the quadratic logistic model
We make simple linear assumptions to put some constraint on our data.
Sometimes we wish to predict the outcome for given values of the covariates. For binomial data this will mean estimating the probability of success.
Given covariate \(x_0\), the predicted response on the link scale is \(\hat{\eta} = x_0\hat{\beta}\) with variance given by \(x_0^T(X^TWX)^{-1}x_0\). We can obtain approximate confidence intervals using normal approximation. To get the answer in the probability scale, it will be necessary to transform back using the inverse of the link function.
Let’s do an example:
data(bliss)
head(bliss)
## dead alive conc
## 1 2 28 0
## 2 8 22 1
## 3 15 15 2
## 4 23 7 3
## 5 27 3 4
lmod <- glm(cbind(dead,alive) ~ conc, family=binomial, data=bliss)
lmodsum <- summary(lmod)
Now we want to predict the response at a dose of 2.5:
\(x_0 = [1, 2.5]\)
\(\eta_0 = \sum \beta_ix_0\)
\(ilogit(\eta_0)\)
x0 <- c(1,2.5)
eta_0 <- sum(x0*coef(lmod))
eta_0
## [1] 0.5809475
ilogit(eta_0)
## [1] 0.6412854
We predict a 64% chance of death at this dose.
Equivalently we can do:
predict(lmod, newdata = data.frame(conc=2.5), type = "response")
## 1
## 0.6412854
Let’s compute at \(95\%\) confidence interval (CI) for the probability
#First extract variance matrix
(cm <- lmodsum$cov.unscaled)
## (Intercept) conc
## (Intercept) 0.17463024 -0.06582336
## conc -0.06582336 0.03291168
se <- sqrt(t(x0) %*% cm %*% x0)
ilogit(c(eta_0 - 1.96*se, eta_0 + 1.96*se))
## [1] 0.5342962 0.7358471
More can do this more directly:
predict(lmod, newdata = data.frame(conc=2.5), type = "response", se= T)
## $fit
## 1
## 0.6412854
##
## $se.fit
## 1
## 0.05205758
##
## $residual.scale
## [1] 1
ilogit(c(0.58095 - 1.96*0.2263, 0.58095 + 1.96*0.2263))
## [1] 0.5342966 0.7358478
What if we want to predict dose of -5
x0 <- c(1,-5)
se <- sqrt(t(x0) %*% cm %*% x0)
eta_0 <- sum(x0*coef(lmod))
ilogit(c(eta_0 - 1.96*se, eta_0 + 1.96*se))
## [1] 2.357639e-05 3.643038e-03
When there is a single (continuous) covariate or when other covariates are hold fix, we sometimes wish to estimate the value of \(x\) corresponding to a chosen p. If we want to determine what dose will lea to a probability of success \(p\).
We define ED50 as effective dose as for which there will be a 50% chance of success. For a logit link, we can set \(p = 1/2\) and then solve for \(x\) to find that”
\[\hat{ED50} = \hat{\beta_0/\hat{\beta_1}}\]
To determine the standard error, we use the delta method.
For other levels, the effective dose \(x_p\) is defined as:
\[x_p = \frac{logit(p) - \beta_{0}}{\beta_1}\]