When we do linear regression on a Zero/One variable, we are effectively modeling the probability.
orings = fetchData("oring-damage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/oring-damage.csv
xyplot(damage ~ temp, data = orings)
Sketch in the best fitting line. But a line is too “stiff” and needs eventually to escape 0-1.
Remember that the idea of logistic regression is to fit a linear model to the “log odds”, which is just the logarithm of \( p/(1-p) \) — a different format for the probability. We haven't talked about how this is done. That's more advanced.
Suppose we have some data on an illness and exposure to a potential toxin
| Sick | Healthy | |
|---|---|---|
| Exposed | A | B |
| Not Exposed | C | D |
Among the exposed, the risk of being sick is \( A/(A+B) \).
Among the unexposed, the risk of being sick is \( C/(C+D) \).
The risk ratio is the … ratio of the two risks! It tells how much more likely you are to get the sickness if you are exposed.
Now suppose that the sickness is fairly rare, say roughly 1 in 1000. It's a huge amount of work to measure the exposure on everybody. Is it necessary, given that almost everybody is healthy.
To avoid this problem, it's common to do a case/control study, where we pick sick people from a clinic and a similar number of healthy controls. (Call up the friends of the sick kids. They will be similar in age, activities, etc.)
But now the ratio of A to B is wrong, it's roughly 1 to 1, whereas in the population it's roughly 1 to 1000
Instead, we calculate the odds and the odds ratio. \( \frac{A/C}{B/D} \)
Notice that if A is much less than B, and C is much less than D, the odds ratio is essentially the same as the risk ratio.
A study of bicycle helmet use and the influence of state law.
Demonstration of logistic regression and odds ratios. “Effects of state helmet laws on bicycle helmet use by children and adolescents” Injury Prevention 2002, 9:42-46
The coefficients in logistic regression correspond to log-odds ratios.
When they say “adjusted Odds Ratio”, they mean the odds ratio of one of the categorical variable levels relative to the reference level, with the other variables in the model as covariates. Notice that they give a confidence interval on the odds ratio itself. The calculation of this is straightforward given the standard error on the logistic regression coefficient.
Calculate the confidence interval on the odds ratio, e.g.
exp(0.7 + c(-1, 1) * 1.96 * 0.294)
## [1] 1.132 3.583
NOTE They've made
It turns out that the parent's myopia is the disposing factor. Shortsighted parents leave the lights on at night and their kids are shortsighted for genetic reasons. A CNN news report is here. And here is the abstract of the rebuttal.
Variable ---> Outcome ^ ^ | | ---> Confounder <--
Draw the causal diagram
Spending ----------> SAT <-------|
| |
Focus on Educ. ---> fraction taking SAT -|
Research in political science shows that higher spending in campaigns is related to a lower vote for the incumbent. Yet it's common sense that higher spending improves things for the candidate; that's why they do it.
Polls <----- Popularity ---> vote outcome
| ^
v |
Spending ---------------------------
We've done it by including the covariate in the model. But this is too crude an answer.
Work on the Logistic Regression model of Bonds's hitting.