Create an example data set:
foo = data.frame(exposed = c("Y", "N", "Y", "N"), sick = c("Sick", "Sick", "Healthy",
"Healthy"))
foo
## exposed sick
## 1 Y Sick
## 2 N Sick
## 3 Y Healthy
## 4 N Healthy
goo = foo[c(rep(1, 10), rep(2, 20), rep(3, 2000), rep(4, 1000)), ]
tally(~sick | exposed, data = goo)
## exposed
## sick N Y
## Healthy 0.980392 0.995025
## Sick 0.019608 0.004975
## Total 1.000000 1.000000
glm(sick == "Sick" ~ exposed, data = goo, family = "binomial")
##
## Call: glm(formula = sick == "Sick" ~ exposed, family = "binomial",
## data = goo)
##
## Coefficients:
## (Intercept) exposedY
## -3.91 -1.39
##
## Degrees of Freedom: 3029 Total (i.e. Null); 3028 Residual
## Null Deviance: 337
## Residual Deviance: 323 AIC: 327
When we do linear regression on a Zero/One variable, we are effectively modeling the probability.
orings = fetchData("oring-damage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/oring-damage.csv
xyplot(damage ~ temp, data = orings, pch = 20, col = rgb(0, 0, 0, 0.3))
Sketch in the best fitting line. But a line is too “stiff” and needs eventually to escape 0-1.
Remember that the idea of logistic regression is to fit a linear model to the “log odds”, which is just the logarithm of \( p/(1-p) \) — a different format for the probability. We haven't talked about how this is done. That's more advanced.
Suppose we have some data on an illness and exposure to a potential toxin
| Sick | Healthy | |
|---|---|---|
| Exposed | A | B |
| Not Exposed | C | D |
Among the exposed, the risk of being sick is \( A/(A+B) \).
Among the unexposed, the risk of being sick is \( C/(C+D) \).
The risk ratio is the … ratio of the two risks! It tells how much more likely you are to get the sickness if you are exposed.
Now suppose that the sickness is fairly rare, say roughly 1 in 1000. It's a huge amount of work to measure the exposure on everybody. Is it necessary, given that almost everybody is healthy.
To avoid this problem, it's common to do a case/control study, where we pick sick people from a clinic and a similar number of healthy controls. (Call up the friends of the sick kids. They will be similar in age, activities, etc.)
But now the ratio of A to B is wrong, it's roughly 1 to 1, whereas in the population it's roughly 1 to 1000
Instead, we calculate the odds and the odds ratio. \( \frac{A/C}{B/D} \)
Notice that if A is much less than B, and C is much less than D, the odds ratio is essentially the same as the risk ratio.
A study of bicycle helmet use and the influence of state law.
Demonstration of logistic regression and odds ratios. “Effects of state helmet laws on bicycle helmet use by children and adolescents” Injury Prevention 2002, 9:42-46
The coefficients in logistic regression correspond to log-odds ratios.
When they say “adjusted Odds Ratio”, they mean the odds ratio of one of the categorical variable levels relative to the reference level, with the other variables in the model as covariates. Notice that they give a confidence interval on the odds ratio itself. The calculation of this is straightforward given the standard error on the logistic regression coefficient.
Calculate the confidence interval on the odds ratio, e.g.
exp(0.7 + c(-1, 1) * 1.96 * 0.294)
## [1] 1.132 3.583
Aspirin and stroke.
The aspirin simulation gives a (not very detailed) model of the relationship between aspirin consumption and stroke. It has a confounder, sick, which represents how sick the patient is.
You can see the overall structure of the simulation:
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
aspirin
## Causal Network with 3 vars: sick, mgPerDay, stroke
## ===============================================
## sick is exogenous
## mgPerDay <== sick
## stroke <== sick & mgPerDay
The simulation itself is a little bit complicated. For what it's worth, you can look inside:
equations(aspirin)
## sick <== sample(c(rep(0,20),1:10,1:3,8:10), nsamps,replace=TRUE)
## mgPerDay <== 5*(sick<2)*sample(c(0,0,0,0:10,0:20),nsamps, replace=TRUE) + (sick>=2)*5*sample(c(0,0,0,0:20,10:20,15:20),nsamps,replace=TRUE)
## stroke <== cut(rnorm(nsamps,mean=0,sd=5) + 2*sick - mgPerDay/10, c(-Inf,5,Inf), labels=c('N','Y'))
Imagine that your research team has constructed this model. You've been asked to design the observational study. Two questions:
sick to get a meaningful result?Assuming that it's important to include sick as a covariate, here's one trial in a power calculation for a study of size \( n=50 \).
f = run.sim(aspirin, 50)
summary(glm(stroke == "Y" ~ mgPerDay + sick, data = f))$coef[2, 4]
## [1] 0.08187
Generate many trials and find the power.
Then set the sample size as required to create a power of, say, 80%.
How much larger would the sample size need to be to create a power of 95%.
See instructions at http://rpubs.com/dtkaplan/deer-crossing.