fetchData("m155development.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
Theme for the Day: The Alternative Hypothesis
Suppose that we've just invented a device that makes cars safer, and which we expect to reduce deaths by 10%. That would be a big improvement. We're going to do an experiment:
We already know something about Planet Alt: that the death rate will be lower by 10%. Now we need to do a little bit more work to make a Planet Alt that's relevant to the data we will actually be creating on Planet Earth. Death rates for vehicular accidents are given at the NHTSA web site http://www-fars.nhtsa.dot.gov/Main/index.aspx.
The site indicates that in 2010, there were 15.65 vehicle accident deaths per 100,000 drivers per year. We don't actually know that this rate will be relevant to our experiment.
Imagine that we collect data with \( n=1000 \) sites. (This corresponds to 100,000 cars in each of the treatment and control groups.) The death rate at each site should be 15.65/1000.
ourdata = groupwiseData(treatment = rpois(1000, lambda = 0.9 * 15.65/1000),
control = rpois(1000, lambda = 15.65/1000))
summary(lm(val ~ group, data = ourdata))
##
## Call:
## lm(formula = val ~ group, data = ourdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.015 -0.015 -0.012 -0.012 0.988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01200 0.00365 3.29 0.001 **
## groupcontrol 0.00300 0.00516 0.58 0.561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.115 on 1998 degrees of freedom
## Multiple R-squared: 0.000169, Adjusted R-squared: -0.000331
## F-statistic: 0.338 on 1 and 1998 DF, p-value: 0.561
Have everyone in the class do this and see how many of them were able to reject the null hypothesis.
How many sites do we need to use to get 90% power?
Another way to think about this problem is to ask what is the precision of our estimate of the number of deaths. It needs to be substantially smaller than the anticipated reduction in the death rate. That reduction is 10% of 15.65/100 or 0.0016
summary(lm(rpois(1000, lambda = 0.9 * 15.65/1000) ~ 1))
##
## Call:
## lm(formula = rpois(1000, lambda = 0.9 * 15.65/1000) ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.014 -0.014 -0.014 -0.014 0.986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01400 0.00372 3.77 0.00018 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.118 on 999 degrees of freedom
This is too large. We should try to get it much smaller, perhaps one-sixth as large. That will necessitate collecting 36 times as much data.
Do the power calculation for that size.
QUESTION: Given the number of sites, is there a practical way to do the experiment?
How could we do the experiment in a less costly way? (Find a high-risk group of cars. Find an endpoint other than deaths, say injuries.)
Here's a bit of information for Pennsylvania.
How big would the study need to be if studied injuries rather than deaths?
Aspirin and stroke.
The aspirin simulation gives a (not very detailed) model of the relationship between aspirin consumption and stroke. It has a confounder, sick, which represents how sick the patient is.
You can see the overall structure of the simulation:
aspirin
## Causal Network with 3 vars: sick, mgPerDay, stroke
## ===============================================
## sick is exogenous
## mgPerDay <== sick
## stroke <== sick & mgPerDay
The simulation itself is a little bit complicated. For what it's worth, you can look inside:
equations(aspirin)
## sick <== sample(c(rep(0,20),1:10,1:3,8:10), nsamps,replace=TRUE)
## mgPerDay <== 5*(sick<2)*sample(c(0,0,0,0:10,0:20),nsamps, replace=TRUE) + (sick>=2)*5*sample(c(0,0,0,0:20,10:20,15:20),nsamps,replace=TRUE)
## stroke <== cut(rnorm(nsamps,mean=0,sd=5) + 2*sick - mgPerDay/10, c(-Inf,5,Inf), labels=c('N','Y'))
Imagine that your research team has constructed this model. You've been asked to design the observational study. Two questions:
sick to get a meaningful result?Assuming that it's important to include sick as a covariate, here's one trial in a power calculation for a study of size \( n=50 \).
f = run.sim(aspirin, 50)
summary(glm(stroke == "Y" ~ mgPerDay + sick, data = f))$coef[2, 4]
## [1] 0.009233
Generate many trials and find the power.
Then set the sample size as required to create a power of, say, 80%.
How much larger would the sample size need to be to create a power of 95%.
The power of vitamin D. Handout.
Paper and book by John Ioannidis
*CCU outcomes
*Retroactive prayer. How could prayer work retroactively? Look at the last paragraph of the paper.
How to avoid this problem? Insist on a p-value of 0.0000001. Repeat the aspirin study to find out how large it would need to be. The sample size is surprisingly small.
What sample size do we need to get 90 percent power? (About 300.)
s = do(100) * summary(lm(systolic ~ D + race, data = run.sim(vitaminD, 30)))$coef[2,
4]
tally(~result < 0.05, data = s)
##
## TRUE FALSE Total
## 16 84 100