Last session we looked at hypothetical causal networks. There were pretty simple rules for including or excluding a covariate in order to open or block a pathway. But:
We need a reliable and authoritative way to establish the existence and strength of a causal link, that works regardless of the correspondence between our hypothesis and reality.
Consider the mediator network:
fetchData("simulate.r")
## [1] TRUE
cnet.mediator
## Causal Network with 3 vars: B, C, A
## ===============================================
## B is exogenous
## C <== B
## A <== C
We can see what the structure of the network is, but suppose we want to discover the causal structure without being able to see the network. Here's how:
myintervention = rnorm(100)
toA = run.sim(cnet.mediator, 100, A = myintervention)
summary(lm(B ~ A, data = toA))
##
## Call:
## lm(formula = B ~ A, data = toA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2225 -0.5493 0.0122 0.5152 2.1900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.00357 0.08725 0.04 0.97
## A -0.10015 0.09188 -1.09 0.28
##
## Residual standard error: 0.872 on 98 degrees of freedom
## Multiple R-squared: 0.012, Adjusted R-squared: 0.0019
## F-statistic: 1.19 on 1 and 98 DF, p-value: 0.278
toB = run.sim(cnet.mediator, 100, B = myintervention)
summary(lm(A ~ B, data = toB))
##
## Call:
## lm(formula = A ~ B, data = toB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.037 -0.908 0.155 1.027 3.695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0369 0.1532 0.24 0.81
## B -1.0478 0.1613 -6.50 3.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 98 degrees of freedom
## Multiple R-squared: 0.301, Adjusted R-squared: 0.294
## F-statistic: 42.2 on 1 and 98 DF, p-value: 3.44e-09
A couple of ways:
myvals = resample(c("N", "Y"), 100)
myvals2 = rep(c("N", "Y"), 50)
Use this technique to map out the causal connections on these networks
campaign.spending, cnet.common.cause, cnet.witness, cnet.chain, aspirin, jock,heights
We want to maximize the variance but not go above a certain limit (as in the dose of aspirin). An efficient way to do this is to make an intervention that's of the form 0, 1, 0, 1
We're going to do an experiment to see if there is a relationship between posture (standing or sitting) and heart rate.
We've just done an experiment, a procedure in which we impose certain conditions and measure the result. In this case, the condition we imposed was sitting versus standing.
We do experiments in order to measure with some precision the relationships involved. Let's tally up the results of this experiment:
Now we'll analyze the data:
s = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dE1COTRrYWVHN3dNR09TZHNLTkJnbHc&single=true&gid=0&output=csv")
## Loading required package: RCurl
## Warning: package 'RCurl' was built under R version 2.15.2
## Loading required package: bitops
Plot out the density broken down by posture and a bwplot.
Now, the model:
coef(summary(lm(HR1050 ~ Posture, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.333 4.089 16.222 3.452e-12
## Posturestanding 8.212 5.514 1.489 1.537e-01
Could we improve the experiment. Several ways:
But our focus here is on how to improve the experiment by a better design.
To explore this, here is a program that will generate simulated HR data:
fetchData("hr.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/hr.R
## [1] TRUE
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
To run it, you specify the subject ID numbers and the posture of each patient, which must be one of “sitting”, “standing”, “lying”. Here's an example of running it on 10 people, half of whom are standing and half sitting. NOTE the expression 1:10 is creating the numbers 1, 2, 3, …, 10. That is, 1:10 makes 10 different people for the experiment. The levels of the posture variable are recycled.
s1 = heart.rate(who = 1:10, posture = c("sitting", "standing"))
s1
## who posture hr sex weight
## 1 S1 sitting 52 F 75.27
## 2 S2 standing 54 M 79.02
## 3 S3 sitting 53 M 63.18
## 4 S4 standing 62 M 52.30
## 5 S5 sitting 51 F 66.33
## 6 S6 standing 66 M 84.19
## 7 S7 sitting 70 F 69.76
## 8 S8 standing 61 F 66.96
## 9 S9 sitting 50 F 46.71
## 10 S10 standing 64 F 50.95
Can we see the difference between the two different postures?
coef(summary(lm(hr ~ posture, data = s1)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.2 3.008 18.349 8.009e-08
## posturestanding 6.2 4.254 1.457 1.831e-01
No significant difference.
ACTIVITY:
This may seem odd at first, but let's add who as an explanatory variable:
summary(lm(hr ~ posture + who, data = s1))
##
## Call:
## lm(formula = hr ~ posture + who, data = s1)
##
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52 NA NA NA
## posturestanding 9 NA NA NA
## whoS10 3 NA NA NA
## whoS2 -7 NA NA NA
## whoS3 1 NA NA NA
## whoS4 1 NA NA NA
## whoS5 -1 NA NA NA
## whoS6 5 NA NA NA
## whoS7 18 NA NA NA
## whoS8 NA NA NA NA
## whoS9 -2 NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 9 and 0 DF, p-value: NA
Why did the p-values disappear?
Now a slight variation on the experiment:
s2 = heart.rate(who = rep(1:5, each = 2), posture = c("sitting", "standing"))
s2
## who posture hr sex weight
## 1 S1 sitting 51 F 75.27
## 2 S1 standing 55 F 75.27
## 3 S2 sitting 52 M 79.02
## 4 S2 standing 54 M 79.02
## 5 S3 sitting 53 M 63.18
## 6 S3 standing 55 M 63.18
## 7 S4 sitting 60 M 52.30
## 8 S4 standing 62 M 52.30
## 9 S5 sitting 52 F 66.33
## 10 S5 standing 54 F 66.33
Can we see the difference between the two different postures in this new experiment?
coef(summary(lm(hr ~ posture, data = s2)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.6 1.575 34.036 6.067e-10
## posturestanding 2.4 2.227 1.078 3.126e-01
coef(summary(lm(hr ~ posture + who, data = s2)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.180e+01 0.4899 1.057e+02 4.797e-08
## posturestanding 2.400e+00 0.4000 6.000e+00 3.883e-03
## whoS2 -8.119e-15 0.6325 -1.284e-14 1.000e+00
## whoS3 1.000e+00 0.6325 1.581e+00 1.890e-01
## whoS4 8.000e+00 0.6325 1.265e+01 2.249e-04
## whoS5 -9.936e-15 0.6325 -1.571e-14 1.000e+00
Why is the estimate from the second model much better than from the first?
who that accounts for a lot of the case-by-case variability.who would have provided a perfect fit.A major source of uncontrolled variation in the heart rate is the person-to-person variability. There's nothing we can do about that directly. However, we can try to arrange the experimental treatment to be orthogonal to the uncontrolled variation.
You can measure the angle between the imposed treatment and the uncontrolled variation with R2.
r.squared(lm(as.numeric(posture) ~ who, data = s1))
## [1] 1
r.squared(lm(as.numeric(posture) ~ who, data = s2))
## [1] 3.747e-31
In the second experiment, we had each person participate twice. Intuitively, you can think of this as each person serving as his or her own control. Mathematically, it's a matter of using the who variable to eat up variance, and ensuring that there is little or no overlap between what's explained by the who variable and what's explained by the posture variable.
This gives us a much better experiment: smaller \( n \) for a given power, better precision in the measurement. This is called a cross-over experiment.
There are many settings when you only have one opportunity for each experimental unit. For example, suppose instead of heart rate we were studying survival. How should we arrange posture?
Now it will be impossible to impose a posture that's orthogonal to who. But we can come close. Assign the posture randomly. posture = resample(c("sitting","standing"), 1000)
sex and weight.We don't mean blocking as in football or as in blocking a path in a hypothetical causal network. The “blocking” here refers to arranging your experimental units in groups. Within each group, the units are similar to one another.
sex. Break into two groups, one for males and one for females. Assign the treatment randomly within each group.weight. Break into groups of, say, two. The people with the closest weights. Then randomly assign one person to “sitting” and one person to “standing” within each group.sex and weight: Divide into males and females. Then subdivide each of those groups into small groups according to weight. Randomly assign “sitting” and “standing” within each group.You should be able to document your randomization procedure. It's not really enough to say, “We did it at random.”
Often, especially in the social sciences, experiments are done to probe causal paths.
The basic idea is that by imposing a treatment, you alter the network. If you assign posture, then all the other factors that might have been connected causally to posture are disconnected — only the downstream elements remain.
When the study is done observationally, the results depend on what covariates you include in the model. So you are looking at the consequences of your assumptions, rather than purely at the system under study.
But when you impose a treatment, you have reconfigured the system in a known way. All of the backdoor links that involve an incoming paths to the treatment have been blocked.
Let's do an experiment where every other candidate gets $10M to spend, while the others get nothing:
s = run.sim(campaign.spending, 5000, spending = c(0, 10))
coef(summary(lm(vote ~ spending, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.414 0.31824 117.564 0.000e+00
## spending 0.315 0.04501 6.999 2.924e-12
coef(summary(lm(vote ~ spending + polls, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0291 0.216459 4.754 2.046e-06
## spending 0.2240 0.015388 14.554 4.973e-47
## polls 0.7323 0.003767 194.413 0.000e+00
The results don't depend on the covariates; within the margin of error, they are the same for the two models.
Notice that I used 5000 units. One reason for this is that I didn't make the variation in spending very large. Here's a bigger variation with only 1/10th as many experimental units:
s = run.sim(campaign.spending, 500, spending = c(0, 100))
coef(summary(lm(vote ~ spending, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0453 0.99446 36.25 9.203e-142
## spending 0.2495 0.01406 17.74 6.307e-55
coef(summary(lm(vote ~ spending + polls, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2178 0.687078 1.772 7.693e-02
## spending 0.2484 0.004975 49.932 8.418e-196
## polls 0.7301 0.012373 59.011 1.185e-226
An experiment with aspirin.
obs = run.sim(aspirin, 500)
Construct a logistic equation model of stroke:
mod = lm(stroke == "Y" ~ mgPerDay, data = obs)
coef(summary(mod))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3060503 0.0353370 8.6609 6.554e-17
## mgPerDay -0.0002174 0.0006233 -0.3488 7.274e-01
mod2 = lm(stroke == "Y" ~ mgPerDay + sick, data = obs)
coef(summary(mod2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.196663 0.0245781 8.002 8.728e-15
## mgPerDay -0.003238 0.0004443 -7.287 1.252e-12
## sick 0.098743 0.0041377 23.864 1.925e-84
(Have the students do this.)
Create an experimental intervention:
exper = run.sim(aspirin, 500, mgPerDay = c(0, 100))
Now the analysis:
emod = lm(stroke == "Y" ~ mgPerDay, data = exper)
coef(summary(emod))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40400 0.0276437 14.615 1.653e-40
## mgPerDay -0.00236 0.0003909 -6.037 3.076e-09
emod2 = lm(stroke == "Y" ~ mgPerDay + sick, data = exper)
coef(summary(emod2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.229622 0.0220955 10.392 5.040e-23
## mgPerDay -0.002835 0.0002892 -9.803 7.384e-21
## sick 0.085480 0.0041774 20.462 5.890e-68
The two models give the same result. The second model, depending on measuring and using the covariate sick, doesn't change the result, but it does eat up deviance (the logistic equivalent of variance)
[Need to write some blocking software that can create an intervention that's orthogonal to a set of covariates]
It would be better to make spending orthogonal to polls. There's no good way to do that in this simulation, except by setting the popularity and polls from one simulation when creating the next simulation.
# Get the poll data. This involve
prelim = run.sim(campaign.spending, 100)