Last session we looked at hypothetical causal networks. There were pretty simple rules for including or excluding a covariate in order to open or block a pathway. But:

- There's no guarantee that our hypothesis corresponds to reality.

We need a reliable and authoritative way to establish the existence and strength of a causal link, that works regardless of the correspondence between our hypothesis and reality.

Consider the mediator network:

```
fetchData("simulate.r")
```

```
## [1] TRUE
```

```
cnet.mediator
```

```
## Causal Network with 3 vars: B, C, A
## ===============================================
## B is exogenous
## C <== B
## A <== C
```

We can see what the structure of the network is, but suppose we want to discover the causal structure without being able to see the network. Here's how:

- Pick two variables you're interested in seeing the connection between, say A and B.
- Create an intervention:

```
myintervention = rnorm(100)
```

- Apply the intervention to one of the variables:

```
toA = run.sim(cnet.mediator, 100, A = myintervention)
```

- Look for a correlation between your intervention and the target variable

```
summary(lm(B ~ A, data = toA))
```

```
##
## Call:
## lm(formula = B ~ A, data = toA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2225 -0.5493 0.0122 0.5152 2.1900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.00357 0.08725 0.04 0.97
## A -0.10015 0.09188 -1.09 0.28
##
## Residual standard error: 0.872 on 98 degrees of freedom
## Multiple R-squared: 0.012, Adjusted R-squared: 0.0019
## F-statistic: 1.19 on 1 and 98 DF, p-value: 0.278
```

- Now go the other way:

```
toB = run.sim(cnet.mediator, 100, B = myintervention)
summary(lm(A ~ B, data = toB))
```

```
##
## Call:
## lm(formula = A ~ B, data = toB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.037 -0.908 0.155 1.027 3.695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0369 0.1532 0.24 0.81
## B -1.0478 0.1613 -6.50 3.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 98 degrees of freedom
## Multiple R-squared: 0.301, Adjusted R-squared: 0.294
## F-statistic: 42.2 on 1 and 98 DF, p-value: 3.44e-09
```

A couple of ways:

```
myvals = resample(c("N", "Y"), 100)
myvals2 = rep(c("N", "Y"), 50)
```

Use this technique to map out the causal connections on these networks
`campaign.spending`

, `cnet.common.cause`

, `cnet.witness`

, `cnet.chain`

, `aspirin`

, `jock`

,`heights`

We want to maximize the variance but not go above a certain limit (as in the dose of aspirin). An efficient way to do this is to make an intervention that's of the form 0, 1, 0, 1

We're going to do an experiment to see if there is a relationship between posture (standing or sitting) and heart rate.

- Ask every second student to stand up.
- Put your finger against your carotid artery and press gently until you feel your pulse.
- At the signal, start counting heart beats. Count until the next signal (after 15 seconds).
- Your heart rate in beats per second is 4 times the count

We've just done an experiment, a procedure in which we impose certain conditions and measure the result. In this case, the condition we imposed was sitting versus standing.

We do experiments in order to measure with some precision the relationships involved. Let's tally up the results of this experiment:

- Enter the data into this spreadsheet. There is one column for each of the three meeting times for the course. Note that the posture data has been pre-entered, so enter the sitting people first and the standing people after.

Now we'll analyze the data:

```
s = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dE1COTRrYWVHN3dNR09TZHNLTkJnbHc&single=true&gid=0&output=csv")
```

```
## Loading required package: RCurl
```

```
## Warning: package 'RCurl' was built under R version 2.15.2
```

```
## Loading required package: bitops
```

Plot out the density broken down by posture and a bwplot.

Now, the model:

```
coef(summary(lm(HR1050 ~ Posture, data = s)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.333 4.089 16.222 3.452e-12
## Posturestanding 8.212 5.514 1.489 1.537e-01
```

Could we improve the experiment. Several ways:

- Measure HR more precisely.
- Specify posture more precisely.
- Collect more data from more people.

But our focus here is on how to improve the experiment by a better design.

To explore this, here is a program that will generate simulated HR data:

```
fetchData("hr.R")
```

```
## Retrieving from http://www.mosaic-web.org/go/datasets/hr.R
```

```
## [1] TRUE
```

```
fetchData("simulate.r")
```

```
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
```

```
## [1] TRUE
```

To run it, you specify the subject ID numbers and the posture of each patient, which must be one of “sitting”, “standing”, “lying”. Here's an example of running it on 10 people, half of whom are standing and half sitting. **NOTE** the expression `1:10`

is creating the numbers 1, 2, 3, …, 10. That is, `1:10`

makes 10 different people for the experiment. The levels of the posture variable are recycled.

```
s1 = heart.rate(who = 1:10, posture = c("sitting", "standing"))
s1
```

```
## who posture hr sex weight
## 1 S1 sitting 52 F 75.27
## 2 S2 standing 54 M 79.02
## 3 S3 sitting 53 M 63.18
## 4 S4 standing 62 M 52.30
## 5 S5 sitting 51 F 66.33
## 6 S6 standing 66 M 84.19
## 7 S7 sitting 70 F 69.76
## 8 S8 standing 61 F 66.96
## 9 S9 sitting 50 F 46.71
## 10 S10 standing 64 F 50.95
```

Can we see the difference between the two different postures?

```
coef(summary(lm(hr ~ posture, data = s1)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.2 3.008 18.349 8.009e-08
## posturestanding 6.2 4.254 1.457 1.831e-01
```

No significant difference.

ACTIVITY:

- Run the simulation several times. What's the power for a significance level of 0.05? For a significance level of 0.01?
- Assign different groups of students to run experiments of size \( n=20 \), \( n=30 \), \( n=50 \), and so on. What sample size will give a power of 80% at a significance level of 0.01?

This may seem odd at first, but let's add `who`

as an explanatory variable:

```
summary(lm(hr ~ posture + who, data = s1))
```

```
##
## Call:
## lm(formula = hr ~ posture + who, data = s1)
##
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52 NA NA NA
## posturestanding 9 NA NA NA
## whoS10 3 NA NA NA
## whoS2 -7 NA NA NA
## whoS3 1 NA NA NA
## whoS4 1 NA NA NA
## whoS5 -1 NA NA NA
## whoS6 5 NA NA NA
## whoS7 18 NA NA NA
## whoS8 NA NA NA NA
## whoS9 -2 NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 9 and 0 DF, p-value: NA
```

Why did the p-values disappear?

Now a slight variation on the experiment:

```
s2 = heart.rate(who = rep(1:5, each = 2), posture = c("sitting", "standing"))
s2
```

```
## who posture hr sex weight
## 1 S1 sitting 51 F 75.27
## 2 S1 standing 55 F 75.27
## 3 S2 sitting 52 M 79.02
## 4 S2 standing 54 M 79.02
## 5 S3 sitting 53 M 63.18
## 6 S3 standing 55 M 63.18
## 7 S4 sitting 60 M 52.30
## 8 S4 standing 62 M 52.30
## 9 S5 sitting 52 F 66.33
## 10 S5 standing 54 F 66.33
```

Can we see the difference between the two different postures in this new experiment?

```
coef(summary(lm(hr ~ posture, data = s2)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.6 1.575 34.036 6.067e-10
## posturestanding 2.4 2.227 1.078 3.126e-01
```

```
coef(summary(lm(hr ~ posture + who, data = s2)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.180e+01 0.4899 1.057e+02 4.797e-08
## posturestanding 2.400e+00 0.4000 6.000e+00 3.883e-03
## whoS2 -8.119e-15 0.6325 -1.284e-14 1.000e+00
## whoS3 1.000e+00 0.6325 1.581e+00 1.890e-01
## whoS4 8.000e+00 0.6325 1.265e+01 2.249e-04
## whoS5 -9.936e-15 0.6325 -1.571e-14 1.000e+00
```

Why is the estimate from the second model much better than from the first?

- We've introduced a new explanatory variable,
`who`

that accounts for a lot of the case-by-case variability. - We couldn't do this in the original experiment, because
`who`

would have provided a perfect fit.

- Replicate! You need to have many subjects. Large \( n \) gives higher precision. Remember, the standard error is proportional to \( 1/\sqrt{n-m} \), where \( m \) is the number of model terms. You need to have some residual in order to know what your error is.
- Impose controlled variation! That's why we had half the people stand up and half sit down. To create variation.
- Minimize uncontrolled variation. Make your experimental conditions as similar as possible (outside of the controlled variation).
- Make your controlled variation perpendicular to your uncontrolled variation.
- When you have multiple variables that you impose, make them mutually orthogonal. Indeed, if you expect to be fitting a model with interaction terms, etc., arrange your intervention so that all the terms are mutually orthogonal. That will give you better estimation properties.

A major source of uncontrolled variation in the heart rate is the person-to-person variability. There's nothing we can do about that directly. However, we can try to arrange the experimental treatment to be orthogonal to the uncontrolled variation.

You can measure the angle between the imposed treatment and the uncontrolled variation with R^{2.}

```
r.squared(lm(as.numeric(posture) ~ who, data = s1))
```

```
## [1] 1
```

```
r.squared(lm(as.numeric(posture) ~ who, data = s2))
```

```
## [1] 3.747e-31
```

In the second experiment, we had each person participate twice. Intuitively, you can think of this as each person serving as his or her own control. Mathematically, it's a matter of using the `who`

variable to eat up variance, and ensuring that there is little or no overlap between what's explained by the `who`

variable and what's explained by the `posture`

variable.

This gives us a much better experiment: smaller \( n \) for a given power, better precision in the measurement. This is called a cross-over experiment.

There are many settings when you only have one opportunity for each experimental unit. For example, suppose instead of heart rate we were studying survival. How should we arrange posture?

Now it will be impossible to impose a posture that's orthogonal to `who`

. But we can come close. Assign the posture randomly. `posture = resample(c("sitting","standing"), 1000)`

- Show that this is approximately orthogonal to both
`sex`

and`weight`

.

We don't mean **blocking** as in football or as in blocking a path in a hypothetical causal network. The “blocking” here refers to arranging your experimental units in groups. Within each group, the units are similar to one another.

- Blocking according to
`sex`

. Break into two groups, one for males and one for females. Assign the treatment randomly within each group. - Blocking according to
`weight`

. Break into groups of, say, two. The people with the closest weights. Then randomly assign one person to “sitting” and one person to “standing” within each group. - Blocking according to both
`sex`

and`weight`

: Divide into males and females. Then subdivide each of those groups into small groups according to weight. Randomly assign “sitting” and “standing” within each group.

You should be able to document your randomization procedure. It's not really enough to say, “We did it at random.”

Often, especially in the social sciences, experiments are done to probe causal paths.

The basic idea is that by imposing a treatment, you alter the network. If you assign posture, then all the other factors that might have been connected causally to posture are disconnected — only the downstream elements remain.

When the study is done observationally, the results depend on what covariates you include in the model. So you are looking at the consequences of your assumptions, rather than purely at the system under study.

But when you impose a treatment, you have reconfigured the system in a known way. All of the backdoor links that involve an incoming paths to the treatment have been blocked.

Let's do an experiment where every other candidate gets $10M to spend, while the others get nothing:

```
s = run.sim(campaign.spending, 5000, spending = c(0, 10))
coef(summary(lm(vote ~ spending, data = s)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.414 0.31824 117.564 0.000e+00
## spending 0.315 0.04501 6.999 2.924e-12
```

```
coef(summary(lm(vote ~ spending + polls, data = s)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0291 0.216459 4.754 2.046e-06
## spending 0.2240 0.015388 14.554 4.973e-47
## polls 0.7323 0.003767 194.413 0.000e+00
```

The results don't depend on the covariates; within the margin of error, they are the same for the two models.

Notice that I used 5000 units. One reason for this is that I didn't make the variation in spending very large. Here's a bigger variation with only 1/10th as many experimental units:

```
s = run.sim(campaign.spending, 500, spending = c(0, 100))
coef(summary(lm(vote ~ spending, data = s)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0453 0.99446 36.25 9.203e-142
## spending 0.2495 0.01406 17.74 6.307e-55
```

```
coef(summary(lm(vote ~ spending + polls, data = s)))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2178 0.687078 1.772 7.693e-02
## spending 0.2484 0.004975 49.932 8.418e-196
## polls 0.7301 0.012373 59.011 1.185e-226
```

An experiment with aspirin.

```
obs = run.sim(aspirin, 500)
```

Construct a logistic equation model of stroke:

```
mod = lm(stroke == "Y" ~ mgPerDay, data = obs)
coef(summary(mod))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3060503 0.0353370 8.6609 6.554e-17
## mgPerDay -0.0002174 0.0006233 -0.3488 7.274e-01
```

```
mod2 = lm(stroke == "Y" ~ mgPerDay + sick, data = obs)
coef(summary(mod2))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.196663 0.0245781 8.002 8.728e-15
## mgPerDay -0.003238 0.0004443 -7.287 1.252e-12
## sick 0.098743 0.0041377 23.864 1.925e-84
```

(Have the students do this.)

Create an experimental intervention:

```
exper = run.sim(aspirin, 500, mgPerDay = c(0, 100))
```

Now the analysis:

```
emod = lm(stroke == "Y" ~ mgPerDay, data = exper)
coef(summary(emod))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40400 0.0276437 14.615 1.653e-40
## mgPerDay -0.00236 0.0003909 -6.037 3.076e-09
```

```
emod2 = lm(stroke == "Y" ~ mgPerDay + sick, data = exper)
coef(summary(emod2))
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.229622 0.0220955 10.392 5.040e-23
## mgPerDay -0.002835 0.0002892 -9.803 7.384e-21
## sick 0.085480 0.0041774 20.462 5.890e-68
```

The two models give the same result. The second model, depending on measuring and using the covariate `sick`

, doesn't change the result, but it does eat up deviance (the logistic equivalent of variance)

[Need to write some blocking software that can create an intervention that's orthogonal to a set of covariates]

It would be better to make spending orthogonal to polls. There's no good way to do that in this simulation, except by setting the popularity and polls from one simulation when creating the next simulation.

```
# Get the poll data. This involve
prelim = run.sim(campaign.spending, 100)
```