\( \newcommand{\ensuremath}[1]{#1} \newcommand{\pathway}[1]{\hbox{#1}} \newcommand{\causes}{\ensuremath{\Rightarrow}} \newcommand{\causedBy}{\ensuremath{\Leftarrow}} \newcommand{\correlatedWith}{\ensuremath{\nLeftrightarrow}} \)
You should do this even if you already have done so. There were some corrections made to the software.
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
If we want to study the direct link between class size and test scores, we need to block the backdoor pathways.
Following the convention used throughout the book
cnet.mediatorcnet.common.causecnet.witnessWe won't deal with these. They raise questions about “when”.
The notation \( A \correlatedWith B \) means that there is a non-causal connection between \( A \) and \( B \). This must be because there is some unobserved variable \( U \) producing the correlation, such as
Note that if either of the arrows went the other way, we'd simply have \( U \) being a causal mediator.
We'll see later that (1) is not a possibility if \( U \) is unobserved and unconnected to anything else.
Remember, we want to study the causal relationship between \( A \) and \( B \) and are trying to decide whether to include the covariate \( C \).
Run a simulation of \( A \) against \( B \). When the coefficient in a model is zero, the pathway between \( B \) and \( A \) has been blocked.
equations(cnet.mediator)
## B <== rnorm(nsamps)
## C <== B + rnorm(nsamps)
## A <== rnorm(nsamps) - C
s = run.sim(cnet.mediator, 1000)
coef(summary(lm(A ~ B, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006554 0.04483 0.1462 8.838e-01
## B -0.977772 0.04625 -21.1432 2.771e-82
The above model indicates that the pathway is open.
Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).
coef(summary(lm(A ~ B + C, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006143 0.03196 0.1922 8.476e-01
## B 0.009186 0.04577 0.2007 8.410e-01
## C -0.994755 0.03200 -31.0884 6.776e-149
equations(cnet.common.cause)
## C <== rnorm(nsamps)
## B <== rnorm(nsamps)+C
## A <== rnorm(nsamps)-C
s = run.sim(cnet.common.cause, 1000)
coef(summary(lm(A ~ B, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03654 0.03898 0.9373 3.488e-01
## B -0.48613 0.02669 -18.2131 3.258e-64
The above model indicates that the pathway is open.
Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).
coef(summary(lm(A ~ B + C, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04690 0.03161 1.483 1.383e-01
## B 0.03449 0.03144 1.097 2.730e-01
## C -1.01498 0.04447 -22.826 4.148e-93
equations(cnet.witness)
## A <== rnorm(nsamps)
## B <== rnorm(nsamps)
## C <== rnorm(nsamps) - A + B
s = run.sim(cnet.witness, 1000)
coef(summary(lm(A ~ B, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01741 0.03224 -0.5399 0.5894
## B -0.01511 0.03251 -0.4646 0.6423
The above model indicates that the pathway is blocked.
Including the covariate \( C \) opens the pathway between \( B \) and \( A \).
coef(summary(lm(A ~ B + C, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03783 0.02216 -1.707 8.805e-02
## B 0.52827 0.02762 19.125 1.066e-69
## C -0.51283 0.01534 -33.434 5.519e-165
Imagine that we wanted to study the relationship between athletic ability and SAT scores. The conventional wisdom is that athletes tend to have lower test scores. The jock network implements one hypothesis.
s = run.sim(jock, 1000)
head(s)
## IQ Athletic College Varsity
## 1 95.77 18.99 No No
## 2 92.10 24.54 No No
## 3 70.75 23.94 No No
## 4 111.32 23.82 Yes Yes
## 5 115.24 19.63 Yes No
## 6 108.79 20.18 Yes No
In practice, SAT and athletic ability data on an individual might be available via a college, so let's pull out just the cases that are in college:
college <- subset(s, College == "Yes")
Look at the relationship between IQ and Athletic:
coef(summary(lm(IQ ~ Athletic, data = college)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.194 4.3625 34.887 3.930e-108
## Athletic -1.707 0.2009 -8.496 8.981e-16
There is a relationship!
Now look at the entire set of data:
coef(summary(lm(IQ ~ Athletic, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.73122 3.3287 29.6609 3.845e-139
## Athletic 0.08128 0.1646 0.4939 6.215e-01
The relationship has disappeared. Why?
ACTIVITY: Look at the structure and equations of the jock network and explain.
Here is the complete set of simple causal pathways among three variables.
The possibilities for who can be in the middle: \( A \) or \( B \) or \( C \)