\( \newcommand{\ensuremath}[1]{#1} \newcommand{\pathway}[1]{\hbox{#1}} \newcommand{\causes}{\ensuremath{\Rightarrow}} \newcommand{\causedBy}{\ensuremath{\Leftarrow}} \newcommand{\correlatedWith}{\ensuremath{\nLeftrightarrow}} \)
Draw the causal diagram
Spending ----------> SAT <-------|
| |
Focus on Educ. ---> fraction taking SAT -|
Research in political science shows that higher spending in campaigns is related to a lower vote for the incumbent. Yet it's common sense that higher spending improves things for the candidate; that's why they do it.
Polls <----- Popularity ---> vote outcome
| ^
v |
Spending ---------------------------
We've done it by including the covariate in the model. But this is too crude an answer.
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
campaign.spending
## Causal Network with 4 vars: popularity, polls, spending, vote
## ===============================================
## popularity is exogenous
## polls <== popularity
## spending <== polls
## vote <== popularity & spending
equations(campaign.spending)
## popularity <== runif(nsamps, min=15,max=85)
## polls <== popularity + rnorm(nsamps,sd=3)
## spending <== 100 - polls + rnorm(nsamps,sd=10)
## vote <== 0.75*popularity + 0.25*spending + rnorm(nsamps,sd=5)
You can see from the equations how spending is related to vote: it increases it. Let's look at what a statistical model has to say
d = run.sim(campaign.spending, 435) # number of congressment
summary(lm(vote ~ spending, data = d))
##
## Call:
## lm(formula = vote ~ spending, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.240 -5.376 0.263 6.100 24.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.3970 0.9792 71.9 <2e-16 ***
## spending -0.4071 0.0183 -22.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.31 on 433 degrees of freedom
## Multiple R-squared: 0.533, Adjusted R-squared: 0.532
## F-statistic: 494 on 1 and 433 DF, p-value: <2e-16
The problem: a back door pathway from spending to vote via popularity. Block it by including a node on the pathway as a covariate.
summary(lm(vote ~ spending + polls, data = d))
##
## Call:
## lm(formula = vote ~ spending + polls, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.85 -3.93 0.02 3.48 19.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3751 2.9254 0.81 0.42
## spending 0.2232 0.0291 7.68 1.1e-13 ***
## polls 0.7409 0.0311 23.84 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.47 on 432 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.797
## F-statistic: 854 on 2 and 432 DF, p-value: <2e-16
You should do this even if you already have done so. There were some corrections made to the software.
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
If we want to study the direct link between class size and test scores, we need to block the backdoor pathways.
Following the convention used throughout the book
cnet.mediatorcnet.common.causecnet.witnessWe won't deal with these. They raise questions about “when”.
The notation \( A \correlatedWith B \) means that there is a non-causal connection between \( A \) and \( B \). This must be because there is some unobserved variable \( U \) producing the correlation, such as
Note that if either of the arrows went the other way, we'd simply have \( U \) being a causal mediator.
We'll see later that (1) is not a possibility if \( U \) is unobserved and unconnected to anything else.
Remember, we want to study the causal relationship between \( A \) and \( B \) and are trying to decide whether to include the covariate \( C \).
Run a simulation of \( A \) against \( B \). When the coefficient in a model is zero, the pathway between \( B \) and \( A \) has been blocked.
equations(cnet.mediator)
## B <== rnorm(nsamps)
## C <== B + rnorm(nsamps)
## A <== rnorm(nsamps) - C
s = run.sim(cnet.mediator, 1000)
coef(summary(lm(A ~ B, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0598 0.04354 -1.374 1.699e-01
## B -1.0246 0.04270 -23.995 8.242e-101
The above model indicates that the pathway is open.
Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).
coef(summary(lm(A ~ B + C, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05986 0.03275 -1.8275 6.792e-02
## B -0.03939 0.04794 -0.8217 4.114e-01
## C -0.94291 0.03406 -27.6865 1.285e-125
equations(cnet.common.cause)
## C <== rnorm(nsamps)
## B <== rnorm(nsamps)+C
## A <== rnorm(nsamps)-C
s = run.sim(cnet.common.cause, 1000)
coef(summary(lm(A ~ B, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02493 0.03843 0.6486 5.168e-01
## B -0.45594 0.02741 -16.6328 5.110e-55
The above model indicates that the pathway is open.
Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).
coef(summary(lm(A ~ B + C, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03604 0.03192 1.1291 2.591e-01
## B 0.01704 0.03187 0.5347 5.930e-01
## C -0.96997 0.04574 -21.2048 1.161e-82
equations(cnet.witness)
## A <== rnorm(nsamps)
## B <== rnorm(nsamps)
## C <== rnorm(nsamps) - A + B
s = run.sim(cnet.witness, 1000)
coef(summary(lm(A ~ B, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03848 0.03166 -1.2155 0.2245
## B -0.01114 0.03189 -0.3493 0.7269
The above model indicates that the pathway is blocked.
Including the covariate \( C \) opens the pathway between \( B \) and \( A \).
coef(summary(lm(A ~ B + C, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01382 0.02229 -0.62 5.354e-01
## B 0.46990 0.02703 17.38 2.488e-59
## C -0.49322 0.01545 -31.92 1.249e-154
Imagine that we wanted to study the relationship between athletic ability and SAT scores. The conventional wisdom is that athletes tend to have lower test scores. The jock network implements one hypothesis.
s = run.sim(jock, 1000)
head(s)
## IQ Athletic College Varsity
## 1 110.17 15.20 No No
## 2 83.13 20.28 No No
## 3 114.26 16.50 No No
## 4 105.02 20.20 No No
## 5 92.20 21.19 No No
## 6 98.32 19.82 No No
In practice, SAT and athletic ability data on an individual might be available via a college, so let's pull out just the cases that are in college:
college <- subset(s, College == "Yes")
Look at the relationship between IQ and Athletic:
coef(summary(lm(IQ ~ Athletic, data = college)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154.197 4.8881 31.545 4.643e-96
## Athletic -1.786 0.2214 -8.068 1.872e-14
There is a relationship!
Now look at the entire set of data:
coef(summary(lm(IQ ~ Athletic, data = s)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.4079 3.2112 29.71 1.744e-139
## Athletic 0.2234 0.1585 1.41 1.590e-01
The relationship has disappeared. Why?
ACTIVITY: Look at the structure and equations of the jock network and explain.
Here is the complete set of simple causal pathways among three variables.
The possibilities for who can be in the middle: \( A \) or \( B \) or \( C \)