In the News
Variable ---> Outcome ^ ^ | | ---> Confounder <--
Draw the causal diagram
Spending ----------> SAT <-------|
| |
Focus on Educ. ---> fraction taking SAT -|
Research in political science shows that higher spending in campaigns is related to a lower vote for the incumbent. Yet it's common sense that higher spending improves things for the candidate; that's why they do it.
Polls <----- Popularity ---> vote outcome
| ^
v |
Spending ---------------------------
We've done it by including the covariate in the model. But this is too crude an answer.
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
campaign.spending
## Causal Network with 4 vars: popularity, polls, spending, vote
## ===============================================
## popularity is exogenous
## polls <== popularity
## spending <== polls
## vote <== popularity & spending
equations(campaign.spending)
## popularity <== runif(nsamps, min=15,max=85)
## polls <== popularity + rnorm(nsamps,sd=3)
## spending <== 100 - polls + rnorm(nsamps,sd=10)
## vote <== 0.75*popularity + 0.25*spending + rnorm(nsamps,sd=5)
You can see from the equations how spending is related to vote: it increases it. Let's look at what a statistical model has to say
d = run.sim(campaign.spending, 435) # number of congressment
summary(lm(vote ~ spending, data = d))
##
## Call:
## lm(formula = vote ~ spending, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.946 -5.937 -0.267 6.062 24.206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.1522 0.9386 68.3 <2e-16 ***
## spending -0.2803 0.0176 -15.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.69 on 433 degrees of freedom
## Multiple R-squared: 0.369, Adjusted R-squared: 0.368
## F-statistic: 254 on 1 and 433 DF, p-value: <2e-16
The problem: a back door pathway from spending to vote via popularity. Block it by including a node on the pathway as a covariate.
summary(lm(vote ~ spending + polls, data = d))
##
## Call:
## lm(formula = vote ~ spending + polls, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.98 -3.79 0.49 4.13 18.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5693 2.8298 -0.91 0.36
## spending 0.2939 0.0264 11.13 <2e-16 ***
## polls 0.7598 0.0315 24.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.67 on 432 degrees of freedom
## Multiple R-squared: 0.732, Adjusted R-squared: 0.73
## F-statistic: 589 on 2 and 432 DF, p-value: <2e-16
Work on the Logistic Regression model of Bonds's hitting.