Monday April 29, 2013 Stats 155 Class Notes

\( \newcommand{\ensuremath}[1]{#1} \newcommand{\pathway}[1]{\hbox{#1}} \newcommand{\causes}{\ensuremath{\Rightarrow}} \newcommand{\causedBy}{\ensuremath{\Leftarrow}} \newcommand{\correlatedWith}{\ensuremath{\nLeftrightarrow}} \)

Confounders as a diagram

Funky tikz

SAT versus per-student spending

Draw the causal diagram

Funky tikz

Spending  ---------->      SAT    <-------|
      |                                   |
Focus on Educ. ---> fraction taking SAT  -|

Campaign Spending and the back-door network

Research in political science shows that higher spending in campaigns is related to a lower vote for the incumbent. Yet it's common sense that higher spending improves things for the candidate; that's why they do it.

  Polls <-----    Popularity ---> vote outcome
    |                                 ^
    v                                 |
  Spending ---------------------------

How to block a back-door?

We've done it by including the covariate in the model. But this is too crude an answer.

Example: Election Spending

fetchData("simulate.r")

## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r

## [1] TRUE

campaign.spending

## Causal Network with  4  vars:  popularity, polls, spending, vote 
## ===============================================
## popularity is exogenous
## polls <== popularity 
## spending <== polls 
## vote <== popularity & spending

equations(campaign.spending)

## popularity <== runif(nsamps, min=15,max=85) 
## polls <== popularity + rnorm(nsamps,sd=3) 
## spending <== 100 - polls + rnorm(nsamps,sd=10) 
## vote <== 0.75*popularity + 0.25*spending + rnorm(nsamps,sd=5)

You can see from the equations how spending is related to vote: it increases it. Let's look at what a statistical model has to say

d = run.sim(campaign.spending, 435)  # number of congressment
summary(lm(vote ~ spending, data = d))

## 
## Call:
## lm(formula = vote ~ spending, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.240  -5.376   0.263   6.100  24.125 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  70.3970     0.9792    71.9   <2e-16 ***
## spending     -0.4071     0.0183   -22.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8.31 on 433 degrees of freedom
## Multiple R-squared: 0.533,   Adjusted R-squared: 0.532 
## F-statistic:  494 on 1 and 433 DF,  p-value: <2e-16

The problem: a back door pathway from spending to vote via popularity. Block it by including a node on the pathway as a covariate.

summary(lm(vote ~ spending + polls, data = d))

## 
## Call:
## lm(formula = vote ~ spending + polls, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14.85  -3.93   0.02   3.48  19.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3751     2.9254    0.81     0.42    
## spending      0.2232     0.0291    7.68  1.1e-13 ***
## polls         0.7409     0.0311   23.84  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.47 on 432 degrees of freedom
## Multiple R-squared: 0.798,   Adjusted R-squared: 0.797 
## F-statistic:  854 on 2 and 432 DF,  p-value: <2e-16

Software for Simulation

You should do this even if you already have done so. There were some corrections made to the software.

fetchData("simulate.r")

## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r

## [1] TRUE

An example hypothetical causal network

If we want to study the direct link between class size and test scores, we need to block the backdoor pathways.

score hypothetical causal network

Chains in Causation

Following the convention used throughout the book

\( A \) is a response variable
\( B \) is an explanatory variable
\( C \) is a covariate. We'll also use \( D \), \( E \), etc. as covariates and \( U \) or \( ? \) as unobserved variables.

Three linear pathways:

Causal mediator: \( A \causedBy C \causedBy B \). Simulation: cnet.mediator
Common cause: \( A \causedBy C \causes B \). Simulation: cnet.common.cause
Witness: \( C \) is caused by both \( A \) and \( B \): \( A \causes C \causedBy B \). Simulation: cnet.witness

A circular pathway

A recurrent network (or, closed loop of causation): : \( A \causes B \causes C \causes A \) or the variations.

We won't deal with these. They raise questions about “when”.

Correlation

The notation \( A \correlatedWith B \) means that there is a non-causal connection between \( A \) and \( B \). This must be because there is some unobserved variable \( U \) producing the correlation, such as

\( A \causedBy U \causes B \)
\( A \causes U \causedBy B \)

Note that if either of the arrows went the other way, we'd simply have \( U \) being a causal mediator.

We'll see later that (1) is not a possibility if \( U \) is unobserved and unconnected to anything else.

The Rules for Observing or Blocking a Pathway

Remember, we want to study the causal relationship between \( A \) and \( B \) and are trying to decide whether to include the covariate \( C \).

Causal mediator: Including \( C \) blocks the pathway, which is otherwise open.
Common cause: Including \( C \) blocks the pathway, which is otherwise open.
Witness: Including \( C \) opens the pathway, which is otherwise blocked.

Confirming these rules

Run a simulation of \( A \) against \( B \). When the coefficient in a model is zero, the pathway between \( B \) and \( A \) has been blocked.

Mediator

equations(cnet.mediator)

## B <== rnorm(nsamps) 
## C <== B + rnorm(nsamps) 
## A <== rnorm(nsamps) - C

s = run.sim(cnet.mediator, 1000)
coef(summary(lm(A ~ B, data = s)))

##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)  -0.0598    0.04354  -1.374  1.699e-01
## B            -1.0246    0.04270 -23.995 8.242e-101

The above model indicates that the pathway is open.

Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).

coef(summary(lm(A ~ B + C, data = s)))

##             Estimate Std. Error  t value   Pr(>|t|)
## (Intercept) -0.05986    0.03275  -1.8275  6.792e-02
## B           -0.03939    0.04794  -0.8217  4.114e-01
## C           -0.94291    0.03406 -27.6865 1.285e-125

Common Cause

equations(cnet.common.cause)

## C <== rnorm(nsamps) 
## B <== rnorm(nsamps)+C 
## A <== rnorm(nsamps)-C

s = run.sim(cnet.common.cause, 1000)
coef(summary(lm(A ~ B, data = s)))

##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)  0.02493    0.03843   0.6486 5.168e-01
## B           -0.45594    0.02741 -16.6328 5.110e-55

The above model indicates that the pathway is open.

Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).

coef(summary(lm(A ~ B + C, data = s)))

##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)  0.03604    0.03192   1.1291 2.591e-01
## B            0.01704    0.03187   0.5347 5.930e-01
## C           -0.96997    0.04574 -21.2048 1.161e-82

Witness

equations(cnet.witness)

## A <== rnorm(nsamps) 
## B <== rnorm(nsamps) 
## C <== rnorm(nsamps) - A + B

s = run.sim(cnet.witness, 1000)
coef(summary(lm(A ~ B, data = s)))

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03848    0.03166 -1.2155   0.2245
## B           -0.01114    0.03189 -0.3493   0.7269

The above model indicates that the pathway is blocked.

Including the covariate \( C \) opens the pathway between \( B \) and \( A \).

coef(summary(lm(A ~ B + C, data = s)))

##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept) -0.01382    0.02229   -0.62  5.354e-01
## B            0.46990    0.02703   17.38  2.488e-59
## C           -0.49322    0.01545  -31.92 1.249e-154

The Jock Network

Imagine that we wanted to study the relationship between athletic ability and SAT scores. The conventional wisdom is that athletes tend to have lower test scores. The jock network implements one hypothesis.

s = run.sim(jock, 1000)
head(s)

##       IQ Athletic College Varsity
## 1 110.17    15.20      No      No
## 2  83.13    20.28      No      No
## 3 114.26    16.50      No      No
## 4 105.02    20.20      No      No
## 5  92.20    21.19      No      No
## 6  98.32    19.82      No      No

In practice, SAT and athletic ability data on an individual might be available via a college, so let's pull out just the cases that are in college:

college <- subset(s, College == "Yes")

Look at the relationship between IQ and Athletic:

coef(summary(lm(IQ ~ Athletic, data = college)))

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  154.197     4.8881  31.545 4.643e-96
## Athletic      -1.786     0.2214  -8.068 1.872e-14

There is a relationship!

Now look at the entire set of data:

coef(summary(lm(IQ ~ Athletic, data = s)))

##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)  95.4079     3.2112   29.71 1.744e-139
## Athletic      0.2234     0.1585    1.41  1.590e-01

The relationship has disappeared. Why?

ACTIVITY: Look at the structure and equations of the jock network and explain.

In-Class Activity

Here is the complete set of simple causal pathways among three variables.

In each case, write down whether the pathway is a mediator, a common cause, or a witness.
We've only studied the pathways with C in the middle. For the others, figure out whether including C as a covariate opens or blocks the pathway between A and B.
Come up with a setting which assigns a meaning to A, B, and C. Example: B is taking nitroglycerin, C is dilation of the arteries, A is getting a headache.

The possibilities for who can be in the middle: \( A \) or \( B \) or \( C \)

\( C \) in the middle

\( B \causedBy C \causes A \)
\( B \causedBy C \causedBy A \)
\( B \causes C \causedBy A \)
\( B \causes C \causes A \)

\( A \) in the middle

\( B \causedBy A \causes C \)
\( B \causedBy A \causedBy C \)
\( B \causes A \causedBy C \)
\( B \causes A \causes C \)

\( B \) in the middle

\( A \causedBy B \causes C \)
\( A \causedBy B \causedBy C \)
\( A \causes B \causedBy C \)
\( A \causes B \causes C \)