Wednesday May 1, 2013 Stats 155 Class Notes

\( \newcommand{\ensuremath}[1]{#1} \newcommand{\pathway}[1]{\hbox{#1}} \newcommand{\causes}{\ensuremath{\Rightarrow}} \newcommand{\causedBy}{\ensuremath{\Leftarrow}} \newcommand{\correlatedWith}{\ensuremath{\nLeftrightarrow}} \)

Survey processing

Causality

Software for Simulation

You should do this even if you already have done so. There were some corrections made to the software.

fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE

An example hypothetical causal network

If we want to study the direct link between class size and test scores, we need to block the backdoor pathways.

score hypothetical causal network

Chains in Causation

Following the convention used throughout the book

Three linear pathways:

A circular pathway

We won't deal with these. They raise questions about “when”.

Correlation

The notation \( A \correlatedWith B \) means that there is a non-causal connection between \( A \) and \( B \). This must be because there is some unobserved variable \( U \) producing the correlation, such as

  1. \( A \causedBy U \causes B \)
  2. \( A \causes U \causedBy B \)

Note that if either of the arrows went the other way, we'd simply have \( U \) being a causal mediator.

We'll see later that (1) is not a possibility if \( U \) is unobserved and unconnected to anything else.

The Rules for Observing or Blocking a Pathway

Remember, we want to study the causal relationship between \( A \) and \( B \) and are trying to decide whether to include the covariate \( C \).

Confirming these rules

Run a simulation of \( A \) against \( B \). When the coefficient in a model is zero, the pathway between \( B \) and \( A \) has been blocked.

Mediator

equations(cnet.mediator)
## B <== rnorm(nsamps) 
## C <== B + rnorm(nsamps) 
## A <== rnorm(nsamps) - C
s = run.sim(cnet.mediator, 1000)
coef(summary(lm(A ~ B, data = s)))
##              Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)  0.006554    0.04483   0.1462 8.838e-01
## B           -0.977772    0.04625 -21.1432 2.771e-82

The above model indicates that the pathway is open.

Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).

coef(summary(lm(A ~ B + C, data = s)))
##              Estimate Std. Error  t value   Pr(>|t|)
## (Intercept)  0.006143    0.03196   0.1922  8.476e-01
## B            0.009186    0.04577   0.2007  8.410e-01
## C           -0.994755    0.03200 -31.0884 6.776e-149

Common Cause

equations(cnet.common.cause)
## C <== rnorm(nsamps) 
## B <== rnorm(nsamps)+C 
## A <== rnorm(nsamps)-C
s = run.sim(cnet.common.cause, 1000)
coef(summary(lm(A ~ B, data = s)))
##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)  0.03654    0.03898   0.9373 3.488e-01
## B           -0.48613    0.02669 -18.2131 3.258e-64

The above model indicates that the pathway is open.

Including the covariate \( C \) blocks the pathway between \( B \) and \( A \).

coef(summary(lm(A ~ B + C, data = s)))
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  0.04690    0.03161   1.483 1.383e-01
## B            0.03449    0.03144   1.097 2.730e-01
## C           -1.01498    0.04447 -22.826 4.148e-93

Witness

equations(cnet.witness)
## A <== rnorm(nsamps) 
## B <== rnorm(nsamps) 
## C <== rnorm(nsamps) - A + B
s = run.sim(cnet.witness, 1000)
coef(summary(lm(A ~ B, data = s)))
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01741    0.03224 -0.5399   0.5894
## B           -0.01511    0.03251 -0.4646   0.6423

The above model indicates that the pathway is blocked.

Including the covariate \( C \) opens the pathway between \( B \) and \( A \).

coef(summary(lm(A ~ B + C, data = s)))
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept) -0.03783    0.02216  -1.707  8.805e-02
## B            0.52827    0.02762  19.125  1.066e-69
## C           -0.51283    0.01534 -33.434 5.519e-165

The Jock Network

Imagine that we wanted to study the relationship between athletic ability and SAT scores. The conventional wisdom is that athletes tend to have lower test scores. The jock network implements one hypothesis.

s = run.sim(jock, 1000)
head(s)
##       IQ Athletic College Varsity
## 1  95.77    18.99      No      No
## 2  92.10    24.54      No      No
## 3  70.75    23.94      No      No
## 4 111.32    23.82     Yes     Yes
## 5 115.24    19.63     Yes      No
## 6 108.79    20.18     Yes      No

In practice, SAT and athletic ability data on an individual might be available via a college, so let's pull out just the cases that are in college:

college <- subset(s, College == "Yes")

Look at the relationship between IQ and Athletic:

coef(summary(lm(IQ ~ Athletic, data = college)))
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)  152.194     4.3625  34.887 3.930e-108
## Athletic      -1.707     0.2009  -8.496  8.981e-16

There is a relationship!

Now look at the entire set of data:

coef(summary(lm(IQ ~ Athletic, data = s)))
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept) 98.73122     3.3287 29.6609 3.845e-139
## Athletic     0.08128     0.1646  0.4939  6.215e-01

The relationship has disappeared. Why?

ACTIVITY: Look at the structure and equations of the jock network and explain.

In-Class Activity

Here is the complete set of simple causal pathways among three variables.

The possibilities for who can be in the middle: \( A \) or \( B \) or \( C \)

\( C \) in the middle

\( A \) in the middle

\( B \) in the middle