Chapter 6 - The Haunted DAG & The Causal Terror

Multiple regression is no oracle, but only a golem. It is logical, but the relationships it describes are conditional associations, not causal influences. Therefore additional information, from outside the model, is needed to make sense of it. This chapter presented introductory examples of some common frustrations: multicollinearity, post-treatment bias, and collider bias. Solutions to these frustrations can be organized under a coherent framework in which hypothetical causal relations among variables are analyzed to cope with confounding. In all cases, causal models exist outside the statistical model and can be difficult to test. However, it is possible to reach valid causal inferences in the absence of experiments. This is good news, because we often cannot perform experiments, both for practical and ethical reasons.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

# The first one is multicollinearity. This could happen when two or more variables are associated on the other variables of the model.
# The second one is Post-treatment Bias. This could happen when the inclusion of a variable that are consequence of other variables has false inferences.
# The third one is Collider Bias. This could happen when two variables are unrelated, but are related to a third variable.

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

# I could take education as an example for Post-treatment Bias. Teacher would have professional training to improve their knowledge of instruction. We would expect such professional training could improve student's performance
# Here, the professional training influences the knowledge of instruction, which in turns influences the student's performance.

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

# 1. The Fork. Z is the common cause of X and Y, X and Y are independent, and both conditional on Z.
# 2. The Pipe. X influences Z, and then in turns influences Y. Similarly, X and Y are independent, and conditonal on Z.
# 3. The Collider. Both X and Y influences Z. There is no association between X and Y, unless condition on Z. So X is not independent from Y, conditional on Z.
# 4. The descendant. Similar to Collider, but with a descendant influenced by the collider Z. Since descendant carries some information from Z, condition on the descendant would be like condition on Z, but weaker.

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

# When conditioning on a collider, an association is established between two variables. In the example at the open of the chapter, there is no association between trustworthiness and newsworthiness. However, when selection status is introduced, there is a negative association. That is because when a funded proposals has low trustworthiness, it should have high newsworthiness. Otherwise it wouldn't have been funded.

6M1. Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y: C ← V → Y. Reanalyze the DAG. Draw the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

library(dagitty)
dag_6M1 <- dagitty( "dag {
X -> Y;
X <- U <- A -> C -> Y;
U -> B <- C;
C <- V -> Y;
X [exposure]
Y [outcome]
U [unobserved]
V [unobserved]
}")
plot(dag_6M1)

# We can see from the graph that we have 4 paths from X to Y:
# 1. X <- U <- A -> C -> Y
# 2. X <- U -> B <- C -> Y
# 3. X <- U <- A -> C <- V -> Y
# 4. X <- U -> B <- C <- V -> Y
# Because B is a collider, we would close 2 and 4.
# Now, because of V, C becomes a collider. Thus, if we condition on C, we would open a new path through V.Therefore, we should condition on A.
adjustmentSets(dag_6M1, exposure = "X", outcome = "Y")
## { A }

6M2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

# Let's first simulate the data
N <- 100
X <- rnorm(N, 0, 1) 
Z <- rnorm(N, X, 0.5)
Y <- rnorm(N, Z)
cor(X, Z)
## [1] 0.8890347
# We can see X and Z are highly correlated.

# Now let's see the regression model
m_6M2 <- quap(
  alist(
    Y ~ dnorm(μ,σ),
    μ <- α + βX*X + βZ*Z,
    c(α, βX, βZ)~dnorm(0,1),
    σ ~ dexp(1)
  ), data=list(X=X, Y=Y, Z=Z)
)
precis(m_6M2)
##           mean         sd       5.5%     94.5%
## a  -0.05127265 0.09670833 -0.2058312 0.1032859
## ßX  0.06559383 0.19636839 -0.2482408 0.3794284
## ßZ  0.77881832 0.18491122  0.4832945 1.0743422
## s   0.95961406 0.06738313  0.8519228 1.0673053
# Here we can see with the inclusion of Z, X is not useful when predicting Y. Therefore, it doesn't like multicollinearity.
# Comparing with the leg example, we can see the causal model is different. In the leg example, both legs predict the height. In our example here, only Z predicts the Y (so Z is a pipe).