Chapter 6 - The Haunted DAG & The Causal Terror

Multiple regression is no oracle, but only a golem. It is logical, but the relationships it describes are conditional associations, not causal influences. Therefore additional information, from outside the model, is needed to make sense of it. This chapter presented introductory examples of some common frustrations: multicollinearity, post-treatment bias, and collider bias. Solutions to these frustrations can be organized under a coherent framework in which hypothetical causal relations among variables are analyzed to cope with confounding. In all cases, causal models exist outside the statistical model and can be difficult to test. However, it is possible to reach valid causal inferences in the absence of experiments. This is good news, because we often cannot perform experiments, both for practical and ethical reasons.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

# Collider Bias: The selection-distortion effect can happen inside of a multiple regression, because the act of adding a predictor in- duces statistical selection within the model.
# Multicollinearity: a very strong association between two or more predictor variables.
# post-treatment bias: Mistaken inferences that arise from omitting predictor variables. Post-treatment bias is a risk in all types of studies.

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

# When determining if a relationship between blood pressure and age, weight, body surface area, pulse rate, and stress level. Blood pressure appears to be related fairly strongly to Weight, blood pressure, and BSA, and hardly related at all to Stress level. However, stress level has high correlation with blood pressure and age.

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

# Fork: X←Z→Y  Z is a common cause of X and Y, generating a correlation between them.
# Pipe: X → Z → Y  The treatment X influences fungus Z which influences growth Y. If we condition on Z now, we also block the path from X to Y.
# Collider: X → Z ← Y  in a collider there is no association between X and Y unless you condition on Z. Conditioning on Z, the collider variable, opens the path. Once the path is open, information flows between X and Y. However neither X nor Y has any causal influence on the other.
# Descendent: Conditioning on a descendent partly conditions on its parent.

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

# X → Z ← Y 
# X represents Happiness, Y stands for age, and Z stands for marriage. X and Y both cause Z, so Z is a collider. If we condition on marriage (Z) — which means here, if we include it as a predictor in a regression—then it will induce a statistical association between age and happiness. And this can mislead us to think that happiness changes with age, when in fact it is constant.

6M1. Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y: C ← V → Y. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

# (1) X←U←A→C→Y
# Consider the first path, passing through A. This path is open, because there is no collider within it. There is just a fork at the top and two pipes, one on each side. Information will flow through this path, confounding X → Y. It is a backdoor. To shut this backdoor, we need to condition on one of its variables. 

# (2) X←U→B←C→Y
# This path does contain a collider, U → B ← C. It is therefore already closed. That is why adjustmentSets above did not mention B. In fact, if we do condition on B, it will open the path, creating a confound. Then our inference about X → Y will change, but without the DAG, we won’t know whether that change is helping us or rather misleading us. The fact that including a variable changes the X → Y coefficient does not always mean that the coefficient is better now. You could have just conditioned on a collider.

6M2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

n<- 1000
b_xz<- 0.9
b_zy<- 0.7

set.seed(111)
x<- rnorm(n)
z<- rnorm(n,x*b_xz)
y<- rnorm(n,z*b_zy)

d <- data.frame(x,y,z)
cor(d)
##           x         y         z
## x 1.0000000 0.4629346 0.6889482
## y 0.4629346 1.0000000 0.6792110
## z 0.6889482 0.6792110 1.0000000
# Yes, I observe multicollinearity here since X and Z are highly correlated. The difference is that using Z will provide extra information about Y However in the legs example, the length of two legs are not providing extra information but the same info.