Assignment #5

6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

Multicollinearity, Post-treatment bias, Collider bias

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

Multicollinearity example:

To predict the likelihood of survival from a desease, scientists could list many factor as candidate variable cause such as age and pre-existing desease. It looks very reasonable that age and pre-existing may highly relate the patient’s survival rate.

However, the age and pre-existing desease may be not two independant variable, which means the two factors are correlated each other. For example, usually a seneior patient is more likely to have pre-existing desease.

Put these two variable in a prediction model could introduce multicollinearity problem.

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

Four elemental confounds: 1. The first type of relation is FORK: X <– Z –> Y. This is the classic confounder. In a fork, some variable Z is a common cause of X and Y, generating a correlation between them. If we condition on Z, then learning X tells us nothing about Y, X and Y and independent, conditional on Z.

The second type of relation is PIPE: X –> Z –> Y. When we discuss the post-treatment bias, the treatment X influences fungus Z which influcnese growth Y. If we condition on Z now, we also block the pather from X to Y. So in both a FORK and a PIPE, conditioning of the middle variable blocks the path.

3.The third type of relation is a COLLIDER: X –> Z <–Y. Unlike the other two types of relations, in a COLLIDER there is no association between X and Y unless we condition on Z. Conditioning on Z, the collider variable opens the path. Once the path is open, information flows betwee X and Y. However neither X nor Y has any causal influence on the other.

The fourth relation is the DESCENDENT. A DESCENDENT is a variable influenced by another variable. Conditioning on a descendent partly conditions on its parent.

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

As mentioned by writer of textbook, for example, the scientific studies’ newsworthy and trustworthy, editors select proposals to fund, they will give score from both newsworth and trustworth two perspectives equally. Then they would select the top 10% highest scored proposals from the candidate list.

Conditioning on the final scores, there must be a negative association between newsworthy and trustworthy. The reason is that any proposals selected must have either high newsworthiness or high trustworthiness. if,for example, a selected proposal has low trustworthiness,then it must have high newsworthiness, vice verse. Otherwise the proposal would not have enough score to be selected.

There will be 2 paths from X to Y:

X <– U <– A –> C <– V –> Y: This path contains a fork U <– A –> C, then a pipe X <– U, and a collider C <– V –> Y. And it is open at the left side, X <– U <– A. If no conditioning on V, there is no association at the right hand side, C <– V –> Y. To close this path, we need to condition on A.
X <– U –> B <– C <– V –> Y This path contains a collider in the middle, X <– U –> B, and B <– C <– V. And a second collider at the right hand side, C <– V –> Y. This path would be closed if no conditioning on B or V.

6M2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

n<- 1000
b_xz<- 0.9
b_zy<- 0.7

set.seed(100)
x<- rnorm(n)
z<- rnorm(n,x*b_xz)
y<- rnorm(n,z*b_zy)

d <- data.frame(x,y,z)
cor(d)

##           x         y         z
## x 1.0000000 0.4562717 0.6924074
## y 0.4562717 1.0000000 0.6351279
## z 0.6924074 0.6351279 1.0000000

Do you observe any multicollinearity? Jiang Lan: Yes, we observe multicollinearity.

Why or why not? What is different from the legs example in the chapter?

Jiang Lan: The assigned “beta” between x and z to be 0.9, the result of model shows that “beta (x->z)” ranges mostly from -0.03 and 0.11.

The difference between the pipe DAG and legs example is: beta (z->y) is estimated correctly in this model, and both beta are estimated incorrectly in legs example.

Assignment #5

Jiang Lan

2020-06-23