6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

# Post-trestment bias, collider bias, and multicollinearity

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

# Post-trestment bias example: if we want to study the impact of race on salary, and one hypothesis is that race can affect job position, and job position can affect salary. If we regressed salary on race and controlled for job position, we would find no relationship between race and salary, conditional on job position

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

# Fork: X<-Z->Y, X and Y are independent, conditional on Z 
#Collider: X->Z<-Y, X and Y are independent, and conditioning on Z creates dependency between X and Y
#Pipe: x->Z->Y, Z mediates association between X and Y. Conditionining on Z removes dependency between X and Y 
#Decendent: X->Z->Y Z->A Conditioning no A is like conditioning on Z

6M1. Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y: C ← V → Y. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

# Path1: X<-U<-A->C<-V->Y. This path is open. Needs to condition on A or C to close it. It needs a fork conditioning on V: Y<-V->C to close the backdoor. 
#Path2: X<-U->B<-C->Y. This path is closed. 

6M2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

n<-100 
b_xz<-0.9
b_zy<-0.1

set.seed(900)
x<-rnorm(n)
z<-rnorm(n, x*b_xz)
y<-rnorm (n, z*b_zy)

d<- data.frame(x,y,z)
cor(d)
##            x          y         z
## x 1.00000000 0.01420212 0.6636059
## y 0.01420212 1.00000000 0.1640509
## z 0.66360587 0.16405095 1.0000000
# There is multicollinearity. The difference is that in the legs example, bad priors were used.