Assignment #5

6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

# Multicollinearity, post-treatment bias, collider bias

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

#Multicollinearity exists because when we build a model there is multiple predictors that could predict the dependent variable. One of the example is when we are baking a cake, we could build a model to predict baking time. Y = A+ BX_1, which Y means the baking time. X_1 represent the amount of flour used. However there could be other predictor like the amount of gluten in the flour can be used as X_2 to predict the baking time. Then the model turns into Y =  A + BX_1 + CX_2. However due to the high correlation between X_1 and X_2, the result could be unreliable. When the amount of flour used rising the gluten amount is rising as well. However the amount of flour increase the the amount of gluten, this could cause large amount of deduction on the baking time. This is not realistic in real life.

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

#Including the Fork, The Pipe, The Collidor, The Descendant.
# The fork means Z has relationship with x and Y. We can use Z to predict X or Y, this build a relationship between X and Y. However X and Y can not use to predict each other cause there is no direct relationship between them. conditional on Z
# The Pipe means Z can predict Z and Z can predict Y. If we conditional on Z, the relationship was blocked. 
# The Collidor means X and Y can both influence Z. When we conditional on Z, X build a relation with Y. However there is no causal relationship between X and Y.
#The Descendant relation like the collidor but there is a descendant of Z called D. When we conditional on D, the parent Z will also affected which open relationship between X and Y.

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

#We can use the example of light, switch and eletricity. We know that switch -> light <- electricity exist. When we conditioning on light the information between switch and electricity flows. 
#If the light is off, when the switch is on, the electricity should be off. When the light is on, the electricity is on  then the switch is on.

6M1. Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y: C ← V → Y. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

# Besides the existing 2 path on page 186 there are additional 2 path. So 4 path in total 
#(1) X<-U<-A->C<-V->Y 
#  A fork U<-A->C,  a pipe X<-U<-A, and a collider C<-V->Y can be observed.
# The pipe on the left side X←U←A  is open . If no conditioning on A, there is no assocation at the right side, C<-V->Y.
# In order to let the information flow, we need to condition on A.

#(2) X<-U->B<-C<-V->Y
# This path conatins a collider, X<-U->B, and pipe B<-C<-V. And a second collider at the right hand side, C<-V->Y.
# We need to condition on B and V if we want to close the path.

6M2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

library(rethinking)

## Loading required package: rstan

## Loading required package: StanHeaders

## Loading required package: ggplot2

## rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## For improved execution time, we recommend calling
## Sys.setenv(LOCAL_CPPFLAGS = '-march=corei7 -mtune=corei7')
## although this causes Stan to throw an error on a few processors.

## Loading required package: parallel

## Loading required package: dagitty

## rethinking (Version 2.01)

## 
## Attaching package: 'rethinking'

## The following object is masked from 'package:stats':
## 
##     rstudent

set.seed(129)
n<- 1000
b1<- 0.9
b2<- 0.7

set.seed(100)
x<- rnorm(n)
z<- rnorm(n,x*b1)
y<- rnorm(n,z*b2)

d <- data.frame(x,y,z)
cor(d)

##           x         y         z
## x 1.0000000 0.4562717 0.6924074
## y 0.4562717 1.0000000 0.6351279
## z 0.6924074 0.6351279 1.0000000

m<- quap( alist( 
  y ~ dnorm( mu , sigma ), 
  mu <- a + b1*x + b2*z,
  a ~ dnorm( 0 , 100 ), 
  c(b1,b2) ~ dnorm( 0 , 100 ), 
  sigma ~ dexp( 1 ) ), 
  data=d )

## Caution, model may not have converged.

## Code 1: Maximum iterations reached.

precis(m)

##               mean         sd        5.5%      94.5%
## a     -0.008135237 0.03332760 -0.06139918 0.04512871
## b1     0.041459808 0.04483542 -0.03019585 0.11311547
## b2     0.619099380 0.03398903  0.56477834 0.67342042
## sigma  1.053730409 0.02383294  1.01564077 1.09182005

plot(precis(m))

post <- extract.samples(m) 
plot( z ~ x , post , col=col.alpha(rangi2,0.1) , pch=16 )

#We can observe multicollinearity. For  beta between x and z to be 0.9, the result from model shows beta(x->z) ranges mostly from -0.03 and 0.11.
# Compared with the legs example in the chapter, in which right leg is simulated from the height. Both leg has similar information, then there could be a unlimited number of combinations of bl and br that produce the same predictions. This makes the means and standard deviations pretty large.

Assignment #5

Zechong Liu

2020-06-22