Assignment #5

Questions

6E1. List three mechanisms by which multiple regression can produce false inferences about causal effects.

# multicollinearity
# conditioning on post-treatment variables
# collider bias

6E2. For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.

# Multicollinearity: usually means a very strong association between two or more predictor variables. 
# one example: education level and annual income, the higher the education level, and the higher the annual income.

6E3. List the four elemental confounds. Can you explain the conditional dependencies of each?

# 1. Fork: X<-Z->Y. Y ⊥⊥ X|Z, X and Y are independent, conditional on Z.
# 2. Pipe: X->Z->Y. Y ⊥⊥ X|Z, X and Y are independent, conditional on Z.
# 3. Collider: X->Z<-Y. Y⊥̸⊥ X|Z no association between X and Y unless condition on Z. Conditioning on Z, information flows between X and Y.
# 4. Descendant: a  variable influenced by another. when condition on a descendant, weakly condition on its parent, and so what happens depends on the parent's place in the graph.

6E4. How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.

# scientific studies' newsworthy and trustworthy as an example. 
# When selecting proposals to be funded, editors will rate newsworthiness and trustworthiness equally, then pick the top 10% of the proposals putting newsworthy score and trustworthy score together.

6M1. Modify the DAG on page 186 to include the variable V, an unobserved cause of C and Y: C ← V → Y. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?

# there are five paths to connect X and Y
# 1. X -> Y (the causal path)
# 2. X <- U -> B <- C -> Y
# 3. X <- U -> B <- C <- V -> Y
# 4. X <- U <- A -> C -> Y
# 5. X <- U <- A -> C <- V -> Y
# we want to let path 1 open and make sure all of the others are closed, as they are non- causal paths that confound inference.
# before, all the paths through B are closed, as B is collider, so we don't condition on B.
# new paths through A and V are closed, as C is a collider on the paths. conditon on A to close all non-causal paths.
# in the chapter,  V is absent, it's also ok to condition  on C. it isn't now

6M2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

N <- 1000
X <- rnorm(N)
Z <- rnorm(N,X,0.1)
Y <- rnorm(N,Z)
cor(X,Z)

## [1] 0.9947315

it’s high correlation, x and z are nearly same variable. regress:

library(rethinking)

## Loading required package: rstan

## Warning: package 'rstan' was built under R version 4.0.3

## Loading required package: StanHeaders

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.0.3

## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file

## Loading required package: parallel

## Loading required package: dagitty

## Warning: package 'dagitty' was built under R version 4.0.3

## rethinking (Version 2.01)

## 
## Attaching package: 'rethinking'

## The following object is masked from 'package:stats':
## 
##     rstudent

m_6M2 <- quap( 
alist(
Y ~ dnorm( mu , sigma ),
mu <- a + bX*X + bZ*Z,
c(a,bX,bZ) ~ dnorm(0,1),
sigma ~ dexp(1)
) , data=list(X=X,Y=Y,Z=Z) )
precis( m_6M2 )

##              mean         sd        5.5%      94.5%
## a     0.019572463 0.03171272 -0.03111059 0.07025551
## bX    0.006784403 0.28750350 -0.45270171 0.46627052
## bZ    0.964994192 0.28646915  0.50716116 1.42282723
## sigma 1.002871111 0.02240962  0.96705621 1.03868601

The SD are larger than we can expect, for such a large sample. This is an example of conditioning on a post-treatment variable, it knocks out X as Z mediates X’s effect. We can’t explain multicollinearity outside a specific model. It isn’t only a property of the data

Assignment #5

xc

2020-12-16

Chapter 6 - The Haunted DAG & The Causal Terror

Questions