Multiple regression is no oracle, but only a golem. It is logical, but the relationships it describes are conditional associations, not causal influences. Therefore additional information, from outside the model, is needed to make sense of it. This chapter presented introductory examples of some common frustrations: multi-collinearity, post-treatment bias, and collider bias. Solutions to these frustrations can be organized under a coherent framework in which hypothetical causal relations among variables are analyzed to cope with confounding. In all cases, causal models exist outside the statistical model and can be difficult to test. However, it is possible to reach valid causal inferences in the absence of experiments. This is good news, because we often cannot perform experiments, both for practical and ethical reasons.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html
file as: YourName_ANLY505-Year-Semester.html and publish
the assignment to your R Pubs account and submit the link to Canvas.
Each question is worth 5 points.
6-1. Use the Waffle House data, data(WaffleDivorce), to find the total causal influence of number of Waffle Houses on divorce rate. Justify your model or models with a causal graph (draw a DAG).
# Read-in data
data(WaffleDivorce)
str(WaffleDivorce)
## 'data.frame': 50 obs. of 13 variables:
## $ Location : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Loc : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 9 8 10 ...
## $ Population : num 4.78 0.71 6.33 2.92 37.25 ...
## $ MedianAgeMarriage: num 25.3 25.2 25.8 24.3 26.8 25.7 27.6 26.6 29.7 26.4 ...
## $ Marriage : num 20.2 26 20.3 26.4 19.1 23.5 17.1 23.1 17.7 17 ...
## $ Marriage.SE : num 1.27 2.93 0.98 1.7 0.39 1.24 1.06 2.89 2.53 0.58 ...
## $ Divorce : num 12.7 12.5 10.8 13.5 8 11.6 6.7 8.9 6.3 8.5 ...
## $ Divorce.SE : num 0.79 2.05 0.74 1.22 0.24 0.94 0.77 1.39 1.89 0.32 ...
## $ WaffleHouses : int 128 0 18 41 0 11 0 3 0 133 ...
## $ South : int 1 0 0 1 0 0 0 0 0 1 ...
## $ Slaves1860 : int 435080 0 0 111115 0 0 0 1798 0 61745 ...
## $ Population1860 : int 964201 0 0 435450 379994 34277 460147 112216 75080 140424 ...
## $ PropSlaves1860 : num 0.45 0 0 0.26 0 0 0 0.016 0 0.44 ...
# Define DAG paths
# S = whether or not a State in the southern United States
# A = median age at marriage
# M = marriage rate
# W = number of Waffle Houses
# D = divorce rate
dag_6.1 <- dagitty("dag {
S [Southern State]
A [Median Age]
M [Marriage Rate]
W [Waffle House Count]
D [Divorce Rate]
A -> D
A -> M -> D
A <- S -> M
S -> W -> D
}")
# dag_6.1 <- dagitty( "dag {
# S [Southern State]
# A [Median Age]
# M [Marriage Rate]
# W [Waffle House Count]
# D [Divorce Rate]
# S -> A -> M
# S -> W -> D
# A -> D
# }")
# Draw Dag
coordinates(dag_6.1) <- list(
x = c(A = 1, S = 1, M = 2, W = 3, D = 3),
y = c(A = 3, S = 1, M = 2, W = 1, D = 3)
)
drawdag(dag_6.1, lwd=2)
# Check on conditioning
adjustmentSets( dag_6.1 , exposure="W" , outcome="D" )
## { A, M }
## { S }
# Based on the adjustmentSets, I chose to condition on S
raw_data <- WaffleDivorce
# Scale variables
raw_data$Divorce_Scaled <- scale(raw_data$Divorce)
raw_data$WaffleHouses_Scaled <- scale(raw_data$WaffleHouses)
raw_data$South_Scaled <- scale(raw_data$South)
# Build model
model <- quap(
alist (
Divorce_Scaled ~ dnorm(mu, sigma),
mu <- a + b_SS * South_Scaled + b_WS * WaffleHouses_Scaled,
a ~ dnorm(0, 0.2),
c(b_SS, b_WS) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data)
# Result
precis(model, depth=2)
## mean sd 5.5% 94.5%
## a 1.467293e-06 0.10911695 -0.17438849 0.1743914
## b_SS 2.891975e-01 0.16594806 0.02398042 0.5544145
## b_WS 5.241965e-02 0.16588041 -0.21268928 0.3175286
## sigma 9.206707e-01 0.09087302 0.77543811 1.0659034
# This model showed that once know whether a state is located in the South or not, number of Waffle Houses in the area does not contribute additional information on the divorce rate in that state.
6-2. Build a series of models to test the implied conditional independencies of the causal graph you used in the previous problem. If any of the tests fail, how do you think the graph needs to be amended? Does the graph need more or fewer arrows? Feel free to nominate variables that aren’t in the data.
# Implied conditional independencies
impliedConditionalIndependencies(dag_6.1)
## A _||_ W | S
## D _||_ S | A, M, W
## M _||_ W | S
# Reset data
raw_data <- WaffleDivorce
# Scale variables
raw_data$S <- scale(raw_data$South)
raw_data$A <- scale(raw_data$MedianAgeMarriage)
raw_data$W <- scale(raw_data$WaffleHouses)
raw_data$D <- scale(raw_data$Divorce)
raw_data$M <- scale(raw_data$Marriage)
# Model A _||_ W | S
model_a <- quap(
alist(
A ~ dnorm(mu,sigma),
mu <- a + bS * S + bW * W,
a ~ dnorm(0, 0.2),
c(bS, bW) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data
)
# Results
precis(model_a)
## mean sd 5.5% 94.5%
## a -1.920090e-06 0.1113607 -0.1779779 0.1779740170
## bS -2.716977e-01 0.1701329 -0.5436029 0.0002074713
## bW 6.350218e-02 0.1700697 -0.2083021 0.3353064713
## sigma 9.479862e-01 0.0935625 0.7984552 1.0975171181
# This model confirmed the conditional independence of A from W given S.
# Model D _||_ S | A, M, W
model_d <- quap(
alist(
D ~ dnorm(mu,sigma),
mu <- a + bA * A + bS * S + bM * M + bW * W,
a ~ dnorm(0, 0.2),
c(bA, bS, bM, bW) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data
)
# Results
precis(model_d)
## mean sd 5.5% 94.5%
## a -9.640115e-08 0.09456279 -0.15112970 0.1511295
## bA -5.512418e-01 0.15155087 -0.79344934 -0.3090342
## bS 1.451638e-01 0.14437772 -0.08557969 0.3759073
## bM -3.628367e-02 0.14737935 -0.27182433 0.1992570
## bW 8.659448e-02 0.14050137 -0.13795384 0.3111428
## sigma 7.588385e-01 0.07520228 0.63865077 0.8790263
# This model confirmed the conditional independence of D from S given A, M. and W.
# Model M _||_ W | S
model_m <- quap(
alist(
M ~ dnorm(mu,sigma),
mu <- a + bS * S + bW * W,
a ~ dnorm(0, 0.2),
c(bS, bW) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data
)
# Results
precis(model_m)
## mean sd 5.5% 94.5%
## a -6.660963e-06 0.11363644 -0.1816196 0.1816063
## bS 1.006221e-01 0.17425350 -0.1778686 0.3791129
## bW -3.886694e-02 0.17424508 -0.3173442 0.2396103
## sigma 9.764592e-01 0.09626225 0.8226135 1.1303049
# This model confirmed the conditional independence of M from W given S.
# Overall all implied conditional independencies of the causal graph used in the previous problem pass the test.
6-3. For the DAG you made in the previous problem, can you write a data generating simulation for it? Can you design one or more statistical models to produce causal estimates? If so, try to calculate interesting counterfactuals. If not, use the simulation to estimate the size of the bias you might expect. Under what conditions would you, for example, infer the opposite of a true causal effect?
# Select number of data points for grid approximation
N_POINTS <- 1e3
# Set seed
set.seed(42)
# Create data for simulation
s_sim <- rnorm(N_POINTS, 0, 1)
w_sim <- s_sim*rnorm(N_POINTS, 0, 1) + rnorm(N_POINTS, 0, 1)
m_sim <- s_sim - w_sim + rnorm(N_POINTS, 0, 1)
data_sim <- data.frame(s_sim, w_sim, m_sim)
model_m <- quap(
alist(
m_sim ~ dnorm(mu,sigma),
mu <- a + bS * s_sim + bW * w_sim,
a ~ dnorm(0, 0.2),
c(bS, bW) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=data_sim
)
# Results
precis(model_m)
## mean sd 5.5% 94.5%
## a -0.02086586 0.03088000 -0.07021806 0.02848634
## bS 1.01063427 0.03112141 0.96089625 1.06037230
## bW -1.02088850 0.02161917 -1.05544012 -0.98633689
## sigma 0.98802608 0.02207668 0.95274328 1.02330889
# The model above produce a causal estimate that also confirmed the conditional independence of M from W given S.
All three problems below are based on the same data. The data in data(foxes) are 116 foxes from 30 different urban groups in England. These foxes are like street gangs. Group size varies from 2 to 8 individuals. Each group maintains its own urban territory. Some territories are larger than others. The area variable encodes this information. Some territories also have more avgfood than others. We want to model the weight of each fox. For the problems below, assume the following DAG:
6-4. Use a model to infer the total causal influence of area on weight. Would increasing the area available to each fox make it heavier (healthier)? You might want to standardize the variables. Regardless, use prior predictive simulation to show that your model’s prior predictions stay within the possible outcome range.
# Check data
raw_data <- foxes
str(raw_data)
## 'data.frame': 116 obs. of 5 variables:
## $ group : int 1 1 2 2 3 3 4 4 5 5 ...
## $ avgfood : num 0.37 0.37 0.53 0.53 0.49 0.49 0.45 0.45 0.74 0.74 ...
## $ groupsize: int 2 2 2 2 2 2 2 2 3 3 ...
## $ area : num 1.09 1.09 2.05 2.05 2.12 2.12 1.29 1.29 3.78 3.78 ...
## $ weight : num 5.02 2.84 5.33 6.07 5.85 3.25 4.53 4.09 6.13 5.59 ...
# Scale variables
raw_data$avgfood_scaled <- scale(raw_data$avgfood)
raw_data$groupsize_scaled <- scale(raw_data$groupsize)
raw_data$area_scaled <- scale(raw_data$area)
raw_data$weight_scaled <- scale(raw_data$weight)
# Create dag
dag_6.4 <- dagitty("dag {
Area -> AverageFood -> GroupSize -> Weight
AverageFood -> Weight
}")
# Draw dag
drawdag(dag_6.4, lwd=2)
# Check on conditioning
adjustmentSets(dag_6.4, exposure = "Area", outcome = "Weight")
## {}
# No conditioning needed, since there is no back-door paths between area and weight.
# Model
model <- quap(
alist(
weight_scaled ~ dnorm(mu,sigma),
mu <- a + bA * area_scaled,
a ~ dnorm(0, 0.2),
bA ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data
)
# Results
precis(model)
## mean sd 5.5% 94.5%
## a 6.232027e-08 0.08360856 -0.1336226 0.1336227
## bA 1.883312e-02 0.09089568 -0.1264357 0.1641020
## sigma 9.912645e-01 0.06466623 0.8879154 1.0946136
# Mathematical model
# weight_i ~ Normal(μ_i, σ)
# μ_i = α + β_A
# α ~ Normal(0, 0.2)
# β ~ Normal(0, 0.5)
# σ ~ Exponential(1)
# Set seed
set.seed(42)
# Simulate the priors chosen
NUM_POINTS <- 1000
alpha <- rnorm(NUM_POINTS, 0, 0.2)
beta <- rnorm(NUM_POINTS, 0, 0.5)
sigma <- dexp(1)
# Simulate weight from 1000 areas input
plot(NULL,
xlim=range(-2:2),
ylim=c(-4, 4),
main='Weight of 1,000 Foxes Given Standardized Area',
xlab='Area',
ylab='Simulated Weight')
for (i in 1:1000) curve(alpha[i] + beta[i] * x,
from=-2,
to=2,
add = TRUE,
col=col.alpha("black"))
# Prior predictive simulation was used to understand the model's prior prediction. Based on the conditional on the DAG we created, and the prior predictive simulation, it doesn't look like area has any causal effect on fox weights
6-5. Now infer the causal impact of adding food to a territory. Would this make foxes heavier? Which covariates do you need to adjust for to estimate the total causal influence of food?
# Check on conditioning
adjustmentSets(dag_6.4, exposure = "AverageFood", outcome = "Weight")
## {}
# Model
model <- quap(
alist(
weight_scaled ~ dnorm(mu,sigma),
mu <- a + bA * avgfood_scaled,
a ~ dnorm(0, 0.2),
bA ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data
)
# Results
precis(model)
## mean sd 5.5% 94.5%
## a -9.994121e-10 0.08360017 -0.1336092 0.1336092
## bA -2.421164e-02 0.09088502 -0.1694634 0.1210402
## sigma 9.911440e-01 0.06465858 0.8878071 1.0944809
# Based on the conditioning result of our DAG, it shows that we do not need to adjust for any variables. Regression of weight on average food shows that there is no effect between average food and weight.
6-6. Now infer the causal impact of group size. Which covariates do you need to adjust for? Looking at the posterior distribution of the resulting model, what do you think explains these data? That is, can you explain the estimates for all three problems? How do they go together?
# Check on conditioning
adjustmentSets(dag_6.4, exposure = "GroupSize", outcome = "Weight")
## { AverageFood }
# Model
model <- quap(
alist(
weight_scaled ~ dnorm(mu,sigma),
mu <- a + bS * groupsize_scaled + bA * avgfood_scaled,
a ~ dnorm(0, 0.2),
c(bS, bA) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=raw_data
)
# Results
precis(model)
## mean sd 5.5% 94.5%
## a -0.0001587063 0.08015132 -0.1282560 0.1279386
## bS -0.5737136856 0.17916711 -0.8600573 -0.2873700
## bA 0.4772036137 0.17915143 0.1908850 0.7635222
## sigma 0.9422295490 0.06178292 0.8434885 1.0409706
# Based on the conditioning result of our DAG, it shows that we need to adjust the model for average food. Controlling for average food, negatively affect group size. This consequently means that there is a positive effect on average food when controlling for group size. The causal effect of group size on weight is negative, which cause weight to decrease as group size increased. The model also shows that as average food increase so does weight, even though the previous exercise show no such relationship. The positive relationship between average food and weight is due to post-treatment bias.
6-7. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?
# Set seed
set.seed(42)
# Generate highly correlated data
N_POINTS <- 1000
x <- rnorm(N_POINTS)
z <- rnorm(N_POINTS, x, 0.1)
y <- rnorm(N_POINTS, z)
# Format as dataframe
df <- data.frame(x=x, y=y, z=z)
# Get correlation
data_corr <- cor(df$x, df$z)
data_corr
## [1] 0.9952075
# Correlation between x and z are 0.995
# Create dag
dag_6.7 <- dagitty("dag {
X -> Z -> Y
}")
# Draw dag
drawdag(dag_6.7, lwd=2)
# Model - regress y on x and z
model <- quap(
alist(
y ~ dnorm(mu,sigma),
mu <- a + bX * x + bZ * z,
a ~ dnorm(0, 0.2),
c(bX, bZ) ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=df
)
# Results
precis(model)
## mean sd 5.5% 94.5%
## a -0.003896407 0.03210531 -0.05520690 0.04741408
## bX 0.312823696 0.24270205 -0.07506106 0.70070845
## bZ 0.664202563 0.24130912 0.27854398 1.04986114
## sigma 1.028247737 0.02297729 0.99152559 1.06496988
# Based on the modeling results, it can be observed that the estimates for both x and z has fairly narrow posterior distributions. This is different than what we saw on the legs example, since the SD on those are larger than the estimated magnitude with overlaps. Due to these differences, it does not seem like our simulated data has multicollinearity based on this (even though we designed the data to be highly correlated).
# The reason for this is because in the leg example, there are two arrows pointing to height, which means that either variable can contributed towards height. In our current DAG, only Z can predict Y, while X predict Z. So even though X and Z are highly correlated (with a posterior around 0), we don't see it on our estimate. Therefore, multicollinearity in variables are dependent on both pairwise relationship and also the causal model.