The Haunted DAG & The Causal Terror

Key Idea: Selection Processes Can Create Correlations

  • Observation: The most newsworthy studies often appear to be the least trustworthy.
  • Hypothesis: This pattern arises not because of a true relationship, but due to the act of selection.

Example: Grant Review and Selection Bias

  • Suppose:
    • 200 grant proposals are evaluated.
    • Two criteria are assessed: Newsworthiness (nw) and Trustworthiness (tw).
    • Both criteria are uncorrelated in the population.
    • The top 10% of combined scores (\(nw + tw\)) are selected for funding.

Question:

  • After selection, are nw and tw still uncorrelated?

Simulation: How Selection Creates Correlation

Results:

Before Selection:

  • No correlation between nw and tw.
  • Regression line is flat.

After Selection:

  • Correlation appears among selected proposals.
  • Selection favors high combined scores (\(nw + tw\)).

Distortion:

  • The selection process creates an artificial correlation.

Why Does This Happen?

This phenomenon, known as Berkson’s Paradox or the Selection-Distortion Effect, occurs because: - Selection conditions on the sum (\(nw + tw\)), introducing dependence between nw and tw.

Broader Implications:

  • Everywhere in Life: Restaurants with bad food in good locations (and vice versa) are examples of selection-distortion.
  • In Regression Models: Adding predictors can induce statistical selection, a form of collider bias. This bias can mislead us into believing relationships exist where they do not.

Key Takeaways:

  • Selection Can Create Illusory Correlations: Simply combining unrelated factors can produce spurious relationships.
  • Regression Models Are Susceptible: Adding predictors can induce collider bias, especially when conditioning on a variable.

This example highlights why understanding causal structures and conditioning effects is critical for responsible regression modeling.

Multicollinearity

Key Idea: Challenges of Multiple Predictors

  • Context: Many potential predictor variables may be available, e.g., 7 predictors in the primate milk data.
  • Problem: Including all predictors may lead to issues like multicollinearity.
  • Multicollinearity:
    • A strong association between two or more predictors (conditional on other variables).
    • Leads to uncertainty in parameter estimates, even if all predictors are strongly associated with the outcome.
    • Not a problem for prediction but can hinder interpretability.

Example: Predicting Height Using Leg Length

Simulation: Height vs. Left and Right Leg Length

# Simulated data
N <- 100  # number of individuals
set.seed(909)
height <- rnorm(N,10,2)  # sim total height of each
leg_prop <- runif(N,0.4,0.5)  # leg as proportion of height
leg_left <- leg_prop*height + rnorm( N , 0 , 0.02 )  # left leg + error
leg_right <- leg_prop*height + rnorm( N , 0 , 0.02 )  # right leg + error

# Combine into data frame
d <- data.frame(height, leg_left, leg_right)

Model: Using Both Legs as Predictors

plot(precis(m6.1))

Observations

Key Question:

  • What is the value of knowing one leg’s length after knowing the other leg’s length?

Result:

  • The posterior distribution for coefficients \(\beta_l\) (left leg) and \(\beta_r\) (right leg) shows:
    • A high correlation between \(\beta_l\) and \(\beta_r\), forming a narrow ridge of plausible values.
    • The model cannot disentangle the contributions of each leg.

Visualizing the Posterior

# Sample from the posterior
post <- extract.samples(m6.1)


# Add transparency and color
rangi2 <- rgb(0.2, 0.4, 0.8, alpha = 1) # Blue color
col_alpha <- function(color, alpha) adjustcolor(color, alpha.f = alpha)

# Set up side-by-side plots
par(mfrow = c(1, 2)) # 1 row, 2 columns

# Left Panel: Scatter plot of bl vs br
plot(post$bl, post$br, col = col_alpha(rangi2, 0.5), pch = 16,
     xlab = "Left leg coefficient (bl)", 
     ylab = "Right leg coefficient (br)",
     main = "Scatter: bl vs br")

# Right Panel: Density plot of bl + br
bl_br_sum <- post$bl + post$br
plot(density(bl_br_sum), col = "blue", lwd = 2,
     main = "Density: bl + br",
     xlab = "Sum of bl and br", 
     ylab = "Density",
     type = "l") # Only line plot

What has happened here is that since both leg variables contain almost exactly the same information, if you insist on including both in a model, then there will be a practically infinite number of combinations of bl and br that produce the same predictions.

One way to think of this phenomenon is that you have approximated this model:

\[y_i \sim \mathcal{N}(\mu_i,\,\sigma)\] \[\mu_i = \alpha + \beta_1 \times \text{Leg Length}_i + \beta_2 \times \text{Leg Length}_i\] \[\mu_i = \alpha + (\beta_1 + \beta_2) \times \text{Leg Length}_i\]

The parameters β1 and β2 cannot be pulled apart, because they never separately influence the mean µ. Only their sum, β1+β2,influences µ. So this means the posterior distribution ends up reporting the very large range of combinations of β1 and β2 that make their sum close to the actual association of x with y.

And the posterior distribution in this simulated example has done exactly that: It has produced a good estimate of the sum of bl and br. But it has done so at the expense of producing a very wide range of plausible values for bl and br separately. This is the essence of multicollinearity. It is not a problem with the model. It is a problem with the question. The model is answering the question you asked, but the question is not what you thought it was.

Simpler Model: One Leg as Predictor

m6.2 <- quap(
  alist(
    height ~ dnorm( mu , sigma ) ,
    mu <- a + bl*leg_left,
    a ~ dnorm( 10 , 100 ) ,
    bl ~ dnorm( 2 , 10 ) ,
    sigma ~ dexp( 1 )
  ) ,
  data=d )

precis(m6.2)

Multicollinearity in Primate Milk Data: Standardizing Variables

Key Points:

  • Multicollinearity: Occurs when predictors in a regression model are highly correlated, impacting the reliability of the model.
  • Standardizing Variables: A common technique to address multicollinearity by transforming variables to have a mean of zero and a standard deviation of one.

Why It Matters:

  • Multicollinearity can obscure the true relationships between variables, leading to misleading conclusions.
  • Standardizing variables helps to mitigate the effects of multicollinearity, making the model more reliable and interpretable.

Example:

  • In the context of primate milk data, standardizing variables can help to better understand the relationships between different predictors and the outcome variable.

Steps to Standardize Variables:

  1. Calculate the Mean and Standard Deviation: For each predictor variable.
  2. Transform the Variables: Subtract the mean and divide by the standard deviation for each value in the predictor variable.
data(milk)
d <- milk
d$K <- standardize( d$kcal.per.g )
d$F <- standardize( d$perc.fat )
d$L <- standardize( d$perc.lactose )

Models: Using One Predictor

#kcal.per.g regressedonperc.fat
 m6.3<-quap(
   alist(
     K~dnorm(mu,sigma),
     mu<-a+bF*F,
     a~dnorm(0,0.2),
     bF~ dnorm(0,0.5),
     sigma~dexp(1)
   ),
   data=d)
 

#kcal.per.g regressedonperc.lactose
 m6.4<-quap(
   alist(
     K~dnorm(mu,sigma),
     mu<-a+bL*L,
     a~dnorm(0,0.2),
     bL~ dnorm(0,0.5),
     sigma~dexp(1)
  ),data=d)
 
 precis(m6.3)
 
 precis(m6.4)

What happens when we fit a model with both predictors?

Model: Using Both Predictors


#kcal.per.g regressedonperc.fatandperc.lactose
 m6.5<-quap(
   alist(
     K~dnorm(mu,sigma),
     mu<-a+bF*F+bL*L,
     a~dnorm(0,0.2),
     bF~ dnorm(0,0.5),
     bL~ dnorm(0,0.5),
     sigma~dexp(1)
  ),data=d)
 
 precis(m6.5)

Diagnosing Multicollinearity

pairs( ~ kcal.per.g + perc.fat + perc.lactose , data=d , col=rangi2 )

Dodgy Procedures for Multicollinearity

  • In some fields, a common but flawed approach to handling multicollinearity involves:
    1. Inspecting pairwise correlations between predictors.
    2. Dropping highly correlated variables before modeling.
  • Why is this a mistake?
    • Pairwise correlations are not the issue; conditional associations within the model are what matter.
    • The right approach depends on the cause of the collinearity, which requires a causal perspective.
  • Example of dodginess:
    • Imagine analyzing human health data and finding that “exercise” and “calorie intake” are highly correlated.
    • Dropping one might ignore important causal relationships—exercise and calorie intake both influence weight but in different ways.

Core Tradeoff in Milk Composition

In the milk example, a natural tradeoff likely explains the collinearity between fat and lactose:

  • Frequent nursing:
    • Milk tends to be watery and low in energy.
    • High in sugar (lactose) to provide quick energy.
  • Infrequent nursing:
    • Milk needs to be more energy-dense.
    • High in fat to sustain the infant for longer periods between feedings.
  • This tradeoff suggests a causal model: nursing frequency drives milk composition. High lactose and high fat are inversely related because they serve different energetic strategies.

# Define the DAG with D as latent
dag <- dagitty( "dag{
  L -> K <- F
  L <- D -> F
  D [unobserved]
}" )

# Define coordinates for visualization
coordinates(dag) <- list(
  x = c(L = 0, K = 1, D = 1, F = 2),
  y = c(L = 0.5, K = 1, D = 0.5, F = 0.5)
)

plot( dag )

circle_points(c("D"), dag)

Key Takeaways

Avoid Pairwise Correlation Tests:

  • It’s conditional associations, not pairwise correlations, that matter.

Understand the Causal Model:

  • Collinearity often arises from underlying causal processes.
  • In the milk example, fat and lactose are linked by nursing frequency, a latent variable.

Post-treatment Bias

Key Idea: Mistakes From Including Variables

  • It’s common to worry about omitting predictor variables (omitted variable bias).
  • Less attention is given to mistakes from including variables (included variable bias):
    • Adding variables without considering causality can ruin randomized experiments.
    • Example: Post-treatment bias occurs when a predictor is influenced by the treatment or outcome.

Example: Plant Growth Experiment

  • Scenario:
    • Goal: Measure the causal effect of soil treatment on plant growth.
    • Variables:
      1. Initial height (\(H_0\)): Pre-treatment height.
      2. Final height (\(H_1\)): Post-treatment height.
      3. Treatment (\(T\)): Soil treatment.
      4. Fungus (\(F\)): Fungus presence.
  • Problem: Including \(F\) (fungus) induces bias because \(T \to F \to H_1\):
    • Treatment influences fungus.
    • Fungus influences growth (\(H_1\)).
    • Controlling for \(F\) blocks the causal path \(T \to H_1\).

Simulated Data

set.seed(1914)
N <- 100
h0 <- rnorm(N, 10, 2)
treatment <- rep(0:1, each = N/2)
fungus <- rbinom(N, size = 1, prob = 0.5 - treatment * 0.4)
h1 <- h0 + rnorm(N, 5 - 3 * fungus)
d <- data.frame(h0 = h0, h1 = h1, treatment = treatment, fungus = fungus)
precis(d)
'data.frame': 100 obs. of 4 variables:

Initial Growth Model Without Predictors

Model Definition

Goal: Model proportional growth (\(p = \frac{H_1}{H_0}\)).

Model:

  • \(h_{1,i} \sim N(\mu_i, \sigma)\)
  • \(\mu_i = h_{0,i} \times p\)
  • \(p \sim \text{LogNormal}(0, 0.25)\): Prior centered around no growth (\(p = 1\)).

Simulation and Fitting


# Simulate prior from model
sim_p <- rlnorm( 1e4 , 0 , 0.25 )

precis( data.frame(sim_p) )
'data.frame': 10000 obs. of 1 variables:
hist(sim_p)

So this prior expects anything from 40% shrinkage up to 50% growth. Let’s fit this model, so you can see how it just measures the average growth in the experiment.

# Fit the model
m6.6 <- quap(
  alist(
    h1 ~ dnorm( h0 * p , sigma ),
    mu <- h0 * p,
    p ~ dlnorm( 0 , 0.25 ),
    sigma ~ dexp( 1 )
  ),
  data = d
)

precis(m6.6)
NA

Output:

Adding Treatment and Fungus

Model Definition

Extend \(p\) to depend on treatment and fungus: - \(h_{1,i} \sim N(\mu_i, \sigma)\) - \(\mu_i = h_{0,i} \times p\) - \(p = a + b_t \times T_i + b_f \times F_i\)

Priors:

  • \(a \sim \text{LogNormal}(0, 0.2)\)
  • \(b_t, b_f \sim N(0, 0.5)\)
  • \(\sigma \sim \text{Exponential}(1)\)

Model Fitting

The proportion of growth p is now a function of the predictor variables. It looks like any other linear model. The priors on the slopes are almost certainly too flat. They place 95% of the prior mass between −1 (100% reduction) and +1 (100% increase) and two-thirds of the prior mass between −0.5 and +0.5.

# Fit the model
 m6.7<-quap(
   alist(
     h1~ dnorm(mu,sigma),
     mu<-h0*p,
     p<- a+bt*treatment+bf*fungus,
     a~dlnorm(0,0.2),
     bt~ dnorm(0,0.5),
     bf~ dnorm(0,0.5),
     sigma~dexp(1)
     
    ), data=d
   
   )

precis(m6.7)

Output:

Why?

  • Fungus is a post-treatment variable. Controlling for it blocks the causal path \(T \rightarrow H_1\).

Blocked by consequence. The problem is that fungus is mostly a consequence of treatment. This is to say that fungus is a post-treatment variable. So when we control for fungus, the model is implicitly answering the question: Once we already know whether or not a plant developed fungus, does soil treatment matter? The answer is “no,” because soil treatment has its effects on growth through reducing fungus. But we actually want to know, based on the design of the experiment, is the impact of treatment on growth. To measure this properly, we should omit the post-treatment variable fungus. Here’s what the inference looks like in that case.

Correct Inference: Excluding Fungus

Adjusted Model

Exclude \(F\) to correctly measure the treatment effect: - \(p = a + b_t \times T_i\)

Model Fitting

# Fit the model
 m6.8<-quap(
   alist(
     h1~ dnorm(mu,sigma),
     mu<-h0*p,
     p<- a+bt*treatment,
     a~dlnorm(0,0.2),
     bt~ dnorm(0,0.5),
     sigma~dexp(1)
     
    ), data=d
   
   )

precis(m6.8)

Output:

Now the impact of treatment is clearly positive, as it should be. It makes sense to control for pre-treatment differences, like the initial height h0, that might mask the causal influence of treatment. But including post-treatment variables can actually mask the treatment itself. This doesn’t mean you don’t want the model that includes both treatment and fungus. The fact that including fungus zeros the coefficient for treatment suggests that the treatment works for exactly the anticipated reasons. It tells us about the mechanism. But a correct inference about the treatment still depends upon omitting the post-treatment variable.

Visualizing With a DAG

Fungus and d-separation. It helps to look at this problem in terms of a DAG. In this case, I’ll show you how to draw it using the dagitty R package, because we are going to use that package now to do some graph analysis.

Causal Diagram

Graph: - \(T \rightarrow F \rightarrow H_1\) - \(H_0 \rightarrow H_1\)

The causal diagram for this scenario is shown below:

# Define the DAG with D as latent
plant_dag <- dagitty( "dag {
    H_0-> H_1
    F-> H_1
    T-> F
 }")


 coordinates( plant_dag ) <- list( x=c(H_0=0,T=2,F=1.5,H_1=1) ,
                                   y=c(H_0=0,T=0,F=0,H_1=0) 
                                 )
 drawdag( plant_dag )

NA

Key Insight:

Key Takeaways

Collider Bias

Key Idea: Selection and Collider Bias

  • Selection processes can create misleading associations in statistical models, known as collider bias.
  • Collider: A variable influenced by two or more other variables (e.g., trustworthiness and newsworthiness both affect selection).

Example: Selection in Grant Proposals

Causal Diagram

  • Trustworthiness (\(T\)) and Newsworthiness (\(N\)) both influence Selection (\(S\)).
  • Selection (\(S\)) is a collider.
grant_dag <- dagitty("dag {
    T -> S
    N -> S
}")
coordinates(grant_dag) <- list(
  x = c(T = 0, N = 2, S = 1),
  y = c(T = 0, N = 0, S = 0)
)
drawdag(grant_dag)

Explanation

Example: Aging, Happiness, and Marriage

Causal Diagram

  • Happiness (\(H\)) and Age (\(A\)) both influence Marriage (\(M\)).
  • \(M\) is a collider.
# Create the DAG mentioned in the paragraph above
happiness_dag <- dagitty( "dag {
    A -> M
    H -> M
 }")

coordinates( happiness_dag ) <- list( x=c(A=0, H=2, M=1) ,
                                   y=c(A=0, H=0, M=0)
                                 )

drawdag( happiness_dag )

Explanation

Simulating the Data

Simulation Design

  • Each year, 20 individuals are born with uniformly distributed happiness.
  • Individuals age and may get married:
    • Happiness does not change with age.
    • Probability of marriage increases with happiness and age.
  • Married individuals remain married. Individuals over 65 are removed (They move to Spain.)
# Simulate data
d <- sim_happiness( seed=1977 , N_years=1000 )

precis(d)
'data.frame': 1300 obs. of 3 variables:

Plotting the Data

# Plot happiness vs. age, colored by marriage status
plot( d$age , d$happiness , col=d$married+1 , pch=16 ,
      xlab="Age" , ylab="Happiness" )
legend( "topleft" , legend=c("Not Married","Married") , pch=16 , col=1:2 )

Observation

Multiple Regression: Conditioning on Marriage

Model Definition

Linear model: - \(\mu_i = \alpha_{MID[i]} + \beta_A \times A_i\) - \(MID[i]\): Marriage status index.

Rescale age (\(A\)) for adults (18–65): - \(A = \frac{\text{age} - 18}{65 - 18}\)

# Rescale age
d2 <- d[ d$age>17 , ] # only adults

d2$A <- ( d2$age- 18 ) / ( 65- 18 )

Model Fitting

# Create marriage status index variable
d2$MID <- d2$married + 1

# Fit the model
m6.9 <- quap(
  alist(
    happiness ~ dnorm( mu , sigma ) ,
    mu <- a[MID] + bA * A ,
    a[MID] ~ dnorm( 0 , 1 ) ,
    bA ~ dnorm( 0 , 2 ) ,
    sigma ~ dexp( 1 )
  ) ,
  data=d2
)

precis(m6.9, depth=2)

Output:

Regression Without Conditioning on Marriage

# Fit the model without marriage status
m6.10 <- quap(
  alist(
    happiness ~ dnorm( mu , sigma ) ,
    mu <- a + bA * A ,
    a ~ dnorm( 0 , 1 ) ,
    bA ~ dnorm( 0 , 2 ) ,
    sigma ~ dexp( 1 )
  ) ,
  data=d2
)

precis(m6.10, depth=1)

Output:

It’s easy to plead with this example. Shouldn’t marriage also influence happiness? What if happiness does change with age? But this misses the point. If you don’t have a causal model, you can’t make inferences from a multiple regression. And the regression itself does not provide the evidence you need to justify a causal model. Instead, you need some science.

Key Takeaways

Take-Home Messages

Key Lessons:

  • Causal Understanding is Crucial:
    • Statistical models cannot replace a thoughtful causal framework.
    • DAGs provide a way to structure and communicate assumptions.
  • Selection and Collider Bias:
    • Conditioning on colliders creates misleading associations.
    • Examples like marriage in the happiness study or fungus in the plant growth example demonstrate this.
  • Avoid Over-Conditioning:
    • Adding variables without a causal perspective can mask or distort true relationships.
  • Multicollinearity:
    • Predictors with shared information can lead to wide posterior distributions.
    • Interpretability suffers, even if prediction is still effective.
  • Think Causally, Not Just Statistically:
    • Regression models answer the questions you ask—ensure your question reflects your research goals.
---
title: "Statistical Rethinking - Chapter 6"
output: html_notebook
---


```{r, echo=False, message=False, warning=TRUE}
# Import the necessary libraries for the analysis.

# Check and install missing packages
required_packages <- c("ggplot2", "dplyr", "rethinking", "dagitty")
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}


# Load necessary libraries
library(rethinking)
library(ggplot2)
library(dagitty)
library(ggplot2)
library(dplyr)

library(htmltools)

```


## The Haunted DAG & The Causal Terror

### Key Idea: Selection Processes Can Create Correlations
- **Observation**: The most newsworthy studies often appear to be the least trustworthy.
- **Hypothesis**: This pattern arises not because of a true relationship, but due to the act of **selection**.

---

### Example: Grant Review and Selection Bias
- Suppose:
  - 200 grant proposals are evaluated.
  - Two criteria are assessed: **Newsworthiness (nw)** and **Trustworthiness (tw)**.
  - Both criteria are **uncorrelated** in the population.
  - The top 10% of combined scores (\(nw + tw\)) are selected for funding.
  
#### Question:
- After selection, are **nw** and **tw** still uncorrelated?

---

### Simulation: How Selection Creates Correlation

```{r, echo=False}
# Simulated data
set.seed(1914)
N <- 200 # Number of grant proposals
p <- 0.1 # Proportion to select

# Uncorrelated newsworthiness and trustworthiness
nw <- rnorm(N)
tw <- rnorm(N)

# Select top 10% of combined scores
s <- nw + tw # Total score
q <- quantile(s, 1 - p) # Top 10% threshold
selected <- ifelse(s >= q, TRUE, FALSE)

# Set up side-by-side plots
par(mfrow = c(1, 2)) # Arrange plots in 1 row and 2 columns

# Left Plot: All points with regression line
plot(nw, tw, col = "black", pch = 19,
     xlab = "Newsworthiness (nw)", 
     ylab = "Trustworthiness (tw)", 
     main = "All Points")
fit_all <- lm(tw ~ nw)
abline(fit_all, col = "red", lwd = 2) # Regression line for all points

# Right Plot: Highlight selected points
plot(nw, tw, col = ifelse(selected, "blue", "black"), 
     pch = ifelse(selected, 19, 1), # Filled for selected, unfilled for others
     xlab = "Newsworthiness (nw)", 
     ylab = "Trustworthiness (tw)", 
     main = "Selected Points")
if (any(selected)) {
  fit_selected <- lm(tw[selected] ~ nw[selected])
  abline(fit_selected, col = "blue", lwd = 2) # Regression line for selected points
}

# Display correlation on the right plot
correlation <- cor(tw[selected], nw[selected])
mtext(paste("Correlation (selected):", round(correlation, 2)), side = 3, line = -1.5, at = max(nw) * 0.9)

```

## Results:

### Before Selection:
- No correlation between nw and tw.
- Regression line is flat.

### After Selection:
- Correlation appears among selected proposals.
- Selection favors high combined scores (\(nw + tw\)).

### Distortion:
- The selection process creates an artificial correlation.

### Why Does This Happen?
This phenomenon, known as **Berkson’s Paradox** or the **Selection-Distortion Effect**, occurs because:
- Selection conditions on the sum (\(nw + tw\)), introducing dependence between **nw** and **tw**.

### Broader Implications:
- **Everywhere in Life**: Restaurants with bad food in good locations (and vice versa) are examples of selection-distortion.
- **In Regression Models**: Adding predictors can induce statistical selection, a form of collider bias. This bias can mislead us into believing relationships exist where they do not.

### Key Takeaways:
- **Selection Can Create Illusory Correlations**: Simply combining unrelated factors can produce spurious relationships.
- **Regression Models Are Susceptible**: Adding predictors can induce collider bias, especially when conditioning on a variable.

This example highlights why understanding causal structures and conditioning effects is critical for responsible regression modeling.



## Multicollinearity

### Key Idea: Challenges of Multiple Predictors
- **Context**: Many potential predictor variables may be available, e.g., 7 predictors in the primate milk data.
- **Problem**: Including all predictors may lead to issues like multicollinearity.
- **Multicollinearity**:
  - A strong association between two or more predictors (conditional on other variables).
  - Leads to uncertainty in parameter estimates, even if all predictors are strongly associated with the outcome.
  - Not a problem for prediction but can hinder interpretability.

---

### Example: Predicting Height Using Leg Length

#### Simulation: Height vs. Left and Right Leg Length
```{r}
# Simulated data
N <- 100  # number of individuals
set.seed(909)
height <- rnorm(N,10,2)  # sim total height of each
leg_prop <- runif(N,0.4,0.5)  # leg as proportion of height
leg_left <- leg_prop*height + rnorm( N , 0 , 0.02 )  # left leg + error
leg_right <- leg_prop*height + rnorm( N , 0 , 0.02 )  # right leg + error

# Combine into data frame
d <- data.frame(height, leg_left, leg_right)
```


**Model: Using Both Legs as Predictors**

```{r, echo=False}
m6.1 <- quap(
  alist(
  height ~ dnorm( mu , sigma ) ,
  mu <- a + bl*leg_left + br*leg_right ,
  a ~ dnorm( 10 , 100 ) ,
  bl ~ dnorm( 2 , 10 ) ,
  br ~ dnorm( 2 , 10 ) ,
  sigma ~ dexp( 1 )
  ),
  data=d
  )
precis(m6.1)
```




```{r}
plot(precis(m6.1))
```
## Observations
- Both legs have almost identical lengths and are strongly associated with height.

### Key Question:
- What is the value of knowing one leg's length after knowing the other leg’s length?

### Result:
- The posterior distribution for coefficients \( \beta_l \) (left leg) and \( \beta_r \) (right leg) shows:
  - A high correlation between \( \beta_l \) and \( \beta_r \), forming a narrow ridge of plausible values.
  - The model cannot disentangle the contributions of each leg.


## Visualizing the Posterior

 
```{r} 
# Sample from the posterior
post <- extract.samples(m6.1)


# Add transparency and color
rangi2 <- rgb(0.2, 0.4, 0.8, alpha = 1) # Blue color
col_alpha <- function(color, alpha) adjustcolor(color, alpha.f = alpha)

# Set up side-by-side plots
par(mfrow = c(1, 2)) # 1 row, 2 columns

# Left Panel: Scatter plot of bl vs br
plot(post$bl, post$br, col = col_alpha(rangi2, 0.5), pch = 16,
     xlab = "Left leg coefficient (bl)", 
     ylab = "Right leg coefficient (br)",
     main = "Scatter: bl vs br")

# Right Panel: Density plot of bl + br
bl_br_sum <- post$bl + post$br
plot(density(bl_br_sum), col = "blue", lwd = 2,
     main = "Density: bl + br",
     xlab = "Sum of bl and br", 
     ylab = "Density",
     type = "l") # Only line plot
```
- The left plot shows the narrow ridge of plausible values for \(\beta_l\) and \(\beta_r\).
- The right plot shows the posterior distribution for their sum (\(\beta_l + \beta_r\)), which the model reliably estimates.

What has happened here is that since both leg variables contain almost exactly the same information, if you insist on including both in a model, then there will be a practically infinite number of combinations of bl and br that produce the same predictions.
 
One way to think of this phenomenon is that you have approximated this model:

$$y_i \sim \mathcal{N}(\mu_i,\,\sigma)$$
$$\mu_i = \alpha + \beta_1 \times \text{Leg Length}_i + \beta_2 \times \text{Leg Length}_i$$
$$\mu_i = \alpha + (\beta_1 + \beta_2) \times \text{Leg Length}_i$$

The parameters β1 and β2 cannot be pulled apart, because they never separately influence the mean µ. Only their sum, β1+β2,influences µ. So this means the posterior distribution ends up reporting the very large range of combinations of β1 and β2 that make their sum close to the actual association of x with y.

And the posterior distribution in this simulated example has done exactly that: It has produced a good estimate of the sum of bl and br. But it has done so at the expense of producing a very wide range of plausible values for bl and br separately. This is the essence of multicollinearity. It is not a problem with the model. It is a problem with the question. The model is answering the question you asked, but the question is not what you thought it was.

## Simpler Model: One Leg as Predictor


```{r}
m6.2 <- quap(
  alist(
    height ~ dnorm( mu , sigma ) ,
    mu <- a + bl*leg_left,
    a ~ dnorm( 10 , 100 ) ,
    bl ~ dnorm( 2 , 10 ) ,
    sigma ~ dexp( 1 )
  ) ,
  data=d )

precis(m6.2)
```

- Using only one leg results in stable posterior estimates, avoiding multicollinearity.


## Multicollinearity in Primate Milk Data: Standardizing Variables

### Key Points:
- **Multicollinearity**: Occurs when predictors in a regression model are highly correlated, impacting the reliability of the model.
- **Standardizing Variables**: A common technique to address multicollinearity by transforming variables to have a mean of zero and a standard deviation of one.

### Why It Matters:
- Multicollinearity can obscure the true relationships between variables, leading to misleading conclusions.
- Standardizing variables helps to mitigate the effects of multicollinearity, making the model more reliable and interpretable.

### Example:
- In the context of primate milk data, standardizing variables can help to better understand the relationships between different predictors and the outcome variable.

### Steps to Standardize Variables:
1. **Calculate the Mean and Standard Deviation**: For each predictor variable.
2. **Transform the Variables**: Subtract the mean and divide by the standard deviation for each value in the predictor variable.

```{r}  
data(milk)
d <- milk
d$K <- standardize( d$kcal.per.g )
d$F <- standardize( d$perc.fat )
d$L <- standardize( d$perc.lactose )
```

## Models: Using One Predictor


```{r}
#kcal.per.g regressedonperc.fat
 m6.3<-quap(
   alist(
     K~dnorm(mu,sigma),
     mu<-a+bF*F,
     a~dnorm(0,0.2),
     bF~ dnorm(0,0.5),
     sigma~dexp(1)
   ),
   data=d)
 

#kcal.per.g regressedonperc.lactose
 m6.4<-quap(
   alist(
     K~dnorm(mu,sigma),
     mu<-a+bL*L,
     a~dnorm(0,0.2),
     bL~ dnorm(0,0.5),
     sigma~dexp(1)
  ),data=d)
 
 precis(m6.3)
 
 precis(m6.4)
```

- Both predictors are strongly associated with energy content in separate models.
- The posterior distributions for bF and bL are essentially mirror images of one another. 
- Given the strong association of each predictor with outcome, we might conclude that both are important. The more fat the more calorie content, and the more lactose the less calorie content. 

What happens when we fit a model with both predictors?

## Model: Using Both Predictors


```{r}

#kcal.per.g regressedonperc.fatandperc.lactose
 m6.5<-quap(
   alist(
     K~dnorm(mu,sigma),
     mu<-a+bF*F+bL*L,
     a~dnorm(0,0.2),
     bF~ dnorm(0,0.5),
     bL~ dnorm(0,0.5),
     sigma~dexp(1)
  ),data=d)
 
 precis(m6.5)
```

- Including both predictors increases the posterior uncertainty for \(\beta_F\) and \(\beta_L\), as the two variables are highly collinear.
- The ridge-like posterior distribution indicates that the model cannot disentangle the effects of fat and lactose on energy content. The easiest way to see this is to use a pairs plot.

## Diagnosing Multicollinearity


```{r}
pairs( ~ kcal.per.g + perc.fat + perc.lactose , data=d , col=rangi2 )

```

- Scatterplots reveal \(\text{perc.fat}\) and \(\text{perc.lactose}\) are highly correlated and nearly redundant.
- The strong association between predictors leads to multicollinearity, hindering the model's interpretability.

### Dodgy Procedures for Multicollinearity

- In some fields, a common but flawed approach to handling multicollinearity involves:
  1. Inspecting **pairwise correlations** between predictors.
  2. Dropping highly correlated variables before modeling.
  
- **Why is this a mistake?**
  - Pairwise correlations are not the issue; **conditional associations** within the model are what matter.
  - The right approach depends on the cause of the collinearity, which requires a **causal perspective**.
  
- **Example of dodginess**:
  - Imagine analyzing human health data and finding that "exercise" and "calorie intake" are highly correlated. 
  - Dropping one might ignore important causal relationships—exercise and calorie intake both influence weight but in different ways.

---

### Core Tradeoff in Milk Composition

In the milk example, a natural tradeoff likely explains the collinearity between fat and lactose:

- **Frequent nursing**:
  - Milk tends to be watery and low in energy.
  - High in sugar (**lactose**) to provide quick energy.

- **Infrequent nursing**:
  - Milk needs to be more energy-dense.
  - High in **fat** to sustain the infant for longer periods between feedings.

- This tradeoff suggests a **causal model**: nursing frequency drives milk composition. High lactose and high fat are inversely related because they serve different energetic strategies.



```{r, echo=FALSE, message=FALSE, warning=FALSE}
circle_points <- function(points_to_circle, g) {
    #few regexs to extract the points and the positions from "g"
    #can surely be optimized, made nicer and more robust but it works for now
    fsplit <- strsplit(g[1], "\\]")[[1]]
    fsplit <- fsplit[-length(fsplit)]
    fsplit <- substr(fsplit, 1, nchar(fsplit)-1)
    fsplit[1] <- substr(fsplit[1], 6, nchar(fsplit))
    vars <- sapply(regmatches(fsplit,
                              regexec("\\\n(.*?)\\s*\\[", fsplit)), "[", 2)
    pos <- sub(".*pos=\\\"", "", fsplit)
    
    #build dataframe with extracted information
    res_df <- data.frame(vars = vars, 
                         posx = sapply(strsplit(pos, ","), "[",1), 
                         posy = sapply(strsplit(pos, ","), "[",2))
    df_to_circle <- res_df[res_df$vars %in% points_to_circle,]
    
    #y-position seems to be inverted and has to be multiplied by -1
    points(c(as.numeric(df_to_circle$posx)), 
           c(as.numeric(df_to_circle$posy) * -1), 
           cex = 4)
}

```

```{r}

# Define the DAG with D as latent
dag <- dagitty( "dag{
  L -> K <- F
  L <- D -> F
  D [unobserved]
}" )

# Define coordinates for visualization
coordinates(dag) <- list(
  x = c(L = 0, K = 1, D = 1, F = 2),
  y = c(L = 0.5, K = 1, D = 0.5, F = 0.5)
)

plot( dag )

circle_points(c("D"), dag)
```

## Key Takeaways
- **Multicollinearity is a problem of interpretation, not prediction**.
- When predictors are highly correlated, the model struggles to allocate effects to each predictor.

### Avoid Pairwise Correlation Tests:
- It’s conditional associations, not pairwise correlations, that matter.

### Understand the Causal Model:
- Collinearity often arises from underlying causal processes.
- In the milk example, fat and lactose are linked by nursing frequency, a latent variable.


## Post-treatment Bias

### Key Idea: Mistakes From Including Variables
- It’s common to worry about omitting predictor variables (**omitted variable bias**).
- Less attention is given to mistakes from **including variables** (**included variable bias**):
  - Adding variables without considering causality can ruin randomized experiments.
  - Example: **Post-treatment bias** occurs when a predictor is influenced by the treatment or outcome.

---

### Example: Plant Growth Experiment
- Scenario:
  - **Goal**: Measure the causal effect of soil treatment on plant growth.
  - Variables:
    1. Initial height (\(H_0\)): Pre-treatment height.
    2. Final height (\(H_1\)): Post-treatment height.
    3. Treatment (\(T\)): Soil treatment.
    4. Fungus (\(F\)): Fungus presence.
- **Problem**: Including \(F\) (fungus) induces bias because \(T \to F \to H_1\):
  - Treatment influences fungus.
  - Fungus influences growth (\(H_1\)).
  - Controlling for \(F\) blocks the causal path \(T \to H_1\).

---

### Simulated Data
```{r}
set.seed(1914)
N <- 100
h0 <- rnorm(N, 10, 2)
treatment <- rep(0:1, each = N/2)
fungus <- rbinom(N, size = 1, prob = 0.5 - treatment * 0.4)
h1 <- h0 + rnorm(N, 5 - 3 * fungus)
d <- data.frame(h0 = h0, h1 = h1, treatment = treatment, fungus = fungus)
precis(d)
```

## Initial Growth Model Without Predictors

### Model Definition
**Goal**: Model proportional growth (\( p = \frac{H_1}{H_0} \)).

### Model:
- \( h_{1,i} \sim N(\mu_i, \sigma) \)
- \( \mu_i = h_{0,i} \times p \)
- \( p \sim \text{LogNormal}(0, 0.25) \): Prior centered around no growth (\( p = 1 \)).

### Simulation and Fitting


```{r}

# Simulate prior from model
sim_p <- rlnorm( 1e4 , 0 , 0.25 )

precis( data.frame(sim_p) )

hist(sim_p)

```
So this prior expects anything from 40% shrinkage up to 50% growth. Let’s fit this model, so you can see how it just measures the average growth in the experiment. 

```{r}
# Fit the model
m6.6 <- quap(
  alist(
    h1 ~ dnorm( h0 * p , sigma ),
    mu <- h0 * p,
    p ~ dlnorm( 0 , 0.25 ),
    sigma ~ dexp( 1 )
  ),
  data = d
)

precis(m6.6)

```
## Output:
- **Posterior mean for \( p \)**: Approximately 1.4, suggesting 40% average growth.
- **Prior**: Allows for 40% shrinkage to 50% growth.

## Adding Treatment and Fungus

### Model Definition
Extend \( p \) to depend on treatment and fungus:
- \( h_{1,i} \sim N(\mu_i, \sigma) \)
- \( \mu_i = h_{0,i} \times p \)
- \( p = a + b_t \times T_i + b_f \times F_i \)

### Priors:
- \( a \sim \text{LogNormal}(0, 0.2) \)
- \( b_t, b_f \sim N(0, 0.5) \)
- \( \sigma \sim \text{Exponential}(1) \)

### Model Fitting

The proportion of growth p is now a function of the predictor variables. It looks like any  other linear model. The priors on the slopes are almost certainly too flat. They place 95% of the prior mass between −1 (100% reduction) and +1 (100% increase) and two-thirds of the prior mass between −0.5 and +0.5.  



```{r}
# Fit the model
 m6.7<-quap(
   alist(
     h1~ dnorm(mu,sigma),
     mu<-h0*p,
     p<- a+bt*treatment+bf*fungus,
     a~dlnorm(0,0.2),
     bt~ dnorm(0,0.5),
     bf~ dnorm(0,0.5),
     sigma~dexp(1)
     
    ), data=d
   
   )

precis(m6.7)
```

## Output:
- **Treatment effect (\( b_t \))**: Close to zero, suggesting no effect.
- **Fungus effect (\( b_f \))**: Negative, indicating reduced growth due to fungus.

### Why?
- Fungus is a post-treatment variable. Controlling for it blocks the causal path \( T \rightarrow H_1 \).



**Blocked by consequence**. The problem is that fungus is mostly a consequence of treatment. This is to say that fungus is a post-treatment variable. So when we control for fungus, the model is implicitly answering the question: Once we already know whether or not a plant developed fungus, does soil treatment matter? The answer is “no,” because soil treatment has its effects on growth through reducing fungus. But we actually want to know, based on the design of the experiment, is the impact of treatment on growth. To measure this properly, we should omit the post-treatment variable fungus. Here’s what the inference looks like in that case.

## Correct Inference: Excluding Fungus

### Adjusted Model
Exclude \( F \) to correctly measure the treatment effect:
- \( p = a + b_t \times T_i \)

### Model Fitting


```{r}
# Fit the model
 m6.8<-quap(
   alist(
     h1~ dnorm(mu,sigma),
     mu<-h0*p,
     p<- a+bt*treatment,
     a~dlnorm(0,0.2),
     bt~ dnorm(0,0.5),
     sigma~dexp(1)
     
    ), data=d
   
   )

precis(m6.8)
```

## Output:
- **Treatment effect (\( b_t \))**: Positive, as expected.
- Excluding fungus reveals the true effect of treatment on growth.


Now the impact of treatment is clearly positive, as it should be. It makes sense to control for pre-treatment differences, like the initial height h0, that might mask the causal influence of treatment. But including post-treatment variables can actually mask the treatment itself. This doesn’t mean you don’t want the model that includes both treatment and fungus. The fact that including *fungus* zeros the coefficient for *treatment* suggests that the treatment works for exactly the anticipated reasons. It tells us about the mechanism. But a correct inference about the treatment still depends upon omitting the post-treatment variable.

## Visualizing With a DAG
Fungus and d-separation. It helps to look at this problem in terms of a DAG. In this case, I’ll show you how to draw it using the dagitty R package, because we are going to use that package now to do some graph analysis.


### Causal Diagram
Graph:
- \( T \rightarrow F \rightarrow H_1 \)
- \( H_0 \rightarrow H_1 \)


The causal diagram for this scenario is shown below:

```{r}
# Define the DAG with D as latent
plant_dag <- dagitty( "dag {
    H_0-> H_1
    F-> H_1
    T-> F
 }")


 coordinates( plant_dag ) <- list( x=c(H_0=0,T=2,F=1.5,H_1=1) ,
                                   y=c(H_0=0,T=0,F=0,H_1=0) 
                                 )
 drawdag( plant_dag )
 
```
## Key Insight:
- Conditioning on \( F \) (fungus) blocks the path \( T \rightarrow H_1 \), making \( T \) and \( H_1 \) independent.

## Key Takeaways
- **Post-treatment bias**:
  - Including variables affected by treatment can mask the treatment effect.
- **DAGs help clarify causality**:
  - Use causal diagrams to identify variables that should or should not be included.
- **Correct inference**:
  - Exclude post-treatment variables to measure causal effects accurately.


## Collider Bias

### Key Idea: Selection and Collider Bias
- Selection processes can create misleading associations in statistical models, known as **collider bias**.
- **Collider**: A variable influenced by two or more other variables (e.g., trustworthiness and newsworthiness both affect selection).

---

### Example: Selection in Grant Proposals

#### Causal Diagram
- Trustworthiness (\(T\)) and Newsworthiness (\(N\)) both influence Selection (\(S\)).
- Selection (\(S\)) is a **collider**.

```{r}
grant_dag <- dagitty("dag {
    T -> S
    N -> S
}")
coordinates(grant_dag) <- list(
  x = c(T = 0, N = 2, S = 1),
  y = c(T = 0, N = 0, S = 0)
)
drawdag(grant_dag)

```

## Explanation
- Conditioning on \( S \) creates a statistical association between \( T \) and \( N \):
  - If a selected proposal has low \( T \), it must have high \( N \) to be selected (and vice versa).
  - This creates a negative association between \( T \) and \( N \), even if they are uncorrelated in the population.


## Example: Aging, Happiness, and Marriage

### Causal Diagram
- Happiness (\( H \)) and Age (\( A \)) both influence Marriage (\( M \)).
- \( M \) is a collider.


```{r}
# Create the DAG mentioned in the paragraph above
happiness_dag <- dagitty( "dag {
    A -> M
    H -> M
 }")

coordinates( happiness_dag ) <- list( x=c(A=0, H=2, M=1) ,
                                   y=c(A=0, H=0, M=0)
                                 )

drawdag( happiness_dag )
```

## Explanation
- Conditioning on \( M \) induces a spurious association between \( A \) and \( H \), even if happiness is constant across ages.
- This **collider bias** arises from conditioning on a variable influenced by two other variables.
- Happiness (H) and age (A) both cause marriage (M). Marriage is therefore a collider. 



## Simulating the Data

### Simulation Design
- Each year, 20 individuals are born with uniformly distributed happiness.
- Individuals age and may get married:
  - Happiness does not change with age.
  - Probability of marriage increases with happiness and age.
- Married individuals remain married. Individuals over 65 are removed (They move to Spain.)

```{r}
# Simulate data
d <- sim_happiness( seed=1977 , N_years=1000 )

precis(d)
```

## Plotting the Data

```{r}
# Plot happiness vs. age, colored by marriage status
plot( d$age , d$happiness , col=d$married+1 , pch=16 ,
      xlab="Age" , ylab="Happiness" )
legend( "topleft" , legend=c("Not Married","Married") , pch=16 , col=1:2 )
```
## Observation
- Happiness is uncorrelated with age.
- Marriage is associated with both age and happiness, creating the appearance of a relationship.

## Multiple Regression: Conditioning on Marriage

### Model Definition
**Linear model**:
- \( \mu_i = \alpha_{MID[i]} + \beta_A \times A_i \)
- \( MID[i] \): Marriage status index.

**Rescale age (\( A \)) for adults (18–65)**:
- \( A = \frac{\text{age} - 18}{65 - 18} \)


```{r}
# Rescale age
d2 <- d[ d$age>17 , ] # only adults

d2$A <- ( d2$age- 18 ) / ( 65- 18 )

```

## Model Fitting

```{r}
# Create marriage status index variable
d2$MID <- d2$married + 1

# Fit the model
m6.9 <- quap(
  alist(
    happiness ~ dnorm( mu , sigma ) ,
    mu <- a[MID] + bA * A ,
    a[MID] ~ dnorm( 0 , 1 ) ,
    bA ~ dnorm( 0 , 2 ) ,
    sigma ~ dexp( 1 )
  ) ,
  data=d2
)

precis(m6.9, depth=2)
```

## Output:
- The model suggests a negative association between age and happiness, which is spurious due to collider bias.
- Conditioning on marriage status induces a statistical relationship between age and happiness.


## Regression Without Conditioning on Marriage

```{r}
# Fit the model without marriage status
m6.10 <- quap(
  alist(
    happiness ~ dnorm( mu , sigma ) ,
    mu <- a + bA * A ,
    a ~ dnorm( 0 , 1 ) ,
    bA ~ dnorm( 0 , 2 ) ,
    sigma ~ dexp( 1 )
  ) ,
  data=d2
)

precis(m6.10, depth=1)
```

## Output:
- Without conditioning on \( M \), the model finds no association between age and happiness, reflecting the true causal model.


It’s easy to plead with this example. Shouldn’t marriage also influence happiness? What if happiness does change with age? But this misses the point. If you don’t have a causal model, you can’t make inferences from a multiple regression. And the regression itself does not provide the evidence you need to justify a causal model. Instead, you need some science.

## Key Takeaways
- **Collider Bias**:
  - Conditioning on a collider creates misleading associations between its causes.
  - In the marriage example, conditioning on \( M \) induces a spurious relationship between \( A \) and \( H \).

- **DAGs Clarify Causal Relationships**:
  - DAGs help identify colliders and prevent incorrect variable selection.

- **Causal Inference Requires a Model**:
  - Without a causal model, multiple regression cannot justify inferences.

- **Practical Advice**:
  - Avoid conditioning on colliders to prevent spurious associations.

## Take-Home Messages

### Key Lessons:
- **Causal Understanding is Crucial**:
  - Statistical models cannot replace a thoughtful causal framework.
  - DAGs provide a way to structure and communicate assumptions.

- **Selection and Collider Bias**:
  - Conditioning on colliders creates misleading associations.
  - Examples like marriage in the happiness study or fungus in the plant growth example demonstrate this.

- **Avoid Over-Conditioning**:
  - Adding variables without a causal perspective can mask or distort true relationships.

- **Multicollinearity**:
  - Predictors with shared information can lead to wide posterior distributions.
  - Interpretability suffers, even if prediction is still effective.

- **Think Causally, Not Just Statistically**:
  - Regression models answer the questions you ask—ensure your question reflects your research goals.

