Multiple Regression

Causal Implications
February 27, 2023

…but first

Categorical predictors

Let’s look at the Howell data again:

data(Howell1)
d <- Howell1
str(d)
'data.frame':   544 obs. of  4 variables:
 $ height: num  152 140 137 157 145 ...
 $ weight: num  47.8 36.5 31.9 53 41.3 ...
 $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
 $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

The male variable is called an indicator variable, a way to encode unordered categories into quantitative models.

Indicator variables

Here’s the corresponding model:

\[ \begin{array}{rl} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), \\ \mu_i\!\!\!\! & = \alpha + \beta_m m_i, \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), \\ \beta_m\!\!\!\! & \sim \mathrm{Normal}(0, 10), \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50), \end{array} \]

where \(m_i=0\) corresponds to females and \(m_i=1\) corresponds to males.

Discuss: Interpret the parameters.

Answer: \(\alpha\) is average height for females, while \(\alpha+\beta_m\) is average height for males. Thus, \(\beta_m\) represents difference in height between males and females.

Indicator variables

Two issues:

  1. How do we assign priors to these parameters?
  2. If average male height is the sum of two parameters, then there will be additional uncertainty for male height.

\[ \begin{array}{rl} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), \\ \mu_i\!\!\!\! & = \alpha + \beta_m m_i, \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(178, 20), \\ \beta_m\!\!\!\! & \sim \mathrm{Normal}(0, 10), \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50), \end{array} \]

where \(m_i=0\) corresponds to females and \(m_i=1\) corresponds to males.

mu_female <- rnorm(1e4, 178, 20)
mu_male <- rnorm(1e4, 178, 20) + rnorm(1e4, 0, 10)
precis( data.frame( mu_female, mu_male ) )
              mean       sd     5.5%    94.5%  histogram
mu_female 177.8613 19.73536 146.1872 209.2209   ▁▁▃▇▇▂▁▁
mu_male   177.8657 22.47167 141.4661 213.7772 ▁▁▁▃▇▇▃▁▁▁

As you can see, the prior for males is wider. Why would we expect the prior for males to be wider??

Index variable

It is much more useful to encode a categorical as an index variable instead.

Definition: An index variable contains integers that correspond to different categories.

For example:

d$sex <- ifelse( d$male == 1, 2, 1 )
str( d$sex )
 num [1:544] 2 1 1 2 1 2 1 2 1 2 ...

Now “1” means female and “2” means male. No order is implied - these are just integer labels.

Index variable

The resulting model now looks like:

\[ \begin{array}{rl} h_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), \\ \mu_i\!\!\!\! & = \alpha_{\textrm{sex[}i\textrm{]}}, \\ \alpha_j\!\!\!\! & \sim \mathrm{Normal}(178, 20), \\ \sigma\!\!\!\! & \sim \mathrm{Uniform}(0, 50) \end{array} \] for \(j\)=1 \(\ldots\) 2.

mod <- quap(
  alist(
    height ~ dnorm( mu, sigma ),
    mu <- a[sex],
    a[sex] ~ dnorm( 178, 20 ),
    sigma ~ dunif( 0, 50 )
  ), data=d
)
precis( mod, depth=2 )
           mean        sd     5.5%     94.5%
a[1]  134.91013 1.6069272 132.3419 137.47831
a[2]  142.57833 1.6974665 139.8655 145.29121
sigma  27.30986 0.8280347  25.9865  28.63322

Index variable

precis( mod, depth=2 )
           mean        sd     5.5%     94.5%
a[1]  134.91013 1.6069272 132.3419 137.47831
a[2]  142.57833 1.6974665 139.8655 145.29121
sigma  27.30986 0.8280347  25.9865  28.63322

The depth=2 argument tells the function to show vector parameters.

What if we are interested in the difference between the heights (especially if we want to “test” if there is a statistically significant difference between the average heights).

Note that you can already compare percentile intervals if you want to compare them. In this case, there is no overlap in their 89% percentile intervals.

Index variable

If you still want to estimate the expected difference between male and female heights, you can sample from the posterior to estimate.

post.samples <- extract.samples( mod )
post.samples$diff <- post.samples$a[, 1] - post.samples$a[, 2]
precis( post.samples, depth=2 )
            mean        sd      5.5%      94.5%       histogram
sigma  27.306955 0.8356903  25.95831  28.628234  ▁▁▁▁▃▅▇▇▃▂▁▁▁▁
a[1]  134.877625 1.6028254 132.29848 137.400127 ▁▁▁▁▁▂▅▇▇▅▂▁▁▁▁
a[2]  142.578252 1.7123852 139.81433 145.315433   ▁▁▁▂▃▇▇▇▃▂▁▁▁
diff   -7.700627 2.3480100 -11.45774  -3.970207      ▁▁▁▃▇▇▃▁▁▁

Index variable

            mean        sd      2.5%      97.5%       histogram
sigma  27.306955 0.8356903  25.67309  28.903975  ▁▁▁▁▃▅▇▇▃▂▁▁▁▁
a[1]  134.877625 1.6028254 131.75345 138.043838 ▁▁▁▁▁▂▅▇▇▅▂▁▁▁▁
a[2]  142.578252 1.7123852 139.18941 145.954967   ▁▁▁▂▃▇▇▇▃▂▁▁▁
diff   -7.700627 2.3480100 -12.38465  -3.109879      ▁▁▁▃▇▇▃▁▁▁

Compare to t-test:

t.test( height ~ male, data=d )

    Welch Two Sample t-test

data:  height by male
t = -3.254, df = 517.65, p-value = 0.001212
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12.334008  -3.047511
sample estimates:
mean in group 0 mean in group 1 
       134.6303        142.3210 

Index variable

One more neat thing: you can plot!

plot( precis( mod, depth=2, pars="a" ) )

Note:

  • The first argument to the precis command is the model, not the samples.
  • The pars="a" argument tells precis to only return that parameter.

Multiple Regression

Waffle Houses and Divorce?

Is this a spurious correlation?

Likely yes, there is probably some other variable driving the relationship between waffles and divorce.

Some history:

  • Waffle House began in Georgia in 1955 and spread to other southern states.
  • Waffle House remained largely in the South.
  • Southern United States as some of the highest divorce rates in the nation.
  • “Southernness” is called a confounder.

What is multiple regression?

Definition: Multiple regression uses more than one predictor variable to simultaneously model an outcome.

\[ \begin{array}{rl} y_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma), \\ \mu_i\!\!\!\! & = \alpha + \beta_1 x_1 + \ldots + \beta_k x_k \\ \alpha\!\!\!\! & \sim \textrm{Prior distribution}, \\ \beta_j\!\!\!\! & \sim \textrm{Prior distribution}, \\ \sigma\!\!\!\! & \sim \textrm{Prior distribution}, \end{array} \] where \(j=1\ldots k\).

Why multiple regression?

Reasons given for multiple regression include:

  • Statistical “control” for counfounds: A confound is something that misleads us about a causal influence. But confounds are diverse. They can hide important effects just as easily as they can produce false ones.
  • Multiple and complex causation: A phenomenon may arise from multiple simultaneous causes. Since one cause can hide another, they must be measured simultaneously.

TL;DR of Chapters 5 and 6

  • Causal models are important to use to design and interpret regression models.
  • Confounders should be included in a regression model to (1) reveal spurious correlations between a predictor and response, and (2) reveal a correlation between a predictor and response that can be masked by unrevealed correlations with the confounder (Chapter 5).
  • Confounders should NOT be included in a regression model under certain circumstances (post-treatment bias and collider bias; see Chapter 6).

Motivating multiple regression

Aside: Standardizing variables

# load data and copy
data(WaffleDivorce)
d <- WaffleDivorce

# standardize variables
d$D <- standardize( d$Divorce )
d$M <- standardize( d$Marriage )
d$A <- standardize( d$MedianAgeMarriage )

Standardization helps numerical algorithm to converge. \[ x_{\textrm{s}} = \frac{x - \mu_x}{\sigma_x} \] This makes \(\mu_{x_{\textrm{s}}}=0\) and \(\sigma_{x_{\textrm{s}}}=1\).

Divorce vs. Marriage rate

mDvM <- quap(
  alist(
    D ~ dnorm(mu, sigma),
    mu <- a + bM*M,
    a ~ dnorm(0, 0.2),
    bM ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data = d)

precis( mDvM )
               mean         sd       5.5%     94.5%
a     -1.725707e-06 0.10824159 -0.1729927 0.1729892
bM     3.500785e-01 0.12591984  0.1488342 0.5513227
sigma  9.102080e-01 0.08984828  0.7666130 1.0538029

Divorce vs. Median age @ marriage

mDvA <- quap(
  alist(
    D ~ dnorm(mu, sigma),
    mu <- a + bA*A,
    a ~ dnorm(0, 0.2),
    bA ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ), data = d)

precis( mDvA )
               mean         sd       5.5%      94.5%
a      1.415210e-07 0.09737881 -0.1556300  0.1556303
bA    -5.684034e-01 0.10999986 -0.7442045 -0.3926024
sigma  7.883261e-01 0.07801144  0.6636488  0.9130035

Which predictor??

  • Both marriage rate M and median age @ marriage A could provide independent value to predicting D, or
  • they could be redundant (something else is driving them both), or
  • one could eliminate the value of the other (one is completely determining the other).

Which predictor??

The pattern we see, i.e.

  • Marriage rate (M) positively affects Divorce rate (D)
  • Median age @ marrage (A) negatively affects Divorce rate (D)

is indicative of a situation where only one of the predictors (A in this case) has a causal impact on the outcome D, even though both are strongly associated with the outcome.

To see this, we need to incorporate a modeling mechanism to infer causation - DAGs, or Directed Acyclic Graphs.

Getting at causation

DAGs: Directed Acyclic Graphs

  • M = Marriage rate
  • D = Divorce rate
  • A = Median age @ marriage
  • A directly influences D
  • M directly influences D
  • A directly influences M

Getting at causation

DAGs: Directed Acyclic Graphs

  • Graph means it is nodes and connections between nodes.
  • Directed means the connections represent directions of causal influence.
  • Acyclic means causes do not eventually flow back on themselves (no cycles).

Getting at causation: Model #1

Discuss: Discuss how each of the connections could represent a causal influence.

Note: There is a direct causal influence of A on D and an indirect causal influence of A on D through M.

Testable Implications

DAGs have testable associative implications, which are also called conditional independencies.

Conditional independencies…

  • …are statements of which variables should or should not be associated in the data.
  • …are statements about which variables become dis-associated when conditioned on some other set of variables.

Testable Implications

Conditioning on a variable

Conditioning on a variable Z means learning its value and then asking if X adds any additional information about Y.

If learning X gives no more information about Y, then we say X is independent of Y, conditional on Z:

That weird symbol means “independent”. If there is a slash through the symbol, then it would mean “dependent” (or “not independent”).

Model #1 Testable Implications

library(dagitty)
DMA_dag1 <- dagitty('dag { D <- A -> M -> D }')
impliedConditionalIndependencies( DMA_dag1 )

No output! Thus, all variables are associated, no matter what we condition on.

Model #2 Testable Implications

DMA_dag2 <- dagitty('dag { D <- A -> M }')
impliedConditionalIndependencies( DMA_dag2 )
D _||_ M | A

Thus, all variables are associated without conditioning, but D and M should be independent after conditioning on A.

Model #2 Testable Implications

We need a statistical model that conditions on A, so we can test whether that renders D and M independent. That is what multiple regression does for us.

Multiple Regression

In other words, multiple regression answers the descriptive question:

“Is there any additional value in knowning a variable, once I already know all of the other predictor variables?”

Once you fit a multiple regression to predict divorce using both marriage rate and age at marriage, the model addresses the questions:

  • After I already know marriage rate, what additional value is there in also knowing age at marriage?
  • After I already know age at marriage, what additional value is there in also knowing marriage rate?

Multiple Regression: Model Formulation

\[ \begin{array}{rlr} D_{i}\!\!\!\! & \sim \mathrm{Normal}(\mu_i,\sigma) & \textrm{[probability of data]}\\ \mu_i\!\!\!\! & = \alpha + \beta_{M}M_i + \beta_{A}A_i & \textrm{[linear model]} \\ \alpha\!\!\!\! & \sim \mathrm{Normal}(0, 0.2) & \mathrm{[prior \ for \ \alpha]} \\ \beta_{M}\!\!\!\! & \sim \mathrm{Normal}(0, 0.5) & \mathrm{[prior \ for \ \beta_{M}]} \\ \beta_{A}\!\!\!\! & \sim \mathrm{Normal}(0, 0.5) & \mathrm{[prior \ for \ \beta_{A}]} \\ \sigma\!\!\!\! & \sim \mathrm{Exponential}(1) & \mathrm{[prior \ for \ \sigma]} \end{array} \]

Multiple Regression: Compute Posterior

mDvM_A <- quap(
  alist(
    D ~ dnorm( mu, sigma ),
    mu <- a + bM*M + bA*A,
    a ~ dnorm( 0, 0.2 ),
    bM ~ dnorm( 0, 0.5 ),
    bA ~ dnorm( 0, 0.5 ),
    sigma ~ dexp( 1 )
  ), data = d )
precis( mDvM_A )
               mean         sd       5.5%      94.5%
a     -1.755713e-06 0.09707661 -0.1551489  0.1551454
bM    -6.536853e-02 0.15077420 -0.3063348  0.1755978
bA    -6.135048e-01 0.15098472 -0.8548076 -0.3722021
sigma  7.851241e-01 0.07784491  0.6607129  0.9095353

Discuss: Interpret the parameters.

Multiple Regression: Compare models

plot( coeftab(mDvA, mDvM, mDvM_A), par=c("bA", "bM") )

Multiple Regression: Compare models

plot( coeftab(mDvA, mDvM, mDvM_A), par=c("bA", "bM") )

Multiple Regression: Compare models

What the multiple regression tells us is that once we know median age at marriage for a State, there is little or no additional predictive power in also knowing the rate of marriage in that State.

Conclusion: There is a spurious association between M and D.

Masked Relationships

What have we shown so far? Multiple predictor variables are useful for find spurious association.

Another use of multiple predictor variables is to measure multiple direct influences on an outcome, when none of those influences is apparent from individual bivariate relationships.

New data set!

Composition of milk across primate species and its relationship to body mass and brain size

data(milk)
d <- milk
str(d)
'data.frame':   29 obs. of  8 variables:
 $ clade         : Factor w/ 4 levels "Ape","New World Monkey",..: 4 4 4 4 4 2 2 2 2 2 ...
 $ species       : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 8 9 10 16 2 1 6 28 27 ...
 $ kcal.per.g    : num  0.49 0.51 0.46 0.48 0.6 0.47 0.56 0.89 0.91 0.92 ...
 $ perc.fat      : num  16.6 19.3 14.1 14.9 27.3 ...
 $ perc.protein  : num  15.4 16.9 16.9 13.2 19.5 ...
 $ perc.lactose  : num  68 63.8 69 71.9 53.2 ...
 $ mass          : num  1.95 2.09 2.51 1.62 2.19 5.25 5.37 2.51 0.71 0.68 ...
 $ neocortex.perc: num  55.2 NA NA NA NA ...

New data set

'data.frame':   29 obs. of  8 variables:
 $ clade         : Factor w/ 4 levels "Ape","New World Monkey",..: 4 4 4 4 4 2 2 2 2 2 ...
 $ species       : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 8 9 10 16 2 1 6 28 27 ...
 $ kcal.per.g    : num  0.49 0.51 0.46 0.48 0.6 0.47 0.56 0.89 0.91 0.92 ...
 $ perc.fat      : num  16.6 19.3 14.1 14.9 27.3 ...
 $ perc.protein  : num  15.4 16.9 16.9 13.2 19.5 ...
 $ perc.lactose  : num  68 63.8 69 71.9 53.2 ...
 $ mass          : num  1.95 2.09 2.51 1.62 2.19 5.25 5.37 2.51 0.71 0.68 ...
 $ neocortex.perc: num  55.2 NA NA NA NA ...

We’ll consider:

  • kcal.per.g: kilocalories of energy per gram of milk (response)
  • mass: average female body mass, in kilograms (predictor)
  • neocortex.perc: percent of total brain mass that is neocortex mass (predictor)

Hypothesis

Hypothesis: Primates with larger brains produce more energetic milk, so that brains can grow quickly.

We won’t solve this hypothesis, but we will explore the following question:

Question: What extent does energy content of milk (measured in kilocalories) relate to percent of the brain mass that is neocortex?

We’ll end up needing to include female body mass in our model to see the “masking” that hides the relationships between these variables.

Standardize data!

d$K <- standardize( d$kcal.per.g )
d$N <- standardize( d$neocortex.perc )
d$M <- standardize( log(d$mass) )

Comments:

  • Standardized variables will have means of 0 and standard deviations of 1.
  • Taking the log of a measure translates the measure into magnitudes, which are more likely to be related in a linear fashion to other variables.

Prior to priors

First, let’s just run a draft model and see what happens.

Note: Priors are centered at 0 now and have standard deviations of 1. This is because we standardized the variables!

mKvN <- quap(
  alist(
    K ~ dnorm( mu, sigma ),
    mu <- a + bN*N,
    a ~ dnorm( 0, 1 ),
    bN ~ dnorm( 0, 1 ),
    sigma ~ dexp( 1 )
  ), data=d
)

Prior to priors

The model gives the following error in R:

Error in quap(alist(K ~ dnorm(mu, sigma), mu <- a + bN * N, a ~ dnorm(0,  : 
  initial value in 'vmmin' is not finite
The start values for the parameters were invalid. This could be caused by missing values (NA) in the data or by start values outside the parameter constraints. If there are no NA values in the data, try using explicit start values.

Issue: The data has NAs:

d$neocortex.perc
 [1] 55.16    NA    NA    NA    NA 64.54 64.54 67.64    NA 68.85 58.85 61.69
[13] 60.32    NA    NA 69.97    NA 70.41    NA 73.40    NA 67.53    NA 71.26
[25] 72.60    NA 70.24 76.30 75.49

Prior to priors

How to fix? Use complete.cases function to throw out all rows where at least one of the predictors or response is NA.

dcc <- d[ complete.cases(d$K, d$N, d$M) , ]
str(dcc)
'data.frame':   17 obs. of  11 variables:
 $ clade         : Factor w/ 4 levels "Ape","New World Monkey",..: 4 2 2 2 2 2 2 2 3 3 ...
 $ species       : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 2 1 6 27 5 3 4 21 19 ...
 $ kcal.per.g    : num  0.49 0.47 0.56 0.89 0.92 0.8 0.46 0.71 0.68 0.97 ...
 $ perc.fat      : num  16.6 21.2 29.7 53.4 50.6 ...
 $ perc.protein  : num  15.4 23.6 23.5 15.8 22.3 ...
 $ perc.lactose  : num  68 55.2 46.9 30.8 27.1 ...
 $ mass          : num  1.95 5.25 5.37 2.51 0.68 0.12 0.47 0.32 1.55 3.24 ...
 $ neocortex.perc: num  55.2 64.5 64.5 67.6 68.8 ...
 $ K             : num  -0.94 -1.064 -0.506 1.538 1.724 ...
 $ N             : num  -2.0802 -0.5086 -0.5086 0.0107 0.2135 ...
 $ M             : num  -0.456 0.127 0.141 -0.307 -1.076 ...

No more NAs! The complete.cases function returns the indices with no NAs.

Prior to priors

Finally, let’s recreate the model using the subsetted full data dcc

mKvN <- quap(
  alist(
    K ~ dnorm( mu, sigma ),
    mu <- a + bN*N,
    a ~ dnorm( 0, 1 ),
    bN ~ dnorm( 0, 1 ),
    sigma ~ dexp( 1 )
  ), data=dcc
)

Prior predictive simulation

Let’s look at the priors.

prior <- extract.prior( mKvN ) # Extract samples from the priors
xseq <- c(-2, 2) # Look at linear model over 2 standard deviations of N
mu <- link( mKvN, post=prior, data=list(N=xseq) ) # Linear model prediction

Some comments:

  • Note the new function extract.prior. This is a shortcut for randomly sampling from the prior distributions.
str(prior)
List of 3
 $ a    : num [1:1000(1d)] 0.3293 1.5364 0.0357 -0.9557 0.446 ...
 $ bN   : num [1:1000(1d)] 2.663 1.031 -0.719 -0.128 1.549 ...
 $ sigma: num [1:1000(1d)] 0.157 0.539 5.392 6.384 2.725 ...
 - attr(*, "source")= chr "quap prior: 1000 samples from mKvN"

Prior predictive simulation

Let’s look at the priors.

prior <- extract.prior( mKvN ) # Extract samples from the priors
xseq <- c(-2, 2) # Look at linear model over 2 standard deviations of N
mu <- link( mKvN, post=prior, data=list(N=xseq) ) # Linear model prediction

Some comments:

  • Note the new function extract.prior. This is a shortcut for randomly sampling from the prior distributions.
  • The variable xseq spans 2 standard deviations from 0 (standardized), covering 95% of the data for the variables.
  • Note the new argument post to link, which allows you to specify your own parameters samples (default is to extract samples from posterior).

Prior predictive simulation

Let’s plot!

plot( NULL, xlim=xseq, ylim=xseq, xlab="neocortex percent (std)", ylab="kilocal per g (std)" )
for (i in 1:50) lines( xseq, mu[i,], col=col.alpha("black", 0.5) )

That’s cray-cray!

Prior predictive simulation

\[ \begin{array}{c} \textrm{a ~ dnorm(0, 1)} \\ \textrm{bN ~ dnorm(0, 1)} \end{array} \]

Comments:

  • With two standardized variables (predictor and response), when predictor is zero, the expected value of the response should also be zero (\(\alpha\)).
  • Let’s tighten standard deviations for \(\alpha\) and \(\beta\).

Priors: Attempt #2

mKvN <- quap(
  alist(
    K ~ dnorm( mu, sigma ),
    mu <- a + bN*N,
    a ~ dnorm( 0, 0.2 ),
    bN ~ dnorm( 0, 0.5 ),
    sigma ~ dexp( 1 )
  ), data=dcc
)

That’s better!

Understanding the Posterior

precis( mKvN )
            mean        sd       5.5%     94.5%
a     0.03994062 0.1544908 -0.2069654 0.2868467
bN    0.13323840 0.2237468 -0.2243522 0.4908291
sigma 0.99982029 0.1647080  0.7365850 1.2630556

Discuss: Interpret!!

It looks like there is a positive relationship between N and K, but it is not very precise. The 89% percentile interval overlaps zero, so negative relationships are also plausible.

Understanding the Posterior

xseq <- seq( from=min(dcc$N)-0.15, to=max(dcc$N)+0.15, length.out=30 )
mu <- link( mKvN, data=list(N=xseq) )
mu_mean <- apply(mu, 2, mean)
mu_PI <- apply(mu, 2, PI)
plot( K~N, data=dcc )
lines( xseq, mu_mean, lwd=2 )
shade( mu_PI, xseq )

Understanding the Posterior

            mean        sd       5.5%     94.5%
a     0.03994062 0.1544908 -0.2069654 0.2868467
bN    0.13323840 0.2237468 -0.2243522 0.4908291
sigma 0.99982029 0.1647080  0.7365850 1.2630556

Now consider mass and kilocalories

mKvM <- quap(
  alist(
    K ~ dnorm( mu, sigma ),
    mu <- a + bM*M,
    a ~ dnorm( 0, 0.2 ),
    bM ~ dnorm( 0, 0.5 ),
    sigma ~ dexp( 1 )
  ), data=dcc
)
precis( mKvM )
             mean        sd       5.5%      94.5%
a      0.04654165 0.1512800 -0.1952331 0.28831639
bM    -0.28253580 0.1928818 -0.5907982 0.02572662
sigma  0.94927965 0.1570616  0.6982648 1.20029446

Discuss: Interpret!!

Now there there appears to be a negative relationship between mass and kilocalories, but it is again not very precise. The 89% percentile interval overlaps zero, so positive relationships are also plausible (although not very).

Mass and kilocalories

xseq <- seq( from=min(dcc$M)-0.15, to=max(dcc$M)+0.15, length.out=30 )
mu <- link( mKvM, data=list(M=xseq) )
mu_mean <- apply(mu, 2, mean)
mu_PI <- apply(mu, 2, PI)
plot( K~M, data=dcc )
lines( xseq, mu_mean, lwd=2 )
shade( mu_PI, xseq )

Mass and kilocalories

             mean        sd       5.5%      94.5%
a      0.04654165 0.1512800 -0.1952331 0.28831639
bM    -0.28253580 0.1928818 -0.5907982 0.02572662
sigma  0.94927965 0.1570616  0.6982648 1.20029446

Multiple regression

mKvNM <- quap(
  alist(
    K ~ dnorm( mu, sigma ),
    mu <- a + bN*N + bM*M,
    a ~ dnorm( 0, 0.2 ),
    c("bN", "bM") ~ dnorm( 0, 0.5 ),
    sigma ~ dexp( 1 )
  ), data=dcc
)
precis( mKvNM )
             mean        sd       5.5%      94.5%
a      0.06799128 0.1339987 -0.1461645  0.2821471
bN     0.67511452 0.2482988  0.2782850  1.0719440
bM    -0.70298774 0.2207872 -1.0558484 -0.3501271
sigma  0.73801418 0.1324618  0.5263147  0.9497136

Discuss: Interpret!!

Multiple regression

plot( coeftab( mKvN, mKvM, mKvNM ), pars=c("bM", "bN") )

Multiple regression

Summary:

  • Bigger species, like apes, have milk with less energy.
  • But species with more neocortex tend to have richer milk.
  • BUT neocortex percentage and body mass are correlated!!

Neocortex and body mass

plot( dcc$N, dcc$M )
cor( dcc$N, dcc$M)
[1] 0.7503758

Causal models: Which one?

There are at least three causal models consistent with these results.

U is an unobserved variable (hence the circle). Which of these graphs is “right”?

We can’t tell from the data alone! They have the same set of conditional independencies (i.e. none).

Causal models: Which one?

Definition: A set of DAGs with the same conditional independencies is known as a Markov Equivalence Set.

dag <- dagitty( "dag {
  M -> N -> K
  M -> K
}")
equivalentDAGs(dag)
[[1]]
dag {
K
M
N
M -> K
M -> N
N -> K
}

[[2]]
dag {
K
M
N
K -> N
M -> K
M -> N
}

[[3]]
dag {
K
M
N
M -> K
N -> K
N -> M
}

[[4]]
dag {
K
M
N
K -> M
K -> N
M -> N
}

[[5]]
dag {
K
M
N
K -> M
N -> K
N -> M
}

[[6]]
dag {
K
M
N
K -> M
K -> N
N -> M
}

Summary (Chapter 5)

Confounders should be included in a regression model to:

  1. reveal spurious correlations between a predictor and response

Summary (Chapter 5)

Confounders should be included in a regression model to:

  1. reveal a correlation between a predictor and response that can be masked by unrevealed correlations with the confounder

On the road to a general theory

Ultimately, we would like to know when we are trying to estimate the causal influence X on an outcome Y if we need to include (or NOT include) any other predictors in our regressions for correct inference.

On the road to a general theory

Definition: Omitted Variable Biases are mistaken inferences from omitting predictor variables. Spurious associations and masked effects are examples of this.

Definition: Included Variable Biases are mistaken inferences from including predictor variables.

We will explore two examples of included variable bias: post-treatment bias and collider bias.

Example #1: Post-Treatment Bias

Carefully randomized experiments can be ruined just as easily as uncontrolled observational studies.

Setup: You are growing plants in a greenhouse. You want to know difference in growth under different anti-fungal soil treatments, as fungus on plants tends to reduce growth. Plants are initially seeded and sprout, and heights are measured (h0). Then different soil treatments (treatment) are applied. Final measures are the height of the plant (h1) and the presence of fungus (fungus).

Example #1: Post-Treatment Bias

We simulated some data to arrive at the following:

'data.frame':   100 obs. of  4 variables:
 $ h0       : num  9.14 9.11 9.04 10.83 9.16 ...
 $ h1       : num  14.3 15.6 14.4 15.8 11.5 ...
 $ treatment: int  0 0 0 0 0 0 0 0 0 0 ...
 $ fungus   : int  0 0 0 0 1 0 1 0 1 0 ...
              mean        sd      5.5%    94.5%    histogram
h0         9.95978 2.1011623  6.570328 13.07874 ▁▂▂▂▇▃▂▃▁▁▁▁
h1        14.39920 2.6880870 10.618002 17.93369     ▁▁▃▇▇▇▁▁
treatment  0.50000 0.5025189  0.000000  1.00000   ▇▁▁▁▁▁▁▁▁▇
fungus     0.23000 0.4229526  0.000000  1.00000   ▇▁▁▁▁▁▁▁▁▂

This data is based on the following DAG:

Example #1: Post-Treatment Bias

We will model growth (h1) as a function of initial growth (h0), treatment, and fungus.

modGvTF <- quap(
  alist(
    h1 ~ dnorm( mu, sigma ),
    mu <- h0*p,
    p <- a + bt*treatment + bf*fungus,
    a ~ dlnorm( 0, 0.2 ),
    bt ~ dnorm( 0, 0.5 ),
    bf ~ dnorm( 0, 0.5 ),
    sigma ~ dexp( 1 )
  ), data = d
)
  • Don’t worry too much about the difference in model type and priors.
  • Focus on the inferences!

Example #1: Post-Treatment Bias

precis( modGvTF )
              mean         sd        5.5%       94.5%
a      1.481431290 0.02450942  1.44226051  1.52060207
bt     0.002354539 0.02986811 -0.04538047  0.05008954
bf    -0.266765492 0.03654592 -0.32517293 -0.20835806
sigma  1.408732184 0.09860917  1.25113568  1.56632869

Discuss: Interpret!!

Looks like treatment has no effect, but fungus has a negative effect on growth. Given that we know treatment matters, what happened?

Example #1: Post-Treatment Bias

The problem is that fungus is mostly a consequence of treatment, i.e. fungus is called a post-treatment variable. So when we control for fungus (i.e. include it in our regression), we are implicitly asking the following question:

“Once we already know whether or not a plant developed fungus, does soil treatment matter?”

The answer to this question is “no”, because soil treatment has its effects on growth through reducing fungus.

Example #1: Post-Treatment Bias

To fix this, just leave fungus out of your regression:

modGvT <- quap(
  alist(
    h1 ~ dnorm( mu, sigma ),
    mu <- h0*p,
    p <- a + bt*treatment,
    a ~ dlnorm( 0, 0.2 ),
    bt ~ dnorm( 0, 0.5 ),
    sigma ~ dexp( 1 )
  ), data = d
)
precis( modGvT )
            mean         sd       5.5%     94.5%
a     1.38035086 0.02517701 1.34011315 1.4205886
bt    0.08499593 0.03429913 0.03017929 0.1398126
sigma 1.74641729 0.12193304 1.55154473 1.9412898

Discuss: Interpet!!

Looks like treatment indeed does have an effect on growth.

Example #1: Post-Treatment Bias

You can see this by looking at the implied conditional independencies of the DAG:

dag <- dagitty( "dag {
  H_0 -> H_1
  F -> H_1
  T -> F
}" )
coordinates(dag) <- list( 
  x = c(H_0=0, T=2, F=1.5, H_1=1),
  y = c(H_0=0, T=0, F=0, H_1=0) )
drawdag(dag)
impliedConditionalIndependencies( dag )
F _||_ H_0
H_0 _||_ T
H_1 _||_ T | F

Example #2: Collider Bias

Unfortunately, conditioning on a post-treatment variable can also fool you into thinking the treatment does work!

To see this, suppose the plant species isn’t really affected by fungus at all, but unmeasured moisture affects fungus and growth.

Here’s some simulated data for this scenario.

set.seed( 71 )
h0 <- rnorm( N, 10, 2 ) # Initial heights
treatment <- rep( 0:1, each=N/2 ) # Indicator variable for treatment
M <- rbern( N, prob = 0.5 ) # Moisture is Bernoulli distributed with p=0.5 (has moisture or doesn't)
fungus <- rbern( N, prob = 0.5 - treatment*0.4 + 0.4*M )
h1 <- h0 + rnorm( N, 5 + 3*M )
d2 <- data.frame( h0=h0, h1=h1, treatment=treatment, fungus=fungus )

Example #2: Collider Bias

If we include fungus in our regression of growth vs. treatment, we get the following inferences.

            mean         sd        5.5%      94.5%
a     1.55266338 0.03903640  1.49027567 1.61505108
bt    0.01451738 0.04399633 -0.05579725 0.08483201
bf    0.16787286 0.04520178  0.09563168 0.24011404
sigma 2.15653624 0.15019786  1.91649104 2.39658143

Discuss: Interpret!!

So fungus now helps growth???

Example #2: Collider Bias

If we just leave fungus out of the regression, we again get correct inferences.

            mean         sd       5.5%     94.5%
a     1.38035086 0.02517701 1.34011314 1.4205886
bt    0.08499594 0.03429914 0.03017929 0.1398126
sigma 1.74641749 0.12193308 1.55154488 1.9412901

Discuss: Interpret!!

So no weird inferences, and indeed treatment does not affect growth.

This sub-DAG leads to what is called collider bias.

Example #2: Collider Bias

Question: Why do so many restaurants in good locations have bad food?

Answer: Selection-distortion bias (collider bias)!

Let L denote “location rating”, F denote “food quality”, and S denote survivability of the restaurant. Consider the following DAG:

Example #2: Collider Bias

Let L denote “location rating”, F denote “food quality”, and S denote survivability of the restaurant. Consider the following DAG:

S in this case is called a collider. When you condition on a collider, it creates statistical, but not necessarily causal, associations among its causes.

Example #2: Collider bias

Suppose the top 2% of restaurants survive. Of the surviving restaurants, is there a relationship between location and food quality?

If we know that a restaurant has been around for awhile, and it has poor quality food, we automatically know it is in a good location. This is what creates the negative association above.

Example #2: Collider Bias

Let’s look at a different example to see this.

We will consider how age influences happiness. Is age a causal influence of happiness?

In this example, we will control for a plausible confound of happiness (marriage) and see see that it can bias inference about the influence of age.

Example #2: Collider Bias

We will make the following assumptions leading to a causal DAG model:

  • Average happiness of an individual is determined at birth and doesn’t change with age (so we are assuming there is actually no causal connection between age and happiness).
  • Suppose happiness influences marriage, i.e. happier people are more likely to get married.
  • Suppose also that age influences marriage: The more years you are alive, the more likely you will eventually get married.

Example #2: Collider Bias

These assumptions lead to the following DAG:

Marriage is a collider. If we condition on marriage, i.e. include it as a predictor in a regression between happiness and age, then it will induce a statistical association between age and happiness.

Example #2: Collider Bias

Consider the following simulated data that obeys the DAG causal model.

                   mean         sd        5.5%      94.5%  histogram
age        4.150000e+01 13.8606201 20.00000000 63.0000000 ▃▇▇▇▇▇▇▇▇▇
married    4.072917e-01  0.4915861  0.00000000  1.0000000 ▇▁▁▁▁▁▁▁▁▅
happiness -1.000070e-16  1.2145867 -1.78947368  1.7894737   ▇▅▇▅▅▇▅▇
A          5.000000e-01  0.2949068  0.04255319  0.9574468 ▇▇▇▅▇▇▅▇▇▇
mid        1.407292e+00  0.4915861  1.00000000  2.0000000 ▇▁▁▁▁▁▁▁▁▅

Comments:

  • A is transformed age from 0 to 1 (0 = 18 years old; 1 = 65 years old)
  • mid is married id (1 = single; 2 = married)

Example #2: Collider Bias

Let’s include the collider marriage as a predictor in a model:

mHvMA <- quap(
  alist(
    happiness ~ dnorm( mu, sigma ),
    mu <- a[mid] + bA*A,
    a[mid] ~ dnorm( 0, 1 ),
    bA ~ dnorm( 0, 2 ),
    sigma ~ dexp( 1 )
  ), data=d2
)
precis( mHvMA, depth=2 )
            mean         sd       5.5%      94.5%
a[1]  -0.2350877 0.06348986 -0.3365568 -0.1336186
a[2]   1.2585517 0.08495989  1.1227694  1.3943340
bA    -0.7490274 0.11320112 -0.9299447 -0.5681102
sigma  0.9897080 0.02255800  0.9536559  1.0257600

Discuss: Interpret!!

Age negatively affects happiness, which is an expected association when conditioning on a collider.

Example #2: Collider Bias

Here’s the model without conditioning on marriage.

mHvA <- quap(
  alist(
    happiness ~ dnorm( mu, sigma ),
    mu <- a + bA*A,
    a ~ dnorm( 0, 1 ),
    bA ~ dnorm( 0, 2 ),
    sigma ~ dexp( 1 )
  ), data=d2
)
precis( mHvA, depth=2 )
               mean         sd       5.5%     94.5%
a      1.649248e-07 0.07675015 -0.1226614 0.1226617
bA    -2.728620e-07 0.13225976 -0.2113769 0.2113764
sigma  1.213188e+00 0.02766080  1.1689803 1.2573949

Discuss: Interpet!!

There we go - we get that age and happiness are not causally related.

Bringing it all together

Chalk talk!!