Assignment #4

Questions

5E1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

# 2 and 4 are multiple regressions

5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.

# The multiple regression model is defined as below, where A indicates animal diversity, P is plant diversity and L is the latitude:

\[L_i \sim Normal (μ_i, \sigma)\] \[μ_i = α + β_AA_i + β_PP_i \]

5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

# The multiple regression model is defined as below, where F indicates amount of funding, S is size of laboratory and T is time to PhD degree.
# m3.1 and m3.2 are the model definition to evaluate for "Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree"
# m3.3 is the model definition to evaluate for "but together these variables are both positively associated with time to degree."
# Slope expectation: we may expect to see the small positive slope for m3.1 and m3.2, while see large larger positive slopes for both variables in m3.3
# m3.1 definition:

\[T_i \sim Normal (μ_i, \sigma)\] \[μ_i = α + β_FF_i \]

# m3.2 definition:

\[T_i \sim Normal (μ_i, \sigma)\] \[μ_i = α + β_SS_i \]

# m3.3 definition:

\[T_i \sim Normal (μ_i, \sigma)\] \[μ_i = α + β_FF_i + β_SS_i \]

5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1} \\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2} \\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3} \\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4} \\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5} \\ \end{align}\]

# Models 1, 3, 4, and 5 are inferentially equivalent

5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

# Simulating spurious association
# X_real is the truly causal predictor that influences both the outcome y and a spurious predictor, X_spur

set.seed(1000)
N <- 100                       # number of cases
x_real <-  rnorm(N)            # x_real as Gaussian with mean 0 and stddev 1
x_spur <- rnorm(N, x_real)     # x_spur as Gaussian with mean=x_real
y <- rnorm(N, x_real)          # y as Gaussian with mean=x_real
d <- data.frame(y, x_real,x_spur) # bind all together in data frame
# From the scatterplots below, both x_real and x_spur are correlated with y.
pairs(d)

m5.1 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a + bReal * x_real + bSpur * x_spur,
    a ~ dnorm(0,0.2),
    bReal ~ dnorm(0,0.5),
    bSpur ~ dnorm(0,0.5),
    sigma ~ dexp(1)
    
  ), data = d
)

precis(m5.1)

##               mean         sd       5.5%      94.5%
## a     -0.047060334 0.08724422 -0.1864935 0.09237278
## bReal  0.943689714 0.13361848  0.7301416 1.15723786
## bSpur -0.000362611 0.10306299 -0.1650772 0.16435195
## sigma  0.963666811 0.06782660  0.8552668 1.07206682

plot(precis(m5.1))

# However, when we include both x variables in a linear regression predicting y, the posterior mean for the association between y and x_spur will be close to zero.

5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

set.seed(2971)

# Simulating masked association
# M -> K <- N
# M -> N
n <- 100
M <- rnorm( n )
N <- rnorm( n , M )
K <- rnorm( n , N - M )
d_sim <- data.frame(K=K,N=N,M=M)


#pairs(d_sim)
ggpairs(d_sim)

# From the scatterplots above, it's obvious to see the outcome variable K is correlated with both predictor variables M and N, but in the opposite directions. Specifically in the case, K is positively correlated with M, but negatively correlated with N; On the other hand, the two predictor variables M and N are correlated with one another.

m5.2_withM <- quap(
  alist(
    K ~ dnorm(mu, sigma),
    mu <- a + bM * M,
    a ~ dnorm(0,0.2),
    bM ~ dnorm(0,0.5),
    sigma ~ dexp(1)
  ), data = d_sim
)
precis(m5.2_withM)

##              mean         sd       5.5%     94.5%
## a      0.01882277 0.10431529 -0.1478932 0.1855388
## bM    -0.10126253 0.12599341 -0.3026243 0.1000993
## sigma  1.21140213 0.08489457  1.0757242 1.3470801

m5.2_withN <- quap(
  alist(
    K ~ dnorm(mu, sigma),
    mu <- a + bN * N,
    a ~ dnorm(0,0.2),
    bN ~ dnorm(0,0.5),
    sigma ~ dexp(1)
  ), data = d_sim
)

precis(m5.2_withN)

##             mean         sd       5.5%     94.5%
## a     0.05387857 0.09828176 -0.1031947 0.2109518
## bN    0.35877786 0.08871407  0.2169956 0.5005601
## sigma 1.12539813 0.07894275  0.9992324 1.2515639

m5.2_both <- quap(
  alist(
    K ~ dnorm(mu, sigma),
    mu <- a + bM * M + bN * N,
    a ~ dnorm(0,0.2),
    bM ~ dnorm(0,0.5),
    bN ~ dnorm(0,0.5),
    sigma ~ dexp(1)
    
  ), data = d_sim
)
precis(m5.2_both)

##               mean         sd       5.5%      94.5%
## a      0.006370239 0.08872878 -0.1354355  0.1481760
## bM    -0.729632194 0.13682871 -0.9483109 -0.5109535
## bN     0.720385098 0.10283817  0.5560298  0.8847404
## sigma  0.980453167 0.06909564  0.8700250  1.0908814

plot(coeftab(m5.2_withM,m5.2_withN,m5.2_both), par=c("bM", "bN"))

# From the comparisons among the model with M only, model with N only and model with both variables, we can conclude the model outcome variable K is correlated with both predictor variables M and N, but in the opposite directions, even when M and N are correlated.

5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

# The high divorce rate also means a larger pool of single people which could potentially increase the marriage rate. We can bring in the remarriage rate as one predictor of the multiple regression: marriage rate ~ divorce rate + remarriage rate. If the impact of divorce rate largely vanish after bringing in the remarriage rate, then the guessed relationship is proved.

5M4. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

data(WaffleDivorce)
#  LDS data found at:  https://en.wikipedia.org/wiki/The_Church_of_Jesus_Christ_of_Latter-day_Saints_membership_statistics_(United_States)
LDS_data = c(0.0077,0.0458,0.0600,0.0107,0.0191,0.0261,0.0045,0.0058,0.0045,0.0075,0.0082,0.0530,0.2586,0.0045,0.0068,0.0090,0.0132,0.0080,0.0064,0.0082,0.0072,0.0041,0.0045,0.0059,0.0073,0.0118,0.0473,0.0130,0.0065,0.0038,0.0331,0.0043,0.0085,0.0152,0.0054,0.0124,0.0364,0.0041,0.0040,0.0080,0.0120,0.0077,0.0125,0.6632,0.0074,0.0113,0.0380,0.0096,0.0047,0.1170)

d <- list()
d$A <- standardize( WaffleDivorce$MedianAgeMarriage )
d$D <- standardize( WaffleDivorce$Divorce )
d$M <- standardize( WaffleDivorce$Marriage )
d$LDS <- standardize(LDS_data)

m5.4 <- quap(
  alist(
    ## A -> D, M -> D, LDS -> D
    D ~ dnorm( mu , sigma ) ,
    mu <- a + bM*M + bA*A + bL*LDS,
    a ~ dnorm( 0 , 0.2 ) ,
    bM ~ dnorm( 0 , 0.5 ) ,
    bA ~ dnorm( 0 , 0.5 ) ,
    bL ~ dnorm( 0 , 0.5 ) ,
    sigma ~ dexp( 1 )
  ) , data = d )

precis(m5.4)

##                mean        sd       5.5%      94.5%
## a     -1.341369e-06 0.0918967 -0.1468700  0.1468673
## bM     3.761083e-02 0.1474931 -0.1981117  0.2733334
## bA    -6.890921e-01 0.1442836 -0.9196851 -0.4584991
## bL    -3.155185e-01 0.1196360 -0.5067200 -0.1243170
## sigma  7.316120e-01 0.0726453  0.6155108  0.8477133

plot(precis(m5.4))

#From the results, we confirm that the posterior mean for marriage rate, bM, is close to 0. The posterior mean for age at marriage, bA is not changed much from the example in the book (model with marriage rate and age at marriage). The posterior mean for LDS, bL is negative and not close to 0. It supports the relationship that States with higher percentage of members of the Church of Jesus Christ of Latter-day Saints (LDS) yields to lower divorce rates.

5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

# First, price of gas could lead to less driving and therefore more exercise.
# Variables definition: O is rate of obesity, G is price of gasoline, D is the driving time, E is the average exercise

\[ O_i \sim Normal (\mu_i, \sigma) \] \[\mu_i = \alpha + \beta_GG_i + \beta_DD_i + \beta_EE_i \]

# Second, price of gas could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals.
# Variables definition: O is rate of obesity, G is price of gasoline, D is the driving time, F is the frequency of eating out and R is the dine-out restaurant spend.

\[ O_i \sim Normal (\mu_i, \sigma) \] \[\mu_i = \alpha + \beta_GG_i + \beta_DD_i + \beta_FF_i + \beta_RR_i \]

# We may also try to address two mechanisms together to include one variable G about gasoline price, one variable E about excerise and one variable R is about restaurant dine-in-restaurant-related variables.
# Variables definition: O is rate of obesity, G is price of gasoline, E is the average exercise and R is the dine-in-restaurant spend.

\[ O_i \sim Normal (\mu_i, \sigma) \] \[\mu_i = \alpha + \beta_GG_i + \beta_EE_i + \beta_RR_i \]

Assignment #4

Yuanzhen Cai

2021-06-21

Chapter 5 - Many Variables and Spurious Waffles

Questions