Chapter 5 - Many Variables and Spurious Waffles

This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

5E1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

# The 2,3,4 are multiple regressions, since they all have more than one predictor.

5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.

# μ_i = = α + β_AA_i + β_PP_i
# A is for animal diversity and P is for plant.

5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

# μ_i = = α + β_AA_i + β_SS_i
# A is for amount of funding and S is for size of laboratory. Based on the context, when they are together, both should be on the positive side.

5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1} \\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2} \\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3} \\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4} \\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5} \\ \end{align}\]

# I believe they are model 1,3,4 and 5. Only 2 is not inferentially equivalent since it has slope for all possible categories.

5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

# I could come up with a spurious correlation between ice cream and drowning. It is to say that Drownings rise when ice cream sales rise.

n <- 100
temperature <- rnorm(n, mean = 0, sd = 1)
drownings <- rnorm(n, mean = temperature, sd = 2)
ice_cream_sales <- rnorm(n, mean = temperature, sd = 1)
d <- data.frame(drownings, ice_cream_sales, temperature)
pairs(d)

# Let's see the association between drownings and ice cream sales
m <- map(
  alist(
    drownings ~ dnorm(μ, σ),
    μ <- α + β * ice_cream_sales,
    α ~ dnorm(0, 5),
    β ~ dnorm(0, 5),
    σ ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##         mean        sd       5.5%     94.5%
## a 0.02489142 0.1898644 -0.2785485 0.3283314
## ß 0.61346533 0.1232024  0.4165641 0.8103665
## s 1.88161057 0.1330496  1.6689716 2.0942496
# Now let's include temperature
m <- map(
  alist(
    drownings ~ dnorm(μ, σ),
    μ <- α + β_i * ice_cream_sales + β_t * temperature,
    α ~ dnorm(0, 5),
    β_i ~ dnorm(0, 5),
    β_t ~ dnorm(0, 5),
    σ ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##           mean        sd        5.5%     94.5%
## a   0.02046669 0.1789656 -0.26555491 0.3064883
## ß_i 0.19981719 0.1646513 -0.06332745 0.4629618
## ß_t 0.83570566 0.2358071  0.45884045 1.2125709
## s   1.77341323 0.1253992  1.57300115 1.9738253
# Comparing two model, we can see that temperature can predicts the drownings when ice cream sales is known, but ice cream sales cannot predicts the drownings when temperature is known. Hence it's a spurious correlation.

5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

# I couldn't think of a real life example, so I would take the relationship between x-positive and x-negative of x-axis and y-axis.
n <- 100
rho <- 0.6
x_p <- rnorm(n, mean = 0, sd = 1)
x_n <- rnorm(n, mean = rho * x_p, sd = sqrt(1 - rho^2))
y <- rnorm(n, mean = x_p - x_n, sd = 1)
d <- data.frame(y, x_p, x_n)
pairs(d)

# Now let's see the bivariate associations
m <- map(
  alist(
    y ~ dnorm(μ, σ),
    μ <- α + β_xp * x_p,
    α ~ dnorm(0, 5),
    β_xp ~ dnorm(0, 5),
    σ ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean        sd       5.5%     94.5%
## a    0.007830104 0.1441017 -0.2224722 0.2381324
## ß_xp 0.592288709 0.1471248  0.3571549 0.8274225
## s    1.441456242 0.1019261  1.2785587 1.6043538
m <- map(
  alist(
    y ~ dnorm(μ, σ),
    μ <- α + β_xn * x_n,
    α ~ dnorm(0, 5),
    β_xn ~ dnorm(0, 5),
    σ ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean        sd       5.5%      94.5%
## a    -0.05001098 0.1461259 -0.2835484  0.1835264
## ß_xn -0.55333959 0.1447783 -0.7847233 -0.3219559
## s     1.45149068 0.1026357  1.2874590  1.6155224
# Now in multivariate regression:
m <- map(
  alist(
    y ~ dnorm(μ, σ),
    μ <- α + β_xp * x_p + β_xn * x_n,
    α ~ dnorm(0, 5),
    β_xp ~ dnorm(0, 5),
    β_xn ~ dnorm(0, 5),
    σ ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##            mean         sd       5.5%       94.5%
## a    -0.1341661 0.10773878 -0.3063535  0.03802123
## ß_xp  1.1571043 0.12529188  0.9568636  1.35734488
## ß_xn -1.1136377 0.12244127 -1.3093225 -0.91795294
## s     1.0661404 0.07539949  0.9456374  1.18664332
# Comparing those 3 model, we can see the x_positive increase y, x_negative decrease y. Also, we can clearly see that x_positive and x_negative have magnitude increase in multivariate model.

5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

# We can assume that high divorce rate can result in increase in unmarried people, which would increase the marriage rate.
# We can evaluate this relationship by regressing marriage rate on divorce rate and marriage-after-divorce rate.

5M4. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

# I used the data from https://www.worldatlas.com/articles/mormon-population-by-state.html to insert the percentage of LDS into the WaffleDivorce dataframe. Although, there are only 50 states in the dataframe, and Nevada is missing.

data(WaffleDivorce)
d <- WaffleDivorce
d$LDS <- c(0.6770, 0.2642, 0.1153, 0.0610, 0.0517, 0.0481, 0.0456, 0.0394, 0.0376, 0.0335, 0.0274, 0.0197, 0.0149, 0.0130, 0.0129, 0.0125, 0.0121, 0.0121, 0.0116, 0.0113, 0.0103, 0.0093, 0.0090, 0.0084, 0.0082, 0.0082, 0.0081, 0.0079, 0.0077, 0.0075, 0.0075, 0.0073, 0.0073, 0.0072, 0.0067, 0.0065, 0.0064, 0.0059, 0.0057, 0.0053, 0.0046, 0.0045, 0.0045, 0.0044, 0.0041, 0.0040, 0.0040, 0.0040, 0.0039, 0.0037)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$logLDS.s)

# We can see the standardized histogram distrubtion looks fine. Let's see use the multiple regression model:
rm <- map(
  alist(
    Divorce ~ dnorm(μ, σ),
    μ <- α + β_M * Marriage + β_A * MedianAgeMarriage + β_L * logLDS.s,
    α ~ dnorm(10, 20),
    β_M ~ dnorm(0, 10),
    β_A ~ dnorm(0, 10),
    β_L ~ dnorm(0, 10),
    σ ~ dunif(0, 5)
  ),
  data = d
)
precis(rm)
##            mean         sd       5.5%       94.5%
## a   36.90907918 6.77764490 26.0770936 47.74106476
## ß_M -0.06888806 0.07305389 -0.1856423  0.04786616
## ß_A -0.99168718 0.21708128 -1.3386250 -0.64474937
## ß_L  0.47398845 0.19916650  0.1556819  0.79229498
## s    1.36071344 0.13670071  1.1422393  1.57918758
# From the model, we can see the slope of marriage has its interval across zero. But the slope of median age at marriage and LDS population do not include zero. Therefore, we can conclude that older median age or higher percent of Mormons had lower divorce rates.

5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

# I believe we would need variables that can measure these two mechanisms. For exercise, we can use the time spent on exercise. For eating out, we can use the number of times in eating outs.