Chapter 5 - Many Variables and Spurious Waffles

This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

5E1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = β_xx_i + β_zz_i \tag{3} \\ μ_i = α + β(x_i − z_i) \tag{4} \\ μ_i = α + β_xx_i + β_zz_i \tag{5} \\ \end{align}\]

# The first model has only one predictor, xi, so is not a multiple regression. The second model has two predictors, xi and zi, so is a multiple regression (despite having no intercept). The third model is tricky because it includes two predictor variables but only one slope term; however, algebra would apply the multiplication of β to both xi and zi, so we can consider this to be a strangely formatted multiple regression in which the slope for zi is constrained to be equal to −1 times the slope for xi. The fourth model has two predictors, xi and zi, so is a multiple regression.

5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.

\[\begin{align} μ_i=α+β_AA_i+β_PP_i \end{align}\]

# where A  is animal diversity and P is plant diversity.

5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

\[\begin{align} μ_i=α+β_FF_i+β_SS_i \end{align}\]

# where F is amount of funding and S is size of laboratory.

# Given the question text, both slope parameters, βf and βs, should be positive.

5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1} \\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2} \\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3} \\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4} \\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5} \\ \end{align}\]

# The first model includes a single intercept (for category C) and slopes for A, B, and D.

# The second model is non-identifiable because it includes a slope for all possible categories.

# The third model includes a single intercept (for category A) and slopes for B, C, and D.

# The fourth model uses the unique index approach to provide a separate intercept for each category (and no slopes).

# The fifth model uses the reparameterized approach on pages 154 and 155 to multiply the intercept for category A times 1 when in category A and times 0 otherwise.

# Models 1, 3, 4, and 5 are inferentially equivalent because they each allow the computation of each other’s posterior distribution.

5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

N <- 100
income <- rnorm(n = 100, mean = 0, sd = 1)
operas <- rnorm(n = N, mean = income, sd = 2)
scores <- rnorm(n = N, mean = income, sd = 1)
d <- data.frame(scores, operas, income)
pairs(d)

# Let’s first show that there is a (spurious) association between number of operas seen and score.

m <- map(
  alist(
    scores ~ dnorm(mu, sigma),
    mu <- a + bo * operas,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean         sd          5.5%      94.5%
## a     -0.2067430 0.14781850 -0.4429854862 0.02949953
## bo     0.1099296 0.06832681  0.0007301183 0.21912901
## sigma  1.4754794 0.10433200  1.3087367573 1.64222213
# Then let’s show that this association vanishes when family income is added to the model.

m <- map(
  alist(
    scores ~ dnorm(mu, sigma),
    mu <- a + bo * operas + bi * income,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##              mean         sd       5.5%        94.5%
## a     -0.12827594 0.10645835 -0.2984169  0.041865055
## bo    -0.08787048 0.05313322 -0.1727876 -0.002953321
## bi     1.14894585 0.11853596  0.9595025  1.338389212
## sigma  1.05932680 0.07490552  0.9396133  1.179040297

5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

N <- 100
rho <- 0.6
alcohol <- rnorm(n = N, mean = 0, sd = 1)
illness <- rnorm(n = N, mean = rho * alcohol, sd = sqrt(1 - rho^2))
happiness <- rnorm(n = N, mean = alcohol - illness, sd = 1)
d <- data.frame(happiness, alcohol, illness)
pairs(d)

# Due to the masking relationship, the bivariate associations should be weak/widely variable.

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean         sd       5.5%      94.5%
## a     -0.3361169 0.12312893 -0.5329007 -0.1393331
## ba     0.3449071 0.13423812  0.1303687  0.5594456
## sigma  1.2293510 0.08692799  1.0904233  1.3682787
m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bi * illness,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean         sd       5.5%     94.5%
## a     -0.3505198 0.12202204 -0.5455346 -0.155505
## bi    -0.3628060 0.12701031 -0.5657930 -0.159819
## sigma  1.2204599 0.08629929  1.0825369  1.358383
# But in the multivariate regression, the masking relationship should be resolved.

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol + bi * illness,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean         sd       5.5%      94.5%
## a     -0.2878130 0.10092012 -0.4491029 -0.1265232
## ba     0.9745423 0.14161581  0.7482129  1.2008717
## bi    -0.9496685 0.13496473 -1.1653682 -0.7339688
## sigma  1.0051711 0.07107606  0.8915778  1.1187643
# The slopes for alcohol and illness became much larger in magnitude in the multivariate model. Thus, in this example, because alcohol increases happiness and feelings of illness, and feelings of illness decrease happiness, the bivariate relationships of alcohol and feelings of illness to happiness are masked.

5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

# We might hypothesize that a high divorce rate causes a higher marriage rate by introducing more unmarried individuals (who have a demonstrated willingness to marry) into the dating pool. This possibility could be evaluated using multiple regression by regressing marriage rate on both divorce rate and re-marriage rate (i.e., the rate of non-first marriages or marriages following divorces). If divorce rate no longer predicts marriage rate even when the re-marriage rate is known, this would support our hypothesis.

5M4. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

# I will pull up the percent LDS population by State from Wikipedia and add it as a new column in the WaffleDivorce data frame (minus Nevada as it is apparently missing from WaffleDivorce). Given the positive skew of these percentages, I will log-transform them prior to standardization.

data("WaffleDivorce")
d <- WaffleDivorce
d$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$LDS)

simplehist(d$logLDS)

simplehist(d$logLDS.s)

# The transformed and standardized variables look much better in terms of distribution. Now we can build the requested multiple regression model.

m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##              mean        sd        5.5%      94.5%
## a     35.44754145 6.7753434 24.61923416 46.2758487
## bm     0.05333784 0.0826203 -0.07870536  0.1853810
## ba    -1.02996290 0.2246962 -1.38907090 -0.6708549
## bl    -0.60806326 0.2905800 -1.07246618 -0.1436603
## sigma  1.37877064 0.1383999  1.15758087  1.5999604
# In this model, the slope of marriage was widely variable and its interval includes zero. However, the slopes of both median age at marriage and percentage of LDS population were negative and their intervals did not include zero. Thus, states with older median age at marriage or higher percentages of Mormons had lower divorce rates.

5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need. \[\begin{align} μ_i=α+β_GG_i+β_EE_i+β_RR_i \end{align}\]

# where G represents the price of gasoline, E represents one exercise-related variable, and R represents one restaurant-related variable. One version of this model might use self-reported frequencies of exercise and eating out, and another version might use more rigorously measured calories burned through exercise and calories ingested from restaurants.