Assignment #4

Questions

5E1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = β_xx_i + β_zz_i \tag{3} \\ μ_i = α + β(x_i − z_i) \tag{4} \\ μ_i = α + β_xx_i + β_zz_i \tag{5} \\ \end{align}\]

The first model has only one predictor, xi, so it is a single regression.
The second model has two predictors, xi and zi, so is a multiple regression.
The third model has two predictor variables. The parameter for zi is negative.
The fourth model has two predictors, xi and zi, so is a multiple regression.
The fifth model has two predictors, xi and zi, it also has a intercept α, so is a multiple regression.

5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.

I will assume latitude is the outcome and diversity of plants and animals to be the parameters. μi=α+βAAi+βPPi A: animal diversity P: plant diversity

5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

μi=α+βFFi+βSSi

F: the amount of funding
S: size of laboratory.

5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1} \\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2} \\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3} \\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4} \\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5} \\ \end{align}\]

The first model includes a single intercept (for category C) and slopes for A, B, and D.
The second model is non-identifiable because it includes a slope for all possible categories (page 156).
The third model includes a single intercept (for category A) and slopes for B, C, and D.
The fourth model uses the unique index approach to provide a separate intercept for each category (and no slopes).
The fifth model uses the reparameterized approach to multiply the intercept for category A times 1 when in category A and times 0 otherwise.

Models 1, 3, 4, and 5 are inferentially equivalent - they all allow to compute each other’s posterior distribution (e.g., each category’s intercept and difference from each other category).

5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

# simulate data and create spurious correlation between the # operas and scores
N <- 100
income <- rnorm(n = 100, mean = 0, sd = 1)
operas <- rnorm(n = N, mean = income, sd = 2)
scores <- rnorm(n = N, mean = income, sd = 1)
d <- data.frame(scores, operas, income)
pairs(d)

# there is a spurious association between number of operas seen and score.
m <- map(
  alist(
    scores ~ dnorm(mu, sigma),
    mu <- a + bo * operas,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##            mean         sd        5.5%     94.5%
## a     0.1173716 0.12443739 -0.08150343 0.3162465
## bo    0.1429025 0.05460711  0.05562978 0.2301752
## sigma 1.2402675 0.08769995  1.10010604 1.3804290

# the association vanishes when family income is added to the model.
m <- map(
  alist(
    scores ~ dnorm(mu, sigma),
    mu <- a + bo * operas + bi * income,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean         sd       5.5%      94.5%
## a     -0.03211012 0.08906148 -0.1744476 0.11022732
## bo    -0.05422107 0.04324084 -0.1233283 0.01488614
## bi     1.07364882 0.10691183  0.9027831 1.24451458
## sigma  0.87504740 0.06187491  0.7761594 0.97393545

From the above, the income predicts scores when the number of operas is known. Vice versa, the prediction does not work when the income is known. Therefore, the bivariate association between the test scores and opera is spurious.

5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

# define the 3 varaiables: alcohol, illness, happiness
N <- 100
rho <- 0.6
alcohol <- rnorm(n = N, mean = 0, sd = 1)
illness <- rnorm(n = N, mean = rho * alcohol, sd = sqrt(1 - rho^2))
happiness <- rnorm(n = N, mean = alcohol - illness, sd = 1)
d <- data.frame(happiness, alcohol, illness)
pairs(d)

# bivariate relationship - a weak relationship
m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean         sd        5.5%     94.5%
## a     -0.02310495 0.10808486 -0.19584544 0.1496355
## ba     0.24820240 0.11888546  0.05820046 0.4382043
## sigma  1.07165939 0.07577755  0.95055223 1.1927665

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bi * illness,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##             mean         sd       5.5%      94.5%
## a     -0.0516867 0.10406183 -0.2179976  0.1146242
## bi    -0.3256925 0.09989849 -0.4853495 -0.1660354
## sigma  1.0408375 0.07359811  0.9232135  1.1584615

# multivariate regression:
m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol + bi * illness,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean         sd       5.5%      94.5%
## a      0.06454818 0.08453495 -0.0705550  0.1996514
## ba     0.95323283 0.12656784  0.7509530  1.1555127
## bi    -0.89066414 0.10949873 -1.0656643 -0.7156640
## sigma  0.83123274 0.05877681  0.7372961  0.9251694

In multivariate regression, the relationship got unmasked. Alcohol increases happiness and the illness decrease the happiness. The bivariate relationship of alcohol (to happiness) and illness to (happiness) are masked.

5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

One possibility is that the high divorce rate leads to more single individuals, which could later on re-marry someone else. We can use multivariate regression, the outcome being the marriage rate and the variables being divorce rate and re-marriage rate (from the divorced individuals). The model can be used to see if those 2 variables we hypothesized explain the marriage rate.

5M4. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

# LDS population by State data comes from Wikipedia - adding them to WaffleDivorce dataset

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data("WaffleDivorce")
d <- WaffleDivorce
d$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)

# normality testing - log-transform corrects the right skewness
simplehist(d$LDS)

simplehist(d$logLDS)

simplehist(d$logLDS.s)

# build multivariate regression model
m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##             mean         sd        5.5%      94.5%
## a     35.4368831 6.77530885 24.60863100 46.2651353
## bm     0.0534121 0.08261842 -0.07862809  0.1854523
## ba    -1.0296166 0.22469483 -1.38872235 -0.6705109
## bl    -0.6080773 0.29057042 -1.07246493 -0.1436896
## sigma  1.3787254 0.13838904  1.15755296  1.5998978

the slope of marriage as widely variable and its interval includes zero
the slopes of both median age and and LDS population are negative and their intervals did not include zero. States with older median age or higher percentages of LDS population had lower divorce rates. s

5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

To address these two mechanisms, we first need to define the variables to capture them. For the first mechanism, a variable such as weekly work out time For the second mechanism, a variable like the frequency of dining at restaurant weekly.

However, introducing more variables may potentially leads to multicollinearity, especially when the variables are highly correlated.

So, I would propose the following mult-regression model:

μi=α+βGGi+βEEi+βRRi

P: price of gasoline E: weekly exercise time R: frequency of dining at restaurant weekly

Assignment #4

Zhengyuan Huang

2020-12-08

Chapter 5 - Many Variables and Spurious Waffles

Questions