Assignment #4

5E1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = β_xx_i + β_zz_i \tag{3} \\ μ_i = α + β(x_i − z_i) \tag{4} \\ μ_i = α + β_xx_i + β_zz_i \tag{5} \\ \end{align}\]

# Model(2),(3),(4)and(5) are multiple linear regressions.

5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.

# μ_i = α + βlLi + βpPi 
# Where L is latitude and P is plant diversity

5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

# μ_i = α + βfFi + βlLi
# Where F is amount of funding and L is size of laboratory

5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1} \\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2} \\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3} \\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4} \\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5} \\ \end{align}\]

# Linear models 1， 3， 4， 5 are inferentially equivalent ways to include the categorical variable in a regression.

5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

# This example demonstrates a spurious correlation between Japanese passenger cars sold in the US and suicides by crashing of motor vehicle that vanishes when the oil price is included. (Smaller vehicles, often imported from Japan, became popular with the American public as these vehicles featured better fuel economy ratings).

N <- 100
oil_price <- rnorm(n = N, mean = 0, sd = 1)
suicides <- rnorm(n = N, mean = oil_price, sd = 3)
car_sold <- rnorm(n = N, mean = oil_price, sd = 1)
d <- data.frame(suicides, car_sold, oil_price)
pairs(d)

m <- map(
  alist(
    car_sold ~ dnorm(mu, sigma),
    mu <- a + b1 * suicides,
    a ~ dnorm(0, 5),
    b1 ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##            mean         sd        5.5%      94.5%
## a     0.0452647 0.13575964 -0.17170543 0.26223482
## b1    0.0121712 0.04196596 -0.05489851 0.07924092
## sigma 1.3513325 0.09555338  1.19861971 1.50404524

m2 <- map(
  alist(
    car_sold ~ dnorm(mu, sigma),
    mu <- a + b1 * suicides + b2 * oil_price,
    a ~ dnorm(0, 5),
    b1 ~ dnorm(0, 5),
    b2 ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m2)

##              mean         sd       5.5%       94.5%
## a     -0.04861544 0.09583496 -0.2017782  0.10454733
## b1    -0.07798206 0.03079485 -0.1271982 -0.02876595
## b2     1.07252322 0.10588234  0.9033028  1.24174365
## sigma  0.94927688 0.06712405  0.8419997  1.05655408

5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

# This example demonstrates a masked relationship between the blood cholesterol and the sodium intake and person’s weight.
N <- 100
rho <- 0.6
weight <- rnorm(n = N, mean = 0, sd = 1)
sodium <- rnorm(n = N, mean = rho * weight, sd = sqrt(1 - rho^2))
cholesterol <- rnorm(n = N, mean = weight - sodium, sd = 1)
d <- data.frame(cholesterol , weight, sodium)
pairs(d)

m1 <- map(
  alist(
    cholesterol ~ dnorm(mu, sigma),
    mu <- a + ba * weight,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m1)

##             mean         sd       5.5%     94.5%
## a     -0.0189785 0.12710489 -0.2221167 0.1841597
## ba     0.5310524 0.11974697  0.3396736 0.7224312
## sigma  1.2700965 0.08980918  1.1265641 1.4136290

m2 <- map(
  alist(
    cholesterol ~ dnorm(mu, sigma),
    mu <- a + bi * sodium,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m2)

##              mean         sd       5.5%      94.5%
## a     -0.07679212 0.13291640 -0.2892182  0.1356339
## bi    -0.43011012 0.13716267 -0.6493226 -0.2108977
## sigma  1.32575450 0.09374431  1.1759330  1.4755760

m3 <- map(
  alist(
    cholesterol ~ dnorm(mu, sigma),
    mu <- a + ba * weight + bi * sodium,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m3)

##              mean         sd       5.5%       94.5%
## a     -0.07109837 0.09450472 -0.2221352  0.07993843
## ba     1.04157203 0.10531687  0.8732553  1.20988873
## bi    -1.04368251 0.11557614 -1.2283955 -0.85896952
## sigma  0.94243912 0.06664021  0.8359352  1.04894304

5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

# High divorce rate introduced more signals to the datating pools, which could result in more marriage couples. In a multiple linear regression model, the two predictor variables are the divorce rate and the re-marriage rate (rate of non-first marriages). The response variable is the marriage rate. If the re-marriage rate is given and the divorce rate can not predict the marriage rate anymore, then we can conclude a high divorce rate cause a higher marriage rate.

5M4. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

library(rethinking) 
data(WaffleDivorce) 
d <- WaffleDivorce
d$LDS <- c(0.0077, 0.0458, 0.0600, 0.01047, 0.0191, 0.0261, 0.0045, 0.0058, 0.0045, 0.0075, 
           0.0082, 0.0530, 0.2586, 0.0045, 0.0068, 0.0090, 0.0132, 0.0080, 0.0064, 0.0082, 
           0.0072, 0.0041, 0.0045, 0.0059, 0.0073, 0.0118, 0.0473, 0.0130, 0.0065, 0.0038, 
           0.0331, 0.0043, 0.0085, 0.0152, 0.0054, 0.0124, 0.0364, 0.0041, 0.0040, 0.0080, 
           0.0120, 0.0077, 0.0125, 0.6632, 0.0074, 0.0113, 0.0380, 0.0096, 0.0047, 0.1170)
d$logLDS <- log(d$LDS)

d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$LDS)

simplehist(d$logLDS)

simplehist(d$logLDS.s)

m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean        sd        5.5%      94.5%
## a     35.30676941 6.7586080 24.50520842 46.1083304
## bm     0.05650871 0.0829256 -0.07602241  0.1890398
## ba    -1.02701716 0.2237881 -1.38467378 -0.6693605
## bl    -0.61899279 0.2905462 -1.08334176 -0.1546438
## sigma  1.37694036 0.1382028  1.15606561  1.5978151

# From the regression results, the slope of median age marriage and log LDS are all negative, the slope of marriage is positive. The regression results suggests that the older the median age at marriage, the higher percentages of LDS, the lower the divorce rate.

5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

#μ_i = α + βGGi + βEEi+βRRi
#Where G is the price of gasolie, E is an exercise variable, and R is a restaurant variable

Assignment #4

Xinrui Zhang

2020-06-14