Assignment #4

5E1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = β_xx_i + β_zz_i \tag{3} \\ μ_i = α + β(x_i − z_i) \tag{4} \\ μ_i = α + β_xx_i + β_zz_i \tag{5} \\ \end{align}\]

#μ_i = β_xx_i + β_zz_i
#μ_i = α + β(x_i − z_i) 
#μ_i = α + β_xx_i + β_zz_i
# The first one is single linear regressions. The 2nd and 3rd are the same , along with the 4th and 5th are all multiple linear regressions with both x and z as independent variable

5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition. \[\begin{align} μ_i = α + β_aA_i + β_PP_i \\ \end{align}\]

# Based on the statement. We set the latitude as dependent variable μ_l, Animal diversity as independent variable A, Plant diversity as control variable P.

5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on. \[\begin{align} μ_i = α + β_FF_i + β_SS_i \\ \end{align}\]

# Both F and S should be on the positive side of the parameter

5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let Ai be an indicator variable that is 1 where case i is in category A. Also suppose Bi, Ci, and Di for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model. \[\begin{align} μ_i = α + β_AA_i + β_BB_i + β_DD_i \tag{1} \\ μ_i = α + β_AA_i + β_BB_i + β_CC_i + β_DD_i \tag{2} \\ μ_i = α + β_BB_i + β_CC_i + β_DD_i \tag{3} \\ μ_i = α_AA_i + α_BB_i + α_CC_i + α_DD_i \tag{4} \\ μ_i = α_A(1 − B_i − C_i − D_i) + α_BB_i + α_CC_i + α_DD_i \tag{5} \\ \end{align}\]

## The first model includes a single intercept (for category C) and slopes for A, B, and D.
# The second model is non-identifiable because it includes a slope for all possible categories (page 156).
# The third model includes a single intercept (for category A) and slopes for B, C, and D.
# The fourth model uses the unique index approach to provide a separate intercept for each category (and no slopes).
# The fifth model uses the reparameterized approach on pages 154 and 155 to multiply the intercept for category A times 1 when in category A and times 0 otherwise.

5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

# I will simulate suicide rate and people eat ice cream, income are considered in the relationship


set.seed(103)
N <- 100
income <- rnorm(n = 100, mean = 1, sd = 1)
suicide <- rnorm(n = N, mean = income, sd = 2)
icecream <- rnorm(n = N, mean = income, sd = 1)
d <- data.frame(icecream, suicide, income)
pairs(d)

m <- map(
  alist(
    icecream ~ dnorm(mu, sigma),
    mu <- a + bo * suicide,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##            mean         sd      5.5%    94.5%
## a     0.8313052 0.16114293 0.5737677 1.088843
## bo    0.1762579 0.07009677 0.0642297 0.288286
## sigma 1.3681757 0.09674388 1.2135603 1.522791

m <- map(
  alist(
    icecream ~ dnorm(mu, sigma),
    mu <- a + bo * suicide + bi * income,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean         sd        5.5%     94.5%
## a     -0.06137155 0.14520833 -0.29344250 0.1706994
## bo     0.01810285 0.05214602 -0.06523656 0.1014423
## bi     1.07564710 0.10801033  0.90302574 1.2482685
## sigma  0.96939318 0.06854416  0.85984638 1.0789400

# Based on the result we can see that the income influence the people eating ice cream when we have the suicide rate. However the people who eat ice cream do not predict the suicide rate given the income is known. So the relationship between suicide rate and people eating ice cream is spurious relationship

5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

# I will set up the relationship between happyness with the health condition and milktea consumption

set.seed(105)
N <- 1000
rho <- 0.6
milktea <- rnorm(n = N, mean = 10, sd = 1)
health <- rnorm(n = N, mean = rho * milktea, sd = sqrt(1 - rho^2))
happiness <- rnorm(n = N, mean = health + milktea, sd = 1)
d <- data.frame(happiness, milktea, health)
pairs(d)

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * milktea,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##             mean         sd       5.5%     94.5%
## a     0.02286526 0.45081389 -0.6976224 0.7433529
## ba    1.60050608 0.04482479  1.5288674 1.6721448
## sigma 1.35926039 0.03039381  1.3106852 1.4078356

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bi * health,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##           mean         sd     5.5%    94.5%
## a     6.462336 0.24373522 6.072800 6.851872
## bi    1.596089 0.04003314 1.532108 1.660070
## sigma 1.273248 0.02847074 1.227746 1.318750

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * milktea + bi * health,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##             mean         sd       5.5%     94.5%
## a     0.04006156 0.34246672 -0.5072664 0.5873895
## ba    0.95396937 0.04153627  0.8875864 1.0203523
## bi    1.07509531 0.03954966  1.0118873 1.1383033
## sigma 1.03077572 0.02304878  0.9939393 1.0676121

5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

#HIgher divorce rate causing a higher marriage rate seems light a possible relationship if we simply read the data. However when we make analysis we can tell the relationship was not correct. Therefore we can set up a multiple regression model to verify the relationship. 
#We set up the divorce rate as primary independent variable and marriage rate as dependent variable. Then we add more factors such as age, income, birthday, education level. With new factors added we need to see if divorce rate can still predict marriage rate. If not we can tell the causal relationship is not correct.

5M4. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

data("WaffleDivorce")
d <- WaffleDivorce
d$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$LDS)

simplehist(d$logLDS)

simplehist(d$logLDS.s)

m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean         sd        5.5%      94.5%
## a     35.43683047 6.77531079 24.60857525 46.2650857
## bm     0.05341314 0.08261845 -0.07862709  0.1854534
## ba    -1.02961541 0.22469488 -1.38872123 -0.6705096
## bl    -0.60807816 0.29057047 -1.07246588 -0.1436904
## sigma  1.37872562 0.13838910  1.15755311  1.5998981

5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

# The two machanisms above consider the relationship between gas price and obesity. The first machanisms consider the excercise as an factor. The second machanism consider the restauranct visit as other factor. Therefore combine the two machanisms, we can have the model below. G represent the price of gas and E represent the exercise and R represent the restaurant visit

\[\begin{align} μ_i = α+β_GG_i+β_EE_i+β_RR_i \\ \end{align}\]

Assignment #4

Zechong Liu

2020-06-15