Chapter 5 - Many Variables and Spurious Waffles

This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

# Tags 2, 3 and 4 are multiple linear regressions.

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

# μi = α + βffi + βssi
# f is the amount of funding and s is the size of laboratory.

5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

#As seen below, we use an example of a spurious correlation between the score on a standardized achievement test and study hours put in that vanishes when happiness is included.
N <- 100
set.seed(111)
studyhours <- rnorm(n = N, mean = 6, sd = 1)
scores <- rnorm(n = N, mean = 12 * studyhours, sd = sqrt(1 - 0.3^2))
happiness <- rnorm(n = N, mean = scores-1.5*studyhours, sd = 10)
m1 <- data.frame(studyhours, scores, happiness)
pairs(m1)

# So there is some correlation between study hours and scores.

m1_1 <- lm(scores~happiness, m1)
precis(m1_1)
##                   mean         sd       5.5%      94.5%
## (Intercept) 27.3110058 3.96766494 20.9699109 33.6521007
## happiness    0.6933016 0.06038104  0.5968011  0.7898022
m1_2 <- lm(scores ~ happiness + studyhours, m1)
precis(m1_2)
##                    mean         sd         5.5%       94.5%
## (Intercept)  0.21712387 0.55722437 -0.673428284  1.10767603
## happiness    0.01430498 0.01058093 -0.002605385  0.03121535
## studyhours  11.80762534 0.13850774 11.586263215 12.02898746
#In this model, happiness predicts scores when the study hours are already known but the study hours do not predict scores when happiness is already known, which means the association  between study hours and scores is spurious.

5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

#As shown below, we use an example of a masked relationship involving the prediction of happiness ratings from the amount of alcohol one drinks and the amount of time one spends feeling ill.
N <- 100
rho <- 0.6
alcohol <- rnorm(n = N, mean = 0, sd = 1)
illness <- rnorm(n = N, mean = rho * alcohol, sd = sqrt(1 - rho^2))
happiness <- rnorm(n = N, mean = alcohol - illness, sd = 1)
d <- data.frame(happiness, alcohol, illness)
pairs(d)

## Due to the masking relationship, the bivariate associations should be weak (widely variable).


m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##            mean         sd        5.5%     94.5%
## a     0.1926478 0.12718823 -0.01062351 0.3959192
## ba    0.4676526 0.13334136  0.25454731 0.6807578
## sigma 1.2719203 0.08993815  1.12818174 1.4156588
m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bi * illness,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean         sd        5.5%      94.5%
## a      0.1372445 0.12724736 -0.06612138  0.3406103
## bi    -0.4689686 0.12402574 -0.66718574 -0.2707516
## sigma  1.2607611 0.08914897  1.11828379  1.4032383
# But in the multivariate regression, the masking relationship should be resolved.

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol + bi * illness,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##              mean         sd       5.5%      94.5%
## a      0.02972838 0.09319682 -0.1192181  0.1786749
## ba     1.09902428 0.11632878  0.9131084  1.2849401
## bi    -1.05058667 0.10915690 -1.2250405 -0.8761329
## sigma  0.91633423 0.06480317  0.8127662  1.0199022
# From the result, the slopes for alcohol and illness became much larger in the multivariate model as the alcohol increases happiness while feelings of illness decrease happiness, the bi variate relationships of alcohol and feelings of illness to happiness are masked.

5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.

data(WaffleDivorce)
d <- WaffleDivorce
d$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
hist(d$LDS)

hist(d$logLDS)

hist(d$logLDS.s)

m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##              mean        sd        5.5%      94.5%
## a     35.44257647 6.7751582 24.61456518 46.2705878
## bm     0.05332568 0.0826167 -0.07871177  0.1853631
## ba    -1.02976930 0.2246905 -1.38886810 -0.6706705
## bl    -0.60793535 0.2905677 -1.07231863 -0.1435521
## sigma  1.37871096 0.1383851  1.15754477  1.5998771
#The results show that the slopes for median age at marriage and percentage of LDS population are negative and, therefore, are negatively related to divorce rates. The marriage rate isn’t a good predictor for the divorce rate and states with an older median age of marriage had lower divorce rates.

5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)

library(dagitty)
maddog = dagitty("dag{M -> A -> D}")
impliedConditionalIndependencies(maddog)
## D _||_ M | A
equivalentDAGs(maddog)
## [[1]]
## dag {
## A
## D
## M
## A -> D
## M -> A
## }
## 
## [[2]]
## dag {
## A
## D
## M
## A -> D
## A -> M
## }
## 
## [[3]]
## dag {
## A
## D
## M
## A -> M
## D -> A
## }
#From the result, we can see that DAG implies the divorce rate (D) is independent of marriage rate which is conditional on the median age of marriage.