Assignment #4

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

# The second and fourth models are multiple linear regressions.

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

# Answer: T - Time to PhD degree, Fi - Amount of funding, Si - Size of laboratory
# T ~ normal (u, sigma)
# (1) ui= a + b1*Fi
# (2) ui= a + b1*Si
# (3) ui= a + b1*Fi + b2*Si

# In models (1) and (2) the b1 (slopes) should be close to zero. In model (3), the slopes are both positive.

5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the 'quap' model, show the 'precis' table, plot the results and explain.

N <- 100
set.seed(111)
studyhours <- rnorm(n = N, mean = 6, sd = 1)
scores <- rnorm(n = N, mean = 12 * studyhours, sd = sqrt(1 - 0.3^2))
happiness <- rnorm(n = N, mean = scores-1.5*studyhours, sd = 10)
m1 <- data.frame(studyhours, scores, happiness)
pairs(m1)

m1_1 <- lm(scores~happiness, m1)
precis(m1_1)

##                   mean         sd       5.5%      94.5%
## (Intercept) 27.3110058 3.96766494 20.9699109 33.6521007
## happiness    0.6933016 0.06038104  0.5968011  0.7898022

m1_2 <- lm(scores ~ happiness + studyhours, m1)
precis(m1_2)

##                    mean         sd         5.5%       94.5%
## (Intercept)  0.21712387 0.55722437 -0.673428284  1.10767603
## happiness    0.01430498 0.01058093 -0.002605385  0.03121535
## studyhours  11.80762534 0.13850774 11.586263215 12.02898746

# In this model, happiness predicts scores from the study hours but the study hours do not predict scores from   happiness, which means the association between study hours and scores is spurious.

5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the 'quap' model, show the 'precis' table, plot the results and explain.

N <- 100
rho <- 0.6
alcohol <- rnorm(n = N, mean = 0, sd = 1)
illness <- rnorm(n = N, mean = rho * alcohol, sd = sqrt(1 - rho^2))
happiness <- rnorm(n = N, mean = alcohol - illness, sd = 1)
d <- data.frame(happiness, alcohol, illness)
pairs(d)

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##            mean         sd        5.5%     94.5%
## a     0.1926478 0.12718823 -0.01062351 0.3959192
## ba    0.4676526 0.13334136  0.25454731 0.6807578
## sigma 1.2719203 0.08993815  1.12818174 1.4156588

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + bi * illness,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##             mean         sd        5.5%      94.5%
## a      0.1372445 0.12724736 -0.06612138  0.3406103
## bi    -0.4689686 0.12402574 -0.66718574 -0.2707516
## sigma  1.2607611 0.08914897  1.11828379  1.4032383

m <- map(
  alist(
    happiness ~ dnorm(mu, sigma),
    mu <- a + ba * alcohol + bi * illness,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)

##              mean         sd       5.5%      94.5%
## a      0.02972838 0.09319682 -0.1192181  0.1786749
## ba     1.09902428 0.11632878  0.9131084  1.2849401
## bi    -1.05058667 0.10915690 -1.2250405 -0.8761329
## sigma  0.91633423 0.06480317  0.8127662  1.0199022

# From the results, the alcohol and happiness have positive relationship while feelings of illness and happiness have negative relationship, the bi variate relationships of alcohol and feelings of illness to happiness are      masked.

5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the 'quap' model, the 'precis' table and your explanation of the results.

data(WaffleDivorce)
d <- WaffleDivorce

#Remove nevada
d$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)

# standardize variables
d$Marriage.standardized <- (d$Marriage - mean(d$Marriage)) / sd(d$Marriage)
d$MedianAgeMarriage.standardized <- (d$MedianAgeMarriage - mean(d$MedianAgeMarriage)) / sd(d$MedianAgeMarriage)
d$LDS.standardized <- (d$LDS - mean(d$LDS)) / sd(d$LDS)

model <- map(
  alist(
    Divorce ~ dnorm(mean = mu, sd = sigma),
    mu <- a + b.marriage.rate * Marriage.standardized + b.median.age.at.marriage * MedianAgeMarriage.standardized + b.lds * LDS.standardized,
    a ~ dnorm(mean = 0, sd = 100),
    c(b.marriage.rate, b.median.age.at.marriage, b.lds) ~ dnorm(mean = 0, sd = 10),
    sigma ~ dunif(min = 0, 10)
  ),
  data = d
)

precis(model)

##                                  mean        sd       5.5%      94.5%
## a                         9.688020167 0.1893751  9.3853621  9.9906782
## b.marriage.rate           0.009430373 0.2876166 -0.4502366  0.4690973
## b.median.age.at.marriage -1.375256421 0.2801220 -1.8229455 -0.9275673
## b.lds                    -0.624601966 0.2260713 -0.9859076 -0.2632964
## sigma                     1.339086755 0.1338640  1.1251462  1.5530274

# As the model results show, a higher population of LDS members may lead to a lower divorce rate.

5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)

library(dagitty)
data(WaffleDivorce)
d <- WaffleDivorce
d$D <-standardize(d$Divorce)
d$M <-standardize(d$Marriage)
d$A  <-standardize(d$MedianAgeMarriage)
D = d$D
M = d$M
A = d$A

dag <- dagitty( "dag{M -> A -> D}")
impliedConditionalIndependencies(dag)

## D _||_ M | A

# What are the implied conditional independencies of the graph?
# D _||_ M | A
# D is independent of M, conditional on A

# Are the data consistent with it?
# Yes

Assignment #4

Nischal Bondalapati

2022-04-25

Chapter 5 - Many Variables and Spurious Waffles

Questions