Assignment #4

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

# Model 2,3 and 4 are multiple regressions

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

# μ_i = α + βf*Funding + βs* Size
# Where βf is the slope of funding , Funding is the amount of Funding
# βs is the slope of laboratory size, Size is the amount of laboratory size

# Since all together both these variables are positively associated with time to degree. We would expect it is positive side of zero each slope parameter should be on.

5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

N <- 100
set.seed(1)

#study of the amount of work in hours, the outcome of works (assume  the higher the better) and satisfaction of this job from the employees (assume the longer working hours could result into a lower score of the satisfaction)

standardize = function(x){
  z <- (x - mean(x)) / sd(x)
  return( z)
}
  

library(corrplot)
workHours <- rnorm(n = N, mean = 8, sd = 1)
workOutcome <- rnorm(n = N, mean =  10*  workHours, sd = 2)
statisfact <- rnorm(n = N, mean = workOutcome- 0.5 * workHours, sd = 0.5)
data <- data.frame(workHours, workOutcome, statisfact)

data_standard <- apply(data,2,standardize)
data_standard <-data.frame(data_standard)

ggpairs(data_standard)

corrplot(cor(data_standard))

#run quap model
model1 <- quap(
  alist(
    statisfact ~ dnorm(mu, sigma),
    mu <- a + bhours * workHours,
    a ~ dnorm(0, 10),
    bhours ~ dnorm(0, 10),
    sigma ~ dunif(0, 10)
  ),
  data = data_standard
)
precis(model1)

##                mean         sd        5.5%      94.5%
## a      6.905001e-08 0.02225000 -0.03555973 0.03555987
## bhours 9.746722e-01 0.02236209  0.93893326 1.01041113
## sigma  2.225005e-01 0.01573212  0.19735757 0.24764352

#run quap model
model2 <- quap(
  alist(
    statisfact ~ dnorm(mu, sigma),
    mu <- a + bworkOutcome * workOutcome + bhours * workHours,
    a ~ dnorm(0, 10),
    bworkOutcome~ dnorm(0, 10),
    bhours ~ dnorm(0, 10),
    sigma ~ dunif(0, 10)
  ),
  data = data_standard
)
precis(model2)

##                       mean          sd         5.5%       94.5%
## a             3.558264e-06 0.005864839 -0.009369587 0.009376704
## bworkOutcome  1.033876e+00 0.028251383  0.988724853 1.079027186
## bhours       -3.644775e-02 0.028251383 -0.081598913 0.008703420
## sigma         5.864840e-02 0.004142581  0.052027754 0.065269043

# Comment: When we construct the multiple regression for work satisfaction with both hours and outcome with simulated data, the coefficient for mean of hours vanishes towards zero. However if it is only one single factor, then the coefficient for hours is high (0.97)

5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

N <- 100
set.seed(1)

#study of the amount of body weight, the amount of exercises times in hours per week and age, more exercises time would result to less body weight(negative relationship), greater age could have higher body weight (positive relationship) , and age might inversely related to the amount of exercises

standardize = function(x){
  z <- (x - mean(x)) / sd(x)
  return( z)
}
  
maximum_exercise = 70
body_const = 160


library(corrplot)
age <- rnorm(n = N, mean = 25, sd = 5)
exercise <- rnorm(n = N, mean =  maximum_exercise - 1.5* age, sd = 3)
bodyweight <- rnorm(n = N, mean = body_const + age - exercise, sd = 0.5)
data <- data.frame(age, exercise, bodyweight)

data_standard <- apply(data,2,standardize)
data_standard <-data.frame(data_standard)

ggpairs(data_standard)

corrplot(cor(data_standard))

#run quap model
model1 <- quap(
  alist(
    bodyweight ~ dnorm(mu, sigma),
    mu <- a + bage * age,
    a ~ dnorm(0, 10),
    bage ~ dnorm(0, 10),
    sigma ~ dunif(0, 10)
  ),
  data = data_standard
)
precis(model1)

##               mean         sd        5.5%      94.5%
## a     1.873398e-06 0.02521776 -0.04030097 0.04030472
## bage  9.673383e-01 0.02534480  0.92683246 1.00784422
## sigma 2.521784e-01 0.01783049  0.22368179 0.28067494

#run quap model
model2 <- quap(
  alist(
    bodyweight ~ dnorm(mu, sigma),
    mu <- a + bage * age + bexercise * exercise,
    a ~ dnorm(0, 10),
    bage~ dnorm(0, 10),
    bexercise ~ dnorm(0, 10),
    sigma ~ dunif(0, 10)
  ),
  data = data_standard
)
precis(model2)

##                    mean          sd         5.5%        94.5%
## a          8.819245e-08 0.004424042 -0.007070385  0.007070561
## bage       3.821645e-01 0.011336060  0.364047239  0.400281665
## bexercise -6.361616e-01 0.011336060 -0.654278816 -0.618044390
## sigma      4.424042e-02 0.003123061  0.039249166  0.049231676

#Comment: When we conduct the multiple regression with age and amount of exercise against the simulated body weight, it is obvious to observe the coefficient of age has been significantly reduced against the scenario when the 'age' is the only factor. From the correlation plot ,we could confirm the positive relationship between body weight and age, negative relationship between body weight and exercises.

5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.

#  use the percentage values of LDS as obtained through Wikipedia by Jeffrey Girar, which potentially need a log transformation to remove the skewness, and then conduct the standardizing 

data("WaffleDivorce")
data <- WaffleDivorce
data$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
data$logLDS <- log(data$LDS)
data$logLDS_std <- (data$logLDS - mean(data$logLDS)) / sd(data$logLDS)
hist(data$LDS)

hist(data$logLDS_std)

m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS_std,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = data
)
precis(m)

##              mean         sd        5.5%      94.5%
## a     35.43697207 6.77530618 24.60872422 46.2652199
## bm     0.05341211 0.08261842 -0.07862807  0.1854523
## ba    -1.02962005 0.22469474 -1.38872563 -0.6705145
## bl    -0.60808155 0.29057042 -1.07246920 -0.1436939
## sigma  1.37872543 0.13838905  1.15755301  1.5998979

# From the result we could find out that both the median age of marriage and the LDS are negative related to the divorse rate, the corresponding coefficient is not significant. The marriage age in this case is not a significant factor, since the slope is close to zero(5.5% <0 and 94.5% >0)

5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)

library(dagitty)
dag = dagitty("dag{M -> A -> D}")
impliedConditionalIndependencies(dag)

## D _||_ M | A

#Comment : The implied conditional independence means the Divorce rate is not associated with the Marriage Rate, after condition on the age and the data consistent with it

data("WaffleDivorce")
data <- WaffleDivorce
m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage ,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = data
)
precis(m)

##             mean         sd       5.5%       94.5%
## a     33.5974936 6.99727425 22.4144979 44.78048927
## bm    -0.0286839 0.07516336 -0.1488095  0.09144166
## ba    -0.8956346 0.22494670 -1.2551429 -0.53612637
## sigma  1.4409983 0.14460480  1.2098919  1.67210474

# From the result, the the coefficient for factor marriage rate is only -0.03, which is close to zero. Therefore we could conclude, with the existence of the additional factor age (with coefficient -0.9, which is significant) , the divorce rate is almost independent to the marriage rate, which confirms our assumption

Assignment #4

Yuan Ji

2022-04-25

Chapter 5 - Many Variables and Spurious Waffles

Questions