This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]
# Model 2,3 and 4 are multiple regressions
5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.
# μ_i = α + βf*Funding + βs* Size
# Where βf is the slope of funding , Funding is the amount of Funding
# βs is the slope of laboratory size, Size is the amount of laboratory size
# Since all together both these variables are positively associated with time to degree. We would expect it is positive side of zero each slope parameter should be on.
5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.
N <- 100
set.seed(1)
#study of the amount of work in hours, the outcome of works (assume the higher the better) and satisfaction of this job from the employees (assume the longer working hours could result into a lower score of the satisfaction)
standardize = function(x){
z <- (x - mean(x)) / sd(x)
return( z)
}
library(corrplot)
workHours <- rnorm(n = N, mean = 8, sd = 1)
workOutcome <- rnorm(n = N, mean = 10* workHours, sd = 2)
statisfact <- rnorm(n = N, mean = workOutcome- 0.5 * workHours, sd = 0.5)
data <- data.frame(workHours, workOutcome, statisfact)
data_standard <- apply(data,2,standardize)
data_standard <-data.frame(data_standard)
ggpairs(data_standard)
corrplot(cor(data_standard))
#run quap model
model1 <- quap(
alist(
statisfact ~ dnorm(mu, sigma),
mu <- a + bhours * workHours,
a ~ dnorm(0, 10),
bhours ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
),
data = data_standard
)
precis(model1)
## mean sd 5.5% 94.5%
## a 6.905001e-08 0.02225000 -0.03555973 0.03555987
## bhours 9.746722e-01 0.02236209 0.93893326 1.01041113
## sigma 2.225005e-01 0.01573212 0.19735757 0.24764352
#run quap model
model2 <- quap(
alist(
statisfact ~ dnorm(mu, sigma),
mu <- a + bworkOutcome * workOutcome + bhours * workHours,
a ~ dnorm(0, 10),
bworkOutcome~ dnorm(0, 10),
bhours ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
),
data = data_standard
)
precis(model2)
## mean sd 5.5% 94.5%
## a 3.558264e-06 0.005864839 -0.009369587 0.009376704
## bworkOutcome 1.033876e+00 0.028251383 0.988724853 1.079027186
## bhours -3.644775e-02 0.028251383 -0.081598913 0.008703420
## sigma 5.864840e-02 0.004142581 0.052027754 0.065269043
# Comment: When we construct the multiple regression for work satisfaction with both hours and outcome with simulated data, the coefficient for mean of hours vanishes towards zero. However if it is only one single factor, then the coefficient for hours is high (0.97)
5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.
N <- 100
set.seed(1)
#study of the amount of body weight, the amount of exercises times in hours per week and age, more exercises time would result to less body weight(negative relationship), greater age could have higher body weight (positive relationship) , and age might inversely related to the amount of exercises
standardize = function(x){
z <- (x - mean(x)) / sd(x)
return( z)
}
maximum_exercise = 70
body_const = 160
library(corrplot)
age <- rnorm(n = N, mean = 25, sd = 5)
exercise <- rnorm(n = N, mean = maximum_exercise - 1.5* age, sd = 3)
bodyweight <- rnorm(n = N, mean = body_const + age - exercise, sd = 0.5)
data <- data.frame(age, exercise, bodyweight)
data_standard <- apply(data,2,standardize)
data_standard <-data.frame(data_standard)
ggpairs(data_standard)
corrplot(cor(data_standard))
#run quap model
model1 <- quap(
alist(
bodyweight ~ dnorm(mu, sigma),
mu <- a + bage * age,
a ~ dnorm(0, 10),
bage ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
),
data = data_standard
)
precis(model1)
## mean sd 5.5% 94.5%
## a 1.873398e-06 0.02521776 -0.04030097 0.04030472
## bage 9.673383e-01 0.02534480 0.92683246 1.00784422
## sigma 2.521784e-01 0.01783049 0.22368179 0.28067494
#run quap model
model2 <- quap(
alist(
bodyweight ~ dnorm(mu, sigma),
mu <- a + bage * age + bexercise * exercise,
a ~ dnorm(0, 10),
bage~ dnorm(0, 10),
bexercise ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
),
data = data_standard
)
precis(model2)
## mean sd 5.5% 94.5%
## a 8.819245e-08 0.004424042 -0.007070385 0.007070561
## bage 3.821645e-01 0.011336060 0.364047239 0.400281665
## bexercise -6.361616e-01 0.011336060 -0.654278816 -0.618044390
## sigma 4.424042e-02 0.003123061 0.039249166 0.049231676
#Comment: When we conduct the multiple regression with age and amount of exercise against the simulated body weight, it is obvious to observe the coefficient of age has been significantly reduced against the scenario when the 'age' is the only factor. From the correlation plot ,we could confirm the positive relationship between body weight and age, negative relationship between body weight and exercises.
5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.
# use the percentage values of LDS as obtained through Wikipedia by Jeffrey Girar, which potentially need a log transformation to remove the skewness, and then conduct the standardizing
data("WaffleDivorce")
data <- WaffleDivorce
data$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
data$logLDS <- log(data$LDS)
data$logLDS_std <- (data$logLDS - mean(data$logLDS)) / sd(data$logLDS)
hist(data$LDS)
hist(data$logLDS_std)
m <- map(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS_std,
a ~ dnorm(10, 20),
bm ~ dnorm(0, 10),
ba ~ dnorm(0, 10),
bl ~ dnorm(0, 10),
sigma ~ dunif(0, 5)
),
data = data
)
precis(m)
## mean sd 5.5% 94.5%
## a 35.43697207 6.77530618 24.60872422 46.2652199
## bm 0.05341211 0.08261842 -0.07862807 0.1854523
## ba -1.02962005 0.22469474 -1.38872563 -0.6705145
## bl -0.60808155 0.29057042 -1.07246920 -0.1436939
## sigma 1.37872543 0.13838905 1.15755301 1.5998979
# From the result we could find out that both the median age of marriage and the LDS are negative related to the divorse rate, the corresponding coefficient is not significant. The marriage age in this case is not a significant factor, since the slope is close to zero(5.5% <0 and 94.5% >0)
5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)
library(dagitty)
dag = dagitty("dag{M -> A -> D}")
impliedConditionalIndependencies(dag)
## D _||_ M | A
#Comment : The implied conditional independence means the Divorce rate is not associated with the Marriage Rate, after condition on the age and the data consistent with it
data("WaffleDivorce")
data <- WaffleDivorce
m <- map(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bm * Marriage + ba * MedianAgeMarriage ,
a ~ dnorm(10, 20),
bm ~ dnorm(0, 10),
ba ~ dnorm(0, 10),
sigma ~ dunif(0, 5)
),
data = data
)
precis(m)
## mean sd 5.5% 94.5%
## a 33.5974936 6.99727425 22.4144979 44.78048927
## bm -0.0286839 0.07516336 -0.1488095 0.09144166
## ba -0.8956346 0.22494670 -1.2551429 -0.53612637
## sigma 1.4409983 0.14460480 1.2098919 1.67210474
# From the result, the the coefficient for factor marriage rate is only -0.03, which is close to zero. Therefore we could conclude, with the existence of the additional factor age (with coefficient -0.9, which is significant) , the divorce rate is almost independent to the marriage rate, which confirms our assumption