This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]
#ANSWER:
#The equation μ_i = β_xx_i + β_zz_i and μ_i = α + β_xx_i + β_zz_i are multiple linear regression models because in both the equations we have two different slopes for both the variables.
5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.
#ANSWER:
#T_i = α + β_a A_i + β_s S_i
#T_i = Time to PhD degree
#β_a = Slope for amount of funding
#A_i = Amount of funding
#β_s = Slope for size of laboratory
#S_i = Size of laboratory
#As it is mentioned that together these variables are both positively associated with time to degree they should be on positive side of the zero.
ANS: As it is mentioned that together these variables are both positively associated with time to degree they should be on positive side of the zero.
5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.
ANSWER: Japanese passenger cars sold in the USA and suicides by crashing of motor cycles both has effect on transportation statistics. Ref: http://www.tylervigen.com/spurious-correlations
#ANSWER: Japanese passenger cars sold in the USA and suicides by crashing of motor cycles both has effect on transportation statistics.
#Ref: http://www.tylervigen.com/spurious-correlations
#J = Japanese passenger cars sold in USA
#S = Suicides by crashing motor cycles
#T = Transportation statistics
library(devtools)
library(GGally)
library(corrplot)
N = 100
J = rnorm(n = N, mean = 0, sd = 1)
T = rnorm(n = 100, mean = -J, sd = 1)
S = rnorm(n = N, mean = J, sd = 2)
d = data.frame(J, T, S)
#d1 = as.matrix(d)
ggpairs(d)
cor(d)
## J T S
## J 1.0000000 -0.6995148 0.4806565
## T -0.6995148 1.0000000 -0.3496911
## S 0.4806565 -0.3496911 1.0000000
#corrplot(d1)
#evaluating variable suicides by crashing motor cycle against transportation statistics
model_S <- quap(
alist(
T ~ dnorm(mu, sigma),
mu <- a + bS * S,
a ~ dnorm(0, 1),
bS ~ dnorm(0, 1),
sigma ~ dunif(0, 2)
),
data = d
)
precis(model_S)
## mean sd 5.5% 94.5%
## a -0.07250435 0.13156699 -0.2827738 0.1377651
## bS -0.22841845 0.06130062 -0.3263887 -0.1304482
## sigma 1.32441506 0.09365051 1.1747435 1.4740867
plot(model_S)
## evaluating transportation statistics against both the variables
model_SJ <- quap(
alist(
T ~ dnorm(mu, sigma),
mu <- a + bS * S + bJ * J,
a ~ dnorm(0, 1),
bS ~ dnorm(0, 1),
bJ ~ dnorm(0, 1),
sigma ~ dunif(0, 2)
),
data = d
)
precis(model_SJ)
## mean sd 5.5% 94.5%
## a -0.09814341 0.10074790 -0.25915801 0.06287119
## bS -0.01437729 0.05325647 -0.09949141 0.07073684
## bJ -0.97517376 0.11572524 -1.16012505 -0.79022247
## sigma 1.01004844 0.07142867 0.89589163 1.12420526
plot(model_SJ)
plot(coeftab(model_SJ, model_S), par = c("bJ", "bS"))
ANS: When we look at the model and statistical analysis we can see that after introducing variable ‘J’ (Japanese passenger cars sold in USA) we can see significant decrease in T (Transportation statistics). Also, the effect of decreasing S (suicides by crashing motor cycle) leading to decrease in T vanishes when we add variable J (Japanese passenger cars sold in USA) so we can call this equation as spurious equation.
5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.
ANSwer: A good example of this kind of relationship would be in a study that measures the nutritional composition of soil cores at different altitudes and moisture levels.
Because water flows downhill, lower altitudes (A) tend to be more moist (M). So while we can’t completely separate altitude from moisture levels, as long as #they’re only moderately correlated, we should be able to find the unique effect of altitude on potassium levels. Ref: https://www.theanalysisfactor.com/five-common-relationships-among-three-variables-in-a-statistical-model/
#A good example of this kind of relationship would be in a study that measures the nutritional composition of soil cores at different altitudes and moisture levels.
#Because water flows downhill, lower altitudes (A) tend to be more moist (M). So while we can’t completely separate altitude from moisture levels, as long as #they’re only moderately correlated, we should be able to find the unique effect of altitude on potassium levels.
#Ref: https://www.theanalysisfactor.com/five-common-relationships-among-three-variables-in-a-statistical-model/
#N = Nutritional composition of soil
#A = Altitude levels
#M = Moisture levels
N <- 100
rho <- 0.6
A <- rnorm(n = N, mean = 0, sd = 1)
N <- rnorm(n = N, mean = rho * A, sd = sqrt(1 - rho^2))
M <- rnorm(n = N, mean = A - N, sd = 1)
d <- data.frame(M, A, N)
ggpairs(d)
#evaluating altitude levels against Nutritional composition of soil
model_A <- quap(
alist(
N ~ dnorm(mu, sigma),
mu <- a + bA * A,
a ~ dnorm(0, 1),
bA ~ dnorm(0, 1),
sigma ~ dunif(0, 2)
),
data = d
)
precis(model_A)
## mean sd 5.5% 94.5%
## a 0.04671882 0.08357324 -0.08684737 0.1802850
## bA 0.56129668 0.07940151 0.43439773 0.6881956
## sigma 0.83651972 0.05915169 0.74198390 0.9310555
plot(model_A)
#evaluating moisture levels against Nutritional composition of soil
model_M <- quap(
alist(
N ~ dnorm(mu, sigma),
mu <- a + bM * M,
a ~ dnorm(0, 1),
bM ~ dnorm(0, 1),
sigma ~ dunif(0, 2)
),
data = d
)
precis(model_M)
## mean sd 5.5% 94.5%
## a 0.1024543 0.09592617 -0.05085422 0.2557629
## bM -0.2521118 0.06897223 -0.36234270 -0.1418808
## sigma 0.9629470 0.06809759 0.85411390 1.0717801
plot(model_M)
#evaluating both the variables against Nutritional composition of soil
model_AM <- quap(
alist(
N ~ dnorm(mu, sigma),
mu <- a + bA * A + bM * M,
a ~ dnorm(0, 1),
bA ~ dnorm(0, 1),
bM ~ dnorm(0, 1),
sigma ~ dunif(0, 2)
),
data = d
)
precis(model_AM)
## mean sd 5.5% 94.5%
## a 0.05774237 0.06383449 -0.04427747 0.1597622
## bA 0.71242268 0.06315336 0.61149140 0.8133539
## bM -0.40331887 0.04765646 -0.47948310 -0.3271546
## sigma 0.63787509 0.04510551 0.56578777 0.7099624
plot(model_AM)
plot(coeftab(model_AM, model_M, model_A), par = c("bM", "bA"))
ANSWER: From the model implementation and statistical analysis, we can see that increase in moisture level leads to better nutritional composition of the soil. Also, elevated/increased altitude level will adversely affect the nutritional composition and lower altitude levels tend to be more moist. So outcome variable - nutritional composition is correlated with altitude and moisture levels in opposite direction and both the predictor variables are also correlated with each other thus making this a masked relationship.
5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.
# Derived percent LDS population from https://www.worldatlas.com/articles/mormon-population-by-state.html for 51 states.
data("WaffleDivorce")
d <- WaffleDivorce
d$LDS <- c(0.677,0.2642,0.1153,0.0610,0.0517,0.0481,0.0456,0.0394,0.0376,0.0335,0.0274,0.0197,0.0149,0.0130,0.0129,0.0125,0.0121,0.0121,0.0116,0.0113,0.0103,0.0093,0.0090,0.0084,0.0082,0.0082,0.0081,0.0079,0.0077,0.0075,0.0075,0.0073,0.0073,0.0072,0.0067,0.0065,0.0064,0.0059,0.0057,0.0053,0.0046,0.0045,0.0045,0.0044,0.0041,0.0040,0.0040,0.0040,0.0039,0.0037)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$LDS)
simplehist(d$logLDS.s)
model <- map(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
a ~ dnorm(10, 20),
bm ~ dnorm(0, 10),
ba ~ dnorm(0, 10),
bl ~ dnorm(0, 10),
sigma ~ dunif(0, 5)
),
data = d
)
precis(model)
## mean sd 5.5% 94.5%
## a 36.92987714 6.7756234 26.1011223 47.75863201
## bm -0.06908789 0.0730323 -0.1858076 0.04763183
## ba -0.99235187 0.2170165 -1.3391862 -0.64551758
## bl 0.47414605 0.1991106 0.1559288 0.79236325
## sigma 1.36033339 0.1366041 1.1420137 1.57865310
#plot(model)
ANSWER: I have derived percent LDS population from https://www.worldatlas.com/articles/mormon-population-by-state.html for the states. From the above statistical analysis we can see that the slope of marriage is variable whereas the slopes for Median Age Marriage and LDS population are negative. The interval for Median Age Marriage and LDS population did not include zero. So if we have states with older Median Marriage Age then they will have lower divorce rates.
5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)
library(dagitty)
dag_MAD <- dagitty("dag{ M -> A -> D}")
impliedConditionalIndependencies(dag_MAD)
## D _||_ M | A
equivalentDAGs(dag_MAD)
## [[1]]
## dag {
## A
## D
## M
## A -> D
## M -> A
## }
##
## [[2]]
## dag {
## A
## D
## M
## A -> D
## A -> M
## }
##
## [[3]]
## dag {
## A
## D
## M
## A -> M
## D -> A
## }