Assignment #4

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

#ANSWER: 
#The equation  μ_i = β_xx_i + β_zz_i and μ_i = α + β_xx_i + β_zz_i are multiple linear regression models because in both the equations we have two different slopes for both the variables.

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

#ANSWER:
#T_i = α + β_a A_i + β_s S_i 

#T_i = Time to PhD degree
#β_a = Slope for amount of funding
#A_i = Amount of funding
#β_s = Slope for size of laboratory
#S_i = Size of laboratory

#As it is mentioned that together these variables are both positively associated with time to degree they should be on positive side of the zero.

ANS: As it is mentioned that together these variables are both positively associated with time to degree they should be on positive side of the zero.

5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

ANSWER: Japanese passenger cars sold in the USA and suicides by crashing of motor cycles both has effect on transportation statistics. Ref: http://www.tylervigen.com/spurious-correlations

#ANSWER: Japanese passenger cars sold in the USA and suicides by crashing of motor cycles both has effect on transportation statistics.
#Ref: http://www.tylervigen.com/spurious-correlations
#J = Japanese passenger cars sold in USA
#S = Suicides by crashing motor cycles
#T = Transportation statistics

library(devtools)
library(GGally)
library(corrplot)
N = 100
J = rnorm(n = N, mean = 0, sd = 1)
T = rnorm(n = 100, mean = -J, sd = 1)
S = rnorm(n = N, mean = J, sd = 2)
d = data.frame(J, T, S)
#d1 = as.matrix(d)
ggpairs(d)

cor(d)

##            J          T          S
## J  1.0000000 -0.6995148  0.4806565
## T -0.6995148  1.0000000 -0.3496911
## S  0.4806565 -0.3496911  1.0000000

#corrplot(d1)

#evaluating variable suicides by crashing motor cycle against transportation statistics

model_S <- quap(
  alist(
    T ~ dnorm(mu, sigma),
    mu <- a + bS * S,
    a ~ dnorm(0, 1),
    bS ~ dnorm(0, 1),
    sigma ~ dunif(0, 2)
  ),
  data = d
)
precis(model_S)

##              mean         sd       5.5%      94.5%
## a     -0.07250435 0.13156699 -0.2827738  0.1377651
## bS    -0.22841845 0.06130062 -0.3263887 -0.1304482
## sigma  1.32441506 0.09365051  1.1747435  1.4740867

plot(model_S)

## evaluating transportation statistics against both the variables
model_SJ <- quap(
  alist(
    T ~ dnorm(mu, sigma),
    mu <- a + bS * S + bJ * J,
    a ~ dnorm(0, 1),
    bS ~ dnorm(0, 1),
    bJ ~ dnorm(0, 1),
    sigma ~ dunif(0, 2)
  ),
  data = d
)
precis(model_SJ)

##              mean         sd        5.5%       94.5%
## a     -0.09814341 0.10074790 -0.25915801  0.06287119
## bS    -0.01437729 0.05325647 -0.09949141  0.07073684
## bJ    -0.97517376 0.11572524 -1.16012505 -0.79022247
## sigma  1.01004844 0.07142867  0.89589163  1.12420526

plot(model_SJ)

plot(coeftab(model_SJ, model_S), par = c("bJ", "bS"))

ANS: When we look at the model and statistical analysis we can see that after introducing variable ‘J’ (Japanese passenger cars sold in USA) we can see significant decrease in T (Transportation statistics). Also, the effect of decreasing S (suicides by crashing motor cycle) leading to decrease in T vanishes when we add variable J (Japanese passenger cars sold in USA) so we can call this equation as spurious equation.

5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

ANSwer: A good example of this kind of relationship would be in a study that measures the nutritional composition of soil cores at different altitudes and moisture levels.

Because water flows downhill, lower altitudes (A) tend to be more moist (M). So while we can’t completely separate altitude from moisture levels, as long as #they’re only moderately correlated, we should be able to find the unique effect of altitude on potassium levels. Ref: https://www.theanalysisfactor.com/five-common-relationships-among-three-variables-in-a-statistical-model/

#A good example of this kind of relationship would be in a study that measures the nutritional composition of soil cores at different altitudes and moisture levels.

#Because water flows downhill, lower altitudes (A) tend to be more moist (M). So while we can’t completely separate altitude from moisture levels, as long as #they’re only moderately correlated, we should be able to find the unique effect of altitude on potassium levels.

#Ref: https://www.theanalysisfactor.com/five-common-relationships-among-three-variables-in-a-statistical-model/

#N = Nutritional composition of soil
#A = Altitude levels
#M = Moisture levels

N <- 100
rho <- 0.6
A <- rnorm(n = N, mean = 0, sd = 1)
N <- rnorm(n = N, mean = rho * A, sd = sqrt(1 - rho^2))
M <- rnorm(n = N, mean = A - N, sd = 1)
d <- data.frame(M, A, N)
ggpairs(d)

#evaluating altitude levels against Nutritional composition of soil

model_A <- quap(
  alist(
    N ~ dnorm(mu, sigma),
    mu <- a + bA * A,
    a ~ dnorm(0, 1),
    bA ~ dnorm(0, 1),
    sigma ~ dunif(0, 2)
  ),
  data = d
)
precis(model_A)

##             mean         sd        5.5%     94.5%
## a     0.04671882 0.08357324 -0.08684737 0.1802850
## bA    0.56129668 0.07940151  0.43439773 0.6881956
## sigma 0.83651972 0.05915169  0.74198390 0.9310555

plot(model_A)

#evaluating moisture levels against Nutritional composition of soil

model_M <- quap(
  alist(
    N ~ dnorm(mu, sigma),
    mu <- a + bM * M,
    a ~ dnorm(0, 1),
    bM ~ dnorm(0, 1),
    sigma ~ dunif(0, 2)
  ),
  data = d
)
precis(model_M)

##             mean         sd        5.5%      94.5%
## a      0.1024543 0.09592617 -0.05085422  0.2557629
## bM    -0.2521118 0.06897223 -0.36234270 -0.1418808
## sigma  0.9629470 0.06809759  0.85411390  1.0717801

plot(model_M)

#evaluating both the variables against Nutritional composition of soil

model_AM <- quap(
  alist(
    N ~ dnorm(mu, sigma),
    mu <- a + bA * A + bM * M,
    a ~ dnorm(0, 1),
    bA ~ dnorm(0, 1),
    bM ~ dnorm(0, 1),
    sigma ~ dunif(0, 2)
  ),
  data = d
)
precis(model_AM)

##              mean         sd        5.5%      94.5%
## a      0.05774237 0.06383449 -0.04427747  0.1597622
## bA     0.71242268 0.06315336  0.61149140  0.8133539
## bM    -0.40331887 0.04765646 -0.47948310 -0.3271546
## sigma  0.63787509 0.04510551  0.56578777  0.7099624

plot(model_AM)

plot(coeftab(model_AM, model_M, model_A), par = c("bM", "bA"))

ANSWER: From the model implementation and statistical analysis, we can see that increase in moisture level leads to better nutritional composition of the soil. Also, elevated/increased altitude level will adversely affect the nutritional composition and lower altitude levels tend to be more moist. So outcome variable - nutritional composition is correlated with altitude and moisture levels in opposite direction and both the predictor variables are also correlated with each other thus making this a masked relationship.

5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.

# Derived percent LDS population from https://www.worldatlas.com/articles/mormon-population-by-state.html for 51 states.


data("WaffleDivorce")

d <- WaffleDivorce
d$LDS <- c(0.677,0.2642,0.1153,0.0610,0.0517,0.0481,0.0456,0.0394,0.0376,0.0335,0.0274,0.0197,0.0149,0.0130,0.0129,0.0125,0.0121,0.0121,0.0116,0.0113,0.0103,0.0093,0.0090,0.0084,0.0082,0.0082,0.0081,0.0079,0.0077,0.0075,0.0075,0.0073,0.0073,0.0072,0.0067,0.0065,0.0064,0.0059,0.0057,0.0053,0.0046,0.0045,0.0045,0.0044,0.0041,0.0040,0.0040,0.0040,0.0039,0.0037)
d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$LDS)

simplehist(d$logLDS.s)

model <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(model)

##              mean        sd       5.5%       94.5%
## a     36.92987714 6.7756234 26.1011223 47.75863201
## bm    -0.06908789 0.0730323 -0.1858076  0.04763183
## ba    -0.99235187 0.2170165 -1.3391862 -0.64551758
## bl     0.47414605 0.1991106  0.1559288  0.79236325
## sigma  1.36033339 0.1366041  1.1420137  1.57865310

#plot(model)

ANSWER: I have derived percent LDS population from https://www.worldatlas.com/articles/mormon-population-by-state.html for the states. From the above statistical analysis we can see that the slope of marriage is variable whereas the slopes for Median Age Marriage and LDS population are negative. The interval for Median Age Marriage and LDS population did not include zero. So if we have states with older Median Marriage Age then they will have lower divorce rates.

5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)

library(dagitty)

dag_MAD <- dagitty("dag{ M -> A -> D}")
impliedConditionalIndependencies(dag_MAD)

## D _||_ M | A

equivalentDAGs(dag_MAD)

## [[1]]
## dag {
## A
## D
## M
## A -> D
## M -> A
## }
## 
## [[2]]
## dag {
## A
## D
## M
## A -> D
## A -> M
## }
## 
## [[3]]
## dag {
## A
## D
## M
## A -> M
## D -> A
## }

Assignment #4

your name

2021-11-01

Chapter 5 - Many Variables and Spurious Waffles

Questions