ANLY 505 - Assignment #4

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

Only 1 is not multiple linear regression, the other ones have multiple independent variables.

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

Using y1 to denote funding only model; y2 to denote lab size model; y3 to denote multiple regression model. x1 is funding, x2 is lab size.

beta3 and beta4 are both positive, since they are positively correlated with the outcome together. beta1 and beta2 may be positive or negative, since they both are not strong indicators individually, they are likely going to be close to 0.

\[\begin{align} y_1 &= \alpha + \beta_1 x_1 \\ y_2 &= \alpha + \beta_2 x_2 \\ y_3 &= \alpha + \beta_3 x_1 + \beta_4 x_2 \\ \end{align}\]

5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

The underlaying relation is simply adding x1 and x2 together. However, x2 is based on x1, but scaled down by dividing by 5, added with a little randomness. Individually speaking they both have a visible correlation with result y, but combined the effect of x2 is negligable, since it is heavily correlated with x1, just scaled down.

\[\begin{align} y &= x_1 + \frac{x_2}{5} \\ x_2 &= \frac{x_1 + Normal(0, 3)}{10} \\ \end{align}\]

yFunc = function(x_1, x_2) {
  x_1 + x_2
}
x_1 = rnorm(n = 200, mean = 0, sd = 3)
x_2 = (x_1 + rnorm(n = 200, mean = 0, sd = 0.5)) / 10
result = yFunc(x_1, x_2)
data = data.frame(x_1, x_2, result)

x_min = min(c(x_1, x_2))
x_max = max(c(x_1, x_2))

cor(data)

##              x_1       x_2    result
## x_1    1.0000000 0.9843615 0.9998667
## x_2    0.9843615 1.0000000 0.9871060
## result 0.9998667 0.9871060 1.0000000

rbPal <- colorRampPalette(c('blue','red'))
data$color <- rbPal(10)[as.numeric(cut(data$result, breaks = 10))]

plot(x_1, result, pch = 20, col = data$color)

plot(x_2, result, pch = 20, col = data$color)

plot(x_1, x_2, pch = 20, col = data$color, xlim=c(x_min, x_max), ylim=c(x_min, x_max))

Only variable x_1:

flist <- alist(
    result ~ a + b_1 * x_1,
    c(a,b_1) ~ dnorm(0,1)
)
fit <- quap(flist , start=list(a=1,b_1=200) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## result ~ a + b_1 * x_1
## c(a, b_1) ~ dnorm(0, 1)
## 
## Posterior means:
##            a          b_1 
## 2.037250e-10 7.276846e-10 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a   -2.804423e-13  1 -1.598193 1.598193
## b_1  1.864064e-13  1 -1.598193 1.598193

Only variable x_2:

flist <- alist(
    result ~ a + b_2 * x_2,
    c(a,b_2) ~ dnorm(0,1)
)
fit <- quap( flist , start=list(a=5,b_2=-200) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## result ~ a + b_2 * x_2
## c(a, b_2) ~ dnorm(0, 1)
## 
## Posterior means:
##             a           b_2 
## -8.003642e-10 -7.276846e-10 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a   -1.544598e-13  1 -1.598193 1.598193
## b_2  1.555561e-13  1 -1.598193 1.598193

Both variables:

flist <- alist(
    result ~ a + b_1 * x_1 + b_2 * x_2,
    c(a,b_1, b_2) ~ dnorm(0,1)
)
fit <- quap( flist , start=list(a=-1,b_1=200, b_2=-200) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## result ~ a + b_1 * x_1 + b_2 * x_2
## c(a, b_1, b_2) ~ dnorm(0, 1)
## 
## Posterior means:
##             a           b_1           b_2 
## -2.035030e-10  7.276846e-10 -7.276846e-10 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a    2.431388e-13  1 -1.598193 1.598193
## b_1 -3.611555e-13  1 -1.598193 1.598193
## b_2 -1.365574e-14  1 -1.598193 1.598193

5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

\[\begin{align} y &= x_1 - x_2 \\ x_2 &= \frac{{x_1}^2}{5} + Normal(0, 2) \end{align}\]

yFunc = function(x_1, x_2) {
  x_1 - x_2
}
x_1 = rnorm(n = 200, mean = 0, sd = 3)
x_2 = x_1 * x_1 / 5 + rnorm(n = 200, mean = 0, sd = 0.5)
result = yFunc(x_1, x_2)

x_min = min(c(x_1, x_2))
x_max = max(c(x_1, x_2))
data = data.frame(x_1, x_2, result)

cor(data)

##               x_1         x_2     result
## x_1    1.00000000  0.07762135  0.6494903
## x_2    0.07762135  1.00000000 -0.7076614
## result 0.64949033 -0.70766143  1.0000000

rbPal <- colorRampPalette(c('blue','red'))
data$color <- rbPal(10)[as.numeric(cut(data$result, breaks = 10))]

plot(x_1, result, pch = 20, col = data$color)

plot(x_2, result, pch = 20, col = data$color)

plot(x_1, x_2, pch = 20, col = data$color, xlim=c(x_min, x_max), ylim=c(x_min, x_max))

Only variable x_1:

flist <- alist(
    result ~ a + b_1 * x_1,
    c(a,b_1) ~ dnorm(0,1)
)
fit <- quap( flist , start=list(a=-1,b_1=200) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## result ~ a + b_1 * x_1
## c(a, b_1) ~ dnorm(0, 1)
## 
## Posterior means:
##             a           b_1 
## -2.037250e-10  7.276846e-10 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a   -2.353673e-13  1 -1.598193 1.598193
## b_1  9.880985e-14  1 -1.598193 1.598193

Only variable x_2:

flist <- alist(
    result ~ a + b_2 * x_2,
    c(a,b_2) ~ dnorm(0,1)
)
fit <- quap( flist , start=list(a=1,b_2=-200) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## result ~ a + b_2 * x_2
## c(a, b_2) ~ dnorm(0, 1)
## 
## Posterior means:
##             a           b_2 
##  2.037250e-10 -7.276846e-10 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a    2.543521e-13  1 -1.598193 1.598193
## b_2 -8.437695e-15  1 -1.598193 1.598193

Both variables:

flist <- alist(
    result ~ a + b_1 * x_1 + b_2 * x_2,
    c(a,b_1, b_2) ~ dnorm(0,1)
)
fit <- quap( flist , start=list(a=-1,b_1=200, b_2=-200) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## result ~ a + b_1 * x_1 + b_2 * x_2
## c(a, b_1, b_2) ~ dnorm(0, 1)
## 
## Posterior means:
##             a           b_1           b_2 
## -2.035030e-10  7.276846e-10 -7.276846e-10 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a   -2.502651e-13  1 -1.598193 1.598193
## b_1 -3.977374e-13  1 -1.598193 1.598193
## b_2 -4.287681e-13  1 -1.598193 1.598193

5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.

LDS population data can be found at: https://worldpopulationreview.com/state-rankings/mormon-population-by-state
Following vector found from work by Souhail Gaboute at https://rpubs.com/gbsouhail/a505a4 instead of mannual entry.

data(WaffleDivorce)
data = WaffleDivorce
data$ldsPol = c(0.0077,0.0458,0.0600,0.0107,0.0191,0.0261,0.0045,0.0058,0.0045,0.0075,0.0082,0.0530,0.2586,0.0045,0.0068,0.0090,0.0132,0.0080,0.0064,0.0082,0.0072,0.0041,0.0045,0.0059,0.0073,0.0118,0.0473,0.0130,0.0065,0.0038,0.0331,0.0043,0.0085,0.0152,0.0054,0.0124,0.0364,0.0041,0.0040,0.0080,0.0120,0.0077,0.0125,0.6632,0.0074,0.0113,0.0380,0.0096,0.0047,0.1170)

flist <- alist(
    Divorce.SE ~ a + b_1 * ldsPol + b_2 * Marriage.SE + b_3 * MedianAgeMarriage,
    c(a,b_1, b_2, b_3) ~ dnorm(0,1)
)
fit <- quap( flist , start=list(a=-1,b_1=1, b_2=1, b_3=1) , data=data)
print(fit)

## 
## Quadratic approximate posterior distribution
## 
## Formula:
## Divorce.SE ~ a + b_1 * ldsPol + b_2 * Marriage.SE + b_3 * MedianAgeMarriage
## c(a, b_1, b_2, b_3) ~ dnorm(0, 1)
## 
## Posterior means:
##             a           b_1           b_2           b_3 
## -1.101341e-13  1.101341e-13  1.101341e-13  1.101341e-13 
## 
## Log-likelihood: -0.92

precis(map(
  flist,
  data = data
))

##              mean sd      5.5%    94.5%
## a    2.308709e-13  1 -1.598193 1.598193
## b_1 -4.747314e-13  1 -1.598193 1.598193
## b_2 -1.845746e-13  1 -1.598193 1.598193
## b_3  1.254552e-13  1 -1.598193 1.598193

5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)

g <- dagitty('dag {
    M [pos="0,0"]
    A [pos="1,0"]
    D [pos="2,0"]
    
    M -> A -> D
}')
plot(g)

ANLY 505 - Assignment #4

Shuyang Sun

2021-10-20

Chapter 5 - Many Variables and Spurious Waffles

Questions