Chapter 5 - Many Variables and Spurious Waffles

This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: What is the value of knowing each predictor, once we already know the other predictors? The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

5-1. Which of the linear models below are multiple linear regressions? \[\begin{align} {μ_i = α + βx_i} \tag{1}\\ μ_i = β_xx_i + β_zz_i \tag{2} \\ μ_i = α + β(x_i − z_i) \tag{3} \\ μ_i = α + β_xx_i + β_zz_i \tag{4} \\ \end{align}\]

# model 2 and model 4 are multiple linear regression as the variables x and z are in second order where as the equation 1 and 3 the variables are just of first order

5-2. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.

#According to the statement, Time to PhD is a function of laboratory size(S) and funding (F). 
#  Time = Normal(µ, σ)
#   µ = a + b1F + b2S ,b1 is the slope associate with Funding and b2 is the slope associated with size. Because of the positive relation with time to PhD degree, both the slopes b1 and b2 have positive signs. 

5-3. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced). Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

##taking the example of exam scores and operas seen relation before and after introducing the income levels of the families involved with the data

N <- 60
income <- rnorm(n = 60, mean = 0, sd = 1)
operas <- rnorm(n = N, mean = income, sd = 2)
scores <- rnorm(n = N, mean = income, sd = 1)
d <- data.frame(scores, operas, income)
pairs(d)

##verifying the presence of spurious correlation between test score with operas seen
m <- map(
  alist(
    scores ~ dnorm(mu, sigma),
    mu <- a + bo * operas,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##            mean         sd        5.5%     94.5%
## a     0.1758058 0.16477108 -0.08753017 0.4391418
## bo    0.2160875 0.08095357  0.08670808 0.3454670
## sigma 1.2769955 0.11657761  1.09068192 1.4633090
##correlation after adding family income to the same
m <- map(
  alist(
    scores ~ dnorm(mu, sigma),
    mu <- a + bo * operas + bi * income,
    a ~ dnorm(0, 5),
    bo ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##            mean         sd        5.5%     94.5%
## a     0.1523353 0.12557527 -0.04835820 0.3530289
## bo    0.1231059 0.06325378  0.02201414 0.2241976
## bi    0.9608194 0.14584487  0.72773114 1.1939077
## sigma 0.9726078 0.08878562  0.83071122 1.1145044
##as shown with the model results with and without income as part of the models between test scores and operas and observed that the relation between scores and operas is spurious 

5-4. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another. Plot the correlation before analysis, designate and run the ‘quap’ model, show the ‘precis’ table, plot the results and explain.

##to investigate the masked relationship situations, using the alcohol drinking data with being happy as well as spending time ill because of the alcohol. The effect of illness from alcohol is being masked by the feeling of happines from alcohol.
N <- 80
rho <- 0.6
alchol <- rnorm(n = N, mean = 0, sd = 1)
ill <- rnorm(n = N, mean = rho * alchol, sd = sqrt(1 - rho^2))
happy <- rnorm(n = N, mean = alchol - ill, sd = 1)
d <- data.frame(happy, alchol, ill)
pairs(d)

m <- map(
  alist(
    happy ~ dnorm(mu, sigma),
    mu <- a + ba * alchol,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean        sd       5.5%      94.5%
## a     -0.1807145 0.1432595 -0.4096708 0.04824193
## ba     0.6118645 0.1379426  0.3914056 0.83232347
## sigma  1.2816704 0.1013247  1.1197339 1.44360688
##The masking situation has affected the correlations and widrespread bivariate relationships

m <- map(
  alist(
    happy ~ dnorm(mu, sigma),
    mu <- a + bi * ill,
    a ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##             mean        sd       5.5%       94.5%
## a     -0.1493778 0.1585474 -0.4027671  0.10401157
## bi    -0.2879655 0.1603488 -0.5442337 -0.03169718
## sigma  1.4027076 0.1108936  1.2254783  1.57993695
##multivariate regression has helped with overcoming the masking issue

m <- map(
  alist(
    happy ~ dnorm(mu, sigma),
    mu <- a + ba * alchol + bi * ill,
    a ~ dnorm(0, 5),
    ba ~ dnorm(0, 5),
    bi ~ dnorm(0, 5),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##              mean         sd       5.5%      94.5%
## a     -0.01864307 0.11335962 -0.1998136  0.1625275
## ba     1.17564371 0.13213967  0.9644590  1.3868284
## bi    -1.02062928 0.14036451 -1.2449589 -0.7962997
## sigma  0.99419774 0.07859806  0.8685828  1.1198126
##the final model with masking, the slopes of the predictor variables has increased due to the masking effect between alcohol drinking with feeling of happiness and feeling of illness which is bivariate in relationship.

5-5. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.

data(WaffleDivorce)
d <- WaffleDivorce
d$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)

d$logLDS <- log(d$LDS)
d$logLDS.s <- (d$logLDS - mean(d$logLDS)) / sd(d$logLDS)
simplehist(d$LDS)

#observing the histogram of the raw data does not look normal and hence applying the log transformation 
simplehist(d$logLDS)

simplehist(d$logLDS.s)

#log transformed data looks closer to normal distribution with some skewness but it is approximate enough for multivariate regression model
m <- map(
  alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- a + bm * Marriage + ba * MedianAgeMarriage + bl * logLDS.s,
    a ~ dnorm(10, 20),
    bm ~ dnorm(0, 10),
    ba ~ dnorm(0, 10),
    bl ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  ),
  data = d
)
precis(m)
##              mean         sd       5.5%      94.5%
## a     35.44332296 6.77458473 24.6162281 46.2704178
## bm     0.05340162 0.08261017 -0.0786254  0.1854286
## ba    -1.02985684 0.22467031 -1.3889234 -0.6707903
## bl    -0.60818321 0.29054007 -1.0725224 -0.1438441
## sigma  1.37858205 0.13835272  1.1574677  1.5996964
#the model with transformed data that slopes for median age and LDS population are negative and hence are negatively related to divorce. Looking at the model parameters, there is no significant association with marriage rate and divorce rate. However, the states with overall old generation of population has lower divorce rates. 

5-6. In the divorce example, suppose the DAG is: M → A → D. What are the implied conditional independencies of the graph? Are the data consistent with it? (Hint: use the dagitty package)

library(dagitty)

maddog = dagitty("dag{M -> A -> D}")
impliedConditionalIndependencies(maddog)
## D _||_ M | A
equivalentDAGs(maddog)
## [[1]]
## dag {
## A
## D
## M
## A -> D
## M -> A
## }
## 
## [[2]]
## dag {
## A
## D
## M
## A -> D
## A -> M
## }
## 
## [[3]]
## dag {
## A
## D
## M
## A -> M
## D -> A
## }
#upon observing the output, it can be concluded that DAG implies the divorce rate (D) and is independent of marriage rate which further is a conditional probability on the median age of marriage.