CemSoylu_ANLY505_2022

title: ‘Assignment #5’ author: “Cem Soylu” date: “2022-07-25” output: html_document

Questions

5-1. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.

library(rethinking)
data("WaffleDivorce")
head(WaffleDivorce)

##     Location Loc Population MedianAgeMarriage Marriage Marriage.SE Divorce
## 1    Alabama  AL       4.78              25.3     20.2        1.27    12.7
## 2     Alaska  AK       0.71              25.2     26.0        2.93    12.5
## 3    Arizona  AZ       6.33              25.8     20.3        0.98    10.8
## 4   Arkansas  AR       2.92              24.3     26.4        1.70    13.5
## 5 California  CA      37.25              26.8     19.1        0.39     8.0
## 6   Colorado  CO       5.03              25.7     23.5        1.24    11.6
##   Divorce.SE WaffleHouses South Slaves1860 Population1860 PropSlaves1860
## 1       0.79          128     1     435080         964201           0.45
## 2       2.05            0     0          0              0           0.00
## 3       0.74           18     0          0              0           0.00
## 4       1.22           41     1     111115         435450           0.26
## 5       0.24            0     0          0         379994           0.00
## 6       0.94           11     0          0          34277           0.00

# The percent LDS population by State is collected from Wikipedia 
# Merge it to the data WaffleDivorce

WaffleDivorce$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)

hist(WaffleDivorce$LDS)

#Since the distribution is positive skew with long right tail, log transformation is thus used before standardization process.

WaffleDivorce$log_LDS <- log(WaffleDivorce$LDS)
WaffleDivorce$log_LDS.s <- (WaffleDivorce$log_LDS - mean(WaffleDivorce$log_LDS)) / sd(WaffleDivorce$log_LDS)
hist(WaffleDivorce$log_LDS.s)

# Build the multiple regression model

m <- alist(
    Divorce ~ dnorm(mu, sigma),
    mu <- b0 + b1*Marriage + b2*MedianAgeMarriage + b3*log_LDS.s,
    b0 ~ dnorm(10, 20),
    b1 ~ dnorm(0, 10),
    b2 ~ dnorm(0, 10),
    b3 ~ dnorm(0, 10),
    sigma ~ dunif(0, 5)
  )

fit <- quap( 
    m , 
    data = WaffleDivorce
)

precis(fit)

##              mean         sd       5.5%      94.5%
## b0    35.44057027 6.77524762 24.6124160 46.2687245
## b1     0.05337704 0.08261801 -0.0786625  0.1854166
## b2    -1.02973113 0.22469297 -1.3888339 -0.6706284
## b3    -0.60804432 0.29057026 -1.0724317 -0.1436569
## sigma  1.37872431 0.13838859  1.1575526  1.5998960

#In this 'quap' multivariate regression model, the slope of marriage has an interval including zero, thus it is not statistically significant and we fail to reject that the slope of marriage could be zero. However, the slopes of both median age at marriage and percentage of LDS population with standardization were negative and their intervals did not include zero. Therefore, these slopes are statistically significant. The results suggest that states with older median age at marriage or higher percentages of LDS population had lower divorce rates.

5-2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

quest <- dagitty( "dag {X -> Z
    Z -> Y
}")

drawdag(quest, lwd = 2)

# Simulate data from this DAG so that the correlation between X and Z is very large
set.seed(123)
N = 1000
X = rnorm(N, mean = 0, sd = 1)
Z = 0.95*X + rnorm(N, mean = 0, sd = 0.5)
Y = 0.4*Z + rnorm(N, mean = 0, sd = 1)
cor(X,Z)

## [1] 0.8907608

# Correlation between X and Z is very large, which is 0.89
d <- data.frame(X, Y, Z)
cor(d)

##           X         Y         Z
## X 1.0000000 0.3501571 0.8907608
## Y 0.3501571 1.0000000 0.4089184
## Z 0.8907608 0.4089184 1.0000000

summary(lm(Y ~ X + Z, data =d))

## 
## Call:
## lm(formula = Y ~ X + Z, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8360 -0.6277 -0.0370  0.6538  3.3787 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02093    0.03098  -0.676    0.499    
## X           -0.07375    0.06871  -1.073    0.283    
## Z            0.45501    0.06157   7.390  3.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9788 on 997 degrees of freedom
## Multiple R-squared:  0.1682, Adjusted R-squared:  0.1665 
## F-statistic: 100.8 on 2 and 997 DF,  p-value: < 2.2e-16

# X and Z have a strong correlation, so if we use both X and Z to predict Y,  we could observe the muticollinearity
# In the regression, we can see that X is not statistically significant whereas Z is statistically significant.

legs = quap(
  alist(
    Y ~ dnorm(mu,sigma),
    mu <- a + bX*X + bZ*Z,
    c(a,bX,bZ) ~ dnorm(0,1),
    sigma ~ dexp(1)
  ), data=list(X=X,Y=Y,Z=Z)
)

precis(legs)

##              mean         sd        5.5%      94.5%
## a     -0.02089153 0.03090542 -0.07028436 0.02850130
## bX    -0.07173065 0.06831128 -0.18090527 0.03744398
## bZ     0.45303939 0.06122069  0.35519690 0.55088188
## sigma  0.97684432 0.02182773  0.94195940 1.01172925

# In the legs example: X is not causal to Z. However X is causal to Z in this case. As Z shows conditional association and connects between X and Y, if we include Z in the model, then it is going to block the path, which makes the coefficient for X approaching to zero, and then we observed Y is independent of X or X and Y have no statistical significant relationship. So this case is different from previous example in the chapter.

All three problems below are based on the same data. The data in data(foxes) are 116 foxes from 30 different urban groups in England. These foxes are like street gangs. Group size varies from 2 to 8 individuals. Each group maintains its own urban territory. Some territories are larger than others. The area variable encodes this information. Some territories also have more avgfood than others. We want to model the weight of each fox. For the problems below, assume the following DAG:

5-3. Use a model to infer the total causal influence of area on weight. Would increasing the area available to each fox make it heavier (healthier)? You might want to standardize the variables. Regardless, use prior predictive simulation to show that your model’s prior predictions stay within the possible outcome range.

set.seed(123)

data(foxes)

# Standardize the variables
foxes$avgfood <- (foxes$avgfood-mean(foxes$avgfood))/sd(foxes$avgfood)
foxes$groupsize <- (foxes$groupsize-mean(foxes$groupsize))/sd(foxes$groupsize)
foxes$area <- (foxes$area -mean(foxes$area ))/sd(foxes$area)
foxes$weight <- (foxes$weight-mean(foxes$weight))/sd(foxes$weight)


# Predictive simulation
N <- 1000
a <- rnorm(N, 0, 0.5)
b <- rnorm(N, 0, 0.5)
plot(NULL, xlim=range(foxes$area),ylim=c(-4,4))
xbar<-mean(foxes$area)
for(i in 1:N) curve(a[i] + b[i]*(x- xbar),
                   from=min(foxes$area), to = max(foxes$area), add = TRUE,
                   col = col.alpha("black",0.5)
                   )

# Construct the model
model1 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- b0 + b1*area,
      b0 ~ dnorm(0,0.5),
      b1 ~ dnorm(0,0.5),
     sigma ~ dexp(1)
  ), data = foxes
)

precis(model1)

##               mean         sd       5.5%     94.5%
## b0    2.660928e-08 0.09051605 -0.1446621 0.1446622
## b1    1.883367e-02 0.09089583 -0.1264354 0.1641028
## sigma 9.912662e-01 0.06466649  0.8879166 1.0946157

# As the interval of coefficient for area includes zero, there is no statistically significant relationship between area and weight, thus we conclude that there is no causal impact of the area on the weight of foxes.

5-4. Now infer the causal impact of adding food to a territory. Would this make foxes heavier? Which covariates do you need to adjust for to estimate the total causal influence of food?

# Since we would like to see the impact of food on weight, we cannot use groupsize as a variable, we can only use avgfood in this case.

model2 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- b0 + bavgfood*avgfood,
      b0 ~ dnorm(0,0.5),
      bavgfood ~ dnorm(0,0.5),
     sigma ~ dexp(1)
  ), data = foxes
)

precis(model2)

##                   mean         sd       5.5%     94.5%
## b0       -1.456813e-07 0.09050381 -0.1446427 0.1446424
## bavgfood -2.421198e-02 0.09088357 -0.1694615 0.1210375
## sigma     9.911276e-01 0.06465592  0.8877950 1.0944603

# As the interval of coefficient for avgfood includes zero, there is no statistically significant relationship between avgfood and weight, thus we conclude that there is no causal impact of the avgfood on the weight of foxes.

5-5. Now infer the causal impact of group size. Which covariates do you need to adjust for? Looking at the posterior distribution of the resulting model, what do you think explains these data? That is, can you explain the estimates for all three problems? How do they go together?

# To infer the causal impact of group size, we need avg food to adjust for, and we need to include both avg food and group size.

model3 = quap(
  alist(
      weight ~ dnorm(mu, sigma),
      mu <- a + bavgfood*avgfood + bgroupsize*groupsize,
      a ~ dnorm(0,0.5),
      bavgfood ~ dnorm(0,0.5),
      bgroupsize ~ dnorm(0,0.5),
     sigma ~ dexp(1)
  ), data = foxes
)

precis(model3)

##                     mean         sd       5.5%      94.5%
## a           4.502590e-07 0.08615812 -0.1376969  0.1376978
## bavgfood    4.772530e-01 0.17912305  0.1909798  0.7635262
## bgroupsize -5.735259e-01 0.17914154 -0.8598287 -0.2872232
## sigma       9.420427e-01 0.06175236  0.8433505  1.0407349

cor(foxes$avgfood, foxes$groupsize)

## [1] 0.9014829

# The slope of avg food is significant and positive whereas the slope of group size is significant and negative.
# Since avg food and group size have a strong positive relationship this explains why the model has no significant causal effect between weight and avg food only as shown in model 2 above.

Week 5 - Elemental Confounds

Questions