title: ‘Assignment #5’ author: “Cem Soylu” date: “2022-07-25” output: html_document
Multiple regression is no oracle, but only a golem. It is logical, but the relationships it describes are conditional associations, not causal influences. Therefore additional information, from outside the model, is needed to make sense of it. This chapter presented introductory examples of some common frustrations: multicollinearity, post-treatment bias, and collider bias. Solutions to these frustrations can be organized under a coherent framework in which hypothetical causal relations among variables are analyzed to cope with confounding. In all cases, causal models exist outside the statistical model and can be difficult to test. However, it is possible to reach valid causal inferences in the absence of experiments. This is good news, because we often cannot perform experiments, both for practical and ethical reasons.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
5-1. In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable. Make sure to include the ‘quap’ model, the ‘precis’ table and your explanation of the results.
library(rethinking)
data("WaffleDivorce")
head(WaffleDivorce)
## Location Loc Population MedianAgeMarriage Marriage Marriage.SE Divorce
## 1 Alabama AL 4.78 25.3 20.2 1.27 12.7
## 2 Alaska AK 0.71 25.2 26.0 2.93 12.5
## 3 Arizona AZ 6.33 25.8 20.3 0.98 10.8
## 4 Arkansas AR 2.92 24.3 26.4 1.70 13.5
## 5 California CA 37.25 26.8 19.1 0.39 8.0
## 6 Colorado CO 5.03 25.7 23.5 1.24 11.6
## Divorce.SE WaffleHouses South Slaves1860 Population1860 PropSlaves1860
## 1 0.79 128 1 435080 964201 0.45
## 2 2.05 0 0 0 0 0.00
## 3 0.74 18 0 0 0 0.00
## 4 1.22 41 1 111115 435450 0.26
## 5 0.24 0 0 0 379994 0.00
## 6 0.94 11 0 0 34277 0.00
# The percent LDS population by State is collected from Wikipedia
# Merge it to the data WaffleDivorce
WaffleDivorce$LDS <- c(0.0077, 0.0453, 0.0610, 0.0104, 0.0194, 0.0270, 0.0044, 0.0057, 0.0041, 0.0075, 0.0082, 0.0520, 0.2623, 0.0045, 0.0067, 0.0090, 0.0130, 0.0079, 0.0064, 0.0082, 0.0072, 0.0040, 0.0045, 0.0059, 0.0073, 0.0116, 0.0480, 0.0130, 0.0065, 0.0037, 0.0333, 0.0041, 0.0084, 0.0149, 0.0053, 0.0122, 0.0372, 0.0040, 0.0039, 0.0081, 0.0122, 0.0076, 0.0125, 0.6739, 0.0074, 0.0113, 0.0390, 0.0093, 0.0046, 0.1161)
hist(WaffleDivorce$LDS)
#Since the distribution is positive skew with long right tail, log transformation is thus used before standardization process.
WaffleDivorce$log_LDS <- log(WaffleDivorce$LDS)
WaffleDivorce$log_LDS.s <- (WaffleDivorce$log_LDS - mean(WaffleDivorce$log_LDS)) / sd(WaffleDivorce$log_LDS)
hist(WaffleDivorce$log_LDS.s)
# Build the multiple regression model
m <- alist(
Divorce ~ dnorm(mu, sigma),
mu <- b0 + b1*Marriage + b2*MedianAgeMarriage + b3*log_LDS.s,
b0 ~ dnorm(10, 20),
b1 ~ dnorm(0, 10),
b2 ~ dnorm(0, 10),
b3 ~ dnorm(0, 10),
sigma ~ dunif(0, 5)
)
fit <- quap(
m ,
data = WaffleDivorce
)
precis(fit)
## mean sd 5.5% 94.5%
## b0 35.44057027 6.77524762 24.6124160 46.2687245
## b1 0.05337704 0.08261801 -0.0786625 0.1854166
## b2 -1.02973113 0.22469297 -1.3888339 -0.6706284
## b3 -0.60804432 0.29057026 -1.0724317 -0.1436569
## sigma 1.37872431 0.13838859 1.1575526 1.5998960
#In this 'quap' multivariate regression model, the slope of marriage has an interval including zero, thus it is not statistically significant and we fail to reject that the slope of marriage could be zero. However, the slopes of both median age at marriage and percentage of LDS population with standardization were negative and their intervals did not include zero. Therefore, these slopes are statistically significant. The results suggest that states with older median age at marriage or higher percentages of LDS population had lower divorce rates.
5-2. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?
quest <- dagitty( "dag {X -> Z
Z -> Y
}")
drawdag(quest, lwd = 2)
# Simulate data from this DAG so that the correlation between X and Z is very large
set.seed(123)
N = 1000
X = rnorm(N, mean = 0, sd = 1)
Z = 0.95*X + rnorm(N, mean = 0, sd = 0.5)
Y = 0.4*Z + rnorm(N, mean = 0, sd = 1)
cor(X,Z)
## [1] 0.8907608
# Correlation between X and Z is very large, which is 0.89
d <- data.frame(X, Y, Z)
cor(d)
## X Y Z
## X 1.0000000 0.3501571 0.8907608
## Y 0.3501571 1.0000000 0.4089184
## Z 0.8907608 0.4089184 1.0000000
summary(lm(Y ~ X + Z, data =d))
##
## Call:
## lm(formula = Y ~ X + Z, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8360 -0.6277 -0.0370 0.6538 3.3787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.02093 0.03098 -0.676 0.499
## X -0.07375 0.06871 -1.073 0.283
## Z 0.45501 0.06157 7.390 3.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9788 on 997 degrees of freedom
## Multiple R-squared: 0.1682, Adjusted R-squared: 0.1665
## F-statistic: 100.8 on 2 and 997 DF, p-value: < 2.2e-16
# X and Z have a strong correlation, so if we use both X and Z to predict Y, we could observe the muticollinearity
# In the regression, we can see that X is not statistically significant whereas Z is statistically significant.
legs = quap(
alist(
Y ~ dnorm(mu,sigma),
mu <- a + bX*X + bZ*Z,
c(a,bX,bZ) ~ dnorm(0,1),
sigma ~ dexp(1)
), data=list(X=X,Y=Y,Z=Z)
)
precis(legs)
## mean sd 5.5% 94.5%
## a -0.02089153 0.03090542 -0.07028436 0.02850130
## bX -0.07173065 0.06831128 -0.18090527 0.03744398
## bZ 0.45303939 0.06122069 0.35519690 0.55088188
## sigma 0.97684432 0.02182773 0.94195940 1.01172925
# In the legs example: X is not causal to Z. However X is causal to Z in this case. As Z shows conditional association and connects between X and Y, if we include Z in the model, then it is going to block the path, which makes the coefficient for X approaching to zero, and then we observed Y is independent of X or X and Y have no statistical significant relationship. So this case is different from previous example in the chapter.
All three problems below are based on the same data. The data in data(foxes) are 116 foxes from 30 different urban groups in England. These foxes are like street gangs. Group size varies from 2 to 8 individuals. Each group maintains its own urban territory. Some territories are larger than others. The area variable encodes this information. Some territories also have more avgfood than others. We want to model the weight of each fox. For the problems below, assume the following DAG:
5-3. Use a model to infer the total causal influence of area on weight. Would increasing the area available to each fox make it heavier (healthier)? You might want to standardize the variables. Regardless, use prior predictive simulation to show that your model’s prior predictions stay within the possible outcome range.
set.seed(123)
data(foxes)
# Standardize the variables
foxes$avgfood <- (foxes$avgfood-mean(foxes$avgfood))/sd(foxes$avgfood)
foxes$groupsize <- (foxes$groupsize-mean(foxes$groupsize))/sd(foxes$groupsize)
foxes$area <- (foxes$area -mean(foxes$area ))/sd(foxes$area)
foxes$weight <- (foxes$weight-mean(foxes$weight))/sd(foxes$weight)
# Predictive simulation
N <- 1000
a <- rnorm(N, 0, 0.5)
b <- rnorm(N, 0, 0.5)
plot(NULL, xlim=range(foxes$area),ylim=c(-4,4))
xbar<-mean(foxes$area)
for(i in 1:N) curve(a[i] + b[i]*(x- xbar),
from=min(foxes$area), to = max(foxes$area), add = TRUE,
col = col.alpha("black",0.5)
)
# Construct the model
model1 = quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- b0 + b1*area,
b0 ~ dnorm(0,0.5),
b1 ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data = foxes
)
precis(model1)
## mean sd 5.5% 94.5%
## b0 2.660928e-08 0.09051605 -0.1446621 0.1446622
## b1 1.883367e-02 0.09089583 -0.1264354 0.1641028
## sigma 9.912662e-01 0.06466649 0.8879166 1.0946157
# As the interval of coefficient for area includes zero, there is no statistically significant relationship between area and weight, thus we conclude that there is no causal impact of the area on the weight of foxes.
5-4. Now infer the causal impact of adding food to a territory. Would this make foxes heavier? Which covariates do you need to adjust for to estimate the total causal influence of food?
# Since we would like to see the impact of food on weight, we cannot use groupsize as a variable, we can only use avgfood in this case.
model2 = quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- b0 + bavgfood*avgfood,
b0 ~ dnorm(0,0.5),
bavgfood ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data = foxes
)
precis(model2)
## mean sd 5.5% 94.5%
## b0 -1.456813e-07 0.09050381 -0.1446427 0.1446424
## bavgfood -2.421198e-02 0.09088357 -0.1694615 0.1210375
## sigma 9.911276e-01 0.06465592 0.8877950 1.0944603
# As the interval of coefficient for avgfood includes zero, there is no statistically significant relationship between avgfood and weight, thus we conclude that there is no causal impact of the avgfood on the weight of foxes.
5-5. Now infer the causal impact of group size. Which covariates do you need to adjust for? Looking at the posterior distribution of the resulting model, what do you think explains these data? That is, can you explain the estimates for all three problems? How do they go together?
# To infer the causal impact of group size, we need avg food to adjust for, and we need to include both avg food and group size.
model3 = quap(
alist(
weight ~ dnorm(mu, sigma),
mu <- a + bavgfood*avgfood + bgroupsize*groupsize,
a ~ dnorm(0,0.5),
bavgfood ~ dnorm(0,0.5),
bgroupsize ~ dnorm(0,0.5),
sigma ~ dexp(1)
), data = foxes
)
precis(model3)
## mean sd 5.5% 94.5%
## a 4.502590e-07 0.08615812 -0.1376969 0.1376978
## bavgfood 4.772530e-01 0.17912305 0.1909798 0.7635262
## bgroupsize -5.735259e-01 0.17914154 -0.8598287 -0.2872232
## sigma 9.420427e-01 0.06175236 0.8433505 1.0407349
cor(foxes$avgfood, foxes$groupsize)
## [1] 0.9014829
# The slope of avg food is significant and positive whereas the slope of group size is significant and negative.
# Since avg food and group size have a strong positive relationship this explains why the model has no significant causal effect between weight and avg food only as shown in model 2 above.