Multiple regression is no oracle, but only a golem. It is logical, but the relationships it describes are conditional associations, not causal influences. Therefore additional information, from outside the model, is needed to make sense of it. This chapter presented introductory examples of some common frustrations: multi-collinearity, post-treatment bias, and collider bias. Solutions to these frustrations can be organized under a coherent framework in which hypothetical causal relations among variables are analyzed to cope with confounding. In all cases, causal models exist outside the statistical model and can be difficult to test. However, it is possible to reach valid causal inferences in the absence of experiments. This is good news, because we often cannot perform experiments, both for practical and ethical reasons.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
6-1. Use the Waffle House data, data(WaffleDivorce), to find the total causal influence of number of Waffle Houses on divorce rate. Justify your model or models with a causal graph (draw a DAG).
## Leveraging the Waffle Divorce Data set.
data("WaffleDivorce")
str(WaffleDivorce)
## 'data.frame': 50 obs. of 13 variables:
## $ Location : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Loc : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 9 8 10 ...
## $ Population : num 4.78 0.71 6.33 2.92 37.25 ...
## $ MedianAgeMarriage: num 25.3 25.2 25.8 24.3 26.8 25.7 27.6 26.6 29.7 26.4 ...
## $ Marriage : num 20.2 26 20.3 26.4 19.1 23.5 17.1 23.1 17.7 17 ...
## $ Marriage.SE : num 1.27 2.93 0.98 1.7 0.39 1.24 1.06 2.89 2.53 0.58 ...
## $ Divorce : num 12.7 12.5 10.8 13.5 8 11.6 6.7 8.9 6.3 8.5 ...
## $ Divorce.SE : num 0.79 2.05 0.74 1.22 0.24 0.94 0.77 1.39 1.89 0.32 ...
## $ WaffleHouses : int 128 0 18 41 0 11 0 3 0 133 ...
## $ South : int 1 0 0 1 0 0 0 0 0 1 ...
## $ Slaves1860 : int 435080 0 0 111115 0 0 0 1798 0 61745 ...
## $ Population1860 : int 964201 0 0 435450 379994 34277 460147 112216 75080 140424 ...
## $ PropSlaves1860 : num 0.45 0 0 0.26 0 0 0 0.016 0 0.44 ...
# Standardization of the "Waffle Divorce" data for the determination of the total causal influence of number of Waffle Houses on divorce rate.
Data_Waffle <- WaffleDivorce %>%
as_tibble() %>%
select(divorce_rate = Divorce, married_rate = Marriage, married_age = MedianAgeMarriage,
waffle_houses = WaffleHouses, south = South) %>%
mutate(across(-south, standardize),
south = south + 1)
# Under the consideration and assumption that the median age of the individual's marriage or "married_age" has an influence on the individual's marriage rate as well as divorce rate.
## Determination of DAG modelfor the determination of the total causal influence of number of Waffle Houses on divorce rate.
tribble(
~ name, ~ x, ~y,
"south", 0, 2,
"married_rate", 2, 1,
"married_age", 1, 2,
"waffle_houses", 0, 0,
"divorce_rate", 1, 0,
) %>%
dagify(
divorce_rate ~ married_age + waffle_houses,
waffle_houses ~ south,
married_rate ~ married_age,
married_age ~ south,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(colour = "blue", shape = 1, size = 24) +
geom_dag_text(aes(label = name), color = "black", size = 2.5) +
geom_dag_edges(edge_color = "black") +
labs(caption = "Total causal influence of number of Waffle Houses on Divorce rate") +
theme_void()
## Causal graph Model Inference:
# One of the backdoor path has been observed between Waffle_houses & divorce rate i.e. Waffle_houses <- south -> married_age -> divorce_rate, and the total causal influence of number of waffle houses would be the direct influence and backdoor path, however the backdoor path influence can be closed by the process of conditioning on south.
6-2. Build a series of models to test the implied conditional independencies of the causal graph you used in the previous problem. If any of the tests fail, how do you think the graph needs to be amended? Does the graph need more or fewer arrows? Feel free to nominate variables that aren’t in the data.
## Determination of some of the models for the Test validation of the multiple implied conditional independencies of the prior causal graph.
dagitty("dag{
waffle_houses -> divorce_rate <- married_age <- south -> waffle_houses
<- south -> married_age -> married_rate}") %>%
impliedConditionalIndependencies()
## dvr_ _||_ mrrd_r | mrrd_g
## dvr_ _||_ soth | mrr_, wff_
## mrr_ _||_ wff_ | soth
## mrrd_r _||_ soth | mrrd_g
## mrr_ _||_ wff_ | soth
## mrrd_r _||_ wff_ | mrrd_g
## (Model:a): divorce_rate _||_ married_rate | married_age
## (Model :b): divorce_rate _||_ south | married_age, waffle_houses
## (Model :c): married_rate _||_ waffle_houses | south
## (Model :d): married_rate _||_ south | married_age
## (Model :e): married_rate _||_ waffle_houses | married_age
## Determination of the indepence test validation function for the multiple implied conditional independencies validations.
Independence_test_validation <- function(model.input, coeff.input, depth.output = 1){
suppressWarnings(
precis(model.input, depth = depth.output) %>%
as_tibble(rownames = "estimate") %>%
filter(estimate %in% {{coeff.input}}) %>%
mutate(across(where(is.numeric), round, digits = 2)) %>%
knitr::kable()
)
}
## Model a condition validation: divorce_rate _||_ married_rate | married_age.
alist(
married_rate ~ dnorm(mu, sigma),
mu <- a + Bm_age*married_age + Bdivorce_rate*divorce_rate,
a ~ dnorm(0, 0.5),
c(Bm_age, Bdivorce_rate) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_Waffle) %>%
Independence_test_validation(coeff.input = "Bdivorce_rate")
| estimate | mean | sd | 5.5% | 94.5% |
|---|---|---|---|---|
| Bdivorce_rate | -0.06 | 0.12 | -0.25 | 0.13 |
# Inference: Since as per the output, the posterior "Bdivorce_rate" didn't indicate any consistent association with married rate, hence Divorce rate could be considered as independent of the married rate after the conditioning on married age.
## (Model :b): divorce_rate _||_ south | married_rate, waffle_houses
alist(
divorce_rate ~ dnorm(mu, sigma),
mu <- a[south] + Bm_rate*married_rate + Bwaffle_h*waffle_houses,
a[south] ~ dnorm(0, 0.5),
c(Bm_rate, Bwaffle_h) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_Waffle) %>%
Independence_test_validation(depth.output = 2, coeff.input = c("a[1]", "a[2]"))
| estimate | mean | sd | 5.5% | 94.5% |
|---|---|---|---|---|
| a[1] | -0.15 | 0.15 | -0.39 | 0.09 |
| a[2] | 0.34 | 0.25 | -0.06 | 0.74 |
# Inference: South can be considered independent from the divorce rate, since as per the output, the posterior didn't indicate any consistent association with divorced rate, and after conditioning on married rate and waffle houses.
## (Model :c): married_rate _||_ south | married_age
alist(
married_rate ~ dnorm(mu, sigma),
mu <- a[south] + Bm_age*married_age,
a[south] ~ dnorm(0, 0.5),
Bm_age ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_Waffle) %>%
Independence_test_validation(depth.output = 2, coeff.input = c("a[1]", "a[2]"))
| estimate | mean | sd | 5.5% | 94.5% |
|---|---|---|---|---|
| a[1] | 0.06 | 0.11 | -0.12 | 0.23 |
| a[2] | -0.13 | 0.17 | -0.41 | 0.14 |
# Inference: South can be considered independent from the married rate, since as per the output, the posterior did not indicate association with the married rate, and after conditioning on marriage age.
## (Model :d): married_rate _||_ waffle_houses | south
alist(
waffle_houses ~ dnorm(mu, sigma),
mu <- a[south] + Bm_rate*married_rate,
a[south] ~ dnorm(0, 0.5),
Bm_rate ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_Waffle) %>%
Independence_test_validation(coeff.input = "Bm_rate")
| estimate | mean | sd | 5.5% | 94.5% |
|---|---|---|---|---|
| Bm_rate | -0.02 | 0.1 | -0.18 | 0.14 |
#Inference: Married rate can be considered independent from the waffle houses, since as per the output, the posterior "Bm_rate" didn't indicate any consistent association with waffle houses, and after conditioning on south.
## (Model :e): married_rate _||_ waffle_houses | married_age
alist(
waffle_houses ~ dnorm(mu, sigma),
mu <- a + Bm_rate*married_rate + Bm_age*married_age,
a ~ dnorm(0, 0.5),
c(Bm_rate, Bm_age) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_Waffle) %>%
Independence_test_validation(coeff.input = "Bm_rate")
| estimate | mean | sd | 5.5% | 94.5% |
|---|---|---|---|---|
| Bm_rate | -0.09 | 0.18 | -0.38 | 0.2 |
#Inference: Married rate is found to be independent from the waffle houses, since as per the output, the posterior "Bm_rate" didn't indicate any consistent association with waffle houses, and after conditioning on the married age.
6-3. For the DAG you made in the previous problem, can you write a data generating simulation for it? Can you design one or more statistical models to produce causal estimates? If so, try to calculate interesting counterfactuals. If not, use the simulation to estimate the size of the bias you might expect. Under what conditions would you, for example, infer the opposite of a true causal effect?
## Determination of Data generating simulation for the level of Professors acquired with tacit and explicit professional development qualifications and it's level or extent of intervention on the Engineering students performance:
# Here in our case Scenario,
# Ex_Knldge <- Existing of Knowledge of Students
# Dvlpmnt_Qualf <- Development Qualifications Intervention of Professors.
# Eng_Prfmnce <- Performance of Engineering Students.
# Prof_Instrct <- Instructions by Professors.
set.seed(1234)
Students <- 1000
Intelligence_data <-
tibble(
Ex_Knldge = rnorm(Students, mean = 0, sd =2),
Dvlpmnt_Qualf = sample(0L:1L, Students, replace = TRUE),
Prof_Instrct = rnorm(Students, mean = 1 + 3 * Dvlpmnt_Qualf),
Eng_Prfmnce = Ex_Knldge + rnorm(Students, 0.8 * Prof_Instrct)) %>%
mutate(across(where(is.double), standardize))
str(Intelligence_data)
## tibble [1,000 × 4] (S3: tbl_df/tbl/data.frame)
## $ Ex_Knldge : num [1:1000] -1.184 0.305 1.114 -2.325 0.457 ...
## ..- attr(*, "scaled:center")= num -0.0532
## ..- attr(*, "scaled:scale")= num 1.99
## $ Dvlpmnt_Qualf: int [1:1000] 1 0 0 1 1 0 0 0 1 1 ...
## $ Prof_Instrct : num [1:1000] 1.808 -1.566 -1.206 0.52 0.655 ...
## ..- attr(*, "scaled:center")= num 2.49
## ..- attr(*, "scaled:scale")= num 1.82
## $ Eng_Prfmnce : num [1:1000] -0.424 -0.444 -0.338 -1.594 -0.1 ...
## ..- attr(*, "scaled:center")= num 1.97
## ..- attr(*, "scaled:scale")= num 2.63
## Determination of the model for extracting the causal effect of professional development activities on the performance of the students.
Intelligence_Model <-
brm(
Eng_Prfmnce ~ 1 + Dvlpmnt_Qualf, data = Intelligence_data, family = gaussian,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 3000, warmup = 1000, chains = 4, cores = 4, seed = 123,
file = here("Fitted", "Causal"))
as_draws_df(Intelligence_Model) %>%
as_tibble() %>%
select(b_Intercept, b_Dvlpmnt_Qualf, sigma) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name)) +
stat_halfeye(.width = c(0.70, 0.80, 0.90)) +
labs(x = "Parameters Values", y = "Different Parameters")
# Result Inference: It is clearly evident from our model's simulation data plot that there seems to be a major magnitude of causal influence of the Engineering study teaching professor's professional development on the performances of the Engineering students, and with around 0.9 standard deviation of increase in the Engineering student performances, against or in comparison to the Engineering student performances whose Professor's didn't have professional development related qualification.
All three problems below are based on the same data. The data in data(foxes) are 116 foxes from 30 different urban groups in England. These foxes are like street gangs. Group size varies from 2 to 8 individuals. Each group maintains its own urban territory. Some territories are larger than others. The area variable encodes this information. Some territories also have more avgfood than others. We want to model the weight of each fox. For the problems below, assume the following DAG:
6-4. Use a model to infer the total causal influence of area on weight. Would increasing the area available to each fox make it heavier (healthier)? You might want to standardize the variables. Regardless, use prior predictive simulation to show that your model’s prior predictions stay within the possible outcome range.
# Determination of the model to infer the total causal influence of area on weight.
## As per the diagram above, there seems to be two possible below scenario's or "paths" connecting "area" with weight as indicated below:
# Path_1: area -> avgfood -> groupsize -> weight
# Path_2 : area -> avgfood -> weight
# ## Leveraging the foxes Data set for the determination of the model to infer the total causal influence of area on weight.
data("foxes")
str(foxes)
## 'data.frame': 116 obs. of 5 variables:
## $ group : int 1 1 2 2 3 3 4 4 5 5 ...
## $ avgfood : num 0.37 0.37 0.53 0.53 0.49 0.49 0.45 0.45 0.74 0.74 ...
## $ groupsize: int 2 2 2 2 2 2 2 2 3 3 ...
## $ area : num 1.09 1.09 2.05 2.05 2.12 2.12 1.29 1.29 3.78 3.78 ...
## $ weight : num 5.02 2.84 5.33 6.07 5.85 3.25 4.53 4.09 6.13 5.59 ...
summary(foxes)
## group avgfood groupsize area
## Min. : 1.00 Min. :0.3700 Min. :2.000 Min. :1.090
## 1st Qu.:11.75 1st Qu.:0.6600 1st Qu.:3.000 1st Qu.:2.590
## Median :18.00 Median :0.7350 Median :4.000 Median :3.130
## Mean :17.21 Mean :0.7517 Mean :4.345 Mean :3.169
## 3rd Qu.:24.00 3rd Qu.:0.8000 3rd Qu.:5.000 3rd Qu.:3.772
## Max. :30.00 Max. :1.2100 Max. :8.000 Max. :5.070
## weight
## Min. :1.920
## 1st Qu.:3.720
## Median :4.420
## Mean :4.530
## 3rd Qu.:5.375
## Max. :7.550
## Standardization of all variables, and excluding group since that is a dummy variable.
Data_foxes <- foxes %>%
as_tibble() %>%
mutate(across(-group, standardize))
# Model # 1 depiction:
Model_foxes1 <- alist(weight ~ dnorm(mu, sigma),
mu <- a + Barea*area,
a ~ dnorm(0, 0.2),
Barea ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_foxes)
# Model validation to check if our priors are falling outside a realistic range and by the utilization of the predictive simulation approach.
Model_foxes1 %>%
extract.prior() %>%
link(Model_foxes1, post = .,
data = list(area = seq(-2, 2, length.out = 20))) %>%
as_tibble() %>%
pivot_longer(cols = everything(), values_to = "weight") %>%
add_column(
area = rep(seq(-2, 2, length = 20), 1000),
type = rep(as.character(1:1000), each = 20)) %>%
ggplot() +
geom_line(aes(x = area, y = weight, group = type),
alpha = 0.1, colour = "black") +
labs(x = "Standard Area", y = "Standard Weight",
caption = "Prediction of prior simulation for standardized area on weight") +
theme_minimal()
## Findings: Some lines of the Simulation prior Model plot were observed to be out of sd (-2, 2) range.
## Assessment of influence of Area on Weight:
Model_foxes1 %>%
Independence_test_validation(coeff.input = "Barea")
| estimate | mean | sd | 5.5% | 94.5% |
|---|---|---|---|---|
| Barea | 0.02 | 0.09 | -0.13 | 0.16 |
## There appears to be no significant association between the foxes territory area and weight.
6-5. Now infer the causal impact of adding food to a territory. Would this make foxes heavier? Which covariates do you need to adjust for to estimate the total causal influence of food?
# Determination of causal impact of incorporating average food to a territory to our equation.
# Two of open Paths linking to Average food to foxes Weight.
# Path_A. avgfood -> weight
# Path_B. avgfood <- groupsize -> weight
# Determination of Causal impact of incorporating 'average food' to the territory calculation, for the above open path A, and further not considering or accounting for the path B, or to include or conditioning on 'group size', since that would close our path A:
Path_A <-
alist(
weight ~ dnorm(mu, sigma),
mu <- a + Bavgfood*avgfood,
a ~ dnorm(0, 0.2),
Bavgfood ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_foxes)
precis(Path_A)
## mean sd 5.5% 94.5%
## a -2.106047e-06 0.08360014 -0.1336113 0.1336071
## Bavgfood -2.421541e-02 0.09088498 -0.1694672 0.1210363
## sigma 9.911435e-01 0.06465851 0.8878067 1.0944803
# Considering the fact that the 'avgfood' variable is only causally associated with the foxes 'weight' variable, and as per the estimate encountered output of this independent test validation for assessing causal impact of incorporating average food to a territory, there appears to be no significant associationship or relation between these parameters of 'avgfood' and 'weight'.
6-6. Now infer the causal impact of group size. Which covariates do you need to adjust for? Looking at the posterior distribution of the resulting model, what do you think explains these data? That is, can you explain the estimates for all three problems? How do they go together?
# Determination of causal impact of incorporating group size to a territory to our equation.
# Two of open Paths linking to Group size to foxes Weight.
# Path_A1. groupsize -> weight
# Path_B1. groupsize <- avgfood -> weight
# Determination of Causal impact of incorporating 'group size' to the territory calculation, for the above open path A, and further not considering or accounting for the path B, since that is a backdoor path.
Path_A1 <-
alist(
weight ~ dnorm(mu, sigma),
mu <- a + Bavgfood*avgfood + Bgrpsize*groupsize,
a ~ dnorm(0, 0.2),
c(Bavgfood, Bgrpsize) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = Data_foxes)
precis(Path_A1)
## mean sd 5.5% 94.5%
## a 8.134211e-08 0.08013815 -0.1280762 0.1280763
## Bavgfood 4.772542e-01 0.17912338 0.1909804 0.7635279
## Bgrpsize -5.735261e-01 0.17914188 -0.8598294 -0.2872228
## sigma 9.420450e-01 0.06175274 0.8433522 1.0407378
## Estimate Inferences:
# As per the obtained estimate output findings of our scenario test validation, it's certain that there is a negative relationship exists between the 'Group size' with the foxes' "Weight", however, there also seems to be a positive association or relationship between the "Average food" and the foxes "Weight", but further in contrast of our prior finding of estimate where no significant associations or relationship was observed between the "Average food" and foxes "Weight", this outcome or estimate could be assumed to be a result of some masked relationship.
## Estimate Inferences for above different paths scenario's models:
# The obtained estimate outcome can be realistic in the scenario, linking that the availability of more food also leads to an opportunity for more magnitude of foxes to be attracted and drawn towards the areas available with more average food, hence in the process, more number of foxes leading to more distribution of average food, and in a way also impacting the weight of the foxes, or eliminating the causal impact of average food on the weight of the foxes, and further in the process having a negative associations or relationship group size of the foxes and the weight of the foxes.
6-7. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?
# Considering the DAG X -> Z -> Y, and under the assumption of 'Z' to be dependent to a high proportion on 'X', and similarly 'Y' to be also dependent to a high proportion on 'Z', correlation determination between X and Z.
set.seed(1234)
N = 500
X <- rnorm(N, mean = 0, sd = 1)
Z <- rnorm(N, mean = X, sd = 0.5)
Y <- rnorm(N, mean = Z, sd = 1)
Data_X_Z_Y <- data.frame(X, Y, Z)
cor(X,Z)
## [1] 0.913709
# Correlation Finding Inference:
# Correlation between the 'X" and 'Z' variable was observed to be positive direction correlation and very high value of 0.90,hence supporting the argument of existence of multicollinearity for this case.
# Simulation of the DAG X -> Z -> Y
Model_X_Z_Y <-
quap(alist(
Y ~ dnorm(mu, sigma),
mu <- a + B1*X + B2*Z,
a ~ dnorm(1, 50),
B1 ~ dnorm(2, 10),
B2 ~ dnorm(2, 10),
sigma ~ dexp(1)
), data = Data_X_Z_Y)
precis(Model_X_Z_Y)
## mean sd 5.5% 94.5%
## a -40.557931 44.29355 -111.34758 30.23172
## B1 2.269435 9.94262 -13.62079 18.15966
## B2 1.148759 9.92591 -14.71476 17.01228
## sigma 2135.719483 NaN NaN NaN
# Determination of legs:
Lgs <- quap(
alist(
Y ~ dnorm(mu, sigma),
mu <- a + bX*X + bZ*Z,
c(a,bX,bZ) ~ dnorm(0,1),
sigma ~ dexp(1)
), data = list(X=X, Y=Y, Z=Z)
)
precis(Lgs)
## mean sd 5.5% 94.5%
## a 0.02953931 0.04171535 -0.03712987 0.09620849
## bX 0.14690704 0.09842898 -0.01040149 0.30421557
## bZ 0.92899387 0.08666489 0.79048664 1.06750111
## sigma 0.93206320 0.02942591 0.88503492 0.97909149
## Determination of plot for Legs.
plot(precis(Lgs))
post <- extract.samples(Lgs)
plot( Z ~ X, post, col = "red", pch=16)
## One of the element different from the legs example pertinent to the chapter is the depiction of the causal relationship of 'X' variable to the 'Z' variable of the simulation model and considering for this DAG X -> Z -> Y and both Z and X estimates being relatively with narrower posterior and which was observed to be distinct from chapter Leg example, and further in the chapter both Legs to a major proportion predicted the height, however in this example 'Z' is considered as a major predictor of the output.