Chapter 6 - The Haunted DAG & the Causal Terror

Multiple regression is no oracle, but only a golem. It is logical, but the relationships it describes are conditional associations, not causal influences. Therefore additional information, from outside the model, is needed to make sense of it. This chapter presented introductory examples of some common frustrations: multi-collinearity, post-treatment bias, and collider bias. Solutions to these frustrations can be organized under a coherent framework in which hypothetical causal relations among variables are analyzed to cope with confounding. In all cases, causal models exist outside the statistical model and can be difficult to test. However, it is possible to reach valid causal inferences in the absence of experiments. This is good news, because we often cannot perform experiments, both for practical and ethical reasons.

Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Make sure to include plots if the question requests them.

Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.

Questions

6-1. Use the Waffle House data, data(WaffleDivorce), to find the total causal influence of number of Waffle Houses on divorce rate. Justify your model or models with a causal graph (draw a DAG).

## Leveraging the Waffle Divorce Data set. 

data("WaffleDivorce")

str(WaffleDivorce)
## 'data.frame':    50 obs. of  13 variables:
##  $ Location         : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Loc              : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 9 8 10 ...
##  $ Population       : num  4.78 0.71 6.33 2.92 37.25 ...
##  $ MedianAgeMarriage: num  25.3 25.2 25.8 24.3 26.8 25.7 27.6 26.6 29.7 26.4 ...
##  $ Marriage         : num  20.2 26 20.3 26.4 19.1 23.5 17.1 23.1 17.7 17 ...
##  $ Marriage.SE      : num  1.27 2.93 0.98 1.7 0.39 1.24 1.06 2.89 2.53 0.58 ...
##  $ Divorce          : num  12.7 12.5 10.8 13.5 8 11.6 6.7 8.9 6.3 8.5 ...
##  $ Divorce.SE       : num  0.79 2.05 0.74 1.22 0.24 0.94 0.77 1.39 1.89 0.32 ...
##  $ WaffleHouses     : int  128 0 18 41 0 11 0 3 0 133 ...
##  $ South            : int  1 0 0 1 0 0 0 0 0 1 ...
##  $ Slaves1860       : int  435080 0 0 111115 0 0 0 1798 0 61745 ...
##  $ Population1860   : int  964201 0 0 435450 379994 34277 460147 112216 75080 140424 ...
##  $ PropSlaves1860   : num  0.45 0 0 0.26 0 0 0 0.016 0 0.44 ...
# Standardization of the "Waffle Divorce" data for the determination of the total causal influence of number of Waffle Houses on divorce rate.

Data_Waffle <- WaffleDivorce %>%
  as_tibble() %>%
  select(divorce_rate = Divorce, married_rate = Marriage, married_age = MedianAgeMarriage,
         waffle_houses = WaffleHouses, south = South)  %>%
  mutate(across(-south, standardize), 
         south = south + 1)



# Under the consideration and assumption that the median age of the individual's marriage or "married_age" has an influence on the individual's marriage rate as well as divorce rate.


## Determination of DAG modelfor the determination of the total causal influence of number of Waffle Houses on divorce rate.

tribble(
  ~ name,         ~ x,         ~y,
  "south",          0,          2,
  "married_rate",   2,          1,
  "married_age",    1,          2,
  "waffle_houses",  0,          0,
  "divorce_rate",   1,          0,  
) %>%
  dagify(
    divorce_rate ~ married_age + waffle_houses,
    waffle_houses ~ south,
    married_rate ~ married_age,
    married_age ~ south,
    coords = .) %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_node(colour = "blue", shape = 1, size = 24) +
  geom_dag_text(aes(label = name), color = "black", size = 2.5) +
  geom_dag_edges(edge_color = "black") +
  labs(caption = "Total causal influence of number of Waffle Houses on Divorce rate") +
  theme_void()

  ## Causal graph Model Inference: 
  #  One of the backdoor path has been observed between Waffle_houses & divorce rate i.e. Waffle_houses <- south -> married_age -> divorce_rate, and the total causal influence of number of waffle houses would be the direct influence and backdoor path, however the backdoor path influence can be closed by the process of conditioning on south.

6-2. Build a series of models to test the implied conditional independencies of the causal graph you used in the previous problem. If any of the tests fail, how do you think the graph needs to be amended? Does the graph need more or fewer arrows? Feel free to nominate variables that aren’t in the data.

## Determination of some of the models for the Test validation of the multiple implied conditional independencies of the prior causal graph.


dagitty("dag{
        waffle_houses -> divorce_rate <- married_age <- south -> waffle_houses 
        <- south -> married_age -> married_rate}") %>% 
  impliedConditionalIndependencies()
## dvr_ _||_ mrrd_r | mrrd_g
## dvr_ _||_ soth | mrr_, wff_
## mrr_ _||_ wff_ | soth
## mrrd_r _||_ soth | mrrd_g
## mrr_ _||_ wff_ | soth
## mrrd_r _||_ wff_ | mrrd_g
## (Model:a): divorce_rate _||_ married_rate | married_age 

## (Model :b): divorce_rate _||_ south | married_age, waffle_houses  

## (Model :c):  married_rate _||_ waffle_houses | south 

## (Model :d): married_rate _||_ south | married_age 

## (Model :e): married_rate _||_ waffle_houses | married_age 


## Determination of the indepence test validation function for the multiple implied conditional independencies validations.

Independence_test_validation <- function(model.input, coeff.input, depth.output = 1){
  suppressWarnings(
    precis(model.input, depth = depth.output) %>%
    as_tibble(rownames = "estimate") %>%
    filter(estimate %in% {{coeff.input}}) %>%
    mutate(across(where(is.numeric), round, digits = 2)) %>%
    knitr::kable()
  )
}


## Model a condition validation: divorce_rate _||_ married_rate | married_age. 

alist(
  married_rate ~ dnorm(mu, sigma),
  mu <- a + Bm_age*married_age + Bdivorce_rate*divorce_rate,
  a ~ dnorm(0, 0.5),
  c(Bm_age, Bdivorce_rate) ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_Waffle) %>% 
  Independence_test_validation(coeff.input = "Bdivorce_rate")
estimate mean sd 5.5% 94.5%
Bdivorce_rate -0.06 0.12 -0.25 0.13
# Inference: Since as per the output, the posterior "Bdivorce_rate" didn't indicate any consistent association with married rate, hence  Divorce rate could be considered as independent of the married rate after the conditioning on married age.



## (Model :b): divorce_rate _||_ south | married_rate, waffle_houses 


alist(
  divorce_rate ~ dnorm(mu, sigma),
  mu <- a[south] + Bm_rate*married_rate + Bwaffle_h*waffle_houses,
  a[south] ~ dnorm(0, 0.5),
  c(Bm_rate, Bwaffle_h) ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_Waffle) %>% 
  Independence_test_validation(depth.output = 2, coeff.input = c("a[1]", "a[2]"))
estimate mean sd 5.5% 94.5%
a[1] -0.15 0.15 -0.39 0.09
a[2] 0.34 0.25 -0.06 0.74
# Inference: South can be considered independent from the divorce rate, since as per the output, the posterior didn't indicate any consistent association with divorced rate, and after conditioning on married rate and waffle houses.


## (Model :c): married_rate _||_ south | married_age 


alist(
  married_rate ~ dnorm(mu, sigma),
  mu <- a[south] + Bm_age*married_age,
  a[south] ~ dnorm(0, 0.5),
  Bm_age ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_Waffle) %>% 
  Independence_test_validation(depth.output = 2, coeff.input = c("a[1]", "a[2]"))
estimate mean sd 5.5% 94.5%
a[1] 0.06 0.11 -0.12 0.23
a[2] -0.13 0.17 -0.41 0.14
# Inference: South can be considered independent from the married rate, since as per the output, the posterior did not indicate association with the married rate, and after conditioning on marriage age.


## (Model :d):  married_rate _||_ waffle_houses | south 


alist(
  waffle_houses ~ dnorm(mu, sigma),
  mu <- a[south] + Bm_rate*married_rate,
  a[south] ~ dnorm(0, 0.5),
  Bm_rate ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_Waffle) %>% 
  Independence_test_validation(coeff.input = "Bm_rate")
estimate mean sd 5.5% 94.5%
Bm_rate -0.02 0.1 -0.18 0.14
#Inference: Married rate can be considered independent from the waffle houses, since as per the output, the posterior "Bm_rate" didn't indicate any consistent association with waffle houses, and after conditioning on south.


## (Model :e): married_rate _||_ waffle_houses | married_age  

alist(
  waffle_houses ~ dnorm(mu, sigma),
  mu <- a + Bm_rate*married_rate + Bm_age*married_age,
  a ~ dnorm(0, 0.5),
  c(Bm_rate, Bm_age) ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_Waffle) %>% 
  Independence_test_validation(coeff.input = "Bm_rate")
estimate mean sd 5.5% 94.5%
Bm_rate -0.09 0.18 -0.38 0.2
#Inference: Married rate is found to be independent from the waffle houses, since as per the output, the posterior "Bm_rate" didn't indicate any consistent association with waffle houses, and after conditioning on the married age.

6-3. For the DAG you made in the previous problem, can you write a data generating simulation for it? Can you design one or more statistical models to produce causal estimates? If so, try to calculate interesting counterfactuals. If not, use the simulation to estimate the size of the bias you might expect. Under what conditions would you, for example, infer the opposite of a true causal effect?

## Determination of Data generating simulation for the level of Professors acquired with tacit and explicit professional development qualifications and it's level or extent of intervention on the Engineering students performance:

# Here in our case Scenario,

# Ex_Knldge <- Existing of Knowledge of Students

# Dvlpmnt_Qualf <- Development Qualifications Intervention of Professors.

# Eng_Prfmnce <- Performance of Engineering Students.

# Prof_Instrct <- Instructions by Professors.


set.seed(1234)

Students <- 1000


Intelligence_data <-
  tibble(
    Ex_Knldge = rnorm(Students, mean = 0, sd =2),
    Dvlpmnt_Qualf = sample(0L:1L, Students, replace = TRUE),
    Prof_Instrct = rnorm(Students, mean = 1 + 3 * Dvlpmnt_Qualf),
    Eng_Prfmnce = Ex_Knldge + rnorm(Students, 0.8 * Prof_Instrct)) %>%
  mutate(across(where(is.double), standardize))
  

str(Intelligence_data)
## tibble [1,000 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Ex_Knldge    : num [1:1000] -1.184 0.305 1.114 -2.325 0.457 ...
##   ..- attr(*, "scaled:center")= num -0.0532
##   ..- attr(*, "scaled:scale")= num 1.99
##  $ Dvlpmnt_Qualf: int [1:1000] 1 0 0 1 1 0 0 0 1 1 ...
##  $ Prof_Instrct : num [1:1000] 1.808 -1.566 -1.206 0.52 0.655 ...
##   ..- attr(*, "scaled:center")= num 2.49
##   ..- attr(*, "scaled:scale")= num 1.82
##  $ Eng_Prfmnce  : num [1:1000] -0.424 -0.444 -0.338 -1.594 -0.1 ...
##   ..- attr(*, "scaled:center")= num 1.97
##   ..- attr(*, "scaled:scale")= num 2.63
## Determination of the model for extracting the causal effect of professional development activities on the performance of the students.

Intelligence_Model <-
  brm(
    Eng_Prfmnce ~ 1 + Dvlpmnt_Qualf, data = Intelligence_data, family = gaussian,
    prior = c(prior(normal(0, 0.2), class = Intercept),
              prior(normal(0, 0.5), class = b),
              prior(exponential(1), class = sigma)),
    iter = 3000, warmup = 1000, chains = 4, cores = 4, seed = 123,
    file = here("Fitted", "Causal"))
  

as_draws_df(Intelligence_Model) %>%
  as_tibble() %>%
  select(b_Intercept, b_Dvlpmnt_Qualf, sigma) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value, y = name)) + 
  stat_halfeye(.width = c(0.70, 0.80, 0.90)) +
  labs(x = "Parameters Values", y = "Different Parameters")

# Result Inference: It is clearly evident from our model's simulation data plot that there seems to be a major magnitude of causal influence of the Engineering study teaching professor's professional development on the performances of the Engineering students, and with around 0.9 standard deviation of increase in the Engineering student performances, against or in comparison to the Engineering student performances whose Professor's didn't have professional development related qualification.

All three problems below are based on the same data. The data in data(foxes) are 116 foxes from 30 different urban groups in England. These foxes are like street gangs. Group size varies from 2 to 8 individuals. Each group maintains its own urban territory. Some territories are larger than others. The area variable encodes this information. Some territories also have more avgfood than others. We want to model the weight of each fox. For the problems below, assume the following DAG:

6-4. Use a model to infer the total causal influence of area on weight. Would increasing the area available to each fox make it heavier (healthier)? You might want to standardize the variables. Regardless, use prior predictive simulation to show that your model’s prior predictions stay within the possible outcome range.

# Determination of the model to infer the total causal influence of area on weight.

## As per the diagram above, there seems to be two possible below scenario's or "paths" connecting "area" with weight as indicated below:

# Path_1: area -> avgfood -> groupsize -> weight

# Path_2 : area -> avgfood -> weight  


# ## Leveraging the foxes Data set for the determination of the model to infer the total causal influence of area on weight.

data("foxes")

str(foxes)
## 'data.frame':    116 obs. of  5 variables:
##  $ group    : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ avgfood  : num  0.37 0.37 0.53 0.53 0.49 0.49 0.45 0.45 0.74 0.74 ...
##  $ groupsize: int  2 2 2 2 2 2 2 2 3 3 ...
##  $ area     : num  1.09 1.09 2.05 2.05 2.12 2.12 1.29 1.29 3.78 3.78 ...
##  $ weight   : num  5.02 2.84 5.33 6.07 5.85 3.25 4.53 4.09 6.13 5.59 ...
summary(foxes)
##      group          avgfood         groupsize          area      
##  Min.   : 1.00   Min.   :0.3700   Min.   :2.000   Min.   :1.090  
##  1st Qu.:11.75   1st Qu.:0.6600   1st Qu.:3.000   1st Qu.:2.590  
##  Median :18.00   Median :0.7350   Median :4.000   Median :3.130  
##  Mean   :17.21   Mean   :0.7517   Mean   :4.345   Mean   :3.169  
##  3rd Qu.:24.00   3rd Qu.:0.8000   3rd Qu.:5.000   3rd Qu.:3.772  
##  Max.   :30.00   Max.   :1.2100   Max.   :8.000   Max.   :5.070  
##      weight     
##  Min.   :1.920  
##  1st Qu.:3.720  
##  Median :4.420  
##  Mean   :4.530  
##  3rd Qu.:5.375  
##  Max.   :7.550
## Standardization of all variables, and excluding group since that is a dummy variable.

Data_foxes <- foxes %>% 
  as_tibble() %>% 
  mutate(across(-group, standardize))

# Model # 1 depiction:

Model_foxes1 <- alist(weight ~ dnorm(mu, sigma),  
                  mu <- a + Barea*area, 
                  a ~ dnorm(0, 0.2), 
                  Barea ~ dnorm(0, 0.5), 
                  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)



# Model validation to check if our priors are falling outside a realistic range and by the utilization of the predictive simulation approach.

Model_foxes1 %>% 
  extract.prior() %>% 
  link(Model_foxes1, post = .,
    data = list(area = seq(-2, 2, length.out = 20))) %>% 
  as_tibble() %>%
  pivot_longer(cols = everything(), values_to = "weight") %>%
  add_column(
    area = rep(seq(-2, 2, length = 20), 1000),
    type = rep(as.character(1:1000), each = 20)) %>%
  ggplot() + 
  geom_line(aes(x = area,  y = weight, group = type), 
            alpha = 0.1, colour = "black") + 
  labs(x = "Standard Area", y = "Standard Weight", 
       caption = "Prediction of prior simulation for standardized area on weight") +
  theme_minimal()

  ## Findings: Some lines of the Simulation prior Model plot were observed to be out of sd (-2, 2) range.


## Assessment of influence of Area on Weight:

Model_foxes1 %>% 
  Independence_test_validation(coeff.input = "Barea")
estimate mean sd 5.5% 94.5%
Barea 0.02 0.09 -0.13 0.16
## There appears to be no significant association between the foxes territory area and weight.

6-5. Now infer the causal impact of adding food to a territory. Would this make foxes heavier? Which covariates do you need to adjust for to estimate the total causal influence of food?

# Determination of causal impact of incorporating average food to a territory to our equation.


# Two of open Paths linking to Average food to foxes Weight.

# Path_A. avgfood -> weight

# Path_B. avgfood <- groupsize -> weight


# Determination of Causal impact of incorporating 'average food' to the territory calculation, for the above open path A, and further not considering or accounting for the path B, or to include or conditioning on 'group size', since that would close our path A:

Path_A <- 
  alist(
  weight ~ dnorm(mu, sigma),
  mu <- a + Bavgfood*avgfood,
  a ~ dnorm(0, 0.2),
  Bavgfood ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes)  
  

precis(Path_A)
##                   mean         sd       5.5%     94.5%
## a        -2.106047e-06 0.08360014 -0.1336113 0.1336071
## Bavgfood -2.421541e-02 0.09088498 -0.1694672 0.1210363
## sigma     9.911435e-01 0.06465851  0.8878067 1.0944803
# Considering the fact that the 'avgfood' variable is only causally associated with the foxes 'weight' variable, and as per the estimate encountered output of this independent test validation for assessing causal impact of incorporating average food to a territory, there appears to be no significant associationship or relation between these parameters of 'avgfood' and 'weight'.

6-6. Now infer the causal impact of group size. Which covariates do you need to adjust for? Looking at the posterior distribution of the resulting model, what do you think explains these data? That is, can you explain the estimates for all three problems? How do they go together?

# Determination of causal impact of incorporating group size to a territory to our equation.


# Two of open Paths linking to Group size to foxes Weight.

# Path_A1. groupsize -> weight

# Path_B1. groupsize <- avgfood -> weight

# Determination of Causal impact of incorporating 'group size' to the territory calculation, for the above open path A, and further not considering or accounting for the path B, since that is a backdoor path.

Path_A1 <-
alist(
  weight ~ dnorm(mu, sigma),
  mu <- a + Bavgfood*avgfood + Bgrpsize*groupsize,
  a ~ dnorm(0, 0.2),
  c(Bavgfood, Bgrpsize) ~ dnorm(0, 0.5),
  sigma ~ dexp(1)) %>% 
  quap(data = Data_foxes) 
  

precis(Path_A1)
##                   mean         sd       5.5%      94.5%
## a         8.134211e-08 0.08013815 -0.1280762  0.1280763
## Bavgfood  4.772542e-01 0.17912338  0.1909804  0.7635279
## Bgrpsize -5.735261e-01 0.17914188 -0.8598294 -0.2872228
## sigma     9.420450e-01 0.06175274  0.8433522  1.0407378
## Estimate Inferences: 
#  As per the obtained estimate output findings of our scenario test validation, it's certain that there is a negative relationship exists between the 'Group size' with the foxes' "Weight", however, there also seems to be a positive association or relationship    between the "Average food" and the foxes "Weight", but further in contrast of our prior finding of estimate where no significant    associations or relationship was observed between the "Average food" and foxes "Weight", this outcome or estimate could be assumed to be a result of some masked relationship.  

## Estimate Inferences for above different paths scenario's models:
#  The obtained estimate outcome can be realistic in the scenario, linking that the availability of more food also leads to an opportunity for more magnitude of foxes to be attracted and drawn towards the areas available with more average food, hence in the process, more number of foxes leading to more distribution of average food, and in a way also impacting the weight of the foxes, or eliminating the causal impact of average food on the weight of the foxes, and further in the process having a negative associations or relationship group size of the foxes and the weight of the foxes.

6-7. Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X → Z → Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?

# Considering the DAG  X -> Z -> Y, and under the assumption of 'Z' to be dependent to a high proportion on 'X', and similarly 'Y' to be also dependent to a high proportion on 'Z', correlation determination between X and Z.

set.seed(1234)

N = 500

X <- rnorm(N, mean = 0, sd = 1)

Z <- rnorm(N, mean = X, sd = 0.5)

Y <- rnorm(N, mean = Z, sd = 1)


Data_X_Z_Y <- data.frame(X, Y, Z)

cor(X,Z) 
## [1] 0.913709
# Correlation Finding Inference:
# Correlation between the 'X" and 'Z' variable was observed to be positive direction correlation and very high value of 0.90,hence supporting the argument of existence of multicollinearity for this case.


# Simulation of the DAG X -> Z -> Y


Model_X_Z_Y <- 
  quap(alist(
  Y ~ dnorm(mu, sigma),
  mu <- a + B1*X + B2*Z,
  a ~ dnorm(1, 50),
  B1 ~ dnorm(2, 10),
  B2 ~ dnorm(2, 10),
  sigma ~ dexp(1)
  ), data = Data_X_Z_Y) 
  precis(Model_X_Z_Y)
##              mean       sd       5.5%    94.5%
## a      -40.557931 44.29355 -111.34758 30.23172
## B1       2.269435  9.94262  -13.62079 18.15966
## B2       1.148759  9.92591  -14.71476 17.01228
## sigma 2135.719483      NaN        NaN      NaN
# Determination of legs:

Lgs <- quap(
  alist(
    Y ~ dnorm(mu, sigma),
    mu <- a + bX*X + bZ*Z,
    c(a,bX,bZ) ~ dnorm(0,1),
    sigma ~ dexp(1)
  ), data = list(X=X, Y=Y, Z=Z)
)

precis(Lgs)
##             mean         sd        5.5%      94.5%
## a     0.02953931 0.04171535 -0.03712987 0.09620849
## bX    0.14690704 0.09842898 -0.01040149 0.30421557
## bZ    0.92899387 0.08666489  0.79048664 1.06750111
## sigma 0.93206320 0.02942591  0.88503492 0.97909149
## Determination of plot for Legs.

plot(precis(Lgs))

post <- extract.samples(Lgs)

plot( Z ~ X, post, col = "red", pch=16)

## One of the element different from the legs example pertinent to the chapter is the depiction of the causal relationship of 'X' variable to the 'Z' variable of the simulation model and considering for this DAG X -> Z -> Y and both Z and X estimates being relatively with narrower posterior and which was observed to be distinct from chapter Leg example, and further in the chapter both Legs to a major proportion predicted the height, however in this example 'Z' is considered as a major predictor of the output.