Causal Inference, Graphs and Simulation

Christopher Weber

2024-10-07

Estimation

library(dplyr)
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive %>% 
  mutate(
    support_protest = ifelse(violent >3, 1, 0)
  ) %>%
  dplyr::select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "logit")) 
# Estimate probit
probitModel = glm(support_protest ~ VIOLENT, data = dat, family = binomial(link = "probit"))


Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.34186    0.04804  -7.115 1.12e-12 ***
VIOLENT     -0.55257    0.07058  -7.829 4.92e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4609.4  on 3598  degrees of freedom
AIC: 4613.4

Number of Fisher Scoring iterations: 4


Call:
glm(formula = support_protest ~ VIOLENT, family = binomial(link = "probit"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.21378    0.02992  -7.145 9.01e-13 ***
VIOLENT     -0.33902    0.04316  -7.855 3.99e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4609.4  on 3598  degrees of freedom
AIC: 4613.4

Number of Fisher Scoring iterations: 4

Identical Predictions

# Create a dataset
pred.data = expand.grid(VIOLENT = c(0, 1)) %>% as.data.frame()

predictions_logit = predict(logitModel, pred.data,  type = "response") %>% 
  data.frame() %>%
  mutate(
    violent = c(0, 1)
  ) 

predictions_probit = predict(probitModel, pred.data,  type = "response") %>% 
  data.frame() %>%
  mutate(
    violent = c(0, 1)
  ) 


# The difference
cat("The treatment effect for the logit model is:\n",
predictions_logit[1,1] - predictions_logit[2,1], "\n",
"The treatment effect for the probit model is:\n",
predictions_probit[1,1] - predictions_probit[2,1]

)

The treatment effect for the logit model is:
 0.1251605 
 The treatment effect for the probit model is:
 0.1251605

Nonlinear Regression

A different approach to the same probolem, a binary dependent variable
The log log likelihood

\[ log(p(D|\theta))=\sum_{n=1}^Nlogp(y_n|\theta)=\sum_{n=1}^N[y_n log\theta+(1-y_n)log(1-\theta)] \]

If \(\theta\) is bound between 0 and 1. But our regression model requires a continuous depenendent variable,
First, convert the probability to an odds

\[{{\theta_i}\over{1-\theta_i}}\]

This frees the upper bound restriction.

Nonlinear Regression

Then the lower bounds
The transformation is an “odds” that may approach \(\infty\). But, we still have the zero restriction
Take the natural logarithm of the odds, resulting in the log odds, and the logit transformation

\[\eta_i=log[{{\theta_i)}\over{1-\theta_i}}]\]

\[\eta \in [-\infty, \infty]\]

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive %>% mutate( support_protest = ifelse(violent >3, 1, 0)) %>%
  dplyr::select(support_protest, VIOLENT)
# Estimate logit
logitModel = glm(support_protest ~ VIOLENT, data = dat, 
                 family = binomial(link = "logit")) 
logitModel


Call:  glm(formula = support_protest ~ VIOLENT, family = binomial(link = "logit"), 
    data = dat)

Coefficients:
(Intercept)      VIOLENT  
    -0.3419      -0.5526  

Degrees of Freedom: 3599 Total (i.e. Null);  3598 Residual
Null Deviance:      4671 
Residual Deviance: 4609     AIC: 4613

Simulation

It’s often useful to simulate the distribution of the predicted probabilities
We do this by first estimating our statistical model
Then develop a routine to examine “what happens” when we change the values of our independent variables
We can leverage this paradigm for other purpoposes, such as generating treatment effects, or marginal effects, or testing the robustness of our findings

Simulation, \(var(\theta)\)

Let’s do a few things here
We’ll use the variance-covariance matrix of the parameters to simulate the distribution of the predicted probabilities
Here’s how we’ll do it

The Variance Covariance Matrix

The Hessian \[H(\theta)={{\partial^2 ln L(\theta) }\over{\partial \theta \partial \theta^T}}\]
What does the second derivative tell us?
The negative of the expected value of the Hessian is the information matrix

\[I(\theta)=-E(H(\theta))\].

Finally, the inverse of the information matrix is the variance-covariance matrix for the parameters

\[V(\theta)=[-E(H(\theta))]^{-1}\].

If we have the variance-covariance matrix, we can simulate draws from a multivariate normal distribution
Uncertainty is captured by treating the parameters as random, not fixed

\(\texttt{vcov()}\)

library(MASS)
variance = vcov(logitModel)
variance

            (Intercept)      VIOLENT
(Intercept)   0.0023083 -0.002308300
VIOLENT      -0.0023083  0.004981626

Clarification

\(\texttt{vcov(logitModel)}\) returns the variance-covariance matrix of the parameters
\(\textbf{mvrnorm(1000, coef(logitModel), vcov(logitModel))}\) simulates 1000 draws from a multivariate normal distribution
If the predictions are \(X\beta\), then the predicted probabilities
\(\hat{y}_{i,k} = X_{i,j} \beta_{j,k}\).
And if we’re dealing with a logit or probit, \(plogis(\hat{y}_{i,k})\) or \(pnorm(\hat{y}_{i,k})\)

In Practice

library(MASS)
library(plotly)
library(tidyr)
variance = vcov(logitModel)
simval = mvrnorm(1000, coef(logitModel), variance)
variance

            (Intercept)      VIOLENT
(Intercept)   0.0023083 -0.002308300
VIOLENT      -0.0023083  0.004981626

head(simval, 10)

      (Intercept)    VIOLENT
 [1,]  -0.4681921 -0.4530758
 [2,]  -0.3486036 -0.6186723
 [3,]  -0.3183026 -0.5883488
 [4,]  -0.3388572 -0.4641085
 [5,]  -0.3856140 -0.5440829
 [6,]  -0.2742406 -0.6434343
 [7,]  -0.3053757 -0.6163832
 [8,]  -0.2926447 -0.6357384
 [9,]  -0.3770073 -0.5208251
[10,]  -0.2383980 -0.6481180

The Design Matrix

\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{1} \\ y_{2}= & b_0+ b_1 x_{2} \\ y_{3}= & b_0+ b_1 x_{3}\\ y_{4}= & b_0+ b_1 x_{4}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{n}\\ \end{bmatrix}\]

The Design Matrix

The design matrix is the matrix of the independent variables
The first column is a column of ones, the intercept
\(X\beta\) is the matrix multiplication of the design matrix and the coefficients

Post-Estimation

\(X\beta\) is the matrix multiplication of the design matrix and the coefficients
After estimation, we can generate our own design matrix, which we’ll use to generate predictions
For \(X\beta\) to have a solution, we need to have the same number of columns in the design matrix as we have coefficients (see matrix algebra chapter)
Rules of matrix multiplication. The number of columns in \(X\) must equal the number length of the coefficient vector
Conformable matrices

Plotly

The probability of support in the “violence” frame

ggplot

And in ggplot

Violent versus Not-Specified

Perhaps we with to compare, a treatment effect?

  (Intercept)    VIOLENT y_violent y_not_specified
1  -0.4681921 -0.4530758 0.4771405       0.4153587
2  -0.3486036 -0.6186723 0.5000003       0.4153587
3  -0.3183026 -0.5883488 0.4958115       0.4153587
4  -0.3388572 -0.4641085 0.4786616       0.4153587
5  -0.3856140 -0.5440829 0.4896979       0.4153587
6  -0.2742406 -0.6434343 0.5034209       0.4153587

Violent versus Not-Specified

Marginal Effects

This is a “simple” research design. The “treatment effect” is simply the difference between the treatment and control conditions
In the regression, the dummy variable for the treatment condition is the difference between the two conditions
The logic can be extended further, the marginal effect. What is the change in \(y\) for every change in \(x\)?
It can often be useful to calculate the marginal effect of a variable – i.e., the change in probabilities for a one unit increase in the variable

Marginal Effects

\(\Delta x_i =x_{i,1}-x_{i,0}\)

For y, the marginal value is then

\(\Delta y=f\left(x_{i,1},\ldots,x_n \right)-f\left(x_{i,0},\ldots,x_n \right)\)

And the marginal effect is (if a unit change, then the denominator is 1):

\(\frac{\Delta y}{\Delta x}=\frac{f\left(x_{i,1},\ldots,x_n \right)-f\left(x_{i,0},\ldots,x_n \right)}{x_{i,1}-x_{i,0}}\)

Marginal Effects

Calculate the probability that \(y=1\) when \(x=0\)
Calculate the probability that \(y=1\) when \(x=1\)
Subtract (1) from (2)

Marginal Effects

We might simulate a confidence interval as well, by modifying the recipe

Calculate the probability that \(y=1\) when \(x=0\) by drawing \(K\) draws from the multivariate normal
Calculate the probability that \(y=1\) when \(x=1\) by drawing \(K\) draws from the multivariate normal
Subtract (1) from (2), which will now yield a distribution

Marginal Effects


Call:
glm(formula = support_protest ~ moral_individualism + sdo + moral_individualism:sdo, 
    family = binomial(link = "logit"), data = dat)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)               0.8839     0.1290   6.852 7.28e-12 ***
moral_individualism      -1.5124     0.2334  -6.481 9.13e-11 ***
sdo                      -5.2082     0.4632 -11.245  < 2e-16 ***
moral_individualism:sdo   4.9714     0.7543   6.591 4.38e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4671.4  on 3599  degrees of freedom
Residual deviance: 4427.8  on 3596  degrees of freedom
AIC: 4435.8

Number of Fisher Scoring iterations: 4

Mediation

Moral individualism, (“Any person who is willing to work hard has a good chance of succeeding”; “If people work hard, they almost always get what they want”)
Social dominance is a scale developed by Sidanius and Pratto (1999). (e.g.,“Some groups of people are simply inferior to other groups” ; “It’s probably a good thing that certain groups are at the top and other groups are at the bottom” )
Are moral individualists, and those who tolerate (or even prefer) group based inequality, more likely to support protesting a democratic election?

Mediation

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{MI,i} + \beta_3 x_{SDO,i}x_{2i} + \beta_4 x_{SDO,i}x_{MI,i}x_{SDO,i} + \epsilon_i \]

library(dplyr)
library(ggplot2)
# Extract the simulated parameters ##
variance = vcov(logitModel)
# A 1000 x 4 matrix
simval = mvrnorm(1000, coef(logitModel), variance)
# Create the design matrix
# max.sdo
max.sdo = quantile(dat$sdo, 0.975)
# min.sdo
min.sdo = quantile(dat$sdo, 0.025)
# max ind
max.mi = quantile(dat$moral_individualism, 0.975)
# min ind
min.mi = quantile(dat$moral_individualism, 0.025)

# A 4 x 3 design matirx
design.matrix = expand.grid(intercept = 1, 
                            moral_individualism = c(min.mi, max.mi), 
                            sdo = c(min.mi, max.mi)) %>%
                as.data.frame() %>%
                mutate(interaction = moral_individualism * sdo) %>%
  as.matrix()
print(design.matrix)

     intercept moral_individualism sdo interaction
[1,]         1                   0   0           0
[2,]         1                   1   0           0
[3,]         1                   0   1           0
[4,]         1                   1   1           1

print("Keep track of the matrix dimensions. Each column now represents different combinations of the independent variable")

[1] "Keep track of the matrix dimensions. Each column now represents different combinations of the independent variable"

simulations = design.matrix %*% t(simval) %>% t() %>% plogis()  %>% as.data.frame()
names(simulations) = c("min.mi.min.sdo", "min.mi.max.sdo", "max.mi.min.sdo", "max.mi.max.sdo")

head(simulations)

  min.mi.min.sdo min.mi.max.sdo max.mi.min.sdo max.mi.max.sdo
1      0.7169775      0.3480357     0.01351642      0.2968172
2      0.6577495      0.3960893     0.02102164      0.2405612
3      0.7137669      0.3103688     0.01749856      0.3563595
4      0.6993526      0.3561148     0.01833641      0.2526667
5      0.7255975      0.3562103     0.01154439      0.2604679
6      0.6849504      0.3534417     0.01004602      0.3531018

datLong = simulations %>% 
  pivot_longer(everything(), names_to = "combination", values_to = "probability") 
datLong %>% head()

# A tibble: 6 × 2
  combination    probability
  <chr>                <dbl>
1 min.mi.min.sdo      0.717 
2 min.mi.max.sdo      0.348 
3 max.mi.min.sdo      0.0135
4 max.mi.max.sdo      0.297 
5 min.mi.min.sdo      0.658 
6 min.mi.max.sdo      0.396

The Normal Distribution

The linear model

The Posterior Distribution

We can also expand our definition of “simulation” to include the posterior distribution
The posterior distribution is the distribution of the parameters given the data, \(p(\theta|D)\)
The Likelihood \(\times\) Prior

\[ y_{i} \sim Normal(\mu_i, \sigma)\\ \mu_i = \beta_0 + \beta_1 (x_i - \bar{x})\\ \beta_0 \sim Normal(0, 10)\\ \beta_1 \sim Normal(0, 10)\\ \sigma \sim Uniform(0, 50) \]

The Posterior Distribution

The \(\texttt{brms}\) package

library(brms)

n <- 100
x <- rnorm(n, 0, 1)
y <- 2 + 3 * (x - mean(x)) + rnorm(n, 0, 1)
data <- data.frame(x = x, y = y)

# Define the model formula
formula <- bf(y ~ x - mean(x))

# Specify the priors
priors <- c(
  prior(normal(0, 10), class = "b"),
  prior(normal(0, 10), class = "Intercept"),
  prior(uniform(0, 50), class = "sigma")
)

# Fit the model
fit <- brm(formula, data = data, 
           prior = priors, chains = 4, 
           iter = 2000, warmup = 1000)


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.011 seconds (Warm-up)
Chain 1:                0.009 seconds (Sampling)
Chain 1:                0.02 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.009 seconds (Warm-up)
Chain 2:                0.008 seconds (Sampling)
Chain 2:                0.017 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.009 seconds (Warm-up)
Chain 3:                0.009 seconds (Sampling)
Chain 3:                0.018 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.01 seconds (Warm-up)
Chain 4:                0.008 seconds (Sampling)
Chain 4:                0.018 seconds (Total)
Chain 4:

# Summarize the results
summary(fit)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ x - mean(x) 
   Data: data (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.61      0.11     1.40     1.81 1.00     3713     2767
x             2.95      0.11     2.73     3.17 1.00     3395     2596

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.06      0.08     0.91     1.23 1.00     4243     2987

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Plot the results
plot(fit)

Real Data

library(brms)
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive 
dat$authoritarianism = scale(dat$authoritarianism)
dat$sdo = scale(dat$sdo)
dat$college = ifelse(dat$educ == "4-year" | dat$educ == "Post-grad", 1, 0)
                       
formula <- bf(sdo ~ authoritarianism + college + authoritarianism:college)

# Specify the priors
priors <- c(
  prior(normal(0, 10), class = "b"),
  prior(normal(0, 10), class = "Intercept"),
  prior(uniform(0, 100), class = "sigma")
)

# Fit the model
fit <- brm(formula, data = dat, prior = priors, chains = 1, iter = 2000, warmup = 1000)


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.35 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.087 seconds (Warm-up)
Chain 1:                0.087 seconds (Sampling)
Chain 1:                0.174 seconds (Total)
Chain 1:

summary(fit)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: sdo ~ authoritarianism + college + authoritarianism:college 
   Data: dat (Number of observations: 3599) 
  Draws: 1 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 1000

Regression Coefficients:
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                    0.04      0.02     0.00     0.07 1.00     1189
authoritarianism             0.26      0.02     0.22     0.30 1.00      845
college                     -0.08      0.04    -0.15    -0.01 1.00     1024
authoritarianism:college     0.11      0.04     0.03     0.18 1.00      832
                         Tail_ESS
Intercept                     682
authoritarianism              643
college                       617
authoritarianism:college      709

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.95      0.01     0.93     0.98 1.00     1182      554

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Plot the results
plot(fit)

library(tidybayes)
library(dplyr)
library(ggplot2)
library(modelr)
# fit <- brm(formula, data = dat, prior = priors, chains = 1, iter = 2000, warmup = 1000)

dat %>% data_grid( authoritarianism = seq_range(authoritarianism, n = 30),
                   college = c(0, 1)) %>%
  add_epred_draws(fit) %>%
  group_by(authoritarianism, college) %>%
  mutate(college = ifelse(college == 1, "College", "Less than College")) %>%
  summarize(
    .value = mean(.epred),
    .lower = quantile(.epred, 0.025),
    .upper = quantile(.epred, 0.975)
  ) %>%
  ggplot(aes(x = authoritarianism, y = .value, color = factor(college))) +
  geom_line() +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.3, color= "grey") +
  facet_wrap(~college) +
  labs(
    x = "Authoritarianism",
    y = "Social Dominance Orientation",
    color = "Education",
    title = "Predicted Social Dominance Orientation \nby Authoritarianism and Education"
  ) + 
  theme_minimal()

Steps to Building a Model

Build a scientific model. This is the theoretical model that you believe underlies the data. This is often represented in a DAG
Build a statistical model. How might we use a statistical model to create estimands – parameters that we’ll estimate with our data?
Estimation, Simulation and Prediction. Here we’ll join the statistical and scientific model, using the data to test associations and causal processes
An example

The DAG

Here’s the simple model:

The Confound

The Fork

Here’s the simple model:

The Confound

# Imagine a population that in which 30% have a college or post-graduate degree
education = rbinom(1000, 1, 0.30)
# predict personal wealth
wealth = 1 + 0.8*education + rnorm(1000, 0, 0.5)
political_conservatism = 1 + 0.3*wealth - 0.5*education + rnorm(1000, 0.5)
# Notice the change
lm(political_conservatism ~ wealth) %>% summary()


Call:
lm(formula = political_conservatism ~ wealth)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3120 -0.6218 -0.0402  0.7117  3.1416 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.63804    0.07117  23.015   <2e-16 ***
wealth       0.08756    0.05215   1.679   0.0934 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.02 on 998 degrees of freedom
Multiple R-squared:  0.002817,  Adjusted R-squared:  0.001818 
F-statistic: 2.819 on 1 and 998 DF,  p-value: 0.09345

lm(political_conservatism ~ wealth + education) %>% summary()


Call:
lm(formula = political_conservatism ~ wealth + education)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4723 -0.6241 -0.0235  0.6901  2.9591 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.48957    0.07181  20.743  < 2e-16 ***
wealth       0.37520    0.06295   5.960 3.49e-09 ***
education   -0.64791    0.08409  -7.705 3.15e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9911 on 997 degrees of freedom
Multiple R-squared:  0.05886,   Adjusted R-squared:  0.05697 
F-statistic: 31.18 on 2 and 997 DF,  p-value: 7.366e-14

#
coef(lm(political_conservatism ~ wealth + education))["wealth"] -
coef(lm(political_conservatism ~ wealth))["wealth"]

   wealth 
0.2876399

The Confound

Again, we might simulate data to examine the consequences of estimation. Again, the data reported previously were synthetic
If there is a confounder, we should condition on that confounder; otherwise, our results will be “biased”
Bias is just the difference between “truth” (how we set up the simulation) and the estimate
There are three other types of relationships to consider; here are two we often encounter: The mediator (the chain or pipe) and the collider (the inverted fork)

The Mediator (Pipe or Chain)

\(x\) can affect \(y\) in a couple ways
One is \(x\rightarrow m \rightarrow y\)
Another is \(x \rightarrow y\)

dagify(m ~x , y~ m + x) %>% ggdag()  + theme_dag()

An Introduction to Causal Inference

The “treatment effect”
Causal explanations involve a statement about how units are influenced by a particular treatment or intervention
The \(\textbf{counterfactual}\): How would an individual respond to a treatment, compared to an identical circumstance where that individual did not receive the treatment (Morton and Williams 2010)?

An Introduction to Causal Inference

For this reason, the exercise involves inferences about unit effects

\[\delta_i = Y_{i,1}-Y_{i,0}\]

An Introduction to Causal Inference

The counterfactual is unobservable
The Fundamental Problem of Causal Inference.
\(\delta_i\) is not attainable, because the potential outcome is never directly observed!

An Introduction to Causal Inference

Groups of individuals – or units – where some degree of control is exercised

\[E(\delta_i)=E(Y_{i,T}-Y_{i,C})=E(Y_{i,T})-E(Y_{i,C})\]

The Do Operator

The do operator
Imagine two worlds (datasets); one where everyone has “high wealth” and one where everyone has “low wealth”
Use thise data as the design matrix. Then we perhaps just average the predicted score, political conservatism in two “worlds,” one where everyone is “high wealth” and the other where everyone is “low wealth”

Use these data “worlds,” in conjunction with the statistical model to create predictions and causal estimates

education <- rbinom(2000, 1, 0.3)
wealth <- 1 + 0.8 * education + rnorm(2000, 0, 0.5)

political_conservatism <- 1 + 0.3 * wealth - 0.5 * education + rnorm(2000, 0, 1)
synthetic_data = data.frame(wealth, political_conservatism, education)
synthetic_data %>% head()

     wealth political_conservatism education
1 0.1295662             -0.5064683         0
2 0.5658781              2.0517348         0
3 0.8155193              1.3239891         0
4 0.2684103              1.2948259         0
5 0.5727909             -0.3639028         0
6 2.2151570              1.3407955         1

my_model = lm(political_conservatism ~ wealth + education, data = synthetic_data)
summary(my_model)


Call:
lm(formula = political_conservatism ~ wealth + education, data = synthetic_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6567 -0.6904 -0.0229  0.6610  3.6047 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.00055    0.05150  19.430  < 2e-16 ***
wealth       0.29115    0.04364   6.671 3.28e-11 ***
education   -0.46209    0.05925  -7.799 9.99e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9902 on 1997 degrees of freedom
Multiple R-squared:  0.03285,   Adjusted R-squared:  0.03188 
F-statistic: 33.92 on 2 and 1997 DF,  p-value: 3.271e-15

data1 = synthetic_data %>% mutate(wealth = 0)
data2 = synthetic_data %>% mutate(wealth = 1)
# The treatment effect is:
predict(my_model, data2)  %>% mean() - predict(my_model, data1) %>% mean()

[1] 0.2911457

The Bootstrap

Another means to simulate uncertainty
Sample repeatedly from the data
Estimate the treatment effect at each trial
Summarize the distributions

The Bootstrap

library(dplyr)
library(boot)
# Define a function for bootstrapping
bootstrap_treatment_effect <- function(data, indices) {
  # Sample the data
  sample_data <- data[indices, ]
  my_model = lm(political_conservatism ~ wealth + education, data = sample_data)

  # Create two worlds
  data1 <- sample_data %>% mutate(wealth = 0)
  data2 <- sample_data %>% mutate(wealth = 1)
  
  # Calculate the treatment effect
  treatment_effect <- mean(predict(my_model, data2)) - mean(predict(my_model, data1))
  
  return(treatment_effect)
}

bootstrap_results <- boot(data = synthetic_data, statistic = bootstrap_treatment_effect, R = 100)

mean = mean(bootstrap_results$t)
lower = quantile(bootstrap_results$t, 0.025)
upper = quantile(bootstrap_results$t, 0.975)

cat("The average treatment effect is", mean, "\n")

The average treatment effect is 0.2900896

cat("The 95% confidence interval is", lower, "to", upper, "\n")

The 95% confidence interval is 0.1987024 to 0.3773294

In \(\texttt{dagitty}\), we can use the \(\texttt{adjustmentSets()}\) function to determine the appropriate adjustment sets for a given model.

library(ggdag)
library(dagitty)
dag <- dagify(
  conservatism ~ moral_individualism + age + college,
  moral_individualism ~ age + college,
  college ~ age + geography,
  labels = c(
    "moral_individualism" = "MI",
    "college" = "Knowledge",
    "age" = "Age",
    "conservatism" = "Conservatism",
    "geography" = "zip"
  )
)
adjustment_sets <- adjustmentSets(dag, exposure = "college", outcome = "conservatism")
adjustment_sets

{ age }

adjustment_sets <- adjustmentSets(dag, exposure = "age", outcome = "conservatism")
adjustment_sets

{}

Two Causal Estimands

Defining causal effects
Examine the graph to formulate the appropriate statistic

\[y_{conservatism} = \beta_0 + \beta_1 x_{age} + \epsilon\]

\[y_{conservatism} = \gamma_0 + \gamma_1 x_{age} + \gamma_2 x_{college} + \epsilon\]


Call:  glm(formula = conservative ~ age, family = binomial("logit"), 
    data = dat)

Coefficients:
(Intercept)          age  
   -2.43048      0.03126  

Degrees of Freedom: 3598 Total (i.e. Null);  3597 Residual
  (1 observation deleted due to missingness)
Null Deviance:      4278 
Residual Deviance: 4046     AIC: 4050


Call:  glm(formula = conservative ~ college + age, family = binomial("logit"), 
    data = dat)

Coefficients:
(Intercept)      college          age  
   -2.39842     -0.35369      0.03288  

Degrees of Freedom: 3598 Total (i.e. Null);  3596 Residual
  (1 observation deleted due to missingness)
Null Deviance:      4278 
Residual Deviance: 4028     AIC: 4034

datYoung <- 
  dat %>%
  # replace all in age with 20
  mutate(age = 20)  

predYoung = predict(model_age, newdata = datYoung, type = "response") 

datOld <-
  dat %>%
  # replace all in age with 80
  mutate(age = 80) 

predOld = predict(model_age,  newdata = datOld, type = "response")

cat("The effect of going from 20 to 80 years old leads to ",
round(
  predOld %>% mean() - 
  predYoung %>% mean(),
  2)*100, "percentage point increase in identifying as conservative\n")

The effect of going from 20 to 80 years old leads to  38 percentage point increase in identifying as conservative

load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")

# the effect of going from 20 to 80; 
datCollege <- 
  dat %>%
  # replace all in age with 20
  mutate(college = 1)  %>%
  # redundant, but just to be clear
  mutate(age = age)

predCollege = predict(model_education, newdata = datCollege, type = "response") 

datNocollege <- 
  dat %>%
  # replace all in age with 20
  mutate(college = 0)  %>%
  # redundant, but just to be clear
  mutate(age = age)

predNone = predict(model_age,  newdata = datNocollege, type = "response")

cat("The effect of going from no college to college years old leads to ",
abs(round(
  predCollege %>% mean() - 
  predNone %>% mean(),
  2)*100), "percentage point decrease in identifying as conservative\n")

The effect of going from no college to college years old leads to  4 percentage point decrease in identifying as conservative

Appendix: Matrix Algebra Review

Matrix representation of a system of equations
Easier to visualize the relationship between the model and the data
And tedious calculations are efficiently represented

Appendix: Matrix Algebra Review

\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{Ideology,1}+b_2 x_{PID,1}\\ y_{2}= & b_0+ b_1 x_{Ideology,2}+b_2 x_{PID,2}\\ y_{3}= & b_0+ b_1 x_{Ideology,3}+b_2 x_{PID,3}\\ y_{4}= & b_0+ b_1 x_{Ideology,4}+b_2 x_{PID,4}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{Ideology,n}+b_2 x_{PID,n}\\ \end{bmatrix}\]

Vectors

Single numbers are scalars
Vectors are matrices with one single column or row
Vectors are denoted with lowercase letters b

Some Operations

If two matrices are equal, \(\textbf{A}=\textbf{B}\), for \(i=j,\forall i,j\)
A \(\textbf{symmetric}\) matrix has the same off-diagonal matrix
Squaring a matrix is the same as multiplying it by itself, \(\textbf{A}^2=\textbf{A} \textbf{A}\)
An \(\textbf{idempotent}\) matrix means that if we multiply a matrix by itself, this product is the original matrix, or \(\textbf{A}^2=\textbf{A} \textbf{A}= \textbf{A}\)

Some More Operations

– An \(\textbf{identity}\) matrix, \(\textbf{I}\) is like to multiplying a scalar by 1. So, \(\textbf{A}I= \textbf{A}\).

Addition (and Subtraction): \[ \tiny \begin{bmatrix} a_{11}&a_{12}&\cdots &a_{1n} \\ a_{21}&a_{22}&\cdots &a_{2n} \\ \vdots & \vdots & \ddots & \vdots\\ a_{n1}&a_{n2}&\cdots &a_{nn} \end{bmatrix}+ \begin{bmatrix} b_{11}&b_{12}&\cdots &b_{1n} \\ b_{21}&b_{22}&\cdots &b_{2n} \\ \vdots & \vdots & \ddots & \vdots\\ b_{n1}&b_{n2}&\cdots &b_{nn} \end{bmatrix}= \begin{bmatrix} a_{11}+b_{11}&a_{12}+b_{12}&\cdots &a_{1n}+b_{1n} \\ a_{21}b_{21}&a_{22}+b_{22}&\cdots &a_{2n}+b_{2n} \\ \vdots & \vdots & \ddots & \vdots\\ a_{n1}+b_{n2}&a_{n2}+b_{n2}&\cdots &a_{nn}+b_{n}\\ \end{bmatrix} \]

Even More Operations

A matrix multiplied by a scalar is every element multiplied by that scalar
Matrix multiplication is a bit more complicated (as is matrix inversion - more later)
Both turn out to be quite valuable
To multiply two matrices we have to multiply and add the ith row with the jth column.

\[ \begin{bmatrix} 1&3 \\ 2&4\\ \end{bmatrix} \begin{bmatrix} 3&5 \\ 2&4\\ \end{bmatrix}= \begin{bmatrix} 1*3+3*2&1*5+3*4 \\ 2*3+4*2&2*5+2*4\\ \end{bmatrix} \]