Link to ALL CODE

Link: https://rpubs.com/ffejffej888/causal1

Question 1: “Daughters”

PART (A)

Step 1-1: Data preparing and Simple regression model

foo <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSY9jLlufY1GjeMh7D2_g1m6olveHLNCerT2C36MTkcjwQCOlZYf8evLMzGOnc252OgXEEasHqcNIcZ/pub?gid=1976818127&single=true&output=csv")

reg1 <- lm(nowtot ~ hasgirls +Dems +Repubs + Christian + age + srvlng + demvote, foo)

summary(reg1)

## 
## Call:
## lm(formula = nowtot ~ hasgirls + Dems + Repubs + Christian + 
##     age + srvlng + demvote, data = foo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.028 -10.322  -1.517  11.208  69.642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.6991    18.6306   2.077 0.038390 *  
## hasgirls     -0.4523     1.9036  -0.238 0.812322    
## Dems         -8.1022    17.5861  -0.461 0.645238    
## Repubs      -55.1069    17.6340  -3.125 0.001901 ** 
## Christian   -13.3961     3.7218  -3.599 0.000357 ***
## age           0.1260     0.1117   1.128 0.259938    
## srvlng       -0.2251     0.1355  -1.662 0.097349 .  
## demvote      87.5501     8.4847  10.319  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.19 on 422 degrees of freedom
## Multiple R-squared:  0.7821, Adjusted R-squared:  0.7784 
## F-statistic: 216.3 on 7 and 422 DF,  p-value: < 2.2e-16

The treatment effect of having girls is -0.4523, with a standard error of 18.63. This means the confidence interval is estimated to be between -36.8077 and 37.7123 at a 95% confidence level. It is important to note that this treatment effect does not take into account the influence of confounding factors. To address this, we will introduce “Matching” in below session.

Step 1-2: Visualizing the distributions of the covariants

library(ggplot2)
library(gridExtra)

# plot the distribution
plot1 <- ggplot(foo, aes(x = Dems, fill = factor(hasgirls))) +
  geom_density(alpha = 0.5) +
  ggtitle("Distribution of Dems by hasgirls") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

plot2 <- ggplot(foo, aes(x = Repubs, fill = factor(hasgirls))) +
  geom_density(alpha = 0.5) +
  ggtitle("Distribution of Repubs by hasgirls") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

plot3 <- ggplot(foo, aes(x = Christian, fill = factor(hasgirls))) +
  geom_density(alpha = 0.5) +
  ggtitle("Distribution of Christian by hasgirls") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

plot4 <- ggplot(foo, aes(x = age, fill = factor(hasgirls))) +
  geom_density(alpha = 0.5) +
  ggtitle("Distribution of Age by hasgirls") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

 plot5 <- ggplot(foo, aes(x = srvlng, fill = factor(hasgirls))) +
  geom_density(alpha = 0.5) +
  ggtitle("Distribution of srvlng by hasgirls") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

plot6 <- ggplot(foo, aes(x = demvote, fill = factor(hasgirls))) +
  geom_density(alpha = 0.5) +
  ggtitle("Distribution of Demvote by hasgirls") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

# layout
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2)

Step 1-3: Employing the matching method

library(Matching)

## Warning: package 'Matching' was built under R version 4.3.3

## Loading required package: MASS

## ## 
## ##  Matching (Version 4.10-15, Build Date: 2024-10-14)
## ##  See https://www.jsekhon.com for additional documentation.
## ##  Please cite software as:
## ##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ##   Software with Automated Balance Optimization: The Matching package for R.''
## ##   Journal of Statistical Software, 42(7): 1-52. 
## ##

## Match on the confounders below...
X <- cbind(foo$Dems, foo$Repubs, foo$Christian, foo$age, foo$srvlng, foo$demvote)
Tr  <- foo$hasgirls
Y   <- foo$nowtot


genout <- GenMatch(Tr = Tr, estimand="ATT", X = X, M=3, pop.size=16, max.generations=10, wait.generations=1)

## Loading required namespace: rgenoud

## 
## 
## Mon Dec 16 19:59:57 2024
## Domains:
##  0.000000e+00   <=  X1   <=    1.000000e+03 
##  0.000000e+00   <=  X2   <=    1.000000e+03 
##  0.000000e+00   <=  X3   <=    1.000000e+03 
##  0.000000e+00   <=  X4   <=    1.000000e+03 
##  0.000000e+00   <=  X5   <=    1.000000e+03 
##  0.000000e+00   <=  X6   <=    1.000000e+03 
## 
## Data Type: Floating Point
## Operators (code number, name, population) 
##  (1) Cloning...........................  1
##  (2) Uniform Mutation..................  2
##  (3) Boundary Mutation.................  2
##  (4) Non-Uniform Mutation..............  2
##  (5) Polytope Crossover................  2
##  (6) Simple Crossover..................  2
##  (7) Whole Non-Uniform Mutation........  2
##  (8) Heuristic Crossover...............  2
##  (9) Local-Minimum Crossover...........  0
## 
## SOFT Maximum Number of Generations: 10
## Maximum Nonchanging Generations: 1
## Population size       : 16
## Convergence Tolerance: 1.000000e-03
## 
## Not Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.
## Not Checking Gradients before Stopping.
## Using Out of Bounds Individuals.
## 
## Maximization Problem.
## GENERATION: 0 (initializing the population)
## Lexical Fit..... 5.240188e-03  4.909477e-02  8.960232e-02  1.964681e-01  1.964681e-01  3.173124e-01  3.173124e-01  3.310551e-01  3.617439e-01  5.639124e-01  5.639124e-01  5.939293e-01  
## #unique......... 16, #Total UniqueCount: 16
## var 1:
## best............ 4.089335e+02
## mean............ 4.771542e+02
## variance........ 9.306529e+04
## var 2:
## best............ 6.233589e+02
## mean............ 3.744562e+02
## variance........ 7.007394e+04
## var 3:
## best............ 3.035372e+02
## mean............ 3.446113e+02
## variance........ 5.371188e+04
## var 4:
## best............ 9.650124e+02
## mean............ 5.201913e+02
## variance........ 1.315297e+05
## var 5:
## best............ 5.548271e+02
## mean............ 5.053687e+02
## variance........ 6.839926e+04
## var 6:
## best............ 2.126889e+02
## mean............ 5.184861e+02
## variance........ 1.012202e+05
## 
## GENERATION: 1
## Lexical Fit..... 2.030087e-02  2.212342e-02  1.545841e-01  1.964681e-01  1.964681e-01  3.395078e-01  4.143426e-01  4.143426e-01  5.241463e-01  5.310622e-01  5.639124e-01  5.639124e-01  
## #unique......... 13, #Total UniqueCount: 29
## var 1:
## best............ 4.089335e+02
## mean............ 4.417691e+02
## variance........ 8.281992e+04
## var 2:
## best............ 6.233589e+02
## mean............ 4.714460e+02
## variance........ 3.541329e+04
## var 3:
## best............ 3.035372e+02
## mean............ 3.289821e+02
## variance........ 2.192679e+04
## var 4:
## best............ 9.650124e+02
## mean............ 9.083194e+02
## variance........ 1.290739e+04
## var 5:
## best............ 2.552607e+02
## mean............ 4.524836e+02
## variance........ 1.316494e+04
## var 6:
## best............ 2.126889e+02
## mean............ 2.783973e+02
## variance........ 2.228998e+04
## 
## GENERATION: 2
## Lexical Fit..... 2.030087e-02  2.212342e-02  1.545841e-01  1.964681e-01  1.964681e-01  3.395078e-01  4.143426e-01  4.143426e-01  5.241463e-01  5.310622e-01  5.639124e-01  5.639124e-01  
## #unique......... 12, #Total UniqueCount: 41
## var 1:
## best............ 4.089335e+02
## mean............ 2.851074e+02
## variance........ 1.371594e+04
## var 2:
## best............ 6.233589e+02
## mean............ 4.156116e+02
## variance........ 3.590132e+04
## var 3:
## best............ 3.035372e+02
## mean............ 4.107470e+02
## variance........ 1.294899e+04
## var 4:
## best............ 9.650124e+02
## mean............ 9.451768e+02
## variance........ 4.937924e+03
## var 5:
## best............ 2.552607e+02
## mean............ 3.290438e+02
## variance........ 6.110964e+03
## var 6:
## best............ 2.126889e+02
## mean............ 2.718720e+02
## variance........ 1.026979e+04
## 
## GENERATION: 3
## Lexical Fit..... 2.212342e-02  2.376619e-02  1.249280e-01  1.262261e-01  1.262261e-01  3.179272e-01  3.458234e-01  3.458234e-01  5.684948e-01  5.687506e-01  1.000000e+00  1.000000e+00  
## #unique......... 13, #Total UniqueCount: 54
## var 1:
## best............ 3.721656e+02
## mean............ 3.867065e+02
## variance........ 8.492127e+03
## var 2:
## best............ 5.762529e+02
## mean............ 5.612998e+02
## variance........ 7.507412e+03
## var 3:
## best............ 8.139170e+02
## mean............ 3.597086e+02
## variance........ 2.103786e+04
## var 4:
## best............ 9.197144e+02
## mean............ 8.330766e+02
## variance........ 6.674020e+04
## var 5:
## best............ 2.413626e+02
## mean............ 2.785779e+02
## variance........ 7.565773e+03
## var 6:
## best............ 2.084056e+02
## mean............ 2.228478e+02
## variance........ 5.515775e+02
## 
## GENERATION: 4
## Lexical Fit..... 2.748641e-02  3.770192e-02  1.541630e-01  1.964681e-01  1.964681e-01  4.143426e-01  4.143426e-01  4.405624e-01  5.639124e-01  5.639124e-01  6.604038e-01  8.314274e-01  
## #unique......... 12, #Total UniqueCount: 66
## var 1:
## best............ 3.811564e+02
## mean............ 4.024830e+02
## variance........ 2.164124e+03
## var 2:
## best............ 6.136957e+02
## mean............ 6.017425e+02
## variance........ 3.291897e+02
## var 3:
## best............ 3.034552e+02
## mean............ 4.319642e+02
## variance........ 4.462998e+04
## var 4:
## best............ 9.608901e+02
## mean............ 9.181486e+02
## variance........ 1.471925e+04
## var 5:
## best............ 2.110031e+02
## mean............ 2.340237e+02
## variance........ 1.798344e+02
## var 6:
## best............ 2.321491e+02
## mean............ 2.340856e+02
## variance........ 4.435305e+03
## 
## GENERATION: 5
## Lexical Fit..... 2.748641e-02  3.770192e-02  1.541630e-01  1.964681e-01  1.964681e-01  4.143426e-01  4.143426e-01  4.405624e-01  5.639124e-01  5.639124e-01  6.604038e-01  8.314274e-01  
## #unique......... 13, #Total UniqueCount: 79
## var 1:
## best............ 3.811564e+02
## mean............ 3.822435e+02
## variance........ 2.813281e+01
## var 2:
## best............ 6.136957e+02
## mean............ 6.105143e+02
## variance........ 1.158756e+03
## var 3:
## best............ 3.034552e+02
## mean............ 4.985424e+02
## variance........ 5.999101e+04
## var 4:
## best............ 9.608901e+02
## mean............ 9.379574e+02
## variance........ 2.944820e+03
## var 5:
## best............ 2.110031e+02
## mean............ 2.219688e+02
## variance........ 1.606180e+02
## var 6:
## best............ 2.321491e+02
## mean............ 2.247934e+02
## variance........ 1.024363e+02
## 
## GENERATION: 6
## Lexical Fit..... 4.780984e-02  1.119439e-01  1.896228e-01  1.964681e-01  1.964681e-01  2.533796e-01  2.902986e-01  3.146097e-01  4.143426e-01  4.143426e-01  5.639124e-01  5.639124e-01  
## #unique......... 12, #Total UniqueCount: 91
## var 1:
## best............ 3.811564e+02
## mean............ 3.807309e+02
## variance........ 7.403831e-01
## var 2:
## best............ 6.136957e+02
## mean............ 6.145495e+02
## variance........ 6.960878e+00
## var 3:
## best............ 3.034552e+02
## mean............ 3.181222e+02
## variance........ 3.573759e+03
## var 4:
## best............ 9.608901e+02
## mean............ 9.005358e+02
## variance........ 2.660876e+04
## var 5:
## best............ 2.110031e+02
## mean............ 2.108401e+02
## variance........ 3.938060e+00
## var 6:
## best............ 7.198967e+00
## mean............ 2.182949e+02
## variance........ 2.973523e+03
## 
## GENERATION: 7
## Lexical Fit..... 4.780984e-02  1.119439e-01  1.896228e-01  1.964681e-01  1.964681e-01  2.533796e-01  2.902986e-01  3.146097e-01  4.143426e-01  4.143426e-01  5.639124e-01  5.639124e-01  
## #unique......... 12, #Total UniqueCount: 103
## var 1:
## best............ 3.811564e+02
## mean............ 3.760019e+02
## variance........ 2.896642e+02
## var 2:
## best............ 6.136957e+02
## mean............ 5.984221e+02
## variance........ 5.161668e+03
## var 3:
## best............ 3.034552e+02
## mean............ 3.800547e+02
## variance........ 7.889893e+03
## var 4:
## best............ 9.608901e+02
## mean............ 9.550376e+02
## variance........ 3.290507e+02
## var 5:
## best............ 2.110031e+02
## mean............ 2.110902e+02
## variance........ 6.586819e-02
## var 6:
## best............ 7.198967e+00
## mean............ 1.369670e+02
## variance........ 1.096155e+04
## 
## GENERATION: 8
## Lexical Fit..... 6.102335e-02  1.119439e-01  1.262261e-01  1.262261e-01  1.896228e-01  2.533796e-01  2.622157e-01  3.458234e-01  3.458234e-01  3.484983e-01  1.000000e+00  1.000000e+00  
## #unique......... 10, #Total UniqueCount: 113
## var 1:
## best............ 3.809839e+02
## mean............ 3.773696e+02
## variance........ 2.073871e+02
## var 2:
## best............ 6.136533e+02
## mean............ 6.136811e+02
## variance........ 4.067119e-04
## var 3:
## best............ 7.640723e+02
## mean............ 3.270495e+02
## variance........ 1.313160e+04
## var 4:
## best............ 9.608939e+02
## mean............ 9.388585e+02
## variance........ 7.261030e+03
## var 5:
## best............ 2.111863e+02
## mean............ 2.200837e+02
## variance........ 1.223702e+03
## var 6:
## best............ 7.208276e+00
## mean............ 6.270837e+01
## variance........ 4.621339e+04
## 
## GENERATION: 9
## Lexical Fit..... 6.102335e-02  1.119439e-01  1.262261e-01  1.262261e-01  1.896228e-01  2.533796e-01  2.622157e-01  3.458234e-01  3.458234e-01  3.484983e-01  1.000000e+00  1.000000e+00  
## #unique......... 10, #Total UniqueCount: 123
## var 1:
## best............ 3.809839e+02
## mean............ 3.876317e+02
## variance........ 6.532461e+02
## var 2:
## best............ 6.136533e+02
## mean............ 6.136642e+02
## variance........ 4.141544e-04
## var 3:
## best............ 7.640723e+02
## mean............ 5.872352e+02
## variance........ 5.522774e+04
## var 4:
## best............ 9.608939e+02
## mean............ 9.522263e+02
## variance........ 1.126474e+03
## var 5:
## best............ 2.111863e+02
## mean............ 2.111255e+02
## variance........ 1.195736e-02
## var 6:
## best............ 7.208276e+00
## mean............ 3.666885e+01
## variance........ 6.121125e+03
## 
## GENERATION: 10
## Lexical Fit..... 6.102335e-02  1.119439e-01  1.262261e-01  1.262261e-01  1.896228e-01  2.533796e-01  2.622157e-01  3.458234e-01  3.458234e-01  3.484983e-01  1.000000e+00  1.000000e+00  
## #unique......... 13, #Total UniqueCount: 136
## var 1:
## best............ 3.809839e+02
## mean............ 3.809951e+02
## variance........ 1.471041e-03
## var 2:
## best............ 6.136533e+02
## mean............ 5.903179e+02
## variance........ 9.839882e+03
## var 3:
## best............ 7.640723e+02
## mean............ 7.256049e+02
## variance........ 1.513418e+04
## var 4:
## best............ 9.608939e+02
## mean............ 9.375856e+02
## variance........ 8.144992e+03
## var 5:
## best............ 2.111863e+02
## mean............ 2.220140e+02
## variance........ 1.759337e+03
## var 6:
## best............ 7.208276e+00
## mean............ 7.208878e+00
## variance........ 1.357114e-05
## 
## 'wait.generations' limit reached.
## No significant improvement in 1 generations.
## 
## Solution Lexical Fitness Value:
## 6.102335e-02  1.119439e-01  1.262261e-01  1.262261e-01  1.896228e-01  2.533796e-01  2.622157e-01  3.458234e-01  3.458234e-01  3.484983e-01  1.000000e+00  1.000000e+00  
## 
## Parameters at the Solution:
## 
##  X[ 1] : 3.809839e+02
##  X[ 2] : 6.136533e+02
##  X[ 3] : 7.640723e+02
##  X[ 4] : 9.608939e+02
##  X[ 5] : 2.111863e+02
##  X[ 6] : 7.208276e+00
## 
## Solution Found Generation 8
## Number of Generations Run 10
## 
## Mon Dec 16 19:59:58 2024
## Total run time : 0 hours 0 minutes and 1 seconds

mout <- Match(Tr=foo$hasgirls, X=X, estimand="ATE", Weight.matrix=genout)
summary(mout)

## 
## Estimate...  0 
## SE.........  0 
## T-stat.....  NaN 
## p.val......  NA 
## 
## Original number of observations..............  430 
## Original number of treated obs...............  312 
## Matched number of observations...............  430 
## Matched number of observations  (unweighted).  434

mb <- MatchBalance(
    hasgirls ~ Dems +Repubs + Christian + age + srvlng + demvote,
    match.out = mout, nboots=500, data = foo)

## 
## ***** (V1) Dems *****
##                        Before Matching        After Matching
## mean treatment........    0.45833            0.47209 
## mean control..........    0.50847            0.47674 
## std mean diff.........    -10.047            -0.9306 
## 
## mean raw eQQ diff.....   0.050847          0.0046083 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.025071          0.0023041 
## med  eCDF diff........   0.025071          0.0023041 
## max  eCDF diff........   0.050141          0.0046083 
## 
## var ratio (Tr/Co).....    0.98809            0.99905 
## T-test p-value........    0.35571            0.15706 
## 
## 
## ***** (V2) Repubs *****
##                        Before Matching        After Matching
## mean treatment........    0.53846            0.52558 
## mean control..........    0.49153            0.52326 
## std mean diff.........        9.4            0.46518 
## 
## mean raw eQQ diff.....   0.042373          0.0023041 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.023468          0.0011521 
## med  eCDF diff........   0.023468          0.0011521 
## max  eCDF diff........   0.046936          0.0023041 
## 
## var ratio (Tr/Co).....    0.98911            0.99954 
## T-test p-value........     0.3873            0.31731 
## 
## 
## ***** (V3) Christian *****
##                        Before Matching        After Matching
## mean treatment........     0.9391            0.94186 
## mean control..........    0.94915            0.94186 
## std mean diff.........    -4.1958                  0 
## 
## mean raw eQQ diff.....   0.016949                  0 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  0 
## 
## mean eCDF diff........   0.005025                  0 
## med  eCDF diff........   0.005025                  0 
## max  eCDF diff........    0.01005                  0 
## 
## var ratio (Tr/Co).....     1.1787                  1 
## T-test p-value........    0.68107                  1 
## 
## 
## ***** (V4) age *****
##                        Before Matching        After Matching
## mean treatment........     52.628             51.733 
## mean control..........     49.178             51.621 
## std mean diff.........     38.385             1.2089 
## 
## mean raw eQQ diff.....      3.661            0.47926 
## med  raw eQQ diff.....          4                  0 
## max  raw eQQ diff.....          7                  7 
## 
## mean eCDF diff........   0.075348          0.0090726 
## med  eCDF diff........   0.075538          0.0069124 
## max  eCDF diff........    0.17807           0.032258 
## 
## var ratio (Tr/Co).....    0.71552            0.94853 
## T-test p-value........  0.0020402            0.22856 
## KS Bootstrap p-value.. < 2.22e-16               0.91 
## KS Naive p-value......  0.0087659            0.97764 
## KS Statistic..........    0.17807           0.032258 
## 
## 
## ***** (V5) srvlng *****
##                        Before Matching        After Matching
## mean treatment........     8.5865             8.5326 
## mean control..........     8.7458             8.6326 
## std mean diff.........    -2.1085            -1.3244 
## 
## mean raw eQQ diff.....    0.66949            0.40783 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          5                  5 
## 
## mean eCDF diff........   0.017181           0.010923 
## med  eCDF diff........    0.01445          0.0069124 
## max  eCDF diff........   0.051608           0.032258 
## 
## var ratio (Tr/Co).....    0.77347            0.92055 
## T-test p-value........    0.85956            0.46051 
## KS Bootstrap p-value..      0.768              0.842 
## KS Naive p-value......    0.97653            0.97764 
## KS Statistic..........   0.051608           0.032258 
## 
## 
## ***** (V6) demvote *****
##                        Before Matching        After Matching
## mean treatment........    0.49929            0.49953 
## mean control..........    0.50602            0.50884 
## std mean diff.........    -5.2747            -7.5845 
## 
## mean raw eQQ diff.....   0.011441           0.012488 
## med  raw eQQ diff.....       0.01               0.01 
## max  raw eQQ diff.....       0.08               0.08 
## 
## mean eCDF diff........   0.015928           0.019092 
## med  eCDF diff........   0.010811           0.013825 
## max  eCDF diff........   0.048512           0.052995 
## 
## var ratio (Tr/Co).....     1.1269             1.0179 
## T-test p-value........    0.61103           0.074488 
## KS Bootstrap p-value..      0.928              0.468 
## KS Naive p-value......    0.98776            0.57589 
## KS Statistic..........   0.048512           0.052995 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): age  Number(s): 4 
## 
## After Matching Minimum p.value: 0.074488 
## Variable Name(s): demvote  Number(s): 6

Step 1-4: Evaluating the balance after matching

After applying the matching method, we observed an improvement in balance. For example, the mean of Dems in the treatment group and the control group was 0.45833 and 0.50847, respectively. However, after matching, the mean number of Dems in the treatment and control groups both became 0.47209. The T-test p-value increased to 1 from 0.35571, indicating that the distributions between two groups had become quite similar. Additionally, other covariate distributions also aligned, making it reasonable to derive potential causal inferences from this analysis.

Step 1-5: Leveraging matching process to better estimate the treatment effect

After_genmatch  <- Match(Y = Y, Tr=Tr, X=X, M=3)
summary(After_genmatch)

## 
## Estimate...  -0.0013355 
## AI SE......  1.9563 
## T-stat.....  -0.00068264 
## p.val......  0.99946 
## 
## Original number of observations..............  430 
## Original number of treated obs...............  312 
## Matched number of observations...............  312 
## Matched number of observations  (unweighted).  938

Based on the analysis, we can identify the treatment effect as -0.0013355, with a standard error of 1.9563. This means that the confidence interval for the treatment effect is estimated to be between -3.835684 and 3.833012 at a 95% confidence level. Therefore, we can conclude from the data that the presence of girls does not have a significant impact for political stance for voting.

Also, it is interesting to observe that through the matching process, the uncertainty was reduced compared to the simple regression model.

PART (B)

Step 1-6: Treatment effect in “higher dose” scenario with matching method

#filter data
treatment_group <- subset(foo, ngirls == 2 & nboys == 0)
control_group <- subset(foo, nboys == 2 & ngirls == 0)

filtered_data <- rbind(treatment_group, control_group)

#head(filtered_data)

## Match on the confounders below...
foo <- filtered_data
reg2 <- lm(nowtot ~ hasgirls +Dems +Repubs + Christian + age + srvlng + demvote, foo)
summary(reg1)

## 
## Call:
## lm(formula = nowtot ~ hasgirls + Dems + Repubs + Christian + 
##     age + srvlng + demvote, data = foo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.028 -10.322  -1.517  11.208  69.642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.6991    18.6306   2.077 0.038390 *  
## hasgirls     -0.4523     1.9036  -0.238 0.812322    
## Dems         -8.1022    17.5861  -0.461 0.645238    
## Repubs      -55.1069    17.6340  -3.125 0.001901 ** 
## Christian   -13.3961     3.7218  -3.599 0.000357 ***
## age           0.1260     0.1117   1.128 0.259938    
## srvlng       -0.2251     0.1355  -1.662 0.097349 .  
## demvote      87.5501     8.4847  10.319  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.19 on 422 degrees of freedom
## Multiple R-squared:  0.7821, Adjusted R-squared:  0.7784 
## F-statistic: 216.3 on 7 and 422 DF,  p-value: < 2.2e-16

X <- cbind(foo$Dems, foo$Repubs, foo$Christian, foo$age, foo$srvlng, foo$demvote)
Tr  <- foo$hasgirls
Y   <- foo$nowtot

genout <- GenMatch(Tr = Tr, estimand="ATT", X = X, M=2, pop.size=16, max.generations=10, wait.generations=1)

## 
## 
## Mon Dec 16 19:59:58 2024
## Domains:
##  0.000000e+00   <=  X1   <=    1.000000e+03 
##  0.000000e+00   <=  X2   <=    1.000000e+03 
##  0.000000e+00   <=  X3   <=    1.000000e+03 
##  0.000000e+00   <=  X4   <=    1.000000e+03 
##  0.000000e+00   <=  X5   <=    1.000000e+03 
##  0.000000e+00   <=  X6   <=    1.000000e+03 
## 
## Data Type: Floating Point
## Operators (code number, name, population) 
##  (1) Cloning...........................  1
##  (2) Uniform Mutation..................  2
##  (3) Boundary Mutation.................  2
##  (4) Non-Uniform Mutation..............  2
##  (5) Polytope Crossover................  2
##  (6) Simple Crossover..................  2
##  (7) Whole Non-Uniform Mutation........  2
##  (8) Heuristic Crossover...............  2
##  (9) Local-Minimum Crossover...........  0
## 
## SOFT Maximum Number of Generations: 10
## Maximum Nonchanging Generations: 1
## Population size       : 16
## Convergence Tolerance: 1.000000e-03
## 
## Not Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.
## Not Checking Gradients before Stopping.
## Using Out of Bounds Individuals.
## 
## Maximization Problem.
## GENERATION: 0 (initializing the population)
## Lexical Fit..... 7.836441e-02  7.836441e-02  2.463540e-01  2.753149e-01  2.944328e-01  3.295378e-01  3.861067e-01  6.226498e-01  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  
## #unique......... 16, #Total UniqueCount: 16
## var 1:
## best............ 8.154814e+02
## mean............ 5.546160e+02
## variance........ 7.597385e+04
## var 2:
## best............ 6.177120e+02
## mean............ 4.747410e+02
## variance........ 4.177031e+04
## var 3:
## best............ 9.321378e+02
## mean............ 4.587515e+02
## variance........ 7.731167e+04
## var 4:
## best............ 7.666736e+00
## mean............ 3.968721e+02
## variance........ 9.574306e+04
## var 5:
## best............ 3.048386e+02
## mean............ 4.648337e+02
## variance........ 1.317992e+05
## var 6:
## best............ 7.590850e+02
## mean............ 3.857127e+02
## variance........ 5.057271e+04
## 
## GENERATION: 1
## Lexical Fit..... 7.836441e-02  7.836441e-02  2.463540e-01  2.753149e-01  2.944328e-01  3.295378e-01  3.861067e-01  6.226498e-01  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  
## #unique......... 13, #Total UniqueCount: 29
## var 1:
## best............ 8.154814e+02
## mean............ 7.759390e+02
## variance........ 4.928724e+03
## var 2:
## best............ 6.177120e+02
## mean............ 4.977584e+02
## variance........ 1.058235e+04
## var 3:
## best............ 9.321378e+02
## mean............ 6.283226e+02
## variance........ 9.227468e+04
## var 4:
## best............ 7.666736e+00
## mean............ 1.168220e+02
## variance........ 1.205825e+04
## var 5:
## best............ 3.048386e+02
## mean............ 3.712163e+02
## variance........ 1.130965e+05
## var 6:
## best............ 7.590850e+02
## mean............ 5.619880e+02
## variance........ 6.564955e+04
## 
## GENERATION: 2
## Lexical Fit..... 7.836441e-02  7.836441e-02  2.463540e-01  2.753149e-01  2.944328e-01  3.295378e-01  3.861067e-01  6.226498e-01  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  
## #unique......... 11, #Total UniqueCount: 40
## var 1:
## best............ 8.154814e+02
## mean............ 7.984996e+02
## variance........ 7.521928e+03
## var 2:
## best............ 6.177120e+02
## mean............ 6.237730e+02
## variance........ 7.316444e+02
## var 3:
## best............ 9.321378e+02
## mean............ 9.207833e+02
## variance........ 1.919584e+03
## var 4:
## best............ 7.666736e+00
## mean............ 1.048392e+01
## variance........ 1.270550e+02
## var 5:
## best............ 3.048386e+02
## mean............ 3.218381e+02
## variance........ 5.475228e+03
## var 6:
## best............ 7.590850e+02
## mean............ 6.892854e+02
## variance........ 3.218285e+04
## 
## 'wait.generations' limit reached.
## No significant improvement in 1 generations.
## 
## Solution Lexical Fitness Value:
## 7.836441e-02  7.836441e-02  2.463540e-01  2.753149e-01  2.944328e-01  3.295378e-01  3.861067e-01  6.226498e-01  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  
## 
## Parameters at the Solution:
## 
##  X[ 1] : 8.154814e+02
##  X[ 2] : 6.177120e+02
##  X[ 3] : 9.321378e+02
##  X[ 4] : 7.666736e+00
##  X[ 5] : 3.048386e+02
##  X[ 6] : 7.590850e+02
## 
## Solution Found Generation 1
## Number of Generations Run 2
## 
## Mon Dec 16 19:59:58 2024
## Total run time : 0 hours 0 minutes and 0 seconds

mout <- Match(Tr=foo$hasgirls, X=X, estimand="ATE", Weight.matrix=genout)

summary(mout)

## 
## Estimate...  0 
## SE.........  0 
## T-stat.....  NaN 
## p.val......  NA 
## 
## Original number of observations..............  59 
## Original number of treated obs...............  31 
## Matched number of observations...............  59 
## Matched number of observations  (unweighted).  59

mb <- MatchBalance(
    hasgirls ~ Dems +Repubs + Christian + age + srvlng + demvote,
    match.out = mout, nboots=500, data = foo)

## 
## ***** (V1) Dems *****
##                        Before Matching        After Matching
## mean treatment........    0.64516            0.54237 
## mean control..........    0.42857            0.54237 
## std mean diff.........     44.532                  0 
## 
## mean raw eQQ diff.....    0.21429                  0 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  0 
## 
## mean eCDF diff........    0.10829                  0 
## med  eCDF diff........    0.10829                  0 
## max  eCDF diff........    0.21659                  0 
## 
## var ratio (Tr/Co).....    0.93145                  1 
## T-test p-value........   0.099329                  1 
## 
## 
## ***** (V2) Repubs *****
##                        Before Matching        After Matching
## mean treatment........    0.35484            0.45763 
## mean control..........    0.57143            0.45763 
## std mean diff.........    -44.532                  0 
## 
## mean raw eQQ diff.....    0.21429                  0 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  0 
## 
## mean eCDF diff........    0.10829                  0 
## med  eCDF diff........    0.10829                  0 
## max  eCDF diff........    0.21659                  0 
## 
## var ratio (Tr/Co).....    0.93145                  1 
## T-test p-value........   0.099329                  1 
## 
## 
## ***** (V3) Christian *****
##                        Before Matching        After Matching
## mean treatment........    0.90323            0.94915 
## mean control..........          1                  1 
## std mean diff.........      -32.2            -22.949 
## 
## mean raw eQQ diff.....    0.10714           0.050847 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.048387           0.025424 
## med  eCDF diff........   0.048387           0.025424 
## max  eCDF diff........   0.096774           0.050847 
## 
## var ratio (Tr/Co).....        Inf                Inf 
## T-test p-value........   0.083087           0.080673 
## 
## 
## ***** (V4) age *****
##                        Before Matching        After Matching
## mean treatment........     48.226             48.424 
## mean control..........     49.857             47.593 
## std mean diff.........    -19.026              10.29 
## 
## mean raw eQQ diff.....     2.4643             1.0678 
## med  raw eQQ diff.....        2.5                  0 
## max  raw eQQ diff.....          5                  8 
## 
## mean eCDF diff........   0.061382           0.027797 
## med  eCDF diff........   0.051843           0.016949 
## max  eCDF diff........    0.13479           0.084746 
## 
## var ratio (Tr/Co).....     1.0028             1.1312 
## T-test p-value........    0.46822             0.4593 
## KS Bootstrap p-value..       0.79              0.928 
## KS Naive p-value......    0.80669            0.90803 
## KS Statistic..........    0.13479           0.084746 
## 
## 
## ***** (V5) srvlng *****
##                        Before Matching        After Matching
## mean treatment........     7.5484             7.8475 
## mean control..........     9.6071             8.4407 
## std mean diff.........    -28.926            -8.1437 
## 
## mean raw eQQ diff.....     2.4286                  1 
## med  raw eQQ diff.....          1                  0 
## max  raw eQQ diff.....         10                  8 
## 
## mean eCDF diff........   0.066172            0.03072 
## med  eCDF diff........    0.05818           0.033898 
## max  eCDF diff........    0.17051           0.084746 
## 
## var ratio (Tr/Co).....    0.60661            0.74509 
## T-test p-value........    0.34249            0.17321 
## KS Bootstrap p-value..      0.476               0.83 
## KS Naive p-value......    0.49336            0.80634 
## KS Statistic..........    0.17051           0.084746 
## 
## 
## ***** (V6) demvote *****
##                        Before Matching        After Matching
## mean treatment........    0.52677            0.51119 
## mean control..........    0.50714            0.51576 
## std mean diff.........     15.554            -3.6245 
## 
## mean raw eQQ diff.....       0.05           0.022542 
## med  raw eQQ diff.....       0.05               0.02 
## max  raw eQQ diff.....       0.12               0.08 
## 
## mean eCDF diff........    0.10108           0.041874 
## med  eCDF diff........   0.066244           0.033898 
## max  eCDF diff........    0.29493            0.10169 
## 
## var ratio (Tr/Co).....    0.88501             1.0516 
## T-test p-value........    0.56612            0.45914 
## KS Bootstrap p-value..       0.13              0.798 
## KS Naive p-value......   0.099395             0.8412 
## KS Statistic..........    0.29493            0.10169 
## 
## 
## Before Matching Minimum p.value: 0.083087 
## Variable Name(s): Christian  Number(s): 3 
## 
## After Matching Minimum p.value: 0.080673 
## Variable Name(s): Christian  Number(s): 3

After_genmatch  <- Match(Y = Y, Tr=Tr, X=X, M=2)
summary(After_genmatch)

## 
## Estimate...  15.484 
## AI SE......  5.1617 
## T-stat.....  2.9998 
## p.val......  0.0027017 
## 
## Original number of observations..............  59 
## Original number of treated obs...............  31 
## Matched number of observations...............  31 
## Matched number of observations  (unweighted).  62

Based on the analysis, we can identify the treatment effect as 15.484, with a standard error of 5.1617. This means that the confidence interval for the treatment effect is estimated to be between 5.367068 and 25.60093 at a 95% confidence level. Therefore, we can conclude from the data that in the “high dosed” scenario, having two girls causing statistically significant on outcome variable, in contrast to having two boys.

Bonus question

hasgirls and totchi are part of the treatment group definition and therefore directly reflects whether a sample is belonged to (having daughters vs. not having daughters). So, it is not ideal to match or balance on these variables, it would weaken the effectiveness of the analysis because of the inherently tied connection.

QUESTION 2: “Business Lending in Indonesia”

From our CS class we learnt that randomised controlled trials (RCTs) are the gold standard for making causal inferences. However, policy is typically not conducted in an experimental setting. Convenience sampling is the common scenario. Hence, how we can address the confounding to make an apple-to-apple comparison is critical in observational research.

This article not only combines different matching methods(e.g. exact, calliper, genetic method) as an approach but also provides actionable results to inform the survey team of the targets to follow up with after the treatment. Traditionally, or in previous classes, we dealt with static data. However, with this article, I can see how the matching method can be utilized in a dynamic way.

My only concern is whether it is possible to optimize the number of treated units while maintaining a good balance. The article mentions that it’s possible to achieve a more efficient algorithm by adjusting the settings in repeated trial-and-error. But, could machine learning for optimization be applied here, given that our objective function is clearly defined as a combination of the number of treatment units and a balance?

QUESTION 3: Sensemakr package in R

Step 3-1: Choosing the data set

Because there is a potential causal inference we can draw from the “high dose” scenario, I will use it as the subset to demonstrate the sensitivity analysis.

foo <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSY9jLlufY1GjeMh7D2_g1m6olveHLNCerT2C36MTkcjwQCOlZYf8evLMzGOnc252OgXEEasHqcNIcZ/pub?gid=1976818127&single=true&output=csv")
# loads package
library(sensemakr)

## Warning: package 'sensemakr' was built under R version 4.3.3

## See details in:

## Carlos Cinelli and Chad Hazlett (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias. Journal of the Royal Statistical Society, Series B (Statistical Methodology).

#filter data
treatment_group <- subset(foo, ngirls == 2 & nboys == 0)
control_group <- subset(foo, nboys == 2 & ngirls == 0)
filtered_data <- rbind(treatment_group, control_group)

Step 3-2: Identifying the benchmark covariant through PCA

While our data has multiple covariates, I was considering which covariate to select as the benchmark for the sensitivity test. Therefore, I employed the PCA method to identify the most significant covariate that explains the major variation.

covariates <- foo[, c("Dems","Repubs","Christian","age","srvlng","demvote")]
pca <- prcomp(covariates, scale. = TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6
## Standard deviation     1.6219 1.2517 0.9600 0.69705 0.6250 0.06680
## Proportion of Variance 0.4385 0.2611 0.1536 0.08098 0.0651 0.00074
## Cumulative Proportion  0.4385 0.6996 0.8532 0.93416 0.9993 1.00000

pca$rotation

##                  PC1         PC2         PC3          PC4         PC5
## Dems      -0.5805602  0.14127956 -0.14803652  0.350220423  0.00896927
## Repubs     0.5822635 -0.14002613  0.12847497 -0.350920742 -0.01644348
## Christian  0.2121854  0.09021739 -0.96910461 -0.064435959 -0.05765627
## age       -0.1487362 -0.68381376 -0.13451859 -0.050724696  0.69971515
## srvlng    -0.1705498 -0.67912390 -0.05768713 -0.007351973 -0.71155291
## demvote   -0.4771653  0.15324183 -0.03150505 -0.864535279 -0.02039802
##                     PC6
## Dems       0.7059322509
## Repubs     0.7081245085
## Christian -0.0140784699
## age        0.0025977429
## srvlng    -0.0037552662
## demvote   -0.0005328732

biplot(pca, main = "PCA Biplot", cex = 0.8)

For picking a benchmark covariate from the plot, I exclude party variables because they straightforwardly indicate political stance. So, I choose “Christian” as the benchmark covariate to conduct the following sensitivity analysis.

Step 3-3: Sensitivity analysis coding

foo <- filtered_data
reg2 <- lm(nowtot ~ hasgirls +Dems +Repubs + Christian + age + srvlng + demvote, foo)
#summary(reg1)

daughters.sensitivity <- sensemakr(model = reg2, 
                                treatment = "hasgirls",
                                benchmark_covariates = "Christian",
                                kd = 1:3,
                                )
daughters.sensitivity

## Sensitivity Analysis to Unobserved Confounding
## 
## Model Formula: nowtot ~ hasgirls + Dems + Repubs + Christian + age + srvlng + 
##     demvote
## 
## Null hypothesis: q = 1 and reduce = TRUE 
## 
## Unadjusted Estimates of ' hasgirls ':
##   Coef. estimate: 13.20268 
##   Standard Error: 3.76197 
##   t-value: 3.50951 
## 
## Sensitivity Statistics:
##   Partial R2 of treatment with outcome: 0.1915 
##   Robustness Value, q = 1 : 0.38245 
##   Robustness Value, q = 1 alpha = 0.05 : 0.18552 
## 
## For more information, check summary.

#ovb_minimal_reporting(daughters.sensitivity, format = "latex")

Step 3-4: Interpreting sensitivity analysis

From the table above, the robustness value is crucial, indicating that unobserved confounders must account for at least 38.2% of the residual variance of both the treatment and the outcome to significantly impact the results.

Since our benchmark covariant “Christian” partial R-squares is 6.4%, which is lower than 38.2%. Thus, we can claim that any potential unobserved confounder as powerful as “Christian” is not sufficient enough to affect the outcomes.

Step 3-5: Visualizaion sensitivity analysis results

par(mfrow = c(1, 2))
plot(daughters.sensitivity)
plot(daughters.sensitivity, sensitivity.of = "t-value")

par(mfrow = c(1, 1))

From the figure above, we can see even unobserved confounders even three times as strong as the “Christian”, can not bring the effect size down to 0.

As for the uncertainty aspect that the unobserved confounders can contribute, we can examine our t-value plot, from the plot we can claim that any potential unobserved confounder even three times as powerful as “Christian” is not sufficient enough to make the estimate statistically insignificant.

plot(daughters.sensitivity, type = "extreme")

## Warning in rug(x = r2dz.x, col = "red", lwd = 2): some values will be clipped

In our final visualization, we simulate different hypothetical scenarios indicating strengths even once and twice as strong as the “Christian” covariate, the unobserved confounders neither can not bring the adjusted effect down to zero in those extreme scenarios.

Causal Inf Assignment

Jhong-Fu, Huang

2024-12-15