Practice Data

library(Matching)

## Warning: 程序包'Matching'是用R版本4.4.2 来建造的

## 载入需要的程序包：MASS

## 
## 载入程序包：'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## ## 
## ##  Matching (Version 4.10-15, Build Date: 2024-10-14)
## ##  See https://www.jsekhon.com for additional documentation.
## ##  Please cite software as:
## ##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ##   Software with Automated Balance Optimization: The Matching package for R.''
## ##   Journal of Statistical Software, 42(7): 1-52. 
## ##

library(tidyverse)
library(stargazer)
data(lalonde)

PROBLEM 1. Research question and DAG

i

balance_vars <- c("age", "educ", "black", "hisp", "married", "nodegr", "re74", "re75", "u74", "u75")
# I selected these variables because they are baseline characteristics that should be balanced between the treatment and control groups in a randomized experiment. 


balance_table <- lalonde %>%
  group_by(treat) %>%
  summarize(across(all_of(balance_vars), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))


t_test_results <- map(balance_vars, function(var) {
  t_test <- t.test(lalonde[[var]] ~ lalonde$treat)
  c(mean_control = mean(lalonde[[var]][lalonde$treat == 0], na.rm = TRUE),
    mean_treated = mean(lalonde[[var]][lalonde$treat == 1], na.rm = TRUE),
    sd_control = sd(lalonde[[var]][lalonde$treat == 0], na.rm = TRUE),
    sd_treated = sd(lalonde[[var]][lalonde$treat == 1], na.rm = TRUE),
    t_statistic = t_test$statistic,
    p_value = t_test$p.value)
})


balance_results <- bind_rows(t_test_results, .id = "Variable")


stargazer(balance_results, type = "text", summary = FALSE)

## 
## ===========================================================================================================================
##    Variable   mean_control       mean_treated       sd_control        sd_treated       t_statistic.t          p_value      
## ---------------------------------------------------------------------------------------------------------------------------
## 1     1     25.0538461538462   25.8162162162162  7.05774476810838  7.15501927478618  -1.11403614382901   0.265944346988501 
## 2     2     10.0884615384615   10.3459459459459   1.614324612971   2.01065025640563  -1.44218403616581   0.150169352649502 
## 3     3     0.826923076923077 0.843243243243243  0.379043392864054 0.364557907176729 -0.457777669937133  0.647357377113294 
## 4     4     0.107692307692308 0.0594594594594595 0.310589272940481 0.237124370526381  1.85654251255621  0.0640432731660751 
## 5     5     0.153846153846154 0.189189189189189  0.361497068701989 0.392721679149236 -0.966836312132641  0.334247781170195 
## 6     6     0.834615384615385 0.708108108108108  0.372243856316647 0.455866577054302  3.10849808479388  0.00203678014670407
## 7     7     2107.02681538462       2095.574      5687.90673400048  4886.62252560553  0.0227465981985764  0.981863018448317 
## 8     8     1266.90924076923   1532.05562972973  3102.98303053682  3219.25108939942  -0.869206081799221  0.385272603045653 
## 9     9           0.75        0.708108108108108  0.433847828419065 0.455866577054302 0.974689041503303   0.330327780889681 
## 10    10    0.684615384615385        0.6         0.465565051041008 0.491227389124514  1.82997431239289  0.0680305198189506 
## ---------------------------------------------------------------------------------------------------------------------------

#ii The Average Treatment Effect (ATE) represents the expected difference in the outcome variable between treated and control units. In an OLS regression framework, the ATE can be estimated by regressing the outcome variable (re78, real earnings in 1978) on an indicator for treatment (treat) and a set of control variables. \[ re78_{i} = \alpha + \beta treat_{i} + \gamma' X_{i} + \epsilon_{i} \] treat is the binary treatment indicator (1 for participants, 0 for non-participants)

X_i represents control variables (covariates).

is the coefficient of interest, which represents the ATE.

Under the conditional independence assumption (CIA), which states that treatment assignment is independent of potential outcomes conditional on observed covariates, provides an unbiased estimate of the ATE. This means that, after controlling for X_i, differences in earnings between treated and untreated individuals can be attributed to the NSW training program.

If the CIA holds, the coefficient on treat in the regression provides a valid estimate of the ATE. However, if there are unobserved confounders that affect both treatment and the outcome, this estimate may be biased.

Run the OLS regression to obtain the ATE:

library(Matching)

library(Jmisc)

## Warning: 程序包'Jmisc'是用R版本4.4.2 来建造的

## 
## 载入程序包：'Jmisc'

## The following object is masked from 'package:data.table':
## 
##     shift

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:ggplot2':
## 
##     %+%

data(lalonde)

lalonde_demean <- lalonde
covariates <- c("age", "educ", "black", "hisp", "married", "nodegr", "re74", "re75", "u74", "u75")
lalonde_demean[, covariates] <- apply(lalonde[, covariates], 2, function(x) x - mean(x))


ols_model <- lm(re78 ~ treat + age + educ + black + hisp + married + nodegr + re74 + re75 + u74 + u75, data = lalonde_demean)

summary(ols_model)

## 
## Call:
## lm(formula = re78 ~ treat + age + educ + black + hisp + married + 
##     nodegr + re74 + re75 + u74 + u75, data = lalonde_demean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9612  -4355  -1572   3054  53119 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.606e+03  4.080e+02  11.289  < 2e-16 ***
## treat        1.671e+03  6.411e+02   2.606  0.00948 ** 
## age          5.357e+01  4.581e+01   1.170  0.24284    
## educ         4.008e+02  2.288e+02   1.751  0.08058 .  
## black       -2.037e+03  1.174e+03  -1.736  0.08331 .  
## hisp         4.258e+02  1.565e+03   0.272  0.78562    
## married     -1.463e+02  8.823e+02  -0.166  0.86835    
## nodegr      -1.518e+01  1.006e+03  -0.015  0.98797    
## re74         1.234e-01  8.784e-02   1.405  0.16079    
## re75         1.974e-02  1.503e-01   0.131  0.89554    
## u74          1.380e+03  1.188e+03   1.162  0.24590    
## u75         -1.071e+03  1.025e+03  -1.045  0.29651    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6517 on 433 degrees of freedom
## Multiple R-squared:  0.05822,    Adjusted R-squared:  0.0343 
## F-statistic: 2.433 on 11 and 433 DF,  p-value: 0.005974

Estimated ATE (β): 1.671e+03

The standard error of β: 6.411e+02

β is significantly different from zero, this suggests that the NSW program had a statistically significant positive effect on real earnings in 1978.

iii

One-to-One Matching

library(Matching)

data(lalonde)

covariates <- c("age", "educ", "black", "hisp", "married", "nodegr", "re74", "re75", "u74", "u75")

match_one_to_one <- Match(Y = lalonde$re78,  # Outcome variable (earnings in 1978)
                          Tr = lalonde$treat, # Treatment indicator
                          X = lalonde[, covariates], # Matching covariates
                          M = 1,  # One-to-one matching
                          replace = FALSE)  # Without replacement

summary(match_one_to_one)

## 
## Estimate...  1702.7 
## SE.........  724.2 
## T-stat.....  2.3512 
## p.val......  0.018713 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  185

The NSW training program had a significant positive effect on earnings, with treated individuals earning $1,907.2 more on average in 1978 compared to matched controls.

Exact Matching

match_exact <- Match(Y = lalonde$re78, 
                     Tr = lalonde$treat, 
                     X = lalonde[, covariates], 
                     M = 1, 
                     exact = TRUE)  

summary(match_exact)

## 
## Estimate...  1306.1 
## AI SE......  339.7 
## T-stat.....  3.8448 
## p.val......  0.00012064 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  55 
## Matched number of observations  (unweighted).  120 
## 
## Number of obs dropped by 'exact' or 'caliper'  130

The ATT estimate dropped from 1,907.2 (one-to-one) to 1,306.1 (exact matching). The lower ATT estimate suggests that some of the positive treatment effects observed in one-to-one matching may have been due to imperfect covariate balance.

The standard error (SE) decreased significantly (711.54 → 339.7). This means the estimate is more precise, and the confidence interval is narrower.

Nearest Neighbor Matching (1:2)

match_1to2 <- Match(Y = lalonde$re78, 
                     Tr = lalonde$treat, 
                     X = lalonde[, covariates], 
                     M = 2,  # Two nearest neighbors
                     replace = TRUE)  # Allow replacement

summary(match_1to2)

## 
## Estimate...  1645.5 
## AI SE......  829.11 
## T-stat.....  1.9846 
## p.val......  0.047185 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  455

One-to-One Matching vs. Nearest Neighbor Matching (1:2):The ATT estimate decreased from 1,907.2 (one-to-one) to 1,645.5 (1:2 matching). The standard error increased (711.54 → 829.11), making the estimate less precise. More control matches mean that the variability in control group outcomes increases, leading to higher standard errors. 1:2 matching has more uncertainty compared to one-to-one matching.

Nearest Neighbor Matching (1:2) vs. Exact Matching: Exact matching has a lower ATT estimate (1,306.1 vs. 1,645.5 in 1:2 matching). This suggests that some of the positive effect in nearest neighbor matching may be due to residual imbalance. Exact matching provides better covariate balance and therefore a more conservative estimate. Exact matching has much lower standard errors (829.11 in 1:2 matching vs. 339.7 in exact matching). Exact matching ensures only the most similar control observations are included, which reduces variability.

Nearest Neighbor Matching (1:3)

match_1to3 <- Match(Y = lalonde$re78, 
                     Tr = lalonde$treat, 
                     X = lalonde[, covariates], 
                     M = 3,  
                     replace = TRUE)  

summary(match_1to3)

## 
## Estimate...  1686.9 
## AI SE......  791.37 
## T-stat.....  2.1317 
## p.val......  0.033033 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  644

1:3 Matching vs. 1:2 Matching: The ATT estimate slightly increased from 1,645.5 (1:2 matching) to 1,686.9 (1:3 matching). This suggests that adding a third control match did not significantly change the estimated treatment effect, meaning the additional matches were relatively similar to the first two. However, the increase is small, indicating that adding more control matches beyond two does not strongly influence the estimate. The standard error decreased slightly from 829.11 (1:2) to 791.37 (1:3). Adding more control matches reduces variance, leading to slightly lower standard errors. However, the change is not large, meaning that adding a third match does not dramatically improve precision.

1:3 Matching vs. Exact Matching: Nearest Neighbor (1:3) matching estimates a higher ATT (1,686.9) compared to exact matching (1,306.1). This suggests that some of the effect observed in nearest neighbor matching may be due to imperfect balance, meaning the NSW program’s true effect might be closer to 1,306.1. Standard errors in 1:3 matching are much higher (791.37 vs. 339.7 in exact matching). Exact matching ensures perfect covariate balance, reducing standard errors significantly. 1:3 matching introduces more control matches, but this does not necessarily improve balance, leading to higher standard errors.

Bias Correction

match_1to2_bias <- Match(Y = lalonde$re78, 
                         Tr = lalonde$treat, 
                         X = lalonde[, covariates], 
                         M = 2,  
                         replace = TRUE,  
                         BiasAdjust = TRUE)  

summary(match_1to2_bias)

## 
## Estimate...  1542.5 
## AI SE......  827.45 
## T-stat.....  1.8642 
## p.val......  0.062296 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  455

match_1to3_bias <- Match(Y = lalonde$re78, 
                         Tr = lalonde$treat, 
                         X = lalonde[, covariates], 
                         M = 3,  
                         replace = TRUE,  
                         BiasAdjust = TRUE)  

summary(match_1to3_bias)

## 
## Estimate...  1535 
## AI SE......  792.09 
## T-stat.....  1.9379 
## p.val......  0.052632 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  644

The ATT estimates decreased slightly after applying bias correction. This suggests that the original nearest neighbor matching overestimated the treatment effect slightly due to residual covariate imbalance. Bias correction adjusts for these imbalances using a regression-based correction, leading to a more accurate estimate of ATT. Standard errors remained similar between bias-corrected and uncorrected matching. This suggests that bias correction does not significantly affect the variability of the estimate, but it does provide a slightly more reliable ATT estimate.

Coarsened Exact Matching

library(cem)

## Warning: 程序包'cem'是用R版本4.4.2 来建造的

## 载入需要的程序包：tcltk

## 载入需要的程序包：lattice

## 
## How to use CEM? Type vignette("cem")

cem_match <- cem(treatment = "treat", data = lalonde, drop = "re78")

## 
## Using 'treat'='1' as baseline group

print(cem_match)

##            G0  G1
## All       260 185
## Matched   138  96
## Unmatched 122  89

cem_lm <- lm(re78 ~ treat, data = lalonde, weights = cem_match$w)

summary(cem_lm)

## 
## Call:
## lm(formula = re78 ~ treat, data = lalonde, weights = cem_match$w)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
##  -7496  -2172      0      0  30121 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3954.3      493.7   8.009 5.57e-14 ***
## treat         1241.2      770.8   1.610    0.109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5800 on 232 degrees of freedom
## Multiple R-squared:  0.01105,    Adjusted R-squared:  0.00679 
## F-statistic: 2.593 on 1 and 232 DF,  p-value: 0.1087

cem_result <- Match(Y = lalonde$re78,
                    Tr = lalonde$treat,
                    X = lalonde[, covariates],
                    weights = cem_match$w)

summary(cem_result)

## 
## Estimate...  1026.4 
## AI SE......  NaN 
## T-stat.....  NaN 
## p.val......  NA 
## 
## Original number of observations (weighted)...  234 
## Original number of observations..............  445 
## Original number of treated obs (weighted)....  96 
## Original number of treated obs...............  185 
## Matched number of observations...............  96 
## Matched number of observations  (unweighted).  855

Coarsened Exact Matching (CEM) Results Recap:

Estimated ATT: 1,241.2: This suggests that treated individuals in the NSW program earned, on average, $1,241.2 more than matched controls.

Standard Error (SE): 770.8: The standard error is relatively high, meaning there is considerable variability in the estimate.

Coarsened Exact Matching vs. Exact Matching: ATT Estimate. CEM gives an ATT of 1,241.2, which is slightly lower than Exact Matching (1,306.1). This suggests that Exact Matching produced a slightly higher estimate, possibly due to selecting only perfectly matched pairs. Standard Error. Standard Error. CEM retains more observations and allows for more flexibility in matches, but at the cost of slightly more variation.

iv

Directed Acyclic Graph (DAG)

A DAG (Directed Acyclic Graph) visually represents the causal structure between variables. In this case, we want to model the causal effect of NSW training participation (D) on real earnings (Y), while considering potential confounding variables.

library(dagitty)

## Warning: 程序包'dagitty'是用R版本4.4.3 来建造的

library(ggdag)  # For better visualization with ggplot2

## Warning: 程序包'ggdag'是用R版本4.4.3 来建造的

## 
## 载入程序包：'ggdag'

## The following object is masked from 'package:stats':
## 
##     filter

dag <- dagitty('
dag {
  "Age" -> "D"
  "Educ" -> "D"
  "Black" -> "D"
  "Hisp" -> "D"
  "Married" -> "D"
  "Re74" -> "D"
  "Re75" -> "D"
  "D" -> "Y"
  "Re74" -> "Y"
  "Re75" -> "Y"
}
')


coordinates(dag) <- list(
  x = c(Age = 3, Educ = 2, Black = 3, Hisp = 4, Married = 6, Re74 = 8, Re75 = 9, D = 6, Y = 7),
  y = c(Age = 2, Educ = 3, Black = 4, Hisp = 5, Married = 5, Re74 = 4, Re75 = 2, D = 3, Y = 3)
)

# Plot the DAG
plot(dag)

Confounders X (Re74, Re75) affect both treatment (D) and earnings (Y), meaning we need to control for them to isolate the causal effect of D on Y. The propensity score matching method ensures that we compare treated and control units with similar propensity scores, reducing bias from confounding variables.

Descriptive graphs.

library(ggplot2)

# Plot relationship between age and real earnings (1974, 1975, 1978)
ggplot(lalonde, aes(x = age, y = re74)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  labs(title = "Age vs. Real Earnings in 1974")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(lalonde, aes(x = age, y = re75)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  labs(title = "Age vs. Real Earnings in 1975")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(lalonde, aes(x = age, y = re78)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  labs(title = "Age vs. Real Earnings in 1978")

## `geom_smooth()` using formula = 'y ~ x'

# Compare age distributions of participants and non-participants
ggplot(lalonde, aes(x = age, fill = as.factor(treat))) + 
  geom_density(alpha = 0.5) + 
  labs(title = "Age Distribution by Treatment Group", fill = "Treatment Status")

Age has a weak positive relationship with earnings in 1974, 1975, and 1978, with many individuals having near-zero earnings, suggesting high unemployment or low wages. The age distribution of NSW participants and non-participants overlaps, but participants tend to be younger on average, with a peak in their early 20s, while non-participants have a more spread-out distribution, especially at older ages. These differences indicate that age may influence program participation, making it a potential confounder that should be controlled for in propensity score matching to ensure a more balanced comparison between treated and control groups.

Propensity score.

ps_model <- glm(treat ~ age + educ + black + hisp + married + nodegr + re74 + re75, 
                family = binomial, data = lalonde)


lalonde$psfit <- ps_model$fitted.values

summary(ps_model)

## 
## Call:
## glm(formula = treat ~ age + educ + black + hisp + married + nodegr + 
##     re74 + re75, family = binomial, data = lalonde)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.178e+00  1.056e+00   1.115  0.26474   
## age          4.698e-03  1.433e-02   0.328  0.74297   
## educ        -7.124e-02  7.173e-02  -0.993  0.32061   
## black       -2.247e-01  3.655e-01  -0.615  0.53874   
## hisp        -8.528e-01  5.066e-01  -1.683  0.09228 . 
## married      1.636e-01  2.769e-01   0.591  0.55463   
## nodegr      -9.035e-01  3.135e-01  -2.882  0.00395 **
## re74        -3.161e-05  2.584e-05  -1.223  0.22122   
## re75         6.161e-05  4.358e-05   1.414  0.15744   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 604.20  on 444  degrees of freedom
## Residual deviance: 587.22  on 436  degrees of freedom
## AIC: 605.22
## 
## Number of Fisher Scoring iterations: 4

summary(lalonde$psfit) # propensity score estimation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1948  0.3593  0.3914  0.4157  0.4755  0.6756

The logistic regression results show that having no degree (nodegr) significantly increases the likelihood of NSW program participation (p = 0.00395), while Hispanic (hisp) status is marginally significant (p = 0.092), suggesting lower participation rates. Other variables, including age, education, race (Black), marital status, and past earnings (re74, re75), are not statistically significant predictors. The model has moderate fit (AIC = 605.22), but the low explanatory power suggests unobserved factors may influence program participation. These results highlight the importance of propensity score matching to balance observed differences, particularly in education level, to ensure a fair comparison between treated and control groups.

Matching on the Propensity Score

library(Matching)

# Perform nearest neighbor matching (1:3) using propensity scores
ps_match <- Match(Y = lalonde$re78, Tr = lalonde$treat, X = lalonde$psfit, M = 3, BiasAdjust = TRUE)

# Display ATT estimate
summary(ps_match)

## 
## Estimate...  2389.3 
## AI SE......  781.62 
## T-stat.....  3.0568 
## p.val......  0.0022372 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  652

Using nearest neighbor matching (1:3) with bias adjustment, the estimated ATT is $2,389.3, meaning NSW program participants earned, on average, $2,389.3 more than matched non-participants in 1978. The standard error (781.62) indicates moderate variability, and the p-value (0.0022) confirms that the effect is statistically significant at the 1% level. The bias adjustment ensures that residual differences in covariates, after matching, do not distort the estimated treatment effect. Additionally, 185 treated individuals were successfully matched, with an unweighted total of 652 control matches, improving precision. The standard error accounts for the uncertainty introduced in the earlier logit estimation of the propensity score, making the final ATT estimate more robust.

Bootstrapping for ATT

library(boot)

## 
## 载入程序包：'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

# Define function for bootstrapping
boot_att <- function(data, index) {
  match_res <- Match(Y = data$re78[index], Tr = data$treat[index], X = data$psfit[index], M = 3, BiasAdjust = TRUE)
  return(match_res$est)
}

# Apply bootstrapping
boot_results <- boot(lalonde, boot_att, R = 1000)

# Display bootstrapped standard errors
boot.ci(boot_results, type = "perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_results, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 899, 3785 )  
## Calculations and Intervals on Original Scale

Using bootstrapping with 1,000 replications, we estimated a 95% confidence interval (947, 3,864) for the ATT, meaning the effect of the NSW program on earnings is likely between 947 and 3,864. This range accounts for the uncertainty introduced by propensity score estimation and improves the robustness of our inference. Compared to the bias-adjusted ATT estimate (2,389.3 with SE = 781.62 and p = 0.0022), the bootstrapped confidence interval confirms statistical significance, as it does not include zero. Bootstrapping provides a more accurate measure of variability by resampling the data, ensuring that our ATT estimate remains stable across different samples. This reinforces the conclusion that the NSW program had a statistically significant positive impact on earnings.

Quality of the Propensity Score Matching

library(kdensity)

## Warning: 程序包'kdensity'是用R版本4.4.3 来建造的

library(ggplot2)

# Plot density of propensity scores for participants and non-participants
ggplot(lalonde, aes(x = psfit, fill = as.factor(treat))) + 
  geom_density(alpha = 0.5) + 
  labs(title = "Density of Propensity Scores by Treatment Status", fill = "Treatment")

# Histogram of propensity scores
hist(lalonde$psfit[lalonde$treat == 1], col = "blue", main = "Propensity Score Distribution", xlab = "Propensity Score")
hist(lalonde$psfit[lalonde$treat == 0], col = "red", add = TRUE)

The density plot and histogram of propensity scores show that the common support condition holds, meaning there is sufficient overlap between treated and control groups. The density plot indicates that while the distributions of propensity scores for the two groups differ, there is substantial overlap in the middle range (around 0.3 to 0.5), allowing for meaningful matching. However, there are some control units with very low scores and treated units with high scores, suggesting that some unmatched treated individuals may need to be excluded to improve balance.

The histogram further confirms this overlap, with the treated group (blue) having a concentration at higher propensity scores compared to the control group (red). While some treated individuals have no close matches at the extremes, the majority of the sample falls within a comparable range. To improve the quality of matching, imposing common support by trimming extreme propensity scores may enhance balance. Given this distribution, propensity score matching should effectively reduce bias, but additional balance checks, such as standardized mean differences, should be performed to confirm covariate balance post-matching.

Another Matching Methods

library(Matching)

# Perform kernel matching
kernel_match <- Match(Y = lalonde$re78, Tr = lalonde$treat, X = lalonde$psfit, M = 1, replace = TRUE)

# Display ATT estimate
summary(kernel_match)

## 
## Estimate...  2624.3 
## AI SE......  802.19 
## T-stat.....  3.2714 
## p.val......  0.0010702 
## 
## Original number of observations..............  445 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  344

Higher ATT estimate ($2,624.3$) compared to Nearest Neighbor Matching ($2,389.3$). More control observations used (unweighted: 344) vs. nearest neighbor (652). Stronger statistical significance (p = 0.0011, t = 3.2714). Lower variance in matches due to weighting but potential bias from distant controls. More efficient but requires careful balance checks.

PS3

Mohan

2025-03-11