EEFE 530, Spring 2025

GENERAL NOTES

Objectives of Problem Set 3

The objective of this problem set is to apply multiple matching methods in R.

Submission process

When you are done with the problem set, publish it on Rpubs using your temporary account.
Copy the RPubs link of your work and submit it on Canvas.
For any entirely equal submissions, whoever sent me their RPubs link last has copied the others. So, timely submissions are important. Own your work. I can randomly ask your R script and .Rmd files for double-checking purposes. As a standard practice, work in a script file before making your code chunks in the .Rmd file. Your .Rmd file and Rpubs submission page MUST show the code used to produce any of the outputs you present in your answers.

Academic integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

Practice Data

The primary practice data of this problem set is the dataset of the “National Supported Work (NSW) Demonstration” lalonde. NSW is a job training program in the US that provides work experience for a period of up to 1.5 years to socioeconomically-disadvantaged individuals with the goal to increase their labor market outcomes. This famous dataset is sourced from the experimental sample in Lalonde (AER 1986) and has been used in several other papers or textbooks: Dehejia and Wahba (JASA 1999), Dehejia and Wahba (RESTAT 2002), Mixtape, etc. You can load the package Matching by Sekhon (2011) that contains the lalonde dataset.

The lalonde sample has a size of 445 observations, including 260 individuals not assigned to NSW training participation and 185 individuals assigned to it, which you can check by examining the treatment variable (treat). The outcome variable is real earnings in 1978 (re78); pre-treatment outcomes are real earnings in 1974 and 1975 (re74 and re75); binary variables for being unemployed in 1974 and 1975 are (u74 and u75); and covariates include age in years (age), years of schooling (educ), and binary variables for blacks (black), Hispanics (hisp), etc. I suggest you call ?lalonde to open the help file and check the description of the dataset.

# Load packages
library(pacman)
p_load(Matching, Jmisc, lmtest, sandwich, kdensity, haven, boot, 
       cobalt, Matchit, Zelig, estimatr, cem, tidyverse, 
       lubridate, usmap, gridExtra, stringr, readxl, plot3D,  
       cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,  
       foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, 
       kableExtra, snakecase, janitor)

# Load data
data(lalonde)
attach(lalonde)
table(treat)

## treat
##   0   1 
## 429 185

#?lalonde

PROBLEM 1. Matching methods

i. Balance table.

Generate a table of summary statistics (mean and standard deviation) by treatment status with t-tests comparing the means of covariates, pre-treatment outcomes, and outcome for non-participants and participants to the NSW training. You may use packages like stargazer, xtable, kable, or any other package that helps you produce well-formatted balance tables. Interpret your table.

### i. Balance table 

install.packages("dplyr")

## Warning: package 'dplyr' is in use and will not be installed

install.packages("stargazer")

## Warning: package 'stargazer' is in use and will not be installed

library(dplyr)
library(stargazer)

# Count occurrences of each race category
lalonde %>%
  count(race)

##     race   n
## 1  black 243
## 2 hispan  72
## 3  white 299

lalonde <- lalonde %>%
  mutate(
    black = ifelse(race == "black", 1, 0),
    hispanic = ifelse(race == "hispan", 1, 0),
    white = ifelse(race == "white", 1, 0)
  )


# Define the variables for comparison (excluding treat itself)
covariates <- c("age", "educ", "married", "black","hispanic", "white", "nodegree", "re74", "re75", "re78")

# Compute mean and standard deviation by treatment group
summary_table <- lalonde %>%
  group_by(treat) %>%
  summarise(across(all_of(covariates), 
                   list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))

# Print summary table
print(summary_table)

## # A tibble: 2 × 21
##   treat age_mean age_sd educ_mean educ_sd married_mean married_sd black_mean
##   <int>    <dbl>  <dbl>     <dbl>   <dbl>        <dbl>      <dbl>      <dbl>
## 1     0     28.0  10.8       10.2    2.86        0.513      0.500      0.203
## 2     1     25.8   7.16      10.3    2.01        0.189      0.393      0.843
## # ℹ 13 more variables: black_sd <dbl>, hispanic_mean <dbl>, hispanic_sd <dbl>,
## #   white_mean <dbl>, white_sd <dbl>, nodegree_mean <dbl>, nodegree_sd <dbl>,
## #   re74_mean <dbl>, re74_sd <dbl>, re75_mean <dbl>, re75_sd <dbl>,
## #   re78_mean <dbl>, re78_sd <dbl>

install.packages("tidyr")

## Warning: package 'tidyr' is in use and will not be installed

library(tidyr)            


# Function to compute t-test p-values for each covariate
t_test_results <- sapply(covariates, function(var) {
  t_test <- t.test(lalonde[[var]] ~ lalonde$treat)
  return(t_test$p.value)  # Extract p-value
})

# Combine the summary stats with t-test results
summary_table_ttest <- summary_table %>%
  pivot_longer(cols = -treat, names_to = c("variable", "stat"), names_sep = "_") %>%
  pivot_wider(names_from = "stat", values_from = "value") %>%
  filter(treat == 0) %>%
  select(variable, mean, sd) %>%
  left_join(summary_table %>% filter(treat == 1) %>%
              pivot_longer(cols = -treat, names_to = c("variable", "stat"), names_sep = "_") %>%
              pivot_wider(names_from = "stat", values_from = "value") %>%
              select(variable, mean, sd),
            by = "variable", suffix = c("_control", "_treat")) %>%
  mutate(p_value = t_test_results[variable])  # Add p-values

# Print final table
print(summary_table_ttest)

## # A tibble: 10 × 6
##    variable mean_control sd_control mean_treat sd_treat  p_value
##    <chr>           <dbl>      <dbl>      <dbl>    <dbl>    <dbl>
##  1 age            28.0       10.8      25.8       7.16  2.91e- 3
##  2 educ           10.2        2.86     10.3       2.01  5.85e- 1
##  3 married         0.513      0.500     0.189     0.393 1.46e-16
##  4 black           0.203      0.403     0.843     0.365 1.21e-58
##  5 hispanic        0.142      0.350     0.0595    0.237 7.04e- 4
##  6 white           0.655      0.476     0.0973    0.297 5.50e-55
##  7 nodegree        0.597      0.491     0.708     0.456 6.98e- 3
##  8 re74         5619.      6789.     2096.     4887.    1.75e-12
##  9 re75         2466.      3292.     1532.     3219.    1.15e- 3
## 10 re78         6984.      7294.     6349.     7867.    3.49e- 1

stargazer(summary_table_ttest, type = "text",
          title = "Summary Statistics by Treatment Status",
          summary = FALSE, digits = 2)

## 
## Summary Statistics by Treatment Status
## =======================================================================================================================
##    variable   mean_control       sd_control         mean_treat         sd_treat                   p_value              
## -----------------------------------------------------------------------------------------------------------------------
## 1    age     28.030303030303  10.7866530174644   25.8162162162162  7.15501927478618     c(age = 0.00291427440260211)   
## 2    educ   10.2354312354312  2.85523844087595   10.3459459459459  2.01065025640563     c(educ = 0.584797660188313)    
## 3  married  0.512820512820513 0.500419186851279 0.189189189189189  0.392721679149236 c(married = 1.46127767613592e-16) 
## 4   black   0.202797202797203 0.402552148350608 0.843243243243243  0.364557907176729  c(black = 1.20588956186977e-58)  
## 5  hispanic 0.142191142191142 0.349653835243938 0.0594594594594595 0.237124370526381 c(hispanic = 0.000704207057997086)
## 6   white   0.655011655011655 0.475919486195385 0.0972972972972973 0.297166406396285  c(white = 5.49530172720531e-55)  
## 7  nodegree 0.596736596736597 0.491125522231798 0.708108108108108  0.455866577054302 c(nodegree = 0.00698222507160265) 
## 8    re74   5619.23650638695  6788.75079645538   2095.57368864865  4886.62035315108    c(re74 = 1.74783141336639e-12)  
## 9    re75   2466.48444312354  3291.99618331065   1532.05531378378   3219.2508699216    c(re75 = 0.00115008478712992)   
## 10   re78   6984.16974230769   7294.1617908684   6349.14353027027  7867.40221773469     c(re78 = 0.349076655556669)    
## -----------------------------------------------------------------------------------------------------------------------

ii. Average Treatment Effect (ATE) using regression.

Return to one of our first readings for this class: Imbens and Wooldridge (JEL 2009). Revisit your notes for sections 5.1 about “Identification” on page 26 and 5.3 about “Regression Methods” on page 28, where the authors made the point that one could compute the ATE by estimating the OLS regression of the outcome re78 on a constant, the treatment status treat, the set of control variables \(X\), and the set of interactions between treat and the de-meaned transformation of each control variable \(\left(X-\overline{X}\right)\).

Relying upon your knowledge of the potential outcomes framework, the conditional independence assumption, and the overlap condition, explain briefly why the coefficient on the treatment status treat in the above regression is the estimated ATE.
Run the OLS regression and show the estimated ATE and its standard error. You may use the command demean from the package Jmisc by Chan (2014) to demean your control variables. Interpret your result. Does the participation in the NSW training program increase real earnings in 1978?

install.packages("Jmisc")

## Warning: package 'Jmisc' is in use and will not be installed

library(Jmisc)
library(dplyr)  

control_vars <- c("age", "educ", "married", "black", "hispanic", "white", 
                  "nodegree", "re74", "re75")


# Create new de-meaned variables without replacing originals
lalonde <- lalonde %>%
  mutate(across(all_of(control_vars), demean, .names = "demean_{.col}"))

# Create interaction terms
lalonde <- lalonde %>%
  mutate(across(starts_with("demean_"), ~ . * treat, .names = "{.col}_treat"))


# Create regression formula dynamically
formula <- as.formula(paste("re78 ~ treat +", 
                            paste(control_vars, collapse = " + "), "+", 
                            paste(paste0("demean_", control_vars, "_treat"), collapse = " + ")))

# Run OLS regression
m1 <- lm(formula, data = lalonde)

# Show summary of regression results
summary(m1)

## 
## Call:
## lm(formula = formula, data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14685  -4703  -1544   3742  53958 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.204e+03  2.829e+03   0.425  0.67066    
## treat                  1.075e+03  1.031e+03   1.043  0.29755    
## age                   -2.027e+01  3.645e+01  -0.556  0.57840    
## educ                   3.225e+02  1.817e+02   1.775  0.07643 .  
## married               -5.269e+01  8.035e+02  -0.066  0.94774    
## black                 -1.168e+03  8.816e+02  -1.324  0.18590    
## hispanic               4.961e+02  1.008e+03   0.492  0.62283    
## white                         NA         NA      NA       NA    
## nodegree               4.804e+02  1.013e+03   0.474  0.63545    
## re74                   3.767e-01  6.558e-02   5.744 1.48e-08 ***
## re75                   3.398e-01  1.219e-01   2.788  0.00547 ** 
## demean_age_treat       1.038e+02  8.355e+01   1.243  0.21448    
## demean_educ_treat      3.015e+02  3.926e+02   0.768  0.44283    
## demean_married_treat   1.085e+03  1.614e+03   0.672  0.50167    
## demean_black_treat     2.753e+01  1.955e+03   0.014  0.98877    
## demean_hispanic_treat -1.918e+02  2.875e+03  -0.067  0.94684    
## demean_white_treat            NA         NA      NA       NA    
## demean_nodegree_treat -7.994e+02  1.886e+03  -0.424  0.67185    
## demean_re74_treat     -3.372e-01  1.551e-01  -2.174  0.03010 *  
## demean_re75_treat     -2.510e-01  2.467e-01  -1.018  0.30928    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6904 on 596 degrees of freedom
## Multiple R-squared:  0.1698, Adjusted R-squared:  0.1461 
## F-statistic: 7.168 on 17 and 596 DF,  p-value: 4.558e-16

iii. Average Treatment Effect for the Treated (ATT) using covariate matching.

You can practice a couple of covariate matching methods using the Match command from the Matching package. You would have to carefully check all the arguments of the command in the help file. The default estimand is the ATT.

One-to-one matching. Using the outcome, the treatment status, and all the control variables \(X\) from part (ii), implement a one-to-one matching and show the estimated ATT and its standard error.

install.packages("Matching")

## Warning: package 'Matching' is in use and will not be installed

library(Matching)

install.packages("dplyr")

## Warning: package 'dplyr' is in use and will not be installed

library(dplyr)

# a. One to one matching

control_vars <- c("age", "educ", "married", "black", "hispanic", "white", 
                  "nodegree", "re74", "re75")

# Extract only the variables needed for matching
X <- lalonde[, control_vars]

# Perform one-to-one matching using Mahalanobis distance
match_result <- Match(Y = lalonde$re78,     
                      Tr = lalonde$treat,   
                      X = X,                
                      M = 1,               
                      replace = FALSE)     

# Display matching results
summary(match_result)

## 
## Estimate...  861.43 
## SE.........  676.21 
## T-stat.....  1.2739 
## p.val......  0.2027 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  185

Exact matching. Using the outcome, the treatment status, and all the control variables \(X\) from part (ii), implement an exact matching and show the estimated ATT and its standard error. How qualitatively does your result change from the one-to-one to the exact matching? Explain the differences.

match_result_2 <- Match(Y = lalonde$re78,    
                      Tr = lalonde$treat,  
                      X = lalonde[, control_vars], 
                      exact = TRUE,        
                      estimand = "ATT",   
                      replace = FALSE)     


summary(match_result_2)

## 
## Estimate...  -1199.8 
## SE.........  2739.1 
## T-stat.....  -0.43801 
## p.val......  0.66138 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  10 
## Matched number of observations  (unweighted).  10 
## 
## Number of obs dropped by 'exact' or 'caliper'  175

Nearest neighbor matching (1:2). Using the outcome, the treatment status, and all the control variables \(X\) from part (ii), implement a nearest neighbor matching with two matches (1:2 matching) without bias correction and show the estimated ATT and its standard error.

# Perform nearest neighbor matching (1 treated : 2 control)
match_result_3 <- Match(Y = lalonde$re78,     
                      Tr = lalonde$treat,   
                      X = X,                
                      M = 2,                
                      replace = FALSE)      


summary(match_result_3)

## 
## Estimate...  -121.59 
## SE.........  706.67 
## T-stat.....  -0.17206 
## p.val......  0.86339 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  370

Nearest neighbor matching (1:3). Using the outcome, the treatment status, and all the control variables \(X\) from part (ii), implement a nearest neighbor matching with three matches (1:3 matching) without bias correction and show the estimated ATT and its standard error.

# d.    Nearest neighbor matching (1:3)

match_result_4 <- Match(Y = lalonde$re78,   
                        Tr = lalonde$treat,  
                        X = X,      
                        M = 3,               
                        estimand = "ATT",    
                        replace = FALSE)     


summary(match_result_4)

## 
## Estimate...  -702.6 
## SE.........  901.49 
## T-stat.....  -0.77938 
## p.val......  0.43576 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  143 
## Matched number of observations  (unweighted).  429

Bias correction. There is a bias with not getting fully comparable matches for a training participant. This can be adjusted using a regression-based correction. You may read Abadie & Imbens (JBES 2011) to apprehend this correction that can be applied using the argument “BiasAdjust = T” of the Match command. Implement the nearest neighbor matching (1:2) and (1.3), this time with the bias-adjustment option. How qualitatively does your result change by applying the regression-based correction to the nearest neighbor matching? How qualitatively does your result change from the one-to-one to the nearest neighbor matching with bias correction?

# 1:2 Nearest Neighbor Matching
match_result_5 <- Match(Y = lalonde$re78,   
                        Tr = lalonde$treat,  
                        X = X,      
                        M = 2,               
                        estimand = "ATT",    
                        BiasAdjust = TRUE)   

summary(match_result_5)

## 
## Estimate...  832.39 
## AI SE......  1062.1 
## T-stat.....  0.7837 
## p.val......  0.43322 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  385

# 1:3 Nearest Neighbor Matching
match_result_6 <- Match(Y = lalonde$re78,   
                        Tr = lalonde$treat,  
                        X = X,      
                        M = 3,               
                        estimand = "ATT",    
                        BiasAdjust = TRUE)   

summary(match_result_6)

## 
## Estimate...  1167.4 
## AI SE......  1009.9 
## T-stat.....  1.1559 
## p.val......  0.2477 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  583

Coarsened exact matching. Implement a coarsened exact matching and show the estimated ATT and its standard error. How qualitatively does your result change relatively to the exact matching? Explain the differences.

install.packages("cem")

## Warning: package 'cem' is in use and will not be installed

library(cem)


control_vars <- c("age", "educ", "married", "black", "hispanic", "white", "nodegree", "re74", "re75")

lalonde_subset <- lalonde[, c("treat", control_vars)]  
cem_match <- cem(treatment = "treat", data = lalonde_subset)

## 
## Using 'treat'='1' as baseline group

summary(cem_match)

##                Length Class  Mode     
## call             3    -none- call     
## strata         614    -none- numeric  
## n.strata         1    -none- numeric  
## vars             9    -none- character
## drop             1    -none- character
## breaks           9    -none- list     
## treatment        1    -none- character
## n                1    -none- numeric  
## groups         614    factor numeric  
## g.names          2    -none- character
## n.groups         1    -none- numeric  
## group.idx        2    -none- list     
## group.len        2    -none- numeric  
## mstrata        614    -none- numeric  
## mstrataID       25    -none- numeric  
## matched        614    -none- logical  
## baseline.group   1    -none- character
## tab              6    -none- numeric  
## k2k              1    -none- logical  
## w              614    -none- numeric

att_model <- lm(re78 ~ treat, data = lalonde, weights = cem_match$w)

# Display regression results
summary(att_model)

## 
## Call:
## lm(formula = re78 ~ treat, data = lalonde, weights = cem_match$w)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -11694      0      0      0  28224 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5459.4      682.9   7.995 3.85e-13 ***
## treat          415.4     1000.6   0.415    0.679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6031 on 144 degrees of freedom
## Multiple R-squared:  0.001196,   Adjusted R-squared:  -0.00574 
## F-statistic: 0.1724 on 1 and 144 DF,  p-value: 0.6786

iv. Average Treatment Effect for the Treated (ATT) using propensity score matching.

Directed Acyclic Graph (DAG). Draw a DAG corresponding with the propensity score matching method that you can apply to the causal relationship between NSW training participation \(\left(D\right)\) and real earnings \(\left(Y\right)\). Describe the DAG.
Descriptive graphs. What is the shape of the relationship between age and real earnings in 1974, 1975, and 1978? Is there a difference in the distribution of age between NSW non-participants and participants? What is the shape of the relationship between education and real earnings in 1974, 1975, and 1978? Is there a difference in the distribution of years of schooling between NSW non-participants and participants? Discuss the graphs.

### b. Descriptive graph
install.packages("dplyr")

## Warning: package 'dplyr' is in use and will not be installed

install.packages("tidyr")

## Warning: package 'tidyr' is in use and will not be installed

install.packages("ggplot2")

## Warning: package 'ggplot2' is in use and will not be installed

library(dplyr) 
library(tidyr)  
library(ggplot2) 

### The relationship between age and real earnings in 1974, 1975, and 1978

# Convert the dataset into long format for easier plotting

lalonde_long <- lalonde %>%
  select(age, re74, re75, re78) %>%
  pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings")

# Create the plot
ggplot(lalonde_long, aes(x = age, y = earnings, color = year)) +
  geom_point(alpha = 0.5) +  # Scatter plot
  geom_smooth(method = "loess", se = FALSE) +  # Smoothed trend line
  labs(title = "Relationship Between Age and Real Earnings (1974, 1975, 1978)",
       x = "Age",
       y = "Real Earnings",
       color = "Year") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

### The difference in the distribution of age between education and participants

# Step 1: Filter only the treatment group
lalonde_treated <- lalonde %>%
  filter(treat == 1) %>% 
  select(age, re74, re75, re78) %>%
  pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings")

# Step 2: Draw the graph for the treatment group
ggplot(lalonde_treated, aes(x = age, y = earnings, color = year)) +
  geom_point(alpha = 0.5) +  # Scatter plot
  geom_smooth(method = "loess", se = FALSE) +  # Smoothed trend line
  labs(title = "Age vs. Real Earnings (Treatment Group - treat = 1)",
       x = "Age",
       y = "Real Earnings",
       color = "Year") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Step 3: Filter only the control group
lalonde_control <- lalonde %>%
  filter(treat == 0) %>% 
  select(age, re74, re75, re78) %>%
  pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings")

# Step 4: Draw the graph for the control group
ggplot(lalonde_control, aes(x = age, y = earnings, color = year)) +
  geom_point(alpha = 0.5) +  # Scatter plot
  geom_smooth(method = "loess", se = FALSE) +  # Smoothed trend line
  labs(title = "Age vs. Real Earnings (Control Group - treat = 0)",
       x = "Age",
       y = "Real Earnings",
       color = "Year") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

### The relationship between education and real earnings in 1974, 1975, and 1978

# Transform dataset into long format for plotting
lalonde_long_educ <- lalonde %>%
  select(educ, re74, re75, re78) %>%
  pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings") %>%
  mutate(year = factor(year, labels = c("1974", "1975", "1978")))  # Rename years

# Create the scatter plot with trend lines
ggplot(lalonde_long_educ, aes(x = educ, y = earnings, color = year)) +
  geom_point(alpha = 0.6, size = 3) +  # Scatter plot points
  geom_smooth(method = "lm", se = FALSE) +  # Linear regression line
  labs(title = "Education vs. Real Earnings (1974, 1975, 1978)",
       x = "Years of Education",
       y = "Real Earnings",
       color = "Year") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

### The difference in the distribution of education between education and participants

# Transform dataset into long format for easier plotting
lalonde_long_educ <- lalonde %>%
  select(treat, educ, re74, re75, re78) %>%
  pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings") %>%
  mutate(year = factor(year, labels = c("1974", "1975", "1978")))  # Rename years

# Split data into treatment and control groups
lalonde_treated <- lalonde_long_educ %>% filter(treat == 1)
lalonde_control <- lalonde_long_educ %>% filter(treat == 0)

# Plot for Treatment Group
plot_treated <- ggplot(lalonde_treated, aes(x = educ, y = earnings, color = year)) +
  geom_point(alpha = 0.6, size = 3) +  # Scatter plot points
  geom_smooth(method = "lm", se = FALSE) +  # Linear regression line
  labs(title = "Education vs. Real Earnings (Treatment Group - treat = 1)",
       x = "Years of Education",
       y = "Real Earnings",
       color = "Year") +
  theme_minimal()

# Plot for Control Group
plot_control <- ggplot(lalonde_control, aes(x = educ, y = earnings, color = year)) +
  geom_point(alpha = 0.6, size = 3) +  # Scatter plot points
  geom_smooth(method = "lm", se = FALSE) +  # Linear regression line
  labs(title = "Education vs. Real Earnings (Control Group - treat = 0)",
       x = "Years of Education",
       y = "Real Earnings",
       color = "Year") +
  theme_minimal()

# Display plots side by side
library(gridExtra)
grid.arrange(plot_treated, plot_control, ncol = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Propensity score. Using the glm command and the argument “family=binomial”, estimate a logit model of the probability of participation conditional on the full set of observables that you have, with the appropriate polynomial function for age and educ, and store the fitted values of this propensity score as psfit. Present the results. You may use packages like stargazer, texreg, or any other package that helps you produce well-formatted estimation tables.
Matching on the propensity score. Using the outcome, the treatment status, and the fitted propensity score psfit \(p(X)\), implement a nearest neighbor matching with three matches (1:3 matching) and bias-adjustment option (“BiasAdjust = T”) and show the estimated ATT and its standard error obtained from the Match command. With the bias-adjustment option, does the standard error take into account the earlier logit estimation of the propensity score?

##### C.    Propensity score

library(Matching)

# Step 1: Estimate propensity scores using logistic regression
ps_model <- glm(treat ~ age + educ, data = lalonde, family = binomial())


lalonde$psfit <- ps_model$fitted.values

summary(ps_model)

## 
## Call:
## glm(formula = treat ~ age + educ, family = binomial(), data = lalonde)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.266059   0.466844  -0.570   0.5687  
## age         -0.023963   0.009587  -2.500   0.0124 *
## educ         0.006700   0.034815   0.192   0.8474  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.49  on 613  degrees of freedom
## Residual deviance: 744.70  on 611  degrees of freedom
## AIC: 750.7
## 
## Number of Fisher Scoring iterations: 4

# Step 2: Nearest neighbor matching using the estimated propensity score
match_ps <- Match(Y = lalonde$re78,  
                  Tr = lalonde$treat,  
                  X = lalonde$psfit,  
                  M = 3,  
                  replace = FALSE,  
                  BiasAdjust = TRUE)

## Warning in Match(Y = lalonde$re78, Tr = lalonde$treat, X = lalonde$psfit, :
## Bias Adjustment can only be estimated when ties==TRUE and replace=TRUE.
## Setting BiasAdjust=FALSE

# Display matching results
summary(match_ps)

## 
## Estimate...  -702.6 
## SE.........  853.82 
## T-stat.....  -0.82289 
## p.val......  0.41057 
## 
## Original number of observations..............  614 
## Original number of treated obs...............  185 
## Matched number of observations...............  143 
## Matched number of observations  (unweighted).  429

Bootstrapping. Using the boot package, re-estimate the ATT with bootstrapped standard errors that take into account the logit estimation of the propensity score. Compare the results with those that apply the bias-adjustment option without bootstrap inference.

library(boot)

# Define a function to estimate ATT using bootstrap
boot_att <- function(data, indices) {
  d <- data[indices, ]  # Resample data
  ps_model <- glm(treat ~ age + educ, data = d, family = binomial())
  d$psfit <- ps_model$fitted.values
  match_ps <- Match(Y = d$re78, Tr = d$treat, X = d$psfit, M = 3, BiasAdjust = TRUE)
  return(match_ps$est)  # Return ATT estimate
}

# Run bootstrap with 1000 replications
set.seed(123)
boot_results <- boot(data = lalonde, statistic = boot_att, R = 1000)

# Get bootstrapped standard error
boot_se <- sd(boot_results$t)
print(paste("Bootstrapped Standard Error:", boot_se))

## [1] "Bootstrapped Standard Error: 853.071770030857"

Quality of the propensity score matching.
- Present the summary statistics of the fitted propensity score psfit for participants and non-participants.

library(dplyr)
library(stargazer)

# Compute summary statistics separately for participants (treat = 1) and non-participants (treat = 0)
summary_ps <- lalonde %>%
  group_by(treat) %>%
  summarise(
    count = n(),
    mean = mean(psfit),
    sd = sd(psfit),
    min = min(psfit),
    max = max(psfit),
    median = median(psfit)
  )


# Round all numerical columns to 2 decimal places before passing to stargazer
summary_ps_rounded <- summary_ps %>%
  mutate(across(where(is.numeric), ~ round(.x, 2)))

# Generate summary statistics table with 2 decimal places
stargazer(summary_ps_rounded, type = "text", summary = FALSE,
          title = "Summary Statistics of Propensity Scores (psfit)",
          digits = 2)

## 
## Summary Statistics of Propensity Scores (psfit)
## ========================================
##   treat count mean  sd  min  max  median
## ----------------------------------------
## 1   0    429  0.3  0.05 0.17 0.36  0.31 
## 2   1    185  0.31 0.04 0.2  0.35  0.31 
## ----------------------------------------

Using the kdensity package, plot the density of the fitted propensity score for participants and non-participants. Hint: psfitdens1 = kdensity(psfit[D==1]) and plot(psfitdens1) for participants. Alternatively, you could use ggplot as in the lecture slides: ggplot(lalonde, aes(psfit, fill = as.factor(D))) + stat_density(aes(y = ..density..), position = “identity”, color = “black”, alpha=0.5)

### b. Plot the density of the fitted propensity score for participants and 
### non-participants

# Load necessary libraries
library(kdensity)
library(ggplot2)

# Compute kernel density for participants (treat = 1)
psfitdens1 <- kdensity(lalonde$psfit[lalonde$treat == 1])

# Compute kernel density for non-participants (treat = 0)
psfitdens0 <- kdensity(lalonde$psfit[lalonde$treat == 0])



# Plot both densities in the same graph
plot(psfitdens1, main = "Density of Propensity Score (Participants vs. Non-Participants)",
     xlab = "Propensity Score", ylab = "Density", col = "blue", lwd = 2)
lines(psfitdens0, col = "red", lwd = 2)  # Add non-participants' density
legend("topleft", legend = c("Participants", "Non-Participants"), 
       col = c("blue", "red"), lwd = 2)

Plot the histogram of the fitted propensity score psfit for participants and non-participants. Hint: hist(psfit[D==0]) for non-participants. Alternatively, you may use one of many great-looking options provided by the package cobalt here or there.

### c.  Plot the histogram of the fitted propensity score psfit for participants
### and non-participants

# Create histogram for non-participants first
hist(lalonde$psfit[lalonde$treat == 0], 
     main = "Histogram of Propensity Scores (Participants vs. Non-Participants)", 
     xlab = "Propensity Score", col = rgb(1, 0, 0, 0.5), border = "black", breaks = 20)

# Add histogram for participants on top
hist(lalonde$psfit[lalonde$treat == 1], 
     col = rgb(0, 0, 1, 0.5), border = "black", breaks = 20, add = TRUE)

# Add legend
legend("topleft", legend = c("Non-Participants", "Participants"), 
       fill = c("red", "blue"))

Does the overlap condition hold?
Imposing the common support, run your most reliable propensity score matching estimation using the command Match from the package Matching and the command matchit from the package matchit. Compare the estimated results.

### e. Impose the common support

# Identify the minimum and maximum propensity scores in each group
ps_min <- max(min(lalonde$psfit[lalonde$treat == 1]), min(lalonde$psfit[lalonde$treat == 0]))
ps_max <- min(max(lalonde$psfit[lalonde$treat == 1]), max(lalonde$psfit[lalonde$treat == 0]))

# Keep only observations within the common support
lalonde_common_support <- lalonde %>%
  filter(psfit >= ps_min & psfit <= ps_max)

# Check the range after imposing common support
summary(lalonde_common_support$psfit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1995  0.2790  0.3133  0.3030  0.3383  0.3529

# Perform nearest neighbor matching with bias adjustment
match_result <- Match(
  Y = lalonde_common_support$re78,  
  Tr = lalonde_common_support$treat,  
  X = lalonde_common_support$psfit,  
  M = 1,  
  replace = FALSE,  
  BiasAdjust = TRUE  
)

## Warning in Match(Y = lalonde_common_support$re78, Tr =
## lalonde_common_support$treat, : Bias Adjustment can only be estimated when
## ties==TRUE and replace=TRUE.  Setting BiasAdjust=FALSE

# Display results
summary(match_result)

## 
## Estimate...  -709.89 
## SE.........  731.2 
## T-stat.....  -0.97086 
## p.val......  0.33162 
## 
## Original number of observations..............  560 
## Original number of treated obs...............  185 
## Matched number of observations...............  185 
## Matched number of observations  (unweighted).  185

#### Estimation using Matchit

install.packages("MatchIt")

## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.3:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.3/PACKAGES'

## package 'MatchIt' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'MatchIt'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\ybc5427\AppData\Local\Programs\R\R-4.3.3\library\00LOCK\MatchIt\libs\x64\MatchIt.dll
## to
## C:\Users\ybc5427\AppData\Local\Programs\R\R-4.3.3\library\MatchIt\libs\x64\MatchIt.dll:
## Permission denied

## Warning: restored 'MatchIt'

## 
## The downloaded binary packages are in
##  C:\Users\ybc5427\AppData\Local\Temp\RtmpwjEFQl\downloaded_packages

library(MatchIt)

## 
## Attaching package: 'MatchIt'

## The following object is masked _by_ '.GlobalEnv':
## 
##     lalonde

## The following object is masked from 'package:cobalt':
## 
##     lalonde

# Perform propensity score matching
psm_model <- matchit(
  treat ~ age + educ,  
  data = lalonde,      
  method = "nearest",  # Nearest neighbor matching (1:1)
  distance = "logit",  
  replace = FALSE     
)

# Summary of the matching results
summary(psm_model)

## 
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest", 
##     distance = "logit", replace = FALSE)
## 
## Summary of Balance for All Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.2984          0.2734     0.4606    0.0881
## age            25.8162       28.0303         -0.3094     0.4400    0.0813
## educ           10.3459       10.2354          0.0550     0.4959    0.0347
##          eCDF Max
## distance   0.1663
## age        0.1577
## educ       0.1114
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.3077          0.0094     0.9963    0.0033
## age            25.8162       25.8649         -0.0068     1.0300    0.0050
## educ           10.3459       10.2865          0.0296     0.5886    0.0253
##          eCDF Max Std. Pair Dist.
## distance   0.0432          0.0146
## age        0.0162          0.0597
## educ       0.1189          0.8146
## 
## Sample Sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

# Check balance summary
summary(psm_model, standardize = TRUE)

## 
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest", 
##     distance = "logit", replace = FALSE)
## 
## Summary of Balance for All Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.2984          0.2734     0.4606    0.0881
## age            25.8162       28.0303         -0.3094     0.4400    0.0813
## educ           10.3459       10.2354          0.0550     0.4959    0.0347
##          eCDF Max
## distance   0.1663
## age        0.1577
## educ       0.1114
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.3077          0.0094     0.9963    0.0033
## age            25.8162       25.8649         -0.0068     1.0300    0.0050
## educ           10.3459       10.2865          0.0296     0.5886    0.0253
##          eCDF Max Std. Pair Dist.
## distance   0.0432          0.0146
## age        0.0162          0.0597
## educ       0.1189          0.8146
## 
## Sample Sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

# Extract matched dataset
matched_data <- match.data(psm_model)
att_model <- lm(re78 ~ treat, data = matched_data)
summary(att_model)

## 
## Call:
## lm(formula = re78 ~ treat, data = matched_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7474  -6340  -1959   3631  53959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7474.4      553.4  13.506   <2e-16 ***
## treat        -1125.2      782.7  -1.438    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7527 on 368 degrees of freedom
## Multiple R-squared:  0.005585,   Adjusted R-squared:  0.002883 
## F-statistic: 2.067 on 1 and 368 DF,  p-value: 0.1514

Are there any other quality checks you can perform? Present your results.

#### f. Are there any quality checks you can perform?
# Balance summary before and after matching
summary(psm_model, standardize = TRUE)

## 
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest", 
##     distance = "logit", replace = FALSE)
## 
## Summary of Balance for All Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.2984          0.2734     0.4606    0.0881
## age            25.8162       28.0303         -0.3094     0.4400    0.0813
## educ           10.3459       10.2354          0.0550     0.4959    0.0347
##          eCDF Max
## distance   0.1663
## age        0.1577
## educ       0.1114
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.3077          0.0094     0.9963    0.0033
## age            25.8162       25.8649         -0.0068     1.0300    0.0050
## educ           10.3459       10.2865          0.0296     0.5886    0.0253
##          eCDF Max Std. Pair Dist.
## distance   0.0432          0.0146
## age        0.0162          0.0597
## educ       0.1189          0.8146
## 
## Sample Sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

library(cobalt)
love.plot(psm_model, thresholds = 0.1)  # Love plot (SMD before & after matching)

Another matching method. Implement another matching method studied in class that you have not shown so far. How qualitatively are the results different from the precedent ones?
###### G. Another matching method

###### G. Another matching method

### a. We can add the weight
# Step 1: Estimate propensity scores
ps_model <- glm(treat ~ age + educ, data = lalonde, family = binomial)
lalonde$pscore <- predict(ps_model, type = "response")

# Step 2: Compute inverse variance weights
lalonde$weights <- ifelse(lalonde$treat == 1, 
                          1 / lalonde$pscore,     # Treated units
                          1 / (1 - lalonde$pscore))  # Control units

# Step 3: Run Matching with Weights
match_result_9 <- Match(Y = lalonde$re78,   
                        Tr = lalonde$treat,  
                        X = X,      
                        M = 2,               
                        estimand = "ATT",    
                        BiasAdjust = TRUE,    
                        weights = lalonde$weights)  # Use computed weights

summary(match_result_9)

## 
## Estimate...  1050.2 
## AI SE......  977.7 
## T-stat.....  1.0741 
## p.val......  0.28277 
## 
## Original number of observations (weighted)...  1224.381 
## Original number of observations..............  614 
## Original number of treated obs (weighted)....  609.753 
## Original number of treated obs...............  185 
## Matched number of observations...............  609.753 
## Matched number of observations  (unweighted).  385

### b. We can also adopt the Caliper matching 

psm_caliper <- matchit(treat ~ age + educ, 
                       data = lalonde, 
                       method = "nearest", 
                       caliper = 0.1)  # Restricts matches to be within 0.1 of propensity score

summary(psm_caliper, standardize = TRUE)

## 
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest", 
##     caliper = 0.1)
## 
## Summary of Balance for All Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.2984          0.2734     0.4606    0.0881
## age            25.8162       28.0303         -0.3094     0.4400    0.0813
## educ           10.3459       10.2354          0.0550     0.4959    0.0347
##          eCDF Max
## distance   0.1663
## age        0.1577
## educ       0.1114
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3080        0.3077          0.0094     0.9963    0.0033
## age            25.8162       25.8649         -0.0068     1.0300    0.0050
## educ           10.3459       10.2865          0.0296     0.5886    0.0253
##          eCDF Max Std. Pair Dist.
## distance   0.0432          0.0146
## age        0.0162          0.0597
## educ       0.1189          0.8146
## 
## Sample Sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

# Extract matched dataset, rename weights column
matched_data <- match.data(psm_caliper, weights = "matching_weights")

# Run regression on matched data
att_model <- lm(re78 ~ treat, data = matched_data)

# Show results
summary(att_model)

## 
## Call:
## lm(formula = re78 ~ treat, data = matched_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7474  -6340  -1959   3631  53959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7474.4      553.4  13.506   <2e-16 ***
## treat        -1125.2      782.7  -1.438    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7527 on 368 degrees of freedom
## Multiple R-squared:  0.005585,   Adjusted R-squared:  0.002883 
## F-statistic: 2.067 on 1 and 368 DF,  p-value: 0.1514

Wrapping-up for a non-technical audience (not graded). If you want to give a summary presentation to a non-technical audience, you may get inspiration from these slides: PSM for policy-makers.

HAVE FUN AND KEEP FAITH IN THE FUN!