R.Rpubs using your temporary account.RPubs link of your work and submit it on
Canvas.RPubs link last has copied the others. So, timely
submissions are important. Own your work. I can randomly ask your
R script and .Rmd files for double-checking purposes. As a
standard practice, work in a script file before making your code chunks
in the .Rmd file. Your .Rmd file and Rpubs submission page
MUST show the code used to produce any of the outputs you present in
your answers.Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.
Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.
The primary practice data of this problem set is the dataset of the
“National Supported Work (NSW) Demonstration” lalonde. NSW
is a job training program in the US that provides work experience for a
period of up to 1.5 years to socioeconomically-disadvantaged individuals
with the goal to increase their labor market outcomes. This famous
dataset is sourced from the experimental sample in Lalonde (AER 1986) and
has been used in several other papers or textbooks: Dehejia and Wahba
(JASA 1999), Dehejia and Wahba
(RESTAT 2002), Mixtape,
etc. You can load the package Matching by Sekhon (2011)
that contains the lalonde dataset.
The lalonde sample has a size of 445 observations,
including 260 individuals not assigned to NSW training participation and
185 individuals assigned to it, which you can check by examining the
treatment variable (treat). The outcome variable is real
earnings in 1978 (re78); pre-treatment outcomes are real
earnings in 1974 and 1975 (re74 and re75);
binary variables for being unemployed in 1974 and 1975 are
(u74 and u75); and covariates include age in
years (age), years of schooling (educ), and
binary variables for blacks (black), Hispanics
(hisp), etc. I suggest you call ?lalonde to
open the help file and check the description of the dataset.
# Load packages
library(pacman)
p_load(Matching, Jmisc, lmtest, sandwich, kdensity, haven, boot,
cobalt, Matchit, Zelig, estimatr, cem, tidyverse,
lubridate, usmap, gridExtra, stringr, readxl, plot3D,
cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor)
# Load data
data(lalonde)
attach(lalonde)
table(treat)
## treat
## 0 1
## 429 185
#?lalonde
Generate a table of summary statistics (mean and standard deviation)
by treatment status with t-tests comparing the means of covariates,
pre-treatment outcomes, and outcome for non-participants and
participants to the NSW training. You may use packages like
stargazer, xtable, kable, or any
other package that helps you produce well-formatted balance tables.
Interpret your table.
### i. Balance table
install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
install.packages("stargazer")
## Warning: package 'stargazer' is in use and will not be installed
library(dplyr)
library(stargazer)
# Count occurrences of each race category
lalonde %>%
count(race)
## race n
## 1 black 243
## 2 hispan 72
## 3 white 299
lalonde <- lalonde %>%
mutate(
black = ifelse(race == "black", 1, 0),
hispanic = ifelse(race == "hispan", 1, 0),
white = ifelse(race == "white", 1, 0)
)
# Define the variables for comparison (excluding treat itself)
covariates <- c("age", "educ", "married", "black","hispanic", "white", "nodegree", "re74", "re75", "re78")
# Compute mean and standard deviation by treatment group
summary_table <- lalonde %>%
group_by(treat) %>%
summarise(across(all_of(covariates),
list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))
# Print summary table
print(summary_table)
## # A tibble: 2 × 21
## treat age_mean age_sd educ_mean educ_sd married_mean married_sd black_mean
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 28.0 10.8 10.2 2.86 0.513 0.500 0.203
## 2 1 25.8 7.16 10.3 2.01 0.189 0.393 0.843
## # ℹ 13 more variables: black_sd <dbl>, hispanic_mean <dbl>, hispanic_sd <dbl>,
## # white_mean <dbl>, white_sd <dbl>, nodegree_mean <dbl>, nodegree_sd <dbl>,
## # re74_mean <dbl>, re74_sd <dbl>, re75_mean <dbl>, re75_sd <dbl>,
## # re78_mean <dbl>, re78_sd <dbl>
install.packages("tidyr")
## Warning: package 'tidyr' is in use and will not be installed
library(tidyr)
# Function to compute t-test p-values for each covariate
t_test_results <- sapply(covariates, function(var) {
t_test <- t.test(lalonde[[var]] ~ lalonde$treat)
return(t_test$p.value) # Extract p-value
})
# Combine the summary stats with t-test results
summary_table_ttest <- summary_table %>%
pivot_longer(cols = -treat, names_to = c("variable", "stat"), names_sep = "_") %>%
pivot_wider(names_from = "stat", values_from = "value") %>%
filter(treat == 0) %>%
select(variable, mean, sd) %>%
left_join(summary_table %>% filter(treat == 1) %>%
pivot_longer(cols = -treat, names_to = c("variable", "stat"), names_sep = "_") %>%
pivot_wider(names_from = "stat", values_from = "value") %>%
select(variable, mean, sd),
by = "variable", suffix = c("_control", "_treat")) %>%
mutate(p_value = t_test_results[variable]) # Add p-values
# Print final table
print(summary_table_ttest)
## # A tibble: 10 × 6
## variable mean_control sd_control mean_treat sd_treat p_value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 28.0 10.8 25.8 7.16 2.91e- 3
## 2 educ 10.2 2.86 10.3 2.01 5.85e- 1
## 3 married 0.513 0.500 0.189 0.393 1.46e-16
## 4 black 0.203 0.403 0.843 0.365 1.21e-58
## 5 hispanic 0.142 0.350 0.0595 0.237 7.04e- 4
## 6 white 0.655 0.476 0.0973 0.297 5.50e-55
## 7 nodegree 0.597 0.491 0.708 0.456 6.98e- 3
## 8 re74 5619. 6789. 2096. 4887. 1.75e-12
## 9 re75 2466. 3292. 1532. 3219. 1.15e- 3
## 10 re78 6984. 7294. 6349. 7867. 3.49e- 1
stargazer(summary_table_ttest, type = "text",
title = "Summary Statistics by Treatment Status",
summary = FALSE, digits = 2)
##
## Summary Statistics by Treatment Status
## =======================================================================================================================
## variable mean_control sd_control mean_treat sd_treat p_value
## -----------------------------------------------------------------------------------------------------------------------
## 1 age 28.030303030303 10.7866530174644 25.8162162162162 7.15501927478618 c(age = 0.00291427440260211)
## 2 educ 10.2354312354312 2.85523844087595 10.3459459459459 2.01065025640563 c(educ = 0.584797660188313)
## 3 married 0.512820512820513 0.500419186851279 0.189189189189189 0.392721679149236 c(married = 1.46127767613592e-16)
## 4 black 0.202797202797203 0.402552148350608 0.843243243243243 0.364557907176729 c(black = 1.20588956186977e-58)
## 5 hispanic 0.142191142191142 0.349653835243938 0.0594594594594595 0.237124370526381 c(hispanic = 0.000704207057997086)
## 6 white 0.655011655011655 0.475919486195385 0.0972972972972973 0.297166406396285 c(white = 5.49530172720531e-55)
## 7 nodegree 0.596736596736597 0.491125522231798 0.708108108108108 0.455866577054302 c(nodegree = 0.00698222507160265)
## 8 re74 5619.23650638695 6788.75079645538 2095.57368864865 4886.62035315108 c(re74 = 1.74783141336639e-12)
## 9 re75 2466.48444312354 3291.99618331065 1532.05531378378 3219.2508699216 c(re75 = 0.00115008478712992)
## 10 re78 6984.16974230769 7294.1617908684 6349.14353027027 7867.40221773469 c(re78 = 0.349076655556669)
## -----------------------------------------------------------------------------------------------------------------------
Return to one of our first readings for this class: Imbens and Wooldridge (JEL
2009). Revisit your notes for sections 5.1 about “Identification” on
page 26 and 5.3 about “Regression Methods” on page 28, where the authors
made the point that one could compute the ATE by estimating
the OLS regression of the outcome re78 on a constant, the
treatment status treat, the set of control variables \(X\), and the set of interactions between
treat and the de-meaned transformation of each control
variable \(\left(X-\overline{X}\right)\).
Relying upon your knowledge of the potential outcomes framework,
the conditional independence assumption, and the overlap condition,
explain briefly why the coefficient on the treatment status
treat in the above regression is the estimated
ATE.
Run the OLS regression and show the estimated ATE
and its standard error. You may use the command demean from
the package Jmisc by Chan (2014) to demean your control
variables. Interpret your result. Does the participation in the NSW
training program increase real earnings in 1978?
install.packages("Jmisc")
## Warning: package 'Jmisc' is in use and will not be installed
library(Jmisc)
library(dplyr)
control_vars <- c("age", "educ", "married", "black", "hispanic", "white",
"nodegree", "re74", "re75")
# Create new de-meaned variables without replacing originals
lalonde <- lalonde %>%
mutate(across(all_of(control_vars), demean, .names = "demean_{.col}"))
# Create interaction terms
lalonde <- lalonde %>%
mutate(across(starts_with("demean_"), ~ . * treat, .names = "{.col}_treat"))
# Create regression formula dynamically
formula <- as.formula(paste("re78 ~ treat +",
paste(control_vars, collapse = " + "), "+",
paste(paste0("demean_", control_vars, "_treat"), collapse = " + ")))
# Run OLS regression
m1 <- lm(formula, data = lalonde)
# Show summary of regression results
summary(m1)
##
## Call:
## lm(formula = formula, data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14685 -4703 -1544 3742 53958
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.204e+03 2.829e+03 0.425 0.67066
## treat 1.075e+03 1.031e+03 1.043 0.29755
## age -2.027e+01 3.645e+01 -0.556 0.57840
## educ 3.225e+02 1.817e+02 1.775 0.07643 .
## married -5.269e+01 8.035e+02 -0.066 0.94774
## black -1.168e+03 8.816e+02 -1.324 0.18590
## hispanic 4.961e+02 1.008e+03 0.492 0.62283
## white NA NA NA NA
## nodegree 4.804e+02 1.013e+03 0.474 0.63545
## re74 3.767e-01 6.558e-02 5.744 1.48e-08 ***
## re75 3.398e-01 1.219e-01 2.788 0.00547 **
## demean_age_treat 1.038e+02 8.355e+01 1.243 0.21448
## demean_educ_treat 3.015e+02 3.926e+02 0.768 0.44283
## demean_married_treat 1.085e+03 1.614e+03 0.672 0.50167
## demean_black_treat 2.753e+01 1.955e+03 0.014 0.98877
## demean_hispanic_treat -1.918e+02 2.875e+03 -0.067 0.94684
## demean_white_treat NA NA NA NA
## demean_nodegree_treat -7.994e+02 1.886e+03 -0.424 0.67185
## demean_re74_treat -3.372e-01 1.551e-01 -2.174 0.03010 *
## demean_re75_treat -2.510e-01 2.467e-01 -1.018 0.30928
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6904 on 596 degrees of freedom
## Multiple R-squared: 0.1698, Adjusted R-squared: 0.1461
## F-statistic: 7.168 on 17 and 596 DF, p-value: 4.558e-16
You can practice a couple of covariate matching methods using the
Match command from the Matching package. You
would have to carefully check all the arguments of the command in the
help file. The default estimand is the ATT.
ATT and its standard
error.install.packages("Matching")
## Warning: package 'Matching' is in use and will not be installed
library(Matching)
install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
library(dplyr)
# a. One to one matching
control_vars <- c("age", "educ", "married", "black", "hispanic", "white",
"nodegree", "re74", "re75")
# Extract only the variables needed for matching
X <- lalonde[, control_vars]
# Perform one-to-one matching using Mahalanobis distance
match_result <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = X,
M = 1,
replace = FALSE)
# Display matching results
summary(match_result)
##
## Estimate... 861.43
## SE......... 676.21
## T-stat..... 1.2739
## p.val...... 0.2027
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 185
## Matched number of observations (unweighted). 185
ATT and its standard error.
How qualitatively does your result change from the one-to-one to the
exact matching? Explain the differences.match_result_2 <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = lalonde[, control_vars],
exact = TRUE,
estimand = "ATT",
replace = FALSE)
summary(match_result_2)
##
## Estimate... -1199.8
## SE......... 2739.1
## T-stat..... -0.43801
## p.val...... 0.66138
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 10
## Matched number of observations (unweighted). 10
##
## Number of obs dropped by 'exact' or 'caliper' 175
ATT and its standard
error.# Perform nearest neighbor matching (1 treated : 2 control)
match_result_3 <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = X,
M = 2,
replace = FALSE)
summary(match_result_3)
##
## Estimate... -121.59
## SE......... 706.67
## T-stat..... -0.17206
## p.val...... 0.86339
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 185
## Matched number of observations (unweighted). 370
ATT and its standard
error.# d. Nearest neighbor matching (1:3)
match_result_4 <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = X,
M = 3,
estimand = "ATT",
replace = FALSE)
summary(match_result_4)
##
## Estimate... -702.6
## SE......... 901.49
## T-stat..... -0.77938
## p.val...... 0.43576
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 143
## Matched number of observations (unweighted). 429
Match
command. Implement the nearest neighbor matching (1:2) and (1.3), this
time with the bias-adjustment option. How qualitatively does your result
change by applying the regression-based correction to the nearest
neighbor matching? How qualitatively does your result change from the
one-to-one to the nearest neighbor matching with bias correction?# 1:2 Nearest Neighbor Matching
match_result_5 <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = X,
M = 2,
estimand = "ATT",
BiasAdjust = TRUE)
summary(match_result_5)
##
## Estimate... 832.39
## AI SE...... 1062.1
## T-stat..... 0.7837
## p.val...... 0.43322
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 185
## Matched number of observations (unweighted). 385
# 1:3 Nearest Neighbor Matching
match_result_6 <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = X,
M = 3,
estimand = "ATT",
BiasAdjust = TRUE)
summary(match_result_6)
##
## Estimate... 1167.4
## AI SE...... 1009.9
## T-stat..... 1.1559
## p.val...... 0.2477
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 185
## Matched number of observations (unweighted). 583
ATT and its standard
error. How qualitatively does your result change relatively to the exact
matching? Explain the differences.install.packages("cem")
## Warning: package 'cem' is in use and will not be installed
library(cem)
control_vars <- c("age", "educ", "married", "black", "hispanic", "white", "nodegree", "re74", "re75")
lalonde_subset <- lalonde[, c("treat", control_vars)]
cem_match <- cem(treatment = "treat", data = lalonde_subset)
##
## Using 'treat'='1' as baseline group
summary(cem_match)
## Length Class Mode
## call 3 -none- call
## strata 614 -none- numeric
## n.strata 1 -none- numeric
## vars 9 -none- character
## drop 1 -none- character
## breaks 9 -none- list
## treatment 1 -none- character
## n 1 -none- numeric
## groups 614 factor numeric
## g.names 2 -none- character
## n.groups 1 -none- numeric
## group.idx 2 -none- list
## group.len 2 -none- numeric
## mstrata 614 -none- numeric
## mstrataID 25 -none- numeric
## matched 614 -none- logical
## baseline.group 1 -none- character
## tab 6 -none- numeric
## k2k 1 -none- logical
## w 614 -none- numeric
att_model <- lm(re78 ~ treat, data = lalonde, weights = cem_match$w)
# Display regression results
summary(att_model)
##
## Call:
## lm(formula = re78 ~ treat, data = lalonde, weights = cem_match$w)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -11694 0 0 0 28224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5459.4 682.9 7.995 3.85e-13 ***
## treat 415.4 1000.6 0.415 0.679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6031 on 144 degrees of freedom
## Multiple R-squared: 0.001196, Adjusted R-squared: -0.00574
## F-statistic: 0.1724 on 1 and 144 DF, p-value: 0.6786
Directed Acyclic Graph (DAG). Draw a DAG corresponding with the propensity score matching method that you can apply to the causal relationship between NSW training participation \(\left(D\right)\) and real earnings \(\left(Y\right)\). Describe the DAG.
Descriptive graphs. What is the shape of the relationship between age and real earnings in 1974, 1975, and 1978? Is there a difference in the distribution of age between NSW non-participants and participants? What is the shape of the relationship between education and real earnings in 1974, 1975, and 1978? Is there a difference in the distribution of years of schooling between NSW non-participants and participants? Discuss the graphs.
### b. Descriptive graph
install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
install.packages("tidyr")
## Warning: package 'tidyr' is in use and will not be installed
install.packages("ggplot2")
## Warning: package 'ggplot2' is in use and will not be installed
library(dplyr)
library(tidyr)
library(ggplot2)
### The relationship between age and real earnings in 1974, 1975, and 1978
# Convert the dataset into long format for easier plotting
lalonde_long <- lalonde %>%
select(age, re74, re75, re78) %>%
pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings")
# Create the plot
ggplot(lalonde_long, aes(x = age, y = earnings, color = year)) +
geom_point(alpha = 0.5) + # Scatter plot
geom_smooth(method = "loess", se = FALSE) + # Smoothed trend line
labs(title = "Relationship Between Age and Real Earnings (1974, 1975, 1978)",
x = "Age",
y = "Real Earnings",
color = "Year") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### The difference in the distribution of age between education and participants
# Step 1: Filter only the treatment group
lalonde_treated <- lalonde %>%
filter(treat == 1) %>%
select(age, re74, re75, re78) %>%
pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings")
# Step 2: Draw the graph for the treatment group
ggplot(lalonde_treated, aes(x = age, y = earnings, color = year)) +
geom_point(alpha = 0.5) + # Scatter plot
geom_smooth(method = "loess", se = FALSE) + # Smoothed trend line
labs(title = "Age vs. Real Earnings (Treatment Group - treat = 1)",
x = "Age",
y = "Real Earnings",
color = "Year") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Step 3: Filter only the control group
lalonde_control <- lalonde %>%
filter(treat == 0) %>%
select(age, re74, re75, re78) %>%
pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings")
# Step 4: Draw the graph for the control group
ggplot(lalonde_control, aes(x = age, y = earnings, color = year)) +
geom_point(alpha = 0.5) + # Scatter plot
geom_smooth(method = "loess", se = FALSE) + # Smoothed trend line
labs(title = "Age vs. Real Earnings (Control Group - treat = 0)",
x = "Age",
y = "Real Earnings",
color = "Year") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### The relationship between education and real earnings in 1974, 1975, and 1978
# Transform dataset into long format for plotting
lalonde_long_educ <- lalonde %>%
select(educ, re74, re75, re78) %>%
pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings") %>%
mutate(year = factor(year, labels = c("1974", "1975", "1978"))) # Rename years
# Create the scatter plot with trend lines
ggplot(lalonde_long_educ, aes(x = educ, y = earnings, color = year)) +
geom_point(alpha = 0.6, size = 3) + # Scatter plot points
geom_smooth(method = "lm", se = FALSE) + # Linear regression line
labs(title = "Education vs. Real Earnings (1974, 1975, 1978)",
x = "Years of Education",
y = "Real Earnings",
color = "Year") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### The difference in the distribution of education between education and participants
# Transform dataset into long format for easier plotting
lalonde_long_educ <- lalonde %>%
select(treat, educ, re74, re75, re78) %>%
pivot_longer(cols = c(re74, re75, re78), names_to = "year", values_to = "earnings") %>%
mutate(year = factor(year, labels = c("1974", "1975", "1978"))) # Rename years
# Split data into treatment and control groups
lalonde_treated <- lalonde_long_educ %>% filter(treat == 1)
lalonde_control <- lalonde_long_educ %>% filter(treat == 0)
# Plot for Treatment Group
plot_treated <- ggplot(lalonde_treated, aes(x = educ, y = earnings, color = year)) +
geom_point(alpha = 0.6, size = 3) + # Scatter plot points
geom_smooth(method = "lm", se = FALSE) + # Linear regression line
labs(title = "Education vs. Real Earnings (Treatment Group - treat = 1)",
x = "Years of Education",
y = "Real Earnings",
color = "Year") +
theme_minimal()
# Plot for Control Group
plot_control <- ggplot(lalonde_control, aes(x = educ, y = earnings, color = year)) +
geom_point(alpha = 0.6, size = 3) + # Scatter plot points
geom_smooth(method = "lm", se = FALSE) + # Linear regression line
labs(title = "Education vs. Real Earnings (Control Group - treat = 0)",
x = "Years of Education",
y = "Real Earnings",
color = "Year") +
theme_minimal()
# Display plots side by side
library(gridExtra)
grid.arrange(plot_treated, plot_control, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Propensity score. Using the glm
command and the argument “family=binomial”, estimate a logit model of
the probability of participation conditional on the full set of
observables that you have, with the appropriate polynomial function for
age and educ, and store the fitted values of
this propensity score as psfit. Present the results. You
may use packages like stargazer, texreg, or
any other package that helps you produce well-formatted estimation
tables.
Matching on the propensity score. Using the
outcome, the treatment status, and the fitted propensity score
psfit \(p(X)\), implement
a nearest neighbor matching with three matches (1:3 matching) and
bias-adjustment option (“BiasAdjust = T”) and show the estimated
ATT and its standard error obtained from the
Match command. With the bias-adjustment option, does the
standard error take into account the earlier logit estimation of the
propensity score?
##### C. Propensity score
library(Matching)
# Step 1: Estimate propensity scores using logistic regression
ps_model <- glm(treat ~ age + educ, data = lalonde, family = binomial())
lalonde$psfit <- ps_model$fitted.values
summary(ps_model)
##
## Call:
## glm(formula = treat ~ age + educ, family = binomial(), data = lalonde)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.266059 0.466844 -0.570 0.5687
## age -0.023963 0.009587 -2.500 0.0124 *
## educ 0.006700 0.034815 0.192 0.8474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.49 on 613 degrees of freedom
## Residual deviance: 744.70 on 611 degrees of freedom
## AIC: 750.7
##
## Number of Fisher Scoring iterations: 4
# Step 2: Nearest neighbor matching using the estimated propensity score
match_ps <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = lalonde$psfit,
M = 3,
replace = FALSE,
BiasAdjust = TRUE)
## Warning in Match(Y = lalonde$re78, Tr = lalonde$treat, X = lalonde$psfit, :
## Bias Adjustment can only be estimated when ties==TRUE and replace=TRUE.
## Setting BiasAdjust=FALSE
# Display matching results
summary(match_ps)
##
## Estimate... -702.6
## SE......... 853.82
## T-stat..... -0.82289
## p.val...... 0.41057
##
## Original number of observations.............. 614
## Original number of treated obs............... 185
## Matched number of observations............... 143
## Matched number of observations (unweighted). 429
boot package,
re-estimate the ATT with bootstrapped standard errors that
take into account the logit estimation of the propensity score. Compare
the results with those that apply the bias-adjustment option without
bootstrap inference.library(boot)
# Define a function to estimate ATT using bootstrap
boot_att <- function(data, indices) {
d <- data[indices, ] # Resample data
ps_model <- glm(treat ~ age + educ, data = d, family = binomial())
d$psfit <- ps_model$fitted.values
match_ps <- Match(Y = d$re78, Tr = d$treat, X = d$psfit, M = 3, BiasAdjust = TRUE)
return(match_ps$est) # Return ATT estimate
}
# Run bootstrap with 1000 replications
set.seed(123)
boot_results <- boot(data = lalonde, statistic = boot_att, R = 1000)
# Get bootstrapped standard error
boot_se <- sd(boot_results$t)
print(paste("Bootstrapped Standard Error:", boot_se))
## [1] "Bootstrapped Standard Error: 853.071770030857"
psfit for participants and non-participants.library(dplyr)
library(stargazer)
# Compute summary statistics separately for participants (treat = 1) and non-participants (treat = 0)
summary_ps <- lalonde %>%
group_by(treat) %>%
summarise(
count = n(),
mean = mean(psfit),
sd = sd(psfit),
min = min(psfit),
max = max(psfit),
median = median(psfit)
)
# Round all numerical columns to 2 decimal places before passing to stargazer
summary_ps_rounded <- summary_ps %>%
mutate(across(where(is.numeric), ~ round(.x, 2)))
# Generate summary statistics table with 2 decimal places
stargazer(summary_ps_rounded, type = "text", summary = FALSE,
title = "Summary Statistics of Propensity Scores (psfit)",
digits = 2)
##
## Summary Statistics of Propensity Scores (psfit)
## ========================================
## treat count mean sd min max median
## ----------------------------------------
## 1 0 429 0.3 0.05 0.17 0.36 0.31
## 2 1 185 0.31 0.04 0.2 0.35 0.31
## ----------------------------------------
kdensity package, plot the density of the
fitted propensity score for participants and non-participants. Hint:
psfitdens1 = kdensity(psfit[D==1]) and
plot(psfitdens1) for participants. Alternatively, you could
use ggplot as in the lecture slides: ggplot(lalonde,
aes(psfit, fill = as.factor(D))) + stat_density(aes(y = ..density..),
position = “identity”, color = “black”, alpha=0.5)### b. Plot the density of the fitted propensity score for participants and
### non-participants
# Load necessary libraries
library(kdensity)
library(ggplot2)
# Compute kernel density for participants (treat = 1)
psfitdens1 <- kdensity(lalonde$psfit[lalonde$treat == 1])
# Compute kernel density for non-participants (treat = 0)
psfitdens0 <- kdensity(lalonde$psfit[lalonde$treat == 0])
# Plot both densities in the same graph
plot(psfitdens1, main = "Density of Propensity Score (Participants vs. Non-Participants)",
xlab = "Propensity Score", ylab = "Density", col = "blue", lwd = 2)
lines(psfitdens0, col = "red", lwd = 2) # Add non-participants' density
legend("topleft", legend = c("Participants", "Non-Participants"),
col = c("blue", "red"), lwd = 2)
psfit
for participants and non-participants. Hint:
hist(psfit[D==0]) for non-participants. Alternatively, you
may use one of many great-looking options provided by the package
cobalt here
or there.### c. Plot the histogram of the fitted propensity score psfit for participants
### and non-participants
# Create histogram for non-participants first
hist(lalonde$psfit[lalonde$treat == 0],
main = "Histogram of Propensity Scores (Participants vs. Non-Participants)",
xlab = "Propensity Score", col = rgb(1, 0, 0, 0.5), border = "black", breaks = 20)
# Add histogram for participants on top
hist(lalonde$psfit[lalonde$treat == 1],
col = rgb(0, 0, 1, 0.5), border = "black", breaks = 20, add = TRUE)
# Add legend
legend("topleft", legend = c("Non-Participants", "Participants"),
fill = c("red", "blue"))
Does the overlap condition hold?
Imposing the common support, run your most reliable propensity
score matching estimation using the command Match from the
package Matching and the command matchit from
the package matchit. Compare the estimated
results.
### e. Impose the common support
# Identify the minimum and maximum propensity scores in each group
ps_min <- max(min(lalonde$psfit[lalonde$treat == 1]), min(lalonde$psfit[lalonde$treat == 0]))
ps_max <- min(max(lalonde$psfit[lalonde$treat == 1]), max(lalonde$psfit[lalonde$treat == 0]))
# Keep only observations within the common support
lalonde_common_support <- lalonde %>%
filter(psfit >= ps_min & psfit <= ps_max)
# Check the range after imposing common support
summary(lalonde_common_support$psfit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1995 0.2790 0.3133 0.3030 0.3383 0.3529
# Perform nearest neighbor matching with bias adjustment
match_result <- Match(
Y = lalonde_common_support$re78,
Tr = lalonde_common_support$treat,
X = lalonde_common_support$psfit,
M = 1,
replace = FALSE,
BiasAdjust = TRUE
)
## Warning in Match(Y = lalonde_common_support$re78, Tr =
## lalonde_common_support$treat, : Bias Adjustment can only be estimated when
## ties==TRUE and replace=TRUE. Setting BiasAdjust=FALSE
# Display results
summary(match_result)
##
## Estimate... -709.89
## SE......... 731.2
## T-stat..... -0.97086
## p.val...... 0.33162
##
## Original number of observations.............. 560
## Original number of treated obs............... 185
## Matched number of observations............... 185
## Matched number of observations (unweighted). 185
#### Estimation using Matchit
install.packages("MatchIt")
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.3:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.3/PACKAGES'
## package 'MatchIt' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'MatchIt'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\ybc5427\AppData\Local\Programs\R\R-4.3.3\library\00LOCK\MatchIt\libs\x64\MatchIt.dll
## to
## C:\Users\ybc5427\AppData\Local\Programs\R\R-4.3.3\library\MatchIt\libs\x64\MatchIt.dll:
## Permission denied
## Warning: restored 'MatchIt'
##
## The downloaded binary packages are in
## C:\Users\ybc5427\AppData\Local\Temp\RtmpwjEFQl\downloaded_packages
library(MatchIt)
##
## Attaching package: 'MatchIt'
## The following object is masked _by_ '.GlobalEnv':
##
## lalonde
## The following object is masked from 'package:cobalt':
##
## lalonde
# Perform propensity score matching
psm_model <- matchit(
treat ~ age + educ,
data = lalonde,
method = "nearest", # Nearest neighbor matching (1:1)
distance = "logit",
replace = FALSE
)
# Summary of the matching results
summary(psm_model)
##
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest",
## distance = "logit", replace = FALSE)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.2984 0.2734 0.4606 0.0881
## age 25.8162 28.0303 -0.3094 0.4400 0.0813
## educ 10.3459 10.2354 0.0550 0.4959 0.0347
## eCDF Max
## distance 0.1663
## age 0.1577
## educ 0.1114
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.3077 0.0094 0.9963 0.0033
## age 25.8162 25.8649 -0.0068 1.0300 0.0050
## educ 10.3459 10.2865 0.0296 0.5886 0.0253
## eCDF Max Std. Pair Dist.
## distance 0.0432 0.0146
## age 0.0162 0.0597
## educ 0.1189 0.8146
##
## Sample Sizes:
## Control Treated
## All 429 185
## Matched 185 185
## Unmatched 244 0
## Discarded 0 0
# Check balance summary
summary(psm_model, standardize = TRUE)
##
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest",
## distance = "logit", replace = FALSE)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.2984 0.2734 0.4606 0.0881
## age 25.8162 28.0303 -0.3094 0.4400 0.0813
## educ 10.3459 10.2354 0.0550 0.4959 0.0347
## eCDF Max
## distance 0.1663
## age 0.1577
## educ 0.1114
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.3077 0.0094 0.9963 0.0033
## age 25.8162 25.8649 -0.0068 1.0300 0.0050
## educ 10.3459 10.2865 0.0296 0.5886 0.0253
## eCDF Max Std. Pair Dist.
## distance 0.0432 0.0146
## age 0.0162 0.0597
## educ 0.1189 0.8146
##
## Sample Sizes:
## Control Treated
## All 429 185
## Matched 185 185
## Unmatched 244 0
## Discarded 0 0
# Extract matched dataset
matched_data <- match.data(psm_model)
att_model <- lm(re78 ~ treat, data = matched_data)
summary(att_model)
##
## Call:
## lm(formula = re78 ~ treat, data = matched_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7474 -6340 -1959 3631 53959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7474.4 553.4 13.506 <2e-16 ***
## treat -1125.2 782.7 -1.438 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7527 on 368 degrees of freedom
## Multiple R-squared: 0.005585, Adjusted R-squared: 0.002883
## F-statistic: 2.067 on 1 and 368 DF, p-value: 0.1514
#### f. Are there any quality checks you can perform?
# Balance summary before and after matching
summary(psm_model, standardize = TRUE)
##
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest",
## distance = "logit", replace = FALSE)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.2984 0.2734 0.4606 0.0881
## age 25.8162 28.0303 -0.3094 0.4400 0.0813
## educ 10.3459 10.2354 0.0550 0.4959 0.0347
## eCDF Max
## distance 0.1663
## age 0.1577
## educ 0.1114
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.3077 0.0094 0.9963 0.0033
## age 25.8162 25.8649 -0.0068 1.0300 0.0050
## educ 10.3459 10.2865 0.0296 0.5886 0.0253
## eCDF Max Std. Pair Dist.
## distance 0.0432 0.0146
## age 0.0162 0.0597
## educ 0.1189 0.8146
##
## Sample Sizes:
## Control Treated
## All 429 185
## Matched 185 185
## Unmatched 244 0
## Discarded 0 0
library(cobalt)
love.plot(psm_model, thresholds = 0.1) # Love plot (SMD before & after matching)
###### G. Another matching method
### a. We can add the weight
# Step 1: Estimate propensity scores
ps_model <- glm(treat ~ age + educ, data = lalonde, family = binomial)
lalonde$pscore <- predict(ps_model, type = "response")
# Step 2: Compute inverse variance weights
lalonde$weights <- ifelse(lalonde$treat == 1,
1 / lalonde$pscore, # Treated units
1 / (1 - lalonde$pscore)) # Control units
# Step 3: Run Matching with Weights
match_result_9 <- Match(Y = lalonde$re78,
Tr = lalonde$treat,
X = X,
M = 2,
estimand = "ATT",
BiasAdjust = TRUE,
weights = lalonde$weights) # Use computed weights
summary(match_result_9)
##
## Estimate... 1050.2
## AI SE...... 977.7
## T-stat..... 1.0741
## p.val...... 0.28277
##
## Original number of observations (weighted)... 1224.381
## Original number of observations.............. 614
## Original number of treated obs (weighted).... 609.753
## Original number of treated obs............... 185
## Matched number of observations............... 609.753
## Matched number of observations (unweighted). 385
### b. We can also adopt the Caliper matching
psm_caliper <- matchit(treat ~ age + educ,
data = lalonde,
method = "nearest",
caliper = 0.1) # Restricts matches to be within 0.1 of propensity score
summary(psm_caliper, standardize = TRUE)
##
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "nearest",
## caliper = 0.1)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.2984 0.2734 0.4606 0.0881
## age 25.8162 28.0303 -0.3094 0.4400 0.0813
## educ 10.3459 10.2354 0.0550 0.4959 0.0347
## eCDF Max
## distance 0.1663
## age 0.1577
## educ 0.1114
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3080 0.3077 0.0094 0.9963 0.0033
## age 25.8162 25.8649 -0.0068 1.0300 0.0050
## educ 10.3459 10.2865 0.0296 0.5886 0.0253
## eCDF Max Std. Pair Dist.
## distance 0.0432 0.0146
## age 0.0162 0.0597
## educ 0.1189 0.8146
##
## Sample Sizes:
## Control Treated
## All 429 185
## Matched 185 185
## Unmatched 244 0
## Discarded 0 0
# Extract matched dataset, rename weights column
matched_data <- match.data(psm_caliper, weights = "matching_weights")
# Run regression on matched data
att_model <- lm(re78 ~ treat, data = matched_data)
# Show results
summary(att_model)
##
## Call:
## lm(formula = re78 ~ treat, data = matched_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7474 -6340 -1959 3631 53959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7474.4 553.4 13.506 <2e-16 ***
## treat -1125.2 782.7 -1.438 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7527 on 368 degrees of freedom
## Multiple R-squared: 0.005585, Adjusted R-squared: 0.002883
## F-statistic: 2.067 on 1 and 368 DF, p-value: 0.1514
HAVE FUN AND KEEP FAITH IN THE FUN!