R
.Rpubs
using your temporary account.RPubs
link of your work and submit it on
Canvas.RPubs
link last has copied the others. So, timely
submissions are important. Own your work. I can randomly ask your
R
script and .Rmd files for double-checking purposes. As a
standard practice, work in a script file before making your code chunks
in the .Rmd file. Your .Rmd file and Rpubs
submission page
MUST show the code used to produce any of the outputs you present in
your answers.Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.
Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.
The primary practice data of this problem set is the dataset of the
“National Supported Work (NSW) Demonstration” lalonde
. NSW
is a job training program in the US that provides work experience for a
period of up to 1.5 years to socioeconomically-disadvantaged individuals
with the goal to increase their labor market outcomes. This famous
dataset is sourced from the experimental sample in Lalonde (AER 1986) and
has been used in several other papers or textbooks: Dehejia and Wahba
(JASA 1999), Dehejia and Wahba
(RESTAT 2002), Mixtape,
etc. You can load the package Matching
by Sekhon (2011)
that contains the lalonde
dataset.
The lalonde
sample has a size of 445 observations,
including 260 individuals not assigned to NSW training participation and
185 individuals assigned to it, which you can check by examining the
treatment variable (treat
). The outcome variable is real
earnings in 1978 (re78
); pre-treatment outcomes are real
earnings in 1974 and 1975 (re74
and re75
);
binary variables for being unemployed in 1974 and 1975 are
(u74
and u75
); and covariates include age in
years (age
), years of schooling (educ
), and
binary variables for blacks (black
), Hispanics
(hisp
), etc. I suggest you call ?lalonde
to
open the help file and check the description of the dataset.
# Load packages
library(pacman)
p_load(Matching, Jmisc, lmtest, sandwich, kdensity, haven, boot,
cobalt, Matchit, Zelig, estimatr, cem, tidyverse,
lubridate, usmap, gridExtra, stringr, readxl, plot3D,
cowplot, reshape2, scales, broom, data.table, ggplot2, stargazer,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor)
## Installing package into 'C:/Users/mensa/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## Installing package into 'C:/Users/mensa/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
# Load data
data(lalonde)
attach(lalonde)
table(treat)
## treat
## 0 1
## 429 185
#?lalonde
Generate a table of summary statistics (mean and standard deviation)
by treatment status with t-tests comparing the means of covariates,
pre-treatment outcomes, and outcome for non-participants and
participants to the NSW training. You may use packages like
stargazer
, xtable
, kable
, or any
other package that helps you produce well-formatted balance tables.
Interpret your table.
Return to one of our first readings for this class: Imbens and Wooldridge (JEL
2009). Revisit your notes for sections 5.1 about “Identification” on
page 26 and 5.3 about “Regression Methods” on page 28, where the authors
made the point that one could compute the ATE
by estimating
the OLS regression of the outcome re78
on a constant, the
treatment status treat
, the set of control variables \(X\), and the set of interactions between
treat
and the de-meaned transformation of each control
variable \(\left(X-\overline{X}\right)\).
Relying upon your knowledge of the potential outcomes framework,
the conditional independence assumption, and the overlap condition,
explain briefly why the coefficient on the treatment status
treat
in the above regression is the estimated
ATE
.
Run the OLS regression and show the estimated ATE
and its standard error. You may use the command demean
from
the package Jmisc
by Chan (2014) to demean your control
variables. Interpret your result. Does the participation in the NSW
training program increase real earnings in 1978?
You can practice a couple of covariate matching methods using the
Match
command from the Matching
package. You
would have to carefully check all the arguments of the command in the
help file. The default estimand is the ATT
.
One-to-one matching. Using the outcome, the
treatment status, and all the control variables \(X\) from part (ii), implement a one-to-one
matching and show the estimated ATT
and its standard
error.
Exact matching. Using the outcome, the treatment
status, and all the control variables \(X\) from part (ii), implement an exact
matching and show the estimated ATT
and its standard error.
How qualitatively does your result change from the one-to-one to the
exact matching? Explain the differences.
Nearest neighbor matching (1:2). Using the
outcome, the treatment status, and all the control variables \(X\) from part (ii), implement a nearest
neighbor matching with two matches (1:2 matching) without bias
correction and show the estimated ATT
and its standard
error.
Nearest neighbor matching (1:3). Using the
outcome, the treatment status, and all the control variables \(X\) from part (ii), implement a nearest
neighbor matching with three matches (1:3 matching) without bias
correction and show the estimated ATT
and its standard
error.
Bias correction. There is a bias with not
getting fully comparable matches for a training participant. This can be
adjusted using a regression-based correction. You may read Abadie
& Imbens (JBES 2011) to apprehend this correction that can be
applied using the argument “BiasAdjust = T” of the Match
command. Implement the nearest neighbor matching (1:2) and (1.3), this
time with the bias-adjustment option. How qualitatively does your result
change by applying the regression-based correction to the nearest
neighbor matching? How qualitatively does your result change from the
one-to-one to the nearest neighbor matching with bias
correction?
Coarsened exact matching. Implement a coarsened
exact matching and show the estimated ATT
and its standard
error. How qualitatively does your result change relatively to the exact
matching? Explain the differences.
Directed Acyclic Graph (DAG). Draw a DAG corresponding with the propensity score matching method that you can apply to the causal relationship between NSW training participation \(\left(D\right)\) and real earnings \(\left(Y\right)\). Describe the DAG.
Descriptive graphs. What is the shape of the relationship between age and real earnings in 1974, 1975, and 1978? Is there a difference in the distribution of age between NSW non-participants and participants? What is the shape of the relationship between education and real earnings in 1974, 1975, and 1978? Is there a difference in the distribution of years of schooling between NSW non-participants and participants? Discuss the graphs.
Propensity score. Using the glm
command and the argument “family=binomial”, estimate a logit model of
the probability of participation conditional on the full set of
observables that you have, with the appropriate polynomial function for
age
and educ
, and store the fitted values of
this propensity score as psfit
. Present the results. You
may use packages like stargazer
, texreg
, or
any other package that helps you produce well-formatted estimation
tables.
Matching on the propensity score. Using the
outcome, the treatment status, and the fitted propensity score
psfit
\(p(X)\), implement
a nearest neighbor matching with three matches (1:3 matching) and
bias-adjustment option (“BiasAdjust = T”) and show the estimated
ATT
and its standard error obtained from the
Match
command. With the bias-adjustment option, does the
standard error take into account the earlier logit estimation of the
propensity score?
Bootstrapping. Using the boot
package, re-estimate the ATT
with bootstrapped standard
errors that take into account the logit estimation of the propensity
score. Compare the results with those that apply the bias-adjustment
option without bootstrap inference.
Quality of the propensity score matching.
psfit
for participants and non-participants.kdensity
package, plot the density of the
fitted propensity score for participants and non-participants. Hint:
psfitdens1 = kdensity(psfit[D==1])
and
plot(psfitdens1)
for participants. Alternatively, you could
use ggplot
as in the lecture slides: ggplot(lalonde,
aes(psfit, fill = as.factor(D))) + stat_density(aes(y = ..density..),
position = “identity”, color = “black”, alpha=0.5)psfit
for participants and non-participants. Hint:
hist(psfit[D==0])
for non-participants. Alternatively, you
may use one of many great-looking options provided by the package
cobalt
here
or there.Match
from the
package Matching
and the command matchit
from
the package matchit
. Compare the estimated results.Another matching method. Implement another matching method studied in class that you have not shown so far. How qualitatively are the results different from the precedent ones?
Wrapping-up for a non-technical audience (not graded). If you want to give a summary presentation to a non-technical audience, you may get inspiration from these slides: PSM for policy-makers.
HAVE FUN AND KEEP FAITH IN THE FUN!