The goal of the following simulation was to come up with better plots than the ones in Appendix B of Armstrong, Jagolinzer and Larcker (2010). My thinking was that a better example would use a setting with no treatment effect to illustrate the bias in regression methods when there are non-linearities in the true relation between \( X \) and \( y \) that are also present in selection for treatment. (In contrast, Appendix B of the paper focuses on non-linear treatment effects, which seems to be a different issue.) However, some aspects of the simulation require some work.
We first generate data. We assume that \( X_3 \) is not observable. But in this case, \( X_3 \) does not affect probability of treatment, so we have “selection on observables'' in a sense, though \( X_2^2 \) is clearly not the same thing as \( X_2 \).
N <- 10000
X1 <- rnorm(N)^2
X2 <- rnorm(N)^2
X3 <- rnorm(N)^2
X <- cbind(X1, X2^2, X3)
Now assign treatment as a function of \( X \), which includes \( X_2^2 \).
# Subtracting 2 gives a average p well below 0.5 This is consistent with
# more controls than treatment cases
b_d <- c(1, 1, 0)
# X_d is a latent variable driving assignment
X_d <- (X %*% b_d) - 3 + rnorm(N, 0, 3)
p_d <- exp(X_d)/(1 + exp(X_d)) # logit function
d <- runif(N) <= p_d
Now, generate \( y \). Note that \( X_2 \) has a non-linear relation with \( y \) mirroring the relation in the selection for treatement.
b <- c(2, 3, 4)
e <- rnorm(N, 0, 0.1)
y <- X1 * b[1] + X2^2 * b[2] + X3 * b[3] + e
Now estimate OLS. Note that we estimate a treatment effect (non-zero coefficient on \( d \)), even though \( d \) is absent from the true model.
reg.data <- data.frame(y, d, X1, X2)
# Estimate OLS
summary(lm(y ~ d + X1 + X2, data = reg.data))
##
## Call:
## lm(formula = y ~ d + X1 + X2, data = reg.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.8 -9.2 1.0 6.8 489.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.516 0.244 -18.5 <2e-16 ***
## dTRUE -7.010 0.379 -18.5 <2e-16 ***
## X1 2.878 0.119 24.3 <2e-16 ***
## X2 20.016 0.124 161.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.5 on 9996 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.748
## F-statistic: 9.9e+03 on 3 and 9996 DF, p-value: <2e-16
One issue with the data is that we don't have a huge amount of overlap in the distributions, which means we won't be able to match all of our treatment cases to controls.
require(ggplot2)
ggplot(reg.data, aes(x = p_d, fill = d)) + geom_histogram(position = "identity",
alpha = 0.4, binwidth = 0.02)
Now, let's do propensity score matching. First, we do it using a logit model with \( X_1 \) and \( X_2 \). (Note that the caliper=0.1 option forces matches that are too far apart to be thrown out. Note also, that there are probably several packages that do propensity-score matching in R; I used Google and picked one that looked OK.)
# Estimate the propensity model assuming a linear model
glm1 <- glm(d ~ X1 + X2, family = binomial, data = reg.data)
require(Matching)
# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr1 <- with(reg.data, Match(Y = y, Tr = d, X = glm1$fitted, replace = FALSE,
estimand = "ATT", caliper = 0.1))
summary(rr1)
##
## Estimate... 0.20211
## SE......... 0.16359
## T-stat..... 1.2354
## p.val...... 0.21666
##
## Original number of observations.............. 10000
## Original number of treated obs............... 4376
## Matched number of observations............... 2578
## Matched number of observations (unweighted). 2578
##
## Caliper (SDs)........................................ 0.1
## Number of obs dropped by 'exact' or 'caliper' 1798
# Let's check the covariate balance
mb1 <- MatchBalance(d ~ X1 + X2, data = reg.data, match.out = rr1, nboots = 500)
##
## ***** (V1) X1 *****
## Before Matching After Matching
## mean treatment........ 1.4056 1.0241
## mean control.......... 0.70448 0.96661
## std mean diff......... 38.26 4.8845
##
## mean raw eQQ diff..... 0.70153 0.068068
## med raw eQQ diff..... 0.34816 0.043588
## max raw eQQ diff..... 9.2876 1.3586
##
## mean eCDF diff........ 0.11403 0.015178
## med eCDF diff........ 0.1278 0.016292
## max eCDF diff........ 0.1779 0.02948
##
## var ratio (Tr/Co)..... 3.9242 1.0448
## T-test p-value........ < 2.22e-16 0.043337
## KS Bootstrap p-value.. < 2.22e-16 0.214
## KS Naive p-value...... < 2.22e-16 0.21255
## KS Statistic.......... 0.1779 0.02948
##
##
## ***** (V2) X2 *****
## Before Matching After Matching
## mean treatment........ 1.7103 0.66439
## mean control.......... 0.47278 0.68661
## std mean diff......... 64.275 -3.1942
##
## mean raw eQQ diff..... 1.2377 0.04672
## med raw eQQ diff..... 0.83637 0.039775
## max raw eQQ diff..... 13.332 0.74861
##
## mean eCDF diff........ 0.22004 0.022123
## med eCDF diff........ 0.23818 0.018619
## max eCDF diff........ 0.36679 0.048099
##
## var ratio (Tr/Co)..... 12.52 1.1147
## T-test p-value........ < 2.22e-16 0.080579
## KS Bootstrap p-value.. < 2.22e-16 0.006
## KS Naive p-value...... < 2.22e-16 0.0051376
## KS Statistic.......... 0.36679 0.048099
##
##
## Before Matching Minimum p.value: < 2.22e-16
## Variable Name(s): X1 X2 Number(s): 1 2
##
## After Matching Minimum p.value: 0.006
## Variable Name(s): X2 Number(s): 2
Now, what happens if we assume we knew about the non-linear relation between \( X_2 \) and the latent treatment-assignment variable?
# Estimate the propensity model assuming a linear model
glm2 <- glm(d ~ X1 + I(X2^2), family = binomial, data = reg.data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr2 <- with(reg.data, Match(Y = y, Tr = d, X = glm2$fitted, replace = FALSE,
estimand = "ATT", caliper = 0.1))
summary(rr2)
##
## Estimate... -0.034208
## SE......... 0.16457
## T-stat..... -0.20786
## p.val...... 0.83534
##
## Original number of observations.............. 10000
## Original number of treated obs............... 4376
## Matched number of observations............... 2561
## Matched number of observations (unweighted). 2561
##
## Caliper (SDs)........................................ 0.1
## Number of obs dropped by 'exact' or 'caliper' 1815
# Let's check the covariate balance
mb2 <- MatchBalance(d ~ X1 + I(X2^2), data = reg.data, match.out = rr2, nboots = 500)
##
## ***** (V1) X1 *****
## Before Matching After Matching
## mean treatment........ 1.4056 1.0409
## mean control.......... 0.70448 1.0206
## std mean diff......... 38.26 1.7196
##
## mean raw eQQ diff..... 0.70153 0.027765
## med raw eQQ diff..... 0.34816 0.015904
## max raw eQQ diff..... 9.2876 1.2577
##
## mean eCDF diff........ 0.11403 0.0062568
## med eCDF diff........ 0.1278 0.0058571
## max eCDF diff........ 0.1779 0.019914
##
## var ratio (Tr/Co)..... 3.9242 1.0168
## T-test p-value........ < 2.22e-16 0.44085
## KS Bootstrap p-value.. < 2.22e-16 0.718
## KS Naive p-value...... < 2.22e-16 0.69016
## KS Statistic.......... 0.1779 0.019914
##
##
## ***** (V2) I(X2^2) *****
## Before Matching After Matching
## mean treatment........ 6.631 0.85813
## mean control.......... 0.51955 0.87226
## std mean diff......... 39.434 -1.0218
##
## mean raw eQQ diff..... 6.112 0.039029
## med raw eQQ diff..... 1.1366 0.005551
## max raw eQQ diff..... 259.73 2.2041
##
## mean eCDF diff........ 0.22004 0.0063327
## med eCDF diff........ 0.23818 0.0062476
## max eCDF diff........ 0.36679 0.014447
##
## var ratio (Tr/Co)..... 219.97 0.97514
## T-test p-value........ < 2.22e-16 0.60542
## KS Bootstrap p-value.. < 2.22e-16 0.946
## KS Naive p-value...... < 2.22e-16 0.95203
## KS Statistic.......... 0.36679 0.014447
##
##
## Before Matching Minimum p.value: < 2.22e-16
## Variable Name(s): X1 I(X2^2) Number(s): 1 2
##
## After Matching Minimum p.value: 0.44085
## Variable Name(s): X1 Number(s): 1