The goal of the following simulation was to come up with better plots than the ones in Appendix B of Armstrong, Jagolinzer and Larcker (2010). My thinking was that a better example would use a setting with no treatment effect to illustrate the bias in regression methods when there are non-linearities in the true relation between \( X \) and \( y \) that are also present in selection for treatment. (In contrast, Appendix B of the paper focuses on non-linear treatment effects, which seems to be a different issue.) However, some aspects of the simulation require some work.
We first generate data. We assume that \( X_3 \) is not observable. But in this case, \( X_3 \) does not affect probability of treatment, so we have “selection on observables'' in a sense, though \( X_2^2 \) is clearly not the same thing as \( X_2 \).
N <- 10000
X1 <- rnorm(N)^2
X2 <- rnorm(N)^2
X3 <- rnorm(N)^2
X <- cbind(X1, X2^2, X3)
Now assign treatment as a function of \( X \), which includes \( X_2^2 \).
# Subtracting 2 gives a average p well below 0.5 This is consistent with
# more controls than treatment cases
b_d <- c(1, 1, 0)
# X_d is a latent variable driving assignment
X_d <- (X %*% b_d) - 3 + rnorm(N, 0, 3)
p_d <- exp(X_d)/(1 + exp(X_d)) # logit function
d <- runif(N) <= p_d
Now, generate \( y \). Note that \( X_2 \) has a non-linear relation with \( y \) mirroring the relation in the selection for treatement.
b <- c(2, 3, 4)
e <- rnorm(N, 0, 0.1)
y <- X1 * b[1] + X2^2 * b[2] + X3 * b[3] + e
Now estimate OLS. Note that we estimate a treatment effect (non-zero coefficient on \( d \)), even though \( d \) is absent from the true model.
reg.data <- data.frame(y, d, X1, X2)
# Estimate OLS
summary(lm(y ~ d + X1 + X2, data = reg.data))
##
## Call:
## lm(formula = y ~ d + X1 + X2, data = reg.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.3 -9.1 1.0 6.8 506.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.341 0.246 -17.6 <2e-16 ***
## dTRUE -6.700 0.380 -17.6 <2e-16 ***
## X1 2.377 0.119 19.9 <2e-16 ***
## X2 19.866 0.129 154.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.5 on 9996 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.73
## F-statistic: 9.02e+03 on 3 and 9996 DF, p-value: <2e-16
One issue with the data is that we don't have a huge amount of overlap in the distributions, which means we won't be able to match all of our treatment cases to controls.
require(ggplot2)
ggplot(reg.data, aes(x = p_d, fill = d)) + geom_histogram(position = "identity",
alpha = 0.4, binwidth = 0.02)
Now, let's do propensity score matching. First, we do it using a logit model with \( X_1 \) and \( X_2 \). (Note that the caliper=0.1 option forces matches that are too far apart to be thrown out. Note also, that there are probably several packages that do propensity-score matching in R; I used Google and picked one that looked OK.)
# Estimate the propensity model assuming a linear model
glm1 <- glm(d ~ X1 + X2, family = binomial, data = reg.data)
require(Matching)
# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr1 <- with(reg.data, Match(Y = y, Tr = d, X = glm1$fitted, replace = FALSE,
estimand = "ATT", caliper = 0.1))
summary(rr1)
##
## Estimate... 0.12843
## SE......... 0.15903
## T-stat..... 0.80761
## p.val...... 0.41932
##
## Original number of observations.............. 10000
## Original number of treated obs............... 4321
## Matched number of observations............... 2585
## Matched number of observations (unweighted). 2585
##
## Caliper (SDs)........................................ 0.1
## Number of obs dropped by 'exact' or 'caliper' 1736
# Let's check the covariate balance
mb1 <- MatchBalance(d ~ X1 + X2, data = reg.data, match.out = rr1, nboots = 500)
##
## ***** (V1) X1 *****
## Before Matching After Matching
## mean treatment........ 1.3742 1.0086
## mean control.......... 0.71733 0.94718
## std mean diff......... 35.962 5.1478
##
## mean raw eQQ diff..... 0.65749 0.070013
## med raw eQQ diff..... 0.31068 0.051157
## max raw eQQ diff..... 7.1499 2.54
##
## mean eCDF diff........ 0.10321 0.014315
## med eCDF diff........ 0.11515 0.013153
## max eCDF diff........ 0.16122 0.034429
##
## var ratio (Tr/Co)..... 3.7631 1.0422
## T-test p-value........ < 2.22e-16 0.033731
## KS Bootstrap p-value.. < 2.22e-16 0.092
## KS Naive p-value...... < 2.22e-16 0.093371
## KS Statistic.......... 0.16122 0.034429
##
##
## ***** (V2) X2 *****
## Before Matching After Matching
## mean treatment........ 1.6808 0.66073
## mean control.......... 0.47805 0.68377
## std mean diff......... 64.089 -3.3319
##
## mean raw eQQ diff..... 1.2029 0.058244
## med raw eQQ diff..... 0.80659 0.050943
## max raw eQQ diff..... 13.2 0.81786
##
## mean eCDF diff........ 0.21546 0.026069
## med eCDF diff........ 0.22716 0.030948
## max eCDF diff........ 0.37189 0.051838
##
## var ratio (Tr/Co)..... 12.581 1.1649
## T-test p-value........ < 2.22e-16 0.050135
## KS Bootstrap p-value.. < 2.22e-16 0.004
## KS Naive p-value...... < 2.22e-16 0.0019245
## KS Statistic.......... 0.37189 0.051838
##
##
## Before Matching Minimum p.value: < 2.22e-16
## Variable Name(s): X1 X2 Number(s): 1 2
##
## After Matching Minimum p.value: 0.004
## Variable Name(s): X2 Number(s): 2
Now, what happens if we assume we knew about the non-linear relation between \( X_2 \) and the latent treatment-assignment variable?
# Estimate the propensity model assuming a linear model
glm2 <- glm(d ~ X1 + I(X2^2), family = binomial, data = reg.data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr2 <- with(reg.data, Match(Y = y, Tr = d, X = glm2$fitted, replace = FALSE,
estimand = "ATT", caliper = 0.1))
summary(rr2)
##
## Estimate... -0.070776
## SE......... 0.15411
## T-stat..... -0.45925
## p.val...... 0.64605
##
## Original number of observations.............. 10000
## Original number of treated obs............... 4321
## Matched number of observations............... 2574
## Matched number of observations (unweighted). 2574
##
## Caliper (SDs)........................................ 0.1
## Number of obs dropped by 'exact' or 'caliper' 1747
# Let's check the covariate balance
mb2 <- MatchBalance(d ~ X1 + I(X2^2), data = reg.data, match.out = rr2, nboots = 500)
##
## ***** (V1) X1 *****
## Before Matching After Matching
## mean treatment........ 1.3742 1.0354
## mean control.......... 0.71733 1.0206
## std mean diff......... 35.962 1.2185
##
## mean raw eQQ diff..... 0.65749 0.031419
## med raw eQQ diff..... 0.31068 0.016259
## max raw eQQ diff..... 7.1499 1.7175
##
## mean eCDF diff........ 0.10321 0.0054429
## med eCDF diff........ 0.11515 0.005439
## max eCDF diff........ 0.16122 0.013986
##
## var ratio (Tr/Co)..... 3.7631 1.0244
## T-test p-value........ < 2.22e-16 0.57607
## KS Bootstrap p-value.. < 2.22e-16 0.966
## KS Naive p-value...... < 2.22e-16 0.96282
## KS Statistic.......... 0.16122 0.013986
##
##
## ***** (V2) I(X2^2) *****
## Before Matching After Matching
## mean treatment........ 6.3462 0.83504
## mean control.......... 0.50841 0.84131
## std mean diff......... 38.341 -0.45709
##
## mean raw eQQ diff..... 5.8382 0.023909
## med raw eQQ diff..... 1.1042 0.0044552
## max raw eQQ diff..... 262.29 4.4618
##
## mean eCDF diff........ 0.21546 0.0039347
## med eCDF diff........ 0.22716 0.0034965
## max eCDF diff........ 0.37189 0.013598
##
## var ratio (Tr/Co)..... 233.1 1.0441
## T-test p-value........ < 2.22e-16 0.79395
## KS Bootstrap p-value.. < 2.22e-16 0.946
## KS Naive p-value...... < 2.22e-16 0.97121
## KS Statistic.......... 0.37189 0.013598
##
##
## Before Matching Minimum p.value: < 2.22e-16
## Variable Name(s): X1 I(X2^2) Number(s): 1 2
##
## After Matching Minimum p.value: 0.57607
## Variable Name(s): X1 Number(s): 1