Illustration of matching.

Outline

The goal of the following simulation was to come up with better plots than the ones in Appendix B of Armstrong, Jagolinzer and Larcker (2010). My thinking was that a better example would use a setting with no treatment effect to illustrate the bias in regression methods when there are non-linearities in the true relation between \( X \) and \( y \) that are also present in selection for treatment. (In contrast, Appendix B of the paper focuses on non-linear treatment effects, which seems to be a different issue.) However, some aspects of the simulation require some work.

Details of the simulation

We first generate data. We assume that \( X_3 \) is not observable. But in this case, \( X_3 \) does not affect probability of treatment, so we have “selection on observables'' in a sense, though \( X_2^2 \) is clearly not the same thing as \( X_2 \).

N <- 10000
X1 <- rnorm(N)^2
X2 <- rnorm(N)^2
X3 <- rnorm(N)^2
X <- cbind(X1, X2^2, X3)

Now assign treatment as a function of \( X \), which includes \( X_2^2 \).

# Subtracting 2 gives a average p well below 0.5 This is consistent with
# more controls than treatment cases
b_d <- c(1, 1, 0)

# X_d is a latent variable driving assignment
X_d <- (X %*% b_d) - 3 + rnorm(N, 0, 3)

p_d <- exp(X_d)/(1 + exp(X_d))  # logit function
d <- runif(N) <= p_d

Now, generate \( y \). Note that \( X_2 \) has a non-linear relation with \( y \) mirroring the relation in the selection for treatement.

b <- c(2, 3, 4)
e <- rnorm(N, 0, 0.1)
y <- X1 * b[1] + X2^2 * b[2] + X3 * b[3] + e

Now estimate OLS. Note that we estimate a treatment effect (non-zero coefficient on \( d \)), even though \( d \) is absent from the true model.

reg.data <- data.frame(y, d, X1, X2)
# Estimate OLS
summary(lm(y ~ d + X1 + X2, data = reg.data))

## 
## Call:
## lm(formula = y ~ d + X1 + X2, data = reg.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -28.3   -9.1    1.0    6.8  506.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.341      0.246   -17.6   <2e-16 ***
## dTRUE         -6.700      0.380   -17.6   <2e-16 ***
## X1             2.377      0.119    19.9   <2e-16 ***
## X2            19.866      0.129   154.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.5 on 9996 degrees of freedom
## Multiple R-squared:  0.73,   Adjusted R-squared:  0.73 
## F-statistic: 9.02e+03 on 3 and 9996 DF,  p-value: <2e-16

One issue with the data is that we don't have a huge amount of overlap in the distributions, which means we won't be able to match all of our treatment cases to controls.

require(ggplot2)
ggplot(reg.data, aes(x = p_d, fill = d)) + geom_histogram(position = "identity", 
    alpha = 0.4, binwidth = 0.02)

plot of chunk histogram

Now, let's do propensity score matching. First, we do it using a logit model with \( X_1 \) and \( X_2 \). (Note that the caliper=0.1 option forces matches that are too far apart to be thrown out. Note also, that there are probably several packages that do propensity-score matching in R; I used Google and picked one that looked OK.)

# Estimate the propensity model assuming a linear model
glm1 <- glm(d ~ X1 + X2, family = binomial, data = reg.data)

require(Matching)

# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr1 <- with(reg.data, Match(Y = y, Tr = d, X = glm1$fitted, replace = FALSE, 
    estimand = "ATT", caliper = 0.1))
summary(rr1)

## 
## Estimate...  0.12843 
## SE.........  0.15903 
## T-stat.....  0.80761 
## p.val......  0.41932 
## 
## Original number of observations..............  10000 
## Original number of treated obs...............  4321 
## Matched number of observations...............  2585 
## Matched number of observations  (unweighted).  2585 
## 
## Caliper (SDs)........................................   0.1 
## Number of obs dropped by 'exact' or 'caliper'  1736


# Let's check the covariate balance
mb1 <- MatchBalance(d ~ X1 + X2, data = reg.data, match.out = rr1, nboots = 500)

## 
## ***** (V1) X1 *****
##                        Before Matching        After Matching
## mean treatment........     1.3742             1.0086 
## mean control..........    0.71733            0.94718 
## std mean diff.........     35.962             5.1478 
## 
## mean raw eQQ diff.....    0.65749           0.070013 
## med  raw eQQ diff.....    0.31068           0.051157 
## max  raw eQQ diff.....     7.1499               2.54 
## 
## mean eCDF diff........    0.10321           0.014315 
## med  eCDF diff........    0.11515           0.013153 
## max  eCDF diff........    0.16122           0.034429 
## 
## var ratio (Tr/Co).....     3.7631             1.0422 
## T-test p-value........ < 2.22e-16           0.033731 
## KS Bootstrap p-value.. < 2.22e-16              0.092 
## KS Naive p-value...... < 2.22e-16           0.093371 
## KS Statistic..........    0.16122           0.034429 
## 
## 
## ***** (V2) X2 *****
##                        Before Matching        After Matching
## mean treatment........     1.6808            0.66073 
## mean control..........    0.47805            0.68377 
## std mean diff.........     64.089            -3.3319 
## 
## mean raw eQQ diff.....     1.2029           0.058244 
## med  raw eQQ diff.....    0.80659           0.050943 
## max  raw eQQ diff.....       13.2            0.81786 
## 
## mean eCDF diff........    0.21546           0.026069 
## med  eCDF diff........    0.22716           0.030948 
## max  eCDF diff........    0.37189           0.051838 
## 
## var ratio (Tr/Co).....     12.581             1.1649 
## T-test p-value........ < 2.22e-16           0.050135 
## KS Bootstrap p-value.. < 2.22e-16              0.004 
## KS Naive p-value...... < 2.22e-16          0.0019245 
## KS Statistic..........    0.37189           0.051838 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): X1 X2  Number(s): 1 2 
## 
## After Matching Minimum p.value: 0.004 
## Variable Name(s): X2  Number(s): 2

Now, what happens if we assume we knew about the non-linear relation between \( X_2 \) and the latent treatment-assignment variable?

# Estimate the propensity model assuming a linear model
glm2 <- glm(d ~ X1 + I(X2^2), family = binomial, data = reg.data)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr2 <- with(reg.data, Match(Y = y, Tr = d, X = glm2$fitted, replace = FALSE, 
    estimand = "ATT", caliper = 0.1))
summary(rr2)

## 
## Estimate...  -0.070776 
## SE.........  0.15411 
## T-stat.....  -0.45925 
## p.val......  0.64605 
## 
## Original number of observations..............  10000 
## Original number of treated obs...............  4321 
## Matched number of observations...............  2574 
## Matched number of observations  (unweighted).  2574 
## 
## Caliper (SDs)........................................   0.1 
## Number of obs dropped by 'exact' or 'caliper'  1747


# Let's check the covariate balance
mb2 <- MatchBalance(d ~ X1 + I(X2^2), data = reg.data, match.out = rr2, nboots = 500)

## 
## ***** (V1) X1 *****
##                        Before Matching        After Matching
## mean treatment........     1.3742             1.0354 
## mean control..........    0.71733             1.0206 
## std mean diff.........     35.962             1.2185 
## 
## mean raw eQQ diff.....    0.65749           0.031419 
## med  raw eQQ diff.....    0.31068           0.016259 
## max  raw eQQ diff.....     7.1499             1.7175 
## 
## mean eCDF diff........    0.10321          0.0054429 
## med  eCDF diff........    0.11515           0.005439 
## max  eCDF diff........    0.16122           0.013986 
## 
## var ratio (Tr/Co).....     3.7631             1.0244 
## T-test p-value........ < 2.22e-16            0.57607 
## KS Bootstrap p-value.. < 2.22e-16              0.966 
## KS Naive p-value...... < 2.22e-16            0.96282 
## KS Statistic..........    0.16122           0.013986 
## 
## 
## ***** (V2) I(X2^2) *****
##                        Before Matching        After Matching
## mean treatment........     6.3462            0.83504 
## mean control..........    0.50841            0.84131 
## std mean diff.........     38.341           -0.45709 
## 
## mean raw eQQ diff.....     5.8382           0.023909 
## med  raw eQQ diff.....     1.1042          0.0044552 
## max  raw eQQ diff.....     262.29             4.4618 
## 
## mean eCDF diff........    0.21546          0.0039347 
## med  eCDF diff........    0.22716          0.0034965 
## max  eCDF diff........    0.37189           0.013598 
## 
## var ratio (Tr/Co).....      233.1             1.0441 
## T-test p-value........ < 2.22e-16            0.79395 
## KS Bootstrap p-value.. < 2.22e-16              0.946 
## KS Naive p-value...... < 2.22e-16            0.97121 
## KS Statistic..........    0.37189           0.013598 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): X1 I(X2^2)  Number(s): 1 2 
## 
## After Matching Minimum p.value: 0.57607 
## Variable Name(s): X1  Number(s): 1