Illustration of matching.

Outline

The goal of the following simulation was to come up with better plots than the ones in Appendix B of Armstrong, Jagolinzer and Larcker (2010). My thinking was that a better example would use a setting with no treatment effect to illustrate the bias in regression methods when there are non-linearities in the true relation between \( X \) and \( y \) that are also present in selection for treatment. (In contrast, Appendix B of the paper focuses on non-linear treatment effects, which seems to be a different issue.) However, some aspects of the simulation require some work.

Details of the simulation

We first generate data. We assume that \( X_3 \) is not observable. But in this case, \( X_3 \) does not affect probability of treatment, so we have “selection on observables'' in a sense, though \( X_2^2 \) is clearly not the same thing as \( X_2 \).

N <- 10000
X1 <- rnorm(N)^2
X2 <- rnorm(N)^2
X3 <- rnorm(N)^2
X <- cbind(X1, X2^2, X3)

Now assign treatment as a function of \( X \), which includes \( X_2^2 \).

# Subtracting 2 gives a average p well below 0.5 This is consistent with
# more controls than treatment cases
b_d <- c(1, 1, 0)

# X_d is a latent variable driving assignment
X_d <- (X %*% b_d) - 3 + rnorm(N, 0, 3)

p_d <- exp(X_d)/(1 + exp(X_d))  # logit function
d <- runif(N) <= p_d

Now, generate \( y \). Note that \( X_2 \) has a non-linear relation with \( y \) mirroring the relation in the selection for treatement.

b <- c(2, 3, 4)
e <- rnorm(N, 0, 0.1)
y <- X1 * b[1] + X2^2 * b[2] + X3 * b[3] + e

Now estimate OLS. Note that we estimate a treatment effect (non-zero coefficient on \( d \)), even though \( d \) is absent from the true model.

reg.data <- data.frame(y, d, X1, X2)
# Estimate OLS
summary(lm(y ~ d + X1 + X2, data = reg.data))
## 
## Call:
## lm(formula = y ~ d + X1 + X2, data = reg.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -28.8   -9.2    1.0    6.8  489.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.516      0.244   -18.5   <2e-16 ***
## dTRUE         -7.010      0.379   -18.5   <2e-16 ***
## X1             2.878      0.119    24.3   <2e-16 ***
## X2            20.016      0.124   161.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 16.5 on 9996 degrees of freedom
## Multiple R-squared: 0.748,   Adjusted R-squared: 0.748 
## F-statistic: 9.9e+03 on 3 and 9996 DF,  p-value: <2e-16

One issue with the data is that we don't have a huge amount of overlap in the distributions, which means we won't be able to match all of our treatment cases to controls.

require(ggplot2)
ggplot(reg.data, aes(x = p_d, fill = d)) + geom_histogram(position = "identity", 
    alpha = 0.4, binwidth = 0.02)

plot of chunk histogram

Now, let's do propensity score matching. First, we do it using a logit model with \( X_1 \) and \( X_2 \). (Note that the caliper=0.1 option forces matches that are too far apart to be thrown out. Note also, that there are probably several packages that do propensity-score matching in R; I used Google and picked one that looked OK.)

# Estimate the propensity model assuming a linear model
glm1 <- glm(d ~ X1 + X2, family = binomial, data = reg.data)

require(Matching)

# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr1 <- with(reg.data, Match(Y = y, Tr = d, X = glm1$fitted, replace = FALSE, 
    estimand = "ATT", caliper = 0.1))
summary(rr1)
## 
## Estimate...  0.20211 
## SE.........  0.16359 
## T-stat.....  1.2354 
## p.val......  0.21666 
## 
## Original number of observations..............  10000 
## Original number of treated obs...............  4376 
## Matched number of observations...............  2578 
## Matched number of observations  (unweighted).  2578 
## 
## Caliper (SDs)........................................   0.1 
## Number of obs dropped by 'exact' or 'caliper'  1798

# Let's check the covariate balance
mb1 <- MatchBalance(d ~ X1 + X2, data = reg.data, match.out = rr1, nboots = 500)
## 
## ***** (V1) X1 *****
##                        Before Matching        After Matching
## mean treatment........     1.4056             1.0241 
## mean control..........    0.70448            0.96661 
## std mean diff.........      38.26             4.8845 
## 
## mean raw eQQ diff.....    0.70153           0.068068 
## med  raw eQQ diff.....    0.34816           0.043588 
## max  raw eQQ diff.....     9.2876             1.3586 
## 
## mean eCDF diff........    0.11403           0.015178 
## med  eCDF diff........     0.1278           0.016292 
## max  eCDF diff........     0.1779            0.02948 
## 
## var ratio (Tr/Co).....     3.9242             1.0448 
## T-test p-value........ < 2.22e-16           0.043337 
## KS Bootstrap p-value.. < 2.22e-16              0.214 
## KS Naive p-value...... < 2.22e-16            0.21255 
## KS Statistic..........     0.1779            0.02948 
## 
## 
## ***** (V2) X2 *****
##                        Before Matching        After Matching
## mean treatment........     1.7103            0.66439 
## mean control..........    0.47278            0.68661 
## std mean diff.........     64.275            -3.1942 
## 
## mean raw eQQ diff.....     1.2377            0.04672 
## med  raw eQQ diff.....    0.83637           0.039775 
## max  raw eQQ diff.....     13.332            0.74861 
## 
## mean eCDF diff........    0.22004           0.022123 
## med  eCDF diff........    0.23818           0.018619 
## max  eCDF diff........    0.36679           0.048099 
## 
## var ratio (Tr/Co).....      12.52             1.1147 
## T-test p-value........ < 2.22e-16           0.080579 
## KS Bootstrap p-value.. < 2.22e-16              0.006 
## KS Naive p-value...... < 2.22e-16          0.0051376 
## KS Statistic..........    0.36679           0.048099 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): X1 X2  Number(s): 1 2 
## 
## After Matching Minimum p.value: 0.006 
## Variable Name(s): X2  Number(s): 2

Now, what happens if we assume we knew about the non-linear relation between \( X_2 \) and the latent treatment-assignment variable?

# Estimate the propensity model assuming a linear model
glm2 <- glm(d ~ X1 + I(X2^2), family = binomial, data = reg.data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# one-to-one matching without replacement Estimating the treatment effect
# on the treated (the 'estimand' option defaults to ATT).
rr2 <- with(reg.data, Match(Y = y, Tr = d, X = glm2$fitted, replace = FALSE, 
    estimand = "ATT", caliper = 0.1))
summary(rr2)
## 
## Estimate...  -0.034208 
## SE.........  0.16457 
## T-stat.....  -0.20786 
## p.val......  0.83534 
## 
## Original number of observations..............  10000 
## Original number of treated obs...............  4376 
## Matched number of observations...............  2561 
## Matched number of observations  (unweighted).  2561 
## 
## Caliper (SDs)........................................   0.1 
## Number of obs dropped by 'exact' or 'caliper'  1815

# Let's check the covariate balance
mb2 <- MatchBalance(d ~ X1 + I(X2^2), data = reg.data, match.out = rr2, nboots = 500)
## 
## ***** (V1) X1 *****
##                        Before Matching        After Matching
## mean treatment........     1.4056             1.0409 
## mean control..........    0.70448             1.0206 
## std mean diff.........      38.26             1.7196 
## 
## mean raw eQQ diff.....    0.70153           0.027765 
## med  raw eQQ diff.....    0.34816           0.015904 
## max  raw eQQ diff.....     9.2876             1.2577 
## 
## mean eCDF diff........    0.11403          0.0062568 
## med  eCDF diff........     0.1278          0.0058571 
## max  eCDF diff........     0.1779           0.019914 
## 
## var ratio (Tr/Co).....     3.9242             1.0168 
## T-test p-value........ < 2.22e-16            0.44085 
## KS Bootstrap p-value.. < 2.22e-16              0.718 
## KS Naive p-value...... < 2.22e-16            0.69016 
## KS Statistic..........     0.1779           0.019914 
## 
## 
## ***** (V2) I(X2^2) *****
##                        Before Matching        After Matching
## mean treatment........      6.631            0.85813 
## mean control..........    0.51955            0.87226 
## std mean diff.........     39.434            -1.0218 
## 
## mean raw eQQ diff.....      6.112           0.039029 
## med  raw eQQ diff.....     1.1366           0.005551 
## max  raw eQQ diff.....     259.73             2.2041 
## 
## mean eCDF diff........    0.22004          0.0063327 
## med  eCDF diff........    0.23818          0.0062476 
## max  eCDF diff........    0.36679           0.014447 
## 
## var ratio (Tr/Co).....     219.97            0.97514 
## T-test p-value........ < 2.22e-16            0.60542 
## KS Bootstrap p-value.. < 2.22e-16              0.946 
## KS Naive p-value...... < 2.22e-16            0.95203 
## KS Statistic..........    0.36679           0.014447 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): X1 I(X2^2)  Number(s): 1 2 
## 
## After Matching Minimum p.value: 0.44085 
## Variable Name(s): X1  Number(s): 1