Pre check and data clean
Use the CGSS 2010 data.
setwd('/Users/Tina/Documents/IPM/psm')
load('data/cgss2010.rdata')
attach(newdata)
Check the summary statistics of the dataset
summary(newdata)
## hincome CCPmember male age
## Min. : 0 Min. :0.0000 Min. :0 Min. : 17.0
## 1st Qu.: 3000 1st Qu.:0.0000 1st Qu.:0 1st Qu.: 36.0
## Median : 10000 Median :0.0000 Median :0 Median : 46.0
## Mean : 19211 Mean :0.1242 Mean :0 Mean : 47.8
## 3rd Qu.: 20000 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.: 58.0
## Max. :6000000 Max. :1.0000 Max. :0 Max. :2013.0
## NA's :1625 NA's :16
## race edu height weight
## Min. :-3.000 Min. :1.000 Min. :110.0 Min. : 70.0
## 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:158.0 1st Qu.:105.0
## Median : 1.000 Median :2.000 Median :164.0 Median :120.0
## Mean : 1.456 Mean :2.148 Mean :163.9 Mean :121.3
## 3rd Qu.: 1.000 3rd Qu.:3.000 3rd Qu.:170.0 3rd Qu.:135.0
## Max. : 8.000 Max. :5.000 Max. :193.0 Max. :246.0
## NA's :15 NA's :21 NA's :26
## faEdu faCCPmember english mandarin
## Min. :1.000 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:2.000
## Median :1.000 Median :0.0000 Median :1.000 Median :3.000
## Mean :1.443 Mean :0.1607 Mean :1.389 Mean :3.074
## 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :5.000 Max. :1.0000 Max. :5.000 Max. :5.000
## NA's :489 NA's :222 NA's :20 NA's :22
The summary statistics show that there is some NA value in this dataset. Also, the min. of the variable race is -3, which is strange and should be coded as missing values. Following, I check how many rows have missing value.
library("car")
## Loading required package: carData
race.recode <- recode(race,"-3 = NA") #recode -3 into NA
newdata$race <- race.recode
table(rowSums(is.na(newdata)))
##
## 0 1 2 3 4 5 6 10
## 9642 1884 207 41 4 1 1 3
The result shows that only 9,642 (in total 11,783 observations have no missing value). In order to conduct the following analysis, I delete all observations with any missing value.
data.nomissing <- newdata[complete.cases(newdata), ]
In the following analysis, I am going to use the new dataset (named as data.nomissing) to analysis.
detach(newdata)
attach(data.nomissing)
I also check for the types of each variables.
str(data.nomissing)
## 'data.frame': 9642 obs. of 12 variables:
## $ hincome : num 0 0 3000 20000 8500 4200 8500 18000 5000 0 ...
## $ CCPmember : num 0 0 1 0 0 0 0 1 0 0 ...
## $ male : num 0 0 0 0 0 0 0 0 0 0 ...
## $ age : num 39 62 58 47 41 37 29 76 46 39 ...
## $ race : num 1 1 1 1 1 1 1 1 1 1 ...
## $ edu : num 1 1 1 2 3 1 3 3 2 1 ...
## $ height : num 158 145 170 167 175 160 173 170 170 150 ...
## $ weight : num 140 100 112 150 130 110 145 105 155 85 ...
## $ faEdu : num 1 1 1 1 1 1 3 1 2 1 ...
## $ faCCPmember: num 0 0 0 0 0 0 1 0 1 0 ...
## $ english : num 1 1 1 1 1 1 3 1 1 1 ...
## $ mandarin : num 3 2 2 2 2 2 5 1 2 2 ...
Some of the types are not correct. I convert these variables into proper types.
data.nomissing$male <- as.factor(male)
data.nomissing$race <- as.factor(race)
data.nomissing$edu <- as.ordered(edu)
data.nomissing$faEdu <- as.ordered(faEdu)
data.nomissing$faCCPmember <- as.factor(faCCPmember)
data.nomissing$english <- as.ordered(english)
data.nomissing$mandarin <- as.ordered(mandarin)
Treatment variable: CCPmember
I want to get the frequency table of CCPmember first.
count <- table(CCPmember)
perc <- prop.table(count)
CCPmemer.table <- cbind(count,perc)
CCPmemer.table
## count perc
## 0 8384 0.8695291
## 1 1258 0.1304709
The result indicates that the number of observations in the treatment group (aka. with CPC membership) is smaller than the number of observations in the control group.
Outcome variable: hincome
I want to know the distribution of the outcomes variable, household income.
summary(hincome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3000 10000 19332 20000 6000000
par(mfrow=c(1,2))
hist(hincome, main = "Histogram of household income", xlab = "")
boxplot(hincome, main = "Boxplot of household income")

#check how many observation have income equal to 0.
noincome <- (hincome ==0)
count <- table(noincome)
perc <- prop.table(count)
noicome.table <- cbind(count,perc)
noicome.table
## count perc
## FALSE 8542 0.8859158
## TRUE 1100 0.1140842
The summary statistics as well as both the histogram and box figure show that the distribution of household income is extremely positively skewed, with an outlier has an income of 6,000,000 dollars. Also, a high percentage (11.4%) of observations have income equal to zero. Dealing with this issue is necessary before estimating the effect of CCP membership on household income.
Firstly, I try taking log. To deal with the 0s, I add 1 to each value.
lnhincome <- log(hincome+1)
data.nomissing <- cbind(data.nomissing,lnhincome)
par(mfrow=c(1,2))
hist(lnhincome, main = "Histogram of log household income", xlab = "")
boxplot(lnhincome, main = "Boxplot of log household income")

The histogram shows that the distribution of log income is much nearer to a normal distribution than the previous one. However, the values of 0 are still annoying. One explanation is that the income of these observations is not really equal to 0, but just another type of missing. This variable is income instead of earnings or salary, a household (rather than individual) with no any income does not make sense, at least they should have something like rent revenue, subsidy income, or pension income. Hence, I decided to exclude the observation with no income in the following analysis.
data.nomissing <- data.nomissing[hincome != 0, ]
detach(data.nomissing)
attach(data.nomissing)
## The following object is masked _by_ .GlobalEnv:
##
## lnhincome
Estimate propensity score
Estimate propensity score by using logit model:
#Only male in the dataset, so I do not include the variable male.
ps1 <- glm(CCPmember ~ age + race + edu + height + weight + english +
mandarin + faEdu + faCCPmember, family = binomial, data = data.nomissing)
summary(ps1)
##
## Call:
## glm(formula = CCPmember ~ age + race + edu + height + weight +
## english + mandarin + faEdu + faCCPmember, family = binomial,
## data = data.nomissing)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2200 -0.5060 -0.3148 -0.1830 3.3160
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.358147 35.766410 -0.457 0.64741
## age 0.062913 0.002979 21.122 < 2e-16 ***
## race2 0.689317 0.508335 1.356 0.17509
## race3 -0.376056 0.414123 -0.908 0.36384
## race4 0.050520 0.268938 0.188 0.85099
## race5 1.497205 0.476193 3.144 0.00167 **
## race6 -0.099728 0.408653 -0.244 0.80720
## race7 0.258124 0.421621 0.612 0.54039
## race8 0.340691 0.217983 1.563 0.11807
## edu.L 3.781815 0.220059 17.185 < 2e-16 ***
## edu.Q 0.497228 0.170102 2.923 0.00347 **
## edu.C 0.046574 0.112626 0.414 0.67922
## edu^4 -0.145172 0.075862 -1.914 0.05567 .
## height 0.051731 0.005900 8.768 < 2e-16 ***
## weight 0.002804 0.001927 1.455 0.14576
## english.L -0.525491 0.360648 -1.457 0.14510
## english.Q -0.304533 0.296095 -1.028 0.30372
## english.C 0.172648 0.236303 0.731 0.46501
## english^4 0.319314 0.161395 1.978 0.04788 *
## mandarin.L 0.297112 0.131854 2.253 0.02424 *
## mandarin.Q -0.329774 0.104231 -3.164 0.00156 **
## mandarin.C 0.046905 0.092468 0.507 0.61198
## mandarin^4 0.075792 0.074111 1.023 0.30646
## faEdu.L -8.653729 113.066296 -0.077 0.93899
## faEdu.Q -6.886321 95.558459 -0.072 0.94255
## faEdu.C -4.087590 56.533230 -0.072 0.94236
## faEdu^4 -1.772875 21.367831 -0.083 0.93388
## faCCPmember1 0.351880 0.090010 3.909 9.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6998.2 on 8541 degrees of freedom
## Residual deviance: 5349.5 on 8514 degrees of freedom
## AIC: 5405.5
##
## Number of Fisher Scoring iterations: 12
The following line combines the fitted value in the data frame.
pscore <- ps1$fitted.values
data.nomissing <- cbind(data.nomissing,pscore)
Estimate ATT
Estimate the Average Treatment Effect for Treated (ATT):
With 1 vs. 1 nearest neighborhood matching
library("Matching")
## Loading required package: MASS
## ##
## ## Matching (Version 4.9-6, Build Date: 2019-04-07)
## ## See http://sekhon.berkeley.edu/matching for additional documentation.
## ## Please cite software as:
## ## Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ## Software with Automated Balance Optimization: The Matching package for R.''
## ## Journal of Statistical Software, 42(7): 1-52.
## ##
psm1 <- Match(Y = hincome, Tr = CCPmember, X = pscore, estimand = "ATT",
M = 1, replace = TRUE)
summary(psm1)
##
## Estimate... -204.34
## AI SE...... 6013.6
## T-stat..... -0.03398
## p.val...... 0.97289
##
## Original number of observations.............. 8542
## Original number of treated obs............... 1218
## Matched number of observations............... 1218
## Matched number of observations (unweighted). 14791
The point estimate is negative (\(~\beta=-204.34\)) but not significantly different from zero (\(~p=0.97289\)).
With 1 vs. 5 nearest neighborhood matching
psm2 <- Match(Y = hincome, Tr = CCPmember, X = pscore, estimand = "ATT",
M = 5, replace = TRUE)
summary(psm2)
##
## Estimate... 2808.9
## AI SE...... 4610
## T-stat..... 0.60931
## p.val...... 0.54232
##
## Original number of observations.............. 8542
## Original number of treated obs............... 1218
## Matched number of observations............... 1218
## Matched number of observations (unweighted). 17421
The point estimate is positive (\(~\beta=2808.9\)) but still not significantly different from zero (\(~p=0.54232\)).
Check balance and common support
Comparing the means
I check for the case of 1 vs. 5 nearest neighborhood matching.
library("MatchIt")
match1 <- matchit(CCPmember ~ age + race + edu + height + weight + english +
mandarin + faEdu + faCCPmember, data = data.nomissing,
method = "nearest", ratio=5, replace = TRUE)
summary(match1)
##
## Call:
## matchit(formula = CCPmember ~ age + race + edu + height + weight +
## english + mandarin + faEdu + faCCPmember, data = data.nomissing,
## method = "nearest", ratio = 5, replace = TRUE)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.3344 0.1107 0.1305 0.2237 0.2195
## age 52.0722 46.9177 14.5738 5.1546 5.0000
## race1 0.9245 0.9035 0.2953 0.0210 0.0000
## race2 0.0057 0.0031 0.0560 0.0026 0.0000
## race3 0.0074 0.0085 0.0916 -0.0011 0.0000
## race4 0.0164 0.0216 0.1453 -0.0052 0.0000
## race5 0.0074 0.0035 0.0595 0.0038 0.0000
## race6 0.0066 0.0112 0.1052 -0.0046 0.0000
## race7 0.0074 0.0089 0.0938 -0.0015 0.0000
## race8 0.0246 0.0397 0.1953 -0.0151 0.0000
## edu.L 0.0096 -0.3085 0.3193 0.3181 0.3162
## edu.Q -0.2166 -0.0077 0.4390 -0.2089 0.0000
## edu.C -0.1581 0.0215 0.4600 -0.1796 0.0000
## edu^4 -0.0762 -0.0312 0.4441 -0.0451 0.0000
## height 167.7767 163.9440 7.8356 3.8327 4.0000
## weight 131.5189 121.5943 22.2328 9.9245 10.0000
## english.L -0.4455 -0.5328 0.2080 0.0872 0.0000
## english.Q 0.1791 0.3397 0.3666 -0.1606 0.0000
## english.C -0.0436 -0.1521 0.3487 0.1085 0.0000
## english^4 0.0400 0.0529 0.2750 -0.0129 0.0000
## mandarin.L 0.1643 -0.0021 0.3916 0.1665 0.3162
## mandarin.Q -0.1420 -0.1247 0.4310 -0.0173 0.0000
## mandarin.C -0.0314 0.0008 0.4315 -0.0322 0.0000
## mandarin^4 0.0984 0.0758 0.5035 0.0226 0.0000
## faEdu.L -0.4577 -0.5092 0.2395 0.0515 0.0000
## faEdu.Q 0.2530 0.3117 0.3907 -0.0587 0.0000
## faEdu.C -0.1607 -0.1638 0.3395 0.0031 0.0000
## faEdu^4 0.0425 0.0686 0.2947 -0.0261 0.0000
## faCCPmember1 0.2537 0.1509 0.3580 0.1028 0.0000
## eQQ Mean eQQ Max
## distance 0.2235 0.4163
## age 5.1667 8.0000
## race1 0.0213 1.0000
## race2 0.0025 1.0000
## race3 0.0016 1.0000
## race4 0.0057 1.0000
## race5 0.0033 1.0000
## race6 0.0049 1.0000
## race7 0.0016 1.0000
## race8 0.0156 1.0000
## edu.L 0.3180 0.6325
## edu.Q 0.2087 0.8018
## edu.C 0.1797 0.6325
## edu^4 0.1393 0.5976
## height 3.8539 19.0000
## weight 9.9639 26.0000
## english.L 0.0867 0.3162
## english.Q 0.1606 0.8018
## english.C 0.1132 0.9487
## english^4 0.1040 0.5976
## mandarin.L 0.1664 0.3162
## mandarin.Q 0.0173 0.8018
## mandarin.C 0.0602 0.3162
## mandarin^4 0.0226 0.5976
## faEdu.L 0.0519 0.3162
## faEdu.Q 0.0586 0.8018
## faEdu.C 0.0296 0.6325
## faEdu^4 0.0407 0.5976
## faCCPmember1 0.1026 1.0000
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.3344 0.3341 0.2257 0.0003 0.1154
## age 52.0722 51.7644 15.6361 0.3079 2.0000
## race1 0.9245 0.9305 0.2543 -0.0061 0.0000
## race2 0.0057 0.0043 0.0652 0.0015 0.0000
## race3 0.0074 0.0064 0.0798 0.0010 0.0000
## race4 0.0164 0.0176 0.1314 -0.0011 0.0000
## race5 0.0074 0.0038 0.0614 0.0036 0.0000
## race6 0.0066 0.0067 0.0818 -0.0002 0.0000
## race7 0.0074 0.0094 0.0963 -0.0020 0.0000
## race8 0.0246 0.0213 0.1446 0.0033 0.0000
## edu.L 0.0096 0.0138 0.3395 -0.0042 0.0000
## edu.Q -0.2166 -0.2261 0.3301 0.0096 0.0000
## edu.C -0.1581 -0.1501 0.4757 -0.0080 0.0000
## edu^4 -0.0762 -0.0685 0.5208 -0.0078 0.0000
## height 167.7767 168.1281 7.7595 -0.3514 1.0000
## weight 131.5189 132.5445 24.6256 -1.0256 4.0000
## english.L -0.4455 -0.4410 0.2695 -0.0045 0.0000
## english.Q 0.1791 0.1794 0.4444 -0.0003 0.0000
## english.C -0.0436 -0.0423 0.4041 -0.0014 0.0000
## english^4 0.0400 0.0442 0.3693 -0.0041 0.0000
## mandarin.L 0.1643 0.1712 0.3494 -0.0069 0.0000
## mandarin.Q -0.1420 -0.1301 0.4340 -0.0118 0.0000
## mandarin.C -0.0314 -0.0236 0.4221 -0.0078 0.0000
## mandarin^4 0.0984 0.0983 0.5052 0.0001 0.0000
## faEdu.L -0.4577 -0.4421 0.3040 -0.0156 0.0000
## faEdu.Q 0.2530 0.2348 0.4238 0.0182 0.0000
## faEdu.C -0.1607 -0.1594 0.3687 -0.0013 0.0000
## faEdu^4 0.0425 0.0379 0.3388 0.0046 0.0000
## faCCPmember1 0.2537 0.2542 0.4355 -0.0005 0.0000
## eQQ Mean eQQ Max
## distance 0.1267 0.2538
## age 2.0082 6.0000
## race1 0.0025 1.0000
## race2 0.0016 1.0000
## race3 0.0033 1.0000
## race4 0.0000 0.0000
## race5 0.0033 1.0000
## race6 0.0025 1.0000
## race7 0.0008 1.0000
## race8 0.0041 1.0000
## edu.L 0.1337 0.3162
## edu.Q 0.0391 0.8018
## edu.C 0.1355 0.6325
## edu^4 0.0569 0.5976
## height 1.3662 4.0000
## weight 4.0599 26.0000
## english.L 0.0382 0.3162
## english.Q 0.0709 0.8018
## english.C 0.0467 0.6325
## english^4 0.0456 0.5976
## mandarin.L 0.0576 0.3162
## mandarin.Q 0.0103 0.8018
## mandarin.C 0.0223 0.3162
## mandarin^4 0.0064 0.5976
## faEdu.L 0.0184 0.3162
## faEdu.Q 0.0160 0.8018
## faEdu.C 0.0091 0.6325
## faEdu^4 0.0177 0.5976
## faCCPmember1 0.0378 1.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.8549 47.4178 43.3413 39.0198
## age 94.0270 60.0000 61.1314 25.0000
## race1 71.0665 0.0000 88.4615 0.0000
## race2 43.3078 0.0000 33.3333 0.0000
## race3 8.4500 0.0000 -100.0000 0.0000
## race4 77.6921 0.0000 100.0000 100.0000
## race5 5.9052 0.0000 0.0000 0.0000
## race6 96.4519 0.0000 50.0000 0.0000
## race7 -32.6211 0.0000 50.0000 0.0000
## race8 78.2538 0.0000 73.6842 0.0000
## edu.L 98.6943 100.0000 57.9592 50.0000
## edu.Q 95.4205 100.0000 81.2829 0.0000
## edu.C 95.5180 0.0000 24.5665 0.0000
## edu^4 82.7960 100.0000 59.1549 0.0000
## height 90.8316 75.0000 64.5505 78.9474
## weight 89.6659 60.0000 59.2535 0.0000
## english.L 94.8215 0.0000 55.9880 0.0000
## english.Q 99.8088 0.0000 55.8743 0.0000
## english.C 98.7551 0.0000 58.7156 33.3333
## english^4 68.0573 0.0000 56.1321 0.0000
## mandarin.L 95.8824 100.0000 65.3666 0.0000
## mandarin.Q 31.4235 0.0000 40.5063 0.0000
## mandarin.C 75.6439 0.0000 62.9310 0.0000
## mandarin^4 99.5665 0.0000 71.7391 0.0000
## faEdu.L 69.7285 0.0000 64.5000 0.0000
## faEdu.Q 69.0340 0.0000 72.6592 0.0000
## faEdu.C 58.1658 0.0000 69.2982 0.0000
## faEdu^4 82.3399 0.0000 56.6265 0.0000
## faCCPmember1 99.5209 0.0000 63.2000 0.0000
##
## Sample sizes:
## Control Treated
## All 7324 1218
## Matched 2653 1218
## Unmatched 4671 0
## Discarded 0 0
library("cobalt")
##
## Attaching package: 'cobalt'
## The following object is masked from 'package:MatchIt':
##
## lalonde
bal.tab(match1)
## Call
## matchit(formula = CCPmember ~ age + race + edu + height + weight +
## english + mandarin + faEdu + faCCPmember, data = data.nomissing,
## method = "nearest", ratio = 5, replace = TRUE)
##
## Balance Measures
## Type Diff.Adj
## distance Distance 0.0014
## age Contin. 0.0202
## race_1 Binary -0.0061
## race_2 Binary 0.0015
## race_3 Binary 0.0010
## race_4 Binary -0.0011
## race_5 Binary 0.0036
## race_6 Binary -0.0002
## race_7 Binary -0.0020
## race_8 Binary 0.0033
## edu_1 Binary 0.0094
## edu_2 Binary -0.0026
## edu_3 Binary -0.0107
## edu_4 Binary 0.0049
## edu_5 Binary -0.0010
## height Contin. -0.0498
## weight Contin. -0.0464
## english_1 Binary 0.0026
## english_2 Binary 0.0026
## english_3 Binary -0.0028
## english_4 Binary 0.0015
## english_5 Binary -0.0039
## mandarin_1 Binary 0.0005
## mandarin_2 Binary 0.0003
## mandarin_3 Binary 0.0064
## mandarin_4 Binary 0.0059
## mandarin_5 Binary -0.0131
## faEdu_1 Binary 0.0205
## faEdu_2 Binary -0.0030
## faEdu_3 Binary -0.0064
## faEdu_4 Binary -0.0112
## faEdu_5 Binary 0.0000
## faCCPmember Binary -0.0005
##
## Sample sizes
## Control Treated
## All 7324 1218
## Matched 2653 1218
## Unmatched 4671 0
love.plot(match1, abs = F)

The plot shows that for unadjusted cases (before matching), the standardized mean difference is large (even though none of them greater than 1.96). However, for adjusted cases (after matching), the standardized mean difference is small, the absolute value even smaller than 0.1.
For more formal check, I run the command MatchBalance.
## I've tried 5000 times, unfortuntely, my poor laptop hardly can run it, haha!
balance1 <- MatchBalance(CCPmember ~ age + race + edu + height + weight + english +
mandarin + faEdu + faCCPmember,
match.out = psm1, nboots = 1000, data = data.nomissing)
##
## ***** (V1) age *****
## Before Matching After Matching
## mean treatment........ 52.072 52.072
## mean control.......... 46.918 51.564
## std mean diff......... 33.796 3.3349
##
## mean raw eQQ diff..... 5.1667 1.0741
## med raw eQQ diff..... 5 1
## max raw eQQ diff..... 8 9
##
## mean eCDF diff........ 0.067003 0.013934
## med eCDF diff........ 0.060643 0.0087891
## max eCDF diff........ 0.15073 0.059631
##
## var ratio (Tr/Co)..... 1.0952 0.97383
## T-test p-value........ < 2.22e-16 0.39098
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... < 2.22e-16 < 2.22e-16
## KS Statistic.......... 0.15073 0.059631
##
##
## ***** (V2) race2 *****
## Before Matching After Matching
## mean treatment........ 0.0057471 0.0057471
## mean control.......... 0.0031404 0.0040935
## std mean diff......... 3.4471 2.1867
##
## mean raw eQQ diff..... 0.0024631 0.0029748
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0013034 0.0014874
## med eCDF diff........ 0.0013034 0.0014874
## max eCDF diff........ 0.0026068 0.0029748
##
## var ratio (Tr/Co)..... 1.8265 1.4016
## T-test p-value........ 0.24962 0.56078
##
##
## ***** (V3) race3 *****
## Before Matching After Matching
## mean treatment........ 0.0073892 0.0073892
## mean control.......... 0.0084653 0.009978
## std mean diff......... -1.2561 -3.0216
##
## mean raw eQQ diff..... 0.001642 0.0025015
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00053808 0.0012508
## med eCDF diff........ 0.00053808 0.0012508
## max eCDF diff........ 0.0010762 0.0025015
##
## var ratio (Tr/Co)..... 0.87442 0.74248
## T-test p-value........ 0.68787 0.49302
##
##
## ***** (V4) race4 *****
## Before Matching After Matching
## mean treatment........ 0.01642 0.01642
## mean control.......... 0.021573 0.018772
## std mean diff......... -4.0527 -1.8495
##
## mean raw eQQ diff..... 0.0057471 0.0068961
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0025763 0.003448
## med eCDF diff........ 0.0025763 0.003448
## max eCDF diff........ 0.0051525 0.0068961
##
## var ratio (Tr/Co)..... 0.76569 0.87683
## T-test p-value........ 0.20001 0.65831
##
##
## ***** (V5) race5 *****
## Before Matching After Matching
## mean treatment........ 0.0073892 0.0073892
## mean control.......... 0.00355 0.0028792
## std mean diff......... 4.481 5.2638
##
## mean raw eQQ diff..... 0.0032841 0.0016902
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0019196 0.00084511
## med eCDF diff........ 0.0019196 0.00084511
## max eCDF diff........ 0.0038392 0.0016902
##
## var ratio (Tr/Co)..... 2.0749 2.5547
## T-test p-value........ 0.13262 0.12026
##
##
## ***** (V6) race6 *****
## Before Matching After Matching
## mean treatment........ 0.0065681 0.0065681
## mean control.......... 0.011196 0.0053493
## std mean diff......... -5.7269 1.5082
##
## mean raw eQQ diff..... 0.0049261 0.00047326
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.002314 0.00023663
## med eCDF diff........ 0.002314 0.00023663
## max eCDF diff........ 0.0046279 0.00047326
##
## var ratio (Tr/Co)..... 0.5898 1.2263
## T-test p-value........ 0.077679 0.69685
##
##
## ***** (V7) race7 *****
## Before Matching After Matching
## mean treatment........ 0.0073892 0.0073892
## mean control.......... 0.0088749 0.012637
## std mean diff......... -1.7341 -6.1248
##
## mean raw eQQ diff..... 0.001642 0.0060172
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.00074288 0.0030086
## med eCDF diff........ 0.00074288 0.0030086
## max eCDF diff........ 0.0014858 0.0060172
##
## var ratio (Tr/Co)..... 0.83441 0.58785
## T-test p-value........ 0.58058 0.19555
##
##
## ***** (V8) race8 *****
## Before Matching After Matching
## mean treatment........ 0.024631 0.024631
## mean control.......... 0.039732 0.021889
## std mean diff......... -9.7394 1.7677
##
## mean raw eQQ diff..... 0.015599 0.0086539
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.0075509 0.004327
## med eCDF diff........ 0.0075509 0.004327
## max eCDF diff........ 0.015102 0.0086539
##
## var ratio (Tr/Co)..... 0.63009 1.1221
## T-test p-value........ 0.0025328 0.65361
##
##
## ***** (V9) edu.L *****
## Before Matching After Matching
## mean treatment........ 0.0096063 0.0096063
## mean control.......... -0.30854 0.016892
## std mean diff......... 92.238 -2.1123
##
## mean raw eQQ diff..... 0.31805 0.0089795
## med raw eQQ diff..... 0.31623 0
## max raw eQQ diff..... 0.63246 0.31623
##
## mean eCDF diff........ 0.20121 0.0056791
## med eCDF diff........ 0.26148 0.0060172
## max eCDF diff........ 0.39828 0.012913
##
## var ratio (Tr/Co)..... 1.167 1.0028
## T-test p-value........ < 2.22e-16 0.46286
## KS Bootstrap p-value.. < 2.22e-16 0.04
## KS Naive p-value...... < 2.22e-16 0.16967
## KS Statistic.......... 0.39828 0.012913
##
##
## ***** (V10) edu.Q *****
## Before Matching After Matching
## mean treatment........ -0.21657 -0.21657
## mean control.......... -0.0076631 -0.21696
## std mean diff......... -62.214 0.11508
##
## mean raw eQQ diff..... 0.20867 0.011709
## med raw eQQ diff..... 3.3307e-16 0
## max raw eQQ diff..... 0.80178 0.80178
##
## mean eCDF diff........ 0.17109 0.006051
## med eCDF diff........ 0.15673 0.0052397
## max eCDF diff........ 0.37091 0.013725
##
## var ratio (Tr/Co)..... 0.58515 1.0091
## T-test p-value........ < 2.22e-16 0.97554
## KS Bootstrap p-value.. < 2.22e-16 0.026
## KS Naive p-value...... < 2.22e-16 0.1233
## KS Statistic.......... 0.37091 0.013725
##
##
## ***** (V11) edu.C *****
## Before Matching After Matching
## mean treatment........ -0.15811 -0.15811
## mean control.......... 0.021459 -0.15879
## std mean diff......... -37.8 0.14197
##
## mean raw eQQ diff..... 0.17966 0.02076
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.63246 0.63246
##
## mean eCDF diff........ 0.11357 0.01313
## med eCDF diff........ 0.10943 0.018119
## max eCDF diff........ 0.29156 0.021567
##
## var ratio (Tr/Co)..... 1.0667 0.99459
## T-test p-value........ < 2.22e-16 0.96873
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... < 2.22e-16 0.0020564
## KS Statistic.......... 0.29156 0.021567
##
##
## ***** (V12) edu^4 *****
## Before Matching After Matching
## mean treatment........ -0.076247 -0.076247
## mean control.......... -0.031186 -0.080147
## std mean diff......... -8.7472 0.75704
##
## mean raw eQQ diff..... 0.13935 0.011354
## med raw eQQ diff..... 9.992e-16 0
## max raw eQQ diff..... 0.59761 0.59761
##
## mean eCDF diff........ 0.12648 0.0055304
## med eCDF diff........ 0.10673 0.003448
## max eCDF diff........ 0.29156 0.016361
##
## var ratio (Tr/Co)..... 1.3458 1.004
## T-test p-value........ 0.0040318 0.84836
## KS Bootstrap p-value.. < 2.22e-16 0.007
## KS Naive p-value...... < 2.22e-16 0.038148
## KS Statistic.......... 0.29156 0.016361
##
##
## ***** (V13) height *****
## Before Matching After Matching
## mean treatment........ 167.78 167.78
## mean control.......... 163.94 168.3
## std mean diff......... 54.349 -7.4255
##
## mean raw eQQ diff..... 3.8539 0.8184
## med raw eQQ diff..... 4 1
## max raw eQQ diff..... 19 4
##
## mean eCDF diff........ 0.068411 0.015671
## med eCDF diff........ 0.029911 0.0093638
## max eCDF diff........ 0.23164 0.062673
##
## var ratio (Tr/Co)..... 0.80998 0.83895
## T-test p-value........ < 2.22e-16 0.065974
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... < 2.22e-16 < 2.22e-16
## KS Statistic.......... 0.23164 0.062673
##
##
## ***** (V14) weight *****
## Before Matching After Matching
## mean treatment........ 131.52 131.52
## mean control.......... 121.59 133.38
## std mean diff......... 44.862 -8.4079
##
## mean raw eQQ diff..... 9.9639 1.0723
## med raw eQQ diff..... 10 0
## max raw eQQ diff..... 26 30
##
## mean eCDF diff........ 0.078543 0.0080914
## med eCDF diff........ 0.054683 0.0076736
## max eCDF diff........ 0.20151 0.029207
##
## var ratio (Tr/Co)..... 0.99007 0.76899
## T-test p-value........ < 2.22e-16 0.044873
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... < 2.22e-16 6.6277e-06
## KS Statistic.......... 0.20151 0.029207
##
##
## ***** (V15) english.L *****
## Before Matching After Matching
## mean treatment........ -0.44552 -0.44552
## mean control.......... -0.53276 -0.43368
## std mean diff......... 33.317 -4.5225
##
## mean raw eQQ diff..... 0.086716 0.010391
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.31623 0.31623
##
## mean eCDF diff........ 0.055174 0.0065716
## med eCDF diff........ 0.010158 0.006558
## max eCDF diff........ 0.17688 0.014874
##
## var ratio (Tr/Co)..... 1.5843 0.86419
## T-test p-value........ < 2.22e-16 0.26526
## KS Bootstrap p-value.. < 2.22e-16 0.003
## KS Naive p-value...... < 2.22e-16 0.075837
## KS Statistic.......... 0.17688 0.014874
##
##
## ***** (V16) english.Q *****
## Before Matching After Matching
## mean treatment........ 0.17905 0.17905
## mean control.......... 0.3397 0.18
## std mean diff......... -36.211 -0.2136
##
## mean raw eQQ diff..... 0.16062 0.0094683
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.80178 0.80178
##
## mean eCDF diff........ 0.084037 0.0042762
## med eCDF diff........ 0.080664 0.0028734
## max eCDF diff........ 0.17482 0.011358
##
## var ratio (Tr/Co)..... 1.4646 0.99311
## T-test p-value........ < 2.22e-16 0.95596
## KS Bootstrap p-value.. < 2.22e-16 0.028
## KS Naive p-value...... < 2.22e-16 0.29575
## KS Statistic.......... 0.17482 0.011358
##
##
## ***** (V17) english.C *****
## Before Matching After Matching
## mean treatment........ -0.043618 -0.043618
## mean control.......... -0.15207 -0.049119
## std mean diff......... 26.72 1.3554
##
## mean raw eQQ diff..... 0.1132 0.01022
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.94868 0.63246
##
## mean eCDF diff........ 0.07183 0.0064634
## med eCDF diff........ 0.090107 0.0069637
## max eCDF diff........ 0.16878 0.011832
##
## var ratio (Tr/Co)..... 1.3548 1.0211
## T-test p-value........ < 2.22e-16 0.72978
## KS Bootstrap p-value.. < 2.22e-16 0.02
## KS Naive p-value...... < 2.22e-16 0.25174
## KS Statistic.......... 0.16878 0.011832
##
##
## ***** (V18) english^4 *****
## Before Matching After Matching
## mean treatment........ 0.040037 0.040037
## mean control.......... 0.05294 0.047825
## std mean diff......... -3.4949 -2.1094
##
## mean raw eQQ diff..... 0.10402 0.0067879
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.59761 0.59761
##
## mean eCDF diff........ 0.052319 0.0038537
## med eCDF diff........ 0.076615 0.0030424
## max eCDF diff........ 0.098205 0.010006
##
## var ratio (Tr/Co)..... 1.8018 0.99889
## T-test p-value........ 0.24338 0.60318
## KS Bootstrap p-value.. < 2.22e-16 0.025
## KS Naive p-value...... 3.571e-09 0.44952
## KS Statistic.......... 0.098205 0.010006
##
##
## ***** (V19) mandarin.L *****
## Before Matching After Matching
## mean treatment........ 0.16434 0.16434
## mean control.......... -0.0021157 0.16801
## std mean diff......... 48.059 -1.0594
##
## mean raw eQQ diff..... 0.16642 0.016591
## med raw eQQ diff..... 0.31623 0
## max raw eQQ diff..... 0.31623 0.31623
##
## mean eCDF diff........ 0.10528 0.010493
## med eCDF diff........ 0.10163 0.0080454
## max eCDF diff........ 0.18083 0.021973
##
## var ratio (Tr/Co)..... 0.78226 0.97584
## T-test p-value........ < 2.22e-16 0.7829
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... < 2.22e-16 0.0015837
## KS Statistic.......... 0.18083 0.021973
##
##
## ***** (V20) mandarin.Q *****
## Before Matching After Matching
## mean treatment........ -0.14197 -0.14197
## mean control.......... -0.12469 -0.13078
## std mean diff......... -4.0292 -2.6096
##
## mean raw eQQ diff..... 0.017335 0.02938
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.80178 0.80178
##
## mean eCDF diff........ 0.032699 0.018728
## med eCDF diff........ 0.019265 0.023088
## max eCDF diff........ 0.092265 0.028734
##
## var ratio (Tr/Co)..... 0.99019 0.98612
## T-test p-value........ 0.19339 0.52265
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... 3.7946e-08 9.9427e-06
## KS Statistic.......... 0.092265 0.028734
##
##
## ***** (V21) mandarin.C *****
## Before Matching After Matching
## mean treatment........ -0.031415 -0.031415
## mean control.......... 0.00077718 -0.0287
## std mean diff......... -7.608 -0.64154
##
## mean raw eQQ diff..... 0.060234 0.0098988
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.31623 0.31623
##
## mean eCDF diff........ 0.03804 0.0062606
## med eCDF diff........ 0.034834 0.0062876
## max eCDF diff........ 0.079205 0.014401
##
## var ratio (Tr/Co)..... 0.96168 0.98881
## T-test p-value........ 0.014322 0.87536
## KS Bootstrap p-value.. < 2.22e-16 0.023
## KS Naive p-value...... 4.0785e-06 0.09308
## KS Statistic.......... 0.079205 0.014401
##
##
## ***** (V22) mandarin^4 *****
## Before Matching After Matching
## mean treatment........ 0.098425 0.098425
## mean control.......... 0.075787 0.089898
## std mean diff......... 4.4409 1.6728
##
## mean raw eQQ diff..... 0.02257 0.017172
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.59761 0.59761
##
## mean eCDF diff........ 0.043743 0.0066121
## med eCDF diff........ 0.02547 0.0030424
## max eCDF diff........ 0.11404 0.023731
##
## var ratio (Tr/Co)..... 1.0249 1.0156
## T-test p-value........ 0.15074 0.68116
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... 3.1947e-12 0.00048261
## KS Statistic.......... 0.11404 0.023731
##
##
## ***** (V23) faEdu.L *****
## Before Matching After Matching
## mean treatment........ -0.45773 -0.45773
## mean control.......... -0.50919 -0.43605
## std mean diff......... 17.628 -7.4251
##
## mean raw eQQ diff..... 0.051926 0.02743
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.31623 0.31623
##
## mean eCDF diff........ 0.032928 0.021685
## med eCDF diff........ 0.041521 0.022717
## max eCDF diff........ 0.068011 0.041309
##
## var ratio (Tr/Co)..... 1.4859 0.87018
## T-test p-value........ 6.6079e-09 0.070234
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... 0.00012745 2.1855e-11
## KS Statistic.......... 0.068011 0.041309
##
##
## ***** (V24) faEdu.Q *****
## Before Matching After Matching
## mean treatment........ 0.253 0.253
## mean control.......... 0.31167 0.23517
## std mean diff......... -14.073 4.2756
##
## mean raw eQQ diff..... 0.058587 0.043023
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.80178 0.80178
##
## mean eCDF diff........ 0.034177 0.0299
## med eCDF diff........ 0.033872 0.039145
## max eCDF diff........ 0.068966 0.041309
##
## var ratio (Tr/Co)..... 1.1388 0.98003
## T-test p-value........ 4.8289e-06 0.28311
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... 9.6957e-05 2.1855e-11
## KS Statistic.......... 0.068966 0.041309
##
##
## ***** (V25) faEdu.C *****
## Before Matching After Matching
## mean treatment........ -0.16071 -0.16071
## mean control.......... -0.16381 -0.16817
## std mean diff......... 0.8541 2.0537
##
## mean raw eQQ diff..... 0.029598 0.013106
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.63246 0.63246
##
## mean eCDF diff........ 0.018953 0.010344
## med eCDF diff........ 0.013856 0.0021297
## max eCDF diff........ 0.042477 0.037117
##
## var ratio (Tr/Co)..... 1.1452 0.96407
## T-test p-value........ 0.78063 0.6175
## KS Bootstrap p-value.. 0.002 < 2.22e-16
## KS Naive p-value...... 0.046169 2.8266e-09
## KS Statistic.......... 0.042477 0.037117
##
##
## ***** (V26) faEdu^4 *****
## Before Matching After Matching
## mean treatment........ 0.04249 0.04249
## mean control.......... 0.068607 0.027254
## std mean diff......... -7.9389 4.6316
##
## mean raw eQQ diff..... 0.040724 0.024687
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 0.59761 0.59761
##
## mean eCDF diff........ 0.024624 0.011375
## med eCDF diff........ 0.012633 0.0042255
## max eCDF diff........ 0.056333 0.03705
##
## var ratio (Tr/Co)..... 1.246 0.95137
## T-test p-value........ 0.0093446 0.25289
## KS Bootstrap p-value.. < 2.22e-16 < 2.22e-16
## KS Naive p-value...... 0.002645 3.0443e-09
## KS Statistic.......... 0.056333 0.03705
##
##
## ***** (V27) faCCPmember1 *****
## Before Matching After Matching
## mean treatment........ 0.25369 0.25369
## mean control.......... 0.15087 0.236
## std mean diff......... 23.62 4.066
##
## mean raw eQQ diff..... 0.10263 0.029477
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.05141 0.014739
## med eCDF diff........ 0.05141 0.014739
## max eCDF diff........ 0.10282 0.029477
##
## var ratio (Tr/Co)..... 1.4789 1.0501
## T-test p-value........ 1.0214e-14 0.29984
##
##
## Before Matching Minimum p.value: < 2.22e-16
## Variable Name(s): age edu.L edu.Q edu.C edu^4 height weight english.L english.Q english.C english^4 mandarin.L mandarin.Q mandarin.C mandarin^4 faEdu.L faEdu.Q faEdu^4 Number(s): 1 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26
##
## After Matching Minimum p.value: < 2.22e-16
## Variable Name(s): age edu.C height weight mandarin.L mandarin.Q mandarin^4 faEdu.L faEdu.Q faEdu.C faEdu^4 Number(s): 1 11 13 14 19 20 22 23 24 25 26
The result shows that though for most variables, t-tests provide insignificant results, KS tests suggest a significnat difference between treatment group and control group. To detect the problem, I try all the variables one by one.
Comparing the distributions
bal.plot(match1, var.name = "age")
bal.plot(match1, var.name = "race")
bal.plot(match1, var.name = "edu")
bal.plot(match1, var.name = "height")
bal.plot(match1, var.name = "weight")
bal.plot(match1, var.name = "faEdu")
bal.plot(match1, var.name = "faCCPmember")
bal.plot(match1, var.name = "english")
bal.plot(match1, var.name = "mandarin")









The figures show that the main problem displays in continuous variables, which are age, height, and weight. Besides, in the age cases, the two groups do not meet the common support.
library("psych")
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describeBy(age, group = CCPmember)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 7324 46.92 14.57 46 46.44 14.83 18 96 78 0.29 -0.46
## se
## X1 0.17
## --------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1218 52.07 15.25 52 51.89 17.79 21 90 69 0.1 -0.85
## se
## X1 0.44
The range of age for the treatment group is 21 to 90 while the range of age for the control group is 18 to 96 (This must be that people under 20 can not be a member of CCP).
On the other hand, there probably has an interaction relationship between education, age, and income.
scatterplot(lnhincome ~ age | edu , data = data.nomissing, smooth = FALSE)

The scatter plot shows that, for people with higher education level, their income increase as age increase (aka. the grey and orange line). However, for people with lower education level, their income decrease as age increase (aka. the blue, and pink line.)
Necessary adjustment
Therefore, I am going to:
1. Drop the observations whose age greater than 90 or lower than 21.
2. Include age-squared and BMI((weight/2)/(height/100)^2) (Note: according to the questionnaire, the unit of weight is 500g instead of kg. The unit of height is cm.).
3. Add an interaction term between education and age.
data.nomissing <- subset(data.nomissing, age>=21 & age<=90)
detach(data.nomissing)
attach(data.nomissing)
## The following objects are masked _by_ .GlobalEnv:
##
## lnhincome, pscore
data.nomissing$age.squared <- age^2
data.nomissing$BMI <- (weight/2)/((height/100)^2)
Now, I am going to do everything again. Besides, I choose 1 vs. 5 nearest neighborhood matching method.
match3 <- matchit(CCPmember ~ age + age.squared + race + edu + age:edu + height + weight +
BMI +english + mandarin + faEdu + faCCPmember, data = data.nomissing,
method = "nearest", ratio=5, replace = TRUE)
summary(match3)
##
## Call:
## matchit(formula = CCPmember ~ age + age.squared + race + edu +
## age:edu + height + weight + BMI + english + mandarin + faEdu +
## faCCPmember, data = data.nomissing, method = "nearest", ratio = 5,
## replace = TRUE)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.3352 0.1122 0.1316 0.2230 0.2326
## age 52.0722 47.2894 14.2569 4.7829 5.0000
## age.squared 2943.9507 2439.5139 1433.2486 504.4369 485.0000
## race1 0.9245 0.9031 0.2958 0.0213 0.0000
## race2 0.0057 0.0030 0.0551 0.0027 0.0000
## race3 0.0074 0.0083 0.0908 -0.0009 0.0000
## race4 0.0164 0.0216 0.1454 -0.0052 0.0000
## race5 0.0074 0.0036 0.0599 0.0038 0.0000
## race6 0.0066 0.0112 0.1054 -0.0047 0.0000
## race7 0.0074 0.0090 0.0945 -0.0016 0.0000
## race8 0.0246 0.0400 0.1961 -0.0154 0.0000
## edu.L 0.0096 -0.3099 0.3201 0.3195 0.3162
## edu.Q -0.2166 -0.0040 0.4395 -0.2126 0.0000
## edu.C -0.1581 0.0181 0.4596 -0.1763 0.0000
## edu^4 -0.0762 -0.0310 0.4425 -0.0452 0.0000
## height 167.7767 163.9055 7.8070 3.8712 4.0000
## weight 131.5189 121.7589 22.2139 9.7600 10.0000
## BMI 23.3063 22.5816 3.2859 0.7247 0.7840
## english.L -0.4455 -0.5353 0.2057 0.0897 0.0000
## english.Q 0.1791 0.3443 0.3632 -0.1652 0.0000
## english.C -0.0436 -0.1554 0.3461 0.1118 0.0000
## english^4 0.0400 0.0537 0.2719 -0.0136 0.0000
## mandarin.L 0.1643 -0.0050 0.3917 0.1693 0.3162
## mandarin.Q -0.1420 -0.1244 0.4310 -0.0175 0.0000
## mandarin.C -0.0314 0.0010 0.4315 -0.0324 0.0000
## mandarin^4 0.0984 0.0757 0.5034 0.0227 0.0000
## faEdu.L -0.4577 -0.5108 0.2389 0.0531 0.0000
## faEdu.Q 0.2530 0.3154 0.3888 -0.0624 0.0000
## faEdu.C -0.1607 -0.1677 0.3358 0.0070 0.0000
## faEdu^4 0.0425 0.0708 0.2923 -0.0283 0.0000
## faCCPmember1 0.2537 0.1511 0.3581 0.1026 0.0000
## age:edu.L -2.0098 -16.3133 17.1550 14.3035 14.8627
## age:edu.Q -10.4594 1.8436 22.5601 -12.3030 6.4143
## age:edu.C -6.8342 0.2327 21.5258 -7.0670 7.5895
## age:edu^4 -2.6015 -0.5594 20.3800 -2.0421 6.2152
## eQQ Mean eQQ Max
## distance 0.2229 0.3732
## age 4.7808 8.0000
## age.squared 503.5148 973.0000
## race1 0.0213 1.0000
## race2 0.0025 1.0000
## race3 0.0008 1.0000
## race4 0.0057 1.0000
## race5 0.0033 1.0000
## race6 0.0049 1.0000
## race7 0.0016 1.0000
## race8 0.0156 1.0000
## edu.L 0.3191 0.6325
## edu.Q 0.2126 0.8018
## edu.C 0.1760 0.6325
## edu^4 0.1423 0.5976
## height 3.8916 19.0000
## weight 9.8046 26.0000
## BMI 0.7478 5.0821
## english.L 0.0893 0.3162
## english.Q 0.1650 0.8018
## english.C 0.1168 0.9487
## english^4 0.1070 0.5976
## mandarin.L 0.1693 0.3162
## mandarin.Q 0.0173 0.8018
## mandarin.C 0.0613 0.3162
## mandarin^4 0.0226 0.5976
## faEdu.L 0.0535 0.3162
## faEdu.Q 0.0619 0.8018
## faEdu.C 0.0335 0.6325
## faEdu^4 0.0437 0.5976
## faCCPmember1 0.1026 1.0000
## age:edu.L 14.3216 25.2982
## age:edu.Q 12.3197 37.4166
## age:edu.C 8.1944 24.0333
## age:edu^4 9.5542 24.0241
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.3352 0.3348 0.2122 0.0005 0.1221
## age 52.0722 51.5200 15.2625 0.5522 1.0000
## age.squared 2943.9507 2887.1696 1626.3585 56.7811 108.0000
## race1 0.9245 0.9353 0.2460 -0.0108 0.0000
## race2 0.0057 0.0053 0.0723 0.0005 0.0000
## race3 0.0074 0.0049 0.0700 0.0025 0.0000
## race4 0.0164 0.0143 0.1187 0.0021 0.0000
## race5 0.0074 0.0021 0.0462 0.0053 0.0000
## race6 0.0066 0.0066 0.0808 0.0000 0.0000
## race7 0.0074 0.0072 0.0847 0.0002 0.0000
## race8 0.0246 0.0243 0.1540 0.0003 0.0000
## edu.L 0.0096 0.0166 0.3496 -0.0070 0.0000
## edu.Q -0.2166 -0.2073 0.3456 -0.0093 0.0000
## edu.C -0.1581 -0.1707 0.4662 0.0126 0.0000
## edu^4 -0.0762 -0.0677 0.5140 -0.0085 0.0000
## height 167.7767 167.9844 7.5622 -0.2077 2.0000
## weight 131.5189 132.3353 23.0429 -0.8164 4.0000
## BMI 23.3063 23.3842 3.3359 -0.0778 0.3211
## english.L -0.4455 -0.4392 0.2664 -0.0063 0.0000
## english.Q 0.1791 0.1706 0.4476 0.0084 0.0000
## english.C -0.0436 -0.0446 0.4032 0.0010 0.0000
## english^4 0.0400 0.0482 0.3742 -0.0081 0.0000
## mandarin.L 0.1643 0.1692 0.3417 -0.0049 0.0000
## mandarin.Q -0.1420 -0.1461 0.4228 0.0041 0.0000
## mandarin.C -0.0314 -0.0477 0.4290 0.0163 0.0000
## mandarin^4 0.0984 0.0818 0.5116 0.0166 0.0000
## faEdu.L -0.4577 -0.4466 0.2973 -0.0111 0.0000
## faEdu.Q 0.2530 0.2347 0.4237 0.0183 0.0000
## faEdu.C -0.1607 -0.1502 0.3727 -0.0105 0.0000
## faEdu^4 0.0425 0.0376 0.3388 0.0049 0.0000
## faCCPmember1 0.2537 0.2557 0.4363 -0.0020 0.0000
## age:edu.L -2.0098 -1.7677 19.7204 -0.2421 6.3246
## age:edu.Q -10.4594 -9.8750 20.3538 -0.5844 1.3363
## age:edu.C -6.8342 -7.5014 24.2928 0.6672 5.0596
## age:edu^4 -2.6015 -1.9272 27.5066 -0.6743 2.8685
## eQQ Mean eQQ Max
## distance 0.1252 0.2322
## age 1.2660 3.0000
## age.squared 138.3103 393.0000
## race1 0.0057 1.0000
## race2 0.0025 1.0000
## race3 0.0000 0.0000
## race4 0.0033 1.0000
## race5 0.0041 1.0000
## race6 0.0016 1.0000
## race7 0.0008 1.0000
## race8 0.0074 1.0000
## edu.L 0.1485 0.3162
## edu.Q 0.0601 0.8018
## edu.C 0.1205 0.6325
## edu^4 0.0628 0.5976
## height 1.5813 5.0000
## weight 3.8883 26.0000
## BMI 0.3004 3.3302
## english.L 0.0421 0.3162
## english.Q 0.0812 0.8018
## english.C 0.0587 0.6325
## english^4 0.0525 0.5976
## mandarin.L 0.0644 0.3162
## mandarin.Q 0.0118 0.8018
## mandarin.C 0.0275 0.3162
## mandarin^4 0.0137 0.5976
## faEdu.L 0.0244 0.3162
## faEdu.Q 0.0250 0.8018
## faEdu.C 0.0132 0.6325
## faEdu^4 0.0177 0.5976
## faCCPmember1 0.0517 1.0000
## age:edu.L 8.0189 18.0250
## age:edu.Q 4.2650 37.4166
## age:edu.C 6.3290 25.9307
## age:edu^4 4.3500 20.3189
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.7893 47.4759 43.8360 37.7886
## age 88.4543 80.0000 73.5188 62.5000
## age.squared 88.7437 77.7320 72.5310 59.6095
## race1 49.2021 0.0000 73.0769 0.0000
## race2 81.7440 0.0000 0.0000 0.0000
## race3 -166.0767 0.0000 100.0000 100.0000
## race4 58.9354 0.0000 42.8571 0.0000
## race5 -38.7859 0.0000 -25.0000 0.0000
## race6 100.0000 0.0000 66.6667 0.0000
## race7 89.8552 0.0000 50.0000 0.0000
## race8 97.8702 0.0000 52.6316 0.0000
## edu.L 97.8224 100.0000 53.4581 50.0000
## edu.Q 95.6233 100.0000 71.7234 0.0000
## edu.C 92.8411 0.0000 31.5634 0.0000
## edu^4 81.1151 100.0000 55.8621 0.0000
## height 94.6343 50.0000 59.3671 73.6842
## weight 91.6350 60.0000 60.3417 0.0000
## BMI 89.2588 59.0428 59.8272 34.4726
## english.L 92.9402 0.0000 52.9070 0.0000
## english.Q 94.9010 0.0000 50.7979 0.0000
## english.C 99.1177 0.0000 49.7778 33.3333
## english^4 40.3107 0.0000 50.9174 0.0000
## mandarin.L 97.1176 100.0000 61.9632 0.0000
## mandarin.Q 76.4589 0.0000 31.6456 0.0000
## mandarin.C 49.8728 0.0000 55.0847 0.0000
## mandarin^4 26.9832 0.0000 39.1304 0.0000
## faEdu.L 79.0815 0.0000 54.3689 0.0000
## faEdu.Q 70.7658 0.0000 59.5745 0.0000
## faEdu.C -50.7656 0.0000 60.4651 0.0000
## faEdu^4 82.6841 0.0000 59.5506 0.0000
## faCCPmember1 98.0803 0.0000 49.6000 0.0000
## age:edu.L 98.3076 57.4468 44.0086 28.7500
## age:edu.Q 95.2501 79.1667 65.3807 0.0000
## age:edu.C 90.5590 33.3333 22.7647 -7.8947
## age:edu^4 66.9804 53.8462 54.4699 15.4229
##
## Sample sizes:
## Control Treated
## All 7216 1218
## Matched 2643 1218
## Unmatched 4573 0
## Discarded 0 0
library(cobalt)
bal.tab(match3)
## Call
## matchit(formula = CCPmember ~ age + age.squared + race + edu +
## age:edu + height + weight + BMI + english + mandarin + faEdu +
## faCCPmember, data = data.nomissing, method = "nearest", ratio = 5,
## replace = TRUE)
##
## Balance Measures
## Type Diff.Adj
## distance Distance 0.0022
## age Contin. 0.0362
## age.squared Contin. 0.0348
## race_1 Binary -0.0108
## race_2 Binary 0.0005
## race_3 Binary 0.0025
## race_4 Binary 0.0021
## race_5 Binary 0.0053
## race_6 Binary 0.0000
## race_7 Binary 0.0002
## race_8 Binary 0.0003
## edu_1 Binary -0.0056
## edu_2 Binary 0.0167
## edu_3 Binary -0.0011
## edu_4 Binary -0.0036
## edu_5 Binary -0.0064
## height Contin. -0.0295
## weight Contin. -0.0369
## BMI Contin. -0.0238
## english_1 Binary 0.0072
## english_2 Binary 0.0043
## english_3 Binary -0.0103
## english_4 Binary -0.0010
## english_5 Binary -0.0002
## mandarin_1 Binary 0.0021
## mandarin_2 Binary 0.0028
## mandarin_3 Binary 0.0097
## mandarin_4 Binary -0.0209
## mandarin_5 Binary 0.0062
## faEdu_1 Binary 0.0207
## faEdu_2 Binary -0.0103
## faEdu_3 Binary -0.0062
## faEdu_4 Binary -0.0041
## faEdu_5 Binary 0.0000
## faCCPmember Binary -0.0020
##
## Sample sizes
## Control Treated
## All 7216 1218
## Matched 2643 1218
## Unmatched 4573 0
love.plot(match3, abs = F)
bal.plot(match3, var.name = "age")
bal.plot(match3, var.name = "age.squared")
bal.plot(match3, var.name = "height")
bal.plot(match3, var.name = "weight")
bal.plot(match3, var.name = "BMI")






Now, the figures look much better. However, the distribution of age for the two groups is still not similar. The main difference lies in the age group of 45 to 55. Therefore, I am going to add two new dummies variables which indicate whether an individual is below 45 and above 55.
data.nomissing$below45 <- age < 45
data.nomissing$above55 <- age > 55
match4 <- matchit(CCPmember ~ age + age.squared + below45 + above55 + race + edu + age:edu +
height + weight + BMI +english + mandarin + faEdu + faCCPmember,
data = data.nomissing, method = "nearest", ratio=5, replace = TRUE)
summary(match4)
##
## Call:
## matchit(formula = CCPmember ~ age + age.squared + below45 + above55 +
## race + edu + age:edu + height + weight + BMI + english +
## mandarin + faEdu + faCCPmember, data = data.nomissing, method = "nearest",
## ratio = 5, replace = TRUE)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.3365 0.1120 0.1323 0.2245 0.2353
## age 52.0722 47.2894 14.2569 4.7829 5.0000
## age.squared 2943.9507 2439.5139 1433.2486 504.4369 485.0000
## below45FALSE 0.6544 0.5443 0.4981 0.1100 0.0000
## below45TRUE 0.3456 0.4557 0.4981 -0.1100 0.0000
## above55TRUE 0.4343 0.2873 0.4525 0.1470 0.0000
## race2 0.0057 0.0030 0.0551 0.0027 0.0000
## race3 0.0074 0.0083 0.0908 -0.0009 0.0000
## race4 0.0164 0.0216 0.1454 -0.0052 0.0000
## race5 0.0074 0.0036 0.0599 0.0038 0.0000
## race6 0.0066 0.0112 0.1054 -0.0047 0.0000
## race7 0.0074 0.0090 0.0945 -0.0016 0.0000
## race8 0.0246 0.0400 0.1961 -0.0154 0.0000
## edu.L 0.0096 -0.3099 0.3201 0.3195 0.3162
## edu.Q -0.2166 -0.0040 0.4395 -0.2126 0.0000
## edu.C -0.1581 0.0181 0.4596 -0.1763 0.0000
## edu^4 -0.0762 -0.0310 0.4425 -0.0452 0.0000
## height 167.7767 163.9055 7.8070 3.8712 4.0000
## weight 131.5189 121.7589 22.2139 9.7600 10.0000
## BMI 23.3063 22.5816 3.2859 0.7247 0.7840
## english.L -0.4455 -0.5353 0.2057 0.0897 0.0000
## english.Q 0.1791 0.3443 0.3632 -0.1652 0.0000
## english.C -0.0436 -0.1554 0.3461 0.1118 0.0000
## english^4 0.0400 0.0537 0.2719 -0.0136 0.0000
## mandarin.L 0.1643 -0.0050 0.3917 0.1693 0.3162
## mandarin.Q -0.1420 -0.1244 0.4310 -0.0175 0.0000
## mandarin.C -0.0314 0.0010 0.4315 -0.0324 0.0000
## mandarin^4 0.0984 0.0757 0.5034 0.0227 0.0000
## faEdu.L -0.4577 -0.5108 0.2389 0.0531 0.0000
## faEdu.Q 0.2530 0.3154 0.3888 -0.0624 0.0000
## faEdu.C -0.1607 -0.1677 0.3358 0.0070 0.0000
## faEdu^4 0.0425 0.0708 0.2923 -0.0283 0.0000
## faCCPmember1 0.2537 0.1511 0.3581 0.1026 0.0000
## age:edu.L -2.0098 -16.3133 17.1550 14.3035 14.8627
## age:edu.Q -10.4594 1.8436 22.5601 -12.3030 6.4143
## age:edu.C -6.8342 0.2327 21.5258 -7.0670 7.5895
## age:edu^4 -2.6015 -0.5594 20.3800 -2.0421 6.2152
## eQQ Mean eQQ Max
## distance 0.2244 0.3748
## age 4.7808 8.0000
## age.squared 503.5148 973.0000
## below45FALSE 0.1100 1.0000
## below45TRUE 0.1100 1.0000
## above55TRUE 0.1470 1.0000
## race2 0.0025 1.0000
## race3 0.0008 1.0000
## race4 0.0057 1.0000
## race5 0.0033 1.0000
## race6 0.0049 1.0000
## race7 0.0016 1.0000
## race8 0.0156 1.0000
## edu.L 0.3191 0.6325
## edu.Q 0.2126 0.8018
## edu.C 0.1760 0.6325
## edu^4 0.1423 0.5976
## height 3.8916 19.0000
## weight 9.8046 26.0000
## BMI 0.7478 5.0821
## english.L 0.0893 0.3162
## english.Q 0.1650 0.8018
## english.C 0.1168 0.9487
## english^4 0.1070 0.5976
## mandarin.L 0.1693 0.3162
## mandarin.Q 0.0173 0.8018
## mandarin.C 0.0613 0.3162
## mandarin^4 0.0226 0.5976
## faEdu.L 0.0535 0.3162
## faEdu.Q 0.0619 0.8018
## faEdu.C 0.0335 0.6325
## faEdu^4 0.0437 0.5976
## faCCPmember1 0.1026 1.0000
## age:edu.L 14.3216 25.2982
## age:edu.Q 12.3197 37.4166
## age:edu.C 8.1944 24.0333
## age:edu^4 9.5542 24.0241
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.3365 0.3362 0.2127 0.0003 0.1322
## age 52.0722 51.8360 15.4260 0.2363 1.0000
## age.squared 2943.9507 2924.8379 1634.7174 19.1128 128.0000
## below45FALSE 0.6544 0.6373 0.4809 0.0171 0.0000
## below45TRUE 0.3456 0.3627 0.4809 -0.0171 0.0000
## above55TRUE 0.4343 0.4342 0.4957 0.0002 0.0000
## race2 0.0057 0.0061 0.0777 -0.0003 0.0000
## race3 0.0074 0.0039 0.0627 0.0034 0.0000
## race4 0.0164 0.0153 0.1227 0.0011 0.0000
## race5 0.0074 0.0039 0.0627 0.0034 0.0000
## race6 0.0066 0.0071 0.0837 -0.0005 0.0000
## race7 0.0074 0.0062 0.0788 0.0011 0.0000
## race8 0.0246 0.0220 0.1467 0.0026 0.0000
## edu.L 0.0096 0.0103 0.3476 -0.0007 0.0000
## edu.Q -0.2166 -0.2115 0.3381 -0.0051 0.0000
## edu.C -0.1581 -0.1546 0.4776 -0.0035 0.0000
## edu^4 -0.0762 -0.0818 0.5114 0.0056 0.0000
## height 167.7767 168.0517 7.6437 -0.2750 2.0000
## weight 131.5189 132.1369 23.1860 -0.6181 4.0000
## BMI 23.3063 23.3247 3.3201 -0.0184 0.3646
## english.L -0.4455 -0.4414 0.2694 -0.0041 0.0000
## english.Q 0.1791 0.1801 0.4448 -0.0011 0.0000
## english.C -0.0436 -0.0453 0.4024 0.0017 0.0000
## english^4 0.0400 0.0468 0.3692 -0.0068 0.0000
## mandarin.L 0.1643 0.1678 0.3474 -0.0035 0.0000
## mandarin.Q -0.1420 -0.1367 0.4252 -0.0052 0.0000
## mandarin.C -0.0314 -0.0399 0.4321 0.0085 0.0000
## mandarin^4 0.0984 0.0737 0.5081 0.0247 0.0000
## faEdu.L -0.4577 -0.4511 0.2990 -0.0066 0.0000
## faEdu.Q 0.2530 0.2481 0.4174 0.0049 0.0000
## faEdu.C -0.1607 -0.1631 0.3663 0.0024 0.0000
## faEdu^4 0.0425 0.0356 0.3309 0.0069 0.0000
## faCCPmember1 0.2537 0.2583 0.4378 -0.0046 0.0000
## age:edu.L -2.0098 -2.0717 19.6979 0.0619 6.3246
## age:edu.Q -10.4594 -10.1322 20.0345 -0.3272 1.6036
## age:edu.C -6.8342 -6.5871 25.2091 -0.2471 5.3759
## age:edu^4 -2.6015 -2.9077 27.5137 0.3061 2.8685
## eQQ Mean eQQ Max
## distance 0.1268 0.2218
## age 1.3998 3.0000
## age.squared 148.7036 393.0000
## below45FALSE 0.0255 1.0000
## below45TRUE 0.0246 1.0000
## above55TRUE 0.0394 1.0000
## race2 0.0016 1.0000
## race3 0.0025 1.0000
## race4 0.0016 1.0000
## race5 0.0033 1.0000
## race6 0.0016 1.0000
## race7 0.0008 1.0000
## race8 0.0057 1.0000
## edu.L 0.1482 0.3162
## edu.Q 0.0507 0.8018
## edu.C 0.1358 0.6325
## edu^4 0.0550 0.5976
## height 1.5567 5.0000
## weight 4.1018 20.0000
## BMI 0.3444 5.0821
## english.L 0.0423 0.3162
## english.Q 0.0821 0.8018
## english.C 0.0582 0.6325
## english^4 0.0535 0.5976
## mandarin.L 0.0659 0.3162
## mandarin.Q 0.0050 0.8018
## mandarin.C 0.0231 0.3162
## mandarin^4 0.0108 0.5976
## faEdu.L 0.0265 0.3162
## faEdu.Q 0.0274 0.8018
## faEdu.C 0.0158 0.6325
## faEdu^4 0.0201 0.5976
## faCCPmember1 0.0484 1.0000
## age:edu.L 7.9431 18.0250
## age:edu.Q 3.8496 37.1493
## age:edu.C 7.0658 26.5631
## age:edu^4 3.9909 19.9603
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.8521 43.8139 43.5115 40.8121
## age 95.0597 80.0000 70.7196 62.5000
## age.squared 96.2111 73.6082 70.4669 59.6095
## below45FALSE 84.4761 0.0000 76.8657 0.0000
## below45TRUE 84.4761 0.0000 77.6119 0.0000
## above55TRUE 99.8883 0.0000 73.1844 0.0000
## race2 87.8293 0.0000 33.3333 0.0000
## race3 -272.5074 0.0000 -200.0000 0.0000
## race4 77.8883 0.0000 71.4286 0.0000
## race5 8.9217 0.0000 0.0000 0.0000
## race6 89.4219 0.0000 66.6667 0.0000
## race7 28.9864 0.0000 50.0000 0.0000
## race8 82.9613 0.0000 63.1579 0.0000
## edu.L 99.7725 100.0000 53.5395 50.0000
## edu.Q 97.6052 100.0000 76.1610 0.0000
## edu.C 98.0262 0.0000 22.8614 0.0000
## edu^4 87.6271 100.0000 61.3793 0.0000
## height 92.8952 50.0000 60.0000 73.6842
## weight 93.6674 60.0000 58.1645 23.0769
## BMI 97.4610 53.4955 53.9510 0.0000
## english.L 95.4285 0.0000 52.6163 0.0000
## english.Q 99.3361 0.0000 50.2660 0.0000
## english.C 98.4676 0.0000 50.2222 33.3333
## english^4 50.3788 0.0000 50.0000 0.0000
## mandarin.L 97.9455 100.0000 61.0429 0.0000
## mandarin.Q 70.1980 0.0000 70.8861 0.0000
## mandarin.C 73.7352 0.0000 62.2881 0.0000
## mandarin^4 -8.8771 0.0000 52.1739 0.0000
## faEdu.L 87.4880 0.0000 50.4854 0.0000
## faEdu.Q 92.1292 0.0000 55.6738 0.0000
## faEdu.C 65.6672 0.0000 52.7132 0.0000
## faEdu^4 75.7578 0.0000 53.9326 0.0000
## faCCPmember1 95.5206 0.0000 52.8000 0.0000
## age:edu.L 99.5673 57.4468 44.5379 28.7500
## age:edu.Q 97.3404 75.0000 68.7523 0.7143
## age:edu.C 96.5032 29.1667 13.7729 -10.5263
## age:edu^4 85.0082 53.8462 58.2291 16.9154
##
## Sample sizes:
## Control Treated
## All 7216 1218
## Matched 2660 1218
## Unmatched 4556 0
## Discarded 0 0
bal.tab(match4)
## Call
## matchit(formula = CCPmember ~ age + age.squared + below45 + above55 +
## race + edu + age:edu + height + weight + BMI + english +
## mandarin + faEdu + faCCPmember, data = data.nomissing, method = "nearest",
## ratio = 5, replace = TRUE)
##
## Balance Measures
## Type Diff.Adj
## distance Distance 0.0016
## age Contin. 0.0155
## age.squared Contin. 0.0117
## below45 Binary -0.0171
## above55 Binary 0.0002
## race_1 Binary -0.0110
## race_2 Binary -0.0003
## race_3 Binary 0.0034
## race_4 Binary 0.0011
## race_5 Binary 0.0034
## race_6 Binary -0.0005
## race_7 Binary 0.0011
## race_8 Binary 0.0026
## edu_1 Binary -0.0005
## edu_2 Binary -0.0033
## edu_3 Binary 0.0067
## edu_4 Binary 0.0007
## edu_5 Binary -0.0036
## height Contin. -0.0390
## weight Contin. -0.0279
## BMI Contin. -0.0056
## english_1 Binary 0.0007
## english_2 Binary 0.0059
## english_3 Binary -0.0043
## english_4 Binary 0.0011
## english_5 Binary -0.0034
## mandarin_1 Binary -0.0003
## mandarin_2 Binary -0.0039
## mandarin_3 Binary 0.0205
## mandarin_4 Binary -0.0169
## mandarin_5 Binary 0.0007
## faEdu_1 Binary 0.0069
## faEdu_2 Binary -0.0010
## faEdu_3 Binary 0.0023
## faEdu_4 Binary -0.0082
## faEdu_5 Binary 0.0000
## faCCPmember Binary -0.0046
##
## Sample sizes
## Control Treated
## All 7216 1218
## Matched 2660 1218
## Unmatched 4556 0
bal.plot(match4, var.name = "age")
bal.plot(match4, var.name = "below45")
bal.plot(match4, var.name = "above55")



It seems much better now.
The causal effect of CCP membership
I then estimate the Pscore again and conduct the matching estimation.
ps3 <- glm(CCPmember ~ age + age.squared + below45 + above55 + race + edu + age:edu +
height + weight + BMI + english + mandarin + faEdu + faCCPmember,
family = binomial, data = data.nomissing)
summary(ps3)
##
## Call:
## glm(formula = CCPmember ~ age + age.squared + below45 + above55 +
## race + edu + age:edu + height + weight + BMI + english +
## mandarin + faEdu + faCCPmember, family = binomial, data = data.nomissing)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8380 -0.5222 -0.3046 -0.1421 3.5133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.631e+01 3.642e+01 -0.997 0.318776
## age 1.114e-01 2.324e-02 4.792 1.65e-06 ***
## age.squared -6.214e-04 1.972e-04 -3.151 0.001626 **
## below45TRUE 3.130e-01 1.456e-01 2.150 0.031539 *
## above55TRUE 4.602e-01 1.359e-01 3.388 0.000705 ***
## race2 7.161e-01 5.133e-01 1.395 0.163031
## race3 -3.870e-01 4.092e-01 -0.946 0.344327
## race4 1.016e-01 2.710e-01 0.375 0.707757
## race5 1.315e+00 4.955e-01 2.654 0.007963 **
## race6 -7.970e-02 4.110e-01 -0.194 0.846255
## race7 2.639e-01 4.194e-01 0.629 0.529227
## race8 4.151e-01 2.197e-01 1.889 0.058875 .
## edu.L 6.906e+00 9.413e-01 7.336 2.19e-13 ***
## edu.Q 6.597e-01 7.721e-01 0.854 0.392858
## edu.C 7.564e-01 5.032e-01 1.503 0.132783
## edu^4 2.104e-02 3.189e-01 0.066 0.947394
## height 1.628e-01 3.147e-02 5.172 2.32e-07 ***
## weight -6.868e-02 1.990e-02 -3.451 0.000559 ***
## BMI 4.045e-01 1.122e-01 3.604 0.000313 ***
## english.L -6.003e-01 3.609e-01 -1.663 0.096279 .
## english.Q -3.161e-01 2.960e-01 -1.068 0.285478
## english.C 1.611e-01 2.360e-01 0.683 0.494862
## english^4 3.047e-01 1.610e-01 1.892 0.058511 .
## mandarin.L 2.552e-01 1.319e-01 1.935 0.053016 .
## mandarin.Q -3.220e-01 1.040e-01 -3.096 0.001960 **
## mandarin.C 4.184e-02 9.234e-02 0.453 0.650474
## mandarin^4 8.687e-02 7.415e-02 1.172 0.241364
## faEdu.L -8.374e+00 1.139e+02 -0.074 0.941407
## faEdu.Q -6.628e+00 9.629e+01 -0.069 0.945121
## faEdu.C -3.973e+00 5.696e+01 -0.070 0.944398
## faEdu^4 -1.724e+00 2.153e+01 -0.080 0.936197
## faCCPmember1 3.493e-01 9.144e-02 3.819 0.000134 ***
## age:edu.L -7.105e-02 2.470e-02 -2.877 0.004020 **
## age:edu.Q -1.907e-02 2.057e-02 -0.927 0.354035
## age:edu.C -1.857e-02 1.287e-02 -1.443 0.148974
## age:edu^4 -5.161e-03 6.878e-03 -0.750 0.453087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6964.8 on 8433 degrees of freedom
## Residual deviance: 5288.5 on 8398 degrees of freedom
## AIC: 5360.5
##
## Number of Fisher Scoring iterations: 12
pscore <- ps3$fitted.values
data.nomissing$pscore <- pscore
psm3 <- Match(Y = hincome, Tr = CCPmember, X = pscore, estimand = "ATT",
M = 5, replace = TRUE)
summary(psm3)
##
## Estimate... -4366.9
## AI SE...... 5777
## T-stat..... -0.75591
## p.val...... 0.4497
##
## Original number of observations.............. 8434
## Original number of treated obs............... 1218
## Matched number of observations............... 1218
## Matched number of observations (unweighted). 16006
The estimate (ATT) is -4366.9, and the p-value is 0.54748 which is greater than 0.05. Therefore, NO evidence supports the ATT of CCPmembership on household income is significantly different from zero. I then try to replace the outcomes variable with log household income.
psm4 <- Match(Y = data.nomissing$lnhincome, Tr = CCPmember, X = pscore, estimand = "ATT",
M = 5, replace = TRUE)
summary(psm4)
##
## Estimate... 0.14272
## AI SE...... 0.036001
## T-stat..... 3.9645
## p.val...... 7.3545e-05
##
## Original number of observations.............. 8434
## Original number of treated obs............... 1218
## Matched number of observations............... 1218
## Matched number of observations (unweighted). 16006
Now, the estimate is 0.14272, indicating that being a member of CCP raise the household income by 14%. The coefficient is significantly different from zero (\(p<0.000\)). Comparing with the pevios result, the difference suggests that CCP membership has a non-linear relationship with household income. I think the estimation of log household income is more reliable.
Sensitivity analysis
library(rbounds)
psens(x = psm4, Gamma = 2, GammaInc = 0.1)
##
## Rosenbaum Sensitivity Test for Wilcoxon Signed Rank P-Value
##
## Unconfounded estimate .... 0
##
## Gamma Lower bound Upper bound
## 1.0 0 0.0000
## 1.1 0 0.0000
## 1.2 0 0.0000
## 1.3 0 0.0000
## 1.4 0 0.3885
## 1.5 0 0.9996
## 1.6 0 1.0000
## 1.7 0 1.0000
## 1.8 0 1.0000
## 1.9 0 1.0000
## 2.0 0 1.0000
##
## Note: Gamma is Odds of Differential Assignment To
## Treatment Due to Unobserved Factors
##
hlsens(x = psm4, Gamma = 2, GammaInc = 0.1)
##
## Rosenbaum Sensitivity Test for Hodges-Lehmann Point Estimate
##
## Unconfounded estimate .... 0.2054
##
## Gamma Lower bound Upper bound
## 1.0 0.2053500 0.20535
## 1.1 0.1053500 0.30535
## 1.2 0.0053548 0.40535
## 1.3 0.0053548 0.40535
## 1.4 -0.0946450 0.40535
## 1.5 -0.0946450 0.50535
## 1.6 -0.0946450 0.50535
## 1.7 -0.1946500 0.60535
## 1.8 -0.1946500 0.60535
## 1.9 -0.1946500 0.60535
## 2.0 -0.2946500 0.60535
##
## Note: Gamma is Odds of Differential Assignment To
## Treatment Due to Unobserved Factors
##
The Rosenbaum Sensitivity Test for Wilcoxon Signed Rank P-Value shows that the upper bound become greater than 0.05 when \(\Gamma\) equals to 1.4.
Also, the Rosenbaum Sensitivity Test for Hodges-Lehmann Point Estimate shows that the lower bound become negative when \(\Gamma\) equals to 1.4.
That is when the probability of an individual being in the treatment group (aka being CCP member) 1.4 times higher because of some omitted variables, the previous conclusion of CCPmembership takes effect would not hold true.
Conclusion and Discussion
Main result
The result suggests that when considering of age, education, height, weight, father’s education and CCP membership, the proficiency of English and Mandarin, CCP membership can benefit the individual by increasing their household income. However, this result is a little sensitive to other unobserved, or non-included factors.
Many unobserved characters may affect personal income, and collecting and including these variables is really difficult. For example, personal ambition might affect both the probability of participate CCP and hiusehold income. However, this variable is hard to observe and measure.
Still, concluding that CCP membership has a causal effect on income is hard.
Appropriateness of the treatment variable
A good treatment variable must fit the following criteria:
1. The cause must precede the outcome
This dataset is a cross-sectional dataset. We only know that at this certain time what’s the income of the interviewee as well as whether he/she has CCP membership. However, we have no idea about when did he or she joint the CCP and what’s the income before he/she becoming a member of CCP. That is, we can not sure whether the high income happens first (and because he/she is rich or powerful, so he/she can get into the CCP), or the membership occurs first.
2. The cause should associate with the outcome
If I just run a simple t-test (and before that, run a test of homogeneity of variance first.):
var.test(hincome ~ CCPmember, data=data.nomissing)
##
## F test to compare two variances
##
## data: hincome by CCPmember
## F = 0.54861, num df = 7215, denom df = 1217, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5027726 0.5970442
## sample estimates:
## ratio of variances
## 0.5486141
t.test(hincome ~ CCPmember, data=data.nomissing , var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: hincome by CCPmember
## t = -5.413, df = 1450.7, p-value = 7.241e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24782.05 -11598.36
## sample estimates:
## mean in group 0 mean in group 1
## 19314.18 37504.39
var.test(lnhincome ~ CCPmember, data=data.nomissing)
##
## F test to compare two variances
##
## data: lnhincome by CCPmember
## F = 1.3364, num df = 7215, denom df = 1217, p-value = 1.753e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.224765 1.454413
## sample estimates:
## ratio of variances
## 1.336436
t.test(lnhincome ~ CCPmember, data=data.nomissing , var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lnhincome by CCPmember
## t = -25.927, df = 1812.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8900196 -0.7648358
## sample estimates:
## mean in group 0 mean in group 1
## 9.149145 9.976573
par(mfrow=c(1,2))
boxplot(hincome ~ CCPmember, data=data.nomissing, main="Household income",
names=c("Control","Treatment"))
boxplot(lnhincome ~ CCPmember, data=data.nomissing, main="Household income (nature log)",
names=c("Control","Treatment"))

The result shows that individual with CCP membership have significantly higher income than an individual without membership (no matter the outcomes variable is household income or log housrhold income). In this case, CCP membership meets the requirement.
3. Treatment must be operation-able
Of course, CCP membership is operation-able. However, even though CCP membership takes effect on income, the estimated result still has little implication. It’s not likely that merely become a member of CCP have any magic power to enhance personal human capital. The explanation must be that the fact of one individual can hold a membership signal his or her special capability or social capital, therefore, he/she can easily get a job or promotion. If everyone can get CCP membership, the value of such membership would decrease. Even if all people in China get the membership, it’s not likely that the average salary of whole population would increase.
CCP membership as a treatment variable is different from a job training program, or medicine caring. CCP membership essentially has no power to increase income.