Lalonde 3 Ways
Executive Summary: In this analysis, multivariate regression, logistic regression, and random forest are used to determine if the Lalonde treatment has a positive effect on the real income in 1978 for the observations. Two subgroups of with and without a degree are separated to produce a better analysis. Through three approaches, it is highly likely that the Lalonde treatment has a positive effect on increasing income, and it has a bigger impact on people with a degree.
Import Library & Data
library(Matching)
## Loading required package: MASS
## ##
## ## Matching (Version 4.9-2, Build Date: 2015-12-25)
## ## See http://sekhon.berkeley.edu/matching for additional documentation.
## ## Please cite software as:
## ## Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ## Software with Automated Balance Optimization: The Matching package for R.''
## ## Journal of Statistical Software, 42(7): 1-52.
## ##
library(arm)
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.9-3, built: 2016-11-21)
## Working directory is /Users/boyu/Lalonde 3 Ways
library(dismo)
## Loading required package: raster
## Loading required package: sp
##
## Attaching package: 'raster'
## The following objects are masked from 'package:MASS':
##
## area, select
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
data(lalonde)
subgroup.nodegr <- subset(lalonde, nodegr == 1)
subgroup.yesdegr <- subset(lalonde, nodegr == 0)
| Part 1 |
First, perform a linear regression using all variables, find those with lowing p-value to create the actual linear model, as a lower meaning a higher confidence to reject the null hypothesis (change in “re78” is not associated with the predictor, or no effect).
lm.1 <- lm(re78 ~ age + educ + black + hisp + married + nodegr + re74 + re75 + u74 + u75 + treat, data=lalonde)
summary(lm.1)
##
## Call:
## lm(formula = re78 ~ age + educ + black + hisp + married + nodegr +
## re74 + re75 + u74 + u75 + treat, data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9612 -4355 -1572 3054 53119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.567e+02 3.522e+03 0.073 0.94193
## age 5.357e+01 4.581e+01 1.170 0.24284
## educ 4.008e+02 2.288e+02 1.751 0.08058 .
## black -2.037e+03 1.174e+03 -1.736 0.08331 .
## hisp 4.258e+02 1.565e+03 0.272 0.78562
## married -1.463e+02 8.823e+02 -0.166 0.86835
## nodegr -1.518e+01 1.006e+03 -0.015 0.98797
## re74 1.234e-01 8.784e-02 1.405 0.16079
## re75 1.974e-02 1.503e-01 0.131 0.89554
## u74 1.380e+03 1.188e+03 1.162 0.24590
## u75 -1.071e+03 1.025e+03 -1.045 0.29651
## treat 1.671e+03 6.411e+02 2.606 0.00948 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6517 on 433 degrees of freedom
## Multiple R-squared: 0.05822, Adjusted R-squared: 0.0343
## F-statistic: 2.433 on 11 and 433 DF, p-value: 0.005974
The result shows treat is a strong predictor for “re78”, for the purpose of including at least three variables in the linear regression model, we will use p<0.1 as the cut-off point to include a variable in the prediction model. So, the new linear regression will use “educ”, “black”, “treat” as independent variables to predict the dependent variable “re78”.
lm.2 <- lm(re78 ~ educ + black + treat, data=lalonde)
summary(lm.2)
##
## Call:
## lm(formula = re78 ~ educ + black + treat, data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9298 -4557 -1687 3256 54029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2156.3 1889.1 1.141 0.25431
## educ 416.9 172.9 2.412 0.01629 *
## black -2185.4 829.2 -2.636 0.00869 **
## treat 1722.7 627.4 2.746 0.00628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6505 on 441 degrees of freedom
## Multiple R-squared: 0.04432, Adjusted R-squared: 0.03782
## F-statistic: 6.817 on 3 and 441 DF, p-value: 0.0001686
The new regression model shows all of the three variables are strong predictors for “re78”. Now, we will apply this linear regression model to predict “re78” in the no-degree and yes-degree subgroups.
lm.nodegr.1 <- lm(re78 ~ educ + black + treat, data=subgroup.nodegr)
summary(lm.nodegr.1)
##
## Call:
## lm(formula = re78 ~ educ + black + treat, data = subgroup.nodegr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7024 -4184 -1625 3193 54590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3966.3 2255.6 1.758 0.0796 .
## educ 253.9 223.1 1.138 0.2559
## black -2321.4 924.0 -2.512 0.0125 *
## treat 1280.4 693.7 1.846 0.0658 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6250 on 344 degrees of freedom
## Multiple R-squared: 0.02794, Adjusted R-squared: 0.01947
## F-statistic: 3.296 on 3 and 344 DF, p-value: 0.02069
confint(lm.nodegr.1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -470.08502 8402.7662
## educ -184.92403 692.6958
## black -4138.72294 -503.9788
## treat -84.10616 2644.8608
lm.yesdegr.1 <- lm(re78 ~ educ + black + treat, data=subgroup.yesdegr)
summary(lm.yesdegr.1)
##
## Call:
## lm(formula = re78 ~ educ + black + treat, data = subgroup.yesdegr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12561 -4322 -2229 3928 25896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18009 12412 -1.451 0.1502
## educ 1987 1017 1.954 0.0538 .
## black -1713 1886 -0.908 0.3663
## treat 2658 1522 1.746 0.0842 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7337 on 93 degrees of freedom
## Multiple R-squared: 0.08704, Adjusted R-squared: 0.05759
## F-statistic: 2.956 on 3 and 93 DF, p-value: 0.0365
confint(lm.yesdegr.1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -42657.23808 6639.911
## educ -32.80737 4006.376
## black -5458.89617 2033.411
## treat -365.51492 5681.145
Predicted by the linear regression model, the treatment effect for the no-degree subgroup is +1280.4, with 95% confidence interval between -84.1 and +2644.9; the treatment effect for the yes-degree subgroup is +2658.0, with 95% confidence interval between -365.5 and +5681.1. As we are using a linear regression model, the model itself assume a constant treatment effect in the same subgroup. The result is an indication that the Lalonde treatment has a positive treatment effect for both subgroups, but the treatment effect on the yes-degree subgroup is approximately twice comparing to the no-degree subgroup. However, it is worth noting that “treat” is a less powerful predictor in for the yes-degree subgroup, giving it a bigger confident interval. The lower bound of the 95% confidence interval of the yes-degree subgroup excite the lower bound of the 95% confident interval of the no-degree subgroup, meaning it is above 2.5% possibility that the treatment effect for the yes-degree subgroup is lower than the no-degree subgroup, decreasing the authority of the conclusion that treatment effect is more effective on the yes-degree subgroup.
*Cross-Validation In attempt to impvove the model accrucy for the yes-degree subgroup, we can try cross-validation.
k <- 10
folds <- kfold(subgroup.yesdegr, k)
min.SE.kfold <- Inf
for (i in 1:k)
{
set.train <- subgroup.yesdegr[which(folds!=i),]
set.test <- subgroup.yesdegr[which(folds==i),]
lm.yesdegr.kfold <- lm(re78 ~ educ + black + treat, data=set.train)
predict.kfold <- predict(lm.yesdegr.kfold, set.test)
SE.kfold = (sum((predict.kfold-set.test$re78)^2)/length(set.test))**(1/2)
if (SE.kfold < min.SE.kfold)
{
min.SE.kfold = SE.kfold
min.lm.yesdegr.kfold = lm.yesdegr.kfold
}
}
summary(min.lm.yesdegr.kfold)
##
## Call:
## lm(formula = re78 ~ educ + black + treat, data = set.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12033 -4407 -2516 3759 26309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16412 13160 -1.247 0.2159
## educ 1930 1075 1.795 0.0763 .
## black -2336 2079 -1.124 0.2645
## treat 2073 1677 1.236 0.2200
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7650 on 83 degrees of freedom
## Multiple R-squared: 0.07384, Adjusted R-squared: 0.04037
## F-statistic: 2.206 on 3 and 83 DF, p-value: 0.09349
confint(min.lm.yesdegr.kfold, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -42586.1380 9762.477
## educ -208.8395 4067.855
## black -6470.5956 1799.217
## treat -1262.8732 5408.263
The k-fold validation methord with k=10 sometimes shows an improvement on the p-value, but most of the times result in a higher p-value.
k <- nrow(subgroup.yesdegr)
folds <- kfold(subgroup.yesdegr, k)
min.SE.kfold <- Inf
for (i in 1:k)
{
set.train <- subgroup.yesdegr[which(folds!=i),]
set.test <- subgroup.yesdegr[which(folds==i),]
lm.yesdegr.kfold <- lm(re78 ~ educ + black + treat, data=set.train)
predict.kfold <- predict(lm.yesdegr.kfold, set.test)
SE.kfold = (sum((predict.kfold-set.test$re78)^2)/length(set.test))**(1/2)
if (SE.kfold < min.SE.kfold)
{
min.SE.kfold = SE.kfold
min.lm.yesdegr.kfold = lm.yesdegr.kfold
}
}
summary(min.lm.yesdegr.kfold)
##
## Call:
## lm(formula = re78 ~ educ + black + treat, data = set.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12562 -4348 -2273 3938 25895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18012 12490 -1.442 0.1527
## educ 1987 1023 1.942 0.0552 .
## black -1713 1899 -0.902 0.3694
## treat 2659 1539 1.727 0.0875 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7376 on 92 degrees of freedom
## Multiple R-squared: 0.08601, Adjusted R-squared: 0.05621
## F-statistic: 2.886 on 3 and 92 DF, p-value: 0.03986
confint(min.lm.yesdegr.kfold, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -42818.81731 6795.777
## educ -45.03633 4018.988
## black -5485.23931 2058.686
## treat -398.69239 5716.114
Using LOOCV, the special k-fold where k is the number of observations produce almost the same result as the linear regression on the entire yes-degree subset.
| Part 2 |
In the new linear regression model, only the interaction term between “treat” and “nodegr” is included.
lm.2 <- lm(re78 ~ I(treat*nodegr), data=lalonde)
summary(lm.2)
##
## Call:
## lm(formula = re78 ~ I(treat * nodegr), data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5649 -5155 -1632 2797 54658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5155.3 374.4 13.768 <2e-16 ***
## I(treat * nodegr) 494.2 690.1 0.716 0.474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6635 on 443 degrees of freedom
## Multiple R-squared: 0.001156, Adjusted R-squared: -0.001099
## F-statistic: 0.5127 on 1 and 443 DF, p-value: 0.4743
The result shows the using the interaction term alone is not a good predictor. The interaction term is made out of two boolean variables, so this multiplication basically being if the observation is both “treat” and “nodegr”, or any other condition. This interaction actually loses information. So, to give it prediction power, we have to include the “treat” and “nodegr” variables along with the interaction term.
lm.3 <- lm(re78 ~ treat + nodegr + I(treat*nodegr), data=lalonde)
summary(lm.3)
##
## Call:
## lm(formula = re78 ~ treat + nodegr + I(treat * nodegr), data = lalonde)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8047 -4495 -1604 2758 54658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4854.5 999.7 4.856 1.67e-06 ***
## treat 3192.0 1339.9 2.382 0.0176 *
## nodegr -359.1 1094.3 -0.328 0.7430
## I(treat * nodegr) -2038.0 1523.7 -1.338 0.1817
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6556 on 441 degrees of freedom
## Multiple R-squared: 0.02931, Adjusted R-squared: 0.02271
## F-statistic: 4.439 on 3 and 441 DF, p-value: 0.004354
confint(lm.3, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 2889.6389 6819.3497
## treat 558.6096 5825.4429
## nodegr -2509.8155 1791.6595
## I(treat * nodegr) -5032.4964 956.5392
lm.3.sim <- sim (lm.3)
coef(lm.3.sim)
## (Intercept) treat nodegr I(treat * nodegr)
## [1,] 3748.104 5351.2813 736.743236 -5378.07576
## [2,] 3526.192 6398.8687 1053.344193 -5202.31041
## [3,] 6788.821 1453.9186 -2391.660911 -722.87928
## [4,] 3525.700 2907.4501 635.748024 -1039.80685
## [5,] 4465.068 4584.7854 -155.878046 -2706.36922
## [6,] 6959.752 1186.4968 -3019.978770 -436.11642
## [7,] 4649.293 3060.0280 379.533590 -2497.07767
## [8,] 4124.861 2683.3189 -115.905758 -542.27964
## [9,] 3031.728 5757.4293 1600.781735 -4615.63925
## [10,] 4461.122 5262.0412 -369.679559 -4824.72692
## [11,] 5733.171 1541.8889 -736.231449 602.78724
## [12,] 4362.072 3324.3517 313.732214 -3273.58919
## [13,] 2654.806 5819.4192 1627.236758 -5331.11069
## [14,] 5807.255 2899.2927 -355.184350 -3102.58866
## [15,] 4259.919 4068.4599 181.496667 -3445.54466
## [16,] 5767.489 2147.9492 -1523.349639 -22.48511
## [17,] 3543.985 4703.1270 1224.635686 -3727.69726
## [18,] 2981.154 4038.9315 1143.332457 -2595.82182
## [19,] 4566.388 3335.0551 -812.775556 -2326.21611
## [20,] 4743.684 2307.1768 -950.061799 -718.21607
## [21,] 4536.748 4208.9174 -377.972082 -2995.04251
## [22,] 3871.496 4050.4154 106.820004 -1551.43446
## [23,] 4770.045 3151.8609 -1101.804771 -19.58108
## [24,] 5585.021 1866.9594 -1015.451598 167.94056
## [25,] 6148.153 149.5457 -1181.461164 532.43852
## [26,] 5869.604 874.2713 -1486.525995 285.11752
## [27,] 4314.127 3333.7161 113.941696 -2533.46073
## [28,] 5697.671 3039.8609 -639.382055 -2670.23617
## [29,] 6138.416 1162.6184 -1512.905911 -411.84145
## [30,] 4424.260 2767.1121 -835.402692 -659.56251
## [31,] 4380.719 3780.9076 -384.980851 -2498.70446
## [32,] 5310.851 2205.9855 -1802.052384 358.54494
## [33,] 6001.464 2309.3290 -1698.103721 -712.65868
## [34,] 4303.009 3761.3179 447.340651 -2914.55632
## [35,] 4865.485 3319.9895 193.468428 -3097.93835
## [36,] 4326.193 3359.9517 779.826142 -1786.14368
## [37,] 4908.996 3217.1872 -537.457596 -2079.45144
## [38,] 4308.512 5746.6374 384.358360 -5579.36964
## [39,] 4355.833 3099.4406 -488.953079 -422.99023
## [40,] 5255.217 3360.6287 -739.485273 -3226.80314
## [41,] 3592.351 3851.8921 426.106498 -2558.37522
## [42,] 5029.167 4151.4882 -614.109407 -3326.17943
## [43,] 3797.430 6242.4500 832.842252 -5129.76548
## [44,] 4695.429 3818.1941 227.253408 -3570.63073
## [45,] 5012.451 3468.4919 -459.526253 -2212.45105
## [46,] 5377.957 990.1739 -579.104439 424.26957
## [47,] 4901.990 1706.9350 768.550859 -2361.07576
## [48,] 3885.321 4204.0514 1098.405200 -3906.18132
## [49,] 6151.073 1620.8842 -1982.062907 -1016.78457
## [50,] 4243.985 4426.7293 341.425954 -3086.64195
## [51,] 5980.761 1538.9047 -1609.287174 -53.23908
## [52,] 5581.975 1632.3138 -1008.470377 -661.99156
## [53,] 2563.554 4754.8608 1666.005927 -3243.88877
## [54,] 5060.823 1680.2418 -1152.664338 -282.73728
## [55,] 4086.597 5137.0857 702.882288 -5019.03167
## [56,] 4780.719 2941.3381 -386.304526 -2080.79393
## [57,] 5623.833 2223.7513 -967.349438 -1746.87304
## [58,] 4313.205 4154.3348 1153.248538 -3897.91865
## [59,] 2773.006 6325.3883 1781.898794 -4853.22380
## [60,] 3927.060 3041.7025 216.448710 -1792.05206
## [61,] 5984.314 1881.1746 -1620.473602 -2041.70703
## [62,] 6194.295 2817.6214 -2145.177773 -1010.46776
## [63,] 4769.592 2324.6299 -767.580159 -1610.34176
## [64,] 5972.558 503.5306 -1947.046532 966.05409
## [65,] 3770.502 3847.5892 579.570894 -3245.29353
## [66,] 4767.258 2088.3519 -1193.901122 572.51225
## [67,] 2701.364 4262.2475 1540.349117 -2557.28104
## [68,] 6271.948 2304.0385 -1561.255882 -883.09319
## [69,] 4108.274 4427.3066 632.500701 -4140.45200
## [70,] 5077.692 3304.4998 733.179455 -4514.81655
## [71,] 4439.643 3769.0426 -392.326310 -1782.30499
## [72,] 5947.602 1926.5406 -681.745739 -1454.91364
## [73,] 4041.980 3358.3917 -87.331424 -951.18015
## [74,] 5105.997 2468.6939 -658.312688 -1035.86563
## [75,] 4769.497 2848.3915 -555.459562 -2169.95284
## [76,] 4560.943 3037.6989 -5.305163 -1695.77210
## [77,] 5392.912 2438.3841 -1193.626985 -1857.67179
## [78,] 3531.689 4306.3501 1227.556471 -3094.01340
## [79,] 4346.823 2610.1308 217.209873 -1523.73571
## [80,] 5075.959 3588.4620 -1178.611166 -2111.69099
## [81,] 5657.704 2257.9737 -852.305740 -1275.53631
## [82,] 5297.532 3473.7820 -945.218546 -2118.32616
## [83,] 5922.476 2550.8942 -1740.476112 -1660.56175
## [84,] 5555.827 2432.2942 -697.919860 -1082.90934
## [85,] 4294.485 3881.5421 473.602476 -3250.24868
## [86,] 5101.565 823.5417 -685.120947 -149.26603
## [87,] 5664.219 2310.6621 -1014.566629 -1855.35639
## [88,] 4770.982 2930.4110 -287.380299 -1428.74358
## [89,] 5566.798 1226.1572 -1178.475337 735.50551
## [90,] 6492.016 1276.6900 -1939.513767 -503.45854
## [91,] 4025.423 5178.8178 1138.069516 -4338.72397
## [92,] 5560.737 1604.7377 -1032.489193 -231.93135
## [93,] 6775.873 1327.3113 -2235.524424 -718.36305
## [94,] 4537.051 2537.9022 -417.780129 -414.96508
## [95,] 3781.977 6379.4008 608.391311 -5067.90480
## [96,] 4203.278 4678.0730 -20.779065 -2695.17638
## [97,] 4707.904 4523.7602 -161.787273 -3538.76183
## [98,] 4766.788 3026.6078 -237.021037 -2089.60671
## [99,] 5124.884 3225.8283 -771.218905 -2358.44735
## [100,] 5949.855 2299.3612 -1490.959414 -388.86894
plot (NA, NA, xlim=c(0, 1), ylim=c(-2000,6000),
xlab="nodegr", ylab="treatment effect",
main="lalonde program treatment effect")
abline (h = 0, lwd=.5, lty=2)
for (i in 1:100) {
abline (a = coef(lm.3.sim)[i,2], b = coef(lm.3.sim)[i,4],
lwd = .5, col = "gray")
}
abline (a = lm.3$coefficients[2], b = lm.3$coefficients[4],
lwd = 1, col = "black")
| Part 3 |
Unemployment is defined as people with no income, so observations with variable “re78” equals to 0 with have the feature “u78” equals to 1.
lalonde$u78 = 1*(lalonde$re78 == 0) # define u78 = 1 for those earned 0 income in 1978
subgroup.nodegr = subset(lalonde, nodegr == 1)
subgroup.yesdegr = subset(lalonde, nodegr == 0) # update the subsets to include u78
In logistic regression, we can use all the variables as predictor for u78.
glm.nodegr <- glm(u78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.nodegr, family=binomial)
summary(glm.nodegr)
##
## Call:
## glm(formula = u78 ~ age + educ + black + hisp + married + re74 +
## re75 + u74 + u75 + treat, family = binomial, data = subgroup.nodegr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2812 -0.9314 -0.7011 1.2992 2.0522
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.764e+00 1.201e+00 -2.301 0.0214 *
## age 1.448e-02 1.644e-02 0.881 0.3785
## educ 7.734e-02 8.270e-02 0.935 0.3496
## black 1.008e+00 6.478e-01 1.555 0.1199
## hisp -2.103e-01 8.039e-01 -0.262 0.7936
## married -2.172e-01 3.509e-01 -0.619 0.5360
## re74 -1.603e-05 3.664e-05 -0.437 0.6618
## re75 1.357e-05 5.990e-05 0.227 0.8208
## u74 -1.711e-01 4.600e-01 -0.372 0.7100
## u75 5.360e-01 4.102e-01 1.307 0.1913
## treat -4.510e-01 2.539e-01 -1.776 0.0757 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 435.76 on 347 degrees of freedom
## Residual deviance: 416.50 on 337 degrees of freedom
## AIC: 438.5
##
## Number of Fisher Scoring iterations: 4
confint(glm.nodegr, level=0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -5.214207e+00 -4.749612e-01
## age -1.812055e-02 4.664333e-02
## educ -8.160058e-02 2.442997e-01
## black -1.361705e-01 2.492155e+00
## hisp -1.763729e+00 1.489190e+00
## married -9.265025e-01 4.566575e-01
## re74 -9.692201e-05 5.175561e-05
## re75 -1.119409e-04 1.280040e-04
## u74 -1.081807e+00 7.331177e-01
## u75 -2.519523e-01 1.366016e+00
## treat -9.560608e-01 4.148172e-02
glm.yesdegr <- glm(u78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.yesdegr, family=binomial)
summary(glm.yesdegr)
##
## Call:
## glm(formula = u78 ~ age + educ + black + hisp + married + re74 +
## re75 + u74 + u75 + treat, family = binomial, data = subgroup.yesdegr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9196 -0.7571 -0.4846 0.6185 1.9956
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.572e+00 5.814e+00 1.302 0.1928
## age -8.391e-02 5.764e-02 -1.456 0.1455
## educ -5.701e-01 4.793e-01 -1.189 0.2343
## black 1.387e+00 8.829e-01 1.571 0.1161
## hisp -1.502e+01 1.727e+03 -0.009 0.9931
## married 7.808e-01 7.351e-01 1.062 0.2882
## re74 8.751e-05 9.275e-05 0.944 0.3454
## re75 -5.160e-04 2.701e-04 -1.910 0.0561 .
## u74 1.536e+00 1.155e+00 1.330 0.1836
## u75 -2.083e+00 8.993e-01 -2.317 0.0205 *
## treat -5.754e-01 5.474e-01 -1.051 0.2932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 112.772 on 96 degrees of freedom
## Residual deviance: 91.099 on 86 degrees of freedom
## AIC: 113.1
##
## Number of Fisher Scoring iterations: 16
confint(glm.yesdegr, level=0.95)
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 2.5 % 97.5 %
## (Intercept) -2.320257e+00 2.124550e+01
## age -2.068435e-01 2.131394e-02
## educ -1.702414e+00 2.363576e-01
## black -1.606293e-01 3.440636e+00
## hisp -5.212624e+02 7.018307e+01
## married -6.904408e-01 2.247866e+00
## re74 -9.570820e-05 2.897849e-04
## re75 -1.164422e-03 -5.711014e-05
## u74 -6.677220e-01 3.920850e+00
## u75 -4.002096e+00 -4.001305e-01
## treat -1.671820e+00 4.966339e-01
The logistic regression produce a reasonable result, and we can use this prediction to caltulate the predicted treatment result.
u78.mean.nodegr.notreat.glm <- mean(subset(glm.nodegr$fitted.values, subgroup.nodegr$treat == 0))
u78.mean.nodegr.yestreat.glm <- mean(subset(glm.nodegr$fitted.values, subgroup.nodegr$treat == 1))
u78.nodegr.treateffect.glm <- u78.mean.nodegr.yestreat.glm - u78.mean.nodegr.notreat.glm
u78.nodegr.treateffect.glm
## [1] -0.09529672
u78.mean.yesdegr.notreat.glm <- mean(subset(glm.yesdegr$fitted.values, subgroup.yesdegr$treat == 0))
u78.mean.yesdegr.yestreat.glm <- mean(subset(glm.yesdegr$fitted.values, subgroup.yesdegr$treat == 1))
u78.yesdegr.treateffect.glm <- u78.mean.yesdegr.yestreat.glm - u78.mean.yesdegr.notreat.glm
u78.yesdegr.treateffect.glm
## [1] -0.1451335
The treatment effect (decrese in unemployement reat in year 1978) is -0.09529672 for the no-degree subgroup and -0.1451335 for the yes-degree subgroup.
u78.mean.nodegr.notreat <- mean(subset(subgroup.nodegr$u78, subgroup.nodegr$treat == 0))
u78.mean.nodegr.yestreat <- mean(subset(subgroup.nodegr$u78, subgroup.nodegr$treat == 1))
u78.nodegr.treateffect <- u78.mean.nodegr.yestreat - u78.mean.nodegr.notreat
u78.nodegr.treateffect
## [1] -0.09529672
u78.mean.yesdegr.notreat <- mean(subset(subgroup.yesdegr$u78, subgroup.yesdegr$treat == 0))
u78.mean.yesdegr.yestreat <- mean(subset(subgroup.yesdegr$u78, subgroup.yesdegr$treat == 1))
u78.yesdegr.treateffect <- u78.mean.yesdegr.yestreat - u78.mean.yesdegr.notreat
u78.yesdegr.treateffect
## [1] -0.1451335
However, if we run calculate the mean difference of “u78” using the acctural data, we will abtain the exact same result as using the logistic regression model. Regardless, the conclusion show the Lalond treatment can decrease the unemployement rate, but it is more effective for the yes-degree subgroup.
| Part 4 |
To start with, build a random forest using all variables for the entire set.
rf <- randomForest(re78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=lalonde, ntree=2000, importance=T)
varImpPlot(rf)
Variable “treat” is ranked in the 8th place, meaning is not a very significant contributer to the tree accuracy. The positive value of the MSE, mean decrease accuracy, shows “treat” dose help increase the tree accuracy.
rf.nodegr <- randomForest(re78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.nodegr, ntree=2000, importance=T)
varImpPlot(rf.nodegr)
For the no-degree subset, “treat” is ranked at the 9th place, but with a negative MSE value, meaning adding treat will acctuly harm the tree accuracy.
rf.yesdegr <- randomForest(re78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.yesdegr, ntree=2000, importance=T)
varImpPlot(rf.yesdegr)
In the yes-degree subgroup, “treat” is ranked at the 4th place and the MSE value is positive, showing evedence that “treat” has a posstive impact on real income of 1978 for the yes-degree subgroup.