Lalonde 3 Ways

Executive Summary: In this analysis, multivariate regression, logistic regression, and random forest are used to determine if the Lalonde treatment has a positive effect on the real income in 1978 for the observations. Two subgroups of with and without a degree are separated to produce a better analysis. Through three approaches, it is highly likely that the Lalonde treatment has a positive effect on increasing income, and it has a bigger impact on people with a degree.

Import Library & Data

library(Matching)
## Loading required package: MASS
## ## 
## ##  Matching (Version 4.9-2, Build Date: 2015-12-25)
## ##  See http://sekhon.berkeley.edu/matching for additional documentation.
## ##  Please cite software as:
## ##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ##   Software with Automated Balance Optimization: The Matching package for R.''
## ##   Journal of Statistical Software, 42(7): 1-52. 
## ##
library(arm)
## Loading required package: Matrix
## Loading required package: lme4
## 
## arm (Version 1.9-3, built: 2016-11-21)
## Working directory is /Users/boyu/Lalonde 3 Ways
library(dismo)
## Loading required package: raster
## Loading required package: sp
## 
## Attaching package: 'raster'
## The following objects are masked from 'package:MASS':
## 
##     area, select
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
data(lalonde)
subgroup.nodegr <- subset(lalonde, nodegr == 1)
subgroup.yesdegr <- subset(lalonde, nodegr == 0)
Part 1

First, perform a linear regression using all variables, find those with lowing p-value to create the actual linear model, as a lower meaning a higher confidence to reject the null hypothesis (change in “re78” is not associated with the predictor, or no effect).

lm.1 <- lm(re78 ~ age + educ + black + hisp + married + nodegr + re74 + re75 + u74 + u75 + treat, data=lalonde)
summary(lm.1)
## 
## Call:
## lm(formula = re78 ~ age + educ + black + hisp + married + nodegr + 
##     re74 + re75 + u74 + u75 + treat, data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9612  -4355  -1572   3054  53119 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.567e+02  3.522e+03   0.073  0.94193   
## age          5.357e+01  4.581e+01   1.170  0.24284   
## educ         4.008e+02  2.288e+02   1.751  0.08058 . 
## black       -2.037e+03  1.174e+03  -1.736  0.08331 . 
## hisp         4.258e+02  1.565e+03   0.272  0.78562   
## married     -1.463e+02  8.823e+02  -0.166  0.86835   
## nodegr      -1.518e+01  1.006e+03  -0.015  0.98797   
## re74         1.234e-01  8.784e-02   1.405  0.16079   
## re75         1.974e-02  1.503e-01   0.131  0.89554   
## u74          1.380e+03  1.188e+03   1.162  0.24590   
## u75         -1.071e+03  1.025e+03  -1.045  0.29651   
## treat        1.671e+03  6.411e+02   2.606  0.00948 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6517 on 433 degrees of freedom
## Multiple R-squared:  0.05822,    Adjusted R-squared:  0.0343 
## F-statistic: 2.433 on 11 and 433 DF,  p-value: 0.005974

The result shows treat is a strong predictor for “re78”, for the purpose of including at least three variables in the linear regression model, we will use p<0.1 as the cut-off point to include a variable in the prediction model. So, the new linear regression will use “educ”, “black”, “treat” as independent variables to predict the dependent variable “re78”.

lm.2 <- lm(re78 ~ educ + black + treat, data=lalonde)
summary(lm.2)
## 
## Call:
## lm(formula = re78 ~ educ + black + treat, data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9298  -4557  -1687   3256  54029 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   2156.3     1889.1   1.141  0.25431   
## educ           416.9      172.9   2.412  0.01629 * 
## black        -2185.4      829.2  -2.636  0.00869 **
## treat         1722.7      627.4   2.746  0.00628 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6505 on 441 degrees of freedom
## Multiple R-squared:  0.04432,    Adjusted R-squared:  0.03782 
## F-statistic: 6.817 on 3 and 441 DF,  p-value: 0.0001686

The new regression model shows all of the three variables are strong predictors for “re78”. Now, we will apply this linear regression model to predict “re78” in the no-degree and yes-degree subgroups.

lm.nodegr.1 <- lm(re78 ~ educ + black + treat, data=subgroup.nodegr)
summary(lm.nodegr.1)
## 
## Call:
## lm(formula = re78 ~ educ + black + treat, data = subgroup.nodegr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7024  -4184  -1625   3193  54590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3966.3     2255.6   1.758   0.0796 .
## educ           253.9      223.1   1.138   0.2559  
## black        -2321.4      924.0  -2.512   0.0125 *
## treat         1280.4      693.7   1.846   0.0658 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6250 on 344 degrees of freedom
## Multiple R-squared:  0.02794,    Adjusted R-squared:  0.01947 
## F-statistic: 3.296 on 3 and 344 DF,  p-value: 0.02069
confint(lm.nodegr.1, level=0.95)
##                   2.5 %    97.5 %
## (Intercept)  -470.08502 8402.7662
## educ         -184.92403  692.6958
## black       -4138.72294 -503.9788
## treat         -84.10616 2644.8608
lm.yesdegr.1 <- lm(re78 ~ educ + black + treat, data=subgroup.yesdegr)
summary(lm.yesdegr.1)
## 
## Call:
## lm(formula = re78 ~ educ + black + treat, data = subgroup.yesdegr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12561  -4322  -2229   3928  25896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -18009      12412  -1.451   0.1502  
## educ            1987       1017   1.954   0.0538 .
## black          -1713       1886  -0.908   0.3663  
## treat           2658       1522   1.746   0.0842 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7337 on 93 degrees of freedom
## Multiple R-squared:  0.08704,    Adjusted R-squared:  0.05759 
## F-statistic: 2.956 on 3 and 93 DF,  p-value: 0.0365
confint(lm.yesdegr.1, level=0.95)
##                    2.5 %   97.5 %
## (Intercept) -42657.23808 6639.911
## educ           -32.80737 4006.376
## black        -5458.89617 2033.411
## treat         -365.51492 5681.145

Predicted by the linear regression model, the treatment effect for the no-degree subgroup is +1280.4, with 95% confidence interval between -84.1 and +2644.9; the treatment effect for the yes-degree subgroup is +2658.0, with 95% confidence interval between -365.5 and +5681.1. As we are using a linear regression model, the model itself assume a constant treatment effect in the same subgroup. The result is an indication that the Lalonde treatment has a positive treatment effect for both subgroups, but the treatment effect on the yes-degree subgroup is approximately twice comparing to the no-degree subgroup. However, it is worth noting that “treat” is a less powerful predictor in for the yes-degree subgroup, giving it a bigger confident interval. The lower bound of the 95% confidence interval of the yes-degree subgroup excite the lower bound of the 95% confident interval of the no-degree subgroup, meaning it is above 2.5% possibility that the treatment effect for the yes-degree subgroup is lower than the no-degree subgroup, decreasing the authority of the conclusion that treatment effect is more effective on the yes-degree subgroup.

*Cross-Validation In attempt to impvove the model accrucy for the yes-degree subgroup, we can try cross-validation.

k <- 10
folds <- kfold(subgroup.yesdegr, k)
min.SE.kfold <- Inf
for (i in 1:k)
{
  set.train <- subgroup.yesdegr[which(folds!=i),]
  set.test <- subgroup.yesdegr[which(folds==i),]
  lm.yesdegr.kfold <- lm(re78 ~ educ + black + treat, data=set.train)
  predict.kfold <- predict(lm.yesdegr.kfold, set.test)
  SE.kfold = (sum((predict.kfold-set.test$re78)^2)/length(set.test))**(1/2)
  if (SE.kfold < min.SE.kfold)
  {
    min.SE.kfold = SE.kfold
    min.lm.yesdegr.kfold = lm.yesdegr.kfold
  }
}
summary(min.lm.yesdegr.kfold)
## 
## Call:
## lm(formula = re78 ~ educ + black + treat, data = set.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12033  -4407  -2516   3759  26309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -16412      13160  -1.247   0.2159  
## educ            1930       1075   1.795   0.0763 .
## black          -2336       2079  -1.124   0.2645  
## treat           2073       1677   1.236   0.2200  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7650 on 83 degrees of freedom
## Multiple R-squared:  0.07384,    Adjusted R-squared:  0.04037 
## F-statistic: 2.206 on 3 and 83 DF,  p-value: 0.09349
confint(min.lm.yesdegr.kfold, level=0.95)
##                   2.5 %   97.5 %
## (Intercept) -42586.1380 9762.477
## educ          -208.8395 4067.855
## black        -6470.5956 1799.217
## treat        -1262.8732 5408.263

The k-fold validation methord with k=10 sometimes shows an improvement on the p-value, but most of the times result in a higher p-value.

k <- nrow(subgroup.yesdegr)
folds <- kfold(subgroup.yesdegr, k)
min.SE.kfold <- Inf
for (i in 1:k)
{
  set.train <- subgroup.yesdegr[which(folds!=i),]
  set.test <- subgroup.yesdegr[which(folds==i),]
  lm.yesdegr.kfold <- lm(re78 ~ educ + black + treat, data=set.train)
  predict.kfold <- predict(lm.yesdegr.kfold, set.test)
  SE.kfold = (sum((predict.kfold-set.test$re78)^2)/length(set.test))**(1/2)
  if (SE.kfold < min.SE.kfold)
  {
    min.SE.kfold = SE.kfold
    min.lm.yesdegr.kfold = lm.yesdegr.kfold
  }
}
summary(min.lm.yesdegr.kfold)
## 
## Call:
## lm(formula = re78 ~ educ + black + treat, data = set.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12562  -4348  -2273   3938  25895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -18012      12490  -1.442   0.1527  
## educ            1987       1023   1.942   0.0552 .
## black          -1713       1899  -0.902   0.3694  
## treat           2659       1539   1.727   0.0875 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7376 on 92 degrees of freedom
## Multiple R-squared:  0.08601,    Adjusted R-squared:  0.05621 
## F-statistic: 2.886 on 3 and 92 DF,  p-value: 0.03986
confint(min.lm.yesdegr.kfold, level=0.95)
##                    2.5 %   97.5 %
## (Intercept) -42818.81731 6795.777
## educ           -45.03633 4018.988
## black        -5485.23931 2058.686
## treat         -398.69239 5716.114

Using LOOCV, the special k-fold where k is the number of observations produce almost the same result as the linear regression on the entire yes-degree subset.

Part 2

In the new linear regression model, only the interaction term between “treat” and “nodegr” is included.

lm.2 <- lm(re78 ~ I(treat*nodegr), data=lalonde)
summary(lm.2)
## 
## Call:
## lm(formula = re78 ~ I(treat * nodegr), data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5649  -5155  -1632   2797  54658 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5155.3      374.4  13.768   <2e-16 ***
## I(treat * nodegr)    494.2      690.1   0.716    0.474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6635 on 443 degrees of freedom
## Multiple R-squared:  0.001156,   Adjusted R-squared:  -0.001099 
## F-statistic: 0.5127 on 1 and 443 DF,  p-value: 0.4743

The result shows the using the interaction term alone is not a good predictor. The interaction term is made out of two boolean variables, so this multiplication basically being if the observation is both “treat” and “nodegr”, or any other condition. This interaction actually loses information. So, to give it prediction power, we have to include the “treat” and “nodegr” variables along with the interaction term.

lm.3 <- lm(re78 ~ treat + nodegr + I(treat*nodegr), data=lalonde)
summary(lm.3)
## 
## Call:
## lm(formula = re78 ~ treat + nodegr + I(treat * nodegr), data = lalonde)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8047  -4495  -1604   2758  54658 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4854.5      999.7   4.856 1.67e-06 ***
## treat               3192.0     1339.9   2.382   0.0176 *  
## nodegr              -359.1     1094.3  -0.328   0.7430    
## I(treat * nodegr)  -2038.0     1523.7  -1.338   0.1817    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6556 on 441 degrees of freedom
## Multiple R-squared:  0.02931,    Adjusted R-squared:  0.02271 
## F-statistic: 4.439 on 3 and 441 DF,  p-value: 0.004354
confint(lm.3, level=0.95)
##                        2.5 %    97.5 %
## (Intercept)        2889.6389 6819.3497
## treat               558.6096 5825.4429
## nodegr            -2509.8155 1791.6595
## I(treat * nodegr) -5032.4964  956.5392
lm.3.sim <- sim (lm.3)
coef(lm.3.sim)
##        (Intercept)     treat       nodegr I(treat * nodegr)
##   [1,]    3748.104 5351.2813   736.743236       -5378.07576
##   [2,]    3526.192 6398.8687  1053.344193       -5202.31041
##   [3,]    6788.821 1453.9186 -2391.660911        -722.87928
##   [4,]    3525.700 2907.4501   635.748024       -1039.80685
##   [5,]    4465.068 4584.7854  -155.878046       -2706.36922
##   [6,]    6959.752 1186.4968 -3019.978770        -436.11642
##   [7,]    4649.293 3060.0280   379.533590       -2497.07767
##   [8,]    4124.861 2683.3189  -115.905758        -542.27964
##   [9,]    3031.728 5757.4293  1600.781735       -4615.63925
##  [10,]    4461.122 5262.0412  -369.679559       -4824.72692
##  [11,]    5733.171 1541.8889  -736.231449         602.78724
##  [12,]    4362.072 3324.3517   313.732214       -3273.58919
##  [13,]    2654.806 5819.4192  1627.236758       -5331.11069
##  [14,]    5807.255 2899.2927  -355.184350       -3102.58866
##  [15,]    4259.919 4068.4599   181.496667       -3445.54466
##  [16,]    5767.489 2147.9492 -1523.349639         -22.48511
##  [17,]    3543.985 4703.1270  1224.635686       -3727.69726
##  [18,]    2981.154 4038.9315  1143.332457       -2595.82182
##  [19,]    4566.388 3335.0551  -812.775556       -2326.21611
##  [20,]    4743.684 2307.1768  -950.061799        -718.21607
##  [21,]    4536.748 4208.9174  -377.972082       -2995.04251
##  [22,]    3871.496 4050.4154   106.820004       -1551.43446
##  [23,]    4770.045 3151.8609 -1101.804771         -19.58108
##  [24,]    5585.021 1866.9594 -1015.451598         167.94056
##  [25,]    6148.153  149.5457 -1181.461164         532.43852
##  [26,]    5869.604  874.2713 -1486.525995         285.11752
##  [27,]    4314.127 3333.7161   113.941696       -2533.46073
##  [28,]    5697.671 3039.8609  -639.382055       -2670.23617
##  [29,]    6138.416 1162.6184 -1512.905911        -411.84145
##  [30,]    4424.260 2767.1121  -835.402692        -659.56251
##  [31,]    4380.719 3780.9076  -384.980851       -2498.70446
##  [32,]    5310.851 2205.9855 -1802.052384         358.54494
##  [33,]    6001.464 2309.3290 -1698.103721        -712.65868
##  [34,]    4303.009 3761.3179   447.340651       -2914.55632
##  [35,]    4865.485 3319.9895   193.468428       -3097.93835
##  [36,]    4326.193 3359.9517   779.826142       -1786.14368
##  [37,]    4908.996 3217.1872  -537.457596       -2079.45144
##  [38,]    4308.512 5746.6374   384.358360       -5579.36964
##  [39,]    4355.833 3099.4406  -488.953079        -422.99023
##  [40,]    5255.217 3360.6287  -739.485273       -3226.80314
##  [41,]    3592.351 3851.8921   426.106498       -2558.37522
##  [42,]    5029.167 4151.4882  -614.109407       -3326.17943
##  [43,]    3797.430 6242.4500   832.842252       -5129.76548
##  [44,]    4695.429 3818.1941   227.253408       -3570.63073
##  [45,]    5012.451 3468.4919  -459.526253       -2212.45105
##  [46,]    5377.957  990.1739  -579.104439         424.26957
##  [47,]    4901.990 1706.9350   768.550859       -2361.07576
##  [48,]    3885.321 4204.0514  1098.405200       -3906.18132
##  [49,]    6151.073 1620.8842 -1982.062907       -1016.78457
##  [50,]    4243.985 4426.7293   341.425954       -3086.64195
##  [51,]    5980.761 1538.9047 -1609.287174         -53.23908
##  [52,]    5581.975 1632.3138 -1008.470377        -661.99156
##  [53,]    2563.554 4754.8608  1666.005927       -3243.88877
##  [54,]    5060.823 1680.2418 -1152.664338        -282.73728
##  [55,]    4086.597 5137.0857   702.882288       -5019.03167
##  [56,]    4780.719 2941.3381  -386.304526       -2080.79393
##  [57,]    5623.833 2223.7513  -967.349438       -1746.87304
##  [58,]    4313.205 4154.3348  1153.248538       -3897.91865
##  [59,]    2773.006 6325.3883  1781.898794       -4853.22380
##  [60,]    3927.060 3041.7025   216.448710       -1792.05206
##  [61,]    5984.314 1881.1746 -1620.473602       -2041.70703
##  [62,]    6194.295 2817.6214 -2145.177773       -1010.46776
##  [63,]    4769.592 2324.6299  -767.580159       -1610.34176
##  [64,]    5972.558  503.5306 -1947.046532         966.05409
##  [65,]    3770.502 3847.5892   579.570894       -3245.29353
##  [66,]    4767.258 2088.3519 -1193.901122         572.51225
##  [67,]    2701.364 4262.2475  1540.349117       -2557.28104
##  [68,]    6271.948 2304.0385 -1561.255882        -883.09319
##  [69,]    4108.274 4427.3066   632.500701       -4140.45200
##  [70,]    5077.692 3304.4998   733.179455       -4514.81655
##  [71,]    4439.643 3769.0426  -392.326310       -1782.30499
##  [72,]    5947.602 1926.5406  -681.745739       -1454.91364
##  [73,]    4041.980 3358.3917   -87.331424        -951.18015
##  [74,]    5105.997 2468.6939  -658.312688       -1035.86563
##  [75,]    4769.497 2848.3915  -555.459562       -2169.95284
##  [76,]    4560.943 3037.6989    -5.305163       -1695.77210
##  [77,]    5392.912 2438.3841 -1193.626985       -1857.67179
##  [78,]    3531.689 4306.3501  1227.556471       -3094.01340
##  [79,]    4346.823 2610.1308   217.209873       -1523.73571
##  [80,]    5075.959 3588.4620 -1178.611166       -2111.69099
##  [81,]    5657.704 2257.9737  -852.305740       -1275.53631
##  [82,]    5297.532 3473.7820  -945.218546       -2118.32616
##  [83,]    5922.476 2550.8942 -1740.476112       -1660.56175
##  [84,]    5555.827 2432.2942  -697.919860       -1082.90934
##  [85,]    4294.485 3881.5421   473.602476       -3250.24868
##  [86,]    5101.565  823.5417  -685.120947        -149.26603
##  [87,]    5664.219 2310.6621 -1014.566629       -1855.35639
##  [88,]    4770.982 2930.4110  -287.380299       -1428.74358
##  [89,]    5566.798 1226.1572 -1178.475337         735.50551
##  [90,]    6492.016 1276.6900 -1939.513767        -503.45854
##  [91,]    4025.423 5178.8178  1138.069516       -4338.72397
##  [92,]    5560.737 1604.7377 -1032.489193        -231.93135
##  [93,]    6775.873 1327.3113 -2235.524424        -718.36305
##  [94,]    4537.051 2537.9022  -417.780129        -414.96508
##  [95,]    3781.977 6379.4008   608.391311       -5067.90480
##  [96,]    4203.278 4678.0730   -20.779065       -2695.17638
##  [97,]    4707.904 4523.7602  -161.787273       -3538.76183
##  [98,]    4766.788 3026.6078  -237.021037       -2089.60671
##  [99,]    5124.884 3225.8283  -771.218905       -2358.44735
## [100,]    5949.855 2299.3612 -1490.959414        -388.86894
plot (NA, NA, xlim=c(0, 1), ylim=c(-2000,6000),
      xlab="nodegr", ylab="treatment effect",
      main="lalonde program treatment effect")
abline (h = 0, lwd=.5, lty=2)
for (i in 1:100) {
  abline (a = coef(lm.3.sim)[i,2], b = coef(lm.3.sim)[i,4],
          lwd = .5, col = "gray")
}
abline (a = lm.3$coefficients[2], b = lm.3$coefficients[4],
        lwd = 1, col = "black")

Part 3

Unemployment is defined as people with no income, so observations with variable “re78” equals to 0 with have the feature “u78” equals to 1.

lalonde$u78 = 1*(lalonde$re78 == 0)   # define u78 = 1 for those earned 0 income in 1978
subgroup.nodegr = subset(lalonde, nodegr == 1)
subgroup.yesdegr = subset(lalonde, nodegr == 0)   # update the subsets to include u78

In logistic regression, we can use all the variables as predictor for u78.

glm.nodegr <- glm(u78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.nodegr, family=binomial)
summary(glm.nodegr)
## 
## Call:
## glm(formula = u78 ~ age + educ + black + hisp + married + re74 + 
##     re75 + u74 + u75 + treat, family = binomial, data = subgroup.nodegr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2812  -0.9314  -0.7011   1.2992   2.0522  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.764e+00  1.201e+00  -2.301   0.0214 *
## age          1.448e-02  1.644e-02   0.881   0.3785  
## educ         7.734e-02  8.270e-02   0.935   0.3496  
## black        1.008e+00  6.478e-01   1.555   0.1199  
## hisp        -2.103e-01  8.039e-01  -0.262   0.7936  
## married     -2.172e-01  3.509e-01  -0.619   0.5360  
## re74        -1.603e-05  3.664e-05  -0.437   0.6618  
## re75         1.357e-05  5.990e-05   0.227   0.8208  
## u74         -1.711e-01  4.600e-01  -0.372   0.7100  
## u75          5.360e-01  4.102e-01   1.307   0.1913  
## treat       -4.510e-01  2.539e-01  -1.776   0.0757 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 435.76  on 347  degrees of freedom
## Residual deviance: 416.50  on 337  degrees of freedom
## AIC: 438.5
## 
## Number of Fisher Scoring iterations: 4
confint(glm.nodegr, level=0.95)
## Waiting for profiling to be done...
##                     2.5 %        97.5 %
## (Intercept) -5.214207e+00 -4.749612e-01
## age         -1.812055e-02  4.664333e-02
## educ        -8.160058e-02  2.442997e-01
## black       -1.361705e-01  2.492155e+00
## hisp        -1.763729e+00  1.489190e+00
## married     -9.265025e-01  4.566575e-01
## re74        -9.692201e-05  5.175561e-05
## re75        -1.119409e-04  1.280040e-04
## u74         -1.081807e+00  7.331177e-01
## u75         -2.519523e-01  1.366016e+00
## treat       -9.560608e-01  4.148172e-02
glm.yesdegr <- glm(u78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.yesdegr, family=binomial)
summary(glm.yesdegr)
## 
## Call:
## glm(formula = u78 ~ age + educ + black + hisp + married + re74 + 
##     re75 + u74 + u75 + treat, family = binomial, data = subgroup.yesdegr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9196  -0.7571  -0.4846   0.6185   1.9956  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  7.572e+00  5.814e+00   1.302   0.1928  
## age         -8.391e-02  5.764e-02  -1.456   0.1455  
## educ        -5.701e-01  4.793e-01  -1.189   0.2343  
## black        1.387e+00  8.829e-01   1.571   0.1161  
## hisp        -1.502e+01  1.727e+03  -0.009   0.9931  
## married      7.808e-01  7.351e-01   1.062   0.2882  
## re74         8.751e-05  9.275e-05   0.944   0.3454  
## re75        -5.160e-04  2.701e-04  -1.910   0.0561 .
## u74          1.536e+00  1.155e+00   1.330   0.1836  
## u75         -2.083e+00  8.993e-01  -2.317   0.0205 *
## treat       -5.754e-01  5.474e-01  -1.051   0.2932  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 112.772  on 96  degrees of freedom
## Residual deviance:  91.099  on 86  degrees of freedom
## AIC: 113.1
## 
## Number of Fisher Scoring iterations: 16
confint(glm.yesdegr, level=0.95)
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                     2.5 %        97.5 %
## (Intercept) -2.320257e+00  2.124550e+01
## age         -2.068435e-01  2.131394e-02
## educ        -1.702414e+00  2.363576e-01
## black       -1.606293e-01  3.440636e+00
## hisp        -5.212624e+02  7.018307e+01
## married     -6.904408e-01  2.247866e+00
## re74        -9.570820e-05  2.897849e-04
## re75        -1.164422e-03 -5.711014e-05
## u74         -6.677220e-01  3.920850e+00
## u75         -4.002096e+00 -4.001305e-01
## treat       -1.671820e+00  4.966339e-01

The logistic regression produce a reasonable result, and we can use this prediction to caltulate the predicted treatment result.

u78.mean.nodegr.notreat.glm <- mean(subset(glm.nodegr$fitted.values, subgroup.nodegr$treat == 0))
u78.mean.nodegr.yestreat.glm <- mean(subset(glm.nodegr$fitted.values, subgroup.nodegr$treat == 1))
u78.nodegr.treateffect.glm <- u78.mean.nodegr.yestreat.glm - u78.mean.nodegr.notreat.glm
u78.nodegr.treateffect.glm
## [1] -0.09529672
u78.mean.yesdegr.notreat.glm <- mean(subset(glm.yesdegr$fitted.values, subgroup.yesdegr$treat == 0))
u78.mean.yesdegr.yestreat.glm <- mean(subset(glm.yesdegr$fitted.values, subgroup.yesdegr$treat == 1))
u78.yesdegr.treateffect.glm <- u78.mean.yesdegr.yestreat.glm - u78.mean.yesdegr.notreat.glm
u78.yesdegr.treateffect.glm
## [1] -0.1451335

The treatment effect (decrese in unemployement reat in year 1978) is -0.09529672 for the no-degree subgroup and -0.1451335 for the yes-degree subgroup.

u78.mean.nodegr.notreat <- mean(subset(subgroup.nodegr$u78, subgroup.nodegr$treat == 0))
u78.mean.nodegr.yestreat <- mean(subset(subgroup.nodegr$u78, subgroup.nodegr$treat == 1))
u78.nodegr.treateffect <- u78.mean.nodegr.yestreat - u78.mean.nodegr.notreat
u78.nodegr.treateffect
## [1] -0.09529672
u78.mean.yesdegr.notreat <- mean(subset(subgroup.yesdegr$u78, subgroup.yesdegr$treat == 0))
u78.mean.yesdegr.yestreat <- mean(subset(subgroup.yesdegr$u78, subgroup.yesdegr$treat == 1))
u78.yesdegr.treateffect <- u78.mean.yesdegr.yestreat - u78.mean.yesdegr.notreat
u78.yesdegr.treateffect
## [1] -0.1451335

However, if we run calculate the mean difference of “u78” using the acctural data, we will abtain the exact same result as using the logistic regression model. Regardless, the conclusion show the Lalond treatment can decrease the unemployement rate, but it is more effective for the yes-degree subgroup.

Part 4

To start with, build a random forest using all variables for the entire set.

rf <- randomForest(re78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=lalonde, ntree=2000, importance=T)
varImpPlot(rf)

Variable “treat” is ranked in the 8th place, meaning is not a very significant contributer to the tree accuracy. The positive value of the MSE, mean decrease accuracy, shows “treat” dose help increase the tree accuracy.

rf.nodegr <- randomForest(re78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.nodegr, ntree=2000, importance=T)
varImpPlot(rf.nodegr)

For the no-degree subset, “treat” is ranked at the 9th place, but with a negative MSE value, meaning adding treat will acctuly harm the tree accuracy.

rf.yesdegr <- randomForest(re78 ~ age + educ + black + hisp + married + re74 + re75 + u74 + u75 + treat, data=subgroup.yesdegr, ntree=2000, importance=T)
varImpPlot(rf.yesdegr)

In the yes-degree subgroup, “treat” is ranked at the 4th place and the MSE value is positive, showing evedence that “treat” has a posstive impact on real income of 1978 for the yes-degree subgroup.