收入群体分类

初步数据检查

变量说明

变量说明

非连续变量分析

Race & Native Country

                Var1  Freq
1              White 27799
2              Black  3121
3 Asian-Pac-Islander  1038
4 Amer-Indian-Eskimo   311
5              Other   271

                 Var1  Freq
1       United-States 29150
2              Mexico   643
3                   ?   583
4         Philippines   197
5             Germany   137
6              Canada   121
7         Puerto-Rico   114
8         El-Salvador   106
9               India   100
10               Cuba    95
11            England    90
12            Jamaica    81
13              South    80
14              China    75
15              Italy    73
16 Dominican-Republic    70
17            Vietnam    67
18          Guatemala    64
19              Japan    62
20             Poland    60
21           Columbia    59
22             Taiwan    51
23              Haiti    44
24               Iran    43
25           Portugal    37
26          Nicaragua    34
27               Peru    31
28             France    29
29             Greece    29
30            Ecuador    28
31            Ireland    24
32               Hong    20
33           Cambodia    19
34    Trinadad&Tobago    19
35               Laos    18
36           Thailand    18
37         Yugoslavia    16
 [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

Sex

模型处理

随机森林

Model


Call:
 randomForest(x = traindata[, -15], y = traindata$class, ntree = 300) 
               Type of random forest: classification
                     Number of trees: 300
No. of variables tried at each split: 3

        OOB estimate of  error rate: 13.69%
Confusion matrix:
      <=50K >50K class.error
<=50K 18438 1322  0.06690283
>50K   2241 4032  0.35724534

Feature Importance

Confusion Matrix & ROC Curve

Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K  4650  577
     >50K    289  991
                                          
               Accuracy : 0.8669          
                 95% CI : (0.8584, 0.8751)
    No Information Rate : 0.759           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6119          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9415          
            Specificity : 0.6320          
         Pos Pred Value : 0.8896          
         Neg Pred Value : 0.7742          
             Prevalence : 0.7590          
         Detection Rate : 0.7146          
   Detection Prevalence : 0.8033          
      Balanced Accuracy : 0.7868          
                                          
       'Positive' Class : <=50K           
                                          

模型改进


Call:
 randomForest(x = traindata2[, -15], y = traindata2$class, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 16.85%
Confusion matrix:
      <=50K >50K class.error
<=50K  4927 1346   0.2145704
>50K    768 5505   0.1224295
Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K  1207  187
     >50K    361 1381
                                          
               Accuracy : 0.8253          
                 95% CI : (0.8115, 0.8384)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6505          
                                          
 Mcnemar's Test P-Value : 1.466e-13       
                                          
            Sensitivity : 0.7698          
            Specificity : 0.8807          
         Pos Pred Value : 0.8659          
         Neg Pred Value : 0.7928          
             Prevalence : 0.5000          
         Detection Rate : 0.3849          
   Detection Prevalence : 0.4445          
      Balanced Accuracy : 0.8253          
                                          
       'Positive' Class : <=50K           
                                          

Tune RF

mtry = 3  OOB error = 16.83% 
Searching left ...
mtry = 6    OOB error = 17.87% 
-0.0620559 0.05 
Searching right ...
mtry = 1    OOB error = 16.98% 
-0.009000474 0.05 

XGBoost

[1] 0.8383806
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1294  192
         1  315 1336
                                         
               Accuracy : 0.8384         
                 95% CI : (0.825, 0.8511)
    No Information Rate : 0.5129         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6772         
                                         
 Mcnemar's Test P-Value : 6.02e-08       
                                         
            Sensitivity : 0.8042         
            Specificity : 0.8743         
         Pos Pred Value : 0.8708         
         Neg Pred Value : 0.8092         
             Prevalence : 0.5129         
         Detection Rate : 0.4125         
   Detection Prevalence : 0.4737         
      Balanced Accuracy : 0.8393         
                                         
       'Positive' Class : 0              
                                         

Net Zissou

2019-10-28