This is an analysis of the census data set, a classification set where I will determine whether members of the poulation has an income amount of under $50,000.00 or over $50,000.00. This type of analysis may have many types of real-world uses in the business world.

The census data set has 15 variables, and 32,561 rows of data to be analyzed.

Data set description

I commence by loading in the data and looking at the summary and structure of the entire data set.

##    Age             Class  Fnlwgt       Work Educ                Marital
## 1  39,        State-gov,  77516, Bachelors,  13,         Never-married,
## 2  50, Self-emp-not-inc,  83311, Bachelors,  13,    Married-civ-spouse,
## 3  38,          Private, 215646,   HS-grad,   9,              Divorced,
## 4  53,          Private, 234721,      11th,   7,    Married-civ-spouse,
## 5  28,          Private, 338409, Bachelors,  13,    Married-civ-spouse,
## 6  37,          Private, 284582,   Masters,  14,    Married-civ-spouse,
##                   Job   Relationship   Race     sex   Gain Loss Hours
## 1       Adm-clerical, Not-in-family, White,   Male,  2174,   0,   40,
## 2    Exec-managerial,       Husband, White,   Male,     0,   0,   13,
## 3  Handlers-cleaners, Not-in-family, White,   Male,     0,   0,   40,
## 4  Handlers-cleaners,       Husband, Black,   Male,     0,   0,   40,
## 5     Prof-specialty,          Wife, Black, Female,     0,   0,   40,
## 6    Exec-managerial,          Wife, White, Female,     0,   0,   40,
##           Country  Over
## 1  United-States, <=50K
## 2  United-States, <=50K
## 3  United-States, <=50K
## 4  United-States, <=50K
## 5           Cuba, <=50K
## 6  United-States, <=50K
##  [ reached getOption("max.print") -- omitted 4 rows ]

Next I will perform some cleaning, and also I will add labels to the data set.

## 'data.frame':    32561 obs. of  15 variables:
##  $ Age         : num  39 50 38 53 28 37 49 52 31 42 ...
##  $ Class       : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ Fnlwgt      : num  77516 83311 215646 234721 338409 ...
##  $ Work        : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ Educ        : Factor w/ 16 levels "1","10","11",..: 5 5 16 14 5 6 12 16 6 5 ...
##  $ Marital     : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ Job         : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ Relationship: Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ Race        : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex         : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ Gain        : num  2174 0 0 0 0 ...
##  $ Loss        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hours       : num  40 13 40 40 40 40 16 45 50 40 ...
##  $ Country     : chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ Over        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...

Next I will look at some of the relationships of numerical predictor variables.

I do not see any linear relationships with these variables.

There is class imbalance in this data set. There are approximately 75% of those with income less than $50,000.00, and 25% of the data set have income above $50,000.00

## 
##         0         1 
## 0.7591904 0.2408096

Next, I will divide the data set in training, and testing data sets, with approxiamtely 70% in the training set and 30% in the testing set.

##       Age            Class  Fnlwgt         Work Educ               Marital
## 1      39        State-gov   77516    Bachelors   13         Never-married
## 2      50 Self-emp-not-inc   83311    Bachelors   13    Married-civ-spouse
## 3      38          Private  215646      HS-grad    9              Divorced
## 4      53          Private  234721         11th    7    Married-civ-spouse
## 5      28          Private  338409    Bachelors   13    Married-civ-spouse
## 6      37          Private  284582      Masters   14    Married-civ-spouse
##                     Job   Relationship               Race    sex  Gain
## 1          Adm-clerical  Not-in-family              White   Male  2174
## 2       Exec-managerial        Husband              White   Male     0
## 3     Handlers-cleaners  Not-in-family              White   Male     0
## 4     Handlers-cleaners        Husband              Black   Male     0
## 5        Prof-specialty           Wife              Black Female     0
## 6       Exec-managerial           Wife              White Female     0
##       Loss Hours                    Country Over
## 1        0    40              United-States    0
## 2        0    13              United-States    0
## 3        0    40              United-States    0
## 4        0    40              United-States    0
## 5        0    40                       Cuba    0
## 6        0    40              United-States    0
##  [ reached getOption("max.print") -- omitted 32555 rows ]

Linear Discriminant Analysis

##       Age            Class  Fnlwgt         Work Educ               Marital
## 1      39        State-gov   77516    Bachelors   13         Never-married
## 2      50 Self-emp-not-inc   83311    Bachelors   13    Married-civ-spouse
## 3      38          Private  215646      HS-grad    9              Divorced
## 4      53          Private  234721         11th    7    Married-civ-spouse
## 5      28          Private  338409    Bachelors   13    Married-civ-spouse
## 6      37          Private  284582      Masters   14    Married-civ-spouse
##                     Job   Relationship               Race    sex  Gain
## 1          Adm-clerical  Not-in-family              White   Male  2174
## 2       Exec-managerial        Husband              White   Male     0
## 3     Handlers-cleaners  Not-in-family              White   Male     0
## 4     Handlers-cleaners        Husband              Black   Male     0
## 5        Prof-specialty           Wife              Black Female     0
## 6       Exec-managerial           Wife              White Female     0
##       Loss Hours                    Country Over
## 1        0    40              United-States    0
## 2        0    13              United-States    0
## 3        0    40              United-States    0
## 4        0    40              United-States    0
## 5        0    40                       Cuba    0
## 6        0    40              United-States    0
##  [ reached getOption("max.print") -- omitted 32555 rows ]
## $class
##   [1] 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
##  [36] 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0
##  [71] 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0
##  [ reached getOption("max.print") -- omitted 32461 entries ]
## Levels: 0 1
## 
## $posterior
##                  0            1
## 1     9.358896e-01 0.0641103722
## 2     4.670460e-01 0.5329540119
## 3     9.849364e-01 0.0150635968
## 4     8.433780e-01 0.1566220142
## 5     3.173452e-01 0.6826547507
## 6     7.926320e-02 0.9207367989
## 7     9.956915e-01 0.0043084851
## 8     5.810364e-01 0.4189635659
## 9     6.467243e-01 0.3532756900
## 10    1.692070e-01 0.8307929808
## 11    2.498483e-01 0.7501516941
## 12    6.351797e-01 0.3648203430
## 13    9.643300e-01 0.0356699723
## 14    9.466553e-01 0.0533446536
## 15    8.000117e-01 0.1999883397
## 16    9.677750e-01 0.0322250411
## 17    9.956370e-01 0.0043629685
## 18    9.853847e-01 0.0146153411
## 19    7.135585e-01 0.2864414859
## 20    7.503214e-01 0.2496786174
## 21    5.045034e-02 0.9495496603
## 22    9.880451e-01 0.0119548842
## 23    9.218838e-01 0.0781161986
## 24    5.193626e-01 0.4806374395
## 25    9.557259e-01 0.0442741422
## 26    3.040670e-01 0.6959329636
## 27    9.839241e-01 0.0160758748
## 28    8.061077e-01 0.1938922955
## 29    8.110178e-01 0.1889822332
## 30    7.710868e-01 0.2289132386
## 31    9.462557e-01 0.0537442525
## 32    9.599845e-01 0.0400154886
## 33    3.977273e-01 0.6022726850
## 34    8.475608e-01 0.1524392209
## 35    9.271195e-01 0.0728804863
## 36    9.863995e-01 0.0136004911
## 37    9.807448e-01 0.0192551583
## 38    6.975766e-01 0.3024233949
## 39    7.843584e-01 0.2156415923
## 40    6.167962e-01 0.3832038467
## 41    9.013629e-01 0.0986371250
## 42    4.153391e-01 0.5846609210
## 43    3.517188e-01 0.6482811649
## 44    9.781729e-01 0.0218271334
## 45    9.901885e-01 0.0098114939
## 46    1.637635e-01 0.8362364529
## 47    8.168624e-01 0.1831375750
## 48    6.785738e-01 0.3214262435
## 49    7.784209e-01 0.2215790678
## 50    9.365264e-01 0.0634735562
##  [ reached getOption("max.print") -- omitted 32511 rows ]
## 
## $x
##                 LD1
## 1     -3.931398e-01
## 2      1.179341e+00
## 3     -1.231345e+00
## 4      1.643823e-01
## 5      1.533758e+00
## 6      2.476499e+00
## 7     -1.937145e+00
## 8      9.227280e-01
## 9      7.675193e-01
## 10     1.995099e+00
## 11     1.720150e+00
## 12     7.955647e-01
## 13    -7.376295e-01
## 14    -5.023006e-01
## 15     3.305316e-01
## 16    -7.964006e-01
## 17    -1.930089e+00
## 18    -1.248487e+00
## 19     5.953066e-01
## 20     4.904353e-01
## 21     2.746283e+00
## 22    -1.362320e+00
## 23    -2.742522e-01
## 24     1.062227e+00
## 25    -6.118187e-01
## 26     1.568421e+00
## 27    -1.194412e+00
## 28     3.089827e-01
## 29     2.912491e-01
## 30     4.266333e-01
## 31    -4.978926e-01
## 32    -6.708407e-01
## 33     1.337510e+00
## 34     1.464842e-01
## 35    -3.162013e-01
## 36    -1.289293e+00
## 37    -1.091721e+00
## 38     6.383213e-01
## 39     3.837055e-01
## 40     8.394657e-01
## 41    -1.312751e-01
## 42     1.296697e+00
## 43     1.447385e+00
## 44    -1.020165e+00
## 45    -1.473987e+00
## 46     2.017029e+00
## 47     2.696729e-01
## 48     6.878281e-01
## 49     4.031374e-01
## 50    -3.991007e-01
## 51     1.092058e+00
## 52    -1.909665e+00
## 53     3.439784e+00
## 54     1.154037e+00
## 55     3.075120e-01
## 56     1.016998e+00
## 57    -3.934403e-01
## 58     2.783530e-01
## 59     5.006873e-01
## 60     2.236454e-01
## 61     1.281130e+00
## 62    -1.855048e+00
## 63     2.710395e-01
## 64     2.533841e+00
## 65    -2.793631e-01
## 66     2.556165e-01
## 67    -1.142827e+00
## 68     8.119478e-01
## 69     1.780746e+00
## 70    -1.021181e+00
## 71    -7.023296e-01
## 72    -2.148727e-01
## 73     1.320488e+00
## 74    -1.186689e+00
## 75     4.805017e-01
## 76    -1.403296e+00
## 77     6.496584e-01
## 78    -1.825998e-01
## 79    -1.785246e+00
## 80    -9.011632e-01
## 81    -1.103677e+00
## 82     1.165530e+00
## 83     5.647897e-01
## 84     9.165315e-01
## 85    -6.538566e-01
## 86    -7.629822e-01
## 87     7.363204e-01
## 88     1.970930e+00
## 89    -1.289132e+00
## 90     1.410059e+00
## 91     1.263679e+00
## 92    -1.006382e+00
## 93    -1.587552e+00
## 94     1.057262e+00
## 95     1.312123e+00
## 96    -1.191790e+00
## 97     3.378668e+00
## 98     8.932633e-01
## 99    -3.023789e-03
## 100   -7.822202e-01
##  [ reached getOption("max.print") -- omitted 32461 rows ]
## [1] 0 1 0 0 1 1
## Levels: 0 1
##    class
##         0     1
##   0 22982  1738
##   1  3401  4440
## [1] 0.8421732

Naive Bayes

Here I will run a Naive Bayes model to see how high of a prediction accuracy I can achieve.

## [1] 0 1 1 0 1 0
## Levels: 0 1
##    pred
##        0    1
##   0 6938  478
##   1 1155 1197
## [1] 0.6603298
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6938 1155
##          1  478 1197
##                                           
##                Accuracy : 0.8328          
##                  95% CI : (0.8253, 0.8402)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4929          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5089          
##             Specificity : 0.9355          
##          Pos Pred Value : 0.7146          
##          Neg Pred Value : 0.8573          
##              Prevalence : 0.2408          
##          Detection Rate : 0.1225          
##    Detection Prevalence : 0.1715          
##       Balanced Accuracy : 0.7222          
##                                           
##        'Positive' Class : 1               
## 

I’ll try an oversampled model for naive bayes.

## 
##     0     1 
## 17304 17304
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6746  952
##          1  670 1400
##                                           
##                Accuracy : 0.8339          
##                  95% CI : (0.8264, 0.8413)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5264          
##  Mcnemar's Test P-Value : 3.011e-12       
##                                           
##             Sensitivity : 0.5952          
##             Specificity : 0.9097          
##          Pos Pred Value : 0.6763          
##          Neg Pred Value : 0.8763          
##              Prevalence : 0.2408          
##          Detection Rate : 0.1433          
##    Detection Prevalence : 0.2119          
##       Balanced Accuracy : 0.7524          
##                                           
##        'Positive' Class : 1               
## 
## 
##    0    1 
## 5489 5489
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6747  967
##          1  669 1385
##                                          
##                Accuracy : 0.8325         
##                  95% CI : (0.825, 0.8399)
##     No Information Rate : 0.7592         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5212         
##  Mcnemar's Test P-Value : 2.091e-13      
##                                          
##             Sensitivity : 0.5889         
##             Specificity : 0.9098         
##          Pos Pred Value : 0.6743         
##          Neg Pred Value : 0.8746         
##              Prevalence : 0.2408         
##          Detection Rate : 0.1418         
##    Detection Prevalence : 0.2103         
##       Balanced Accuracy : 0.7493         
##                                          
##        'Positive' Class : 1              
## 
## 
##     0     1 
## 11307 11486
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6747  967
##          1  669 1385
##                                          
##                Accuracy : 0.8325         
##                  95% CI : (0.825, 0.8399)
##     No Information Rate : 0.7592         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5212         
##  Mcnemar's Test P-Value : 2.091e-13      
##                                          
##             Sensitivity : 0.5889         
##             Specificity : 0.9098         
##          Pos Pred Value : 0.6743         
##          Neg Pred Value : 0.8746         
##              Prevalence : 0.2408         
##          Detection Rate : 0.1418         
##    Detection Prevalence : 0.2103         
##       Balanced Accuracy : 0.7493         
##                                          
##        'Positive' Class : 1              
## 

If there is a case where I need to increase sensitivity or sprecificity when class imbalance exists, I can try oversampling.

Boosted Trees

Fit the GBM model for boosted trees

##                                                                   var
## Marital.Married.civ.spouse                 Marital.Married.civ.spouse
## Gain                                                             Gain
## Loss                                                             Loss
## Age                                                               Age
## Hours                                                           Hours
## Work.Bachelors                                         Work.Bachelors
## Job.Prof.specialty                                 Job.Prof.specialty
## Job.Exec.managerial                               Job.Exec.managerial
## Work.Masters                                             Work.Masters
## Work.Prof.school                                     Work.Prof.school
## Fnlwgt                                                         Fnlwgt
## Work.Doctorate                                         Work.Doctorate
## Relationship.Wife                                   Relationship.Wife
## sex.Male                                                     sex.Male
## Work.HS.grad                                             Work.HS.grad
## Job.Other.service                                   Job.Other.service
## Class.Federal.gov                                   Class.Federal.gov
## Job.Tech.support                                     Job.Tech.support
## Job.Farming.fishing                               Job.Farming.fishing
## Work.Assoc.voc                                         Work.Assoc.voc
## Job.Sales                                                   Job.Sales
## Class.Self.emp.not.inc                         Class.Self.emp.not.inc
## Work.Some.college                                   Work.Some.college
## Work.7th.8th                                             Work.7th.8th
## Class.Self.emp.inc                                 Class.Self.emp.inc
## Country.United.States                           Country.United.States
## Job.Protective.serv                               Job.Protective.serv
## Class.Private                                           Class.Private
## Job.Machine.op.inspct                           Job.Machine.op.inspct
## Job.Adm.clerical                                     Job.Adm.clerical
## Class.Local.gov                                       Class.Local.gov
## Educ.6                                                         Educ.6
## Relationship.Not.in.family                 Relationship.Not.in.family
## Race.White                                                 Race.White
## Work.5th.6th                                             Work.5th.6th
## Country.Italy                                           Country.Italy
## Job.Handlers.cleaners                           Job.Handlers.cleaners
## Country.Mexico                                         Country.Mexico
## Job.Craft.repair                                     Job.Craft.repair
## Job.Transport.moving                             Job.Transport.moving
## Work.Assoc.acdm                                       Work.Assoc.acdm
## Work.11th                                                   Work.11th
## Marital.Never.married                           Marital.Never.married
## Country.Philippines                               Country.Philippines
## Work.9th                                                     Work.9th
## Class.State.gov                                       Class.State.gov
## Race.Black                                                 Race.Black
## Marital.Widowed                                       Marital.Widowed
## Country.India                                           Country.India
## Race.Asian.Pac.Islander                       Race.Asian.Pac.Islander
##                                         rel.inf
## Marital.Married.civ.spouse         31.352695105
## Gain                               20.989548964
## Loss                                7.530250536
## Age                                 7.371591088
## Hours                               5.758885455
## Work.Bachelors                      3.846186407
## Job.Prof.specialty                  3.664138573
## Job.Exec.managerial                 3.423952700
## Work.Masters                        2.566496904
## Work.Prof.school                    1.648874274
## Fnlwgt                              1.632020736
## Work.Doctorate                      1.018997313
## Relationship.Wife                   0.812450011
## sex.Male                            0.663315592
## Work.HS.grad                        0.615093450
## Job.Other.service                   0.574787015
## Class.Federal.gov                   0.454466428
## Job.Tech.support                    0.444332300
## Job.Farming.fishing                 0.441654981
## Work.Assoc.voc                      0.435697760
## Job.Sales                           0.401616590
## Class.Self.emp.not.inc              0.398157990
## Work.Some.college                   0.353078702
## Work.7th.8th                        0.335089761
## Class.Self.emp.inc                  0.327788479
## Country.United.States               0.236474613
## Job.Protective.serv                 0.230299895
## Class.Private                       0.215108920
## Job.Machine.op.inspct               0.214345963
## Job.Adm.clerical                    0.177762436
## Class.Local.gov                     0.172807250
## Educ.6                              0.153414281
## Relationship.Not.in.family          0.146107127
## Race.White                          0.127840155
## Work.5th.6th                        0.117113417
## Country.Italy                       0.109839650
## Job.Handlers.cleaners               0.105495956
## Country.Mexico                      0.103915226
## Job.Craft.repair                    0.103476398
## Job.Transport.moving                0.085063736
## Work.Assoc.acdm                     0.064073339
## Work.11th                           0.063031327
## Marital.Never.married               0.062014408
## Country.Philippines                 0.058146251
## Work.9th                            0.052408445
## Class.State.gov                     0.044403122
## Race.Black                          0.042046206
## Marital.Widowed                     0.037646390
## Country.India                       0.031697891
## Race.Asian.Pac.Islander             0.023411294
##  [ reached getOption("max.print") -- omitted 64 rows ]

First few rows of probability

## [1] 0 0 0 1 0 1
## [1] 0.8639245

Accuracy

Goodness of fit

## [1] 0.8707