This is an analysis of the census data set, a classification set where I will determine whether members of the poulation has an income amount of under $50,000.00 or over $50,000.00. This type of analysis may have many types of real-world uses in the business world.
The census data set has 15 variables, and 32,561 rows of data to be analyzed.
Data set description
- Age: continuous.
- Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- Fnlwgt: continuous.
- Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- Education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- Sex: Female, Male.
- Capital-gain: continuous.
- Capital-loss: continuous.
- Hours-per-week: continuous.
- Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
I commence by loading in the data and looking at the summary and structure of the entire data set.
## Age Class Fnlwgt Work Educ Marital
## 1 39, State-gov, 77516, Bachelors, 13, Never-married,
## 2 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse,
## 3 38, Private, 215646, HS-grad, 9, Divorced,
## 4 53, Private, 234721, 11th, 7, Married-civ-spouse,
## 5 28, Private, 338409, Bachelors, 13, Married-civ-spouse,
## 6 37, Private, 284582, Masters, 14, Married-civ-spouse,
## Job Relationship Race sex Gain Loss Hours
## 1 Adm-clerical, Not-in-family, White, Male, 2174, 0, 40,
## 2 Exec-managerial, Husband, White, Male, 0, 0, 13,
## 3 Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40,
## 4 Handlers-cleaners, Husband, Black, Male, 0, 0, 40,
## 5 Prof-specialty, Wife, Black, Female, 0, 0, 40,
## 6 Exec-managerial, Wife, White, Female, 0, 0, 40,
## Country Over
## 1 United-States, <=50K
## 2 United-States, <=50K
## 3 United-States, <=50K
## 4 United-States, <=50K
## 5 Cuba, <=50K
## 6 United-States, <=50K
## [ reached getOption("max.print") -- omitted 4 rows ]
I do not see any linear relationships with these variables.
There is class imbalance in this data set. There are approximately 75% of those with income less than $50,000.00, and 25% of the data set have income above $50,000.00
##
## 0 1
## 0.7591904 0.2408096

Next, I will divide the data set in training, and testing data sets, with approxiamtely 70% in the training set and 30% in the testing set.
## Age Class Fnlwgt Work Educ Marital
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## Job Relationship Race sex Gain
## 1 Adm-clerical Not-in-family White Male 2174
## 2 Exec-managerial Husband White Male 0
## 3 Handlers-cleaners Not-in-family White Male 0
## 4 Handlers-cleaners Husband Black Male 0
## 5 Prof-specialty Wife Black Female 0
## 6 Exec-managerial Wife White Female 0
## Loss Hours Country Over
## 1 0 40 United-States 0
## 2 0 13 United-States 0
## 3 0 40 United-States 0
## 4 0 40 United-States 0
## 5 0 40 Cuba 0
## 6 0 40 United-States 0
## [ reached getOption("max.print") -- omitted 32555 rows ]
Linear Discriminant Analysis
## Age Class Fnlwgt Work Educ Marital
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## Job Relationship Race sex Gain
## 1 Adm-clerical Not-in-family White Male 2174
## 2 Exec-managerial Husband White Male 0
## 3 Handlers-cleaners Not-in-family White Male 0
## 4 Handlers-cleaners Husband Black Male 0
## 5 Prof-specialty Wife Black Female 0
## 6 Exec-managerial Wife White Female 0
## Loss Hours Country Over
## 1 0 40 United-States 0
## 2 0 13 United-States 0
## 3 0 40 United-States 0
## 4 0 40 United-States 0
## 5 0 40 Cuba 0
## 6 0 40 United-States 0
## [ reached getOption("max.print") -- omitted 32555 rows ]
## $class
## [1] 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
## [36] 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0
## [71] 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0
## [ reached getOption("max.print") -- omitted 32461 entries ]
## Levels: 0 1
##
## $posterior
## 0 1
## 1 9.358896e-01 0.0641103722
## 2 4.670460e-01 0.5329540119
## 3 9.849364e-01 0.0150635968
## 4 8.433780e-01 0.1566220142
## 5 3.173452e-01 0.6826547507
## 6 7.926320e-02 0.9207367989
## 7 9.956915e-01 0.0043084851
## 8 5.810364e-01 0.4189635659
## 9 6.467243e-01 0.3532756900
## 10 1.692070e-01 0.8307929808
## 11 2.498483e-01 0.7501516941
## 12 6.351797e-01 0.3648203430
## 13 9.643300e-01 0.0356699723
## 14 9.466553e-01 0.0533446536
## 15 8.000117e-01 0.1999883397
## 16 9.677750e-01 0.0322250411
## 17 9.956370e-01 0.0043629685
## 18 9.853847e-01 0.0146153411
## 19 7.135585e-01 0.2864414859
## 20 7.503214e-01 0.2496786174
## 21 5.045034e-02 0.9495496603
## 22 9.880451e-01 0.0119548842
## 23 9.218838e-01 0.0781161986
## 24 5.193626e-01 0.4806374395
## 25 9.557259e-01 0.0442741422
## 26 3.040670e-01 0.6959329636
## 27 9.839241e-01 0.0160758748
## 28 8.061077e-01 0.1938922955
## 29 8.110178e-01 0.1889822332
## 30 7.710868e-01 0.2289132386
## 31 9.462557e-01 0.0537442525
## 32 9.599845e-01 0.0400154886
## 33 3.977273e-01 0.6022726850
## 34 8.475608e-01 0.1524392209
## 35 9.271195e-01 0.0728804863
## 36 9.863995e-01 0.0136004911
## 37 9.807448e-01 0.0192551583
## 38 6.975766e-01 0.3024233949
## 39 7.843584e-01 0.2156415923
## 40 6.167962e-01 0.3832038467
## 41 9.013629e-01 0.0986371250
## 42 4.153391e-01 0.5846609210
## 43 3.517188e-01 0.6482811649
## 44 9.781729e-01 0.0218271334
## 45 9.901885e-01 0.0098114939
## 46 1.637635e-01 0.8362364529
## 47 8.168624e-01 0.1831375750
## 48 6.785738e-01 0.3214262435
## 49 7.784209e-01 0.2215790678
## 50 9.365264e-01 0.0634735562
## [ reached getOption("max.print") -- omitted 32511 rows ]
##
## $x
## LD1
## 1 -3.931398e-01
## 2 1.179341e+00
## 3 -1.231345e+00
## 4 1.643823e-01
## 5 1.533758e+00
## 6 2.476499e+00
## 7 -1.937145e+00
## 8 9.227280e-01
## 9 7.675193e-01
## 10 1.995099e+00
## 11 1.720150e+00
## 12 7.955647e-01
## 13 -7.376295e-01
## 14 -5.023006e-01
## 15 3.305316e-01
## 16 -7.964006e-01
## 17 -1.930089e+00
## 18 -1.248487e+00
## 19 5.953066e-01
## 20 4.904353e-01
## 21 2.746283e+00
## 22 -1.362320e+00
## 23 -2.742522e-01
## 24 1.062227e+00
## 25 -6.118187e-01
## 26 1.568421e+00
## 27 -1.194412e+00
## 28 3.089827e-01
## 29 2.912491e-01
## 30 4.266333e-01
## 31 -4.978926e-01
## 32 -6.708407e-01
## 33 1.337510e+00
## 34 1.464842e-01
## 35 -3.162013e-01
## 36 -1.289293e+00
## 37 -1.091721e+00
## 38 6.383213e-01
## 39 3.837055e-01
## 40 8.394657e-01
## 41 -1.312751e-01
## 42 1.296697e+00
## 43 1.447385e+00
## 44 -1.020165e+00
## 45 -1.473987e+00
## 46 2.017029e+00
## 47 2.696729e-01
## 48 6.878281e-01
## 49 4.031374e-01
## 50 -3.991007e-01
## 51 1.092058e+00
## 52 -1.909665e+00
## 53 3.439784e+00
## 54 1.154037e+00
## 55 3.075120e-01
## 56 1.016998e+00
## 57 -3.934403e-01
## 58 2.783530e-01
## 59 5.006873e-01
## 60 2.236454e-01
## 61 1.281130e+00
## 62 -1.855048e+00
## 63 2.710395e-01
## 64 2.533841e+00
## 65 -2.793631e-01
## 66 2.556165e-01
## 67 -1.142827e+00
## 68 8.119478e-01
## 69 1.780746e+00
## 70 -1.021181e+00
## 71 -7.023296e-01
## 72 -2.148727e-01
## 73 1.320488e+00
## 74 -1.186689e+00
## 75 4.805017e-01
## 76 -1.403296e+00
## 77 6.496584e-01
## 78 -1.825998e-01
## 79 -1.785246e+00
## 80 -9.011632e-01
## 81 -1.103677e+00
## 82 1.165530e+00
## 83 5.647897e-01
## 84 9.165315e-01
## 85 -6.538566e-01
## 86 -7.629822e-01
## 87 7.363204e-01
## 88 1.970930e+00
## 89 -1.289132e+00
## 90 1.410059e+00
## 91 1.263679e+00
## 92 -1.006382e+00
## 93 -1.587552e+00
## 94 1.057262e+00
## 95 1.312123e+00
## 96 -1.191790e+00
## 97 3.378668e+00
## 98 8.932633e-01
## 99 -3.023789e-03
## 100 -7.822202e-01
## [ reached getOption("max.print") -- omitted 32461 rows ]
## [1] 0 1 0 0 1 1
## Levels: 0 1
## class
## 0 1
## 0 22982 1738
## 1 3401 4440
## [1] 0.8421732

Naive Bayes
Here I will run a Naive Bayes model to see how high of a prediction accuracy I can achieve.
## [1] 0 1 1 0 1 0
## Levels: 0 1
## pred
## 0 1
## 0 6938 478
## 1 1155 1197
## [1] 0.6603298
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6938 1155
## 1 478 1197
##
## Accuracy : 0.8328
## 95% CI : (0.8253, 0.8402)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4929
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5089
## Specificity : 0.9355
## Pos Pred Value : 0.7146
## Neg Pred Value : 0.8573
## Prevalence : 0.2408
## Detection Rate : 0.1225
## Detection Prevalence : 0.1715
## Balanced Accuracy : 0.7222
##
## 'Positive' Class : 1
##
I’ll try an oversampled model for naive bayes.
##
## 0 1
## 17304 17304
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6746 952
## 1 670 1400
##
## Accuracy : 0.8339
## 95% CI : (0.8264, 0.8413)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5264
## Mcnemar's Test P-Value : 3.011e-12
##
## Sensitivity : 0.5952
## Specificity : 0.9097
## Pos Pred Value : 0.6763
## Neg Pred Value : 0.8763
## Prevalence : 0.2408
## Detection Rate : 0.1433
## Detection Prevalence : 0.2119
## Balanced Accuracy : 0.7524
##
## 'Positive' Class : 1
##
##
## 0 1
## 5489 5489
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6747 967
## 1 669 1385
##
## Accuracy : 0.8325
## 95% CI : (0.825, 0.8399)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5212
## Mcnemar's Test P-Value : 2.091e-13
##
## Sensitivity : 0.5889
## Specificity : 0.9098
## Pos Pred Value : 0.6743
## Neg Pred Value : 0.8746
## Prevalence : 0.2408
## Detection Rate : 0.1418
## Detection Prevalence : 0.2103
## Balanced Accuracy : 0.7493
##
## 'Positive' Class : 1
##
##
## 0 1
## 11307 11486
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6747 967
## 1 669 1385
##
## Accuracy : 0.8325
## 95% CI : (0.825, 0.8399)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5212
## Mcnemar's Test P-Value : 2.091e-13
##
## Sensitivity : 0.5889
## Specificity : 0.9098
## Pos Pred Value : 0.6743
## Neg Pred Value : 0.8746
## Prevalence : 0.2408
## Detection Rate : 0.1418
## Detection Prevalence : 0.2103
## Balanced Accuracy : 0.7493
##
## 'Positive' Class : 1
##
If there is a case where I need to increase sensitivity or sprecificity when class imbalance exists, I can try oversampling.
Boosted Trees
Fit the GBM model for boosted trees

## var
## Marital.Married.civ.spouse Marital.Married.civ.spouse
## Gain Gain
## Loss Loss
## Age Age
## Hours Hours
## Work.Bachelors Work.Bachelors
## Job.Prof.specialty Job.Prof.specialty
## Job.Exec.managerial Job.Exec.managerial
## Work.Masters Work.Masters
## Work.Prof.school Work.Prof.school
## Fnlwgt Fnlwgt
## Work.Doctorate Work.Doctorate
## Relationship.Wife Relationship.Wife
## sex.Male sex.Male
## Work.HS.grad Work.HS.grad
## Job.Other.service Job.Other.service
## Class.Federal.gov Class.Federal.gov
## Job.Tech.support Job.Tech.support
## Job.Farming.fishing Job.Farming.fishing
## Work.Assoc.voc Work.Assoc.voc
## Job.Sales Job.Sales
## Class.Self.emp.not.inc Class.Self.emp.not.inc
## Work.Some.college Work.Some.college
## Work.7th.8th Work.7th.8th
## Class.Self.emp.inc Class.Self.emp.inc
## Country.United.States Country.United.States
## Job.Protective.serv Job.Protective.serv
## Class.Private Class.Private
## Job.Machine.op.inspct Job.Machine.op.inspct
## Job.Adm.clerical Job.Adm.clerical
## Class.Local.gov Class.Local.gov
## Educ.6 Educ.6
## Relationship.Not.in.family Relationship.Not.in.family
## Race.White Race.White
## Work.5th.6th Work.5th.6th
## Country.Italy Country.Italy
## Job.Handlers.cleaners Job.Handlers.cleaners
## Country.Mexico Country.Mexico
## Job.Craft.repair Job.Craft.repair
## Job.Transport.moving Job.Transport.moving
## Work.Assoc.acdm Work.Assoc.acdm
## Work.11th Work.11th
## Marital.Never.married Marital.Never.married
## Country.Philippines Country.Philippines
## Work.9th Work.9th
## Class.State.gov Class.State.gov
## Race.Black Race.Black
## Marital.Widowed Marital.Widowed
## Country.India Country.India
## Race.Asian.Pac.Islander Race.Asian.Pac.Islander
## rel.inf
## Marital.Married.civ.spouse 31.352695105
## Gain 20.989548964
## Loss 7.530250536
## Age 7.371591088
## Hours 5.758885455
## Work.Bachelors 3.846186407
## Job.Prof.specialty 3.664138573
## Job.Exec.managerial 3.423952700
## Work.Masters 2.566496904
## Work.Prof.school 1.648874274
## Fnlwgt 1.632020736
## Work.Doctorate 1.018997313
## Relationship.Wife 0.812450011
## sex.Male 0.663315592
## Work.HS.grad 0.615093450
## Job.Other.service 0.574787015
## Class.Federal.gov 0.454466428
## Job.Tech.support 0.444332300
## Job.Farming.fishing 0.441654981
## Work.Assoc.voc 0.435697760
## Job.Sales 0.401616590
## Class.Self.emp.not.inc 0.398157990
## Work.Some.college 0.353078702
## Work.7th.8th 0.335089761
## Class.Self.emp.inc 0.327788479
## Country.United.States 0.236474613
## Job.Protective.serv 0.230299895
## Class.Private 0.215108920
## Job.Machine.op.inspct 0.214345963
## Job.Adm.clerical 0.177762436
## Class.Local.gov 0.172807250
## Educ.6 0.153414281
## Relationship.Not.in.family 0.146107127
## Race.White 0.127840155
## Work.5th.6th 0.117113417
## Country.Italy 0.109839650
## Job.Handlers.cleaners 0.105495956
## Country.Mexico 0.103915226
## Job.Craft.repair 0.103476398
## Job.Transport.moving 0.085063736
## Work.Assoc.acdm 0.064073339
## Work.11th 0.063031327
## Marital.Never.married 0.062014408
## Country.Philippines 0.058146251
## Work.9th 0.052408445
## Class.State.gov 0.044403122
## Race.Black 0.042046206
## Marital.Widowed 0.037646390
## Country.India 0.031697891
## Race.Asian.Pac.Islander 0.023411294
## [ reached getOption("max.print") -- omitted 64 rows ]
First few rows of probability
## [1] 0 0 0 1 0 1
## [1] 0.8639245
Accuracy
Goodness of fit
## [1] 0.8707