The data was extracted from 1994 Census bureau database. However, I downloaded from Kaggle by UCI Machine Learning Repository.

The data contains 32561 observations (people) and 15 variables. A high level summary of the data is below.

## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : Factor w/ 8 levels "Federal-gov",..: NA 4 NA 4 4 4 4 7 1 4 ...
##  $ fnlwgt        : int  77053 132870 186061 140359 264663 216864 150601 88638 422013 70037 ...
##  $ education     : Factor w/ 16 levels "10th","11th",..: 12 12 16 6 16 12 1 11 12 16 ...
##  $ education.num : int  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 7 7 1 6 1 6 5 1 5 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: NA 4 NA 7 10 8 1 10 10 3 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 2 5 5 4 5 5 3 2 5 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 1 1 2 ...
##  $ capital.gain  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : int  4356 4356 4356 3900 3900 3770 3770 3683 3683 3004 ...
##  $ hours.per.week: int  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 NA ...
##  $ income        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 1 2 ...

Statistics summary after changing missing values to ‘NA’.

##       age                   workclass         fnlwgt       
##  Min.   :17.00   Private         :22696   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00   Local-gov       : 2093   Median : 178356  
##  Mean   :38.58   State-gov       : 1298   Mean   : 189778  
##  3rd Qu.:48.00   Self-emp-inc    : 1116   3rd Qu.: 237051  
##  Max.   :90.00   (Other)         :  981   Max.   :1484705  
##                  NA's            : 1836                    
##         education     education.num                 marital.status 
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683  
##  11th        : 1175   Max.   :16.00   Separated            : 1025  
##  (Other)     : 5134                   Widowed              :  993  
##            occupation            relationship                   race      
##  Prof-specialty : 4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   : 4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial: 4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   : 3770   Own-child     : 5068   Other             :  271  
##  Sales          : 3650   Unmarried     : 3446   White             :27816  
##  (Other)        :10993   Wife          : 1568                             
##  NA's           : 1843                                                    
##      sex         capital.gain    capital.loss    hours.per.week 
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native.country    income     
##  United-States:29170   <=50K:24720  
##  Mexico       :  643   >50K : 7841  
##  Philippines  :  198                
##  Germany      :  137                
##  Canada       :  121                
##  (Other)      : 1709                
##  NA's         :  583

Data Cleaning Process

Check for ‘NA’ values and look how many unique values there are for each variable.

##            age      workclass         fnlwgt      education  education.num 
##              0           1836              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0           1843              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0            583              0
##            age      workclass         fnlwgt      education  education.num 
##             73              9          21648             16             16 
## marital.status     occupation   relationship           race            sex 
##              7             15              6              5              2 
##   capital.gain   capital.loss hours.per.week native.country         income 
##            119             92             94             42              2

## 
## FALSE  TRUE 
##  2399 30162

Approximate 7%(2399/32561) of the total data has missing value. They are mainly in variables ‘occupation’, ‘workclass’ and ‘native country’. I decided to remove those missing values because I don’t think its a good idea to replace categorical values by imputing.

Explore Numeric Variables With Income Levels

“Age”, “Years of education” and “hours per week” all show significant variations with income level. Therefore, they will be kept for the regression analysis. “Final Weight” does not show any variation with income level, therefore, it will be excluded from the analysis. Its hard to see whether “Capital gain” and “Capital loss” have variation with Income level from the above plot, so I will keep them for now.

Explore Categorical Variables With Income Levels

Most of the data was collected from the United States, so variable “native country” does not have effect in my analysis, I will exclude it from regression model. And all the other categorial variables seem to have reasonable variation, so will be kept.

Convert income level to 0’s and 1’s,“<=50K” will be 0 and “>50K”" will be 1(binary outcome).

Model Fitting

split the data into two chunks: training and testing set.

Fit the model

## 
## Call:
## glm(formula = income ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1023  -0.5221  -0.1872   0.0581   3.3415  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         -6.946e+00  4.711e-01 -14.743  < 2e-16
## age                                  2.436e-02  1.888e-03  12.899  < 2e-16
## workclassLocal-gov                  -6.826e-01  1.259e-01  -5.420 5.97e-08
## workclassPrivate                    -4.959e-01  1.049e-01  -4.727 2.28e-06
## workclassSelf-emp-inc               -3.309e-01  1.383e-01  -2.392 0.016734
## workclassSelf-emp-not-inc           -1.004e+00  1.228e-01  -8.181 2.82e-16
## workclassState-gov                  -7.998e-01  1.401e-01  -5.708 1.15e-08
## workclassWithout-pay                -1.306e+01  2.386e+02  -0.055 0.956366
## education11th                       -7.654e-03  2.285e-01  -0.034 0.973272
## education12th                        2.253e-01  3.101e-01   0.727 0.467456
## education1st-4th                    -6.456e-01  5.343e-01  -1.208 0.226955
## education5th-6th                    -7.638e-01  3.970e-01  -1.924 0.054348
## education7th-8th                    -7.297e-01  2.656e-01  -2.748 0.006003
## education9th                        -6.099e-01  3.093e-01  -1.972 0.048662
## educationAssoc-acdm                  1.165e+00  1.941e-01   6.003 1.94e-09
## educationAssoc-voc                   1.088e+00  1.863e-01   5.840 5.23e-09
## educationBachelors                   1.784e+00  1.721e-01  10.366  < 2e-16
## educationDoctorate                   2.773e+00  2.433e-01  11.397  < 2e-16
## educationHS-grad                     6.298e-01  1.671e-01   3.769 0.000164
## educationMasters                     2.116e+00  1.851e-01  11.433  < 2e-16
## educationPreschool                  -2.026e+01  1.422e+02  -0.142 0.886719
## educationProf-school                 2.641e+00  2.266e-01  11.654  < 2e-16
## educationSome-college                9.415e-01  1.698e-01   5.544 2.96e-08
## education.num                               NA         NA      NA       NA
## marital.statusMarried-AF-spouse      3.113e+00  6.846e-01   4.548 5.43e-06
## marital.statusMarried-civ-spouse     2.071e+00  3.025e-01   6.846 7.57e-12
## marital.statusMarried-spouse-absent -9.842e-02  2.684e-01  -0.367 0.713876
## marital.statusNever-married         -4.215e-01  9.933e-02  -4.243 2.20e-05
## marital.statusSeparated              4.981e-02  1.795e-01   0.277 0.781443
## marital.statusWidowed                1.641e-01  1.811e-01   0.906 0.364857
## occupationArmed-Forces              -1.030e+00  1.519e+00  -0.678 0.497620
## occupationCraft-repair               1.312e-01  8.996e-02   1.458 0.144763
## occupationExec-managerial            8.642e-01  8.719e-02   9.912  < 2e-16
## occupationFarming-fishing           -9.219e-01  1.556e-01  -5.927 3.09e-09
## occupationHandlers-cleaners         -6.340e-01  1.624e-01  -3.903 9.50e-05
## occupationMachine-op-inspct         -2.581e-01  1.148e-01  -2.248 0.024571
## occupationOther-service             -7.466e-01  1.302e-01  -5.735 9.74e-09
## occupationPriv-house-serv           -4.065e+00  1.761e+00  -2.309 0.020966
## occupationProf-specialty             5.969e-01  9.185e-02   6.499 8.11e-11
## occupationProtective-serv            6.089e-01  1.401e-01   4.345 1.39e-05
## occupationSales                      3.265e-01  9.278e-02   3.519 0.000434
## occupationTech-support               7.090e-01  1.247e-01   5.684 1.31e-08
## occupationTransport-moving           6.729e-03  1.104e-01   0.061 0.951385
## relationshipNot-in-family            3.903e-01  2.989e-01   1.306 0.191598
## relationshipOther-relative          -4.501e-01  2.736e-01  -1.645 0.100029
## relationshipOwn-child               -7.227e-01  2.971e-01  -2.433 0.014990
## relationshipUnmarried                2.333e-01  3.171e-01   0.736 0.461827
## relationshipWife                     1.344e+00  1.176e-01  11.424  < 2e-16
## raceAsian-Pac-Islander               2.932e-01  2.811e-01   1.043 0.296968
## raceBlack                            4.212e-01  2.667e-01   1.579 0.114232
## raceOther                           -5.576e-01  4.370e-01  -1.276 0.201988
## raceWhite                            4.978e-01  2.549e-01   1.953 0.050844
## sexMale                              8.690e-01  8.981e-02   9.677  < 2e-16
## capital.gain                         3.239e-04  1.083e-05  29.899  < 2e-16
## capital.loss                         6.432e-04  3.870e-05  16.622  < 2e-16
## hours.per.week                       2.840e-02  1.912e-03  14.854  < 2e-16
##                                        
## (Intercept)                         ***
## age                                 ***
## workclassLocal-gov                  ***
## workclassPrivate                    ***
## workclassSelf-emp-inc               *  
## workclassSelf-emp-not-inc           ***
## workclassState-gov                  ***
## workclassWithout-pay                   
## education11th                          
## education12th                          
## education1st-4th                       
## education5th-6th                    .  
## education7th-8th                    ** 
## education9th                        *  
## educationAssoc-acdm                 ***
## educationAssoc-voc                  ***
## educationBachelors                  ***
## educationDoctorate                  ***
## educationHS-grad                    ***
## educationMasters                    ***
## educationPreschool                     
## educationProf-school                ***
## educationSome-college               ***
## education.num                          
## marital.statusMarried-AF-spouse     ***
## marital.statusMarried-civ-spouse    ***
## marital.statusMarried-spouse-absent    
## marital.statusNever-married         ***
## marital.statusSeparated                
## marital.statusWidowed                  
## occupationArmed-Forces                 
## occupationCraft-repair                 
## occupationExec-managerial           ***
## occupationFarming-fishing           ***
## occupationHandlers-cleaners         ***
## occupationMachine-op-inspct         *  
## occupationOther-service             ***
## occupationPriv-house-serv           *  
## occupationProf-specialty            ***
## occupationProtective-serv           ***
## occupationSales                     ***
## occupationTech-support              ***
## occupationTransport-moving             
## relationshipNot-in-family              
## relationshipOther-relative             
## relationshipOwn-child               *  
## relationshipUnmarried                  
## relationshipWife                    ***
## raceAsian-Pac-Islander                 
## raceBlack                              
## raceOther                              
## raceWhite                           .  
## sexMale                             ***
## capital.gain                        ***
## capital.loss                        ***
## hours.per.week                      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27607  on 23999  degrees of freedom
## Residual deviance: 15626  on 23945  degrees of freedom
## AIC: 15736
## 
## Number of Fisher Scoring iterations: 13

Interpreting the results of the logistic regression model:

  1. “Age”, “Hours per week”, “sex”, “capital gain” and “capital loss” are the most statistically significant variables. Their lowest p-values suggesting a strong association with the probability of wage>50K from the data.
  2. “Workclass”, “education”, “marital status”, “occupation” and “relationship” are all across the table. so cannot be eliminated from the model.
  3. “Race” category is not statistically significant and can be eliminated from the model.

Run the anova() function on the model to analyze the table of deviance.

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: income
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                           23999      27607              
## age             1   1390.0     23998      26217 < 2.2e-16 ***
## workclass       6    357.4     23992      25859 < 2.2e-16 ***
## education      15   3009.1     23977      22850 < 2.2e-16 ***
## education.num   0      0.0     23977      22850              
## marital.status  6   4121.0     23971      18729 < 2.2e-16 ***
## occupation     13    634.8     23958      18094 < 2.2e-16 ***
## relationship    5    167.8     23953      17927 < 2.2e-16 ***
## race            4     19.9     23949      17907 0.0005157 ***
## sex             1    136.3     23948      17770 < 2.2e-16 ***
## capital.gain    1   1625.5     23947      16145 < 2.2e-16 ***
## capital.loss    1    293.2     23946      15852 < 2.2e-16 ***
## hours.per.week  1    225.6     23945      15626 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the null deviance and the residual deviance indicates how the model is doing against the null model. The bigger difference, the better. From the above table we can see the drop in deviance when adding each variable one at a time. Adding age, workclass, education, marital status, occupation, relationship, race, sex, capital gain, capital loss and hours per week significantly reduces the residual deviance. education.num seem to have no effect.

Apply model to the test set

## [1] "Accuracy 0.844368711457319"

The 0.84 accuracy on the test set is a very encouraging result.

At last, plot the ROC curve and calculate the AUC (area under the curve). The closer AUC for a model comes to 1, the better predictive ability.

## [1] 0.8868877

The area under the curve corresponds the AUC.

As a last step, I have just learned, use the effects package to compute and plot all variables.

The End

I have been very caucious on removing variables because I don’t want to compromise the data as I may end up removing valid information. As a result, I may have kept variables that I should have removed such as “education.num”.