This is a smaller machine learning project exploring simple use of logisitc regression. The dataset we will look at is the “adult income”" dataset.

The task is to predict whether an individual has an income higher than 50k.

Setup

Loading libraries.

library(ggplot2)
library(dplyr)

The structure of the dataset “adult income”:

adult <- read.csv("adult.csv")
head(adult)
##   age workclass fnlwgt    education educational.num     marital.status
## 1  25   Private 226802         11th               7      Never-married
## 2  38   Private  89814      HS-grad               9 Married-civ-spouse
## 3  28 Local-gov 336951   Assoc-acdm              12 Married-civ-spouse
## 4  44   Private 160323 Some-college              10 Married-civ-spouse
## 5  18         ? 103497 Some-college              10      Never-married
## 6  34   Private 198693         10th               6      Never-married
##          occupation  relationship  race gender capital.gain capital.loss
## 1 Machine-op-inspct     Own-child Black   Male            0            0
## 2   Farming-fishing       Husband White   Male            0            0
## 3   Protective-serv       Husband White   Male            0            0
## 4 Machine-op-inspct       Husband Black   Male         7688            0
## 5                 ?     Own-child White Female            0            0
## 6     Other-service Not-in-family White   Male            0            0
##   hours.per.week native.country income
## 1             40  United-States  <=50K
## 2             50  United-States  <=50K
## 3             40  United-States   >50K
## 4             40  United-States   >50K
## 5             30  United-States  <=50K
## 6             30  United-States  <=50K

Data exploration

Basic data exploration.

str(adult)
## 'data.frame':    48842 obs. of  15 variables:
##  $ age            : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass      : Factor w/ 9 levels "?","Federal-gov",..: 5 5 3 5 1 5 1 7 5 5 ...
##  $ fnlwgt         : int  226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
##  $ education      : Factor w/ 16 levels "10th","11th",..: 2 12 8 16 16 1 12 15 16 6 ...
##  $ educational.num: int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 3 3 5 5 5 3 5 3 ...
##  $ occupation     : Factor w/ 15 levels "?","Adm-clerical",..: 8 6 12 8 1 9 1 11 9 4 ...
##  $ relationship   : Factor w/ 6 levels "Husband","Not-in-family",..: 4 1 1 1 4 2 5 1 5 1 ...
##  $ race           : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 3 5 5 3 5 5 3 5 5 5 ...
##  $ gender         : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 2 2 1 2 ...
##  $ capital.gain   : int  0 0 0 7688 0 0 0 3103 0 0 ...
##  $ capital.loss   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ native.country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 40 40 40 40 40 40 ...
##  $ income         : Factor w/ 2 levels "<=50K",">50K": 1 1 2 2 1 1 1 2 1 1 ...
summary(adult)
##       age                   workclass         fnlwgt       
##  Min.   :17.00   Private         :33906   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 3862   1st Qu.: 117551  
##  Median :37.00   Local-gov       : 3136   Median : 178145  
##  Mean   :38.64   ?               : 2799   Mean   : 189664  
##  3rd Qu.:48.00   State-gov       : 1981   3rd Qu.: 237642  
##  Max.   :90.00   Self-emp-inc    : 1695   Max.   :1490400  
##                  (Other)         : 1463                    
##         education     educational.num               marital.status 
##  HS-grad     :15784   Min.   : 1.00   Divorced             : 6633  
##  Some-college:10878   1st Qu.: 9.00   Married-AF-spouse    :   37  
##  Bachelors   : 8025   Median :10.00   Married-civ-spouse   :22379  
##  Masters     : 2657   Mean   :10.08   Married-spouse-absent:  628  
##  Assoc-voc   : 2061   3rd Qu.:12.00   Never-married        :16117  
##  11th        : 1812   Max.   :16.00   Separated            : 1530  
##  (Other)     : 7625                   Widowed              : 1518  
##            occupation            relationship                   race      
##  Prof-specialty : 6172   Husband       :19716   Amer-Indian-Eskimo:  470  
##  Craft-repair   : 6112   Not-in-family :12583   Asian-Pac-Islander: 1519  
##  Exec-managerial: 6086   Other-relative: 1506   Black             : 4685  
##  Adm-clerical   : 5611   Own-child     : 7581   Other             :  406  
##  Sales          : 5504   Unmarried     : 5125   White             :41762  
##  Other-service  : 4923   Wife          : 2331                             
##  (Other)        :14434                                                    
##     gender       capital.gain    capital.loss    hours.per.week 
##  Female:16192   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :32650   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1079   Mean   :  87.5   Mean   :40.42  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native.country    income     
##  United-States:43832   <=50K:37155  
##  Mexico       :  951   >50K :11687  
##  ?            :  857                
##  Philippines  :  295                
##  Germany      :  206                
##  Puerto-Rico  :  184                
##  (Other)      : 2517

Fixing the data

We have a lot of categorical variables, with too many factors that may be necessary for model building. In this section, we’ll collapse factors for these columns into fewer categories.

Workclass type

table(adult$workclass)
## 
##                ?      Federal-gov        Local-gov     Never-worked 
##             2799             1432             3136               10 
##          Private     Self-emp-inc Self-emp-not-inc        State-gov 
##            33906             1695             3862             1981 
##      Without-pay 
##               21

The type workclass has alot of similar factors, we put them in fewer categories.

levels(adult$workclass)[4] <- "Unemployed"
levels(adult$workclass)[9] <- "Unemployed"
levels(adult$workclass)[3] <- "SL-gov"
levels(adult$workclass)[8] <- "SL-gov"
levels(adult$workclass)[6:7] <- "self-emp"
table(adult$workclass)
## 
##           ? Federal-gov      SL-gov  Unemployed     Private    self-emp 
##        2799        1432        5117          31       33906        5557

Martial type

Again, here we put these in fewer categories.

table(adult$marital)
## 
##              Divorced     Married-AF-spouse    Married-civ-spouse 
##                  6633                    37                 22379 
## Married-spouse-absent         Never-married             Separated 
##                   628                 16117                  1530 
##               Widowed 
##                  1518

The type workclass has alot of similar factors, we put them in fewer categories.

levels(adult$marital)[1] <- "Not-Married"
levels(adult$marital)[6:7] <- "Not-Married"
levels(adult$marital)[2:4] <- "Married"
table(adult$marital)
## 
##   Not-Married       Married Never-married 
##          9681         23044         16117

Native country

The type for native country has very many factors.

levels(adult$native.country)
##  [1] "?"                          "Cambodia"                  
##  [3] "Canada"                     "China"                     
##  [5] "Columbia"                   "Cuba"                      
##  [7] "Dominican-Republic"         "Ecuador"                   
##  [9] "El-Salvador"                "England"                   
## [11] "France"                     "Germany"                   
## [13] "Greece"                     "Guatemala"                 
## [15] "Haiti"                      "Holand-Netherlands"        
## [17] "Honduras"                   "Hong"                      
## [19] "Hungary"                    "India"                     
## [21] "Iran"                       "Ireland"                   
## [23] "Italy"                      "Jamaica"                   
## [25] "Japan"                      "Laos"                      
## [27] "Mexico"                     "Nicaragua"                 
## [29] "Outlying-US(Guam-USVI-etc)" "Peru"                      
## [31] "Philippines"                "Poland"                    
## [33] "Portugal"                   "Puerto-Rico"               
## [35] "Scotland"                   "South"                     
## [37] "Taiwan"                     "Thailand"                  
## [39] "Trinadad&Tobago"            "United-States"             
## [41] "Vietnam"                    "Yugoslavia"

So we will put them in the following categories.

Asia <- c('China','Hong','India','Iran','Cambodia','Japan', 'Laos' ,
          'Philippines' ,'Vietnam' ,'Taiwan', 'Thailand')

North.America <- c('Canada','United-States','Puerto-Rico' )

Europe <- c('England' ,'France', 'Germany' ,'Greece','Holand-Netherlands','Hungary',
            'Ireland','Italy','Poland','Portugal','Scotland','Yugoslavia')

Latin.and.South.America <- c('Columbia','Cuba','Dominican-Republic','Ecuador',
                             'El-Salvador','Guatemala','Haiti','Honduras',
                             'Mexico','Nicaragua','Outlying-US(Guam-USVI-etc)','Peru',
                             'Jamaica','Trinadad&Tobago')
Other <- c('South')

With the following function we do this.

group_country <- function(ctry){
  if (ctry %in% Asia){
    return('Asia')
  }else if (ctry %in% North.America){
    return('North.America')
  }else if (ctry %in% Europe){
    return('Europe')
  }else if (ctry %in% Latin.and.South.America){
    return('Latin.and.South.America')
  }else{
    return('Other')      
  }
}
adult$native.country <- sapply(adult$native.country,group_country)
colnames(adult)[14] <- "region"

This creates the following categories.

table(adult$region)
## 
##                    Asia                  Europe Latin.and.South.America 
##                     981                     780                    1911 
##           North.America                   Other 
##                   44198                     972

Missing data

library(Amelia)

Making the columns factors again.

adult$workclass <- factor(adult$workclass)
adult$region <- factor(adult$region)
adult$marital <- factor(adult$marital)
adult$occupation <- factor(adult$occupation)

Ommitting missing values.

adult <- na.omit(adult)

Exploring the data

Start by looking at a histogram of ages, sorted by income.

ggplot(adult,aes(age)) + geom_histogram(aes(fill=income),color='black',binwidth=1) + theme_bw()

Histogram of hours worked per week:

ggplot(adult,aes(hours.per.week)) + geom_histogram(bins = 30) + theme_bw()

Barplot of region filled by income class:

ggplot(adult,aes(region)) + geom_bar(aes(fill=income),color="black") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Building the model

Now we build the model we will use to classify the individuals. As we are trying to predict a binary outcome it makes sense to use logistic regression. ### Train-Test split Splitting the dataset in 70% training and 30% testing.

library(caTools)
set.seed(3)

sample <- sample.split(adult$income,SplitRatio=0.7)
train <- subset(adult,sample==T)
test <- subset(adult,sample==F)

The logistic model is created with the following code.

model <- glm(income ~ ., family = binomial(logit), data = train)
summary(model)
## 
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.2318  -0.5238  -0.1960  -0.0131   3.7886  
## 
## Coefficients: (3 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         -8.021e+00  4.529e-01 -17.711  < 2e-16
## age                                  2.626e-02  1.649e-03  15.928  < 2e-16
## workclassSL-gov                     -7.488e-01  1.018e-01  -7.354 1.92e-13
## workclassUnemployed                 -1.856e+00  1.080e+00  -1.718 0.085780
## workclassPrivate                    -5.197e-01  9.081e-02  -5.723 1.05e-08
## workclassself-emp                   -8.507e-01  1.011e-01  -8.414  < 2e-16
## fnlwgt                               7.463e-07  1.695e-07   4.404 1.06e-05
## education11th                        9.291e-02  2.172e-01   0.428 0.668780
## education12th                        6.625e-01  2.662e-01   2.489 0.012810
## education1st-4th                    -6.187e-01  5.291e-01  -1.169 0.242271
## education5th-6th                     8.413e-02  2.944e-01   0.286 0.775050
## education7th-8th                    -3.454e-01  2.372e-01  -1.456 0.145386
## education9th                        -1.254e-01  2.599e-01  -0.482 0.629476
## educationAssoc-acdm                  1.493e+00  1.804e-01   8.277  < 2e-16
## educationAssoc-voc                   1.464e+00  1.744e-01   8.394  < 2e-16
## educationBachelors                   2.062e+00  1.629e-01  12.665  < 2e-16
## educationDoctorate                   2.877e+00  2.158e-01  13.332  < 2e-16
## educationHS-grad                     9.376e-01  1.590e-01   5.895 3.75e-09
## educationMasters                     2.428e+00  1.720e-01  14.113  < 2e-16
## educationPreschool                  -1.099e+01  1.020e+02  -0.108 0.914198
## educationProf-school                 2.820e+00  2.054e-01  13.726  < 2e-16
## educationSome-college                1.316e+00  1.611e-01   8.172 3.04e-16
## educational.num                             NA         NA      NA       NA
## marital.statusMarried-AF-spouse      3.020e+00  5.297e-01   5.702 1.18e-08
## marital.statusMarried-civ-spouse     2.356e+00  2.633e-01   8.949  < 2e-16
## marital.statusMarried-spouse-absent  8.824e-02  2.200e-01   0.401 0.688354
## marital.statusNever-married         -3.992e-01  8.593e-02  -4.646 3.39e-06
## marital.statusSeparated             -7.093e-02  1.597e-01  -0.444 0.656871
## marital.statusWidowed                1.188e-01  1.510e-01   0.787 0.431195
## occupationArmed-Forces               5.629e-01  8.625e-01   0.653 0.513968
## occupationCraft-repair               9.386e-02  7.706e-02   1.218 0.223236
## occupationExec-managerial            7.981e-01  7.433e-02  10.737  < 2e-16
## occupationFarming-fishing           -9.485e-01  1.353e-01  -7.010 2.38e-12
## occupationHandlers-cleaners         -6.078e-01  1.354e-01  -4.491 7.10e-06
## occupationMachine-op-inspct         -2.690e-01  9.796e-02  -2.746 0.006034
## occupationOther-service             -8.864e-01  1.154e-01  -7.682 1.57e-14
## occupationPriv-house-serv           -2.230e+00  1.050e+00  -2.123 0.033736
## occupationProf-specialty             5.806e-01  7.811e-02   7.433 1.06e-13
## occupationProtective-serv            5.445e-01  1.215e-01   4.482 7.39e-06
## occupationSales                      2.625e-01  7.967e-02   3.295 0.000983
## occupationTech-support               6.877e-01  1.057e-01   6.508 7.61e-11
## occupationTransport-moving          -1.031e-01  9.708e-02  -1.062 0.288255
## relationshipNot-in-family            6.214e-01  2.605e-01   2.385 0.017057
## relationshipOther-relative          -4.388e-01  2.324e-01  -1.888 0.059061
## relationshipOwn-child               -4.132e-01  2.524e-01  -1.637 0.101559
## relationshipUnmarried                4.349e-01  2.765e-01   1.573 0.115645
## relationshipWife                     1.162e+00  1.008e-01  11.527  < 2e-16
## raceAsian-Pac-Islander               7.685e-01  2.633e-01   2.919 0.003515
## raceBlack                            3.862e-01  2.341e-01   1.649 0.099051
## raceOther                            2.304e-01  3.328e-01   0.692 0.488679
## raceWhite                            5.557e-01  2.237e-01   2.484 0.012994
## genderMale                           7.754e-01  7.662e-02  10.120  < 2e-16
## capital.gain                         3.159e-04  1.032e-05  30.603  < 2e-16
## capital.loss                         6.197e-04  3.666e-05  16.902  < 2e-16
## hours.per.week                       3.006e-02  1.623e-03  18.521  < 2e-16
## regionEurope                         5.885e-01  2.093e-01   2.812 0.004930
## regionLatin.and.South.America       -2.783e-01  2.099e-01  -1.326 0.184956
## regionNorth.America                  3.185e-01  1.688e-01   1.887 0.059196
## regionOther                          4.324e-02  1.912e-01   0.226 0.821045
## maritalMarried                              NA         NA      NA       NA
## maritalNever-married                        NA         NA      NA       NA
##                                        
## (Intercept)                         ***
## age                                 ***
## workclassSL-gov                     ***
## workclassUnemployed                 .  
## workclassPrivate                    ***
## workclassself-emp                   ***
## fnlwgt                              ***
## education11th                          
## education12th                       *  
## education1st-4th                       
## education5th-6th                       
## education7th-8th                       
## education9th                           
## educationAssoc-acdm                 ***
## educationAssoc-voc                  ***
## educationBachelors                  ***
## educationDoctorate                  ***
## educationHS-grad                    ***
## educationMasters                    ***
## educationPreschool                     
## educationProf-school                ***
## educationSome-college               ***
## educational.num                        
## marital.statusMarried-AF-spouse     ***
## marital.statusMarried-civ-spouse    ***
## marital.statusMarried-spouse-absent    
## marital.statusNever-married         ***
## marital.statusSeparated                
## marital.statusWidowed                  
## occupationArmed-Forces                 
## occupationCraft-repair                 
## occupationExec-managerial           ***
## occupationFarming-fishing           ***
## occupationHandlers-cleaners         ***
## occupationMachine-op-inspct         ** 
## occupationOther-service             ***
## occupationPriv-house-serv           *  
## occupationProf-specialty            ***
## occupationProtective-serv           ***
## occupationSales                     ***
## occupationTech-support              ***
## occupationTransport-moving             
## relationshipNot-in-family           *  
## relationshipOther-relative          .  
## relationshipOwn-child                  
## relationshipUnmarried                  
## relationshipWife                    ***
## raceAsian-Pac-Islander              ** 
## raceBlack                           .  
## raceOther                              
## raceWhite                           *  
## genderMale                          ***
## capital.gain                        ***
## capital.loss                        ***
## hours.per.week                      ***
## regionEurope                        ** 
## regionLatin.and.South.America          
## regionNorth.America                 .  
## regionOther                            
## maritalMarried                         
## maritalNever-married                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36106  on 32222  degrees of freedom
## Residual deviance: 21065  on 32165  degrees of freedom
## AIC: 21181
## 
## Number of Fisher Scoring iterations: 13

Removing features iteratively using step:

new.model <- step(model)

Predictions and checking results.

test$predicted.income = predict(new.model, newdata=test, type="response")
table(test$income, test$predicted.income > 0.5)
##        
##         FALSE TRUE
##   <=50K  9661  722
##   >50K   1340 2087