Adult Dataset Income Prediction using Simple Classification Techniques

Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database. The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database. Prediction task is to determine whether a person makes over 50K a year.

Objective:

The main objective of the dataset is to classify people earning <=50k or >50k based on several explanatory factors affecting the income of a person like Age, Occupation, Education, etc.

The methods we intend to use are:

Binary Logistic Regression
Decision Tree
Random Forest

Data cleaning method used in this article is kNN imputation using VIM library

Understanding the Data set:

The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:

age: The age of an individual

workclass: A general term to represent the employment status of an individual.

          * Private
          * Self.emp.not.inc
          * Self.emp.inc
          * Federal.gov
          * Local.gov
          * State.gov
          * Without.pay
          * Never.worked.

fnlwgt: Final weight. In other words, this is the number of people the census believes the entry represents.

education: The highest level of education achieved by an individual.

           * Bachelors
           * Some.college
           * 11th
           * HS.grad
           * Prof.school
           * Assoc.acdm
           * Assoc.voc
           * 9th
           * 7th.8th
           * 12th
           * Masters
           * 1st.4th
           * 10th
           * Doctorate
           * 5th.6th
           * Preschool.

education.num: The highest level of education achieved in numerical form.

marital.status: Marital status of an individual. Married.civ.spouse corresponds to a civilian spouse while Married.AF.spouse is a spouse in the Armed Forces.

           * Married.civ.spouse
           * Divorced
           * Never.married
           * Separated
           * Widowed
           * Married.spouse.absent
           * Married.AF.spouse

occupation: the general type of occupation of an individual

           * Tech.support
           * Craft.repair
           * Other.service
           * Sales
           * Exec.managerial
           * Prof.specialty
           * Handlers.cleaners
           * Machine.op.inspct
           * Adm.clerical
           * Farming.fishing
           * Transport.moving
           * Priv.house.serv
           * Protective.serv
           * Armed.Forces

relationship: Represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status. We might not make use of this attribute at all
```
           * Wife
           * Own.child
           * Husband
           * Not.in.family
           * Other.relative
           * Unmarried.
```

race: Descriptions of an individual’s race

           * White
           * Asian.Pac.Islander
           * Amer.Indian.Eskimo
           * Other
           * Black

sex: the biological sex of the individual
```
           * Male
           * Female
```
capital.gain: capital gains for an individual
capital.loss: capital loss for an individual
hours.per.week: the hours an individual has reported to work per week

native.country: country of origin for an individual

           * United.States
           * Cambodia
           * England
           * Puerto.Rico
           * Canada
           * Germany
           * Outlying.US(Guam.USVI.etc) 
           * India 
           * Japan 
           * Greece 
           * South
           * China 
           * Cuba 
           * Iran
           * Honduras 
           * Philippines 
           * Italy
           * Poland
           * Jamaica 
           * Vietnam 
           * Mexico 
           * Portugal
           * Ireland 
           * France 
           * Dominican.Republic
           * Laos
           * Ecuador 
           * Taiwan 
           * Haiti 
           * Columbia
           * Hungary 
           * Guatemala 
           * Nicaragua 
           * Scotland 
           * Thailand 
           * Yugoslavia
           * El.Salvador
           * Trinadad&Tobago
           * Peru
           * Hong 
           * Holand.Netherlands

the label: whether or not an individual makes more than $50,000 annually.
```
           * <=50k
           * >50k
```

Loading the data and performing EDA

adult<-read.csv("C:/Users/Rohit/Desktop/R/Data sets/adult.csv",header=T)
# header = T : if file with header is given then we use this arg (will not include first row for calculations)
#              if this arg is not added it will consider first row for calculations

head(adult) # to check the first 6 observations in the data

##   age workclass fnlwgt    education education.num marital.status
## 1  90         ?  77053      HS-grad             9        Widowed
## 2  82   Private 132870      HS-grad             9        Widowed
## 3  66         ? 186061 Some-college            10        Widowed
## 4  54   Private 140359      7th-8th             4       Divorced
## 5  41   Private 264663 Some-college            10      Separated
## 6  34   Private 216864      HS-grad             9       Divorced
##          occupation  relationship  race    sex capital.gain capital.loss
## 1                 ? Not-in-family White Female            0         4356
## 2   Exec-managerial Not-in-family White Female            0         4356
## 3                 ?     Unmarried Black Female            0         4356
## 4 Machine-op-inspct     Unmarried White Female            0         3900
## 5    Prof-specialty     Own-child White Female            0         3900
## 6     Other-service     Unmarried White Female            0         3770
##   hours.per.week native.country income
## 1             40  United-States  <=50K
## 2             18  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40  United-States  <=50K
## 6             45  United-States  <=50K

EDA of the dependent variable

The original dataset contains a distribution of 24.08% entries labeled with >50k and 75.91% entries labeled with <=50k. The following graphs and statistics pertain to the orignal dataset.

library(ggplot2)
barplot(table(adult$income),main = 'Income Classification',col='blue',ylab ='No. of people')

Let’s check the summary of the data

summary(adult)

##       age                   workclass         fnlwgt       
##  Min.   :17.00   Private         :22696   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00   Local-gov       : 2093   Median : 178356  
##  Mean   :38.58   ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00   State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00   Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)         :  981                    
##         education     education.num                 marital.status 
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683  
##  11th        : 1175   Max.   :16.00   Separated            : 1025  
##  (Other)     : 5134                   Widowed              :  993  
##            occupation           relationship                   race      
##  Prof-specialty :4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   :4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial:4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   :3770   Own-child     : 5068   Other             :  271  
##  Sales          :3650   Unmarried     : 3446   White             :27816  
##  Other-service  :3295   Wife          : 1568                             
##  (Other)        :9541                                                    
##      sex         capital.gain    capital.loss    hours.per.week 
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native.country    income     
##  United-States:29170   <=50K:24720  
##  Mexico       :  643   >50K : 7841  
##  ?            :  583                
##  Philippines  :  198                
##  Germany      :  137                
##  Canada       :  121                
##  (Other)      : 1709

Now as we can see some of the values in coloumns are marked as ‘?’

Let us first convert these to NA while loading the data itself

adult1<-read.csv("C:/Users/Rohit/Desktop/R/Data sets/adult.csv",na.strings = c("?","NA"))
# This will replace '?' with 'NA'

# Let's check summary again
summary(adult1)

##       age                   workclass         fnlwgt       
##  Min.   :17.00   Private         :22696   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00   Local-gov       : 2093   Median : 178356  
##  Mean   :38.58   State-gov       : 1298   Mean   : 189778  
##  3rd Qu.:48.00   Self-emp-inc    : 1116   3rd Qu.: 237051  
##  Max.   :90.00   (Other)         :  981   Max.   :1484705  
##                  NA's            : 1836                    
##         education     education.num                 marital.status 
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683  
##  11th        : 1175   Max.   :16.00   Separated            : 1025  
##  (Other)     : 5134                   Widowed              :  993  
##            occupation            relationship                   race      
##  Prof-specialty : 4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   : 4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial: 4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   : 3770   Own-child     : 5068   Other             :  271  
##  Sales          : 3650   Unmarried     : 3446   White             :27816  
##  (Other)        :10993   Wife          : 1568                             
##  NA's           : 1843                                                    
##      sex         capital.gain    capital.loss    hours.per.week 
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native.country    income     
##  United-States:29170   <=50K:24720  
##  Mexico       :  643   >50K : 7841  
##  Philippines  :  198                
##  Germany      :  137                
##  Canada       :  121                
##  (Other)      : 1709                
##  NA's         :  583

As you can see we have replaced missing values in our data marked as ‘?’ with ‘NA’

Another way to check the number of NA’s in our data coloumn wise

colSums(is.na(adult1))

##            age      workclass         fnlwgt      education  education.num 
##              0           1836              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0           1843              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0            583              0

We can also check the structure of our data

str(adult1)

## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : Factor w/ 8 levels "Federal-gov",..: NA 4 NA 4 4 4 4 7 1 4 ...
##  $ fnlwgt        : int  77053 132870 186061 140359 264663 216864 150601 88638 422013 70037 ...
##  $ education     : Factor w/ 16 levels "10th","11th",..: 12 12 16 6 16 12 1 11 12 16 ...
##  $ education.num : int  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 7 7 1 6 1 6 5 1 5 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: NA 4 NA 7 10 8 1 10 10 3 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 2 5 5 4 5 5 3 2 5 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 1 1 2 ...
##  $ capital.gain  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : int  4356 4356 4356 3900 3900 3770 3770 3683 3683 3004 ...
##  $ hours.per.week: int  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 NA ...
##  $ income        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 1 2 ...

Concluding a few things before applying the various classification algorithms

It is observed that some variables are not self-explanatory.

capital_gain and capital_loss are income from other sources like investments other than salary which have no relevance here.

The continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents.

The variable education_num stands for the number of years of education in total, which is a continuous representation of the discrete variable education.

The variable relationship represents the responding members’s role in the family.

For simplicity of this analysis, the following variables are removed education.num, relationship, fnlwgt, capital.gain and capital.loss

adult1$capital.gain<-NULL
adult1$capital.loss<-NULL
adult1$fnlwgt<-NULL
adult1$education.num<-NULL
adult1$relationship<-NULL

kNN imputation to replace NAs

Here library(VIM) is required to impute missing values. kNN imputation is preferred over the conventional method of replacing with mean, median and mode as it is supposed to be more justified. It may occur that a person whose age is missing and earns >50k is alloted a median age which may not be true. kNN inputation will consider all the observations and based on the historical data will assign a better value.

VIM stands for Visualization and imputation of missing values

Note: This may take a while based on sample size

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## Loading required package: data.table

## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.

## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

# as it is observed only the following coloumns have NAs in them, we specifically perform kNN imputation on these 3 variables

adult2<-kNN(adult1,variable = c("workclass","occupation","native.country"),k=sqrt(nrow(adult1)))

colSums(is.na(adult2)) # to verify if NAs removed

##                age          workclass          education 
##                  0                  0                  0 
##     marital.status         occupation               race 
##                  0                  0                  0 
##                sex     hours.per.week     native.country 
##                  0                  0                  0 
##             income      workclass_imp     occupation_imp 
##                  0                  0                  0 
## native.country_imp 
##                  0

This action also creates some dummy variables at the end of the data. You can check this by

head(adult2)

##   age workclass    education marital.status        occupation  race    sex
## 1  90   Private      HS-grad        Widowed      Adm-clerical White Female
## 2  82   Private      HS-grad        Widowed   Exec-managerial White Female
## 3  66   Private Some-college        Widowed      Adm-clerical Black Female
## 4  54   Private      7th-8th       Divorced Machine-op-inspct White Female
## 5  41   Private Some-college      Separated    Prof-specialty White Female
## 6  34   Private      HS-grad       Divorced     Other-service White Female
##   hours.per.week native.country income workclass_imp occupation_imp
## 1             40  United-States  <=50K          TRUE           TRUE
## 2             18  United-States  <=50K         FALSE          FALSE
## 3             40  United-States  <=50K          TRUE           TRUE
## 4             40  United-States  <=50K         FALSE          FALSE
## 5             40  United-States  <=50K         FALSE          FALSE
## 6             45  United-States  <=50K         FALSE          FALSE
##   native.country_imp
## 1              FALSE
## 2              FALSE
## 3              FALSE
## 4              FALSE
## 5              FALSE
## 6              FALSE

So now we create another data set exculding the dummy variables

adult3<-adult2[,1:10]

head(adult3) # to verify if dummy variables removed

##   age workclass    education marital.status        occupation  race    sex
## 1  90   Private      HS-grad        Widowed      Adm-clerical White Female
## 2  82   Private      HS-grad        Widowed   Exec-managerial White Female
## 3  66   Private Some-college        Widowed      Adm-clerical Black Female
## 4  54   Private      7th-8th       Divorced Machine-op-inspct White Female
## 5  41   Private Some-college      Separated    Prof-specialty White Female
## 6  34   Private      HS-grad       Divorced     Other-service White Female
##   hours.per.week native.country income
## 1             40  United-States  <=50K
## 2             18  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40  United-States  <=50K
## 6             45  United-States  <=50K

dim(adult3) # gives the number of variables and coloumns in our dataset

## [1] 32561    10

More EDA

Let’s check income with respect to age

library(ggplot2)
ggplot(adult3) + aes(x=as.numeric(age), group=income, fill=income) + 
  geom_histogram(binwidth=1, color='black')+
  labs(x="Age",y="Count",title = "Income w.r.t Age")

As we notice majority of the people make less than <50k a year. However, we observe people earning >50k are in their mid career. We make this hypothesis based on the age.

Let’s check the same for workclass

barplot(table(adult3$workclass),main = 'Income Classification w.r.t workclass',col='blue',ylab ='No. of people')

We can conclude that people working in private sector earn significantly better than the ones in other classes.

Performing Logistic Regression

Dividing data in Training and Testing Datasets

Let’s put 75% data in training and 25% in testing dataset

library(caret) # classification and regression training

## Loading required package: lattice

index<-createDataPartition(adult3$age,p=0.75,list = F)
# argument 'list=F' is added so that it takes only indexes of the observations and not make a list row wise
train_adult<-adult3[index,]
test_adult<-adult3[-index,]
dim(train_adult)

## [1] 24423    10

dim(test_adult)

## [1] 8138   10

# model implementation
adult_blr<-glm(income~.,data = train_adult,family = "binomial")
# argument (family = "binomial") is necessary as we are creating a model with dichotomous result
summary(adult_blr)

## 
## Call:
## glm(formula = income ~ ., family = "binomial", data = train_adult)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6409  -0.5524  -0.2375  -0.0502   3.6781  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                              -5.437e+00  7.727e-01  -7.037
## age                                       2.657e-02  1.747e-03  15.211
## workclassLocal-gov                       -6.492e-01  1.228e-01  -5.287
## workclassNever-worked                    -1.240e+01  4.507e+02  -0.028
## workclassPrivate                         -5.293e-01  1.018e-01  -5.198
## workclassSelf-emp-inc                    -1.969e-01  1.338e-01  -1.471
## workclassSelf-emp-not-inc                -9.236e-01  1.198e-01  -7.711
## workclassState-gov                       -8.552e-01  1.371e-01  -6.236
## workclassWithout-pay                     -1.423e+01  3.749e+02  -0.038
## education11th                             3.119e-01  2.304e-01   1.354
## education12th                             6.217e-01  2.859e-01   2.175
## education1st-4th                         -2.614e-01  4.873e-01  -0.537
## education5th-6th                         -3.212e-01  3.719e-01  -0.864
## education7th-8th                         -4.524e-01  2.622e-01  -1.725
## education9th                             -5.837e-01  3.228e-01  -1.808
## educationAssoc-acdm                       1.519e+00  1.944e-01   7.813
## educationAssoc-voc                        1.531e+00  1.874e-01   8.165
## educationBachelors                        2.118e+00  1.751e-01  12.094
## educationDoctorate                        3.160e+00  2.349e-01  13.453
## educationHS-grad                          9.300e-01  1.708e-01   5.445
## educationMasters                          2.528e+00  1.859e-01  13.599
## educationPreschool                       -1.197e+01  2.011e+02  -0.060
## educationProf-school                      3.124e+00  2.183e-01  14.310
## educationSome-college                     1.263e+00  1.734e-01   7.283
## marital.statusMarried-AF-spouse           2.809e+00  5.418e-01   5.184
## marital.statusMarried-civ-spouse          2.037e+00  7.044e-02  28.922
## marital.statusMarried-spouse-absent      -6.109e-02  2.455e-01  -0.249
## marital.statusNever-married              -4.985e-01  8.663e-02  -5.754
## marital.statusSeparated                  -3.342e-01  1.796e-01  -1.861
## marital.statusWidowed                    -3.172e-03  1.547e-01  -0.021
## occupationArmed-Forces                   -1.360e+01  5.012e+02  -0.027
## occupationCraft-repair                   -1.596e-02  8.429e-02  -0.189
## occupationExec-managerial                 7.675e-01  8.090e-02   9.488
## occupationFarming-fishing                -1.008e+00  1.520e-01  -6.632
## occupationHandlers-cleaners              -7.481e-01  1.578e-01  -4.741
## occupationMachine-op-inspct              -3.224e-01  1.111e-01  -2.902
## occupationOther-service                  -8.922e-01  1.256e-01  -7.106
## occupationPriv-house-serv                -2.711e+00  1.175e+00  -2.307
## occupationProf-specialty                  4.967e-01  8.530e-02   5.822
## occupationProtective-serv                 4.257e-01  1.376e-01   3.093
## occupationSales                           2.345e-01  8.642e-02   2.713
## occupationTech-support                    5.366e-01  1.192e-01   4.501
## occupationTransport-moving               -1.447e-01  1.071e-01  -1.351
## raceAsian-Pac-Islander                    3.270e-01  2.800e-01   1.168
## raceBlack                                 2.807e-01  2.384e-01   1.177
## raceOther                                -4.965e-01  3.969e-01  -1.251
## raceWhite                                 3.908e-01  2.260e-01   1.729
## sexMale                                   1.391e-01  5.695e-02   2.442
## hours.per.week                            3.075e-02  1.734e-03  17.729
## native.countryCanada                     -9.297e-01  7.552e-01  -1.231
## native.countryChina                      -1.704e+00  7.940e-01  -2.146
## native.countryColumbia                   -1.529e+01  1.834e+02  -0.083
## native.countryCuba                       -6.111e-01  7.746e-01  -0.789
## native.countryDominican-Republic         -1.929e+00  1.047e+00  -1.842
## native.countryEcuador                    -9.081e-01  9.979e-01  -0.910
## native.countryEl-Salvador                -1.109e+00  8.502e-01  -1.304
## native.countryEngland                    -8.714e-01  7.797e-01  -1.118
## native.countryFrance                     -3.244e-01  8.982e-01  -0.361
## native.countryGermany                    -3.924e-01  7.460e-01  -0.526
## native.countryGreece                     -1.452e+00  9.031e-01  -1.607
## native.countryGuatemala                  -4.279e-01  1.006e+00  -0.425
## native.countryHaiti                      -1.004e+00  1.055e+00  -0.951
## native.countryHoland-Netherlands         -1.251e+01  1.455e+03  -0.009
## native.countryHonduras                   -1.306e+01  4.790e+02  -0.027
## native.countryHong                       -1.431e+00  1.035e+00  -1.382
## native.countryHungary                    -6.942e-01  1.149e+00  -0.604
## native.countryIndia                      -1.077e+00  7.561e-01  -1.425
## native.countryIran                       -7.714e-01  8.215e-01  -0.939
## native.countryIreland                    -1.678e-01  9.493e-01  -0.177
## native.countryItaly                       3.826e-02  7.857e-01   0.049
## native.countryJamaica                    -9.682e-01  8.599e-01  -1.126
## native.countryJapan                      -5.582e-01  8.262e-01  -0.676
## native.countryLaos                       -1.827e+00  1.416e+00  -1.290
## native.countryMexico                     -1.474e+00  7.440e-01  -1.982
## native.countryNicaragua                  -1.663e+00  1.066e+00  -1.559
## native.countryOutlying-US(Guam-USVI-etc) -1.391e+01  4.116e+02  -0.034
## native.countryPeru                       -1.500e+00  1.079e+00  -1.390
## native.countryPhilippines                -6.902e-01  7.226e-01  -0.955
## native.countryPoland                     -9.914e-01  8.347e-01  -1.188
## native.countryPortugal                   -1.378e+00  1.006e+00  -1.369
## native.countryPuerto-Rico                -1.102e+00  8.130e-01  -1.355
## native.countryScotland                   -7.641e-01  1.051e+00  -0.727
## native.countrySouth                      -1.628e+00  7.814e-01  -2.083
## native.countryTaiwan                     -9.425e-01  8.150e-01  -1.156
## native.countryThailand                   -1.209e+00  1.090e+00  -1.109
## native.countryTrinadad&Tobago            -1.525e+00  1.325e+00  -1.151
## native.countryUnited-States              -7.606e-01  7.011e-01  -1.085
## native.countryVietnam                    -2.136e+00  9.154e-01  -2.334
## native.countryYugoslavia                 -4.662e-01  1.031e+00  -0.452
##                                          Pr(>|z|)    
## (Intercept)                              1.97e-12 ***
## age                                       < 2e-16 ***
## workclassLocal-gov                       1.24e-07 ***
## workclassNever-worked                     0.97804    
## workclassPrivate                         2.02e-07 ***
## workclassSelf-emp-inc                     0.14132    
## workclassSelf-emp-not-inc                1.25e-14 ***
## workclassState-gov                       4.50e-10 ***
## workclassWithout-pay                      0.96973    
## education11th                             0.17583    
## education12th                             0.02966 *  
## education1st-4th                          0.59158    
## education5th-6th                          0.38767    
## education7th-8th                          0.08447 .  
## education9th                              0.07060 .  
## educationAssoc-acdm                      5.57e-15 ***
## educationAssoc-voc                       3.21e-16 ***
## educationBachelors                        < 2e-16 ***
## educationDoctorate                        < 2e-16 ***
## educationHS-grad                         5.17e-08 ***
## educationMasters                          < 2e-16 ***
## educationPreschool                        0.95254    
## educationProf-school                      < 2e-16 ***
## educationSome-college                    3.26e-13 ***
## marital.statusMarried-AF-spouse          2.18e-07 ***
## marital.statusMarried-civ-spouse          < 2e-16 ***
## marital.statusMarried-spouse-absent       0.80352    
## marital.statusNever-married              8.70e-09 ***
## marital.statusSeparated                   0.06270 .  
## marital.statusWidowed                     0.98364    
## occupationArmed-Forces                    0.97836    
## occupationCraft-repair                    0.84986    
## occupationExec-managerial                 < 2e-16 ***
## occupationFarming-fishing                3.31e-11 ***
## occupationHandlers-cleaners              2.13e-06 ***
## occupationMachine-op-inspct               0.00370 ** 
## occupationOther-service                  1.20e-12 ***
## occupationPriv-house-serv                 0.02105 *  
## occupationProf-specialty                 5.80e-09 ***
## occupationProtective-serv                 0.00198 ** 
## occupationSales                           0.00666 ** 
## occupationTech-support                   6.75e-06 ***
## occupationTransport-moving                0.17664    
## raceAsian-Pac-Islander                    0.24289    
## raceBlack                                 0.23906    
## raceOther                                 0.21088    
## raceWhite                                 0.08378 .  
## sexMale                                   0.01459 *  
## hours.per.week                            < 2e-16 ***
## native.countryCanada                      0.21827    
## native.countryChina                       0.03187 *  
## native.countryColumbia                    0.93357    
## native.countryCuba                        0.43014    
## native.countryDominican-Republic          0.06544 .  
## native.countryEcuador                     0.36278    
## native.countryEl-Salvador                 0.19216    
## native.countryEngland                     0.26374    
## native.countryFrance                      0.71800    
## native.countryGermany                     0.59889    
## native.countryGreece                      0.10795    
## native.countryGuatemala                   0.67068    
## native.countryHaiti                       0.34138    
## native.countryHoland-Netherlands          0.99314    
## native.countryHonduras                    0.97824    
## native.countryHong                        0.16692    
## native.countryHungary                     0.54568    
## native.countryIndia                       0.15420    
## native.countryIran                        0.34774    
## native.countryIreland                     0.85974    
## native.countryItaly                       0.96117    
## native.countryJamaica                     0.26019    
## native.countryJapan                       0.49928    
## native.countryLaos                        0.19705    
## native.countryMexico                      0.04750 *  
## native.countryNicaragua                   0.11897    
## native.countryOutlying-US(Guam-USVI-etc)  0.97303    
## native.countryPeru                        0.16444    
## native.countryPhilippines                 0.33950    
## native.countryPoland                      0.23491    
## native.countryPortugal                    0.17085    
## native.countryPuerto-Rico                 0.17544    
## native.countryScotland                    0.46740    
## native.countrySouth                       0.03724 *  
## native.countryTaiwan                      0.24754    
## native.countryThailand                    0.26737    
## native.countryTrinadad&Tobago             0.24988    
## native.countryUnited-States               0.27797    
## native.countryVietnam                     0.01962 *  
## native.countryYugoslavia                  0.65106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26789  on 24422  degrees of freedom
## Residual deviance: 17212  on 24334  degrees of freedom
## AIC: 17390
## 
## Number of Fisher Scoring iterations: 14

# to check how well is our model built we need to calculate predicted porobabilities
# also our calculated probabilities need to be classified
# in order to do that we also need to decide the threshold that best classifies our predicted results

train_adult$pred_prob_income<-fitted(adult_blr) 
# this coloumn will have predicted probabilties of being 1
head(train_adult) # run the command to check if the new coloumn is added

##   age   workclass    education marital.status      occupation  race    sex
## 1  90     Private      HS-grad        Widowed    Adm-clerical White Female
## 2  82     Private      HS-grad        Widowed Exec-managerial White Female
## 5  41     Private Some-college      Separated  Prof-specialty White Female
## 6  34     Private      HS-grad       Divorced   Other-service White Female
## 8  74   State-gov    Doctorate  Never-married  Prof-specialty White Female
## 9  68 Federal-gov      HS-grad       Divorced  Prof-specialty White Female
##   hours.per.week native.country income pred_prob_income
## 1             40  United-States  <=50K       0.14331964
## 2             18  United-States  <=50K       0.12903395
## 5             40  United-States  <=50K       0.06966413
## 6             45  United-States  <=50K       0.01778688
## 8             20  United-States   >50K       0.28444111
## 9             40  United-States  <=50K       0.20694077

library(ROCR) # receiver operating charecteristic

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred<-prediction(train_adult$pred_prob_income,train_adult$income)
# compares predicted values with actual values in training dataset

perf<-performance(pred,"tpr","fpr")
# stores the measures with respect to which we want to plot the ROC graph

plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))

# plots the ROC curve

# we assign that threshold where sensitivity and specificity have almost similar values after observing the ROC graph
train_adult$pred_income<-ifelse(train_adult$pred_prob_income<0.3,0,1) 
# this coloumn will classify probabilities we calculated and classify them as 0 or 1 based on our threshold value (0.3) and store in this coloumn
head(train_adult)

##   age   workclass    education marital.status      occupation  race    sex
## 1  90     Private      HS-grad        Widowed    Adm-clerical White Female
## 2  82     Private      HS-grad        Widowed Exec-managerial White Female
## 5  41     Private Some-college      Separated  Prof-specialty White Female
## 6  34     Private      HS-grad       Divorced   Other-service White Female
## 8  74   State-gov    Doctorate  Never-married  Prof-specialty White Female
## 9  68 Federal-gov      HS-grad       Divorced  Prof-specialty White Female
##   hours.per.week native.country income pred_prob_income pred_income
## 1             40  United-States  <=50K       0.14331964           0
## 2             18  United-States  <=50K       0.12903395           0
## 5             40  United-States  <=50K       0.06966413           0
## 6             45  United-States  <=50K       0.01778688           0
## 8             20  United-States   >50K       0.28444111           0
## 9             40  United-States  <=50K       0.20694077           0

Creating confusion matrix and assesing the results:

table(train_adult$income,train_adult$pred_income)

##        
##             0     1
##   <=50K 15323  3294
##   >50K   1348  4458

dim(train_adult)

## [1] 24423    12

accuracy<-(4506+15299)/24423;accuracy # formula- (TP+TN)/total possibilities

## [1] 0.8109159

sensitivity<-4506/(4506+1349);sensitivity # formula TP/(TP+FN)

## [1] 0.7695986

specificity<-15299/(15299+3269);specificity # formula TN/(TN+FP)

## [1] 0.8239444

Training dataset results:

Accuracy of 81.09% is fairly good. We can conclude model is good and also we observe the values of sensitivity and specificity are almost close.

Checking how well our model is built using test dataset:

test_adult$pred_prob_income<-predict(adult_blr,test_adult,type = "response")
# an extra argument(type = "response") is required while using 'predict' function to generate response as probabilities
# this argument is not required while using 'fitted'

test_adult$pred_income<-ifelse(test_adult$pred_prob_income<0.3,0,1)
# we take the same threshold to classify which we considered while classifying probabilities of training data
head(test_adult)

##    age workclass    education     marital.status        occupation  race
## 3   66   Private Some-college            Widowed      Adm-clerical Black
## 4   54   Private      7th-8th           Divorced Machine-op-inspct White
## 7   38   Private         10th          Separated      Adm-clerical White
## 22  29   Private         11th          Separated             Sales White
## 24  51   Private Some-college Married-civ-spouse  Transport-moving White
## 28  49   Private      5th-6th Married-civ-spouse     Other-service White
##       sex hours.per.week native.country income pred_prob_income
## 3  Female             40  United-States  <=50K       0.09945670
## 4  Female             40  United-States  <=50K       0.01158659
## 7    Male             40  United-States  <=50K       0.01349647
## 22 Female             42  United-States  <=50K       0.01692173
## 24   Male             40  United-States  <=50K       0.38768861
## 28   Male             40         Greece  <=50K       0.02839995
##    pred_income
## 3            0
## 4            0
## 7            0
## 22           0
## 24           1
## 28           0

dim(test_adult)

## [1] 8138   12

table(test_adult$income,test_adult$pred_income)

##        
##            0    1
##   <=50K 5041 1062
##   >50K   477 1558

accuracy<-(1549+5056)/8138;accuracy

## [1] 0.8116245

sensitivity<-1549/(1549+437);sensitivity

## [1] 0.7799597

specificity<-5056/(5056+1096);specificity

## [1] 0.8218466

To check how much of our predicted values lie inside the curve:

auc<-performance(pred,"auc")
auc@y.values

## [[1]]
## [1] 0.8844618

We can conclude that we are getting an accuracy of 81.16% with 88.43% of our predicted values lying under the curve. Also our misclassifcation rate is 18.84%

Decision Tree

We need to remove the extra coloumns we added while performing BLR before implementing Decision tree

train_adult$pred_income<-NULL
train_adult$pred_prob_income<-NULL
test_adult$pred_income<-NULL
test_adult$pred_prob_income<-NULL

We need the following libraries to perform Decision tree

library(rpart)
library(rpart.plot)

rpart stands for Recursive partitioning and regression trees.

rpart is used when both independent and dependent variables are continuous or categorical.

rpart automatically detects whether to perform regression or classification based on dependent variable. There is no need to specify.

Implementing Decision tree

library(rpart)
tree_adult_model<-rpart(income~.,data = train_adult)

test_adult$pred_income<-predict(tree_adult_model,test_adult,type = "class")
# an extra argument (type = "class") is required to directly classify prediction into classes

head(test_adult)

##    age workclass    education     marital.status        occupation  race
## 3   66   Private Some-college            Widowed      Adm-clerical Black
## 4   54   Private      7th-8th           Divorced Machine-op-inspct White
## 7   38   Private         10th          Separated      Adm-clerical White
## 22  29   Private         11th          Separated             Sales White
## 24  51   Private Some-college Married-civ-spouse  Transport-moving White
## 28  49   Private      5th-6th Married-civ-spouse     Other-service White
##       sex hours.per.week native.country income pred_income
## 3  Female             40  United-States  <=50K       <=50K
## 4  Female             40  United-States  <=50K       <=50K
## 7    Male             40  United-States  <=50K       <=50K
## 22 Female             42  United-States  <=50K       <=50K
## 24   Male             40  United-States  <=50K       <=50K
## 28   Male             40         Greece  <=50K       <=50K

table(test_adult$income,test_adult$pred_income)

##        
##         <=50K >50K
##   <=50K  5801  302
##   >50K   1213  822

dim(test_adult)

## [1] 8138   11

accuracy<-(1167+5548)/8138;accuracy

## [1] 0.8251413

We are getting an accuracy of 82.51%

Here is how to plot the decision tree:

library(rpart.plot)
rpart.plot(tree_adult_model,cex = 0.6) # cex argument was just to adjust the resolution

Consider any observation from testing dataset and take the corresponding boolean tests in the graph above to see the predicted class.

Random forest

Here it is not required to split the data into training and testing

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_adult_model<-randomForest(income~.,data = adult3)
rf_adult_model

## 
## Call:
##  randomForest(formula = income ~ ., data = adult3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.52%
## Confusion matrix:
##       <=50K >50K class.error
## <=50K 22444 2276   0.0920712
## >50K   3104 4737   0.3958679

Here the Out of Bag error (OOB) gives us the miscalssification rate (MCR) of the model. In this case it comes out to be 16.42%, which gives us the accuracy of 83.58%

To check classwise error

plot(rf_adult_model)

Red line represents MCR of class <=50k, green line represents MCR of class >50k and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.

Conclusion

After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 81% to 84%. Out of which Random forest gave a slightly better accuracy of 83.58%