Data Set Information:
Extraction was done by Barry Becker from the 1994 Census database. The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database. Prediction task is to determine whether a person makes over 50K a year.
Objective:
The main objective of the dataset is to classify people earning <=50k or >50k based on several explanatory factors affecting the income of a person like Age, Occupation, Education, etc.
The methods we intend to use are:
Data cleaning method used in this article is kNN imputation using VIM library
Understanding the Data set:
The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:
age: The age of an individual
workclass: A general term to represent the employment status of an individual.
* Private
* Self.emp.not.inc
* Self.emp.inc
* Federal.gov
* Local.gov
* State.gov
* Without.pay
* Never.worked.
fnlwgt: Final weight. In other words, this is the number of people the census believes the entry represents.
education: The highest level of education achieved by an individual.
* Bachelors
* Some.college
* 11th
* HS.grad
* Prof.school
* Assoc.acdm
* Assoc.voc
* 9th
* 7th.8th
* 12th
* Masters
* 1st.4th
* 10th
* Doctorate
* 5th.6th
* Preschool.
education.num: The highest level of education achieved in numerical form.
marital.status: Marital status of an individual. Married.civ.spouse corresponds to a civilian spouse while Married.AF.spouse is a spouse in the Armed Forces.
* Married.civ.spouse
* Divorced
* Never.married
* Separated
* Widowed
* Married.spouse.absent
* Married.AF.spouse
occupation: the general type of occupation of an individual
* Tech.support
* Craft.repair
* Other.service
* Sales
* Exec.managerial
* Prof.specialty
* Handlers.cleaners
* Machine.op.inspct
* Adm.clerical
* Farming.fishing
* Transport.moving
* Priv.house.serv
* Protective.serv
* Armed.Forces
relationship: Represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status. We might not make use of this attribute at all
* Wife
* Own.child
* Husband
* Not.in.family
* Other.relative
* Unmarried.
race: Descriptions of an individual’s race
* White
* Asian.Pac.Islander
* Amer.Indian.Eskimo
* Other
* Black
sex: the biological sex of the individual
* Male
* Female
capital.gain: capital gains for an individual
capital.loss: capital loss for an individual
hours.per.week: the hours an individual has reported to work per week
native.country: country of origin for an individual
* United.States
* Cambodia
* England
* Puerto.Rico
* Canada
* Germany
* Outlying.US(Guam.USVI.etc)
* India
* Japan
* Greece
* South
* China
* Cuba
* Iran
* Honduras
* Philippines
* Italy
* Poland
* Jamaica
* Vietnam
* Mexico
* Portugal
* Ireland
* France
* Dominican.Republic
* Laos
* Ecuador
* Taiwan
* Haiti
* Columbia
* Hungary
* Guatemala
* Nicaragua
* Scotland
* Thailand
* Yugoslavia
* El.Salvador
* Trinadad&Tobago
* Peru
* Hong
* Holand.Netherlands
the label: whether or not an individual makes more than $50,000 annually.
* <=50k
* >50k
Loading the data and performing EDA
adult<-read.csv("C:/Users/Rohit/Desktop/R/Data sets/adult.csv",header=T)
# header = T : if file with header is given then we use this arg (will not include first row for calculations)
# if this arg is not added it will consider first row for calculations
head(adult) # to check the first 6 observations in the data
## age workclass fnlwgt education education.num marital.status
## 1 90 ? 77053 HS-grad 9 Widowed
## 2 82 Private 132870 HS-grad 9 Widowed
## 3 66 ? 186061 Some-college 10 Widowed
## 4 54 Private 140359 7th-8th 4 Divorced
## 5 41 Private 264663 Some-college 10 Separated
## 6 34 Private 216864 HS-grad 9 Divorced
## occupation relationship race sex capital.gain capital.loss
## 1 ? Not-in-family White Female 0 4356
## 2 Exec-managerial Not-in-family White Female 0 4356
## 3 ? Unmarried Black Female 0 4356
## 4 Machine-op-inspct Unmarried White Female 0 3900
## 5 Prof-specialty Own-child White Female 0 3900
## 6 Other-service Unmarried White Female 0 3770
## hours.per.week native.country income
## 1 40 United-States <=50K
## 2 18 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 United-States <=50K
## 6 45 United-States <=50K
EDA of the dependent variable
The original dataset contains a distribution of 24.08% entries labeled with >50k and 75.91% entries labeled with <=50k. The following graphs and statistics pertain to the orignal dataset.
library(ggplot2)
barplot(table(adult$income),main = 'Income Classification',col='blue',ylab ='No. of people')
Let’s check the summary of the data
summary(adult)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education.num marital.status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capital.gain capital.loss hours.per.week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native.country income
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
Now as we can see some of the values in coloumns are marked as ‘?’
Let us first convert these to NA while loading the data itself
adult1<-read.csv("C:/Users/Rohit/Desktop/R/Data sets/adult.csv",na.strings = c("?","NA"))
# This will replace '?' with 'NA'
# Let's check summary again
summary(adult1)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 State-gov : 1298 Mean : 189778
## 3rd Qu.:48.00 Self-emp-inc : 1116 3rd Qu.: 237051
## Max. :90.00 (Other) : 981 Max. :1484705
## NA's : 1836
## education education.num marital.status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty : 4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair : 4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial: 4066 Other-relative: 981 Black : 3124
## Adm-clerical : 3770 Own-child : 5068 Other : 271
## Sales : 3650 Unmarried : 3446 White :27816
## (Other) :10993 Wife : 1568
## NA's : 1843
## sex capital.gain capital.loss hours.per.week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native.country income
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
## NA's : 583
As you can see we have replaced missing values in our data marked as ‘?’ with ‘NA’
Another way to check the number of NA’s in our data coloumn wise
colSums(is.na(adult1))
## age workclass fnlwgt education education.num
## 0 1836 0 0 0
## marital.status occupation relationship race sex
## 0 1843 0 0 0
## capital.gain capital.loss hours.per.week native.country income
## 0 0 0 583 0
We can also check the structure of our data
str(adult1)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 90 82 66 54 41 34 38 74 68 41 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: NA 4 NA 4 4 4 4 7 1 4 ...
## $ fnlwgt : int 77053 132870 186061 140359 264663 216864 150601 88638 422013 70037 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 12 12 16 6 16 12 1 11 12 16 ...
## $ education.num : int 9 9 10 4 10 9 6 16 9 10 ...
## $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 7 7 1 6 1 6 5 1 5 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: NA 4 NA 7 10 8 1 10 10 3 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 2 5 5 4 5 5 3 2 5 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 3 5 5 5 5 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 1 1 2 ...
## $ capital.gain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ capital.loss : int 4356 4356 4356 3900 3900 3770 3770 3683 3683 3004 ...
## $ hours.per.week: int 40 18 40 40 40 45 40 20 40 60 ...
## $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 NA ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 1 2 ...
Concluding a few things before applying the various classification algorithms
It is observed that some variables are not self-explanatory.
capital_gain and capital_loss are income from other sources like investments other than salary which have no relevance here.
The continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents.
The variable education_num stands for the number of years of education in total, which is a continuous representation of the discrete variable education.
The variable relationship represents the responding members’s role in the family.
For simplicity of this analysis, the following variables are removed education.num, relationship, fnlwgt, capital.gain and capital.loss
adult1$capital.gain<-NULL
adult1$capital.loss<-NULL
adult1$fnlwgt<-NULL
adult1$education.num<-NULL
adult1$relationship<-NULL
Here library(VIM) is required to impute missing values. kNN imputation is preferred over the conventional method of replacing with mean, median and mode as it is supposed to be more justified. It may occur that a person whose age is missing and earns >50k is alloted a median age which may not be true. kNN inputation will consider all the observations and based on the historical data will assign a better value.
VIM stands for Visualization and imputation of missing values
Note: This may take a while based on sample size
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# as it is observed only the following coloumns have NAs in them, we specifically perform kNN imputation on these 3 variables
adult2<-kNN(adult1,variable = c("workclass","occupation","native.country"),k=sqrt(nrow(adult1)))
colSums(is.na(adult2)) # to verify if NAs removed
## age workclass education
## 0 0 0
## marital.status occupation race
## 0 0 0
## sex hours.per.week native.country
## 0 0 0
## income workclass_imp occupation_imp
## 0 0 0
## native.country_imp
## 0
This action also creates some dummy variables at the end of the data. You can check this by
head(adult2)
## age workclass education marital.status occupation race sex
## 1 90 Private HS-grad Widowed Adm-clerical White Female
## 2 82 Private HS-grad Widowed Exec-managerial White Female
## 3 66 Private Some-college Widowed Adm-clerical Black Female
## 4 54 Private 7th-8th Divorced Machine-op-inspct White Female
## 5 41 Private Some-college Separated Prof-specialty White Female
## 6 34 Private HS-grad Divorced Other-service White Female
## hours.per.week native.country income workclass_imp occupation_imp
## 1 40 United-States <=50K TRUE TRUE
## 2 18 United-States <=50K FALSE FALSE
## 3 40 United-States <=50K TRUE TRUE
## 4 40 United-States <=50K FALSE FALSE
## 5 40 United-States <=50K FALSE FALSE
## 6 45 United-States <=50K FALSE FALSE
## native.country_imp
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
So now we create another data set exculding the dummy variables
adult3<-adult2[,1:10]
head(adult3) # to verify if dummy variables removed
## age workclass education marital.status occupation race sex
## 1 90 Private HS-grad Widowed Adm-clerical White Female
## 2 82 Private HS-grad Widowed Exec-managerial White Female
## 3 66 Private Some-college Widowed Adm-clerical Black Female
## 4 54 Private 7th-8th Divorced Machine-op-inspct White Female
## 5 41 Private Some-college Separated Prof-specialty White Female
## 6 34 Private HS-grad Divorced Other-service White Female
## hours.per.week native.country income
## 1 40 United-States <=50K
## 2 18 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 United-States <=50K
## 6 45 United-States <=50K
dim(adult3) # gives the number of variables and coloumns in our dataset
## [1] 32561 10
Let’s check income with respect to age
library(ggplot2)
ggplot(adult3) + aes(x=as.numeric(age), group=income, fill=income) +
geom_histogram(binwidth=1, color='black')+
labs(x="Age",y="Count",title = "Income w.r.t Age")
As we notice majority of the people make less than <50k a year. However, we observe people earning >50k are in their mid career. We make this hypothesis based on the age.
Let’s check the same for workclass
barplot(table(adult3$workclass),main = 'Income Classification w.r.t workclass',col='blue',ylab ='No. of people')
We can conclude that people working in private sector earn significantly better than the ones in other classes.
Dividing data in Training and Testing Datasets
Let’s put 75% data in training and 25% in testing dataset
library(caret) # classification and regression training
## Loading required package: lattice
index<-createDataPartition(adult3$age,p=0.75,list = F)
# argument 'list=F' is added so that it takes only indexes of the observations and not make a list row wise
train_adult<-adult3[index,]
test_adult<-adult3[-index,]
dim(train_adult)
## [1] 24423 10
dim(test_adult)
## [1] 8138 10
# model implementation
adult_blr<-glm(income~.,data = train_adult,family = "binomial")
# argument (family = "binomial") is necessary as we are creating a model with dichotomous result
summary(adult_blr)
##
## Call:
## glm(formula = income ~ ., family = "binomial", data = train_adult)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6409 -0.5524 -0.2375 -0.0502 3.6781
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -5.437e+00 7.727e-01 -7.037
## age 2.657e-02 1.747e-03 15.211
## workclassLocal-gov -6.492e-01 1.228e-01 -5.287
## workclassNever-worked -1.240e+01 4.507e+02 -0.028
## workclassPrivate -5.293e-01 1.018e-01 -5.198
## workclassSelf-emp-inc -1.969e-01 1.338e-01 -1.471
## workclassSelf-emp-not-inc -9.236e-01 1.198e-01 -7.711
## workclassState-gov -8.552e-01 1.371e-01 -6.236
## workclassWithout-pay -1.423e+01 3.749e+02 -0.038
## education11th 3.119e-01 2.304e-01 1.354
## education12th 6.217e-01 2.859e-01 2.175
## education1st-4th -2.614e-01 4.873e-01 -0.537
## education5th-6th -3.212e-01 3.719e-01 -0.864
## education7th-8th -4.524e-01 2.622e-01 -1.725
## education9th -5.837e-01 3.228e-01 -1.808
## educationAssoc-acdm 1.519e+00 1.944e-01 7.813
## educationAssoc-voc 1.531e+00 1.874e-01 8.165
## educationBachelors 2.118e+00 1.751e-01 12.094
## educationDoctorate 3.160e+00 2.349e-01 13.453
## educationHS-grad 9.300e-01 1.708e-01 5.445
## educationMasters 2.528e+00 1.859e-01 13.599
## educationPreschool -1.197e+01 2.011e+02 -0.060
## educationProf-school 3.124e+00 2.183e-01 14.310
## educationSome-college 1.263e+00 1.734e-01 7.283
## marital.statusMarried-AF-spouse 2.809e+00 5.418e-01 5.184
## marital.statusMarried-civ-spouse 2.037e+00 7.044e-02 28.922
## marital.statusMarried-spouse-absent -6.109e-02 2.455e-01 -0.249
## marital.statusNever-married -4.985e-01 8.663e-02 -5.754
## marital.statusSeparated -3.342e-01 1.796e-01 -1.861
## marital.statusWidowed -3.172e-03 1.547e-01 -0.021
## occupationArmed-Forces -1.360e+01 5.012e+02 -0.027
## occupationCraft-repair -1.596e-02 8.429e-02 -0.189
## occupationExec-managerial 7.675e-01 8.090e-02 9.488
## occupationFarming-fishing -1.008e+00 1.520e-01 -6.632
## occupationHandlers-cleaners -7.481e-01 1.578e-01 -4.741
## occupationMachine-op-inspct -3.224e-01 1.111e-01 -2.902
## occupationOther-service -8.922e-01 1.256e-01 -7.106
## occupationPriv-house-serv -2.711e+00 1.175e+00 -2.307
## occupationProf-specialty 4.967e-01 8.530e-02 5.822
## occupationProtective-serv 4.257e-01 1.376e-01 3.093
## occupationSales 2.345e-01 8.642e-02 2.713
## occupationTech-support 5.366e-01 1.192e-01 4.501
## occupationTransport-moving -1.447e-01 1.071e-01 -1.351
## raceAsian-Pac-Islander 3.270e-01 2.800e-01 1.168
## raceBlack 2.807e-01 2.384e-01 1.177
## raceOther -4.965e-01 3.969e-01 -1.251
## raceWhite 3.908e-01 2.260e-01 1.729
## sexMale 1.391e-01 5.695e-02 2.442
## hours.per.week 3.075e-02 1.734e-03 17.729
## native.countryCanada -9.297e-01 7.552e-01 -1.231
## native.countryChina -1.704e+00 7.940e-01 -2.146
## native.countryColumbia -1.529e+01 1.834e+02 -0.083
## native.countryCuba -6.111e-01 7.746e-01 -0.789
## native.countryDominican-Republic -1.929e+00 1.047e+00 -1.842
## native.countryEcuador -9.081e-01 9.979e-01 -0.910
## native.countryEl-Salvador -1.109e+00 8.502e-01 -1.304
## native.countryEngland -8.714e-01 7.797e-01 -1.118
## native.countryFrance -3.244e-01 8.982e-01 -0.361
## native.countryGermany -3.924e-01 7.460e-01 -0.526
## native.countryGreece -1.452e+00 9.031e-01 -1.607
## native.countryGuatemala -4.279e-01 1.006e+00 -0.425
## native.countryHaiti -1.004e+00 1.055e+00 -0.951
## native.countryHoland-Netherlands -1.251e+01 1.455e+03 -0.009
## native.countryHonduras -1.306e+01 4.790e+02 -0.027
## native.countryHong -1.431e+00 1.035e+00 -1.382
## native.countryHungary -6.942e-01 1.149e+00 -0.604
## native.countryIndia -1.077e+00 7.561e-01 -1.425
## native.countryIran -7.714e-01 8.215e-01 -0.939
## native.countryIreland -1.678e-01 9.493e-01 -0.177
## native.countryItaly 3.826e-02 7.857e-01 0.049
## native.countryJamaica -9.682e-01 8.599e-01 -1.126
## native.countryJapan -5.582e-01 8.262e-01 -0.676
## native.countryLaos -1.827e+00 1.416e+00 -1.290
## native.countryMexico -1.474e+00 7.440e-01 -1.982
## native.countryNicaragua -1.663e+00 1.066e+00 -1.559
## native.countryOutlying-US(Guam-USVI-etc) -1.391e+01 4.116e+02 -0.034
## native.countryPeru -1.500e+00 1.079e+00 -1.390
## native.countryPhilippines -6.902e-01 7.226e-01 -0.955
## native.countryPoland -9.914e-01 8.347e-01 -1.188
## native.countryPortugal -1.378e+00 1.006e+00 -1.369
## native.countryPuerto-Rico -1.102e+00 8.130e-01 -1.355
## native.countryScotland -7.641e-01 1.051e+00 -0.727
## native.countrySouth -1.628e+00 7.814e-01 -2.083
## native.countryTaiwan -9.425e-01 8.150e-01 -1.156
## native.countryThailand -1.209e+00 1.090e+00 -1.109
## native.countryTrinadad&Tobago -1.525e+00 1.325e+00 -1.151
## native.countryUnited-States -7.606e-01 7.011e-01 -1.085
## native.countryVietnam -2.136e+00 9.154e-01 -2.334
## native.countryYugoslavia -4.662e-01 1.031e+00 -0.452
## Pr(>|z|)
## (Intercept) 1.97e-12 ***
## age < 2e-16 ***
## workclassLocal-gov 1.24e-07 ***
## workclassNever-worked 0.97804
## workclassPrivate 2.02e-07 ***
## workclassSelf-emp-inc 0.14132
## workclassSelf-emp-not-inc 1.25e-14 ***
## workclassState-gov 4.50e-10 ***
## workclassWithout-pay 0.96973
## education11th 0.17583
## education12th 0.02966 *
## education1st-4th 0.59158
## education5th-6th 0.38767
## education7th-8th 0.08447 .
## education9th 0.07060 .
## educationAssoc-acdm 5.57e-15 ***
## educationAssoc-voc 3.21e-16 ***
## educationBachelors < 2e-16 ***
## educationDoctorate < 2e-16 ***
## educationHS-grad 5.17e-08 ***
## educationMasters < 2e-16 ***
## educationPreschool 0.95254
## educationProf-school < 2e-16 ***
## educationSome-college 3.26e-13 ***
## marital.statusMarried-AF-spouse 2.18e-07 ***
## marital.statusMarried-civ-spouse < 2e-16 ***
## marital.statusMarried-spouse-absent 0.80352
## marital.statusNever-married 8.70e-09 ***
## marital.statusSeparated 0.06270 .
## marital.statusWidowed 0.98364
## occupationArmed-Forces 0.97836
## occupationCraft-repair 0.84986
## occupationExec-managerial < 2e-16 ***
## occupationFarming-fishing 3.31e-11 ***
## occupationHandlers-cleaners 2.13e-06 ***
## occupationMachine-op-inspct 0.00370 **
## occupationOther-service 1.20e-12 ***
## occupationPriv-house-serv 0.02105 *
## occupationProf-specialty 5.80e-09 ***
## occupationProtective-serv 0.00198 **
## occupationSales 0.00666 **
## occupationTech-support 6.75e-06 ***
## occupationTransport-moving 0.17664
## raceAsian-Pac-Islander 0.24289
## raceBlack 0.23906
## raceOther 0.21088
## raceWhite 0.08378 .
## sexMale 0.01459 *
## hours.per.week < 2e-16 ***
## native.countryCanada 0.21827
## native.countryChina 0.03187 *
## native.countryColumbia 0.93357
## native.countryCuba 0.43014
## native.countryDominican-Republic 0.06544 .
## native.countryEcuador 0.36278
## native.countryEl-Salvador 0.19216
## native.countryEngland 0.26374
## native.countryFrance 0.71800
## native.countryGermany 0.59889
## native.countryGreece 0.10795
## native.countryGuatemala 0.67068
## native.countryHaiti 0.34138
## native.countryHoland-Netherlands 0.99314
## native.countryHonduras 0.97824
## native.countryHong 0.16692
## native.countryHungary 0.54568
## native.countryIndia 0.15420
## native.countryIran 0.34774
## native.countryIreland 0.85974
## native.countryItaly 0.96117
## native.countryJamaica 0.26019
## native.countryJapan 0.49928
## native.countryLaos 0.19705
## native.countryMexico 0.04750 *
## native.countryNicaragua 0.11897
## native.countryOutlying-US(Guam-USVI-etc) 0.97303
## native.countryPeru 0.16444
## native.countryPhilippines 0.33950
## native.countryPoland 0.23491
## native.countryPortugal 0.17085
## native.countryPuerto-Rico 0.17544
## native.countryScotland 0.46740
## native.countrySouth 0.03724 *
## native.countryTaiwan 0.24754
## native.countryThailand 0.26737
## native.countryTrinadad&Tobago 0.24988
## native.countryUnited-States 0.27797
## native.countryVietnam 0.01962 *
## native.countryYugoslavia 0.65106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26789 on 24422 degrees of freedom
## Residual deviance: 17212 on 24334 degrees of freedom
## AIC: 17390
##
## Number of Fisher Scoring iterations: 14
# to check how well is our model built we need to calculate predicted porobabilities
# also our calculated probabilities need to be classified
# in order to do that we also need to decide the threshold that best classifies our predicted results
train_adult$pred_prob_income<-fitted(adult_blr)
# this coloumn will have predicted probabilties of being 1
head(train_adult) # run the command to check if the new coloumn is added
## age workclass education marital.status occupation race sex
## 1 90 Private HS-grad Widowed Adm-clerical White Female
## 2 82 Private HS-grad Widowed Exec-managerial White Female
## 5 41 Private Some-college Separated Prof-specialty White Female
## 6 34 Private HS-grad Divorced Other-service White Female
## 8 74 State-gov Doctorate Never-married Prof-specialty White Female
## 9 68 Federal-gov HS-grad Divorced Prof-specialty White Female
## hours.per.week native.country income pred_prob_income
## 1 40 United-States <=50K 0.14331964
## 2 18 United-States <=50K 0.12903395
## 5 40 United-States <=50K 0.06966413
## 6 45 United-States <=50K 0.01778688
## 8 20 United-States >50K 0.28444111
## 9 40 United-States <=50K 0.20694077
library(ROCR) # receiver operating charecteristic
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred<-prediction(train_adult$pred_prob_income,train_adult$income)
# compares predicted values with actual values in training dataset
perf<-performance(pred,"tpr","fpr")
# stores the measures with respect to which we want to plot the ROC graph
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))
# plots the ROC curve
# we assign that threshold where sensitivity and specificity have almost similar values after observing the ROC graph
train_adult$pred_income<-ifelse(train_adult$pred_prob_income<0.3,0,1)
# this coloumn will classify probabilities we calculated and classify them as 0 or 1 based on our threshold value (0.3) and store in this coloumn
head(train_adult)
## age workclass education marital.status occupation race sex
## 1 90 Private HS-grad Widowed Adm-clerical White Female
## 2 82 Private HS-grad Widowed Exec-managerial White Female
## 5 41 Private Some-college Separated Prof-specialty White Female
## 6 34 Private HS-grad Divorced Other-service White Female
## 8 74 State-gov Doctorate Never-married Prof-specialty White Female
## 9 68 Federal-gov HS-grad Divorced Prof-specialty White Female
## hours.per.week native.country income pred_prob_income pred_income
## 1 40 United-States <=50K 0.14331964 0
## 2 18 United-States <=50K 0.12903395 0
## 5 40 United-States <=50K 0.06966413 0
## 6 45 United-States <=50K 0.01778688 0
## 8 20 United-States >50K 0.28444111 0
## 9 40 United-States <=50K 0.20694077 0
Creating confusion matrix and assesing the results:
table(train_adult$income,train_adult$pred_income)
##
## 0 1
## <=50K 15323 3294
## >50K 1348 4458
dim(train_adult)
## [1] 24423 12
accuracy<-(4506+15299)/24423;accuracy # formula- (TP+TN)/total possibilities
## [1] 0.8109159
sensitivity<-4506/(4506+1349);sensitivity # formula TP/(TP+FN)
## [1] 0.7695986
specificity<-15299/(15299+3269);specificity # formula TN/(TN+FP)
## [1] 0.8239444
Training dataset results:
Accuracy of 81.09% is fairly good. We can conclude model is good and also we observe the values of sensitivity and specificity are almost close.
Checking how well our model is built using test dataset:
test_adult$pred_prob_income<-predict(adult_blr,test_adult,type = "response")
# an extra argument(type = "response") is required while using 'predict' function to generate response as probabilities
# this argument is not required while using 'fitted'
test_adult$pred_income<-ifelse(test_adult$pred_prob_income<0.3,0,1)
# we take the same threshold to classify which we considered while classifying probabilities of training data
head(test_adult)
## age workclass education marital.status occupation race
## 3 66 Private Some-college Widowed Adm-clerical Black
## 4 54 Private 7th-8th Divorced Machine-op-inspct White
## 7 38 Private 10th Separated Adm-clerical White
## 22 29 Private 11th Separated Sales White
## 24 51 Private Some-college Married-civ-spouse Transport-moving White
## 28 49 Private 5th-6th Married-civ-spouse Other-service White
## sex hours.per.week native.country income pred_prob_income
## 3 Female 40 United-States <=50K 0.09945670
## 4 Female 40 United-States <=50K 0.01158659
## 7 Male 40 United-States <=50K 0.01349647
## 22 Female 42 United-States <=50K 0.01692173
## 24 Male 40 United-States <=50K 0.38768861
## 28 Male 40 Greece <=50K 0.02839995
## pred_income
## 3 0
## 4 0
## 7 0
## 22 0
## 24 1
## 28 0
dim(test_adult)
## [1] 8138 12
table(test_adult$income,test_adult$pred_income)
##
## 0 1
## <=50K 5041 1062
## >50K 477 1558
accuracy<-(1549+5056)/8138;accuracy
## [1] 0.8116245
sensitivity<-1549/(1549+437);sensitivity
## [1] 0.7799597
specificity<-5056/(5056+1096);specificity
## [1] 0.8218466
To check how much of our predicted values lie inside the curve:
auc<-performance(pred,"auc")
auc@y.values
## [[1]]
## [1] 0.8844618
We can conclude that we are getting an accuracy of 81.16% with 88.43% of our predicted values lying under the curve. Also our misclassifcation rate is 18.84%
We need to remove the extra coloumns we added while performing BLR before implementing Decision tree
train_adult$pred_income<-NULL
train_adult$pred_prob_income<-NULL
test_adult$pred_income<-NULL
test_adult$pred_prob_income<-NULL
We need the following libraries to perform Decision tree
rpart stands for Recursive partitioning and regression trees.
rpart is used when both independent and dependent variables are continuous or categorical.
rpart automatically detects whether to perform regression or classification based on dependent variable. There is no need to specify.
Implementing Decision tree
library(rpart)
tree_adult_model<-rpart(income~.,data = train_adult)
test_adult$pred_income<-predict(tree_adult_model,test_adult,type = "class")
# an extra argument (type = "class") is required to directly classify prediction into classes
head(test_adult)
## age workclass education marital.status occupation race
## 3 66 Private Some-college Widowed Adm-clerical Black
## 4 54 Private 7th-8th Divorced Machine-op-inspct White
## 7 38 Private 10th Separated Adm-clerical White
## 22 29 Private 11th Separated Sales White
## 24 51 Private Some-college Married-civ-spouse Transport-moving White
## 28 49 Private 5th-6th Married-civ-spouse Other-service White
## sex hours.per.week native.country income pred_income
## 3 Female 40 United-States <=50K <=50K
## 4 Female 40 United-States <=50K <=50K
## 7 Male 40 United-States <=50K <=50K
## 22 Female 42 United-States <=50K <=50K
## 24 Male 40 United-States <=50K <=50K
## 28 Male 40 Greece <=50K <=50K
table(test_adult$income,test_adult$pred_income)
##
## <=50K >50K
## <=50K 5801 302
## >50K 1213 822
dim(test_adult)
## [1] 8138 11
accuracy<-(1167+5548)/8138;accuracy
## [1] 0.8251413
We are getting an accuracy of 82.51%
Here is how to plot the decision tree:
library(rpart.plot)
rpart.plot(tree_adult_model,cex = 0.6) # cex argument was just to adjust the resolution
Consider any observation from testing dataset and take the corresponding boolean tests in the graph above to see the predicted class.
Here it is not required to split the data into training and testing
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf_adult_model<-randomForest(income~.,data = adult3)
rf_adult_model
##
## Call:
## randomForest(formula = income ~ ., data = adult3)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.52%
## Confusion matrix:
## <=50K >50K class.error
## <=50K 22444 2276 0.0920712
## >50K 3104 4737 0.3958679
Here the Out of Bag error (OOB) gives us the miscalssification rate (MCR) of the model. In this case it comes out to be 16.42%, which gives us the accuracy of 83.58%
To check classwise error
plot(rf_adult_model)
Red line represents MCR of class <=50k, green line represents MCR of class >50k and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.
After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 81% to 84%. Out of which Random forest gave a slightly better accuracy of 83.58%