Suitable Machine Learning Model for Adult Dataset

Dataset Information

Extraction was done by Barry Becker from the 1994 Census database. The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database. Prediction task is to determine whether a person makes over 50K a year.Classification has been done to predict whether a person’s yearly income in US falls in the income category of either greater than 50K Dollars or less equal to 50K Dollars category based on a certain set of attributes.

Objective

The main objective of the dataset is to classify people earning <=50k or >50k based on several explanatory factors affecting the income of a person like Age, Occupation, Education, etc.

The methods we intend to use are:

Logistic Regression
Decision Tree
Random Forest

Literature Review

Some of the machine learning models have been made in the past by researchers for predicting income levels.

Bekena [1] implemented the Random Forest Classifier algorithm to predict income levels of individuals.
Haojun Zhu [2] attempted Logistic Regression as the Sta- tistical Modelling Tool and 4 different Machine Learning Techniques, Neural Network, Classification and Regres- sion Tree, Random Forest, and Support Vector Machine for predicting Income Levels.
Chockalingam et. al. [3] explored and analysed the Adult Dataset and used several Machine Learning Models like Logistic Regression, Stepwise Logistic Regression, Naive Bayes, Decision Trees, Extra Trees, k-Nearest Neighbor, SVM, Gradient Boosting and 6 configurations of Acti- vated Neural Network. They also drew a comparative analysis of their predictive performances.
Topiwalla [4] made the usage of complex algorithms like XGBOOST, Random Forest and stacking of models for prediction tasks including Logistic Stack on XGBOOST and SVM Stack on Logistic for scaling up the accuracy.
Lazar [5] implemented Principal Component Analysis (PCA) and Support Vector Machine methods to generate and evaluate income prediction data based on the Current Population Survey provided by the U.S. Census Bureau.
Lemon et. al. [6] attempted to identify the important features in the data that could help to optimize the complexity of different machine learning models used in classification tasks.
Deepajothi et. al. [7] tried to replicate Bayesian Networks, Decision Tree Induction, Lazy Classifier and Rule Based Learning Techniques for the Adult Dataset and presented a comparative analysis of the predictive performances.

#Read the data
adult <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"), header = FALSE,stringsAsFactors=T)
# Apply headers
names(adult) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
head(adult)

##   age         workclass fnlwgt  education education_num      marital_status
## 1  39         State-gov  77516  Bachelors            13       Never-married
## 2  50  Self-emp-not-inc  83311  Bachelors            13  Married-civ-spouse
## 3  38           Private 215646    HS-grad             9            Divorced
## 4  53           Private 234721       11th             7  Married-civ-spouse
## 5  28           Private 338409  Bachelors            13  Married-civ-spouse
## 6  37           Private 284582    Masters            14  Married-civ-spouse
##           occupation   relationship   race     sex capital_gain capital_loss
## 1       Adm-clerical  Not-in-family  White    Male         2174            0
## 2    Exec-managerial        Husband  White    Male            0            0
## 3  Handlers-cleaners  Not-in-family  White    Male            0            0
## 4  Handlers-cleaners        Husband  Black    Male            0            0
## 5     Prof-specialty           Wife  Black  Female            0            0
## 6    Exec-managerial           Wife  White  Female            0            0
##   hours_per_week native_country income
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K
## 6             40  United-States  <=50K

str(adult)

## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ income        : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...

EDA

#Histogram of age by income group
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.2

ggplot(adult) + aes(x=as.numeric(age), group=income, fill=income) + 
  geom_histogram(binwidth=1, color='black')+labs(x="Age",y="Count",title = "Income w.r.t Age")

As we notice majority of the people make less than <50k a year. However, we observe people earning >50k are in their mid career. We make this hypothesis based on the age.

#Histogram of age by gender group
ggplot(adult) + aes(x=as.numeric(age), group=sex, fill=sex) + 
  geom_histogram(binwidth=1, color='black')+labs(x="Age",y="Count",title = "Income w.r.t Age")

# Barplot
data <- data.frame(table(adult$income,adult$age))
names(data) <- c('income', 'age')
ggplot(data=adult, aes(x=income,y=age, fill=income, order = as.numeric(income))) + geom_bar(stat="identity")+ggtitle("age vs income")

data <- data.frame(table(adult$income,adult$hours_per_week))
names(data) <- c('income', 'hours_per_week')
ggplot(data=adult, aes(x=income,y=hours_per_week, fill=income, order = as.numeric(income))) + 
                                         geom_bar(stat="identity")+ggtitle("hours_per_week vs income")

# Boxplot
boxplot(adult$age, xlab= "hours_per_week", ylab = "income", main= "income vs working-hours", horizontal = FALSE,col = "pink")

boxplot(adult$age, xlab= "age", ylab = "income", main= "Income vs age", horizontal = FALSE,col="pink")

# DensityPlot
ggplot(data = adult) +
  aes(x = hours_per_week, fill = income) +
  geom_density(alpha=0.3)+ggtitle("Denisty plot")

ggplot(data = adult) +
  aes(x = age, fill = income) +
  geom_density(alpha=0.3)+ggtitle("Denisty plot")

# BubblePlot
ggplot(adult,aes(x=age,y=capital_gain))+geom_point(alpha=0.7)+ggtitle("bubble plot-agevscapital_gain")+theme(plot.title=element_text(hjust=0.5))

summary(adult)

##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education_num                  marital_status 
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##             occupation            relationship                    race      
##   Prof-specialty :4140    Husband       :13193    Amer-Indian-Eskimo:  311  
##   Craft-repair   :4099    Not-in-family : 8305    Asian-Pac-Islander: 1039  
##   Exec-managerial:4066    Other-relative:  981    Black             : 3124  
##   Adm-clerical   :3770    Own-child     : 5068    Other             :  271  
##   Sales          :3650    Unmarried     : 3446    White             :27816  
##   Other-service  :3295    Wife          : 1568                              
##  (Other)         :9541                                                      
##       sex         capital_gain    capital_loss    hours_per_week 
##   Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##   Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                  Median :    0   Median :   0.0   Median :40.00  
##                  Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                  3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                  Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                  
##         native_country     income     
##   United-States:29170    <=50K:24720  
##   Mexico       :  643    >50K : 7841  
##   ?            :  583                 
##   Philippines  :  198                 
##   Germany      :  137                 
##   Canada       :  121                 
##  (Other)       : 1709

library(corrgram)

## Warning: package 'corrgram' was built under R version 4.1.3

corrgram(adult, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Corrgram of adult intercorrelations")

ggplot(adult,aes(x=age,y=income))+geom_violin(fill="lightblue",color="black")+ggtitle(" AGE VS INCOME")

pairs(adult[c(1,3,5,12,13)],pch=21,col="lightblue",main="Scatterplot")

# find elements which has ? in place of NA
null_positions <- adult == " ?"
# replace elements with NA
is.na(adult) <- null_positions
# Now dropping the null values
adult1 <- na.omit(adult)
library(arules)

## Warning: package 'arules' was built under R version 4.1.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.1.3

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Now dropping the duplicates
adult2<- adult1 %>% distinct()
# Removing factor levels which doesn't occur (i.e., zero frequency)
adult2 <- droplevels(adult2)
# Type of Attributes in the data
names(adult2)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education_num"  "marital_status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital_gain"   "capital_loss"  
## [13] "hours_per_week" "native_country" "income"

# Number of categorical attributes in the data
info <- sapply(adult2, is.factor)
# Printing levels of each categorical attribute
for(i in 1:ncol(adult2)){
  if(info[i] == TRUE){
    print(colnames(adult2)[i])
    temp <- as.data.frame(table(adult2[ , i]))
    print(temp)
    cat ("\n\n\n")
  }
}

## [1] "workclass"
##                Var1  Freq
## 1       Federal-gov   943
## 2         Local-gov  2067
## 3           Private 22264
## 4      Self-emp-inc  1074
## 5  Self-emp-not-inc  2498
## 6         State-gov  1279
## 7       Without-pay    14
## 
## 
## 
## [1] "education"
##             Var1 Freq
## 1           10th  820
## 2           11th 1048
## 3           12th  377
## 4        1st-4th  149
## 5        5th-6th  287
## 6        7th-8th  556
## 7            9th  455
## 8     Assoc-acdm 1008
## 9      Assoc-voc 1307
## 10     Bachelors 5042
## 11     Doctorate  375
## 12       HS-grad 9834
## 13       Masters 1626
## 14     Preschool   44
## 15   Prof-school  542
## 16  Some-college 6669
## 
## 
## 
## [1] "marital_status"
##                     Var1  Freq
## 1               Divorced  4212
## 2      Married-AF-spouse    21
## 3     Married-civ-spouse 14059
## 4  Married-spouse-absent   370
## 5          Never-married  9711
## 6              Separated   939
## 7                Widowed   827
## 
## 
## 
## [1] "occupation"
##                  Var1 Freq
## 1        Adm-clerical 3719
## 2        Armed-Forces    9
## 3        Craft-repair 4025
## 4     Exec-managerial 3991
## 5     Farming-fishing  987
## 6   Handlers-cleaners 1349
## 7   Machine-op-inspct 1964
## 8       Other-service 3209
## 9     Priv-house-serv  141
## 10     Prof-specialty 4034
## 11    Protective-serv  644
## 12              Sales 3584
## 13       Tech-support  911
## 14   Transport-moving 1572
## 
## 
## 
## [1] "relationship"
##              Var1  Freq
## 1         Husband 12457
## 2   Not-in-family  7714
## 3  Other-relative   889
## 4       Own-child  4462
## 5       Unmarried  3211
## 6            Wife  1406
## 
## 
## 
## [1] "race"
##                  Var1  Freq
## 1  Amer-Indian-Eskimo   286
## 2  Asian-Pac-Islander   894
## 3               Black  2816
## 4               Other   231
## 5               White 25912
## 
## 
## 
## [1] "sex"
##      Var1  Freq
## 1  Female  9773
## 2    Male 20366
## 
## 
## 
## [1] "native_country"
##                           Var1  Freq
## 1                     Cambodia    18
## 2                       Canada   107
## 3                        China    68
## 4                     Columbia    56
## 5                         Cuba    92
## 6           Dominican-Republic    67
## 7                      Ecuador    27
## 8                  El-Salvador   100
## 9                      England    86
## 10                      France    27
## 11                     Germany   128
## 12                      Greece    29
## 13                   Guatemala    61
## 14                       Haiti    42
## 15          Holand-Netherlands     1
## 16                    Honduras    12
## 17                        Hong    19
## 18                     Hungary    13
## 19                       India   100
## 20                        Iran    42
## 21                     Ireland    24
## 22                       Italy    68
## 23                     Jamaica    80
## 24                       Japan    59
## 25                        Laos    17
## 26                      Mexico   606
## 27                   Nicaragua    33
## 28  Outlying-US(Guam-USVI-etc)    14
## 29                        Peru    30
## 30                 Philippines   188
## 31                      Poland    56
## 32                    Portugal    34
## 33                 Puerto-Rico   109
## 34                    Scotland    11
## 35                       South    71
## 36                      Taiwan    42
## 37                    Thailand    17
## 38             Trinadad&Tobago    18
## 39               United-States 27487
## 40                     Vietnam    64
## 41                  Yugoslavia    16
## 
## 
## 
## [1] "income"
##     Var1  Freq
## 1  <=50K 22633
## 2   >50K  7506

It is observed that some variables are not self-explanatory.

capital_gain and capital_loss are income from other sources like investments other than salary which have no relevance here.

The continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents.

The variable education_num stands for the number of years of education in total, which is a continuous representation of the discrete variable education.

The variable relationship represents the responding members’s role in the family.

For simplicity of this analysis, the following variables are removed education.num, relationship, fnlwgt, capital.gain and capital.loss

adult2$capital_gain<-NULL
adult2$capital_loss<-NULL
adult2$fnlwgt<-NULL
adult2$education_num<-NULL
adult2$relationship<-NULL
# Summary of all the attributes in the data
summary(adult2)

##       age                    workclass             education   
##  Min.   :17.00    Federal-gov     :  943    HS-grad     :9834  
##  1st Qu.:28.00    Local-gov       : 2067    Some-college:6669  
##  Median :37.00    Private         :22264    Bachelors   :5042  
##  Mean   :38.44    Self-emp-inc    : 1074    Masters     :1626  
##  3rd Qu.:47.00    Self-emp-not-inc: 2498    Assoc-voc   :1307  
##  Max.   :90.00    State-gov       : 1279    11th        :1048  
##                   Without-pay     :   14   (Other)      :4613  
##                 marital_status             occupation  
##   Divorced             : 4212    Prof-specialty :4034  
##   Married-AF-spouse    :   21    Craft-repair   :4025  
##   Married-civ-spouse   :14059    Exec-managerial:3991  
##   Married-spouse-absent:  370    Adm-clerical   :3719  
##   Never-married        : 9711    Sales          :3584  
##   Separated            :  939    Other-service  :3209  
##   Widowed              :  827   (Other)         :7577  
##                   race            sex        hours_per_week 
##   Amer-Indian-Eskimo:  286    Female: 9773   Min.   : 1.00  
##   Asian-Pac-Islander:  894    Male  :20366   1st Qu.:40.00  
##   Black             : 2816                   Median :40.00  
##   Other             :  231                   Mean   :40.93  
##   White             :25912                   3rd Qu.:45.00  
##                                              Max.   :99.00  
##                                                             
##         native_country     income     
##   United-States:27487    <=50K:22633  
##   Mexico       :  606    >50K : 7506  
##   Philippines  :  188                 
##   Germany      :  128                 
##   Puerto-Rico  :  109                 
##   Canada       :  107                 
##  (Other)       : 1514

Logistic Regression

library(caret)

## Warning: package 'caret' was built under R version 4.1.2

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:corrgram':
## 
##     panel.fill

#Lets take 75% of data as training data  and 25% as testing data
size<- round(.75 * dim(adult2)[1])  # training set size
train_set <- adult2[1:size,]
test_set <- adult2[-(1:size),]
#Fitting a model
classifier <- glm(income ~., data = train_set, family = binomial('logit'))
#summary of the model
summary(classifier)

## 
## Call:
## glm(formula = income ~ ., family = binomial("logit"), data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6113  -0.5688  -0.2453  -0.0002   3.5464  
## 
## Coefficients:
##                                             Estimate Std. Error z value
## (Intercept)                               -5.432e+00  7.995e-01  -6.794
## age                                        2.780e-02  1.827e-03  15.216
## workclass Local-gov                       -6.680e-01  1.225e-01  -5.454
## workclass Private                         -4.381e-01  1.019e-01  -4.300
## workclass Self-emp-inc                    -1.611e-01  1.346e-01  -1.196
## workclass Self-emp-not-inc                -8.552e-01  1.197e-01  -7.145
## workclass State-gov                       -8.504e-01  1.373e-01  -6.195
## workclass Without-pay                     -1.355e+01  4.246e+02  -0.032
## education 11th                             4.220e-02  2.253e-01   0.187
## education 12th                             3.406e-01  2.986e-01   1.141
## education 1st-4th                         -3.825e-01  5.283e-01  -0.724
## education 5th-6th                         -6.503e-01  4.204e-01  -1.547
## education 7th-8th                         -6.183e-01  2.641e-01  -2.341
## education 9th                             -5.547e-01  3.070e-01  -1.807
## education Assoc-acdm                       1.252e+00  1.894e-01   6.609
## education Assoc-voc                        1.192e+00  1.814e-01   6.569
## education Bachelors                        1.870e+00  1.677e-01  11.151
## education Doctorate                        2.910e+00  2.361e-01  12.327
## education HS-grad                          6.798e-01  1.628e-01   4.175
## education Masters                          2.299e+00  1.802e-01  12.754
## education Preschool                       -1.209e+01  2.369e+02  -0.051
## education Prof-school                      2.912e+00  2.193e-01  13.275
## education Some-college                     9.710e-01  1.656e-01   5.863
## marital_status Married-AF-spouse           3.092e+00  5.960e-01   5.188
## marital_status Married-civ-spouse          2.111e+00  7.268e-02  29.048
## marital_status Married-spouse-absent      -4.162e-02  2.481e-01  -0.168
## marital_status Never-married              -4.269e-01  8.859e-02  -4.819
## marital_status Separated                  -5.303e-04  1.708e-01  -0.003
## marital_status Widowed                     1.293e-02  1.682e-01   0.077
## occupation Armed-Forces                   -1.363e+01  5.113e+02  -0.027
## occupation Craft-repair                    2.806e-02  8.655e-02   0.324
## occupation Exec-managerial                 8.083e-01  8.268e-02   9.777
## occupation Farming-fishing                -1.030e+00  1.498e-01  -6.876
## occupation Handlers-cleaners              -7.307e-01  1.592e-01  -4.589
## occupation Machine-op-inspct              -3.787e-01  1.129e-01  -3.354
## occupation Other-service                  -8.378e-01  1.261e-01  -6.644
## occupation Priv-house-serv                -1.365e+01  1.168e+02  -0.117
## occupation Prof-specialty                  5.471e-01  8.765e-02   6.241
## occupation Protective-serv                 4.788e-01  1.381e-01   3.468
## occupation Sales                           2.277e-01  8.839e-02   2.576
## occupation Tech-support                    6.021e-01  1.211e-01   4.973
## occupation Transport-moving               -1.320e-01  1.081e-01  -1.222
## race Asian-Pac-Islander                    4.432e-01  3.056e-01   1.450
## race Black                                 2.891e-01  2.530e-01   1.143
## race Other                                -4.065e-01  4.271e-01  -0.952
## race White                                 3.383e-01  2.412e-01   1.403
## sex Male                                   1.840e-01  5.817e-02   3.163
## hours_per_week                             2.846e-02  1.854e-03  15.348
## native_country Canada                     -2.468e-01  7.800e-01  -0.316
## native_country China                      -1.462e+00  8.052e-01  -1.816
## native_country Columbia                   -2.782e+00  1.145e+00  -2.430
## native_country Cuba                       -5.466e-01  8.033e-01  -0.680
## native_country Dominican-Republic         -1.472e+00  1.073e+00  -1.372
## native_country Ecuador                    -1.881e+00  1.159e+00  -1.624
## native_country El-Salvador                -1.211e+00  9.006e-01  -1.345
## native_country England                    -2.409e-01  8.015e-01  -0.301
## native_country France                     -2.681e-01  9.645e-01  -0.278
## native_country Germany                    -1.847e-01  7.748e-01  -0.238
## native_country Greece                     -1.502e+00  9.150e-01  -1.642
## native_country Guatemala                  -1.867e+00  1.317e+00  -1.418
## native_country Haiti                      -1.136e+00  1.029e+00  -1.104
## native_country Holand-Netherlands         -1.223e+01  1.455e+03  -0.008
## native_country Honduras                   -1.047e+00  2.258e+00  -0.464
## native_country Hong                       -6.936e-01  1.046e+00  -0.663
## native_country Hungary                    -1.490e+00  1.324e+00  -1.126
## native_country India                      -1.280e+00  7.773e-01  -1.647
## native_country Iran                       -8.257e-01  8.516e-01  -0.970
## native_country Ireland                     1.665e-01  9.861e-01   0.169
## native_country Italy                       1.934e-01  8.051e-01   0.240
## native_country Jamaica                    -3.499e-01  8.575e-01  -0.408
## native_country Japan                      -3.520e-01  8.275e-01  -0.425
## native_country Laos                       -1.809e+00  1.439e+00  -1.257
## native_country Mexico                     -1.221e+00  7.625e-01  -1.601
## native_country Nicaragua                  -1.048e+00  1.118e+00  -0.938
## native_country Outlying-US(Guam-USVI-etc) -1.407e+01  4.650e+02  -0.030
## native_country Peru                       -1.803e+00  1.335e+00  -1.350
## native_country Philippines                -6.710e-01  7.404e-01  -0.906
## native_country Poland                     -7.929e-01  8.416e-01  -0.942
## native_country Portugal                   -4.437e-01  9.920e-01  -0.447
## native_country Puerto-Rico                -6.912e-01  8.199e-01  -0.843
## native_country Scotland                   -1.272e+00  1.348e+00  -0.943
## native_country South                      -1.570e+00  8.047e-01  -1.951
## native_country Taiwan                     -1.025e+00  8.333e-01  -1.230
## native_country Thailand                   -1.477e+00  1.071e+00  -1.379
## native_country Trinadad&Tobago            -1.493e+00  1.327e+00  -1.125
## native_country United-States              -5.615e-01  7.260e-01  -0.773
## native_country Vietnam                    -2.465e+00  1.049e+00  -2.349
## native_country Yugoslavia                  1.028e-01  1.057e+00   0.097
##                                           Pr(>|z|)    
## (Intercept)                               1.09e-11 ***
## age                                        < 2e-16 ***
## workclass Local-gov                       4.94e-08 ***
## workclass Private                         1.71e-05 ***
## workclass Self-emp-inc                    0.231533    
## workclass Self-emp-not-inc                9.03e-13 ***
## workclass State-gov                       5.83e-10 ***
## workclass Without-pay                     0.974532    
## education 11th                            0.851407    
## education 12th                            0.254008    
## education 1st-4th                         0.469046    
## education 5th-6th                         0.121895    
## education 7th-8th                         0.019242 *  
## education 9th                             0.070827 .  
## education Assoc-acdm                      3.87e-11 ***
## education Assoc-voc                       5.05e-11 ***
## education Bachelors                        < 2e-16 ***
## education Doctorate                        < 2e-16 ***
## education HS-grad                         2.97e-05 ***
## education Masters                          < 2e-16 ***
## education Preschool                       0.959296    
## education Prof-school                      < 2e-16 ***
## education Some-college                    4.56e-09 ***
## marital_status Married-AF-spouse          2.12e-07 ***
## marital_status Married-civ-spouse          < 2e-16 ***
## marital_status Married-spouse-absent      0.866747    
## marital_status Never-married              1.44e-06 ***
## marital_status Separated                  0.997523    
## marital_status Widowed                    0.938713    
## occupation Armed-Forces                   0.978733    
## occupation Craft-repair                   0.745747    
## occupation Exec-managerial                 < 2e-16 ***
## occupation Farming-fishing                6.14e-12 ***
## occupation Handlers-cleaners              4.45e-06 ***
## occupation Machine-op-inspct              0.000795 ***
## occupation Other-service                  3.06e-11 ***
## occupation Priv-house-serv                0.906966    
## occupation Prof-specialty                 4.34e-10 ***
## occupation Protective-serv                0.000525 ***
## occupation Sales                          0.009988 ** 
## occupation Tech-support                   6.58e-07 ***
## occupation Transport-moving               0.221891    
## race Asian-Pac-Islander                   0.146965    
## race Black                                0.253046    
## race Other                                0.341180    
## race White                                0.160697    
## sex Male                                  0.001559 ** 
## hours_per_week                             < 2e-16 ***
## native_country Canada                     0.751651    
## native_country China                      0.069422 .  
## native_country Columbia                   0.015107 *  
## native_country Cuba                       0.496243    
## native_country Dominican-Republic         0.170144    
## native_country Ecuador                    0.104427    
## native_country El-Salvador                0.178781    
## native_country England                    0.763744    
## native_country France                     0.781029    
## native_country Germany                    0.811559    
## native_country Greece                     0.100629    
## native_country Guatemala                  0.156083    
## native_country Haiti                      0.269798    
## native_country Holand-Netherlands         0.993297    
## native_country Honduras                   0.642940    
## native_country Hong                       0.507131    
## native_country Hungary                    0.260160    
## native_country India                      0.099649 .  
## native_country Iran                       0.332282    
## native_country Ireland                    0.865925    
## native_country Italy                      0.810107    
## native_country Jamaica                    0.683241    
## native_country Japan                      0.670534    
## native_country Laos                       0.208670    
## native_country Mexico                     0.109300    
## native_country Nicaragua                  0.348318    
## native_country Outlying-US(Guam-USVI-etc) 0.975867    
## native_country Peru                       0.176979    
## native_country Philippines                0.364796    
## native_country Poland                     0.346119    
## native_country Portugal                   0.654653    
## native_country Puerto-Rico                0.399262    
## native_country Scotland                   0.345684    
## native_country South                      0.051084 .  
## native_country Taiwan                     0.218585    
## native_country Thailand                   0.167822    
## native_country Trinadad&Tobago            0.260410    
## native_country United-States              0.439333    
## native_country Vietnam                    0.018833 *  
## native_country Yugoslavia                 0.922540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25312  on 22603  degrees of freedom
## Residual deviance: 16343  on 22516  degrees of freedom
## AIC: 16519
## 
## Number of Fisher Scoring iterations: 14

library(ggplot2)
ggplot(test_set) + aes(x=as.numeric(age), group=income, fill=income) + 
  geom_histogram(binwidth=1, color='black')+labs(x="Age",y="Count",title = "Income w.r.t Age")

index <- 1:dim(train_set)[1]
dev_resid <- residuals(classifier)
income <- train_set$income
dff <- data.frame(index, dev_resid, income)

ggplot(dff, aes(x = index, y = dev_resid, color = income)) +
  geom_point() + 
  geom_hline(yintercept = 3, linetype = 'dashed', color = 'blue') +
  geom_hline(yintercept = -3, linetype = 'dashed', color = 'blue')

#predicitng 
prob <- predict(classifier, test_set, type = 'response')
y_pred <- rep('<=50K', length(prob))
y_pred[prob>=.3] <- '>50K'
# confusion matrix 
cm<- table(y_pred, test_set$income)
cm

##        
## y_pred   <=50K  >50K
##   <=50K   4595   390
##   >50K    1035  1515

accuracy=sum(diag(cm)/sum(cm))
accuracy

## [1] 0.8108825

library(ROCR)

## Warning: package 'ROCR' was built under R version 4.1.3

pr <- prediction(prob, test_set$income)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05),main="ROC Graph")

We can conclude that we are getting an accuracy of 81.08% of logistic regression model and our misclassifcation rate is 18.92%

Decision Tree

#We need the following library to perform Decision tree
library(rpart)
tree_adult_model<-rpart(income~.,data = train_set)

test_set$pred_income<-predict(tree_adult_model,test_set,type = "class")
# an extra argument (type = "class") is required to directly classify prediction into classes

head(test_set)

##       age         workclass     education      marital_status       occupation
## 22605  29           Private     Bachelors       Never-married   Prof-specialty
## 22606  44           Private       Masters  Married-civ-spouse  Exec-managerial
## 22607  60  Self-emp-not-inc   Prof-school  Married-civ-spouse            Sales
## 22608  30         Local-gov     Assoc-voc            Divorced  Protective-serv
## 22609  24           Private  Some-college       Never-married            Sales
## 22610  17           Private          10th       Never-married    Other-service
##                      race     sex hours_per_week native_country income
## 22605               Other  Female             50  United-States  <=50K
## 22606               White    Male             45  United-States   >50K
## 22607               White    Male             45  United-States   >50K
## 22608  Amer-Indian-Eskimo  Female             40  United-States  <=50K
## 22609               White  Female             17  United-States  <=50K
## 22610               White  Female             10  United-States  <=50K
##       pred_income
## 22605       <=50K
## 22606        >50K
## 22607        >50K
## 22608       <=50K
## 22609       <=50K
## 22610       <=50K

#confusion matrix
cm1=table(test_set$income,test_set$pred_income)
cm1

##         
##           <=50K  >50K
##    <=50K   5173   457
##    >50K     845  1060

accuracy1=sum(diag(cm1)/sum(cm1))
accuracy1

## [1] 0.8272064

Random Forest

Here it is not required to split the data into training and testing

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.1.3

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_adult_model<-randomForest(income~.,data = adult2)
rf_adult_model

## 
## Call:
##  randomForest(formula = income ~ ., data = adult2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.97%
## Confusion matrix:
##         <=50K  >50K class.error
##  <=50K  20444  2189  0.09671718
##  >50K    2926  4580  0.38982148

Here the Out of Bag error (OOB) gives us the miscalssification rate (MCR) of the model. In this case it comes out to be 16.98%, which gives us the accuracy of 83.02%

plot(rf_adult_model)

Red line represents MCR of class <=50k, green line represents MCR of class >50k and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.

Comparative Statements

##  Model                               Acuuracy
##  Logistic Regression                  81.08%
##  Decision Tree                        82.72%
##  Random Forest                        83.02%

After performing the classification techniques like Logistic regression,Decision TRee,Random Tree.The classification model that got the highest accuracy is Random Forest compared to the other models like Decision Tree and Logistic Regression.And the missclassifation error is less for Random Forest compared to other models.

Result

After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 81% to 84%. Out of which Random forest gave a slightly better accuracy of 83.58%.So we choose Random Forest as the best model.