Extraction was done by Barry Becker from the 1994 Census database. The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database. Prediction task is to determine whether a person makes over 50K a year.Classification has been done to predict whether a person’s yearly income in US falls in the income category of either greater than 50K Dollars or less equal to 50K Dollars category based on a certain set of attributes.
The main objective of the dataset is to classify people earning <=50k or >50k based on several explanatory factors affecting the income of a person like Age, Occupation, Education, etc.
The methods we intend to use are:
Some of the machine learning models have been made in the past by researchers for predicting income levels.
Bekena [1] implemented the Random Forest Classifier algorithm to predict income levels of individuals.
Haojun Zhu [2] attempted Logistic Regression as the Sta- tistical Modelling Tool and 4 different Machine Learning Techniques, Neural Network, Classification and Regres- sion Tree, Random Forest, and Support Vector Machine for predicting Income Levels.
Chockalingam et. al. [3] explored and analysed the Adult Dataset and used several Machine Learning Models like Logistic Regression, Stepwise Logistic Regression, Naive Bayes, Decision Trees, Extra Trees, k-Nearest Neighbor, SVM, Gradient Boosting and 6 configurations of Acti- vated Neural Network. They also drew a comparative analysis of their predictive performances.
Topiwalla [4] made the usage of complex algorithms like XGBOOST, Random Forest and stacking of models for prediction tasks including Logistic Stack on XGBOOST and SVM Stack on Logistic for scaling up the accuracy.
Lazar [5] implemented Principal Component Analysis (PCA) and Support Vector Machine methods to generate and evaluate income prediction data based on the Current Population Survey provided by the U.S. Census Bureau.
Lemon et. al. [6] attempted to identify the important features in the data that could help to optimize the complexity of different machine learning models used in classification tasks.
Deepajothi et. al. [7] tried to replicate Bayesian Networks, Decision Tree Induction, Lazy Classifier and Rule Based Learning Techniques for the Adult Dataset and presented a comparative analysis of the predictive performances.
#Read the data
adult <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"), header = FALSE,stringsAsFactors=T)
# Apply headers
names(adult) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
head(adult)
## age workclass fnlwgt education education_num marital_status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours_per_week native_country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
str(adult)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ income : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
#Histogram of age by income group
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
ggplot(adult) + aes(x=as.numeric(age), group=income, fill=income) +
geom_histogram(binwidth=1, color='black')+labs(x="Age",y="Count",title = "Income w.r.t Age")
As we notice majority of the people make less than <50k a year. However, we observe people earning >50k are in their mid career. We make this hypothesis based on the age.
#Histogram of age by gender group
ggplot(adult) + aes(x=as.numeric(age), group=sex, fill=sex) +
geom_histogram(binwidth=1, color='black')+labs(x="Age",y="Count",title = "Income w.r.t Age")
# Barplot
data <- data.frame(table(adult$income,adult$age))
names(data) <- c('income', 'age')
ggplot(data=adult, aes(x=income,y=age, fill=income, order = as.numeric(income))) + geom_bar(stat="identity")+ggtitle("age vs income")
data <- data.frame(table(adult$income,adult$hours_per_week))
names(data) <- c('income', 'hours_per_week')
ggplot(data=adult, aes(x=income,y=hours_per_week, fill=income, order = as.numeric(income))) +
geom_bar(stat="identity")+ggtitle("hours_per_week vs income")
# Boxplot
boxplot(adult$age, xlab= "hours_per_week", ylab = "income", main= "income vs working-hours", horizontal = FALSE,col = "pink")
boxplot(adult$age, xlab= "age", ylab = "income", main= "Income vs age", horizontal = FALSE,col="pink")
# DensityPlot
ggplot(data = adult) +
aes(x = hours_per_week, fill = income) +
geom_density(alpha=0.3)+ggtitle("Denisty plot")
ggplot(data = adult) +
aes(x = age, fill = income) +
geom_density(alpha=0.3)+ggtitle("Denisty plot")
# BubblePlot
ggplot(adult,aes(x=age,y=capital_gain))+geom_point(alpha=0.7)+ggtitle("bubble plot-agevscapital_gain")+theme(plot.title=element_text(hjust=0.5))
summary(adult)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education_num marital_status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capital_gain capital_loss hours_per_week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native_country income
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
library(corrgram)
## Warning: package 'corrgram' was built under R version 4.1.3
corrgram(adult, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of adult intercorrelations")
ggplot(adult,aes(x=age,y=income))+geom_violin(fill="lightblue",color="black")+ggtitle(" AGE VS INCOME")
pairs(adult[c(1,3,5,12,13)],pch=21,col="lightblue",main="Scatterplot")
# find elements which has ? in place of NA
null_positions <- adult == " ?"
# replace elements with NA
is.na(adult) <- null_positions
# Now dropping the null values
adult1 <- na.omit(adult)
library(arules)
## Warning: package 'arules' was built under R version 4.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.1.3
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Now dropping the duplicates
adult2<- adult1 %>% distinct()
# Removing factor levels which doesn't occur (i.e., zero frequency)
adult2 <- droplevels(adult2)
# Type of Attributes in the data
names(adult2)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education_num" "marital_status" "occupation" "relationship"
## [9] "race" "sex" "capital_gain" "capital_loss"
## [13] "hours_per_week" "native_country" "income"
# Number of categorical attributes in the data
info <- sapply(adult2, is.factor)
# Printing levels of each categorical attribute
for(i in 1:ncol(adult2)){
if(info[i] == TRUE){
print(colnames(adult2)[i])
temp <- as.data.frame(table(adult2[ , i]))
print(temp)
cat ("\n\n\n")
}
}
## [1] "workclass"
## Var1 Freq
## 1 Federal-gov 943
## 2 Local-gov 2067
## 3 Private 22264
## 4 Self-emp-inc 1074
## 5 Self-emp-not-inc 2498
## 6 State-gov 1279
## 7 Without-pay 14
##
##
##
## [1] "education"
## Var1 Freq
## 1 10th 820
## 2 11th 1048
## 3 12th 377
## 4 1st-4th 149
## 5 5th-6th 287
## 6 7th-8th 556
## 7 9th 455
## 8 Assoc-acdm 1008
## 9 Assoc-voc 1307
## 10 Bachelors 5042
## 11 Doctorate 375
## 12 HS-grad 9834
## 13 Masters 1626
## 14 Preschool 44
## 15 Prof-school 542
## 16 Some-college 6669
##
##
##
## [1] "marital_status"
## Var1 Freq
## 1 Divorced 4212
## 2 Married-AF-spouse 21
## 3 Married-civ-spouse 14059
## 4 Married-spouse-absent 370
## 5 Never-married 9711
## 6 Separated 939
## 7 Widowed 827
##
##
##
## [1] "occupation"
## Var1 Freq
## 1 Adm-clerical 3719
## 2 Armed-Forces 9
## 3 Craft-repair 4025
## 4 Exec-managerial 3991
## 5 Farming-fishing 987
## 6 Handlers-cleaners 1349
## 7 Machine-op-inspct 1964
## 8 Other-service 3209
## 9 Priv-house-serv 141
## 10 Prof-specialty 4034
## 11 Protective-serv 644
## 12 Sales 3584
## 13 Tech-support 911
## 14 Transport-moving 1572
##
##
##
## [1] "relationship"
## Var1 Freq
## 1 Husband 12457
## 2 Not-in-family 7714
## 3 Other-relative 889
## 4 Own-child 4462
## 5 Unmarried 3211
## 6 Wife 1406
##
##
##
## [1] "race"
## Var1 Freq
## 1 Amer-Indian-Eskimo 286
## 2 Asian-Pac-Islander 894
## 3 Black 2816
## 4 Other 231
## 5 White 25912
##
##
##
## [1] "sex"
## Var1 Freq
## 1 Female 9773
## 2 Male 20366
##
##
##
## [1] "native_country"
## Var1 Freq
## 1 Cambodia 18
## 2 Canada 107
## 3 China 68
## 4 Columbia 56
## 5 Cuba 92
## 6 Dominican-Republic 67
## 7 Ecuador 27
## 8 El-Salvador 100
## 9 England 86
## 10 France 27
## 11 Germany 128
## 12 Greece 29
## 13 Guatemala 61
## 14 Haiti 42
## 15 Holand-Netherlands 1
## 16 Honduras 12
## 17 Hong 19
## 18 Hungary 13
## 19 India 100
## 20 Iran 42
## 21 Ireland 24
## 22 Italy 68
## 23 Jamaica 80
## 24 Japan 59
## 25 Laos 17
## 26 Mexico 606
## 27 Nicaragua 33
## 28 Outlying-US(Guam-USVI-etc) 14
## 29 Peru 30
## 30 Philippines 188
## 31 Poland 56
## 32 Portugal 34
## 33 Puerto-Rico 109
## 34 Scotland 11
## 35 South 71
## 36 Taiwan 42
## 37 Thailand 17
## 38 Trinadad&Tobago 18
## 39 United-States 27487
## 40 Vietnam 64
## 41 Yugoslavia 16
##
##
##
## [1] "income"
## Var1 Freq
## 1 <=50K 22633
## 2 >50K 7506
It is observed that some variables are not self-explanatory.
capital_gain and capital_loss are income from other sources like investments other than salary which have no relevance here.
The continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents.
The variable education_num stands for the number of years of education in total, which is a continuous representation of the discrete variable education.
The variable relationship represents the responding members’s role in the family.
For simplicity of this analysis, the following variables are removed education.num, relationship, fnlwgt, capital.gain and capital.loss
adult2$capital_gain<-NULL
adult2$capital_loss<-NULL
adult2$fnlwgt<-NULL
adult2$education_num<-NULL
adult2$relationship<-NULL
# Summary of all the attributes in the data
summary(adult2)
## age workclass education
## Min. :17.00 Federal-gov : 943 HS-grad :9834
## 1st Qu.:28.00 Local-gov : 2067 Some-college:6669
## Median :37.00 Private :22264 Bachelors :5042
## Mean :38.44 Self-emp-inc : 1074 Masters :1626
## 3rd Qu.:47.00 Self-emp-not-inc: 2498 Assoc-voc :1307
## Max. :90.00 State-gov : 1279 11th :1048
## Without-pay : 14 (Other) :4613
## marital_status occupation
## Divorced : 4212 Prof-specialty :4034
## Married-AF-spouse : 21 Craft-repair :4025
## Married-civ-spouse :14059 Exec-managerial:3991
## Married-spouse-absent: 370 Adm-clerical :3719
## Never-married : 9711 Sales :3584
## Separated : 939 Other-service :3209
## Widowed : 827 (Other) :7577
## race sex hours_per_week
## Amer-Indian-Eskimo: 286 Female: 9773 Min. : 1.00
## Asian-Pac-Islander: 894 Male :20366 1st Qu.:40.00
## Black : 2816 Median :40.00
## Other : 231 Mean :40.93
## White :25912 3rd Qu.:45.00
## Max. :99.00
##
## native_country income
## United-States:27487 <=50K:22633
## Mexico : 606 >50K : 7506
## Philippines : 188
## Germany : 128
## Puerto-Rico : 109
## Canada : 107
## (Other) : 1514
library(caret)
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
#Lets take 75% of data as training data and 25% as testing data
size<- round(.75 * dim(adult2)[1]) # training set size
train_set <- adult2[1:size,]
test_set <- adult2[-(1:size),]
#Fitting a model
classifier <- glm(income ~., data = train_set, family = binomial('logit'))
#summary of the model
summary(classifier)
##
## Call:
## glm(formula = income ~ ., family = binomial("logit"), data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6113 -0.5688 -0.2453 -0.0002 3.5464
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -5.432e+00 7.995e-01 -6.794
## age 2.780e-02 1.827e-03 15.216
## workclass Local-gov -6.680e-01 1.225e-01 -5.454
## workclass Private -4.381e-01 1.019e-01 -4.300
## workclass Self-emp-inc -1.611e-01 1.346e-01 -1.196
## workclass Self-emp-not-inc -8.552e-01 1.197e-01 -7.145
## workclass State-gov -8.504e-01 1.373e-01 -6.195
## workclass Without-pay -1.355e+01 4.246e+02 -0.032
## education 11th 4.220e-02 2.253e-01 0.187
## education 12th 3.406e-01 2.986e-01 1.141
## education 1st-4th -3.825e-01 5.283e-01 -0.724
## education 5th-6th -6.503e-01 4.204e-01 -1.547
## education 7th-8th -6.183e-01 2.641e-01 -2.341
## education 9th -5.547e-01 3.070e-01 -1.807
## education Assoc-acdm 1.252e+00 1.894e-01 6.609
## education Assoc-voc 1.192e+00 1.814e-01 6.569
## education Bachelors 1.870e+00 1.677e-01 11.151
## education Doctorate 2.910e+00 2.361e-01 12.327
## education HS-grad 6.798e-01 1.628e-01 4.175
## education Masters 2.299e+00 1.802e-01 12.754
## education Preschool -1.209e+01 2.369e+02 -0.051
## education Prof-school 2.912e+00 2.193e-01 13.275
## education Some-college 9.710e-01 1.656e-01 5.863
## marital_status Married-AF-spouse 3.092e+00 5.960e-01 5.188
## marital_status Married-civ-spouse 2.111e+00 7.268e-02 29.048
## marital_status Married-spouse-absent -4.162e-02 2.481e-01 -0.168
## marital_status Never-married -4.269e-01 8.859e-02 -4.819
## marital_status Separated -5.303e-04 1.708e-01 -0.003
## marital_status Widowed 1.293e-02 1.682e-01 0.077
## occupation Armed-Forces -1.363e+01 5.113e+02 -0.027
## occupation Craft-repair 2.806e-02 8.655e-02 0.324
## occupation Exec-managerial 8.083e-01 8.268e-02 9.777
## occupation Farming-fishing -1.030e+00 1.498e-01 -6.876
## occupation Handlers-cleaners -7.307e-01 1.592e-01 -4.589
## occupation Machine-op-inspct -3.787e-01 1.129e-01 -3.354
## occupation Other-service -8.378e-01 1.261e-01 -6.644
## occupation Priv-house-serv -1.365e+01 1.168e+02 -0.117
## occupation Prof-specialty 5.471e-01 8.765e-02 6.241
## occupation Protective-serv 4.788e-01 1.381e-01 3.468
## occupation Sales 2.277e-01 8.839e-02 2.576
## occupation Tech-support 6.021e-01 1.211e-01 4.973
## occupation Transport-moving -1.320e-01 1.081e-01 -1.222
## race Asian-Pac-Islander 4.432e-01 3.056e-01 1.450
## race Black 2.891e-01 2.530e-01 1.143
## race Other -4.065e-01 4.271e-01 -0.952
## race White 3.383e-01 2.412e-01 1.403
## sex Male 1.840e-01 5.817e-02 3.163
## hours_per_week 2.846e-02 1.854e-03 15.348
## native_country Canada -2.468e-01 7.800e-01 -0.316
## native_country China -1.462e+00 8.052e-01 -1.816
## native_country Columbia -2.782e+00 1.145e+00 -2.430
## native_country Cuba -5.466e-01 8.033e-01 -0.680
## native_country Dominican-Republic -1.472e+00 1.073e+00 -1.372
## native_country Ecuador -1.881e+00 1.159e+00 -1.624
## native_country El-Salvador -1.211e+00 9.006e-01 -1.345
## native_country England -2.409e-01 8.015e-01 -0.301
## native_country France -2.681e-01 9.645e-01 -0.278
## native_country Germany -1.847e-01 7.748e-01 -0.238
## native_country Greece -1.502e+00 9.150e-01 -1.642
## native_country Guatemala -1.867e+00 1.317e+00 -1.418
## native_country Haiti -1.136e+00 1.029e+00 -1.104
## native_country Holand-Netherlands -1.223e+01 1.455e+03 -0.008
## native_country Honduras -1.047e+00 2.258e+00 -0.464
## native_country Hong -6.936e-01 1.046e+00 -0.663
## native_country Hungary -1.490e+00 1.324e+00 -1.126
## native_country India -1.280e+00 7.773e-01 -1.647
## native_country Iran -8.257e-01 8.516e-01 -0.970
## native_country Ireland 1.665e-01 9.861e-01 0.169
## native_country Italy 1.934e-01 8.051e-01 0.240
## native_country Jamaica -3.499e-01 8.575e-01 -0.408
## native_country Japan -3.520e-01 8.275e-01 -0.425
## native_country Laos -1.809e+00 1.439e+00 -1.257
## native_country Mexico -1.221e+00 7.625e-01 -1.601
## native_country Nicaragua -1.048e+00 1.118e+00 -0.938
## native_country Outlying-US(Guam-USVI-etc) -1.407e+01 4.650e+02 -0.030
## native_country Peru -1.803e+00 1.335e+00 -1.350
## native_country Philippines -6.710e-01 7.404e-01 -0.906
## native_country Poland -7.929e-01 8.416e-01 -0.942
## native_country Portugal -4.437e-01 9.920e-01 -0.447
## native_country Puerto-Rico -6.912e-01 8.199e-01 -0.843
## native_country Scotland -1.272e+00 1.348e+00 -0.943
## native_country South -1.570e+00 8.047e-01 -1.951
## native_country Taiwan -1.025e+00 8.333e-01 -1.230
## native_country Thailand -1.477e+00 1.071e+00 -1.379
## native_country Trinadad&Tobago -1.493e+00 1.327e+00 -1.125
## native_country United-States -5.615e-01 7.260e-01 -0.773
## native_country Vietnam -2.465e+00 1.049e+00 -2.349
## native_country Yugoslavia 1.028e-01 1.057e+00 0.097
## Pr(>|z|)
## (Intercept) 1.09e-11 ***
## age < 2e-16 ***
## workclass Local-gov 4.94e-08 ***
## workclass Private 1.71e-05 ***
## workclass Self-emp-inc 0.231533
## workclass Self-emp-not-inc 9.03e-13 ***
## workclass State-gov 5.83e-10 ***
## workclass Without-pay 0.974532
## education 11th 0.851407
## education 12th 0.254008
## education 1st-4th 0.469046
## education 5th-6th 0.121895
## education 7th-8th 0.019242 *
## education 9th 0.070827 .
## education Assoc-acdm 3.87e-11 ***
## education Assoc-voc 5.05e-11 ***
## education Bachelors < 2e-16 ***
## education Doctorate < 2e-16 ***
## education HS-grad 2.97e-05 ***
## education Masters < 2e-16 ***
## education Preschool 0.959296
## education Prof-school < 2e-16 ***
## education Some-college 4.56e-09 ***
## marital_status Married-AF-spouse 2.12e-07 ***
## marital_status Married-civ-spouse < 2e-16 ***
## marital_status Married-spouse-absent 0.866747
## marital_status Never-married 1.44e-06 ***
## marital_status Separated 0.997523
## marital_status Widowed 0.938713
## occupation Armed-Forces 0.978733
## occupation Craft-repair 0.745747
## occupation Exec-managerial < 2e-16 ***
## occupation Farming-fishing 6.14e-12 ***
## occupation Handlers-cleaners 4.45e-06 ***
## occupation Machine-op-inspct 0.000795 ***
## occupation Other-service 3.06e-11 ***
## occupation Priv-house-serv 0.906966
## occupation Prof-specialty 4.34e-10 ***
## occupation Protective-serv 0.000525 ***
## occupation Sales 0.009988 **
## occupation Tech-support 6.58e-07 ***
## occupation Transport-moving 0.221891
## race Asian-Pac-Islander 0.146965
## race Black 0.253046
## race Other 0.341180
## race White 0.160697
## sex Male 0.001559 **
## hours_per_week < 2e-16 ***
## native_country Canada 0.751651
## native_country China 0.069422 .
## native_country Columbia 0.015107 *
## native_country Cuba 0.496243
## native_country Dominican-Republic 0.170144
## native_country Ecuador 0.104427
## native_country El-Salvador 0.178781
## native_country England 0.763744
## native_country France 0.781029
## native_country Germany 0.811559
## native_country Greece 0.100629
## native_country Guatemala 0.156083
## native_country Haiti 0.269798
## native_country Holand-Netherlands 0.993297
## native_country Honduras 0.642940
## native_country Hong 0.507131
## native_country Hungary 0.260160
## native_country India 0.099649 .
## native_country Iran 0.332282
## native_country Ireland 0.865925
## native_country Italy 0.810107
## native_country Jamaica 0.683241
## native_country Japan 0.670534
## native_country Laos 0.208670
## native_country Mexico 0.109300
## native_country Nicaragua 0.348318
## native_country Outlying-US(Guam-USVI-etc) 0.975867
## native_country Peru 0.176979
## native_country Philippines 0.364796
## native_country Poland 0.346119
## native_country Portugal 0.654653
## native_country Puerto-Rico 0.399262
## native_country Scotland 0.345684
## native_country South 0.051084 .
## native_country Taiwan 0.218585
## native_country Thailand 0.167822
## native_country Trinadad&Tobago 0.260410
## native_country United-States 0.439333
## native_country Vietnam 0.018833 *
## native_country Yugoslavia 0.922540
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25312 on 22603 degrees of freedom
## Residual deviance: 16343 on 22516 degrees of freedom
## AIC: 16519
##
## Number of Fisher Scoring iterations: 14
library(ggplot2)
ggplot(test_set) + aes(x=as.numeric(age), group=income, fill=income) +
geom_histogram(binwidth=1, color='black')+labs(x="Age",y="Count",title = "Income w.r.t Age")
index <- 1:dim(train_set)[1]
dev_resid <- residuals(classifier)
income <- train_set$income
dff <- data.frame(index, dev_resid, income)
ggplot(dff, aes(x = index, y = dev_resid, color = income)) +
geom_point() +
geom_hline(yintercept = 3, linetype = 'dashed', color = 'blue') +
geom_hline(yintercept = -3, linetype = 'dashed', color = 'blue')
#predicitng
prob <- predict(classifier, test_set, type = 'response')
y_pred <- rep('<=50K', length(prob))
y_pred[prob>=.3] <- '>50K'
# confusion matrix
cm<- table(y_pred, test_set$income)
cm
##
## y_pred <=50K >50K
## <=50K 4595 390
## >50K 1035 1515
accuracy=sum(diag(cm)/sum(cm))
accuracy
## [1] 0.8108825
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.1.3
pr <- prediction(prob, test_set$income)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05),main="ROC Graph")
We can conclude that we are getting an accuracy of 81.08% of logistic regression model and our misclassifcation rate is 18.92%
#We need the following library to perform Decision tree
library(rpart)
tree_adult_model<-rpart(income~.,data = train_set)
test_set$pred_income<-predict(tree_adult_model,test_set,type = "class")
# an extra argument (type = "class") is required to directly classify prediction into classes
head(test_set)
## age workclass education marital_status occupation
## 22605 29 Private Bachelors Never-married Prof-specialty
## 22606 44 Private Masters Married-civ-spouse Exec-managerial
## 22607 60 Self-emp-not-inc Prof-school Married-civ-spouse Sales
## 22608 30 Local-gov Assoc-voc Divorced Protective-serv
## 22609 24 Private Some-college Never-married Sales
## 22610 17 Private 10th Never-married Other-service
## race sex hours_per_week native_country income
## 22605 Other Female 50 United-States <=50K
## 22606 White Male 45 United-States >50K
## 22607 White Male 45 United-States >50K
## 22608 Amer-Indian-Eskimo Female 40 United-States <=50K
## 22609 White Female 17 United-States <=50K
## 22610 White Female 10 United-States <=50K
## pred_income
## 22605 <=50K
## 22606 >50K
## 22607 >50K
## 22608 <=50K
## 22609 <=50K
## 22610 <=50K
#confusion matrix
cm1=table(test_set$income,test_set$pred_income)
cm1
##
## <=50K >50K
## <=50K 5173 457
## >50K 845 1060
accuracy1=sum(diag(cm1)/sum(cm1))
accuracy1
## [1] 0.8272064
Here it is not required to split the data into training and testing
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_adult_model<-randomForest(income~.,data = adult2)
rf_adult_model
##
## Call:
## randomForest(formula = income ~ ., data = adult2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.97%
## Confusion matrix:
## <=50K >50K class.error
## <=50K 20444 2189 0.09671718
## >50K 2926 4580 0.38982148
Here the Out of Bag error (OOB) gives us the miscalssification rate (MCR) of the model. In this case it comes out to be 16.98%, which gives us the accuracy of 83.02%
plot(rf_adult_model)
Red line represents MCR of class <=50k, green line represents MCR of class >50k and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.
## Model Acuuracy
## Logistic Regression 81.08%
## Decision Tree 82.72%
## Random Forest 83.02%
After performing the classification techniques like Logistic regression,Decision TRee,Random Tree.The classification model that got the highest accuracy is Random Forest compared to the other models like Decision Tree and Logistic Regression.And the missclassifation error is less for Random Forest compared to other models.
After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 81% to 84%. Out of which Random forest gave a slightly better accuracy of 83.58%.So we choose Random Forest as the best model.