The description of each variable is as follows:
1)age: continuous.
2)workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 3)fnlwgt: continuous.Final Weight Determined by Census Org
4)education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5)education-num: continuous.Number of years of education
6)marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7)occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
8)relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9)race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10)sex: Female, Male.
11)capital-gain: continuous.
12)capital-loss: continuous.
13)hours-per-week: continuous.
14)native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
15)income : Annual income (<=50k, >50k)
#Exploratory Data Analysis
Income=read.table("/Users/softman/Downloads/Incomedataset.csv" , header=TRUE, sep=",")
head(Income)
## age workclass fnlwgt education educational.num marital.status
## 1 67 Private 366425 Doctorate 16 Divorced
## 2 17 Private 244602 12th 8 Never-married
## 3 31 Private 174201 Bachelors 13 Married-civ-spouse
## 4 58 State-gov 110199 7th-8th 4 Married-civ-spouse
## 5 25 State-gov 149248 Some-college 10 Never-married
## 6 59 State-gov 105363 HS-grad 9 Never-married
## occupation relationship race gender capital.gain capital.loss
## 1 Exec-managerial Not-in-family White Male 99999 0
## 2 Other-service Own-child White Male 0 0
## 3 Exec-managerial Husband White Male 0 0
## 4 Transport-moving Husband White Male 0 0
## 5 Other-service Not-in-family Black Male 0 0
## 6 Adm-clerical Own-child White Male 0 0
## hours.per.week native.country income_.50K
## 1 60 United-States 1
## 2 15 United-States 0
## 3 40 United-States 1
## 4 40 United-States 0
## 5 40 United-States 0
## 6 40 United-States 0
sapply(Income, class)
## age workclass fnlwgt education educational.num
## "integer" "character" "integer" "character" "integer"
## marital.status occupation relationship race gender
## "character" "character" "character" "character" "character"
## capital.gain capital.loss hours.per.week native.country income_.50K
## "integer" "integer" "integer" "character" "integer"
#ncol(Income)
COMMENT
The dataset has 43957 individual cases in total and a total of 15 variables or columns. Following below are the list of variables in the dataset.
The following list shows the data type of each variable. As expected, age, hours.per.week, capital.gain, capital.loss and education.num are integers, all other variables are factors.
colnames(Income)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "educational.num" "marital.status" "occupation" "relationship"
## [9] "race" "gender" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "income_.50K"
COMMENT No NA’s in the data
summary(Income)
## age workclass fnlwgt education
## Min. :17.00 Length:43957 Min. : 13492 Length:43957
## 1st Qu.:28.00 Class :character 1st Qu.: 117496 Class :character
## Median :37.00 Mode :character Median : 178100 Mode :character
## Mean :38.62 Mean : 189673
## 3rd Qu.:48.00 3rd Qu.: 237671
## Max. :90.00 Max. :1490400
## educational.num marital.status occupation relationship
## Min. : 1.00 Length:43957 Length:43957 Length:43957
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.07
## 3rd Qu.:12.00
## Max. :16.00
## race gender capital.gain capital.loss
## Length:43957 Length:43957 Min. : 0 Min. : 0.00
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.00
## Mode :character Mode :character Median : 0 Median : 0.00
## Mean : 1094 Mean : 88.25
## 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :99999 Max. :4356.00
## hours.per.week native.country income_.50K
## Min. : 1.00 Length:43957 Min. :0.0000
## 1st Qu.:40.00 Class :character 1st Qu.:0.0000
## Median :40.00 Mode :character Median :0.0000
## Mean :40.41 Mean :0.2393
## 3rd Qu.:45.00 3rd Qu.:0.0000
## Max. :99.00 Max. :1.0000
which(is.na(Income))
## integer(0)
sum(is.na(Income))
## [1] 0
table(complete.cases(Income))
##
## TRUE
## 43957
Incomedata <- na.omit(Income)
head(Incomedata)
## age workclass fnlwgt education educational.num marital.status
## 1 67 Private 366425 Doctorate 16 Divorced
## 2 17 Private 244602 12th 8 Never-married
## 3 31 Private 174201 Bachelors 13 Married-civ-spouse
## 4 58 State-gov 110199 7th-8th 4 Married-civ-spouse
## 5 25 State-gov 149248 Some-college 10 Never-married
## 6 59 State-gov 105363 HS-grad 9 Never-married
## occupation relationship race gender capital.gain capital.loss
## 1 Exec-managerial Not-in-family White Male 99999 0
## 2 Other-service Own-child White Male 0 0
## 3 Exec-managerial Husband White Male 0 0
## 4 Transport-moving Husband White Male 0 0
## 5 Other-service Not-in-family Black Male 0 0
## 6 Adm-clerical Own-child White Male 0 0
## hours.per.week native.country income_.50K
## 1 60 United-States 1
## 2 15 United-States 0
## 3 40 United-States 1
## 4 40 United-States 0
## 5 40 United-States 0
## 6 40 United-States 0
str(Incomedata)
## 'data.frame': 43957 obs. of 15 variables:
## $ age : int 67 17 31 58 25 59 70 35 28 28 ...
## $ workclass : chr "Private" "Private" "Private" "State-gov" ...
## $ fnlwgt : int 366425 244602 174201 110199 149248 105363 216390 361888 74784 118089 ...
## $ education : chr "Doctorate" "12th" "Bachelors" "7th-8th" ...
## $ educational.num: int 16 8 13 4 10 9 5 13 9 9 ...
## $ marital.status : chr "Divorced" "Never-married" "Married-civ-spouse" "Married-civ-spouse" ...
## $ occupation : chr "Exec-managerial" "Other-service" "Exec-managerial" "Transport-moving" ...
## $ relationship : chr "Not-in-family" "Own-child" "Husband" "Husband" ...
## $ race : chr "White" "White" "White" "White" ...
## $ gender : chr "Male" "Male" "Male" "Male" ...
## $ capital.gain : int 99999 0 0 0 0 0 2653 0 0 4386 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 60 15 40 40 40 40 40 60 50 45 ...
## $ native.country : chr "United-States" "United-States" "United-States" "United-States" ...
## $ income_.50K : int 1 0 1 0 0 0 0 0 0 1 ...
COMMENT Here is the list of the numerical variables in the dataset
Numeric_variables= Incomedata[,-c(2,4,6,7,8,9,10,14)]
summary(Numeric_variables)
## age fnlwgt educational.num capital.gain
## Min. :17.00 Min. : 13492 Min. : 1.00 Min. : 0
## 1st Qu.:28.00 1st Qu.: 117496 1st Qu.: 9.00 1st Qu.: 0
## Median :37.00 Median : 178100 Median :10.00 Median : 0
## Mean :38.62 Mean : 189673 Mean :10.07 Mean : 1094
## 3rd Qu.:48.00 3rd Qu.: 237671 3rd Qu.:12.00 3rd Qu.: 0
## Max. :90.00 Max. :1490400 Max. :16.00 Max. :99999
## capital.loss hours.per.week income_.50K
## Min. : 0.00 Min. : 1.00 Min. :0.0000
## 1st Qu.: 0.00 1st Qu.:40.00 1st Qu.:0.0000
## Median : 0.00 Median :40.00 Median :0.0000
## Mean : 88.25 Mean :40.41 Mean :0.2393
## 3rd Qu.: 0.00 3rd Qu.:45.00 3rd Qu.:0.0000
## Max. :4356.00 Max. :99.00 Max. :1.0000
COMMENT Here are the list of the categorical variables in the dataset
categorical_variables=Incomedata[,-c(1,3,5,11,12,13)]
summary(categorical_variables)
## workclass education marital.status occupation
## Length:43957 Length:43957 Length:43957 Length:43957
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## relationship race gender native.country
## Length:43957 Length:43957 Length:43957 Length:43957
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## income_.50K
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2393
## 3rd Qu.:0.0000
## Max. :1.0000
COMMENT
library(ggplot2)
ggplot(Numeric_variables, aes(x=income_.50K, y=age, fill = income_.50K==1)) +
geom_boxplot()
COMMENT
#The Boxplot below shows education with annual income greater than $50k have Bachelor’s Degree, Master’s Degree, and PhD.
ggplot(Numeric_variables, aes(x=income_.50K, y=educational.num, fill = income_.50K==1)) +
geom_boxplot()
COMMENT #The boxplot below shows people who work 40hours per week have an annual income greater than $50k are older.
ggplot(Numeric_variables, aes(x=income_.50K, y=hours.per.week, fill = income_.50K==1)) +
geom_boxplot()
COMMENT
ggplot(categorical_variables,aes(x=occupation,y=income_.50K==1,fill=income_.50K==1))+
geom_bar(stat="identity")+theme(axis.text.x = element_text(angle =45, hjust = 1))+
ggtitle('Proportion patterns of Occupations with income above 50k')+
xlab("Occupation")+
ylab("Number of People")
COMMENT #The bar chart below shows proportion patterns of Professionals with higher educational degrees who are more likely to have an income higher than $50k.
ggplot(categorical_variables,aes(x=education,y=income_.50K==1,fill=income_.50K==1))+
geom_bar(stat="identity")+theme(axis.text.x = element_text(angle =45, hjust = 1))+
ggtitle('Proportion patterns of Professionals with income above 50k')+
xlab("education")+
ylab("Number of People")
COMMENT #People employed by private companies have more people with income above 50k, and Self Employed people have a higher proportion of people with income higher than 50k.
ggplot(categorical_variables,aes(x=workclass,y=income_.50K==1,fill=income_.50K==1))+
geom_bar(stat="identity")+theme(axis.text.x = element_text(angle =45, hjust = 1))+
ggtitle('Proportion patterns of workclass with income above 50k')+
xlab("workclass")+
ylab("Number of People")
COMMENT
Converted my variables into factor level
Incomedata$workclass <- as.factor(Incomedata$workclass)
Incomedata$education <- as.factor(Incomedata$education)
Incomedata$occupation <- as.factor(Incomedata$occupation)
Incomedata$relationship <- as.factor(Incomedata$relationship)
Incomedata$race <- as.factor(Incomedata$race)
Incomedata$native.country <- as.factor(Incomedata$native.country)
Incomedata$gender <- as.factor(Incomedata$gender)
Incomedata$marital.status <- as.factor(Incomedata$marital.status)
#Incomedata$income_.50K <- factor(Incomedata$income_.50K)
COMMENT #Separated data into 80% training and 20% testing data
set.seed(123)
sizee <- floor(0.80 * nrow(Incomedata))
train_ind=sample(seq_len(nrow(Incomedata)), size =sizee)
train.inc=Incomedata[train_ind, ]
test.inc=Incomedata[-train_ind, ]
head(Incomedata)
## age workclass fnlwgt education educational.num marital.status
## 1 67 Private 366425 Doctorate 16 Divorced
## 2 17 Private 244602 12th 8 Never-married
## 3 31 Private 174201 Bachelors 13 Married-civ-spouse
## 4 58 State-gov 110199 7th-8th 4 Married-civ-spouse
## 5 25 State-gov 149248 Some-college 10 Never-married
## 6 59 State-gov 105363 HS-grad 9 Never-married
## occupation relationship race gender capital.gain capital.loss
## 1 Exec-managerial Not-in-family White Male 99999 0
## 2 Other-service Own-child White Male 0 0
## 3 Exec-managerial Husband White Male 0 0
## 4 Transport-moving Husband White Male 0 0
## 5 Other-service Not-in-family Black Male 0 0
## 6 Adm-clerical Own-child White Male 0 0
## hours.per.week native.country income_.50K
## 1 60 United-States 1
## 2 15 United-States 0
## 3 40 United-States 1
## 4 40 United-States 0
## 5 40 United-States 0
## 6 40 United-States 0
Data Modelling # Logistic Regression
str(Incomedata)
## 'data.frame': 43957 obs. of 15 variables:
## $ age : int 67 17 31 58 25 59 70 35 28 28 ...
## $ workclass : Factor w/ 9 levels "","Federal-gov",..: 5 5 5 8 8 8 5 7 5 5 ...
## $ fnlwgt : int 366425 244602 174201 110199 149248 105363 216390 361888 74784 118089 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 11 3 10 6 16 12 7 10 12 12 ...
## $ educational.num: int 16 8 13 4 10 9 5 13 9 9 ...
## $ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 1 5 3 3 5 5 3 3 5 3 ...
## $ occupation : Factor w/ 15 levels "","Adm-clerical",..: 5 9 5 15 9 2 8 13 7 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 4 1 1 2 4 6 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 3 5 5 5 5 5 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 2 2 ...
## $ capital.gain : int 99999 0 0 0 0 0 2653 0 0 4386 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 60 15 40 40 40 40 40 60 50 45 ...
## $ native.country : Factor w/ 42 levels "","Cambodia",..: 40 40 40 40 40 40 40 25 40 40 ...
## $ income_.50K : int 1 0 1 0 0 0 0 0 0 1 ...
#Logistic Regression
start_time <- proc.time()
logit_model <- glm(as.factor(income_.50K)~.,
data = train.inc, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
time.log.rad.over <- proc.time() - start_time
time.log.rad.over
## user system elapsed
## 6.460 0.245 6.724
summary(logit_model)
##
## Call:
## glm(formula = as.factor(income_.50K) ~ ., family = binomial,
## data = train.inc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1268 -0.5048 -0.1829 -0.0377 3.7826
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.958e+00 4.280e-01 -20.932 < 2e-16
## age 2.551e-02 1.595e-03 16.000 < 2e-16
## workclassFederal-gov 1.257e+00 1.513e-01 8.312 < 2e-16
## workclassLocal-gov 6.103e-01 1.387e-01 4.400 1.08e-05
## workclassNever-worked -9.157e+00 1.586e+02 -0.058 0.953963
## workclassPrivate 7.668e-01 1.238e-01 6.193 5.92e-10
## workclassSelf-emp-inc 9.444e-01 1.470e-01 6.424 1.32e-10
## workclassSelf-emp-not-inc 2.401e-01 1.347e-01 1.783 0.074620
## workclassState-gov 4.390e-01 1.480e-01 2.967 0.003003
## workclassWithout-pay -5.398e-01 1.088e+00 -0.496 0.619909
## fnlwgt 6.036e-07 1.663e-07 3.629 0.000285
## education11th 3.691e-02 2.026e-01 0.182 0.855435
## education12th 5.761e-01 2.450e-01 2.352 0.018682
## education1st-4th -5.827e-01 4.679e-01 -1.245 0.212957
## education5th-6th -1.591e-01 2.980e-01 -0.534 0.593342
## education7th-8th -4.290e-01 2.175e-01 -1.972 0.048585
## education9th -3.418e-01 2.513e-01 -1.360 0.173840
## educationAssoc-acdm 1.446e+00 1.695e-01 8.528 < 2e-16
## educationAssoc-voc 1.292e+00 1.637e-01 7.892 2.96e-15
## educationBachelors 1.914e+00 1.518e-01 12.611 < 2e-16
## educationDoctorate 2.769e+00 2.043e-01 13.551 < 2e-16
## educationHS-grad 7.348e-01 1.478e-01 4.971 6.65e-07
## educationMasters 2.316e+00 1.616e-01 14.329 < 2e-16
## educationPreschool -4.879e+00 3.760e+00 -1.297 0.194478
## educationProf-school 2.890e+00 1.958e-01 14.758 < 2e-16
## educationSome-college 1.137e+00 1.500e-01 7.579 3.49e-14
## educational.num NA NA NA NA
## marital.statusMarried-AF-spouse 2.591e+00 5.700e-01 4.546 5.47e-06
## marital.statusMarried-civ-spouse 2.391e+00 2.531e-01 9.445 < 2e-16
## marital.statusMarried-spouse-absent 1.377e-01 2.163e-01 0.637 0.524383
## marital.statusNever-married -3.951e-01 8.482e-02 -4.658 3.20e-06
## marital.statusSeparated -8.021e-02 1.592e-01 -0.504 0.614286
## marital.statusWidowed 6.493e-03 1.524e-01 0.043 0.966010
## occupationAdm-clerical -4.405e-02 9.657e-02 -0.456 0.648257
## occupationArmed-Forces 1.107e+00 9.022e-01 1.227 0.219922
## occupationCraft-repair 1.329e-01 8.216e-02 1.618 0.105617
## occupationExec-managerial 7.984e-01 8.451e-02 9.447 < 2e-16
## occupationFarming-fishing -8.996e-01 1.377e-01 -6.535 6.34e-11
## occupationHandlers-cleaners -6.206e-01 1.385e-01 -4.479 7.48e-06
## occupationMachine-op-inspct -2.815e-01 1.033e-01 -2.725 0.006431
## occupationOther-service -7.774e-01 1.207e-01 -6.443 1.17e-10
## occupationPriv-house-serv -1.908e+00 9.947e-01 -1.918 0.055117
## occupationProf-specialty 4.992e-01 9.087e-02 5.494 3.94e-08
## occupationProtective-serv 5.780e-01 1.261e-01 4.584 4.56e-06
## occupationSales 2.913e-01 8.733e-02 3.336 0.000850
## occupationTech-support 5.777e-01 1.151e-01 5.017 5.24e-07
## occupationTransport-moving NA NA NA NA
## relationshipNot-in-family 5.833e-01 2.505e-01 2.329 0.019870
## relationshipOther-relative -5.633e-01 2.409e-01 -2.339 0.019345
## relationshipOwn-child -4.193e-01 2.399e-01 -1.748 0.080518
## relationshipUnmarried 5.172e-01 2.665e-01 1.940 0.052340
## relationshipWife 1.128e+00 9.827e-02 11.477 < 2e-16
## raceAsian-Pac-Islander 7.735e-01 2.643e-01 2.926 0.003430
## raceBlack 3.662e-01 2.296e-01 1.595 0.110666
## raceOther 4.311e-01 3.283e-01 1.313 0.189197
## raceWhite 6.458e-01 2.184e-01 2.957 0.003108
## genderMale 6.794e-01 7.535e-02 9.016 < 2e-16
## capital.gain 3.223e-04 1.007e-05 31.994 < 2e-16
## capital.loss 6.737e-04 3.621e-05 18.605 < 2e-16
## hours.per.week 2.723e-02 1.538e-03 17.704 < 2e-16
## native.countryCambodia 1.096e+00 5.938e-01 1.846 0.064841
## native.countryCanada 7.906e-01 2.827e-01 2.796 0.005172
## native.countryChina -5.967e-01 3.644e-01 -1.637 0.101527
## native.countryColumbia -3.147e+00 1.126e+00 -2.796 0.005176
## native.countryCuba 2.374e-01 3.403e-01 0.698 0.485443
## native.countryDominican-Republic -9.929e-01 6.370e-01 -1.559 0.119084
## native.countryEcuador -5.817e-01 7.310e-01 -0.796 0.426163
## native.countryEl-Salvador -2.959e-01 4.801e-01 -0.616 0.537647
## native.countryEngland 1.954e-01 3.504e-01 0.558 0.577142
## native.countryFrance 5.241e-01 5.334e-01 0.983 0.325836
## native.countryGermany 1.325e-01 2.849e-01 0.465 0.641758
## native.countryGreece -2.482e-01 4.531e-01 -0.548 0.583861
## native.countryGuatemala -1.426e-01 7.580e-01 -0.188 0.850720
## native.countryHaiti 6.250e-01 4.967e-01 1.258 0.208248
## native.countryHoland-Netherlands -9.326e+00 5.354e+02 -0.017 0.986103
## native.countryHonduras 3.492e-01 1.162e+00 0.301 0.763791
## native.countryHong -3.701e-01 6.667e-01 -0.555 0.578810
## native.countryHungary 2.993e-01 7.007e-01 0.427 0.669325
## native.countryIndia -3.584e-02 3.180e-01 -0.113 0.910263
## native.countryIran 4.660e-02 4.570e-01 0.102 0.918769
## native.countryIreland 1.494e+00 5.315e-01 2.812 0.004929
## native.countryItaly 6.612e-01 3.441e-01 1.921 0.054702
## native.countryJamaica 1.245e-01 4.666e-01 0.267 0.789690
## native.countryJapan -7.420e-02 3.984e-01 -0.186 0.852260
## native.countryLaos -3.679e-01 8.924e-01 -0.412 0.680177
## native.countryMexico -6.930e-01 2.573e-01 -2.693 0.007079
## native.countryNicaragua -2.056e-01 6.846e-01 -0.300 0.763993
## native.countryOutlying-US(Guam-USVI-etc) -3.661e-01 1.115e+00 -0.328 0.742634
## native.countryPeru -6.351e-01 6.384e-01 -0.995 0.319827
## native.countryPhilippines 5.155e-01 2.681e-01 1.923 0.054522
## native.countryPoland -3.490e-01 4.376e-01 -0.798 0.425096
## native.countryPortugal 6.761e-01 4.677e-01 1.446 0.148276
## native.countryPuerto-Rico 1.380e-01 3.550e-01 0.389 0.697493
## native.countryScotland -9.325e-01 8.455e-01 -1.103 0.270048
## native.countrySouth -7.857e-01 4.306e-01 -1.825 0.068039
## native.countryTaiwan 1.048e-01 4.670e-01 0.224 0.822393
## native.countryThailand -9.225e-01 7.465e-01 -1.236 0.216563
## native.countryTrinadad&Tobago -1.202e+01 1.003e+02 -0.120 0.904637
## native.countryUnited-States 2.713e-01 1.363e-01 1.991 0.046489
## native.countryVietnam -9.335e-01 6.296e-01 -1.483 0.138161
## native.countryYugoslavia 4.969e-01 7.251e-01 0.685 0.493183
##
## (Intercept) ***
## age ***
## workclassFederal-gov ***
## workclassLocal-gov ***
## workclassNever-worked
## workclassPrivate ***
## workclassSelf-emp-inc ***
## workclassSelf-emp-not-inc .
## workclassState-gov **
## workclassWithout-pay
## fnlwgt ***
## education11th
## education12th *
## education1st-4th
## education5th-6th
## education7th-8th *
## education9th
## educationAssoc-acdm ***
## educationAssoc-voc ***
## educationBachelors ***
## educationDoctorate ***
## educationHS-grad ***
## educationMasters ***
## educationPreschool
## educationProf-school ***
## educationSome-college ***
## educational.num
## marital.statusMarried-AF-spouse ***
## marital.statusMarried-civ-spouse ***
## marital.statusMarried-spouse-absent
## marital.statusNever-married ***
## marital.statusSeparated
## marital.statusWidowed
## occupationAdm-clerical
## occupationArmed-Forces
## occupationCraft-repair
## occupationExec-managerial ***
## occupationFarming-fishing ***
## occupationHandlers-cleaners ***
## occupationMachine-op-inspct **
## occupationOther-service ***
## occupationPriv-house-serv .
## occupationProf-specialty ***
## occupationProtective-serv ***
## occupationSales ***
## occupationTech-support ***
## occupationTransport-moving
## relationshipNot-in-family *
## relationshipOther-relative *
## relationshipOwn-child .
## relationshipUnmarried .
## relationshipWife ***
## raceAsian-Pac-Islander **
## raceBlack
## raceOther
## raceWhite **
## genderMale ***
## capital.gain ***
## capital.loss ***
## hours.per.week ***
## native.countryCambodia .
## native.countryCanada **
## native.countryChina
## native.countryColumbia **
## native.countryCuba
## native.countryDominican-Republic
## native.countryEcuador
## native.countryEl-Salvador
## native.countryEngland
## native.countryFrance
## native.countryGermany
## native.countryGreece
## native.countryGuatemala
## native.countryHaiti
## native.countryHoland-Netherlands
## native.countryHonduras
## native.countryHong
## native.countryHungary
## native.countryIndia
## native.countryIran
## native.countryIreland **
## native.countryItaly .
## native.countryJamaica
## native.countryJapan
## native.countryLaos
## native.countryMexico **
## native.countryNicaragua
## native.countryOutlying-US(Guam-USVI-etc)
## native.countryPeru
## native.countryPhilippines .
## native.countryPoland
## native.countryPortugal
## native.countryPuerto-Rico
## native.countryScotland
## native.countrySouth .
## native.countryTaiwan
## native.countryThailand
## native.countryTrinadad&Tobago
## native.countryUnited-States *
## native.countryVietnam
## native.countryYugoslavia
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38496 on 35164 degrees of freedom
## Residual deviance: 22106 on 35066 degrees of freedom
## AIC: 22304
##
## Number of Fisher Scoring iterations: 12
#The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient is inflated due to multicollinearity in the model
vip(logit_model)
#COMMENT
The values for the income levels are recoded with 1 and 0 with >50 k being coded as 1. The sum of the values in this column will hence give the number of people with income >50 k and thus can be an effective prediction tool.
GLM is the generalized linear model we will be using. Income ~ . means that we want to model income using (~) every available feature (.). Family = binomial() is used to predict a binary outcome below 50k or above 50k.
In order to test the significance of each covariate in the model we will perform likelihood ratio tests with the R function “anova()”. When we run “anova(glm.model.wld, test=”LRT“)”, the function sequentially compares nested models with increasing complexity against the full model by adding one predictor at a time
anova(logit_model, test = "LRT")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: as.factor(income_.50K)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 35164 38496
## age 1 1846.5 35163 36649 < 2.2e-16 ***
## workclass 8 930.4 35155 35719 < 2.2e-16 ***
## fnlwgt 1 2.1 35154 35717 0.1508
## education 15 3994.7 35139 31722 < 2.2e-16 ***
## educational.num 0 0.0 35139 31722
## marital.status 6 5768.2 35133 25954 < 2.2e-16 ***
## occupation 13 805.0 35120 25149 < 2.2e-16 ***
## relationship 5 180.9 35115 24968 < 2.2e-16 ***
## race 4 26.0 35111 24942 3.182e-05 ***
## gender 1 129.2 35110 24813 < 2.2e-16 ***
## capital.gain 1 1908.0 35109 22905 < 2.2e-16 ***
## capital.loss 1 371.1 35108 22534 < 2.2e-16 ***
## hours.per.week 1 318.6 35107 22215 < 2.2e-16 ***
## native.country 41 108.8 35066 22106 4.628e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#As we see from the output above, the p-values indicate that all predictors in the model are significant and should be retained.
#Performance on the training data
logprobt<-predict(logit_model, newdata=train.inc, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
contrasts(factor(train.inc[,"income_.50K"]))
## 1
## 0 0
## 1 1
logpredt=rep(0, nrow(train.inc))
logpredt[logprobt>.5]=1
#confusion matrix
table(actual=train.inc[,"income_.50K"], predicted=logpredt)
## predicted
## actual 0 1
## 0 25031 1807
## 1 3327 5000
#Accuracy train
accuracylogt =1 - mean(train.inc$income_.50K!= logpredt)
accuracylogt
## [1] 0.8540026
#Logistic Train Error
logerrort=mean(train.inc[,"income_.50K"]!=logpredt)
logerrort
## [1] 0.1459974
#confusion matrix Logistic Regression on test data
logprob<-predict(logit_model, newdata=test.inc, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
contrasts(factor(test.inc[,"income_.50K"]))
## 1
## 0 0
## 1 1
logpred=rep(0, nrow(test.inc))
logpred[logprob>=.5]=1
#confusion matrix
table(actual=test.inc[,"income_.50K"], predicted=logpred)
## predicted
## actual 0 1
## 0 6159 442
## 1 869 1322
#we generate the vector of predicted probabilities – “logprob”, and then we use it to create the binary vector “logpred”:
#Logistic Test Error
logerror=mean(test.inc[,"income_.50K"]!=logpred)
logerror
## [1] 0.1491128
#ACCURACY Logistic Regression
#Accuracy
accuracylog =1 - mean(test.inc$income_.50K!= logpred)
accuracylog
## [1] 0.8508872
#Logistic Regression is one of the easiest and most commonly used supervised Machine learning algorithms for categorical classification, and It can interpret model coefficients as indicators of feature importance.
We are interested in predicting the values of the variable “income”. Therefore “income” is our response variable or dependent variable. It assumes only two values - less than 50K a year and more than 50K a year. The problem we consider is a classification.
Since the response variable is binary, we start with a logistic regression model, a type of generalized linear model.
The R function that we use to build the logistic regression model is “glm”, Since we want to fit a logistic regression model, in “glm” we set “family” to “binomial.”
#The logistic output shows the performance of the logistic model on test data with An accuracy of 85.08% and a test error of 14.91%
#The logistic output shows the performance of the logistic model on train data with An accuracy of 85.40% and a error of 14.59%
Data Modelling K-Nearest Neighbor
#k-nearest-neighbor (KNN)
set.seed(123)
train.ink=train.inc
test.ink=test.inc
train.ink = as.data.frame(sapply(train.ink, as.numeric))
test.ink = as.data.frame(sapply(test.ink, as.numeric))
start_time <- proc.time() # computation time
fit.knn<-train(factor(income_.50K)~.,data=train.ink,method='knn',trControl = trainControl(method = "cv"), tuneLength=20)
fit.knn
## k-Nearest Neighbors
##
## 35165 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 31648, 31648, 31649, 31648, 31649, 31649, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7774776 0.2852572
## 7 0.7856675 0.2776841
## 9 0.7916110 0.2763353
## 11 0.7953364 0.2746837
## 13 0.7968720 0.2692428
## 15 0.7984931 0.2653739
## 17 0.7997441 0.2629760
## 19 0.8003697 0.2598157
## 21 0.8009669 0.2588687
## 23 0.8016779 0.2586507
## 25 0.8013367 0.2549548
## 27 0.8018487 0.2554549
## 29 0.8015358 0.2511046
## 31 0.8012230 0.2473828
## 33 0.8008818 0.2445431
## 35 0.8001993 0.2400672
## 37 0.7998296 0.2366952
## 39 0.7995168 0.2334547
## 41 0.7997443 0.2328313
## 43 0.7993746 0.2298781
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 27.
time.knn.rad.over <- proc.time() - start_time
time.knn.rad.over
## user system elapsed
## 282.990 3.327 289.870
plot(fit.knn, print.thres = 0.5, type="S")
predict.knn<-predict(fit.knn,test.ink)
confusionMatrix(predict.knn, (factor(test.ink[,"income_.50K"])))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6513 1718
## 1 88 473
##
## Accuracy : 0.7946
## 95% CI : (0.786, 0.803)
## No Information Rate : 0.7508
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2695
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9867
## Specificity : 0.2159
## Pos Pred Value : 0.7913
## Neg Pred Value : 0.8431
## Prevalence : 0.7508
## Detection Rate : 0.7408
## Detection Prevalence : 0.9362
## Balanced Accuracy : 0.6013
##
## 'Positive' Class : 0
##
error_knn=mean(predict.knn != test.ink$income_.50K)
error_knn
## [1] 0.205414
accuracy_knn= 1-mean(predict.knn != test.ink$income_.50K)
accuracy_knn
## [1] 0.794586
predict.knnt<-predict(fit.knn,train.ink)
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(predict.knnt, (factor(train.ink[,"income_.50K"])))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26548 6560
## 1 290 1767
##
## Accuracy : 0.8052
## 95% CI : (0.801, 0.8093)
## No Information Rate : 0.7632
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.272
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9892
## Specificity : 0.2122
## Pos Pred Value : 0.8019
## Neg Pred Value : 0.8590
## Prevalence : 0.7632
## Detection Rate : 0.7550
## Detection Prevalence : 0.9415
## Balanced Accuracy : 0.6007
##
## 'Positive' Class : 0
##
error_knnt=mean(predict.knnt != train.ink$income_.50K)
error_knnt
## [1] 0.194796
accuracy_knnt= 1-mean(predict.knnt != train.ink$income_.50K)
accuracy_knnt
## [1] 0.805204
#COMMENT
KNN provides flexibility with classification data and is a distance metric, binary variables will result in values that are on opposite sides of that values spectrum, and thus a point’s nearest neighbors will be more influenced by those on the same side of the spectrum
For some values of the parameters we observe that the optimization algorithm cannot converge, we need to convert some varables into numberic train.ink = as.data.frame(sapply(train.ink, as.numeric)) test.ink = as.data.frame(sapply(test.ink, as.numeric))
We Perform KNN on the training data, using cross-validation of 10 fold in order to give optimum value for k=27 and to avoid overfitting
The prediction accuracy on the test dataset is 79.44% with an error rate of 20.55% and is slightly lower than that of the training, which was 80.52% with an error rate of 19.47%
Data Modelling Classification Tree
train.int=Incomedata[train_ind, ]
test.int=Incomedata[-train_ind, ]
set.seed(123)
start_time <- proc.time()
tree <- rpart(income_.50K ~ .,
data = train.int,
method = "class", parms=list(split="gini"))
time.tree.rad.over <- proc.time() - start_time
time.tree.rad.over
## user system elapsed
## 1.143 0.024 1.240
printcp(tree)
##
## Classification tree:
## rpart(formula = income_.50K ~ ., data = train.int, method = "class",
## parms = list(split = "gini"))
##
## Variables actually used in tree construction:
## [1] capital.gain education relationship
##
## Root node error: 8327/35165 = 0.2368
##
## n= 35165
##
## CP nsplit rel error xerror xstd
## 1 0.125976 0 1.00000 1.00000 0.0095736
## 2 0.060526 2 0.74805 0.74805 0.0085977
## 3 0.036988 3 0.68752 0.68752 0.0083141
## 4 0.010000 4 0.65053 0.65053 0.0081295
plotcp(tree)
rpart.plot(tree, main = "Basic Classification Tree for Income Data")
Treepred=predict(tree, newdata=test.int, type="class")
cm_t=confusionMatrix(
factor(Treepred, levels = 0:1),
factor(test.int$income_.50K, levels = 0:1)
)
cm_t
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6269 1061
## 1 332 1130
##
## Accuracy : 0.8416
## 95% CI : (0.8338, 0.8491)
## No Information Rate : 0.7508
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5237
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9497
## Specificity : 0.5157
## Pos Pred Value : 0.8553
## Neg Pred Value : 0.7729
## Prevalence : 0.7508
## Detection Rate : 0.7130
## Detection Prevalence : 0.8337
## Balanced Accuracy : 0.7327
##
## 'Positive' Class : 0
##
summary(Treepred)
## 0 1
## 7330 1462
#classification error
ce(factor(Treepred, levels = 0:1), factor(test.int$income_.50K, levels = 0:1))
## [1] 0.1584395
#Test Error
tr_error = mean(test.int$income_.50K != Treepred)
#Accuracy
tr_ac=1-tr_error
tr_error
## [1] 0.1584395
tr_ac
## [1] 0.8415605
summary(Treepred)
## 0 1
## 7330 1462
#Training
Treepredt=predict(tree, newdata=train.int, type="class")
cm_tt=confusionMatrix(
factor(Treepredt, levels = 0:1),
factor(train.int$income_.50K, levels = 0:1)
)
cm_tt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25490 4069
## 1 1348 4258
##
## Accuracy : 0.846
## 95% CI : (0.8421, 0.8497)
## No Information Rate : 0.7632
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5197
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9498
## Specificity : 0.5113
## Pos Pred Value : 0.8623
## Neg Pred Value : 0.7595
## Prevalence : 0.7632
## Detection Rate : 0.7249
## Detection Prevalence : 0.8406
## Balanced Accuracy : 0.7306
##
## 'Positive' Class : 0
##
#classification error
ce(factor(Treepredt, levels = 0:1), factor(train.int$income_.50K, levels = 0:1))
## [1] 0.1540452
#Test Error
tr_errort = mean(train.int$income_.50K != Treepredt)
#Accuracy
tr_act=1-tr_errort
tr_act
## [1] 0.8459548
summary(Treepredt)
## 0 1
## 29559 5606
#Important varables by level 1 to 14 show how important each varables are in term of 1 to 14
importance_varables <- varImp(tree)
rownames(importance_varables)[order(importance_varables$Overall, decreasing=TRUE)]
## [1] "capital.gain" "educational.num" "education" "relationship"
## [5] "marital.status" "occupation" "capital.loss" "hours.per.week"
## [9] "age" "workclass" "fnlwgt" "race"
## [13] "gender" "native.country"
#COMMENT
The Classification Tree Algorithm is represented in a visual format that enables us to quickly understand the data’s information and is the most powerful tool for classification to make a prediction.
Ability to use both numeric and categorical variables as predictors: the data is a mix of a categorical and numeric variable
Algorithm The package used to construct the tree here is Rpart which uses recursive partitioning of the data with the default criterion for each split being Gini Coefficient. We used Gini split because it operates well on the categorical Response variable in terms of “0” or “1”
The decision tree in the above figure classifies individual incomes according to relationship and returning the classification associated with the particular leaf Yes or No
We create the tree using all the variables and rank the variables in terms of importance to figure out the variables used by the decision tree algorithm to predict the income on both train and test.
We test the accuracy on the training and test datasets, and we see that it is 84.59% and 84.15%, respectively. The error is 15.84% on the test data and 15.40% on train data.
The prediction on the test data show 7330 people earn below 50,000$, and 1462 people earn above.
Support Vector Machine
training_set=train.ink
test_set=test.ink
income_df=subset(training_set, select = c(income_.50K, occupation, capital.loss))
plot_margins <- ggplot(data = income_df, aes(x = occupation, y = capital.loss, color = factor(income_.50K))) + geom_point() +
scale_color_manual(values = c("red", "blue"))
#display plot
plot_margins
COMMENT # The plot above shown type of kernels to use, here is not linearly seprable so we need to use radial
training_set=train.ink
test_set=test.ink
start_time <- proc.time()
svm_model <- svm(factor(income_.50K)~.,
data = training_set)
time.svm.rad.over <- proc.time() - start_time
time.svm.rad.over
## user system elapsed
## 61.075 0.928 63.285
predsvm <- predict(svm_model, test_set)
table(predsvm,test_set$income_.50K)
##
## predsvm 0 1
## 0 6213 967
## 1 388 1224
accuracy_svm=accuracy(test_set$income_.50K, predsvm)
accuracy_svm
## [1] 0.8458826
testerror_smv=1-accuracy_svm
testerror_smv
## [1] 0.1541174
summary(predsvm)
## 0 1
## 7180 1612
#plot(svm_model, data=training_set)
predsvmt <- predict(svm_model, training_set)
table(predsvmt,training_set$income_.50K)
##
## predsvmt 0 1
## 0 25476 3623
## 1 1362 4704
accuracy_svmt=accuracy(training_set$income_.50K, predsvmt)
accuracy_svmt
## [1] 0.8582397
testerror_smvt=1-accuracy_svmt
testerror_smvt
## [1] 0.1417603
summary(predsvmt)
## 0 1
## 29099 6066
library(foreign)
plot(svm_model, training_set,occupation~capital.loss)
COMMUNICATION #Support Vector Machines
We fitted a model with the Support Vector Machines (SVM) algorithm, implemented in the R package “e1071”. This is a popular method with reasonable default performance and parameters that allow for further tuning if necessary. it tries to find the perfect margin between the line and the support vectors that separates the classes and thus reduces the risk of error on the data though it works well on unstructured and semi-structured, our data is quite semi-structured this gives reason for using this algorithm
It also allows for fitting both linearly and non-linearly separable data using linear or non-linear kernels, respectively. By default, the “SVM” function in the “e1071” package scales the data and also perform ’radial kernel. We predict two types of models – one on the train set and one on the test set.
For some values of the parameters, we observe that the optimization algorithm cannot converge. We need to convert some variables into numeric as we did for ‘KNN.’
We also fit an SVM model with the default radial kernel, which is used for data that is not linearly separable. We build a model on the training data, and We can’t perform cross-validation and grid search to find the optimal model hyperparameters because of the computational time and type of computer used on this project.
We found out that the optimal parameter (cost=1) gives good performance results and is the best runtime and convergence.
The prediction accuracy on the test dataset is 84.58% and is slightly lower than that for the training, which was 85.82%, with an error rate of 15% on test and 14.17% on train.
X<-c("Logistic","KNN", "TREE", "SVM")
Y<-round(c(logerror,error_knn,tr_error,testerror_smv),2)
X_name <- "MODEL"
Y_name <- "TEST.ERROR.RATE"
df <- data.frame(X,Y)
names(df) <- c(X_name,Y_name)
ggplot(df,aes(x=MODEL,y=TEST.ERROR.RATE,fill=MODEL))+geom_bar(stat = "identity") + geom_text(aes(label=TEST.ERROR.RATE),position=position_dodge(width=0.9), vjust=-0.25)
XX<-c("Logistic","KNN", "TREE", "SVM")
YY<-round(c(accuracylog,accuracy_knn,tr_ac,accuracy_svmt),2)
XX_name <- "MODEL"
YY_name <- "ACCURACY"
df <- data.frame(XX,YY)
names(df) <- c(XX_name,YY_name)
ggplot(df,aes(x=MODEL,y=ACCURACY,fill=MODEL))+geom_bar(stat = "identity") + geom_text(aes(label=ACCURACY),position=position_dodge(width=0.9), vjust=-0.25)
#Goals and objectives of the project Build prediction models to classify if the income level of an individual is above $50,000 a year using R, Proportion Analysis using bar charts, and box plots to understand the important variables and their influence on prediction. The dependent variable in our analysis will be ‘income level.’
#Dataset Description The dataset was downloaded from (www.kaggle.com) The dataset has 43957 individual cases in total and a total of 15 variables. Following is the list of variables in the dataset.
We clean the data because if it contains some missing value and NAs, We change the data type to factor for easy prediction, and some Algorithms does work with characters, numerical and we used all the variables in the data set because they are all significant and that shown using a function called “anova()”
Building four predictive models using the following classification algorithms: Logistic regression
KNN
Decision Tree
Support vector machines
#Summary
We built several classification models to predict whether an individual earns more than 50K a year. We used logistic regression,K-Nearest Neighbor, decision tree, and support vector machine Algorithms. The table above summarises the accuracy, error rate, and computational time on the test dataset for all fitted models. We show results for models on the train also.
Let Y be the random variable “income”. Let also Y=1 or Y=0 respectively ‘income>50K’ and Y =1, Y=0 if income<=50K.(where n=43957 is the number of observations in the data set) follows the binomial distribution. More precisely, Y follows the Bernoulli
In conclusion, the logistic regression model with 85% accuracy after accounting for all performance indicators and the Support Vector Machine with 86% accuracy. The model appears to be the most suitable choice for the considered classification problem. Based on the comparisons, the method that produces the most accuracy is the Support Vector Machine.
The prediction on the test shows 7,180 people earn below 50,000$ and 1,612 people earn above.
Effectively use professional-level technology tools to support our findings in terms of the algorithms in approaching problems and demonstrate the capability to deploy established approaches accurately to analyse and solve problems using a reasonable level of skill in statistical learning mixture modelling and clustering, discriminant analysis and graphical models a quick example in this problem solving we see Support Vector Machine is our best model.
Support vector machines (SVM) are binary classifiers, and it perform non-linear classification, therefore, making them more flexible. E071 is a library that provides an R INTERFACE We first train two types of models train set and test set. We performed Kernel functions. Whose elements are to be classified into a new feature space. In the new space, previously non-linear observations become linearly separable.