R Markdown

Including Plots

The description of each variable is as follows:

1)age: continuous.

2)workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 3)fnlwgt: continuous.Final Weight Determined by Census Org

4)education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

5)education-num: continuous.Number of years of education

6)marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

7)occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

8)relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

9)race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

10)sex: Female, Male.

11)capital-gain: continuous.

12)capital-loss: continuous.

13)hours-per-week: continuous.

14)native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

15)income : Annual income (<=50k, >50k)

#Exploratory Data Analysis

Income=read.table("/Users/softman/Downloads/Incomedataset.csv" , header=TRUE, sep=",")

head(Income)

##   age workclass fnlwgt    education educational.num     marital.status
## 1  67   Private 366425    Doctorate              16           Divorced
## 2  17   Private 244602         12th               8      Never-married
## 3  31   Private 174201    Bachelors              13 Married-civ-spouse
## 4  58 State-gov 110199      7th-8th               4 Married-civ-spouse
## 5  25 State-gov 149248 Some-college              10      Never-married
## 6  59 State-gov 105363      HS-grad               9      Never-married
##         occupation  relationship  race gender capital.gain capital.loss
## 1  Exec-managerial Not-in-family White   Male        99999            0
## 2    Other-service     Own-child White   Male            0            0
## 3  Exec-managerial       Husband White   Male            0            0
## 4 Transport-moving       Husband White   Male            0            0
## 5    Other-service Not-in-family Black   Male            0            0
## 6     Adm-clerical     Own-child White   Male            0            0
##   hours.per.week native.country income_.50K
## 1             60  United-States           1
## 2             15  United-States           0
## 3             40  United-States           1
## 4             40  United-States           0
## 5             40  United-States           0
## 6             40  United-States           0

sapply(Income, class)

##             age       workclass          fnlwgt       education educational.num 
##       "integer"     "character"       "integer"     "character"       "integer" 
##  marital.status      occupation    relationship            race          gender 
##     "character"     "character"     "character"     "character"     "character" 
##    capital.gain    capital.loss  hours.per.week  native.country     income_.50K 
##       "integer"       "integer"       "integer"     "character"       "integer"

#ncol(Income)

COMMENT

The dataset has 43957 individual cases in total and a total of 15 variables or columns. Following below are the list of variables in the dataset.

The following list shows the data type of each variable. As expected, age, hours.per.week, capital.gain, capital.loss and education.num are integers, all other variables are factors.

colnames(Income)

##  [1] "age"             "workclass"       "fnlwgt"          "education"      
##  [5] "educational.num" "marital.status"  "occupation"      "relationship"   
##  [9] "race"            "gender"          "capital.gain"    "capital.loss"   
## [13] "hours.per.week"  "native.country"  "income_.50K"

COMMENT No NA’s in the data

summary(Income)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:43957       Min.   :  13492   Length:43957      
##  1st Qu.:28.00   Class :character   1st Qu.: 117496   Class :character  
##  Median :37.00   Mode  :character   Median : 178100   Mode  :character  
##  Mean   :38.62                      Mean   : 189673                     
##  3rd Qu.:48.00                      3rd Qu.: 237671                     
##  Max.   :90.00                      Max.   :1490400                     
##  educational.num marital.status      occupation        relationship      
##  Min.   : 1.00   Length:43957       Length:43957       Length:43957      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.07                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race              gender           capital.gain    capital.loss    
##  Length:43957       Length:43957       Min.   :    0   Min.   :   0.00  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.00  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.00  
##                                        Mean   : 1094   Mean   :  88.25  
##                                        3rd Qu.:    0   3rd Qu.:   0.00  
##                                        Max.   :99999   Max.   :4356.00  
##  hours.per.week  native.country      income_.50K    
##  Min.   : 1.00   Length:43957       Min.   :0.0000  
##  1st Qu.:40.00   Class :character   1st Qu.:0.0000  
##  Median :40.00   Mode  :character   Median :0.0000  
##  Mean   :40.41                      Mean   :0.2393  
##  3rd Qu.:45.00                      3rd Qu.:0.0000  
##  Max.   :99.00                      Max.   :1.0000

which(is.na(Income))

## integer(0)

sum(is.na(Income))

## [1] 0

table(complete.cases(Income))

## 
##  TRUE 
## 43957

Removing the missing data

Incomedata <- na.omit(Income)

head(Incomedata)

##   age workclass fnlwgt    education educational.num     marital.status
## 1  67   Private 366425    Doctorate              16           Divorced
## 2  17   Private 244602         12th               8      Never-married
## 3  31   Private 174201    Bachelors              13 Married-civ-spouse
## 4  58 State-gov 110199      7th-8th               4 Married-civ-spouse
## 5  25 State-gov 149248 Some-college              10      Never-married
## 6  59 State-gov 105363      HS-grad               9      Never-married
##         occupation  relationship  race gender capital.gain capital.loss
## 1  Exec-managerial Not-in-family White   Male        99999            0
## 2    Other-service     Own-child White   Male            0            0
## 3  Exec-managerial       Husband White   Male            0            0
## 4 Transport-moving       Husband White   Male            0            0
## 5    Other-service Not-in-family Black   Male            0            0
## 6     Adm-clerical     Own-child White   Male            0            0
##   hours.per.week native.country income_.50K
## 1             60  United-States           1
## 2             15  United-States           0
## 3             40  United-States           1
## 4             40  United-States           0
## 5             40  United-States           0
## 6             40  United-States           0

str(Incomedata)

## 'data.frame':    43957 obs. of  15 variables:
##  $ age            : int  67 17 31 58 25 59 70 35 28 28 ...
##  $ workclass      : chr  "Private" "Private" "Private" "State-gov" ...
##  $ fnlwgt         : int  366425 244602 174201 110199 149248 105363 216390 361888 74784 118089 ...
##  $ education      : chr  "Doctorate" "12th" "Bachelors" "7th-8th" ...
##  $ educational.num: int  16 8 13 4 10 9 5 13 9 9 ...
##  $ marital.status : chr  "Divorced" "Never-married" "Married-civ-spouse" "Married-civ-spouse" ...
##  $ occupation     : chr  "Exec-managerial" "Other-service" "Exec-managerial" "Transport-moving" ...
##  $ relationship   : chr  "Not-in-family" "Own-child" "Husband" "Husband" ...
##  $ race           : chr  "White" "White" "White" "White" ...
##  $ gender         : chr  "Male" "Male" "Male" "Male" ...
##  $ capital.gain   : int  99999 0 0 0 0 0 2653 0 0 4386 ...
##  $ capital.loss   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week : int  60 15 40 40 40 40 40 60 50 45 ...
##  $ native.country : chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income_.50K    : int  1 0 1 0 0 0 0 0 0 1 ...

COMMENT Here is the list of the numerical variables in the dataset

Numeric_variables= Incomedata[,-c(2,4,6,7,8,9,10,14)]
summary(Numeric_variables)

##       age            fnlwgt        educational.num  capital.gain  
##  Min.   :17.00   Min.   :  13492   Min.   : 1.00   Min.   :    0  
##  1st Qu.:28.00   1st Qu.: 117496   1st Qu.: 9.00   1st Qu.:    0  
##  Median :37.00   Median : 178100   Median :10.00   Median :    0  
##  Mean   :38.62   Mean   : 189673   Mean   :10.07   Mean   : 1094  
##  3rd Qu.:48.00   3rd Qu.: 237671   3rd Qu.:12.00   3rd Qu.:    0  
##  Max.   :90.00   Max.   :1490400   Max.   :16.00   Max.   :99999  
##   capital.loss     hours.per.week   income_.50K    
##  Min.   :   0.00   Min.   : 1.00   Min.   :0.0000  
##  1st Qu.:   0.00   1st Qu.:40.00   1st Qu.:0.0000  
##  Median :   0.00   Median :40.00   Median :0.0000  
##  Mean   :  88.25   Mean   :40.41   Mean   :0.2393  
##  3rd Qu.:   0.00   3rd Qu.:45.00   3rd Qu.:0.0000  
##  Max.   :4356.00   Max.   :99.00   Max.   :1.0000

COMMENT Here are the list of the categorical variables in the dataset

categorical_variables=Incomedata[,-c(1,3,5,11,12,13)]
summary(categorical_variables)

##   workclass          education         marital.status      occupation       
##  Length:43957       Length:43957       Length:43957       Length:43957      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  relationship           race              gender          native.country    
##  Length:43957       Length:43957       Length:43957       Length:43957      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   income_.50K    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2393  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

COMMENT

The Boxplot below shows people with annual income greater than $50k are older.

library(ggplot2)
ggplot(Numeric_variables, aes(x=income_.50K, y=age, fill = income_.50K==1)) + 
  geom_boxplot()

COMMENT

#The Boxplot below shows education with annual income greater than $50k have Bachelor’s Degree, Master’s Degree, and PhD.

ggplot(Numeric_variables, aes(x=income_.50K, y=educational.num, fill = income_.50K==1)) + 
  geom_boxplot()

COMMENT #The boxplot below shows people who work 40hours per week have an annual income greater than $50k are older.

ggplot(Numeric_variables, aes(x=income_.50K, y=hours.per.week, fill = income_.50K==1)) + 
  geom_boxplot()

COMMENT

We found that people who have occupations like Professor, Executive Managerial, and Tech support usually have Higher Annual Income above $50k

ggplot(categorical_variables,aes(x=occupation,y=income_.50K==1,fill=income_.50K==1))+
  geom_bar(stat="identity")+theme(axis.text.x = element_text(angle =45, hjust = 1))+
  ggtitle('Proportion patterns of Occupations with income above 50k')+
  xlab("Occupation")+
  ylab("Number of People")

COMMENT #The bar chart below shows proportion patterns of Professionals with higher educational degrees who are more likely to have an income higher than $50k.

ggplot(categorical_variables,aes(x=education,y=income_.50K==1,fill=income_.50K==1))+
  geom_bar(stat="identity")+theme(axis.text.x = element_text(angle =45, hjust = 1))+
  ggtitle('Proportion patterns of Professionals with income above 50k')+
  xlab("education")+
  ylab("Number of People")

COMMENT #People employed by private companies have more people with income above 50k, and Self Employed people have a higher proportion of people with income higher than 50k.

ggplot(categorical_variables,aes(x=workclass,y=income_.50K==1,fill=income_.50K==1))+
  geom_bar(stat="identity")+theme(axis.text.x = element_text(angle =45, hjust = 1))+
  ggtitle('Proportion patterns of workclass with income above 50k')+
  xlab("workclass")+
  ylab("Number of People")

COMMENT

Converted my variables into factor level

Incomedata$workclass <- as.factor(Incomedata$workclass)
Incomedata$education <- as.factor(Incomedata$education)
Incomedata$occupation  <- as.factor(Incomedata$occupation)
Incomedata$relationship <- as.factor(Incomedata$relationship)
Incomedata$race <- as.factor(Incomedata$race)
Incomedata$native.country <- as.factor(Incomedata$native.country)
Incomedata$gender <- as.factor(Incomedata$gender)
Incomedata$marital.status <- as.factor(Incomedata$marital.status)


#Incomedata$income_.50K <- factor(Incomedata$income_.50K)

COMMENT #Separated data into 80% training and 20% testing data

set.seed(123)

sizee <- floor(0.80 * nrow(Incomedata))

train_ind=sample(seq_len(nrow(Incomedata)), size =sizee)

train.inc=Incomedata[train_ind, ]
test.inc=Incomedata[-train_ind, ]


head(Incomedata)

##   age workclass fnlwgt    education educational.num     marital.status
## 1  67   Private 366425    Doctorate              16           Divorced
## 2  17   Private 244602         12th               8      Never-married
## 3  31   Private 174201    Bachelors              13 Married-civ-spouse
## 4  58 State-gov 110199      7th-8th               4 Married-civ-spouse
## 5  25 State-gov 149248 Some-college              10      Never-married
## 6  59 State-gov 105363      HS-grad               9      Never-married
##         occupation  relationship  race gender capital.gain capital.loss
## 1  Exec-managerial Not-in-family White   Male        99999            0
## 2    Other-service     Own-child White   Male            0            0
## 3  Exec-managerial       Husband White   Male            0            0
## 4 Transport-moving       Husband White   Male            0            0
## 5    Other-service Not-in-family Black   Male            0            0
## 6     Adm-clerical     Own-child White   Male            0            0
##   hours.per.week native.country income_.50K
## 1             60  United-States           1
## 2             15  United-States           0
## 3             40  United-States           1
## 4             40  United-States           0
## 5             40  United-States           0
## 6             40  United-States           0

Data Modelling # Logistic Regression

str(Incomedata)

## 'data.frame':    43957 obs. of  15 variables:
##  $ age            : int  67 17 31 58 25 59 70 35 28 28 ...
##  $ workclass      : Factor w/ 9 levels "","Federal-gov",..: 5 5 5 8 8 8 5 7 5 5 ...
##  $ fnlwgt         : int  366425 244602 174201 110199 149248 105363 216390 361888 74784 118089 ...
##  $ education      : Factor w/ 16 levels "10th","11th",..: 11 3 10 6 16 12 7 10 12 12 ...
##  $ educational.num: int  16 8 13 4 10 9 5 13 9 9 ...
##  $ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 1 5 3 3 5 5 3 3 5 3 ...
##  $ occupation     : Factor w/ 15 levels "","Adm-clerical",..: 5 9 5 15 9 2 8 13 7 5 ...
##  $ relationship   : Factor w/ 6 levels "Husband","Not-in-family",..: 2 4 1 1 2 4 6 1 2 1 ...
##  $ race           : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 3 5 5 5 5 5 ...
##  $ gender         : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 2 2 ...
##  $ capital.gain   : int  99999 0 0 0 0 0 2653 0 0 4386 ...
##  $ capital.loss   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week : int  60 15 40 40 40 40 40 60 50 45 ...
##  $ native.country : Factor w/ 42 levels "","Cambodia",..: 40 40 40 40 40 40 40 25 40 40 ...
##  $ income_.50K    : int  1 0 1 0 0 0 0 0 0 1 ...

#Logistic Regression

start_time <- proc.time()
logit_model <- glm(as.factor(income_.50K)~., 
                   data = train.inc, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

time.log.rad.over <- proc.time() - start_time

time.log.rad.over

##    user  system elapsed 
##   6.460   0.245   6.724

summary(logit_model)

## 
## Call:
## glm(formula = as.factor(income_.50K) ~ ., family = binomial, 
##     data = train.inc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1268  -0.5048  -0.1829  -0.0377   3.7826  
## 
## Coefficients: (2 not defined because of singularities)
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              -8.958e+00  4.280e-01 -20.932  < 2e-16
## age                                       2.551e-02  1.595e-03  16.000  < 2e-16
## workclassFederal-gov                      1.257e+00  1.513e-01   8.312  < 2e-16
## workclassLocal-gov                        6.103e-01  1.387e-01   4.400 1.08e-05
## workclassNever-worked                    -9.157e+00  1.586e+02  -0.058 0.953963
## workclassPrivate                          7.668e-01  1.238e-01   6.193 5.92e-10
## workclassSelf-emp-inc                     9.444e-01  1.470e-01   6.424 1.32e-10
## workclassSelf-emp-not-inc                 2.401e-01  1.347e-01   1.783 0.074620
## workclassState-gov                        4.390e-01  1.480e-01   2.967 0.003003
## workclassWithout-pay                     -5.398e-01  1.088e+00  -0.496 0.619909
## fnlwgt                                    6.036e-07  1.663e-07   3.629 0.000285
## education11th                             3.691e-02  2.026e-01   0.182 0.855435
## education12th                             5.761e-01  2.450e-01   2.352 0.018682
## education1st-4th                         -5.827e-01  4.679e-01  -1.245 0.212957
## education5th-6th                         -1.591e-01  2.980e-01  -0.534 0.593342
## education7th-8th                         -4.290e-01  2.175e-01  -1.972 0.048585
## education9th                             -3.418e-01  2.513e-01  -1.360 0.173840
## educationAssoc-acdm                       1.446e+00  1.695e-01   8.528  < 2e-16
## educationAssoc-voc                        1.292e+00  1.637e-01   7.892 2.96e-15
## educationBachelors                        1.914e+00  1.518e-01  12.611  < 2e-16
## educationDoctorate                        2.769e+00  2.043e-01  13.551  < 2e-16
## educationHS-grad                          7.348e-01  1.478e-01   4.971 6.65e-07
## educationMasters                          2.316e+00  1.616e-01  14.329  < 2e-16
## educationPreschool                       -4.879e+00  3.760e+00  -1.297 0.194478
## educationProf-school                      2.890e+00  1.958e-01  14.758  < 2e-16
## educationSome-college                     1.137e+00  1.500e-01   7.579 3.49e-14
## educational.num                                  NA         NA      NA       NA
## marital.statusMarried-AF-spouse           2.591e+00  5.700e-01   4.546 5.47e-06
## marital.statusMarried-civ-spouse          2.391e+00  2.531e-01   9.445  < 2e-16
## marital.statusMarried-spouse-absent       1.377e-01  2.163e-01   0.637 0.524383
## marital.statusNever-married              -3.951e-01  8.482e-02  -4.658 3.20e-06
## marital.statusSeparated                  -8.021e-02  1.592e-01  -0.504 0.614286
## marital.statusWidowed                     6.493e-03  1.524e-01   0.043 0.966010
## occupationAdm-clerical                   -4.405e-02  9.657e-02  -0.456 0.648257
## occupationArmed-Forces                    1.107e+00  9.022e-01   1.227 0.219922
## occupationCraft-repair                    1.329e-01  8.216e-02   1.618 0.105617
## occupationExec-managerial                 7.984e-01  8.451e-02   9.447  < 2e-16
## occupationFarming-fishing                -8.996e-01  1.377e-01  -6.535 6.34e-11
## occupationHandlers-cleaners              -6.206e-01  1.385e-01  -4.479 7.48e-06
## occupationMachine-op-inspct              -2.815e-01  1.033e-01  -2.725 0.006431
## occupationOther-service                  -7.774e-01  1.207e-01  -6.443 1.17e-10
## occupationPriv-house-serv                -1.908e+00  9.947e-01  -1.918 0.055117
## occupationProf-specialty                  4.992e-01  9.087e-02   5.494 3.94e-08
## occupationProtective-serv                 5.780e-01  1.261e-01   4.584 4.56e-06
## occupationSales                           2.913e-01  8.733e-02   3.336 0.000850
## occupationTech-support                    5.777e-01  1.151e-01   5.017 5.24e-07
## occupationTransport-moving                       NA         NA      NA       NA
## relationshipNot-in-family                 5.833e-01  2.505e-01   2.329 0.019870
## relationshipOther-relative               -5.633e-01  2.409e-01  -2.339 0.019345
## relationshipOwn-child                    -4.193e-01  2.399e-01  -1.748 0.080518
## relationshipUnmarried                     5.172e-01  2.665e-01   1.940 0.052340
## relationshipWife                          1.128e+00  9.827e-02  11.477  < 2e-16
## raceAsian-Pac-Islander                    7.735e-01  2.643e-01   2.926 0.003430
## raceBlack                                 3.662e-01  2.296e-01   1.595 0.110666
## raceOther                                 4.311e-01  3.283e-01   1.313 0.189197
## raceWhite                                 6.458e-01  2.184e-01   2.957 0.003108
## genderMale                                6.794e-01  7.535e-02   9.016  < 2e-16
## capital.gain                              3.223e-04  1.007e-05  31.994  < 2e-16
## capital.loss                              6.737e-04  3.621e-05  18.605  < 2e-16
## hours.per.week                            2.723e-02  1.538e-03  17.704  < 2e-16
## native.countryCambodia                    1.096e+00  5.938e-01   1.846 0.064841
## native.countryCanada                      7.906e-01  2.827e-01   2.796 0.005172
## native.countryChina                      -5.967e-01  3.644e-01  -1.637 0.101527
## native.countryColumbia                   -3.147e+00  1.126e+00  -2.796 0.005176
## native.countryCuba                        2.374e-01  3.403e-01   0.698 0.485443
## native.countryDominican-Republic         -9.929e-01  6.370e-01  -1.559 0.119084
## native.countryEcuador                    -5.817e-01  7.310e-01  -0.796 0.426163
## native.countryEl-Salvador                -2.959e-01  4.801e-01  -0.616 0.537647
## native.countryEngland                     1.954e-01  3.504e-01   0.558 0.577142
## native.countryFrance                      5.241e-01  5.334e-01   0.983 0.325836
## native.countryGermany                     1.325e-01  2.849e-01   0.465 0.641758
## native.countryGreece                     -2.482e-01  4.531e-01  -0.548 0.583861
## native.countryGuatemala                  -1.426e-01  7.580e-01  -0.188 0.850720
## native.countryHaiti                       6.250e-01  4.967e-01   1.258 0.208248
## native.countryHoland-Netherlands         -9.326e+00  5.354e+02  -0.017 0.986103
## native.countryHonduras                    3.492e-01  1.162e+00   0.301 0.763791
## native.countryHong                       -3.701e-01  6.667e-01  -0.555 0.578810
## native.countryHungary                     2.993e-01  7.007e-01   0.427 0.669325
## native.countryIndia                      -3.584e-02  3.180e-01  -0.113 0.910263
## native.countryIran                        4.660e-02  4.570e-01   0.102 0.918769
## native.countryIreland                     1.494e+00  5.315e-01   2.812 0.004929
## native.countryItaly                       6.612e-01  3.441e-01   1.921 0.054702
## native.countryJamaica                     1.245e-01  4.666e-01   0.267 0.789690
## native.countryJapan                      -7.420e-02  3.984e-01  -0.186 0.852260
## native.countryLaos                       -3.679e-01  8.924e-01  -0.412 0.680177
## native.countryMexico                     -6.930e-01  2.573e-01  -2.693 0.007079
## native.countryNicaragua                  -2.056e-01  6.846e-01  -0.300 0.763993
## native.countryOutlying-US(Guam-USVI-etc) -3.661e-01  1.115e+00  -0.328 0.742634
## native.countryPeru                       -6.351e-01  6.384e-01  -0.995 0.319827
## native.countryPhilippines                 5.155e-01  2.681e-01   1.923 0.054522
## native.countryPoland                     -3.490e-01  4.376e-01  -0.798 0.425096
## native.countryPortugal                    6.761e-01  4.677e-01   1.446 0.148276
## native.countryPuerto-Rico                 1.380e-01  3.550e-01   0.389 0.697493
## native.countryScotland                   -9.325e-01  8.455e-01  -1.103 0.270048
## native.countrySouth                      -7.857e-01  4.306e-01  -1.825 0.068039
## native.countryTaiwan                      1.048e-01  4.670e-01   0.224 0.822393
## native.countryThailand                   -9.225e-01  7.465e-01  -1.236 0.216563
## native.countryTrinadad&Tobago            -1.202e+01  1.003e+02  -0.120 0.904637
## native.countryUnited-States               2.713e-01  1.363e-01   1.991 0.046489
## native.countryVietnam                    -9.335e-01  6.296e-01  -1.483 0.138161
## native.countryYugoslavia                  4.969e-01  7.251e-01   0.685 0.493183
##                                             
## (Intercept)                              ***
## age                                      ***
## workclassFederal-gov                     ***
## workclassLocal-gov                       ***
## workclassNever-worked                       
## workclassPrivate                         ***
## workclassSelf-emp-inc                    ***
## workclassSelf-emp-not-inc                .  
## workclassState-gov                       ** 
## workclassWithout-pay                        
## fnlwgt                                   ***
## education11th                               
## education12th                            *  
## education1st-4th                            
## education5th-6th                            
## education7th-8th                         *  
## education9th                                
## educationAssoc-acdm                      ***
## educationAssoc-voc                       ***
## educationBachelors                       ***
## educationDoctorate                       ***
## educationHS-grad                         ***
## educationMasters                         ***
## educationPreschool                          
## educationProf-school                     ***
## educationSome-college                    ***
## educational.num                             
## marital.statusMarried-AF-spouse          ***
## marital.statusMarried-civ-spouse         ***
## marital.statusMarried-spouse-absent         
## marital.statusNever-married              ***
## marital.statusSeparated                     
## marital.statusWidowed                       
## occupationAdm-clerical                      
## occupationArmed-Forces                      
## occupationCraft-repair                      
## occupationExec-managerial                ***
## occupationFarming-fishing                ***
## occupationHandlers-cleaners              ***
## occupationMachine-op-inspct              ** 
## occupationOther-service                  ***
## occupationPriv-house-serv                .  
## occupationProf-specialty                 ***
## occupationProtective-serv                ***
## occupationSales                          ***
## occupationTech-support                   ***
## occupationTransport-moving                  
## relationshipNot-in-family                *  
## relationshipOther-relative               *  
## relationshipOwn-child                    .  
## relationshipUnmarried                    .  
## relationshipWife                         ***
## raceAsian-Pac-Islander                   ** 
## raceBlack                                   
## raceOther                                   
## raceWhite                                ** 
## genderMale                               ***
## capital.gain                             ***
## capital.loss                             ***
## hours.per.week                           ***
## native.countryCambodia                   .  
## native.countryCanada                     ** 
## native.countryChina                         
## native.countryColumbia                   ** 
## native.countryCuba                          
## native.countryDominican-Republic            
## native.countryEcuador                       
## native.countryEl-Salvador                   
## native.countryEngland                       
## native.countryFrance                        
## native.countryGermany                       
## native.countryGreece                        
## native.countryGuatemala                     
## native.countryHaiti                         
## native.countryHoland-Netherlands            
## native.countryHonduras                      
## native.countryHong                          
## native.countryHungary                       
## native.countryIndia                         
## native.countryIran                          
## native.countryIreland                    ** 
## native.countryItaly                      .  
## native.countryJamaica                       
## native.countryJapan                         
## native.countryLaos                          
## native.countryMexico                     ** 
## native.countryNicaragua                     
## native.countryOutlying-US(Guam-USVI-etc)    
## native.countryPeru                          
## native.countryPhilippines                .  
## native.countryPoland                        
## native.countryPortugal                      
## native.countryPuerto-Rico                   
## native.countryScotland                      
## native.countrySouth                      .  
## native.countryTaiwan                        
## native.countryThailand                      
## native.countryTrinadad&Tobago               
## native.countryUnited-States              *  
## native.countryVietnam                       
## native.countryYugoslavia                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38496  on 35164  degrees of freedom
## Residual deviance: 22106  on 35066  degrees of freedom
## AIC: 22304
## 
## Number of Fisher Scoring iterations: 12

#The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient is inflated due to multicollinearity in the model  
vip(logit_model)

#COMMENT

Recoding Income Levels

The values for the income levels are recoded with 1 and 0 with >50 k being coded as 1. The sum of the values in this column will hence give the number of people with income >50 k and thus can be an effective prediction tool.

GLM is the generalized linear model we will be using. Income ~ . means that we want to model income using (~) every available feature (.). Family = binomial() is used to predict a binary outcome below 50k or above 50k.

In order to test the significance of each covariate in the model we will perform likelihood ratio tests with the R function “anova()”. When we run “anova(glm.model.wld, test=”LRT“)”, the function sequentially compares nested models with increasing complexity against the full model by adding one predictor at a time

anova(logit_model, test = "LRT")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: as.factor(income_.50K)
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            35164      38496              
## age              1   1846.5     35163      36649 < 2.2e-16 ***
## workclass        8    930.4     35155      35719 < 2.2e-16 ***
## fnlwgt           1      2.1     35154      35717    0.1508    
## education       15   3994.7     35139      31722 < 2.2e-16 ***
## educational.num  0      0.0     35139      31722              
## marital.status   6   5768.2     35133      25954 < 2.2e-16 ***
## occupation      13    805.0     35120      25149 < 2.2e-16 ***
## relationship     5    180.9     35115      24968 < 2.2e-16 ***
## race             4     26.0     35111      24942 3.182e-05 ***
## gender           1    129.2     35110      24813 < 2.2e-16 ***
## capital.gain     1   1908.0     35109      22905 < 2.2e-16 ***
## capital.loss     1    371.1     35108      22534 < 2.2e-16 ***
## hours.per.week   1    318.6     35107      22215 < 2.2e-16 ***
## native.country  41    108.8     35066      22106 4.628e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#As we see from the output above, the p-values indicate that all predictors in the model are significant and should be retained.

#Performance on the training data

logprobt<-predict(logit_model, newdata=train.inc, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

contrasts(factor(train.inc[,"income_.50K"]))

##   1
## 0 0
## 1 1

logpredt=rep(0, nrow(train.inc))
logpredt[logprobt>.5]=1

#confusion matrix
table(actual=train.inc[,"income_.50K"], predicted=logpredt)

##       predicted
## actual     0     1
##      0 25031  1807
##      1  3327  5000

#Accuracy train 
accuracylogt =1 - mean(train.inc$income_.50K!= logpredt)
accuracylogt

## [1] 0.8540026

#Logistic Train Error 
logerrort=mean(train.inc[,"income_.50K"]!=logpredt)

logerrort

## [1] 0.1459974

#confusion matrix Logistic Regression on test data

logprob<-predict(logit_model, newdata=test.inc, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

contrasts(factor(test.inc[,"income_.50K"]))

##   1
## 0 0
## 1 1

logpred=rep(0, nrow(test.inc))
logpred[logprob>=.5]=1

#confusion matrix
table(actual=test.inc[,"income_.50K"], predicted=logpred)

##       predicted
## actual    0    1
##      0 6159  442
##      1  869 1322

#we generate the vector of predicted probabilities – “logprob”, and then we use it to create the binary vector “logpred”:

#Logistic Test Error

logerror=mean(test.inc[,"income_.50K"]!=logpred)

logerror

## [1] 0.1491128

#ACCURACY Logistic Regression

#Accuracy
accuracylog =1 - mean(test.inc$income_.50K!= logpred)
accuracylog

## [1] 0.8508872

#Logistic Regression is one of the easiest and most commonly used supervised Machine learning algorithms for categorical classification, and It can interpret model coefficients as indicators of feature importance.

We are interested in predicting the values of the variable “income”. Therefore “income” is our response variable or dependent variable. It assumes only two values - less than 50K a year and more than 50K a year. The problem we consider is a classification.

Since the response variable is binary, we start with a logistic regression model, a type of generalized linear model.

The R function that we use to build the logistic regression model is “glm”, Since we want to fit a logistic regression model, in “glm” we set “family” to “binomial.”

#The logistic output shows the performance of the logistic model on test data with An accuracy of 85.08% and a test error of 14.91%

#The logistic output shows the performance of the logistic model on train data with An accuracy of 85.40% and a error of 14.59%

Data Modelling K-Nearest Neighbor

#k-nearest-neighbor (KNN)

set.seed(123)

train.ink=train.inc
test.ink=test.inc

train.ink = as.data.frame(sapply(train.ink, as.numeric))
test.ink = as.data.frame(sapply(test.ink, as.numeric))

start_time <- proc.time() # computation time 
fit.knn<-train(factor(income_.50K)~.,data=train.ink,method='knn',trControl = trainControl(method = "cv"), tuneLength=20)
fit.knn

## k-Nearest Neighbors 
## 
## 35165 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 31648, 31648, 31649, 31648, 31649, 31649, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7774776  0.2852572
##    7  0.7856675  0.2776841
##    9  0.7916110  0.2763353
##   11  0.7953364  0.2746837
##   13  0.7968720  0.2692428
##   15  0.7984931  0.2653739
##   17  0.7997441  0.2629760
##   19  0.8003697  0.2598157
##   21  0.8009669  0.2588687
##   23  0.8016779  0.2586507
##   25  0.8013367  0.2549548
##   27  0.8018487  0.2554549
##   29  0.8015358  0.2511046
##   31  0.8012230  0.2473828
##   33  0.8008818  0.2445431
##   35  0.8001993  0.2400672
##   37  0.7998296  0.2366952
##   39  0.7995168  0.2334547
##   41  0.7997443  0.2328313
##   43  0.7993746  0.2298781
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 27.

time.knn.rad.over <- proc.time() - start_time
time.knn.rad.over

##    user  system elapsed 
## 282.990   3.327 289.870

plot(fit.knn, print.thres = 0.5, type="S")

KNN Predict Performance on the Test data

predict.knn<-predict(fit.knn,test.ink)

confusionMatrix(predict.knn, (factor(test.ink[,"income_.50K"])))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6513 1718
##          1   88  473
##                                         
##                Accuracy : 0.7946        
##                  95% CI : (0.786, 0.803)
##     No Information Rate : 0.7508        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.2695        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.9867        
##             Specificity : 0.2159        
##          Pos Pred Value : 0.7913        
##          Neg Pred Value : 0.8431        
##              Prevalence : 0.7508        
##          Detection Rate : 0.7408        
##    Detection Prevalence : 0.9362        
##       Balanced Accuracy : 0.6013        
##                                         
##        'Positive' Class : 0             
##

error_knn=mean(predict.knn != test.ink$income_.50K)
error_knn

## [1] 0.205414

accuracy_knn= 1-mean(predict.knn != test.ink$income_.50K)
accuracy_knn

## [1] 0.794586

KNN Predict Performance on the Train data

predict.knnt<-predict(fit.knn,train.ink)
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(predict.knnt, (factor(train.ink[,"income_.50K"])))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 26548  6560
##          1   290  1767
##                                          
##                Accuracy : 0.8052         
##                  95% CI : (0.801, 0.8093)
##     No Information Rate : 0.7632         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.272          
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9892         
##             Specificity : 0.2122         
##          Pos Pred Value : 0.8019         
##          Neg Pred Value : 0.8590         
##              Prevalence : 0.7632         
##          Detection Rate : 0.7550         
##    Detection Prevalence : 0.9415         
##       Balanced Accuracy : 0.6007         
##                                          
##        'Positive' Class : 0              
##

error_knnt=mean(predict.knnt != train.ink$income_.50K)
error_knnt

## [1] 0.194796

accuracy_knnt= 1-mean(predict.knnt != train.ink$income_.50K)
accuracy_knnt

## [1] 0.805204

#COMMENT

KNN provides flexibility with classification data and is a distance metric, binary variables will result in values that are on opposite sides of that values spectrum, and thus a point’s nearest neighbors will be more influenced by those on the same side of the spectrum

For some values of the parameters we observe that the optimization algorithm cannot converge, we need to convert some varables into numberic train.ink = as.data.frame(sapply(train.ink, as.numeric)) test.ink = as.data.frame(sapply(test.ink, as.numeric))

We Perform KNN on the training data, using cross-validation of 10 fold in order to give optimum value for k=27 and to avoid overfitting

The prediction accuracy on the test dataset is 79.44% with an error rate of 20.55% and is slightly lower than that of the training, which was 80.52% with an error rate of 19.47%

Part 4: Tree Based Model (15 points)

Data Modelling Classification Tree

train.int=Incomedata[train_ind, ]
test.int=Incomedata[-train_ind, ]

set.seed(123)
start_time <- proc.time()
tree <- rpart(income_.50K ~ .,
              data = train.int,
              method = "class", parms=list(split="gini"))
time.tree.rad.over <- proc.time() - start_time
time.tree.rad.over

##    user  system elapsed 
##   1.143   0.024   1.240

printcp(tree)

## 
## Classification tree:
## rpart(formula = income_.50K ~ ., data = train.int, method = "class", 
##     parms = list(split = "gini"))
## 
## Variables actually used in tree construction:
## [1] capital.gain education    relationship
## 
## Root node error: 8327/35165 = 0.2368
## 
## n= 35165 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.125976      0   1.00000 1.00000 0.0095736
## 2 0.060526      2   0.74805 0.74805 0.0085977
## 3 0.036988      3   0.68752 0.68752 0.0083141
## 4 0.010000      4   0.65053 0.65053 0.0081295

plotcp(tree)

rpart.plot(tree, main = "Basic Classification Tree for Income Data")

Obtain the confusion matrix and compute the testing error rate based on your chosen tree based model.

Treepred=predict(tree, newdata=test.int, type="class")

cm_t=confusionMatrix(
    factor(Treepred, levels = 0:1),
    factor(test.int$income_.50K, levels = 0:1)
)

cm_t

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6269 1061
##          1  332 1130
##                                           
##                Accuracy : 0.8416          
##                  95% CI : (0.8338, 0.8491)
##     No Information Rate : 0.7508          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5237          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9497          
##             Specificity : 0.5157          
##          Pos Pred Value : 0.8553          
##          Neg Pred Value : 0.7729          
##              Prevalence : 0.7508          
##          Detection Rate : 0.7130          
##    Detection Prevalence : 0.8337          
##       Balanced Accuracy : 0.7327          
##                                           
##        'Positive' Class : 0               
##

summary(Treepred)

##    0    1 
## 7330 1462

#classification error
ce(factor(Treepred, levels = 0:1), factor(test.int$income_.50K, levels = 0:1))

## [1] 0.1584395

#Test Error
 tr_error = mean(test.int$income_.50K != Treepred)
 
 #Accuracy
 tr_ac=1-tr_error 
 tr_error

## [1] 0.1584395

 tr_ac

## [1] 0.8415605

summary(Treepred)

##    0    1 
## 7330 1462

#Training

Treepredt=predict(tree, newdata=train.int, type="class")

cm_tt=confusionMatrix(
    factor(Treepredt, levels = 0:1),
    factor(train.int$income_.50K, levels = 0:1)
)

cm_tt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25490  4069
##          1  1348  4258
##                                           
##                Accuracy : 0.846           
##                  95% CI : (0.8421, 0.8497)
##     No Information Rate : 0.7632          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5197          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9498          
##             Specificity : 0.5113          
##          Pos Pred Value : 0.8623          
##          Neg Pred Value : 0.7595          
##              Prevalence : 0.7632          
##          Detection Rate : 0.7249          
##    Detection Prevalence : 0.8406          
##       Balanced Accuracy : 0.7306          
##                                           
##        'Positive' Class : 0               
##

#classification error
ce(factor(Treepredt, levels = 0:1), factor(train.int$income_.50K, levels = 0:1))

## [1] 0.1540452

#Test Error
 tr_errort = mean(train.int$income_.50K != Treepredt)
 
#Accuracy 
 tr_act=1-tr_errort 
 
  tr_act

## [1] 0.8459548

  summary(Treepredt)

##     0     1 
## 29559  5606

#Important varables by level 1 to 14 show how important each varables are in term of 1 to 14

importance_varables <- varImp(tree)
rownames(importance_varables)[order(importance_varables$Overall, decreasing=TRUE)]

##  [1] "capital.gain"    "educational.num" "education"       "relationship"   
##  [5] "marital.status"  "occupation"      "capital.loss"    "hours.per.week" 
##  [9] "age"             "workclass"       "fnlwgt"          "race"           
## [13] "gender"          "native.country"

#COMMENT

The Classification Tree Algorithm is represented in a visual format that enables us to quickly understand the data’s information and is the most powerful tool for classification to make a prediction.

Ability to use both numeric and categorical variables as predictors: the data is a mix of a categorical and numeric variable

Algorithm The package used to construct the tree here is Rpart which uses recursive partitioning of the data with the default criterion for each split being Gini Coefficient. We used Gini split because it operates well on the categorical Response variable in terms of “0” or “1”

The decision tree in the above figure classifies individual incomes according to relationship and returning the classification associated with the particular leaf Yes or No

We create the tree using all the variables and rank the variables in terms of importance to figure out the variables used by the decision tree algorithm to predict the income on both train and test.

We test the accuracy on the training and test datasets, and we see that it is 84.59% and 84.15%, respectively. The error is 15.84% on the test data and 15.40% on train data.

The prediction on the test data show 7330 people earn below 50,000$, and 1462 people earn above.

Part 5: SVM (15 points)

Support Vector Machine

training_set=train.ink
test_set=test.ink

income_df=subset(training_set, select = c(income_.50K, occupation, capital.loss))
plot_margins <- ggplot(data = income_df, aes(x = occupation, y = capital.loss, color = factor(income_.50K))) + geom_point() + 
    scale_color_manual(values = c("red", "blue"))
#display plot 
plot_margins

COMMENT # The plot above shown type of kernels to use, here is not linearly seprable so we need to use radial

training_set=train.ink
test_set=test.ink
start_time <- proc.time()
svm_model <- svm(factor(income_.50K)~., 
              data = training_set)
time.svm.rad.over <- proc.time() - start_time
time.svm.rad.over

##    user  system elapsed 
##  61.075   0.928  63.285

predsvm <- predict(svm_model, test_set)

table(predsvm,test_set$income_.50K)

##        
## predsvm    0    1
##       0 6213  967
##       1  388 1224

accuracy_svm=accuracy(test_set$income_.50K, predsvm)
accuracy_svm

## [1] 0.8458826

testerror_smv=1-accuracy_svm

testerror_smv

## [1] 0.1541174

summary(predsvm)

##    0    1 
## 7180 1612

#plot(svm_model, data=training_set)

on training

predsvmt <- predict(svm_model, training_set)

table(predsvmt,training_set$income_.50K)

##         
## predsvmt     0     1
##        0 25476  3623
##        1  1362  4704

accuracy_svmt=accuracy(training_set$income_.50K, predsvmt)
accuracy_svmt

## [1] 0.8582397

testerror_smvt=1-accuracy_svmt

testerror_smvt

## [1] 0.1417603

summary(predsvmt)

##     0     1 
## 29099  6066

library(foreign)

plot(svm_model, training_set,occupation~capital.loss)

COMMUNICATION #Support Vector Machines

We fitted a model with the Support Vector Machines (SVM) algorithm, implemented in the R package “e1071”. This is a popular method with reasonable default performance and parameters that allow for further tuning if necessary. it tries to find the perfect margin between the line and the support vectors that separates the classes and thus reduces the risk of error on the data though it works well on unstructured and semi-structured, our data is quite semi-structured this gives reason for using this algorithm

It also allows for fitting both linearly and non-linearly separable data using linear or non-linear kernels, respectively. By default, the “SVM” function in the “e1071” package scales the data and also perform ’radial kernel. We predict two types of models – one on the train set and one on the test set.

For some values of the parameters, we observe that the optimization algorithm cannot converge. We need to convert some variables into numeric as we did for ‘KNN.’

We also fit an SVM model with the default radial kernel, which is used for data that is not linearly separable. We build a model on the training data, and We can’t perform cross-validation and grid search to find the optimal model hyperparameters because of the computational time and type of computer used on this project.

We found out that the optimal parameter (cost=1) gives good performance results and is the best runtime and convergence.

The prediction accuracy on the test dataset is 84.58% and is slightly lower than that for the training, which was 85.82%, with an error rate of 15% on test and 14.17% on train.

X<-c("Logistic","KNN", "TREE", "SVM")
Y<-round(c(logerror,error_knn,tr_error,testerror_smv),2)



 
X_name <- "MODEL"
Y_name <- "TEST.ERROR.RATE"

df <- data.frame(X,Y)
names(df) <- c(X_name,Y_name)

ggplot(df,aes(x=MODEL,y=TEST.ERROR.RATE,fill=MODEL))+geom_bar(stat = "identity") + geom_text(aes(label=TEST.ERROR.RATE),position=position_dodge(width=0.9), vjust=-0.25)

XX<-c("Logistic","KNN", "TREE", "SVM")
YY<-round(c(accuracylog,accuracy_knn,tr_ac,accuracy_svmt),2)

XX_name <- "MODEL"
YY_name <- "ACCURACY"

df <- data.frame(XX,YY)
names(df) <- c(XX_name,YY_name)

ggplot(df,aes(x=MODEL,y=ACCURACY,fill=MODEL))+geom_bar(stat = "identity") + geom_text(aes(label=ACCURACY),position=position_dodge(width=0.9), vjust=-0.25)

#Goals and objectives of the project Build prediction models to classify if the income level of an individual is above $50,000 a year using R, Proportion Analysis using bar charts, and box plots to understand the important variables and their influence on prediction. The dependent variable in our analysis will be ‘income level.’

#Dataset Description The dataset was downloaded from (www.kaggle.com) The dataset has 43957 individual cases in total and a total of 15 variables. Following is the list of variables in the dataset.

We clean the data because if it contains some missing value and NAs, We change the data type to factor for easy prediction, and some Algorithms does work with characters, numerical and we used all the variables in the data set because they are all significant and that shown using a function called “anova()”

Building four predictive models using the following classification algorithms: Logistic regression

KNN

Decision Tree

Support vector machines

#Summary

We built several classification models to predict whether an individual earns more than 50K a year. We used logistic regression,K-Nearest Neighbor, decision tree, and support vector machine Algorithms. The table above summarises the accuracy, error rate, and computational time on the test dataset for all fitted models. We show results for models on the train also.

Let Y be the random variable “income”. Let also Y=1 or Y=0 respectively ‘income>50K’ and Y =1, Y=0 if income<=50K.(where n=43957 is the number of observations in the data set) follows the binomial distribution. More precisely, Y follows the Bernoulli

In conclusion, the logistic regression model with 85% accuracy after accounting for all performance indicators and the Support Vector Machine with 86% accuracy. The model appears to be the most suitable choice for the considered classification problem. Based on the comparisons, the method that produces the most accuracy is the Support Vector Machine.

The prediction on the test shows 7,180 people earn below 50,000$ and 1,612 people earn above.

Effectively use professional-level technology tools to support our findings in terms of the algorithms in approaching problems and demonstrate the capability to deploy established approaches accurately to analyse and solve problems using a reasonable level of skill in statistical learning mixture modelling and clustering, discriminant analysis and graphical models a quick example in this problem solving we see Support Vector Machine is our best model.
Support vector machines (SVM) are binary classifiers, and it perform non-linear classification, therefore, making them more flexible. E071 is a library that provides an R INTERFACE We first train two types of models train set and test set. We performed Kernel functions. Whose elements are to be classified into a new feature space. In the new space, previously non-linear observations become linearly separable.

PREDICT IF SALARY IS GREATER THAN $50K

BAYO ISHOLA

2/22/2022

R Markdown

Including Plots

Removing the missing data

The Boxplot below shows people with annual income greater than $50k are older.

We found that people who have occupations like Professor, Executive Managerial, and Tech support usually have Higher Annual Income above $50k

Recoding Income Levels

KNN Predict Performance on the Test data

KNN Predict Performance on the Train data

Part 4: Tree Based Model (15 points)

Part 5: SVM (15 points)

on training