Predicting Income Class from Census Data

Naveen Dandiprolu
Thu Apr 16 13:24:36 2015

Problem Statement

Question:

Predict if a person's income is above or below 50K$/yr given certain features(both quantitative and qualitative).

Data:

1994 Census Data of USA describing 15 attributes for 32561 individuals. (Source: UCI Machine Learning Repository )

Challenge:

Fitting a model with both qualitative and quantitative variables isn't always an ideal scenario. Achieving high accuracy involves complex methods.

Approach

Get and clean the raw data.
Understand the attributes and explore their association.
Come up with logical options to build a model.
Split the data into training and test sets.
Perform any preprocessing if neccessary.
Build the models on the training data set and cross validate on the test set.
Continue to research on ways to improve the accuracy

Attributes

Age
Working Class
Final Weight
Education
Education_Num
Marital Status
Occupation
Relationship

Attributes(contd.)

Race
Sex
Capital Gain
Capital Loss
Hours Per Week
Native-Country
Income Class

Age vs Income

plot of chunk unnamed-chunk-1

Inference: Much of the older people earn more than the younger ones.

Education vs Income

plot of chunk unnamed-chunk-2

Inference: Higher the education level, higher the probability to earn more than 50K$/year.

Hours Per Week vs Income

plot of chunk unnamed-chunk-3

Inference: Most of the higher income group work more than 40 hours per week.

Final Weight vs Income

plot of chunk unnamed-chunk-4

Inference: No impact of Final Weight on income.

Capital Gain vs Income

plot of chunk unnamed-chunk-5

Inference: People with higher income usually have higher Capital Gains (very few exceptions)

Capital Loss vs Income

plot of chunk unnamed-chunk-6

Inference: Poeple with high capital losses mostly fall under the category of low income group.

Sex vs Income

plot of chunk unnamed-chunk-7

Inference: Very few females earn more than 50K$/year

Working Class vs Income

plot of chunk unnamed-chunk-8

Inference: Variations in the proportions of income groups across different working classes

Marital Status vs Income

plot of chunk unnamed-chunk-9

Inference: Most of the singles earn less than 50K$ per year.

Occupation vs Income

plot of chunk unnamed-chunk-10

Inference: Contrasting ratio's across various occupations

Race vs Income

plot of chunk unnamed-chunk-11

Inference: Percent of black population in higher income group is very less compared to the whites.

Country vs Income

plot of chunk unnamed-chunk-12

Inference: Most of the data is collected from American's. No significant effect in modeling.

Split Data

Partition Data into train(75%) and test(25%) data sets

set.seed(1234)
inTrain <- createDataPartition(rawdata$Class,p=0.75,list=FALSE)
train <- rawdata[inTrain,]
test <- rawdata[-inTrain,]

Random Forest Model

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   5815   727
     >50K     365  1233

               Accuracy : 0.8658          
                 95% CI : (0.8583, 0.8732)
    No Information Rate : 0.7592          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.6084          
 Mcnemar's Test P-Value : < 2.2e-16       

            Sensitivity : 0.9409          
            Specificity : 0.6291          
         Pos Pred Value : 0.8889          
         Neg Pred Value : 0.7716          
             Prevalence : 0.7592          
         Detection Rate : 0.7144          
   Detection Prevalence : 0.8037          
      Balanced Accuracy : 0.7850          

       'Positive' Class :  <=50K