Predicting Income Class from Census Data

Naveen Dandiprolu
Thu Apr 16 13:24:36 2015

Problem Statement

Question:

Predict if a person's income is above or below 50K$/yr given certain features(both quantitative and qualitative).

Data:

1994 Census Data of USA describing 15 attributes for 32561 individuals. (Source: UCI Machine Learning Repository )

Challenge:

Fitting a model with both qualitative and quantitative variables isn't always an ideal scenario. Achieving high accuracy involves complex methods.

Approach

  • Get and clean the raw data.
  • Understand the attributes and explore their association.
  • Come up with logical options to build a model.
  • Split the data into training and test sets.
  • Perform any preprocessing if neccessary.
  • Build the models on the training data set and cross validate on the test set.
  • Continue to research on ways to improve the accuracy

Attributes

  • Age
  • Working Class
  • Final Weight
  • Education
  • Education_Num
  • Marital Status
  • Occupation
  • Relationship

Attributes(contd.)

  • Race
  • Sex
  • Capital Gain
  • Capital Loss
  • Hours Per Week
  • Native-Country
  • Income Class

Age vs Income

plot of chunk unnamed-chunk-1

Inference: Much of the older people earn more than the younger ones.

Education vs Income

plot of chunk unnamed-chunk-2

Inference: Higher the education level, higher the probability to earn more than 50K$/year.

Hours Per Week vs Income

plot of chunk unnamed-chunk-3

Inference: Most of the higher income group work more than 40 hours per week.

Final Weight vs Income

plot of chunk unnamed-chunk-4

Inference: No impact of Final Weight on income.

Capital Gain vs Income

plot of chunk unnamed-chunk-5

Inference: People with higher income usually have higher Capital Gains (very few exceptions)

Capital Loss vs Income

plot of chunk unnamed-chunk-6

Inference: Poeple with high capital losses mostly fall under the category of low income group.

Sex vs Income

plot of chunk unnamed-chunk-7

Inference: Very few females earn more than 50K$/year

Working Class vs Income

plot of chunk unnamed-chunk-8

Inference: Variations in the proportions of income groups across different working classes

Marital Status vs Income

plot of chunk unnamed-chunk-9

Inference: Most of the singles earn less than 50K$ per year.

Occupation vs Income

plot of chunk unnamed-chunk-10

Inference: Contrasting ratio's across various occupations

Race vs Income

plot of chunk unnamed-chunk-11

Inference: Percent of black population in higher income group is very less compared to the whites.

Country vs Income

plot of chunk unnamed-chunk-12

Inference: Most of the data is collected from American's. No significant effect in modeling.

Split Data

Partition Data into train(75%) and test(25%) data sets

set.seed(1234)
inTrain <- createDataPartition(rawdata$Class,p=0.75,list=FALSE)
train <- rawdata[inTrain,]
test <- rawdata[-inTrain,]

Random Forest Model

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   5815   727
     >50K     365  1233

               Accuracy : 0.8658          
                 95% CI : (0.8583, 0.8732)
    No Information Rate : 0.7592          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.6084          
 Mcnemar's Test P-Value : < 2.2e-16       

            Sensitivity : 0.9409          
            Specificity : 0.6291          
         Pos Pred Value : 0.8889          
         Neg Pred Value : 0.7716          
             Prevalence : 0.7592          
         Detection Rate : 0.7144          
   Detection Prevalence : 0.8037          
      Balanced Accuracy : 0.7850          

       'Positive' Class :  <=50K