Predicting the Income Level based on Various Factors

Description of the Dataset

The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository.
The data contains approximately 32,000 observations with over 15 variables.
The dataset was downloaded from:
http://archive.ics.uci.edu/ml/datasets/Adult.
The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.

R Packages Used

options(gsubfn.engine = "R")
library(sqldf) #For SQL Functions
library(reshape2) #For plots
library(knitr) # For tables
library(Hmisc)# Description of the data
library(rpart)
library(caret)
library(rpart.plot)

Importing the dataset

data <- read.csv("adult.csv", header=FALSE, sep= ",", strip.white=TRUE,col.names= c("age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation","relationship", "race", "sex", "capitalgain",
 "capitalloss", "hoursperweek", "nativecountry",
"incomelevel"), na.strings= "?", stringsAsFactors = TRUE)
kable(head(data))
age workclass fnlwgt education educationnum maritalstatus occupation relationship race sex capitalgain capitalloss hoursperweek nativecountry incomelevel
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K

Description of the variables in the dataset

The variables and their classifications are as given below:

variabletable<- read.csv("Workbook1.csv")
kable(variabletable)
Variable.Name Description Type Possible.Values
Age Age of the individual Continuous Numeric
Workclass Class of Work Categorical Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt Final Weight Determined by Census Org Continuous Numeric
Education Education of the individual Categorical Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
Education-num Number of years of education Continuous Numeric
Marital-status Marital status of the individual Categorical Married-civilian-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-ArmedForces-spouse
Occupation Occupation of the individual Categorical Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Relationship Present relationship Categorical Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
Race Race of the individual Categorical White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Sex Sex of the individual Categorical Female, Male
Capital-gain Capital gain made by the individual Continuous Numeric
Capital-loss Capital loss made by the individual Continuous Numeric
Hours-per-week Average number of hours spent by the individual on work Continuous Numeric
Native-country Native country of origin Categorical United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Income Level Income Level Categorical <= 50 k or >50k

Basic commands to know about the Data

Summary of the data

kable(summary(data))
age workclass fnlwgt education educationnum maritalstatus occupation relationship race sex capitalgain capitalloss hoursperweek nativecountry incomelevel
Min. :17.00 Private :22696 Min. : 12285 HS-grad :10501 Min. : 1.00 Divorced : 4443 Prof-specialty : 4140 Husband :13193 Amer-Indian-Eskimo: 311 Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00 United-States:29170 <=50K:24720
1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827 Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23 Craft-repair : 4099 Not-in-family : 8305 Asian-Pac-Islander: 1039 Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643 >50K : 7841
Median :37.00 Local-gov : 2093 Median : 178356 Bachelors : 5355 Median :10.00 Married-civ-spouse :14976 Exec-managerial: 4066 Other-relative: 981 Black : 3124 NA Median : 0 Median : 0.0 Median :40.00 Philippines : 198 NA
Mean :38.58 State-gov : 1298 Mean : 189778 Masters : 1723 Mean :10.08 Married-spouse-absent: 418 Adm-clerical : 3770 Own-child : 5068 Other : 271 NA Mean : 1078 Mean : 87.3 Mean :40.44 Germany : 137 NA
3rd Qu.:48.00 Self-emp-inc : 1116 3rd Qu.: 237051 Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683 Sales : 3650 Unmarried : 3446 White :27816 NA 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00 Canada : 121 NA
Max. :90.00 (Other) : 981 Max. :1484705 11th : 1175 Max. :16.00 Separated : 1025 (Other) :10993 Wife : 1568 NA NA Max. :99999 Max. :4356.0 Max. :99.00 (Other) : 1709 NA
NA NA’s : 1836 NA (Other) : 5134 NA Widowed : 993 NA’s : 1843 NA NA NA NA NA NA NA’s : 583 NA

Description of the dataset

describe(data)
data 

 15  Variables      32561  Observations
---------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       73        1    38.58     15.4       19       22 
     .25      .50      .75      .90      .95 
      28       37       48       58       63 

lowest : 17 18 19 20 21, highest: 85 86 87 88 90 
---------------------------------------------------------------------------
workclass 
       n  missing distinct 
   30725     1836        8 

lowest : Federal-gov      Local-gov        Never-worked     Private          Self-emp-inc    
highest: Private          Self-emp-inc     Self-emp-not-inc State-gov        Without-pay      
                                                             
Value           Federal-gov        Local-gov     Never-worked
Frequency               960             2093                7
Proportion            0.031            0.068            0.000
                                                             
Value               Private     Self-emp-inc Self-emp-not-inc
Frequency             22696             1116             2541
Proportion            0.739            0.036            0.083
                                            
Value             State-gov      Without-pay
Frequency              1298               14
Proportion            0.042            0.000
---------------------------------------------------------------------------
fnlwgt 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0    21648        1   189778   112348    39460    65716 
     .25      .50      .75      .90      .95 
  117827   178356   237051   329054   379682 

lowest :   12285   13769   14878   18827   19214
highest: 1226583 1268339 1366120 1455435 1484705 
---------------------------------------------------------------------------
education 
       n  missing distinct 
   32561        0       16 

lowest : 10th         11th         12th         1st-4th      5th-6th     
highest: HS-grad      Masters      Preschool    Prof-school  Some-college 
                                                              
Value              10th         11th         12th      1st-4th
Frequency           933         1175          433          168
Proportion        0.029        0.036        0.013        0.005
                                                              
Value           5th-6th      7th-8th          9th   Assoc-acdm
Frequency           333          646          514         1067
Proportion        0.010        0.020        0.016        0.033
                                                              
Value         Assoc-voc    Bachelors    Doctorate      HS-grad
Frequency          1382         5355          413        10501
Proportion        0.042        0.164        0.013        0.323
                                                              
Value           Masters    Preschool  Prof-school Some-college
Frequency          1723           51          576         7291
Proportion        0.053        0.002        0.018        0.224
---------------------------------------------------------------------------
educationnum 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       16     0.95    10.08     2.75        5        7 
     .25      .50      .75      .90      .95 
       9       10       12       13       14 

lowest :  1  2  3  4  5, highest: 12 13 14 15 16 
                                                                      
Value          1     2     3     4     5     6     7     8     9    10
Frequency     51   168   333   646   514   933  1175   433 10501  7291
Proportion 0.002 0.005 0.010 0.020 0.016 0.029 0.036 0.013 0.323 0.224
                                              
Value         11    12    13    14    15    16
Frequency   1382  1067  5355  1723   576   413
Proportion 0.042 0.033 0.164 0.053 0.018 0.013
---------------------------------------------------------------------------
maritalstatus 
       n  missing distinct 
   32561        0        7 

lowest : Divorced              Married-AF-spouse     Married-civ-spouse    Married-spouse-absent Never-married        
highest: Married-civ-spouse    Married-spouse-absent Never-married         Separated             Widowed               
                                                      
Value                   Divorced     Married-AF-spouse
Frequency                   4443                    23
Proportion                 0.136                 0.001
                                                      
Value         Married-civ-spouse Married-spouse-absent
Frequency                  14976                   418
Proportion                 0.460                 0.013
                                                      
Value              Never-married             Separated
Frequency                  10683                  1025
Proportion                 0.328                 0.031
                                
Value                    Widowed
Frequency                    993
Proportion                 0.030
---------------------------------------------------------------------------
occupation 
       n  missing distinct 
   30718     1843       14 

lowest : Adm-clerical     Armed-Forces     Craft-repair     Exec-managerial  Farming-fishing 
highest: Prof-specialty   Protective-serv  Sales            Tech-support     Transport-moving 
                                                                
Value           Adm-clerical      Armed-Forces      Craft-repair
Frequency               3770                 9              4099
Proportion             0.123             0.000             0.133
                                                                
Value        Exec-managerial   Farming-fishing Handlers-cleaners
Frequency               4066               994              1370
Proportion             0.132             0.032             0.045
                                                                
Value      Machine-op-inspct     Other-service   Priv-house-serv
Frequency               2002              3295               149
Proportion             0.065             0.107             0.005
                                                                
Value         Prof-specialty   Protective-serv             Sales
Frequency               4140               649              3650
Proportion             0.135             0.021             0.119
                                              
Value           Tech-support  Transport-moving
Frequency                928              1597
Proportion             0.030             0.052
---------------------------------------------------------------------------
relationship 
       n  missing distinct 
   32561        0        6 

lowest : Husband        Not-in-family  Other-relative Own-child      Unmarried     
highest: Not-in-family  Other-relative Own-child      Unmarried      Wife           
                                                                      
Value             Husband  Not-in-family Other-relative      Own-child
Frequency           13193           8305            981           5068
Proportion          0.405          0.255          0.030          0.156
                                        
Value           Unmarried           Wife
Frequency            3446           1568
Proportion          0.106          0.048
---------------------------------------------------------------------------
race 
       n  missing distinct 
   32561        0        5 

lowest : Amer-Indian-Eskimo Asian-Pac-Islander Black              Other              White             
highest: Amer-Indian-Eskimo Asian-Pac-Islander Black              Other              White              

Amer-Indian-Eskimo (311, 0.010), Asian-Pac-Islander (1039, 0.032), Black
(3124, 0.096), Other (271, 0.008), White (27816, 0.854)
---------------------------------------------------------------------------
sex 
       n  missing distinct 
   32561        0        2 

Female (10771, 0.331), Male (21790, 0.669)
---------------------------------------------------------------------------
capitalgain 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0      119     0.23     1078     2082        0        0 
     .25      .50      .75      .90      .95 
       0        0        0        0     5013 

lowest :     0   114   401   594   914
highest: 25236 27828 34095 41310 99999 
---------------------------------------------------------------------------
capitalloss 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       92    0.134     87.3    167.3        0        0 
     .25      .50      .75      .90      .95 
       0        0        0        0        0 

lowest :    0  155  213  323  419, highest: 3004 3683 3770 3900 4356 
---------------------------------------------------------------------------
hoursperweek 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       94    0.897    40.44    12.28       18       24 
     .25      .50      .75      .90      .95 
      40       40       45       55       60 

lowest :  1  2  3  4  5, highest: 95 96 97 98 99 
---------------------------------------------------------------------------
nativecountry 
       n  missing distinct 
   31978      583       41 

lowest : Cambodia        Canada          China           Columbia        Cuba           
highest: Thailand        Trinadad&Tobago United-States   Vietnam         Yugoslavia      
---------------------------------------------------------------------------
incomelevel 
       n  missing distinct 
   32561        0        2 

<=50K (24720, 0.759), >50K (7841, 0.241)
---------------------------------------------------------------------------

Normalization of the data

The dataset is already normalized as there are no repetitive data which can be used to split into multiple tables and there isn’t data redundancy or data integrity improvements available for this particular dataset.

Problems in the Data

Recoded Missing Values

We recoded the data with NA for the string “?” (Which denotes misisng value in the dataset) when we imported the file to make the treatment of missing values easier.

Missing Data

2399 datapoints are found to be missing which is 7.3% of the dataset.

table(complete.cases(data)) 

FALSE  TRUE 
 2399 30162 

We find that about 2066 of the missing observations centres around <=50 k category and since most of the data has <=50 k category as the income level, removing these missing variables will not affect the data integrity.

summary(data[!complete.cases(data),])
      age                   workclass        fnlwgt      
 Min.   :17.00   Private         : 410   Min.   : 12285  
 1st Qu.:22.00   Self-emp-inc    :  42   1st Qu.:121804  
 Median :36.00   Self-emp-not-inc:  42   Median :177906  
 Mean   :40.39   Local-gov       :  26   Mean   :189584  
 3rd Qu.:58.00   State-gov       :  19   3rd Qu.:232668  
 Max.   :90.00   (Other)         :  24   Max.   :981628  
                 NA's            :1836                   
        education    educationnum                 maritalstatus
 HS-grad     :661   Min.   : 1.00   Divorced             :229  
 Some-college:613   1st Qu.: 9.00   Married-AF-spouse    :  2  
 Bachelors   :311   Median :10.00   Married-civ-spouse   :911  
 11th        :127   Mean   : 9.57   Married-spouse-absent: 48  
 10th        :113   3rd Qu.:11.00   Never-married        :957  
 Masters     : 96   Max.   :16.00   Separated            : 86  
 (Other)     :478                   Widowed              :166  
           occupation           relationship                 race     
 Prof-specialty : 102   Husband       :730   Amer-Indian-Eskimo:  25  
 Other-service  :  83   Not-in-family :579   Asian-Pac-Islander: 144  
 Exec-managerial:  74   Other-relative: 92   Black             : 307  
 Craft-repair   :  69   Own-child     :602   Other             :  40  
 Sales          :  66   Unmarried     :234   White             :1883  
 (Other)        : 162   Wife          :162                            
 NA's           :1843                                                 
     sex        capitalgain       capitalloss       hoursperweek  
 Female: 989   Min.   :    0.0   Min.   :   0.00   Min.   : 1.00  
 Male  :1410   1st Qu.:    0.0   1st Qu.:   0.00   1st Qu.:25.00  
               Median :    0.0   Median :   0.00   Median :40.00  
               Mean   :  897.1   Mean   :  73.87   Mean   :34.23  
               3rd Qu.:    0.0   3rd Qu.:   0.00   3rd Qu.:40.00  
               Max.   :99999.0   Max.   :4356.00   Max.   :99.00  
                                                                  
       nativecountry  incomelevel 
 United-States:1666   <=50K:2066  
 Mexico       :  33   >50K : 333  
 Canada       :  14               
 Philippines  :  10               
 Germany      :   9               
 (Other)      :  84               
 NA's         : 583               

Removing the missing data

data <- na.omit(data)

Unncessary Variables

The fnlwgt which is the final weight determined by the Census Organization is of no use in any of the analysis that we are doing henceforth and is removed. The educationnum if a repetitive variable which recodes the categorical variable education as a numeric variable but will be used in the analysis for decision trees, hence is not being removed.

Finaldata <- sqldf("SELECT age
                           ,workclass
                    ,education
                    ,educationnum
                    ,maritalstatus
                    ,occupation
                    ,relationship
                    ,race
                    ,sex
                    ,capitalgain
                    ,capitalloss
                    ,hoursperweek
                    ,nativecountry
                    ,incomelevel
                    FROM data
                    ")
kable(head(Finaldata))
age workclass education educationnum maritalstatus occupation relationship race sex capitalgain capitalloss hoursperweek nativecountry incomelevel
39 State-gov Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
37 Private Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K

Recoding Income Levels

The values for the income levels are recoded with 1 and 0 with >50 k being coded as 1. The sum of the values in this column will hence give the number of people with income >50 k and thus can be an effective prediction tool.

Finaldata$income<-ifelse(Finaldata$income=='>50K',1,0)
kable(head(Finaldata))
age workclass education educationnum maritalstatus occupation relationship race sex capitalgain capitalloss hoursperweek nativecountry incomelevel income
39 State-gov Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K 0
50 Self-emp-not-inc Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K 0
38 Private HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K 0
53 Private 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K 0
28 Private Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K 0
37 Private Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K 0

SQL Queries and Exploratory Analysis using R

Statistics of other variables based on workclass and SQL Queries

Sample statistics are generated classifying the data by Work Class and calculating the numerical values categorized by the Work Class using GROUPBY, SUM and AVERAGE functions in SQLDF package in R.

WorkclassClassification <- sqldf("SELECT 
                          workclass,
                          count(*) as Numberofrecords,
                         Avg(age) as averageage,
                         Sum(capitalgain) as capitalgain,
                        Sum(capitalloss) as capitalloss,
                        Avg(hoursperweek),
                        sum(income) as above50k
                         FROM Finaldata
                         GROUP by workclass")
kable(WorkclassClassification)
workclass Numberofrecords averageage capitalgain capitalloss Avg(hoursperweek) above50k
Federal-gov 943 42.57794 784879 105363 41.27996 365
Local-gov 2067 41.74311 1714019 228038 41.00242 609
Private 22286 36.79436 19608520 1774623 40.25088 4876
Self-emp-inc 1074 46.02793 5166742 164864 48.80261 600
Self-emp-not-inc 2499 45.01160 4780923 288815 44.43217 714
State-gov 1279 39.36200 875228 103788 39.05395 344
Without-pay 14 47.78571 6830 0 32.71429 0

Statistics based on race

Sample statistics are also generated by classifying the data by Race and calculating the numerical values categorized by the Work Class using GROUPBY, SUM and AVERAGE functions in SQLDF package in R.

RacesClassification <- sqldf("SELECT 
                                 race,
                                 count(*) as Numberofrecords,
                                 Avg(age) as averageage,
                                 Sum(capitalgain) as capitalgain,
                                 Sum(capitalloss) as capitalloss,
                                 Avg(hoursperweek),
                                 sum(income) as above50k
                                 FROM Finaldata
                                 GROUP by race")
kable(RacesClassification)
race Numberofrecords averageage capitalgain capitalloss Avg(hoursperweek) above50k
Amer-Indian-Eskimo 286 36.73776 188843 10629 40.42657 34
Asian-Pac-Islander 895 37.91061 1167416 82057 40.50056 248
Black 2817 37.86084 1703502 158506 38.67448 366
Other 231 33.60173 245011 11618 40.11255 21
White 25933 38.58061 29632369 2402681 41.20410 6839

SQL Queries can also be used to find some basic prediction about the income level and the various categorical variables which contribute towards it.

Marital Status and implication on the income levels

We find that Married with Civilian Spouse to have higher amount of people with above 50k income. We then find the proportions of the people with above 50 k income with respect to the total number of people in each category and find that Married with Armed Forces have a better proportion of people with income above 50k.

MaritalLevel<- sqldf("SELECT 
                    maritalstatus as status
                    , Count (*) as Count
                    , sum(income) as Above50k
                    FROM 
                    Finaldata
                    GROUP BY 
                    status
                    ORDER BY status") 
kable(MaritalLevel)
status Count Above50k
Divorced 4214 452
Married-AF-spouse 21 10
Married-civ-spouse 14065 6399
Married-spouse-absent 370 31
Never-married 9726 470
Separated 939 66
Widowed 827 80
Maritalclass<-melt(MaritalLevel,id.vars = 'status')
ggplot(Maritalclass,aes(x=status,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Gender Class")+
  ylab("Number of People")

table<-data.frame(Class=MaritalLevel$status, 
                   Proportion=
                     MaritalLevel$Above50k/MaritalLevel$Count)
kable(table)
Class Proportion
Divorced 0.1072615
Married-AF-spouse 0.4761905
Married-civ-spouse 0.4549591
Married-spouse-absent 0.0837838
Never-married 0.0483241
Separated 0.0702875
Widowed 0.0967352

Gender and implication on the income levels

We use a similar analysis as above across the Gender Variable and find that Males have a better proportion of people with salaries above 50k and more number of people who have salaries greater than 50k thus raising the spark for gender inequality.

GenderLevel<- sqldf("SELECT 
                    sex as gender
                       , Count (*) as Count
                       , sum(income) as Above50k
                       FROM 
                       Finaldata
                       GROUP BY 
                       gender
                       ORDER BY gender") 
Genderclass<-melt(GenderLevel,id.vars = 'gender')
ggplot(Genderclass,aes(x=gender,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Gender Class")+
  ylab("Number of People")

table<-data.frame(Class=GenderLevel$gender, 
                   Proportion=
                    GenderLevel$Above50k/GenderLevel$Count)
kable(table)
Class Proportion
Female 0.1136782
Male 0.3138371

Work Class and implication on the income levels

We find that the people employed in private companies have more people with income above 50k and Self Employed people having a higher proportion of peoplw with income greater than 50k.

WorkclassLevel<- sqldf("SELECT 
                    workclass as workclass
                    , Count (*) as Count
                    , sum(income) as Above50k
                    FROM 
                    Finaldata
                    GROUP BY 
                    workclass
                    ORDER BY workclass") 
Workclass<-melt(WorkclassLevel,id.vars = 'workclass')
ggplot(Workclass,aes(x=workclass,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Work Class")+
  ylab("Number of People")

table<-data.frame(Class=WorkclassLevel$workclass, 
                  Proportion=
                    WorkclassLevel$Above50k/WorkclassLevel$Count)
kable(table)
Class Proportion
Federal-gov 0.3870626
Local-gov 0.2946299
Private 0.2187921
Self-emp-inc 0.5586592
Self-emp-not-inc 0.2857143
State-gov 0.2689601
Without-pay 0.0000000

Education Level and implication on the income levels

We find that the Bachelors’ Degree has higher number of people with income above 50 k and Prof School Graduates and Doctorates have higher proportion of people with income about 50k.

Education_Level<- sqldf("SELECT 
                       education as 'Education'
                       , Count (*) as 'Count'
                       , sum(income) as 'Above50k'
                       FROM 
                       Finaldata
                       GROUP BY 
                       Education
                       ORDER BY Education") 
EducationLevel<-melt(Education_Level,id.vars = 'Education')
ggplot(EducationLevel,aes(x=Education,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Education Class")+
  ylab("Number of People")

table<-data.frame(Class=Education_Level$Education, 
                   Proportion=
                     Education_Level$Above50k/Education_Level$Count)
kable(table)
Class Proportion
10th 0.0719512
11th 0.0562977
12th 0.0769231
1st-4th 0.0397351
5th-6th 0.0416667
7th-8th 0.0628366
9th 0.0549451
Assoc-acdm 0.2539683
Assoc-voc 0.2631982
Bachelors 0.4214909
Doctorate 0.7466667
HS-grad 0.1643293
Masters 0.5642286
Preschool 0.0000000
Prof-school 0.7490775
Some-college 0.2000599

Using R for Statistical Analysis

We use R to create Decision Trees to predict the income levels using all the variables in the dataset.

Methodology Used

Classification methodology was used for this particular data as the response variable income was a binary variable.

Advantages of using Decision Trees for this Analysis

Simplicity of Interpretation: The entire data is represented in a visual format which enables us to understand the data quickly and make a prediction.
Ability to use both numeric and categorical variables as predictors: Our data is a mix of categorical and numeric variables and other commonly used methods such as linear or logistic regressions can use either of these categories but not both in a single algorithm.

Algorithm behind

The package used to construct the tree here is Rpart which uses recursive partitioning of the data with the default criterion being used for each split being Gini Coefficient.

Variables used

We create the tree using all the variables and rank the variables in terms of importance to figure out the variables used by the decision tree algorithm to predict the income.

tree <- rpart(incomelevel ~ .,
              data = data,
              method = "class")
imp <- varImp(tree)
rownames(imp)[order(imp$Overall, decreasing=TRUE)]
 [1] "capitalgain"   "education"     "educationnum"  "relationship" 
 [5] "maritalstatus" "occupation"    "capitalloss"   "age"          
 [9] "hoursperweek"  "workclass"     "fnlwgt"        "race"         
[13] "sex"           "nativecountry"

We find that Capital Gain is one of the key predictors followed by education for the income levels.
The tree logic is as below where only “capital gain, education and relationship” are used as variables by the tree as Education num and education both have the same data.

printcp(tree)

Classification tree:
rpart(formula = incomelevel ~ ., data = data, method = "class")

Variables actually used in tree construction:
[1] capitalgain  education    relationship

Root node error: 7508/30162 = 0.24892

n= 30162 

        CP nsplit rel error  xerror      xstd
1 0.129995      0   1.00000 1.00000 0.0100018
2 0.064198      2   0.74001 0.74001 0.0089670
3 0.037294      3   0.67581 0.67581 0.0086527
4 0.010000      4   0.63852 0.63852 0.0084574
plotcp(tree)

rpart.plot(tree, main = "Basic Classification Tree for Census Data")

Inferences

About 46% of the people are in a relationship called “Husband” or “Wife” which is then further classified based on Education Level where nearly 14% who earn above $50 K have the education of Bachelors, Prof-school, Masters and Doctorate.

The other education levels have income predominantly below $50 k with just 2% having salaries above $50k who also have capital gains greater than $5096

With respect to other relationships, only 1% have income above $50 k and with capital gains greater than $7074.

Tableau for Visualization

We use 4 sheets to visualize different relationships and use Work Class as the common parameter which will filter the entire dashboard.

The first sheet shows the relationship of Education and Number of People Earning > 50 k and separated by Work Class. We find that Bachelors graduates working in Private companies have a higher number of people earning above 50 k.

Impact of Education on Income Levels

Impact of Education on Income Levels

The second sheet shows the relationship of Average hours per week with respect to gender and separated by Work Class and we find that Males typically work more hours per week on Average across all work classes.

Impact of Gender on Income Levels

Impact of Gender on Income Levels

The third sheet shows that the relationship of marital status and income levels separated by Work Class. Majority of the people in Married with Civilian spouse have an income greater than 50 k and majorly in the private sector.

Impact of Marital Status on Income Levels

Impact of Marital Status on Income Levels

The fourth work sheet shows the impact of occupation, capital gain and capital loss on the income levels which has details of work class too. This is a comprehensive visualization across 4 different parameters. We find that Executives at Managerial Level have more people with income greater than 50 k and Professional Speciality has more capital gains.

Impact of Occupation on Income Levels and Capital Gains/Loss

Impact of Occupation on Income Levels and Capital Gains/Loss

Final Dashboard:

Final Dashboard

Final Dashboard

Filtered Dashboard for Private Sector:

Filtered Dashboard

Filtered Dashboard

Challenges Faced

  1. The datafile was initially scraped from a web repository called UCI Machine Learning- part of UC Irvine, which crashed in the last two days left for submission and hence R was unable to establish a connection to the data. So, I figured out the FTP page of the website and downloaded the data and uploaded from the hard disk directly for the project. This is not the ideal way but R markdown supports this by establishing a static html page which will provide all the outputs from the commands at the time of publishing the report.
  2. Problems with decision trees and R packages- The packages used for prettier decision tree modeling was not working in the R studio and has to be fixed with multiple supporting libraries before proceeding.
  3. Learning SQL DF- R and SQL both are relatively new for me and it took me a while to understand the workings of SQL DF commands.

2016-11-27