Predicting the Income Level based on Various Factors

Description of the Dataset
R Packages Used
Importing the dataset
Description of the variables in the dataset
Basic commands to know about the Data
- Summary of the data
- Description of the dataset
Normalization of the data
Problems in the Data
SQL Queries and Exploratory Analysis using R
Using R for Statistical Analysis
Inferences
Tableau for Visualization
Challenges Faced

Description of the Dataset

The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository.
The data contains approximately 32,000 observations with over 15 variables.
The dataset was downloaded from:
http://archive.ics.uci.edu/ml/datasets/Adult.
The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.

R Packages Used

options(gsubfn.engine = "R")
library(sqldf) #For SQL Functions
library(reshape2) #For plots
library(knitr) # For tables
library(Hmisc)# Description of the data
library(rpart)
library(caret)
library(rpart.plot)

Importing the dataset

data <- read.csv("adult.csv", header=FALSE, sep= ",", strip.white=TRUE,col.names= c("age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation","relationship", "race", "sex", "capitalgain",
 "capitalloss", "hoursperweek", "nativecountry",
"incomelevel"), na.strings= "?", stringsAsFactors = TRUE)
kable(head(data))

age	workclass	fnlwgt	education	educationnum	maritalstatus	occupation	relationship	race	sex	capitalgain	hoursperweek	nativecountry	incomelevel
39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	<=50K

Description of the variables in the dataset

The variables and their classifications are as given below:

variabletable<- read.csv("Workbook1.csv")
kable(variabletable)

Variable.Name	Description	Type	Possible.Values
Age	Age of the individual	Continuous	Numeric
Workclass	Class of Work	Categorical	Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt	Final Weight Determined by Census Org	Continuous	Numeric
Education	Education of the individual	Categorical	Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
Education-num	Number of years of education	Continuous	Numeric
Marital-status	Marital status of the individual	Categorical	Married-civilian-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-ArmedForces-spouse
Occupation	Occupation of the individual	Categorical	Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Relationship	Present relationship	Categorical	Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
Race	Race of the individual	Categorical	White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Sex	Sex of the individual	Categorical	Female, Male
Capital-gain	Capital gain made by the individual	Continuous	Numeric
Capital-loss	Capital loss made by the individual	Continuous	Numeric
Hours-per-week	Average number of hours spent by the individual on work	Continuous	Numeric
Native-country	Native country of origin	Categorical	United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Income Level	Income Level	Categorical	<= 50 k or >50k

Basic commands to know about the Data

Summary of the data

kable(summary(data))

age	workclass	fnlwgt	education	educationnum	maritalstatus	occupation	relationship	race	sex	capitalgain	capitalloss	hoursperweek	nativecountry	incomelevel
Min. :17.00	Private :22696	Min. : 12285	HS-grad :10501	Min. : 1.00	Divorced : 4443	Prof-specialty : 4140	Husband :13193	Amer-Indian-Eskimo: 311	Female:10771	Min. : 0	Min. : 0.0	Min. : 1.00	United-States:29170	<=50K:24720
1st Qu.:28.00	Self-emp-not-inc: 2541	1st Qu.: 117827	Some-college: 7291	1st Qu.: 9.00	Married-AF-spouse : 23	Craft-repair : 4099	Not-in-family : 8305	Asian-Pac-Islander: 1039	Male :21790	1st Qu.: 0	1st Qu.: 0.0	1st Qu.:40.00	Mexico : 643	>50K : 7841
Median :37.00	Local-gov : 2093	Median : 178356	Bachelors : 5355	Median :10.00	Married-civ-spouse :14976	Exec-managerial: 4066	Other-relative: 981	Black : 3124	NA	Median : 0	Median : 0.0	Median :40.00	Philippines : 198	NA
Mean :38.58	State-gov : 1298	Mean : 189778	Masters : 1723	Mean :10.08	Married-spouse-absent: 418	Adm-clerical : 3770	Own-child : 5068	Other : 271	NA	Mean : 1078	Mean : 87.3	Mean :40.44	Germany : 137	NA
3rd Qu.:48.00	Self-emp-inc : 1116	3rd Qu.: 237051	Assoc-voc : 1382	3rd Qu.:12.00	Never-married :10683	Sales : 3650	Unmarried : 3446	White :27816	NA	3rd Qu.: 0	3rd Qu.: 0.0	3rd Qu.:45.00	Canada : 121	NA
Max. :90.00	(Other) : 981	Max. :1484705	11th : 1175	Max. :16.00	Separated : 1025	(Other) :10993	Wife : 1568	NA	NA	Max. :99999	Max. :4356.0	Max. :99.00	(Other) : 1709	NA
NA	NA’s : 1836	NA	(Other) : 5134	NA	Widowed : 993	NA’s : 1843	NA	NA	NA	NA	NA	NA	NA’s : 583	NA

Description of the dataset

describe(data)

data 

 15  Variables      32561  Observations
---------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       73        1    38.58     15.4       19       22 
     .25      .50      .75      .90      .95 
      28       37       48       58       63 

lowest : 17 18 19 20 21, highest: 85 86 87 88 90 
---------------------------------------------------------------------------
workclass 
       n  missing distinct 
   30725     1836        8 

lowest : Federal-gov      Local-gov        Never-worked     Private          Self-emp-inc    
highest: Private          Self-emp-inc     Self-emp-not-inc State-gov        Without-pay      
                                                             
Value           Federal-gov        Local-gov     Never-worked
Frequency               960             2093                7
Proportion            0.031            0.068            0.000
                                                             
Value               Private     Self-emp-inc Self-emp-not-inc
Frequency             22696             1116             2541
Proportion            0.739            0.036            0.083
                                            
Value             State-gov      Without-pay
Frequency              1298               14
Proportion            0.042            0.000
---------------------------------------------------------------------------
fnlwgt 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0    21648        1   189778   112348    39460    65716 
     .25      .50      .75      .90      .95 
  117827   178356   237051   329054   379682 

lowest :   12285   13769   14878   18827   19214
highest: 1226583 1268339 1366120 1455435 1484705 
---------------------------------------------------------------------------
education 
       n  missing distinct 
   32561        0       16 

lowest : 10th         11th         12th         1st-4th      5th-6th     
highest: HS-grad      Masters      Preschool    Prof-school  Some-college 
                                                              
Value              10th         11th         12th      1st-4th
Frequency           933         1175          433          168
Proportion        0.029        0.036        0.013        0.005
                                                              
Value           5th-6th      7th-8th          9th   Assoc-acdm
Frequency           333          646          514         1067
Proportion        0.010        0.020        0.016        0.033
                                                              
Value         Assoc-voc    Bachelors    Doctorate      HS-grad
Frequency          1382         5355          413        10501
Proportion        0.042        0.164        0.013        0.323
                                                              
Value           Masters    Preschool  Prof-school Some-college
Frequency          1723           51          576         7291
Proportion        0.053        0.002        0.018        0.224
---------------------------------------------------------------------------
educationnum 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       16     0.95    10.08     2.75        5        7 
     .25      .50      .75      .90      .95 
       9       10       12       13       14 

lowest :  1  2  3  4  5, highest: 12 13 14 15 16 
                                                                      
Value          1     2     3     4     5     6     7     8     9    10
Frequency     51   168   333   646   514   933  1175   433 10501  7291
Proportion 0.002 0.005 0.010 0.020 0.016 0.029 0.036 0.013 0.323 0.224
                                              
Value         11    12    13    14    15    16
Frequency   1382  1067  5355  1723   576   413
Proportion 0.042 0.033 0.164 0.053 0.018 0.013
---------------------------------------------------------------------------
maritalstatus 
       n  missing distinct 
   32561        0        7 

lowest : Divorced              Married-AF-spouse     Married-civ-spouse    Married-spouse-absent Never-married        
highest: Married-civ-spouse    Married-spouse-absent Never-married         Separated             Widowed               
                                                      
Value                   Divorced     Married-AF-spouse
Frequency                   4443                    23
Proportion                 0.136                 0.001
                                                      
Value         Married-civ-spouse Married-spouse-absent
Frequency                  14976                   418
Proportion                 0.460                 0.013
                                                      
Value              Never-married             Separated
Frequency                  10683                  1025
Proportion                 0.328                 0.031
                                
Value                    Widowed
Frequency                    993
Proportion                 0.030
---------------------------------------------------------------------------
occupation 
       n  missing distinct 
   30718     1843       14 

lowest : Adm-clerical     Armed-Forces     Craft-repair     Exec-managerial  Farming-fishing 
highest: Prof-specialty   Protective-serv  Sales            Tech-support     Transport-moving 
                                                                
Value           Adm-clerical      Armed-Forces      Craft-repair
Frequency               3770                 9              4099
Proportion             0.123             0.000             0.133
                                                                
Value        Exec-managerial   Farming-fishing Handlers-cleaners
Frequency               4066               994              1370
Proportion             0.132             0.032             0.045
                                                                
Value      Machine-op-inspct     Other-service   Priv-house-serv
Frequency               2002              3295               149
Proportion             0.065             0.107             0.005
                                                                
Value         Prof-specialty   Protective-serv             Sales
Frequency               4140               649              3650
Proportion             0.135             0.021             0.119
                                              
Value           Tech-support  Transport-moving
Frequency                928              1597
Proportion             0.030             0.052
---------------------------------------------------------------------------
relationship 
       n  missing distinct 
   32561        0        6 

lowest : Husband        Not-in-family  Other-relative Own-child      Unmarried     
highest: Not-in-family  Other-relative Own-child      Unmarried      Wife           
                                                                      
Value             Husband  Not-in-family Other-relative      Own-child
Frequency           13193           8305            981           5068
Proportion          0.405          0.255          0.030          0.156
                                        
Value           Unmarried           Wife
Frequency            3446           1568
Proportion          0.106          0.048
---------------------------------------------------------------------------
race 
       n  missing distinct 
   32561        0        5 

lowest : Amer-Indian-Eskimo Asian-Pac-Islander Black              Other              White             
highest: Amer-Indian-Eskimo Asian-Pac-Islander Black              Other              White              

Amer-Indian-Eskimo (311, 0.010), Asian-Pac-Islander (1039, 0.032), Black
(3124, 0.096), Other (271, 0.008), White (27816, 0.854)
---------------------------------------------------------------------------
sex 
       n  missing distinct 
   32561        0        2 

Female (10771, 0.331), Male (21790, 0.669)
---------------------------------------------------------------------------
capitalgain 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0      119     0.23     1078     2082        0        0 
     .25      .50      .75      .90      .95 
       0        0        0        0     5013 

lowest :     0   114   401   594   914
highest: 25236 27828 34095 41310 99999 
---------------------------------------------------------------------------
capitalloss 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       92    0.134     87.3    167.3        0        0 
     .25      .50      .75      .90      .95 
       0        0        0        0        0 

lowest :    0  155  213  323  419, highest: 3004 3683 3770 3900 4356 
---------------------------------------------------------------------------
hoursperweek 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   32561        0       94    0.897    40.44    12.28       18       24 
     .25      .50      .75      .90      .95 
      40       40       45       55       60 

lowest :  1  2  3  4  5, highest: 95 96 97 98 99 
---------------------------------------------------------------------------
nativecountry 
       n  missing distinct 
   31978      583       41 

lowest : Cambodia        Canada          China           Columbia        Cuba           
highest: Thailand        Trinadad&Tobago United-States   Vietnam         Yugoslavia      
---------------------------------------------------------------------------
incomelevel 
       n  missing distinct 
   32561        0        2 

<=50K (24720, 0.759), >50K (7841, 0.241)
---------------------------------------------------------------------------

Normalization of the data

The dataset is already normalized as there are no repetitive data which can be used to split into multiple tables and there isn’t data redundancy or data integrity improvements available for this particular dataset.

Problems in the Data

Recoded Missing Values

We recoded the data with NA for the string “?” (Which denotes misisng value in the dataset) when we imported the file to make the treatment of missing values easier.

Missing Data

2399 datapoints are found to be missing which is 7.3% of the dataset.

table(complete.cases(data))


FALSE  TRUE 
 2399 30162

We find that about 2066 of the missing observations centres around <=50 k category and since most of the data has <=50 k category as the income level, removing these missing variables will not affect the data integrity.

summary(data[!complete.cases(data),])

      age                   workclass        fnlwgt      
 Min.   :17.00   Private         : 410   Min.   : 12285  
 1st Qu.:22.00   Self-emp-inc    :  42   1st Qu.:121804  
 Median :36.00   Self-emp-not-inc:  42   Median :177906  
 Mean   :40.39   Local-gov       :  26   Mean   :189584  
 3rd Qu.:58.00   State-gov       :  19   3rd Qu.:232668  
 Max.   :90.00   (Other)         :  24   Max.   :981628  
                 NA's            :1836                   
        education    educationnum                 maritalstatus
 HS-grad     :661   Min.   : 1.00   Divorced             :229  
 Some-college:613   1st Qu.: 9.00   Married-AF-spouse    :  2  
 Bachelors   :311   Median :10.00   Married-civ-spouse   :911  
 11th        :127   Mean   : 9.57   Married-spouse-absent: 48  
 10th        :113   3rd Qu.:11.00   Never-married        :957  
 Masters     : 96   Max.   :16.00   Separated            : 86  
 (Other)     :478                   Widowed              :166  
           occupation           relationship                 race     
 Prof-specialty : 102   Husband       :730   Amer-Indian-Eskimo:  25  
 Other-service  :  83   Not-in-family :579   Asian-Pac-Islander: 144  
 Exec-managerial:  74   Other-relative: 92   Black             : 307  
 Craft-repair   :  69   Own-child     :602   Other             :  40  
 Sales          :  66   Unmarried     :234   White             :1883  
 (Other)        : 162   Wife          :162                            
 NA's           :1843                                                 
     sex        capitalgain       capitalloss       hoursperweek  
 Female: 989   Min.   :    0.0   Min.   :   0.00   Min.   : 1.00  
 Male  :1410   1st Qu.:    0.0   1st Qu.:   0.00   1st Qu.:25.00  
               Median :    0.0   Median :   0.00   Median :40.00  
               Mean   :  897.1   Mean   :  73.87   Mean   :34.23  
               3rd Qu.:    0.0   3rd Qu.:   0.00   3rd Qu.:40.00  
               Max.   :99999.0   Max.   :4356.00   Max.   :99.00  
                                                                  
       nativecountry  incomelevel 
 United-States:1666   <=50K:2066  
 Mexico       :  33   >50K : 333  
 Canada       :  14               
 Philippines  :  10               
 Germany      :   9               
 (Other)      :  84               
 NA's         : 583

Removing the missing data

data <- na.omit(data)

Unncessary Variables

The fnlwgt which is the final weight determined by the Census Organization is of no use in any of the analysis that we are doing henceforth and is removed. The educationnum if a repetitive variable which recodes the categorical variable education as a numeric variable but will be used in the analysis for decision trees, hence is not being removed.

Finaldata <- sqldf("SELECT age
                           ,workclass
                    ,education
                    ,educationnum
                    ,maritalstatus
                    ,occupation
                    ,relationship
                    ,race
                    ,sex
                    ,capitalgain
                    ,capitalloss
                    ,hoursperweek
                    ,nativecountry
                    ,incomelevel
                    FROM data
                    ")
kable(head(Finaldata))

age	workclass	education	educationnum	maritalstatus	occupation	relationship	race	sex	capitalgain	hoursperweek	nativecountry	incomelevel
39	State-gov	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
50	Self-emp-not-inc	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
38	Private	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
53	Private	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
28	Private	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
37	Private	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	<=50K

Recoding Income Levels

The values for the income levels are recoded with 1 and 0 with >50 k being coded as 1. The sum of the values in this column will hence give the number of people with income >50 k and thus can be an effective prediction tool.

Finaldata$income<-ifelse(Finaldata$income=='>50K',1,0)
kable(head(Finaldata))

age	workclass	education	educationnum	maritalstatus	occupation	relationship	race	sex	capitalgain	hoursperweek	nativecountry	incomelevel
39	State-gov	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
50	Self-emp-not-inc	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
38	Private	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
53	Private	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
28	Private	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
37	Private	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	<=50K

SQL Queries and Exploratory Analysis using R

Statistics of other variables based on workclass and SQL Queries

Sample statistics are generated classifying the data by Work Class and calculating the numerical values categorized by the Work Class using GROUPBY, SUM and AVERAGE functions in SQLDF package in R.

WorkclassClassification <- sqldf("SELECT 
                          workclass,
                          count(*) as Numberofrecords,
                         Avg(age) as averageage,
                         Sum(capitalgain) as capitalgain,
                        Sum(capitalloss) as capitalloss,
                        Avg(hoursperweek),
                        sum(income) as above50k
                         FROM Finaldata
                         GROUP by workclass")
kable(WorkclassClassification)

workclass	Numberofrecords	averageage	capitalgain	capitalloss	Avg(hoursperweek)	above50k
Federal-gov	943	42.57794	784879	105363	41.27996	365
Local-gov	2067	41.74311	1714019	228038	41.00242	609
Private	22286	36.79436	19608520	1774623	40.25088	4876
Self-emp-inc	1074	46.02793	5166742	164864	48.80261	600
Self-emp-not-inc	2499	45.01160	4780923	288815	44.43217	714
State-gov	1279	39.36200	875228	103788	39.05395	344
Without-pay	14	47.78571	6830	0	32.71429	0

Statistics based on race

Sample statistics are also generated by classifying the data by Race and calculating the numerical values categorized by the Work Class using GROUPBY, SUM and AVERAGE functions in SQLDF package in R.

RacesClassification <- sqldf("SELECT 
                                 race,
                                 count(*) as Numberofrecords,
                                 Avg(age) as averageage,
                                 Sum(capitalgain) as capitalgain,
                                 Sum(capitalloss) as capitalloss,
                                 Avg(hoursperweek),
                                 sum(income) as above50k
                                 FROM Finaldata
                                 GROUP by race")
kable(RacesClassification)

race	Numberofrecords	averageage	capitalgain	capitalloss	Avg(hoursperweek)	above50k
Amer-Indian-Eskimo	286	36.73776	188843	10629	40.42657	34
Asian-Pac-Islander	895	37.91061	1167416	82057	40.50056	248
Black	2817	37.86084	1703502	158506	38.67448	366
Other	231	33.60173	245011	11618	40.11255	21
White	25933	38.58061	29632369	2402681	41.20410	6839

SQL Queries can also be used to find some basic prediction about the income level and the various categorical variables which contribute towards it.

Marital Status and implication on the income levels

We find that Married with Civilian Spouse to have higher amount of people with above 50k income. We then find the proportions of the people with above 50 k income with respect to the total number of people in each category and find that Married with Armed Forces have a better proportion of people with income above 50k.

MaritalLevel<- sqldf("SELECT 
                    maritalstatus as status
                    , Count (*) as Count
                    , sum(income) as Above50k
                    FROM 
                    Finaldata
                    GROUP BY 
                    status
                    ORDER BY status") 
kable(MaritalLevel)

status	Count	Above50k
Divorced	4214	452
Married-AF-spouse	21	10
Married-civ-spouse	14065	6399
Married-spouse-absent	370	31
Never-married	9726	470
Separated	939	66
Widowed	827	80

Maritalclass<-melt(MaritalLevel,id.vars = 'status')
ggplot(Maritalclass,aes(x=status,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Gender Class")+
  ylab("Number of People")

table<-data.frame(Class=MaritalLevel$status, 
                   Proportion=
                     MaritalLevel$Above50k/MaritalLevel$Count)
kable(table)

Class	Proportion
Divorced	0.1072615
Married-AF-spouse	0.4761905
Married-civ-spouse	0.4549591
Married-spouse-absent	0.0837838
Never-married	0.0483241
Separated	0.0702875
Widowed	0.0967352

Gender and implication on the income levels

We use a similar analysis as above across the Gender Variable and find that Males have a better proportion of people with salaries above 50k and more number of people who have salaries greater than 50k thus raising the spark for gender inequality.

GenderLevel<- sqldf("SELECT 
                    sex as gender
                       , Count (*) as Count
                       , sum(income) as Above50k
                       FROM 
                       Finaldata
                       GROUP BY 
                       gender
                       ORDER BY gender") 
Genderclass<-melt(GenderLevel,id.vars = 'gender')
ggplot(Genderclass,aes(x=gender,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Gender Class")+
  ylab("Number of People")

table<-data.frame(Class=GenderLevel$gender, 
                   Proportion=
                    GenderLevel$Above50k/GenderLevel$Count)
kable(table)

Class	Proportion
Female	0.1136782
Male	0.3138371

Work Class and implication on the income levels

We find that the people employed in private companies have more people with income above 50k and Self Employed people having a higher proportion of peoplw with income greater than 50k.

WorkclassLevel<- sqldf("SELECT 
                    workclass as workclass
                    , Count (*) as Count
                    , sum(income) as Above50k
                    FROM 
                    Finaldata
                    GROUP BY 
                    workclass
                    ORDER BY workclass") 
Workclass<-melt(WorkclassLevel,id.vars = 'workclass')
ggplot(Workclass,aes(x=workclass,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Work Class")+
  ylab("Number of People")

table<-data.frame(Class=WorkclassLevel$workclass, 
                  Proportion=
                    WorkclassLevel$Above50k/WorkclassLevel$Count)
kable(table)

Class	Proportion
Federal-gov	0.3870626
Local-gov	0.2946299
Private	0.2187921
Self-emp-inc	0.5586592
Self-emp-not-inc	0.2857143
State-gov	0.2689601
Without-pay	0.0000000

Education Level and implication on the income levels

We find that the Bachelors’ Degree has higher number of people with income above 50 k and Prof School Graduates and Doctorates have higher proportion of people with income about 50k.

Education_Level<- sqldf("SELECT 
                       education as 'Education'
                       , Count (*) as 'Count'
                       , sum(income) as 'Above50k'
                       FROM 
                       Finaldata
                       GROUP BY 
                       Education
                       ORDER BY Education") 
EducationLevel<-melt(Education_Level,id.vars = 'Education')
ggplot(EducationLevel,aes(x=Education,y=value,fill=variable))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle('Proportion of People with income above 50k')+
  xlab("Education Class")+
  ylab("Number of People")

table<-data.frame(Class=Education_Level$Education, 
                   Proportion=
                     Education_Level$Above50k/Education_Level$Count)
kable(table)

Class	Proportion
10th	0.0719512
11th	0.0562977
12th	0.0769231
1st-4th	0.0397351
5th-6th	0.0416667
7th-8th	0.0628366
9th	0.0549451
Assoc-acdm	0.2539683
Assoc-voc	0.2631982
Bachelors	0.4214909
Doctorate	0.7466667
HS-grad	0.1643293
Masters	0.5642286
Preschool	0.0000000
Prof-school	0.7490775
Some-college	0.2000599

Using R for Statistical Analysis

We use R to create Decision Trees to predict the income levels using all the variables in the dataset.

Methodology Used

Classification methodology was used for this particular data as the response variable income was a binary variable.

Advantages of using Decision Trees for this Analysis

Simplicity of Interpretation: The entire data is represented in a visual format which enables us to understand the data quickly and make a prediction.
Ability to use both numeric and categorical variables as predictors: Our data is a mix of categorical and numeric variables and other commonly used methods such as linear or logistic regressions can use either of these categories but not both in a single algorithm.

Algorithm behind

The package used to construct the tree here is Rpart which uses recursive partitioning of the data with the default criterion being used for each split being Gini Coefficient.

Variables used

We create the tree using all the variables and rank the variables in terms of importance to figure out the variables used by the decision tree algorithm to predict the income.

tree <- rpart(incomelevel ~ .,
              data = data,
              method = "class")
imp <- varImp(tree)
rownames(imp)[order(imp$Overall, decreasing=TRUE)]

 [1] "capitalgain"   "education"     "educationnum"  "relationship" 
 [5] "maritalstatus" "occupation"    "capitalloss"   "age"          
 [9] "hoursperweek"  "workclass"     "fnlwgt"        "race"         
[13] "sex"           "nativecountry"

We find that Capital Gain is one of the key predictors followed by education for the income levels.
The tree logic is as below where only “capital gain, education and relationship” are used as variables by the tree as Education num and education both have the same data.

printcp(tree)


Classification tree:
rpart(formula = incomelevel ~ ., data = data, method = "class")

Variables actually used in tree construction:
[1] capitalgain  education    relationship

Root node error: 7508/30162 = 0.24892

n= 30162 

        CP nsplit rel error  xerror      xstd
1 0.129995      0   1.00000 1.00000 0.0100018
2 0.064198      2   0.74001 0.74001 0.0089670
3 0.037294      3   0.67581 0.67581 0.0086527
4 0.010000      4   0.63852 0.63852 0.0084574

plotcp(tree)

rpart.plot(tree, main = "Basic Classification Tree for Census Data")

Inferences

About 46% of the people are in a relationship called “Husband” or “Wife” which is then further classified based on Education Level where nearly 14% who earn above $50 K have the education of Bachelors, Prof-school, Masters and Doctorate.

The other education levels have income predominantly below $50 k with just 2% having salaries above $50k who also have capital gains greater than $5096

With respect to other relationships, only 1% have income above $50 k and with capital gains greater than $7074.

Tableau for Visualization

We use 4 sheets to visualize different relationships and use Work Class as the common parameter which will filter the entire dashboard.

The first sheet shows the relationship of Education and Number of People Earning > 50 k and separated by Work Class. We find that Bachelors graduates working in Private companies have a higher number of people earning above 50 k.

Impact of Education on Income Levels

The second sheet shows the relationship of Average hours per week with respect to gender and separated by Work Class and we find that Males typically work more hours per week on Average across all work classes.

Impact of Gender on Income Levels

The third sheet shows that the relationship of marital status and income levels separated by Work Class. Majority of the people in Married with Civilian spouse have an income greater than 50 k and majorly in the private sector.

Impact of Marital Status on Income Levels

The fourth work sheet shows the impact of occupation, capital gain and capital loss on the income levels which has details of work class too. This is a comprehensive visualization across 4 different parameters. We find that Executives at Managerial Level have more people with income greater than 50 k and Professional Speciality has more capital gains.

Impact of Occupation on Income Levels and Capital Gains/Loss

Final Dashboard:

Final Dashboard

Filtered Dashboard for Private Sector:

Filtered Dashboard

Challenges Faced

The datafile was initially scraped from a web repository called UCI Machine Learning- part of UC Irvine, which crashed in the last two days left for submission and hence R was unable to establish a connection to the data. So, I figured out the FTP page of the website and downloaded the data and uploaded from the hard disk directly for the project. This is not the ideal way but R markdown supports this by establishing a static html page which will provide all the outputs from the commands at the time of publishing the report.
Problems with decision trees and R packages- The packages used for prettier decision tree modeling was not working in the R studio and has to be fixed with multiple supporting libraries before proceeding.
Learning SQL DF- R and SQL both are relatively new for me and it took me a while to understand the workings of SQL DF commands.

Predicting the Income Level based on Various Factors

Data Management - Final Project

Predicting the Income Level based on Various Factors

Description of the Dataset

R Packages Used

Importing the dataset

Description of the variables in the dataset

Basic commands to know about the Data

Summary of the data

Description of the dataset

Normalization of the data

Problems in the Data

Recoded Missing Values

Missing Data

Unncessary Variables

Recoding Income Levels

SQL Queries and Exploratory Analysis using R

Statistics of other variables based on workclass and SQL Queries

Statistics based on race

Marital Status and implication on the income levels

Gender and implication on the income levels

Work Class and implication on the income levels

Education Level and implication on the income levels

Using R for Statistical Analysis

Methodology Used

Advantages of using Decision Trees for this Analysis

Algorithm behind

Variables used

Inferences

Tableau for Visualization

Challenges Faced