Predicting the Income Level based on Various Factors
Data Management - Final Project
Data Management - Final Project
Predicting the Income Level based on Various Factors
- Description of the Dataset
- R Packages Used
- Importing the dataset
- Description of the variables in the dataset
- Basic commands to know about the Data
- Normalization of the data
- Problems in the Data
- SQL Queries and Exploratory Analysis using R
- Using R for Statistical Analysis
- Inferences
- Tableau for Visualization
- Challenges Faced
Description of the Dataset
The dataset used is US Census data which is an extraction of the 1994 census data which was donated to the UC Irvine’s Machine Learning Repository.
The data contains approximately 32,000 observations with over 15 variables.
The dataset was downloaded from:
http://archive.ics.uci.edu/ml/datasets/Adult.
The dependent variable in our analysis will be income level and who earns above $50,000 a year using SQL queries, Proportion Analysis using bar charts and Simple Decision Tree to understand the important variables and their influence on prediction.
R Packages Used
options(gsubfn.engine = "R")
library(sqldf) #For SQL Functions
library(reshape2) #For plots
library(knitr) # For tables
library(Hmisc)# Description of the data
library(rpart)
library(caret)
library(rpart.plot)
Importing the dataset
data <- read.csv("adult.csv", header=FALSE, sep= ",", strip.white=TRUE,col.names= c("age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation","relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"), na.strings= "?", stringsAsFactors = TRUE)
kable(head(data))
age | workclass | fnlwgt | education | educationnum | maritalstatus | occupation | relationship | race | sex | capitalgain | capitalloss | hoursperweek | nativecountry | incomelevel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
Description of the variables in the dataset
The variables and their classifications are as given below:
variabletable<- read.csv("Workbook1.csv")
kable(variabletable)
Variable.Name | Description | Type | Possible.Values |
---|---|---|---|
Age | Age of the individual | Continuous | Numeric |
Workclass | Class of Work | Categorical | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked |
fnlwgt | Final Weight Determined by Census Org | Continuous | Numeric |
Education | Education of the individual | Categorical | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool |
Education-num | Number of years of education | Continuous | Numeric |
Marital-status | Marital status of the individual | Categorical | Married-civilian-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-ArmedForces-spouse |
Occupation | Occupation of the individual | Categorical | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. |
Relationship | Present relationship | Categorical | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
Race | Race of the individual | Categorical | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black |
Sex | Sex of the individual | Categorical | Female, Male |
Capital-gain | Capital gain made by the individual | Continuous | Numeric |
Capital-loss | Capital loss made by the individual | Continuous | Numeric |
Hours-per-week | Average number of hours spent by the individual on work | Continuous | Numeric |
Native-country | Native country of origin | Categorical | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |
Income Level | Income Level | Categorical | <= 50 k or >50k |
Basic commands to know about the Data
Summary of the data
kable(summary(data))
age | workclass | fnlwgt | education | educationnum | maritalstatus | occupation | relationship | race | sex | capitalgain | capitalloss | hoursperweek | nativecountry | incomelevel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. :17.00 | Private :22696 | Min. : 12285 | HS-grad :10501 | Min. : 1.00 | Divorced : 4443 | Prof-specialty : 4140 | Husband :13193 | Amer-Indian-Eskimo: 311 | Female:10771 | Min. : 0 | Min. : 0.0 | Min. : 1.00 | United-States:29170 | <=50K:24720 | |
1st Qu.:28.00 | Self-emp-not-inc: 2541 | 1st Qu.: 117827 | Some-college: 7291 | 1st Qu.: 9.00 | Married-AF-spouse : 23 | Craft-repair : 4099 | Not-in-family : 8305 | Asian-Pac-Islander: 1039 | Male :21790 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.:40.00 | Mexico : 643 | >50K : 7841 | |
Median :37.00 | Local-gov : 2093 | Median : 178356 | Bachelors : 5355 | Median :10.00 | Married-civ-spouse :14976 | Exec-managerial: 4066 | Other-relative: 981 | Black : 3124 | NA | Median : 0 | Median : 0.0 | Median :40.00 | Philippines : 198 | NA | |
Mean :38.58 | State-gov : 1298 | Mean : 189778 | Masters : 1723 | Mean :10.08 | Married-spouse-absent: 418 | Adm-clerical : 3770 | Own-child : 5068 | Other : 271 | NA | Mean : 1078 | Mean : 87.3 | Mean :40.44 | Germany : 137 | NA | |
3rd Qu.:48.00 | Self-emp-inc : 1116 | 3rd Qu.: 237051 | Assoc-voc : 1382 | 3rd Qu.:12.00 | Never-married :10683 | Sales : 3650 | Unmarried : 3446 | White :27816 | NA | 3rd Qu.: 0 | 3rd Qu.: 0.0 | 3rd Qu.:45.00 | Canada : 121 | NA | |
Max. :90.00 | (Other) : 981 | Max. :1484705 | 11th : 1175 | Max. :16.00 | Separated : 1025 | (Other) :10993 | Wife : 1568 | NA | NA | Max. :99999 | Max. :4356.0 | Max. :99.00 | (Other) : 1709 | NA | |
NA | NA’s : 1836 | NA | (Other) : 5134 | NA | Widowed : 993 | NA’s : 1843 | NA | NA | NA | NA | NA | NA | NA’s : 583 | NA |
Description of the dataset
describe(data)
data
15 Variables 32561 Observations
---------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10
32561 0 73 1 38.58 15.4 19 22
.25 .50 .75 .90 .95
28 37 48 58 63
lowest : 17 18 19 20 21, highest: 85 86 87 88 90
---------------------------------------------------------------------------
workclass
n missing distinct
30725 1836 8
lowest : Federal-gov Local-gov Never-worked Private Self-emp-inc
highest: Private Self-emp-inc Self-emp-not-inc State-gov Without-pay
Value Federal-gov Local-gov Never-worked
Frequency 960 2093 7
Proportion 0.031 0.068 0.000
Value Private Self-emp-inc Self-emp-not-inc
Frequency 22696 1116 2541
Proportion 0.739 0.036 0.083
Value State-gov Without-pay
Frequency 1298 14
Proportion 0.042 0.000
---------------------------------------------------------------------------
fnlwgt
n missing distinct Info Mean Gmd .05 .10
32561 0 21648 1 189778 112348 39460 65716
.25 .50 .75 .90 .95
117827 178356 237051 329054 379682
lowest : 12285 13769 14878 18827 19214
highest: 1226583 1268339 1366120 1455435 1484705
---------------------------------------------------------------------------
education
n missing distinct
32561 0 16
lowest : 10th 11th 12th 1st-4th 5th-6th
highest: HS-grad Masters Preschool Prof-school Some-college
Value 10th 11th 12th 1st-4th
Frequency 933 1175 433 168
Proportion 0.029 0.036 0.013 0.005
Value 5th-6th 7th-8th 9th Assoc-acdm
Frequency 333 646 514 1067
Proportion 0.010 0.020 0.016 0.033
Value Assoc-voc Bachelors Doctorate HS-grad
Frequency 1382 5355 413 10501
Proportion 0.042 0.164 0.013 0.323
Value Masters Preschool Prof-school Some-college
Frequency 1723 51 576 7291
Proportion 0.053 0.002 0.018 0.224
---------------------------------------------------------------------------
educationnum
n missing distinct Info Mean Gmd .05 .10
32561 0 16 0.95 10.08 2.75 5 7
.25 .50 .75 .90 .95
9 10 12 13 14
lowest : 1 2 3 4 5, highest: 12 13 14 15 16
Value 1 2 3 4 5 6 7 8 9 10
Frequency 51 168 333 646 514 933 1175 433 10501 7291
Proportion 0.002 0.005 0.010 0.020 0.016 0.029 0.036 0.013 0.323 0.224
Value 11 12 13 14 15 16
Frequency 1382 1067 5355 1723 576 413
Proportion 0.042 0.033 0.164 0.053 0.018 0.013
---------------------------------------------------------------------------
maritalstatus
n missing distinct
32561 0 7
lowest : Divorced Married-AF-spouse Married-civ-spouse Married-spouse-absent Never-married
highest: Married-civ-spouse Married-spouse-absent Never-married Separated Widowed
Value Divorced Married-AF-spouse
Frequency 4443 23
Proportion 0.136 0.001
Value Married-civ-spouse Married-spouse-absent
Frequency 14976 418
Proportion 0.460 0.013
Value Never-married Separated
Frequency 10683 1025
Proportion 0.328 0.031
Value Widowed
Frequency 993
Proportion 0.030
---------------------------------------------------------------------------
occupation
n missing distinct
30718 1843 14
lowest : Adm-clerical Armed-Forces Craft-repair Exec-managerial Farming-fishing
highest: Prof-specialty Protective-serv Sales Tech-support Transport-moving
Value Adm-clerical Armed-Forces Craft-repair
Frequency 3770 9 4099
Proportion 0.123 0.000 0.133
Value Exec-managerial Farming-fishing Handlers-cleaners
Frequency 4066 994 1370
Proportion 0.132 0.032 0.045
Value Machine-op-inspct Other-service Priv-house-serv
Frequency 2002 3295 149
Proportion 0.065 0.107 0.005
Value Prof-specialty Protective-serv Sales
Frequency 4140 649 3650
Proportion 0.135 0.021 0.119
Value Tech-support Transport-moving
Frequency 928 1597
Proportion 0.030 0.052
---------------------------------------------------------------------------
relationship
n missing distinct
32561 0 6
lowest : Husband Not-in-family Other-relative Own-child Unmarried
highest: Not-in-family Other-relative Own-child Unmarried Wife
Value Husband Not-in-family Other-relative Own-child
Frequency 13193 8305 981 5068
Proportion 0.405 0.255 0.030 0.156
Value Unmarried Wife
Frequency 3446 1568
Proportion 0.106 0.048
---------------------------------------------------------------------------
race
n missing distinct
32561 0 5
lowest : Amer-Indian-Eskimo Asian-Pac-Islander Black Other White
highest: Amer-Indian-Eskimo Asian-Pac-Islander Black Other White
Amer-Indian-Eskimo (311, 0.010), Asian-Pac-Islander (1039, 0.032), Black
(3124, 0.096), Other (271, 0.008), White (27816, 0.854)
---------------------------------------------------------------------------
sex
n missing distinct
32561 0 2
Female (10771, 0.331), Male (21790, 0.669)
---------------------------------------------------------------------------
capitalgain
n missing distinct Info Mean Gmd .05 .10
32561 0 119 0.23 1078 2082 0 0
.25 .50 .75 .90 .95
0 0 0 0 5013
lowest : 0 114 401 594 914
highest: 25236 27828 34095 41310 99999
---------------------------------------------------------------------------
capitalloss
n missing distinct Info Mean Gmd .05 .10
32561 0 92 0.134 87.3 167.3 0 0
.25 .50 .75 .90 .95
0 0 0 0 0
lowest : 0 155 213 323 419, highest: 3004 3683 3770 3900 4356
---------------------------------------------------------------------------
hoursperweek
n missing distinct Info Mean Gmd .05 .10
32561 0 94 0.897 40.44 12.28 18 24
.25 .50 .75 .90 .95
40 40 45 55 60
lowest : 1 2 3 4 5, highest: 95 96 97 98 99
---------------------------------------------------------------------------
nativecountry
n missing distinct
31978 583 41
lowest : Cambodia Canada China Columbia Cuba
highest: Thailand Trinadad&Tobago United-States Vietnam Yugoslavia
---------------------------------------------------------------------------
incomelevel
n missing distinct
32561 0 2
<=50K (24720, 0.759), >50K (7841, 0.241)
---------------------------------------------------------------------------
Normalization of the data
The dataset is already normalized as there are no repetitive data which can be used to split into multiple tables and there isn’t data redundancy or data integrity improvements available for this particular dataset.
Problems in the Data
Recoded Missing Values
We recoded the data with NA for the string “?” (Which denotes misisng value in the dataset) when we imported the file to make the treatment of missing values easier.
Missing Data
2399 datapoints are found to be missing which is 7.3% of the dataset.
table(complete.cases(data))
FALSE TRUE
2399 30162
We find that about 2066 of the missing observations centres around <=50 k category and since most of the data has <=50 k category as the income level, removing these missing variables will not affect the data integrity.
summary(data[!complete.cases(data),])
age workclass fnlwgt
Min. :17.00 Private : 410 Min. : 12285
1st Qu.:22.00 Self-emp-inc : 42 1st Qu.:121804
Median :36.00 Self-emp-not-inc: 42 Median :177906
Mean :40.39 Local-gov : 26 Mean :189584
3rd Qu.:58.00 State-gov : 19 3rd Qu.:232668
Max. :90.00 (Other) : 24 Max. :981628
NA's :1836
education educationnum maritalstatus
HS-grad :661 Min. : 1.00 Divorced :229
Some-college:613 1st Qu.: 9.00 Married-AF-spouse : 2
Bachelors :311 Median :10.00 Married-civ-spouse :911
11th :127 Mean : 9.57 Married-spouse-absent: 48
10th :113 3rd Qu.:11.00 Never-married :957
Masters : 96 Max. :16.00 Separated : 86
(Other) :478 Widowed :166
occupation relationship race
Prof-specialty : 102 Husband :730 Amer-Indian-Eskimo: 25
Other-service : 83 Not-in-family :579 Asian-Pac-Islander: 144
Exec-managerial: 74 Other-relative: 92 Black : 307
Craft-repair : 69 Own-child :602 Other : 40
Sales : 66 Unmarried :234 White :1883
(Other) : 162 Wife :162
NA's :1843
sex capitalgain capitalloss hoursperweek
Female: 989 Min. : 0.0 Min. : 0.00 Min. : 1.00
Male :1410 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:25.00
Median : 0.0 Median : 0.00 Median :40.00
Mean : 897.1 Mean : 73.87 Mean :34.23
3rd Qu.: 0.0 3rd Qu.: 0.00 3rd Qu.:40.00
Max. :99999.0 Max. :4356.00 Max. :99.00
nativecountry incomelevel
United-States:1666 <=50K:2066
Mexico : 33 >50K : 333
Canada : 14
Philippines : 10
Germany : 9
(Other) : 84
NA's : 583
Removing the missing data
data <- na.omit(data)
Unncessary Variables
The fnlwgt which is the final weight determined by the Census Organization is of no use in any of the analysis that we are doing henceforth and is removed. The educationnum if a repetitive variable which recodes the categorical variable education as a numeric variable but will be used in the analysis for decision trees, hence is not being removed.
Finaldata <- sqldf("SELECT age
,workclass
,education
,educationnum
,maritalstatus
,occupation
,relationship
,race
,sex
,capitalgain
,capitalloss
,hoursperweek
,nativecountry
,incomelevel
FROM data
")
kable(head(Finaldata))
age | workclass | education | educationnum | maritalstatus | occupation | relationship | race | sex | capitalgain | capitalloss | hoursperweek | nativecountry | incomelevel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | State-gov | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
50 | Self-emp-not-inc | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
38 | Private | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
53 | Private | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
28 | Private | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
37 | Private | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
Recoding Income Levels
The values for the income levels are recoded with 1 and 0 with >50 k being coded as 1. The sum of the values in this column will hence give the number of people with income >50 k and thus can be an effective prediction tool.
Finaldata$income<-ifelse(Finaldata$income=='>50K',1,0)
kable(head(Finaldata))
age | workclass | education | educationnum | maritalstatus | occupation | relationship | race | sex | capitalgain | capitalloss | hoursperweek | nativecountry | incomelevel | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | State-gov | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | 0 |
50 | Self-emp-not-inc | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | 0 |
38 | Private | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | 0 |
53 | Private | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | 0 |
28 | Private | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | 0 |
37 | Private | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K | 0 |
SQL Queries and Exploratory Analysis using R
Statistics of other variables based on workclass and SQL Queries
Sample statistics are generated classifying the data by Work Class and calculating the numerical values categorized by the Work Class using GROUPBY, SUM and AVERAGE functions in SQLDF package in R.
WorkclassClassification <- sqldf("SELECT
workclass,
count(*) as Numberofrecords,
Avg(age) as averageage,
Sum(capitalgain) as capitalgain,
Sum(capitalloss) as capitalloss,
Avg(hoursperweek),
sum(income) as above50k
FROM Finaldata
GROUP by workclass")
kable(WorkclassClassification)
workclass | Numberofrecords | averageage | capitalgain | capitalloss | Avg(hoursperweek) | above50k |
---|---|---|---|---|---|---|
Federal-gov | 943 | 42.57794 | 784879 | 105363 | 41.27996 | 365 |
Local-gov | 2067 | 41.74311 | 1714019 | 228038 | 41.00242 | 609 |
Private | 22286 | 36.79436 | 19608520 | 1774623 | 40.25088 | 4876 |
Self-emp-inc | 1074 | 46.02793 | 5166742 | 164864 | 48.80261 | 600 |
Self-emp-not-inc | 2499 | 45.01160 | 4780923 | 288815 | 44.43217 | 714 |
State-gov | 1279 | 39.36200 | 875228 | 103788 | 39.05395 | 344 |
Without-pay | 14 | 47.78571 | 6830 | 0 | 32.71429 | 0 |
Statistics based on race
Sample statistics are also generated by classifying the data by Race and calculating the numerical values categorized by the Work Class using GROUPBY, SUM and AVERAGE functions in SQLDF package in R.
RacesClassification <- sqldf("SELECT
race,
count(*) as Numberofrecords,
Avg(age) as averageage,
Sum(capitalgain) as capitalgain,
Sum(capitalloss) as capitalloss,
Avg(hoursperweek),
sum(income) as above50k
FROM Finaldata
GROUP by race")
kable(RacesClassification)
race | Numberofrecords | averageage | capitalgain | capitalloss | Avg(hoursperweek) | above50k |
---|---|---|---|---|---|---|
Amer-Indian-Eskimo | 286 | 36.73776 | 188843 | 10629 | 40.42657 | 34 |
Asian-Pac-Islander | 895 | 37.91061 | 1167416 | 82057 | 40.50056 | 248 |
Black | 2817 | 37.86084 | 1703502 | 158506 | 38.67448 | 366 |
Other | 231 | 33.60173 | 245011 | 11618 | 40.11255 | 21 |
White | 25933 | 38.58061 | 29632369 | 2402681 | 41.20410 | 6839 |
SQL Queries can also be used to find some basic prediction about the income level and the various categorical variables which contribute towards it.
Marital Status and implication on the income levels
We find that Married with Civilian Spouse to have higher amount of people with above 50k income. We then find the proportions of the people with above 50 k income with respect to the total number of people in each category and find that Married with Armed Forces have a better proportion of people with income above 50k.
MaritalLevel<- sqldf("SELECT
maritalstatus as status
, Count (*) as Count
, sum(income) as Above50k
FROM
Finaldata
GROUP BY
status
ORDER BY status")
kable(MaritalLevel)
status | Count | Above50k |
---|---|---|
Divorced | 4214 | 452 |
Married-AF-spouse | 21 | 10 |
Married-civ-spouse | 14065 | 6399 |
Married-spouse-absent | 370 | 31 |
Never-married | 9726 | 470 |
Separated | 939 | 66 |
Widowed | 827 | 80 |
Maritalclass<-melt(MaritalLevel,id.vars = 'status')
ggplot(Maritalclass,aes(x=status,y=value,fill=variable))+
geom_bar(stat = 'identity')+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ggtitle('Proportion of People with income above 50k')+
xlab("Gender Class")+
ylab("Number of People")
table<-data.frame(Class=MaritalLevel$status,
Proportion=
MaritalLevel$Above50k/MaritalLevel$Count)
kable(table)
Class | Proportion |
---|---|
Divorced | 0.1072615 |
Married-AF-spouse | 0.4761905 |
Married-civ-spouse | 0.4549591 |
Married-spouse-absent | 0.0837838 |
Never-married | 0.0483241 |
Separated | 0.0702875 |
Widowed | 0.0967352 |
Gender and implication on the income levels
We use a similar analysis as above across the Gender Variable and find that Males have a better proportion of people with salaries above 50k and more number of people who have salaries greater than 50k thus raising the spark for gender inequality.
GenderLevel<- sqldf("SELECT
sex as gender
, Count (*) as Count
, sum(income) as Above50k
FROM
Finaldata
GROUP BY
gender
ORDER BY gender")
Genderclass<-melt(GenderLevel,id.vars = 'gender')
ggplot(Genderclass,aes(x=gender,y=value,fill=variable))+
geom_bar(stat = 'identity')+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ggtitle('Proportion of People with income above 50k')+
xlab("Gender Class")+
ylab("Number of People")
table<-data.frame(Class=GenderLevel$gender,
Proportion=
GenderLevel$Above50k/GenderLevel$Count)
kable(table)
Class | Proportion |
---|---|
Female | 0.1136782 |
Male | 0.3138371 |
Work Class and implication on the income levels
We find that the people employed in private companies have more people with income above 50k and Self Employed people having a higher proportion of peoplw with income greater than 50k.
WorkclassLevel<- sqldf("SELECT
workclass as workclass
, Count (*) as Count
, sum(income) as Above50k
FROM
Finaldata
GROUP BY
workclass
ORDER BY workclass")
Workclass<-melt(WorkclassLevel,id.vars = 'workclass')
ggplot(Workclass,aes(x=workclass,y=value,fill=variable))+
geom_bar(stat = 'identity')+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ggtitle('Proportion of People with income above 50k')+
xlab("Work Class")+
ylab("Number of People")
table<-data.frame(Class=WorkclassLevel$workclass,
Proportion=
WorkclassLevel$Above50k/WorkclassLevel$Count)
kable(table)
Class | Proportion |
---|---|
Federal-gov | 0.3870626 |
Local-gov | 0.2946299 |
Private | 0.2187921 |
Self-emp-inc | 0.5586592 |
Self-emp-not-inc | 0.2857143 |
State-gov | 0.2689601 |
Without-pay | 0.0000000 |
Education Level and implication on the income levels
We find that the Bachelors’ Degree has higher number of people with income above 50 k and Prof School Graduates and Doctorates have higher proportion of people with income about 50k.
Education_Level<- sqldf("SELECT
education as 'Education'
, Count (*) as 'Count'
, sum(income) as 'Above50k'
FROM
Finaldata
GROUP BY
Education
ORDER BY Education")
EducationLevel<-melt(Education_Level,id.vars = 'Education')
ggplot(EducationLevel,aes(x=Education,y=value,fill=variable))+
geom_bar(stat = 'identity')+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle('Proportion of People with income above 50k')+
xlab("Education Class")+
ylab("Number of People")
table<-data.frame(Class=Education_Level$Education,
Proportion=
Education_Level$Above50k/Education_Level$Count)
kable(table)
Class | Proportion |
---|---|
10th | 0.0719512 |
11th | 0.0562977 |
12th | 0.0769231 |
1st-4th | 0.0397351 |
5th-6th | 0.0416667 |
7th-8th | 0.0628366 |
9th | 0.0549451 |
Assoc-acdm | 0.2539683 |
Assoc-voc | 0.2631982 |
Bachelors | 0.4214909 |
Doctorate | 0.7466667 |
HS-grad | 0.1643293 |
Masters | 0.5642286 |
Preschool | 0.0000000 |
Prof-school | 0.7490775 |
Some-college | 0.2000599 |
Using R for Statistical Analysis
We use R to create Decision Trees to predict the income levels using all the variables in the dataset.
Methodology Used
Classification methodology was used for this particular data as the response variable income was a binary variable.
Advantages of using Decision Trees for this Analysis
Simplicity of Interpretation: The entire data is represented in a visual format which enables us to understand the data quickly and make a prediction.
Ability to use both numeric and categorical variables as predictors: Our data is a mix of categorical and numeric variables and other commonly used methods such as linear or logistic regressions can use either of these categories but not both in a single algorithm.
Algorithm behind
The package used to construct the tree here is Rpart which uses recursive partitioning of the data with the default criterion being used for each split being Gini Coefficient.
Variables used
We create the tree using all the variables and rank the variables in terms of importance to figure out the variables used by the decision tree algorithm to predict the income.
tree <- rpart(incomelevel ~ .,
data = data,
method = "class")
imp <- varImp(tree)
rownames(imp)[order(imp$Overall, decreasing=TRUE)]
[1] "capitalgain" "education" "educationnum" "relationship"
[5] "maritalstatus" "occupation" "capitalloss" "age"
[9] "hoursperweek" "workclass" "fnlwgt" "race"
[13] "sex" "nativecountry"
We find that Capital Gain is one of the key predictors followed by education for the income levels.
The tree logic is as below where only “capital gain, education and relationship” are used as variables by the tree as Education num and education both have the same data.
printcp(tree)
Classification tree:
rpart(formula = incomelevel ~ ., data = data, method = "class")
Variables actually used in tree construction:
[1] capitalgain education relationship
Root node error: 7508/30162 = 0.24892
n= 30162
CP nsplit rel error xerror xstd
1 0.129995 0 1.00000 1.00000 0.0100018
2 0.064198 2 0.74001 0.74001 0.0089670
3 0.037294 3 0.67581 0.67581 0.0086527
4 0.010000 4 0.63852 0.63852 0.0084574
plotcp(tree)
rpart.plot(tree, main = "Basic Classification Tree for Census Data")
Inferences
About 46% of the people are in a relationship called “Husband” or “Wife” which is then further classified based on Education Level where nearly 14% who earn above $50 K have the education of Bachelors, Prof-school, Masters and Doctorate.
The other education levels have income predominantly below $50 k with just 2% having salaries above $50k who also have capital gains greater than $5096
With respect to other relationships, only 1% have income above $50 k and with capital gains greater than $7074.
Tableau for Visualization
We use 4 sheets to visualize different relationships and use Work Class as the common parameter which will filter the entire dashboard.
The first sheet shows the relationship of Education and Number of People Earning > 50 k and separated by Work Class. We find that Bachelors graduates working in Private companies have a higher number of people earning above 50 k.
Impact of Education on Income Levels
The second sheet shows the relationship of Average hours per week with respect to gender and separated by Work Class and we find that Males typically work more hours per week on Average across all work classes.
Impact of Gender on Income Levels
The third sheet shows that the relationship of marital status and income levels separated by Work Class. Majority of the people in Married with Civilian spouse have an income greater than 50 k and majorly in the private sector.
Impact of Marital Status on Income Levels
The fourth work sheet shows the impact of occupation, capital gain and capital loss on the income levels which has details of work class too. This is a comprehensive visualization across 4 different parameters. We find that Executives at Managerial Level have more people with income greater than 50 k and Professional Speciality has more capital gains.
Impact of Occupation on Income Levels and Capital Gains/Loss
Final Dashboard:
Final Dashboard
Filtered Dashboard for Private Sector:
Filtered Dashboard
Challenges Faced
- The datafile was initially scraped from a web repository called UCI Machine Learning- part of UC Irvine, which crashed in the last two days left for submission and hence R was unable to establish a connection to the data. So, I figured out the FTP page of the website and downloaded the data and uploaded from the hard disk directly for the project. This is not the ideal way but R markdown supports this by establishing a static html page which will provide all the outputs from the commands at the time of publishing the report.
- Problems with decision trees and R packages- The packages used for prettier decision tree modeling was not working in the R studio and has to be fixed with multiple supporting libraries before proceeding.
- Learning SQL DF- R and SQL both are relatively new for me and it took me a while to understand the workings of SQL DF commands.