we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data, and also at world level. UCI site archive
Libaries installed for this Project. data.table
dplyr
ggplot2
plotly
gtable
gridExtra
caret
rworldmap
countrycode
kableExtra
gbm
Download from site the census.tar.gz [UCI site archive](http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/ have a look at the data and also the description for various variables provided on the site .That will give you an idea of what the variables are and what variables we might not require and hence can exlcuded. This is one of the crucial task as having an insight on the data variables and based on that we can find out the response variable and what all variables we can keep for our exploratpory analysis and can be further used in our Model for prediction/classification.
We read the files using Fread function of data.table library which gives us a faster and more convenient way to read the data from files(train & test) data sets . Now we explore the data set to see what data we require and which data can be discarded
trainFileName = "census-income-train.csv"
testFileName = "census-income-test.csv"
colnames <- c("age","class_of_worker","industry_code","occupation_code","education","wage_per_hour","enrolled_in_edu_inst_lastwk", "marital_status","major_industry_code","major_occupation_code","race","hispanic_origin","sex","member_of_labor_union","reason_for_unemployment","full_parttime_employment_stat","capital_gains","capital_losses","dividend_from_Stocks","tax_filer_status","region_of_previous_residence","state_of_previous_residence","d_household_family_stat","d_household_summary","instance_weight","migration_msa","migration_reg","migration_within_reg","live_1_year_ago","migration_sunbelt","num_person_Worked_employer","family_members_under_18","country_father","country_mother","country_self","citizenship","business_or_self_employed","fill_questionnaire_veteran_admin","veterans_benefits","weeks_worked_in_year","year","income_level")
train <- fread(trainFileName,na.strings = c(""," ","?","NA",NA) , col.names = colnames)
test <- fread(testFileName,na.strings = c(""," ","?","NA",NA) , col.names = colnames)
table (complete.cases (train))
##
## FALSE TRUE
## 105061 94462
table (complete.cases (test))
##
## FALSE TRUE
## 52680 47082
Using the summary function of the data sets train & test we can see that few columns are having lot of NA ,which won’t help us in analysis hence it is better to remove them (“migration_msa”,“migration_reg”,“migration_within_reg” & “migration_sunbelt”).
And similary we remove rows for which there are any NA values which makes our data set get rid of any NA values.
Now again subsetting and constructing our final dataset of 13 variables , then checking the unique value for our response variable (income_level) and change the values to “<=50K” for “- 50000.” and simlarly change “>50k” for “50000+.” which makes easier for doing analysis.Let’s take a look at the severity of imbalanced classes in our data.
## age class_of_worker occupation_code education
## Min. : 0.00 Length:189729 Min. : 0.0 Length:189729
## 1st Qu.:15.00 Class :character 1st Qu.: 0.0 Class :character
## Median :33.00 Mode :character Median : 0.0 Mode :character
## Mean :34.09 Mean :11.4
## 3rd Qu.:49.00 3rd Qu.:26.0
## Max. :90.00 Max. :46.0
## wage_per_hour marital_status race
## Min. : 0.00 Length:189729 Length:189729
## 1st Qu.: 0.00 Class :character Class :character
## Median : 0.00 Mode :character Mode :character
## Mean : 56.13
## 3rd Qu.: 0.00
## Max. :9999.00
## sex capital_gains capital_losses country_self
## Length:189729 Min. : 0.0 Min. : 0.00 Length:189729
## Class :character 1st Qu.: 0.0 1st Qu.: 0.00 Class :character
## Mode :character Median : 0.0 Median : 0.00 Mode :character
## Mean : 422.7 Mean : 36.93
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. :99999.0 Max. :4608.00
## weeks_worked_in_year income_level
## Min. : 0.00 Length:189729
## 1st Qu.: 0.00 Class :character
## Median : 8.00 Mode :character
## Mean :23.29
## 3rd Qu.:52.00
## Max. :52.00
## [1] "<=50K" ">50K"
##
## - 50000. 50000+.
## 94 6
class_of_worker | education | marital_status | race | sex | country_self | income_level |
---|---|---|---|---|---|---|
Not in universe | High school graduate | Widowed | White | Female | United-States | <=50K |
Self-employed-not incorporated | Some college but no degree | Divorced | White | Male | United-States | <=50K |
Not in universe | 10th grade | Never married | Asian or Pacific Islander | Female | Vietnam | <=50K |
Not in universe | Children | Never married | White | Female | United-States | <=50K |
Not in universe | Children | Never married | White | Female | United-States | <=50K |
Private | Some college but no degree | Married-civilian spouse present | Amer Indian Aleut or Eskimo | Female | United-States | <=50K |
age | occupation_code | wage_per_hour | capital_gains | capital_losses | weeks_worked_in_year |
---|---|---|---|---|---|
73 | 0 | 0 | 0 | 0 | 0 |
58 | 34 | 0 | 0 | 0 | 52 |
18 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 |
48 | 10 | 1200 | 0 | 0 | 52 |
income_level is our response/dependent variable and rest of the variables are our independent variables , next step is to do Exploratory Analsyis on numeric variables and categorical variables.
Each of the variables is explored for distribution, variance, and predictability. We will first explore numeric variables and then do analysis on categorical variables.
Doing summary of the Age variable , one can clearly see that age has wide range and variability, Mean and Distributions is quiet different from the income levels and hence is a good predictor factor.
The variable occupation code has good vriability , hence it can be good predictor factor.Thus we sustain it.
Variables(Capital gain and capital losses) don’t show much variance for all income levels from the plots below. However, the means show a difference for the different levels of income. So these variables can be used for prediction.
Furthermore, in classification problems, we should also plot numerical variables with dependent variable. This would help us determine the clusters (if exists) of classes “<=50k” and “>50k”. In this we plot Wage per hour against Age for income level dependent variable.
we can clearly from the below graph, that for both sections of income level (i.e. <=50K & >50k) the age group is between 25-65 yrs and also the average wage per hour for group <=50K is <2000 and even more <1000 whereas for income level >50k the average salary per hour is lies more between 1000 & 3000. Thus we can cleary see the clusters of Age against Wage per hour.
In this we plot weeks_worked_in_year against Age for income level dependent variable, and can clearly see the clusters forming which gives that for Age between 25 & 65 and they work 30+ weeks, we can see the pattern in >50K income that majority of them fall in this cluster. Similary in <=50k income we can see that most of the user base has weeks per year less than 30+ and most of them fall on either of age specturen ie.e. <25 and >65 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.00 33.00 34.09 49.00 90.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 11.4 26.0 46.0
## freqRatio percentUnique zeroVar nzv
## capital_gains 248.7374 0.06904585 FALSE TRUE
## capital_losses 481.9896 0.05850450 FALSE TRUE
## capital_gains capital_losses
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 0.0 Median : 0.00
## Mean : 140.9 Mean : 26.76
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. :99999.0 Max. :4608.00
## capital_gains capital_losses
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0
## Mean : 4746 Mean : 193
## 3rd Qu.: 0 3rd Qu.: 0
## Max. :99999 Max. :3683
## Exploring correlation between all numerical variables The below correlation chows that all the numerical(continous) variables are not co-related and are independent of each other.
correlationNumerical = cor (numFinalTrain[, c("age", "occupation_code","wage_per_hour", "capital_gains", "capital_losses", "weeks_worked_in_year")])
diag (correlationNumerical) = 0 # ythis removes any correlation with self
correlationNumerical
## age occupation_code wage_per_hour
## age 0.00000000 0.129479084 0.040535605
## occupation_code 0.12947908 0.000000000 0.195854599
## wage_per_hour 0.04053560 0.195854599 0.000000000
## capital_gains 0.05416728 0.003395563 -0.001974882
## capital_losses 0.06367913 0.045558024 0.012366906
## weeks_worked_in_year 0.22097398 0.655996548 0.197455489
## capital_gains capital_losses weeks_worked_in_year
## age 0.054167279 0.06367913 0.22097398
## occupation_code 0.003395563 0.04555802 0.65599655
## wage_per_hour -0.001974882 0.01236691 0.19745549
## capital_gains 0.000000000 -0.01255867 0.08280735
## capital_losses -0.012558671 0.00000000 0.10175318
## weeks_worked_in_year 0.082807351 0.10175318 0.00000000
Looking at all the graphs , we can safely assume that Education / Class of Workers / Marital Status / Country_self are good predictor variables . And similarly don’t see much variability for variable(s) Sex/ Race and hence we won’t be using these variable in our Prediction Model.
# Education
qplot (income_level, data = catFinalTrain, fill = education) + facet_grid (. ~ education) + theme(axis.text.x=element_text(angle = 80,hjust = 1,size=10))
ggplot(catFinalTrain,aes(x=education,fill=income_level))+geom_bar(position = "dodge", color="black")+scale_fill_brewer(palette = "YlGnBu")+theme(axis.text.x =element_text(angle = 60,hjust = 1,size=10))
### The explaoratory analysis for education vs Income levels (<=50k & >50k), clearly gives us a view that people who are bracketed in group >50k (income level) have more education the minimum they have completed thier schooling and have some sort of college done mostly people have professional degree, whereas people in income group <=50K have lesser education .
# Classs of Workers
ggplot(catFinalTrain,aes(x=class_of_worker,fill=income_level))+geom_bar(position = "dodge", color="black")+scale_fill_brewer(palette = "YlOrBr")+theme(axis.text.x =element_text(angle = 60,hjust = 1,size=10))
# As we can clearly see that majority of <=50k falls in Not in Universe which seems to be imbalanced data set as we can safely assume either user group was frustrated or not in mood to give any clear answers. So now we can either clear these particular values from the data and then redraw to see if we can use this data to infer something.
classWorkderDF <- catFinalTrain[ catFinalTrain$class_of_worker != "Not in universe", c("class_of_worker","income_level")]
summary(classWorkderDF)
## class_of_worker income_level
## Private :68868 <=50K:84049
## Self-employed-not incorporated: 8005 >50K :10792
## Local government : 7516
## State government : 4070
## Self-employed-incorporated : 3018
## Federal government : 2796
## (Other) : 568
qplot (income_level, data = classWorkderDF, fill = class_of_worker) + facet_grid (. ~ class_of_worker)
### Now we can infer that most of the people work in Private sector with user earning <=50k is far more than people earning >50k and same goes in all other government jobs also.
# Martial Status
qplot (income_level, data = catFinalTrain, fill = marital_status) + facet_grid (. ~ marital_status)
# As nothing much can be infered from the below graph , hence using Null Hypothesis for inference
H? : There is no significant impact of the variable (MARTIAL_STATUS ) on the INCOME_LEVEL variable.
Ha : There exists a significant impact of the variable (MARTIAL_STATUS) on the INCOME_LEVEL variable.
#creating prop table and then using chi-squaretest to calculate the p-value and then deciding which hypothesis to reject
myTable <- prop.table(table(catFinalTrain$marital_status,catFinalTrain$income_level),1)
kable(myTable) %>%
kable_styling(bootstrap_options = c("striped","condensed","responsive"),full_width = F,position = "left",font_size = 12) %>%
row_spec(0, background ="gray")
<=50K | >50K | |
---|---|---|
Divorced | 0.9167490 | 0.0832510 |
Married-A F spouse present | 0.9793978 | 0.0206022 |
Married-civilian spouse present | 0.8871358 | 0.1128642 |
Married-spouse absent | 0.9378655 | 0.0621345 |
Never married | 0.9874365 | 0.0125635 |
Separated | 0.9542978 | 0.0457022 |
Widowed | 0.9679992 | 0.0320008 |
chisq.test(myTable)
## Warning in chisq.test(myTable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: myTable
## X-squared = 0.15537, df = 6, p-value = 0.9999
As clearly we can see that p-value is greater than significance levl (.05) hence we reject NULL Hyposthesis and can safely assume that Marital_status has significant impact on the Income_level variable.
head(catFinalTrain)
## class_of_worker education
## 1: Not in universe High school graduate
## 2: Self-employed-not incorporated Some college but no degree
## 3: Not in universe 10th grade
## 4: Not in universe Children
## 5: Not in universe Children
## 6: Private Some college but no degree
## marital_status race sex
## 1: Widowed White Female
## 2: Divorced White Male
## 3: Never married Asian or Pacific Islander Female
## 4: Never married White Female
## 5: Never married White Female
## 6: Married-civilian spouse present Amer Indian Aleut or Eskimo Female
## country_self income_level
## 1: United-States <=50K
## 2: United-States <=50K
## 3: Vietnam <=50K
## 4: United-States <=50K
## 5: United-States <=50K
## 6: United-States <=50K
## RACE
summary(catFinalTrain$race)
## Amer Indian Aleut or Eskimo Asian or Pacific Islander
## 2214 5174
## Black Other
## 19544 3338
## White
## 159459
qplot (income_level, data = catFinalTrain, fill = race) + facet_grid (. ~ race)
###We can safely infer that majority of the user base is White and hence the maximum and minimum in both income level are from White race.And the other significant race is Black and they have maximum userbase with income level <=50K.
# sex
ggplot(catFinalTrain,aes(x=sex,fill=income_level))+geom_bar(position = "dodge", color="black")+scale_fill_brewer(palette = "YlOrBr")+theme(axis.text.x =element_text(angle = 60,hjust = 1,size=10))
### For the graph we can safely assume that the income_level are skewed against Females for both sections of income group(i.e. <=50k & >50k)
#Country Self
country_incomeDF <- select(catFinalTrain,6:7) %>%
mutate(iso_code = countrycode(catFinalTrain$country_self, 'country.name', 'iso3c'))
## Warning in countrycode(catFinalTrain$country_self, "country.name", "iso3c"): Some values were not matched unambiguously: Columbia, England, Scotland, Yugoslavia
table(is.na(country_incomeDF))
##
## FALSE TRUE
## 568207 980
theCountries <- catFinalTrain$country_self
###It can be clearly infered that US has maximum
countryCodes <- countrycode(unique(theCountries), 'country.name', 'iso3c')
## Warning in countrycode(unique(theCountries), "country.name", "iso3c"): Some values were not matched unambiguously: Columbia, England, Scotland, Yugoslavia
malDF <- data.frame(country_incomeDF$iso_code,
inc_level = country_incomeDF$income_level)
# malDF is a data.frame with the ISO3 country names plus a variable to
# merge to the map data
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3", nameJoinColumn = "country_incomeDF.iso_code")
## 188749 codes from your data successfully matched countries in the map
## 980 codes from your data failed to match with a country code in the map
## 205 codes from the map weren't represented in your data
# This will join your malDF data.frame to the country map data
mapCountryData(malMap, nameColumnToPlot="inc_level", catMethod = "categorical", missingCountryCol = gray(.8) , colourPalette = c("red", "blue") , mapTitle = "Country Vs Income Level")
### Looking at the graph plotted , for world map and countries with less than <=50k & >50K , we can clearly see that US, has both significant population of both sides of Income level and similarly India & China has significant levels of user base <=50k income level.
Using the Boosting algorithm for this classification modeling, as consensus data has some weak predictors.
set.seed (32323)
training_Ctrl = trainControl(method = "cv", number = 2)
boost_Fit_mdl = train(income_level ~ age + class_of_worker + occupation_code +education + wage_per_hour+ marital_status + capital_gains + capital_losses +
country_self + weeks_worked_in_year , trControl = training_Ctrl,
method = "gbm", data = finalTrain, verbose = FALSE)
confusionMatrix(finalTrain$income_level, predict (boost_Fit_mdl, finalTrain))
finalTest$predicted = predict(boost_Fit_mdl, finalTest)
confusionMatrix(finalTest$income_level, finalTest$predicted)
#confusion Matrix
It is very important to understand how the built model has performed with respect to a baseline model.Looking at the Baseline model and prediction model using booster algorithm , it is clear that prediction model does perform better than the baseline model.