we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data, and also at world level. UCI site archive
Libaries installed for this Project. data.table
dplyr
ggplot2
plotly
gtable
gridExtra
caret
rworldmap
countrycode
kableExtra
gbm
Download from site the census.tar.gz [UCI site archive](http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/ have a look at the data and also the description for various variables provided on the site .That will give you an idea of what the variables are and what variables we might not require and hence can exlcuded. This is one of the crucial task as having an insight on the data variables and based on that we can find out the response variable and what all variables we can keep for our exploratpory analysis and can be further used in our Model for prediction/classification.
We read the files using Fread function of data.table library which gives us a faster and more convenient way to read the data from files(train & test) data sets . Now we explore the data set to see what data we require and which data can be discarded
Dataset is read and stored as train data frame of 199523 rows and 42 columns.
trainFileName = "census-income-train.csv"
testFileName = "census-income-test.csv"
colnames <- c("age","class_of_worker","industry_code","occupation_code","education","wage_per_hour","enrolled_in_edu_inst_lastwk", "marital_status","major_industry_code","major_occupation_code","race","hispanic_origin","sex","member_of_labor_union","reason_for_unemployment","full_parttime_employment_stat","capital_gains","capital_losses","dividend_from_Stocks","tax_filer_status","region_of_previous_residence","state_of_previous_residence","d_household_family_stat","d_household_summary","instance_weight","migration_msa","migration_reg","migration_within_reg","live_1_year_ago","migration_sunbelt","num_person_Worked_employer","family_members_under_18","country_father","country_mother","country_self","citizenship","business_or_self_employed","fill_questionnaire_veteran_admin","veterans_benefits","weeks_worked_in_year","year","income_level")
train <- fread(trainFileName,na.strings = c(""," ","?","NA",NA) , col.names = colnames)
Dataset is read and stored as test data frame showing details of complete cases:return a logical vector indicating which cases are complete, i.e., have no missing values.
test <- fread(testFileName,na.strings = c(""," ","?","NA",NA) , col.names = colnames)
table (complete.cases (test))
##
## FALSE TRUE
## 52680 47082
Using the summary function of the data sets train & test we can see that few columns are having lot of NA ,which won’t help us in analysis hence it is better to remove them (“migration_msa”,“migration_reg”,“migration_within_reg” & “migration_sunbelt”).
And similary we remove rows for which there are any NA values which makes our data set get rid of any NA values.
train <- train[,c(-26,-27,-28,-30)]
myCleanTrain <- na.omit(train)
test <- test[,c(-26,-27,-28,-30)]
myCleanTest <- na.omit(test)
Subset and construct final dataset of 13 variables and showing summary of Training data set
## age class_of_worker occupation_code education
## Min. : 0.00 Length:189729 Min. : 0.0 Length:189729
## 1st Qu.:15.00 Class :character 1st Qu.: 0.0 Class :character
## Median :33.00 Mode :character Median : 0.0 Mode :character
## Mean :34.09 Mean :11.4
## 3rd Qu.:49.00 3rd Qu.:26.0
## Max. :90.00 Max. :46.0
## wage_per_hour marital_status race
## Min. : 0.00 Length:189729 Length:189729
## 1st Qu.: 0.00 Class :character Class :character
## Median : 0.00 Mode :character Mode :character
## Mean : 56.13
## 3rd Qu.: 0.00
## Max. :9999.00
## sex capital_gains capital_losses country_self
## Length:189729 Min. : 0.0 Min. : 0.00 Length:189729
## Class :character 1st Qu.: 0.0 1st Qu.: 0.00 Class :character
## Mode :character Median : 0.0 Median : 0.00 Mode :character
## Mean : 422.7 Mean : 36.93
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. :99999.0 Max. :4608.00
## weeks_worked_in_year income_level
## Min. : 0.00 Length:189729
## 1st Qu.: 0.00 Class :character
## Median : 8.00 Mode :character
## Mean :23.29
## 3rd Qu.:52.00
## Max. :52.00
Check the unique value for our response variable (income_level) and change the values to “<=50K” for “- 50000.” and simlarly change “>50k” for “50000+.” which makes easier for doing analysis.Let’s take a look at the severity of imbalanced classes in our data.
##
## - 50000. 50000+.
## 94 6
segregating out data into numeric and categorical data frames for both Train data set and Test data set
class_of_worker | education | marital_status | race | sex | country_self | income_level |
---|---|---|---|---|---|---|
Not in universe | High school graduate | Widowed | White | Female | United-States | <=50K |
Self-employed-not incorporated | Some college but no degree | Divorced | White | Male | United-States | <=50K |
Not in universe | 10th grade | Never married | Asian or Pacific Islander | Female | Vietnam | <=50K |
Not in universe | Children | Never married | White | Female | United-States | <=50K |
Not in universe | Children | Never married | White | Female | United-States | <=50K |
Private | Some college but no degree | Married-civilian spouse present | Amer Indian Aleut or Eskimo | Female | United-States | <=50K |
age | occupation_code | wage_per_hour | capital_gains | capital_losses | weeks_worked_in_year |
---|---|---|---|---|---|
73 | 0 | 0 | 0 | 0 | 0 |
58 | 34 | 0 | 0 | 0 | 52 |
18 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 |
48 | 10 | 1200 | 0 | 0 | 52 |
Income_level is our response/dependent variable and rest of the variables are our independent variables , next step is to do Exploratory Analsyis on numeric variables and categorical variables.
Each of the variables is explored for distribution, variance, and predictability. We will first explore numeric variables and then do analysis on categorical variables.
Doing summary of the Age variable , one can clearly see that age has wide range and variability, Mean and Distributions is quiet different from the income levels and hence is a good predictor factor.
The variable occupation code has good vriability , hence it can be good predictor factor.Thus we sustain it.
Variables(Capital gain and capital losses) don’t show much variance for all income levels from the plots below. However, the means show a difference for the different levels of income. So these variables can be used for prediction.
Furthermore, in classification problems, we should also plot numerical variables with dependent variable. This would help us determine the clusters (if exists) of classes “<=50k” and “>50k”. In this we plot Wage per hour against Age for income level dependent variable.
we can clearly from the below graph, that for both sections of income level (i.e. <=50K & >50k) the age group is between 25-65 yrs and also the average wage per hour for group <=50K is <2000 and even more <1000 whereas for income level >50k the average salary per hour is lies more between 1000 & 3000. Thus we can cleary see the clusters of Age against Wage per hour.
In this we plot weeks_worked_in_year against Age for income level dependent variable, and can clearly see the clusters forming which gives that for Age between 25 & 65 and they work 30+ weeks, we can see the pattern in >50K income that majority of them fall in this cluster. Similary in <=50k income we can see that most of the user base has weeks per year less than 30+ and most of them fall on either of age specturen ie.e. <25 and >65 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.00 33.00 34.09 49.00 90.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 11.4 26.0 46.0
## freqRatio percentUnique zeroVar nzv
## capital_gains 248.7374 0.06904585 FALSE TRUE
## capital_losses 481.9896 0.05850450 FALSE TRUE
## capital_gains capital_losses
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 0.0 Median : 0.00
## Mean : 140.9 Mean : 26.76
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. :99999.0 Max. :4608.00
## capital_gains capital_losses
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0
## Mean : 4746 Mean : 193
## 3rd Qu.: 0 3rd Qu.: 0
## Max. :99999 Max. :3683
## Exploring correlation between all numerical variables The below correlation chows that all the numerical(continous) variables are not co-related and are independent of each other.
## age occupation_code wage_per_hour
## age 0.00000000 0.129479084 0.040535605
## occupation_code 0.12947908 0.000000000 0.195854599
## wage_per_hour 0.04053560 0.195854599 0.000000000
## capital_gains 0.05416728 0.003395563 -0.001974882
## capital_losses 0.06367913 0.045558024 0.012366906
## weeks_worked_in_year 0.22097398 0.655996548 0.197455489
## capital_gains capital_losses weeks_worked_in_year
## age 0.054167279 0.06367913 0.22097398
## occupation_code 0.003395563 0.04555802 0.65599655
## wage_per_hour -0.001974882 0.01236691 0.19745549
## capital_gains 0.000000000 -0.01255867 0.08280735
## capital_losses -0.012558671 0.00000000 0.10175318
## weeks_worked_in_year 0.082807351 0.10175318 0.00000000
Looking at all the graphs , we can safely assume that Education / Class of Workers / Marital Status / Country_self are good predictor variables . And similarly don’t see much variability for variable(s) Sex/ Race and hence we won’t be using these variable in our Prediction Model.
## class_of_worker income_level
## Private :68868 <=50K:84049
## Self-employed-not incorporated: 8005 >50K :10792
## Local government : 7516
## State government : 4070
## Self-employed-incorporated : 3018
## Federal government : 2796
## (Other) : 568
H? : There is no significant impact of the variable (MARTIAL_STATUS ) on the INCOME_LEVEL variable.
Ha : There exists a significant impact of the variable (MARTIAL_STATUS) on the INCOME_LEVEL variable.
<=50K | >50K | |
---|---|---|
Divorced | 0.9167490 | 0.0832510 |
Married-A F spouse present | 0.9793978 | 0.0206022 |
Married-civilian spouse present | 0.8871358 | 0.1128642 |
Married-spouse absent | 0.9378655 | 0.0621345 |
Never married | 0.9874365 | 0.0125635 |
Separated | 0.9542978 | 0.0457022 |
Widowed | 0.9679992 | 0.0320008 |
## Warning in chisq.test(myTable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: myTable
## X-squared = 0.15537, df = 6, p-value = 0.9999
As clearly we can see that p-value is greater than significance levl (.05) hence we reject NULL Hyposthesis and can safely assume that Marital_status has significant impact on the Income_level variable.
##
## FALSE TRUE
## 568207 980
## 188749 codes from your data successfully matched countries in the map
## 980 codes from your data failed to match with a country code in the map
## 205 codes from the map weren't represented in your data
To build the prediction model, we will be using all the independent variables except the Sex and race variables that predicts the income level of an individual to be greater than USD 50000 or less than USD 50000 using Census data. Using the Boosting algorithm for this classification modeling, as consensus data has some weak predictors.
Cross Validation (CV) where the training data is partitioned a specific number of times (2 in oiur case) and separate boosted models are built on each. The resulting models are ensembled to arrive at final model. This helps avoid overfitting the model to the training data.
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 176763 1359
## >50K 7681 3926
##
## Accuracy : 0.9524
## 95% CI : (0.9514, 0.9533)
## No Information Rate : 0.9721
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4435
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9584
## Specificity : 0.7429
## Pos Pred Value : 0.9924
## Neg Pred Value : 0.3382
## Prevalence : 0.9721
## Detection Rate : 0.9317
## Detection Prevalence : 0.9388
## Balanced Accuracy : 0.8506
##
## 'Positive' Class : <=50K
##
The confusion matrix above shows an in-sample overall accuracy of ~95%, the sensitivity of ~95% and specificity of ~74%.
This implies that 95% of times, the model has classified the income level correctly, 95% of the times, the income level being less than or equal to USD 50000 in classified correctly and 74% of the times, the income level being greater than USD 50000 is classified correctly.
The created prediction model is applied to the test data to validate the true performance.
The prediction model is applied on the test data. From the confusion matrix below the performance measures are out-of-sample overall accuracy of ~95%, sensitivity of ~95% and specificity of ~74%, which is quite similar to the in-sample performances
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 88356 676
## >50K 3865 1930
##
## Accuracy : 0.9521
## 95% CI : (0.9507, 0.9535)
## No Information Rate : 0.9725
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4382
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9581
## Specificity : 0.7406
## Pos Pred Value : 0.9924
## Neg Pred Value : 0.3330
## Prevalence : 0.9725
## Detection Rate : 0.9318
## Detection Prevalence : 0.9389
## Balanced Accuracy : 0.8493
##
## 'Positive' Class : <=50K
##
It is very important to understand how the built model has performed with respect to a baseline model.Looking at the Baseline model and prediction model using booster algorithm , it is clear that prediction model does perform better than the baseline model.