1 Problem statement

we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data, and also at world level. UCI site archive

Libaries installed for this Project. data.table
dplyr
ggplot2
plotly
gtable
gridExtra
caret
rworldmap
countrycode
kableExtra
gbm

2 Data Exploration

Download from site the census.tar.gz [UCI site archive](http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/ have a look at the data and also the description for various variables provided on the site .That will give you an idea of what the variables are and what variables we might not require and hence can exlcuded. This is one of the crucial task as having an insight on the data variables and based on that we can find out the response variable and what all variables we can keep for our exploratpory analysis and can be further used in our Model for prediction/classification.

2.1 Read the data

We read the files using Fread function of data.table library which gives us a faster and more convenient way to read the data from files(train & test) data sets . Now we explore the data set to see what data we require and which data can be discarded

Dataset is read and stored as train data frame of 199523 rows and 42 columns.

trainFileName = "census-income-train.csv"

testFileName = "census-income-test.csv"


colnames <- c("age","class_of_worker","industry_code","occupation_code","education","wage_per_hour","enrolled_in_edu_inst_lastwk",  "marital_status","major_industry_code","major_occupation_code","race","hispanic_origin","sex","member_of_labor_union","reason_for_unemployment","full_parttime_employment_stat","capital_gains","capital_losses","dividend_from_Stocks","tax_filer_status","region_of_previous_residence","state_of_previous_residence","d_household_family_stat","d_household_summary","instance_weight","migration_msa","migration_reg","migration_within_reg","live_1_year_ago","migration_sunbelt","num_person_Worked_employer","family_members_under_18","country_father","country_mother","country_self","citizenship","business_or_self_employed","fill_questionnaire_veteran_admin","veterans_benefits","weeks_worked_in_year","year","income_level")

train <- fread(trainFileName,na.strings = c(""," ","?","NA",NA) ,  col.names = colnames)

Dataset is read and stored as test data frame showing details of complete cases:return a logical vector indicating which cases are complete, i.e., have no missing values.

test <- fread(testFileName,na.strings = c(""," ","?","NA",NA) , col.names = colnames)
table (complete.cases (test))
## 
## FALSE  TRUE 
## 52680 47082

2.2 Data cleaning

Using the summary function of the data sets train & test we can see that few columns are having lot of NA ,which won’t help us in analysis hence it is better to remove them (“migration_msa”,“migration_reg”,“migration_within_reg” & “migration_sunbelt”).
And similary we remove rows for which there are any NA values which makes our data set get rid of any NA values.

train <- train[,c(-26,-27,-28,-30)]
myCleanTrain <- na.omit(train)
test <- test[,c(-26,-27,-28,-30)]
myCleanTest <- na.omit(test)

Subset and construct final dataset of 13 variables and showing summary of Training data set

##       age        class_of_worker    occupation_code  education        
##  Min.   : 0.00   Length:189729      Min.   : 0.0    Length:189729     
##  1st Qu.:15.00   Class :character   1st Qu.: 0.0    Class :character  
##  Median :33.00   Mode  :character   Median : 0.0    Mode  :character  
##  Mean   :34.09                      Mean   :11.4                      
##  3rd Qu.:49.00                      3rd Qu.:26.0                      
##  Max.   :90.00                      Max.   :46.0                      
##  wage_per_hour     marital_status         race          
##  Min.   :   0.00   Length:189729      Length:189729     
##  1st Qu.:   0.00   Class :character   Class :character  
##  Median :   0.00   Mode  :character   Mode  :character  
##  Mean   :  56.13                                        
##  3rd Qu.:   0.00                                        
##  Max.   :9999.00                                        
##      sex            capital_gains     capital_losses    country_self      
##  Length:189729      Min.   :    0.0   Min.   :   0.00   Length:189729     
##  Class :character   1st Qu.:    0.0   1st Qu.:   0.00   Class :character  
##  Mode  :character   Median :    0.0   Median :   0.00   Mode  :character  
##                     Mean   :  422.7   Mean   :  36.93                     
##                     3rd Qu.:    0.0   3rd Qu.:   0.00                     
##                     Max.   :99999.0   Max.   :4608.00                     
##  weeks_worked_in_year income_level      
##  Min.   : 0.00        Length:189729     
##  1st Qu.: 0.00        Class :character  
##  Median : 8.00        Mode  :character  
##  Mean   :23.29                          
##  3rd Qu.:52.00                          
##  Max.   :52.00

Check the unique value for our response variable (income_level) and change the values to “<=50K” for “- 50000.” and simlarly change “>50k” for “50000+.” which makes easier for doing analysis.Let’s take a look at the severity of imbalanced classes in our data.

## 
## - 50000.  50000+. 
##       94        6

segregating out data into numeric and categorical data frames for both Train data set and Test data set

class_of_worker education marital_status race sex country_self income_level
Not in universe High school graduate Widowed White Female United-States <=50K
Self-employed-not incorporated Some college but no degree Divorced White Male United-States <=50K
Not in universe 10th grade Never married Asian or Pacific Islander Female Vietnam <=50K
Not in universe Children Never married White Female United-States <=50K
Not in universe Children Never married White Female United-States <=50K
Private Some college but no degree Married-civilian spouse present Amer Indian Aleut or Eskimo Female United-States <=50K
age occupation_code wage_per_hour capital_gains capital_losses weeks_worked_in_year
73 0 0 0 0 0
58 34 0 0 0 52
18 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
48 10 1200 0 0 52

Income_level is our response/dependent variable and rest of the variables are our independent variables , next step is to do Exploratory Analsyis on numeric variables and categorical variables.

3 Exploratory Analysis

Each of the variables is explored for distribution, variance, and predictability. We will first explore numeric variables and then do analysis on categorical variables.

3.1 Exploratory Analysis of Numerical variables

3.1.0.1 Age :-

Doing summary of the Age variable , one can clearly see that age has wide range and variability, Mean and Distributions is quiet different from the income levels and hence is a good predictor factor.

3.1.0.2 Occupation code:-

The variable occupation code has good vriability , hence it can be good predictor factor.Thus we sustain it.

3.1.0.2.1 Capitol Gains & Capitol Losses:-

Variables(Capital gain and capital losses) don’t show much variance for all income levels from the plots below. However, the means show a difference for the different levels of income. So these variables can be used for prediction.

3.1.0.3 Wage per hours

Furthermore, in classification problems, we should also plot numerical variables with dependent variable. This would help us determine the clusters (if exists) of classes “<=50k” and “>50k”. In this we plot Wage per hour against Age for income level dependent variable.

we can clearly from the below graph, that for both sections of income level (i.e. <=50K & >50k) the age group is between 25-65 yrs and also the average wage per hour for group <=50K is <2000 and even more <1000 whereas for income level >50k the average salary per hour is lies more between 1000 & 3000. Thus we can cleary see the clusters of Age against Wage per hour.

3.1.0.4 Weeks worked in year:-

In this we plot weeks_worked_in_year against Age for income level dependent variable, and can clearly see the clusters forming which gives that for Age between 25 & 65 and they work 30+ weeks, we can see the pattern in >50K income that majority of them fall in this cluster. Similary in <=50k income we can see that most of the user base has weeks per year less than 30+ and most of them fall on either of age specturen ie.e. <25 and >65 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   33.00   34.09   49.00   90.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    11.4    26.0    46.0

##                freqRatio percentUnique zeroVar  nzv
## capital_gains   248.7374    0.06904585   FALSE TRUE
## capital_losses  481.9896    0.05850450   FALSE TRUE
##  capital_gains     capital_losses   
##  Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:    0.0   1st Qu.:   0.00  
##  Median :    0.0   Median :   0.00  
##  Mean   :  140.9   Mean   :  26.76  
##  3rd Qu.:    0.0   3rd Qu.:   0.00  
##  Max.   :99999.0   Max.   :4608.00
##  capital_gains   capital_losses
##  Min.   :    0   Min.   :   0  
##  1st Qu.:    0   1st Qu.:   0  
##  Median :    0   Median :   0  
##  Mean   : 4746   Mean   : 193  
##  3rd Qu.:    0   3rd Qu.:   0  
##  Max.   :99999   Max.   :3683

## Exploring correlation between all numerical variables The below correlation chows that all the numerical(continous) variables are not co-related and are independent of each other.

##                             age occupation_code wage_per_hour
## age                  0.00000000     0.129479084   0.040535605
## occupation_code      0.12947908     0.000000000   0.195854599
## wage_per_hour        0.04053560     0.195854599   0.000000000
## capital_gains        0.05416728     0.003395563  -0.001974882
## capital_losses       0.06367913     0.045558024   0.012366906
## weeks_worked_in_year 0.22097398     0.655996548   0.197455489
##                      capital_gains capital_losses weeks_worked_in_year
## age                    0.054167279     0.06367913           0.22097398
## occupation_code        0.003395563     0.04555802           0.65599655
## wage_per_hour         -0.001974882     0.01236691           0.19745549
## capital_gains          0.000000000    -0.01255867           0.08280735
## capital_losses        -0.012558671     0.00000000           0.10175318
## weeks_worked_in_year   0.082807351     0.10175318           0.00000000

3.2 Explanatory Analysis for Categorical variables

Looking at all the graphs , we can safely assume that Education / Class of Workers / Marital Status / Country_self are good predictor variables . And similarly don’t see much variability for variable(s) Sex/ Race and hence we won’t be using these variable in our Prediction Model.

##                        class_of_worker  income_level 
##  Private                       :68868   <=50K:84049  
##  Self-employed-not incorporated: 8005   >50K :10792  
##  Local government              : 7516                
##  State government              : 4070                
##  Self-employed-incorporated    : 3018                
##  Federal government            : 2796                
##  (Other)                       :  568

H? : There is no significant impact of the variable (MARTIAL_STATUS ) on the INCOME_LEVEL variable.

Ha : There exists a significant impact of the variable (MARTIAL_STATUS) on the INCOME_LEVEL variable.

<=50K >50K
Divorced 0.9167490 0.0832510
Married-A F spouse present 0.9793978 0.0206022
Married-civilian spouse present 0.8871358 0.1128642
Married-spouse absent 0.9378655 0.0621345
Never married 0.9874365 0.0125635
Separated 0.9542978 0.0457022
Widowed 0.9679992 0.0320008
## Warning in chisq.test(myTable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  myTable
## X-squared = 0.15537, df = 6, p-value = 0.9999

As clearly we can see that p-value is greater than significance levl (.05) hence we reject NULL Hyposthesis and can safely assume that Marital_status has significant impact on the Income_level variable.

## 
##  FALSE   TRUE 
## 568207    980
## 188749 codes from your data successfully matched countries in the map
## 980 codes from your data failed to match with a country code in the map
## 205 codes from the map weren't represented in your data

4 Building a Predictive Model

To build the prediction model, we will be using all the independent variables except the Sex and race variables that predicts the income level of an individual to be greater than USD 50000 or less than USD 50000 using Census data. Using the Boosting algorithm for this classification modeling, as consensus data has some weak predictors.

Cross Validation (CV) where the training data is partitioned a specific number of times (2 in oiur case) and separate boosted models are built on each. The resulting models are ensembled to arrive at final model. This helps avoid overfitting the model to the training data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  <=50K   >50K
##      <=50K 176763   1359
##      >50K    7681   3926
##                                           
##                Accuracy : 0.9524          
##                  95% CI : (0.9514, 0.9533)
##     No Information Rate : 0.9721          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4435          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9584          
##             Specificity : 0.7429          
##          Pos Pred Value : 0.9924          
##          Neg Pred Value : 0.3382          
##              Prevalence : 0.9721          
##          Detection Rate : 0.9317          
##    Detection Prevalence : 0.9388          
##       Balanced Accuracy : 0.8506          
##                                           
##        'Positive' Class : <=50K           
## 

The confusion matrix above shows an in-sample overall accuracy of ~95%, the sensitivity of ~95% and specificity of ~74%.

This implies that 95% of times, the model has classified the income level correctly, 95% of the times, the income level being less than or equal to USD 50000 in classified correctly and 74% of the times, the income level being greater than USD 50000 is classified correctly.

4.0.1 Validate the Predictive Model

The created prediction model is applied to the test data to validate the true performance.

The prediction model is applied on the test data. From the confusion matrix below the performance measures are out-of-sample overall accuracy of ~95%, sensitivity of ~95% and specificity of ~74%, which is quite similar to the in-sample performances

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K  >50K
##      <=50K 88356   676
##      >50K   3865  1930
##                                           
##                Accuracy : 0.9521          
##                  95% CI : (0.9507, 0.9535)
##     No Information Rate : 0.9725          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4382          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9581          
##             Specificity : 0.7406          
##          Pos Pred Value : 0.9924          
##          Neg Pred Value : 0.3330          
##              Prevalence : 0.9725          
##          Detection Rate : 0.9318          
##    Detection Prevalence : 0.9389          
##       Balanced Accuracy : 0.8493          
##                                           
##        'Positive' Class : <=50K           
## 

5 Summary

It is very important to understand how the built model has performed with respect to a baseline model.Looking at the Baseline model and prediction model using booster algorithm , it is clear that prediction model does perform better than the baseline model.

6 Challenges

  1. Memory was big challenge , as R uses RAM to load the data for in memory calculations. To run predictive model GPM had to reduced the number of boost.