1 Problem statement

we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data, and also at world level. UCI site archive

Libaries installed for this Project. data.table
dplyr
ggplot2
plotly
gtable
gridExtra
caret
rworldmap
countrycode
kableExtra
gbm

2 Data Exploration

Download from site the census.tar.gz [UCI site archive](http://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/ have a look at the data and also the description for various variables provided on the site .That will give you an idea of what the variables are and what variables we might not require and hence can exlcuded. This is one of the crucial task as having an insight on the data variables and based on that we can find out the response variable and what all variables we can keep for our exploratpory analysis and can be further used in our Model for prediction/classification.

We read the files using Fread function of data.table library which gives us a faster and more convenient way to read the data from files(train & test) data sets . Now we explore the data set to see what data we require and which data can be discarded

2.1 Data Cleaning & Manipulation

trainFileName = "census-income-train.csv"

testFileName = "census-income-test.csv"


colnames <- c("age","class_of_worker","industry_code","occupation_code","education","wage_per_hour","enrolled_in_edu_inst_lastwk",  "marital_status","major_industry_code","major_occupation_code","race","hispanic_origin","sex","member_of_labor_union","reason_for_unemployment","full_parttime_employment_stat","capital_gains","capital_losses","dividend_from_Stocks","tax_filer_status","region_of_previous_residence","state_of_previous_residence","d_household_family_stat","d_household_summary","instance_weight","migration_msa","migration_reg","migration_within_reg","live_1_year_ago","migration_sunbelt","num_person_Worked_employer","family_members_under_18","country_father","country_mother","country_self","citizenship","business_or_self_employed","fill_questionnaire_veteran_admin","veterans_benefits","weeks_worked_in_year","year","income_level")

train <- fread(trainFileName,na.strings = c(""," ","?","NA",NA) ,  col.names = colnames)

test <- fread(testFileName,na.strings = c(""," ","?","NA",NA) , col.names = colnames)

table (complete.cases (train))
## 
##  FALSE   TRUE 
## 105061  94462
table (complete.cases (test))
## 
## FALSE  TRUE 
## 52680 47082

Using the summary function of the data sets train & test we can see that few columns are having lot of NA ,which won’t help us in analysis hence it is better to remove them (“migration_msa”,“migration_reg”,“migration_within_reg” & “migration_sunbelt”).
And similary we remove rows for which there are any NA values which makes our data set get rid of any NA values.

Now again subsetting and constructing our final dataset of 13 variables , then checking the unique value for our response variable (income_level) and change the values to “<=50K” for “- 50000.” and simlarly change “>50k” for “50000+.” which makes easier for doing analysis.Let’s take a look at the severity of imbalanced classes in our data.

##       age        class_of_worker    occupation_code  education        
##  Min.   : 0.00   Length:189729      Min.   : 0.0    Length:189729     
##  1st Qu.:15.00   Class :character   1st Qu.: 0.0    Class :character  
##  Median :33.00   Mode  :character   Median : 0.0    Mode  :character  
##  Mean   :34.09                      Mean   :11.4                      
##  3rd Qu.:49.00                      3rd Qu.:26.0                      
##  Max.   :90.00                      Max.   :46.0                      
##  wage_per_hour     marital_status         race          
##  Min.   :   0.00   Length:189729      Length:189729     
##  1st Qu.:   0.00   Class :character   Class :character  
##  Median :   0.00   Mode  :character   Mode  :character  
##  Mean   :  56.13                                        
##  3rd Qu.:   0.00                                        
##  Max.   :9999.00                                        
##      sex            capital_gains     capital_losses    country_self      
##  Length:189729      Min.   :    0.0   Min.   :   0.00   Length:189729     
##  Class :character   1st Qu.:    0.0   1st Qu.:   0.00   Class :character  
##  Mode  :character   Median :    0.0   Median :   0.00   Mode  :character  
##                     Mean   :  422.7   Mean   :  36.93                     
##                     3rd Qu.:    0.0   3rd Qu.:   0.00                     
##                     Max.   :99999.0   Max.   :4608.00                     
##  weeks_worked_in_year income_level      
##  Min.   : 0.00        Length:189729     
##  1st Qu.: 0.00        Class :character  
##  Median : 8.00        Mode  :character  
##  Mean   :23.29                          
##  3rd Qu.:52.00                          
##  Max.   :52.00
## [1] "<=50K" ">50K"
## 
## - 50000.  50000+. 
##       94        6
class_of_worker education marital_status race sex country_self income_level
Not in universe High school graduate Widowed White Female United-States <=50K
Self-employed-not incorporated Some college but no degree Divorced White Male United-States <=50K
Not in universe 10th grade Never married Asian or Pacific Islander Female Vietnam <=50K
Not in universe Children Never married White Female United-States <=50K
Not in universe Children Never married White Female United-States <=50K
Private Some college but no degree Married-civilian spouse present Amer Indian Aleut or Eskimo Female United-States <=50K
age occupation_code wage_per_hour capital_gains capital_losses weeks_worked_in_year
73 0 0 0 0 0
58 34 0 0 0 52
18 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
48 10 1200 0 0 52

income_level is our response/dependent variable and rest of the variables are our independent variables , next step is to do Exploratory Analsyis on numeric variables and categorical variables.

3 Exploratory Analysis:-

Each of the variables is explored for distribution, variance, and predictability. We will first explore numeric variables and then do analysis on categorical variables.

3.1 Exploratory Analysis of Numerical variables

3.1.0.1 Age :-

Doing summary of the Age variable , one can clearly see that age has wide range and variability, Mean and Distributions is quiet different from the income levels and hence is a good predictor factor.

3.1.0.2 Occupation code:-

The variable occupation code has good vriability , hence it can be good predictor factor.Thus we sustain it.

3.1.0.2.1 Capitol Gains & Capitol Losses:-

Variables(Capital gain and capital losses) don’t show much variance for all income levels from the plots below. However, the means show a difference for the different levels of income. So these variables can be used for prediction.

3.1.0.3 Wage per hours

Furthermore, in classification problems, we should also plot numerical variables with dependent variable. This would help us determine the clusters (if exists) of classes “<=50k” and “>50k”. In this we plot Wage per hour against Age for income level dependent variable.

we can clearly from the below graph, that for both sections of income level (i.e. <=50K & >50k) the age group is between 25-65 yrs and also the average wage per hour for group <=50K is <2000 and even more <1000 whereas for income level >50k the average salary per hour is lies more between 1000 & 3000. Thus we can cleary see the clusters of Age against Wage per hour.

3.1.0.4 Weeks worked in year:-

In this we plot weeks_worked_in_year against Age for income level dependent variable, and can clearly see the clusters forming which gives that for Age between 25 & 65 and they work 30+ weeks, we can see the pattern in >50K income that majority of them fall in this cluster. Similary in <=50k income we can see that most of the user base has weeks per year less than 30+ and most of them fall on either of age specturen ie.e. <25 and >65 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   33.00   34.09   49.00   90.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    11.4    26.0    46.0

##                freqRatio percentUnique zeroVar  nzv
## capital_gains   248.7374    0.06904585   FALSE TRUE
## capital_losses  481.9896    0.05850450   FALSE TRUE
##  capital_gains     capital_losses   
##  Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:    0.0   1st Qu.:   0.00  
##  Median :    0.0   Median :   0.00  
##  Mean   :  140.9   Mean   :  26.76  
##  3rd Qu.:    0.0   3rd Qu.:   0.00  
##  Max.   :99999.0   Max.   :4608.00
##  capital_gains   capital_losses
##  Min.   :    0   Min.   :   0  
##  1st Qu.:    0   1st Qu.:   0  
##  Median :    0   Median :   0  
##  Mean   : 4746   Mean   : 193  
##  3rd Qu.:    0   3rd Qu.:   0  
##  Max.   :99999   Max.   :3683

## Exploring correlation between all numerical variables The below correlation chows that all the numerical(continous) variables are not co-related and are independent of each other.

correlationNumerical = cor (numFinalTrain[, c("age", "occupation_code","wage_per_hour", "capital_gains", "capital_losses", "weeks_worked_in_year")])
diag (correlationNumerical) = 0 # ythis removes any correlation with self
correlationNumerical
##                             age occupation_code wage_per_hour
## age                  0.00000000     0.129479084   0.040535605
## occupation_code      0.12947908     0.000000000   0.195854599
## wage_per_hour        0.04053560     0.195854599   0.000000000
## capital_gains        0.05416728     0.003395563  -0.001974882
## capital_losses       0.06367913     0.045558024   0.012366906
## weeks_worked_in_year 0.22097398     0.655996548   0.197455489
##                      capital_gains capital_losses weeks_worked_in_year
## age                    0.054167279     0.06367913           0.22097398
## occupation_code        0.003395563     0.04555802           0.65599655
## wage_per_hour         -0.001974882     0.01236691           0.19745549
## capital_gains          0.000000000    -0.01255867           0.08280735
## capital_losses        -0.012558671     0.00000000           0.10175318
## weeks_worked_in_year   0.082807351     0.10175318           0.00000000

3.2 Explanatory Analysis for Categorical variables

Looking at all the graphs , we can safely assume that Education / Class of Workers / Marital Status / Country_self are good predictor variables . And similarly don’t see much variability for variable(s) Sex/ Race and hence we won’t be using these variable in our Prediction Model.

# Education
qplot (income_level, data = catFinalTrain, fill = education) + facet_grid (. ~ education) + theme(axis.text.x=element_text(angle  = 80,hjust = 1,size=10))

 ggplot(catFinalTrain,aes(x=education,fill=income_level))+geom_bar(position = "dodge",  color="black")+scale_fill_brewer(palette = "YlGnBu")+theme(axis.text.x =element_text(angle  = 60,hjust = 1,size=10))

### The explaoratory analysis for education vs Income levels (<=50k & >50k), clearly gives us a view that people who are bracketed in group >50k (income level) have more education the minimum they have completed thier schooling and have some sort of college done mostly people have professional degree, whereas people in income group <=50K have lesser education .

 
 # Classs of Workers
 
 
 ggplot(catFinalTrain,aes(x=class_of_worker,fill=income_level))+geom_bar(position = "dodge",  color="black")+scale_fill_brewer(palette = "YlOrBr")+theme(axis.text.x =element_text(angle  = 60,hjust = 1,size=10))

 # As we can clearly see that majority of <=50k falls in Not in Universe which seems to be imbalanced data set as we can safely assume either user group was frustrated or not in mood to give any clear answers. So now we can either clear these particular values from the data and then redraw to see if we can use this data to infer something.
 
 classWorkderDF <- catFinalTrain[ catFinalTrain$class_of_worker != "Not in universe", c("class_of_worker","income_level")]
 
 summary(classWorkderDF)
##                        class_of_worker  income_level 
##  Private                       :68868   <=50K:84049  
##  Self-employed-not incorporated: 8005   >50K :10792  
##  Local government              : 7516                
##  State government              : 4070                
##  Self-employed-incorporated    : 3018                
##  Federal government            : 2796                
##  (Other)                       :  568
 qplot (income_level, data = classWorkderDF, fill = class_of_worker) + facet_grid (. ~ class_of_worker)

 ### Now we can infer that most of the people work in Private sector with user earning <=50k is far more than people earning >50k and same goes in all other government jobs also.
 
# Martial Status
 
 qplot (income_level, data = catFinalTrain, fill = marital_status) + facet_grid (. ~ marital_status)

# As nothing much can be infered from the below graph , hence using Null Hypothesis  for inference 

H? : There is no significant impact of the variable (MARTIAL_STATUS ) on the INCOME_LEVEL variable.

Ha : There exists a significant impact of the variable (MARTIAL_STATUS) on the INCOME_LEVEL variable.

  #creating prop table and then using chi-squaretest to calculate the p-value and then deciding which hypothesis to reject
  myTable <- prop.table(table(catFinalTrain$marital_status,catFinalTrain$income_level),1)
  kable(myTable) %>%
  kable_styling(bootstrap_options = c("striped","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
<=50K >50K
Divorced 0.9167490 0.0832510
Married-A F spouse present 0.9793978 0.0206022
Married-civilian spouse present 0.8871358 0.1128642
Married-spouse absent 0.9378655 0.0621345
Never married 0.9874365 0.0125635
Separated 0.9542978 0.0457022
Widowed 0.9679992 0.0320008
  chisq.test(myTable)
## Warning in chisq.test(myTable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  myTable
## X-squared = 0.15537, df = 6, p-value = 0.9999

As clearly we can see that p-value is greater than significance levl (.05) hence we reject NULL Hyposthesis and can safely assume that Marital_status has significant impact on the Income_level variable.

head(catFinalTrain)
##                   class_of_worker                  education
## 1:                Not in universe       High school graduate
## 2: Self-employed-not incorporated Some college but no degree
## 3:                Not in universe                 10th grade
## 4:                Not in universe                   Children
## 5:                Not in universe                   Children
## 6:                        Private Some college but no degree
##                     marital_status                        race    sex
## 1:                         Widowed                       White Female
## 2:                        Divorced                       White   Male
## 3:                   Never married   Asian or Pacific Islander Female
## 4:                   Never married                       White Female
## 5:                   Never married                       White Female
## 6: Married-civilian spouse present Amer Indian Aleut or Eskimo Female
##     country_self income_level
## 1: United-States        <=50K
## 2: United-States        <=50K
## 3:       Vietnam        <=50K
## 4: United-States        <=50K
## 5: United-States        <=50K
## 6: United-States        <=50K
## RACE 
summary(catFinalTrain$race)
## Amer Indian Aleut or Eskimo   Asian or Pacific Islander 
##                        2214                        5174 
##                       Black                       Other 
##                       19544                        3338 
##                       White 
##                      159459
 qplot (income_level, data = catFinalTrain, fill = race) + facet_grid (. ~ race)

 ###We can safely infer that majority of the user base is White and hence the maximum and minimum in both income level are from White race.And the other significant race is Black and they have maximum userbase with income level <=50K.
 
 
#  sex
ggplot(catFinalTrain,aes(x=sex,fill=income_level))+geom_bar(position = "dodge",  color="black")+scale_fill_brewer(palette = "YlOrBr")+theme(axis.text.x =element_text(angle  = 60,hjust = 1,size=10))

### For the graph we can safely assume that the income_level are skewed against Females for both sections of income group(i.e. <=50k & >50k)


#Country Self
country_incomeDF <- select(catFinalTrain,6:7) %>%

                      mutate(iso_code = countrycode(catFinalTrain$country_self, 'country.name', 'iso3c'))
## Warning in countrycode(catFinalTrain$country_self, "country.name", "iso3c"): Some values were not matched unambiguously: Columbia, England, Scotland, Yugoslavia
table(is.na(country_incomeDF))
## 
##  FALSE   TRUE 
## 568207    980
theCountries <- catFinalTrain$country_self
###It can be clearly infered that US has maximum 


countryCodes <- countrycode(unique(theCountries), 'country.name', 'iso3c')
## Warning in countrycode(unique(theCountries), "country.name", "iso3c"): Some values were not matched unambiguously: Columbia, England, Scotland, Yugoslavia
malDF <- data.frame(country_incomeDF$iso_code,
  inc_level = country_incomeDF$income_level)
# malDF is a data.frame with the ISO3 country names plus a variable to
# merge to the map data


malMap <- joinCountryData2Map(malDF, joinCode = "ISO3",  nameJoinColumn = "country_incomeDF.iso_code")
## 188749 codes from your data successfully matched countries in the map
## 980 codes from your data failed to match with a country code in the map
## 205 codes from the map weren't represented in your data
# This will join your malDF data.frame to the country map data

mapCountryData(malMap, nameColumnToPlot="inc_level", catMethod = "categorical",  missingCountryCol = gray(.8) , colourPalette = c("red", "blue") , mapTitle = "Country Vs Income Level")

### Looking at the graph plotted , for world map and countries with less than <=50k & >50K , we can clearly see that US, has both significant  population of both sides of Income level and similarly India & China has significant levels of user base <=50k income level.

4 Building a Model(Prediction Model)

Using the Boosting algorithm for this classification modeling, as consensus data has some weak predictors.

set.seed (32323)
training_Ctrl = trainControl(method = "cv", number = 2)

boost_Fit_mdl = train(income_level ~ age + class_of_worker + occupation_code +education + wage_per_hour+ marital_status + capital_gains + capital_losses +
                      country_self + weeks_worked_in_year , trControl = training_Ctrl, 
                  method = "gbm", data = finalTrain, verbose = FALSE)


confusionMatrix(finalTrain$income_level, predict (boost_Fit_mdl, finalTrain))

finalTest$predicted = predict(boost_Fit_mdl, finalTest)

confusionMatrix(finalTest$income_level, finalTest$predicted)
#confusion Matrix

5 Summary

It is very important to understand how the built model has performed with respect to a baseline model.Looking at the Baseline model and prediction model using booster algorithm , it is clear that prediction model does perform better than the baseline model.

6 Challenges

  1. Memory was big challenge , as R uses RAM to load the data for in memory calculations. Reduced the no of boost