Data Preparation in R

Aim:

This Mini Project involves data preparation of dataset cencus_income.csv in order to make it fit for futher analysis and model building.

Description:

Creating dummy variable for character variables.
Grouping similar category variables and making dummies.
Dealing with flag variables.(for numeric variables)
Converting the target Variable.(Y)

Data Information:

census_income.csv is a csv file containing 32561 obs and 15 variables.It describes the income range of people with their characteristic attributes.The income range of people is >50k and <=50k which is stored in target variable Y. We need to prepare data for the remaining (14) variables which can be further usefull in building models.

Initial setup

loading library dplyr

library(dplyr)

reading file census_income.csv into d

d=read.csv("census_income.csv",stringsAsFactors = F)
View(d)

variables in d are as follows:

names(d)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education.num"  "marital.status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital.gain"   "capital.loss"  
## [13] "hours.per.week" "native.country" "Y"

structure of dataset is:

glimpse(d)

## Observations: 32,561
## Variables: 15
## $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ workclass      <chr> " State-gov", " Self-emp-not-inc", " Private", ...
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education      <chr> " Bachelors", " Bachelors", " HS-grad", " 11th"...
## $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ marital.status <chr> " Never-married", " Married-civ-spouse", " Divo...
## $ occupation     <chr> " Adm-clerical", " Exec-managerial", " Handlers...
## $ relationship   <chr> " Not-in-family", " Husband", " Not-in-family",...
## $ race           <chr> " White", " White", " White", " Black", " Black...
## $ sex            <chr> " Male", " Male", " Male", " Male", " Female", ...
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ native.country <chr> " United-States", " United-States", " United-St...
## $ Y              <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K...

The dataset includes 9 character variables namely workclass, education, marital.status, occupation, relationship, race, sex, native.country and Y(target variable) which needs to be converted accordingly in order to make it fit for modelling.

lets summarise the dataset

summary(d)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours.per.week  native.country          Y            
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

finding out the number of distinct caztegories in each 8 character variables.(excludes Y as its the target variable)

for(i in 1:ncol(d)){
  if(class(d[,i])=="character"){
    if(names(d)[i]!="Y"){
      message=paste("Number of categories in ",names(d)[i]," : ")
      num.cat=length(unique(d[,i]))
      print(paste0(message,num.cat))
    }
  }
}

## [1] "Number of categories in  workclass  : 9"
## [1] "Number of categories in  education  : 16"
## [1] "Number of categories in  marital.status  : 7"
## [1] "Number of categories in  occupation  : 15"
## [1] "Number of categories in  relationship  : 6"
## [1] "Number of categories in  race  : 5"
## [1] "Number of categories in  sex  : 2"
## [1] "Number of categories in  native.country  : 42"

1. Creating dummy variable : (for variables having 6 or <6 categories:relationship,race,sex)

creating dummy variable for race:

table(d$race)

## 
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black 
##                 311                1039                3124 
##               Other               White 
##                 271               27816

lets create 4 dummy variables considering others as base variable.

d=d%>%
  mutate(race_AIE=as.numeric(race==" Amer-Indian-Eskimo"),
         race_API=as.numeric(race==" Asian-Pac-Islander"),
         race_Black=as.numeric(race==" Black"),
         race_White=as.numeric(race==" White")) %>% 
  select(-race)

lets create dummy for sex(2category)

table(d$sex)

## 
##  Female    Male 
##   10771   21790

nchar(d$sex[2]) #:there was space in the string of male and female.

## [1] 5

d=d %>% 
  mutate(sex_Female=as.numeric(sex==" Female"),
         sex_Male=as.numeric(sex==" Male")) %>% 
  select(-sex)

lets create dummy for relationship(6category)

table(d$relationship)

## 
##         Husband   Not-in-family  Other-relative       Own-child 
##           13193            8305             981            5068 
##       Unmarried            Wife 
##            3446            1568

d=d %>% 
  mutate(rel_h=as.numeric(relationship==" Husband"),
         rel_nif=as.numeric(relationship==" Not-in-family"),
         rel_oc=as.numeric(relationship==" Own-child"),
         rel_um=as.numeric(relationship==" Unmarried"),
         rel_w=as.numeric(relationship==" Wife")) %>%
  select(-relationship)

here other relative were taken as base variable and 5 dummies were made.

2. Dummies for variables having similar categories.

When categories present in a variable are too many,we group similar categories and then make dummies.

combining similar category of variable workclass and making dummies:

table(d$workclass)

## 
##                 ?       Federal-gov         Local-gov      Never-worked 
##              1836               960              2093                 7 
##           Private      Self-emp-inc  Self-emp-not-inc         State-gov 
##             22696              1116              2541              1298 
##       Without-pay 
##                14

here no of categories:9 dummy variables:8 but that too much hence we combine similar categories.

round(prop.table(table(d$workclass,d$Y),1),1)

##                    
##                      <=50K  >50K
##    ?                   0.9   0.1
##    Federal-gov         0.6   0.4
##    Local-gov           0.7   0.3
##    Never-worked        1.0   0.0
##    Private             0.8   0.2
##    Self-emp-inc        0.4   0.6
##    Self-emp-not-inc    0.7   0.3
##    State-gov           0.7   0.3
##    Without-pay         1.0   0.0

addmargins(round(prop.table(table(d$workclass,d$Y),1),1),2)

##                    
##                      <=50K  >50K Sum
##    ?                   0.9   0.1 1.0
##    Federal-gov         0.6   0.4 1.0
##    Local-gov           0.7   0.3 1.0
##    Never-worked        1.0   0.0 1.0
##    Private             0.8   0.2 1.0
##    Self-emp-inc        0.4   0.6 1.0
##    Self-emp-not-inc    0.7   0.3 1.0
##    State-gov           0.7   0.3 1.0
##    Without-pay         1.0   0.0 1.0

Here never worked and without pay is taken as base.and we create 5 dummies as follows:

d=d %>%
  mutate(wc_1=as.numeric(workclass==" Self-emp-inc"),
         wc_2=as.numeric(workclass==" Federal-gov"),
         wc_3=as.numeric(workclass %in% c(" Local-gov"," Self-emp-not-inc"," State-gov")),
         wc_4=as.numeric(workclass==" Private"),
         wc_5=as.numeric(workclass==" ?")) %>%
  select(-workclass)

combining similar category of variable education and making dummies:

round(prop.table(table(d$education,d$Y),1),1)

##                
##                  <=50K  >50K
##    10th            0.9   0.1
##    11th            0.9   0.1
##    12th            0.9   0.1
##    1st-4th         1.0   0.0
##    5th-6th         1.0   0.0
##    7th-8th         0.9   0.1
##    9th             0.9   0.1
##    Assoc-acdm      0.8   0.2
##    Assoc-voc       0.7   0.3
##    Bachelors       0.6   0.4
##    Doctorate       0.3   0.7
##    HS-grad         0.8   0.2
##    Masters         0.4   0.6
##    Preschool       1.0   0.0
##    Prof-school     0.3   0.7
##    Some-college    0.8   0.2

here we create 6 dummies by combining similar categories.

d=d %>%
  mutate(edu_1=as.numeric(education %in% c(" 10th"," 11th"," 12th"," 7th-8th"," 9th")),
         edu_2=as.numeric(education %in% c(" 1st-4th"," 5th-6th"," Preschool")),
         edu_3=as.numeric(education %in% c(" Assoc-acdm"," HS-grad"," Some-college")),
         edu_4=as.numeric(education ==" Assoc-voc"),
         edu_5=as.numeric(education==" Bachelors"),
         edu_6=as.numeric(education==" Masters")) %>%
  select(-education)

combining similar category of variable marital status and making dummies:

round(prop.table(table(d$marital.status,d$Y),1),1)

##                         
##                           <=50K  >50K
##    Divorced                 0.9   0.1
##    Married-AF-spouse        0.6   0.4
##    Married-civ-spouse       0.6   0.4
##    Married-spouse-absent    0.9   0.1
##    Never-married            1.0   0.0
##    Separated                0.9   0.1
##    Widowed                  0.9   0.1

d=d %>%
  mutate(ms_1=as.numeric(marital.status==" Never-married"),
         ms_2=as.numeric(marital.status %in% c(" Married-AF-spouse"," Married-civ-spouse"))) %>%
  select(-marital.status)

combining similar category of variable occupation and making dummies:

round(prop.table(table(d$occupation,d$Y),1),1)

##                     
##                       <=50K  >50K
##    ?                    0.9   0.1
##    Adm-clerical         0.9   0.1
##    Armed-Forces         0.9   0.1
##    Craft-repair         0.8   0.2
##    Exec-managerial      0.5   0.5
##    Farming-fishing      0.9   0.1
##    Handlers-cleaners    0.9   0.1
##    Machine-op-inspct    0.9   0.1
##    Other-service        1.0   0.0
##    Priv-house-serv      1.0   0.0
##    Prof-specialty       0.6   0.4
##    Protective-serv      0.7   0.3
##    Sales                0.7   0.3
##    Tech-support         0.7   0.3
##    Transport-moving     0.8   0.2

d=d %>%
  mutate(oc_1=as.numeric(occupation==" Exec-managerial"),
         oc_2=as.numeric(occupation==" Prof-specialty"),
         oc_3=as.numeric(occupation %in% c(" Protective-serv"," Sales"," Tech-support")),
         oc_4=as.numeric(occupation %in% c(" Craft-repair"," Transport-moving")),
         oc_5=as.numeric(occupation %in% c(" Priv-house-serv"," Other-service"))) %>%
  select(-occupation)

combining similar category of variable native.country and making dummies:

k=round(prop.table(table(d$native.country,d$Y),1),1)
k

##                              
##                                <=50K  >50K
##    ?                             0.7   0.3
##    Cambodia                      0.6   0.4
##    Canada                        0.7   0.3
##    China                         0.7   0.3
##    Columbia                      1.0   0.0
##    Cuba                          0.7   0.3
##    Dominican-Republic            1.0   0.0
##    Ecuador                       0.9   0.1
##    El-Salvador                   0.9   0.1
##    England                       0.7   0.3
##    France                        0.6   0.4
##    Germany                       0.7   0.3
##    Greece                        0.7   0.3
##    Guatemala                     1.0   0.0
##    Haiti                         0.9   0.1
##    Holand-Netherlands            1.0   0.0
##    Honduras                      0.9   0.1
##    Hong                          0.7   0.3
##    Hungary                       0.8   0.2
##    India                         0.6   0.4
##    Iran                          0.6   0.4
##    Ireland                       0.8   0.2
##    Italy                         0.7   0.3
##    Jamaica                       0.9   0.1
##    Japan                         0.6   0.4
##    Laos                          0.9   0.1
##    Mexico                        0.9   0.1
##    Nicaragua                     0.9   0.1
##    Outlying-US(Guam-USVI-etc)    1.0   0.0
##    Peru                          0.9   0.1
##    Philippines                   0.7   0.3
##    Poland                        0.8   0.2
##    Portugal                      0.9   0.1
##    Puerto-Rico                   0.9   0.1
##    Scotland                      0.8   0.2
##    South                         0.8   0.2
##    Taiwan                        0.6   0.4
##    Thailand                      0.8   0.2
##    Trinadad&Tobago               0.9   0.1
##    United-States                 0.8   0.2
##    Vietnam                       0.9   0.1
##    Yugoslavia                    0.6   0.4

sort(k[,1]) #sorting makes it easier to combine

##                    Cambodia                      France 
##                         0.6                         0.6 
##                       India                        Iran 
##                         0.6                         0.6 
##                       Japan                      Taiwan 
##                         0.6                         0.6 
##                  Yugoslavia                           ? 
##                         0.6                         0.7 
##                      Canada                       China 
##                         0.7                         0.7 
##                        Cuba                     England 
##                         0.7                         0.7 
##                     Germany                      Greece 
##                         0.7                         0.7 
##                        Hong                       Italy 
##                         0.7                         0.7 
##                 Philippines                     Hungary 
##                         0.7                         0.8 
##                     Ireland                      Poland 
##                         0.8                         0.8 
##                    Scotland                       South 
##                         0.8                         0.8 
##                    Thailand               United-States 
##                         0.8                         0.8 
##                     Ecuador                 El-Salvador 
##                         0.9                         0.9 
##                       Haiti                    Honduras 
##                         0.9                         0.9 
##                     Jamaica                        Laos 
##                         0.9                         0.9 
##                      Mexico                   Nicaragua 
##                         0.9                         0.9 
##                        Peru                    Portugal 
##                         0.9                         0.9 
##                 Puerto-Rico             Trinadad&Tobago 
##                         0.9                         0.9 
##                     Vietnam                    Columbia 
##                         0.9                         1.0 
##          Dominican-Republic                   Guatemala 
##                         1.0                         1.0 
##          Holand-Netherlands  Outlying-US(Guam-USVI-etc) 
##                         1.0                         1.0

d=d %>%
  mutate(nc_1=as.numeric(native.country %in% c(" Cambodia"," France"," India",
                                               " Iran"," Japan"," Taiwan"," Yugoslavia")),
         nc_2=as.numeric(native.country %in% c(" ?"," Canada"," China"," Cuba"," England",
                                               " Germany"," Greece"," Hong"," Italy",
                                               " Philippines")),
         nc_3=as.numeric(native.country %in% c(" Hungary"," Ireland"," Poland"," Scotland",
                                               " South"," Thailand"," United-States")),
         nc_4=as.numeric(native.country %in% c(" Columbia"," Dominican-Republic",
                                               " Guatemala"," Holand-Netherlands",
                                               " Outlying-US(Guam-USVI-etc)"))) %>%
  select(-native.country)

here 4 dummies are made for native.country.

hence we succesfully converted all character variables into dummy.

glimpse(d)

## Observations: 32,561
## Variables: 40
## $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ Y              <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K...
## $ race_AIE       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ race_API       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ race_Black     <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ race_White     <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ sex_Female     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ sex_Male       <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,...
## $ rel_h          <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ rel_nif        <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ rel_oc         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ rel_um         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rel_w          <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_2           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_3           <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ wc_4           <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ wc_5           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_1          <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ edu_2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_3          <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ edu_4          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ edu_5          <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ edu_6          <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ ms_1           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ms_2           <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ oc_1           <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ oc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ oc_3           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ oc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ oc_5           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ nc_3           <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ nc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

3. Dealing with flag variables.

used for non categorical data values(ie numeric data values)

Dealing with variable capital gain

table(d$capital.gain)

## 
##     0   114   401   594   914   991  1055  1086  1111  1151  1173  1409 
## 29849     6     2    34     8     5    25     4     1     8     3     7 
##  1424  1455  1471  1506  1639  1797  1831  1848  2009  2036  2050  2062 
##     3     1     7    15     1     7     7     6     3     4     5     2 
##  2105  2174  2176  2202  2228  2290  2329  2346  2354  2387  2407  2414 
##     9    48    23    16     5     5     6     6    11     1    19     8 
##  2463  2538  2580  2597  2635  2653  2829  2885  2907  2936  2961  2964 
##    11     1    12    20    11     5    31    24    11     3     3     9 
##  2977  2993  3103  3137  3273  3325  3411  3418  3432  3456  3464  3471 
##     8     2    97    37     6    53    24     5     4     2    23     8 
##  3674  3781  3818  3887  3908  3942  4064  4101  4386  4416  4508  4650 
##    14    12     7     6    32    14    42    20    70    12    12    41 
##  4687  4787  4865  4931  4934  5013  5060  5178  5455  5556  5721  6097 
##     3    23    17     1     7    69     1    97    11     5     3     1 
##  6360  6418  6497  6514  6723  6767  6849  7298  7430  7443  7688  7896 
##     3     9    11     5     2     5    27   246     9     5   284     3 
##  7978  8614  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 
##     1    55    22     4    43     6    12     2    27    41    26     5 
## 15024 15831 18481 20051 22040 25124 25236 27828 34095 41310 99999 
##   347     6     2    37     1     4    11    34     5     2   159

Here 0 occurs 29k times. Hence lets find its percentage of occurance. This will give % of observations where capital.gain is 0

sum(d$capital.gain==0)/nrow(d)

## [1] 0.9167102

More than 90% values are 0 , lets go ahead create a flag variable for this:

d=d %>%
  mutate(cg_flag0=as.numeric(capital.gain==0))

hence wherever capital gain value is zero,it is mutated as 1,rest values take value 0.

similarly lets do for variable capital loss

table(d$capital.loss)

## 
##     0   155   213   323   419   625   653   810   880   974  1092  1138 
## 31042     1     4     3     3    12     3     2     6     2     7     2 
##  1258  1340  1380  1408  1411  1485  1504  1539  1564  1573  1579  1590 
##     4     7     7    21     1    51    18     1    25     6    20    40 
##  1594  1602  1617  1628  1648  1651  1668  1669  1672  1719  1721  1726 
##     8    47     9    15     2     9     4    24    34    22    18     4 
##  1735  1740  1741  1755  1762  1816  1825  1844  1848  1876  1887  1902 
##     2    42    24     2    14     2     4     1    51    39   159   202 
##  1944  1974  1977  1980  2001  2002  2042  2051  2057  2080  2129  2149 
##     1    18   168    23    24    21     9    21     6     1     3     2 
##  2163  2174  2179  2201  2205  2206  2231  2238  2246  2258  2267  2282 
##     1     7    15     1     9     6     3     2     6    25     3     1 
##  2339  2352  2377  2392  2415  2444  2457  2467  2472  2489  2547  2559 
##    17     2    20     9    49    12     3     1     1     1     4    12 
##  2603  2754  2824  3004  3683  3770  3900  4356 
##     5     2    10     2     2     2     2     3

lets find its percenatge fo ocuurance

sum(d$capital.loss==0)/nrow(d)

## [1] 0.9533491

95% more is capital loss values equal to zero.hence lets flag it.

d=d %>%
  mutate(cl_flag0=as.numeric(capital.loss==0))

4. Converting the target Variable.(Y)

d$Y=as.numeric(d$Y==" >50K")

All values where Y>=50 is made “1” else rest everything is made “0”

glimpse(d)

## Observations: 32,561
## Variables: 42
## $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ Y              <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0,...
## $ race_AIE       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ race_API       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ race_Black     <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ race_White     <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ sex_Female     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ sex_Male       <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,...
## $ rel_h          <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ rel_nif        <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ rel_oc         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ rel_um         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rel_w          <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_2           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_3           <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ wc_4           <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ wc_5           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_1          <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ edu_2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_3          <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ edu_4          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ edu_5          <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ edu_6          <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ ms_1           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ms_2           <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ oc_1           <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ oc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ oc_3           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ oc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ oc_5           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ nc_3           <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ nc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cg_flag0       <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,...
## $ cl_flag0       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...