Data Preparation in R

Execute by Neha Raut

Aim

This Mini Project involves data preparation of dataset cencus_income.csv in order to make it fit for futher analysis and model building.

Description:

1.Creating dummy variable for character variables.
2.Grouping similar category variables and making dummies.
3.Dealing with flag variables.(for numeric variables)
4.Converting the target Variable.(Y)

Data Information:

census_income.csv is a csv file containing 32561 obs and 15 variables.It describes the income range of people with their characteristic attributes.The income range of people is >50k and <=50k which is stored in target variable Y. We need to prepare data for the remaining (14) variables which can be further usefull in building models.

loading library dplyr

library(dplyr)

Read Csv file

data=read.csv("census_income.csv",stringsAsFactors = F)
View(data)

Variables in data

names(data)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education.num"  "marital.status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital.gain"   "capital.loss"  
## [13] "hours.per.week" "native.country" "Y"

structure of data

glimpse(data)

## Observations: 32,561
## Variables: 15
## $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ workclass      <chr> " State-gov", " Self-emp-not-inc", " Private", ...
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education      <chr> " Bachelors", " Bachelors", " HS-grad", " 11th"...
## $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ marital.status <chr> " Never-married", " Married-civ-spouse", " Divo...
## $ occupation     <chr> " Adm-clerical", " Exec-managerial", " Handlers...
## $ relationship   <chr> " Not-in-family", " Husband", " Not-in-family",...
## $ race           <chr> " White", " White", " White", " Black", " Black...
## $ sex            <chr> " Male", " Male", " Male", " Male", " Female", ...
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ native.country <chr> " United-States", " United-States", " United-St...
## $ Y              <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K...

Now here, there are 9 character variables such as workclass, education, marital.status, occupation, relationship, race, sex, native.country and Y(target variable) which needs to be converted to numeric to build model

We will exclude Y as its target variable

summary(data)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours.per.week  native.country          Y            
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

Distinct categories in remaining 8 character variables

for(i in 1:ncol(data)){
  if(class(data[,i])=="character"){
    if(names(data)[i]!="Y"){
      message=paste("Number of categories in ",names(data)[i]," : ")
      num.cat=length(unique(data[,i]))
      print(paste0(message,num.cat))
    }
  }
}

## [1] "Number of categories in  workclass  : 9"
## [1] "Number of categories in  education  : 16"
## [1] "Number of categories in  marital.status  : 7"
## [1] "Number of categories in  occupation  : 15"
## [1] "Number of categories in  relationship  : 6"
## [1] "Number of categories in  race  : 5"
## [1] "Number of categories in  sex  : 2"
## [1] "Number of categories in  native.country  : 42"

1. Creating dummies for character variables

(for variables having 6 or <6 categories:relationship,race,sex)

1.1 Creating dummy Variable for race

table(data$race)

## 
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black 
##                 311                1039                3124 
##               Other               White 
##                 271               27816

lets create 4 dummy variables and ignore which one have least value.

data=data%>%
  mutate(race_AIE=as.numeric(race==" Amer-Indian-Eskimo"),
         race_API=as.numeric(race==" Asian-Pac-Islander"),
         race_Black=as.numeric(race==" Black"),
         race_White=as.numeric(race==" White")) %>% 
  select(-race)

1.2 dummy Variable for sex

table(data$sex)

## 
##  Female    Male 
##   10771   21790

nchar(data$sex[2]) #:there was space in the string of male and female.

## [1] 5

data=data %>% 
  mutate(sex_Female=as.numeric(sex==" Female"),
         sex_Male=as.numeric(sex==" Male")) %>% 
  select(-sex)

1.3 dummy Variable for relationship

table(data$relationship)

## 
##         Husband   Not-in-family  Other-relative       Own-child 
##           13193            8305             981            5068 
##       Unmarried            Wife 
##            3446            1568

data=data %>% 
  mutate(rel_h=as.numeric(relationship==" Husband"),
         rel_nif=as.numeric(relationship==" Not-in-family"),
         rel_oc=as.numeric(relationship==" Own-child"),
         rel_um=as.numeric(relationship==" Unmarried"),
         rel_w=as.numeric(relationship==" Wife")) %>%
  select(-relationship)

2. Dummies for variables having similar categories.

When there are too many categories, we will group similer categories together and make a dummy which has too many categories. There is no direct harm in considering to create dummy variables for n-1 forthem.

2.1 combining similar category of variable workclass and making dummies:

table(data$workclass)

## 
##                 ?       Federal-gov         Local-gov      Never-worked 
##              1836               960              2093                 7 
##           Private      Self-emp-inc  Self-emp-not-inc         State-gov 
##             22696              1116              2541              1298 
##       Without-pay 
##                14

round(prop.table(table(data$workclass,data$Y),1),1)

##                    
##                      <=50K  >50K
##    ?                   0.9   0.1
##    Federal-gov         0.6   0.4
##    Local-gov           0.7   0.3
##    Never-worked        1.0   0.0
##    Private             0.8   0.2
##    Self-emp-inc        0.4   0.6
##    Self-emp-not-inc    0.7   0.3
##    State-gov           0.7   0.3
##    Without-pay         1.0   0.0

addmargins(round(prop.table(table(data$workclass,data$Y),1),1),2)

##                    
##                      <=50K  >50K Sum
##    ?                   0.9   0.1 1.0
##    Federal-gov         0.6   0.4 1.0
##    Local-gov           0.7   0.3 1.0
##    Never-worked        1.0   0.0 1.0
##    Private             0.8   0.2 1.0
##    Self-emp-inc        0.4   0.6 1.0
##    Self-emp-not-inc    0.7   0.3 1.0
##    State-gov           0.7   0.3 1.0
##    Without-pay         1.0   0.0 1.0

Here never worked and without pay is taken as base and we create 5 dummies as follows:

data=data %>%
  mutate(wc_1=as.numeric(workclass==" Self-emp-inc"),
         wc_2=as.numeric(workclass==" Federal-gov"),
         wc_3=as.numeric(workclass %in% c(" Local-gov"," Self-emp-not-inc"," State-gov")),
         wc_4=as.numeric(workclass==" Private"),
         wc_5=as.numeric(workclass==" ?")) %>%
  select(-workclass)

2.2 combining similar category of variable education and making dummies:

round(prop.table(table(data$education,data$Y),1),1)

##                
##                  <=50K  >50K
##    10th            0.9   0.1
##    11th            0.9   0.1
##    12th            0.9   0.1
##    1st-4th         1.0   0.0
##    5th-6th         1.0   0.0
##    7th-8th         0.9   0.1
##    9th             0.9   0.1
##    Assoc-acdm      0.8   0.2
##    Assoc-voc       0.7   0.3
##    Bachelors       0.6   0.4
##    Doctorate       0.3   0.7
##    HS-grad         0.8   0.2
##    Masters         0.4   0.6
##    Preschool       1.0   0.0
##    Prof-school     0.3   0.7
##    Some-college    0.8   0.2

here 1st-4th, 5th-6th, Preschool is taken as base and we create 6 dummies by combining similar categories

data=data %>%
  mutate(edu_1=as.numeric(education %in% c(" 10th"," 11th"," 12th"," 7th-8th"," 9th")),
         edu_2=as.numeric(education %in% c(" 1st-4th"," 5th-6th"," Preschool")),
         edu_3=as.numeric(education %in% c(" Assoc-acdm"," HS-grad"," Some-college")),
         edu_4=as.numeric(education ==" Assoc-voc"),
         edu_5=as.numeric(education==" Bachelors"),
         edu_6=as.numeric(education==" Masters")) %>%
  select(-education)

2.3 combining similar category of variable marital status and making dummies:

round(prop.table(table(data$marital.status,data$Y),1),1)

##                         
##                           <=50K  >50K
##    Divorced                 0.9   0.1
##    Married-AF-spouse        0.6   0.4
##    Married-civ-spouse       0.6   0.4
##    Married-spouse-absent    0.9   0.1
##    Never-married            1.0   0.0
##    Separated                0.9   0.1
##    Widowed                  0.9   0.1

data=data %>%
  mutate(ms_1=as.numeric(marital.status==" Never-married"),
         ms_2=as.numeric(marital.status %in% c(" Married-AF-spouse"," Married-civ-spouse"))) %>%
  select(-marital.status)

2.4 combining similar category of variable occupation and making dummies:

round(prop.table(table(data$occupation,data$Y),1),1)

##                     
##                       <=50K  >50K
##    ?                    0.9   0.1
##    Adm-clerical         0.9   0.1
##    Armed-Forces         0.9   0.1
##    Craft-repair         0.8   0.2
##    Exec-managerial      0.5   0.5
##    Farming-fishing      0.9   0.1
##    Handlers-cleaners    0.9   0.1
##    Machine-op-inspct    0.9   0.1
##    Other-service        1.0   0.0
##    Priv-house-serv      1.0   0.0
##    Prof-specialty       0.6   0.4
##    Protective-serv      0.7   0.3
##    Sales                0.7   0.3
##    Tech-support         0.7   0.3
##    Transport-moving     0.8   0.2

data=data %>%
  mutate(oc_1=as.numeric(occupation==" Exec-managerial"),
         oc_2=as.numeric(occupation==" Prof-specialty"),
         oc_3=as.numeric(occupation %in% c(" Protective-serv"," Sales"," Tech-support")),
         oc_4=as.numeric(occupation %in% c(" Craft-repair"," Transport-moving")),
         oc_5=as.numeric(occupation %in% c(" Priv-house-serv"," Other-service"))) %>%
  select(-occupation)

2.5 combining similar category of variable native.country and making dummies:

a=round(prop.table(table(data$native.country,data$Y),1),1)
a

##                              
##                                <=50K  >50K
##    ?                             0.7   0.3
##    Cambodia                      0.6   0.4
##    Canada                        0.7   0.3
##    China                         0.7   0.3
##    Columbia                      1.0   0.0
##    Cuba                          0.7   0.3
##    Dominican-Republic            1.0   0.0
##    Ecuador                       0.9   0.1
##    El-Salvador                   0.9   0.1
##    England                       0.7   0.3
##    France                        0.6   0.4
##    Germany                       0.7   0.3
##    Greece                        0.7   0.3
##    Guatemala                     1.0   0.0
##    Haiti                         0.9   0.1
##    Holand-Netherlands            1.0   0.0
##    Honduras                      0.9   0.1
##    Hong                          0.7   0.3
##    Hungary                       0.8   0.2
##    India                         0.6   0.4
##    Iran                          0.6   0.4
##    Ireland                       0.8   0.2
##    Italy                         0.7   0.3
##    Jamaica                       0.9   0.1
##    Japan                         0.6   0.4
##    Laos                          0.9   0.1
##    Mexico                        0.9   0.1
##    Nicaragua                     0.9   0.1
##    Outlying-US(Guam-USVI-etc)    1.0   0.0
##    Peru                          0.9   0.1
##    Philippines                   0.7   0.3
##    Poland                        0.8   0.2
##    Portugal                      0.9   0.1
##    Puerto-Rico                   0.9   0.1
##    Scotland                      0.8   0.2
##    South                         0.8   0.2
##    Taiwan                        0.6   0.4
##    Thailand                      0.8   0.2
##    Trinadad&Tobago               0.9   0.1
##    United-States                 0.8   0.2
##    Vietnam                       0.9   0.1
##    Yugoslavia                    0.6   0.4

sort(a[,1]) #sorting makes it easier to combine

##                    Cambodia                      France 
##                         0.6                         0.6 
##                       India                        Iran 
##                         0.6                         0.6 
##                       Japan                      Taiwan 
##                         0.6                         0.6 
##                  Yugoslavia                           ? 
##                         0.6                         0.7 
##                      Canada                       China 
##                         0.7                         0.7 
##                        Cuba                     England 
##                         0.7                         0.7 
##                     Germany                      Greece 
##                         0.7                         0.7 
##                        Hong                       Italy 
##                         0.7                         0.7 
##                 Philippines                     Hungary 
##                         0.7                         0.8 
##                     Ireland                      Poland 
##                         0.8                         0.8 
##                    Scotland                       South 
##                         0.8                         0.8 
##                    Thailand               United-States 
##                         0.8                         0.8 
##                     Ecuador                 El-Salvador 
##                         0.9                         0.9 
##                       Haiti                    Honduras 
##                         0.9                         0.9 
##                     Jamaica                        Laos 
##                         0.9                         0.9 
##                      Mexico                   Nicaragua 
##                         0.9                         0.9 
##                        Peru                    Portugal 
##                         0.9                         0.9 
##                 Puerto-Rico             Trinadad&Tobago 
##                         0.9                         0.9 
##                     Vietnam                    Columbia 
##                         0.9                         1.0 
##          Dominican-Republic                   Guatemala 
##                         1.0                         1.0 
##          Holand-Netherlands  Outlying-US(Guam-USVI-etc) 
##                         1.0                         1.0

data=data %>%
  mutate(nc_1=as.numeric(native.country %in% c(" Cambodia"," France"," India",
                                               " Iran"," Japan"," Taiwan"," Yugoslavia")),
         nc_2=as.numeric(native.country %in% c(" ?"," Canada"," China"," Cuba"," England",
                                               " Germany"," Greece"," Hong"," Italy",
                                               " Philippines")),
         nc_3=as.numeric(native.country %in% c(" Hungary"," Ireland"," Poland"," Scotland",
                                               " South"," Thailand"," United-States")),
         nc_4=as.numeric(native.country %in% c(" Columbia"," Dominican-Republic",
                                               " Guatemala"," Holand-Netherlands",
                                               " Outlying-US(Guam-USVI-etc)"))) %>%
  select(-native.country)

Now we are done with converting all character variables into dummy

glimpse(data)

## Observations: 32,561
## Variables: 40
## $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ Y              <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K...
## $ race_AIE       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ race_API       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ race_Black     <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ race_White     <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ sex_Female     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ sex_Male       <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,...
## $ rel_h          <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ rel_nif        <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ rel_oc         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ rel_um         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rel_w          <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_2           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_3           <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ wc_4           <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ wc_5           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_1          <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ edu_2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_3          <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ edu_4          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ edu_5          <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ edu_6          <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ ms_1           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ms_2           <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ oc_1           <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ oc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ oc_3           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ oc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ oc_5           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ nc_3           <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ nc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

3. Create flag variable

We do this for non categorical data i.e numeric data values

3.1 create flag variable for Capital Gain

table(data$capital.gain)

## 
##     0   114   401   594   914   991  1055  1086  1111  1151  1173  1409 
## 29849     6     2    34     8     5    25     4     1     8     3     7 
##  1424  1455  1471  1506  1639  1797  1831  1848  2009  2036  2050  2062 
##     3     1     7    15     1     7     7     6     3     4     5     2 
##  2105  2174  2176  2202  2228  2290  2329  2346  2354  2387  2407  2414 
##     9    48    23    16     5     5     6     6    11     1    19     8 
##  2463  2538  2580  2597  2635  2653  2829  2885  2907  2936  2961  2964 
##    11     1    12    20    11     5    31    24    11     3     3     9 
##  2977  2993  3103  3137  3273  3325  3411  3418  3432  3456  3464  3471 
##     8     2    97    37     6    53    24     5     4     2    23     8 
##  3674  3781  3818  3887  3908  3942  4064  4101  4386  4416  4508  4650 
##    14    12     7     6    32    14    42    20    70    12    12    41 
##  4687  4787  4865  4931  4934  5013  5060  5178  5455  5556  5721  6097 
##     3    23    17     1     7    69     1    97    11     5     3     1 
##  6360  6418  6497  6514  6723  6767  6849  7298  7430  7443  7688  7896 
##     3     9    11     5     2     5    27   246     9     5   284     3 
##  7978  8614  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 
##     1    55    22     4    43     6    12     2    27    41    26     5 
## 15024 15831 18481 20051 22040 25124 25236 27828 34095 41310 99999 
##   347     6     2    37     1     4    11    34     5     2   159

Here we can see 0 occures 29849 times. So now lets find percentage of occurance

sum(data$capital.gain==0)/nrow(data)

## [1] 0.9167102

More than 90% values are 0 , lets go ahead create a flag variable for this
hence wherever capital gain value is zero,it is mutated as 1,rest values take value 0.

data=data %>%
mutate(cg_flag0=as.numeric(capital.gain==0))

3.2 similarly lets do for variable capital loss

table(data$capital.loss)

## 
##     0   155   213   323   419   625   653   810   880   974  1092  1138 
## 31042     1     4     3     3    12     3     2     6     2     7     2 
##  1258  1340  1380  1408  1411  1485  1504  1539  1564  1573  1579  1590 
##     4     7     7    21     1    51    18     1    25     6    20    40 
##  1594  1602  1617  1628  1648  1651  1668  1669  1672  1719  1721  1726 
##     8    47     9    15     2     9     4    24    34    22    18     4 
##  1735  1740  1741  1755  1762  1816  1825  1844  1848  1876  1887  1902 
##     2    42    24     2    14     2     4     1    51    39   159   202 
##  1944  1974  1977  1980  2001  2002  2042  2051  2057  2080  2129  2149 
##     1    18   168    23    24    21     9    21     6     1     3     2 
##  2163  2174  2179  2201  2205  2206  2231  2238  2246  2258  2267  2282 
##     1     7    15     1     9     6     3     2     6    25     3     1 
##  2339  2352  2377  2392  2415  2444  2457  2467  2472  2489  2547  2559 
##    17     2    20     9    49    12     3     1     1     1     4    12 
##  2603  2754  2824  3004  3683  3770  3900  4356 
##     5     2    10     2     2     2     2     3

lets find its percenatge fo ocuurance

sum(data$capital.loss==0)/nrow(data)

## [1] 0.9533491

95% more is capital loss values equal to zero.hence lets flag it.

data=data %>%
  mutate(cl_flag0=as.numeric(capital.loss==0))

Converting the target

data$Y=as.numeric(data$Y==" >50K")

All values where Y>=50 is made “1” else rest everything is made “0”

glimpse(data)

## Observations: 32,561
## Variables: 42
## $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ Y              <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0,...
## $ race_AIE       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ race_API       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ race_Black     <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ race_White     <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ sex_Female     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ sex_Male       <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,...
## $ rel_h          <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ rel_nif        <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ rel_oc         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ rel_um         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rel_w          <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_2           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_3           <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ wc_4           <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ wc_5           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_1          <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ edu_2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_3          <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ edu_4          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ edu_5          <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ edu_6          <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ ms_1           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ms_2           <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ oc_1           <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ oc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ oc_3           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ oc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ oc_5           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nc_1           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nc_2           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ nc_3           <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ nc_4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cg_flag0       <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,...
## $ cl_flag0       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...