census_income.csv is a csv file containing 32561 obs and 15 variables.It describes the income range of people with their characteristic attributes.The income range of people is >50k and <=50k which is stored in target variable Y. We need to prepare data for the remaining (14) variables which can be further usefull in building models.
loading library dplyr
library(dplyr)
reading file census_income.csv into d
d=read.csv("census_income.csv",stringsAsFactors = F)
View(d)
variables in d are as follows:
names(d)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education.num" "marital.status" "occupation" "relationship"
## [9] "race" "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "Y"
structure of dataset is:
glimpse(d)
## Observations: 32,561
## Variables: 15
## $ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ workclass <chr> " State-gov", " Self-emp-not-inc", " Private", ...
## $ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education <chr> " Bachelors", " Bachelors", " HS-grad", " 11th"...
## $ education.num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ marital.status <chr> " Never-married", " Married-civ-spouse", " Divo...
## $ occupation <chr> " Adm-clerical", " Exec-managerial", " Handlers...
## $ relationship <chr> " Not-in-family", " Husband", " Not-in-family",...
## $ race <chr> " White", " White", " White", " Black", " Black...
## $ sex <chr> " Male", " Male", " Male", " Male", " Female", ...
## $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ native.country <chr> " United-States", " United-States", " United-St...
## $ Y <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K...
The dataset includes 9 character variables namely workclass, education, marital.status, occupation, relationship, race, sex, native.country and Y(target variable) which needs to be converted accordingly in order to make it fit for modelling.
lets summarise the dataset
summary(d)
## age workclass fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education.num marital.status occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital.gain capital.loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hours.per.week native.country Y
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
finding out the number of distinct caztegories in each 8 character variables.(excludes Y as its the target variable)
for(i in 1:ncol(d)){
if(class(d[,i])=="character"){
if(names(d)[i]!="Y"){
message=paste("Number of categories in ",names(d)[i]," : ")
num.cat=length(unique(d[,i]))
print(paste0(message,num.cat))
}
}
}
## [1] "Number of categories in workclass : 9"
## [1] "Number of categories in education : 16"
## [1] "Number of categories in marital.status : 7"
## [1] "Number of categories in occupation : 15"
## [1] "Number of categories in relationship : 6"
## [1] "Number of categories in race : 5"
## [1] "Number of categories in sex : 2"
## [1] "Number of categories in native.country : 42"
table(d$race)
##
## Amer-Indian-Eskimo Asian-Pac-Islander Black
## 311 1039 3124
## Other White
## 271 27816
lets create 4 dummy variables considering others as base variable.
d=d%>%
mutate(race_AIE=as.numeric(race==" Amer-Indian-Eskimo"),
race_API=as.numeric(race==" Asian-Pac-Islander"),
race_Black=as.numeric(race==" Black"),
race_White=as.numeric(race==" White")) %>%
select(-race)
table(d$sex)
##
## Female Male
## 10771 21790
nchar(d$sex[2]) #:there was space in the string of male and female.
## [1] 5
d=d %>%
mutate(sex_Female=as.numeric(sex==" Female"),
sex_Male=as.numeric(sex==" Male")) %>%
select(-sex)
table(d$relationship)
##
## Husband Not-in-family Other-relative Own-child
## 13193 8305 981 5068
## Unmarried Wife
## 3446 1568
d=d %>%
mutate(rel_h=as.numeric(relationship==" Husband"),
rel_nif=as.numeric(relationship==" Not-in-family"),
rel_oc=as.numeric(relationship==" Own-child"),
rel_um=as.numeric(relationship==" Unmarried"),
rel_w=as.numeric(relationship==" Wife")) %>%
select(-relationship)
here other relative were taken as base variable and 5 dummies were made.
table(d$workclass)
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
here no of categories:9 dummy variables:8 but that too much hence we combine similar categories.
round(prop.table(table(d$workclass,d$Y),1),1)
##
## <=50K >50K
## ? 0.9 0.1
## Federal-gov 0.6 0.4
## Local-gov 0.7 0.3
## Never-worked 1.0 0.0
## Private 0.8 0.2
## Self-emp-inc 0.4 0.6
## Self-emp-not-inc 0.7 0.3
## State-gov 0.7 0.3
## Without-pay 1.0 0.0
addmargins(round(prop.table(table(d$workclass,d$Y),1),1),2)
##
## <=50K >50K Sum
## ? 0.9 0.1 1.0
## Federal-gov 0.6 0.4 1.0
## Local-gov 0.7 0.3 1.0
## Never-worked 1.0 0.0 1.0
## Private 0.8 0.2 1.0
## Self-emp-inc 0.4 0.6 1.0
## Self-emp-not-inc 0.7 0.3 1.0
## State-gov 0.7 0.3 1.0
## Without-pay 1.0 0.0 1.0
Here never worked and without pay is taken as base.and we create 5 dummies as follows:
d=d %>%
mutate(wc_1=as.numeric(workclass==" Self-emp-inc"),
wc_2=as.numeric(workclass==" Federal-gov"),
wc_3=as.numeric(workclass %in% c(" Local-gov"," Self-emp-not-inc"," State-gov")),
wc_4=as.numeric(workclass==" Private"),
wc_5=as.numeric(workclass==" ?")) %>%
select(-workclass)
round(prop.table(table(d$education,d$Y),1),1)
##
## <=50K >50K
## 10th 0.9 0.1
## 11th 0.9 0.1
## 12th 0.9 0.1
## 1st-4th 1.0 0.0
## 5th-6th 1.0 0.0
## 7th-8th 0.9 0.1
## 9th 0.9 0.1
## Assoc-acdm 0.8 0.2
## Assoc-voc 0.7 0.3
## Bachelors 0.6 0.4
## Doctorate 0.3 0.7
## HS-grad 0.8 0.2
## Masters 0.4 0.6
## Preschool 1.0 0.0
## Prof-school 0.3 0.7
## Some-college 0.8 0.2
here we create 6 dummies by combining similar categories.
d=d %>%
mutate(edu_1=as.numeric(education %in% c(" 10th"," 11th"," 12th"," 7th-8th"," 9th")),
edu_2=as.numeric(education %in% c(" 1st-4th"," 5th-6th"," Preschool")),
edu_3=as.numeric(education %in% c(" Assoc-acdm"," HS-grad"," Some-college")),
edu_4=as.numeric(education ==" Assoc-voc"),
edu_5=as.numeric(education==" Bachelors"),
edu_6=as.numeric(education==" Masters")) %>%
select(-education)
round(prop.table(table(d$marital.status,d$Y),1),1)
##
## <=50K >50K
## Divorced 0.9 0.1
## Married-AF-spouse 0.6 0.4
## Married-civ-spouse 0.6 0.4
## Married-spouse-absent 0.9 0.1
## Never-married 1.0 0.0
## Separated 0.9 0.1
## Widowed 0.9 0.1
d=d %>%
mutate(ms_1=as.numeric(marital.status==" Never-married"),
ms_2=as.numeric(marital.status %in% c(" Married-AF-spouse"," Married-civ-spouse"))) %>%
select(-marital.status)
round(prop.table(table(d$occupation,d$Y),1),1)
##
## <=50K >50K
## ? 0.9 0.1
## Adm-clerical 0.9 0.1
## Armed-Forces 0.9 0.1
## Craft-repair 0.8 0.2
## Exec-managerial 0.5 0.5
## Farming-fishing 0.9 0.1
## Handlers-cleaners 0.9 0.1
## Machine-op-inspct 0.9 0.1
## Other-service 1.0 0.0
## Priv-house-serv 1.0 0.0
## Prof-specialty 0.6 0.4
## Protective-serv 0.7 0.3
## Sales 0.7 0.3
## Tech-support 0.7 0.3
## Transport-moving 0.8 0.2
d=d %>%
mutate(oc_1=as.numeric(occupation==" Exec-managerial"),
oc_2=as.numeric(occupation==" Prof-specialty"),
oc_3=as.numeric(occupation %in% c(" Protective-serv"," Sales"," Tech-support")),
oc_4=as.numeric(occupation %in% c(" Craft-repair"," Transport-moving")),
oc_5=as.numeric(occupation %in% c(" Priv-house-serv"," Other-service"))) %>%
select(-occupation)
k=round(prop.table(table(d$native.country,d$Y),1),1)
k
##
## <=50K >50K
## ? 0.7 0.3
## Cambodia 0.6 0.4
## Canada 0.7 0.3
## China 0.7 0.3
## Columbia 1.0 0.0
## Cuba 0.7 0.3
## Dominican-Republic 1.0 0.0
## Ecuador 0.9 0.1
## El-Salvador 0.9 0.1
## England 0.7 0.3
## France 0.6 0.4
## Germany 0.7 0.3
## Greece 0.7 0.3
## Guatemala 1.0 0.0
## Haiti 0.9 0.1
## Holand-Netherlands 1.0 0.0
## Honduras 0.9 0.1
## Hong 0.7 0.3
## Hungary 0.8 0.2
## India 0.6 0.4
## Iran 0.6 0.4
## Ireland 0.8 0.2
## Italy 0.7 0.3
## Jamaica 0.9 0.1
## Japan 0.6 0.4
## Laos 0.9 0.1
## Mexico 0.9 0.1
## Nicaragua 0.9 0.1
## Outlying-US(Guam-USVI-etc) 1.0 0.0
## Peru 0.9 0.1
## Philippines 0.7 0.3
## Poland 0.8 0.2
## Portugal 0.9 0.1
## Puerto-Rico 0.9 0.1
## Scotland 0.8 0.2
## South 0.8 0.2
## Taiwan 0.6 0.4
## Thailand 0.8 0.2
## Trinadad&Tobago 0.9 0.1
## United-States 0.8 0.2
## Vietnam 0.9 0.1
## Yugoslavia 0.6 0.4
sort(k[,1]) #sorting makes it easier to combine
## Cambodia France
## 0.6 0.6
## India Iran
## 0.6 0.6
## Japan Taiwan
## 0.6 0.6
## Yugoslavia ?
## 0.6 0.7
## Canada China
## 0.7 0.7
## Cuba England
## 0.7 0.7
## Germany Greece
## 0.7 0.7
## Hong Italy
## 0.7 0.7
## Philippines Hungary
## 0.7 0.8
## Ireland Poland
## 0.8 0.8
## Scotland South
## 0.8 0.8
## Thailand United-States
## 0.8 0.8
## Ecuador El-Salvador
## 0.9 0.9
## Haiti Honduras
## 0.9 0.9
## Jamaica Laos
## 0.9 0.9
## Mexico Nicaragua
## 0.9 0.9
## Peru Portugal
## 0.9 0.9
## Puerto-Rico Trinadad&Tobago
## 0.9 0.9
## Vietnam Columbia
## 0.9 1.0
## Dominican-Republic Guatemala
## 1.0 1.0
## Holand-Netherlands Outlying-US(Guam-USVI-etc)
## 1.0 1.0
d=d %>%
mutate(nc_1=as.numeric(native.country %in% c(" Cambodia"," France"," India",
" Iran"," Japan"," Taiwan"," Yugoslavia")),
nc_2=as.numeric(native.country %in% c(" ?"," Canada"," China"," Cuba"," England",
" Germany"," Greece"," Hong"," Italy",
" Philippines")),
nc_3=as.numeric(native.country %in% c(" Hungary"," Ireland"," Poland"," Scotland",
" South"," Thailand"," United-States")),
nc_4=as.numeric(native.country %in% c(" Columbia"," Dominican-Republic",
" Guatemala"," Holand-Netherlands",
" Outlying-US(Guam-USVI-etc)"))) %>%
select(-native.country)
here 4 dummies are made for native.country.
glimpse(d)
## Observations: 32,561
## Variables: 40
## $ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education.num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ Y <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K...
## $ race_AIE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ race_API <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ race_Black <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ race_White <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ sex_Female <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ sex_Male <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,...
## $ rel_h <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ rel_nif <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ rel_oc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ rel_um <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rel_w <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_3 <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ wc_4 <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ wc_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_1 <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ edu_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_3 <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ edu_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ edu_5 <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ edu_6 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ ms_1 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ms_2 <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ oc_1 <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ oc_2 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ oc_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ oc_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ oc_5 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nc_1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nc_2 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ nc_3 <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ nc_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
table(d$capital.gain)
##
## 0 114 401 594 914 991 1055 1086 1111 1151 1173 1409
## 29849 6 2 34 8 5 25 4 1 8 3 7
## 1424 1455 1471 1506 1639 1797 1831 1848 2009 2036 2050 2062
## 3 1 7 15 1 7 7 6 3 4 5 2
## 2105 2174 2176 2202 2228 2290 2329 2346 2354 2387 2407 2414
## 9 48 23 16 5 5 6 6 11 1 19 8
## 2463 2538 2580 2597 2635 2653 2829 2885 2907 2936 2961 2964
## 11 1 12 20 11 5 31 24 11 3 3 9
## 2977 2993 3103 3137 3273 3325 3411 3418 3432 3456 3464 3471
## 8 2 97 37 6 53 24 5 4 2 23 8
## 3674 3781 3818 3887 3908 3942 4064 4101 4386 4416 4508 4650
## 14 12 7 6 32 14 42 20 70 12 12 41
## 4687 4787 4865 4931 4934 5013 5060 5178 5455 5556 5721 6097
## 3 23 17 1 7 69 1 97 11 5 3 1
## 6360 6418 6497 6514 6723 6767 6849 7298 7430 7443 7688 7896
## 3 9 11 5 2 5 27 246 9 5 284 3
## 7978 8614 9386 9562 10520 10566 10605 11678 13550 14084 14344 15020
## 1 55 22 4 43 6 12 2 27 41 26 5
## 15024 15831 18481 20051 22040 25124 25236 27828 34095 41310 99999
## 347 6 2 37 1 4 11 34 5 2 159
Here 0 occurs 29k times. Hence lets find its percentage of occurance. This will give % of observations where capital.gain is 0
sum(d$capital.gain==0)/nrow(d)
## [1] 0.9167102
More than 90% values are 0 , lets go ahead create a flag variable for this:
d=d %>%
mutate(cg_flag0=as.numeric(capital.gain==0))
hence wherever capital gain value is zero,it is mutated as 1,rest values take value 0.
table(d$capital.loss)
##
## 0 155 213 323 419 625 653 810 880 974 1092 1138
## 31042 1 4 3 3 12 3 2 6 2 7 2
## 1258 1340 1380 1408 1411 1485 1504 1539 1564 1573 1579 1590
## 4 7 7 21 1 51 18 1 25 6 20 40
## 1594 1602 1617 1628 1648 1651 1668 1669 1672 1719 1721 1726
## 8 47 9 15 2 9 4 24 34 22 18 4
## 1735 1740 1741 1755 1762 1816 1825 1844 1848 1876 1887 1902
## 2 42 24 2 14 2 4 1 51 39 159 202
## 1944 1974 1977 1980 2001 2002 2042 2051 2057 2080 2129 2149
## 1 18 168 23 24 21 9 21 6 1 3 2
## 2163 2174 2179 2201 2205 2206 2231 2238 2246 2258 2267 2282
## 1 7 15 1 9 6 3 2 6 25 3 1
## 2339 2352 2377 2392 2415 2444 2457 2467 2472 2489 2547 2559
## 17 2 20 9 49 12 3 1 1 1 4 12
## 2603 2754 2824 3004 3683 3770 3900 4356
## 5 2 10 2 2 2 2 3
lets find its percenatge fo ocuurance
sum(d$capital.loss==0)/nrow(d)
## [1] 0.9533491
95% more is capital loss values equal to zero.hence lets flag it.
d=d %>%
mutate(cl_flag0=as.numeric(capital.loss==0))
d$Y=as.numeric(d$Y==" >50K")
All values where Y>=50 is made “1” else rest everything is made “0”
glimpse(d)
## Observations: 32,561
## Variables: 42
## $ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
## $ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
## $ education.num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
## $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
## $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
## $ Y <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0,...
## $ race_AIE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ race_API <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ race_Black <dbl> 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ race_White <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ sex_Female <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ sex_Male <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,...
## $ rel_h <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ rel_nif <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ rel_oc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ rel_um <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ rel_w <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wc_3 <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ wc_4 <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ wc_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_1 <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ edu_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_3 <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ edu_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ edu_5 <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ edu_6 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ ms_1 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ms_2 <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,...
## $ oc_1 <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ oc_2 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ oc_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ oc_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ oc_5 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nc_1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nc_2 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ nc_3 <dbl> 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,...
## $ nc_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cg_flag0 <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,...
## $ cl_flag0 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...