In this Kaggle competition, Social Impact for Women in Impoverished Countries WIDS_DataThon_2018 competition, the aim is to predict the gender of each survey respondent based on demographic and behavioural information from a representative sample of survey respondents from India and their usage of traditional and mobile financial services.
This notebook is an exploratory data analysis of the dataset.
knitr::opts_chunk$set(echo = TRUE)
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
library(readxl)
library(stringr)
Load Datasets
The first step will be to load the test data file and train data file. These have been downloaded to a local Kaggle folder offline as there is an agreement step to the Kaggle data terms and conditions and unzipped using 7-zip.
We will use the data.table R package designed for large datasets.
Overview
The train dataset has 18255 rows and 1235 columns. This is is a high dimensional, multivariate dataset.
The following information is available from the Kaggle discussion forum:
is_female - the target variable you are going to predict. Note: in the data dictionary, Female is 2, male is 1, while in our transformed data, is_female=1 for female and is_female=0 for male. For the rest of the 1000+ column descriptions, please refer to WiDSDataDictionary for details. Please note that data has been processed and some columns were removed to prevent data leakage or to protect privacy.
head(train[,c(1:10)])
Note that the target variable is at position 9 in the train dataset.
Let’s remove train IDs for further analysis on the remaining train dataset.
# Remove ID feature from train as this identifier is not necessarily needed in the EDA.
id <- train$train_id
train <- train %>% dplyr::select(-train_id)
Since this is a survey there might be biases that affect the responses. We do not have any information on the order of the questions. Biases might include stereotyping, desirability to conform to group norms and mindset carry over effects.
1.1 Target Variable
Let’s first take a quick look at the target variable, the is_female label variable as a histogram.
train %>%
ggplot(aes(as.factor(is_female),fill=as.factor(is_female)),guide=FALSE) +
geom_histogram(stat="count") +
scale_y_continuous(name="Count",labels = scales::comma,limits = c(0,10000)) +
xlab("Target") +
ggtitle("Histogram of Target Variable is_female")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
table(train$is_female)
##
## 0 1
## 8450 9805
We can see that a the number of is_female as a proportion of total is 0.54. Therefore the classes are balanced.
1.2 Response Variables
1.2.1 Character Variables
Check if any variables in the train set are characters using dplyr, and subset the ones that do not have a very high proportion of white space.
# Select the subset of character variables and values
ischars <- train %>% select_if(is.character)
head(ischars)
# Create a vector of the count of white space for each column, with a custom function
ischarsws <- sapply(ischars,function(x) table(x =="")["TRUE"])
# Subset the variables where the whitespace count is less than 17,000
ischarsws[ischarsws<17000]
## LN2_RIndLngBEOth.TRUE LN2_WIndLngBEOth.TRUE
## 6914 6911
Of the 96 character variables, this subset of two variables may contribute to the prediction. Let’s take a quick look at the some of the values in the variables.
table(train$LN2_RIndLngBEOth)[1:10]
##
## Aadiwasi Arbi
## 6914 1 1
## Assamese Assamese and Hindi Bengali
## 180 1 891
## Bhojpuri Chattisgari Dehati
## 11 31 1
## English
## 43
table(train$LN2_WIndLngBEOth)[1:10]
##
## Aadiwasi Arbi Assamese
## 6911 4 1 134
## Bengali Bhojpuri Boro Chattisgari
## 1083 6 1 37
## Chhetriya bhasha English
## 1 24
They are not listed in the data dictionary but we can infer that they relate to language questions. Let’s remove the character variables from train except for these two variables.
charstoremove <- ischars %>% select(-LN2_RIndLngBEOth,-LN2_WIndLngBEOth)
# Use dplyr select to remove the column names of the charstoremove
train <- train %>% select(-one_of(names(charstoremove)))
1.2.3 Numeric variables
Check if any variables in the train set are numeric using dplyr, and calculate the sum of NAs for each column using purrr.
# Select the subset of numeric variables and values. Note that is.numeric includes integers
isnum <- train %>% select_if(is.numeric)
# Create a vector of the count of NA for each column using map from purrr package
isnumna <- map(isnum, ~sum(is.na(.)))
We will remove the high proportion of numeric NAs from the train set.
# Use dplyr select to remove the column names vector
train <- train %>% select(-one_of(names(isnumna[isnumna>16000])))
Let’s investigate how many of the numeric variables are in fact binary variables, assuming that binary will take on one of two values and there are no not answered values such as 99.
binary <- function(x) {
ifelse(length(unique(x))>2,FALSE,TRUE)
}
# Create a vector of the count of NA for each column using map from purrr package
isbinaryornot <- map(isnum, ~binary(.))
# Sum is binary or not
sumbinary <- sum(isbinaryornot==TRUE)
1.2.4 Logical variables
Check if any variables in the train set are logicals using dplyr, and check the number of NA values.
# Select the subset of logical variables and values
islog <- train %>% select_if(is.logical)
# Sum the missing values
sum(is.na(islog))-dim(islog)[1]*dim(islog)[2]
## [1] 0
All the logical variables have missing values therefore we can remove these variables from the dataset for the rest of the EDA.
train <- train %>% select(-one_of(names(islog)))
Therefore we saw that the train set originally contained 1235 variables, made up of 1 testid, 96 character, 50 logical and 1088 numeric variables. Of these numeric variables We have at least 219 binary variables.
We can potentially reduce the train set to 485 variables for model training.
1.3 Data Dictionary Comparison
Note there are less variables in the data dictionary ( 1105 ) than in the train set ( 1235 ) so it appears we are missing some variable descriptions. Let’s compare the two datasets.
# Number of columns in data dictionary but not in train dataset. These are likely to be the columns removed to protect privacy as described above.
length(setdiff(dd2$`Column.Name`, names(train)))
## [1] 651
# Columns in train data but not in data dictionary
setdiff(names(train), dd2$`Column.Name`)
## [1] "AA4" "AA7" "AA14"
## [4] "AA15" "DG10b" "DG10c"
## [7] "DG11b" "DG11c" "DL4_96"
## [10] "DL4_99" "DL14" "MT3_1"
## [13] "MT3_2" "MT3_3" "MT12_1"
## [16] "MT12_2" "MT12_3" "MT12_4"
## [19] "MT12_5" "MT12_6" "MT12_7"
## [22] "MT12_8" "MT12_9" "MT12_10"
## [25] "MT12_11" "MT12_12" "MT12_13"
## [28] "MT12_14" "MT12_96" "LN2_RIndLngBEOth"
## [31] "LN2_WIndLngBEOth"
1.4 Variable Summaries
The questions appear to be grouped into Demographic (DL and DG), Mobile (MT), Financial (FF, FB, FL and GN) and Insurance (FB).
1.4.1 Demographic information
# DL0. Who is the main income earner in your household?
table(train$DL0) # all answered. (Top 10 variable importance). This is a a binary answer, no further engineering
##
## 1 2
## 6623 11632
# DL1. In the past 12 months, were you mainly...?
table(train$DL1) # all answered (Top 10 variable importance). This is a candidate for further feature enginnering
##
## 1 2 3 4 5 6 7 8 9 10 96 99
## 4247 791 951 1643 1366 417 5664 1331 375 318 281 871
# DL2. What is your primary job (i.e., the job where you sp
table(train$DL2) # Do they prefer not to say? (Top 10 variable importance). This has many NAS
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2113 1910 74 140 250 216 153 132 175 72 259 157 24 67 58
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 23 12 30 213 424 132 75 102 12 10 57 18 6 294 1200
## 32 96
## 356 234
# DG1. What year were you born?(Top 10 variable importance)
summary(train$DG1) # all answered.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1917 1969 1981 1978 1990 2001
# DG3. What is your marital status? (Top 10 variable importance)
table(train$DG3) # all answered
##
## 1 2 3 4 5 6 7 8 99
## 2906 314 12418 56 103 1165 86 1062 145
# DG4. What is your highest level of education? (Top 10 variable importance)
summary(train$DG4) # all answered
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 5.000 5.503 6.000 99.000
# DG5.4.Do you have any of the following type of official id (Top 10 variable importance)
summary(train$DG4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 5.000 5.503 6.000 99.000
# DG6.How are you related to the household head? (Top 10 variable importance)
summary(train$DG6)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 2.0 3.1 3.0 99.0
# DG8.A.How many adults and children do you have in th (Top 10 variable importance)
summary(train$DG8a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.103 4.000 99.000
1.4.2 Mobile phone questions
The questions which begin with MT are related to phone usage.
# MT1.How many people in your household have a mobile phone?
table(train$MT1)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 1523 10134 3995 1185 505 112 26 10 5 1 1 1
## 99
## 757
# MT1A.Who decides on who should have a phone in your household? (Top 10 variable importance)
summary(train$MT1A)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 8.771 3.000 99.000 757
# MT2.Do you personally own a mobile phone?
table(train$MT2)
##
## 1 2
## 8922 9333
# MT3_1? This question is not in the data dictionary?
table(train$MT3_1)
##
## 0 1 2 3 4
## 3135 5511 233 33 10
# MT6.How did you obtain your phone? (Top 10 variable importance)
summary(train$MT6)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 1.00 1.00 4.84 2.00 99.00 9333
1.4.3 Financial Services questions
# FF1. Do you personally have a bank account that is registered
table(train$FF1)
##
## 1 2
## 11496 6759
# FF2. Do you usually make transactions with your bank account yourself or does
summary(train$FF2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 1.000 1.324 1.000 3.000 6759
table(train$FF2)
##
## 1 2 3
## 9027 1211 1258
1.4.4 Financial Planning
# FL2.When you make a plan, how often do you keep it?
summary(train$FL2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 2.000 2.236 3.000 4.000 7397
# FL4.What or who do you depend on the most for financial..
table(train$FL4)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 6337 5864 364 25 10 22 13 2809 5 35 20 9 3 3 12
## 16 17 96 99
## 8 8 1 2707
1.4.5 Location
The location questions appear to begin with AA.
# AA3. Zone
summary(train$AA3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.371 3.000 4.000
# AA4 no description
summary(train$AA4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 21.00 31.00 28.56 34.00 44.00
# AA5. Town Class
summary(train$AA5)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 3.000 3.139 4.000 5.000 12602
# AA14 no description
summary(train$AA14)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 96 949 2902 8031 4609 99999
hist(train$AA14)
# AA15
summary(train$AA15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24 178 354 352 523 633
The NA for each of AA5 and AA6 add up to the total number of observations. There seems to be a question missing for whether the respondents live in a town or a village. This will be feature engineered.
AA14 question is not in the data dictionary, but it has a high outlier 99,999. These will be removed through imputation.
2.1 EDA Plots
2.1.1 Plot religion versus zone
# Plot religion versus Zone DG3A What is your religion? "1=Christianity\n2=Islam\n3=Sikhism\n4=Hinduism\n5=Buddhism\n6=No religion\n96=Other (Specify)\n99=DK" and AA3. Zone 1=North 2=East 3=West 4=South
train %>% filter(DG3A!=96 &DG3A!=99) %>%
ggplot() +
geom_jitter(aes(DG3A,AA3)) +
xlab("DG3A What is your religion?") +
ylab("AA3. Zone")
2.1.2 Plot working versus main earner
# Plot DL0. Who is the main income earner in your household? v DL1. In the past 12 months, were you mainly...?
train %>% filter(DL1!=96 & DL1!=99) %>%
ggplot() +
geom_jitter(aes(DL0,DL1)) +
xlab("DL0. Who is the main income earner in your household?") +
ylab("DL1. In the past 12 months, were you mainly...?")
2.1.3 Plot household head versus marital status
# Plot DG6.How are you related to the household head? v DG3. What is your marital status?
train %>% filter(DG3!=96 &DG3!=99&DG6!=99) %>%
ggplot() +geom_jitter(aes(DG6,DG3)) +
xlab("DG6.How are you related to the household head?") +
ylab("DG3. What is your marital status?")
2.1.4 Plot phone versus marital status
# Plot MT6.How did you obtain your phone? v DG3. What is your marital status?
train %>% filter(DG3!=96 & DG3!=99 & MT6!=96 & MT6!=99) %>%
ggplot() +
geom_jitter(aes(MT6,DG3)) +
xlab("MT6.How did you obtain your phone?") +
ylab("DG3. What is your marital status?")