WIDS DataThon 2018 EDA

Background

In this Kaggle competition, Social Impact for Women in Impoverished Countries WIDS_DataThon_2018 competition, the aim is to predict the gender of each survey respondent based on demographic and behavioural information from a representative sample of survey respondents from India and their usage of traditional and mobile financial services.

This notebook is an exploratory data analysis of the dataset.

Setup

Load packages

knitr::opts_chunk$set(echo = TRUE)

library(data.table)

## Warning: package 'data.table' was built under R version 3.4.3

library(knitr)

## Warning: package 'knitr' was built under R version 3.4.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.3

## Warning: package 'tibble' was built under R version 3.4.3

## Warning: package 'tidyr' was built under R version 3.4.3

library(readxl)
library(stringr)

Load Data

Load Datasets

The first step will be to load the test data file and train data file. These have been downloaded to a local Kaggle folder offline as there is an agreement step to the Kaggle data terms and conditions and unzipped using 7-zip.

We will use the data.table R package designed for large datasets.

Overview

The train dataset has 18255 rows and 1235 columns. This is is a high dimensional, multivariate dataset.

The following information is available from the Kaggle discussion forum:

is_female - the target variable you are going to predict. Note: in the data dictionary, Female is 2, male is 1, while in our transformed data, is_female=1 for female and is_female=0 for male. For the rest of the 1000+ column descriptions, please refer to WiDSDataDictionary for details. Please note that data has been processed and some columns were removed to prevent data leakage or to protect privacy.

head(train[,c(1:10)])

Note that the target variable is at position 9 in the train dataset.

Let’s remove train IDs for further analysis on the remaining train dataset.

# Remove ID feature from train as this identifier is not necessarily needed in the EDA.
id <- train$train_id
train <- train %>% dplyr::select(-train_id)

Since this is a survey there might be biases that affect the responses. We do not have any information on the order of the questions. Biases might include stereotyping, desirability to conform to group norms and mindset carry over effects.

Part 1: Exploratory data analysis

1.1 Target Variable

Let’s first take a quick look at the target variable, the is_female label variable as a histogram.

train %>% 
      ggplot(aes(as.factor(is_female),fill=as.factor(is_female)),guide=FALSE) + 
      geom_histogram(stat="count") + 
      scale_y_continuous(name="Count",labels = scales::comma,limits = c(0,10000)) +
      xlab("Target") +
      ggtitle("Histogram of Target Variable is_female")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

table(train$is_female)

## 
##    0    1 
## 8450 9805

We can see that a the number of is_female as a proportion of total is 0.54. Therefore the classes are balanced.

1.2 Response Variables

1.2.1 Character Variables

Check if any variables in the train set are characters using dplyr, and subset the ones that do not have a very high proportion of white space.

# Select the subset of character variables and values
ischars <- train %>% select_if(is.character)
head(ischars)

# Create a vector of the count of white space for each column, with a custom function
ischarsws <- sapply(ischars,function(x) table(x =="")["TRUE"])
# Subset the variables where the whitespace count is less than 17,000
ischarsws[ischarsws<17000]

## LN2_RIndLngBEOth.TRUE LN2_WIndLngBEOth.TRUE 
##                  6914                  6911

Of the 96 character variables, this subset of two variables may contribute to the prediction. Let’s take a quick look at the some of the values in the variables.

table(train$LN2_RIndLngBEOth)[1:10]

## 
##                              Aadiwasi               Arbi 
##               6914                  1                  1 
##           Assamese Assamese and Hindi            Bengali 
##                180                  1                891 
##           Bhojpuri        Chattisgari             Dehati 
##                 11                 31                  1 
##            English 
##                 43

table(train$LN2_WIndLngBEOth)[1:10]

## 
##                          Aadiwasi             Arbi         Assamese 
##             6911                4                1              134 
##          Bengali         Bhojpuri             Boro      Chattisgari 
##             1083                6                1               37 
## Chhetriya bhasha          English 
##                1               24

They are not listed in the data dictionary but we can infer that they relate to language questions. Let’s remove the character variables from train except for these two variables.

charstoremove <- ischars %>% select(-LN2_RIndLngBEOth,-LN2_WIndLngBEOth)
# Use dplyr select to remove the column names of the charstoremove
train <- train %>% select(-one_of(names(charstoremove)))

1.2.3 Numeric variables

Check if any variables in the train set are numeric using dplyr, and calculate the sum of NAs for each column using purrr.

# Select the subset of numeric variables and values. Note that is.numeric includes integers
isnum <- train %>% select_if(is.numeric)
# Create a vector of the count of NA for each column using map from purrr package
isnumna <- map(isnum, ~sum(is.na(.)))

We will remove the high proportion of numeric NAs from the train set.

# Use dplyr select to remove the column names vector
train <- train %>% select(-one_of(names(isnumna[isnumna>16000])))

Let’s investigate how many of the numeric variables are in fact binary variables, assuming that binary will take on one of two values and there are no not answered values such as 99.

binary <- function(x) {
  ifelse(length(unique(x))>2,FALSE,TRUE)
}
# Create a vector of the count of NA for each column using map from purrr package
isbinaryornot <- map(isnum, ~binary(.))
# Sum is binary or not
sumbinary <- sum(isbinaryornot==TRUE)

1.2.4 Logical variables

Check if any variables in the train set are logicals using dplyr, and check the number of NA values.

# Select the subset of logical variables and values
islog <- train %>% select_if(is.logical)
# Sum the missing values 
sum(is.na(islog))-dim(islog)[1]*dim(islog)[2]

## [1] 0

All the logical variables have missing values therefore we can remove these variables from the dataset for the rest of the EDA.

train <- train %>% select(-one_of(names(islog)))

Therefore we saw that the train set originally contained 1235 variables, made up of 1 testid, 96 character, 50 logical and 1088 numeric variables. Of these numeric variables We have at least 219 binary variables.

We can potentially reduce the train set to 485 variables for model training.

1.3 Data Dictionary Comparison

Note there are less variables in the data dictionary ( 1105 ) than in the train set ( 1235 ) so it appears we are missing some variable descriptions. Let’s compare the two datasets.

# Number of columns in data dictionary but not in train dataset. These are likely to be the columns removed to protect privacy as described above. 
length(setdiff(dd2$`Column.Name`, names(train)))

## [1] 651

# Columns in train data but not in data dictionary
setdiff(names(train), dd2$`Column.Name`)

##  [1] "AA4"              "AA7"              "AA14"            
##  [4] "AA15"             "DG10b"            "DG10c"           
##  [7] "DG11b"            "DG11c"            "DL4_96"          
## [10] "DL4_99"           "DL14"             "MT3_1"           
## [13] "MT3_2"            "MT3_3"            "MT12_1"          
## [16] "MT12_2"           "MT12_3"           "MT12_4"          
## [19] "MT12_5"           "MT12_6"           "MT12_7"          
## [22] "MT12_8"           "MT12_9"           "MT12_10"         
## [25] "MT12_11"          "MT12_12"          "MT12_13"         
## [28] "MT12_14"          "MT12_96"          "LN2_RIndLngBEOth"
## [31] "LN2_WIndLngBEOth"

1.4 Variable Summaries

The questions appear to be grouped into Demographic (DL and DG), Mobile (MT), Financial (FF, FB, FL and GN) and Insurance (FB).

1.4.1 Demographic information

# DL0. Who is the main income earner in your household?
table(train$DL0) # all answered. (Top 10 variable importance). This is a a binary answer, no further engineering

## 
##     1     2 
##  6623 11632

# DL1. In the past 12 months, were you mainly...?
table(train$DL1) # all answered (Top 10 variable importance). This is a candidate for further feature enginnering

## 
##    1    2    3    4    5    6    7    8    9   10   96   99 
## 4247  791  951 1643 1366  417 5664 1331  375  318  281  871

# DL2. What is your primary job (i.e., the job where you sp
table(train$DL2) # Do they prefer not to say? (Top 10 variable importance). This has many NAS

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2113 1910   74  140  250  216  153  132  175   72  259  157   24   67   58 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##   23   12   30  213  424  132   75  102   12   10   57   18    6  294 1200 
##   32   96 
##  356  234

# DG1. What year were you born?(Top 10 variable importance)
summary(train$DG1) # all answered.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1917    1969    1981    1978    1990    2001

# DG3. What is your marital status? (Top 10 variable importance)
table(train$DG3) # all answered

## 
##     1     2     3     4     5     6     7     8    99 
##  2906   314 12418    56   103  1165    86  1062   145

# DG4. What is your highest level of education? (Top 10 variable importance)
summary(train$DG4)  # all answered

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   5.000   5.503   6.000  99.000

# DG5.4.Do you have any of the following type of official id (Top 10 variable importance)
summary(train$DG4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   5.000   5.503   6.000  99.000

# DG6.How are you related to the household head? (Top 10 variable importance)
summary(train$DG6)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     2.0     3.1     3.0    99.0

# DG8.A.How many adults and children do you have in th (Top 10 variable importance)
summary(train$DG8a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.103   4.000  99.000

1.4.2 Mobile phone questions

The questions which begin with MT are related to phone usage.

# MT1.How many people in your household have a mobile phone?
table(train$MT1)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##  1523 10134  3995  1185   505   112    26    10     5     1     1     1 
##    99 
##   757

# MT1A.Who decides on who should have a phone in your household? (Top 10 variable importance)
summary(train$MT1A)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   8.771   3.000  99.000     757

# MT2.Do you personally own a mobile phone?
table(train$MT2)

## 
##    1    2 
## 8922 9333

#  MT3_1? This question is not in the data dictionary?
table(train$MT3_1)

## 
##    0    1    2    3    4 
## 3135 5511  233   33   10

# MT6.How did you obtain your phone? (Top 10 variable importance) 
summary(train$MT6)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.00    1.00    4.84    2.00   99.00    9333

1.4.3 Financial Services questions

# FF1. Do you personally have a bank account that is registered
table(train$FF1)

## 
##     1     2 
## 11496  6759

# FF2. Do you usually make transactions with your bank account yourself or does 
summary(train$FF2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.324   1.000   3.000    6759

table(train$FF2)

## 
##    1    2    3 
## 9027 1211 1258

1.4.4 Financial Planning

# FL2.When you make a plan, how often do you keep it?
summary(train$FL2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   2.000   2.236   3.000   4.000    7397

# FL4.What or who do you depend on the most for financial..
table(train$FL4)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 6337 5864  364   25   10   22   13 2809    5   35   20    9    3    3   12 
##   16   17   96   99 
##    8    8    1 2707

1.4.5 Location

The location questions appear to begin with AA.

# AA3. Zone
summary(train$AA3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.371   3.000   4.000

# AA4 no description
summary(train$AA4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   21.00   31.00   28.56   34.00   44.00

# AA5. Town Class
summary(train$AA5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   3.000   3.139   4.000   5.000   12602

# AA14 no description
summary(train$AA14)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      96     949    2902    8031    4609   99999

hist(train$AA14)

# AA15
summary(train$AA15)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      24     178     354     352     523     633

The NA for each of AA5 and AA6 add up to the total number of observations. There seems to be a question missing for whether the respondents live in a town or a village. This will be feature engineered.

AA14 question is not in the data dictionary, but it has a high outlier 99,999. These will be removed through imputation.

2.1 EDA Plots

2.1.1 Plot religion versus zone

# Plot religion versus Zone DG3A What is your religion? "1=Christianity\n2=Islam\n3=Sikhism\n4=Hinduism\n5=Buddhism\n6=No religion\n96=Other (Specify)\n99=DK" and AA3. Zone 1=North 2=East 3=West 4=South
train %>% filter(DG3A!=96 &DG3A!=99) %>% 
      ggplot() +
      geom_jitter(aes(DG3A,AA3)) +
      xlab("DG3A What is your religion?") +
      ylab("AA3. Zone")

2.1.2 Plot working versus main earner

# Plot DL0. Who is the main income earner in your household? v DL1. In the past 12 months, were you mainly...?
train %>%  filter(DL1!=96 & DL1!=99) %>%
      ggplot() +
      geom_jitter(aes(DL0,DL1)) +
      xlab("DL0. Who is the main income earner in your household?") +
      ylab("DL1. In the past 12 months, were you mainly...?")

2.1.3 Plot household head versus marital status

# Plot DG6.How are you related to the household head? v DG3. What is your marital status?
train %>% filter(DG3!=96 &DG3!=99&DG6!=99) %>% 
      ggplot() +geom_jitter(aes(DG6,DG3)) + 
      xlab("DG6.How are you related to the household head?") +
      ylab("DG3. What is your marital status?")

2.1.4 Plot phone versus marital status

# Plot MT6.How did you obtain your phone? v DG3. What is your marital status?
train %>% filter(DG3!=96 & DG3!=99 & MT6!=96 & MT6!=99) %>% 
      ggplot() +
      geom_jitter(aes(MT6,DG3)) + 
      xlab("MT6.How did you obtain your phone?") +
      ylab("DG3. What is your marital status?")