Exploratory Data Analysis and Data Munging

This post demonstrates Exploratory Data Analysis (EDA) and Data Munging to deal with missing values in R. I will use Bank Loan Dataset in this post, which is available here.

Data

Following is the description of variables for this dataset;

Variable	Description
Loan_ID	Unique Loan ID
Gender	Male/ Female
Married	Applicant married (Y/N)
Gender	Male/ Female
Dependents	Number of dependents
Education	Applicant Education (Graduate/ Under Graduate)
Self_Employed	Self employed (Y/N)
ApplicantIncome	Applicant income
CoapplicantIncome	Coapplicant income
LoanAmount	Loan amount in thousands
Loan_Amount_Term	Term of loan in months
Credit_History	credit history meets guidelines
Property_Area	Urban/ Semi Urban/ Rural
Loan_Status	Loan approved (Y/N)

EDA

Let’s begin with exploration of this dataset.

loandata<-read.csv('/Users/Tanmay/Desktop/R wd/train_loandata.csv')
head(loandata)

##    Loan_ID Gender Married Dependents    Education Self_Employed
## 1 LP001002   Male      No          0     Graduate            No
## 2 LP001003   Male     Yes          1     Graduate            No
## 3 LP001005   Male     Yes          0     Graduate           Yes
## 4 LP001006   Male     Yes          0 Not Graduate            No
## 5 LP001008   Male      No          0     Graduate            No
## 6 LP001011   Male     Yes          2     Graduate           Yes
##   ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1            5849                 0         NA              360
## 2            4583              1508        128              360
## 3            3000                 0         66              360
## 4            2583              2358        120              360
## 5            6000                 0        141              360
## 6            5417              4196        267              360
##   Credit_History Property_Area Loan_Status
## 1              1         Urban           Y
## 2              1         Rural           N
## 3              1         Urban           Y
## 4              1         Urban           Y
## 5              1         Urban           Y
## 6              1         Urban           Y

and

str(loandata)

## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

Let’s check for number of missing values:

n_NA<-sapply(loandata, function(x) sum(is.na(x)))
n_NA[n_NA>0]

##       LoanAmount Loan_Amount_Term   Credit_History 
##               22               14               50

A small tweaking in the 2nd line has filtered only the variables which have NA values. However, a simpler and broader way of looking at the data after you get the feel of variable types is;

summary(loandata)

##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         : 13      :  3     : 15     Graduate    :480  
##  LP001003:  1   Female:112   No :213   0 :345     Not Graduate:134  
##  LP001005:  1   Male  :489   Yes:398   1 :102                       
##  LP001006:  1                          2 :101                       
##  LP001008:  1                          3+: 51                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 32       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :500       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

Note that, although there are no NAs in some variables like ‘Gender’ but they do have blank entries. Also, observe that distribution of ApplicantIncome is right skewed as mean value is greater than median value.

Variable Distribution - Numerical

Now after knowing the basic characteristics of data, lets check the distributions of variables. We will start with ‘ApplicantIncome’:

suppressWarnings(library(ggplot2))
qplot(loandata$ApplicantIncome, bins=50)

This shows that there are some extreme values of income. Lets check the same variable via boxplot:

ggplot(loandata, aes(x='Applicants', y=ApplicantIncome))+geom_boxplot()

There lies extreme outliers in ‘ApplicantIncome’ which shows the existence of income disparity in our society.

ggplot(loandata, aes(x=Education, y=ApplicantIncome))+geom_boxplot()

Above segregation shows that although there is not much difference between the mean income of Graduates and non graduates but there are high number of graduates with very high incomes. Now lets look at LoanAmount:

ggplot(loandata, aes(x='Applicants', y=LoanAmount))+geom_boxplot()

## Warning: Removed 22 rows containing non-finite values (stat_boxplot).

So, both ApplicantIncome and LoanAmount have outliers as well as LoanAmount has NA values and hence this demands deeper understanding.

Variable Distribution - Categorical

Now that we have some understanding of ApplicantIncome and LoanAmount, lets look at the distribution of Categorical variables. One important variable which we have in our dataset is Credit_History which categorises applicants based on adherence of their credit history to guidelines. Also, Loan_Status shows whether loan was approved or not. Lets look at these 2 variables:

table(loandata$Credit_History, loandata$Loan_Status)

##    
##       N   Y
##   0  82   7
##   1  97 378

Great. Without noticing, we have a simple classifier for Loan Approval based on Credit_History variable. This classification gives us a True Positive of 378x100/(378+97)=79.6% and True Negative of 82x100/(82+7)=92.1%.

Now, categorical variables could have blank entries which will not be an NA value but these blanks are also required to be captured. Lets check for the same:

missVal<-sapply(loandata, function(x) sum(x=='' | is.na(x)))
missVal[missVal>0]

##           Gender          Married       Dependents    Self_Employed 
##               13                3               15               32 
##       LoanAmount Loan_Amount_Term   Credit_History 
##               22               14               50

So, all these variables should be dealt appropriately for imputation of missing values.

Filling The Missing Values - Part I

We will fill missing values of LoanAmount. There are numerous ways of filling missing values and we will try 3 ways. Simplest method would be to replace them by mean, but, look at the distribution of LoanAmount:

qplot(loandata$LoanAmount, bins=60)

## Warning: Removed 22 rows containing non-finite values (stat_bin).

The distribution is skewed so mean would not be a great way for imputing missing values. However, lets perform a simple transformation of LoanAmount:

trans_la<- log(loandata$LoanAmount)
qplot(trans_la, bins=60)

## Warning: Removed 22 rows containing non-finite values (stat_bin).

The distribution looks much more normal now and thus roughly centered about its mean. Now, we can impute NA of transformed loan amount with its mean value.

loandata1<-loandata #Generating a copy for imputation using method-I
l<- sapply(trans_la, function(x){
    x[is.na(x)]<-mean(trans_la,na.rm = T)
    x
})
loandata1$LoanAmount<-exp(l)

Part II

Second, we will impute missing values based on other variables. We can use employment mode and education to arrive at suitable values LoanAmount. But first, we have to resolve the missing values of Self_Employed variable. Lets check its distribution:

plot(loandata$Self_Employed)

Most of the applicants are not self employed and hence we can suitably assume the missing ones also to be not self employed. This can be easily done by changing factor levels of blank entries to ‘No’:

loandata2<-loandata #Generating a copy for imputation using method-II
levels(loandata2$Self_Employed)<-c('No', 'No', 'Yes')

Now, lets look at LoanAmount distribution with respect to Education and Self_employed:

loandata2$SelfEmpl_Edu<-interaction(loandata2$Self_Employed, loandata2$Education)
ggplot(loandata2, aes(x=SelfEmpl_Edu, y=LoanAmount))+geom_boxplot()

## Warning: Removed 22 rows containing non-finite values (stat_boxplot).

And now we can replace the NA values by medians of these 4 categories in the same way as before. This is more appropriate as we have better attributes to it now.

med_values<-tapply(loandata2$LoanAmount, loandata2$SelfEmpl_Edu, function(x) median(x, na.rm = T))
med_values

##      No.Graduate     Yes.Graduate  No.Not Graduate Yes.Not Graduate 
##            130.0            157.5            113.0            130.0

Now, we will impute these values in place of NAs in LoanAmount:

l1<-tapply(loandata2$LoanAmount, loandata2$SelfEmpl_Edu, function(x){
     replace(x, which(is.na(x)), median(x, na.rm=T))
})
suppressMessages(library(dplyr))
loandata2$LoanAmount<- combine(l1)

In above code, tapply is used to impute NAs with median values based on levels of factor variable SelfEmpl_Edu. But, tapply returns an array, so, combine function from dplyr package is used to bind that array in 1 vector.

Part III

R has many great packages available for imputing missing values through predictive analysis of variables. One great package is missForest, which builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values. This package could be deployed to impute missing values for all variables simultaneously in a single step but its important to take care of details so as to get proper results. For using this package we will first remove the Loan_ID from our data set as it is an independent variable.

suppressMessages(library(missForest))
loandata3<-select(loandata, -Loan_ID)
imploandata<-missForest(loandata3)

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

We can check the imputation errors. NRMSE is normalized mean squared error. It is used to represent error derived from imputing continuous values. PFC (proportion of falsely classified) is used to represent error derived from imputing categorical values.

imploandata$OOBerror

##      NRMSE        PFC 
## 0.01110868 0.00000000

#Imputed dataset
head(imploandata$ximp)

##   Gender Married Dependents    Education Self_Employed ApplicantIncome
## 1   Male      No          0     Graduate            No            5849
## 2   Male     Yes          1     Graduate            No            4583
## 3   Male     Yes          0     Graduate           Yes            3000
## 4   Male     Yes          0 Not Graduate            No            2583
## 5   Male      No          0     Graduate            No            6000
## 6   Male     Yes          2     Graduate           Yes            5417
##   CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
## 1                 0   149.3228              360              1
## 2              1508   128.0000              360              1
## 3                 0    66.0000              360              1
## 4              2358   120.0000              360              1
## 5                 0   141.0000              360              1
## 6              4196   267.0000              360              1
##   Property_Area Loan_Status
## 1         Urban           Y
## 2         Rural           N
## 3         Urban           Y
## 4         Urban           Y
## 5         Urban           Y
## 6         Urban           Y