This post demonstrates Exploratory Data Analysis (EDA) and Data Munging to deal with missing values in R. I will use Bank Loan Dataset in this post, which is available here.
Following is the description of variables for this dataset;
| Variable | Description |
|---|---|
| Loan_ID | Unique Loan ID |
| Gender | Male/ Female |
| Married | Applicant married (Y/N) |
| Gender | Male/ Female |
| Dependents | Number of dependents |
| Education | Applicant Education (Graduate/ Under Graduate) |
| Self_Employed | Self employed (Y/N) |
| ApplicantIncome | Applicant income |
| CoapplicantIncome | Coapplicant income |
| LoanAmount | Loan amount in thousands |
| Loan_Amount_Term | Term of loan in months |
| Credit_History | credit history meets guidelines |
| Property_Area | Urban/ Semi Urban/ Rural |
| Loan_Status | Loan approved (Y/N) |
Let’s begin with exploration of this dataset.
loandata<-read.csv('/Users/Tanmay/Desktop/R wd/train_loandata.csv')
head(loandata)
## Loan_ID Gender Married Dependents Education Self_Employed
## 1 LP001002 Male No 0 Graduate No
## 2 LP001003 Male Yes 1 Graduate No
## 3 LP001005 Male Yes 0 Graduate Yes
## 4 LP001006 Male Yes 0 Not Graduate No
## 5 LP001008 Male No 0 Graduate No
## 6 LP001011 Male Yes 2 Graduate Yes
## ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1 5849 0 NA 360
## 2 4583 1508 128 360
## 3 3000 0 66 360
## 4 2583 2358 120 360
## 5 6000 0 141 360
## 6 5417 4196 267 360
## Credit_History Property_Area Loan_Status
## 1 1 Urban Y
## 2 1 Rural N
## 3 1 Urban Y
## 4 1 Urban Y
## 5 1 Urban Y
## 6 1 Urban Y
and
str(loandata)
## 'data.frame': 614 obs. of 13 variables:
## $ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
Let’s check for number of missing values:
n_NA<-sapply(loandata, function(x) sum(is.na(x)))
n_NA[n_NA>0]
## LoanAmount Loan_Amount_Term Credit_History
## 22 14 50
A small tweaking in the 2nd line has filtered only the variables which have NA values. However, a simpler and broader way of looking at the data after you get the feel of variable types is;
summary(loandata)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 13 : 3 : 15 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes:398 1 :102
## LP001006: 1 2 :101
## LP001008: 1 3+: 51
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 32 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes: 82 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
Note that, although there are no NAs in some variables like ‘Gender’ but they do have blank entries. Also, observe that distribution of ApplicantIncome is right skewed as mean value is greater than median value.
Now after knowing the basic characteristics of data, lets check the distributions of variables. We will start with ‘ApplicantIncome’:
suppressWarnings(library(ggplot2))
qplot(loandata$ApplicantIncome, bins=50)
This shows that there are some extreme values of income. Lets check the same variable via boxplot:
ggplot(loandata, aes(x='Applicants', y=ApplicantIncome))+geom_boxplot()
There lies extreme outliers in ‘ApplicantIncome’ which shows the existence of income disparity in our society.
ggplot(loandata, aes(x=Education, y=ApplicantIncome))+geom_boxplot()
Above segregation shows that although there is not much difference between the mean income of Graduates and non graduates but there are high number of graduates with very high incomes. Now lets look at LoanAmount:
ggplot(loandata, aes(x='Applicants', y=LoanAmount))+geom_boxplot()
## Warning: Removed 22 rows containing non-finite values (stat_boxplot).
So, both ApplicantIncome and LoanAmount have outliers as well as LoanAmount has NA values and hence this demands deeper understanding.
Now that we have some understanding of ApplicantIncome and LoanAmount, lets look at the distribution of Categorical variables. One important variable which we have in our dataset is Credit_History which categorises applicants based on adherence of their credit history to guidelines. Also, Loan_Status shows whether loan was approved or not. Lets look at these 2 variables:
table(loandata$Credit_History, loandata$Loan_Status)
##
## N Y
## 0 82 7
## 1 97 378
Great. Without noticing, we have a simple classifier for Loan Approval based on Credit_History variable. This classification gives us a True Positive of 378x100/(378+97)=79.6% and True Negative of 82x100/(82+7)=92.1%.
Now, categorical variables could have blank entries which will not be an NA value but these blanks are also required to be captured. Lets check for the same:
missVal<-sapply(loandata, function(x) sum(x=='' | is.na(x)))
missVal[missVal>0]
## Gender Married Dependents Self_Employed
## 13 3 15 32
## LoanAmount Loan_Amount_Term Credit_History
## 22 14 50
So, all these variables should be dealt appropriately for imputation of missing values.
We will fill missing values of LoanAmount. There are numerous ways of filling missing values and we will try 3 ways. Simplest method would be to replace them by mean, but, look at the distribution of LoanAmount:
qplot(loandata$LoanAmount, bins=60)
## Warning: Removed 22 rows containing non-finite values (stat_bin).
The distribution is skewed so mean would not be a great way for imputing missing values. However, lets perform a simple transformation of LoanAmount:
trans_la<- log(loandata$LoanAmount)
qplot(trans_la, bins=60)
## Warning: Removed 22 rows containing non-finite values (stat_bin).
The distribution looks much more normal now and thus roughly centered about its mean. Now, we can impute NA of transformed loan amount with its mean value.
loandata1<-loandata #Generating a copy for imputation using method-I
l<- sapply(trans_la, function(x){
x[is.na(x)]<-mean(trans_la,na.rm = T)
x
})
loandata1$LoanAmount<-exp(l)
Second, we will impute missing values based on other variables. We can use employment mode and education to arrive at suitable values LoanAmount. But first, we have to resolve the missing values of Self_Employed variable. Lets check its distribution:
plot(loandata$Self_Employed)
Most of the applicants are not self employed and hence we can suitably assume the missing ones also to be not self employed. This can be easily done by changing factor levels of blank entries to ‘No’:
loandata2<-loandata #Generating a copy for imputation using method-II
levels(loandata2$Self_Employed)<-c('No', 'No', 'Yes')
Now, lets look at LoanAmount distribution with respect to Education and Self_employed:
loandata2$SelfEmpl_Edu<-interaction(loandata2$Self_Employed, loandata2$Education)
ggplot(loandata2, aes(x=SelfEmpl_Edu, y=LoanAmount))+geom_boxplot()
## Warning: Removed 22 rows containing non-finite values (stat_boxplot).
And now we can replace the NA values by medians of these 4 categories in the same way as before. This is more appropriate as we have better attributes to it now.
med_values<-tapply(loandata2$LoanAmount, loandata2$SelfEmpl_Edu, function(x) median(x, na.rm = T))
med_values
## No.Graduate Yes.Graduate No.Not Graduate Yes.Not Graduate
## 130.0 157.5 113.0 130.0
Now, we will impute these values in place of NAs in LoanAmount:
l1<-tapply(loandata2$LoanAmount, loandata2$SelfEmpl_Edu, function(x){
replace(x, which(is.na(x)), median(x, na.rm=T))
})
suppressMessages(library(dplyr))
loandata2$LoanAmount<- combine(l1)
In above code, tapply is used to impute NAs with median values based on levels of factor variable SelfEmpl_Edu. But, tapply returns an array, so, combine function from dplyr package is used to bind that array in 1 vector.
R has many great packages available for imputing missing values through predictive analysis of variables. One great package is missForest, which builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values. This package could be deployed to impute missing values for all variables simultaneously in a single step but its important to take care of details so as to get proper results. For using this package we will first remove the Loan_ID from our data set as it is an independent variable.
suppressMessages(library(missForest))
loandata3<-select(loandata, -Loan_ID)
imploandata<-missForest(loandata3)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
We can check the imputation errors. NRMSE is normalized mean squared error. It is used to represent error derived from imputing continuous values. PFC (proportion of falsely classified) is used to represent error derived from imputing categorical values.
imploandata$OOBerror
## NRMSE PFC
## 0.01110868 0.00000000
#Imputed dataset
head(imploandata$ximp)
## Gender Married Dependents Education Self_Employed ApplicantIncome
## 1 Male No 0 Graduate No 5849
## 2 Male Yes 1 Graduate No 4583
## 3 Male Yes 0 Graduate Yes 3000
## 4 Male Yes 0 Not Graduate No 2583
## 5 Male No 0 Graduate No 6000
## 6 Male Yes 2 Graduate Yes 5417
## CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
## 1 0 149.3228 360 1
## 2 1508 128.0000 360 1
## 3 0 66.0000 360 1
## 4 2358 120.0000 360 1
## 5 0 141.0000 360 1
## 6 4196 267.0000 360 1
## Property_Area Loan_Status
## 1 Urban Y
## 2 Rural N
## 3 Urban Y
## 4 Urban Y
## 5 Urban Y
## 6 Urban Y