Predictive Analytics Model Building-Loan status prediction

Today businesses have changed from reactive mode to proactive mode. In every business operation such as HR, Marketing or operations, businesses are trying to predict the future happenings and are taking measures to control them. Predictive analytics is the technique that is being used by businesses to predict the future happenings in the area of their operations. In the predictive analytics, a data will be collected and feed in to a mathematical model such that the model will predict the outcome for any given data.

Here I have demonstrated a case in the finance industry where the predictive analytics has been used to predict an outcome.

Problem Statement: A Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

Data Set name : loan_data_set Variable- Description Loan_ID- Unique Loan ID Gender - Male/ Female Married- Applicant married (Y/N) Dependents- Number of dependents Education- Applicant Education (Graduate/ Under Graduate) Self_Employed -Self employed (Y/N) ApplicantIncome- Applicant income CoapplicantIncome- Coapplicant income LoanAmount- Loan amount in thousands Loan_Amount_Term- Term of loan in months Credit_History- credit history meets guidelines Property_Area- Urban/ Semi Urban/ Rural Loan_Status- Loan approved (Y/N)

A step by step procedure of predicting whether the customer will pay the loan or not (Yes/Not) has been provided in the below pdf document .

#1.Loading the Data set:

setwd ("D:/Raviteja/Raviteja Professional/Data Science/my projects/competetions/loan_dataset")
loan <-read.csv("loan_data_set_train.csv",header = TRUE,sep = ",",na.strings = "")

#2. Required packages for statistical computation:
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

#3. Understanding the data set : Univariate analysis,Bi-variate analysis and finding missing values in the data set 
summary(loan)

##      Loan_ID       Gender    Married    Dependents        Education  
##  LP001002:  1   Female:112   No  :213   0   :345   Graduate    :480  
##  LP001003:  1   Male  :489   Yes :398   1   :102   Not Graduate:134  
##  LP001005:  1   NA's  : 13   NA's:  3   2   :101                     
##  LP001006:  1                           3+  : 51                     
##  LP001008:  1                           NA's: 15                     
##  LP001011:  1                                                        
##  (Other) :608                                                        
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

#Let us see the missing values
sapply(loan, function(x) sum(is.na(x)))

##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

#To clean this data/to fill the missing values we need to understand the variables:

table (loan$Credit_History)

## 
##   0   1 
##  89 475

#So, majority of the values are 1.Let us fill the missing values with 1

ggplot(aes(x= Loan_Amount_Term ), data= loan)+geom_histogram(binwidth=2)+scale_x_continuous(limits = c(0,500),breaks=seq(0,500,10))

## Warning: Removed 14 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

table(loan$Loan_Amount_Term)

## 
##  12  36  60  84 120 180 240 300 360 480 
##   1   2   2   4   3  44   4  13 512  15

#from the above histogram & the table we can understand that the maximum people have taken a loan for 360 days. Let us put the value of missing ones as 360

ggplot(aes(x=LoanAmount), data= loan)+geom_histogram(binwidth = 2)+scale_x_continuous(limits = c(0,700),breaks=seq(0,700,50))

## Warning: Removed 22 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

#from the above histogram of loan amount variable we can observe the normal distribution of the values with a long tail (outliers). The loan amount has various values distributed normally, let us calculate the mean value of the variable and consider for missing values.

ggplot(aes(x=ApplicantIncome), data= loan)+geom_histogram(binwidth = 30)+scale_x_continuous(limits = c(0,85000),breaks=seq(0,85000,2500))

## Warning: Removed 3 rows containing missing values (geom_bar).

#from the above histogram of the applicant income variable we can observe the normal distribution of the values with a long tail (outliers). As of now, let us calculate the mean value of the variable and consider for missing values.

# Let us understand the remaining categorical variables.
table(loan$Gender,loan$Married)

##         
##           No Yes
##   Female  80  31
##   Male   130 357

table(loan$Gender,loan$Dependents)

##         
##            0   1   2  3+
##   Female  80  19   7   3
##   Male   258  82  92  45

table(loan$Married)

## 
##  No Yes 
## 213 398

table(loan$Self_Employed)

## 
##  No Yes 
## 500  82

table(loan$Education)

## 
##     Graduate Not Graduate 
##          480          134

#In above all the varibles one class (either male,Yes) is dominating. So, in the missing values consider the mode (the most recurring value in the variable).

#Trying to understand relationship between variables :
  
ggplot(aes(x=LoanAmount,y= ApplicantIncome), data= loan)+geom_point()+scale_x_continuous(limits = c(0,700),breaks=seq(0,700,50))+geom_smooth(method='lm',color= 'blue')

## Warning: Removed 22 rows containing non-finite values (stat_smooth).

## Warning: Removed 22 rows containing missing values (geom_point).

# from the above graph we can understand thatthe applicant income and the loan amount is proportional.We ahould note this and use this pattern whenever asked or required.Based on the project and business requirement we should draw these scatter plots to understand the interdependencies of the variables.

#Now, let us plot some boxplots to understand the outliers:

qplot(x=Gender,y=ApplicantIncome,data=subset(loan,!is.na(Gender)),geom='boxplot')+coord_cartesian(y= c(0,81000))

#We can easily observe how the outliers are there out of the box.


#We have tried to undersand the relation between the variables in the loan data and based on the understanding let us replace the missing values such that we can build a prediction model.

#4.Replacing missing values with mode,mean and median values wherever they are applicable:

loan$Credit_History[which(is.na(loan$Credit_History))] <- 1
loan$Married[which(is.na(loan$Married))] <- "Yes"
loan$Gender[which(is.na(loan$Gender))] <- "Male"
loan$Self_Employed[which(is.na(loan$Self_Employed))] <- "No"
loan$Loan_Amount_Term[which(is.na(loan$Loan_Amount_Term))] <- 360
loan$LoanAmount[which(is.na(loan$LoanAmount))] <- mean(loan$LoanAmount, na.rm=T)
loan$Dependents[which(is.na(loan$Dependents))] <- 0

#6.Now, there will not be no missing values in the data set:

summary(loan)

##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1   Female:112   No :213   0 :360     Graduate    :480  
##  LP001003:  1   Male  :502   Yes:401   1 :102     Not Graduate:134  
##  LP001005:  1                          2 :101                       
##  LP001006:  1                          3+: 51                       
##  LP001008:  1                                                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No :532       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes: 82       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.2  
##                Median : 3812   Median : 1188     Median :129.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:164.8  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                                 
##  Loan_Amount_Term Credit_History    Property_Area Loan_Status
##  Min.   : 12.0    Min.   :0.000   Rural    :179   N:192      
##  1st Qu.:360.0    1st Qu.:1.000   Semiurban:233   Y:422      
##  Median :360.0    Median :1.000   Urban    :202              
##  Mean   :342.4    Mean   :0.855                              
##  3rd Qu.:360.0    3rd Qu.:1.000                              
##  Max.   :480.0    Max.   :1.000                              
##

#7. Deleting LoanID column from the data set as this is not useful for the mathematical model/Logistic regression model building.
#The data set is loaded in to a new data frame with name loan1.

loan1= loan[,2:13]
names(loan1)

##  [1] "Gender"            "Married"           "Dependents"       
##  [4] "Education"         "Self_Employed"     "ApplicantIncome"  
##  [7] "CoapplicantIncome" "LoanAmount"        "Loan_Amount_Term" 
## [10] "Credit_History"    "Property_Area"     "Loan_Status"

#8. Building a Logistic Regression model:

model<- glm(formula = Loan_Status~ ., family = binomial(link=logit), data = loan1)

summary(model)

## 
## Call:
## glm(formula = Loan_Status ~ ., family = binomial(link = logit), 
##     data = loan1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3102  -0.3544   0.5361   0.7209   2.5141  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.446e+00  8.485e-01  -2.883 0.003944 ** 
## GenderMale             -2.975e-02  2.990e-01  -0.100 0.920725    
## MarriedYes              5.839e-01  2.527e-01   2.310 0.020866 *  
## Dependents1            -4.706e-01  2.948e-01  -1.596 0.110480    
## Dependents2             2.909e-01  3.428e-01   0.849 0.396073    
## Dependents3+            2.374e-02  4.261e-01   0.056 0.955568    
## EducationNot Graduate  -4.084e-01  2.596e-01  -1.573 0.115701    
## Self_EmployedYes       -2.413e-02  3.171e-01  -0.076 0.939323    
## ApplicantIncome         1.175e-05  2.441e-05   0.482 0.630138    
## CoapplicantIncome      -5.282e-05  3.520e-05  -1.501 0.133450    
## LoanAmount             -1.923e-03  1.600e-03  -1.202 0.229392    
## Loan_Amount_Term       -1.247e-03  1.827e-03  -0.683 0.494687    
## Credit_History          3.938e+00  4.212e-01   9.350  < 2e-16 ***
## Property_AreaSemiurban  9.067e-01  2.697e-01   3.362 0.000773 ***
## Property_AreaUrban      2.212e-01  2.597e-01   0.852 0.394434    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 762.89  on 613  degrees of freedom
## Residual deviance: 557.22  on 599  degrees of freedom
## AIC: 587.22
## 
## Number of Fisher Scoring iterations: 5

#9.Testing the accuracy of model against the given values

#Method-A
# Givenvalues means Loan status variable in the data set.

fitted.results <- predict(model,data=loan1,type='response')

fitted.results <- ifelse(fitted.results > 0.5,'Y','N')

misClasificError <- mean(fitted.results != loan1$Loan_Status)

print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.812703583061889"

#Accuracy of around 81.3% is considered to be good.

# Method- G, The widely used ROC curve 

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

# 
prob <- predict(model, newdata=loan1, type="response")
pred <- prediction(prob, loan1$Loan_Status)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
print(auc)

## [1] 0.7943572

# The AUC for our model is 0.7943. The Area Under the Curve (AUC) around 0.8 says that the model is a good fit.The AUC=0.5 says that the model is not a good fit and is random.


#10.Implications from the above logistic regression model :

anova(model,test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Loan_Status
## 
## Terms added sequentially (first to last)
## 
## 
##                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                613     762.89              
## Gender             1    0.197       612     762.69  0.657053    
## Married            1    5.038       611     757.66  0.024798 *  
## Dependents         3    3.198       608     754.46  0.362154    
## Education          1    4.393       607     750.07  0.036096 *  
## Self_Employed      1    0.001       606     750.06  0.976602    
## ApplicantIncome    1    0.099       605     749.97  0.752714    
## CoapplicantIncome  1    3.241       604     746.72  0.071800 .  
## LoanAmount         1    1.078       603     745.65  0.299217    
## Loan_Amount_Term   1    0.517       602     745.13  0.472136    
## Credit_History     1  175.105       601     570.02 < 2.2e-16 ***
## Property_Area      2   12.801       599     557.22  0.001661 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#from the above chi-square test the p values for the variables self employed,ApplicantIncome and the Gender are pretty high .This means these are not having much impact on the prediction.The result won't change much even we don't consider them.

#  If the accuracy of the model is not good or if the client is demanding then we should revisit our univariate and bivariate analysis that we have done in the step3.Then we need to see whther there is higher dependency between the variables.If so, we should use that dependency relation and form a regression to predict the missing values.

#We should also use imputation technique to predict the missing values.

#As we can observe that there are outliers present in the data.But we can't blindly delete the outliers as it also have some real business implication.We need to recheck the reason behind outliers and then we need to ake action accordingly.

#10. Prediction : The below procedure will be used for prediction.
#Any typical data set (like "loan_test" mentoined below) can be feed in to the model to predict the outcome.

#result <- predict(model,loan_test,type="response")

#Loan_test is the data set with all the details except the loan status.The loan status has to be predicted.

#result1 <- ifelse(result > 0.5,'Y','N')

#11. The prediction can be done using other Machine Learning Algorithms such as Decision trees (CART models), Random forests ,XG Boost etc.. For this data set, i have tried all other and found that Logistic regression is accurate.

#With this kind of analytics work, we can save lot of time and money to our clients by helping them to predict the defaulter of the loan.

Predictive Analytics Model Building-Loan status prediction

Raviteja

February 15, 2016