Project Title: Loan Prediction Analysis

NAME: Ayushi

EMAIL: ayushikrishnan712@gmail.com

COLLEGE: IIIT Bhubaneswar

1. Introduction

About Company

Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

2.Overview of the study

Our field study concerns loan prediction analysis by Dream Housing Finance company. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. From the study we found out that the amount of loan depends only on applicant income, coapplicant income and the education of the applicant.

3. An empirical field study of Loan prediction

3.1 Overview

The specific objective of this Study was to analyse the loan amount that must be given to the applicant based on different conditions given. The data set given had different situations like few applicants were from rural area while others were from urban area. Similarly data of applicants of different gender, marital status and with different education background were given.

3.2 Data

Variable Description

Loan_ID - Unique Loan ID

Gender - Male/ Female

Married - Applicant married (Y/N)

Dependents - Number of dependents

Education - Applicant Education (Graduate/ Under Graduate)

Self_Employed - Self employed (Y/N)

ApplicantIncome - Applicant income

CoapplicantIncome - Coapplicant income

LoanAmount - Loan amount in thousands

Loan_Amount_Term - Term of loan in months

Credit_History - credit history meets guidelines

Property_Area - Urban/ Semi Urban/ Rural

Loan_Status - Loan approved (Y/N)

3.3 Model

We proposed the following model:

Loan amount = ??0 + ??1applicant income + ??2coapplicant income + ??3*education +??

model<- read.csv(paste("loan_prediction.csv",sep=""))
summary(model)
##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         : 13      :  3     : 15     Graduate    :480  
##  LP001003:  1   Female:112   No :213   0 :345     Not Graduate:134  
##  LP001005:  1   Male  :489   Yes:398   1 :102                       
##  LP001006:  1                          2 :101                       
##  LP001008:  1                          3+: 51                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 32       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :500       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

3.4 Results

From the analysis we found out that p-values of applicant income, coapplicant income and education are less than 0.05. This implies that loan amount depends on the above variables.

4. Conclusion

This paper was motivated by the need for research on the factors on which the grant status of loan depended. We succesfully analyzed the data to see that the loan amount depends on the applicant income, coapplicant income and also the education status.


Study of the data set

dim(model)
## [1] 614  13

The data set has 614 rows and 13 columns.

library(psych)
describe(model$LoanAmount)
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 592 146.41 85.59    128  133.14 47.44   9 700   691 2.66    10.26
##      se
## X1 3.52
library(psych)
describe(model$ApplicantIncome)
##    vars   n    mean      sd median trimmed     mad min   max range skew
## X1    1 614 5403.46 6109.04 3812.5 4292.06 1822.86 150 81000 80850 6.51
##    kurtosis     se
## X1    59.83 246.54
library(psych)
describe(model$CoapplicantIncome)
##    vars   n    mean      sd median trimmed     mad min   max range skew
## X1    1 614 1621.25 2926.25 1188.5 1154.85 1762.07   0 41667 41667 7.45
##    kurtosis     se
## X1    83.97 118.09
library(psych)
describe(model$Loan_Amount_Term)
##    vars   n mean    sd median trimmed mad min max range  skew kurtosis
## X1    1 600  342 65.12    360  358.38   0  12 480   468 -2.35     6.58
##      se
## X1 2.66

One way contingency tables

mytable<-with(model,table(Gender))
mytable
## Gender
##        Female   Male 
##     13    112    489
mytable<-with(model,table(Married))
mytable
## Married
##      No Yes 
##   3 213 398
mytable<-with(model,table(Education))
mytable
## Education
##     Graduate Not Graduate 
##          480          134
mytable<-with(model,table(Self_Employed))
mytable
## Self_Employed
##      No Yes 
##  32 500  82
mytable<-with(model,table(Credit_History))
mytable
## Credit_History
##   0   1 
##  89 475
mytable<-with(model,table(Property_Area))
mytable
## Property_Area
##     Rural Semiurban     Urban 
##       179       233       202
mytable<-with(model,table(Loan_Status))
mytable
## Loan_Status
##   N   Y 
## 192 422

Two way contingency tables

mytable<-xtabs(~Loan_Status+Married, data=model) 
mytable
##            Married
## Loan_Status      No Yes
##           N   0  79 113
##           Y   3 134 285
mytable1<- model[which(model$Gender=="Male"|model$Gender=="Female"),]
mytable<-xtabs(~Loan_Status+Gender, data=mytable1) 
mytable
##            Gender
## Loan_Status     Female Male
##           N   0     37  150
##           Y   0     75  339
mytable1<- model[which(model$Education=="Graduate"|model$Education=="Not Graduate"),]
mytable<-xtabs(~Loan_Status+Education, data=mytable1) 
mytable
##            Education
## Loan_Status Graduate Not Graduate
##           N      140           52
##           Y      340           82
mytable1<- model[which(model$Self_Employed=="Yes"|model$Self_Employed=="No"),]
mytable<-xtabs(~Loan_Status+Self_Employed, data=mytable1) 
mytable
##            Self_Employed
## Loan_Status      No Yes
##           N   0 157  26
##           Y   0 343  56
mytable<-xtabs(~Loan_Status+Property_Area, data=model) 
mytable
##            Property_Area
## Loan_Status Rural Semiurban Urban
##           N    69        54    69
##           Y   110       179   133

Histograms

hist(model$ApplicantIncome,xlab="applicant income",main="Frequency distribution of applicant income",col="lightblue",breaks=10)

hist(model$LoanAmount,xlab="loan amount",main="Loan amount frequency distribution",col="lightblue",breaks=10)

hist(model$Loan_Amount_Term,xlab="term for the loan",main="Loan amount term frequency distribution",col="lightblue",breaks=10)

hist(model$CoapplicantIncome,xlab="coapplicant income",main="Grequency distribution of coapplicant income",col="lightblue",xlim=c(0,15000))

hist(model$Credit_History,xlab="credit history (where 1 means credit history meets guidlines and 0 means it does not)",main="frequency distribution of credit history",col="lightblue")

Box plots of variables

boxplot(model$LoanAmount,xlab="loan amount",main="loan amount taken",horizontal = TRUE)

boxplot(model$ApplicantIncome,xlab="applicant income",main="Frequency distribution of applicant income",horizontal = TRUE)

boxplot(model$CoapplicantIncome,xlab="coapplicant income",main="Plot of distribution of coapplicant income",horizontal = TRUE)

boxplot(model$Loan_Amount_Term,xlab=" term of loan amount",main="Plot showing term for which loan amount is taken",horizontal=TRUE)

boxplot(ApplicantIncome~Loan_Status, data=model,xlab="Loan granted or not?",ylab="Applicant income",main="Loan grant status with respect to applicant income")

plot(model$ApplicantIncome,model$Loan_Status,col="blue",xlab="applicant income",ylab="loan grant status",main="Plot of applicant income and loan grant status")

Here 1 represents not granted while 2 represents loan granted.

plot(model$CoapplicantIncome,model$Loan_Status,xlab="coapplicant income",ylab="loan grant status",col="blue",main="Plot of coapplicant income and loan grant status")

Here 1 represents not granted while 2 represents granted.

plot(model$ApplicantIncome,model$LoanAmount,col="blue",xlab="applicant imcome",ylab="loan amount",main="Plot of applicant income and loan amount")

plot(model$ApplicantIncome,model$Loan_Amount_Term,col="blue",xlab="applicant imcome",ylab="loan amount term",main="Plot of applicant income and loan amount term")

Scatter plots

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(formula=~ApplicantIncome+CoapplicantIncome+LoanAmount, data=model, diagonal="histogram")

scatterplotMatrix(formula=~ApplicantIncome+LoanAmount+Loan_Amount_Term, data=model, diagonal="histogram")

scatterplotMatrix(formula=~ApplicantIncome+CoapplicantIncome+Loan_Amount_Term, data=model, diagonal="histogram")

scatterplotMatrix(formula=~CoapplicantIncome+LoanAmount+Loan_Amount_Term, data=model, diagonal="histogram")

scatterplotMatrix(formula=~ApplicantIncome+CoapplicantIncome+LoanAmount+Loan_Amount_Term, data=model, diagonal="histogram")

Correlation matrix

cor(model[ ,c(7:11)])
##                   ApplicantIncome CoapplicantIncome LoanAmount
## ApplicantIncome         1.0000000        -0.1166046         NA
## CoapplicantIncome      -0.1166046         1.0000000         NA
## LoanAmount                     NA                NA          1
## Loan_Amount_Term               NA                NA         NA
## Credit_History                 NA                NA         NA
##                   Loan_Amount_Term Credit_History
## ApplicantIncome                 NA             NA
## CoapplicantIncome               NA             NA
## LoanAmount                      NA             NA
## Loan_Amount_Term                 1             NA
## Credit_History                  NA              1

Corrgram

library(corrplot)
## corrplot 0.84 loaded
corrplot(corr=cor(model[,c(7:11)],use="complete.obs"),method="ellipse")

library(corrgram)
corrgram(model, order=TRUE, lower.panel=panel.shade,
  upper.panel=panel.pie, text.panel=panel.txt,
  main="Loan prediction analysis Correlogram")

Assumption 1: Grant status of loan depends on gender of the applicant

mytable<- xtabs(~Gender+Loan_Status, data=model)
mytable
##         Loan_Status
## Gender     N   Y
##            5   8
##   Female  37  75
##   Male   150 339
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 0.5559, df = 2, p-value = 0.7573

since p-value >0.05, we say that gender and grant status of loan are independent.

Assumption 2: Grant status of loan depends on marital sttus of the applicant

mytable<- xtabs(~Married+Loan_Status, data=model)
mytable
##        Loan_Status
## Married   N   Y
##           0   3
##     No   79 134
##     Yes 113 285
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 6.2549, df = 2, p-value = 0.04383

Since p value is less than 0.05, we can say that grant status depends whether the person is married or not.

Assumption 3: Grant status of loan depends on education of the applicant

mytable<- xtabs(~Education+Loan_Status, data=model)
mytable
##               Loan_Status
## Education        N   Y
##   Graduate     140 340
##   Not Graduate  52  82
chisq.test(mytable)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable
## X-squared = 4.0915, df = 1, p-value = 0.0431

Since p-vlue<0.05, we can say that grant status of loan depends on education of the applicant.

Assumption 4: Grant status of loan depends on the situation whether the applicant is self employed or not

mytable<- xtabs(~Self_Employed+Loan_Status, data=model)
mytable
##              Loan_Status
## Self_Employed   N   Y
##                 9  23
##           No  157 343
##           Yes  26  56
chisq.test(mytable)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 0.1585, df = 2, p-value = 0.9238

Since p-value>0.05, we can say that loan grant status is independent of self employment.

Assumption 5: Grant status of loan is highly related with credit history

mytable<- xtabs(~Credit_History+Loan_Status, data=model)
mytable
##               Loan_Status
## Credit_History   N   Y
##              0  82   7
##              1  97 378
chisq.test(mytable)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable
## X-squared = 174.64, df = 1, p-value < 2.2e-16

Since p-value is less than 0.05. So we can say that credit history and loan grant credit are highly related.

Assumption 6: Grant status of loan is highly related with property area

mytable<- xtabs(~Gender+Property_Area, data=model)
mytable
##         Property_Area
## Gender   Rural Semiurban Urban
##              4         6     3
##   Female    24        55    33
##   Male     151       172   166
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 8.6475, df = 4, p-value = 0.07054

Since pvalue >0.05. We can say that loan grant status and property area are not highly related.

t-tests

Hypothesis 1: There is no significant difference between the means of income of the applicant and loan amount.

t.test(model$ApplicantIncome,model$LoanAmount)
## 
##  Welch Two Sample t-test
## 
## data:  model$ApplicantIncome and model$LoanAmount
## t = 21.321, df = 613.25, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4772.831 5741.263
## sample estimates:
## mean of x mean of y 
## 5403.4593  146.4122

Since p-value is less than 0.05. Therefore there is a significant difference between the meanns of applicant income and loan amount.

Hypothesis 2: There is no significant difference between the means of applicant income and coapplicant income.

t.test(model$ApplicantIncome,model$CoapplicantIncome)
## 
##  Welch Two Sample t-test
## 
## data:  model$ApplicantIncome and model$CoapplicantIncome
## t = 13.836, df = 880.23, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3245.690 4318.737
## sample estimates:
## mean of x mean of y 
##  5403.459  1621.246

Since p-value is less than 0.05 therefore there is a significant difference between means of applicant income and coapplicant income.

Hypothesis 3: There is no significant difference between the means of coapplicant income and the loan amount.

t.test(model$CoapplicantIncome,model$LoanAmount)
## 
##  Welch Two Sample t-test
## 
## data:  model$CoapplicantIncome and model$LoanAmount
## t = 12.483, df = 614.09, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1242.814 1706.853
## sample estimates:
## mean of x mean of y 
## 1621.2458  146.4122

Since p-value is less than 0.05, we can reject the null hypothesis and say that there is a significant difference between means of coapplicant income and loan amount.

Different regression models

mytable <- lm(model$LoanAmount~ model$ApplicantIncome+model$CoapplicantIncome)
summary(mytable)
## 
## Call:
## lm(formula = model$LoanAmount ~ model$ApplicantIncome + model$CoapplicantIncome)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -404.14  -27.91   -6.91   21.01  392.75 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.921e+01  4.081e+00  21.860  < 2e-16 ***
## model$ApplicantIncome   8.332e-03  4.494e-04  18.543  < 2e-16 ***
## model$CoapplicantIncome 7.407e-03  9.333e-04   7.936 1.06e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.9 on 589 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.3911, Adjusted R-squared:  0.389 
## F-statistic: 189.1 on 2 and 589 DF,  p-value: < 2.2e-16
mytable<-lm(model$LoanAmount~model$Gender+model$Married+model$Education+model$Self_Employed)
summary(mytable)
## 
## Call:
## lm(formula = model$LoanAmount ~ model$Gender + model$Married + 
##     model$Education + model$Self_Employed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.46  -45.82  -13.54   21.11  492.34 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 222.6660    64.9216   3.430 0.000647 ***
## model$GenderFemale          -63.4893    24.6299  -2.578 0.010189 *  
## model$GenderMale            -48.7648    23.3514  -2.088 0.037203 *  
## model$MarriedNo             -22.5029    58.8922  -0.382 0.702524    
## model$MarriedYes             -0.1086    58.6557  -0.002 0.998524    
## model$EducationNot Graduate -35.6084     8.3015  -4.289  2.1e-05 ***
## model$Self_EmployedNo       -14.9012    15.3404  -0.971 0.331765    
## model$Self_EmployedYes       13.6640    17.5716   0.778 0.437110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.65 on 584 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.07845,    Adjusted R-squared:  0.06741 
## F-statistic: 7.103 on 7 and 584 DF,  p-value: 3.696e-08
mytable<- lm(model$LoanAmount~model$Loan_Amount_Term+model$Credit_History+model$Property_Area)
summary(mytable)
## 
## Call:
## lm(formula = model$LoanAmount ~ model$Loan_Amount_Term + model$Credit_History + 
##     model$Property_Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -129.15  -45.37  -18.91   21.85  563.28 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  149.70622   22.42662   6.675  6.3e-11 ***
## model$Loan_Amount_Term         0.02377    0.05669   0.419   0.6752    
## model$Credit_History          -4.35560   10.27054  -0.424   0.6717    
## model$Property_AreaSemiurban  -8.53292    8.92351  -0.956   0.3394    
## model$Property_AreaUrban     -15.75679    9.43890  -1.669   0.0956 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84.17 on 524 degrees of freedom
##   (85 observations deleted due to missingness)
## Multiple R-squared:  0.006169,   Adjusted R-squared:  -0.001417 
## F-statistic: 0.8132 on 4 and 524 DF,  p-value: 0.5171
mytable <- lm(model$LoanAmount~model$ApplicantIncome+model$CoapplicantIncome+model$Gender+model$Education)
summary(mytable)
## 
## Call:
## lm(formula = model$LoanAmount ~ model$ApplicantIncome + model$CoapplicantIncome + 
##     model$Gender + model$Education)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -389.86  -28.22   -6.18   18.90  390.86 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.161e+02  1.916e+01   6.057 2.49e-09 ***
## model$ApplicantIncome        8.067e-03  4.556e-04  17.706  < 2e-16 ***
## model$CoapplicantIncome      7.075e-03  9.355e-04   7.562 1.54e-13 ***
## model$GenderFemale          -3.164e+01  1.970e+01  -1.606   0.1087    
## model$GenderMale            -1.959e+01  1.885e+01  -1.039   0.2990    
## model$EducationNot Graduate -1.626e+01  6.763e+00  -2.404   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.53 on 586 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.4009, Adjusted R-squared:  0.3958 
## F-statistic: 78.44 on 5 and 586 DF,  p-value: < 2.2e-16
mytable <- lm(model$LoanAmount~model$ApplicantIncome+model$CoapplicantIncome+model$Education)
summary(mytable)
## 
## Call:
## lm(formula = model$LoanAmount ~ model$ApplicantIncome + model$CoapplicantIncome + 
##     model$Education)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -396.48  -28.63   -6.60   19.19  391.29 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  9.371e+01  4.499e+00  20.831  < 2e-16 ***
## model$ApplicantIncome        8.182e-03  4.523e-04  18.092  < 2e-16 ***
## model$CoapplicantIncome      7.225e-03  9.331e-04   7.743 4.27e-14 ***
## model$EducationNot Graduate -1.578e+01  6.757e+00  -2.335   0.0199 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.65 on 588 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.3966, Adjusted R-squared:  0.3936 
## F-statistic: 128.9 on 3 and 588 DF,  p-value: < 2.2e-16

Thus from the analysis of above models we see that the amount of loan depends only on applicant income, coapplicant income and the education of the applicant.

Thus the best fit model will be Loan-Amount = (9.371e+1) + (8.182e-03)applicantIncome + (7.225e-03)coapplicantIncome - (1.578e+01)*education(not graduate)