About Company
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.
Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.
Our field study concerns loan prediction analysis by Dream Housing Finance company. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. From the study we found out that the amount of loan depends only on applicant income, coapplicant income and the education of the applicant.
The specific objective of this Study was to analyse the loan amount that must be given to the applicant based on different conditions given. The data set given had different situations like few applicants were from rural area while others were from urban area. Similarly data of applicants of different gender, marital status and with different education background were given.
Loan_ID - Unique Loan ID
Gender - Male/ Female
Married - Applicant married (Y/N)
Dependents - Number of dependents
Education - Applicant Education (Graduate/ Under Graduate)
Self_Employed - Self employed (Y/N)
ApplicantIncome - Applicant income
CoapplicantIncome - Coapplicant income
LoanAmount - Loan amount in thousands
Loan_Amount_Term - Term of loan in months
Credit_History - credit history meets guidelines
Property_Area - Urban/ Semi Urban/ Rural
Loan_Status - Loan approved (Y/N)
We proposed the following model:
Loan amount = ??0 + ??1applicant income + ??2coapplicant income + ??3*education +??
model<- read.csv(paste("loan_prediction.csv",sep=""))
summary(model)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 13 : 3 : 15 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes:398 1 :102
## LP001006: 1 2 :101
## LP001008: 1 3+: 51
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 32 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes: 82 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
From the analysis we found out that p-values of applicant income, coapplicant income and education are less than 0.05. This implies that loan amount depends on the above variables.
This paper was motivated by the need for research on the factors on which the grant status of loan depended. We succesfully analyzed the data to see that the loan amount depends on the applicant income, coapplicant income and also the education status.
dim(model)
## [1] 614 13
The data set has 614 rows and 13 columns.
library(psych)
describe(model$LoanAmount)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 592 146.41 85.59 128 133.14 47.44 9 700 691 2.66 10.26
## se
## X1 3.52
library(psych)
describe(model$ApplicantIncome)
## vars n mean sd median trimmed mad min max range skew
## X1 1 614 5403.46 6109.04 3812.5 4292.06 1822.86 150 81000 80850 6.51
## kurtosis se
## X1 59.83 246.54
library(psych)
describe(model$CoapplicantIncome)
## vars n mean sd median trimmed mad min max range skew
## X1 1 614 1621.25 2926.25 1188.5 1154.85 1762.07 0 41667 41667 7.45
## kurtosis se
## X1 83.97 118.09
library(psych)
describe(model$Loan_Amount_Term)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 600 342 65.12 360 358.38 0 12 480 468 -2.35 6.58
## se
## X1 2.66
mytable<-with(model,table(Gender))
mytable
## Gender
## Female Male
## 13 112 489
mytable<-with(model,table(Married))
mytable
## Married
## No Yes
## 3 213 398
mytable<-with(model,table(Education))
mytable
## Education
## Graduate Not Graduate
## 480 134
mytable<-with(model,table(Self_Employed))
mytable
## Self_Employed
## No Yes
## 32 500 82
mytable<-with(model,table(Credit_History))
mytable
## Credit_History
## 0 1
## 89 475
mytable<-with(model,table(Property_Area))
mytable
## Property_Area
## Rural Semiurban Urban
## 179 233 202
mytable<-with(model,table(Loan_Status))
mytable
## Loan_Status
## N Y
## 192 422
mytable<-xtabs(~Loan_Status+Married, data=model)
mytable
## Married
## Loan_Status No Yes
## N 0 79 113
## Y 3 134 285
mytable1<- model[which(model$Gender=="Male"|model$Gender=="Female"),]
mytable<-xtabs(~Loan_Status+Gender, data=mytable1)
mytable
## Gender
## Loan_Status Female Male
## N 0 37 150
## Y 0 75 339
mytable1<- model[which(model$Education=="Graduate"|model$Education=="Not Graduate"),]
mytable<-xtabs(~Loan_Status+Education, data=mytable1)
mytable
## Education
## Loan_Status Graduate Not Graduate
## N 140 52
## Y 340 82
mytable1<- model[which(model$Self_Employed=="Yes"|model$Self_Employed=="No"),]
mytable<-xtabs(~Loan_Status+Self_Employed, data=mytable1)
mytable
## Self_Employed
## Loan_Status No Yes
## N 0 157 26
## Y 0 343 56
mytable<-xtabs(~Loan_Status+Property_Area, data=model)
mytable
## Property_Area
## Loan_Status Rural Semiurban Urban
## N 69 54 69
## Y 110 179 133
hist(model$ApplicantIncome,xlab="applicant income",main="Frequency distribution of applicant income",col="lightblue",breaks=10)
hist(model$LoanAmount,xlab="loan amount",main="Loan amount frequency distribution",col="lightblue",breaks=10)
hist(model$Loan_Amount_Term,xlab="term for the loan",main="Loan amount term frequency distribution",col="lightblue",breaks=10)
hist(model$CoapplicantIncome,xlab="coapplicant income",main="Grequency distribution of coapplicant income",col="lightblue",xlim=c(0,15000))
hist(model$Credit_History,xlab="credit history (where 1 means credit history meets guidlines and 0 means it does not)",main="frequency distribution of credit history",col="lightblue")
boxplot(model$LoanAmount,xlab="loan amount",main="loan amount taken",horizontal = TRUE)
boxplot(model$ApplicantIncome,xlab="applicant income",main="Frequency distribution of applicant income",horizontal = TRUE)
boxplot(model$CoapplicantIncome,xlab="coapplicant income",main="Plot of distribution of coapplicant income",horizontal = TRUE)
boxplot(model$Loan_Amount_Term,xlab=" term of loan amount",main="Plot showing term for which loan amount is taken",horizontal=TRUE)
boxplot(ApplicantIncome~Loan_Status, data=model,xlab="Loan granted or not?",ylab="Applicant income",main="Loan grant status with respect to applicant income")
plot(model$ApplicantIncome,model$Loan_Status,col="blue",xlab="applicant income",ylab="loan grant status",main="Plot of applicant income and loan grant status")
Here 1 represents not granted while 2 represents loan granted.
plot(model$CoapplicantIncome,model$Loan_Status,xlab="coapplicant income",ylab="loan grant status",col="blue",main="Plot of coapplicant income and loan grant status")
Here 1 represents not granted while 2 represents granted.
plot(model$ApplicantIncome,model$LoanAmount,col="blue",xlab="applicant imcome",ylab="loan amount",main="Plot of applicant income and loan amount")
plot(model$ApplicantIncome,model$Loan_Amount_Term,col="blue",xlab="applicant imcome",ylab="loan amount term",main="Plot of applicant income and loan amount term")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(formula=~ApplicantIncome+CoapplicantIncome+LoanAmount, data=model, diagonal="histogram")
scatterplotMatrix(formula=~ApplicantIncome+LoanAmount+Loan_Amount_Term, data=model, diagonal="histogram")
scatterplotMatrix(formula=~ApplicantIncome+CoapplicantIncome+Loan_Amount_Term, data=model, diagonal="histogram")
scatterplotMatrix(formula=~CoapplicantIncome+LoanAmount+Loan_Amount_Term, data=model, diagonal="histogram")
scatterplotMatrix(formula=~ApplicantIncome+CoapplicantIncome+LoanAmount+Loan_Amount_Term, data=model, diagonal="histogram")
cor(model[ ,c(7:11)])
## ApplicantIncome CoapplicantIncome LoanAmount
## ApplicantIncome 1.0000000 -0.1166046 NA
## CoapplicantIncome -0.1166046 1.0000000 NA
## LoanAmount NA NA 1
## Loan_Amount_Term NA NA NA
## Credit_History NA NA NA
## Loan_Amount_Term Credit_History
## ApplicantIncome NA NA
## CoapplicantIncome NA NA
## LoanAmount NA NA
## Loan_Amount_Term 1 NA
## Credit_History NA 1
library(corrplot)
## corrplot 0.84 loaded
corrplot(corr=cor(model[,c(7:11)],use="complete.obs"),method="ellipse")
library(corrgram)
corrgram(model, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Loan prediction analysis Correlogram")
Assumption 1: Grant status of loan depends on gender of the applicant
mytable<- xtabs(~Gender+Loan_Status, data=model)
mytable
## Loan_Status
## Gender N Y
## 5 8
## Female 37 75
## Male 150 339
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 0.5559, df = 2, p-value = 0.7573
since p-value >0.05, we say that gender and grant status of loan are independent.
Assumption 2: Grant status of loan depends on marital sttus of the applicant
mytable<- xtabs(~Married+Loan_Status, data=model)
mytable
## Loan_Status
## Married N Y
## 0 3
## No 79 134
## Yes 113 285
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 6.2549, df = 2, p-value = 0.04383
Since p value is less than 0.05, we can say that grant status depends whether the person is married or not.
Assumption 3: Grant status of loan depends on education of the applicant
mytable<- xtabs(~Education+Loan_Status, data=model)
mytable
## Loan_Status
## Education N Y
## Graduate 140 340
## Not Graduate 52 82
chisq.test(mytable)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable
## X-squared = 4.0915, df = 1, p-value = 0.0431
Since p-vlue<0.05, we can say that grant status of loan depends on education of the applicant.
Assumption 4: Grant status of loan depends on the situation whether the applicant is self employed or not
mytable<- xtabs(~Self_Employed+Loan_Status, data=model)
mytable
## Loan_Status
## Self_Employed N Y
## 9 23
## No 157 343
## Yes 26 56
chisq.test(mytable)
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 0.1585, df = 2, p-value = 0.9238
Since p-value>0.05, we can say that loan grant status is independent of self employment.
Assumption 5: Grant status of loan is highly related with credit history
mytable<- xtabs(~Credit_History+Loan_Status, data=model)
mytable
## Loan_Status
## Credit_History N Y
## 0 82 7
## 1 97 378
chisq.test(mytable)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable
## X-squared = 174.64, df = 1, p-value < 2.2e-16
Since p-value is less than 0.05. So we can say that credit history and loan grant credit are highly related.
Assumption 6: Grant status of loan is highly related with property area
mytable<- xtabs(~Gender+Property_Area, data=model)
mytable
## Property_Area
## Gender Rural Semiurban Urban
## 4 6 3
## Female 24 55 33
## Male 151 172 166
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 8.6475, df = 4, p-value = 0.07054
Since pvalue >0.05. We can say that loan grant status and property area are not highly related.
Hypothesis 1: There is no significant difference between the means of income of the applicant and loan amount.
t.test(model$ApplicantIncome,model$LoanAmount)
##
## Welch Two Sample t-test
##
## data: model$ApplicantIncome and model$LoanAmount
## t = 21.321, df = 613.25, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4772.831 5741.263
## sample estimates:
## mean of x mean of y
## 5403.4593 146.4122
Since p-value is less than 0.05. Therefore there is a significant difference between the meanns of applicant income and loan amount.
Hypothesis 2: There is no significant difference between the means of applicant income and coapplicant income.
t.test(model$ApplicantIncome,model$CoapplicantIncome)
##
## Welch Two Sample t-test
##
## data: model$ApplicantIncome and model$CoapplicantIncome
## t = 13.836, df = 880.23, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3245.690 4318.737
## sample estimates:
## mean of x mean of y
## 5403.459 1621.246
Since p-value is less than 0.05 therefore there is a significant difference between means of applicant income and coapplicant income.
Hypothesis 3: There is no significant difference between the means of coapplicant income and the loan amount.
t.test(model$CoapplicantIncome,model$LoanAmount)
##
## Welch Two Sample t-test
##
## data: model$CoapplicantIncome and model$LoanAmount
## t = 12.483, df = 614.09, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1242.814 1706.853
## sample estimates:
## mean of x mean of y
## 1621.2458 146.4122
Since p-value is less than 0.05, we can reject the null hypothesis and say that there is a significant difference between means of coapplicant income and loan amount.
mytable <- lm(model$LoanAmount~ model$ApplicantIncome+model$CoapplicantIncome)
summary(mytable)
##
## Call:
## lm(formula = model$LoanAmount ~ model$ApplicantIncome + model$CoapplicantIncome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -404.14 -27.91 -6.91 21.01 392.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.921e+01 4.081e+00 21.860 < 2e-16 ***
## model$ApplicantIncome 8.332e-03 4.494e-04 18.543 < 2e-16 ***
## model$CoapplicantIncome 7.407e-03 9.333e-04 7.936 1.06e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.9 on 589 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.3911, Adjusted R-squared: 0.389
## F-statistic: 189.1 on 2 and 589 DF, p-value: < 2.2e-16
mytable<-lm(model$LoanAmount~model$Gender+model$Married+model$Education+model$Self_Employed)
summary(mytable)
##
## Call:
## lm(formula = model$LoanAmount ~ model$Gender + model$Married +
## model$Education + model$Self_Employed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.46 -45.82 -13.54 21.11 492.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 222.6660 64.9216 3.430 0.000647 ***
## model$GenderFemale -63.4893 24.6299 -2.578 0.010189 *
## model$GenderMale -48.7648 23.3514 -2.088 0.037203 *
## model$MarriedNo -22.5029 58.8922 -0.382 0.702524
## model$MarriedYes -0.1086 58.6557 -0.002 0.998524
## model$EducationNot Graduate -35.6084 8.3015 -4.289 2.1e-05 ***
## model$Self_EmployedNo -14.9012 15.3404 -0.971 0.331765
## model$Self_EmployedYes 13.6640 17.5716 0.778 0.437110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.65 on 584 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.07845, Adjusted R-squared: 0.06741
## F-statistic: 7.103 on 7 and 584 DF, p-value: 3.696e-08
mytable<- lm(model$LoanAmount~model$Loan_Amount_Term+model$Credit_History+model$Property_Area)
summary(mytable)
##
## Call:
## lm(formula = model$LoanAmount ~ model$Loan_Amount_Term + model$Credit_History +
## model$Property_Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -129.15 -45.37 -18.91 21.85 563.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 149.70622 22.42662 6.675 6.3e-11 ***
## model$Loan_Amount_Term 0.02377 0.05669 0.419 0.6752
## model$Credit_History -4.35560 10.27054 -0.424 0.6717
## model$Property_AreaSemiurban -8.53292 8.92351 -0.956 0.3394
## model$Property_AreaUrban -15.75679 9.43890 -1.669 0.0956 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.17 on 524 degrees of freedom
## (85 observations deleted due to missingness)
## Multiple R-squared: 0.006169, Adjusted R-squared: -0.001417
## F-statistic: 0.8132 on 4 and 524 DF, p-value: 0.5171
mytable <- lm(model$LoanAmount~model$ApplicantIncome+model$CoapplicantIncome+model$Gender+model$Education)
summary(mytable)
##
## Call:
## lm(formula = model$LoanAmount ~ model$ApplicantIncome + model$CoapplicantIncome +
## model$Gender + model$Education)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389.86 -28.22 -6.18 18.90 390.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.161e+02 1.916e+01 6.057 2.49e-09 ***
## model$ApplicantIncome 8.067e-03 4.556e-04 17.706 < 2e-16 ***
## model$CoapplicantIncome 7.075e-03 9.355e-04 7.562 1.54e-13 ***
## model$GenderFemale -3.164e+01 1.970e+01 -1.606 0.1087
## model$GenderMale -1.959e+01 1.885e+01 -1.039 0.2990
## model$EducationNot Graduate -1.626e+01 6.763e+00 -2.404 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.53 on 586 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.4009, Adjusted R-squared: 0.3958
## F-statistic: 78.44 on 5 and 586 DF, p-value: < 2.2e-16
mytable <- lm(model$LoanAmount~model$ApplicantIncome+model$CoapplicantIncome+model$Education)
summary(mytable)
##
## Call:
## lm(formula = model$LoanAmount ~ model$ApplicantIncome + model$CoapplicantIncome +
## model$Education)
##
## Residuals:
## Min 1Q Median 3Q Max
## -396.48 -28.63 -6.60 19.19 391.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.371e+01 4.499e+00 20.831 < 2e-16 ***
## model$ApplicantIncome 8.182e-03 4.523e-04 18.092 < 2e-16 ***
## model$CoapplicantIncome 7.225e-03 9.331e-04 7.743 4.27e-14 ***
## model$EducationNot Graduate -1.578e+01 6.757e+00 -2.335 0.0199 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.65 on 588 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.3966, Adjusted R-squared: 0.3936
## F-statistic: 128.9 on 3 and 588 DF, p-value: < 2.2e-16
Thus from the analysis of above models we see that the amount of loan depends only on applicant income, coapplicant income and the education of the applicant.
Thus the best fit model will be Loan-Amount = (9.371e+1) + (8.182e-03)applicantIncome + (7.225e-03)coapplicantIncome - (1.578e+01)*education(not graduate)