Today businesses have changed from reactive mode to proactive mode. In every business operation such as HR, Marketing or operations, businesses are trying to predict the future happenings and are taking measures to control them. Predictive analytics is the technique that is being used by businesses to predict the future happenings in the area of their operations. In the predictive analytics, a data will be collected and feed in to a mathematical model such that the model will predict the outcome for any given data.
Here I have demonstrated a case in the finance industry where the predictive analytics has been used to predict an outcome.
Problem Statement: A Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.
Data Set name : loan_data_set Variable- Description Loan_ID- Unique Loan ID Gender - Male/ Female Married- Applicant married (Y/N) Dependents- Number of dependents Education- Applicant Education (Graduate/ Under Graduate) Self_Employed -Self employed (Y/N) ApplicantIncome- Applicant income CoapplicantIncome- Coapplicant income LoanAmount- Loan amount in thousands Loan_Amount_Term- Term of loan in months Credit_History- credit history meets guidelines Property_Area- Urban/ Semi Urban/ Rural Loan_Status- Loan approved (Y/N)
A step by step procedure of predicting whether the customer will pay the loan or not (Yes/Not) has been provided in the below pdf document .
#1.Loading the Data set:
setwd ("D:/Raviteja/Raviteja Professional/Data Science/my projects/competetions/loan_dataset")
loan <-read.csv("loan_data_set_train.csv",header = TRUE,sep = ",",na.strings = "")
#2. Required packages for statistical computation:
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
#3. Understanding the data set : Univariate analysis,Bi-variate analysis and finding missing values in the data set
summary(loan)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 Female:112 No :213 0 :345 Graduate :480
## LP001003: 1 Male :489 Yes :398 1 :102 Not Graduate:134
## LP001005: 1 NA's : 13 NA's: 3 2 :101
## LP001006: 1 3+ : 51
## LP001008: 1 NA's: 15
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## No :500 Min. : 150 Min. : 0 Min. : 9.0
## Yes : 82 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## NA's: 32 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
#Let us see the missing values
sapply(loan, function(x) sum(is.na(x)))
## Loan_ID Gender Married Dependents
## 0 13 3 15
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 32 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 22 14 50 0
## Loan_Status
## 0
#To clean this data/to fill the missing values we need to understand the variables:
table (loan$Credit_History)
##
## 0 1
## 89 475
#So, majority of the values are 1.Let us fill the missing values with 1
ggplot(aes(x= Loan_Amount_Term ), data= loan)+geom_histogram(binwidth=2)+scale_x_continuous(limits = c(0,500),breaks=seq(0,500,10))
## Warning: Removed 14 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
table(loan$Loan_Amount_Term)
##
## 12 36 60 84 120 180 240 300 360 480
## 1 2 2 4 3 44 4 13 512 15
#from the above histogram & the table we can understand that the maximum people have taken a loan for 360 days. Let us put the value of missing ones as 360
ggplot(aes(x=LoanAmount), data= loan)+geom_histogram(binwidth = 2)+scale_x_continuous(limits = c(0,700),breaks=seq(0,700,50))
## Warning: Removed 22 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
#from the above histogram of loan amount variable we can observe the normal distribution of the values with a long tail (outliers). The loan amount has various values distributed normally, let us calculate the mean value of the variable and consider for missing values.
ggplot(aes(x=ApplicantIncome), data= loan)+geom_histogram(binwidth = 30)+scale_x_continuous(limits = c(0,85000),breaks=seq(0,85000,2500))
## Warning: Removed 3 rows containing missing values (geom_bar).
#from the above histogram of the applicant income variable we can observe the normal distribution of the values with a long tail (outliers). As of now, let us calculate the mean value of the variable and consider for missing values.
# Let us understand the remaining categorical variables.
table(loan$Gender,loan$Married)
##
## No Yes
## Female 80 31
## Male 130 357
table(loan$Gender,loan$Dependents)
##
## 0 1 2 3+
## Female 80 19 7 3
## Male 258 82 92 45
table(loan$Married)
##
## No Yes
## 213 398
table(loan$Self_Employed)
##
## No Yes
## 500 82
table(loan$Education)
##
## Graduate Not Graduate
## 480 134
#In above all the varibles one class (either male,Yes) is dominating. So, in the missing values consider the mode (the most recurring value in the variable).
#Trying to understand relationship between variables :
ggplot(aes(x=LoanAmount,y= ApplicantIncome), data= loan)+geom_point()+scale_x_continuous(limits = c(0,700),breaks=seq(0,700,50))+geom_smooth(method='lm',color= 'blue')
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
# from the above graph we can understand thatthe applicant income and the loan amount is proportional.We ahould note this and use this pattern whenever asked or required.Based on the project and business requirement we should draw these scatter plots to understand the interdependencies of the variables.
#Now, let us plot some boxplots to understand the outliers:
qplot(x=Gender,y=ApplicantIncome,data=subset(loan,!is.na(Gender)),geom='boxplot')+coord_cartesian(y= c(0,81000))
#We can easily observe how the outliers are there out of the box.
#We have tried to undersand the relation between the variables in the loan data and based on the understanding let us replace the missing values such that we can build a prediction model.
#4.Replacing missing values with mode,mean and median values wherever they are applicable:
loan$Credit_History[which(is.na(loan$Credit_History))] <- 1
loan$Married[which(is.na(loan$Married))] <- "Yes"
loan$Gender[which(is.na(loan$Gender))] <- "Male"
loan$Self_Employed[which(is.na(loan$Self_Employed))] <- "No"
loan$Loan_Amount_Term[which(is.na(loan$Loan_Amount_Term))] <- 360
loan$LoanAmount[which(is.na(loan$LoanAmount))] <- mean(loan$LoanAmount, na.rm=T)
loan$Dependents[which(is.na(loan$Dependents))] <- 0
#6.Now, there will not be no missing values in the data set:
summary(loan)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 Female:112 No :213 0 :360 Graduate :480
## LP001003: 1 Male :502 Yes:401 1 :102 Not Graduate:134
## LP001005: 1 2 :101
## LP001006: 1 3+: 51
## LP001008: 1
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## No :532 Min. : 150 Min. : 0 Min. : 9.0
## Yes: 82 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.2
## Median : 3812 Median : 1188 Median :129.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:164.8
## Max. :81000 Max. :41667 Max. :700.0
##
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12.0 Min. :0.000 Rural :179 N:192
## 1st Qu.:360.0 1st Qu.:1.000 Semiurban:233 Y:422
## Median :360.0 Median :1.000 Urban :202
## Mean :342.4 Mean :0.855
## 3rd Qu.:360.0 3rd Qu.:1.000
## Max. :480.0 Max. :1.000
##
#7. Deleting LoanID column from the data set as this is not useful for the mathematical model/Logistic regression model building.
#The data set is loaded in to a new data frame with name loan1.
loan1= loan[,2:13]
names(loan1)
## [1] "Gender" "Married" "Dependents"
## [4] "Education" "Self_Employed" "ApplicantIncome"
## [7] "CoapplicantIncome" "LoanAmount" "Loan_Amount_Term"
## [10] "Credit_History" "Property_Area" "Loan_Status"
#8. Building a Logistic Regression model:
model<- glm(formula = Loan_Status~ ., family = binomial(link=logit), data = loan1)
summary(model)
##
## Call:
## glm(formula = Loan_Status ~ ., family = binomial(link = logit),
## data = loan1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3102 -0.3544 0.5361 0.7209 2.5141
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.446e+00 8.485e-01 -2.883 0.003944 **
## GenderMale -2.975e-02 2.990e-01 -0.100 0.920725
## MarriedYes 5.839e-01 2.527e-01 2.310 0.020866 *
## Dependents1 -4.706e-01 2.948e-01 -1.596 0.110480
## Dependents2 2.909e-01 3.428e-01 0.849 0.396073
## Dependents3+ 2.374e-02 4.261e-01 0.056 0.955568
## EducationNot Graduate -4.084e-01 2.596e-01 -1.573 0.115701
## Self_EmployedYes -2.413e-02 3.171e-01 -0.076 0.939323
## ApplicantIncome 1.175e-05 2.441e-05 0.482 0.630138
## CoapplicantIncome -5.282e-05 3.520e-05 -1.501 0.133450
## LoanAmount -1.923e-03 1.600e-03 -1.202 0.229392
## Loan_Amount_Term -1.247e-03 1.827e-03 -0.683 0.494687
## Credit_History 3.938e+00 4.212e-01 9.350 < 2e-16 ***
## Property_AreaSemiurban 9.067e-01 2.697e-01 3.362 0.000773 ***
## Property_AreaUrban 2.212e-01 2.597e-01 0.852 0.394434
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 762.89 on 613 degrees of freedom
## Residual deviance: 557.22 on 599 degrees of freedom
## AIC: 587.22
##
## Number of Fisher Scoring iterations: 5
#9.Testing the accuracy of model against the given values
#Method-A
# Givenvalues means Loan status variable in the data set.
fitted.results <- predict(model,data=loan1,type='response')
fitted.results <- ifelse(fitted.results > 0.5,'Y','N')
misClasificError <- mean(fitted.results != loan1$Loan_Status)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.812703583061889"
#Accuracy of around 81.3% is considered to be good.
# Method- G, The widely used ROC curve
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
#
prob <- predict(model, newdata=loan1, type="response")
pred <- prediction(prob, loan1$Loan_Status)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
print(auc)
## [1] 0.7943572
# The AUC for our model is 0.7943. The Area Under the Curve (AUC) around 0.8 says that the model is a good fit.The AUC=0.5 says that the model is not a good fit and is random.
#10.Implications from the above logistic regression model :
anova(model,test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Loan_Status
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 613 762.89
## Gender 1 0.197 612 762.69 0.657053
## Married 1 5.038 611 757.66 0.024798 *
## Dependents 3 3.198 608 754.46 0.362154
## Education 1 4.393 607 750.07 0.036096 *
## Self_Employed 1 0.001 606 750.06 0.976602
## ApplicantIncome 1 0.099 605 749.97 0.752714
## CoapplicantIncome 1 3.241 604 746.72 0.071800 .
## LoanAmount 1 1.078 603 745.65 0.299217
## Loan_Amount_Term 1 0.517 602 745.13 0.472136
## Credit_History 1 175.105 601 570.02 < 2.2e-16 ***
## Property_Area 2 12.801 599 557.22 0.001661 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#from the above chi-square test the p values for the variables self employed,ApplicantIncome and the Gender are pretty high .This means these are not having much impact on the prediction.The result won't change much even we don't consider them.
# If the accuracy of the model is not good or if the client is demanding then we should revisit our univariate and bivariate analysis that we have done in the step3.Then we need to see whther there is higher dependency between the variables.If so, we should use that dependency relation and form a regression to predict the missing values.
#We should also use imputation technique to predict the missing values.
#As we can observe that there are outliers present in the data.But we can't blindly delete the outliers as it also have some real business implication.We need to recheck the reason behind outliers and then we need to ake action accordingly.
#10. Prediction : The below procedure will be used for prediction.
#Any typical data set (like "loan_test" mentoined below) can be feed in to the model to predict the outcome.
#result <- predict(model,loan_test,type="response")
#Loan_test is the data set with all the details except the loan status.The loan status has to be predicted.
#result1 <- ifelse(result > 0.5,'Y','N')
#11. The prediction can be done using other Machine Learning Algorithms such as Decision trees (CART models), Random forests ,XG Boost etc.. For this data set, i have tried all other and found that Logistic regression is accurate.
#With this kind of analytics work, we can save lot of time and money to our clients by helping them to predict the defaulter of the loan.