library(mlr)
library(caTools)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
set.seed(121)
loandata = read.csv("train.csv",na.strings = c(""," ",NA))
summarizeColumns(loandata)
## name type na mean disp median mad
## 1 Loan_ID factor 0 NA 0.9983713 NA NA
## 2 Gender factor 13 NA NA NA NA
## 3 Married factor 3 NA NA NA NA
## 4 Dependents factor 15 NA NA NA NA
## 5 Education factor 0 NA 0.2182410 NA NA
## 6 Self_Employed factor 32 NA NA NA NA
## 7 ApplicantIncome integer 0 5403.4592834 6109.0416734 3812.5 1822.8567
## 8 CoapplicantIncome numeric 0 1621.2457980 2926.2483692 1188.5 1762.0701
## 9 LoanAmount integer 22 146.4121622 85.5873252 128.0 47.4432
## 10 Loan_Amount_Term integer 14 342.0000000 65.1204099 360.0 0.0000
## 11 Credit_History integer 50 0.8421986 0.3648783 1.0 0.0000
## 12 Property_Area factor 0 NA 0.6205212 NA NA
## 13 Loan_Status factor 0 NA 0.3127036 NA NA
## min max nlevs
## 1 1 1 614
## 2 112 489 2
## 3 213 398 2
## 4 51 345 4
## 5 134 480 2
## 6 82 500 2
## 7 150 81000 0
## 8 0 41667 0
## 9 9 700 0
## 10 12 480 0
## 11 0 1 0
## 12 179 233 3
## 13 192 422 2
hist(loandata$ApplicantIncome,breaks = 200)
hist(loandata$CoapplicantIncome,breaks = 200)
hist(loandata$LoanAmount,breaks = 200)
By looking at above plots, we find out that ApplicantIncome and CoapplicantIncome are highly skewed and hence we have to normalize them.
boxplot(loandata$ApplicantIncome)
boxplot(loandata$CoapplicantIncome)
boxplot(loandata$LoanAmount)
All these variables have outliers which need to be deal seperately.
Also we need to change Credit_History to factor and “3+” level to “3” in Dependents variable.
loandata$Credit_History = as.factor(loandata$Credit_History)
levels(loandata$Dependents)[4] = "3"
There are missing values in our dataset.So,we’ll have to deal with them first. We will replace the numeric missing values with the mean of that variable(Mean Imputation) and factor missing values with mode of that variable(Mode Imputation). For missing value imputation, we will use “impute” function from the package “mlr”
imp = impute(loandata,classes = list(factor = imputeMode(),integer = imputeMean()))
completedata = imp$data
To remove outliers we will use “capLargeValues” function from package “mlr”.
cd = capLargeValues(completedata,target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 40000)
cd = capLargeValues(cd,target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 21000)
cd = capLargeValues(cd,target = "Loan_Status",cols =c("LoanAmount"),threshold = 520)
cappedData = cd
cappedData$TotalIncome = cappedData$ApplicantIncome + cappedData$CoapplicantIncome
cappedData$IncomeLoan = cappedData$TotalIncome/cappedData$LoanAmount
To normalize our dataset,we will use “preProcess” function from package “caret”
preproc = preProcess(cappedData)
dataNorm = predict(preproc,cappedData)
Variables which are highly correlated do not contribute to accuracy of the model.Hence one of the two highly correlated variables can be ignored safely. To check the correlation of different numeric variables
az = split(names(dataNorm),sapply(dataNorm,function(x){class(x)}))
xs = dataNorm[az$numeric]
cor(xs)
## ApplicantIncome CoapplicantIncome LoanAmount
## ApplicantIncome 1.00000000 -0.14117527 0.58388949
## CoapplicantIncome -0.14117527 1.00000000 0.22253459
## LoanAmount 0.58388949 0.22253459 1.00000000
## Loan_Amount_Term -0.03745779 -0.04086784 0.04108567
## TotalIncome 0.89527867 0.31465345 0.65998236
## IncomeLoan 0.41950702 0.18621087 -0.17732606
## Loan_Amount_Term TotalIncome IncomeLoan
## ApplicantIncome -0.03745779 0.89527867 0.4195070
## CoapplicantIncome -0.04086784 0.31465345 0.1862109
## LoanAmount 0.04108567 0.65998236 -0.1773261
## Loan_Amount_Term 1.00000000 -0.05430597 -0.1033059
## TotalIncome -0.05430597 1.00000000 0.4860247
## IncomeLoan -0.10330594 0.48602473 1.0000000
Since ApplicantIncome and TotalIncome are highly correlated,we will ignore TotalIncome variable.
dataNorm$TotalIncome = NULL
dataNorm$Loan_ID = NULL
We will use package caTools to split the data into training and testing dataset.70% of original dataset will be training dataset and 30% will be testing.
set.seed(121)
split = sample.split(dataNorm$Loan_Status,SplitRatio = 0.70)
train = subset(dataNorm,split==T)
test = subset(dataNorm,split==F)
logmodel = glm(Loan_Status ~ . ,data = train,family = "binomial")
logpreds = predict(logmodel,newdata = test, type = "response")
To create confusion matrix
table(test$Loan_Status,logpreds>0.55)
##
## FALSE TRUE
## N 19 39
## Y 2 125
We get an accuracy of
((19+125)/nrow(test))*100
## [1] 77.83784
set.seed(121)
tree = rpart(Loan_Status ~ .,data = train,method = "class")
prp(tree)
treepreds = predict(tree,newdata = test , type = "class")
Confusion Matrix
table(test$Loan_Status,treepreds)
## treepreds
## N Y
## N 27 31
## Y 12 115
Accuracy of our model is given by :-
((27+115)/nrow(test))*100
## [1] 76.75676
set.seed(121)
RF = randomForest(Loan_Status ~ . ,data =train,importance= T)
varImpPlot(RF)
RFpreds = predict(RF,newdata = test,type = "class")
Confusion Matrix
table(test$Loan_Status,RFpreds)
## RFpreds
## N Y
## N 26 32
## Y 13 114
Accuracy of our model is given by :-
((26+114)/nrow(test))*100
## [1] 75.67568
Out of all 3 models,Logistic Model gives us best accuracy.