We will identify eligible customers from the dataset for loan processing and build the model so that in the future when we get similar information, we should know the risk or probability of either they will go good or bad.
We load the data first.
head(bankdata)
## age ed employ address income debtinc creddebt othdebt default
## 1 41 3 17 12 176 9.3 11.359392 5.008608 1
## 2 27 1 10 6 31 17.3 1.362202 4.000798 0
## 3 40 1 15 14 55 5.5 0.856075 2.168925 0
## 4 41 1 15 14 120 2.9 2.658720 0.821280 0
## 5 24 2 2 0 28 17.3 1.787436 3.056564 1
## 6 41 2 5 5 25 10.2 0.392700 2.157300 0
We check the data structure, check missing values and do the descriptive statistics.
dim(bankdata)
## [1] 850 9
str(bankdata)
## 'data.frame': 850 obs. of 9 variables:
## $ age : int 41 27 40 41 24 41 39 43 24 36 ...
## $ ed : int 3 1 1 1 2 2 1 1 1 1 ...
## $ employ : int 17 10 15 15 2 5 20 12 3 0 ...
## $ address : int 12 6 14 14 0 5 9 11 4 13 ...
## $ income : int 176 31 55 120 28 25 67 38 19 25 ...
## $ debtinc : num 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
## $ creddebt: num 11.359 1.362 0.856 2.659 1.787 ...
## $ othdebt : num 5.009 4.001 2.169 0.821 3.057 ...
## $ default : int 1 0 0 0 1 0 0 0 1 0 ...
summary(bankdata)
## age ed employ address
## Min. :20.00 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:29.00 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median :34.00 Median :1.000 Median : 7.000 Median : 7.000
## Mean :35.03 Mean :1.711 Mean : 8.566 Mean : 8.372
## 3rd Qu.:41.00 3rd Qu.:2.000 3rd Qu.:13.000 3rd Qu.:12.000
## Max. :56.00 Max. :5.000 Max. :33.000 Max. :34.000
##
## income debtinc creddebt othdebt
## Min. : 13.00 Min. : 0.10 Min. : 0.0117 Min. : 0.04558
## 1st Qu.: 24.00 1st Qu.: 5.10 1st Qu.: 0.3822 1st Qu.: 1.04594
## Median : 35.00 Median : 8.70 Median : 0.8851 Median : 2.00324
## Mean : 46.68 Mean :10.17 Mean : 1.5768 Mean : 3.07879
## 3rd Qu.: 55.75 3rd Qu.:13.80 3rd Qu.: 1.8984 3rd Qu.: 3.90300
## Max. :446.00 Max. :41.30 Max. :20.5613 Max. :35.19750
##
## default
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2614
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :150
colSums(is.na(bankdata))
## age ed employ address income debtinc creddebt othdebt
## 0 0 0 0 0 0 0 0
## default
## 150
Education is a factor variable, so we will change it to factor. Everything looks fine except we got 150 missing values in our dependent variable which is “default”. Since it is a response variable, we will remove all the variables with the missing values.
bankdata$ed <- as.factor(bankdata$ed)
bd <- na.omit(bankdata)
colSums(is.na(bd))
## age ed employ address income debtinc creddebt othdebt
## 0 0 0 0 0 0 0 0
## default
## 0
str(bd)
## 'data.frame': 700 obs. of 9 variables:
## $ age : int 41 27 40 41 24 41 39 43 24 36 ...
## $ ed : Factor w/ 5 levels "1","2","3","4",..: 3 1 1 1 2 2 1 1 1 1 ...
## $ employ : int 17 10 15 15 2 5 20 12 3 0 ...
## $ address : int 12 6 14 14 0 5 9 11 4 13 ...
## $ income : int 176 31 55 120 28 25 67 38 19 25 ...
## $ debtinc : num 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
## $ creddebt: num 11.359 1.362 0.856 2.659 1.787 ...
## $ othdebt : num 5.009 4.001 2.169 0.821 3.057 ...
## $ default : int 1 0 0 0 1 0 0 0 1 0 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:150] 701 702 703 704 705 706 707 708 709 710 ...
## .. ..- attr(*, "names")= chr [1:150] "701" "702" "703" "704" ...
Our data is ready for analysis, we divide the data into 2 parts - train and test dataset. We build the model on train dataset and validate with test dataset.
set.seed(100)
sub <- sample(nrow(bd), floor(nrow(bd) * 0.7))
train <- bd[sub, ]
test <- bd[-sub, ]
head(train)
## age ed employ address income debtinc creddebt othdebt default
## 216 34 3 12 8 47 6.7 1.300537 1.848463 0
## 181 39 1 12 10 46 5.0 0.561200 1.738800 0
## 386 48 1 3 11 27 21.3 1.403244 4.347756 0
## 40 45 1 23 5 50 4.2 0.558600 1.541400 0
## 327 40 1 12 6 89 17.4 2.771994 12.714006 0
## 337 36 2 7 16 43 15.4 2.648800 3.973200 0
head(test)
## age ed employ address income debtinc creddebt othdebt default
## 2 27 1 10 6 31 17.3 1.362202 4.000798 0
## 3 40 1 15 14 55 5.5 0.856075 2.168925 0
## 4 41 1 15 14 120 2.9 2.658720 0.821280 0
## 5 24 2 2 0 28 17.3 1.787436 3.056564 1
## 6 41 2 5 5 25 10.2 0.392700 2.157300 0
## 9 24 1 3 4 19 24.4 1.358348 3.277652 1
We build the logistic model on train dataset.
train.glm <- glm(default ~ ., family = "binomial", data = train)
summary(train.glm)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4442 -0.6298 -0.3068 0.3431 2.9037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.918492 0.729374 -2.630 0.008530 **
## age 0.034331 0.021135 1.624 0.104292
## ed2 0.206628 0.309971 0.667 0.505025
## ed3 0.662556 0.394590 1.679 0.093133 .
## ed4 0.446669 0.543982 0.821 0.411583
## ed5 2.070851 1.709775 1.211 0.225825
## employ -0.225464 0.036908 -6.109 1.00e-09 ***
## address -0.092204 0.026527 -3.476 0.000509 ***
## income -0.006907 0.008664 -0.797 0.425362
## debtinc 0.101353 0.036178 2.801 0.005087 **
## creddebt 0.604917 0.125237 4.830 1.36e-06 ***
## othdebt -0.040814 0.091468 -0.446 0.655448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.30 on 488 degrees of freedom
## Residual deviance: 388.57 on 477 degrees of freedom
## AIC: 412.57
##
## Number of Fisher Scoring iterations: 6
We will do step wise logistice model as well so that we remove all the variables which are not significant.
train.step <- step(train.glm, direction = "backward", trace = 0)
summary(train.step)
##
## Call:
## glm(formula = default ~ age + employ + address + debtinc + creddebt,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4009 -0.6579 -0.3121 0.3605 2.8091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.79188 0.61176 -2.929 0.003400 **
## age 0.03301 0.02042 1.617 0.105850
## employ -0.24892 0.03461 -7.192 6.37e-13 ***
## address -0.09207 0.02607 -3.532 0.000412 ***
## debtinc 0.09436 0.02188 4.312 1.62e-05 ***
## creddebt 0.56918 0.10086 5.643 1.67e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 564.30 on 488 degrees of freedom
## Residual deviance: 392.88 on 483 degrees of freedom
## AIC: 404.88
##
## Number of Fisher Scoring iterations: 6
So we will validate our model on train dataset and test dataset.
Classification Table for train dataset
train.prob <- predict(train.step, type = "response")
train.pred <- ifelse(train.prob > .5, "1", "0")
table(train.pred, train$default)
##
## train.pred 0 1
## 0 333 66
## 1 27 63
mean(train.pred == train$default)
## [1] 0.809816
Classification Table for test dataset
test.prob <- predict(train.step, newdata = test, type = "response")
test.pred <- ifelse(test.prob > .5, "1", "0")
table(test.pred, test$default)
##
## test.pred 0 1
## 0 144 24
## 1 13 30
mean(test.pred == test$default)
## [1] 0.8246445
ROC Curve for Train dataset. For this we require ROCR pacakge.
require(ROCR)
train.roc <- prediction(train.prob, train$default)
plot(performance(train.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for train data")
abline(0, 1, lty = 8, col = "blue")
AUC for train dataset
train.auc <- performance(train.roc, "auc")
slot(train.auc, "y.values")
## [[1]]
## [1] 0.8537898
ROC Curve for Test dataset.
test.roc <- prediction(test.prob, test$default)
plot(performance(test.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for test data")
abline(0, 1, lty = 8, col = "blue")
AUC for test dataset
test.auc <- performance(test.roc, "auc")
slot(test.auc, "y.values")
## [[1]]
## [1] 0.8634112
Kolmogorov-Smirnov test for train dataset
ks.train <- performance(train.roc, "tpr", "fpr")
train.ks <- max(attr(ks.train, "y.values")[[1]] - (attr(ks.train, "x.values")[[1]]))
train.ks
## [1] 0.5549096
Kolmogorov-Smirnow test for test dataset
ks.test <- performance(test.roc, "tpr", "fpr")
test.ks <- max(attr(ks.test, "y.values")[[1]] - (attr(ks.test, "x.values")[[1]]))
test.ks
## [1] 0.6046237
Hosmer-Lemeshow test on train dataset. For this we require ResourceSelection package.
require(ResourceSelection)
hl.train <- hoslem.test(train$default, fitted(train.step), g = 10)
hl.train
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: train$default, fitted(train.step)
## X-squared = 5.0527, df = 8, p-value = 0.7519
Percentage of Concordance for Train dataset. For this we need to run this function.
OptimisedConc=function(model)
{
Data = cbind(model$y, model$fitted.values)
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))
}
OptimisedConc(train.step)
## $`Percent Concordance`
## [1] 85.37898
##
## $`Percent Discordance`
## [1] 14.62102
##
## $`Percent Tied`
## [1] 0
##
## $Pairs
## [1] 46440
So we have
Percentage of correct classificaion for train data - 81%
Percentage of Correct classification for test data - 82%
Area under curve for train data - .85
Area under curve for test data - .86
Kolmogorov-Smirnov test value for train data - .55
Kolmogorov-Smirnov test value for test data - .60
Hosmer-Lemeshow test for train data - p-value = 0.75 which is > 0.05
Percentage of Concordance for train data - 85.37
All the tests except Kolmorgorov-Smirnow test validate the model that it is good.
Now, we will see the performance of the model on the test dataset.
test.final <- data.frame(test, test.prob)
test.final$test.prob <- round(test.final$test.prob, 2)
head(test.final)
## age ed employ address income debtinc creddebt othdebt default test.prob
## 2 27 1 10 6 31 17.3 1.362202 4.000798 0 0.18
## 3 40 1 15 14 55 5.5 0.856075 2.168925 0 0.01
## 4 41 1 15 14 120 2.9 2.658720 0.821280 0 0.02
## 5 24 2 2 0 28 17.3 1.787436 3.056564 1 0.76
## 6 41 2 5 5 25 10.2 0.392700 2.157300 0 0.28
## 9 24 1 3 4 19 24.4 1.358348 3.277652 1 0.72
Divide the test dataset into decile and do some data manipulation to get the desire table. For this we require dplyr and ggplot2 package.
require(dplyr)
require(ggplot2)
t.f <- arrange(test.final, desc(test.prob))
t.f$decile <- with(t.f, cut_number(test.prob, 10, labels = 10:1))
head(t.f)
## age ed employ address income debtinc creddebt othdebt default
## 1 40 3 5 3 220 16.0 8.166400 27.033600 1
## 2 25 4 0 1 18 33.4 2.801592 3.210408 1
## 3 28 2 3 4 41 25.1 5.896743 4.394257 1
## 4 30 3 0 8 65 29.7 3.899610 15.405390 1
## 5 32 2 11 6 75 23.3 7.758900 9.716100 1
## 6 41 3 2 0 15 26.3 1.960665 1.984335 1
## test.prob decile
## 1 0.98 1
## 2 0.98 1
## 3 0.98 1
## 4 0.97 1
## 5 0.93 1
## 6 0.93 1
tail(t.f)
## age ed employ address income debtinc creddebt othdebt default
## 206 46 3 22 4 99 2.3 0.826551 1.450449 0
## 207 45 2 21 26 132 3.0 2.558160 1.401840 0
## 208 47 1 27 8 113 9.2 2.380684 8.015316 0
## 209 48 2 30 14 148 7.2 3.974688 6.681312 0
## 210 53 2 16 31 44 5.6 1.333024 1.130976 0
## 211 48 1 28 25 70 12.0 4.334400 4.065600 0
## test.prob decile
## 206 0 10
## 207 0 10
## 208 0 10
## 209 0 10
## 210 0 10
## 211 0 10
By sorting the probability value from largest to lowest and assigning decile, we have divided the data into groups of best and worst. However, we don’t know which one is good or which one is bad in actual. So we will construct a score table.
test.score <- t.f %>% group_by(decile)
test.score1 <- test.score %>%
summarise_each(funs(sum), default) %>%
arrange(desc(decile))
test.score2 <- t.f %>%
group_by(decile) %>%
summarise(default = n()) %>%
arrange(desc(decile))
ts.table <- left_join(test.score1, test.score2, by = "decile")
ts.table <- rename(ts.table, Good = default.x, CountOfDecile = default.y)
ts.table <- mutate(ts.table, Bad = CountOfDecile - Good)
ts.table <- mutate(ts.table, CuGood = cumsum(Good))
ts.table <- mutate(ts.table, CuBad = cumsum(Bad))
ts.table <- mutate(ts.table, CuGoodPercent = CuGood/57)
ts.table$CuGoodPercent <- round(ts.table$CuGoodPercent, 2)
ts.table <- mutate(ts.table, CuBadPercent = CuBad/154)
ts.table$CuBadPercent <- round(ts.table$CuBadPercent, 2)
ts.table <- mutate(ts.table, CuBadAvoided = 1 - CuBadPercent)
ts.table$CuBadAvoided <- round(ts.table$CuBadAvoided, 2)
ts.table <- mutate(ts.table, Profit = 100*CuGood - 500*CuBad)
ts.table
## Source: local data frame [10 x 10]
##
## decile Good CountOfDecile Bad CuGood CuBad CuGoodPercent CuBadPercent
## 1 1 16 19 3 16 3 0.28 0.02
## 2 2 13 22 9 29 12 0.51 0.08
## 3 3 8 22 14 37 26 0.65 0.17
## 4 4 7 18 11 44 37 0.77 0.24
## 5 5 6 23 17 50 54 0.88 0.35
## 6 6 0 19 19 50 73 0.88 0.47
## 7 7 1 23 22 51 95 0.89 0.62
## 8 8 3 17 14 54 109 0.95 0.71
## 9 9 0 23 23 54 132 0.95 0.86
## 10 10 0 25 25 54 157 0.95 1.02
## Variables not shown: CuBadAvoided (dbl), Profit (dbl)
The last two variables are not showing up. I will just construct a separate dataframe for these two.
CumBadAvoided <- c(0.99, 0.92, 0.84, 0.74, 0.63, 0.53, 0.40, 0.27, 0.15, 0.00)
Profit <- c(900, -3600, -8700, -15600, -23700, -31200, -41100, -50400, -59800, -71300)
lasttwo <- cbind(CumBadAvoided, Profit)
lasttwo <- as.data.frame(lasttwo)
lasttwo
## CumBadAvoided Profit
## 1 0.99 900
## 2 0.92 -3600
## 3 0.84 -8700
## 4 0.74 -15600
## 5 0.63 -23700
## 6 0.53 -31200
## 7 0.40 -41100
## 8 0.27 -50400
## 9 0.15 -59800
## 10 0.00 -71300
As we can see the model identifies 16 good observations in decile 1 out of 19. In decile 2, 13 good out of 22 and so on. The difference are the bad observations. We build variables like Cumulative Good, Cumulative Bad, Cumulative Good Percentage, Cumulative Bad Percentage, Cummulative Bad avoided and Profit to get the conclusion for credit score. Here, Profit variable indicates if we gain 100 for every good loan and lose out 500 for every bad loan, we get a cut off point to decide upto what percentage we can take the risk, upto what percentage we can give loan, upto what point we get maximum profits. However, in this we can take only the first decile, get a good loan of about 28% and avoid bad loan of about 99%.
We will do the credit score for train dataset as well.
train.final <- data.frame(train, train.prob)
train.final$train.prob <- round(train.final$train.prob, 2)
head(train.final)
## age ed employ address income debtinc creddebt othdebt default
## 216 34 3 12 8 47 6.7 1.300537 1.848463 0
## 181 39 1 12 10 46 5.0 0.561200 1.738800 0
## 386 48 1 3 11 27 21.3 1.403244 4.347756 0
## 40 45 1 23 5 50 4.2 0.558600 1.541400 0
## 327 40 1 12 6 89 17.4 2.771994 12.714006 0
## 337 36 2 7 16 43 15.4 2.648800 3.973200 0
## train.prob
## 216 0.05
## 181 0.03
## 386 0.70
## 40 0.00
## 327 0.31
## 337 0.30
Divide the train dataset into ten decile
train.f <- arrange(train.final, desc(train.prob))
train.f$decile <- with(train.f, cut_number(train.prob, 10, labels = 10:1))
head(train.f)
## age ed employ address income debtinc creddebt othdebt default
## 1 33 1 14 8 72 41.3 15.01668 14.71932 1
## 2 47 1 29 18 129 25.3 20.56131 12.07569 1
## 3 47 3 16 26 221 17.6 15.79178 23.10422 1
## 4 43 4 18 14 446 6.5 16.03147 12.95853 1
## 5 54 1 25 12 120 26.5 14.59620 17.20380 1
## 6 28 1 0 2 28 33.3 2.28438 7.03962 1
## train.prob decile
## 1 1.00 1
## 2 0.99 1
## 3 0.98 1
## 4 0.97 1
## 5 0.97 1
## 6 0.97 1
tail(train.f)
## age ed employ address income debtinc creddebt othdebt default
## 484 51 2 19 25 159 2.8 1.068480 3.383520 0
## 485 39 2 20 17 101 1.4 0.340774 1.073226 0
## 486 47 1 22 19 81 5.5 1.505790 2.949210 0
## 487 42 2 21 11 121 3.1 1.365364 2.385636 0
## 488 39 1 15 16 39 1.8 0.139698 0.562302 0
## 489 39 1 19 16 45 3.7 0.915750 0.749250 0
## train.prob decile
## 484 0 10
## 485 0 10
## 486 0 10
## 487 0 10
## 488 0 10
## 489 0 10
Score table for train dataset
train.score <- train.f %>% group_by(decile)
train.score1 <- train.score %>%
summarise_each(funs(sum), default) %>%
arrange(desc(decile))
train.score2 <- train.f %>%
group_by(decile) %>%
summarise(default = n()) %>%
arrange(desc(decile))
train.table <- left_join(train.score1, train.score2, by = "decile")
train.table <- rename(train.table, Good = default.x, CountOfDecile = default.y)
train.table <- mutate(train.table, Bad = CountOfDecile - Good)
train.table <- mutate(train.table, CuGood = cumsum(Good))
train.table <- mutate(train.table, CuBad = cumsum(Bad))
train.table <- mutate(train.table, CuGoodPercent = CuGood/126)
train.table$CuGoodPercent <- round(train.table$CuGoodPercent, 2)
train.table <- mutate(train.table, CuBadPercent = CuBad/363)
train.table$CuBadPercent <- round(train.table$CuBadPercent, 2)
train.table <- mutate(train.table, CuBadAvoided = 1 - CuBadPercent)
train.table <- mutate(train.table, Profit = 100*CuGood - 500*CuBad)
train.table
## Source: local data frame [10 x 10]
##
## decile Good CountOfDecile Bad CuGood CuBad CuGoodPercent CuBadPercent
## 1 1 42 49 7 42 7 0.33 0.02
## 2 2 25 47 22 67 29 0.53 0.08
## 3 3 19 44 25 86 54 0.68 0.15
## 4 4 14 53 39 100 93 0.79 0.26
## 5 5 12 47 35 112 128 0.89 0.35
## 6 6 8 51 43 120 171 0.95 0.47
## 7 7 4 48 44 124 215 0.98 0.59
## 8 8 3 36 33 127 248 1.01 0.68
## 9 9 2 59 57 129 305 1.02 0.84
## 10 10 0 55 55 129 360 1.02 0.99
## Variables not shown: CuBadAvoided (dbl), Profit (dbl)
Similarly, we will choose the first decile to get 33% of good loans and avoid 98% of bad loans. Only the first decile yields maximum profit to the bank. So we will select those observations which fall into decile one on both test and train dataset.