Logistic Regression - Credit Score

We will identify eligible customers from the dataset for loan processing and build the model so that in the future when we get similar information, we should know the risk or probability of either they will go good or bad.

We load the data first.

head(bankdata)

##   age ed employ address income debtinc  creddebt  othdebt default
## 1  41  3     17      12    176     9.3 11.359392 5.008608       1
## 2  27  1     10       6     31    17.3  1.362202 4.000798       0
## 3  40  1     15      14     55     5.5  0.856075 2.168925       0
## 4  41  1     15      14    120     2.9  2.658720 0.821280       0
## 5  24  2      2       0     28    17.3  1.787436 3.056564       1
## 6  41  2      5       5     25    10.2  0.392700 2.157300       0

We check the data structure, check missing values and do the descriptive statistics.

dim(bankdata)

## [1] 850   9

str(bankdata)

## 'data.frame':    850 obs. of  9 variables:
##  $ age     : int  41 27 40 41 24 41 39 43 24 36 ...
##  $ ed      : int  3 1 1 1 2 2 1 1 1 1 ...
##  $ employ  : int  17 10 15 15 2 5 20 12 3 0 ...
##  $ address : int  12 6 14 14 0 5 9 11 4 13 ...
##  $ income  : int  176 31 55 120 28 25 67 38 19 25 ...
##  $ debtinc : num  9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
##  $ creddebt: num  11.359 1.362 0.856 2.659 1.787 ...
##  $ othdebt : num  5.009 4.001 2.169 0.821 3.057 ...
##  $ default : int  1 0 0 0 1 0 0 0 1 0 ...

summary(bankdata)

##       age              ed            employ          address      
##  Min.   :20.00   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:29.00   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.: 3.000  
##  Median :34.00   Median :1.000   Median : 7.000   Median : 7.000  
##  Mean   :35.03   Mean   :1.711   Mean   : 8.566   Mean   : 8.372  
##  3rd Qu.:41.00   3rd Qu.:2.000   3rd Qu.:13.000   3rd Qu.:12.000  
##  Max.   :56.00   Max.   :5.000   Max.   :33.000   Max.   :34.000  
##                                                                   
##      income          debtinc         creddebt          othdebt        
##  Min.   : 13.00   Min.   : 0.10   Min.   : 0.0117   Min.   : 0.04558  
##  1st Qu.: 24.00   1st Qu.: 5.10   1st Qu.: 0.3822   1st Qu.: 1.04594  
##  Median : 35.00   Median : 8.70   Median : 0.8851   Median : 2.00324  
##  Mean   : 46.68   Mean   :10.17   Mean   : 1.5768   Mean   : 3.07879  
##  3rd Qu.: 55.75   3rd Qu.:13.80   3rd Qu.: 1.8984   3rd Qu.: 3.90300  
##  Max.   :446.00   Max.   :41.30   Max.   :20.5613   Max.   :35.19750  
##                                                                       
##     default      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2614  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##  NA's   :150

colSums(is.na(bankdata))

##      age       ed   employ  address   income  debtinc creddebt  othdebt 
##        0        0        0        0        0        0        0        0 
##  default 
##      150

Education is a factor variable, so we will change it to factor. Everything looks fine except we got 150 missing values in our dependent variable which is “default”. Since it is a response variable, we will remove all the variables with the missing values.

bankdata$ed <- as.factor(bankdata$ed)
bd <- na.omit(bankdata)
colSums(is.na(bd))

##      age       ed   employ  address   income  debtinc creddebt  othdebt 
##        0        0        0        0        0        0        0        0 
##  default 
##        0

str(bd)

## 'data.frame':    700 obs. of  9 variables:
##  $ age     : int  41 27 40 41 24 41 39 43 24 36 ...
##  $ ed      : Factor w/ 5 levels "1","2","3","4",..: 3 1 1 1 2 2 1 1 1 1 ...
##  $ employ  : int  17 10 15 15 2 5 20 12 3 0 ...
##  $ address : int  12 6 14 14 0 5 9 11 4 13 ...
##  $ income  : int  176 31 55 120 28 25 67 38 19 25 ...
##  $ debtinc : num  9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
##  $ creddebt: num  11.359 1.362 0.856 2.659 1.787 ...
##  $ othdebt : num  5.009 4.001 2.169 0.821 3.057 ...
##  $ default : int  1 0 0 0 1 0 0 0 1 0 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:150] 701 702 703 704 705 706 707 708 709 710 ...
##   .. ..- attr(*, "names")= chr [1:150] "701" "702" "703" "704" ...

Our data is ready for analysis, we divide the data into 2 parts - train and test dataset. We build the model on train dataset and validate with test dataset.

set.seed(100)
sub <- sample(nrow(bd), floor(nrow(bd) * 0.7))
train <- bd[sub, ]
test <- bd[-sub, ]
head(train)

##     age ed employ address income debtinc creddebt   othdebt default
## 216  34  3     12       8     47     6.7 1.300537  1.848463       0
## 181  39  1     12      10     46     5.0 0.561200  1.738800       0
## 386  48  1      3      11     27    21.3 1.403244  4.347756       0
## 40   45  1     23       5     50     4.2 0.558600  1.541400       0
## 327  40  1     12       6     89    17.4 2.771994 12.714006       0
## 337  36  2      7      16     43    15.4 2.648800  3.973200       0

head(test)

##   age ed employ address income debtinc creddebt  othdebt default
## 2  27  1     10       6     31    17.3 1.362202 4.000798       0
## 3  40  1     15      14     55     5.5 0.856075 2.168925       0
## 4  41  1     15      14    120     2.9 2.658720 0.821280       0
## 5  24  2      2       0     28    17.3 1.787436 3.056564       1
## 6  41  2      5       5     25    10.2 0.392700 2.157300       0
## 9  24  1      3       4     19    24.4 1.358348 3.277652       1

We build the logistic model on train dataset.

train.glm <- glm(default ~ ., family = "binomial", data = train)
summary(train.glm)

## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4442  -0.6298  -0.3068   0.3431   2.9037  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.918492   0.729374  -2.630 0.008530 ** 
## age          0.034331   0.021135   1.624 0.104292    
## ed2          0.206628   0.309971   0.667 0.505025    
## ed3          0.662556   0.394590   1.679 0.093133 .  
## ed4          0.446669   0.543982   0.821 0.411583    
## ed5          2.070851   1.709775   1.211 0.225825    
## employ      -0.225464   0.036908  -6.109 1.00e-09 ***
## address     -0.092204   0.026527  -3.476 0.000509 ***
## income      -0.006907   0.008664  -0.797 0.425362    
## debtinc      0.101353   0.036178   2.801 0.005087 ** 
## creddebt     0.604917   0.125237   4.830 1.36e-06 ***
## othdebt     -0.040814   0.091468  -0.446 0.655448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.30  on 488  degrees of freedom
## Residual deviance: 388.57  on 477  degrees of freedom
## AIC: 412.57
## 
## Number of Fisher Scoring iterations: 6

We will do step wise logistice model as well so that we remove all the variables which are not significant.

train.step <- step(train.glm, direction = "backward", trace = 0)
summary(train.step)

## 
## Call:
## glm(formula = default ~ age + employ + address + debtinc + creddebt, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4009  -0.6579  -0.3121   0.3605   2.8091  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.79188    0.61176  -2.929 0.003400 ** 
## age          0.03301    0.02042   1.617 0.105850    
## employ      -0.24892    0.03461  -7.192 6.37e-13 ***
## address     -0.09207    0.02607  -3.532 0.000412 ***
## debtinc      0.09436    0.02188   4.312 1.62e-05 ***
## creddebt     0.56918    0.10086   5.643 1.67e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.30  on 488  degrees of freedom
## Residual deviance: 392.88  on 483  degrees of freedom
## AIC: 404.88
## 
## Number of Fisher Scoring iterations: 6

So we will validate our model on train dataset and test dataset.
Classification Table for train dataset

train.prob <- predict(train.step, type = "response")
train.pred <- ifelse(train.prob > .5, "1", "0")
table(train.pred, train$default)

##           
## train.pred   0   1
##          0 333  66
##          1  27  63

mean(train.pred == train$default)

## [1] 0.809816

Classification Table for test dataset

test.prob <- predict(train.step, newdata = test, type = "response")
test.pred <- ifelse(test.prob > .5, "1", "0")
table(test.pred, test$default)

##          
## test.pred   0   1
##         0 144  24
##         1  13  30

mean(test.pred == test$default)

## [1] 0.8246445

ROC Curve for Train dataset. For this we require ROCR pacakge.

require(ROCR)

train.roc <- prediction(train.prob, train$default)
plot(performance(train.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for train data")
abline(0, 1, lty = 8, col = "blue")

AUC for train dataset

train.auc <- performance(train.roc, "auc")
slot(train.auc, "y.values")

## [[1]]
## [1] 0.8537898

ROC Curve for Test dataset.

test.roc <- prediction(test.prob, test$default)
plot(performance(test.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for test data")
abline(0, 1, lty = 8, col = "blue")

AUC for test dataset

test.auc <- performance(test.roc, "auc")
slot(test.auc, "y.values")

## [[1]]
## [1] 0.8634112

Kolmogorov-Smirnov test for train dataset

ks.train <- performance(train.roc, "tpr", "fpr")
train.ks <- max(attr(ks.train, "y.values")[[1]] - (attr(ks.train, "x.values")[[1]]))
train.ks

## [1] 0.5549096

Kolmogorov-Smirnow test for test dataset

ks.test <- performance(test.roc, "tpr", "fpr")
test.ks <- max(attr(ks.test, "y.values")[[1]] - (attr(ks.test, "x.values")[[1]]))
test.ks

## [1] 0.6046237

Hosmer-Lemeshow test on train dataset. For this we require ResourceSelection package.

require(ResourceSelection)

hl.train <- hoslem.test(train$default, fitted(train.step), g = 10)
hl.train

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  train$default, fitted(train.step)
## X-squared = 5.0527, df = 8, p-value = 0.7519

Percentage of Concordance for Train dataset. For this we need to run this function.

OptimisedConc=function(model)
{
  Data = cbind(model$y, model$fitted.values) 
  ones = Data[Data[,1] == 1,]
  zeros = Data[Data[,1] == 0,]
  conc=matrix(0, dim(zeros)[1], dim(ones)[1])
  disc=matrix(0, dim(zeros)[1], dim(ones)[1])
  ties=matrix(0, dim(zeros)[1], dim(ones)[1])
  for (j in 1:dim(zeros)[1])
  {
    for (i in 1:dim(ones)[1])
    {
      if (ones[i,2]>zeros[j,2])
      {conc[j,i]=1}
      else if (ones[i,2]<zeros[j,2])
      {disc[j,i]=1}
      else if (ones[i,2]==zeros[j,2])
      {ties[j,i]=1}
    }
  }
  Pairs=dim(zeros)[1]*dim(ones)[1]
  PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))
}

OptimisedConc(train.step)

## $`Percent Concordance`
## [1] 85.37898
## 
## $`Percent Discordance`
## [1] 14.62102
## 
## $`Percent Tied`
## [1] 0
## 
## $Pairs
## [1] 46440

So we have
Percentage of correct classificaion for train data - 81%
Percentage of Correct classification for test data - 82%
Area under curve for train data - .85
Area under curve for test data - .86
Kolmogorov-Smirnov test value for train data - .55
Kolmogorov-Smirnov test value for test data - .60
Hosmer-Lemeshow test for train data - p-value = 0.75 which is > 0.05
Percentage of Concordance for train data - 85.37

All the tests except Kolmorgorov-Smirnow test validate the model that it is good.

Now, we will see the performance of the model on the test dataset.

test.final <- data.frame(test, test.prob)
test.final$test.prob <- round(test.final$test.prob, 2)
head(test.final)

##   age ed employ address income debtinc creddebt  othdebt default test.prob
## 2  27  1     10       6     31    17.3 1.362202 4.000798       0      0.18
## 3  40  1     15      14     55     5.5 0.856075 2.168925       0      0.01
## 4  41  1     15      14    120     2.9 2.658720 0.821280       0      0.02
## 5  24  2      2       0     28    17.3 1.787436 3.056564       1      0.76
## 6  41  2      5       5     25    10.2 0.392700 2.157300       0      0.28
## 9  24  1      3       4     19    24.4 1.358348 3.277652       1      0.72

Divide the test dataset into decile and do some data manipulation to get the desire table. For this we require dplyr and ggplot2 package.

require(dplyr)
require(ggplot2)

t.f <- arrange(test.final, desc(test.prob))
t.f$decile <- with(t.f, cut_number(test.prob, 10, labels = 10:1))
head(t.f)

##   age ed employ address income debtinc creddebt   othdebt default
## 1  40  3      5       3    220    16.0 8.166400 27.033600       1
## 2  25  4      0       1     18    33.4 2.801592  3.210408       1
## 3  28  2      3       4     41    25.1 5.896743  4.394257       1
## 4  30  3      0       8     65    29.7 3.899610 15.405390       1
## 5  32  2     11       6     75    23.3 7.758900  9.716100       1
## 6  41  3      2       0     15    26.3 1.960665  1.984335       1
##   test.prob decile
## 1      0.98      1
## 2      0.98      1
## 3      0.98      1
## 4      0.97      1
## 5      0.93      1
## 6      0.93      1

tail(t.f)

##     age ed employ address income debtinc creddebt  othdebt default
## 206  46  3     22       4     99     2.3 0.826551 1.450449       0
## 207  45  2     21      26    132     3.0 2.558160 1.401840       0
## 208  47  1     27       8    113     9.2 2.380684 8.015316       0
## 209  48  2     30      14    148     7.2 3.974688 6.681312       0
## 210  53  2     16      31     44     5.6 1.333024 1.130976       0
## 211  48  1     28      25     70    12.0 4.334400 4.065600       0
##     test.prob decile
## 206         0     10
## 207         0     10
## 208         0     10
## 209         0     10
## 210         0     10
## 211         0     10

By sorting the probability value from largest to lowest and assigning decile, we have divided the data into groups of best and worst. However, we don’t know which one is good or which one is bad in actual. So we will construct a score table.

test.score <- t.f %>% group_by(decile)
test.score1 <- test.score %>%
  summarise_each(funs(sum), default) %>%
  arrange(desc(decile))
test.score2 <- t.f %>%
  group_by(decile) %>%
  summarise(default = n()) %>%
  arrange(desc(decile))
ts.table <- left_join(test.score1, test.score2, by = "decile")
ts.table <- rename(ts.table, Good = default.x, CountOfDecile = default.y)
ts.table <- mutate(ts.table, Bad = CountOfDecile - Good)
ts.table <- mutate(ts.table, CuGood = cumsum(Good))
ts.table <- mutate(ts.table, CuBad = cumsum(Bad))
ts.table <- mutate(ts.table, CuGoodPercent = CuGood/57)
ts.table$CuGoodPercent <- round(ts.table$CuGoodPercent, 2)
ts.table <- mutate(ts.table, CuBadPercent = CuBad/154)
ts.table$CuBadPercent <- round(ts.table$CuBadPercent, 2)
ts.table <- mutate(ts.table, CuBadAvoided = 1 - CuBadPercent)
ts.table$CuBadAvoided <- round(ts.table$CuBadAvoided, 2)
ts.table <- mutate(ts.table, Profit = 100*CuGood - 500*CuBad)
ts.table

## Source: local data frame [10 x 10]
## 
##    decile Good CountOfDecile Bad CuGood CuBad CuGoodPercent CuBadPercent
## 1       1   16            19   3     16     3          0.28         0.02
## 2       2   13            22   9     29    12          0.51         0.08
## 3       3    8            22  14     37    26          0.65         0.17
## 4       4    7            18  11     44    37          0.77         0.24
## 5       5    6            23  17     50    54          0.88         0.35
## 6       6    0            19  19     50    73          0.88         0.47
## 7       7    1            23  22     51    95          0.89         0.62
## 8       8    3            17  14     54   109          0.95         0.71
## 9       9    0            23  23     54   132          0.95         0.86
## 10     10    0            25  25     54   157          0.95         1.02
## Variables not shown: CuBadAvoided (dbl), Profit (dbl)

The last two variables are not showing up. I will just construct a separate dataframe for these two.

CumBadAvoided <- c(0.99, 0.92, 0.84, 0.74, 0.63, 0.53, 0.40, 0.27, 0.15, 0.00)
Profit <- c(900, -3600, -8700, -15600, -23700, -31200, -41100, -50400, -59800, -71300)
lasttwo <- cbind(CumBadAvoided, Profit)
lasttwo <- as.data.frame(lasttwo)
lasttwo

##    CumBadAvoided Profit
## 1           0.99    900
## 2           0.92  -3600
## 3           0.84  -8700
## 4           0.74 -15600
## 5           0.63 -23700
## 6           0.53 -31200
## 7           0.40 -41100
## 8           0.27 -50400
## 9           0.15 -59800
## 10          0.00 -71300

As we can see the model identifies 16 good observations in decile 1 out of 19. In decile 2, 13 good out of 22 and so on. The difference are the bad observations. We build variables like Cumulative Good, Cumulative Bad, Cumulative Good Percentage, Cumulative Bad Percentage, Cummulative Bad avoided and Profit to get the conclusion for credit score. Here, Profit variable indicates if we gain 100 for every good loan and lose out 500 for every bad loan, we get a cut off point to decide upto what percentage we can take the risk, upto what percentage we can give loan, upto what point we get maximum profits. However, in this we can take only the first decile, get a good loan of about 28% and avoid bad loan of about 99%.

We will do the credit score for train dataset as well.

train.final <- data.frame(train, train.prob)
train.final$train.prob <- round(train.final$train.prob, 2)
head(train.final)

##     age ed employ address income debtinc creddebt   othdebt default
## 216  34  3     12       8     47     6.7 1.300537  1.848463       0
## 181  39  1     12      10     46     5.0 0.561200  1.738800       0
## 386  48  1      3      11     27    21.3 1.403244  4.347756       0
## 40   45  1     23       5     50     4.2 0.558600  1.541400       0
## 327  40  1     12       6     89    17.4 2.771994 12.714006       0
## 337  36  2      7      16     43    15.4 2.648800  3.973200       0
##     train.prob
## 216       0.05
## 181       0.03
## 386       0.70
## 40        0.00
## 327       0.31
## 337       0.30

Divide the train dataset into ten decile

train.f <- arrange(train.final, desc(train.prob))
train.f$decile <- with(train.f, cut_number(train.prob, 10, labels = 10:1))
head(train.f)

##   age ed employ address income debtinc creddebt  othdebt default
## 1  33  1     14       8     72    41.3 15.01668 14.71932       1
## 2  47  1     29      18    129    25.3 20.56131 12.07569       1
## 3  47  3     16      26    221    17.6 15.79178 23.10422       1
## 4  43  4     18      14    446     6.5 16.03147 12.95853       1
## 5  54  1     25      12    120    26.5 14.59620 17.20380       1
## 6  28  1      0       2     28    33.3  2.28438  7.03962       1
##   train.prob decile
## 1       1.00      1
## 2       0.99      1
## 3       0.98      1
## 4       0.97      1
## 5       0.97      1
## 6       0.97      1

tail(train.f)

##     age ed employ address income debtinc creddebt  othdebt default
## 484  51  2     19      25    159     2.8 1.068480 3.383520       0
## 485  39  2     20      17    101     1.4 0.340774 1.073226       0
## 486  47  1     22      19     81     5.5 1.505790 2.949210       0
## 487  42  2     21      11    121     3.1 1.365364 2.385636       0
## 488  39  1     15      16     39     1.8 0.139698 0.562302       0
## 489  39  1     19      16     45     3.7 0.915750 0.749250       0
##     train.prob decile
## 484          0     10
## 485          0     10
## 486          0     10
## 487          0     10
## 488          0     10
## 489          0     10

Score table for train dataset

train.score <- train.f %>% group_by(decile)

train.score1 <- train.score %>% 
  summarise_each(funs(sum), default) %>% 
  arrange(desc(decile))
train.score2 <- train.f %>% 
  group_by(decile) %>% 
  summarise(default = n()) %>% 
  arrange(desc(decile))
train.table <- left_join(train.score1, train.score2, by = "decile")
train.table <- rename(train.table, Good = default.x, CountOfDecile = default.y)
train.table <- mutate(train.table, Bad = CountOfDecile - Good)
train.table <- mutate(train.table, CuGood = cumsum(Good))
train.table <- mutate(train.table, CuBad = cumsum(Bad))
train.table <- mutate(train.table, CuGoodPercent = CuGood/126)
train.table$CuGoodPercent <- round(train.table$CuGoodPercent, 2)
train.table <- mutate(train.table, CuBadPercent = CuBad/363)
train.table$CuBadPercent <- round(train.table$CuBadPercent, 2)
train.table <- mutate(train.table, CuBadAvoided = 1 - CuBadPercent)
train.table <- mutate(train.table, Profit = 100*CuGood - 500*CuBad)
train.table

## Source: local data frame [10 x 10]
## 
##    decile Good CountOfDecile Bad CuGood CuBad CuGoodPercent CuBadPercent
## 1       1   42            49   7     42     7          0.33         0.02
## 2       2   25            47  22     67    29          0.53         0.08
## 3       3   19            44  25     86    54          0.68         0.15
## 4       4   14            53  39    100    93          0.79         0.26
## 5       5   12            47  35    112   128          0.89         0.35
## 6       6    8            51  43    120   171          0.95         0.47
## 7       7    4            48  44    124   215          0.98         0.59
## 8       8    3            36  33    127   248          1.01         0.68
## 9       9    2            59  57    129   305          1.02         0.84
## 10     10    0            55  55    129   360          1.02         0.99
## Variables not shown: CuBadAvoided (dbl), Profit (dbl)

Similarly, we will choose the first decile to get 33% of good loans and avoid 98% of bad loans. Only the first decile yields maximum profit to the bank. So we will select those observations which fall into decile one on both test and train dataset.

Logistic Regression - Credit Score

Loy

Friday, February 13, 2015