Credit Scoring 101

In this post we'll fit some predicitve models in (well know) data bases, and evalute the performance of each model. Disclaimer1: for simplicity the predictive variables are treating without apply any transformation to get a better performance or stability, etc. We'll use two datas to evaluate the performances of the models. Both data have categorical and continous variables and we'll use 50-50 split to have a train and test data. The datas are:

##         sample
## response test  train
##     bad  14.9% 15.1%
##     good 38.1% 31.9%
##         sample
## response test  train
##     bad  12.7% 12.4%
##     good 37.3% 37.6%

The models to compare are logistic, conditional inference trees (party package), single-hidden-layer neural network (nnet pakage) and linear discriminant analysis. To evalute the performance there are some indicators like KS statistic, Area under ROC curve among others. If you are not familiar with this terms check this link. Now, let's go with the results.

German Credit Data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.32 0.54   0.83   0.25   0.48   0.64   0.74   0.82
## 4    SCORE_CTree  train 0.32 0.40   0.76   0.53   0.53   0.61   0.87   0.87
## 6   SCORE_SLNNET  train 0.32 0.64   0.90   0.30   0.54   0.73   0.82   0.89
## 8      SCORE_LDA  train 0.32 0.54   0.83   0.24   0.47   0.63   0.76   0.82
## 1 SCORE_Logistic   test 0.28 0.46   0.78   0.23   0.43   0.58   0.72   0.82
## 3    SCORE_CTree   test 0.28 0.37   0.74   0.50   0.50   0.57   0.87   0.87
## 5   SCORE_SLNNET   test 0.28 0.42   0.77   0.21   0.40   0.58   0.68   0.79
## 7      SCORE_LDA   test 0.28 0.47   0.79   0.23   0.46   0.57   0.73   0.82
daux <- subset(data1, SAMPLE == "test")
daux_roc <- ldply(str_pattern(names(daux),"SCORE"), function(score){
  perf <- performance(prediction(daux[[score]], daux$GB), "tpr","fpr")
  df <- data.frame(x = unlist(perf@"x.values") , y = unlist(perf@"y.values"))
  df$score <- score
  df
})

ggplot(daux_roc) + geom_line(aes(x,y, color = score), size = 1.2) + 
  scale_color_manual('',values=brewer.pal(length(unique(daux_roc$score)), "RdBu")) +
  geom_path(data=data.frame(x = c(0,1), y = c(0,1)),
            aes(x,y), colour = "gray", size = 1) +
  scale_x_continuous("False Positive Rate (1 - Specificity)",
                     labels = percent_format(), limits = c(0, 1)) +
  scale_y_continuous("True Positive Rate (Sensivity or Recall)",
                     labels = percent_format(), limits = c(0, 1)) +
  theme(legend.position = "top") +
  ggtitle("ROC Curves for German Credit Data (validation)")

plot of chunk roc-curve-1

Now we can plot the distributions of good/bads in each model. We'll transform the data whith melt function and then plot faceting by score.

daux <- subset(data1, SAMPLE == "test",
               select = c("GB", "SCORE_Logistic",
                          "SCORE_CTree","SCORE_SLNNET","SCORE_LDA"))
daux <- melt(daux, id = "GB")

ggplot(daux, aes(x=value, fill = factor(GB))) +
  geom_density(alpha = 0.6, size = .75) +
  facet_wrap(~variable, ncol=2) +
  scale_fill_manual(values = brewer.pal(3, "Dark2")) +
  theme(legend.position = "none",
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        plot.margin = unit(rep(0.5, 4), "lines"),
        title = element_text(size = 9))

plot of chunk distributions-1

Bankloan Binning data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.25 0.54   0.84   0.31   0.51   0.67   0.79   0.89
## 4    SCORE_CTree  train 0.25 0.52   0.84   0.34   0.50   0.82   0.82   0.90
## 6   SCORE_SLNNET  train 0.25 0.56   0.86   0.32   0.52   0.68   0.82   0.89
## 8      SCORE_LDA  train 0.25 0.52   0.84   0.31   0.51   0.66   0.78   0.88
## 1 SCORE_Logistic   test 0.25 0.53   0.84   0.31   0.51   0.68   0.78   0.87
## 3    SCORE_CTree   test 0.25 0.45   0.79   0.30   0.51   0.76   0.76   0.84
## 5   SCORE_SLNNET   test 0.25 0.51   0.83   0.30   0.51   0.66   0.77   0.85
## 7      SCORE_LDA   test 0.25 0.50   0.83   0.31   0.51   0.67   0.77   0.85

plot of chunk roc-curve-2

plot of chunk distributions-2

Do you want to comment about the results? If you are interesting in this topic reproduce this example. And if you have questions and/or improvements or want to know more details for the code please comment.

References

  1. Ggplot2
  2. RStudio
  3. Knitr
  4. Guide to credit scoring in R