Credit Scoring 101

In this post we'll fit some predicitve models in (well know) data bases, and evalute the performance of each model. Disclaimer1: for simplicity the predictive variables are treating without apply any transformation to get a better performance or stability, etc. We'll use two datas to evaluate the performances of the models. Both data have categorical and continous variables and we'll use 50-50 split to have a train and test data. The datas are:

German Credit: The German Credit data frame has 1000 rows and 8 columns. This are data for clients of a south german bank, 700 good payers and 300 bad payers. They are used to construct a credit scoring method. This data have 20 predictive variables and 1000 observations and have a bad rate of 30%. So, after the select subsets to fit the models the distributions in the data are:

##         sample
## response test  train
##     bad  14.9% 15.1%
##     good 38.1% 31.9%

Bankloan Binning: This is a hypothetical data file containing financial and demographic information on past customers. Here is the source. This data have 8 predictive variables and 5000 observations and have a bad rate of 25.1%.

##         sample
## response test  train
##     bad  12.7% 12.4%
##     good 37.3% 37.6%

The models to compare are logistic, conditional inference trees (party package), single-hidden-layer neural network (nnet pakage) and linear discriminant analysis. To evalute the performance there are some indicators like KS statistic, Area under ROC curve among others. If you are not familiar with this terms check this link. Now, let's go with the results.

German Credit Data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.32 0.54   0.83   0.25   0.48   0.64   0.74   0.82
## 4    SCORE_CTree  train 0.32 0.40   0.76   0.53   0.53   0.61   0.87   0.87
## 6   SCORE_SLNNET  train 0.32 0.64   0.90   0.30   0.54   0.73   0.82   0.89
## 8      SCORE_LDA  train 0.32 0.54   0.83   0.24   0.47   0.63   0.76   0.82
## 1 SCORE_Logistic   test 0.28 0.46   0.78   0.23   0.43   0.58   0.72   0.82
## 3    SCORE_CTree   test 0.28 0.37   0.74   0.50   0.50   0.57   0.87   0.87
## 5   SCORE_SLNNET   test 0.28 0.42   0.77   0.21   0.40   0.58   0.68   0.79
## 7      SCORE_LDA   test 0.28 0.47   0.79   0.23   0.46   0.57   0.73   0.82

daux <- subset(data1, SAMPLE == "test")
daux_roc <- ldply(str_pattern(names(daux),"SCORE"), function(score){
  perf <- performance(prediction(daux[[score]], daux$GB), "tpr","fpr")
  df <- data.frame(x = unlist(perf@"x.values") , y = unlist(perf@"y.values"))
  df$score <- score
  df
})

ggplot(daux_roc) + geom_line(aes(x,y, color = score), size = 1.2) + 
  scale_color_manual('',values=brewer.pal(length(unique(daux_roc$score)), "RdBu")) +
  geom_path(data=data.frame(x = c(0,1), y = c(0,1)),
            aes(x,y), colour = "gray", size = 1) +
  scale_x_continuous("False Positive Rate (1 - Specificity)",
                     labels = percent_format(), limits = c(0, 1)) +
  scale_y_continuous("True Positive Rate (Sensivity or Recall)",
                     labels = percent_format(), limits = c(0, 1)) +
  theme(legend.position = "top") +
  ggtitle("ROC Curves for German Credit Data (validation)")

plot of chunk roc-curve-1

Now we can plot the distributions of good/bads in each model. We'll transform the data whith melt function and then plot faceting by score.

daux <- subset(data1, SAMPLE == "test",
               select = c("GB", "SCORE_Logistic",
                          "SCORE_CTree","SCORE_SLNNET","SCORE_LDA"))
daux <- melt(daux, id = "GB")

ggplot(daux, aes(x=value, fill = factor(GB))) +
  geom_density(alpha = 0.6, size = .75) +
  facet_wrap(~variable, ncol=2) +
  scale_fill_manual(values = brewer.pal(3, "Dark2")) +
  theme(legend.position = "none",
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        plot.margin = unit(rep(0.5, 4), "lines"),
        title = element_text(size = 9))

Bankloan Binning data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.25 0.54   0.84   0.31   0.51   0.67   0.79   0.89
## 4    SCORE_CTree  train 0.25 0.52   0.84   0.34   0.50   0.82   0.82   0.90
## 6   SCORE_SLNNET  train 0.25 0.56   0.86   0.32   0.52   0.68   0.82   0.89
## 8      SCORE_LDA  train 0.25 0.52   0.84   0.31   0.51   0.66   0.78   0.88
## 1 SCORE_Logistic   test 0.25 0.53   0.84   0.31   0.51   0.68   0.78   0.87
## 3    SCORE_CTree   test 0.25 0.45   0.79   0.30   0.51   0.76   0.76   0.84
## 5   SCORE_SLNNET   test 0.25 0.51   0.83   0.30   0.51   0.66   0.77   0.85
## 7      SCORE_LDA   test 0.25 0.50   0.83   0.31   0.51   0.67   0.77   0.85

plot of chunk roc-curve-2

plot of chunk distributions-2

Do you want to comment about the results? If you are interesting in this topic reproduce this example. And if you have questions and/or improvements or want to know more details for the code please comment.

Credit Scoring 101

German Credit Data

Bankloan Binning data

References