Credit Risk Model

By Carolyn Koay | Last edited on 6 February 2018

The data source is from https://www.kaggle.com/c/GiveMeSomeCredit

Import Libraries

import pandas as pd
import numpy as np
import operator
import matplotlib.pyplot as plt
from copy import deepcopy

Read and Explore Data

df = pd.read_csv('cs-training.csv')

print(df.head(4))
   Unnamed: 0  SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  \
0           1                 1                              0.766127   45   
1           2                 0                              0.957151   40   
2           3                 0                              0.658180   38   
3           4                 0                              0.233810   30   

   NumberOfTime30-59DaysPastDueNotWorse  DebtRatio  MonthlyIncome  \
0                                     2   0.802982         9120.0   
1                                     0   0.121876         2600.0   
2                                     1   0.085113         3042.0   
3                                     0   0.036050         3300.0   

   NumberOfOpenCreditLinesAndLoans  NumberOfTimes90DaysLate  \
0                               13                        0   
1                                4                        0   
2                                2                        1   
3                                5                        0   

   NumberRealEstateLoansOrLines  NumberOfTime60-89DaysPastDueNotWorse  \
0                             6                                     0   
1                             0                                     0   
2                             0                                     0   
3                             0                                     0   

   NumberOfDependents  
0                 2.0  
1                 1.0  
2                 0.0  
3                 0.0  

The first column, which is unnamed, is simply a running series of integers starting from 1. Let’s rename this column to ‘ID’.

df.columns = ['ID'] + list(df)[1:]
names = list(df)
summary = df.describe(include = 'all')
print(summary)
                  ID  SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  \
count  150000.000000     150000.000000                         150000.000000   
mean    75000.500000          0.066840                              6.048438   
std     43301.414527          0.249746                            249.755371   
min         1.000000          0.000000                              0.000000   
25%     37500.750000          0.000000                              0.029867   
50%     75000.500000          0.000000                              0.154181   
75%    112500.250000          0.000000                              0.559046   
max    150000.000000          1.000000                          50708.000000   

                 age  NumberOfTime30-59DaysPastDueNotWorse      DebtRatio  \
count  150000.000000                         150000.000000  150000.000000   
mean       52.295207                              0.421033     353.005076   
std        14.771866                              4.192781    2037.818523   
min         0.000000                              0.000000       0.000000   
25%        41.000000                              0.000000       0.175074   
50%        52.000000                              0.000000       0.366508   
75%        63.000000                              0.000000       0.868254   
max       109.000000                             98.000000  329664.000000   

       MonthlyIncome  NumberOfOpenCreditLinesAndLoans  \
count   1.202690e+05                    150000.000000   
mean    6.670221e+03                         8.452760   
std     1.438467e+04                         5.145951   
min     0.000000e+00                         0.000000   
25%     3.400000e+03                         5.000000   
50%     5.400000e+03                         8.000000   
75%     8.249000e+03                        11.000000   
max     3.008750e+06                        58.000000   

       NumberOfTimes90DaysLate  NumberRealEstateLoansOrLines  \
count            150000.000000                 150000.000000   
mean                  0.265973                      1.018240   
std                   4.169304                      1.129771   
min                   0.000000                      0.000000   
25%                   0.000000                      0.000000   
50%                   0.000000                      1.000000   
75%                   0.000000                      2.000000   
max                  98.000000                     54.000000   

       NumberOfTime60-89DaysPastDueNotWorse  NumberOfDependents  
count                         150000.000000       146076.000000  
mean                               0.240387            0.757222  
std                                4.155179            1.115086  
min                                0.000000            0.000000  
25%                                0.000000            0.000000  
50%                                0.000000            0.000000  
75%                                0.000000            1.000000  
max                               98.000000           20.000000  

The summary of the target variable, ‘SeriousDlqin2yrs’ shows that the target variable takes only 2 values: 0 for no default and 1 for default. Based on a mean of 0.06684, we can tell that the proportion of defaults in the data set is 6.7%. Let’s set this as priors.

# Get priors
priors = [1-summary.loc['mean','SeriousDlqin2yrs'], summary.loc['mean','SeriousDlqin2yrs']]

Clean Data

Any variables with count not equal to 150000, have missing values. There are 2 such variables: ‘MonthlyIncome’ and ‘NumberOfDependents’. Since the missing values did not exceed 25% of the data, let’s replace the missing values with the median of the variables.

# Replace NA with median values
fill_values = {}
for i in range(2, df.shape[1]):
    fill_values[df.columns[i]] = np.nanmedian(df[df.columns[i]])
df2 = df.fillna(fill_values)

By observing the quantiles and the max values, we can tell that the data is positively skewed with many positive outliers. Let’s replace the outliers with the upper/lower limit of our tolerance. In this case, our tolerance is set to +/-4 standard deviations away from the mean.

# Handle outliers
def find_outliers(dt, tol):
    outliers = []
    summ = dt.describe()
    for i in dt: 
        if abs(i-summ['mean'])/summ['std'] > tol : 
            outliers.append(True)
        else:
            outliers.append(False)
    
    return outliers

def replace_outliers(dt, tol):
    new = []
    summ = dt.describe()
    for i in dt: 
        if (i-summ['mean'])/summ['std'] > tol : 
            new.append(round(summ['mean'] + tol*summ['std'], 0))
        elif (i-summ['mean'])/summ['std'] < -tol : 
            new.append(round(summ['mean'] - tol*summ['std'],0))
        else:
            new.append(i)
    return pd.Series(new, name = dt.name)
            
for i in names[2:]: 
    if sum(find_outliers(df2[i], tol = 4)) < 0.01 * df2.shape[0] :
        df2.loc[:,i] = replace_outliers(df2[i], tol = 4).values

summary2 = df2.describe(include = 'all')
print(summary2)
                  ID  SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  \
count  150000.000000     150000.000000                         150000.000000   
mean    75000.500000          0.066840                              1.684505   
std     43301.414527          0.249746                             36.114760   
min         1.000000          0.000000                              0.000000   
25%     37500.750000          0.000000                              0.029867   
50%     75000.500000          0.000000                              0.154181   
75%    112500.250000          0.000000                              0.559046   
max    150000.000000          1.000000                           1005.000000   

                 age  NumberOfTime30-59DaysPastDueNotWorse      DebtRatio  \
count  150000.000000                         150000.000000  150000.000000   
mean       52.295207                              0.275840     332.441926   
std        14.771866                              0.994243    1005.604297   
min         0.000000                              0.000000       0.000000   
25%        41.000000                              0.000000       0.175074   
50%        52.000000                              0.000000       0.366508   
75%        63.000000                              0.000000       0.868254   
max       109.000000                             17.000000    8504.000000   

       MonthlyIncome  NumberOfOpenCreditLinesAndLoans  \
count  150000.000000                    150000.000000   
mean     6274.654940                         8.435847   
std      4732.828449                         5.058473   
min         0.000000                         0.000000   
25%      3903.000000                         5.000000   
50%      5400.000000                         8.000000   
75%      7400.000000                        11.000000   
max     57980.000000                        29.000000   

       NumberOfTimes90DaysLate  NumberRealEstateLoansOrLines  \
count             150000.00000                 150000.000000   
mean                   0.12078                      1.007767   
std                    0.86439                      1.043331   
min                    0.00000                      0.000000   
25%                    0.00000                      0.000000   
50%                    0.00000                      1.000000   
75%                    0.00000                      2.000000   
max                   17.00000                      6.000000   

       NumberOfTime60-89DaysPastDueNotWorse  NumberOfDependents  
count                         150000.000000       150000.000000  
mean                               0.095193            0.734747  
std                                0.788773            1.093439  
min                                0.000000            0.000000  
25%                                0.000000            0.000000  
50%                                0.000000            0.000000  
75%                                0.000000            1.000000  
max                               17.000000            5.000000  

Partition Data

The data is partitioned to training and test sets with a proportion of 0.8:0.2.

def partition_data(data, x_col, y_col, train_test_prop):
    np.random.seed(2018)
    names = list(data)
    train = np.random.choice([True, False], size =data.shape[0], p = train_test_prop )
    x_train = data[train][[names[i]for i in x_col]].as_matrix()
    y_train = data[train][names[y_col]].as_matrix()
    x_test = data[~train][[names[i]for i in x_col]].as_matrix()
    y_test = data[~train][names[y_col]].as_matrix()
    return train, x_train, y_train, x_test, y_test

train, x_train, y_train, x_test, y_test = partition_data(df2, x_col = range(2, df2.shape[1]), y_col=1, train_test_prop = [0.8, 0.2])

Scale Data

Scale the predictor data to range between 0 and 1. This is because the ‘MonthlyIncome’ variable clearly has a different scale from the other variables which are count data.

from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Build Baseline Models

This project looks at 4 types of base classifiers: Logistic Regression, Support Vector Machines, Decision Trees and Gaussian Naive Bayes. We will first establish baseline models with optimal parameters using Grid Search. The parameters are mainly related to regularization to prevent overfitting.

*Note: Gaussian Naive Bayes models do not have any parameters to optimize.

The scoring metric will be ‘roc_auc’ or Area Under the Receiver Operating Curve. Higher values are better. Scoring of cross-validation data is used by Grid Search to select the best model. However, only scoring of test data will matter in comparison to subsequent experiments.

from sklearn import metrics
from sklearn import linear_model, tree, svm, naive_bayes
from sklearn import model_selection, calibration

clf_baseline = {}
clf_baseline_val_scores = {}
clf_baseline_test_scores = {}

logit = linear_model.LogisticRegression()
params = {'C': [1, 0.1, 0.01, 0.001]}
gs_logit = model_selection.GridSearchCV(logit, params, scoring = 'roc_auc')
gs_logit.fit(x_train, y_train)
clf_baseline['logit'] = gs_logit.best_estimator_
clf_baseline_val_scores['logit'] = gs_logit.best_score_
y_pred = gs_logit.best_estimator_.predict(x_test)
clf_baseline_test_scores['logit'] = metrics.roc_auc_score(y_test, y_pred)
print('Best params for Logit: \n', gs_logit.best_params_)

svc = svm.LinearSVC()
params = {'C': [1, 0.1, 0.01, 0.001]}
gs_svc = model_selection.GridSearchCV(svc, params, scoring = 'roc_auc')
gs_svc.fit(x_train, y_train)
clf_baseline['svc'] = gs_svc.best_estimator_
clf_baseline_val_scores['svc'] = gs_svc.best_score_
y_pred = gs_svc.best_estimator_.predict(x_test)
clf_baseline_test_scores['svc'] = metrics.roc_auc_score(y_test, y_pred)
print('Best params for SVC: \n', gs_svc.best_params_)

dt = tree.DecisionTreeClassifier()
params = {'criterion': ['gini', 'entropy'],
          'min_samples_leaf': [0.2, 0.1, 0.01, 0.001, 0.0001]}
gs_dt = model_selection.GridSearchCV(dt, params, scoring = 'roc_auc')
gs_dt.fit(x_train, y_train)
clf_baseline['dt'] = gs_dt.best_estimator_
clf_baseline_val_scores['dt'] = gs_dt.best_score_
y_pred = gs_dt.best_estimator_.predict(x_test)
clf_baseline_test_scores['dt'] = metrics.roc_auc_score(y_test, y_pred)
print('Best params for DT: \n', gs_dt.best_params_)

gnb = naive_bayes.GaussianNB()
gnb.fit(x_train, y_train)
y_pred = gnb.predict(x_test)
clf_baseline ['gnb'] = gnb
clf_baseline_test_scores['gnb'] = metrics.roc_auc_score(y_test, y_pred)
Best params for Logit: 
 {'C': 1}
Best params for SVC: 
 {'C': 0.01}
Best params for DT: 
 {'criterion': 'gini', 'min_samples_leaf': 0.01}
print('AUC score on test data for each baseline classifiers:\n')
for k,v in clf_baseline_test_scores.items():
    print(k, ':',v)
AUC score on test data for each baseline classifiers:

logit : 0.53024537827
svc : 0.507953246205
dt : 0.574616673409
gnb : 0.606284717947

Experiment to Find Best Strategy for Data with Unbalanced Target Distribution

The experiment employs 3 main strategies to tackle data with unbalanced target distribution.

  1. Undersample the majority data (i.e. target = 0)

  2. Adjust the class weights during model fitting to put more penalization of the cost function on the minority data (i.e. target = 1)

  3. Change the decision threshold to label a prediction as 1 or 0 based on its predicted probability. The default value is 0.5.

Below are the 3 functions to implement the strategies.

## Function to undersample data
def undersample(x_train, y_train, true_false_prop):   
    np.random.seed(2018)
    y_true = y_train[y_train == 1]
    y_false = y_train[y_train == 0]
    x_true = x_train[y_train == 1]
    x_false = x_train[y_train == 0]
    
    sampling_pct = true_false_prop[1]/true_false_prop[0]*y_true.shape[0]/y_false.shape[0]
    sample = np.random.choice([True, False], size = y_false.shape[0], p = [sampling_pct, 1-sampling_pct])
    
    y_false2 = y_false[sample]
    x_false2 = x_false[sample]
    
    x = np.concatenate([x_true, x_false2])
    y = np.concatenate([y_true, y_false2])
    data = np.column_stack((x,y))
    np.random.shuffle(data)
    
    x_train2 = data[:,:-1]
    y_train2 = data[:,-1]
    return x_train2, y_train2

## Function to set class_weights
def add_class_weights (clf, priors):
    # weights must be a dictionary with the keys representing each class. 
    # if clf is a GaussianNB classifier, the keys must be sorted.    
    
    if str(type(clf)) == "<class 'sklearn.naive_bayes.GaussianNB'>": 
        clf.set_params(priors = priors)
    else: 
        if priors == None: 
            clf.set_params(class_weight = None)
        else: 
            weights = {}
            for i in range(2):
                weights[i] = round(1/priors[i])
            clf.set_params(class_weight = weights)

    return clf

## Function to find best decision threshold
def find_best_threshold(clf, x_test, y_test, score): 
    
    scoring = {'Accuracy': metrics.accuracy_score,
               'Precision': metrics.precision_score,
               'Recall': metrics.recall_score,
               'F1': metrics.f1_score,
               'AUC': metrics.roc_auc_score}    
    
    y_proba = clf.predict_proba(x_test)
    
    scores = {}
    preds = {}
    for i in np.arange(0.1,1,0.1):
        y_pred = np.zeros(shape = (y_proba.shape[0],1))
        for j in range(0, y_proba.shape[0]):
            if y_proba[j,1]>i:
                y_pred[j] = 1
            else:
                y_pred[j] = 0
                
        scores[i] = scoring[score](y_test,y_pred)
        preds[i] = y_pred
    
    best_threshold = max(scores, key=scores.get)
    best_score = max(scores.values())
    best_pred = preds[best_threshold]
    
    return best_threshold, best_score, best_pred

The experiments are designed to employ all possible combination of the 3 strategies (7) on all 4 base models established earlier. There will be 28 experiments in total.

## Setup experiment 
x_train2, y_train2 = undersample(x_train, y_train, [1,1])

doe = [[1,0,0],
       [0,1,0],
       [0,0,1],
       [1,1,0],
       [1,0,1],
       [0,1,1],
       [1,1,1]]
clf = pd.DataFrame(doe, columns = ['Undersample', 'Weights', 'Threshold'])
clf = clf.append(clf.append(clf.append(clf)))
clf.reset_index(inplace = True, drop = True)
clf_types = ['logit', 'svc', 'dt','gnb']
classifier = []
for i in range(4):
    count = 0
    while count < 7:
        classifier.append(clf_types[i])
        count+=1
clf['Classifier'] = classifier
clf['ID'] = [str(clf['Undersample'][i])+str(clf['Weights'][i])+str(clf['Threshold'][i])+str(clf['Classifier'][i]) for i in range(clf.shape[0])]
## Run experiment
models = {}
scores = {}
y_preds = {}
for i in range(clf.shape[0]):    
    
    model = clf_baseline[clf['Classifier'][i]]
          
    if clf['Weights'][i] == 1:
        model = add_class_weights(model,priors)
    else: 
        model = add_class_weights(model,None)
    
    if clf['Undersample'][i] == 1:
        x_tr = x_train2
        y_tr = y_train2
    else:
        x_tr = x_train
        y_tr = y_train
           
    model.fit(x_tr, y_tr)
    models[clf['ID'][i]] = deepcopy(model)
        
    if clf['Threshold'][i] == 1:
        if str(type(model)) == "<class 'sklearn.svm.classes.LinearSVC'>": 
            model = calibration.CalibratedClassifierCV(model, cv = 'prefit')
            model.fit(x_tr, y_tr)    
        threshold, auc, y_pred = find_best_threshold(model, x_test, y_test, 'AUC')   
    else: 
        threshold = 0.5
        y_pred = model.predict(x_test)
        auc = metrics.roc_auc_score(y_test, y_pred)
        
    accuracy = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)
    
    scores[clf['ID'][i]] = [threshold, accuracy, precision, recall, f1, auc]
    y_preds[clf['ID'][i]] = y_pred

scores = pd.DataFrame([[key]+value for key,value in scores.items()], columns = ['ID','Threshold','Accuracy', 'Precision', 'Recall', 'F1', 'AUC'])    

Let’s look at the scores.

print(scores)
          ID  Threshold  Accuracy  Precision    Recall        F1       AUC
0   100logit        0.5  0.822416   0.218572  0.618093  0.322943  0.727769
1   010logit        0.5  0.827810   0.227833  0.633252  0.335102  0.737687
2   001logit        0.1  0.886916   0.292058  0.456724  0.356285  0.687642
3   110logit        0.5  0.069191   0.068567  1.000000  0.128334  0.500360
4   101logit        0.5  0.822416   0.218572  0.618093  0.322943  0.727769
5   011logit        0.5  0.827810   0.227833  0.633252  0.335102  0.737687
6   111logit        0.9  0.645870   0.137400  0.789731  0.234075  0.712509
7     100svc        0.5  0.736271   0.155185  0.641076  0.249881  0.692175
8     010svc        0.5  0.753225   0.165661  0.644499  0.263574  0.702861
9     001svc        0.1  0.892980   0.305978  0.443032  0.361966  0.684555
10    110svc        0.5  0.068521   0.068521  1.000000  0.128253  0.500000
11    101svc        0.6  0.833976   0.213018  0.528117  0.303584  0.692296
12    011svc        0.1  0.884168   0.288115  0.469438  0.357076  0.692057
13    111svc        0.5  0.739889   0.161256  0.665526  0.259609  0.705443
14     100dt        0.5  0.794538   0.212790  0.740342  0.330568  0.769434
15     010dt        0.5  0.756475   0.183646  0.741320  0.294369  0.749455
16     001dt        0.1  0.846172   0.258444  0.666015  0.372386  0.762720
17     110dt        0.5  0.188105   0.076958  0.986797  0.142781  0.558075
18     101dt        0.5  0.794538   0.212790  0.740342  0.330568  0.769434
19     011dt        0.5  0.756475   0.183646  0.741320  0.294369  0.749455
20     111dt        0.9  0.683297   0.157496  0.832763  0.264893  0.752532
21    100gnb        0.5  0.923438   0.427273  0.344743  0.381597  0.655375
22    010gnb        0.5  0.930307   0.482125  0.230807  0.312169  0.606285
23    001gnb        0.1  0.925884   0.441404  0.307579  0.362536  0.639473
24    110gnb        0.5  0.926386   0.440625  0.275795  0.339248  0.625020
25    101gnb        0.1  0.896298   0.326044  0.481174  0.388702  0.704004
26    011gnb        0.1  0.925850   0.441094  0.307579  0.362432  0.639455
27    111gnb        0.1  0.924007   0.429026  0.329584  0.372788  0.648659

The best performing models for each classifier type are:

  1. 010logit: 0.737687 (adjust class weights)
  2. 111svc: 0.705443 (employ all 3 strategies)
  3. 100dt: 0.769434 (undersample)
  4. 101gnb: 0.704004 (undersample and adjust decision threshold)

All the models performed better than the baseline classifiers. Let’s look at the confusion matrix and ROC Curve for each of the models.

Confusion matrix legend:

[[TN FP]

[FN TP]]

for i in [1,13,14,25]: 
    print('\nConfusion Matrix for ', clf['ID'][i], ': \n',metrics.confusion_matrix(y_test,y_preds[clf['ID'][i]]))
    print('\nROC Curve for ', clf['ID'][i], ': ')
    roc = metrics.roc_curve(y_test, y_preds[clf['ID'][i]])
    plt.plot(roc[0], roc[1])
    plt.show()
Confusion Matrix for  010logit : 
 [[23411  4389]
 [  750  1295]]

ROC Curve for  010logit : 
png
png
Confusion Matrix for  111svc : 
 [[20721  7079]
 [  684  1361]]

ROC Curve for  111svc : 
png
png
Confusion Matrix for  100dt : 
 [[22199  5601]
 [  531  1514]]

ROC Curve for  100dt : 
png
png
Confusion Matrix for  101gnb : 
 [[25766  2034]
 [ 1061   984]]

ROC Curve for  101gnb : 
png
png

Interpreting Results

best_logit = models[clf['ID'][1]] 
predictors = [names[i]for i in range(2, df2.shape[1])]
coef = best_logit.coef_.tolist()[0]
coefficients = {}
for i in range(len(predictors)):
    coefficients[predictors[i]] = coef[i]
coefficients['intercept'] = best_logit.intercept_[0]

coefficients = sorted(coefficients.items(), key=operator.itemgetter(1), reverse = True)
print('Coefficients of the Logistic Regression Model:\n')
for i in coefficients:
    print(i[0], ':',i[1])
print('\n')
Coefficients of the Logistic Regression Model:

NumberOfTimes90DaysLate : 17.519201925628142
NumberOfTime60-89DaysPastDueNotWorse : 12.525353809926791
NumberOfTime30-59DaysPastDueNotWorse : 11.864752640143452
intercept : 0.814570230737
NumberRealEstateLoansOrLines : 0.4667846125508872
NumberOfOpenCreditLinesAndLoans : 0.4166401325673546
NumberOfDependents : 0.2968073691781466
RevolvingUtilizationOfUnsecuredLines : 0.24146396276828586
DebtRatio : -0.4334507591257001
MonthlyIncome : -1.9508999080225464
age : -3.2134783517064656
best_svc = models[clf['ID'][13]] 
coef = best_svc.coef_.tolist()[0]
coefficients = {}
for i in range(len(predictors)):
    coefficients[predictors[i]] = coef[i]
coefficients['intercept'] = best_svc.intercept_[0]

coefficients = sorted(coefficients.items(), key=operator.itemgetter(1), reverse = True)
print('Coefficients of the Decision Function of the SVC Model:\n')
for i in coefficients:
    print(i[0], ':',i[1])
print('\n')
Coefficients of the Decision Function of the SVC Model:

NumberOfTimes90DaysLate : 1.2447831962420133
NumberOfTime30-59DaysPastDueNotWorse : 1.1962223881264535
intercept : 1.01244606939
NumberOfTime60-89DaysPastDueNotWorse : 0.9130887765466893
NumberOfOpenCreditLinesAndLoans : 0.0908857580138883
NumberRealEstateLoansOrLines : 0.07534089550199755
RevolvingUtilizationOfUnsecuredLines : 0.06392228822380676
NumberOfDependents : 0.05506412635964687
DebtRatio : -0.06369476341141772
MonthlyIncome : -0.2827283563800224
age : -0.5774558927085341
best_dt = models[clf['ID'][14]]

feature_imp = best_dt.feature_importances_
importance = {}
for i in range(len(predictors)):
    importance[predictors[i]] = feature_imp[i]

importance = sorted(importance.items(), key=operator.itemgetter(1), reverse = True)
print('Feature Importances of the Decision Tree Model:\n')
for i in importance:
    print(i[0], ':', i[1])
print('\n')

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(best_dt, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
print('Decision Tree Diagram:')
print('On ipynb, you may zoom in and scroll to observe. \n')
print('Legend for the variables')
for i in range(len(predictors)):
    print('X',i, ':', predictors[i])

Image(graph.create_png())
Feature Importances of the Decision Tree Model:

RevolvingUtilizationOfUnsecuredLines : 0.566193683618
NumberOfTime30-59DaysPastDueNotWorse : 0.18344763672
NumberOfTimes90DaysLate : 0.157824865225
age : 0.0291063244215
NumberOfTime60-89DaysPastDueNotWorse : 0.0225662590286
DebtRatio : 0.0147537610774
NumberOfOpenCreditLinesAndLoans : 0.0128703380843
NumberRealEstateLoansOrLines : 0.00850750339156
MonthlyIncome : 0.00472962843404
NumberOfDependents : 0.0


Decision Tree Diagram:
On ipynb, you may zoom in and scroll to observe. 

Legend for the variables
X 0 : RevolvingUtilizationOfUnsecuredLines
X 1 : age
X 2 : NumberOfTime30-59DaysPastDueNotWorse
X 3 : DebtRatio
X 4 : MonthlyIncome
X 5 : NumberOfOpenCreditLinesAndLoans
X 6 : NumberOfTimes90DaysLate
X 7 : NumberRealEstateLoansOrLines
X 8 : NumberOfTime60-89DaysPastDueNotWorse
X 9 : NumberOfDependents
png
png
best_gnb = models[clf['ID'][25]]
mean_0 = best_gnb.theta_[0]
mean_1 = best_gnb.theta_[1]
means = pd.DataFrame([mean_0, mean_1],columns = predictors).T
sd_0 = best_gnb.sigma_[0]
sd_1 = best_gnb.sigma_[1]
sd = pd.DataFrame([sd_0, sd_1],columns = predictors).T

print('Mean of the Gaussian Distributions of the Predictor Vectors for Target 0 and 1: \n')
print(means)
print('\n')
print('Standard Deviation of the Gaussian Distributions of the Predictor Vectors for Target 0 and 1:\n')
print(sd)
Mean of the Gaussian Distributions of the Predictor Vectors for Target 0 and 1: 

                                             0         1
RevolvingUtilizationOfUnsecuredLines  0.001169  0.001754
age                                   0.494791  0.429353
NumberOfTime30-59DaysPastDueNotWorse  0.011991  0.071582
DebtRatio                             0.039674  0.032911
MonthlyIncome                         0.109199  0.094613
NumberOfOpenCreditLinesAndLoans       0.289161  0.271671
NumberOfTimes90DaysLate               0.003592  0.054173
NumberRealEstateLoansOrLines          0.165366  0.158731
NumberOfTime60-89DaysPastDueNotWorse  0.002859  0.038555
NumberOfDependents                    0.142231  0.186543


Standard Deviation of the Gaussian Distributions of the Predictor Vectors for Target 0 and 1:

                                             0         1
RevolvingUtilizationOfUnsecuredLines  0.000812  0.001008
age                                   0.019125  0.014640
NumberOfTime30-59DaysPastDueNotWorse  0.001688  0.019655
DebtRatio                             0.014232  0.013463
MonthlyIncome                         0.006914  0.005012
NumberOfOpenCreditLinesAndLoans       0.029199  0.036857
NumberOfTimes90DaysLate               0.000976  0.019469
NumberRealEstateLoansOrLines          0.029189  0.040513
NumberOfTime60-89DaysPastDueNotWorse  0.000732  0.016572
NumberOfDependents                    0.044963  0.057481

Ensemble Models

Now let’s try some ensemble models using decision trees as base classifier, since it has the best results out of the four types of classifiers.

base = deepcopy(best_dt)

from sklearn import ensemble
bag = ensemble.BaggingClassifier(base, n_estimators = 50)
rf = ensemble.RandomForestClassifier(min_samples_leaf=0.01, n_estimators = 50) 
ada = ensemble.AdaBoostClassifier(base, n_estimators = 50)

ensemble_clf = {'bag': bag,'rf': rf, 'ada': ada}

ensemble_scores = {}

for k, v in ensemble_clf.items():
    v.fit(x_train2, y_train2)
    y_proba = v.predict_proba(x_test)
    threshold, auc, y_pred = find_best_threshold(v, x_test, y_test,'AUC')
    accuracy = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)
    
    ensemble_scores [k] = [threshold, accuracy, precision, recall, f1, auc]
    print('\nConfusion Matrix for ', k, ': \n',metrics.confusion_matrix(y_test,y_pred))
    print('\nROC Curve for ', k, ': ')
    roc = metrics.roc_curve(y_test, y_pred)
    plt.plot(roc[0], roc[1])
    plt.show()
    
    
ensemble_scores = pd.DataFrame([[key]+value for key,value in ensemble_scores.items()], columns = ['ID','Threshold','Accuracy', 'Precision', 'Recall', 'F1', 'AUC'])    

print(ensemble_scores)
Confusion Matrix for  bag : 
 [[21864  5936]
 [  512  1533]]

ROC Curve for  bag : 
png
png
Confusion Matrix for  rf : 
 [[21434  6366]
 [  447  1598]]

ROC Curve for  rf : 
png
png
Confusion Matrix for  ada : 
 [[20823  6977]
 [  553  1492]]

ROC Curve for  ada : 
png
png
    ID  Threshold  Accuracy  Precision    Recall        F1       AUC
0  bag        0.5  0.783950   0.205248  0.749633  0.322262  0.768054
1   rf        0.5  0.771721   0.200653  0.781418  0.319313  0.776213
2  ada        0.5  0.747696   0.176172  0.729584  0.283812  0.739307

All the ensemble models performed slightly better than the base classifiers. The best model thus far is the random forest classifier with 0.78 AUC score.