Index

1. Question:
Can we Predict the probability of the default risk and assess a default risk to provide decision support on profit-to-risk control?

2. Input Data:
Data is a zip file as Loan_Data.zip
Data-set

45211 observations with 8 variables

  1. age (numeric)
  2. job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,“blue-collar”,“self-employed”,“retired”,“technician”,“services”)
  3. marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
  4. education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
  5. default: has credit in default? (binary: “yes”,“no”)
  6. balance: average yearly balance, in euros (numeric)
  7. housing: has housing loan? (binary: “yes”,“no”)
  8. loan: has personal loan? (binary: “yes”,“no”)

3. Features:
Dealing with NA values; Zero Frequency Ratio; Center-scale, and boxcox the skewed numeric predictors; Created dummy variables for factor predictors; Remove redundant highly correlated variables; Remove near-zero predictors to form reduced-set, but also keep a set including all the variables.

4. Performance Analysis:
Build the predictive models using all the observable due to the extremely highly unbalanced default output( risk vs non-risk ). Measure the models performance through the 10-fold cross validation method.

5. Build the models:
* First model: Boosted tree model is very flexible but less explanatory. The model have tendency to have a high likelihood of producing the best results
* Second Model: Logistic Regression uses a simpler technique to determine classification boundaries. Does not have tuning parameters and its prediction equation is simple and easy to implement

6. Select the Model:
Compare the models performance by AUC (area under ROC curve), time and easier interpretation. Logistic Regression Model (with reduced set) was chosen as the final model.

7. Calculate the cut-off rate: (for default risk):
by using the ROC weighted “closest.topleft” best thresholds choosing strategy. Get weight by calculating Probability Cost Function.

8. Next:
I am planing to add KNN and random forest analysis to the data set.However Logistic regression results are very close to Boosted trees. This indicates the current model reasonably approximates the performance of a more complex method

Input Data

The data file is attached in folder

Loan_data <- read.csv("Loan_data.csv", header = TRUE, stringsAsFactors = FALSE)

Features

library(caret); library(ggplot2); library(lattice)
## Loading required package: lattice
## Loading required package: ggplot2
Find_NA = apply(Loan_data, 2, function(x) sum(is.na(x)==T)); Find_NA
##       age       job   marital education   default   balance   housing 
##         0         0         0         0         0         0         0 
##      loan 
##         0

Zero NA values!

ZC <- nearZeroVar(Loan_data, saveMetrics = T); ZC
##           freqRatio percentUnique zeroVar   nzv
## age        1.044589   0.170312535   FALSE FALSE
## job        1.028970   0.026542213   FALSE FALSE
## marital    2.127756   0.006635553   FALSE FALSE
## education  1.744380   0.008847404   FALSE FALSE
## default   54.473620   0.004423702   FALSE  TRUE
## balance   18.020513  15.854548672   FALSE FALSE
## housing    1.251432   0.004423702   FALSE FALSE
## loan       5.241165   0.004423702   FALSE FALSE

All variables have a non-zero frequency Ratio

Plotting the numeric predictors Density

Transforming Skewed predictors

“BoxCox”" : Standarize to make numeric variables a more normalized distribution like “Centering” and “Scaling”" : improve numerical model prediction

+Skip Marital Variable (observables can be Divorced or Widowed)

Loan_data1 <- Loan_data
Center_data <- preProcess(Loan_data1[,-3], method = c("BoxCox", "center", "scale"))
train_data <- predict(Center_data, Loan_data1[,-3]) 
Loan_data2 <- Loan_data
Loan_data2$age <- train_data$age
Loan_data2$balance<- train_data$balance
head(Loan_data2)
##          age          job marital education default     balance housing
## 1  1.4966373   management married  tertiary      no  0.25641642     yes
## 2  0.4114911   technician  single secondary      no -0.43788985     yes
## 3 -0.7185480 entrepreneur married secondary      no -0.44675753     yes
## 4  0.6705795  blue-collar married   unknown      no  0.04720492     yes
## 5 -0.7185480      unknown  single   unknown      no -0.44708596      no
## 6 -0.4874176   management married  tertiary      no -0.37154649     yes
##   loan
## 1   no
## 2   no
## 3  yes
## 4   no
## 5   no
## 6   no

Creating Dummy variables

Create dummy variables for factor predictors to be taken in consideration in model 45211 observations with 26 variables

dummies <- dummyVars(default~., data = Loan_data2)
Loan_data3 <- predict(dummies, newdata=Loan_data2)
Loan_data3 <- data.frame(Loan_data3)
Loan_data3$default <- Loan_data2$default

Remove Near Zero Variables

In many cases, binary predictors data is very sparse and very unbalance. If not careful, these predictors could be classified as near-zero variance leading to computational issues in many of the models applied.

NZC <- nearZeroVar(Loan_data3, saveMetrics = T)
nzv_T <- which(NZC$nzv == T)
Loan_data4 <- Loan_data3[,-(nzv_T)]
Loan_data4$default <- Loan_data3$default

Collinearity

Find the highly redundant predictors from both data-sets

Elimination of the highly redundant predictors from both datasets

All predictors

fullCovMat3 <- cov(Loan_data3[,-26])
library(subselect)
fullResults3 <- trim.matrix(fullCovMat3)
discardName3 <- fullResults3$names.discarded
discardName3
## [1] "jobunknown"       "maritaldivorced"  "educationunknown"
## [4] "housingyes"       "loanyes"

Reduced Predictors

reducedCovMat4 <- cov(Loan_data4[,-19])
reducedResults4 <- trim.matrix(reducedCovMat4)
discardName4 <- reducedResults4$names.discarded
discardName4
## [1] "maritaldivorced" "housingyes"      "loanno"

Eliminate the highly correlated variables from both data-set

Loan_data3 <- Loan_data3[,-(fullResults3$numbers.discarded)]
Loan_data4 <- Loan_data4[,-(reducedResults4$numbers.discarded)]

Model performance assesment
The bar-plot shows the unbalanced number of observations in risk return vs non-risk return.

The frequency of “no” is 0.982 and “yes” 0.018.
Normally we would create a training and testing set however, the accuracy to predict “Default” is compromised due to the lack of “yes” results.

Therefore, We will use all the observations to create our predictive model and measure the performance using cross validation resampling strategies. Then we will compare the models performance though AUC (area under ROC curve) and the processing time to choose the final model

Parallel Processing

library(doSNOW)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: snow
registerDoSNOW(makeCluster(2, type = "SOCK"))

Build Predictive Models

Train Control parameters: We will use 10-fold cross validation to evaluate the models.

ctrl <- trainControl(method="cv",
                     summaryFunction = twoClassSummary,
                     classProbs=TRUE,
                     savePredictions =TRUE)
library(pROC); library(gbm); library(caret); library(plyr)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## 
## Attaching package: 'parallel'
## 
## The following objects are masked from 'package:snow':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, clusterSplit, makeCluster,
##     parApply, parCapply, parLapply, parRapply, parSapply,
##     splitIndices, stopCluster
## 
## Loaded gbm 2.1.1

Model 1: BOOSTED TREE MODEL with all

set.seed(512)
t1 <- Sys.time()
gbm_fit <- train(default~.,
                 data = Loan_data,
                 method = "gbm",
                 metric = "ROC",
                 trControl = ctrl,
                 verbose=FALSE)
t2 <- Sys.time()
tgbm_fit<- difftime(t2,t1)
gbm_fit
## Stochastic Gradient Boosting 
## 
## 45211 samples
##     7 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec       
##   1                   50      0.8654989  1.0000000  0.000000000
##   1                  100      0.8667891  0.9999550  0.000000000
##   1                  150      0.8676082  0.9998649  0.000000000
##   2                   50      0.8677824  0.9998649  0.006142728
##   2                  100      0.8682300  0.9997748  0.008611864
##   2                  150      0.8717248  0.9997297  0.012300512
##   3                   50      0.8690963  0.9997072  0.007377296
##   3                  100      0.8713009  0.9995720  0.012285456
##   3                  150      0.8706561  0.9993693  0.017208672
##   ROC SD      Sens SD       Spec SD   
##   0.01911941  0.000000e+00  0.00000000
##   0.01973955  9.496329e-05  0.00000000
##   0.01981358  1.163145e-04  0.00000000
##   0.01979677  1.899293e-04  0.01043302
##   0.02163155  1.839232e-04  0.01303762
##   0.02148203  2.070094e-04  0.01537125
##   0.01928153  3.685987e-04  0.01035553
##   0.02002015  3.432463e-04  0.01295047
##   0.02077119  4.220755e-04  0.01446324
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

Model 2: LOGISTIC REGRESSION with all

set.seed(512)
t3 <- Sys.time()
glm_fit <- train(default~., 
                 data = Loan_data,
                 method = "glm",
                 metric="ROC",
                 trControl = ctrl)
t4 <- Sys.time()
tglm_fit <- difftime(t4,t3); tglm_fit
## Time difference of 8.772502 secs
glm_fit
## Generalized Linear Model 
## 
## 45211 samples
##     7 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ... 
## Resampling results
## 
##   ROC        Sens       Spec        ROC SD     Sens SD       Spec SD   
##   0.8612404  0.9993243  0.02451069  0.0221412  0.0003357384  0.02572601
## 
## 

Model 3: LOGISTIC REGRESSION with full-set

set.seed(512)
t5 <- Sys.time()
glm_fit3 <- train(default~.,
                 data = Loan_data3,
                 method = "glm",
                 metric="ROC",
                 trControl = ctrl)
t6 <- Sys.time()
tglm_fit3 <- difftime(t6,t5); tglm_fit3
## Time difference of 8.147466 secs
glm_fit3
## Generalized Linear Model 
## 
## 45211 samples
##    20 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ... 
## Resampling results
## 
##   ROC        Sens       Spec        ROC SD      Sens SD       Spec SD   
##   0.8611418  0.9993243  0.02451069  0.02193808  0.0003357384  0.02572601
## 
## 

Model 4: LOGISTIC REGRESSION with reduced-set

set.seed(512)
t7 <- Sys.time()
glm_fit4 <- train(default~.,
                  data = Loan_data4,
                  method = "glm",
                  metric="ROC",
                  trControl = ctrl)
t8 <- Sys.time()
tglm_fit4 <- difftime(t8,t7); tglm_fit4
## Time difference of 6.593378 secs
glm_fit4
## Generalized Linear Model 
## 
## 45211 samples
##    15 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ... 
## Resampling results
## 
##   ROC        Sens       Spec        ROC SD      Sens SD       Spec SD   
##   0.8603669  0.9993243  0.02449563  0.02429763  0.0003357384  0.02817112
## 
## 

Measure and select the model

Model performance was selected by AUC (ROC) and processing time (cross validation included)

Select the model:

Boosted Tree Model

## Time difference of 1.642677 mins
## Area under the curve: 0.8655

Logistic Regresion with all

## Time difference of 8.772502 secs
## Area under the curve: 0.8607

Logistic Regression with full-set

## Time difference of 8.147466 secs
## Area under the curve: 0.8606

Logistic Regression with reduce set

## Time difference of 6.593378 secs
## Area under the curve: 0.8599
  • ROC plot shows how the boosted tree model gbm provide largest AUC, however by just a little higher than the logistic regression model glm.

  • Logistic regression models have no significant difference in AUC, but the reduced-set requires less time to compute.

  • The processing time of the reduced set is much shorter than the logistic regression.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2782  -0.2126  -0.1342  -0.0395   8.4904  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -6.19727    0.25114 -24.676  < 2e-16 ***
## age                -0.10134    0.04570  -2.218 0.026571 *  
## jobadmin.          -0.67564    0.15266  -4.426 9.61e-06 ***
## jobblue.collar     -0.19751    0.12048  -1.639 0.101151    
## jobmanagement      -0.07424    0.13520  -0.549 0.582934    
## jobretired         -0.44696    0.23071  -1.937 0.052708 .  
## jobservices        -0.52519    0.15342  -3.423 0.000619 ***
## jobtechnician      -0.47502    0.13018  -3.649 0.000263 ***
## maritalmarried     -0.37102    0.10711  -3.464 0.000533 ***
## maritalsingle      -0.18101    0.12668  -1.429 0.153026    
## educationprimary   -0.08658    0.20726  -0.418 0.676143    
## educationsecondary -0.05191    0.19118  -0.271 0.786008    
## educationtertiary  -0.42330    0.20887  -2.027 0.042698 *  
## balance            -6.87055    0.25530 -26.912  < 2e-16 ***
## housingno           0.43159    0.07766   5.558 2.74e-08 ***
## loanyes             0.72400    0.07901   9.163  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8161.1  on 45210  degrees of freedom
## Residual deviance: 6690.5  on 45195  degrees of freedom
## AIC: 6722.5
## 
## Number of Fisher Scoring iterations: 10

Determine the cutoff rate usin a profit-to-risk cutoff

We make the assumption that the entity has Loans that last one year and charges 20% interest rate.

The goal is to minimize the cost associated with the fraudulent transactions * Averaged requested Loan is $4000 * Interest Rate is 20% * False positive Cost: $4000 (The model predicts that a borrower is “risk-free” but he/she defaults) * False negative Cost: $4000 * 20% (The model predicts that a borrower will default but he/she will not and the loan is denied)

Probability Cost Function
The Probability cost function, (PCostF) is the proportion of the total cost associated with a false-positive sample Cost weight ('CostWeigth') is associated with the false-negative sample

AveLoan <- 4000
IR <- 0.2
FPCost <- AveLoan  
FNCost <- AveLoan*IR

freq <- table(Loan_data3$default)[1]/
        (table(Loan_data3$default)[1]+table(Loan_data3$default)[2])
PCostF <-(freq*FPCost)/((freq*FNCost)+((1-freq)*FPCost)) 
costWeight <- 1/PCostF

Get cutoff by using “closest.topleft”
Return the coordinates of the ROC curve at the specified point. Look at the best results "b" and return threshold, accuracy, sensitivity. Determine the best threshold and best weights

cutoff <- coords(glm_rpartROC4, #"roc" object from the roc function
        "b",                    # get the best result
         ret=c("threshold", "specificity", "sensitivity"), #Coordinates to return. "accuracy", ...
         best.method="closest.topleft", # if x="best", the method to determine the best threshold
         best.weights=c(costWeight, freq)) # if x="best"the weights to determine the best threshold
cutoff
##   threshold specificity sensitivity 
##   0.9601029   0.5521472   0.9107352
cutoffRisk <- 1- cutoff[1]; cutoffRisk
##  threshold 
## 0.03989709

This logistic regression model recommends a default risk of 0.03990 to support decisions on profit-to-risk control

Confidence interval of threshold

Predicted Result

Loan_data5 <- predict(glm_fit4, newdata = Loan_data4, type = "prob")
Loan_data$risk <- Loan_data5$yes
head(Loan_data,15)
##    age          job  marital education default balance housing loan
## 1   58   management  married  tertiary      no    2143     yes   no
## 2   44   technician   single secondary      no      29     yes   no
## 3   33 entrepreneur  married secondary      no       2     yes  yes
## 4   47  blue-collar  married   unknown      no    1506     yes   no
## 5   33      unknown   single   unknown      no       1      no   no
## 6   35   management  married  tertiary      no     231     yes   no
## 7   28   management   single  tertiary      no     447     yes  yes
## 8   42 entrepreneur divorced  tertiary     yes       2     yes   no
## 9   58      retired  married   primary      no     121     yes   no
## 10  43   technician   single secondary      no     593     yes   no
## 11  41       admin. divorced secondary      no     270     yes   no
## 12  29       admin.   single secondary      no     390     yes   no
## 13  53   technician  married secondary      no       6     yes   no
## 14  58   technician  married   unknown      no      71     yes   no
## 15  57     services  married secondary      no     162     yes   no
##            risk
## 1  0.0001259841
## 2  0.0191078006
## 3  0.0598665576
## 4  0.0007779442
## 5  0.0572081113
## 6  0.0113883863
## 7  0.0189229034
## 8  0.0272708637
## 9  0.0115146048
## 10 0.0054760116
## 11 0.0112766335
## 12 0.0082633356
## 13 0.0155112697
## 14 0.0136373610
## 15 0.0101336829