1. Question:
Can we Predict the probability of the default risk and assess a default risk to provide decision support on profit-to-risk control?
2. Input Data:
Data is a zip file as Loan_Data.zip
Data-set
45211 observations with 8 variables
3. Features:
Dealing with NA values; Zero Frequency Ratio; Center-scale, and boxcox the skewed numeric predictors; Created dummy variables for factor predictors; Remove redundant highly correlated variables; Remove near-zero predictors to form reduced-set, but also keep a set including all the variables.
4. Performance Analysis:
Build the predictive models using all the observable due to the extremely highly unbalanced default output( risk vs non-risk ). Measure the models performance through the 10-fold cross validation method.
5. Build the models:
* First model: Boosted tree model is very flexible but less explanatory. The model have tendency to have a high likelihood of producing the best results
* Second Model: Logistic Regression uses a simpler technique to determine classification boundaries. Does not have tuning parameters and its prediction equation is simple and easy to implement
6. Select the Model:
Compare the models performance by AUC (area under ROC curve), time and easier interpretation. Logistic Regression Model (with reduced set) was chosen as the final model.
7. Calculate the cut-off rate: (for default risk):
by using the ROC weighted “closest.topleft” best thresholds choosing strategy. Get weight by calculating Probability Cost Function.
8. Next:
I am planing to add KNN and random forest analysis to the data set.However Logistic regression results are very close to Boosted trees. This indicates the current model reasonably approximates the performance of a more complex method
The data file is attached in folder
Loan_data <- read.csv("Loan_data.csv", header = TRUE, stringsAsFactors = FALSE)
library(caret); library(ggplot2); library(lattice)
## Loading required package: lattice
## Loading required package: ggplot2
Find_NA = apply(Loan_data, 2, function(x) sum(is.na(x)==T)); Find_NA
## age job marital education default balance housing
## 0 0 0 0 0 0 0
## loan
## 0
Zero NA values!
ZC <- nearZeroVar(Loan_data, saveMetrics = T); ZC
## freqRatio percentUnique zeroVar nzv
## age 1.044589 0.170312535 FALSE FALSE
## job 1.028970 0.026542213 FALSE FALSE
## marital 2.127756 0.006635553 FALSE FALSE
## education 1.744380 0.008847404 FALSE FALSE
## default 54.473620 0.004423702 FALSE TRUE
## balance 18.020513 15.854548672 FALSE FALSE
## housing 1.251432 0.004423702 FALSE FALSE
## loan 5.241165 0.004423702 FALSE FALSE
All variables have a non-zero frequency Ratio
Transforming Skewed predictors
“BoxCox”" : Standarize to make numeric variables a more normalized distribution like “Centering” and “Scaling”" : improve numerical model prediction
+Skip Marital Variable (observables can be Divorced or Widowed)
Loan_data1 <- Loan_data
Center_data <- preProcess(Loan_data1[,-3], method = c("BoxCox", "center", "scale"))
train_data <- predict(Center_data, Loan_data1[,-3])
Loan_data2 <- Loan_data
Loan_data2$age <- train_data$age
Loan_data2$balance<- train_data$balance
head(Loan_data2)
## age job marital education default balance housing
## 1 1.4966373 management married tertiary no 0.25641642 yes
## 2 0.4114911 technician single secondary no -0.43788985 yes
## 3 -0.7185480 entrepreneur married secondary no -0.44675753 yes
## 4 0.6705795 blue-collar married unknown no 0.04720492 yes
## 5 -0.7185480 unknown single unknown no -0.44708596 no
## 6 -0.4874176 management married tertiary no -0.37154649 yes
## loan
## 1 no
## 2 no
## 3 yes
## 4 no
## 5 no
## 6 no
Creating Dummy variables
Create dummy variables for factor predictors to be taken in consideration in model 45211 observations with 26 variables
dummies <- dummyVars(default~., data = Loan_data2)
Loan_data3 <- predict(dummies, newdata=Loan_data2)
Loan_data3 <- data.frame(Loan_data3)
Loan_data3$default <- Loan_data2$default
Remove Near Zero Variables
In many cases, binary predictors data is very sparse and very unbalance. If not careful, these predictors could be classified as near-zero variance leading to computational issues in many of the models applied.
NZC <- nearZeroVar(Loan_data3, saveMetrics = T)
nzv_T <- which(NZC$nzv == T)
Loan_data4 <- Loan_data3[,-(nzv_T)]
Loan_data4$default <- Loan_data3$default
Loan_data3 (All Predictors) include all the variables regardless of their distribution.Loan_data4 (reduced set) more appropriate for models that are sensitive to sparse and unbalanced predictorsCollinearity
Find the highly redundant predictors from both data-sets
Elimination of the highly redundant predictors from both datasets
All predictors
fullCovMat3 <- cov(Loan_data3[,-26])
library(subselect)
fullResults3 <- trim.matrix(fullCovMat3)
discardName3 <- fullResults3$names.discarded
discardName3
## [1] "jobunknown" "maritaldivorced" "educationunknown"
## [4] "housingyes" "loanyes"
Reduced Predictors
reducedCovMat4 <- cov(Loan_data4[,-19])
reducedResults4 <- trim.matrix(reducedCovMat4)
discardName4 <- reducedResults4$names.discarded
discardName4
## [1] "maritaldivorced" "housingyes" "loanno"
Eliminate the highly correlated variables from both data-set
Loan_data3 <- Loan_data3[,-(fullResults3$numbers.discarded)]
Loan_data4 <- Loan_data4[,-(reducedResults4$numbers.discarded)]
Model performance assesment
The bar-plot shows the unbalanced number of observations in risk return vs non-risk return.
The frequency of “no” is 0.982 and “yes” 0.018.
Normally we would create a training and testing set however, the accuracy to predict “Default” is compromised due to the lack of “yes” results.
Therefore, We will use all the observations to create our predictive model and measure the performance using cross validation resampling strategies. Then we will compare the models performance though AUC (area under ROC curve) and the processing time to choose the final model
Parallel Processing
library(doSNOW)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: snow
registerDoSNOW(makeCluster(2, type = "SOCK"))
Train Control parameters: We will use 10-fold cross validation to evaluate the models.
ctrl <- trainControl(method="cv",
summaryFunction = twoClassSummary,
classProbs=TRUE,
savePredictions =TRUE)
library(pROC); library(gbm); library(caret); library(plyr)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
##
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
##
## Attaching package: 'parallel'
##
## The following objects are masked from 'package:snow':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, clusterSplit, makeCluster,
## parApply, parCapply, parLapply, parRapply, parSapply,
## splitIndices, stopCluster
##
## Loaded gbm 2.1.1
Model 1: BOOSTED TREE MODEL with all
set.seed(512)
t1 <- Sys.time()
gbm_fit <- train(default~.,
data = Loan_data,
method = "gbm",
metric = "ROC",
trControl = ctrl,
verbose=FALSE)
t2 <- Sys.time()
tgbm_fit<- difftime(t2,t1)
gbm_fit
## Stochastic Gradient Boosting
##
## 45211 samples
## 7 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees ROC Sens Spec
## 1 50 0.8654989 1.0000000 0.000000000
## 1 100 0.8667891 0.9999550 0.000000000
## 1 150 0.8676082 0.9998649 0.000000000
## 2 50 0.8677824 0.9998649 0.006142728
## 2 100 0.8682300 0.9997748 0.008611864
## 2 150 0.8717248 0.9997297 0.012300512
## 3 50 0.8690963 0.9997072 0.007377296
## 3 100 0.8713009 0.9995720 0.012285456
## 3 150 0.8706561 0.9993693 0.017208672
## ROC SD Sens SD Spec SD
## 0.01911941 0.000000e+00 0.00000000
## 0.01973955 9.496329e-05 0.00000000
## 0.01981358 1.163145e-04 0.00000000
## 0.01979677 1.899293e-04 0.01043302
## 0.02163155 1.839232e-04 0.01303762
## 0.02148203 2.070094e-04 0.01537125
## 0.01928153 3.685987e-04 0.01035553
## 0.02002015 3.432463e-04 0.01295047
## 0.02077119 4.220755e-04 0.01446324
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
Model 2: LOGISTIC REGRESSION with all
set.seed(512)
t3 <- Sys.time()
glm_fit <- train(default~.,
data = Loan_data,
method = "glm",
metric="ROC",
trControl = ctrl)
t4 <- Sys.time()
tglm_fit <- difftime(t4,t3); tglm_fit
## Time difference of 8.772502 secs
glm_fit
## Generalized Linear Model
##
## 45211 samples
## 7 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ...
## Resampling results
##
## ROC Sens Spec ROC SD Sens SD Spec SD
## 0.8612404 0.9993243 0.02451069 0.0221412 0.0003357384 0.02572601
##
##
Model 3: LOGISTIC REGRESSION with full-set
set.seed(512)
t5 <- Sys.time()
glm_fit3 <- train(default~.,
data = Loan_data3,
method = "glm",
metric="ROC",
trControl = ctrl)
t6 <- Sys.time()
tglm_fit3 <- difftime(t6,t5); tglm_fit3
## Time difference of 8.147466 secs
glm_fit3
## Generalized Linear Model
##
## 45211 samples
## 20 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ...
## Resampling results
##
## ROC Sens Spec ROC SD Sens SD Spec SD
## 0.8611418 0.9993243 0.02451069 0.02193808 0.0003357384 0.02572601
##
##
Model 4: LOGISTIC REGRESSION with reduced-set
set.seed(512)
t7 <- Sys.time()
glm_fit4 <- train(default~.,
data = Loan_data4,
method = "glm",
metric="ROC",
trControl = ctrl)
t8 <- Sys.time()
tglm_fit4 <- difftime(t8,t7); tglm_fit4
## Time difference of 6.593378 secs
glm_fit4
## Generalized Linear Model
##
## 45211 samples
## 15 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 40690, 40690, 40690, 40690, 40690, 40691, ...
## Resampling results
##
## ROC Sens Spec ROC SD Sens SD Spec SD
## 0.8603669 0.9993243 0.02449563 0.02429763 0.0003357384 0.02817112
##
##
Model performance was selected by AUC (ROC) and processing time (cross validation included)
The highly unbalanced output(risk vs non-risk) results in predictive models in which accuracy is not the primary goal. ROC curve can be used for a quantitative assessment of the model. The model with the largest area under ROC curve would be the most effective.
ROC curve can be used to choose a threshold that maximizes a trade of between sensitivity and specificity
Performance was estimated trough 10-fold cross validation
Boosted Tree Model
## Time difference of 1.642677 mins
## Area under the curve: 0.8655
Logistic Regresion with all
## Time difference of 8.772502 secs
## Area under the curve: 0.8607
Logistic Regression with full-set
## Time difference of 8.147466 secs
## Area under the curve: 0.8606
Logistic Regression with reduce set
## Time difference of 6.593378 secs
## Area under the curve: 0.8599
ROC plot shows how the boosted tree model gbm provide largest AUC, however by just a little higher than the logistic regression model glm.
Logistic regression models have no significant difference in AUC, but the reduced-set requires less time to compute.
The processing time of the reduced set is much shorter than the logistic regression.
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2782 -0.2126 -0.1342 -0.0395 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.19727 0.25114 -24.676 < 2e-16 ***
## age -0.10134 0.04570 -2.218 0.026571 *
## jobadmin. -0.67564 0.15266 -4.426 9.61e-06 ***
## jobblue.collar -0.19751 0.12048 -1.639 0.101151
## jobmanagement -0.07424 0.13520 -0.549 0.582934
## jobretired -0.44696 0.23071 -1.937 0.052708 .
## jobservices -0.52519 0.15342 -3.423 0.000619 ***
## jobtechnician -0.47502 0.13018 -3.649 0.000263 ***
## maritalmarried -0.37102 0.10711 -3.464 0.000533 ***
## maritalsingle -0.18101 0.12668 -1.429 0.153026
## educationprimary -0.08658 0.20726 -0.418 0.676143
## educationsecondary -0.05191 0.19118 -0.271 0.786008
## educationtertiary -0.42330 0.20887 -2.027 0.042698 *
## balance -6.87055 0.25530 -26.912 < 2e-16 ***
## housingno 0.43159 0.07766 5.558 2.74e-08 ***
## loanyes 0.72400 0.07901 9.163 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8161.1 on 45210 degrees of freedom
## Residual deviance: 6690.5 on 45195 degrees of freedom
## AIC: 6722.5
##
## Number of Fisher Scoring iterations: 10
We make the assumption that the entity has Loans that last one year and charges 20% interest rate.
The goal is to minimize the cost associated with the fraudulent transactions * Averaged requested Loan is $4000 * Interest Rate is 20% * False positive Cost: $4000 (The model predicts that a borrower is “risk-free” but he/she defaults) * False negative Cost: $4000 * 20% (The model predicts that a borrower will default but he/she will not and the loan is denied)
Probability Cost Function
The Probability cost function, (PCostF) is the proportion of the total cost associated with a false-positive sample Cost weight ('CostWeigth') is associated with the false-negative sample
AveLoan <- 4000
IR <- 0.2
FPCost <- AveLoan
FNCost <- AveLoan*IR
freq <- table(Loan_data3$default)[1]/
(table(Loan_data3$default)[1]+table(Loan_data3$default)[2])
PCostF <-(freq*FPCost)/((freq*FNCost)+((1-freq)*FPCost))
costWeight <- 1/PCostF
Get cutoff by using “closest.topleft”
Return the coordinates of the ROC curve at the specified point. Look at the best results "b" and return threshold, accuracy, sensitivity. Determine the best threshold and best weights
cutoff <- coords(glm_rpartROC4, #"roc" object from the roc function
"b", # get the best result
ret=c("threshold", "specificity", "sensitivity"), #Coordinates to return. "accuracy", ...
best.method="closest.topleft", # if x="best", the method to determine the best threshold
best.weights=c(costWeight, freq)) # if x="best"the weights to determine the best threshold
cutoff
## threshold specificity sensitivity
## 0.9601029 0.5521472 0.9107352
cutoffRisk <- 1- cutoff[1]; cutoffRisk
## threshold
## 0.03989709
This logistic regression model recommends a default risk of 0.03990 to support decisions on profit-to-risk control
Confidence interval of threshold
Predicted Result
Loan_data5 <- predict(glm_fit4, newdata = Loan_data4, type = "prob")
Loan_data$risk <- Loan_data5$yes
head(Loan_data,15)
## age job marital education default balance housing loan
## 1 58 management married tertiary no 2143 yes no
## 2 44 technician single secondary no 29 yes no
## 3 33 entrepreneur married secondary no 2 yes yes
## 4 47 blue-collar married unknown no 1506 yes no
## 5 33 unknown single unknown no 1 no no
## 6 35 management married tertiary no 231 yes no
## 7 28 management single tertiary no 447 yes yes
## 8 42 entrepreneur divorced tertiary yes 2 yes no
## 9 58 retired married primary no 121 yes no
## 10 43 technician single secondary no 593 yes no
## 11 41 admin. divorced secondary no 270 yes no
## 12 29 admin. single secondary no 390 yes no
## 13 53 technician married secondary no 6 yes no
## 14 58 technician married unknown no 71 yes no
## 15 57 services married secondary no 162 yes no
## risk
## 1 0.0001259841
## 2 0.0191078006
## 3 0.0598665576
## 4 0.0007779442
## 5 0.0572081113
## 6 0.0113883863
## 7 0.0189229034
## 8 0.0272708637
## 9 0.0115146048
## 10 0.0054760116
## 11 0.0112766335
## 12 0.0082633356
## 13 0.0155112697
## 14 0.0136373610
## 15 0.0101336829