1 Giới thiệu

Đánh giá tín dụng đóng vai trò quan trọng đối với lợi nhuận và phát triển bên vững của ngân hàng nói riêng cũng như các tổ chức tài chính nói chung. Hiện nay cách tiếp cận theo phương pháp Machine Learning đã chứng tỏ nhiều ưu thế, sự ưu việt và độ chính xác hơn hẳn các mô hình truyền thống

Mô hình dự báo truyền thống hiện nay được áp dụng trong phần lớn các ngân hàng Việt Nam hiện này là :

Logistic
Probit

Các mô hình này đã ra đời từ những năm 1950s, mà toán học đã không ngừng phát triển để đưa ra những mô hình dự báo mới (Machine Learning) phải kể đến như:

Random Forest
Bagging
AdaBoot
Neural Network
Support Vector Machine
Gradient Boosting
XGBoost

Mỗi phương pháp đều có thuật toán và package riêng để xử lý. Tuy nhiên các packages này lại khá sơ sài trong việc xử lý Tuning Parameter, Cross-Validation hay xử lý tập trung data

Hiện nay, Caret là package tập trung rất nhiều các mô hình Predictive Modeling, package hỗ trợ tất cả các bước trong việc xử lý mô hình Predictive Modeling như là:

Chia data (sử dụng trong việc chia train và test)
Data Pre-Processing
Tuning chỉ số mô hình
Đánh giá Important của biến

1.1 Data Splitting

Việc xây dựng Predictive Modeling theo phương pháp Machine Learning cần có data train vs test. Do vậy, Caret có hỗ trợ xử lý việc chia data thành 2 phần: train và test. (Lưu ý việc setseed trước khi chia data)

set.seed(123) 
indxTrain <- createDataPartition(y = data$Class,p = 8/10,list = FALSE) 
training <- data[indxTrain,] 
testing <- data[-indxTrain,]

1.2 Data Pre-Processing

1.2.1 Near Zero Mean Value

Trong việc xử lý data ta có thể gặp rất nhiều trường hợp unbalance value trong một biến. Như vậy, khi chia tập dữ liệu theo train, test, validation ta có thể gặp trường hợp train chỉ có 1 level, trong khi test có thể có nhiều levels hơn

data(mdrr)
data.frame(table(mdrrDescr$nR11))

##   Var1 Freq
## 1    0  501
## 2    1    4
## 3    2   23

Việc xuất hiện biến với Near Zero Mean Value làm ảnh hưởng đến tính ổn định của mô hình

nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]

##        freqRatio percentUnique zeroVar  nzv
## nTB     23.00000     0.3787879   FALSE TRUE
## nBR    131.00000     0.3787879   FALSE TRUE
## nI     527.00000     0.3787879   FALSE TRUE
## nR03   527.00000     0.3787879   FALSE TRUE
## nR08   527.00000     0.3787879   FALSE TRUE
## nR11    21.78261     0.5681818   FALSE TRUE
## nR12    57.66667     0.3787879   FALSE TRUE
## D.Dr03 527.00000     0.3787879   FALSE TRUE
## D.Dr07 123.50000     5.8712121   FALSE TRUE
## D.Dr08 527.00000     0.3787879   FALSE TRUE

Chúng ta nên loại bỏ biến với NearZeroMean hoặc biến đổi về dạng Factor

1.2.2 Centering and Scaling

Việc centering and scaling sẽ giúp cho các biến trong tập dữ liệu có chung đơn vị
Ngoài ra các biến sẽ có khoảng giá trị từ 0 đến 1 (giúp cho việc predictiion tốt hơn so với việc các giá trị biến thiên từ 0 đến 1 và từ 100 đến 100000 trong cùng tập dữ liệu)
Các biến sẽ được đưa về dạng như Normal Distribution

1.3 Tuning Parameter

1.3.1 K-fold Cross-Validation

Đây là phương pháp để nâng cao khả năng dự báo cũng như giảm thiểu Overfitting của mô hình.

Phương pháp K-fold CV sử dụng cách thức như sau

B1: Chia data thành k nhóm nhỏ
B2: Chạy mô hình k lần, mỗi lần trên 1 nhóm và test trên k-1 nhóm còn lại
B3: Chỉ số average error của k lần chạy mô hình sẽ là Error của mô hình đó

Advantages:

Với việc tất cả các quan sát sẽ đều được training và test sẽ làm cho variance giảm xuống, k càng tăng sẽ làm variance giảm.
Giảm thiểu Overfitting của mô hình

Disadvantages:

Việc chia k-fold cross-validation sẽ làm cho khối lượng tính toán tăng lên rất lớn

1.3.2 Tuning

Caret hỗ trợ việc Tuning Parameter trên từng loại Predictive Modeling

VD: GBM có n.trees, interaction.depth, shrinkage

1.4 Sup-sampling For Class Imbalances

Trong thực tế chúng ta sẽ luôn gặp bài toán Class Imbalances, để giải quyết vấn đền này Caret đưa ra phương pháp :

Over-sampling: Tăng thêm các biến có Frequency nhỏ
Down-sampling: Giảm các biến có Frequency nhiều hơn
Both: kết hợp cả Up-sampling và Down-sampling
SMOTE: phương pháp tính toán để gia tăng thêm biến có Frequency nhỏ

1.5 Đánh giá Important của biến

Package caret hỗ trợ đánh giá mức hộ quan trọng của biến trong mô hình

set.seed(123) 
indxTrain <- createDataPartition(y = data$Class,p = 8/10,list = FALSE) 
training <- data[indxTrain,] 
testing <- data[-indxTrain,] 
set.seed(999)
Control <- trainControl(method = "repeatedcv", #sử dụng k-fold CV
                        number = 5, #giá trị của k
                        repeats = 5, #số lần chạy mô hình
                        classProbs = TRUE,
                        summaryFunction = multiClassSummary)
rf <- train(Class~.,
            data=training,
            method="rf",
            metric = "ROC",
            preProcess = c("center","scale"),
            trControl=Control)
rf.probs <- predict(rf,testing)
varImp(rf, scale = FALSE)

## rf variable importance
## 
##   only 20 most important variables shown (out of 61)
## 
##                                Overall
## Amount                          10.522
## Age                              9.870
## Duration                         9.778
## CheckingAccountStatus.none       9.277
## CheckingAccountStatus.lt.0       8.263
## InstallmentRatePercentage        4.808
## ResidenceDuration                4.428
## CreditHistory.Critical           3.648
## SavingsAccountBonds.lt.100       3.441
## Housing.Own                      3.076
## NumberExistingCredits            2.859
## OtherInstallmentPlans.None       2.688
## CheckingAccountStatus.0.to.200   2.666
## EmploymentDuration.lt.1          2.652
## Purpose.NewCar                   2.597
## Property.RealEstate              2.581
## OtherInstallmentPlans.Bank       2.566
## CreditHistory.NoCredit.AllPaid   2.539
## Telephone                        2.422
## CreditHistory.ThisBank.AllPaid   2.351

2 Thực hành

Data chúng ta sử dụng là German Credit nói về đánh giá tín dụng trong lĩnh vực ngân hàng với 61 biến độc lập và 1 biến phụ thuộc

Tỉ lệ hồ sơ Good và Bad trong biến Class của là 70:30

round(prop.table(table(data$Class))*100, 2)

## 
##  Bad Good 
##   30   70

Theo cách tiếp cận của Machine Learning, chúng ta sẽ chia data thành 2 tập train và test theo tỉ lệ 80:20

set.seed(123) 
indxTrain <- createDataPartition(y = data$Class,p = 8/10,list = FALSE) 
training <- data[indxTrain,] 
testing <- data[-indxTrain,]

Để cho mô hình dự báo tốt hơn, chúng ta sẽ sử dụng phương pháp Cross-Validation với k =5

Control <- trainControl(method = "repeatedcv", #sử dụng k-fold CV
                        number = 5, #giá trị của k
                        repeats = 5, #số lần chạy mô hình
                        classProbs = TRUE,
                        summaryFunction = multiClassSummary) #đánh giá chất lượng mô hình trên các chỉ số

2.1 Logistic

set.seed(999)
logistic <- train(Class~.,
                  data=training,
                  method="glm",
                  metric = "ROC",
                  family="binomial",
                  trControl=Control)

logistic.probs <- predict(logistic,testing,type="prob")

logistic.ROC <- roc(predictor=logistic.probs$Bad,
                    response=testing$Class,
                    levels=rev(levels(testing$Class)))

2.2 Random Forest

set.seed(999)
rf <- train(Class~.,
            data=training,
            method="rf",
            metric = "ROC",
            preProcess = c("center","scale"),
            trControl=Control)

rf.probs <- predict(rf,testing,type="prob")

rf.ROC <- roc(predictor=rf.probs$Bad,
              response=testing$Class,
              levels=rev(levels(testing$Class)))

2.3 Bagging

set.seed(999)
bg <- train(Class~.,
            data=training,
            method="treebag",
            metric = "ROC",
            preProcess = c("center","scale"),
            trControl=Control)
bg.probs <- predict(bg,testing,type="prob")

bg.ROC <- roc(predictor=bg.probs$Bad,
              response=testing$Class,
              levels=rev(levels(testing$Class)))

2.4 Ada Boosting

set.seed(999)
ada <- train(Class~.,
             data=training,
             method="ada",
             preProcess = c("center","scale"),
             trControl=Control,
             metric = "ROC",
             verbose=FALSE)

save(ada, file = "D:/4. BA - R/4. LECTURE & TRAINING/2. TRAINING/8. COMPARE/Model/ada.Rda")

load(file = "D:/4. BA - R/4. LECTURE & TRAINING/2. TRAINING/8. COMPARE/Model/ada.Rda")

ada.probs <- predict(ada,testing,type="prob")

ada.ROC <- roc(predictor=ada.probs$Bad,
               response=testing$Class,
               levels=rev(levels(testing$Class)))

2.5 Gradient Boosting

set.seed(999)
gbm <- train(Class~.,
             data=training,
             method="gbm",
             preProcess = c("center","scale"),
             trControl=Control,
             metric = "ROC",
             verbose=FALSE)
gbm.probs <- predict(gbm,testing,type="prob")

gbm.ROC <- roc(predictor=gbm.probs$Bad,
               response=testing$Class,
               levels=rev(levels(testing$Class)))

2.6 Đánh giá mô hình

model1 <- as.data.frame(logistic$resample)
model1$MoHinh <- "LOGISTIC"
model1 <- model1[,-11]

model2 <- as.data.frame(rf$resample)
model2$MoHinh <- "RandomForest"
model2 <- model2[,-11]

model3 <- as.data.frame(bg$resample)
model3$MoHinh <- "Bagging"
model3 <- model3[,-11]

model4 <- as.data.frame(gbm$resample)
model4$MoHinh <- "GradientBoosting"
model4 <- model4[,-11]

model5 <- as.data.frame(ada$resample)
model5$MoHinh <- "AdaBoost"
model5 <- model5[,-11]

model.performance <- rbind(model1, model2, model3, model4, model5)

2.6.1 Accuracy trên tập train

model.performance %>% 
  ggplot() +
  geom_boxplot(aes(x = MoHinh, y = Accuracy,fill = MoHinh)) +
  coord_flip() +
  theme_wsj()

model.performance %>% 
  ggplot(aes(Accuracy, fill = MoHinh)) + 
  geom_density(alpha = 0.5) +
  theme_wsj() +
  facet_grid(~ MoHinh)

2.6.2 AUC trên tập train

model.performance %>% 
  ggplot() +
  geom_boxplot(aes(x = MoHinh, y = ROC,fill = MoHinh)) +
  coord_flip() +
  geom_hline(yintercept = 0.8,color="blue") +
  theme_wsj()

model.performance %>% 
  ggplot(aes(ROC, fill = MoHinh)) + 
  geom_density(alpha = 0.5) +
  theme_wsj() +
  facet_grid(~ MoHinh)

2.6.3 AUC trên tập test

roc1 <- as.data.frame(logistic.ROC$auc)
roc1$Mohinh <- "Logistic"
names(roc1)[1] <- "AUC"

roc2 <- as.data.frame(rf.ROC$auc)
roc2$Mohinh <- "RandomForest"
names(roc2)[1] <- "AUC"

roc3 <- as.data.frame(bg.ROC$auc)
roc3$Mohinh <- "Bagging"
names(roc3)[1] <- "AUC"

roc4 <- as.data.frame(gbm.ROC$auc)
roc4$Mohinh <- "GradientBoosting"
names(roc4)[1] <- "AUC"

roc5 <- as.data.frame(ada.ROC$auc)
roc5$Mohinh <- "AdaBoost"
names(roc5)[1] <- "AUC"

roc.performance <- rbind(roc1, roc2, roc3, roc4, roc5)
roc.performance$Mohinh <- roc.performance$Mohinh %>% as.factor()

roc.performance %>% 
  ggplot(aes(x = Mohinh, y = AUC)) +
  geom_point(fill="red", color="darkred", size=3) + 
  theme_wsj()

Như vậy mô hình Gradient Boosting có chỉ số AUC cao nhất

2.6.4 So sánh thực tế lãi

Khi áp dụng vào bài toán kinh tế, giả sử với mỗi hợp đồng được dự báo tốt sẽ vay 1 USD, nếu đó là hợp đồng tốt sẽ mang lại 0.1 USD lợi nhuận còn nếu đó là hợp đồng xấu sẽ lỗ 1 USD. Như vậy ta sẽ so sánh 4 model dự báo vào trong bài toán kinh tế này để xem xét xem mô hình nào mang lại giá trị lợi nhuận cao nhất

#logistic
logistic.table <- seq(0,0.97,0.01)
logistic.table <- logistic.table %>% as.data.frame()
names(logistic.table)[1] <- "cutoff"
logistic.pred <-  predict(logistic, newdata = testing, type = "prob")
for (i in seq(0,0.97,0.01)){
  
  pred      <- factor( ifelse(logistic.pred[, "Good"] >= i, "Good", "Bad") )
  pred      <- relevel(pred, "Good")  
  conf <- confusionMatrix(pred, testing$Class)
  
  confusion.result <- conf$table
  
  profit <- confusion.result[2,2]*0.1- confusion.result[2,1]
  cutoff <- i
  
  df <- cbind(cutoff, profit)
  df <- df %>% as.data.frame()
  
  logistic.table$benefit[logistic.table$cutoff %in% df$cutoff] <- df$profit
}

#random forest
rf.table <- seq(0,0.97,0.01)
rf.table <- rf.table %>% as.data.frame()
names(rf.table)[1] <- "cutoff"
rf.result <- predict(rf,testing,type = "prob")
for (i in seq(0,0.97,0.01)){
  pred      <- factor( ifelse(rf.result[, "Good"] >= i, "Good", "Bad") )
  pred      <- relevel(pred, "Good")  
  conf <- confusionMatrix(pred, testing$Class)
  
  confusion.result <- conf$table
  
  profit <- confusion.result[2,2]*0.1- confusion.result[2,1]
  cutoff <- i
  
  df <- cbind(cutoff, profit)
  df <- df %>% as.data.frame()
  
  rf.table$benefit[rf.table$cutoff %in% df$cutoff] <- df$profit
}
#bagging
bg.table <- seq(0,0.97,0.01)
bg.table <- bg.table %>% as.data.frame()
names(bg.table)[1] <- "cutoff"
bg.result <- predict(bg,testing,type = "prob")
for (i in seq(0,0.97,0.01)){
  pred      <- factor( ifelse(bg.result[, "Good"] >= i, "Good", "Bad") )
  pred      <- relevel(pred, "Good")  
  conf <- confusionMatrix(pred, testing$Class)
  
  confusion.result <- conf$table
  
  profit <- confusion.result[2,2]*0.1- confusion.result[2,1]
  cutoff <- i
  
  df <- cbind(cutoff, profit)
  df <- df %>% as.data.frame()
  
  bg.table$benefit[bg.table$cutoff %in% df$cutoff] <- df$profit
}
#gradient boosting
gbm.table <- seq(0,0.97,0.01)
gbm.table <- gbm.table %>% as.data.frame()
names(gbm.table)[1] <- "cutoff"
gbm.result <- predict(gbm,testing,type = "prob")
for (i in seq(0,0.97,0.01)){
  pred      <- factor( ifelse(gbm.result[, "Good"] >= i, "Good", "Bad") )
  pred      <- relevel(pred, "Good")  
  conf <- confusionMatrix(pred, testing$Class)
  
  confusion.result <- conf$table
  
  profit <- confusion.result[2,2]*0.1- confusion.result[2,1]
  cutoff <- i
  
  df <- cbind(cutoff, profit)
  df <- df %>% as.data.frame()
  
  gbm.table$benefit[gbm.table$cutoff %in% df$cutoff] <- df$profit
}

#adaboost
ada.table <- seq(0,0.97,0.01)
ada.table <- ada.table %>% as.data.frame()
names(ada.table)[1] <- "cutoff"
ada.result <- predict(ada,testing,type = "prob")
for (i in seq(0,0.97,0.01)){
  pred      <- factor( ifelse(ada.result[, "Good"] >= i, "Good", "Bad") )
  pred      <- relevel(pred, "Good")  
  conf <- confusionMatrix(pred, testing$Class)
  
  confusion.result <- conf$table
  
  profit <- confusion.result[2,2]*0.1- confusion.result[2,1]
  cutoff <- i
  
  df <- cbind(cutoff, profit)
  df <- df %>% as.data.frame()
  
  ada.table$benefit[ada.table$cutoff %in% df$cutoff] <- df$profit
}

logistic.table$benefit[is.na(logistic.table$benefit) == T] <- 0
rf.table$benefit[is.na(rf.table$benefit) == T] <- 0
gbm.table$benefit[is.na(gbm.table$benefit) == T] <- 0
ada.table

##    cutoff benefit
## 1    0.00   -46.0
## 2    0.01   -46.0
## 3    0.02   -46.0
## 4    0.03   -46.0
## 5    0.04   -46.0
## 6    0.05   -46.0
## 7    0.06   -46.0
## 8    0.07   -46.0
## 9    0.08   -46.0
## 10   0.09   -46.0
## 11   0.10   -46.0
## 12   0.11   -46.0
## 13   0.12   -46.0
## 14   0.13   -46.0
## 15   0.14   -46.0
## 16   0.15   -46.0
## 17   0.16   -46.0
## 18   0.17   -46.0
## 19   0.18   -45.0
## 20   0.19   -44.0
## 21   0.20   -42.0
## 22   0.21   -42.0
## 23   0.22   -41.0
## 24   0.23   -40.0
## 25   0.24   -39.0
## 26   0.25   -39.0
## 27   0.26   -38.0
## 28   0.27   -38.1
## 29   0.28   -36.1
## 30   0.29   -36.1
## 31   0.30   -36.2
## 32   0.31   -36.3
## 33   0.32   -36.3
## 34   0.33   -35.3
## 35   0.34   -34.3
## 36   0.35   -31.4
## 37   0.36   -31.5
## 38   0.37   -30.5
## 39   0.38   -29.5
## 40   0.39   -27.7
## 41   0.40   -25.7
## 42   0.41   -24.8
## 43   0.42   -23.9
## 44   0.43   -23.0
## 45   0.44   -23.0
## 46   0.45   -22.0
## 47   0.46   -20.0
## 48   0.47   -20.2
## 49   0.48   -20.3
## 50   0.49   -20.5
## 51   0.50   -19.6
## 52   0.51   -18.7
## 53   0.52   -16.1
## 54   0.53   -14.2
## 55   0.54   -11.2
## 56   0.55   -11.3
## 57   0.56    -9.3
## 58   0.57    -9.3
## 59   0.58    -9.3
## 60   0.59    -9.3
## 61   0.60    -9.7
## 62   0.61    -8.7
## 63   0.62    -7.8
## 64   0.63    -7.0
## 65   0.64    -7.3
## 66   0.65    -7.4
## 67   0.66    -7.5
## 68   0.67    -7.7
## 69   0.68    -6.8
## 70   0.69    -5.9
## 71   0.70    -6.0
## 72   0.71    -6.1
## 73   0.72    -4.2
## 74   0.73    -4.3
## 75   0.74    -2.3
## 76   0.75    -2.4
## 77   0.76    -2.5
## 78   0.77    -0.8
## 79   0.78    -0.8
## 80   0.79    -1.0
## 81   0.80    -1.3
## 82   0.81    -1.7
## 83   0.82    -2.0
## 84   0.83    -1.0
## 85   0.84    -0.3
## 86   0.85    -0.4
## 87   0.86    -0.8
## 88   0.87    -0.2
## 89   0.88    -0.7
## 90   0.89    -0.1
## 91   0.90     2.3
## 92   0.91     1.3
## 93   0.92     1.8
## 94   0.93     2.4
## 95   0.94     3.1
## 96   0.95     2.7
## 97   0.96     2.3
## 98   0.97     1.1

logistic.table$Mohinh <- "Logistic"
rf.table$Mohinh <- "RandomForest"
bg.table$Mohinh <- "Bagging"
gbm.table$Mohinh <- "Gradient Boosting"
ada.table$Mohinh <- "AdaBoost"
economic.profit <- rbind(logistic.table, rf.table, bg.table, gbm.table, ada.table)

max.benefit1 <- logistic.table %>% 
  filter(logistic.table$benefit == max(logistic.table$benefit))
max.benefit2 <- rf.table %>% 
  filter(rf.table$benefit == max(rf.table$benefit)) %>% 
  head(1)
max.benefit3 <- bg.table %>% 
  filter(bg.table$benefit == max(bg.table$benefit)) %>% 
  head(1)
max.benefit4 <- gbm.table %>% 
  filter(gbm.table$benefit == max(gbm.table$benefit)) 
max.benefit5 <- ada.table %>% 
  filter(ada.table$benefit == max(ada.table$benefit)) 

max.benefit <- rbind(max.benefit1, max.benefit2, max.benefit3, max.benefit4, max.benefit5)

Boxplot model

economic.profit %>% 
  ggplot() +
  geom_boxplot(aes(x = Mohinh, y = benefit,fill = Mohinh)) +
  geom_text(aes(label = benefit, x = Mohinh, y = benefit), data = max.benefit, check_overlap = TRUE, vjust = 1) +
  geom_point(aes(x = Mohinh, y = benefit), data = max.benefit, col = "red") + 
  coord_flip() +
  labs(x = "Model", y = "Benefit(USD)") +
  theme_wsj()

Như vậy mô hình AdaBoost tuy không phải mô hình dự báo tốt nhất nhưng lại là mô hình mang lại lợi nhuận cao nhất trong tất cả các mô hình

Mô hình Gradient Boosting tuy có chỉ số AUC tốt nhất nhưng lợi nhuận mang lại là kém nhất trong tất cả các mô hình

So sánh các mô hình Predictive Modeling bằng với Package Caret

Nguyen Minh Thanh

November 3, 2017