Home Credit Default Risk

Đây là Project nhỏ của mình - Newbie mới trải qua gần 2 tháng tiếp xúc với R và các kiến thức Machine Learning nên có thiếu sót mong nhận thêm được sự góp ý.
Đề tài mà mình lựa chọn là Phân loại Khách hàng dựa trên dữ liệu cuộc thi Home Credit Default Risk 2018 . Mình sử dụng tệp Application_train.csv. Do cấu hình máy tính không được cao nên mình xin phép chỉ sử dụng 10.000 mẫu trong project này. Output là Biến TARGET

Đầu tiên là tải dữ liệu và load các package cần thiết

Tìm hiểu dữ liệu:

Data có 121 biến và 10000 mẫu.

dim(data)
str(data)

Làm sạch dữ liệu

Tạo hàm & thay thế các biến NA có trong data Ở đây mình sẽ thay thế các giá trị NA bằng giá trị mean của các cột có chứa NA

replace_by_mean <- function(x) {x[is.na(x)] <- mean(x, na.rm = TRUE)
  return(x)}

data <- data %>% mutate_if(is.numeric,replace_by_mean)
na_rate = function(x){return(sum(is.na(x))/length(x))}

Kiểm tra xem tỷ lệ NA trong các column cho thấy 6 biến có tỷ lệ NA trong biến cao nhất đều bằng 0. Dữ liệu đã được loại bỏ NA.

data %>%
  sapply(na_rate) %>%
  round(2) %>%
  desc() %>%
  head()

##         SK_ID_CURR             TARGET NAME_CONTRACT_TYPE        CODE_GENDER 
##                  0                  0                  0                  0 
##       FLAG_OWN_CAR    FLAG_OWN_REALTY 
##                  0                  0

Chia thành 2 tệp Train & Test

set.seed(20200412)
ind <-sample(1:nrow(data),
             size = nrow(data)*0.7,
             replace = FALSE)
train_set <- data[ind,]
test_set <- data[-ind,]

Tiến hành chia Bin cho các biến, đồng thời tính WOE v

bins_var <- woebin(dt=train_set, 
                   y = "TARGET", 
                   positive = "BAD|1")

FALSE [INFO] creating woe binning ... 
FALSE [INFO] Binning on 7000 rows and 115 columns in 00:01:18

bins_var_out <- do.call("bind_rows", bins_var)

Sau đó là chọn các biến có IV lớn nhất làm biến tham gia vào mô hình. rồi chuyển về WOE

iv_var <- bins_var_out %>%
  dplyr::select(variable, total_iv) %>%
  group_by(variable) %>%
  summarise(iv = round(mean(total_iv),2)) %>% 
  arrange(desc(iv))

iv_var %>% 
  kable()

# loai bien co IV <0.02 va IV>0.5
iv_var_selected <- iv_var %>% 
  filter( iv >= 0.02 & iv <= 0.5) %>% 
  head(n=10L) %>%
  pull(1) 

print(iv_var_selected)

Các biến được lựa chọn ra như trên gồm có

EXT_SOURCE_1,2,3: Điểm tín dụng của các Khách hàng thông qua bên thứ 3 chấm điểm.
DAYS_BIRTH: Tuổi của Khách hàng tại thời điểm nộp hồ sơ cấp tín dụng.
DAYS_EMPLOYED Thời gian tính từ khi Khách hàng bắt đầu công việc hiện tại tới khi nộp hồ sơ.
AMT_GOODS_PRICE : Giá của hàng hóa trong trường hợp vay tiêu dùng
AMT_CREDIT: Số tiền Khách hàng vay
NAME_EDUCATION_TYPE: Trình độ văn hóa của Khách hàng
DAYS_ID_PUBLISH: thời gian từ ngày cấp ID tới nay
REGION_POPULATION_RELATIVE: Giá trị đã chuẩn hóa của dân số khu vực khách hàng sinh sống

WOE Transformation

train_set_selected <- train_set %>% 
  dplyr::select("TARGET", iv_var_selected)

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(iv_var_selected)` instead of `iv_var_selected` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

df_train_woe <- woebin_ply(train_set_selected,bins_var_out)

Chuẩn hóa dữ liệu để đưa toàn bộ về dạng (0:1)

nor_func <- function(x){
  return((x- min(x))/ (max(x)-min(x)))}
  
  df_train_woe <- df_train_woe %>% 
  apply(2, nor_func) %>%
    as.data.frame()

logistics <- glm(TARGET ~ ., family = binomial, data=df_train_woe)

logistics %>% summary()

## 
## Call:
## glm(formula = TARGET ~ ., family = binomial, data = df_train_woe)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4075  -0.4091  -0.2912  -0.2018   3.2917  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -5.9026     0.2401 -24.580  < 2e-16 ***
## AMT_CREDIT_woe                   0.3150     0.2290   1.376 0.168927    
## AMT_GOODS_PRICE_woe              0.3777     0.1989   1.899 0.057602 .  
## NAME_EDUCATION_TYPE_woe          0.6251     0.1370   4.563 5.05e-06 ***
## REGION_POPULATION_RELATIVE_woe   0.5069     0.1514   3.348 0.000814 ***
## DAYS_BIRTH_woe                   0.1222     0.1519   0.805 0.421060    
## DAYS_EMPLOYED_woe                0.4396     0.1730   2.540 0.011074 *  
## DAYS_ID_PUBLISH_woe              0.6239     0.1604   3.889 0.000101 ***
## EXT_SOURCE_1_woe                 0.8012     0.1887   4.246 2.18e-05 ***
## EXT_SOURCE_2_woe                 1.4229     0.1436   9.911  < 2e-16 ***
## EXT_SOURCE_3_woe                 1.9551     0.1675  11.670  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3729.4  on 6999  degrees of freedom
## Residual deviance: 3244.5  on 6989  degrees of freedom
## AIC: 3266.5
## 
## Number of Fisher Scoring iterations: 6

Tính Model Performance

test_set <- test_set %>% 
  dplyr::select("TARGET", iv_var_selected)

bins_var_nor <- bins_var_out %>%
  group_by(variable) %>%
  mutate(woe = (woe-min(woe))/(max(woe) -min(woe))) 

test_woe <- woebin_ply(test_set, bins_var_nor)

Dự đoán tỷ lệ nợ xấu

# Predict default
pred <- predict(logistics, test_woe, "response")
head(pred)

##          1          2          3          4          5          6 
## 0.05573355 0.11396229 0.01600914 0.10168163 0.12370542 0.07279847

Tính confusion matrix với threshold = 0.3, sau đó tính ngưỡng threshold khuyến nghị:

# Choose 0.3  as theshold
pred1 <- ifelse(pred <= 0.3, 0, 1) %>% as.factor
table(pred1)

## pred1
##    0    1 
## 2912   88

# Accuracy rate

sum(pred1 == test_set$TARGET)/nrow(test_set)

## [1] 0.9073333

# Confusion matrix
confusionMatrix(data = pred1, 
                reference = as.factor(test_set$TARGET))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2692  220
##          1   58   30
##                                           
##                Accuracy : 0.9073          
##                  95% CI : (0.8964, 0.9175)
##     No Information Rate : 0.9167          
##     P-Value [Acc > NIR] : 0.9686          
##                                           
##                   Kappa : 0.1402          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9789          
##             Specificity : 0.1200          
##          Pos Pred Value : 0.9245          
##          Neg Pred Value : 0.3409          
##              Prevalence : 0.9167          
##          Detection Rate : 0.8973          
##    Detection Prevalence : 0.9707          
##       Balanced Accuracy : 0.5495          
##                                           
##        'Positive' Class : 0               
##

Từ đó ta có AUC tốt nhất là 0.7489

# AUC, ROC, KS

perf_eva(pred, 
         test_set$TARGET,
         showplot = c("ks", "roc"),
         title = "Model performance")

## [INFO] The threshold of confusion matrix is 0.1485.

## $binomial_metric
## $binomial_metric$`Model performance`
##          MSE      RMSE   LogLoss         R2        KS       AUC      Gini
## 1: 0.0710349 0.2665237 0.2560486 0.07008855 0.3832727 0.7489367 0.4978735
## 
## 
## $confusion_matrix
## $confusion_matrix$`Model performance`
##    label pred_0 pred_1      error
## 1:     0   2476    274 0.09963636
## 2:     1    155     95 0.62000000
## 3: total   2631    369 0.14300000
## 
## 
## $pic
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Lựa chọn model hiệu quả nhất

Đầu tiên ta thiết lập tổ hợp các biến tham gia mô hình (chọn từ 4 - 6 trong 10 biến được chọn từ bước trên)

variables <- df_train_woe %>% 
  dplyr::select(-TARGET) %>% 
  names()

lapply(4:6, function(x) {combn(variables, x)}) %>% 
  lapply(as.data.frame) -> var_combn

all_combn <-  c()
i=1
for (i in 1:length(var_combn)) {
  df <- var_combn[[i]]
  
  for (j in 1:ncol(data)) {
    combn <- df[,j] %>% 
      as.character() %>% 
      list()
    all_combn <- c(all_combn, combn)
  }
}

all_combn_list <- lapply(all_combn,
                         function(x) {
                           paste("TARGET ~", paste(unlist(x), 
                                                collapse = "+"))
                         })

head(all_combn_list)

## [[1]]
## [1] "TARGET ~ AMT_CREDIT_woe+AMT_GOODS_PRICE_woe+NAME_EDUCATION_TYPE_woe+REGION_POPULATION_RELATIVE_woe"
## 
## [[2]]
## [1] "TARGET ~ AMT_CREDIT_woe+AMT_GOODS_PRICE_woe+NAME_EDUCATION_TYPE_woe+DAYS_BIRTH_woe"
## 
## [[3]]
## [1] "TARGET ~ AMT_CREDIT_woe+AMT_GOODS_PRICE_woe+NAME_EDUCATION_TYPE_woe+DAYS_EMPLOYED_woe"
## 
## [[4]]
## [1] "TARGET ~ AMT_CREDIT_woe+AMT_GOODS_PRICE_woe+NAME_EDUCATION_TYPE_woe+DAYS_ID_PUBLISH_woe"
## 
## [[5]]
## [1] "TARGET ~ AMT_CREDIT_woe+AMT_GOODS_PRICE_woe+NAME_EDUCATION_TYPE_woe+EXT_SOURCE_1_woe"
## 
## [[6]]
## [1] "TARGET ~ AMT_CREDIT_woe+AMT_GOODS_PRICE_woe+NAME_EDUCATION_TYPE_woe+EXT_SOURCE_2_woe"

Chạy Logistics với tất cả các kết quả trong tổ hợp được lựa chọn ban đầu để ra được model cho từng trường hợp

# chay logistic
all_combn_list %>% 
  lapply(function(x) {
    glm(x,
        family = binomial,
        data = df_train_woe)
  }) -> all_models

head(all_models, n=3L)

## [[1]]
## 
## Call:  glm(formula = x, family = binomial, data = df_train_woe)
## 
## Coefficients:
##                    (Intercept)                  AMT_CREDIT_woe  
##                        -4.1644                          0.4241  
##            AMT_GOODS_PRICE_woe         NAME_EDUCATION_TYPE_woe  
##                         0.4606                          0.6917  
## REGION_POPULATION_RELATIVE_woe  
##                         0.7252  
## 
## Degrees of Freedom: 6999 Total (i.e. Null);  6995 Residual
## Null Deviance:       3729 
## Residual Deviance: 3617  AIC: 3627
## 
## [[2]]
## 
## Call:  glm(formula = x, family = binomial, data = df_train_woe)
## 
## Coefficients:
##             (Intercept)           AMT_CREDIT_woe      AMT_GOODS_PRICE_woe  
##                 -4.2487                   0.4078                   0.4289  
## NAME_EDUCATION_TYPE_woe           DAYS_BIRTH_woe  
##                  0.8445                   1.0244  
## 
## Degrees of Freedom: 6999 Total (i.e. Null);  6995 Residual
## Null Deviance:       3729 
## Residual Deviance: 3573  AIC: 3583
## 
## [[3]]
## 
## Call:  glm(formula = x, family = binomial, data = df_train_woe)
## 
## Coefficients:
##             (Intercept)           AMT_CREDIT_woe      AMT_GOODS_PRICE_woe  
##                 -4.4377                   0.4313                   0.4283  
## NAME_EDUCATION_TYPE_woe        DAYS_EMPLOYED_woe  
##                  0.7693                   1.0856  
## 
## Degrees of Freedom: 6999 Total (i.e. Null);  6995 Residual
## Null Deviance:       3729 
## Residual Deviance: 3591  AIC: 3601

Tính Performance của từng model & Lựa chọn ra Best Model thông qua AR, P-Value

# tinh performance
pred_func <- function(x) {
  test_woe$predicted = predict(x,
                               newdata = test_woe,
                               type = "response")
}

all_models %>% 
  lapply(pred_func) -> all_models_pre
View(all_models_pre)
do.call("rbind", all_models_pre) %>% 
  as.data.frame() -> all_models_pre1


# filter ar > 50% & pvalue <= 5%

all.models.pred2 <- lapply(all_models_pre1, 
                           function(x) {
                             x = ifelse(x >= 0.5, 1,0)
                           }) %>% as.data.frame

# AR
ar <- c()
for (i in 1:nrow(all.models.pred2)) {
  ar[i] <- sum(all.models.pred2[i,] == 
                 test_set$TARGET)/length(test_set$TARGET)}

# Max(pvalue)
all_models_sum <- lapply(all_models, summary)
pvalue <- c()

for (i in 1: length(all_models)){
  data_frame <- all_models_sum[[i]]
  pvalue[i] <- max(data_frame[["coefficients"]][,4])
}

# gini = 2*AUC - 1
x=1
gini_score <- function(x) {
  auc <- auc(test_woe$TARGET,
             as.numeric(all_models_pre1[x,]),
             quiet = TRUE) %>% 
    as.numeric()
  gini <- 2*auc -1 
}

# Add pvalue and ar to all_models
results <- data.frame()
gini <- c()
for (i in 1: length(all_models)){
  gini[i] = gini_score(i)}
index <-  1:363
results <- cbind(index, pvalue, gini) %>%
  as.data.frame()

models_filtered <- results %>%
  filter(pvalue <= 0.05, gini >= 0.4)
max_gini_index <- which.max(models_filtered$gini)
models_filtered$index[max_gini_index]

## [1] 346

best.model <- all_models[[346]]
summary(best.model)

## 
## Call:
## glm(formula = x, family = binomial, data = df_train_woe)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3762  -0.4075  -0.2961  -0.2096   3.2410  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -5.4660     0.2153 -25.387  < 2e-16 ***
## AMT_CREDIT_woe            0.6604     0.1520   4.343 1.40e-05 ***
## NAME_EDUCATION_TYPE_woe   0.6312     0.1357   4.651 3.31e-06 ***
## DAYS_EMPLOYED_woe         0.5736     0.1597   3.591 0.000329 ***
## EXT_SOURCE_1_woe          0.8854     0.1773   4.994 5.92e-07 ***
## EXT_SOURCE_2_woe          1.5407     0.1406  10.960  < 2e-16 ***
## EXT_SOURCE_3_woe          1.9895     0.1655  12.021  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3729.4  on 6999  degrees of freedom
## Residual deviance: 3277.2  on 6993  degrees of freedom
## AIC: 3291.2
## 
## Number of Fisher Scoring iterations: 6

Tính điểm cho các tệp Train & Test

# tinh diem cho tap train & test 
infor <- scorecard(bins_var_nor,
                   best.model,
                   points0 = 500,
                   odds = 1/20,
                   pdo = 50)

# tinh diem cho train
train_point <- scorecard_ply(train_set_selected,
                             infor,
                             only_total_score = FALSE,
                             print_step = 0)

train_set_score <- train_set_selected %>% 
  mutate(score = train_point$score)

Biểu diễn phân phối điểm cho các Khách hàng Tốt & Xấu

#Diem trung binh cho KH tot & xau

train_set_score %>% 
  group_by(TARGET) %>% 
  summarise_each(funs(min, max, median, mean, n()), score) %>% 
  mutate_if(is.numeric, function(x) {round(x, 2)}) %>% 
  knitr::kable(caption = "Table 2: Scorecad Points by Group for Test Data")

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Table 2: Scorecad Points by Group for Test Data
TARGET	min	max	median	mean	n
0	250	678	500	497.32	6475
1	224	662	426	428.91	525

train_set_score %>%
  group_by(TARGET) %>%
  summarise(mean=mean(score)) %>%
  ungroup() -> mean_score



train_set_score %>%
  ggplot(aes(x = score,
             group = TARGET,
             fill = TARGET)) +
  geom_density(alpha = .3) +
  geom_vline(data = mean_score,
             mapping = aes(xintercept = mean))

Home Credit Default Risk

Phạm Ngọc Sơn

5/29/2020

Home Credit Default Risk

Đầu tiên là tải dữ liệu và load các package cần thiết

Tìm hiểu dữ liệu:

Làm sạch dữ liệu

Chia thành 2 tệp Train & Test

Tiến hành chia Bin cho các biến, đồng thời tính WOE v

Sau đó là chọn các biến có IV lớn nhất làm biến tham gia vào mô hình. rồi chuyển về WOE

Các biến được lựa chọn ra như trên gồm có

WOE Transformation

Chuẩn hóa dữ liệu để đưa toàn bộ về dạng (0:1)

Tính Model Performance

Dự đoán tỷ lệ nợ xấu

Tính confusion matrix với threshold = 0.3, sau đó tính ngưỡng threshold khuyến nghị:

Lựa chọn model hiệu quả nhất

Chạy Logistics với tất cả các kết quả trong tổ hợp được lựa chọn ban đầu để ra được model cho từng trường hợp

Tính Performance của từng model & Lựa chọn ra Best Model thông qua AR, P-Value

Tính điểm cho các tệp Train & Test

Biểu diễn phân phối điểm cho các Khách hàng Tốt & Xấu