1. Introduction

신용 평가와 대출 승인 결정은 금융 기관에서 가장 중요한 업무 중 하나이다. 기본적으로 금융기관에서는 한국은행을 비롯한 다양한 곳에서 자금을 조달하여 이를 관리하고 있다가 자금을 필요로 하는 곳에 자금을 빌려주고 이에 상응하는 이자를 받아 수익을 얻는 것으로 볼 수 있다. 근본적으로 많은 금액을 빌려주고 이를 나누어서 자금을 사용한 곳에서 갚아 나가는 구조다.

물론 다수의 고객에게 자금을 빌려주다보니 제때 돈을 갚지 못하거나, 불의의 사고, 실직 등 다양한 이유로 인해서 돈을 갖지 못하는 위험이 발생된다. 이때 기대손실(Expected Loss)을 다음 구성요소를 가지고 정량화한다.

  • 채무 불이행 위험 : Probability of default
  • 채무 불이행 노출 : Exposure at default
  • 채무 불이행에 대한 손실 : Loss given default

기대손실=채무 불이행 위험×채무 불이행 노출×채무 불이행에 대한 손실

따라서 금융기관에서 자금을 빌려주기 전에 다양한 정보를 활용하여 채무 불이행 위험이 적은 고객을 선별하여 가능한 많은 금액을 빌려주어 매출과 수익을 극대화한다.

신용 모델의 이면의 아이디어는 채무불이행될 위험이 높은지를 예측할 수 있는 요소를 식별하는 것이다. 따라서 과거 은행 대출에 대한 대량 데이터, 대출의 채무불이행 여부, 신청자에 대한 정보를 입수해야 한다.

  • Application 정보: 나이, 결혼여부, 소득, 자가/전세 등
  • Behaviour 정보: 현재 은행잔고, 연체금액 등

1.1. Aim

- Classification using Machine Learning

기업 금융과 마찬가지로 소매 금융도 리스크 관리가 중요하다. 2007~2008년의 세계 금융 위기(리먼브라더스 사태)는 대출 업무의 투명성과 엄격함의 중요성을 부각시켰다. 기존 신용(시스템)의 유효성이 제한되면서 신용 평점에 따라 엄격하게 관리하는 것이 필요해졌고, 은행은 대출 시스템을 강화하고 위험 대출을 더 정확하게 찾아내기 위해 신용평점모형(Credit Scoring Model)을 머신 러닝을 활용하고 있다.

많은 나라의 정부기관은 대출 업무를 면밀히 감시하고 있기 때문에 경영진은 대출이 승인되고 거절되는 사유를 설명할 수 있어야만 한다. 이런 정보는 신용 평가가 만족스럽지 않은 이유를 확인하려는 고객들에게도 유용하다. 의사결정트리는 다른 알고리즘 모델에 비해 정확도는 떨어지지만 설명력이 높다. 이 때문에 은행 업계에서 널리 사용되고 있다.

자동화된 신용 평가 모델을 이용해 전화나 웹에서 대출 신청을 즉시 승인할 수 있다. Python은 컴퓨팅에 많은 라이브러리가 구축되어 있는 반면, R은 상대적으로 통계 기반이라 금융 위험 관리 분야에 라이브러리가 많다. 과거의 결과를 고려하여 미래의 성과를 정확하게 예측하기위한 모델을 훈련해보겠다(현실과 밀접한 신용 점수 할당 문제를 기계학습에서 대규모로 적용할 경우 풀어가는 방식을 R로 간단한 대출 승인 모델을 개발하고, 재정적 손실을 야기하는 오류를 최소화하기 위한 모델 성능을 개선할 예정이다).

1.2. Dataset

- 2007-2015의 LendingClub 대출자료

  • database.sqlite (134.64 MB) : sqlite 데이터베이스 파일 형식 대출 데이터 전체
  • LCDataDictionary.xlsx (20.5 KB) : 데이터 사전 (변수 설명)
  • loan.csv (105.01 MB) : csv 파일 형식 대출 데이터 전체

- Data description

idx LoanStatNew Description
0 addr_state The state provided by the borrower in the loan application
1 annual_inc The self-reported annual income provided by the borrower during registration.
2 annual_inc_joint The combined self-reported annual income provided by the co-borrowers during registration
3 application_type Indicates whether the loan is an individual application or a joint application with two co-borrowers
4 collection_recovery_fee post charge off collection fee
5 collections_12_mths_ex_med Number of collections in 12 months excluding medical collections
6 delinq_2yrs The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
7 desc Loan description provided by the borrower
8 dti A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
9 dti_joint A ratio calculated using the co-borrowers’ total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers’ combined self-reported monthly income
10 earliest_cr_line The month the borrower’s earliest reported credit line was opened
11 emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and
12 emp_title The job title supplied by the Borrower when applying for the loan.
13 fico_range_high The upper boundary range the borrower’s FICO at loan origination belongs to.
14 fico_range_low The lower boundary range the borrower’s FICO at loan origination belongs to.
15 funded_amnt The total amount committed to that loan at that point in time.
16 funded_amnt_inv The total amount committed by investors for that loan at that point in time.
17 grade LC assigned loan grade
18 home_ownership The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
19 id A unique LC assigned ID for the loan listing.
20 initial_list_status The initial listing status of the loan. Possible values are – W, F
21 inq_last_6mths The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
22 installment The monthly payment owed by the borrower if the loan originates.
23 int_rate Interest Rate on the loan
24 is_inc_v Indicates if income was verified by LC, not verified, or if the income source was verified
25 issue_d The month which the loan was funded
26 last_credit_pull_d The most recent month LC pulled credit for this loan
27 last_fico_range_high The upper boundary range the borrower’s last FICO pulled belongs to.
28 last_fico_range_low The lower boundary range the borrower’s last FICO pulled belongs to.
29 last_pymnt_amnt Last total payment amount received
30 last_pymnt_d Last month payment was received
31 loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
32 loan_status Current status of the loan
33 member_id A unique LC assigned Id for the borrower member.
34 mths_since_last_delinq The number of months since the borrower’s last delinquency.
35 mths_since_last_major_derog Months since most recent 90-day or worse rating
36 mths_since_last_record The number of months since the last public record.
37 next_pymnt_d Next scheduled payment date
38 open_acc The number of open credit lines in the borrower’s credit file.
39 out_prncp Remaining outstanding principal for total amount funded
40 out_prncp_inv Remaining outstanding principal for portion of total amount funded by investors
41 policy_code publicly available policy_code=1 new products not publicly available policy_code=2
42 pub_rec Number of derogatory public records
43 purpose A category provided by the borrower for the loan request.
44 pymnt_plan Indicates if a payment plan has been put in place for the loan
45 recoveries post charge off gross recovery
46 revol_bal Total credit revolving balance
47 revol_util Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
48 sub_grade LC assigned loan subgrade
49 term The number of payments on the loan. Values are in months and can be either 36 or 60.
50 title The loan title provided by the borrower
51 total_acc The total number of credit lines currently in the borrower’s credit file
52 total_pymnt Payments received to date for total amount funded
53 total_pymnt_inv Payments received to date for portion of total amount funded by investors
54 total_rec_int Interest received to date
55 total_rec_late_fee Late fees received to date
56 total_rec_prncp Principal received to date
57 url URL for the LC page with listing data.
58 verified_status_joint Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the income source was verified
59 zip_code The first 3 numbers of the zip code provided by the borrower in the loan application.
60 open_acc_6m Number of open trades in last 6 months
61 open_il_6m Number of currently active installment trades
62 open_il_12m Number of installment accounts opened in past 12 months
63 open_il_24m Number of installment accounts opened in past 24 months
64 mths_since_rcnt_il Months since most recent installment accounts opened
65 total_bal_il Total current balance of all installment accounts
66 il_util Ratio of total current balance to high credit/credit limit on all install acct
67 open_rv_12m Number of revolving trades opened in past 12 months
68 open_rv_24m Number of revolving trades opened in past 24 months
69 max_bal_bc Maximum current balance owed on all revolving accounts
70 all_util Balance to credit limit on all trades
71 total_rev_hi_lim Total revolving high credit/credit limit
72 inq_fi Number of personal finance inquiries
73 total_cu_tl Number of finance trades
74 inq_last_12m Number of credit inquiries in past 12 months
75 acc_now_delinq The number of accounts on which the borrower is now delinquent.
76 tot_coll_amt Total collection amounts ever owed
77 tot_cur_bal Total current balance of all accounts

2. Collect the data

2.1. Setting

- Import Libraries

- Read the data

3. Exploratory Data Analysis 1

탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.

1. 기초 통계를 통하여 Raw Data를 분석
2. 시각화를 통해 Raw Data를 분석

3.1. Preview the data

- Data size

dim(df)
## [1] 887379     74

- Data table

#datatable(head(df, 100), style="bootstrap", class="table-condensed", options = list(dom = 'tp', scrollX = TRUE, search=''))
#kable(head(df, 5))

3.2. Basic Statistics

- Data summary

colnames(df)

str(df)
summary(df)

Summary

  • 887,379개의 값이 존재
  • 74개의 변수들은 NA값이 많음
  • 다양한 대출 정보들이 존재

3.3. Unique values

- Variable별 Unique한 값 찾기

#for (name in names(df)){
#  print(sprintf("Unique values for %s:", name))
#  print(unique(df[[name]]))
#  cat('\n')
#  }

3.4. Duplicate rows

- Checking for any duplicate rows

sum(duplicated(df))
## [1] 0

4. Preprocessing

- Cleansing the data

  • 변수가 많으므로 불필요한 열을 제거
  • 먼저 데이터에서 결측치를 체크

4.1. Missing values

- Checking for any missing values

4.1.1. Visualization

options(repr.plot.width=6, repr.plot.height=8)
missing_data <- df %>% summarise_all(funs(sum(is.na(.))/n()))
missing_data <- gather(missing_data, key = "variables", value = "percent_missing") 
ggplot(missing_data, aes(x = reorder(variables, percent_missing), y = percent_missing)) +
  geom_bar(stat = "identity", fill = "red", aes(color = I('white')), size = 0.1)+coord_flip()+ theme_few()

4.1.2. Table

colSums(is.na(df))
##                          id                   member_id 
##                           0                           0 
##                   loan_amnt                 funded_amnt 
##                           0                           0 
##             funded_amnt_inv                        term 
##                           0                           0 
##                    int_rate                 installment 
##                           0                           0 
##                       grade                   sub_grade 
##                           0                           0 
##                   emp_title                  emp_length 
##                       51457                           0 
##              home_ownership                  annual_inc 
##                           0                           4 
##         verification_status                     issue_d 
##                           0                           0 
##                 loan_status                  pymnt_plan 
##                           0                           0 
##                         url                        desc 
##                           0                      761597 
##                     purpose                       title 
##                           0                         151 
##                    zip_code                  addr_state 
##                           0                           0 
##                         dti                 delinq_2yrs 
##                           0                          29 
##            earliest_cr_line              inq_last_6mths 
##                          29                          29 
##      mths_since_last_delinq      mths_since_last_record 
##                      454312                      750326 
##                    open_acc                     pub_rec 
##                          29                          29 
##                   revol_bal                  revol_util 
##                           0                         502 
##                   total_acc         initial_list_status 
##                          29                           0 
##                   out_prncp               out_prncp_inv 
##                           0                           0 
##                 total_pymnt             total_pymnt_inv 
##                           0                           0 
##             total_rec_prncp               total_rec_int 
##                           0                           0 
##          total_rec_late_fee                  recoveries 
##                           0                           0 
##     collection_recovery_fee                last_pymnt_d 
##                           0                       17659 
##             last_pymnt_amnt                next_pymnt_d 
##                           0                      252971 
##          last_credit_pull_d  collections_12_mths_ex_med 
##                          53                         145 
## mths_since_last_major_derog                 policy_code 
##                      665676                           0 
##            application_type            annual_inc_joint 
##                           0                      886868 
##                   dti_joint   verification_status_joint 
##                      886870                      886868 
##              acc_now_delinq                tot_coll_amt 
##                          29                       70276 
##                 tot_cur_bal                 open_acc_6m 
##                       70276                      866007 
##                  open_il_6m                 open_il_12m 
##                      866007                      866007 
##                 open_il_24m          mths_since_rcnt_il 
##                      866007                      866569 
##                total_bal_il                     il_util 
##                      866007                      868762 
##                 open_rv_12m                 open_rv_24m 
##                      866007                      866007 
##                  max_bal_bc                    all_util 
##                      866007                      866007 
##            total_rev_hi_lim                      inq_fi 
##                       70276                      866007 
##                 total_cu_tl                inq_last_12m 
##                      866007                      866007

4.1.3. Summary

  • 총 22개의 변수가 25% 이상의 NA값을 갖는다.
  • 4개의 열에는 25% 이상의 누락된 데이터가 있다.
  • 7개의 열에는 10% 미만의 누락된 데이터가 있다.

4.2. Cleansing the data

- 누락 된 데이터가 10 % 이상인 열 제거

loan_cleaned <- df[, -which(colMeans(is.na(df)) > 0.1)]

이제 49개의 변수가 저장된 loan_cleaned로 데이터 세트로 추가 분석을 시행할 것이다.

4.3. Removing duplicate variables

id, member_id, url 변수는 오직 대출 식별 목적으로 만든 고유한 값이므로 제거해도 된다. 우편 번호 열은 처음 세 자리만 있으면 되고, 정보는 주 변수에서 얻을 수 있다. emp_title도 약 6%의 NA 값을 갖고 있으며 해당 데이터에서 50,000행을 삭제하면 데이터 손실이 크므로 해당 변수를 제거하겠다.

loan_cleaned <- within(loan_cleaned, rm('member_id', 'id', 'url', 'emp_title', "title","zip_code"))
  • 또한 대출을 승인할지 여부를 결정할 때 존재하지 않을 변수를 제거하고 고객 정보 및 고객 인구 통계와 관련된 변수를 보유하겠다.

  • funded_amnt, funded_amnt_inv, issue_d, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_int, total_rec_late_fee, total_rec_prncp 등과 같은 변수는 기본적으로 고객 지불 매개 변수이며 의사 결정 중에는 사용할 수 없다.

loan_cleaned <- loan_cleaned[, -c(2,3,13,28,29,30,31,32,33,34,35,36,37,38,40,43)]

4.4. Converting incorrect data types

범주형 변수이지만 문자열로 표시되는 일부 변수가 있으므로 형변환해야 한다.

- Convert String to Factor Variables

loan_cleaned$term <- as.factor(loan_cleaned$term)
loan_cleaned$grade <- as.factor(loan_cleaned$grade)
loan_cleaned$emp_length <- as.factor(loan_cleaned$emp_length)
loan_cleaned$home_ownership <- as.factor(loan_cleaned$home_ownership)
loan_cleaned$verification_status <- as.factor(loan_cleaned$verification_status)
loan_cleaned$loan_status <- as.factor(loan_cleaned$loan_status)
loan_cleaned$application_type <- as.factor(loan_cleaned$application_type)
loan_cleaned$pymnt_plan <- as.factor(loan_cleaned$pymnt_plan)
loan_cleaned$initial_list_status <- as.factor(loan_cleaned$initial_list_status)
loan_cleaned$policy_code <- as.factor(loan_cleaned$policy_code)
  • “policy_code”및 “pymnt_plan”변수는 대부분 모든 행에서 하나의 레벨을 가지므로 이를 제거할 수도 있다.
loan_cleaned <- within(loan_cleaned, rm("policy_code", "pymnt_plan"))

- 날짜형 변환

  • 날짜 열은 Chr 형식이므로 Date 형식으로 변환한다.
loan_cleaned$earliest_cr_line <- parse_date(loan_cleaned$earliest_cr_line,format =  "%b-%Y")
loan_cleaned$last_credit_pull_d <- parse_date(loan_cleaned$last_credit_pull_d,format =  "%b-%Y")

5. Exploratory Data Analysis 2

탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.

  1. 기초 통계를 통하여 Raw Data를 분석

2. 시각화를 통해 Raw Data를 분석

- Univariate Analysis on Categorical Variables

5.1. Term (대출기간)

대출 상품의 기간은 36개월, 60개월 두 가지가 존재한다. 36개월이 약 70%를 차지한다.

options(repr.plot.width=5, repr.plot.height=3)

term_summary <- loan_cleaned %>% 
  group_by(term) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(term_summary, caption="2x2 Contingency Table on Term.", format="markdown")
term count pct
36 months 621125 0.6999546
60 months 266254 0.3000454
#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
  
ggplot(data = term_summary, aes(x = term, y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  
  geom_text(aes(x = term, y = 0.05, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Term",
        y = "Percent",
        title = "Term Rate of Loan") +
  theme_economist()

5.2. Grade (신용등급)

해당 대출의 신용 위험도에 따라 각 등급(A-E)을 지정한다. B등급과 C등급의 대출금은 전체 대출금의 50% 이상(57%)을 차지한다.

grade_summary <- loan_cleaned %>%
  group_by(grade) %>%
  summarise(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(pct))

kable(grade_summary, caption="2x2 Contingency Table on Grade.", format="markdown")
grade count pct
B 254535 0.2868391
C 245860 0.2770631
A 148202 0.1670109
D 139542 0.1572519
E 70705 0.0796785
F 23046 0.0259709
G 5489 0.0061856
#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
  
ggplot(data = grade_summary, aes(x = reorder(grade, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = grade, y = 0.05, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Grade",
        y = "Percent",
        title = "Grade Rate of Loan")+
  theme_economist()

5.3. Employment Length (재직 기간)

10년 이상 재직한 사람들이 대출을 많이 받는데, 생애주기상 이사나 자녀의 학업을 위해서 자금을 필요로 하는 것 같다.

emp_len_summary <- loan_cleaned %>%
  group_by(emp_length) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(emp_len_summary, caption="2x2 Contingency Table on Employment Length.", format="markdown")
emp_length count pct
10+ years 291569 0.3285732
2 years 78870 0.0888797
< 1 year 70605 0.0795658
3 years 70026 0.0789133
1 year 57095 0.0643412
5 years 55704 0.0627736
4 years 52529 0.0591957
n/a 44825 0.0505139
7 years 44594 0.0502536
8 years 43955 0.0495335
6 years 42950 0.0484010
9 years 34657 0.0390555
#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
  
ggplot(data = emp_len_summary, aes(x = reorder(emp_length, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = emp_length, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Employment Length",
        y = "Percent",
        title = "Employment Length Rate of Loan") +
  theme_economist() + coord_flip() 

5.4. Home Ownership (주택 소유 여부)

월세 및 모기지 주택 소유자가 전체 대출의 약 90%를 차지한다.

home_ownership_summary <- loan_cleaned %>%
  group_by(home_ownership) %>%
  summarise(count = n()) %>%
  mutate(pct = count / sum(count)) %>%
  arrange(desc(pct))
  
kable(home_ownership_summary, caption="2x2 Contingency Table on Home Ownership.", format="markdown")
home_ownership count pct
MORTGAGE 443557 0.4998507
RENT 356117 0.4013133
OWN 87470 0.0985712
OTHER 182 0.0002051
NONE 50 0.0000563
ANY 3 0.0000034
ggplot(data = home_ownership_summary, aes(x = reorder(home_ownership, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = home_ownership, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Home Ownership",
        y = "Percent",
        title = "Home Ownership Rate of Loan") +
  theme_economist() + coord_flip() 

5.5. Verification Status (수입 상태)

실행된 대출의 30% 이상이 수입 상태에 대해 확인되지 않았다. 검증되지 않은 대출은 추후에 부실 대출을 야기할 가능성이 높다.

vs_summary <- loan_cleaned %>%
  group_by(verification_status) %>%
  summarise(count=n()) %>% 
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(vs_summary, caption="2x2 Contingency Table on Verification Status", format="markdown")
verification_status count pct
Source Verified 329558 0.3713836
Verified 291071 0.3280120
Not Verified 266750 0.3006044
ggplot(data = vs_summary, aes(x = reorder(verification_status, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = verification_status, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Verification Status",
        y = "Percent",
        title = "Verification Status Rate of Loan") +
  theme_economist() + coord_flip() 

5.6. Purpose (대출 목적)

대출의 60%는 기존의 부채를 정리하기 위해 사용된다.

purpose_summary <- loan_cleaned %>%
  group_by(purpose) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(purpose_summary, caption="2x2 Contingency Table on Purpose", format="markdown")
purpose count pct
debt_consolidation 524215 0.5907453
credit_card 206182 0.2323494
home_improvement 51829 0.0584068
other 42894 0.0483379
major_purchase 17277 0.0194697
small_business 10377 0.0116940
car 8863 0.0099878
medical 8540 0.0096238
moving 5414 0.0061011
vacation 4736 0.0053371
house 3707 0.0041775
wedding 2347 0.0026449
renewable_energy 575 0.0006480
educational 423 0.0004767
ggplot(data = purpose_summary, aes(x = reorder(purpose, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = purpose, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Purpose",
        y = "Percent",
        title = "Purpose Rate of Loan") +
  theme_economist() + coord_flip() 

5.7. State (주)

대출금의 15% 이상이 캘리포니아 주에서 실행되며 뉴욕과 텍사스는 그 뒤를 잇는다(약 7~8%).

state_summary <- loan_cleaned %>%
  group_by(addr_state) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

ggplot(data = state_summary, aes(x = reorder(addr_state, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  labs( x = "State",
        y = "Percent",
        title = "State Rate of Loan") +
  theme_economist() + coord_flip()

kable(state_summary, caption="2x2 Contingency Table on State.", format="markdown")
addr_state count pct
CA 129517 0.1459545
NY 74086 0.0834886
TX 71138 0.0801664
FL 60935 0.0686685
IL 35476 0.0399784
NJ 33256 0.0374767
PA 31393 0.0353772
OH 29631 0.0333916
GA 29085 0.0327763
VA 26255 0.0295871
NC 24720 0.0278573
MI 22985 0.0259021
MD 21031 0.0237001
MA 20593 0.0232065
AZ 20412 0.0230026
WA 19434 0.0219005
CO 18807 0.0211939
MN 15957 0.0179822
MO 14207 0.0160101
IN 13789 0.0155390
CT 13531 0.0152483
TN 12887 0.0145225
NV 12443 0.0140222
WI 11574 0.0130429
AL 11200 0.0126214
OR 10893 0.0122755
SC 10639 0.0119892
LA 10587 0.0119306
KY 8550 0.0096351
OK 8085 0.0091111
KS 7926 0.0089319
AR 6640 0.0074827
UT 6264 0.0070590
NM 4939 0.0055658
HI 4570 0.0051500
WV 4386 0.0049426
NH 4294 0.0048390
RI 3893 0.0043871
MS 3819 0.0043037
MT 2558 0.0028826
DE 2511 0.0028297
DC 2432 0.0027407
AK 2205 0.0024848
WY 2028 0.0022854
SD 1815 0.0020453
VT 1797 0.0020251
NE 1176 0.0013253
ME 525 0.0005916
ND 479 0.0005398
IA 14 0.0000158
ID 12 0.0000135

6. Exploratory Data Analysis 3

탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.

  1. 기초 통계를 통하여 Raw Data를 분석

2. 시각화를 통해 Raw Data를 분석

- Segmented Univariate Analysis on Categorical Variables

6.1. Loan Ratio by Loan Status (대출 상태별 대출 비율)

default_summary <- loan_cleaned %>%
  group_by(loan_status) %>%
  summarise(cnt = n()) %>%
  mutate(pct = cnt / sum(cnt)) %>%
  arrange(desc(pct))

#crude_defualt_rate <- default_summary$prob[default_summary$loan_status=="yes"]

kable(default_summary, caption="2x2 Contingency Table on Default.", format="markdown")
loan_status cnt pct
Current 601779 0.6781533
Fully Paid 207723 0.2340860
Charged Off 45248 0.0509906
Late (31-120 days) 11591 0.0130621
Issued 8460 0.0095337
In Grace Period 6253 0.0070466
Late (16-30 days) 2357 0.0026561
Does not meet the credit policy. Status:Fully Paid 1988 0.0022403
Default 1219 0.0013737
Does not meet the credit policy. Status:Charged Off 761 0.0008576
ggplot(default_summary, aes(x=reorder(loan_status, pct), y=pct, colour=loan_status, fill=loan_status)) +
  geom_bar(stat="identity",aes(color = I('black')), size = 0.1)  +
  labs( x = "Loan_Status",
        y = "Percent",
        title = "Loan Ratio by Loan Status") +
  theme_economist() + guides(fill=FALSE) + coord_flip()

ggplot(loan_cleaned, aes(x =term, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Loan Ratio by Loan Status") +
  theme_economist()

ggplot(loan_cleaned, aes(x =term, fill = loan_status)) +
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Loan Ratio by Loan Status") + 
  theme_economist()

- Bining

  • 타겟 변수(Loan_status)에는 10 개의 범주가 있다.
  • 부채 상환에 관련된 요인을 알아내는 것이 목표이므로, 정확한 결과를 알 수 있는 ‘Fully Paid’와 ’Charged Off’ 데이터만 고려할 것이다(이진 값을 갖는 새로운 열을 만들어서 fully paid는 1로, charged off는 0으로 변환할 것이다).
bad_indicators <- c("Charged Off",
                    "Default",
                    "Does not meet the credit policy. Status:Charged Off",
                    "In Grace Period", 
                    "Default Receiver", 
                    "Late (16-30 days)",
                    "Late (31-120 days)")

#loan_cleaned$loan_status_yn <- ifelse(loan_cleaned$loan_status %in% bad_indicators, 1, 0)
loan_cleaned$loan_status <- ifelse(loan_cleaned$loan_status %in% bad_indicators, "Bad", "Good")

loan_cleaned <- filter(loan_cleaned, loan_cleaned$loan_status == "Bad" | loan_cleaned$loan_status == "Good")

loan_cleaned <- mutate(loan_cleaned, binary_status=as.numeric(ifelse(loan_cleaned$loan_status == "Bad", 0, 1)))

#loan_cleaned <- filter(loan_cleaned, loan_cleaned$loan_status == "Fully Paid" | loan_cleaned$loan_status == "Charged Off" | loan_cleaned$loan_status == "Default")

#loan_cleaned <- mutate(loan_cleaned, binary_status=as.numeric(ifelse(loan_cleaned$loan_status == "Fully Paid", 1, 0)))

#head(loan_cleaned)
#kable(head(loan_cleaned, 5))

6.2. Default with respect to Term (기간별 채무 불이행 비율)

36개월로 설정한 대출보다 60개월로 설정한 대출의 채무 불이행이 조금 더 높은 비율을 차지한다.

ggplot(loan_cleaned, aes(x = term, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Default with respect to Term") + 
  theme_economist()

ggplot(loan_cleaned, aes(x = term, fill = loan_status)) +
  geom_bar(stat='count', position ='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Default with respect to Term") + 
  theme_economist()

6.3. Default with respect to Grade (등급별 채무 불이행 비율)

등급이 내려갈수록 채무 불이행의 위험도가 증가하는 것을 확인할 수 있다.

ggplot(loan_cleaned, aes(x = grade, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Grade",
        y = "Percent",
        title = "Default with respect to Grade") + 
  scale_fill_discrete(name="Loan_Status") + theme_economist()

ggplot(loan_cleaned, aes(x = grade, fill = loan_status)) +
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Grade",
        y = "Percent",
        title = "Default with respect to Grade") + 
  scale_fill_discrete(name="Loan_Status") + theme_economist()

6.4. Default with respect to Employee Length (재직 기간별 채무 불이행 비율)

재직 기간에 따른 대출 위험도의 유의미한 차이는 발견하지 못했다(거의 유사한 패턴을 보인다).

ggplot(filter(loan_cleaned, emp_length != 'n/a'), aes(x =emp_length, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "emp_length",
        y = "Percent",
        title = "Default with respect to Employee Length") + 
  scale_fill_discrete(name="Loan_Status") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.8, hjust=1)) +
  theme_economist()

ggplot(filter(loan_cleaned, emp_length != 'n/a'), aes(x =emp_length, fill = loan_status)) +
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "emp_length",
        y = "Percent",
        title = "Default with respect to Employee Length") + 
  scale_fill_discrete(name="Loan_Status") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.8, hjust=1)) +
  theme_economist()

6.5. Default with respect to Home Ownership (자가 여부에 따른 채무 불이행 비율)

기타 형태의 주거 생활을하는 채무자의 채무 불이행 비율이 높다.

ggplot(loan_cleaned, aes(x =home_ownership, fill = loan_status)) + 
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "home_ownership",
        y = "Percent",
        title = "Default with respect to Home Ownership") +
  scale_fill_discrete(name="Loan_Status") +
  theme_economist()

ggplot(loan_cleaned, aes(x =home_ownership, fill = loan_status)) + 
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "home_ownership",
        y = "Percent",
        title = "Default with respect to Home Ownership") +
  scale_fill_discrete(name="Loan_Status") +
  theme_economist()

6.6. Verification status with respect to default (수입 상태에 따른 채무 불이행 비율)

수입이 확인되지 않은 상태보다 확인된 상태의 채무 불이행 비율이 더 높다.

ggplot(loan_cleaned, aes(x =verification_status, fill = loan_status)) + 
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Verification_status",
        y = "Percent",
        title = "Verification status with respect to default") +
  theme_economist()

ggplot(loan_cleaned, aes(x =verification_status, fill = loan_status)) + 
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Verification_status",
        y = "Percent",
        title = "Verification status with respect to default") +
  theme_economist()

6.7. Purpose of Loan with respect to Default (대출 목적에 따른 채무 불이행 비율)

Educational(21%)과 Small Business(17%)의 채무 불이행 비율이 다른 대출 목적에 비해 높다.

purpose_summary_2 <- loan_cleaned %>%
  group_by(purpose) %>%
  summarise(pct = (1-sum(binary_status)/n())) %>%
  arrange(desc(pct))

kable(purpose_summary_2, format="markdown")
purpose pct
educational 0.2080378
small_business 0.1716296
renewable_energy 0.1252174
wedding 0.1231359
house 0.1124899
moving 0.1121167
other 0.0970532
medical 0.0941452
vacation 0.0848818
debt_consolidation 0.0793720
major_purchase 0.0748973
home_improvement 0.0685524
car 0.0672459
credit_card 0.0570127
ggplot(data = purpose_summary_2, aes(x = reorder(purpose, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2, aes(color = I('black')), size = 0.1) +
   labs( x = "Purpose",
        y = "Percent",
        title = "Purpose of Loan with respect to Default") +
  theme_economist() + coord_flip()

6.8. State with respect to default (지역에 따른 채무 불이행 비율)

테네시(Tenesse) 주의 채무 불이행의 비율(20%)이 다른 지역에 비해 상대적으로 높다.

state_summary <- loan_cleaned %>%
  group_by(addr_state) %>%
  summarise(pct = (1-sum(binary_status)/n()))

ggplot(data = state_summary, aes(x = reorder(addr_state, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2, aes(color = I('white')), size = 0.1) +
  labs( x = "States",
        y = "Percent",
        title = "State of Loan with respect to Default") +
  theme_economist() + coord_flip()

6.9. Loan Amount and Loan Status (대출액과 대출 상태 시각화 비교)

대출 금액이 10,000달러 이상일 때부터, 채무불이행이 발생할 수 있다.

ggplot(loan_cleaned, aes(x= loan_amnt)) +
  geom_density(aes(fill = as.factor(loan_status))) +
  xlab("Loan_amount") + theme_economist()

6.10. Interest Rate and Loan Status (이자율에 따른 채무 불이행 비율)

높은 이자율은(즉, 낮은 신용 등급을 가진 사람들은) 채무 불이행의 위험도가 높다.

ggplot(loan_cleaned, aes(x= int_rate, fill = loan_status)) +
geom_histogram(bins = 10, position = "fill", aes(color = I('black')), size = 0.1)+ 
xlab("Interest Rate")+ 
ylab("Percent of default Vs No default")+theme_few()

ggplot(loan_cleaned, aes(x = loan_status, y = int_rate, fill = loan_status)) + geom_boxplot()

6.11. Credit History (신용 활동 기간에 따른 채무 불이행 비율)

신용 활동 기간이 길수록 금리가 줄어들고, 채무 불이행의 위험도가 낮아진다.

loan_cleaned <- mutate(loan_cleaned, credit_history.d = year(earliest_cr_line))
loan_cleaned <- mutate(loan_cleaned, pull.d = year(last_credit_pull_d))

# Credit History Grouping
loan_cleaned <- mutate(loan_cleaned, cr_hist = pull.d-credit_history.d)
loan_cleaned <- mutate(loan_cleaned, cr_hist_grp = cr_hist)
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist <= 5] <- '<5_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 5 & loan_cleaned$cr_hist <= 10] <- '5-10_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 10 & loan_cleaned$cr_hist <= 15] <- '10-15_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 15 & loan_cleaned$cr_hist <= 20] <- '15-20_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 20] <- '>20_yrs'

ggplot(filter(loan_cleaned,cr_hist != "NA") , aes(x= reorder(cr_hist_grp, cr_hist), fill = loan_status)) + 
  geom_bar(position = "fill", aes(color = I('black')), size = 0.1) +
  xlab("Credit_History_Length") +
  ylab("Percent") +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.8, hjust=1)) +
  theme_economist()

6.12. Income Group (소득에 따른 채무 불이행 비율)

소득이 낮을수록 채무 불이행의 위험도는 증가한다.

loan_cleaned <- mutate(loan_cleaned, inc_grp = annual_inc)

loan_cleaned$inc_grp[loan_cleaned$annual_inc <= 50000] <- "verylow_inc"
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 50000 & loan_cleaned$annual_inc <= 80000] <- 'low_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 80000 & loan_cleaned$annual_inc <= 120000] <- 'middle_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 120000 & loan_cleaned$annual_inc <= 240000] <- 'high_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 240000] <- 'veryhigh_inc'

ggplot(loan_cleaned, aes(x= reorder(inc_grp, annual_inc), fill = loan_status, order = loan_status)) +
  geom_bar(position = "fill", aes(color = I('black')), size = 0.1) +
  xlab("Income_groups") +
  ylab("Percent of default Vs No default") +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.8, hjust=1)) +
  theme_economist()

6.13. Loan amount distribution (대출금액 분포)

Desc(df$loan_amnt, main = "Loan amount distribution", plotit = TRUE)
## ------------------------------------------------------------------------- 
## Loan amount distribution
## 
##      length         n       NAs     unique         0s       mean
##     887'379   887'379         0      1'372          0  14'755.26
##                100.0%      0.0%                  0.0%           
##                                                                 
##         .05       .10       .25     median        .75        .90
##    3'600.00  5'000.00  8'000.00  13'000.00  20'000.00  28'000.00
##                                                                 
##       range        sd     vcoef        mad        IQR       skew
##   34'500.00  8'435.46      0.57   8'599.08  12'000.00       0.68
##                                                                 
##      meanCI
##   14'737.71
##   14'772.82
##            
##         .95
##   32'000.00
##            
##        kurt
##       -0.26
##            
## lowest : 500.0 (11), 550.0, 600.0 (6), 700.0 (3), 725.0
## highest: 34'900.0 (14), 34'925.0 (9), 34'950.0 (18), 34'975.0 (31), 35'000.0 (36'368)

6.14. Year of Loan Issue (년도/등급별 대출 현황)

#df$issue_d <- parse_date_time(gsub("^", "01-", df$issue_d), orders = c("d-m-y", "d-B-Y", "m/d/y"))
df$issue_d <- as.Date(gsub("^", "01-", df$issue_d), format="%d-%b-%Y")

df$issue_year <- year(as.yearmon(df$issue_d,"%b-%Y"))

d <- sqldf("
  select
    avg(loan_amnt) as avg_amnt
    , grade
    , issue_year
  from df
  group by grade, issue_year
")

g <- ggplot(d, aes(x = issue_year , y = avg_amnt, color = grade)) +
  geom_line(alpha = 0.4) +
  labs(x = "Year of Loan Issue", y = "Average of Loan Amount issued", title = "Year of Loan Issue") +
  theme_economist()

ggplotly(g, tooltip = c("grade"))
#df$issue_d <- as.Date(gsub("^", "01-", df$issue_d), format="%d-%b-%Y")
#df$issue_year <- year(as.yearmon(df$issue_d,"%b-%Y"))

d <- sqldf("
  select
    grade
    , issue_year
    , count(*) as freq
  from df
  group by 1, 2
")

head(d)
##   grade issue_year  freq
## 1     A       2007    78
## 2     A       2008   318
## 3     A       2009  1203
## 4     A       2010  2830
## 5     A       2011  5754
## 6     A       2012 10901
g <- ggplot(d, aes(x = issue_year , y = freq, color = grade)) +
  geom_line(alpha = 0.4) +
  labs(x = "Year of Loan Issue", y = "Count of Loan issued", title = "Year of Loan Issue") +
  theme_economist()

ggplotly(g, tooltip = c("grade"))

6.15. Loans Amount Distribution by Loan Status (대출 상태별 대출금 분포)

box_status <- ggplot(df, aes(loan_status, loan_amnt))
box_status +
#  geom_boxplot(aes(fill = loan_status)) +
  geom_boxplot(fill = fillColor2) +
  theme(#axis.text.x = element_text(angle=90, hjust = 1),
        axis.text.x=element_blank(),
        legend.position="null") +
  labs(list(
    title = "Loans by Loan Status",
    x = "Loan Status",
    y = "Loan")) +
  theme_economist() +
  guides(fill=FALSE) +
  coord_flip()