신용 평가와 대출 승인 결정은 금융 기관에서 가장 중요한 업무 중 하나이다. 기본적으로 금융기관에서는 한국은행을 비롯한 다양한 곳에서 자금을 조달하여 이를 관리하고 있다가 자금을 필요로 하는 곳에 자금을 빌려주고 이에 상응하는 이자를 받아 수익을 얻는 것으로 볼 수 있다. 근본적으로 많은 금액을 빌려주고 이를 나누어서 자금을 사용한 곳에서 갚아 나가는 구조다.
물론 다수의 고객에게 자금을 빌려주다보니 제때 돈을 갚지 못하거나, 불의의 사고, 실직 등 다양한 이유로 인해서 돈을 갖지 못하는 위험이 발생된다. 이때 기대손실(Expected Loss)을 다음 구성요소를 가지고 정량화한다.
기대손실=채무 불이행 위험×채무 불이행 노출×채무 불이행에 대한 손실
따라서 금융기관에서 자금을 빌려주기 전에 다양한 정보를 활용하여 채무 불이행 위험이 적은 고객을 선별하여 가능한 많은 금액을 빌려주어 매출과 수익을 극대화한다.
신용 모델의 이면의 아이디어는 채무불이행될 위험이 높은지를 예측할 수 있는 요소를 식별하는 것이다. 따라서 과거 은행 대출에 대한 대량 데이터, 대출의 채무불이행 여부, 신청자에 대한 정보를 입수해야 한다.
- Classification using Machine Learning
기업 금융과 마찬가지로 소매 금융도 리스크 관리가 중요하다. 2007~2008년의 세계 금융 위기(리먼브라더스 사태)는 대출 업무의 투명성과 엄격함의 중요성을 부각시켰다. 기존 신용(시스템)의 유효성이 제한되면서 신용 평점에 따라 엄격하게 관리하는 것이 필요해졌고, 은행은 대출 시스템을 강화하고 위험 대출을 더 정확하게 찾아내기 위해 신용평점모형(Credit Scoring Model)을 머신 러닝을 활용하고 있다.
많은 나라의 정부기관은 대출 업무를 면밀히 감시하고 있기 때문에 경영진은 대출이 승인되고 거절되는 사유를 설명할 수 있어야만 한다. 이런 정보는 신용 평가가 만족스럽지 않은 이유를 확인하려는 고객들에게도 유용하다. 의사결정트리는 다른 알고리즘 모델에 비해 정확도는 떨어지지만 설명력이 높다. 이 때문에 은행 업계에서 널리 사용되고 있다.
자동화된 신용 평가 모델을 이용해 전화나 웹에서 대출 신청을 즉시 승인할 수 있다. Python은 컴퓨팅에 많은 라이브러리가 구축되어 있는 반면, R은 상대적으로 통계 기반이라 금융 위험 관리 분야에 라이브러리가 많다. 과거의 결과를 고려하여 미래의 성과를 정확하게 예측하기위한 모델을 훈련해보겠다(현실과 밀접한 신용 점수 할당 문제를 기계학습에서 대규모로 적용할 경우 풀어가는 방식을 R로 간단한 대출 승인 모델을 개발하고, 재정적 손실을 야기하는 오류를 최소화하기 위한 모델 성능을 개선할 예정이다).
- 2007-2015의 LendingClub 대출자료
- Data description
| idx | LoanStatNew | Description |
|---|---|---|
| 0 | addr_state | The state provided by the borrower in the loan application |
| 1 | annual_inc | The self-reported annual income provided by the borrower during registration. |
| 2 | annual_inc_joint | The combined self-reported annual income provided by the co-borrowers during registration |
| 3 | application_type | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
| 4 | collection_recovery_fee | post charge off collection fee |
| 5 | collections_12_mths_ex_med | Number of collections in 12 months excluding medical collections |
| 6 | delinq_2yrs | The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years |
| 7 | desc | Loan description provided by the borrower |
| 8 | dti | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| 9 | dti_joint | A ratio calculated using the co-borrowers’ total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers’ combined self-reported monthly income |
| 10 | earliest_cr_line | The month the borrower’s earliest reported credit line was opened |
| 11 | emp_length | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and |
| 12 | emp_title | The job title supplied by the Borrower when applying for the loan. |
| 13 | fico_range_high | The upper boundary range the borrower’s FICO at loan origination belongs to. |
| 14 | fico_range_low | The lower boundary range the borrower’s FICO at loan origination belongs to. |
| 15 | funded_amnt | The total amount committed to that loan at that point in time. |
| 16 | funded_amnt_inv | The total amount committed by investors for that loan at that point in time. |
| 17 | grade | LC assigned loan grade |
| 18 | home_ownership | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. |
| 19 | id | A unique LC assigned ID for the loan listing. |
| 20 | initial_list_status | The initial listing status of the loan. Possible values are – W, F |
| 21 | inq_last_6mths | The number of inquiries in past 6 months (excluding auto and mortgage inquiries) |
| 22 | installment | The monthly payment owed by the borrower if the loan originates. |
| 23 | int_rate | Interest Rate on the loan |
| 24 | is_inc_v | Indicates if income was verified by LC, not verified, or if the income source was verified |
| 25 | issue_d | The month which the loan was funded |
| 26 | last_credit_pull_d | The most recent month LC pulled credit for this loan |
| 27 | last_fico_range_high | The upper boundary range the borrower’s last FICO pulled belongs to. |
| 28 | last_fico_range_low | The lower boundary range the borrower’s last FICO pulled belongs to. |
| 29 | last_pymnt_amnt | Last total payment amount received |
| 30 | last_pymnt_d | Last month payment was received |
| 31 | loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
| 32 | loan_status | Current status of the loan |
| 33 | member_id | A unique LC assigned Id for the borrower member. |
| 34 | mths_since_last_delinq | The number of months since the borrower’s last delinquency. |
| 35 | mths_since_last_major_derog | Months since most recent 90-day or worse rating |
| 36 | mths_since_last_record | The number of months since the last public record. |
| 37 | next_pymnt_d | Next scheduled payment date |
| 38 | open_acc | The number of open credit lines in the borrower’s credit file. |
| 39 | out_prncp | Remaining outstanding principal for total amount funded |
| 40 | out_prncp_inv | Remaining outstanding principal for portion of total amount funded by investors |
| 41 | policy_code | publicly available policy_code=1 new products not publicly available policy_code=2 |
| 42 | pub_rec | Number of derogatory public records |
| 43 | purpose | A category provided by the borrower for the loan request. |
| 44 | pymnt_plan | Indicates if a payment plan has been put in place for the loan |
| 45 | recoveries | post charge off gross recovery |
| 46 | revol_bal | Total credit revolving balance |
| 47 | revol_util | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. |
| 48 | sub_grade | LC assigned loan subgrade |
| 49 | term | The number of payments on the loan. Values are in months and can be either 36 or 60. |
| 50 | title | The loan title provided by the borrower |
| 51 | total_acc | The total number of credit lines currently in the borrower’s credit file |
| 52 | total_pymnt | Payments received to date for total amount funded |
| 53 | total_pymnt_inv | Payments received to date for portion of total amount funded by investors |
| 54 | total_rec_int | Interest received to date |
| 55 | total_rec_late_fee | Late fees received to date |
| 56 | total_rec_prncp | Principal received to date |
| 57 | url | URL for the LC page with listing data. |
| 58 | verified_status_joint | Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the income source was verified |
| 59 | zip_code | The first 3 numbers of the zip code provided by the borrower in the loan application. |
| 60 | open_acc_6m | Number of open trades in last 6 months |
| 61 | open_il_6m | Number of currently active installment trades |
| 62 | open_il_12m | Number of installment accounts opened in past 12 months |
| 63 | open_il_24m | Number of installment accounts opened in past 24 months |
| 64 | mths_since_rcnt_il | Months since most recent installment accounts opened |
| 65 | total_bal_il | Total current balance of all installment accounts |
| 66 | il_util | Ratio of total current balance to high credit/credit limit on all install acct |
| 67 | open_rv_12m | Number of revolving trades opened in past 12 months |
| 68 | open_rv_24m | Number of revolving trades opened in past 24 months |
| 69 | max_bal_bc | Maximum current balance owed on all revolving accounts |
| 70 | all_util | Balance to credit limit on all trades |
| 71 | total_rev_hi_lim Total revolving high credit/credit limit | |
| 72 | inq_fi | Number of personal finance inquiries |
| 73 | total_cu_tl | Number of finance trades |
| 74 | inq_last_12m | Number of credit inquiries in past 12 months |
| 75 | acc_now_delinq | The number of accounts on which the borrower is now delinquent. |
| 76 | tot_coll_amt | Total collection amounts ever owed |
| 77 | tot_cur_bal | Total current balance of all accounts |
- Import Libraries
- Read the data
탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.
1. 기초 통계를 통하여 Raw Data를 분석
2. 시각화를 통해 Raw Data를 분석
- Data size
dim(df)
## [1] 887379 74
- Data table
#datatable(head(df, 100), style="bootstrap", class="table-condensed", options = list(dom = 'tp', scrollX = TRUE, search=''))
#kable(head(df, 5))
- Data summary
colnames(df)
str(df)
summary(df)
- Variable별 Unique한 값 찾기
#for (name in names(df)){
# print(sprintf("Unique values for %s:", name))
# print(unique(df[[name]]))
# cat('\n')
# }
- Checking for any duplicate rows
sum(duplicated(df))
## [1] 0
- Cleansing the data
- Checking for any missing values
options(repr.plot.width=6, repr.plot.height=8)
missing_data <- df %>% summarise_all(funs(sum(is.na(.))/n()))
missing_data <- gather(missing_data, key = "variables", value = "percent_missing")
ggplot(missing_data, aes(x = reorder(variables, percent_missing), y = percent_missing)) +
geom_bar(stat = "identity", fill = "red", aes(color = I('white')), size = 0.1)+coord_flip()+ theme_few()
colSums(is.na(df))
## id member_id
## 0 0
## loan_amnt funded_amnt
## 0 0
## funded_amnt_inv term
## 0 0
## int_rate installment
## 0 0
## grade sub_grade
## 0 0
## emp_title emp_length
## 51457 0
## home_ownership annual_inc
## 0 4
## verification_status issue_d
## 0 0
## loan_status pymnt_plan
## 0 0
## url desc
## 0 761597
## purpose title
## 0 151
## zip_code addr_state
## 0 0
## dti delinq_2yrs
## 0 29
## earliest_cr_line inq_last_6mths
## 29 29
## mths_since_last_delinq mths_since_last_record
## 454312 750326
## open_acc pub_rec
## 29 29
## revol_bal revol_util
## 0 502
## total_acc initial_list_status
## 29 0
## out_prncp out_prncp_inv
## 0 0
## total_pymnt total_pymnt_inv
## 0 0
## total_rec_prncp total_rec_int
## 0 0
## total_rec_late_fee recoveries
## 0 0
## collection_recovery_fee last_pymnt_d
## 0 17659
## last_pymnt_amnt next_pymnt_d
## 0 252971
## last_credit_pull_d collections_12_mths_ex_med
## 53 145
## mths_since_last_major_derog policy_code
## 665676 0
## application_type annual_inc_joint
## 0 886868
## dti_joint verification_status_joint
## 886870 886868
## acc_now_delinq tot_coll_amt
## 29 70276
## tot_cur_bal open_acc_6m
## 70276 866007
## open_il_6m open_il_12m
## 866007 866007
## open_il_24m mths_since_rcnt_il
## 866007 866569
## total_bal_il il_util
## 866007 868762
## open_rv_12m open_rv_24m
## 866007 866007
## max_bal_bc all_util
## 866007 866007
## total_rev_hi_lim inq_fi
## 70276 866007
## total_cu_tl inq_last_12m
## 866007 866007
- 누락 된 데이터가 10 % 이상인 열 제거
loan_cleaned <- df[, -which(colMeans(is.na(df)) > 0.1)]
이제 49개의 변수가 저장된 loan_cleaned로 데이터 세트로 추가 분석을 시행할 것이다.
id, member_id, url 변수는 오직 대출 식별 목적으로 만든 고유한 값이므로 제거해도 된다. 우편 번호 열은 처음 세 자리만 있으면 되고, 정보는 주 변수에서 얻을 수 있다. emp_title도 약 6%의 NA 값을 갖고 있으며 해당 데이터에서 50,000행을 삭제하면 데이터 손실이 크므로 해당 변수를 제거하겠다.
loan_cleaned <- within(loan_cleaned, rm('member_id', 'id', 'url', 'emp_title', "title","zip_code"))
또한 대출을 승인할지 여부를 결정할 때 존재하지 않을 변수를 제거하고 고객 정보 및 고객 인구 통계와 관련된 변수를 보유하겠다.
funded_amnt, funded_amnt_inv, issue_d, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_int, total_rec_late_fee, total_rec_prncp 등과 같은 변수는 기본적으로 고객 지불 매개 변수이며 의사 결정 중에는 사용할 수 없다.
loan_cleaned <- loan_cleaned[, -c(2,3,13,28,29,30,31,32,33,34,35,36,37,38,40,43)]
범주형 변수이지만 문자열로 표시되는 일부 변수가 있으므로 형변환해야 한다.
- Convert String to Factor Variables
loan_cleaned$term <- as.factor(loan_cleaned$term)
loan_cleaned$grade <- as.factor(loan_cleaned$grade)
loan_cleaned$emp_length <- as.factor(loan_cleaned$emp_length)
loan_cleaned$home_ownership <- as.factor(loan_cleaned$home_ownership)
loan_cleaned$verification_status <- as.factor(loan_cleaned$verification_status)
loan_cleaned$loan_status <- as.factor(loan_cleaned$loan_status)
loan_cleaned$application_type <- as.factor(loan_cleaned$application_type)
loan_cleaned$pymnt_plan <- as.factor(loan_cleaned$pymnt_plan)
loan_cleaned$initial_list_status <- as.factor(loan_cleaned$initial_list_status)
loan_cleaned$policy_code <- as.factor(loan_cleaned$policy_code)
loan_cleaned <- within(loan_cleaned, rm("policy_code", "pymnt_plan"))
- 날짜형 변환
loan_cleaned$earliest_cr_line <- parse_date(loan_cleaned$earliest_cr_line,format = "%b-%Y")
loan_cleaned$last_credit_pull_d <- parse_date(loan_cleaned$last_credit_pull_d,format = "%b-%Y")
탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.
2. 시각화를 통해 Raw Data를 분석
- Univariate Analysis on Categorical Variables
대출 상품의 기간은 36개월, 60개월 두 가지가 존재한다. 36개월이 약 70%를 차지한다.
options(repr.plot.width=5, repr.plot.height=3)
term_summary <- loan_cleaned %>%
group_by(term) %>%
summarise(count=n()) %>%
mutate(pct=count/sum(count)) %>%
arrange(desc(pct))
kable(term_summary, caption="2x2 Contingency Table on Term.", format="markdown")
| term | count | pct |
|---|---|---|
| 36 months | 621125 | 0.6999546 |
| 60 months | 266254 | 0.3000454 |
#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
ggplot(data = term_summary, aes(x = term, y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
geom_text(aes(x = term, y = 0.05, label = paste0("(",round(pct,2)*100, "%)",sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs( x = "Term",
y = "Percent",
title = "Term Rate of Loan") +
theme_economist()
해당 대출의 신용 위험도에 따라 각 등급(A-E)을 지정한다. B등급과 C등급의 대출금은 전체 대출금의 50% 이상(57%)을 차지한다.
grade_summary <- loan_cleaned %>%
group_by(grade) %>%
summarise(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(pct))
kable(grade_summary, caption="2x2 Contingency Table on Grade.", format="markdown")
| grade | count | pct |
|---|---|---|
| B | 254535 | 0.2868391 |
| C | 245860 | 0.2770631 |
| A | 148202 | 0.1670109 |
| D | 139542 | 0.1572519 |
| E | 70705 | 0.0796785 |
| F | 23046 | 0.0259709 |
| G | 5489 | 0.0061856 |
#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
ggplot(data = grade_summary, aes(x = reorder(grade, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
geom_text(aes(x = grade, y = 0.05, label = paste0("(",round(pct,2)*100, "%)",sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs( x = "Grade",
y = "Percent",
title = "Grade Rate of Loan")+
theme_economist()
10년 이상 재직한 사람들이 대출을 많이 받는데, 생애주기상 이사나 자녀의 학업을 위해서 자금을 필요로 하는 것 같다.
emp_len_summary <- loan_cleaned %>%
group_by(emp_length) %>%
summarise(count=n()) %>%
mutate(pct=count/sum(count)) %>%
arrange(desc(pct))
kable(emp_len_summary, caption="2x2 Contingency Table on Employment Length.", format="markdown")
| emp_length | count | pct |
|---|---|---|
| 10+ years | 291569 | 0.3285732 |
| 2 years | 78870 | 0.0888797 |
| < 1 year | 70605 | 0.0795658 |
| 3 years | 70026 | 0.0789133 |
| 1 year | 57095 | 0.0643412 |
| 5 years | 55704 | 0.0627736 |
| 4 years | 52529 | 0.0591957 |
| n/a | 44825 | 0.0505139 |
| 7 years | 44594 | 0.0502536 |
| 8 years | 43955 | 0.0495335 |
| 6 years | 42950 | 0.0484010 |
| 9 years | 34657 | 0.0390555 |
#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
ggplot(data = emp_len_summary, aes(x = reorder(emp_length, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
geom_text(aes(x = emp_length, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs( x = "Employment Length",
y = "Percent",
title = "Employment Length Rate of Loan") +
theme_economist() + coord_flip()
월세 및 모기지 주택 소유자가 전체 대출의 약 90%를 차지한다.
home_ownership_summary <- loan_cleaned %>%
group_by(home_ownership) %>%
summarise(count = n()) %>%
mutate(pct = count / sum(count)) %>%
arrange(desc(pct))
kable(home_ownership_summary, caption="2x2 Contingency Table on Home Ownership.", format="markdown")
| home_ownership | count | pct |
|---|---|---|
| MORTGAGE | 443557 | 0.4998507 |
| RENT | 356117 | 0.4013133 |
| OWN | 87470 | 0.0985712 |
| OTHER | 182 | 0.0002051 |
| NONE | 50 | 0.0000563 |
| ANY | 3 | 0.0000034 |
ggplot(data = home_ownership_summary, aes(x = reorder(home_ownership, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
geom_text(aes(x = home_ownership, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs( x = "Home Ownership",
y = "Percent",
title = "Home Ownership Rate of Loan") +
theme_economist() + coord_flip()
실행된 대출의 30% 이상이 수입 상태에 대해 확인되지 않았다. 검증되지 않은 대출은 추후에 부실 대출을 야기할 가능성이 높다.
vs_summary <- loan_cleaned %>%
group_by(verification_status) %>%
summarise(count=n()) %>%
mutate(pct=count/sum(count)) %>%
arrange(desc(pct))
kable(vs_summary, caption="2x2 Contingency Table on Verification Status", format="markdown")
| verification_status | count | pct |
|---|---|---|
| Source Verified | 329558 | 0.3713836 |
| Verified | 291071 | 0.3280120 |
| Not Verified | 266750 | 0.3006044 |
ggplot(data = vs_summary, aes(x = reorder(verification_status, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
geom_text(aes(x = verification_status, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs( x = "Verification Status",
y = "Percent",
title = "Verification Status Rate of Loan") +
theme_economist() + coord_flip()
대출의 60%는 기존의 부채를 정리하기 위해 사용된다.
purpose_summary <- loan_cleaned %>%
group_by(purpose) %>%
summarise(count=n()) %>%
mutate(pct=count/sum(count)) %>%
arrange(desc(pct))
kable(purpose_summary, caption="2x2 Contingency Table on Purpose", format="markdown")
| purpose | count | pct |
|---|---|---|
| debt_consolidation | 524215 | 0.5907453 |
| credit_card | 206182 | 0.2323494 |
| home_improvement | 51829 | 0.0584068 |
| other | 42894 | 0.0483379 |
| major_purchase | 17277 | 0.0194697 |
| small_business | 10377 | 0.0116940 |
| car | 8863 | 0.0099878 |
| medical | 8540 | 0.0096238 |
| moving | 5414 | 0.0061011 |
| vacation | 4736 | 0.0053371 |
| house | 3707 | 0.0041775 |
| wedding | 2347 | 0.0026449 |
| renewable_energy | 575 | 0.0006480 |
| educational | 423 | 0.0004767 |
ggplot(data = purpose_summary, aes(x = reorder(purpose, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
geom_text(aes(x = purpose, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
hjust=0.5, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs( x = "Purpose",
y = "Percent",
title = "Purpose Rate of Loan") +
theme_economist() + coord_flip()
대출금의 15% 이상이 캘리포니아 주에서 실행되며 뉴욕과 텍사스는 그 뒤를 잇는다(약 7~8%).
state_summary <- loan_cleaned %>%
group_by(addr_state) %>%
summarise(count=n()) %>%
mutate(pct=count/sum(count)) %>%
arrange(desc(pct))
ggplot(data = state_summary, aes(x = reorder(addr_state, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2) +
labs( x = "State",
y = "Percent",
title = "State Rate of Loan") +
theme_economist() + coord_flip()
kable(state_summary, caption="2x2 Contingency Table on State.", format="markdown")
| addr_state | count | pct |
|---|---|---|
| CA | 129517 | 0.1459545 |
| NY | 74086 | 0.0834886 |
| TX | 71138 | 0.0801664 |
| FL | 60935 | 0.0686685 |
| IL | 35476 | 0.0399784 |
| NJ | 33256 | 0.0374767 |
| PA | 31393 | 0.0353772 |
| OH | 29631 | 0.0333916 |
| GA | 29085 | 0.0327763 |
| VA | 26255 | 0.0295871 |
| NC | 24720 | 0.0278573 |
| MI | 22985 | 0.0259021 |
| MD | 21031 | 0.0237001 |
| MA | 20593 | 0.0232065 |
| AZ | 20412 | 0.0230026 |
| WA | 19434 | 0.0219005 |
| CO | 18807 | 0.0211939 |
| MN | 15957 | 0.0179822 |
| MO | 14207 | 0.0160101 |
| IN | 13789 | 0.0155390 |
| CT | 13531 | 0.0152483 |
| TN | 12887 | 0.0145225 |
| NV | 12443 | 0.0140222 |
| WI | 11574 | 0.0130429 |
| AL | 11200 | 0.0126214 |
| OR | 10893 | 0.0122755 |
| SC | 10639 | 0.0119892 |
| LA | 10587 | 0.0119306 |
| KY | 8550 | 0.0096351 |
| OK | 8085 | 0.0091111 |
| KS | 7926 | 0.0089319 |
| AR | 6640 | 0.0074827 |
| UT | 6264 | 0.0070590 |
| NM | 4939 | 0.0055658 |
| HI | 4570 | 0.0051500 |
| WV | 4386 | 0.0049426 |
| NH | 4294 | 0.0048390 |
| RI | 3893 | 0.0043871 |
| MS | 3819 | 0.0043037 |
| MT | 2558 | 0.0028826 |
| DE | 2511 | 0.0028297 |
| DC | 2432 | 0.0027407 |
| AK | 2205 | 0.0024848 |
| WY | 2028 | 0.0022854 |
| SD | 1815 | 0.0020453 |
| VT | 1797 | 0.0020251 |
| NE | 1176 | 0.0013253 |
| ME | 525 | 0.0005916 |
| ND | 479 | 0.0005398 |
| IA | 14 | 0.0000158 |
| ID | 12 | 0.0000135 |
탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.
2. 시각화를 통해 Raw Data를 분석
- Segmented Univariate Analysis on Categorical Variables
default_summary <- loan_cleaned %>%
group_by(loan_status) %>%
summarise(cnt = n()) %>%
mutate(pct = cnt / sum(cnt)) %>%
arrange(desc(pct))
#crude_defualt_rate <- default_summary$prob[default_summary$loan_status=="yes"]
kable(default_summary, caption="2x2 Contingency Table on Default.", format="markdown")
| loan_status | cnt | pct |
|---|---|---|
| Current | 601779 | 0.6781533 |
| Fully Paid | 207723 | 0.2340860 |
| Charged Off | 45248 | 0.0509906 |
| Late (31-120 days) | 11591 | 0.0130621 |
| Issued | 8460 | 0.0095337 |
| In Grace Period | 6253 | 0.0070466 |
| Late (16-30 days) | 2357 | 0.0026561 |
| Does not meet the credit policy. Status:Fully Paid | 1988 | 0.0022403 |
| Default | 1219 | 0.0013737 |
| Does not meet the credit policy. Status:Charged Off | 761 | 0.0008576 |
ggplot(default_summary, aes(x=reorder(loan_status, pct), y=pct, colour=loan_status, fill=loan_status)) +
geom_bar(stat="identity",aes(color = I('black')), size = 0.1) +
labs( x = "Loan_Status",
y = "Percent",
title = "Loan Ratio by Loan Status") +
theme_economist() + guides(fill=FALSE) + coord_flip()
ggplot(loan_cleaned, aes(x =term, fill = loan_status)) +
geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
labs( x = "Term",
y = "Percent",
title = "Loan Ratio by Loan Status") +
theme_economist()
ggplot(loan_cleaned, aes(x =term, fill = loan_status)) +
geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
labs( x = "Term",
y = "Percent",
title = "Loan Ratio by Loan Status") +
theme_economist()
- Bining
bad_indicators <- c("Charged Off",
"Default",
"Does not meet the credit policy. Status:Charged Off",
"In Grace Period",
"Default Receiver",
"Late (16-30 days)",
"Late (31-120 days)")
#loan_cleaned$loan_status_yn <- ifelse(loan_cleaned$loan_status %in% bad_indicators, 1, 0)
loan_cleaned$loan_status <- ifelse(loan_cleaned$loan_status %in% bad_indicators, "Bad", "Good")
loan_cleaned <- filter(loan_cleaned, loan_cleaned$loan_status == "Bad" | loan_cleaned$loan_status == "Good")
loan_cleaned <- mutate(loan_cleaned, binary_status=as.numeric(ifelse(loan_cleaned$loan_status == "Bad", 0, 1)))
#loan_cleaned <- filter(loan_cleaned, loan_cleaned$loan_status == "Fully Paid" | loan_cleaned$loan_status == "Charged Off" | loan_cleaned$loan_status == "Default")
#loan_cleaned <- mutate(loan_cleaned, binary_status=as.numeric(ifelse(loan_cleaned$loan_status == "Fully Paid", 1, 0)))
#head(loan_cleaned)
#kable(head(loan_cleaned, 5))
36개월로 설정한 대출보다 60개월로 설정한 대출의 채무 불이행이 조금 더 높은 비율을 차지한다.
ggplot(loan_cleaned, aes(x = term, fill = loan_status)) +
geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
labs( x = "Term",
y = "Percent",
title = "Default with respect to Term") +
theme_economist()
ggplot(loan_cleaned, aes(x = term, fill = loan_status)) +
geom_bar(stat='count', position ='stack', aes(color = I('black')), size = 0.1) +
labs( x = "Term",
y = "Percent",
title = "Default with respect to Term") +
theme_economist()
등급이 내려갈수록 채무 불이행의 위험도가 증가하는 것을 확인할 수 있다.
ggplot(loan_cleaned, aes(x = grade, fill = loan_status)) +
geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
labs( x = "Grade",
y = "Percent",
title = "Default with respect to Grade") +
scale_fill_discrete(name="Loan_Status") + theme_economist()
ggplot(loan_cleaned, aes(x = grade, fill = loan_status)) +
geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
labs( x = "Grade",
y = "Percent",
title = "Default with respect to Grade") +
scale_fill_discrete(name="Loan_Status") + theme_economist()
재직 기간에 따른 대출 위험도의 유의미한 차이는 발견하지 못했다(거의 유사한 패턴을 보인다).
ggplot(filter(loan_cleaned, emp_length != 'n/a'), aes(x =emp_length, fill = loan_status)) +
geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
labs( x = "emp_length",
y = "Percent",
title = "Default with respect to Employee Length") +
scale_fill_discrete(name="Loan_Status") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.8, hjust=1)) +
theme_economist()
ggplot(filter(loan_cleaned, emp_length != 'n/a'), aes(x =emp_length, fill = loan_status)) +
geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
labs( x = "emp_length",
y = "Percent",
title = "Default with respect to Employee Length") +
scale_fill_discrete(name="Loan_Status") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.8, hjust=1)) +
theme_economist()
기타 형태의 주거 생활을하는 채무자의 채무 불이행 비율이 높다.
ggplot(loan_cleaned, aes(x =home_ownership, fill = loan_status)) +
geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
labs( x = "home_ownership",
y = "Percent",
title = "Default with respect to Home Ownership") +
scale_fill_discrete(name="Loan_Status") +
theme_economist()
ggplot(loan_cleaned, aes(x =home_ownership, fill = loan_status)) +
geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
labs( x = "home_ownership",
y = "Percent",
title = "Default with respect to Home Ownership") +
scale_fill_discrete(name="Loan_Status") +
theme_economist()
수입이 확인되지 않은 상태보다 확인된 상태의 채무 불이행 비율이 더 높다.
ggplot(loan_cleaned, aes(x =verification_status, fill = loan_status)) +
geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
labs( x = "Verification_status",
y = "Percent",
title = "Verification status with respect to default") +
theme_economist()
ggplot(loan_cleaned, aes(x =verification_status, fill = loan_status)) +
geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
labs( x = "Verification_status",
y = "Percent",
title = "Verification status with respect to default") +
theme_economist()
Educational(21%)과 Small Business(17%)의 채무 불이행 비율이 다른 대출 목적에 비해 높다.
purpose_summary_2 <- loan_cleaned %>%
group_by(purpose) %>%
summarise(pct = (1-sum(binary_status)/n())) %>%
arrange(desc(pct))
kable(purpose_summary_2, format="markdown")
| purpose | pct |
|---|---|
| educational | 0.2080378 |
| small_business | 0.1716296 |
| renewable_energy | 0.1252174 |
| wedding | 0.1231359 |
| house | 0.1124899 |
| moving | 0.1121167 |
| other | 0.0970532 |
| medical | 0.0941452 |
| vacation | 0.0848818 |
| debt_consolidation | 0.0793720 |
| major_purchase | 0.0748973 |
| home_improvement | 0.0685524 |
| car | 0.0672459 |
| credit_card | 0.0570127 |
ggplot(data = purpose_summary_2, aes(x = reorder(purpose, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2, aes(color = I('black')), size = 0.1) +
labs( x = "Purpose",
y = "Percent",
title = "Purpose of Loan with respect to Default") +
theme_economist() + coord_flip()
테네시(Tenesse) 주의 채무 불이행의 비율(20%)이 다른 지역에 비해 상대적으로 높다.
state_summary <- loan_cleaned %>%
group_by(addr_state) %>%
summarise(pct = (1-sum(binary_status)/n()))
ggplot(data = state_summary, aes(x = reorder(addr_state, pct), y = pct)) +
geom_bar(stat = "identity", fill = fillColor2, aes(color = I('white')), size = 0.1) +
labs( x = "States",
y = "Percent",
title = "State of Loan with respect to Default") +
theme_economist() + coord_flip()
대출 금액이 10,000달러 이상일 때부터, 채무불이행이 발생할 수 있다.
ggplot(loan_cleaned, aes(x= loan_amnt)) +
geom_density(aes(fill = as.factor(loan_status))) +
xlab("Loan_amount") + theme_economist()
높은 이자율은(즉, 낮은 신용 등급을 가진 사람들은) 채무 불이행의 위험도가 높다.
ggplot(loan_cleaned, aes(x= int_rate, fill = loan_status)) +
geom_histogram(bins = 10, position = "fill", aes(color = I('black')), size = 0.1)+
xlab("Interest Rate")+
ylab("Percent of default Vs No default")+theme_few()
ggplot(loan_cleaned, aes(x = loan_status, y = int_rate, fill = loan_status)) + geom_boxplot()
신용 활동 기간이 길수록 금리가 줄어들고, 채무 불이행의 위험도가 낮아진다.
loan_cleaned <- mutate(loan_cleaned, credit_history.d = year(earliest_cr_line))
loan_cleaned <- mutate(loan_cleaned, pull.d = year(last_credit_pull_d))
# Credit History Grouping
loan_cleaned <- mutate(loan_cleaned, cr_hist = pull.d-credit_history.d)
loan_cleaned <- mutate(loan_cleaned, cr_hist_grp = cr_hist)
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist <= 5] <- '<5_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 5 & loan_cleaned$cr_hist <= 10] <- '5-10_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 10 & loan_cleaned$cr_hist <= 15] <- '10-15_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 15 & loan_cleaned$cr_hist <= 20] <- '15-20_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 20] <- '>20_yrs'
ggplot(filter(loan_cleaned,cr_hist != "NA") , aes(x= reorder(cr_hist_grp, cr_hist), fill = loan_status)) +
geom_bar(position = "fill", aes(color = I('black')), size = 0.1) +
xlab("Credit_History_Length") +
ylab("Percent") +
theme(axis.text.x = element_text(angle = 20, vjust = 0.8, hjust=1)) +
theme_economist()
소득이 낮을수록 채무 불이행의 위험도는 증가한다.
loan_cleaned <- mutate(loan_cleaned, inc_grp = annual_inc)
loan_cleaned$inc_grp[loan_cleaned$annual_inc <= 50000] <- "verylow_inc"
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 50000 & loan_cleaned$annual_inc <= 80000] <- 'low_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 80000 & loan_cleaned$annual_inc <= 120000] <- 'middle_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 120000 & loan_cleaned$annual_inc <= 240000] <- 'high_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 240000] <- 'veryhigh_inc'
ggplot(loan_cleaned, aes(x= reorder(inc_grp, annual_inc), fill = loan_status, order = loan_status)) +
geom_bar(position = "fill", aes(color = I('black')), size = 0.1) +
xlab("Income_groups") +
ylab("Percent of default Vs No default") +
theme(axis.text.x = element_text(angle = 20, vjust = 0.8, hjust=1)) +
theme_economist()
Desc(df$loan_amnt, main = "Loan amount distribution", plotit = TRUE)
## -------------------------------------------------------------------------
## Loan amount distribution
##
## length n NAs unique 0s mean
## 887'379 887'379 0 1'372 0 14'755.26
## 100.0% 0.0% 0.0%
##
## .05 .10 .25 median .75 .90
## 3'600.00 5'000.00 8'000.00 13'000.00 20'000.00 28'000.00
##
## range sd vcoef mad IQR skew
## 34'500.00 8'435.46 0.57 8'599.08 12'000.00 0.68
##
## meanCI
## 14'737.71
## 14'772.82
##
## .95
## 32'000.00
##
## kurt
## -0.26
##
## lowest : 500.0 (11), 550.0, 600.0 (6), 700.0 (3), 725.0
## highest: 34'900.0 (14), 34'925.0 (9), 34'950.0 (18), 34'975.0 (31), 35'000.0 (36'368)
#df$issue_d <- parse_date_time(gsub("^", "01-", df$issue_d), orders = c("d-m-y", "d-B-Y", "m/d/y"))
df$issue_d <- as.Date(gsub("^", "01-", df$issue_d), format="%d-%b-%Y")
df$issue_year <- year(as.yearmon(df$issue_d,"%b-%Y"))
d <- sqldf("
select
avg(loan_amnt) as avg_amnt
, grade
, issue_year
from df
group by grade, issue_year
")
g <- ggplot(d, aes(x = issue_year , y = avg_amnt, color = grade)) +
geom_line(alpha = 0.4) +
labs(x = "Year of Loan Issue", y = "Average of Loan Amount issued", title = "Year of Loan Issue") +
theme_economist()
ggplotly(g, tooltip = c("grade"))
#df$issue_d <- as.Date(gsub("^", "01-", df$issue_d), format="%d-%b-%Y")
#df$issue_year <- year(as.yearmon(df$issue_d,"%b-%Y"))
d <- sqldf("
select
grade
, issue_year
, count(*) as freq
from df
group by 1, 2
")
head(d)
## grade issue_year freq
## 1 A 2007 78
## 2 A 2008 318
## 3 A 2009 1203
## 4 A 2010 2830
## 5 A 2011 5754
## 6 A 2012 10901
g <- ggplot(d, aes(x = issue_year , y = freq, color = grade)) +
geom_line(alpha = 0.4) +
labs(x = "Year of Loan Issue", y = "Count of Loan issued", title = "Year of Loan Issue") +
theme_economist()
ggplotly(g, tooltip = c("grade"))
box_status <- ggplot(df, aes(loan_status, loan_amnt))
box_status +
# geom_boxplot(aes(fill = loan_status)) +
geom_boxplot(fill = fillColor2) +
theme(#axis.text.x = element_text(angle=90, hjust = 1),
axis.text.x=element_blank(),
legend.position="null") +
labs(list(
title = "Loans by Loan Status",
x = "Loan Status",
y = "Loan")) +
theme_economist() +
guides(fill=FALSE) +
coord_flip()