1. Introduction

신용 평가와 대출 승인 결정은 금융 기관에서 가장 중요한 업무 중 하나이다. 기본적으로 금융기관에서는 한국은행을 비롯한 다양한 곳에서 자금을 조달하여 이를 관리하고 있다가 자금을 필요로 하는 곳에 자금을 빌려주고 이에 상응하는 이자를 받아 수익을 얻는 것으로 볼 수 있다. 근본적으로 많은 금액을 빌려주고 이를 나누어서 자금을 사용한 곳에서 갚아 나가는 구조다.

물론 다수의 고객에게 자금을 빌려주다보니 제때 돈을 갚지 못하거나, 불의의 사고, 실직 등 다양한 이유로 인해서 돈을 갖지 못하는 위험이 발생된다. 이때 기대손실(Expected Loss)을 다음 구성요소를 가지고 정량화한다.

채무 불이행 위험 : Probability of default
채무 불이행 노출 : Exposure at default
채무 불이행에 대한 손실 : Loss given default

기대손실=채무 불이행 위험×채무 불이행 노출×채무 불이행에 대한 손실

따라서 금융기관에서 자금을 빌려주기 전에 다양한 정보를 활용하여 채무 불이행 위험이 적은 고객을 선별하여 가능한 많은 금액을 빌려주어 매출과 수익을 극대화한다.

신용 모델의 이면의 아이디어는 채무불이행될 위험이 높은지를 예측할 수 있는 요소를 식별하는 것이다. 따라서 과거 은행 대출에 대한 대량 데이터, 대출의 채무불이행 여부, 신청자에 대한 정보를 입수해야 한다.

Application 정보: 나이, 결혼여부, 소득, 자가/전세 등
Behaviour 정보: 현재 은행잔고, 연체금액 등

1.1. Aim

- Classification using Machine Learning

기업 금융과 마찬가지로 소매 금융도 리스크 관리가 중요하다. 2007~2008년의 세계 금융 위기(리먼브라더스 사태)는 대출 업무의 투명성과 엄격함의 중요성을 부각시켰다. 기존 신용(시스템)의 유효성이 제한되면서 신용 평점에 따라 엄격하게 관리하는 것이 필요해졌고, 은행은 대출 시스템을 강화하고 위험 대출을 더 정확하게 찾아내기 위해 신용평점모형(Credit Scoring Model)을 머신 러닝을 활용하고 있다.

많은 나라의 정부기관은 대출 업무를 면밀히 감시하고 있기 때문에 경영진은 대출이 승인되고 거절되는 사유를 설명할 수 있어야만 한다. 이런 정보는 신용 평가가 만족스럽지 않은 이유를 확인하려는 고객들에게도 유용하다. 의사결정트리는 다른 알고리즘 모델에 비해 정확도는 떨어지지만 설명력이 높다. 이 때문에 은행 업계에서 널리 사용되고 있다.

자동화된 신용 평가 모델을 이용해 전화나 웹에서 대출 신청을 즉시 승인할 수 있다. Python은 컴퓨팅에 많은 라이브러리가 구축되어 있는 반면, R은 상대적으로 통계 기반이라 금융 위험 관리 분야에 라이브러리가 많다. 과거의 결과를 고려하여 미래의 성과를 정확하게 예측하기위한 모델을 훈련해보겠다(현실과 밀접한 신용 점수 할당 문제를 기계학습에서 대규모로 적용할 경우 풀어가는 방식을 R로 간단한 대출 승인 모델을 개발하고, 재정적 손실을 야기하는 오류를 최소화하기 위한 모델 성능을 개선할 예정이다).

1.2. Dataset

- 2007-2015의 LendingClub 대출자료

database.sqlite (134.64 MB) : sqlite 데이터베이스 파일 형식 대출 데이터 전체
LCDataDictionary.xlsx (20.5 KB) : 데이터 사전 (변수 설명)
loan.csv (105.01 MB) : csv 파일 형식 대출 데이터 전체

- Data description

idx	LoanStatNew	Description
0	addr_state	The state provided by the borrower in the loan application
1	annual_inc	The self-reported annual income provided by the borrower during registration.
2	annual_inc_joint	The combined self-reported annual income provided by the co-borrowers during registration
3	application_type	Indicates whether the loan is an individual application or a joint application with two co-borrowers
4	collection_recovery_fee	post charge off collection fee
5	collections_12_mths_ex_med	Number of collections in 12 months excluding medical collections
6	delinq_2yrs	The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
7	desc	Loan description provided by the borrower
8	dti	A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
9	dti_joint	A ratio calculated using the co-borrowers’ total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers’ combined self-reported monthly income
10	earliest_cr_line	The month the borrower’s earliest reported credit line was opened
11	emp_length	Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and
12	emp_title	The job title supplied by the Borrower when applying for the loan.
13	fico_range_high	The upper boundary range the borrower’s FICO at loan origination belongs to.
14	fico_range_low	The lower boundary range the borrower’s FICO at loan origination belongs to.
15	funded_amnt	The total amount committed to that loan at that point in time.
16	funded_amnt_inv	The total amount committed by investors for that loan at that point in time.
17	grade	LC assigned loan grade
18	home_ownership	The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
19	id	A unique LC assigned ID for the loan listing.
20	initial_list_status	The initial listing status of the loan. Possible values are – W, F
21	inq_last_6mths	The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
22	installment	The monthly payment owed by the borrower if the loan originates.
23	int_rate	Interest Rate on the loan
24	is_inc_v	Indicates if income was verified by LC, not verified, or if the income source was verified
25	issue_d	The month which the loan was funded
26	last_credit_pull_d	The most recent month LC pulled credit for this loan
27	last_fico_range_high	The upper boundary range the borrower’s last FICO pulled belongs to.
28	last_fico_range_low	The lower boundary range the borrower’s last FICO pulled belongs to.
29	last_pymnt_amnt	Last total payment amount received
30	last_pymnt_d	Last month payment was received
31	loan_amnt	The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
32	loan_status	Current status of the loan
33	member_id	A unique LC assigned Id for the borrower member.
34	mths_since_last_delinq	The number of months since the borrower’s last delinquency.
35	mths_since_last_major_derog	Months since most recent 90-day or worse rating
36	mths_since_last_record	The number of months since the last public record.
37	next_pymnt_d	Next scheduled payment date
38	open_acc	The number of open credit lines in the borrower’s credit file.
39	out_prncp	Remaining outstanding principal for total amount funded
40	out_prncp_inv	Remaining outstanding principal for portion of total amount funded by investors
41	policy_code	publicly available policy_code=1 new products not publicly available policy_code=2
42	pub_rec	Number of derogatory public records
43	purpose	A category provided by the borrower for the loan request.
44	pymnt_plan	Indicates if a payment plan has been put in place for the loan
45	recoveries	post charge off gross recovery
46	revol_bal	Total credit revolving balance
47	revol_util	Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
48	sub_grade	LC assigned loan subgrade
49	term	The number of payments on the loan. Values are in months and can be either 36 or 60.
50	title	The loan title provided by the borrower
51	total_acc	The total number of credit lines currently in the borrower’s credit file
52	total_pymnt	Payments received to date for total amount funded
53	total_pymnt_inv	Payments received to date for portion of total amount funded by investors
54	total_rec_int	Interest received to date
55	total_rec_late_fee	Late fees received to date
56	total_rec_prncp	Principal received to date
57	url	URL for the LC page with listing data.
58	verified_status_joint	Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the income source was verified
59	zip_code	The first 3 numbers of the zip code provided by the borrower in the loan application.
60	open_acc_6m	Number of open trades in last 6 months
61	open_il_6m	Number of currently active installment trades
62	open_il_12m	Number of installment accounts opened in past 12 months
63	open_il_24m	Number of installment accounts opened in past 24 months
64	mths_since_rcnt_il	Months since most recent installment accounts opened
65	total_bal_il	Total current balance of all installment accounts
66	il_util	Ratio of total current balance to high credit/credit limit on all install acct
67	open_rv_12m	Number of revolving trades opened in past 12 months
68	open_rv_24m	Number of revolving trades opened in past 24 months
69	max_bal_bc	Maximum current balance owed on all revolving accounts
70	all_util	Balance to credit limit on all trades
71	total_rev_hi_lim Total revolving high credit/credit limit
72	inq_fi	Number of personal finance inquiries
73	total_cu_tl	Number of finance trades
74	inq_last_12m	Number of credit inquiries in past 12 months
75	acc_now_delinq	The number of accounts on which the borrower is now delinquent.
76	tot_coll_amt	Total collection amounts ever owed
77	tot_cur_bal	Total current balance of all accounts

2. Collect the data

2.1. Setting

- Import Libraries

- Read the data

3. Exploratory Data Analysis 1

탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.

1. 기초 통계를 통하여 Raw Data를 분석
2. 시각화를 통해 Raw Data를 분석

3.1. Preview the data

- Data size

dim(df)

## [1] 887379     74

- Data table

#datatable(head(df, 100), style="bootstrap", class="table-condensed", options = list(dom = 'tp', scrollX = TRUE, search=''))
#kable(head(df, 5))

3.2. Basic Statistics

- Data summary

colnames(df)

str(df)
summary(df)

Summary

887,379개의 값이 존재
74개의 변수들은 NA값이 많음
다양한 대출 정보들이 존재

3.3. Unique values

- Variable별 Unique한 값 찾기

#for (name in names(df)){
#  print(sprintf("Unique values for %s:", name))
#  print(unique(df[[name]]))
#  cat('\n')
#  }

3.4. Duplicate rows

- Checking for any duplicate rows

sum(duplicated(df))

## [1] 0

4. Preprocessing

- Cleansing the data

변수가 많으므로 불필요한 열을 제거
먼저 데이터에서 결측치를 체크

4.1. Missing values

- Checking for any missing values

4.1.1. Visualization

options(repr.plot.width=6, repr.plot.height=8)
missing_data <- df %>% summarise_all(funs(sum(is.na(.))/n()))
missing_data <- gather(missing_data, key = "variables", value = "percent_missing") 
ggplot(missing_data, aes(x = reorder(variables, percent_missing), y = percent_missing)) +
  geom_bar(stat = "identity", fill = "red", aes(color = I('white')), size = 0.1)+coord_flip()+ theme_few()

4.1.2. Table

colSums(is.na(df))

##                          id                   member_id 
##                           0                           0 
##                   loan_amnt                 funded_amnt 
##                           0                           0 
##             funded_amnt_inv                        term 
##                           0                           0 
##                    int_rate                 installment 
##                           0                           0 
##                       grade                   sub_grade 
##                           0                           0 
##                   emp_title                  emp_length 
##                       51457                           0 
##              home_ownership                  annual_inc 
##                           0                           4 
##         verification_status                     issue_d 
##                           0                           0 
##                 loan_status                  pymnt_plan 
##                           0                           0 
##                         url                        desc 
##                           0                      761597 
##                     purpose                       title 
##                           0                         151 
##                    zip_code                  addr_state 
##                           0                           0 
##                         dti                 delinq_2yrs 
##                           0                          29 
##            earliest_cr_line              inq_last_6mths 
##                          29                          29 
##      mths_since_last_delinq      mths_since_last_record 
##                      454312                      750326 
##                    open_acc                     pub_rec 
##                          29                          29 
##                   revol_bal                  revol_util 
##                           0                         502 
##                   total_acc         initial_list_status 
##                          29                           0 
##                   out_prncp               out_prncp_inv 
##                           0                           0 
##                 total_pymnt             total_pymnt_inv 
##                           0                           0 
##             total_rec_prncp               total_rec_int 
##                           0                           0 
##          total_rec_late_fee                  recoveries 
##                           0                           0 
##     collection_recovery_fee                last_pymnt_d 
##                           0                       17659 
##             last_pymnt_amnt                next_pymnt_d 
##                           0                      252971 
##          last_credit_pull_d  collections_12_mths_ex_med 
##                          53                         145 
## mths_since_last_major_derog                 policy_code 
##                      665676                           0 
##            application_type            annual_inc_joint 
##                           0                      886868 
##                   dti_joint   verification_status_joint 
##                      886870                      886868 
##              acc_now_delinq                tot_coll_amt 
##                          29                       70276 
##                 tot_cur_bal                 open_acc_6m 
##                       70276                      866007 
##                  open_il_6m                 open_il_12m 
##                      866007                      866007 
##                 open_il_24m          mths_since_rcnt_il 
##                      866007                      866569 
##                total_bal_il                     il_util 
##                      866007                      868762 
##                 open_rv_12m                 open_rv_24m 
##                      866007                      866007 
##                  max_bal_bc                    all_util 
##                      866007                      866007 
##            total_rev_hi_lim                      inq_fi 
##                       70276                      866007 
##                 total_cu_tl                inq_last_12m 
##                      866007                      866007

4.1.3. Summary

총 22개의 변수가 25% 이상의 NA값을 갖는다.
4개의 열에는 25% 이상의 누락된 데이터가 있다.
7개의 열에는 10% 미만의 누락된 데이터가 있다.

4.2. Cleansing the data

- 누락 된 데이터가 10 % 이상인 열 제거

loan_cleaned <- df[, -which(colMeans(is.na(df)) > 0.1)]

이제 49개의 변수가 저장된 loan_cleaned로 데이터 세트로 추가 분석을 시행할 것이다.

4.3. Removing duplicate variables

id, member_id, url 변수는 오직 대출 식별 목적으로 만든 고유한 값이므로 제거해도 된다. 우편 번호 열은 처음 세 자리만 있으면 되고, 정보는 주 변수에서 얻을 수 있다. emp_title도 약 6%의 NA 값을 갖고 있으며 해당 데이터에서 50,000행을 삭제하면 데이터 손실이 크므로 해당 변수를 제거하겠다.

loan_cleaned <- within(loan_cleaned, rm('member_id', 'id', 'url', 'emp_title', "title","zip_code"))

또한 대출을 승인할지 여부를 결정할 때 존재하지 않을 변수를 제거하고 고객 정보 및 고객 인구 통계와 관련된 변수를 보유하겠다.
funded_amnt, funded_amnt_inv, issue_d, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_int, total_rec_late_fee, total_rec_prncp 등과 같은 변수는 기본적으로 고객 지불 매개 변수이며 의사 결정 중에는 사용할 수 없다.

loan_cleaned <- loan_cleaned[, -c(2,3,13,28,29,30,31,32,33,34,35,36,37,38,40,43)]

4.4. Converting incorrect data types

범주형 변수이지만 문자열로 표시되는 일부 변수가 있으므로 형변환해야 한다.

- Convert String to Factor Variables

loan_cleaned$term <- as.factor(loan_cleaned$term)
loan_cleaned$grade <- as.factor(loan_cleaned$grade)
loan_cleaned$emp_length <- as.factor(loan_cleaned$emp_length)
loan_cleaned$home_ownership <- as.factor(loan_cleaned$home_ownership)
loan_cleaned$verification_status <- as.factor(loan_cleaned$verification_status)
loan_cleaned$loan_status <- as.factor(loan_cleaned$loan_status)
loan_cleaned$application_type <- as.factor(loan_cleaned$application_type)
loan_cleaned$pymnt_plan <- as.factor(loan_cleaned$pymnt_plan)
loan_cleaned$initial_list_status <- as.factor(loan_cleaned$initial_list_status)
loan_cleaned$policy_code <- as.factor(loan_cleaned$policy_code)

“policy_code”및 “pymnt_plan”변수는 대부분 모든 행에서 하나의 레벨을 가지므로 이를 제거할 수도 있다.

loan_cleaned <- within(loan_cleaned, rm("policy_code", "pymnt_plan"))

- 날짜형 변환

날짜 열은 Chr 형식이므로 Date 형식으로 변환한다.

loan_cleaned$earliest_cr_line <- parse_date(loan_cleaned$earliest_cr_line,format =  "%b-%Y")
loan_cleaned$last_credit_pull_d <- parse_date(loan_cleaned$last_credit_pull_d,format =  "%b-%Y")

5. Exploratory Data Analysis 2

탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.

기초 통계를 통하여 Raw Data를 분석

2. 시각화를 통해 Raw Data를 분석

- Univariate Analysis on Categorical Variables

5.1. Term (대출기간)

대출 상품의 기간은 36개월, 60개월 두 가지가 존재한다. 36개월이 약 70%를 차지한다.

options(repr.plot.width=5, repr.plot.height=3)

term_summary <- loan_cleaned %>% 
  group_by(term) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(term_summary, caption="2x2 Contingency Table on Term.", format="markdown")

term	count	pct
36 months	621125	0.6999546
60 months	266254	0.3000454

#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
  
ggplot(data = term_summary, aes(x = term, y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  
  geom_text(aes(x = term, y = 0.05, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Term",
        y = "Percent",
        title = "Term Rate of Loan") +
  theme_economist()

5.2. Grade (신용등급)

해당 대출의 신용 위험도에 따라 각 등급(A-E)을 지정한다. B등급과 C등급의 대출금은 전체 대출금의 50% 이상(57%)을 차지한다.

grade_summary <- loan_cleaned %>%
  group_by(grade) %>%
  summarise(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(pct))

kable(grade_summary, caption="2x2 Contingency Table on Grade.", format="markdown")

grade	count	pct
B	254535	0.2868391
C	245860	0.2770631
A	148202	0.1670109
D	139542	0.1572519
E	70705	0.0796785
F	23046	0.0259709
G	5489	0.0061856

#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
  
ggplot(data = grade_summary, aes(x = reorder(grade, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = grade, y = 0.05, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Grade",
        y = "Percent",
        title = "Grade Rate of Loan")+
  theme_economist()

5.3. Employment Length (재직 기간)

10년 이상 재직한 사람들이 대출을 많이 받는데, 생애주기상 이사나 자녀의 학업을 위해서 자금을 필요로 하는 것 같다.

emp_len_summary <- loan_cleaned %>%
  group_by(emp_length) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(emp_len_summary, caption="2x2 Contingency Table on Employment Length.", format="markdown")

emp_length	count	pct
10+ years	291569	0.3285732
2 years	78870	0.0888797
< 1 year	70605	0.0795658
3 years	70026	0.0789133
1 year	57095	0.0643412
5 years	55704	0.0627736
4 years	52529	0.0591957
n/a	44825	0.0505139
7 years	44594	0.0502536
8 years	43955	0.0495335
6 years	42950	0.0484010
9 years	34657	0.0390555

#crude_deporate <- deposit_summary$prob[deposit_summary$deposit=="yes"]
  
ggplot(data = emp_len_summary, aes(x = reorder(emp_length, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = emp_length, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Employment Length",
        y = "Percent",
        title = "Employment Length Rate of Loan") +
  theme_economist() + coord_flip()

5.4. Home Ownership (주택 소유 여부)

월세 및 모기지 주택 소유자가 전체 대출의 약 90%를 차지한다.

home_ownership_summary <- loan_cleaned %>%
  group_by(home_ownership) %>%
  summarise(count = n()) %>%
  mutate(pct = count / sum(count)) %>%
  arrange(desc(pct))
  
kable(home_ownership_summary, caption="2x2 Contingency Table on Home Ownership.", format="markdown")

home_ownership	count	pct
MORTGAGE	443557	0.4998507
RENT	356117	0.4013133
OWN	87470	0.0985712
OTHER	182	0.0002051
NONE	50	0.0000563
ANY	3	0.0000034

ggplot(data = home_ownership_summary, aes(x = reorder(home_ownership, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = home_ownership, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Home Ownership",
        y = "Percent",
        title = "Home Ownership Rate of Loan") +
  theme_economist() + coord_flip()

5.5. Verification Status (수입 상태)

실행된 대출의 30% 이상이 수입 상태에 대해 확인되지 않았다. 검증되지 않은 대출은 추후에 부실 대출을 야기할 가능성이 높다.

vs_summary <- loan_cleaned %>%
  group_by(verification_status) %>%
  summarise(count=n()) %>% 
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(vs_summary, caption="2x2 Contingency Table on Verification Status", format="markdown")

verification_status	count	pct
Source Verified	329558	0.3713836
Verified	291071	0.3280120
Not Verified	266750	0.3006044

ggplot(data = vs_summary, aes(x = reorder(verification_status, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = verification_status, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Verification Status",
        y = "Percent",
        title = "Verification Status Rate of Loan") +
  theme_economist() + coord_flip()

5.6. Purpose (대출 목적)

대출의 60%는 기존의 부채를 정리하기 위해 사용된다.

purpose_summary <- loan_cleaned %>%
  group_by(purpose) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

kable(purpose_summary, caption="2x2 Contingency Table on Purpose", format="markdown")

purpose	count	pct
debt_consolidation	524215	0.5907453
credit_card	206182	0.2323494
home_improvement	51829	0.0584068
other	42894	0.0483379
major_purchase	17277	0.0194697
small_business	10377	0.0116940
car	8863	0.0099878
medical	8540	0.0096238
moving	5414	0.0061011
vacation	4736	0.0053371
house	3707	0.0041775
wedding	2347	0.0026449
renewable_energy	575	0.0006480
educational	423	0.0004767

ggplot(data = purpose_summary, aes(x = reorder(purpose, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  geom_text(aes(x = purpose, y = 0.01, label = paste0("(",round(pct,2)*100, "%)",sep="")),
            hjust=0.5, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs( x = "Purpose",
        y = "Percent",
        title = "Purpose Rate of Loan") +
  theme_economist() + coord_flip()

5.7. State (주)

대출금의 15% 이상이 캘리포니아 주에서 실행되며 뉴욕과 텍사스는 그 뒤를 잇는다(약 7~8%).

state_summary <- loan_cleaned %>%
  group_by(addr_state) %>%
  summarise(count=n()) %>%
  mutate(pct=count/sum(count)) %>%
  arrange(desc(pct))

ggplot(data = state_summary, aes(x = reorder(addr_state, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2) +
  labs( x = "State",
        y = "Percent",
        title = "State Rate of Loan") +
  theme_economist() + coord_flip()

kable(state_summary, caption="2x2 Contingency Table on State.", format="markdown")

addr_state	count	pct
CA	129517	0.1459545
NY	74086	0.0834886
TX	71138	0.0801664
FL	60935	0.0686685
IL	35476	0.0399784
NJ	33256	0.0374767
PA	31393	0.0353772
OH	29631	0.0333916
GA	29085	0.0327763
VA	26255	0.0295871
NC	24720	0.0278573
MI	22985	0.0259021
MD	21031	0.0237001
MA	20593	0.0232065
AZ	20412	0.0230026
WA	19434	0.0219005
CO	18807	0.0211939
MN	15957	0.0179822
MO	14207	0.0160101
IN	13789	0.0155390
CT	13531	0.0152483
TN	12887	0.0145225
NV	12443	0.0140222
WI	11574	0.0130429
AL	11200	0.0126214
OR	10893	0.0122755
SC	10639	0.0119892
LA	10587	0.0119306
KY	8550	0.0096351
OK	8085	0.0091111
KS	7926	0.0089319
AR	6640	0.0074827
UT	6264	0.0070590
NM	4939	0.0055658
HI	4570	0.0051500
WV	4386	0.0049426
NH	4294	0.0048390
RI	3893	0.0043871
MS	3819	0.0043037
MT	2558	0.0028826
DE	2511	0.0028297
DC	2432	0.0027407
AK	2205	0.0024848
WY	2028	0.0022854
SD	1815	0.0020453
VT	1797	0.0020251
NE	1176	0.0013253
ME	525	0.0005916
ND	479	0.0005398
IA	14	0.0000158
ID	12	0.0000135

6. Exploratory Data Analysis 3

탐색적 데이터 분석(EDA)은 크게 두 가지 작업으로 분류할 수 있다.

기초 통계를 통하여 Raw Data를 분석

2. 시각화를 통해 Raw Data를 분석

- Segmented Univariate Analysis on Categorical Variables

6.1. Loan Ratio by Loan Status (대출 상태별 대출 비율)

default_summary <- loan_cleaned %>%
  group_by(loan_status) %>%
  summarise(cnt = n()) %>%
  mutate(pct = cnt / sum(cnt)) %>%
  arrange(desc(pct))

#crude_defualt_rate <- default_summary$prob[default_summary$loan_status=="yes"]

kable(default_summary, caption="2x2 Contingency Table on Default.", format="markdown")

loan_status	cnt	pct
Current	601779	0.6781533
Fully Paid	207723	0.2340860
Charged Off	45248	0.0509906
Late (31-120 days)	11591	0.0130621
Issued	8460	0.0095337
In Grace Period	6253	0.0070466
Late (16-30 days)	2357	0.0026561
Does not meet the credit policy. Status:Fully Paid	1988	0.0022403
Default	1219	0.0013737
Does not meet the credit policy. Status:Charged Off	761	0.0008576

ggplot(default_summary, aes(x=reorder(loan_status, pct), y=pct, colour=loan_status, fill=loan_status)) +
  geom_bar(stat="identity",aes(color = I('black')), size = 0.1)  +
  labs( x = "Loan_Status",
        y = "Percent",
        title = "Loan Ratio by Loan Status") +
  theme_economist() + guides(fill=FALSE) + coord_flip()

ggplot(loan_cleaned, aes(x =term, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Loan Ratio by Loan Status") +
  theme_economist()

ggplot(loan_cleaned, aes(x =term, fill = loan_status)) +
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Loan Ratio by Loan Status") + 
  theme_economist()

- Bining

타겟 변수(Loan_status)에는 10 개의 범주가 있다.
부채 상환에 관련된 요인을 알아내는 것이 목표이므로, 정확한 결과를 알 수 있는 ‘Fully Paid’와 ’Charged Off’ 데이터만 고려할 것이다(이진 값을 갖는 새로운 열을 만들어서 fully paid는 1로, charged off는 0으로 변환할 것이다).

bad_indicators <- c("Charged Off",
                    "Default",
                    "Does not meet the credit policy. Status:Charged Off",
                    "In Grace Period", 
                    "Default Receiver", 
                    "Late (16-30 days)",
                    "Late (31-120 days)")

#loan_cleaned$loan_status_yn <- ifelse(loan_cleaned$loan_status %in% bad_indicators, 1, 0)
loan_cleaned$loan_status <- ifelse(loan_cleaned$loan_status %in% bad_indicators, "Bad", "Good")

loan_cleaned <- filter(loan_cleaned, loan_cleaned$loan_status == "Bad" | loan_cleaned$loan_status == "Good")

loan_cleaned <- mutate(loan_cleaned, binary_status=as.numeric(ifelse(loan_cleaned$loan_status == "Bad", 0, 1)))

#loan_cleaned <- filter(loan_cleaned, loan_cleaned$loan_status == "Fully Paid" | loan_cleaned$loan_status == "Charged Off" | loan_cleaned$loan_status == "Default")

#loan_cleaned <- mutate(loan_cleaned, binary_status=as.numeric(ifelse(loan_cleaned$loan_status == "Fully Paid", 1, 0)))

#head(loan_cleaned)
#kable(head(loan_cleaned, 5))

6.2. Default with respect to Term (기간별 채무 불이행 비율)

36개월로 설정한 대출보다 60개월로 설정한 대출의 채무 불이행이 조금 더 높은 비율을 차지한다.

ggplot(loan_cleaned, aes(x = term, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Default with respect to Term") + 
  theme_economist()

ggplot(loan_cleaned, aes(x = term, fill = loan_status)) +
  geom_bar(stat='count', position ='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Term",
        y = "Percent",
        title = "Default with respect to Term") + 
  theme_economist()

6.3. Default with respect to Grade (등급별 채무 불이행 비율)

등급이 내려갈수록 채무 불이행의 위험도가 증가하는 것을 확인할 수 있다.

ggplot(loan_cleaned, aes(x = grade, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Grade",
        y = "Percent",
        title = "Default with respect to Grade") + 
  scale_fill_discrete(name="Loan_Status") + theme_economist()

ggplot(loan_cleaned, aes(x = grade, fill = loan_status)) +
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Grade",
        y = "Percent",
        title = "Default with respect to Grade") + 
  scale_fill_discrete(name="Loan_Status") + theme_economist()

6.4. Default with respect to Employee Length (재직 기간별 채무 불이행 비율)

재직 기간에 따른 대출 위험도의 유의미한 차이는 발견하지 못했다(거의 유사한 패턴을 보인다).

ggplot(filter(loan_cleaned, emp_length != 'n/a'), aes(x =emp_length, fill = loan_status)) +
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "emp_length",
        y = "Percent",
        title = "Default with respect to Employee Length") + 
  scale_fill_discrete(name="Loan_Status") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.8, hjust=1)) +
  theme_economist()

ggplot(filter(loan_cleaned, emp_length != 'n/a'), aes(x =emp_length, fill = loan_status)) +
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "emp_length",
        y = "Percent",
        title = "Default with respect to Employee Length") + 
  scale_fill_discrete(name="Loan_Status") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.8, hjust=1)) +
  theme_economist()

6.5. Default with respect to Home Ownership (자가 여부에 따른 채무 불이행 비율)

기타 형태의 주거 생활을하는 채무자의 채무 불이행 비율이 높다.

ggplot(loan_cleaned, aes(x =home_ownership, fill = loan_status)) + 
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "home_ownership",
        y = "Percent",
        title = "Default with respect to Home Ownership") +
  scale_fill_discrete(name="Loan_Status") +
  theme_economist()

ggplot(loan_cleaned, aes(x =home_ownership, fill = loan_status)) + 
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "home_ownership",
        y = "Percent",
        title = "Default with respect to Home Ownership") +
  scale_fill_discrete(name="Loan_Status") +
  theme_economist()

6.6. Verification status with respect to default (수입 상태에 따른 채무 불이행 비율)

수입이 확인되지 않은 상태보다 확인된 상태의 채무 불이행 비율이 더 높다.

ggplot(loan_cleaned, aes(x =verification_status, fill = loan_status)) + 
  geom_bar(stat='count', position='fill', aes(color = I('black')), size = 0.1) +
  labs( x = "Verification_status",
        y = "Percent",
        title = "Verification status with respect to default") +
  theme_economist()

ggplot(loan_cleaned, aes(x =verification_status, fill = loan_status)) + 
  geom_bar(stat='count', position='stack', aes(color = I('black')), size = 0.1) +
  labs( x = "Verification_status",
        y = "Percent",
        title = "Verification status with respect to default") +
  theme_economist()

6.7. Purpose of Loan with respect to Default (대출 목적에 따른 채무 불이행 비율)

Educational(21%)과 Small Business(17%)의 채무 불이행 비율이 다른 대출 목적에 비해 높다.

purpose_summary_2 <- loan_cleaned %>%
  group_by(purpose) %>%
  summarise(pct = (1-sum(binary_status)/n())) %>%
  arrange(desc(pct))

kable(purpose_summary_2, format="markdown")

purpose	pct
educational	0.2080378
small_business	0.1716296
renewable_energy	0.1252174
wedding	0.1231359
house	0.1124899
moving	0.1121167
other	0.0970532
medical	0.0941452
vacation	0.0848818
debt_consolidation	0.0793720
major_purchase	0.0748973
home_improvement	0.0685524
car	0.0672459
credit_card	0.0570127

ggplot(data = purpose_summary_2, aes(x = reorder(purpose, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2, aes(color = I('black')), size = 0.1) +
   labs( x = "Purpose",
        y = "Percent",
        title = "Purpose of Loan with respect to Default") +
  theme_economist() + coord_flip()

6.8. State with respect to default (지역에 따른 채무 불이행 비율)

테네시(Tenesse) 주의 채무 불이행의 비율(20%)이 다른 지역에 비해 상대적으로 높다.

state_summary <- loan_cleaned %>%
  group_by(addr_state) %>%
  summarise(pct = (1-sum(binary_status)/n()))

ggplot(data = state_summary, aes(x = reorder(addr_state, pct), y = pct)) +
  geom_bar(stat = "identity", fill = fillColor2, aes(color = I('white')), size = 0.1) +
  labs( x = "States",
        y = "Percent",
        title = "State of Loan with respect to Default") +
  theme_economist() + coord_flip()

6.9. Loan Amount and Loan Status (대출액과 대출 상태 시각화 비교)

대출 금액이 10,000달러 이상일 때부터, 채무불이행이 발생할 수 있다.

ggplot(loan_cleaned, aes(x= loan_amnt)) +
  geom_density(aes(fill = as.factor(loan_status))) +
  xlab("Loan_amount") + theme_economist()

6.10. Interest Rate and Loan Status (이자율에 따른 채무 불이행 비율)

높은 이자율은(즉, 낮은 신용 등급을 가진 사람들은) 채무 불이행의 위험도가 높다.

ggplot(loan_cleaned, aes(x= int_rate, fill = loan_status)) +
geom_histogram(bins = 10, position = "fill", aes(color = I('black')), size = 0.1)+ 
xlab("Interest Rate")+ 
ylab("Percent of default Vs No default")+theme_few()

ggplot(loan_cleaned, aes(x = loan_status, y = int_rate, fill = loan_status)) + geom_boxplot()

6.11. Credit History (신용 활동 기간에 따른 채무 불이행 비율)

신용 활동 기간이 길수록 금리가 줄어들고, 채무 불이행의 위험도가 낮아진다.

loan_cleaned <- mutate(loan_cleaned, credit_history.d = year(earliest_cr_line))
loan_cleaned <- mutate(loan_cleaned, pull.d = year(last_credit_pull_d))

# Credit History Grouping
loan_cleaned <- mutate(loan_cleaned, cr_hist = pull.d-credit_history.d)
loan_cleaned <- mutate(loan_cleaned, cr_hist_grp = cr_hist)
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist <= 5] <- '<5_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 5 & loan_cleaned$cr_hist <= 10] <- '5-10_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 10 & loan_cleaned$cr_hist <= 15] <- '10-15_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 15 & loan_cleaned$cr_hist <= 20] <- '15-20_yrs'
loan_cleaned$cr_hist_grp[loan_cleaned$cr_hist > 20] <- '>20_yrs'

ggplot(filter(loan_cleaned,cr_hist != "NA") , aes(x= reorder(cr_hist_grp, cr_hist), fill = loan_status)) + 
  geom_bar(position = "fill", aes(color = I('black')), size = 0.1) +
  xlab("Credit_History_Length") +
  ylab("Percent") +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.8, hjust=1)) +
  theme_economist()

6.12. Income Group (소득에 따른 채무 불이행 비율)

소득이 낮을수록 채무 불이행의 위험도는 증가한다.

loan_cleaned <- mutate(loan_cleaned, inc_grp = annual_inc)

loan_cleaned$inc_grp[loan_cleaned$annual_inc <= 50000] <- "verylow_inc"
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 50000 & loan_cleaned$annual_inc <= 80000] <- 'low_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 80000 & loan_cleaned$annual_inc <= 120000] <- 'middle_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 120000 & loan_cleaned$annual_inc <= 240000] <- 'high_inc'
loan_cleaned$inc_grp[loan_cleaned$annual_inc > 240000] <- 'veryhigh_inc'

ggplot(loan_cleaned, aes(x= reorder(inc_grp, annual_inc), fill = loan_status, order = loan_status)) +
  geom_bar(position = "fill", aes(color = I('black')), size = 0.1) +
  xlab("Income_groups") +
  ylab("Percent of default Vs No default") +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.8, hjust=1)) +
  theme_economist()

6.13. Loan amount distribution (대출금액 분포)

Desc(df$loan_amnt, main = "Loan amount distribution", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Loan amount distribution
## 
##      length         n       NAs     unique         0s       mean
##     887'379   887'379         0      1'372          0  14'755.26
##                100.0%      0.0%                  0.0%           
##                                                                 
##         .05       .10       .25     median        .75        .90
##    3'600.00  5'000.00  8'000.00  13'000.00  20'000.00  28'000.00
##                                                                 
##       range        sd     vcoef        mad        IQR       skew
##   34'500.00  8'435.46      0.57   8'599.08  12'000.00       0.68
##                                                                 
##      meanCI
##   14'737.71
##   14'772.82
##            
##         .95
##   32'000.00
##            
##        kurt
##       -0.26
##            
## lowest : 500.0 (11), 550.0, 600.0 (6), 700.0 (3), 725.0
## highest: 34'900.0 (14), 34'925.0 (9), 34'950.0 (18), 34'975.0 (31), 35'000.0 (36'368)

6.14. Year of Loan Issue (년도/등급별 대출 현황)

#df$issue_d <- parse_date_time(gsub("^", "01-", df$issue_d), orders = c("d-m-y", "d-B-Y", "m/d/y"))
df$issue_d <- as.Date(gsub("^", "01-", df$issue_d), format="%d-%b-%Y")

df$issue_year <- year(as.yearmon(df$issue_d,"%b-%Y"))

d <- sqldf("
  select
    avg(loan_amnt) as avg_amnt
    , grade
    , issue_year
  from df
  group by grade, issue_year
")

g <- ggplot(d, aes(x = issue_year , y = avg_amnt, color = grade)) +
  geom_line(alpha = 0.4) +
  labs(x = "Year of Loan Issue", y = "Average of Loan Amount issued", title = "Year of Loan Issue") +
  theme_economist()

ggplotly(g, tooltip = c("grade"))

#df$issue_d <- as.Date(gsub("^", "01-", df$issue_d), format="%d-%b-%Y")
#df$issue_year <- year(as.yearmon(df$issue_d,"%b-%Y"))

d <- sqldf("
  select
    grade
    , issue_year
    , count(*) as freq
  from df
  group by 1, 2
")

head(d)

##   grade issue_year  freq
## 1     A       2007    78
## 2     A       2008   318
## 3     A       2009  1203
## 4     A       2010  2830
## 5     A       2011  5754
## 6     A       2012 10901

g <- ggplot(d, aes(x = issue_year , y = freq, color = grade)) +
  geom_line(alpha = 0.4) +
  labs(x = "Year of Loan Issue", y = "Count of Loan issued", title = "Year of Loan Issue") +
  theme_economist()

ggplotly(g, tooltip = c("grade"))

6.15. Loans Amount Distribution by Loan Status (대출 상태별 대출금 분포)

box_status <- ggplot(df, aes(loan_status, loan_amnt))
box_status +
#  geom_boxplot(aes(fill = loan_status)) +
  geom_boxplot(fill = fillColor2) +
  theme(#axis.text.x = element_text(angle=90, hjust = 1),
        axis.text.x=element_blank(),
        legend.position="null") +
  labs(list(
    title = "Loans by Loan Status",
    x = "Loan Status",
    y = "Loan")) +
  theme_economist() +
  guides(fill=FALSE) +
  coord_flip()

credit-risk-modelling-eda-classification

Andrew Jang

10/29/2018

1. Introduction

1.1. Aim

1.2. Dataset

2. Collect the data

2.1. Setting

3. Exploratory Data Analysis 1

3.1. Preview the data

3.2. Basic Statistics

Summary

3.3. Unique values

3.4. Duplicate rows

4. Preprocessing

4.1. Missing values

4.1.1. Visualization

4.1.2. Table

4.1.3. Summary

4.2. Cleansing the data

4.3. Removing duplicate variables

4.4. Converting incorrect data types

5. Exploratory Data Analysis 2

5.1. Term (대출기간)

5.2. Grade (신용등급)

5.3. Employment Length (재직 기간)

5.4. Home Ownership (주택 소유 여부)

5.5. Verification Status (수입 상태)

5.6. Purpose (대출 목적)

5.7. State (주)

6. Exploratory Data Analysis 3

6.1. Loan Ratio by Loan Status (대출 상태별 대출 비율)

6.2. Default with respect to Term (기간별 채무 불이행 비율)

6.3. Default with respect to Grade (등급별 채무 불이행 비율)

6.4. Default with respect to Employee Length (재직 기간별 채무 불이행 비율)

6.5. Default with respect to Home Ownership (자가 여부에 따른 채무 불이행 비율)

6.6. Verification status with respect to default (수입 상태에 따른 채무 불이행 비율)

6.7. Purpose of Loan with respect to Default (대출 목적에 따른 채무 불이행 비율)

6.8. State with respect to default (지역에 따른 채무 불이행 비율)

6.9. Loan Amount and Loan Status (대출액과 대출 상태 시각화 비교)

6.10. Interest Rate and Loan Status (이자율에 따른 채무 불이행 비율)

6.11. Credit History (신용 활동 기간에 따른 채무 불이행 비율)

6.12. Income Group (소득에 따른 채무 불이행 비율)

6.13. Loan amount distribution (대출금액 분포)

6.14. Year of Loan Issue (년도/등급별 대출 현황)

6.15. Loans Amount Distribution by Loan Status (대출 상태별 대출금 분포)