Data Mining with R - Capstone Assignment 7
1. Introduction - Project Goals
Our capstone proposal is on Peer-to-Peer (P2P) lending. Also known as crowd-lending, P2P is the practice of unrelated individuals or companies lending money through online services that match lenders with borrowers, bypassing the need for traditional financial institutions. Today, P2P platforms are among the fastest growing segment in the financial services space. Morgan Stanley in a report in 2015 predicted that such marketplace lending would command USD150 billion to USD490 billion globally by 2020. Some of the well-known marketplaces in the U.S. and Europe are Lending Club Corporation, Prosper Marketplace and Funding Circle.
While the P2P lending marketspace has grown exponentially, the space for recommendation engines capable of recommending the best choices for both the investors and the borrowers is very fragmented. While most lending platforms have their own risk categorization tools, metrics and recommendations, the market for external third-party tools that can evaluate available choices across multiple platforms and suggest the right investment and borrowing choices, is still in its infancy. Our project, P2P Advisor, aims to build a recommendation tool that can uniquely fill this gap in today’s P2P market, enabling both the investors and the borrowers, to have a single platform, to evaluate multiple options and to make the right choices depending on their respective investment risk preference, time horizon and credit levels.
2. Defining the Business Problem and Approach
We believe an agile development approach is most suited for our project. For our first sprint, we have prioritized the building of a propensity model to determine the probability of default of the loans. Following the CRISP data mining framework, we have divided the tasks into several broad areas as follows:
- Data Understanding
- Data Pre-processing
- Unsupervised Learning
- Supervised Learning
- Model Performance Evaluation and Selection
- Visualization
For the purpose of this assignment, we will work with a subset of the data to build a few prototype models using techniques such as PCA, k-means, SVM and random forest. The goal is to demonstrate the data mining techniques and knowledge we have acquired from the Data Mining with R course with Professor Luis Torgo. This assignment also marks the first sprint of our capstone project.
3. Data Understanding and Pre-Processing
As a first step, we will need to gain a thorough understanding of the data. The data comes from Lending Club, one of which is a dataset on loan statistics and the other on rejection statistics. Let us go ahead to import and take a first look at the loan statistics data. The dataset covers the period from April to June 2017.
dataLC <- read_csv("dataLC_2017Q2_clean.csv")
dim(dataLC)
## [1] 105451 116
datatable(dataLC[1:100,]) # Show first 100 listings
There are 105,451 cases and 116 features. Thankfully, Lending Club provides a data dictionary to help us make sense of the features.
dataDictionary <- read_excel("LCDataDictionary.xlsx")
datatable(dataDictionary)
We take a look at the summary statistics too.
summary(dataLC)
## id loan_amnt term int_rate
## Min. : 687119 Min. : 1000 Length:105451 Length:105451
## 1st Qu.:106356644 1st Qu.: 7000 Class :character Class :character
## Median :108441242 Median :12000 Mode :character Mode :character
## Mean :108109021 Mean :14589
## 3rd Qu.:110343424 3rd Qu.:20000
## Max. :112441324 Max. :40000
##
## installment grade sub_grade
## Min. : 30.12 Length:105451 Length:105451
## 1st Qu.: 232.64 Class :character Class :character
## Median : 361.38 Mode :character Mode :character
## Mean : 436.80
## 3rd Qu.: 584.37
## Max. :1719.83
##
## emp_title emp_length home_ownership
## Length:105451 Length:105451 Length:105451
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## annual_inc verification_status issue_d
## Min. : 0 Length:105451 Length:105451
## 1st Qu.: 48000 Class :character Class :character
## Median : 68000 Mode :character Mode :character
## Mean : 80452
## 3rd Qu.: 97000
## Max. :8900000
##
## loan_status purpose zip_code
## Length:105451 Length:105451 Length:105451
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## addr_state dti delinq_2yrs earliest_cr_line
## Length:105451 Min. : 0.00 Min. : 0.000 Length:105451
## Class :character 1st Qu.: 12.24 1st Qu.: 0.000 Class :character
## Mode :character Median : 18.13 Median : 0.000 Mode :character
## Mean : 26.43 Mean : 0.356
## 3rd Qu.: 24.60 3rd Qu.: 0.000
## Max. :29550.00 Max. :42.000
##
## fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq
## Min. :660 Min. :664 Min. :0.0000 Min. : 0.00
## 1st Qu.:670 1st Qu.:674 1st Qu.:0.0000 1st Qu.: 15.00
## Median :690 Median :694 Median :0.0000 Median : 30.00
## Mean :698 Mean :702 Mean :0.5095 Mean : 33.29
## 3rd Qu.:715 3rd Qu.:719 3rd Qu.:1.0000 3rd Qu.: 49.00
## Max. :845 Max. :850 Max. :5.0000 Max. :161.00
## NA's :51119
## mths_since_last_record open_acc pub_rec
## Min. : 0.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 53.00 1st Qu.: 8.00 1st Qu.: 0.0000
## Median : 73.00 Median :11.00 Median : 0.0000
## Mean : 69.49 Mean :11.85 Mean : 0.2571
## 3rd Qu.: 88.00 3rd Qu.:15.00 3rd Qu.: 0.0000
## Max. :119.00 Max. :88.00 Max. :22.0000
## NA's :85128
## revol_bal revol_util total_acc initial_list_status
## Min. : 0 Length:105451 Min. : 2.00 Length:105451
## 1st Qu.: 5887 Class :character 1st Qu.: 15.00 Class :character
## Median : 11170 Mode :character Median : 22.00 Mode :character
## Mean : 16596 Mean : 23.76
## 3rd Qu.: 19934 3rd Qu.: 30.00
## Max. :1039013 Max. :146.00
##
## out_prncp total_pymnt total_rec_prncp total_rec_int
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 6475 1st Qu.: 294.5 1st Qu.: 185.5 1st Qu.: 72.74
## Median :11599 Median : 591.4 Median : 366.9 Median : 172.40
## Mean :13761 Mean : 1097.4 Mean : 827.6 Mean : 269.74
## 3rd Qu.:19463 3rd Qu.: 1086.2 3rd Qu.: 685.1 3rd Qu.: 363.41
## Max. :40000 Max. :41722.3 Max. :40000.0 Max. :3120.01
##
## total_rec_late_fee last_pymnt_d last_pymnt_amnt
## Min. : 0.00000 Length:105451 Min. : 0.0
## 1st Qu.: 0.00000 Class :character 1st Qu.: 196.9
## Median : 0.00000 Mode :character Median : 345.1
## Mean : 0.01326 Mean : 717.5
## 3rd Qu.: 0.00000 3rd Qu.: 588.9
## Max. :47.00000 Max. :40937.8
## NA's :89
## next_pymnt_d last_credit_pull_d last_fico_range_high
## Length:105451 Length:105451 Min. : 0.0
## Class :character Class :character 1st Qu.:679.0
## Mode :character Mode :character Median :704.0
## Mean :708.7
## 3rd Qu.:734.0
## Max. :850.0
##
## last_fico_range_low collections_12_mths_ex_med
## Min. : 0.0 Min. :0.00000
## 1st Qu.:675.0 1st Qu.:0.00000
## Median :700.0 Median :0.00000
## Mean :704.7 Mean :0.02319
## 3rd Qu.:730.0 3rd Qu.:0.00000
## Max. :845.0 Max. :4.00000
##
## mths_since_last_major_derog application_type annual_inc_joint
## Min. : 0.00 Length:105451 Min. : 15850
## 1st Qu.: 27.00 Class :character 1st Qu.: 80000
## Median : 44.00 Mode :character Median :105500
## Mean : 44.36 Mean :114779
## 3rd Qu.: 62.00 3rd Qu.:138900
## Max. :161.00 Max. :760000
## NA's :76297 NA's :98638
## dti_joint verification_status_joint acc_now_delinq
## Min. : 0.13 Length:105451 Min. :0.000000
## 1st Qu.:13.69 Class :character 1st Qu.:0.000000
## Median :18.63 Mode :character Median :0.000000
## Mean :18.73 Mean :0.005348
## 3rd Qu.:23.74 3rd Qu.:0.000000
## Max. :34.99 Max. :3.000000
## NA's :98638
## tot_coll_amt tot_cur_bal open_acc_6m open_il_6m
## Min. : 0.0 Min. : 0 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0 1st Qu.: 29967 1st Qu.: 0.0000 1st Qu.: 1.000
## Median : 0.0 Median : 83660 Median : 1.0000 Median : 2.000
## Mean : 238.7 Mean : 147641 Mean : 0.8698 Mean : 2.882
## 3rd Qu.: 0.0 3rd Qu.: 221372 3rd Qu.: 1.0000 3rd Qu.: 3.000
## Max. :262740.0 Max. :3566850 Max. :16.0000 Max. :55.000
##
## open_il_12m open_il_24m mths_since_rcnt_il total_bal_il
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9353
## Median :0.0000 Median : 1.000 Median : 14.00 Median : 23700
## Mean :0.6143 Mean : 1.545 Mean : 21.28 Mean : 36200
## 3rd Qu.:1.0000 3rd Qu.: 2.000 3rd Qu.: 24.00 3rd Qu.: 46773
## Max. :9.0000 Max. :14.000 Max. :411.00 Max. :1267131
## NA's :2652
## il_util open_rv_12m open_rv_24m max_bal_bc
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0
## 1st Qu.: 54.00 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 2382
## Median : 71.00 Median : 1.00 Median : 2.000 Median : 4466
## Mean : 68.73 Mean : 1.32 Mean : 2.814 Mean : 5805
## 3rd Qu.: 85.00 3rd Qu.: 2.00 3rd Qu.: 4.000 3rd Qu.: 7542
## Max. :464.00 Max. :20.00 Max. :45.000 Max. :298260
## NA's :13380
## all_util total_rev_hi_lim inq_fi total_cu_tl
## Min. : 0.00 Min. : 0 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 44.00 1st Qu.: 14800 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 59.00 Median : 25900 Median : 0.0000 Median : 0.000
## Mean : 58.14 Mean : 35098 Mean : 0.9692 Mean : 1.456
## 3rd Qu.: 73.00 3rd Qu.: 43800 3rd Qu.: 1.0000 3rd Qu.: 2.000
## Max. :185.00 Max. :1252500 Max. :27.0000 Max. :71.000
## NA's :13
## inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy
## Min. : 0.000 Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 2.000 1st Qu.: 3130 1st Qu.: 1982
## Median : 1.000 Median : 4.000 Median : 7574 Median : 5908
## Mean : 2.047 Mean : 4.638 Mean : 13804 Mean : 11906
## 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 19316 3rd Qu.: 14903
## Max. :40.000 Max. :45.000 Max. :445856 Max. :357289
## NA's :2 NA's :1143
## bc_util chargeoff_within_12_mths delinq_amnt
## Min. : 0.00 Min. :0.000000 Min. : 0.0
## 1st Qu.: 33.70 1st Qu.:0.000000 1st Qu.: 0.0
## Median : 58.00 Median :0.000000 Median : 0.0
## Mean : 56.51 Mean :0.009189 Mean : 16.6
## 3rd Qu.: 81.40 3rd Qu.:0.000000 3rd Qu.: 0.0
## Max. :166.40 Max. :4.000000 Max. :65000.0
## NA's :1191
## mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op
## Min. : 0.0 Min. : 5.0 Min. : 0.00
## 1st Qu.: 96.0 1st Qu.:117.0 1st Qu.: 4.00
## Median :130.0 Median :162.0 Median : 9.00
## Mean :126.1 Mean :180.6 Mean : 14.01
## 3rd Qu.:154.0 3rd Qu.:232.0 3rd Qu.: 17.00
## Max. :808.0 Max. :999.0 Max. :345.00
## NA's :2652
## mo_sin_rcnt_tl mort_acc mths_since_recent_bc
## Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 0.00 1st Qu.: 6.00
## Median : 6.000 Median : 1.00 Median : 13.00
## Mean : 8.359 Mean : 1.43 Mean : 24.17
## 3rd Qu.: 11.000 3rd Qu.: 2.00 3rd Qu.: 29.00
## Max. :195.000 Max. :23.00 Max. :555.00
## NA's :1083
## mths_since_recent_bc_dlq mths_since_recent_inq
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 19.00 1st Qu.: 2.000
## Median : 34.00 Median : 6.000
## Mean : 37.05 Mean : 7.105
## 3rd Qu.: 52.00 3rd Qu.:11.000
## Max. :161.00 Max. :25.000
## NA's :79994 NA's :11949
## mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl
## Min. : 0 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 16 1st Qu.: 0.0000 1st Qu.: 2.000
## Median : 30 Median : 0.0000 Median : 3.000
## Mean : 34 Mean : 0.5513 Mean : 3.687
## 3rd Qu.: 48 3rd Qu.: 0.0000 3rd Qu.: 5.000
## Max. :161 Max. :42.0000 Max. :31.000
## NA's :68375
## num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 4.000
## Median : 5.000 Median : 4.000 Median : 6.000 Median : 7.000
## Mean : 5.618 Mean : 4.813 Mean : 7.301 Mean : 8.614
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.: 9.000 3rd Qu.: 11.000
## Max. :63.000 Max. :43.000 Max. :66.000 Max. :104.000
##
## num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats
## Min. : 0.000 Min. : 2.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 5.000 1st Qu.: 8.0 1st Qu.: 3.000 1st Qu.: 8.0
## Median : 7.000 Median : 12.0 Median : 5.000 Median :11.0
## Mean : 8.258 Mean : 13.5 Mean : 5.534 Mean :11.8
## 3rd Qu.:10.000 3rd Qu.: 17.0 3rd Qu.: 7.000 3rd Qu.:15.0
## Max. :65.000 Max. :113.0 Max. :40.000 Max. :87.0
##
## num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m
## Min. :0.000 Min. :0.000000 Min. : 0.00000 Min. : 0.000
## 1st Qu.:0.000 1st Qu.:0.000000 1st Qu.: 0.00000 1st Qu.: 1.000
## Median :0.000 Median :0.000000 Median : 0.00000 Median : 2.000
## Mean :0.001 Mean :0.003452 Mean : 0.09338 Mean : 2.073
## 3rd Qu.:0.000 3rd Qu.:0.000000 3rd Qu.: 0.00000 3rd Qu.: 3.000
## Max. :2.000 Max. :2.000000 Max. :42.00000 Max. :22.000
## NA's :5278
## pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens
## Min. : 0.00 Min. : 0.00 Min. :0.0000 Min. : 0.00000
## 1st Qu.: 90.40 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 0.00000
## Median : 97.40 Median : 33.30 Median :0.0000 Median : 0.00000
## Mean : 93.48 Mean : 40.56 Mean :0.1411 Mean : 0.07516
## 3rd Qu.:100.00 3rd Qu.: 66.70 3rd Qu.:0.0000 3rd Qu.: 0.00000
## Max. :100.00 Max. :100.00 Max. :7.0000 Max. :22.00000
## NA's :1146
## tot_hi_cred_lim total_bal_ex_mort total_bc_limit
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 52948 1st Qu.: 21602 1st Qu.: 8500
## Median : 119740 Median : 38987 Median : 16800
## Mean : 185072 Mean : 52964 Mean : 23660
## 3rd Qu.: 268068 3rd Qu.: 66820 3rd Qu.: 30900
## Max. :4460960 Max. :1310848 Max. :407900
##
## total_il_high_credit_limit revol_bal_joint sec_app_fico_range_low
## Min. : 0 Min. : 0 Min. :540.0
## 1st Qu.: 16833 1st Qu.: 14052 1st Qu.:645.0
## Median : 35000 Median : 24042 Median :665.0
## Mean : 46440 Mean : 29151 Mean :667.4
## 3rd Qu.: 62100 3rd Qu.: 38259 3rd Qu.:690.0
## Max. :1349413 Max. :219950 Max. :830.0
## NA's :98645 NA's :98645
## sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths
## Min. :544.0 Length:105451 Min. :0.00
## 1st Qu.:649.0 Class :character 1st Qu.:0.00
## Median :669.0 Mode :character Median :0.00
## Mean :671.4 Mean :0.62
## 3rd Qu.:694.0 3rd Qu.:1.00
## Max. :834.0 Max. :6.00
## NA's :98645 NA's :98645
## sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_il_6m
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 43.80 1st Qu.: 1.00
## Median : 1.00 Median :10.00 Median : 62.90 Median : 2.00
## Mean : 1.68 Mean :11.29 Mean : 60.53 Mean : 2.89
## 3rd Qu.: 3.00 3rd Qu.:14.00 3rd Qu.: 79.30 3rd Qu.: 3.00
## Max. :13.00 Max. :57.00 Max. :129.70 Max. :39.00
## NA's :98645 NA's :98645 NA's :98722 NA's :98645
## sec_app_num_rev_accts sec_app_chargeoff_within_12_mths
## Min. : 0.00 Min. :0.00
## 1st Qu.: 7.00 1st Qu.:0.00
## Median :11.00 Median :0.00
## Mean :12.41 Mean :0.03
## 3rd Qu.:16.00 3rd Qu.:0.00
## Max. :84.00 Max. :6.00
## NA's :98645 NA's :98645
## sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 19.00
## Median : 0.00 Median : 39.00
## Mean : 0.07 Mean : 38.61
## 3rd Qu.: 0.00 3rd Qu.: 57.00
## Max. :10.00 Max. :136.00
## NA's :98645 NA's :103117
There are quite a lot of missing values. Let’s check how many of the columns have missing values and remove those with more than 20% missing data.
colSums(is.na(dataLC))
## id loan_amnt
## 0 0
## term int_rate
## 0 0
## installment grade
## 0 0
## sub_grade emp_title
## 0 6782
## emp_length home_ownership
## 0 0
## annual_inc verification_status
## 0 0
## issue_d loan_status
## 0 0
## purpose zip_code
## 0 0
## addr_state dti
## 0 0
## delinq_2yrs earliest_cr_line
## 0 0
## fico_range_low fico_range_high
## 0 0
## inq_last_6mths mths_since_last_delinq
## 0 51119
## mths_since_last_record open_acc
## 85128 0
## pub_rec revol_bal
## 0 0
## revol_util total_acc
## 75 0
## initial_list_status out_prncp
## 0 0
## total_pymnt total_rec_prncp
## 0 0
## total_rec_int total_rec_late_fee
## 0 89
## last_pymnt_d last_pymnt_amnt
## 8718 0
## next_pymnt_d last_credit_pull_d
## 2585 3
## last_fico_range_high last_fico_range_low
## 0 0
## collections_12_mths_ex_med mths_since_last_major_derog
## 0 76297
## application_type annual_inc_joint
## 0 98638
## dti_joint verification_status_joint
## 98638 98638
## acc_now_delinq tot_coll_amt
## 0 0
## tot_cur_bal open_acc_6m
## 0 0
## open_il_6m open_il_12m
## 0 0
## open_il_24m mths_since_rcnt_il
## 0 2652
## total_bal_il il_util
## 0 13380
## open_rv_12m open_rv_24m
## 0 0
## max_bal_bc all_util
## 0 13
## total_rev_hi_lim inq_fi
## 0 0
## total_cu_tl inq_last_12m
## 0 0
## acc_open_past_24mths avg_cur_bal
## 0 2
## bc_open_to_buy bc_util
## 1143 1191
## chargeoff_within_12_mths delinq_amnt
## 0 0
## mo_sin_old_il_acct mo_sin_old_rev_tl_op
## 2652 0
## mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl
## 0 0
## mort_acc mths_since_recent_bc
## 0 1083
## mths_since_recent_bc_dlq mths_since_recent_inq
## 79994 11949
## mths_since_recent_revol_delinq num_accts_ever_120_pd
## 68375 0
## num_actv_bc_tl num_actv_rev_tl
## 0 0
## num_bc_sats num_bc_tl
## 0 0
## num_il_tl num_op_rev_tl
## 0 0
## num_rev_accts num_rev_tl_bal_gt_0
## 0 0
## num_sats num_tl_120dpd_2m
## 0 5278
## num_tl_30dpd num_tl_90g_dpd_24m
## 0 0
## num_tl_op_past_12m pct_tl_nvr_dlq
## 0 0
## percent_bc_gt_75 pub_rec_bankruptcies
## 1146 0
## tax_liens tot_hi_cred_lim
## 0 0
## total_bal_ex_mort total_bc_limit
## 0 0
## total_il_high_credit_limit revol_bal_joint
## 0 98645
## sec_app_fico_range_low sec_app_fico_range_high
## 98645 98645
## sec_app_earliest_cr_line sec_app_inq_last_6mths
## 98645 98645
## sec_app_mort_acc sec_app_open_acc
## 98645 98645
## sec_app_revol_util sec_app_open_il_6m
## 98722 98645
## sec_app_num_rev_accts sec_app_chargeoff_within_12_mths
## 98645 98645
## sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog
## 98645 103117
dataLC1 <- dataLC[, colSums(is.na(dataLC)) <= 0.2*nrow(dataLC)] # remove columns with more than 20% missing values
dim(dataLC1)
## [1] 105451 95
colnames(dataLC1)
## [1] "id" "loan_amnt"
## [3] "term" "int_rate"
## [5] "installment" "grade"
## [7] "sub_grade" "emp_title"
## [9] "emp_length" "home_ownership"
## [11] "annual_inc" "verification_status"
## [13] "issue_d" "loan_status"
## [15] "purpose" "zip_code"
## [17] "addr_state" "dti"
## [19] "delinq_2yrs" "earliest_cr_line"
## [21] "fico_range_low" "fico_range_high"
## [23] "inq_last_6mths" "open_acc"
## [25] "pub_rec" "revol_bal"
## [27] "revol_util" "total_acc"
## [29] "initial_list_status" "out_prncp"
## [31] "total_pymnt" "total_rec_prncp"
## [33] "total_rec_int" "total_rec_late_fee"
## [35] "last_pymnt_d" "last_pymnt_amnt"
## [37] "next_pymnt_d" "last_credit_pull_d"
## [39] "last_fico_range_high" "last_fico_range_low"
## [41] "collections_12_mths_ex_med" "application_type"
## [43] "acc_now_delinq" "tot_coll_amt"
## [45] "tot_cur_bal" "open_acc_6m"
## [47] "open_il_6m" "open_il_12m"
## [49] "open_il_24m" "mths_since_rcnt_il"
## [51] "total_bal_il" "il_util"
## [53] "open_rv_12m" "open_rv_24m"
## [55] "max_bal_bc" "all_util"
## [57] "total_rev_hi_lim" "inq_fi"
## [59] "total_cu_tl" "inq_last_12m"
## [61] "acc_open_past_24mths" "avg_cur_bal"
## [63] "bc_open_to_buy" "bc_util"
## [65] "chargeoff_within_12_mths" "delinq_amnt"
## [67] "mo_sin_old_il_acct" "mo_sin_old_rev_tl_op"
## [69] "mo_sin_rcnt_rev_tl_op" "mo_sin_rcnt_tl"
## [71] "mort_acc" "mths_since_recent_bc"
## [73] "mths_since_recent_inq" "num_accts_ever_120_pd"
## [75] "num_actv_bc_tl" "num_actv_rev_tl"
## [77] "num_bc_sats" "num_bc_tl"
## [79] "num_il_tl" "num_op_rev_tl"
## [81] "num_rev_accts" "num_rev_tl_bal_gt_0"
## [83] "num_sats" "num_tl_120dpd_2m"
## [85] "num_tl_30dpd" "num_tl_90g_dpd_24m"
## [87] "num_tl_op_past_12m" "pct_tl_nvr_dlq"
## [89] "percent_bc_gt_75" "pub_rec_bankruptcies"
## [91] "tax_liens" "tot_hi_cred_lim"
## [93] "total_bal_ex_mort" "total_bc_limit"
## [95] "total_il_high_credit_limit"
We have removed 21 columns and are left with 95 feature variables.
Let us also check if any of the transactions have more than 20% of missing data and go ahead to remove those. We do not find any of such transactions.
dataLC2 <- dataLC1[rowSums(is.na(dataLC1)) <= 0.2*ncol(dataLC1), ] # remove rows with more than 20% missing values
dim(dataLC2)
## [1] 105451 95
Feature Understanding
Let us now examine the features. We start by taking a closer look at the loan amount, which is the listed amount of the loan applied for by the borrower.
Desc(dataLC2$loan_amnt, main = "Loan Amount", plotit = TRUE)
## -------------------------------------------------------------------------
## Loan Amount
##
## length n NAs unique 0s mean
## 105'451 105'451 0 1'455 0 14'589.07
## 100.0% 0.0% 0.0%
##
## .05 .10 .25 median .75 .90
## 3'000.00 4'200.00 7'000.00 12'000.00 20'000.00 30'000.00
##
## range sd vcoef mad IQR skew
## 39'000.00 9'478.83 0.65 8'895.60 13'000.00 0.83
##
## meanCI
## 14'531.86
## 14'646.28
##
## .95
## 35'000.00
##
## kurt
## -0.12
##
## lowest : 1'000 (592), 1'025 (3), 1'050 (5), 1'075 (2), 1'100 (11)
## highest: 39'775, 39'800, 39'825, 39'975, 40'000 (1'593)
The maximum loan amount is $40,000 and the minimum is $1,000. The median is $12,000. The distribution is not normal, has multiple peaks, and there are some outliers at the right tail.
Let’s take a look at the term variable, which is the number of payments on the loan. We see that there are only two types of loan terms, either 36 months or 60 months. Most of the loans (73%) are shorter, with a term of 36 months.
Desc(dataLC2$term, main = "Term", plotit = TRUE)
## -------------------------------------------------------------------------
## Term
##
## length n NAs unique
## 105'451 105'451 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 36 months 77'105 73.1% 72.9% 73.4%
## 60 months 28'346 26.9% 26.6% 27.1%
##
## ' 95%-CI Wilson
Next, we take a look at the interest rate on the loan. We first remove the “%” and convert it into a numeric.
dataLC2$int_rate = as.numeric(gsub("\\%", "", dataLC2$int_rate))
Desc(dataLC2$int_rate, main = "Interest Rate", plotit = TRUE)
## -------------------------------------------------------------------------
## Interest Rate
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 65 0 13.29 13.26
## 100.0% 0.0% 0.0% 13.32
##
## .05 .10 .25 median .75 .90 .95
## 6.99 7.35 9.93 12.62 15.99 19.03 23.99
##
## range sd vcoef mad IQR skew kurt
## 25.67 5.16 0.39 3.99 6.06 1.11 1.60
##
## lowest : 5.32 (4'365), 6.0 (4), 6.99 (1'187), 7.07 (1'925), 7.21 (1'719)
## highest: 30.79 (203), 30.84 (178), 30.89 (134), 30.94 (123), 30.99 (147)
ggplot(dataLC2, aes(int_rate, fill = grade)) +
geom_density() +
facet_grid(grade ~ .)
The range of interest rate on the loans is quite wide, from 5.32% to 30.99%. The median interest rate is 12.62%. Notably, there are only 65 unique values. Again, we see that the distribution is not normal, has multiple peaks, and there are some outliers at the right tail. The lower the loan grade, the higher the interest rate.
Next, we examine the installment amount, which is the monthly payment owed by the borrower if the loan originates.
Desc(dataLC2$installment, main = "Installment", plotit = TRUE)
## -------------------------------------------------------------------------
## Installment
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 16'277 0 436.80 435.11
## 100.0% 0.0% 0.0% 438.49
##
## .05 .10 .25 median .75 .90 .95
## 98.92 140.67 232.64 361.38 584.37 839.23 1'024.16
##
## range sd vcoef mad IQR skew kurt
## 1'689.71 280.18 0.64 240.36 351.73 1.07 0.76
##
## lowest : 30.12 (13), 30.88 (4), 30.91 (6), 30.98 (11), 30.99 (9)
## highest: 1'566.80 (2), 1'569.11 (2), 1'587.23 (2), 1'691.28, 1'719.83
The range of installments is wide, from $30.12 to $1,719.83 per month. The mean is $436.80. Obviously, one would expect the installments to be highly correlated to the loan amount, as we can see from the simple correlation plot below.
cor(data.frame(dataLC2$installment, dataLC2$loan_amnt)) %>%
corrplot(method="number", tl.cex=0.5)
Next, we take a look at the Lending Club-assigned grade for the loans. There are 7 different grades from A to G. Most of the loans are graded C (35%). By and large, the loans are graded highly by Lending Club, with nearly 82% graded C and above. Less than 1% of the loans are graded G. Notably, the loan amounts for the high-grade loans tend to be smaller than the poorly-graded loans.
Desc(dataLC2$grade, main = "Grade", plotit = TRUE)
## -------------------------------------------------------------------------
## Grade
##
## length n NAs unique levels dupes
## 105'451 105'451 0 7 7 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 C 36'880 35.0% 36'880 35.0%
## 2 B 31'301 29.7% 68'181 64.7%
## 3 A 17'898 17.0% 86'079 81.6%
## 4 D 12'015 11.4% 98'094 93.0%
## 5 E 4'847 4.6% 102'941 97.6%
## 6 F 1'725 1.6% 104'666 99.3%
## 7 G 785 0.7% 105'451 100.0%
Desc(loan_amnt ~ grade, dataLC2, digits=1, plotit=TRUE)
## -------------------------------------------------------------------------
## loan_amnt ~ grade
##
## Summary:
## n pairs: 105'451, valid: 105'451 (100.0%), missings: 0 (0.0%), groups: 7
##
##
## A B C D E F
## mean 13'528.0 13'313.8 15'132.6 15'820.5 17'077.6 18'988.9
## median 10'000.0 10'000.0 13'000.0 14'400.0 15'400.0 18'000.0
## sd 9'410.2 9'152.6 9'511.9 9'510.6 9'612.3 9'159.1
## IQR 11'600.0 12'000.0 12'000.0 13'487.5 14'000.0 13'100.0
## n 17'898 31'301 36'880 12'015 4'847 1'725
## np 17.0% 29.7% 35.0% 11.4% 4.6% 1.6%
## NAs 0 0 0 0 0 0
## 0s 0 0 0 0 0 0
##
## G
## mean 20'214.1
## median 19'475.0
## sd 8'997.5
## IQR 14'600.0
## n 785
## np 0.7%
## NAs 0
## 0s 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 2560.6, df = 6, p-value < 2.2e-16
We see a similar pattern for sub-grades. There are five sub-grades - 1 to 5 - for every grade. With the notable exception of sub-grade D2 and D4, the loan amounts for the high-grade loans tend to be smaller.
Desc(dataLC2$sub_grade, main = "Sub-Grade", plotit = TRUE)
## -------------------------------------------------------------------------
## Sub-Grade
##
## length n NAs unique levels dupes
## 105'451 105'451 0 35 35 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 C1 8'088 7.7% 8'088 7.7%
## 2 C5 7'985 7.6% 16'073 15.2%
## 3 B5 7'511 7.1% 23'584 22.4%
## 4 C4 7'208 6.8% 30'792 29.2%
## 5 C3 6'973 6.6% 37'765 35.8%
## 6 C2 6'626 6.3% 44'391 42.1%
## 7 B1 6'416 6.1% 50'807 48.2%
## 8 B3 6'012 5.7% 56'819 53.9%
## 9 B4 5'821 5.5% 62'640 59.4%
## 10 B2 5'541 5.3% 68'181 64.7%
## 11 A5 4'690 4.4% 72'871 69.1%
## 12 A1 4'365 4.1% 77'236 73.2%
## ... etc.
## [list output truncated]
ggplot(dataLC2,aes(x=sub_grade, y=loan_amnt)) +
geom_boxplot() +
xlab("Sub-grade") +
ylab("Loan Amount")
Let’s see whether the interest rates are dependent on the loan grade.
Desc(int_rate ~ grade, dataLC2, digits = 1, main = "Interest rate by Grade", plotit = TRUE)
## -------------------------------------------------------------------------
## Interest rate by Grade
##
## Summary:
## n pairs: 105'451, valid: 105'451 (100.0%), missings: 0 (0.0%), groups: 7
##
##
## A B C D E F G
## mean 7.0 10.6 14.3 18.6 24.7 29.7 30.9
## median 7.2 10.5 14.1 18.1 24.9 29.7 30.9
## sd 1.0 1.1 1.2 1.4 1.3 0.8 0.1
## IQR 1.0 1.5 1.6 1.9 2.4 1.8 0.2
## n 17'898 31'301 36'880 12'015 4'847 1'725 785
## np 17.0% 29.7% 35.0% 11.4% 4.6% 1.6% 0.7%
## NAs 0 0 0 0 0 0 0
## 0s 0 0 0 0 0 0 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 97582, df = 6, p-value < 2.2e-16
Unsurprisingly, the higher the grade, the higher the interest rates.
Let’s take a look at the employment title, which is the job title supplied by the Borrower when applying for the loan.
Desc(dataLC2$emp_title, main = "Employment Title", plotit = TRUE)
## -------------------------------------------------------------------------
## Employment Title
##
## length n NAs unique levels dupes
## 105'451 98'669 6'782 35'885 35'885 y
## 93.6% 6.4%
##
## level freq perc cumfreq cumperc
## 1 Teacher 2'158 2.2% 2'158 2.2%
## 2 Manager 2'015 2.0% 4'173 4.2%
## 3 Owner 1'310 1.3% 5'483 5.6%
## 4 Supervisor 962 1.0% 6'445 6.5%
## 5 Driver 949 1.0% 7'394 7.5%
## 6 Registered Nurse 906 0.9% 8'300 8.4%
## 7 RN 824 0.8% 9'124 9.2%
## 8 Sales 800 0.8% 9'924 10.1%
## 9 Project Manager 619 0.6% 10'543 10.7%
## 10 Director 569 0.6% 11'112 11.3%
## 11 Office Manager 538 0.5% 11'650 11.8%
## 12 General Manager 517 0.5% 12'167 12.3%
## ... etc.
## [list output truncated]
There are 35,885 different titles, which makes it hard to make sense of it. Teacher was the most commonly supplied title, albeit only 2.2% of the Borrowers. There is also quite bit of missing data, 6,782 to be precise (or 6.4%). It is also not clear what some of the titles mean, for example, RN. We can probably drop this feature.
Employment length is a more useful feature, notwithstanding the considerable number of missing data (6,697). There are 11 levels, from < 1 year, 1 year, through to 10+ years (note: possible values are between 0 and 10 where 0 means less than one year). Notably, Borrowers’ employment length falls into 2 big clusters: the most frequent is mid-career professionals with 10+ years of experience (33.6%), the other is young people with under 3 years of working experience (33.1%).
Desc(dataLC2$emp_length, main = "Employment Length", plotit = TRUE)
## -------------------------------------------------------------------------
## Employment Length
##
## length n NAs unique levels dupes
## 105'451 105'451 0 12 12 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 10+ years 35'438 33.6% 35'438 33.6%
## 2 2 years 9'914 9.4% 45'352 43.0%
## 3 < 1 year 9'542 9.0% 54'894 52.1%
## 4 3 years 8'495 8.1% 63'389 60.1%
## 5 1 year 7'034 6.7% 70'423 66.8%
## 6 n/a 6'697 6.4% 77'120 73.1%
## 7 4 years 6'453 6.1% 83'573 79.3%
## 8 5 years 6'382 6.1% 89'955 85.3%
## 9 6 years 4'526 4.3% 94'481 89.6%
## 10 7 years 3'847 3.6% 98'328 93.2%
## 11 9 years 3'828 3.6% 102'156 96.9%
## 12 8 years 3'295 3.1% 105'451 100.0%
Let us take a look at the type of home ownership status provided by the Borrower during registration (or obtained from their credit report). Most Borrowers either have existing mortgages (49.8%). or are renting their home (38.9%).
Desc(dataLC2$home_ownership, main = "Home Ownership", plotit = TRUE)
## -------------------------------------------------------------------------
## Home Ownership
##
## length n NAs unique levels dupes
## 105'451 105'451 0 5 5 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 MORTGAGE 52'502 49.8% 52'502 49.8%
## 2 RENT 41'069 38.9% 93'571 88.7%
## 3 OWN 11'873 11.3% 105'444 100.0%
## 4 ANY 5 0.0% 105'449 100.0%
## 5 NONE 2 0.0% 105'451 100.0%
Next, we examine the Borrowers’ self-reported annual income. The distribution is highly postively skewed, from zero income to an income of $8.9 million (which appears to be an outlier). The median income is $68,000.
Desc(dataLC2$annual_inc, main = "Annual Income", plotit = TRUE)
## -------------------------------------------------------------------------
## Annual Income
##
## length n NAs unique 0s mean
## 105'451 105'451 0 9'112 74 80'451.97
## 100.0% 0.0% 0.1%
##
## .05 .10 .25 median .75 .90
## 28'000.00 35'000.00 48'000.00 68'000.00 97'000.00 137'000.00
##
## range sd vcoef mad IQR skew
## 8'900'000.00 68'220.64 0.85 34'099.80 49'000.00 33.59
##
## meanCI
## 80'040.21
## 80'863.73
##
## .95
## 170'000.00
##
## kurt
## 3'375.46
##
## lowest : 0.0 (74), 1.0, 50.0, 250.0, 487.0
## highest: 3'650'010.0, 3'800'000.0, 4'000'040.0, 5'616'890.0, 8'900'000.0
Is there a relationship between Annual Income and Loan Amount? The scatterplot belows suggests there isn’t but the outliers are apparent.
ggplot(dataLC2,aes(x=annual_inc, y=loan_amnt)) +
geom_point() +
xlab("Annual Income") +
ylab("Loan Amount")
Verification status tells us whethere the income was verified by Lending Club, not verified, or if the income source was verified. Most of the income has had its source verified (38.5%) although 35.7% were not verified.
Desc(dataLC2$verification_status, main = "Verification Status", plotit = TRUE)
## -------------------------------------------------------------------------
## Verification Status
##
## length n NAs unique levels dupes
## 105'451 105'451 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 Source Verified 40'622 38.5% 40'622 38.5%
## 2 Not Verified 37'599 35.7% 78'221 74.2%
## 3 Verified 27'230 25.8% 105'451 100.0%
Issue date shows the month in which the loan was funded. For Q2 2017, most of the loans (36.1%) were issued in Jun. April was notably slower, with only 28.1% of loans issued.
Desc(dataLC2$issue_d, main = "Issue Date", plotit = TRUE)
## -------------------------------------------------------------------------
## Issue Date
##
## length n NAs unique levels dupes
## 105'451 105'451 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 Jun-17 38'087 36.1% 38'087 36.1%
## 2 May-17 37'681 35.7% 75'768 71.9%
## 3 Apr-17 29'683 28.1% 105'451 100.0%
Let’s take a look at the loan status - our target variable. There are 7 levels, with most loans being current (95.6%). What is interesting to us are the loans that are either late or charged off. These form only 0.4% of the loans. There doesn’t seem to be a discernible pattern between loan status and loan amount.
Desc(dataLC2$loan_status, main = "Loan Status", plotit = TRUE)
## -------------------------------------------------------------------------
## Loan Status
##
## length n NAs unique levels dupes
## 105'451 105'451 0 7 7 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 Current 100'863 95.6% 100'863 95.6%
## 2 Fully Paid 2'580 2.4% 103'443 98.1%
## 3 Issued 867 0.8% 104'310 98.9%
## 4 In Grace Period 757 0.7% 105'067 99.6%
## 5 Late (31-120 days) 196 0.2% 105'263 99.8%
## 6 Late (16-30 days) 183 0.2% 105'446 100.0%
## 7 Charged Off 5 0.0% 105'451 100.0%
ggplot(dataLC2, aes(loan_status, loan_amnt)) +
geom_boxplot(aes(fill = loan_status)) +
theme(axis.text.x = element_blank()) +
labs(list(
title = "Loan amount by status",
x = "Status",
y = "Amount"))
The purpose variable is a category provided by the borrower for the loan request. There are 13 categories. Over half of the loans (55.5%) were for debt consolidation. Other significant categories were credit card and home improvement.
Desc(dataLC2$purpose, main = "Purpose", plotit = TRUE)
## -------------------------------------------------------------------------
## Purpose
##
## length n NAs unique levels dupes
## 105'451 105'451 0 13 13 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 debt_consolidation 58'557 55.5% 58'557 55.5%
## 2 credit_card 21'261 20.2% 79'818 75.7%
## 3 home_improvement 9'222 8.7% 89'040 84.4%
## 4 other 7'140 6.8% 96'180 91.2%
## 5 major_purchase 2'616 2.5% 98'796 93.7%
## 6 medical 1'648 1.6% 100'444 95.3%
## 7 car 1'334 1.3% 101'778 96.5%
## 8 vacation 1'170 1.1% 102'948 97.6%
## 9 small_business 1'034 1.0% 103'982 98.6%
## 10 moving 945 0.9% 104'927 99.5%
## 11 house 453 0.4% 105'380 99.9%
## 12 renewable_energy 70 0.1% 105'450 100.0%
## ... etc.
## [list output truncated]
Zip code and Address State informs us of the Borrower’s place of residence. 13% of the Borrowers live in California. Lending Club’s reach is impressive; all but one of the states had Borrowers who used the platform.
Desc(dataLC2$addr_state, main = "Address State", plotit = TRUE)
## -------------------------------------------------------------------------
## Address State
##
## length n NAs unique levels dupes
## 105'451 105'451 0 49 49 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 CA 13'751 13.0% 13'751 13.0%
## 2 TX 8'802 8.3% 22'553 21.4%
## 3 NY 8'743 8.3% 31'296 29.7%
## 4 FL 7'772 7.4% 39'068 37.0%
## 5 IL 4'378 4.2% 43'446 41.2%
## 6 NJ 3'901 3.7% 47'347 44.9%
## 7 PA 3'735 3.5% 51'082 48.4%
## 8 OH 3'604 3.4% 54'686 51.9%
## 9 GA 3'383 3.2% 58'069 55.1%
## 10 VA 2'982 2.8% 61'051 57.9%
## 11 NC 2'924 2.8% 63'975 60.7%
## 12 MI 2'789 2.6% 66'764 63.3%
## ... etc.
## [list output truncated]
Let’s now take a look at some financial ratios. We begin with the Borrowers’ debt-to-income (DTI) ratio. DTI is calculated using the Borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested Lending Club loan, divided by the borrower’s self-reported monthly income. This will be an important feature affecting a Borrower’s creditworthiness.
Desc(dataLC2$dti, main = "Debt-to-Income Ratio", plotit = TRUE)
## -------------------------------------------------------------------------
## Debt-to-Income Ratio
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 5'047 35 26.43 24.73
## 100.0% 0.0% 0.0% 28.13
##
## .05 .10 .25 median .75 .90 .95
## 5.32 7.68 12.24 18.13 24.60 30.05 32.88
##
## range sd vcoef mad IQR skew kurt
## 29'550.00 282.16 10.68 9.12 12.36 42.35 2'244.95
##
## lowest : 0.0 (35), 0.02, 0.03, 0.06 (2), 0.07
## highest: 1'680.0, 2'933.74, 5'395.0, 9'999.0 (75), 29'550.0
The distribution is extremely skewed. Some Borrowers’ reported no debt, while one reported a DTI ratio of 29,550%. This is not surprising given that some of the Borrowers reported little or no income.
Delinquency shows the number of 30+ days past-due incidences of delinquency in the Borrower’s credit file for the past 2 years. Most Borrowers (79%) have not had a delinquency, but one of the Borrowers has had 42 delinquencies over the last w years!
Desc(dataLC2$delinq_2yrs, main = "Delinquency (Last 2 Years)", plotit = TRUE)
## -------------------------------------------------------------------------
## Delinquency (Last 2 Years)
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 24 83'373 0.36 0.35
## 100.0% 0.0% 79.1% 0.36
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 1.00 2.00
##
## range sd vcoef mad IQR skew kurt
## 42.00 0.96 2.70 0.00 0.00 6.07 82.85
##
## lowest : 0 (83'373), 1 (14'393), 2 (4'363), 3 (1'659), 4 (690)
## highest: 19, 20 (3), 21, 25, 42
Earliest Credit Line shows the month the borrower’s earliest reported credit line was opened. The earliest was back in February 1934! The latest was dated March 2014. Lines were generally opened fairly evenly across time. 2003-2206 seems to be a period when many Borrowers opened their credit lines.
Desc(dataLC2$earliest_cr_line, main = "Earliest Credit Line", plotit = TRUE)
## -------------------------------------------------------------------------
## Earliest Credit Line
##
## length n NAs unique levels dupes
## 105'451 105'451 0 627 627 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 Sep-04 892 0.8% 892 0.8%
## 2 Aug-06 836 0.8% 1'728 1.6%
## 3 Sep-05 819 0.8% 2'547 2.4%
## 4 Aug-04 810 0.8% 3'357 3.2%
## 5 Aug-05 787 0.7% 4'144 3.9%
## 6 Oct-03 755 0.7% 4'899 4.6%
## 7 Sep-03 751 0.7% 5'650 5.4%
## 8 Aug-03 748 0.7% 6'398 6.1%
## 9 Oct-05 715 0.7% 7'113 6.7%
## 10 Sep-06 704 0.7% 7'817 7.4%
## 11 Oct-04 697 0.7% 8'514 8.1%
## 12 Nov-04 681 0.6% 9'195 8.7%
## ... etc.
## [list output truncated]
FICO low and FICO high show the lower and upper boundary range the borrower’s FICO at loan origination belongs to. Each FICO bucket is 4 points so only one of the variable is useful. Let us create a new feature showing the average FICO score of the Borrower at loan origination instead.
dataLC2 <- mutate(dataLC2, fico_ave = (fico_range_low + fico_range_high)/2)
Desc(dataLC2$fico_ave, main = "Average FICO Score", plotit = TRUE)
## -------------------------------------------------------------------------
## Average FICO Score
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 38 0 699.98 699.78
## 100.0% 0.0% 0.0% 700.18
##
## .05 .10 .25 median .75 .90 .95
## 662.00 667.00 672.00 692.00 717.00 747.00 767.00
##
## range sd vcoef mad IQR skew kurt
## 185.50 32.95 0.05 29.65 45.00 1.20 1.33
##
## lowest : 662.0 (9'367), 667.0 (8'703), 672.0 (8'740), 677.0 (7'739), 682.0 (7'733)
## highest: 827.0 (101), 832.0 (67), 837.0 (36), 842.0 (28), 847.5 (16)
FICO scores of the Borrowers range from 662 to 847.5 (note: maximum is 850, minimum is 300). The median score is 692. The distribution is positively skewed, with 13 outliers beyond the upper IQR.
Inquiries (last 6 months) shows the number of inquiries in the past 6 months (excluding auto and mortgage inquiries). Most Borrowers have not made an inquiry (63.9%) but there is a small fraction who have made up to 5 inquires (0.2%)
Desc(dataLC2$inq_last_6mths, main = "Inquires (Last 6 months)", plotit = TRUE)
## -------------------------------------------------------------------------
## Inquires (Last 6 months)
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 6 67'430 0.51 0.50
## 100.0% 0.0% 63.9% 0.51
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 2.00 2.00
##
## range sd vcoef mad IQR skew kurt
## 5.00 0.81 1.59 0.00 1.00 1.89 4.23
##
##
## level freq perc cumfreq cumperc
## 1 0 67'430 63.9% 67'430 63.9%
## 2 1 26'572 25.2% 94'002 89.1%
## 3 2 8'336 7.9% 102'338 97.0%
## 4 3 2'197 2.1% 104'535 99.1%
## 5 4 687 0.7% 105'222 99.8%
## 6 5 229 0.2% 105'451 100.0%
Open accounts is the number of open credit lines in the Borrower’s credit file. The range is wide, wih Borrowers having no open credit lines to some having as many as 88 credit lines opened. The median Borrower has 11 open credit lines.
Desc(dataLC2$open_acc, main = "Open Accounts", plotit = TRUE)
## -------------------------------------------------------------------------
## Open Accounts
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 65 1 11.85 11.81
## 100.0% 0.0% 0.0% 11.88
##
## .05 .10 .25 median .75 .90 .95
## 5.00 6.00 8.00 11.00 15.00 19.00 23.00
##
## range sd vcoef mad IQR skew kurt
## 88.00 5.78 0.49 4.45 7.00 1.34 3.64
##
## lowest : 0, 1 (41), 2 (347), 3 (1'381), 4 (2'985)
## highest: 65, 68, 80, 86, 88
Public records shows the number of derogatory public records. Most Borrowers have none (80.7%), but there are others who have as many as 22 such records.
Desc(dataLC2$pub_rec, main = "Public Records", plotit = TRUE)
## -------------------------------------------------------------------------
## Public Records
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 16 85'128 0.26 0.25
## 100.0% 0.0% 80.7% 0.26
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 22.00 0.66 2.56 0.00 0.00 4.99 50.59
##
## lowest : 0 (85'128), 1 (16'342), 2 (2'531), 3 (827), 4 (305)
## highest: 11 (4), 12, 14 (2), 15, 22
Let’s take a look at the Borrowers’ total credit revolving balance. The distribution is highly skewed; some Borrowers have no revolving credit, some have over $1 million. The median Borrower’s revolving credit balance is $11,170.
Desc(dataLC2$revol_bal, main = "Revolving Credit Balance", plotit = TRUE)
## -------------------------------------------------------------------------
## Revolving Credit Balance
##
## length n NAs unique 0s mean
## 105'451 105'451 0 37'373 437 16'596.03
## 100.0% 0.0% 0.4%
##
## .05 .10 .25 median .75 .90
## 1'735.00 2'978.00 5'887.00 11'170.00 19'933.50 33'362.00
##
## range sd vcoef mad IQR skew
## 1'039'013.00 22'965.94 1.38 9'298.87 14'046.50 9.35
##
## meanCI
## 16'457.42
## 16'734.65
##
## .95
## 45'508.00
##
## kurt
## 179.93
##
## lowest : 0 (437), 1 (5), 2, 3 (5), 4 (5)
## highest: 693'478, 725'708, 742'434, 798'264, 1'039'013
What is the revolving utilization? We first remove the % and convert it into numeric. The distributon is fairly normal, albeit with a right tail. The utlization rate was between 0% and 138.9%, with a median utilization of 48.5%.
dataLC2$revol_util = as.numeric(gsub("\\%", "", dataLC2$revol_util))
Desc(dataLC2$revol_util, main = "Revolving Credit Utilization", plotit = TRUE)
## -------------------------------------------------------------------------
## Revolving Credit Utilization
##
## length n NAs unique 0s mean meanCI
## 105'451 105'376 75 1'076 468 49.02 48.87
## 99.9% 0.1% 0.4% 49.16
##
## .05 .10 .25 median .75 .90 .95
## 9.80 16.50 30.40 48.50 67.40 82.50 89.90
##
## range sd vcoef mad IQR skew kurt
## 138.90 24.26 0.49 27.43 37.00 0.06 -0.81
##
## lowest : 0.0 (468), 0.1 (81), 0.2 (62), 0.3 (61), 0.4 (37)
## highest: 123.4, 125.0, 125.2, 127.0, 138.9
** Total account** is the total number of credit lines currently in the borrower’s credit file. The distribution is positively skewed, with the number of lines ranging from 2 to 146. The median Borrower has 22 credit lines on file. Recall that the median Borrower has 11 open credit lines.
Desc(dataLC2$total_acc, main = "Total Account", plotit = TRUE)
## -------------------------------------------------------------------------
## Total Account
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 108 0 23.76 23.69
## 100.0% 0.0% 0.0% 23.83
##
## .05 .10 .25 median .75 .90 .95
## 8.00 10.00 15.00 22.00 30.00 39.00 46.00
##
## range sd vcoef mad IQR skew kurt
## 144.00 11.83 0.50 10.38 15.00 1.07 2.10
##
## lowest : 2 (48), 3 (188), 4 (423), 5 (700), 6 (1'122)
## highest: 112 (2), 115, 121, 122 (2), 146
Initial list status indicates the initial listing status of the loan. Possible values are W, F. W stands for whole loans, that is, available to investors to be purchased in their entirety (Borrowers benefit from getting ‘instant funding’). Lending Club provides a randomized subset of loans by grade available to purchase as a whole loan for a brief period of time (12 hours). The rest are available for fractional purchase. 75.1% of the loans are whole loans.
Desc(dataLC2$initial_list_status, main = "Initial Listing Status", plotit = TRUE)
## -------------------------------------------------------------------------
## Initial Listing Status
##
## length n NAs unique
## 105'451 105'451 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## w 79'488 75.4% 75.1% 75.6%
## f 25'963 24.6% 24.4% 24.9%
##
## ' 95%-CI Wilson
** Outstanding principal** is the remaining outstanding principal for total amount funded. This should be highly correlated to Loan amount since the loans have only been issued. Indeed, the correlation is 96%. This variable will be more useful when we have a longer data series.
Desc(dataLC2$out_prncp, main = "Outstanding Principal", plotit = TRUE)
## -------------------------------------------------------------------------
## Outstanding Principal
##
## length n NAs unique 0s mean
## 105'451 105'451 0 28'979 2'601 13'761.025
## 100.0% 0.0% 2.5%
##
## .05 .10 .25 median .75 .90
## 1'959.915 3'507.540 6'474.650 11'599.130 19'463.160 28'800.000
##
## range sd vcoef mad IQR skew
## 40'000.000 9'377.066 0.681 8'868.424 12'988.510 0.803
##
## meanCI
## 13'704.428
## 13'817.622
##
## .95
## 33'461.910
##
## kurt
## -0.123
##
## lowest : 0.0 (2'601), 0.04, 0.87, 1.1, 1.47
## highest: 39'800.0, 39'825.0, 39'904.600, 39'990.0, 40'000.0 (166)
cor(dataLC2$out_prncp, dataLC2$loan_amnt)
## [1] 0.9634905
Total payments is the payments received to date for the total amount funded. Unsurprisingly, there are quite a lot of zeros (8,718) since the loans have only been issued. Again, this variable will be more useful when we have a longer data series. The median payment to date was $591.40. The distribution is skewed to the right.
Desc(dataLC2$total_pymnt, main = "Total Payment to Date", plotit = TRUE)
## -------------------------------------------------------------------------
## Total Payment to Date
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 39'800 8'718 1'097.406 1'081.719
## 100.0% 0.0% 8.3% 1'113.092
##
## .05 .10 .25 median .75 .90 .95
## 0.000 92.640 294.475 591.400 1'086.215 1'910.800 2'605.880
##
## range sd vcoef mad IQR skew kurt
## 41'722.293 2'598.943 2.368 532.075 791.740 8.490 88.209
##
## lowest : 0.0 (8'718), 0.11, 0.67, 0.7, 1.31
## highest: 40'978.344, 41'002.436, 41'146.883, 41'380.811, 41'722.293
The next two features, ** Total principal received** and Total interest received are linearly decomposed from Total payments so they are not hence particulary. We can go ahead to drop them later.
Total received late fees shows that the late fees received to date. It is also not particularly interesting - 99.8% are zeros - so we can also drop it later.
Desc(dataLC2$total_rec_late_fee, main = "Total Late Fee Received to Date", plotit = TRUE)
## -------------------------------------------------------------------------
## Total Late Fee Received to Date
##
## length n NAs unique 0s mean meanCI
## 105'451 105'362 89 7 105'272 0.01 0.01
## 99.9% 0.1% 99.8% 0.02
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## range sd vcoef mad IQR skew kurt
## 47.00 0.47 35.20 0.00 0.00 40.09 2'036.09
##
##
## level freq perc cumfreq cumperc
## 1 0 105'272 99.9% 105'272 99.9%
## 2 9 1 0.0% 105'273 99.9%
## 3 15 85 0.1% 105'358 100.0%
## 4 17 1 0.0% 105'359 100.0%
## 5 19 1 0.0% 105'360 100.0%
## 6 30 1 0.0% 105'361 100.0%
## 7 47 1 0.0% 105'362 100.0%
Last payment date shows the last month payment was received. Most (93.2%) were received in July 2017. This feature is not particulary interesting as well. Similarly, Next payment date is not interesting.
Desc(dataLC2$last_pymnt_d, main = "Last Payment Date", plotit = TRUE)
## -------------------------------------------------------------------------
## Last Payment Date
##
## length n NAs unique levels dupes
## 105'451 96'733 8'718 4 4 y
## 91.7% 8.3%
##
## level freq perc cumfreq cumperc
## 1 Jul-17 90'137 93.2% 90'137 93.2%
## 2 Jun-17 5'684 5.9% 95'821 99.1%
## 3 May-17 765 0.8% 96'586 99.8%
## 4 Apr-17 147 0.2% 96'733 100.0%
Last payment amount is the last total payment amount received. This is highly correlated to total payment (94.7%).
Desc(dataLC2$last_pymnt_amnt, main = "Last Payment Amount", plotit = TRUE)
## -------------------------------------------------------------------------
## Last Payment Amount
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 22'729 8'718 717.551 702.778
## 100.0% 0.0% 8.3% 732.324
##
## .05 .10 .25 median .75 .90 .95
## 0.000 65.000 196.940 345.150 588.905 895.730 1'154.000
##
## range sd vcoef mad IQR skew kurt
## 40'937.830 2'447.647 3.411 267.654 391.965 9.661 108.281
##
## lowest : 0.0 (8'718), 0.11, 0.67, 0.7, 1.75 (2)
## highest: 40'604.210, 40'625.110, 40'659.610, 40'758.250, 40'937.830
Last credit pulled is the most recent month Lending Club pulled credit for this loan. Most (95.5%) loans were refreshed for July 2017. This feature does not seem particulary interesting.
Desc(dataLC2$last_credit_pull_d, main = "Last Credit Pulled", plotit = TRUE)
## -------------------------------------------------------------------------
## Last Credit Pulled
##
## length n NAs unique levels dupes
## 105'451 105'448 3 5 5 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 Jul-17 100'680 95.5% 100'680 95.5%
## 2 May-17 1'644 1.6% 102'324 97.0%
## 3 Jun-17 1'600 1.5% 103'924 98.6%
## 4 Apr-17 1'207 1.1% 105'131 99.7%
## 5 Mar-17 317 0.3% 105'448 100.0%
Last FICO low and Last FICO high show the lower and upper boundary range the borrower’s last FICO pulled. Again, each FICO bucket is 4 points so we create a new feature showing the average FICO score pulled instead.
dataLC2 <- mutate(dataLC2, fico_ave_last = (last_fico_range_low + last_fico_range_high)/2)
Desc(dataLC2$fico_ave_last, main = "Average Last FICO Score Pulled", plotit = TRUE)
## -------------------------------------------------------------------------
## Average Last FICO Score Pulled
##
## length n NAs unique 0s mean meanCI
## 105'451 105'451 0 71 3 706.74 706.50
## 100.0% 0.0% 0.0% 706.98
##
## .05 .10 .25 median .75 .90 .95
## 652.00 662.00 677.00 702.00 732.00 762.00 782.00
##
## range sd vcoef mad IQR skew kurt
## 847.50 39.51 0.06 37.06 55.00 0.32 3.36
##
## lowest : 0.0 (3), 249.5, 502.0, 507.0, 512.0
## highest: 827.0 (141), 832.0 (91), 837.0 (36), 842.0 (19), 847.5 (5)
FICO scores last pulled range from 652 to 847.5. The median score is 702. There has been an increase in FICO scores since the loans were orginated. One would expect the change in FICO scores to be an important predictor of loan default.
4. Unsupervised Learning
Now that we have a much better understanding of the dataset, let us go ahead to build some unsupervised learning models, employing clustering techniques to uncover latent structures with the data to further enhance our understanding of the data. The segmentation of customer groups will allow us to better train the models in the next stage for each specific segment.
k-means Clustering
We start with k-means clustering. Let us first create a dataframe consisting of all numeric variables (including integers). We exclude fico_range_low and fico_range_high since we had created a new feature for the average FICO score. We also drop total_rec_prncp, total_rec_int,total_rec_late_fee for the reasons mentioned above.
numeric_cols <- sapply(dataLC2,is.numeric)
dataLC_numeric <- dataLC2[, numeric_cols]
dataLC_numeric <- subset(dataLC_numeric, select=-c(fico_range_low,
fico_range_low,
last_fico_range_low,
last_fico_range_high,
total_rec_prncp,
total_rec_int,
total_rec_late_fee))
str(dataLC_numeric)
## Classes 'tbl_df', 'tbl' and 'data.frame': 105451 obs. of 73 variables:
## $ id : int 112432991 111872988 112439006 111735948 112156817 112390231 112441324 112210041 112119128 112435993 ...
## $ loan_amnt : int 3000 10000 20400 6000 10000 20000 3600 6000 25000 2300 ...
## $ int_rate : num 7.97 9.44 9.44 7.35 7.21 ...
## $ installment : num 94 320 653 186 310 ...
## $ annual_inc : num 34000 52000 44000 74000 145000 ...
## $ dti : num 12.71 30.91 27.06 19.02 6.58 ...
## $ delinq_2yrs : int 0 0 0 0 0 2 0 0 0 0 ...
## $ fico_range_high : int 709 779 764 739 719 704 804 684 709 704 ...
## $ inq_last_6mths : int 1 0 1 0 2 0 0 0 0 1 ...
## $ open_acc : int 10 16 15 15 12 8 12 10 13 4 ...
## $ pub_rec : int 0 0 0 0 0 0 0 1 0 0 ...
## $ revol_bal : int 3443 9983 8769 5433 18633 20589 12060 9332 26947 3911 ...
## $ revol_util : num 31 24.3 18.7 14.8 32.7 81.7 11.7 59.1 73.2 55.1 ...
## $ total_acc : int 17 24 19 35 59 14 30 28 18 6 ...
## $ out_prncp : num 3000 10000 20400 6000 10000 ...
## $ total_pymnt : num 0 0 0 0 0 ...
## $ last_pymnt_amnt : num 0 0 0 0 0 ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tot_coll_amt : int 0 0 0 0 0 0 0 275 0 0 ...
## $ tot_cur_bal : int 45876 148768 62335 297932 278599 37895 115079 107517 26947 3911 ...
## $ open_acc_6m : int 1 3 1 1 4 1 0 0 0 0 ...
## $ open_il_6m : int 3 5 10 7 1 2 1 1 0 0 ...
## $ open_il_12m : int 0 2 1 2 0 0 0 0 0 0 ...
## $ open_il_24m : int 0 3 1 3 0 0 1 1 0 0 ...
## $ mths_since_rcnt_il : int 34 3 6 6 27 27 16 19 38 NA ...
## $ total_bal_il : int 42433 30221 53566 57864 5537 17306 2875 19983 0 0 ...
## $ il_util : int 94 67 95 88 33 41 58 NA NA NA ...
## $ open_rv_12m : int 2 3 1 1 6 1 0 1 1 1 ...
## $ open_rv_24m : int 4 6 2 5 8 3 2 1 3 2 ...
## $ max_bal_bc : int 1955 7015 4240 3201 5428 19372 3486 3990 12643 2315 ...
## $ all_util : int 81 47 60 62 33 56 14 59 73 55 ...
## $ total_rev_hi_lim : int 11100 41400 46900 36800 56900 25200 102900 15800 36800 7100 ...
## $ inq_fi : int 1 0 1 4 0 0 0 0 0 1 ...
## $ total_cu_tl : int 1 2 1 2 0 0 1 0 0 0 ...
## $ inq_last_12m : int 1 0 1 4 4 0 0 0 0 2 ...
## $ acc_open_past_24mths : int 4 10 3 9 9 3 3 2 3 2 ...
## $ avg_cur_bal : int 4588 9298 4156 19862 23217 5414 9590 11946 2073 978 ...
## $ bc_open_to_buy : int 5057 17348 35631 1927 29857 2403 81840 1845 6505 440 ...
## $ bc_util : num 40.5 30.6 19.8 73.6 35.2 89.5 11.9 82.4 73.8 87.1 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mo_sin_old_il_acct : int 93 53 125 81 139 160 113 86 175 NA ...
## $ mo_sin_old_rev_tl_op : int 119 288 125 160 286 298 296 303 234 381 ...
## $ mo_sin_rcnt_rev_tl_op : int 6 3 11 9 0 6 18 9 10 8 ...
## $ mo_sin_rcnt_tl : int 6 3 6 6 0 6 16 9 10 8 ...
## $ mort_acc : int 0 2 0 1 2 0 7 4 0 0 ...
## $ mths_since_recent_bc : int 6 3 11 15 5 6 23 39 21 15 ...
## $ mths_since_recent_inq : int 6 18 6 6 0 NA 18 NA 21 0 ...
## $ num_accts_ever_120_pd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_actv_bc_tl : int 2 3 3 2 5 2 5 3 6 2 ...
## $ num_actv_rev_tl : int 2 5 3 3 8 3 6 4 10 3 ...
## $ num_bc_sats : int 3 5 3 2 5 3 7 3 7 2 ...
## $ num_bc_tl : int 8 6 5 12 20 5 12 8 8 2 ...
## $ num_il_tl : int 4 6 12 11 4 3 5 6 2 0 ...
## $ num_op_rev_tl : int 7 10 5 7 10 6 9 8 13 4 ...
## $ num_rev_accts : int 13 16 7 23 53 11 17 18 16 6 ...
## $ num_rev_tl_bal_gt_0 : int 2 5 3 3 8 3 6 4 10 3 ...
## $ num_sats : int 10 16 15 15 12 8 12 9 13 4 ...
## $ num_tl_120dpd_2m : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_tl_30dpd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_tl_90g_dpd_24m : int 0 0 0 0 0 0 0 0 0 0 ...
## $ num_tl_op_past_12m : int 2 5 2 3 7 1 0 1 1 1 ...
## $ pct_tl_nvr_dlq : num 100 100 100 100 98.3 84.6 100 95 100 100 ...
## $ percent_bc_gt_75 : num 33.3 20 0 0 40 33.3 0 66.7 57.1 100 ...
## $ pub_rec_bankruptcies : int 0 0 0 0 0 0 0 1 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tot_hi_cred_lim : int 56396 198528 103358 342624 330503 67717 223900 150967 36800 7100 ...
## $ total_bal_ex_mort : int 45876 40204 62335 63297 24170 37895 14935 29315 26947 3911 ...
## $ total_bc_limit : int 8500 25000 44400 7300 46100 22900 92900 10500 24800 3400 ...
## $ total_il_high_credit_limit: int 45296 44928 56458 65959 16613 42517 5000 39869 0 0 ...
## $ fico_ave : num 707 777 762 737 717 702 802 682 707 702 ...
## $ fico_ave_last : num 747 752 762 732 712 687 797 682 697 757 ...
There are several ways to deal with the missing unknown values. We could simply remove the cases with unknowns but that could introduce some bias. Let’s fill the columns with missing values using their mean instead.
dataLC_numeric$revol_util<-ifelse(is.na(dataLC_numeric$revol_util),mean(dataLC_numeric$revol_util,na.rm =TRUE),dataLC_numeric$revol_util)
dataLC_numeric$mths_since_rcnt_il<-ifelse(is.na(dataLC_numeric$mths_since_rcnt_il),mean(dataLC_numeric$mths_since_rcnt_il,na.rm =TRUE),dataLC_numeric$mths_since_rcnt_il)
dataLC_numeric$il_util<-ifelse(is.na(dataLC_numeric$il_util),mean(dataLC_numeric$il_util,na.rm =TRUE),dataLC_numeric$il_util)
dataLC_numeric$all_util<-ifelse(is.na(dataLC_numeric$all_util),mean(dataLC_numeric$all_util,na.rm =TRUE),dataLC_numeric$all_util)
dataLC_numeric$avg_cur_bal<-ifelse(is.na(dataLC_numeric$avg_cur_bal),mean(dataLC_numeric$avg_cur_bal,na.rm =TRUE),dataLC_numeric$avg_cur_bal)
dataLC_numeric$bc_open_to_buy<-ifelse(is.na(dataLC_numeric$bc_open_to_buy),mean(dataLC_numeric$bc_open_to_buy,na.rm =TRUE),dataLC_numeric$bc_open_to_buy)
dataLC_numeric$bc_util<-ifelse(is.na(dataLC_numeric$bc_util),mean(dataLC_numeric$bc_util,na.rm =TRUE),dataLC_numeric$bc_util)
dataLC_numeric$mo_sin_old_il_acct<-ifelse(is.na(dataLC_numeric$mo_sin_old_il_acct),mean(dataLC_numeric$mo_sin_old_il_acct,na.rm =TRUE),dataLC_numeric$mo_sin_old_il_acct)
dataLC_numeric$mths_since_recent_bc<-ifelse(is.na(dataLC_numeric$mths_since_recent_bc),mean(dataLC_numeric$mths_since_recent_bc,na.rm =TRUE),dataLC_numeric$mths_since_recent_bc)
dataLC_numeric$mths_since_recent_inq<-ifelse(is.na(dataLC_numeric$mths_since_recent_inq),mean(dataLC_numeric$mths_since_recent_inq,na.rm =TRUE),dataLC_numeric$mths_since_recent_inq)
dataLC_numeric$num_tl_120dpd_2m<-ifelse(is.na(dataLC_numeric$num_tl_120dpd_2m),mean(dataLC_numeric$num_tl_120dpd_2m,na.rm =TRUE),dataLC_numeric$num_tl_120dpd_2m)
dataLC_numeric$percent_bc_gt_75<-ifelse(is.na(dataLC_numeric$percent_bc_gt_75),mean(dataLC_numeric$percent_bc_gt_75,na.rm =TRUE),dataLC_numeric$percent_bc_gt_75)
colSums(is.na(dataLC_numeric))
## id loan_amnt
## 0 0
## int_rate installment
## 0 0
## annual_inc dti
## 0 0
## delinq_2yrs fico_range_high
## 0 0
## inq_last_6mths open_acc
## 0 0
## pub_rec revol_bal
## 0 0
## revol_util total_acc
## 0 0
## out_prncp total_pymnt
## 0 0
## last_pymnt_amnt collections_12_mths_ex_med
## 0 0
## acc_now_delinq tot_coll_amt
## 0 0
## tot_cur_bal open_acc_6m
## 0 0
## open_il_6m open_il_12m
## 0 0
## open_il_24m mths_since_rcnt_il
## 0 0
## total_bal_il il_util
## 0 0
## open_rv_12m open_rv_24m
## 0 0
## max_bal_bc all_util
## 0 0
## total_rev_hi_lim inq_fi
## 0 0
## total_cu_tl inq_last_12m
## 0 0
## acc_open_past_24mths avg_cur_bal
## 0 0
## bc_open_to_buy bc_util
## 0 0
## chargeoff_within_12_mths delinq_amnt
## 0 0
## mo_sin_old_il_acct mo_sin_old_rev_tl_op
## 0 0
## mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl
## 0 0
## mort_acc mths_since_recent_bc
## 0 0
## mths_since_recent_inq num_accts_ever_120_pd
## 0 0
## num_actv_bc_tl num_actv_rev_tl
## 0 0
## num_bc_sats num_bc_tl
## 0 0
## num_il_tl num_op_rev_tl
## 0 0
## num_rev_accts num_rev_tl_bal_gt_0
## 0 0
## num_sats num_tl_120dpd_2m
## 0 0
## num_tl_30dpd num_tl_90g_dpd_24m
## 0 0
## num_tl_op_past_12m pct_tl_nvr_dlq
## 0 0
## percent_bc_gt_75 pub_rec_bankruptcies
## 0 0
## tax_liens tot_hi_cred_lim
## 0 0
## total_bal_ex_mort total_bc_limit
## 0 0
## total_il_high_credit_limit fico_ave
## 0 0
## fico_ave_last
## 0
cor(dataLC_numeric)%>%
corrplot(order = "hclust", addrect = 4, method="square", tl.cex=1)
Principal Component Analysis
out<-PCA(dataLC_numeric,scale.unit = TRUE, graph=TRUE)
set.seed(1)
autoplot(kmeans(dataLC_numeric, 3), data = dataLC_numeric, frame = TRUE, frame.type = "norm") + theme_minimal()
5. Supervised Learning
As noted earlier, there are several statuses which indicate that loans are not performing well, namely loans that have been charged off and late loans. We group these non-performing loans together and assign them a value of 1. The other performing loans are assigned a value of 1.
# Non-performing Loans
npl <- c("Charged Off", "Late (16-30 days)", "Late (31-120 days)")
dataLC2$npl <- ifelse(dataLC2$loan_status %in% npl, 1,
ifelse(dataLC2$loan_status=="", NA, 0))