Data Mining with R - Capstone Assignment 7

1. Introduction - Project Goals
2. Defining the Business Problem and Approach
3. Data Understanding and Pre-Processing
4. Unsupervised Learning
5. Supervised Learning

1. Introduction - Project Goals

Our capstone proposal is on Peer-to-Peer (P2P) lending. Also known as crowd-lending, P2P is the practice of unrelated individuals or companies lending money through online services that match lenders with borrowers, bypassing the need for traditional financial institutions. Today, P2P platforms are among the fastest growing segment in the financial services space. Morgan Stanley in a report in 2015 predicted that such marketplace lending would command USD150 billion to USD490 billion globally by 2020. Some of the well-known marketplaces in the U.S. and Europe are Lending Club Corporation, Prosper Marketplace and Funding Circle.

While the P2P lending marketspace has grown exponentially, the space for recommendation engines capable of recommending the best choices for both the investors and the borrowers is very fragmented. While most lending platforms have their own risk categorization tools, metrics and recommendations, the market for external third-party tools that can evaluate available choices across multiple platforms and suggest the right investment and borrowing choices, is still in its infancy. Our project, P2P Advisor, aims to build a recommendation tool that can uniquely fill this gap in today’s P2P market, enabling both the investors and the borrowers, to have a single platform, to evaluate multiple options and to make the right choices depending on their respective investment risk preference, time horizon and credit levels.

2. Defining the Business Problem and Approach

We believe an agile development approach is most suited for our project. For our first sprint, we have prioritized the building of a propensity model to determine the probability of default of the loans. Following the CRISP data mining framework, we have divided the tasks into several broad areas as follows:

Data Understanding
Data Pre-processing
Unsupervised Learning
Supervised Learning
Model Performance Evaluation and Selection
Visualization

For the purpose of this assignment, we will work with a subset of the data to build a few prototype models using techniques such as PCA, k-means, SVM and random forest. The goal is to demonstrate the data mining techniques and knowledge we have acquired from the Data Mining with R course with Professor Luis Torgo. This assignment also marks the first sprint of our capstone project.

3. Data Understanding and Pre-Processing

As a first step, we will need to gain a thorough understanding of the data. The data comes from Lending Club, one of which is a dataset on loan statistics and the other on rejection statistics. Let us go ahead to import and take a first look at the loan statistics data. The dataset covers the period from April to June 2017.

dataLC <- read_csv("dataLC_2017Q2_clean.csv")

dim(dataLC)

## [1] 105451    116

datatable(dataLC[1:100,]) # Show first 100 listings

There are 105,451 cases and 116 features. Thankfully, Lending Club provides a data dictionary to help us make sense of the features.

dataDictionary <- read_excel("LCDataDictionary.xlsx")
datatable(dataDictionary)

We take a look at the summary statistics too.

summary(dataLC)

##        id              loan_amnt         term             int_rate        
##  Min.   :   687119   Min.   : 1000   Length:105451      Length:105451     
##  1st Qu.:106356644   1st Qu.: 7000   Class :character   Class :character  
##  Median :108441242   Median :12000   Mode  :character   Mode  :character  
##  Mean   :108109021   Mean   :14589                                        
##  3rd Qu.:110343424   3rd Qu.:20000                                        
##  Max.   :112441324   Max.   :40000                                        
##                                                                           
##   installment         grade            sub_grade        
##  Min.   :  30.12   Length:105451      Length:105451     
##  1st Qu.: 232.64   Class :character   Class :character  
##  Median : 361.38   Mode  :character   Mode  :character  
##  Mean   : 436.80                                        
##  3rd Qu.: 584.37                                        
##  Max.   :1719.83                                        
##                                                         
##   emp_title          emp_length        home_ownership    
##  Length:105451      Length:105451      Length:105451     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    annual_inc      verification_status   issue_d         
##  Min.   :      0   Length:105451       Length:105451     
##  1st Qu.:  48000   Class :character    Class :character  
##  Median :  68000   Mode  :character    Mode  :character  
##  Mean   :  80452                                         
##  3rd Qu.:  97000                                         
##  Max.   :8900000                                         
##                                                          
##  loan_status          purpose            zip_code        
##  Length:105451      Length:105451      Length:105451     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   addr_state             dti            delinq_2yrs     earliest_cr_line  
##  Length:105451      Min.   :    0.00   Min.   : 0.000   Length:105451     
##  Class :character   1st Qu.:   12.24   1st Qu.: 0.000   Class :character  
##  Mode  :character   Median :   18.13   Median : 0.000   Mode  :character  
##                     Mean   :   26.43   Mean   : 0.356                     
##                     3rd Qu.:   24.60   3rd Qu.: 0.000                     
##                     Max.   :29550.00   Max.   :42.000                     
##                                                                           
##  fico_range_low fico_range_high inq_last_6mths   mths_since_last_delinq
##  Min.   :660    Min.   :664     Min.   :0.0000   Min.   :  0.00        
##  1st Qu.:670    1st Qu.:674     1st Qu.:0.0000   1st Qu.: 15.00        
##  Median :690    Median :694     Median :0.0000   Median : 30.00        
##  Mean   :698    Mean   :702     Mean   :0.5095   Mean   : 33.29        
##  3rd Qu.:715    3rd Qu.:719     3rd Qu.:1.0000   3rd Qu.: 49.00        
##  Max.   :845    Max.   :850     Max.   :5.0000   Max.   :161.00        
##                                                  NA's   :51119         
##  mths_since_last_record    open_acc        pub_rec       
##  Min.   :  0.00         Min.   : 0.00   Min.   : 0.0000  
##  1st Qu.: 53.00         1st Qu.: 8.00   1st Qu.: 0.0000  
##  Median : 73.00         Median :11.00   Median : 0.0000  
##  Mean   : 69.49         Mean   :11.85   Mean   : 0.2571  
##  3rd Qu.: 88.00         3rd Qu.:15.00   3rd Qu.: 0.0000  
##  Max.   :119.00         Max.   :88.00   Max.   :22.0000  
##  NA's   :85128                                           
##    revol_bal        revol_util          total_acc      initial_list_status
##  Min.   :      0   Length:105451      Min.   :  2.00   Length:105451      
##  1st Qu.:   5887   Class :character   1st Qu.: 15.00   Class :character   
##  Median :  11170   Mode  :character   Median : 22.00   Mode  :character   
##  Mean   :  16596                      Mean   : 23.76                      
##  3rd Qu.:  19934                      3rd Qu.: 30.00                      
##  Max.   :1039013                      Max.   :146.00                      
##                                                                           
##    out_prncp      total_pymnt      total_rec_prncp   total_rec_int    
##  Min.   :    0   Min.   :    0.0   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.: 6475   1st Qu.:  294.5   1st Qu.:  185.5   1st Qu.:  72.74  
##  Median :11599   Median :  591.4   Median :  366.9   Median : 172.40  
##  Mean   :13761   Mean   : 1097.4   Mean   :  827.6   Mean   : 269.74  
##  3rd Qu.:19463   3rd Qu.: 1086.2   3rd Qu.:  685.1   3rd Qu.: 363.41  
##  Max.   :40000   Max.   :41722.3   Max.   :40000.0   Max.   :3120.01  
##                                                                       
##  total_rec_late_fee last_pymnt_d       last_pymnt_amnt  
##  Min.   : 0.00000   Length:105451      Min.   :    0.0  
##  1st Qu.: 0.00000   Class :character   1st Qu.:  196.9  
##  Median : 0.00000   Mode  :character   Median :  345.1  
##  Mean   : 0.01326                      Mean   :  717.5  
##  3rd Qu.: 0.00000                      3rd Qu.:  588.9  
##  Max.   :47.00000                      Max.   :40937.8  
##  NA's   :89                                             
##  next_pymnt_d       last_credit_pull_d last_fico_range_high
##  Length:105451      Length:105451      Min.   :  0.0       
##  Class :character   Class :character   1st Qu.:679.0       
##  Mode  :character   Mode  :character   Median :704.0       
##                                        Mean   :708.7       
##                                        3rd Qu.:734.0       
##                                        Max.   :850.0       
##                                                            
##  last_fico_range_low collections_12_mths_ex_med
##  Min.   :  0.0       Min.   :0.00000           
##  1st Qu.:675.0       1st Qu.:0.00000           
##  Median :700.0       Median :0.00000           
##  Mean   :704.7       Mean   :0.02319           
##  3rd Qu.:730.0       3rd Qu.:0.00000           
##  Max.   :845.0       Max.   :4.00000           
##                                                
##  mths_since_last_major_derog application_type   annual_inc_joint
##  Min.   :  0.00              Length:105451      Min.   : 15850  
##  1st Qu.: 27.00              Class :character   1st Qu.: 80000  
##  Median : 44.00              Mode  :character   Median :105500  
##  Mean   : 44.36                                 Mean   :114779  
##  3rd Qu.: 62.00                                 3rd Qu.:138900  
##  Max.   :161.00                                 Max.   :760000  
##  NA's   :76297                                  NA's   :98638   
##    dti_joint     verification_status_joint acc_now_delinq    
##  Min.   : 0.13   Length:105451             Min.   :0.000000  
##  1st Qu.:13.69   Class :character          1st Qu.:0.000000  
##  Median :18.63   Mode  :character          Median :0.000000  
##  Mean   :18.73                             Mean   :0.005348  
##  3rd Qu.:23.74                             3rd Qu.:0.000000  
##  Max.   :34.99                             Max.   :3.000000  
##  NA's   :98638                                               
##   tot_coll_amt       tot_cur_bal       open_acc_6m        open_il_6m    
##  Min.   :     0.0   Min.   :      0   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.:     0.0   1st Qu.:  29967   1st Qu.: 0.0000   1st Qu.: 1.000  
##  Median :     0.0   Median :  83660   Median : 1.0000   Median : 2.000  
##  Mean   :   238.7   Mean   : 147641   Mean   : 0.8698   Mean   : 2.882  
##  3rd Qu.:     0.0   3rd Qu.: 221372   3rd Qu.: 1.0000   3rd Qu.: 3.000  
##  Max.   :262740.0   Max.   :3566850   Max.   :16.0000   Max.   :55.000  
##                                                                         
##   open_il_12m      open_il_24m     mths_since_rcnt_il  total_bal_il    
##  Min.   :0.0000   Min.   : 0.000   Min.   :  0.00     Min.   :      0  
##  1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:  8.00     1st Qu.:   9353  
##  Median :0.0000   Median : 1.000   Median : 14.00     Median :  23700  
##  Mean   :0.6143   Mean   : 1.545   Mean   : 21.28     Mean   :  36200  
##  3rd Qu.:1.0000   3rd Qu.: 2.000   3rd Qu.: 24.00     3rd Qu.:  46773  
##  Max.   :9.0000   Max.   :14.000   Max.   :411.00     Max.   :1267131  
##                                    NA's   :2652                        
##     il_util        open_rv_12m     open_rv_24m       max_bal_bc    
##  Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   :     0  
##  1st Qu.: 54.00   1st Qu.: 0.00   1st Qu.: 1.000   1st Qu.:  2382  
##  Median : 71.00   Median : 1.00   Median : 2.000   Median :  4466  
##  Mean   : 68.73   Mean   : 1.32   Mean   : 2.814   Mean   :  5805  
##  3rd Qu.: 85.00   3rd Qu.: 2.00   3rd Qu.: 4.000   3rd Qu.:  7542  
##  Max.   :464.00   Max.   :20.00   Max.   :45.000   Max.   :298260  
##  NA's   :13380                                                     
##     all_util      total_rev_hi_lim      inq_fi         total_cu_tl    
##  Min.   :  0.00   Min.   :      0   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.: 44.00   1st Qu.:  14800   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median : 59.00   Median :  25900   Median : 0.0000   Median : 0.000  
##  Mean   : 58.14   Mean   :  35098   Mean   : 0.9692   Mean   : 1.456  
##  3rd Qu.: 73.00   3rd Qu.:  43800   3rd Qu.: 1.0000   3rd Qu.: 2.000  
##  Max.   :185.00   Max.   :1252500   Max.   :27.0000   Max.   :71.000  
##  NA's   :13                                                           
##   inq_last_12m    acc_open_past_24mths  avg_cur_bal     bc_open_to_buy  
##  Min.   : 0.000   Min.   : 0.000       Min.   :     0   Min.   :     0  
##  1st Qu.: 0.000   1st Qu.: 2.000       1st Qu.:  3130   1st Qu.:  1982  
##  Median : 1.000   Median : 4.000       Median :  7574   Median :  5908  
##  Mean   : 2.047   Mean   : 4.638       Mean   : 13804   Mean   : 11906  
##  3rd Qu.: 3.000   3rd Qu.: 6.000       3rd Qu.: 19316   3rd Qu.: 14903  
##  Max.   :40.000   Max.   :45.000       Max.   :445856   Max.   :357289  
##                                        NA's   :2        NA's   :1143    
##     bc_util       chargeoff_within_12_mths  delinq_amnt     
##  Min.   :  0.00   Min.   :0.000000         Min.   :    0.0  
##  1st Qu.: 33.70   1st Qu.:0.000000         1st Qu.:    0.0  
##  Median : 58.00   Median :0.000000         Median :    0.0  
##  Mean   : 56.51   Mean   :0.009189         Mean   :   16.6  
##  3rd Qu.: 81.40   3rd Qu.:0.000000         3rd Qu.:    0.0  
##  Max.   :166.40   Max.   :4.000000         Max.   :65000.0  
##  NA's   :1191                                               
##  mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op
##  Min.   :  0.0      Min.   :  5.0        Min.   :  0.00       
##  1st Qu.: 96.0      1st Qu.:117.0        1st Qu.:  4.00       
##  Median :130.0      Median :162.0        Median :  9.00       
##  Mean   :126.1      Mean   :180.6        Mean   : 14.01       
##  3rd Qu.:154.0      3rd Qu.:232.0        3rd Qu.: 17.00       
##  Max.   :808.0      Max.   :999.0        Max.   :345.00       
##  NA's   :2652                                                 
##  mo_sin_rcnt_tl       mort_acc     mths_since_recent_bc
##  Min.   :  0.000   Min.   : 0.00   Min.   :  0.00      
##  1st Qu.:  3.000   1st Qu.: 0.00   1st Qu.:  6.00      
##  Median :  6.000   Median : 1.00   Median : 13.00      
##  Mean   :  8.359   Mean   : 1.43   Mean   : 24.17      
##  3rd Qu.: 11.000   3rd Qu.: 2.00   3rd Qu.: 29.00      
##  Max.   :195.000   Max.   :23.00   Max.   :555.00      
##                                    NA's   :1083        
##  mths_since_recent_bc_dlq mths_since_recent_inq
##  Min.   :  0.00           Min.   : 0.000       
##  1st Qu.: 19.00           1st Qu.: 2.000       
##  Median : 34.00           Median : 6.000       
##  Mean   : 37.05           Mean   : 7.105       
##  3rd Qu.: 52.00           3rd Qu.:11.000       
##  Max.   :161.00           Max.   :25.000       
##  NA's   :79994            NA's   :11949        
##  mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl  
##  Min.   :  0                    Min.   : 0.0000       Min.   : 0.000  
##  1st Qu.: 16                    1st Qu.: 0.0000       1st Qu.: 2.000  
##  Median : 30                    Median : 0.0000       Median : 3.000  
##  Mean   : 34                    Mean   : 0.5513       Mean   : 3.687  
##  3rd Qu.: 48                    3rd Qu.: 0.0000       3rd Qu.: 5.000  
##  Max.   :161                    Max.   :42.0000       Max.   :31.000  
##  NA's   :68375                                                        
##  num_actv_rev_tl   num_bc_sats       num_bc_tl        num_il_tl      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :  0.000  
##  1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.:  4.000  
##  Median : 5.000   Median : 4.000   Median : 6.000   Median :  7.000  
##  Mean   : 5.618   Mean   : 4.813   Mean   : 7.301   Mean   :  8.614  
##  3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.: 9.000   3rd Qu.: 11.000  
##  Max.   :63.000   Max.   :43.000   Max.   :66.000   Max.   :104.000  
##                                                                      
##  num_op_rev_tl    num_rev_accts   num_rev_tl_bal_gt_0    num_sats   
##  Min.   : 0.000   Min.   :  2.0   Min.   : 0.000      Min.   : 0.0  
##  1st Qu.: 5.000   1st Qu.:  8.0   1st Qu.: 3.000      1st Qu.: 8.0  
##  Median : 7.000   Median : 12.0   Median : 5.000      Median :11.0  
##  Mean   : 8.258   Mean   : 13.5   Mean   : 5.534      Mean   :11.8  
##  3rd Qu.:10.000   3rd Qu.: 17.0   3rd Qu.: 7.000      3rd Qu.:15.0  
##  Max.   :65.000   Max.   :113.0   Max.   :40.000      Max.   :87.0  
##                                                                     
##  num_tl_120dpd_2m  num_tl_30dpd      num_tl_90g_dpd_24m num_tl_op_past_12m
##  Min.   :0.000    Min.   :0.000000   Min.   : 0.00000   Min.   : 0.000    
##  1st Qu.:0.000    1st Qu.:0.000000   1st Qu.: 0.00000   1st Qu.: 1.000    
##  Median :0.000    Median :0.000000   Median : 0.00000   Median : 2.000    
##  Mean   :0.001    Mean   :0.003452   Mean   : 0.09338   Mean   : 2.073    
##  3rd Qu.:0.000    3rd Qu.:0.000000   3rd Qu.: 0.00000   3rd Qu.: 3.000    
##  Max.   :2.000    Max.   :2.000000   Max.   :42.00000   Max.   :22.000    
##  NA's   :5278                                                             
##  pct_tl_nvr_dlq   percent_bc_gt_75 pub_rec_bankruptcies   tax_liens       
##  Min.   :  0.00   Min.   :  0.00   Min.   :0.0000       Min.   : 0.00000  
##  1st Qu.: 90.40   1st Qu.:  0.00   1st Qu.:0.0000       1st Qu.: 0.00000  
##  Median : 97.40   Median : 33.30   Median :0.0000       Median : 0.00000  
##  Mean   : 93.48   Mean   : 40.56   Mean   :0.1411       Mean   : 0.07516  
##  3rd Qu.:100.00   3rd Qu.: 66.70   3rd Qu.:0.0000       3rd Qu.: 0.00000  
##  Max.   :100.00   Max.   :100.00   Max.   :7.0000       Max.   :22.00000  
##                   NA's   :1146                                            
##  tot_hi_cred_lim   total_bal_ex_mort total_bc_limit  
##  Min.   :      0   Min.   :      0   Min.   :     0  
##  1st Qu.:  52948   1st Qu.:  21602   1st Qu.:  8500  
##  Median : 119740   Median :  38987   Median : 16800  
##  Mean   : 185072   Mean   :  52964   Mean   : 23660  
##  3rd Qu.: 268068   3rd Qu.:  66820   3rd Qu.: 30900  
##  Max.   :4460960   Max.   :1310848   Max.   :407900  
##                                                      
##  total_il_high_credit_limit revol_bal_joint  sec_app_fico_range_low
##  Min.   :      0            Min.   :     0   Min.   :540.0         
##  1st Qu.:  16833            1st Qu.: 14052   1st Qu.:645.0         
##  Median :  35000            Median : 24042   Median :665.0         
##  Mean   :  46440            Mean   : 29151   Mean   :667.4         
##  3rd Qu.:  62100            3rd Qu.: 38259   3rd Qu.:690.0         
##  Max.   :1349413            Max.   :219950   Max.   :830.0         
##                             NA's   :98645    NA's   :98645         
##  sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths
##  Min.   :544.0           Length:105451            Min.   :0.00          
##  1st Qu.:649.0           Class :character         1st Qu.:0.00          
##  Median :669.0           Mode  :character         Median :0.00          
##  Mean   :671.4                                    Mean   :0.62          
##  3rd Qu.:694.0                                    3rd Qu.:1.00          
##  Max.   :834.0                                    Max.   :6.00          
##  NA's   :98645                                    NA's   :98645         
##  sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_il_6m
##  Min.   : 0.00    Min.   : 0.00    Min.   :  0.00     Min.   : 0.00     
##  1st Qu.: 0.00    1st Qu.: 7.00    1st Qu.: 43.80     1st Qu.: 1.00     
##  Median : 1.00    Median :10.00    Median : 62.90     Median : 2.00     
##  Mean   : 1.68    Mean   :11.29    Mean   : 60.53     Mean   : 2.89     
##  3rd Qu.: 3.00    3rd Qu.:14.00    3rd Qu.: 79.30     3rd Qu.: 3.00     
##  Max.   :13.00    Max.   :57.00    Max.   :129.70     Max.   :39.00     
##  NA's   :98645    NA's   :98645    NA's   :98722      NA's   :98645     
##  sec_app_num_rev_accts sec_app_chargeoff_within_12_mths
##  Min.   : 0.00         Min.   :0.00                    
##  1st Qu.: 7.00         1st Qu.:0.00                    
##  Median :11.00         Median :0.00                    
##  Mean   :12.41         Mean   :0.03                    
##  3rd Qu.:16.00         3rd Qu.:0.00                    
##  Max.   :84.00         Max.   :6.00                    
##  NA's   :98645         NA's   :98645                   
##  sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog
##  Min.   : 0.00                      Min.   :  0.00                     
##  1st Qu.: 0.00                      1st Qu.: 19.00                     
##  Median : 0.00                      Median : 39.00                     
##  Mean   : 0.07                      Mean   : 38.61                     
##  3rd Qu.: 0.00                      3rd Qu.: 57.00                     
##  Max.   :10.00                      Max.   :136.00                     
##  NA's   :98645                      NA's   :103117

There are quite a lot of missing values. Let’s check how many of the columns have missing values and remove those with more than 20% missing data.

colSums(is.na(dataLC))

##                                  id                           loan_amnt 
##                                   0                                   0 
##                                term                            int_rate 
##                                   0                                   0 
##                         installment                               grade 
##                                   0                                   0 
##                           sub_grade                           emp_title 
##                                   0                                6782 
##                          emp_length                      home_ownership 
##                                   0                                   0 
##                          annual_inc                 verification_status 
##                                   0                                   0 
##                             issue_d                         loan_status 
##                                   0                                   0 
##                             purpose                            zip_code 
##                                   0                                   0 
##                          addr_state                                 dti 
##                                   0                                   0 
##                         delinq_2yrs                    earliest_cr_line 
##                                   0                                   0 
##                      fico_range_low                     fico_range_high 
##                                   0                                   0 
##                      inq_last_6mths              mths_since_last_delinq 
##                                   0                               51119 
##              mths_since_last_record                            open_acc 
##                               85128                                   0 
##                             pub_rec                           revol_bal 
##                                   0                                   0 
##                          revol_util                           total_acc 
##                                  75                                   0 
##                 initial_list_status                           out_prncp 
##                                   0                                   0 
##                         total_pymnt                     total_rec_prncp 
##                                   0                                   0 
##                       total_rec_int                  total_rec_late_fee 
##                                   0                                  89 
##                        last_pymnt_d                     last_pymnt_amnt 
##                                8718                                   0 
##                        next_pymnt_d                  last_credit_pull_d 
##                                2585                                   3 
##                last_fico_range_high                 last_fico_range_low 
##                                   0                                   0 
##          collections_12_mths_ex_med         mths_since_last_major_derog 
##                                   0                               76297 
##                    application_type                    annual_inc_joint 
##                                   0                               98638 
##                           dti_joint           verification_status_joint 
##                               98638                               98638 
##                      acc_now_delinq                        tot_coll_amt 
##                                   0                                   0 
##                         tot_cur_bal                         open_acc_6m 
##                                   0                                   0 
##                          open_il_6m                         open_il_12m 
##                                   0                                   0 
##                         open_il_24m                  mths_since_rcnt_il 
##                                   0                                2652 
##                        total_bal_il                             il_util 
##                                   0                               13380 
##                         open_rv_12m                         open_rv_24m 
##                                   0                                   0 
##                          max_bal_bc                            all_util 
##                                   0                                  13 
##                    total_rev_hi_lim                              inq_fi 
##                                   0                                   0 
##                         total_cu_tl                        inq_last_12m 
##                                   0                                   0 
##                acc_open_past_24mths                         avg_cur_bal 
##                                   0                                   2 
##                      bc_open_to_buy                             bc_util 
##                                1143                                1191 
##            chargeoff_within_12_mths                         delinq_amnt 
##                                   0                                   0 
##                  mo_sin_old_il_acct                mo_sin_old_rev_tl_op 
##                                2652                                   0 
##               mo_sin_rcnt_rev_tl_op                      mo_sin_rcnt_tl 
##                                   0                                   0 
##                            mort_acc                mths_since_recent_bc 
##                                   0                                1083 
##            mths_since_recent_bc_dlq               mths_since_recent_inq 
##                               79994                               11949 
##      mths_since_recent_revol_delinq               num_accts_ever_120_pd 
##                               68375                                   0 
##                      num_actv_bc_tl                     num_actv_rev_tl 
##                                   0                                   0 
##                         num_bc_sats                           num_bc_tl 
##                                   0                                   0 
##                           num_il_tl                       num_op_rev_tl 
##                                   0                                   0 
##                       num_rev_accts                 num_rev_tl_bal_gt_0 
##                                   0                                   0 
##                            num_sats                    num_tl_120dpd_2m 
##                                   0                                5278 
##                        num_tl_30dpd                  num_tl_90g_dpd_24m 
##                                   0                                   0 
##                  num_tl_op_past_12m                      pct_tl_nvr_dlq 
##                                   0                                   0 
##                    percent_bc_gt_75                pub_rec_bankruptcies 
##                                1146                                   0 
##                           tax_liens                     tot_hi_cred_lim 
##                                   0                                   0 
##                   total_bal_ex_mort                      total_bc_limit 
##                                   0                                   0 
##          total_il_high_credit_limit                     revol_bal_joint 
##                                   0                               98645 
##              sec_app_fico_range_low             sec_app_fico_range_high 
##                               98645                               98645 
##            sec_app_earliest_cr_line              sec_app_inq_last_6mths 
##                               98645                               98645 
##                    sec_app_mort_acc                    sec_app_open_acc 
##                               98645                               98645 
##                  sec_app_revol_util                  sec_app_open_il_6m 
##                               98722                               98645 
##               sec_app_num_rev_accts    sec_app_chargeoff_within_12_mths 
##                               98645                               98645 
##  sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog 
##                               98645                              103117

dataLC1 <- dataLC[, colSums(is.na(dataLC)) <= 0.2*nrow(dataLC)] # remove columns with more than 20% missing values
dim(dataLC1)

## [1] 105451     95

colnames(dataLC1)

##  [1] "id"                         "loan_amnt"                 
##  [3] "term"                       "int_rate"                  
##  [5] "installment"                "grade"                     
##  [7] "sub_grade"                  "emp_title"                 
##  [9] "emp_length"                 "home_ownership"            
## [11] "annual_inc"                 "verification_status"       
## [13] "issue_d"                    "loan_status"               
## [15] "purpose"                    "zip_code"                  
## [17] "addr_state"                 "dti"                       
## [19] "delinq_2yrs"                "earliest_cr_line"          
## [21] "fico_range_low"             "fico_range_high"           
## [23] "inq_last_6mths"             "open_acc"                  
## [25] "pub_rec"                    "revol_bal"                 
## [27] "revol_util"                 "total_acc"                 
## [29] "initial_list_status"        "out_prncp"                 
## [31] "total_pymnt"                "total_rec_prncp"           
## [33] "total_rec_int"              "total_rec_late_fee"        
## [35] "last_pymnt_d"               "last_pymnt_amnt"           
## [37] "next_pymnt_d"               "last_credit_pull_d"        
## [39] "last_fico_range_high"       "last_fico_range_low"       
## [41] "collections_12_mths_ex_med" "application_type"          
## [43] "acc_now_delinq"             "tot_coll_amt"              
## [45] "tot_cur_bal"                "open_acc_6m"               
## [47] "open_il_6m"                 "open_il_12m"               
## [49] "open_il_24m"                "mths_since_rcnt_il"        
## [51] "total_bal_il"               "il_util"                   
## [53] "open_rv_12m"                "open_rv_24m"               
## [55] "max_bal_bc"                 "all_util"                  
## [57] "total_rev_hi_lim"           "inq_fi"                    
## [59] "total_cu_tl"                "inq_last_12m"              
## [61] "acc_open_past_24mths"       "avg_cur_bal"               
## [63] "bc_open_to_buy"             "bc_util"                   
## [65] "chargeoff_within_12_mths"   "delinq_amnt"               
## [67] "mo_sin_old_il_acct"         "mo_sin_old_rev_tl_op"      
## [69] "mo_sin_rcnt_rev_tl_op"      "mo_sin_rcnt_tl"            
## [71] "mort_acc"                   "mths_since_recent_bc"      
## [73] "mths_since_recent_inq"      "num_accts_ever_120_pd"     
## [75] "num_actv_bc_tl"             "num_actv_rev_tl"           
## [77] "num_bc_sats"                "num_bc_tl"                 
## [79] "num_il_tl"                  "num_op_rev_tl"             
## [81] "num_rev_accts"              "num_rev_tl_bal_gt_0"       
## [83] "num_sats"                   "num_tl_120dpd_2m"          
## [85] "num_tl_30dpd"               "num_tl_90g_dpd_24m"        
## [87] "num_tl_op_past_12m"         "pct_tl_nvr_dlq"            
## [89] "percent_bc_gt_75"           "pub_rec_bankruptcies"      
## [91] "tax_liens"                  "tot_hi_cred_lim"           
## [93] "total_bal_ex_mort"          "total_bc_limit"            
## [95] "total_il_high_credit_limit"

We have removed 21 columns and are left with 95 feature variables.

Let us also check if any of the transactions have more than 20% of missing data and go ahead to remove those. We do not find any of such transactions.

dataLC2 <- dataLC1[rowSums(is.na(dataLC1)) <= 0.2*ncol(dataLC1), ] # remove rows with more than 20% missing values
dim(dataLC2)

## [1] 105451     95

Feature Understanding

Let us now examine the features. We start by taking a closer look at the loan amount, which is the listed amount of the loan applied for by the borrower.

Desc(dataLC2$loan_amnt, main = "Loan Amount", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Loan Amount
## 
##      length         n       NAs     unique         0s       mean
##     105'451   105'451         0      1'455          0  14'589.07
##                100.0%      0.0%                  0.0%           
##                                                                 
##         .05       .10       .25     median        .75        .90
##    3'000.00  4'200.00  7'000.00  12'000.00  20'000.00  30'000.00
##                                                                 
##       range        sd     vcoef        mad        IQR       skew
##   39'000.00  9'478.83      0.65   8'895.60  13'000.00       0.83
##                                                                 
##      meanCI
##   14'531.86
##   14'646.28
##            
##         .95
##   35'000.00
##            
##        kurt
##       -0.12
##            
## lowest : 1'000 (592), 1'025 (3), 1'050 (5), 1'075 (2), 1'100 (11)
## highest: 39'775, 39'800, 39'825, 39'975, 40'000 (1'593)

The maximum loan amount is $40,000 and the minimum is $1,000. The median is $12,000. The distribution is not normal, has multiple peaks, and there are some outliers at the right tail.

Let’s take a look at the term variable, which is the number of payments on the loan. We see that there are only two types of loan terms, either 36 months or 60 months. Most of the loans (73%) are shorter, with a term of 36 months.

Desc(dataLC2$term, main = "Term", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Term
## 
##    length       n     NAs  unique
##   105'451 105'451       0       2
##            100.0%    0.0%        
## 
##              freq   perc  lci.95  uci.95'
## 36 months  77'105  73.1%   72.9%   73.4%
## 60 months  28'346  26.9%   26.6%   27.1%
## 
## ' 95%-CI Wilson

Next, we take a look at the interest rate on the loan. We first remove the “%” and convert it into a numeric.

dataLC2$int_rate = as.numeric(gsub("\\%", "", dataLC2$int_rate))
Desc(dataLC2$int_rate, main = "Interest Rate", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Interest Rate
## 
##    length        n    NAs  unique     0s   mean  meanCI
##   105'451  105'451      0      65      0  13.29   13.26
##             100.0%   0.0%           0.0%          13.32
##                                                        
##       .05      .10    .25  median    .75    .90     .95
##      6.99     7.35   9.93   12.62  15.99  19.03   23.99
##                                                        
##     range       sd  vcoef     mad    IQR   skew    kurt
##     25.67     5.16   0.39    3.99   6.06   1.11    1.60
##                                                        
## lowest : 5.32 (4'365), 6.0 (4), 6.99 (1'187), 7.07 (1'925), 7.21 (1'719)
## highest: 30.79 (203), 30.84 (178), 30.89 (134), 30.94 (123), 30.99 (147)

ggplot(dataLC2, aes(int_rate, fill = grade)) + 
  geom_density() + 
  facet_grid(grade ~ .)

The range of interest rate on the loans is quite wide, from 5.32% to 30.99%. The median interest rate is 12.62%. Notably, there are only 65 unique values. Again, we see that the distribution is not normal, has multiple peaks, and there are some outliers at the right tail. The lower the loan grade, the higher the interest rate.

Next, we examine the installment amount, which is the monthly payment owed by the borrower if the loan originates.

Desc(dataLC2$installment, main = "Installment", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Installment
## 
##     length        n     NAs  unique      0s    mean    meanCI
##    105'451  105'451       0  16'277       0  436.80    435.11
##              100.0%    0.0%            0.0%            438.49
##                                                              
##        .05      .10     .25  median     .75     .90       .95
##      98.92   140.67  232.64  361.38  584.37  839.23  1'024.16
##                                                              
##      range       sd   vcoef     mad     IQR    skew      kurt
##   1'689.71   280.18    0.64  240.36  351.73    1.07      0.76
##                                                              
## lowest : 30.12 (13), 30.88 (4), 30.91 (6), 30.98 (11), 30.99 (9)
## highest: 1'566.80 (2), 1'569.11 (2), 1'587.23 (2), 1'691.28, 1'719.83

The range of installments is wide, from $30.12 to $1,719.83 per month. The mean is $436.80. Obviously, one would expect the installments to be highly correlated to the loan amount, as we can see from the simple correlation plot below.

cor(data.frame(dataLC2$installment, dataLC2$loan_amnt)) %>% 
  corrplot(method="number", tl.cex=0.5)

Next, we take a look at the Lending Club-assigned grade for the loans. There are 7 different grades from A to G. Most of the loans are graded C (35%). By and large, the loans are graded highly by Lending Club, with nearly 82% graded C and above. Less than 1% of the loans are graded G. Notably, the loan amounts for the high-grade loans tend to be smaller than the poorly-graded loans.

Desc(dataLC2$grade, main = "Grade", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Grade
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0       7       7       y
##            100.0%    0.0%                        
## 
##    level    freq   perc  cumfreq  cumperc
## 1      C  36'880  35.0%   36'880    35.0%
## 2      B  31'301  29.7%   68'181    64.7%
## 3      A  17'898  17.0%   86'079    81.6%
## 4      D  12'015  11.4%   98'094    93.0%
## 5      E   4'847   4.6%  102'941    97.6%
## 6      F   1'725   1.6%  104'666    99.3%
## 7      G     785   0.7%  105'451   100.0%

Desc(loan_amnt ~ grade, dataLC2, digits=1, plotit=TRUE)

## ------------------------------------------------------------------------- 
## loan_amnt ~ grade
## 
## Summary: 
## n pairs: 105'451, valid: 105'451 (100.0%), missings: 0 (0.0%), groups: 7
## 
##                                                                   
##                A         B         C         D         E         F
## mean    13'528.0  13'313.8  15'132.6  15'820.5  17'077.6  18'988.9
## median  10'000.0  10'000.0  13'000.0  14'400.0  15'400.0  18'000.0
## sd       9'410.2   9'152.6   9'511.9   9'510.6   9'612.3   9'159.1
## IQR     11'600.0  12'000.0  12'000.0  13'487.5  14'000.0  13'100.0
## n         17'898    31'301    36'880    12'015     4'847     1'725
## np         17.0%     29.7%     35.0%     11.4%      4.6%      1.6%
## NAs            0         0         0         0         0         0
## 0s             0         0         0         0         0         0
##                 
##                G
## mean    20'214.1
## median  19'475.0
## sd       8'997.5
## IQR     14'600.0
## n            785
## np          0.7%
## NAs            0
## 0s             0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 2560.6, df = 6, p-value < 2.2e-16

We see a similar pattern for sub-grades. There are five sub-grades - 1 to 5 - for every grade. With the notable exception of sub-grade D2 and D4, the loan amounts for the high-grade loans tend to be smaller.

Desc(dataLC2$sub_grade, main = "Sub-Grade", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Sub-Grade
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0      35      35       y
##            100.0%    0.0%                        
## 
##     level   freq  perc  cumfreq  cumperc
## 1      C1  8'088  7.7%    8'088     7.7%
## 2      C5  7'985  7.6%   16'073    15.2%
## 3      B5  7'511  7.1%   23'584    22.4%
## 4      C4  7'208  6.8%   30'792    29.2%
## 5      C3  6'973  6.6%   37'765    35.8%
## 6      C2  6'626  6.3%   44'391    42.1%
## 7      B1  6'416  6.1%   50'807    48.2%
## 8      B3  6'012  5.7%   56'819    53.9%
## 9      B4  5'821  5.5%   62'640    59.4%
## 10     B2  5'541  5.3%   68'181    64.7%
## 11     A5  4'690  4.4%   72'871    69.1%
## 12     A1  4'365  4.1%   77'236    73.2%
## ... etc.
##  [list output truncated]

ggplot(dataLC2,aes(x=sub_grade, y=loan_amnt)) + 
  geom_boxplot() + 
  xlab("Sub-grade") + 
  ylab("Loan Amount")

Let’s see whether the interest rates are dependent on the loan grade.

Desc(int_rate ~ grade, dataLC2, digits = 1, main = "Interest rate by Grade", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Interest rate by Grade
## 
## Summary: 
## n pairs: 105'451, valid: 105'451 (100.0%), missings: 0 (0.0%), groups: 7
## 
##                                                               
##              A       B       C       D       E       F       G
## mean       7.0    10.6    14.3    18.6    24.7    29.7    30.9
## median     7.2    10.5    14.1    18.1    24.9    29.7    30.9
## sd         1.0     1.1     1.2     1.4     1.3     0.8     0.1
## IQR        1.0     1.5     1.6     1.9     2.4     1.8     0.2
## n       17'898  31'301  36'880  12'015   4'847   1'725     785
## np       17.0%   29.7%   35.0%   11.4%    4.6%    1.6%    0.7%
## NAs          0       0       0       0       0       0       0
## 0s           0       0       0       0       0       0       0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 97582, df = 6, p-value < 2.2e-16

Unsurprisingly, the higher the grade, the higher the interest rates.

Let’s take a look at the employment title, which is the job title supplied by the Borrower when applying for the loan.

Desc(dataLC2$emp_title, main = "Employment Title", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Employment Title
## 
##    length       n     NAs  unique  levels   dupes
##   105'451  98'669   6'782  35'885  35'885       y
##             93.6%    6.4%                        
## 
##                level   freq  perc  cumfreq  cumperc
## 1            Teacher  2'158  2.2%    2'158     2.2%
## 2            Manager  2'015  2.0%    4'173     4.2%
## 3              Owner  1'310  1.3%    5'483     5.6%
## 4         Supervisor    962  1.0%    6'445     6.5%
## 5             Driver    949  1.0%    7'394     7.5%
## 6   Registered Nurse    906  0.9%    8'300     8.4%
## 7                 RN    824  0.8%    9'124     9.2%
## 8              Sales    800  0.8%    9'924    10.1%
## 9    Project Manager    619  0.6%   10'543    10.7%
## 10          Director    569  0.6%   11'112    11.3%
## 11    Office Manager    538  0.5%   11'650    11.8%
## 12   General Manager    517  0.5%   12'167    12.3%
## ... etc.
##  [list output truncated]

There are 35,885 different titles, which makes it hard to make sense of it. Teacher was the most commonly supplied title, albeit only 2.2% of the Borrowers. There is also quite bit of missing data, 6,782 to be precise (or 6.4%). It is also not clear what some of the titles mean, for example, RN. We can probably drop this feature.

Employment length is a more useful feature, notwithstanding the considerable number of missing data (6,697). There are 11 levels, from < 1 year, 1 year, through to 10+ years (note: possible values are between 0 and 10 where 0 means less than one year). Notably, Borrowers’ employment length falls into 2 big clusters: the most frequent is mid-career professionals with 10+ years of experience (33.6%), the other is young people with under 3 years of working experience (33.1%).

Desc(dataLC2$emp_length, main = "Employment Length", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Employment Length
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0      12      12       y
##            100.0%    0.0%                        
## 
##         level    freq   perc  cumfreq  cumperc
## 1   10+ years  35'438  33.6%   35'438    33.6%
## 2     2 years   9'914   9.4%   45'352    43.0%
## 3    < 1 year   9'542   9.0%   54'894    52.1%
## 4     3 years   8'495   8.1%   63'389    60.1%
## 5      1 year   7'034   6.7%   70'423    66.8%
## 6         n/a   6'697   6.4%   77'120    73.1%
## 7     4 years   6'453   6.1%   83'573    79.3%
## 8     5 years   6'382   6.1%   89'955    85.3%
## 9     6 years   4'526   4.3%   94'481    89.6%
## 10    7 years   3'847   3.6%   98'328    93.2%
## 11    9 years   3'828   3.6%  102'156    96.9%
## 12    8 years   3'295   3.1%  105'451   100.0%

Let us take a look at the type of home ownership status provided by the Borrower during registration (or obtained from their credit report). Most Borrowers either have existing mortgages (49.8%). or are renting their home (38.9%).

Desc(dataLC2$home_ownership, main = "Home Ownership", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Home Ownership
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0       5       5       y
##            100.0%    0.0%                        
## 
##       level    freq   perc  cumfreq  cumperc
## 1  MORTGAGE  52'502  49.8%   52'502    49.8%
## 2      RENT  41'069  38.9%   93'571    88.7%
## 3       OWN  11'873  11.3%  105'444   100.0%
## 4       ANY       5   0.0%  105'449   100.0%
## 5      NONE       2   0.0%  105'451   100.0%

Next, we examine the Borrowers’ self-reported annual income. The distribution is highly postively skewed, from zero income to an income of $8.9 million (which appears to be an outlier). The median income is $68,000.

Desc(dataLC2$annual_inc, main = "Annual Income", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Annual Income
## 
##         length          n        NAs     unique         0s        mean
##        105'451    105'451          0      9'112         74   80'451.97
##                    100.0%       0.0%                  0.1%            
##                                                                       
##            .05        .10        .25     median        .75         .90
##      28'000.00  35'000.00  48'000.00  68'000.00  97'000.00  137'000.00
##                                                                       
##          range         sd      vcoef        mad        IQR        skew
##   8'900'000.00  68'220.64       0.85  34'099.80  49'000.00       33.59
##                                                                       
##       meanCI
##    80'040.21
##    80'863.73
##             
##          .95
##   170'000.00
##             
##         kurt
##     3'375.46
##             
## lowest : 0.0 (74), 1.0, 50.0, 250.0, 487.0
## highest: 3'650'010.0, 3'800'000.0, 4'000'040.0, 5'616'890.0, 8'900'000.0

Is there a relationship between Annual Income and Loan Amount? The scatterplot belows suggests there isn’t but the outliers are apparent.

ggplot(dataLC2,aes(x=annual_inc, y=loan_amnt)) + 
  geom_point() + 
  xlab("Annual Income") + 
  ylab("Loan Amount")

Verification status tells us whethere the income was verified by Lending Club, not verified, or if the income source was verified. Most of the income has had its source verified (38.5%) although 35.7% were not verified.

Desc(dataLC2$verification_status, main = "Verification Status", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Verification Status
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0       3       3       y
##            100.0%    0.0%                        
## 
##              level    freq   perc  cumfreq  cumperc
## 1  Source Verified  40'622  38.5%   40'622    38.5%
## 2     Not Verified  37'599  35.7%   78'221    74.2%
## 3         Verified  27'230  25.8%  105'451   100.0%

Issue date shows the month in which the loan was funded. For Q2 2017, most of the loans (36.1%) were issued in Jun. April was notably slower, with only 28.1% of loans issued.

Desc(dataLC2$issue_d, main = "Issue Date", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Issue Date
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0       3       3       y
##            100.0%    0.0%                        
## 
##     level    freq   perc  cumfreq  cumperc
## 1  Jun-17  38'087  36.1%   38'087    36.1%
## 2  May-17  37'681  35.7%   75'768    71.9%
## 3  Apr-17  29'683  28.1%  105'451   100.0%

Let’s take a look at the loan status - our target variable. There are 7 levels, with most loans being current (95.6%). What is interesting to us are the loans that are either late or charged off. These form only 0.4% of the loans. There doesn’t seem to be a discernible pattern between loan status and loan amount.

Desc(dataLC2$loan_status, main = "Loan Status", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Loan Status
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0       7       7       y
##            100.0%    0.0%                        
## 
##                 level     freq   perc  cumfreq  cumperc
## 1             Current  100'863  95.6%  100'863    95.6%
## 2          Fully Paid    2'580   2.4%  103'443    98.1%
## 3              Issued      867   0.8%  104'310    98.9%
## 4     In Grace Period      757   0.7%  105'067    99.6%
## 5  Late (31-120 days)      196   0.2%  105'263    99.8%
## 6   Late (16-30 days)      183   0.2%  105'446   100.0%
## 7         Charged Off        5   0.0%  105'451   100.0%

ggplot(dataLC2, aes(loan_status, loan_amnt)) + 
  geom_boxplot(aes(fill = loan_status)) +
  theme(axis.text.x = element_blank()) +
  labs(list(
    title = "Loan amount by status",
    x = "Status",
    y = "Amount"))

The purpose variable is a category provided by the borrower for the loan request. There are 13 categories. Over half of the loans (55.5%) were for debt consolidation. Other significant categories were credit card and home improvement.

Desc(dataLC2$purpose, main = "Purpose", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Purpose
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0      13      13       y
##            100.0%    0.0%                        
## 
##                  level    freq   perc  cumfreq  cumperc
## 1   debt_consolidation  58'557  55.5%   58'557    55.5%
## 2          credit_card  21'261  20.2%   79'818    75.7%
## 3     home_improvement   9'222   8.7%   89'040    84.4%
## 4                other   7'140   6.8%   96'180    91.2%
## 5       major_purchase   2'616   2.5%   98'796    93.7%
## 6              medical   1'648   1.6%  100'444    95.3%
## 7                  car   1'334   1.3%  101'778    96.5%
## 8             vacation   1'170   1.1%  102'948    97.6%
## 9       small_business   1'034   1.0%  103'982    98.6%
## 10              moving     945   0.9%  104'927    99.5%
## 11               house     453   0.4%  105'380    99.9%
## 12    renewable_energy      70   0.1%  105'450   100.0%
## ... etc.
##  [list output truncated]

Zip code and Address State informs us of the Borrower’s place of residence. 13% of the Borrowers live in California. Lending Club’s reach is impressive; all but one of the states had Borrowers who used the platform.

Desc(dataLC2$addr_state, main = "Address State", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Address State
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0      49      49       y
##            100.0%    0.0%                        
## 
##     level    freq   perc  cumfreq  cumperc
## 1      CA  13'751  13.0%   13'751    13.0%
## 2      TX   8'802   8.3%   22'553    21.4%
## 3      NY   8'743   8.3%   31'296    29.7%
## 4      FL   7'772   7.4%   39'068    37.0%
## 5      IL   4'378   4.2%   43'446    41.2%
## 6      NJ   3'901   3.7%   47'347    44.9%
## 7      PA   3'735   3.5%   51'082    48.4%
## 8      OH   3'604   3.4%   54'686    51.9%
## 9      GA   3'383   3.2%   58'069    55.1%
## 10     VA   2'982   2.8%   61'051    57.9%
## 11     NC   2'924   2.8%   63'975    60.7%
## 12     MI   2'789   2.6%   66'764    63.3%
## ... etc.
##  [list output truncated]

Let’s now take a look at some financial ratios. We begin with the Borrowers’ debt-to-income (DTI) ratio. DTI is calculated using the Borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested Lending Club loan, divided by the borrower’s self-reported monthly income. This will be an important feature affecting a Borrower’s creditworthiness.

Desc(dataLC2$dti, main = "Debt-to-Income Ratio", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Debt-to-Income Ratio
## 
##      length        n    NAs  unique     0s   mean    meanCI
##     105'451  105'451      0   5'047     35  26.43     24.73
##               100.0%   0.0%           0.0%            28.13
##                                                            
##         .05      .10    .25  median    .75    .90       .95
##        5.32     7.68  12.24   18.13  24.60  30.05     32.88
##                                                            
##       range       sd  vcoef     mad    IQR   skew      kurt
##   29'550.00   282.16  10.68    9.12  12.36  42.35  2'244.95
##                                                            
## lowest : 0.0 (35), 0.02, 0.03, 0.06 (2), 0.07
## highest: 1'680.0, 2'933.74, 5'395.0, 9'999.0 (75), 29'550.0

The distribution is extremely skewed. Some Borrowers’ reported no debt, while one reported a DTI ratio of 29,550%. This is not surprising given that some of the Borrowers reported little or no income.

Delinquency shows the number of 30+ days past-due incidences of delinquency in the Borrower’s credit file for the past 2 years. Most Borrowers (79%) have not had a delinquency, but one of the Borrowers has had 42 delinquencies over the last w years!

Desc(dataLC2$delinq_2yrs, main = "Delinquency (Last 2 Years)", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Delinquency (Last 2 Years)
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   105'451  105'451      0      24  83'373  0.36    0.35
##             100.0%   0.0%           79.1%          0.36
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    0.00  1.00    2.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##     42.00     0.96   2.70    0.00    0.00  6.07   82.85
##                                                        
## lowest : 0 (83'373), 1 (14'393), 2 (4'363), 3 (1'659), 4 (690)
## highest: 19, 20 (3), 21, 25, 42

Earliest Credit Line shows the month the borrower’s earliest reported credit line was opened. The earliest was back in February 1934! The latest was dated March 2014. Lines were generally opened fairly evenly across time. 2003-2206 seems to be a period when many Borrowers opened their credit lines.

Desc(dataLC2$earliest_cr_line, main = "Earliest Credit Line", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Earliest Credit Line
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'451       0     627     627       y
##            100.0%    0.0%                        
## 
##      level  freq  perc  cumfreq  cumperc
## 1   Sep-04   892  0.8%      892     0.8%
## 2   Aug-06   836  0.8%    1'728     1.6%
## 3   Sep-05   819  0.8%    2'547     2.4%
## 4   Aug-04   810  0.8%    3'357     3.2%
## 5   Aug-05   787  0.7%    4'144     3.9%
## 6   Oct-03   755  0.7%    4'899     4.6%
## 7   Sep-03   751  0.7%    5'650     5.4%
## 8   Aug-03   748  0.7%    6'398     6.1%
## 9   Oct-05   715  0.7%    7'113     6.7%
## 10  Sep-06   704  0.7%    7'817     7.4%
## 11  Oct-04   697  0.7%    8'514     8.1%
## 12  Nov-04   681  0.6%    9'195     8.7%
## ... etc.
##  [list output truncated]

FICO low and FICO high show the lower and upper boundary range the borrower’s FICO at loan origination belongs to. Each FICO bucket is 4 points so only one of the variable is useful. Let us create a new feature showing the average FICO score of the Borrower at loan origination instead.

dataLC2 <- mutate(dataLC2, fico_ave = (fico_range_low + fico_range_high)/2)
Desc(dataLC2$fico_ave, main = "Average FICO Score", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Average FICO Score
## 
##    length        n     NAs  unique      0s    mean  meanCI
##   105'451  105'451       0      38       0  699.98  699.78
##             100.0%    0.0%            0.0%          700.18
##                                                           
##       .05      .10     .25  median     .75     .90     .95
##    662.00   667.00  672.00  692.00  717.00  747.00  767.00
##                                                           
##     range       sd   vcoef     mad     IQR    skew    kurt
##    185.50    32.95    0.05   29.65   45.00    1.20    1.33
##                                                           
## lowest : 662.0 (9'367), 667.0 (8'703), 672.0 (8'740), 677.0 (7'739), 682.0 (7'733)
## highest: 827.0 (101), 832.0 (67), 837.0 (36), 842.0 (28), 847.5 (16)

FICO scores of the Borrowers range from 662 to 847.5 (note: maximum is 850, minimum is 300). The median score is 692. The distribution is positively skewed, with 13 outliers beyond the upper IQR.

Inquiries (last 6 months) shows the number of inquiries in the past 6 months (excluding auto and mortgage inquiries). Most Borrowers have not made an inquiry (63.9%) but there is a small fraction who have made up to 5 inquires (0.2%)

Desc(dataLC2$inq_last_6mths, main = "Inquires (Last 6 months)", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Inquires (Last 6 months)
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   105'451  105'451      0       6  67'430  0.51    0.50
##             100.0%   0.0%           63.9%          0.51
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  2.00    2.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      5.00     0.81   1.59    0.00    1.00  1.89    4.23
##                                                        
## 
##    level    freq   perc  cumfreq  cumperc
## 1      0  67'430  63.9%   67'430    63.9%
## 2      1  26'572  25.2%   94'002    89.1%
## 3      2   8'336   7.9%  102'338    97.0%
## 4      3   2'197   2.1%  104'535    99.1%
## 5      4     687   0.7%  105'222    99.8%
## 6      5     229   0.2%  105'451   100.0%

Open accounts is the number of open credit lines in the Borrower’s credit file. The range is wide, wih Borrowers having no open credit lines to some having as many as 88 credit lines opened. The median Borrower has 11 open credit lines.

Desc(dataLC2$open_acc, main = "Open Accounts", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Open Accounts
## 
##    length        n    NAs  unique     0s   mean  meanCI
##   105'451  105'451      0      65      1  11.85   11.81
##             100.0%   0.0%           0.0%          11.88
##                                                        
##       .05      .10    .25  median    .75    .90     .95
##      5.00     6.00   8.00   11.00  15.00  19.00   23.00
##                                                        
##     range       sd  vcoef     mad    IQR   skew    kurt
##     88.00     5.78   0.49    4.45   7.00   1.34    3.64
##                                                        
## lowest : 0, 1 (41), 2 (347), 3 (1'381), 4 (2'985)
## highest: 65, 68, 80, 86, 88

Public records shows the number of derogatory public records. Most Borrowers have none (80.7%), but there are others who have as many as 22 such records.

Desc(dataLC2$pub_rec, main = "Public Records", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Public Records
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   105'451  105'451      0      16  85'128  0.26    0.25
##             100.0%   0.0%           80.7%          0.26
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    0.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##     22.00     0.66   2.56    0.00    0.00  4.99   50.59
##                                                        
## lowest : 0 (85'128), 1 (16'342), 2 (2'531), 3 (827), 4 (305)
## highest: 11 (4), 12, 14 (2), 15, 22

Let’s take a look at the Borrowers’ total credit revolving balance. The distribution is highly skewed; some Borrowers have no revolving credit, some have over $1 million. The median Borrower’s revolving credit balance is $11,170.

Desc(dataLC2$revol_bal, main = "Revolving Credit Balance", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Revolving Credit Balance
## 
##         length          n       NAs     unique         0s       mean
##        105'451    105'451         0     37'373        437  16'596.03
##                    100.0%      0.0%                  0.4%           
##                                                                     
##            .05        .10       .25     median        .75        .90
##       1'735.00   2'978.00  5'887.00  11'170.00  19'933.50  33'362.00
##                                                                     
##          range         sd     vcoef        mad        IQR       skew
##   1'039'013.00  22'965.94      1.38   9'298.87  14'046.50       9.35
##                                                                     
##      meanCI
##   16'457.42
##   16'734.65
##            
##         .95
##   45'508.00
##            
##        kurt
##      179.93
##            
## lowest : 0 (437), 1 (5), 2, 3 (5), 4 (5)
## highest: 693'478, 725'708, 742'434, 798'264, 1'039'013

What is the revolving utilization? We first remove the % and convert it into numeric. The distributon is fairly normal, albeit with a right tail. The utlization rate was between 0% and 138.9%, with a median utilization of 48.5%.

dataLC2$revol_util = as.numeric(gsub("\\%", "", dataLC2$revol_util))
Desc(dataLC2$revol_util, main = "Revolving Credit Utilization", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Revolving Credit Utilization
## 
##    length        n    NAs  unique     0s   mean  meanCI
##   105'451  105'376     75   1'076    468  49.02   48.87
##              99.9%   0.1%           0.4%          49.16
##                                                        
##       .05      .10    .25  median    .75    .90     .95
##      9.80    16.50  30.40   48.50  67.40  82.50   89.90
##                                                        
##     range       sd  vcoef     mad    IQR   skew    kurt
##    138.90    24.26   0.49   27.43  37.00   0.06   -0.81
##                                                        
## lowest : 0.0 (468), 0.1 (81), 0.2 (62), 0.3 (61), 0.4 (37)
## highest: 123.4, 125.0, 125.2, 127.0, 138.9

** Total account** is the total number of credit lines currently in the borrower’s credit file. The distribution is positively skewed, with the number of lines ranging from 2 to 146. The median Borrower has 22 credit lines on file. Recall that the median Borrower has 11 open credit lines.

Desc(dataLC2$total_acc, main = "Total Account", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Total Account
## 
##    length        n    NAs  unique     0s   mean  meanCI
##   105'451  105'451      0     108      0  23.76   23.69
##             100.0%   0.0%           0.0%          23.83
##                                                        
##       .05      .10    .25  median    .75    .90     .95
##      8.00    10.00  15.00   22.00  30.00  39.00   46.00
##                                                        
##     range       sd  vcoef     mad    IQR   skew    kurt
##    144.00    11.83   0.50   10.38  15.00   1.07    2.10
##                                                        
## lowest : 2 (48), 3 (188), 4 (423), 5 (700), 6 (1'122)
## highest: 112 (2), 115, 121, 122 (2), 146

Initial list status indicates the initial listing status of the loan. Possible values are W, F. W stands for whole loans, that is, available to investors to be purchased in their entirety (Borrowers benefit from getting ‘instant funding’). Lending Club provides a randomized subset of loans by grade available to purchase as a whole loan for a brief period of time (12 hours). The rest are available for fractional purchase. 75.1% of the loans are whole loans.

Desc(dataLC2$initial_list_status, main = "Initial Listing Status", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Initial Listing Status
## 
##    length       n     NAs  unique
##   105'451 105'451       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## w  79'488  75.4%   75.1%   75.6%
## f  25'963  24.6%   24.4%   24.9%
## 
## ' 95%-CI Wilson

** Outstanding principal** is the remaining outstanding principal for total amount funded. This should be highly correlated to Loan amount since the loans have only been issued. Indeed, the correlation is 96%. This variable will be more useful when we have a longer data series.

Desc(dataLC2$out_prncp, main = "Outstanding Principal", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Outstanding Principal
## 
##       length          n        NAs      unique          0s        mean
##      105'451    105'451          0      28'979       2'601  13'761.025
##                  100.0%       0.0%                    2.5%            
##                                                                       
##          .05        .10        .25      median         .75         .90
##    1'959.915  3'507.540  6'474.650  11'599.130  19'463.160  28'800.000
##                                                                       
##        range         sd      vcoef         mad         IQR        skew
##   40'000.000  9'377.066      0.681   8'868.424  12'988.510       0.803
##                                                                       
##       meanCI
##   13'704.428
##   13'817.622
##             
##          .95
##   33'461.910
##             
##         kurt
##       -0.123
##             
## lowest : 0.0 (2'601), 0.04, 0.87, 1.1, 1.47
## highest: 39'800.0, 39'825.0, 39'904.600, 39'990.0, 40'000.0 (166)

cor(dataLC2$out_prncp, dataLC2$loan_amnt)

## [1] 0.9634905

Total payments is the payments received to date for the total amount funded. Unsurprisingly, there are quite a lot of zeros (8,718) since the loans have only been issued. Again, this variable will be more useful when we have a longer data series. The median payment to date was $591.40. The distribution is skewed to the right.

Desc(dataLC2$total_pymnt, main = "Total Payment to Date", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Total Payment to Date
## 
##       length          n      NAs   unique         0s       mean     meanCI
##      105'451    105'451        0   39'800      8'718  1'097.406  1'081.719
##                  100.0%     0.0%                8.3%             1'113.092
##                                                                           
##          .05        .10      .25   median        .75        .90        .95
##        0.000     92.640  294.475  591.400  1'086.215  1'910.800  2'605.880
##                                                                           
##        range         sd    vcoef      mad        IQR       skew       kurt
##   41'722.293  2'598.943    2.368  532.075    791.740      8.490     88.209
##                                                                           
## lowest : 0.0 (8'718), 0.11, 0.67, 0.7, 1.31
## highest: 40'978.344, 41'002.436, 41'146.883, 41'380.811, 41'722.293

The next two features, ** Total principal received** and Total interest received are linearly decomposed from Total payments so they are not hence particulary. We can go ahead to drop them later.

Total received late fees shows that the late fees received to date. It is also not particularly interesting - 99.8% are zeros - so we can also drop it later.

Desc(dataLC2$total_rec_late_fee, main = "Total Late Fee Received to Date", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Total Late Fee Received to Date
## 
##    length        n    NAs  unique       0s   mean    meanCI
##   105'451  105'362     89       7  105'272   0.01      0.01
##              99.9%   0.1%            99.8%             0.02
##                                                            
##       .05      .10    .25  median      .75    .90       .95
##      0.00     0.00   0.00    0.00     0.00   0.00      0.00
##                                                            
##     range       sd  vcoef     mad      IQR   skew      kurt
##     47.00     0.47  35.20    0.00     0.00  40.09  2'036.09
##                                                            
## 
##    level     freq   perc  cumfreq  cumperc
## 1      0  105'272  99.9%  105'272    99.9%
## 2      9        1   0.0%  105'273    99.9%
## 3     15       85   0.1%  105'358   100.0%
## 4     17        1   0.0%  105'359   100.0%
## 5     19        1   0.0%  105'360   100.0%
## 6     30        1   0.0%  105'361   100.0%
## 7     47        1   0.0%  105'362   100.0%

Last payment date shows the last month payment was received. Most (93.2%) were received in July 2017. This feature is not particulary interesting as well. Similarly, Next payment date is not interesting.

Desc(dataLC2$last_pymnt_d, main = "Last Payment Date", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Last Payment Date
## 
##    length       n     NAs  unique  levels   dupes
##   105'451  96'733   8'718       4       4       y
##             91.7%    8.3%                        
## 
##     level    freq   perc  cumfreq  cumperc
## 1  Jul-17  90'137  93.2%   90'137    93.2%
## 2  Jun-17   5'684   5.9%   95'821    99.1%
## 3  May-17     765   0.8%   96'586    99.8%
## 4  Apr-17     147   0.2%   96'733   100.0%

Last payment amount is the last total payment amount received. This is highly correlated to total payment (94.7%).

Desc(dataLC2$last_pymnt_amnt, main = "Last Payment Amount", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Last Payment Amount
## 
##       length          n      NAs   unique       0s     mean     meanCI
##      105'451    105'451        0   22'729    8'718  717.551    702.778
##                  100.0%     0.0%              8.3%             732.324
##                                                                       
##          .05        .10      .25   median      .75      .90        .95
##        0.000     65.000  196.940  345.150  588.905  895.730  1'154.000
##                                                                       
##        range         sd    vcoef      mad      IQR     skew       kurt
##   40'937.830  2'447.647    3.411  267.654  391.965    9.661    108.281
##                                                                       
## lowest : 0.0 (8'718), 0.11, 0.67, 0.7, 1.75 (2)
## highest: 40'604.210, 40'625.110, 40'659.610, 40'758.250, 40'937.830

Last credit pulled is the most recent month Lending Club pulled credit for this loan. Most (95.5%) loans were refreshed for July 2017. This feature does not seem particulary interesting.

Desc(dataLC2$last_credit_pull_d, main = "Last Credit Pulled", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Last Credit Pulled
## 
##    length       n     NAs  unique  levels   dupes
##   105'451 105'448       3       5       5       y
##            100.0%    0.0%                        
## 
##     level     freq   perc  cumfreq  cumperc
## 1  Jul-17  100'680  95.5%  100'680    95.5%
## 2  May-17    1'644   1.6%  102'324    97.0%
## 3  Jun-17    1'600   1.5%  103'924    98.6%
## 4  Apr-17    1'207   1.1%  105'131    99.7%
## 5  Mar-17      317   0.3%  105'448   100.0%

Last FICO low and Last FICO high show the lower and upper boundary range the borrower’s last FICO pulled. Again, each FICO bucket is 4 points so we create a new feature showing the average FICO score pulled instead.

dataLC2 <- mutate(dataLC2, fico_ave_last = (last_fico_range_low + last_fico_range_high)/2)
Desc(dataLC2$fico_ave_last, main = "Average Last FICO Score Pulled", plotit = TRUE)

## ------------------------------------------------------------------------- 
## Average Last FICO Score Pulled
## 
##    length        n     NAs  unique      0s    mean  meanCI
##   105'451  105'451       0      71       3  706.74  706.50
##             100.0%    0.0%            0.0%          706.98
##                                                           
##       .05      .10     .25  median     .75     .90     .95
##    652.00   662.00  677.00  702.00  732.00  762.00  782.00
##                                                           
##     range       sd   vcoef     mad     IQR    skew    kurt
##    847.50    39.51    0.06   37.06   55.00    0.32    3.36
##                                                           
## lowest : 0.0 (3), 249.5, 502.0, 507.0, 512.0
## highest: 827.0 (141), 832.0 (91), 837.0 (36), 842.0 (19), 847.5 (5)

FICO scores last pulled range from 652 to 847.5. The median score is 702. There has been an increase in FICO scores since the loans were orginated. One would expect the change in FICO scores to be an important predictor of loan default.

4. Unsupervised Learning

Now that we have a much better understanding of the dataset, let us go ahead to build some unsupervised learning models, employing clustering techniques to uncover latent structures with the data to further enhance our understanding of the data. The segmentation of customer groups will allow us to better train the models in the next stage for each specific segment.

k-means Clustering

We start with k-means clustering. Let us first create a dataframe consisting of all numeric variables (including integers). We exclude fico_range_low and fico_range_high since we had created a new feature for the average FICO score. We also drop total_rec_prncp, total_rec_int,total_rec_late_fee for the reasons mentioned above.

numeric_cols <- sapply(dataLC2,is.numeric)
dataLC_numeric <- dataLC2[, numeric_cols]
dataLC_numeric <- subset(dataLC_numeric, select=-c(fico_range_low, 
                                                   fico_range_low, 
                                                   last_fico_range_low,
                                                   last_fico_range_high,
                                                   total_rec_prncp, 
                                                   total_rec_int,
                                                   total_rec_late_fee))
str(dataLC_numeric)

## Classes 'tbl_df', 'tbl' and 'data.frame':    105451 obs. of  73 variables:
##  $ id                        : int  112432991 111872988 112439006 111735948 112156817 112390231 112441324 112210041 112119128 112435993 ...
##  $ loan_amnt                 : int  3000 10000 20400 6000 10000 20000 3600 6000 25000 2300 ...
##  $ int_rate                  : num  7.97 9.44 9.44 7.35 7.21 ...
##  $ installment               : num  94 320 653 186 310 ...
##  $ annual_inc                : num  34000 52000 44000 74000 145000 ...
##  $ dti                       : num  12.71 30.91 27.06 19.02 6.58 ...
##  $ delinq_2yrs               : int  0 0 0 0 0 2 0 0 0 0 ...
##  $ fico_range_high           : int  709 779 764 739 719 704 804 684 709 704 ...
##  $ inq_last_6mths            : int  1 0 1 0 2 0 0 0 0 1 ...
##  $ open_acc                  : int  10 16 15 15 12 8 12 10 13 4 ...
##  $ pub_rec                   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ revol_bal                 : int  3443 9983 8769 5433 18633 20589 12060 9332 26947 3911 ...
##  $ revol_util                : num  31 24.3 18.7 14.8 32.7 81.7 11.7 59.1 73.2 55.1 ...
##  $ total_acc                 : int  17 24 19 35 59 14 30 28 18 6 ...
##  $ out_prncp                 : num  3000 10000 20400 6000 10000 ...
##  $ total_pymnt               : num  0 0 0 0 0 ...
##  $ last_pymnt_amnt           : num  0 0 0 0 0 ...
##  $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tot_coll_amt              : int  0 0 0 0 0 0 0 275 0 0 ...
##  $ tot_cur_bal               : int  45876 148768 62335 297932 278599 37895 115079 107517 26947 3911 ...
##  $ open_acc_6m               : int  1 3 1 1 4 1 0 0 0 0 ...
##  $ open_il_6m                : int  3 5 10 7 1 2 1 1 0 0 ...
##  $ open_il_12m               : int  0 2 1 2 0 0 0 0 0 0 ...
##  $ open_il_24m               : int  0 3 1 3 0 0 1 1 0 0 ...
##  $ mths_since_rcnt_il        : int  34 3 6 6 27 27 16 19 38 NA ...
##  $ total_bal_il              : int  42433 30221 53566 57864 5537 17306 2875 19983 0 0 ...
##  $ il_util                   : int  94 67 95 88 33 41 58 NA NA NA ...
##  $ open_rv_12m               : int  2 3 1 1 6 1 0 1 1 1 ...
##  $ open_rv_24m               : int  4 6 2 5 8 3 2 1 3 2 ...
##  $ max_bal_bc                : int  1955 7015 4240 3201 5428 19372 3486 3990 12643 2315 ...
##  $ all_util                  : int  81 47 60 62 33 56 14 59 73 55 ...
##  $ total_rev_hi_lim          : int  11100 41400 46900 36800 56900 25200 102900 15800 36800 7100 ...
##  $ inq_fi                    : int  1 0 1 4 0 0 0 0 0 1 ...
##  $ total_cu_tl               : int  1 2 1 2 0 0 1 0 0 0 ...
##  $ inq_last_12m              : int  1 0 1 4 4 0 0 0 0 2 ...
##  $ acc_open_past_24mths      : int  4 10 3 9 9 3 3 2 3 2 ...
##  $ avg_cur_bal               : int  4588 9298 4156 19862 23217 5414 9590 11946 2073 978 ...
##  $ bc_open_to_buy            : int  5057 17348 35631 1927 29857 2403 81840 1845 6505 440 ...
##  $ bc_util                   : num  40.5 30.6 19.8 73.6 35.2 89.5 11.9 82.4 73.8 87.1 ...
##  $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mo_sin_old_il_acct        : int  93 53 125 81 139 160 113 86 175 NA ...
##  $ mo_sin_old_rev_tl_op      : int  119 288 125 160 286 298 296 303 234 381 ...
##  $ mo_sin_rcnt_rev_tl_op     : int  6 3 11 9 0 6 18 9 10 8 ...
##  $ mo_sin_rcnt_tl            : int  6 3 6 6 0 6 16 9 10 8 ...
##  $ mort_acc                  : int  0 2 0 1 2 0 7 4 0 0 ...
##  $ mths_since_recent_bc      : int  6 3 11 15 5 6 23 39 21 15 ...
##  $ mths_since_recent_inq     : int  6 18 6 6 0 NA 18 NA 21 0 ...
##  $ num_accts_ever_120_pd     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_actv_bc_tl            : int  2 3 3 2 5 2 5 3 6 2 ...
##  $ num_actv_rev_tl           : int  2 5 3 3 8 3 6 4 10 3 ...
##  $ num_bc_sats               : int  3 5 3 2 5 3 7 3 7 2 ...
##  $ num_bc_tl                 : int  8 6 5 12 20 5 12 8 8 2 ...
##  $ num_il_tl                 : int  4 6 12 11 4 3 5 6 2 0 ...
##  $ num_op_rev_tl             : int  7 10 5 7 10 6 9 8 13 4 ...
##  $ num_rev_accts             : int  13 16 7 23 53 11 17 18 16 6 ...
##  $ num_rev_tl_bal_gt_0       : int  2 5 3 3 8 3 6 4 10 3 ...
##  $ num_sats                  : int  10 16 15 15 12 8 12 9 13 4 ...
##  $ num_tl_120dpd_2m          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_tl_30dpd              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_tl_90g_dpd_24m        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_tl_op_past_12m        : int  2 5 2 3 7 1 0 1 1 1 ...
##  $ pct_tl_nvr_dlq            : num  100 100 100 100 98.3 84.6 100 95 100 100 ...
##  $ percent_bc_gt_75          : num  33.3 20 0 0 40 33.3 0 66.7 57.1 100 ...
##  $ pub_rec_bankruptcies      : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ tax_liens                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tot_hi_cred_lim           : int  56396 198528 103358 342624 330503 67717 223900 150967 36800 7100 ...
##  $ total_bal_ex_mort         : int  45876 40204 62335 63297 24170 37895 14935 29315 26947 3911 ...
##  $ total_bc_limit            : int  8500 25000 44400 7300 46100 22900 92900 10500 24800 3400 ...
##  $ total_il_high_credit_limit: int  45296 44928 56458 65959 16613 42517 5000 39869 0 0 ...
##  $ fico_ave                  : num  707 777 762 737 717 702 802 682 707 702 ...
##  $ fico_ave_last             : num  747 752 762 732 712 687 797 682 697 757 ...

There are several ways to deal with the missing unknown values. We could simply remove the cases with unknowns but that could introduce some bias. Let’s fill the columns with missing values using their mean instead.

dataLC_numeric$revol_util<-ifelse(is.na(dataLC_numeric$revol_util),mean(dataLC_numeric$revol_util,na.rm =TRUE),dataLC_numeric$revol_util)

dataLC_numeric$mths_since_rcnt_il<-ifelse(is.na(dataLC_numeric$mths_since_rcnt_il),mean(dataLC_numeric$mths_since_rcnt_il,na.rm =TRUE),dataLC_numeric$mths_since_rcnt_il)

dataLC_numeric$il_util<-ifelse(is.na(dataLC_numeric$il_util),mean(dataLC_numeric$il_util,na.rm =TRUE),dataLC_numeric$il_util)

dataLC_numeric$all_util<-ifelse(is.na(dataLC_numeric$all_util),mean(dataLC_numeric$all_util,na.rm =TRUE),dataLC_numeric$all_util)

dataLC_numeric$avg_cur_bal<-ifelse(is.na(dataLC_numeric$avg_cur_bal),mean(dataLC_numeric$avg_cur_bal,na.rm =TRUE),dataLC_numeric$avg_cur_bal)

dataLC_numeric$bc_open_to_buy<-ifelse(is.na(dataLC_numeric$bc_open_to_buy),mean(dataLC_numeric$bc_open_to_buy,na.rm =TRUE),dataLC_numeric$bc_open_to_buy)

dataLC_numeric$bc_util<-ifelse(is.na(dataLC_numeric$bc_util),mean(dataLC_numeric$bc_util,na.rm =TRUE),dataLC_numeric$bc_util)

dataLC_numeric$mo_sin_old_il_acct<-ifelse(is.na(dataLC_numeric$mo_sin_old_il_acct),mean(dataLC_numeric$mo_sin_old_il_acct,na.rm =TRUE),dataLC_numeric$mo_sin_old_il_acct)

dataLC_numeric$mths_since_recent_bc<-ifelse(is.na(dataLC_numeric$mths_since_recent_bc),mean(dataLC_numeric$mths_since_recent_bc,na.rm =TRUE),dataLC_numeric$mths_since_recent_bc)

dataLC_numeric$mths_since_recent_inq<-ifelse(is.na(dataLC_numeric$mths_since_recent_inq),mean(dataLC_numeric$mths_since_recent_inq,na.rm =TRUE),dataLC_numeric$mths_since_recent_inq)

dataLC_numeric$num_tl_120dpd_2m<-ifelse(is.na(dataLC_numeric$num_tl_120dpd_2m),mean(dataLC_numeric$num_tl_120dpd_2m,na.rm =TRUE),dataLC_numeric$num_tl_120dpd_2m)

dataLC_numeric$percent_bc_gt_75<-ifelse(is.na(dataLC_numeric$percent_bc_gt_75),mean(dataLC_numeric$percent_bc_gt_75,na.rm =TRUE),dataLC_numeric$percent_bc_gt_75)

colSums(is.na(dataLC_numeric))

##                         id                  loan_amnt 
##                          0                          0 
##                   int_rate                installment 
##                          0                          0 
##                 annual_inc                        dti 
##                          0                          0 
##                delinq_2yrs            fico_range_high 
##                          0                          0 
##             inq_last_6mths                   open_acc 
##                          0                          0 
##                    pub_rec                  revol_bal 
##                          0                          0 
##                 revol_util                  total_acc 
##                          0                          0 
##                  out_prncp                total_pymnt 
##                          0                          0 
##            last_pymnt_amnt collections_12_mths_ex_med 
##                          0                          0 
##             acc_now_delinq               tot_coll_amt 
##                          0                          0 
##                tot_cur_bal                open_acc_6m 
##                          0                          0 
##                 open_il_6m                open_il_12m 
##                          0                          0 
##                open_il_24m         mths_since_rcnt_il 
##                          0                          0 
##               total_bal_il                    il_util 
##                          0                          0 
##                open_rv_12m                open_rv_24m 
##                          0                          0 
##                 max_bal_bc                   all_util 
##                          0                          0 
##           total_rev_hi_lim                     inq_fi 
##                          0                          0 
##                total_cu_tl               inq_last_12m 
##                          0                          0 
##       acc_open_past_24mths                avg_cur_bal 
##                          0                          0 
##             bc_open_to_buy                    bc_util 
##                          0                          0 
##   chargeoff_within_12_mths                delinq_amnt 
##                          0                          0 
##         mo_sin_old_il_acct       mo_sin_old_rev_tl_op 
##                          0                          0 
##      mo_sin_rcnt_rev_tl_op             mo_sin_rcnt_tl 
##                          0                          0 
##                   mort_acc       mths_since_recent_bc 
##                          0                          0 
##      mths_since_recent_inq      num_accts_ever_120_pd 
##                          0                          0 
##             num_actv_bc_tl            num_actv_rev_tl 
##                          0                          0 
##                num_bc_sats                  num_bc_tl 
##                          0                          0 
##                  num_il_tl              num_op_rev_tl 
##                          0                          0 
##              num_rev_accts        num_rev_tl_bal_gt_0 
##                          0                          0 
##                   num_sats           num_tl_120dpd_2m 
##                          0                          0 
##               num_tl_30dpd         num_tl_90g_dpd_24m 
##                          0                          0 
##         num_tl_op_past_12m             pct_tl_nvr_dlq 
##                          0                          0 
##           percent_bc_gt_75       pub_rec_bankruptcies 
##                          0                          0 
##                  tax_liens            tot_hi_cred_lim 
##                          0                          0 
##          total_bal_ex_mort             total_bc_limit 
##                          0                          0 
## total_il_high_credit_limit                   fico_ave 
##                          0                          0 
##              fico_ave_last 
##                          0

cor(dataLC_numeric)%>%
  corrplot(order = "hclust", addrect = 4, method="square", tl.cex=1)

Principal Component Analysis

out<-PCA(dataLC_numeric,scale.unit = TRUE, graph=TRUE)

set.seed(1)
autoplot(kmeans(dataLC_numeric, 3), data = dataLC_numeric, frame = TRUE, frame.type = "norm") + theme_minimal()

5. Supervised Learning

As noted earlier, there are several statuses which indicate that loans are not performing well, namely loans that have been charged off and late loans. We group these non-performing loans together and assign them a value of 1. The other performing loans are assigned a value of 1.

# Non-performing Loans
npl <- c("Charged Off", "Late (16-30 days)", "Late (31-120 days)")

dataLC2$npl <- ifelse(dataLC2$loan_status %in% npl, 1, 
                      ifelse(dataLC2$loan_status=="", NA, 0))