Introduction

Lending Club is the world’s largest peer-to-peer lending company, offering a platform for borrowers and lenders to work directly with one another, eliminating the need for a financial intermediary like a bank. Removing the middle-man generally allows both borrowers and lenders to benefit from better interest rates than they otherwise would, which makes peer-to-peer lending an attractive proposition.

In this first post, I’ll cover some of the background on Lending Club, talk about getting and cleaning the loan data, and perform some exploratory analysis on the available variables and outcomes. In subsequent section, I’ll work on developing a predictive model for determining the loan default probabilities. Before investing, it is always important to fully understand the risks, and this post does not constitute investment advice in either Lending Club or in Lending Club notes.

Useful Functions

It was apparent by looking at variable columns there were several variable suitable for combining. I wrote several functions to reduce number of variable into small groups.

Reduce LoanStatus to three groups:

  • Performing
  • NonPerforming

Reduce Occupation to five groups:

  • HiglySkilled
  • Skilled
  • UnSkilled
  • SemiSkilled
  • Student

Reduce LoanStatus to three groups:

  • Employed
  • Unemployed
  • Other

Reduce the loan purpose to five groups:

  • Debt Consolidation Loan
  • Home Improvement Loan
  • Business Loan
  • Studen Loan
  • Personal Loan

Counting Missing Values

There are two types of missing data:

  • MCAR: missing completely at random. This is the desirable scenario in case of missing data.
  • MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
##   na_count_percent                       variable_name
## 1         85.12160       LoanFirstDefaultedCycleNumber
## 2         83.38731         ScorexChangeAtTimeOfListing
## 3         80.61648                   TotalProsperLoans
## 4         80.61648          TotalProsperPaymentsBilled
## 5         80.61648               OnTimeProsperPayments
## 6         80.61648 ProsperPaymentsLessThanOneMonthLate

Imputing Missing Data

Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. I therefore checked for features (columns) and samples (rows) where more than 5% of the data is missing using a simple above function.

Missing Values Imputation:

R; MICE package

I used “mice” R package for the imputation of missing values. The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing data point. The mice () function takes care of the imputing process.

Please note I did imputation using amazon server as this was not possible on local machine.

Feature Engineering

For each loan in the file, Lending Club provides an indication of the current loan status. Because many of the loan statuses represent similar outcomes, I’ve mapped them from prosper loan data’s 7 down to only 2, simplifying the problem of classifying loan outcomes without much loss of information useful for investment decisions. My two outcomes “Performing” and “NonPerforming” seek to separate those loans likely to pay in full from those likely to default.

Reduce LoanStatus to three groups:

  • Performing
  • NonPerforming
## 
## NonPerforming    Performing 
##         19082         94855

Reduce LoanStatus to five groups:

  • HiglySkilled
  • Skilled
  • UnSkilled
  • SemiSkilled
  • Student
## 
## HiglySkilled    SemiSkilled       Skilled       Student    UnSkilled  
##         34917         23816         12911         10291         32002

Reduce LoanStatus to three groups:

  • Employed
  • Unemployed
  • Other
## 
##   Employed      Other Unemployed 
##      74544      37763       1630

Reduce the loan purpose to five groups:

Debt Consolidation Loan Home Improvement Loan Business Loan Studen Loan *Personal Loan

## 
##           Business Loan Debt Consolidation Loan   Home Improvement Loan 
##                    7189                   58308                    7433 
##           Personal Loan            Student Loan 
##                   40251                     756

Feature Selection

Often, we have features that are highly correlated and thus provide redundant information. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features. This also shows us, that when we want to make statements about the importance of specific features, we need to keep in mind that just because they are suitable to predicting an outcome they are not necessarily causal - they could simply be correlated with causal factors.

Correlations between all features are calculated and visualised with the copilot package. Here, we have more variation between the features: some are highly correlated, while others seem to be very distinct. 13 are flagged for removal. I am then removing all features with a correlation higher correlation.

## Compare row 11  and column  1 with corr  0.995 
##   Means:  0.192 vs 0.094 so flagging column 11 
## Compare row 1  and column  10 with corr  0.939 
##   Means:  0.173 vs 0.09 so flagging column 1 
## Compare row 12  and column  13 with corr  0.932 
##   Means:  0.177 vs 0.086 so flagging column 12 
## Compare row 34  and column  33 with corr  0.997 
##   Means:  0.167 vs 0.082 so flagging column 34 
## Compare row 3  and column  4 with corr  0.999 
##   Means:  0.135 vs 0.079 so flagging column 3 
## Compare row 4  and column  35 with corr  0.989 
##   Means:  0.113 vs 0.077 so flagging column 4 
## Compare row 14  and column  15 with corr  0.977 
##   Means:  0.128 vs 0.074 so flagging column 14 
## Compare row 17  and column  16 with corr  0.863 
##   Means:  0.111 vs 0.072 so flagging column 17 
## Compare row 6  and column  36 with corr  0.82 
##   Means:  0.13 vs 0.069 so flagging column 6 
## Compare row 36  and column  40 with corr  0.96 
##   Means:  0.11 vs 0.067 so flagging column 36 
## Compare row 27  and column  30 with corr  0.744 
##   Means:  0.098 vs 0.065 so flagging column 27 
## Compare row 19  and column  20 with corr  0.993 
##   Means:  0.103 vs 0.062 so flagging column 19 
## Compare row 23  and column  24 with corr  0.718 
##   Means:  0.057 vs 0.061 so flagging column 24 
## All correlations <= 0.7
##  [1] "LoanNumber"                 "ListingNumber"             
##  [3] "LoanOriginalAmount"         "CreditScoreRangeUpper"     
##  [5] "BorrowerRate"               "LenderYield"               
##  [7] "LP_CustomerPayments"        "LP_ServiceFees"            
##  [9] "OpenRevolvingAccounts"      "OpenCreditLines"           
## [11] "TotalInquiries"             "LP_GrossPrincipalLoss"     
## [13] "InvestmentFromFriendsCount"

Finally, I had a cleaned set of variables to work with.

Structure Of Data

This dataset has 113937 observations and 81 variables. There are some numeric variables (e.g., RevolvingCredit, BorrowerRate, etc.). Once I had data munging/wrangling done. We have modified and added features to our remaining data set. These features will be key in our analysis. However, I chose following variables by intuition for detailed exploratory analysis.

## 'data.frame':    113937 obs. of  51 variables:
##  $ ListingKey                        : Factor w/ 112726 levels "00003546482094282EF90E5",..: 30894 14003 867 4470 25872 18559 5248 59136 5761 62658 ...
##  $ ListingCreationDate               : Factor w/ 2712 levels "01/01/2007","01/01/2008",..: 2698 2005 805 2661 1398 1717 1044 1100 1524 1972 ...
##  $ Term                              : int  36 60 36 60 12 60 36 36 60 36 ...
##  $ LoanStatus                        : Factor w/ 2 levels "NonPerforming",..: 2 2 2 2 1 2 2 2 2 2 ...
##  $ ListingCategory..numeric.         : int  1 1 3 1 7 7 1 1 1 1 ...
##  $ BorrowerState                     : Factor w/ 51 levels "AK","AL","AR",..: 51 51 51 51 51 51 51 51 51 51 ...
##  $ Occupation                        : Factor w/ 5 levels "HighlySkilled",..: 1 3 1 5 1 1 3 2 1 5 ...
##  $ EmploymentStatus                  : Factor w/ 3 levels "Employed","Other",..: 1 1 2 1 1 1 1 2 1 2 ...
##  $ IsBorrowerHomeowner               : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ CurrentlyInGroup                  : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ DateCreditPulled                  : Factor w/ 2705 levels "01/01/2007","01/01/2008",..: 2691 2001 799 2654 1390 1712 1038 1093 1518 1790 ...
##  $ OpenRevolvingMonthlyPayment       : int  1232 318 143 1051 178 335 683 560 290 371 ...
##  $ IncomeRange                       : Factor w/ 8 levels "$0","$1-24,999",..: 5 5 5 3 2 5 4 6 6 4 ...
##  $ IncomeVerifiable                  : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ StatedMonthlyIncome               : num  5417 5333 5000 8333 175 ...
##  $ LoanKey                           : Factor w/ 112675 levels "00003683605746079487FF7",..: 97280 10138 93949 28137 91082 10430 24392 16768 58367 31791 ...
##  $ LoanCurrentDaysDelinquent         : int  0 0 0 0 616 0 0 0 0 0 ...
##  $ LoanMonthsSinceOrigination        : int  6 8 51 2 31 12 7 47 14 72 ...
##  $ LoanOriginationDate               : Factor w/ 1873 levels "01/02/2007","01/02/2008",..: 598 268 1411 299 1862 1548 840 1122 1660 188 ...
##  $ LoanOriginationQuarter            : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 24 24 29 8 22 7 24 13 7 3 ...
##  $ MemberKey                         : Factor w/ 90569 levels "00003397697413387CAF966",..: 75791 62791 17509 37031 89209 71084 10014 23309 85398 83249 ...
##  $ MonthlyLoanPayment                : num  467.2 397.6 30.9 118.7 239.7 ...
##  $ LP_CustomerPrincipalPayments      : num  1882.8 1734.2 1000 72.8 513 ...
##  $ LP_InterestandFees                : num  453.2 1493.7 60.1 164.5 176 ...
##  $ LP_CollectionFees                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss               : num  0 0 0 0 1987 ...
##  $ LP_NonPrincipalRecoverypayments   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                         : int  280 1 47 1 49 75 1 158 189 58 ...
##  $ DelinquenciesLast7Years           : int  0 0 0 0 0 13 0 0 0 0 ...
##  $ TotalCreditLinespast7years        : int  34 17 10 45 15 36 22 30 29 32 ...
##  $ InquiriesLast6Months              : int  0 0 0 1 3 0 0 0 0 0 ...
##  $ CurrentDelinquencies              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years          : int  0 1 0 0 0 0 0 1 1 1 ...
##  $ CreditScoreRangeLower             : int  760 720 800 640 740 760 720 680 720 620 ...
##  $ BorrowerAPR                       : num  0.0893 0.1936 0.0734 0.2813 0.3584 ...
##  $ DebtToIncomeRatio                 : num  0.38 0.24 0.19 0.24 0.02 0.33 0.41 0.14 0.18 0.43 ...
##  $ EmploymentStatusDuration          : int  156 306 160 186 59 70 275 255 57 129 ...
##  $ AmountDelinquent                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CurrentCreditLines                : int  24 6 6 27 6 16 9 8 7 10 ...
##  $ PublicRecordsLast12Months         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance            : int  8116 396 3574 33311 196711 7997 13271 9489 4027 13557 ...
##  $ BankcardUtilization               : num  0.08 0.39 0.74 0.6 0.87 0.69 0.51 0.97 0.19 0.32 ...
##  $ AvailableBankcardCredit           : int  83884 606 1033 17767 20667 1356 7019 261 17073 18862 ...
##  $ TotalTrades                       : int  16 8 14 16 21 13 23 14 8 18 ...
##  $ TradesNeverDelinquent..percentage.: num  1 0.55 0.8 1 1 0.92 0.64 0.85 1 0.77 ...
##  $ TradesOpenedLast6Months           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanPurpose                       : Factor w/ 21 levels "Auto","Baby & Adoption",..: 14 9 14 13 9 6 6 9 15 15 ...
##  $ LoanPurpose_1                     : chr  "Personal Loan" "Home Improvement Loan" "Personal Loan" "Personal Loan" ...

Choosen Variables

  • IncomeRange
  • IncomeVerifiable
  • AmountDelinquent
  • DelinquenciesLast7Years
  • DebtToIncomeRatio
  • RevolvingCreditBalance
  • Term
  • BorrowerAPR
  • InquiriesLast6Months
  • ListingCategory..numeric.
  • LoanStatus
  • EmploymentStatus
  • Occupation
  • IsBorrowerHomeowner
  • BorrowerState

Feature Viusalisation I

Univariate Plots

Income Range Of Borrowers

The first thing I would like to explore is the income range of the borrowers. We could see that most of the borrowers have income ranging from $25,000 to $75,000.

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

The income of most borrowers ranges from $25,000 - $74,999. To my surprise, there are 806 borrowers with zero income successfully got the loan. Although it’s not clearly stated in the dataset, my guess is here ‘the income’ means the income from employment rather than from all sources. Maybe exploring more information about those borrowers would make it clearer that why they can get the loan.

Income Verifiable

Next I check how many of the incomes are verifiable.

## 
##     No    Yes 
##   8669 105268

We can see that about 92.4% of incomes are verifiable. So we could trust the income range.

Amount Delinquent

Next, I want to investigate the delinquent amount of the borrower.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    983.8      0.0 463900.0

We can see that the delinquent amount is heavily right skewed with 97440 entries of the data equal to 0. So let’s make another plot.

## [1] 96171    53

We can see that except for the borrowers with no delinquent amount, most other borrowers have less than $500 delinquent amount.

Delinquencies Last 7 Years

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.177   3.000  99.000

The number of delinquencies in the past 2 years indicates the number of times a borrower has been behind on payments. We can see there is a long right-tailed distribution.

Debt To Income Ratio

Another interesting variable is DebtToIncomeRatio, which shows the debt to income ratio when the borrower’s profile was pulled.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.140   0.220   0.275   0.310  10.010

I see a huge gap between the 3rd quartile and the maximum and a simple calculation tells me they are 31 times away. After limiting the ratio to less or equal to 1, we include more than 99% of the data but get a much finer graphic. The distribution has a bell shape and is right-skewed. And if we call the function quantile (loan$DebtToIncomeRatio, 0.95, na.rm = T), it tells us 95% of the borrowers keep their DTI ≤ 1. Of course, besides paying the debt people get life to live and it needs money. We could see from the above histogram that most of the borrowers keep the ratio well below 50%.

Revolving Credit Balance

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3105    8499   17560   19440 1436000
##      99% 
## 150862.1

After limiting the ratio to less or equal to 150862, I included more than 99% of the data to get a much finer graphic. The data is right-skewed.

Term

The length of a loan is also an important factor for both the borrower and investor. So let’s take a look at this variable.

## 
##    12    36    60 
##  1614 87778 24545

We could see from this barplot that the majority of the loans has the length of 36 months. About 20% of loans have length of 60 months. Only 1.4% of the loans have length of 12 months.

Borrower APR

Since borrowers’ interest rate is an important factor, let’s take a look at this variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00653 0.15630 0.20980 0.21880 0.28370 0.51230

We could see that the rate approximately follows a normal distribution with mean around 0.21.

Loan Purpose

Next I would like to know the purpose of the loans. Therefore I investigate the listing category variable.

## 
## Debt Consolidation   Home Improvement           Business 
##              58308               7433               7189 
##      Personal Loan        Student Use               Auto 
##               2395                756               2572 
##    Baby & Adoption               Boat Cosmetic Procedure 
##                199                 85                 91 
##    Engagement Ring        Green Loans Household Expenses 
##                217                 59               1996 
##    Large Purchases     Medical/Dental         Motorcycle 
##                876               1522                304 
##                 RV              Taxes           Vacation 
##                 52                885                768 
##      Wedding Loans              Other      Not Available 
##                771              10494              16965

From the above figure, we could tell that most loans on Prosper are for Debt Consolidation. Excluding the Not Available and Other purposes, most of the loans go into Home Improvement, Business, Auto, and Personal Loan. Categories like RV (recreational vehicle), Boat, Vacation only account a small portion of the loan. Therefore, I conclude that most the people borrow money for necessary needs while not for entrainment.

Credit Score Range Upper

Since we also have borrower’s credit score from credit rating angecies, let’s investigate it.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591

Surprisingly, the lower (not shown here) and upper range of the borrower’s credit score resemble each other. It appears to me that in order to get a Prosper loan, you’d better have a credit score over 620. We also note that there are some outliers near 0. We can see that the Lower Range of Credit Score in this dataset is from 0 to 880, and the Upper Range of Credit Score in this dataset is from 19 to 899. Now it becomes clear that most of the borrower has credit score around 670 to 720, which falls into the category of good.

Loan Status

Now, let’s look at the loan performance. As mentioned before because many of the loan statuses represent similar outcomes, I’ve mapped them from prosper loan data’s 7 down to only 2, simplifying the problem of classifying loan outcomes without much loss of information useful for investment decisions

## 
##    Performing NonPerforming 
##         94855         19082

As can be inferred from chart above, the stack of counts for “Performing” is loans are much higher than the ones under “NonPerforming”. Thus fortunately for the LendingClub investors, most of them were able to receive their funds with pre-allocated interest rate.

## 
## HighlySkilled   SemiSkilled       Skilled       Student     UnSkilled 
##         34917         22693         15151          9174         32002

We can total loans do vary among different professions. Where people with exective trypes job have more loan than studen or unskilled people.

Employment Status

## 
##   Employed      Other Unemployed 
##      74544      37763       1630

Evidently there are more people with loan with employement than being unemployed.

Employment Status Duration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      26      67      96     137     755

I was curious about when does a person get a loan in terms of the length of their working status. Using months as the unit of time, there’s a very clear descending trend of the number of people who borrow loans with the increase of the length of employment. My guess is the longer one works, the more one earns and therefore more savings. With more saving in the bank, people are less likely to borrow loans.

Is Borrower Homeowner

##   Yes    No 
## 56459 57478

As can be seen from the graph above, was not much difference for people with or without home. Though, number of people with loan owning a home are slighlty higher.

Borrower State

Now, let’s look at the borrower’s geographical informaiton. I first counted the number of borrowers in each state, then colored the map according to this number.

As we can see from this map, California, Texas, Georgia, Florida, and New York have the most borrowers. These are the most populous and economically prosperous areas as well. The number of borrowers of different states varies a lot and California, Texas, New York and Florida are the top 4 on the list of state with big borrower market.

Univariate Analysis

  • For each loan in the file, prosper loan provides an indication of the current loan status. Because many of the loan statuses represent similar outcomes, I simplified the problem of classifying loan outcomes without much loss of information useful for investment decisions. My two outcomes “Performing” and “NonPerforming” seek to separate those loans likely to pay in full from those likely to default. I was mainly interested in LoanPerformance and Debit, since as a lender, the most important thing one cares about is how much will he earn; as a borrower, one cares about how much interests he will pay for. I want to know which variables or factors have impact on these two variables.

  • Intuitively, I have found variables like Amount Delinquent, Term, RevolvingCreditBalance, DebtToIncomeRatio, EmploymentStatusDuration, IsBorrowerHomeowner, BorrowerState, Occupation, IncomeRange CreditScoreRangeUpper, InquiriesLast6Months, EmploymentStatus, BorrowerAPR more intriguing.

  • Furthermore, to test performance of the loan, I selected and extracted features from the raw data, including loan status, occupation, employment status, loan purpose etc.

  • Reduce LoanStatus to three groups:
    • Performing
    • NonPerforming
  • Reduce LoanStatus to five groups:
    • HiglySkilled
    • Skilled
    • UnSkilled
    • SemiSkilled
    • Student
  • Reduce LoanStatus to three groups:
    • Employed
    • Unemployed
    • Other
  • Reduce the loan purpose to five groups:
    • Debt Consolidation Loan
    • Home Improvement Loan
    • Business Loan
    • Studen Loan
    • Personal Loan
  • The AmountDelinquent and the number of Recommendations are heavily right skewed. Since the majority of these two variables are 0, it would not be appropriate to conduct log transformation. Thus, I didn’t perform any operations. But I tuned some parameters when performing data visualisation. Furthermore, missing values were imputed and for exploratory analysis as well as for different model algorithms to be used predict loan performance.

Feature Viusalisation II

Bivariate Plots

Correlations Between Feeatures

Correlations between all features are calculated and visualised with the corrplot package. Here, we have more variation between the features: some are highly correlated, while others seem to be very distinct. 13 are flagged for removal (see output above).I am then removing all features with a correlation higher correlation.

Loan Status Vs Is Borrower Homeowner

##   IsBorrowerHomeowner Performing NonPerforming
## 1                 Yes      81.45         18.55
## 2                  No      85.02         14.98

As can be seen from the graph above and table, there is no relationship between the Home Ownership and Performing loans. However, of those with having house default the least. In performing loans home ownerships does not make difference that much, but they are still notable. However, the differences for nonforming loans has prominent effects.

LoanStatus Vs Debt To Income Ratio

Debt to income ratio indicates the ratio between a borrower’s monthly debt payment and monthly income. This was originally formatted as a continuous numerical variable, but I bucketed it into 5% increments to better visualise the effect on loan performance.

##   LoanPerformance DebitToIcnomePercentage Counts
## 1      Performing                    0-5%  83.34
## 2   NonPerforming                    0-5%  16.66
## 3      Performing                   5-10%  54.84
## 4   NonPerforming                   5-10%  45.16
## 5      Performing                  10-15%  56.25
## 6   NonPerforming                  10-15%  43.75

As we might expect, there is a steady increase in the percentage of non-performing loans as DTI increases, reflecting the constraints that increased debt put onto borrower ability to repay:

Loan Status Vs Revolving Credit Balance

We can see below that the percentage of non-performing loans steadily increases with revolving credit balance. Borrowers with high utilisation rates are more likely to have high fixed credit card payments which might affect their ability to repay their loans.

##    LoanPerformance RevolvingCreditBalance Counts
## 1       Performing                  0-20%  80.17
## 2    NonPerforming                  0-20%  19.83
## 3       Performing                 20-30%  83.59
## 4    NonPerforming                 20-30%  16.41
## 5       Performing                 30-40%  82.03
## 6    NonPerforming                 30-40%  17.97
## 7       Performing                 40-50%  85.96
## 8    NonPerforming                 40-50%  14.04
## 9       Performing                   80%+  82.05
## 10   NonPerforming                   80%+  17.95
## 11      Performing                 60-70%  86.67
## 12   NonPerforming                 60-70%  13.33
## 13      Performing                 70-80%  84.06
## 14   NonPerforming                 70-80%  15.94
## 15      Performing                   80%+  84.17
## 16   NonPerforming                   80%+  15.83

We can see that the percentage of non-performing loans steadily increases with credit revolving rate. Borrowers with high credit revolving rate are more likely to have high fixed credit card payments which might affect their ability to repay their loans. Also, a high utilization rate often reflects a lack of other financing options, with borrowers turning to peer-to-peer lending as a last resort. This is in contrast to those borrowers with low utilization rates, who may be using peer-to-peer lending opportunistically to pursue lower interest payments.

Loan Status Vs Loan Purpose

Loan purpose refers to the borrower’s stated reason for taking out the loan.

##           LoanPurpose NonPerforming Performing
## 1  Debt Consolidation         10.00      90.00
## 2    Home Improvement         13.72      86.28
## 3            Business         22.10      77.90
## 4       Personal Loan         32.36      67.64
## 5         Student Use         25.79      74.21
## 6                Auto         14.70      85.30
## 7     Baby & Adoption          9.55      90.45
## 8                Boat          5.88      94.12
## 9  Cosmetic Procedure         14.29      85.71
## 10    Engagement Ring          5.99      94.01
## 11        Green Loans         20.34      79.66
## 12 Household Expenses         14.93      85.07
## 13    Large Purchases          9.47      90.53
## 14     Medical/Dental         13.40      86.60
## 15         Motorcycle          5.26      94.74
## 16                 RV          3.85      96.15
## 17              Taxes          9.38      90.62
## 18           Vacation          9.64      90.36
## 19      Wedding Loans          8.82      91.18
## 20              Other         16.23      83.77
## 21      Not Available         39.49      60.51

We see that credit card and debt consolidation tend to have better performance, along with home improvement, cars, and other major purchases. Luxury spending on vacations and weddings and unexpected medical and moving expenses generally have worse performance.

##               LoanPurpose NonPerforming Performing
## 1           Business Loan         22.10      77.90
## 2 Debt Consolidation Loan         10.00      90.00
## 3   Home Improvement Loan         13.72      86.28
## 4           Personal Loan         25.95      74.05
## 5            Student Loan         25.79      74.21

Furthermore, It was quite notable that small business loans perform very poorly, perhaps reflecting the fact that those borrowers unable to get bank financing for their small business may have poor credit or business plans that aren’t fully developed.

Loan Status Vs Inquiries Last 6 Months

Number of inquiries refers to the number of times a borrower’s credit report is accessed by financial institutions, which generally happens when the borrower is seeking a loan or credit line.

##                
##                     0     1     2     3    4+
##   Performing    89.57 85.14 80.55 76.10 61.10
##   NonPerforming 10.43 14.86 19.45 23.90 38.90

More inquiries leads to higher rates of nonperformance, perhaps indicating that increased borrower desperation to access credit might highlight poor financial health. Interestingly, we see an increase in loan performance in the 4+ inquiries bucket. These high levels of inquiries may reflect financially savvy borrowers shopping around for mortgage loans or credit cards.

Loan Status Vs Income Range

As we might expect, the higher a borrower’s annual income the more likely they are to be able to repay their loans. Below I’ve broken the income data into quintiles, and we can see that those in the top 20% of annual incomes ($95000 +) are approximately 6% more likely to be performing borrowers than those in the bottom 20% (less than $42000).

##                
##                    $0 $1-24,999 $25,000-49,999 $50,000-74,999
##   Performing    61.35     75.03          80.92          86.83
##   NonPerforming 38.65     24.97          19.08          13.17
##                
##                 $75,000-99,999 $100,000+
##   Performing             89.14     90.80
##   NonPerforming          10.86      9.20

As the income of the borrower decrease, we see increasing rates of nonperforming loans. Noting that the above trends, this may indicate particularly people with higher incomes are who are maximising their possible borrowing mainly due to their ability to payback loans on time.

Loan Status Vs Income Verifiable

Prosper loan data categorises income verification into two statuses: not verified, and verified. Verified income means that Lending Club independently verified both the source and size of reported income, and not verified means there was no independent verification of the reported values.

##       LoanStatus IncomeVerifiable Counts order
## 1:    Performing               No  80.06     1
## 2: NonPerforming               No  19.94     2
## 3:    Performing              Yes  83.52     2
## 4: NonPerforming              Yes  16.48     1

##   IncomeVerifiable Performing NonPerforming
## 1               No      80.06         19.94
## 2              Yes      83.52         16.48

Interestingly, we see that as income verification increases, the loan performance actually worsens. During the mortgage crisis, non-verified “no-doc” loans were among the worst performing, so the reversal here is interesting. This likely reflects the fact that Lending Club only verifies those borrowers who seem to be of worse credit quality, so there may be confounding variables present here.

Loan Staus Vs Public Records Last 10Years

Public records generally refer to bankruptcies, so we would expect those with more public records to show worse performance.

##    PublicRecordsLast10Years Performing NonPerforming
## 1                         0      84.36         15.64
## 2                         1      81.85         18.15
## 3                         2      72.82         27.18
## 4                         3      64.89         35.11
## 5                         4      64.66         35.34
## 6                         5      63.40         36.60
## 7                         6      49.32         50.68
## 8                         7      67.39         32.61
## 9                         8      70.97         29.03
## 10                        9      26.67         73.33
## 11                      10+      57.50         42.50

Actually, performance increases as we move from 0 to 1 to 2 public records. This possibly indicates stricter lending standards from Lending Club on those borrowers with public records: We could see that among different income ranges, there are no big differences. All of income levels have small delinquent amount.

Bivariate Analysis

A large number of variables also provided strong indications of expected performance. Among the most telling are debt-to-income ratio, RevolvingCreditBalance, IncomeVerifiable, loan purpose, BorrowerState, inquiries in the past 6 months, and number of total accounts. Verified income status and number of public records show results opposite from what we would expect. This is likely due to increased standards on borrowers with poorer credit history, so all else equal we see outperformance in these loans.

In performing loans home ownerships does not make difference that much, but they are still notable. However, the differences for nonperforming loans has prominent effects. As we might expect, there was a steady increase in the percentage of non-performing loans as DTI increases, reflecting the constraints that increased debt put onto borrower ability to repay. It was also evident that borrowers with high credit revolving rate were more likely to have high fixed credit revolving payments which might affect their ability to repay their loans. We also saw that credit card and debt consolidation tend to have better performance, along with home improvement, cars, and other major purchases. Luxury spending on vacations and weddings and unexpected medical and moving expenses generally have worse performance.

Feature Viusalisation III

Mutlivariate Plots

The dataset covered an extensive amount of information on the borrower’s side that was originally available to lenders when they made investment choices. By further segmenting the loan dataset into finished cases and current outstanding loans, this project broke down the composition of the default cases and examines the correlation among indicators. In the end, the goal is to provide investors and borrowers, as well as LendingClub, additional insights regarding investment opportunities and contingent loan collection advice.

Debt To IncomeRatio, Loan Purpose, LoanStatus

The chart shows debit to income ratio with default and charged off status as a function of loan purposes and loan status. What stands out is the higher than normal defaults for loans for student loans purpose. Loans for business, home improvement, personal purposes have proven to have higher defaults also. Notice how the red and blue points seem to be random, and not really related to either the x and y variables. There may be some weak relationships between these variables.

## 
##  Welch Two Sample t-test
## 
## data:  df[df$LoanStatus == "NonPerforming", "DebtToIncomeRatio"] and df[df$LoanStatus == "Performing", "DebtToIncomeRatio"]
## t = 11.042, df = 21270, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05833272 0.08351287
## sample estimates:
## mean of x mean of y 
## 0.3340694 0.2631466

The means are substantially different for DebtToIncomeRatio between Performing and NonPerforming loans. Therfore, the higher the creditDebtToIncomeRatio the more likely the loan is to perform.

Is Borrower Homeowner, Debt To Income Ratio, LoanStatus

The chart shows debit to income ratio with default and charged off status as a function of home ownership and loan status. The chart seems to be validating impression by lenders that there is slight but noticeable difference for debit to income ratio for people with and without home. It was quite evident by looking at the plot points denser between 0 and 0.50 dti. As for loan purpose, I can’t really draw a conclusion here either. Notice how the red and blue points seem to be random, and not really related to either the x and y variables. There may be some weak relationships between these variables, but nothing really jumps out at me when I look at the graph.

## 
##  Welch Two Sample t-test
## 
## data:  df[df$LoanStatus == "NonPerforming", "IsBorrowerHomeowner"] and df[df$LoanStatus == "Performing", "IsBorrowerHomeowner"]
## t = 16.192, df = 27383, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05623159 0.07171995
## sample estimates:
## mean of x mean of y 
## 0.5487894 0.4848137

Non-performing loans have a significantly lower average Prosper Score than performing loans.

Term, Debt To Income Ratio, Loan Status

The chart shows debit to income ratio with default and charged off status as a function of loan term and loan status. The loan status of defaults and charged off for loans for 12 months. The chart seems to be validating impression by lenders that debt 36 months is the best followed by 60 months’ term. The interaction between different variables is not clearly visible.

Income Verifiable, Debt To Income Ratio, Loan Status

It appears that slight difference between verified and unverified income. However, there appears to be very little/no correlation between IncomeVerifiable and DebtToIncomeRatio.

## 
##  Welch Two Sample t-test
## 
## data:  df[df$LoanStatus == "NonPerforming", "IncomeVerifiable"] and df[df$LoanStatus == "Performing", "IncomeVerifiable"]
## t = 7.7756, df = 25779, p-value = 7.786e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01304727 0.02184203
## sample estimates:
##  mean of x  mean of y 
## 0.09060895 0.07316430

Now let’s look at IncomeVerifiable vs. Loan Performance. Do income that worse up front actually perform worse in reality? How predictive are certain “knowns” at the start of the process?. The means are substantially different for IncomeVerifiable between Performing and Non-Performing loans. Therfore, the higher the credit score, the more likely the loan is to perform.

Debt To Income Ratio, Borrower APR, Loan Status

We can see by looking at the graph that BorrowerAPR are denser near that blue line (the mean). We can also see looking at the sheer amount of points that there are more loans with lower delinquencies than there are with higher delinquencies. It is evident from graph, BorrowerAPR is more spread out at the lower debit to income ratios. There is hardly or very weak correlation between these two variables.

Delinquencies Last 7Years, Revolving Credit Balance, Loan Status

We can see by looking at the graph that revolvingcreditbalance are denser near that blue line (the mean). We can also see looking at the sheer amount of points that there are more loans with lower delinquencies than there are with higher delinquencies. It’s hard to really draw a conclusion from the data, however. It may seem, for instance, that RevolvingCreditBalance is more spread out at lower delinquencies. But it’s hard to tell, because there are just more points that have lower delinquencies, so there will be more “extreme” points at that end even if the proportion of extreme points is the same. As for loan status, I can’t really draw a conclusion there either. Notice how the red and blue points seem to be random, and not really related to either the x and y variables. There may be some weak relationships between these variables, but nothing really jumps out at me when I look at the graph. Sometimes, if you can’t see a relationship by looking at the graph, it just may not exist.

Building a Model

Now it’s time to build a model to classify people into two groups: Performing or NonPerforming loans.

Logistic Regression

Logistic Regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical dependent variables, using one or more independent variables. The independent variables can be either categorical or numerical. Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent variable of the logistic function with a linear combination of dependent variables we intend to use for regression, we arrive at the formula for logistic regression.

Building a Logistic Regression Model

Now it is time to build our model! Let’s begin by doing a final “sub-setting” of our data by removing the features we won’t be using and making sure that the features are of the correct data type.

## [1] 113937     17
## 'data.frame':    113937 obs. of  17 variables:
##  $ IncomeRange             : Factor w/ 8 levels "$0","$1-24,999",..: 5 5 5 3 2 5 4 6 6 4 ...
##  $ IncomeVerifiable        : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 1 2 2 2 2 2 ...
##  $ AmountDelinquent        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DelinquenciesLast7Years : int  0 0 0 0 0 13 0 0 0 0 ...
##  $ DebtToIncomeRatio       : num  0.38 0.24 0.19 0.24 0.02 0.33 0.41 0.14 0.18 0.43 ...
##  $ RevolvingCreditBalance  : int  8116 396 3574 33311 196711 7997 13271 9489 4027 13557 ...
##  $ Term                    : Factor w/ 3 levels "12","36","60": 2 3 2 3 1 3 2 2 3 2 ...
##  $ BorrowerAPR             : num  0.0893 0.1936 0.0734 0.2813 0.3584 ...
##  $ LoanPurpose             : Factor w/ 5 levels "Business Loan",..: 4 3 4 4 3 2 2 3 4 4 ...
##  $ CreditScoreRangeUpper   : int  779 739 819 659 759 779 739 699 739 639 ...
##  $ InquiriesLast6Months    : int  0 0 0 1 3 0 0 0 0 0 ...
##  $ LoanStatus              : Factor w/ 2 levels "NonPerforming",..: 2 2 2 2 1 2 2 2 2 2 ...
##  $ Occupation              : Factor w/ 5 levels "HighlySkilled ",..: 1 3 1 5 1 1 3 2 1 5 ...
##  $ EmploymentStatus        : Factor w/ 3 levels "Employed","Other",..: 1 1 2 1 1 1 1 2 1 2 ...
##  $ EmploymentStatusDuration: int  156 306 160 186 59 70 275 255 57 129 ...
##  $ IsBorrowerHomeowner     : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 1 2 2 2 2 2 ...
##  $ BorrowerState           : Factor w/ 51 levels "AK","AL","AR",..: 51 51 51 51 51 51 51 51 51 51 ...

Train Test Split

Split the data into a train and test set using the caTools library as done in previous lectures. Reference previous solutions notebooks if you need a refresher.

Use all the features to train a glm() model on the training data set, pass the argument family=binomial(logit) into the glm function.

## 
## Call:
## glm(formula = LoanStatus ~ ., family = binomial(logit), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0021   0.2673   0.3991   0.5796   3.6106  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         1.756e+00  3.586e-01   4.897 9.74e-07
## IncomeRange$1-24,999                5.914e-01  1.196e-01   4.947 7.55e-07
## IncomeRange$100,000+                1.090e+00  1.197e-01   9.112  < 2e-16
## IncomeRange$25,000-49,999           7.289e-01  1.159e-01   6.290 3.18e-10
## IncomeRange$50,000-74,999           9.735e-01  1.167e-01   8.344  < 2e-16
## IncomeRange$75,000-99,999           1.024e+00  1.192e-01   8.592  < 2e-16
## IncomeRangeNot displayed            3.546e-01  1.197e-01   2.963 0.003046
## IncomeRangeNot employed             1.224e+00  1.762e-01   6.946 3.76e-12
## IncomeVerifiableTRUE                3.119e-01  4.202e-02   7.424 1.14e-13
## AmountDelinquent                   -2.094e-06  1.303e-06  -1.608 0.107884
## DelinquenciesLast7Years            -1.302e-04  9.573e-04  -0.136 0.891842
## DebtToIncomeRatio                  -8.011e-02  1.517e-02  -5.282 1.28e-07
## RevolvingCreditBalance             -5.294e-07  3.112e-07  -1.701 0.088937
## Term36                             -7.780e-01  1.316e-01  -5.914 3.34e-09
## Term60                             -5.115e-01  1.341e-01  -3.813 0.000137
## BorrowerAPR                        -5.885e+00  1.501e-01 -39.201  < 2e-16
## LoanPurposeDebt Consolidation Loan -1.464e-02  4.427e-02  -0.331 0.740899
## LoanPurposeHome Improvement Loan   -6.935e-02  5.801e-02  -1.196 0.231836
## LoanPurposePersonal Loan            4.290e-03  4.535e-02   0.095 0.924631
## LoanPurposeStudent Loan            -5.528e-02  1.349e-01  -0.410 0.681892
## CreditScoreRangeUpper               3.379e-03  1.940e-04  17.413  < 2e-16
## InquiriesLast6Months               -1.314e-01  4.228e-03 -31.087  < 2e-16
## OccupationSemiSkilled              -2.943e-01  3.187e-02  -9.235  < 2e-16
## OccupationSkilled                  -1.757e-01  3.669e-02  -4.787 1.69e-06
## OccupationStudent                   1.843e-01  4.391e-02   4.197 2.70e-05
## OccupationUnSkilled                -1.805e-01  3.013e-02  -5.989 2.11e-09
## EmploymentStatusOther              -9.913e-01  2.572e-02 -38.539  < 2e-16
## EmploymentStatusUnemployed         -9.948e-01  1.003e-01  -9.915  < 2e-16
## EmploymentStatusDuration            9.115e-04  1.221e-04   7.467 8.21e-14
## IsBorrowerHomeownerTRUE            -1.844e-01  2.338e-02  -7.884 3.16e-15
## BorrowerStateAL                    -8.993e-01  2.802e-01  -3.210 0.001329
## BorrowerStateAR                    -3.274e-01  2.980e-01  -1.099 0.271854
## BorrowerStateAZ                    -6.432e-01  2.807e-01  -2.291 0.021957
## BorrowerStateCA                    -5.462e-01  2.717e-01  -2.010 0.044406
## BorrowerStateCO                    -3.773e-01  2.821e-01  -1.338 0.181057
## BorrowerStateCT                    -5.838e-01  2.863e-01  -2.039 0.041448
## BorrowerStateDC                    -4.108e-01  3.434e-01  -1.196 0.231548
## BorrowerStateDE                    -7.420e-01  3.354e-01  -2.212 0.026940
## BorrowerStateFL                    -7.765e-01  2.731e-01  -2.843 0.004465
## BorrowerStateGA                    -9.129e-01  2.737e-01  -3.336 0.000850
## BorrowerStateHI                    -4.486e-01  3.301e-01  -1.359 0.174188
## BorrowerStateIA                    -9.538e-01  3.268e-01  -2.919 0.003513
## BorrowerStateID                    -6.103e-01  2.997e-01  -2.036 0.041704
## BorrowerStateIL                    -7.829e-01  2.735e-01  -2.862 0.004204
## BorrowerStateIN                    -6.179e-01  2.808e-01  -2.200 0.027791
## BorrowerStateKS                    -7.381e-01  2.898e-01  -2.547 0.010878
## BorrowerStateKY                    -5.927e-01  2.946e-01  -2.012 0.044229
## BorrowerStateLA                    -7.305e-01  2.915e-01  -2.506 0.012196
## BorrowerStateMA                    -5.107e-01  2.823e-01  -1.809 0.070481
## BorrowerStateMD                    -9.333e-01  2.774e-01  -3.364 0.000768
## BorrowerStateME                    -1.016e+00  3.560e-01  -2.854 0.004314
## BorrowerStateMI                    -6.506e-01  2.757e-01  -2.360 0.018267
## BorrowerStateMN                    -6.259e-01  2.798e-01  -2.237 0.025257
## BorrowerStateMO                    -8.788e-01  2.769e-01  -3.174 0.001505
## BorrowerStateMS                    -6.462e-01  2.962e-01  -2.182 0.029122
## BorrowerStateMT                    -6.203e-01  3.196e-01  -1.941 0.052263
## BorrowerStateNC                    -7.413e-01  2.766e-01  -2.680 0.007368
## BorrowerStateND                    -5.965e-02  4.390e-01  -0.136 0.891910
## BorrowerStateNE                    -2.207e-02  3.163e-01  -0.070 0.944363
## BorrowerStateNH                    -7.585e-01  3.074e-01  -2.467 0.013614
## BorrowerStateNJ                    -5.418e-01  2.784e-01  -1.946 0.051668
## BorrowerStateNM                    -4.203e-01  3.100e-01  -1.356 0.175106
## BorrowerStateNV                    -5.157e-01  2.915e-01  -1.769 0.076823
## BorrowerStateNY                    -4.250e-01  2.741e-01  -1.551 0.120960
## BorrowerStateOH                    -7.277e-01  2.752e-01  -2.645 0.008179
## BorrowerStateOK                    -5.306e-01  2.918e-01  -1.819 0.068972
## BorrowerStateOR                    -6.281e-01  2.810e-01  -2.236 0.025380
## BorrowerStatePA                    -4.272e-01  2.795e-01  -1.529 0.126358
## BorrowerStateRI                    -8.382e-01  3.199e-01  -2.620 0.008790
## BorrowerStateSC                    -5.993e-01  2.907e-01  -2.062 0.039211
## BorrowerStateSD                    -8.314e-01  3.564e-01  -2.333 0.019671
## BorrowerStateTN                    -7.230e-01  2.831e-01  -2.554 0.010658
## BorrowerStateTX                    -5.692e-01  2.735e-01  -2.081 0.037445
## BorrowerStateUT                    -4.439e-01  2.916e-01  -1.522 0.127888
## BorrowerStateVA                    -6.723e-01  2.773e-01  -2.425 0.015329
## BorrowerStateVT                    -5.751e-01  3.578e-01  -1.607 0.107971
## BorrowerStateWA                    -7.633e-01  2.764e-01  -2.762 0.005751
## BorrowerStateWI                    -6.438e-01  2.831e-01  -2.274 0.022954
## BorrowerStateWV                    -5.897e-01  3.196e-01  -1.845 0.064968
## BorrowerStateWY                    -2.126e-01  3.979e-01  -0.534 0.593202
##                                       
## (Intercept)                        ***
## IncomeRange$1-24,999               ***
## IncomeRange$100,000+               ***
## IncomeRange$25,000-49,999          ***
## IncomeRange$50,000-74,999          ***
## IncomeRange$75,000-99,999          ***
## IncomeRangeNot displayed           ** 
## IncomeRangeNot employed            ***
## IncomeVerifiableTRUE               ***
## AmountDelinquent                      
## DelinquenciesLast7Years               
## DebtToIncomeRatio                  ***
## RevolvingCreditBalance             .  
## Term36                             ***
## Term60                             ***
## BorrowerAPR                        ***
## LoanPurposeDebt Consolidation Loan    
## LoanPurposeHome Improvement Loan      
## LoanPurposePersonal Loan              
## LoanPurposeStudent Loan               
## CreditScoreRangeUpper              ***
## InquiriesLast6Months               ***
## OccupationSemiSkilled              ***
## OccupationSkilled                  ***
## OccupationStudent                  ***
## OccupationUnSkilled                ***
## EmploymentStatusOther              ***
## EmploymentStatusUnemployed         ***
## EmploymentStatusDuration           ***
## IsBorrowerHomeownerTRUE            ***
## BorrowerStateAL                    ** 
## BorrowerStateAR                       
## BorrowerStateAZ                    *  
## BorrowerStateCA                    *  
## BorrowerStateCO                       
## BorrowerStateCT                    *  
## BorrowerStateDC                       
## BorrowerStateDE                    *  
## BorrowerStateFL                    ** 
## BorrowerStateGA                    ***
## BorrowerStateHI                       
## BorrowerStateIA                    ** 
## BorrowerStateID                    *  
## BorrowerStateIL                    ** 
## BorrowerStateIN                    *  
## BorrowerStateKS                    *  
## BorrowerStateKY                    *  
## BorrowerStateLA                    *  
## BorrowerStateMA                    .  
## BorrowerStateMD                    ***
## BorrowerStateME                    ** 
## BorrowerStateMI                    *  
## BorrowerStateMN                    *  
## BorrowerStateMO                    ** 
## BorrowerStateMS                    *  
## BorrowerStateMT                    .  
## BorrowerStateNC                    ** 
## BorrowerStateND                       
## BorrowerStateNE                       
## BorrowerStateNH                    *  
## BorrowerStateNJ                    .  
## BorrowerStateNM                       
## BorrowerStateNV                    .  
## BorrowerStateNY                       
## BorrowerStateOH                    ** 
## BorrowerStateOK                    .  
## BorrowerStateOR                    *  
## BorrowerStatePA                       
## BorrowerStateRI                    ** 
## BorrowerStateSC                    *  
## BorrowerStateSD                    *  
## BorrowerStateTN                    *  
## BorrowerStateTX                    *  
## BorrowerStateUT                       
## BorrowerStateVA                    *  
## BorrowerStateVT                       
## BorrowerStateWA                    ** 
## BorrowerStateWI                    *  
## BorrowerStateWV                    .  
## BorrowerStateWY                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72076  on 79754  degrees of freedom
## Residual deviance: 59853  on 79675  degrees of freedom
## AIC: 60013
## 
## Number of Fisher Scoring iterations: 5

Then, I used step () function to create a new model. This was basically to have an idea which of the features are most inmportant to predict loan performance.

## Start:  AIC=60013.18
## LoanStatus ~ IncomeRange + IncomeVerifiable + AmountDelinquent + 
##     DelinquenciesLast7Years + DebtToIncomeRatio + RevolvingCreditBalance + 
##     Term + BorrowerAPR + LoanPurpose + CreditScoreRangeUpper + 
##     InquiriesLast6Months + Occupation + EmploymentStatus + EmploymentStatusDuration + 
##     IsBorrowerHomeowner + BorrowerState
## 
##                            Df Deviance   AIC
## - LoanPurpose               4    59856 60008
## - DelinquenciesLast7Years   1    59853 60011
## <none>                           59853 60013
## - AmountDelinquent          1    59856 60014
## - RevolvingCreditBalance    1    59856 60014
## - DebtToIncomeRatio         1    59880 60038
## - IncomeVerifiable          1    59906 60064
## - EmploymentStatusDuration  1    59910 60068
## - IsBorrowerHomeowner       1    59915 60073
## - Term                      2    59957 60113
## - BorrowerState            50    60085 60145
## - Occupation                4    60021 60173
## - CreditScoreRangeUpper     1    60176 60334
## - IncomeRange               7    60230 60376
## - InquiriesLast6Months      1    60923 61081
## - EmploymentStatus          2    61356 61512
## - BorrowerAPR               1    61406 61564
## 
## Step:  AIC=60008.27
## LoanStatus ~ IncomeRange + IncomeVerifiable + AmountDelinquent + 
##     DelinquenciesLast7Years + DebtToIncomeRatio + RevolvingCreditBalance + 
##     Term + BorrowerAPR + CreditScoreRangeUpper + InquiriesLast6Months + 
##     Occupation + EmploymentStatus + EmploymentStatusDuration + 
##     IsBorrowerHomeowner + BorrowerState
## 
##                            Df Deviance   AIC
## - DelinquenciesLast7Years   1    59856 60006
## <none>                           59856 60008
## - AmountDelinquent          1    59859 60009
## - RevolvingCreditBalance    1    59859 60009
## - DebtToIncomeRatio         1    59883 60033
## - IncomeVerifiable          1    59909 60059
## - EmploymentStatusDuration  1    59914 60064
## - IsBorrowerHomeowner       1    59918 60068
## - Term                      2    59960 60108
## - BorrowerState            50    60088 60140
## - Occupation                4    60024 60168
## - CreditScoreRangeUpper     1    60178 60328
## - IncomeRange               7    60233 60371
## - InquiriesLast6Months      1    60927 61077
## - EmploymentStatus          2    61359 61507
## - BorrowerAPR               1    61408 61558
## 
## Step:  AIC=60006.28
## LoanStatus ~ IncomeRange + IncomeVerifiable + AmountDelinquent + 
##     DebtToIncomeRatio + RevolvingCreditBalance + Term + BorrowerAPR + 
##     CreditScoreRangeUpper + InquiriesLast6Months + Occupation + 
##     EmploymentStatus + EmploymentStatusDuration + IsBorrowerHomeowner + 
##     BorrowerState
## 
##                            Df Deviance   AIC
## <none>                           59856 60006
## - AmountDelinquent          1    59859 60007
## - RevolvingCreditBalance    1    59859 60007
## - DebtToIncomeRatio         1    59883 60031
## - IncomeVerifiable          1    59909 60057
## - EmploymentStatusDuration  1    59914 60062
## - IsBorrowerHomeowner       1    59918 60066
## - Term                      2    59960 60106
## - BorrowerState            50    60088 60138
## - Occupation                4    60024 60166
## - CreditScoreRangeUpper     1    60189 60337
## - IncomeRange               7    60233 60369
## - InquiriesLast6Months      1    60927 61075
## - EmploymentStatus          2    61359 61505
## - BorrowerAPR               1    61414 61562
## 
## Call:
## glm(formula = LoanStatus ~ IncomeRange + IncomeVerifiable + AmountDelinquent + 
##     DebtToIncomeRatio + RevolvingCreditBalance + Term + BorrowerAPR + 
##     CreditScoreRangeUpper + InquiriesLast6Months + Occupation + 
##     EmploymentStatus + EmploymentStatusDuration + IsBorrowerHomeowner + 
##     BorrowerState, family = binomial(logit), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9967   0.2674   0.3989   0.5796   3.6156  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.745e+00  3.558e-01   4.906 9.28e-07 ***
## IncomeRange$1-24,999        5.907e-01  1.195e-01   4.941 7.77e-07 ***
## IncomeRange$100,000+        1.090e+00  1.196e-01   9.109  < 2e-16 ***
## IncomeRange$25,000-49,999   7.284e-01  1.159e-01   6.286 3.26e-10 ***
## IncomeRange$50,000-74,999   9.728e-01  1.167e-01   8.339  < 2e-16 ***
## IncomeRange$75,000-99,999   1.023e+00  1.192e-01   8.587  < 2e-16 ***
## IncomeRangeNot displayed    3.540e-01  1.197e-01   2.958 0.003094 ** 
## IncomeRangeNot employed     1.225e+00  1.762e-01   6.954 3.56e-12 ***
## IncomeVerifiableTRUE        3.122e-01  4.201e-02   7.432 1.07e-13 ***
## AmountDelinquent           -2.136e-06  1.275e-06  -1.676 0.093746 .  
## DebtToIncomeRatio          -7.991e-02  1.516e-02  -5.271 1.36e-07 ***
## RevolvingCreditBalance     -5.324e-07  3.112e-07  -1.711 0.087071 .  
## Term36                     -7.774e-01  1.315e-01  -5.910 3.43e-09 ***
## Term60                     -5.106e-01  1.341e-01  -3.807 0.000141 ***
## BorrowerAPR                -5.884e+00  1.498e-01 -39.268  < 2e-16 ***
## CreditScoreRangeUpper       3.382e-03  1.913e-04  17.676  < 2e-16 ***
## InquiriesLast6Months       -1.315e-01  4.228e-03 -31.096  < 2e-16 ***
## OccupationSemiSkilled      -2.943e-01  3.186e-02  -9.236  < 2e-16 ***
## OccupationSkilled          -1.756e-01  3.669e-02  -4.787 1.69e-06 ***
## OccupationStudent           1.846e-01  4.391e-02   4.203 2.64e-05 ***
## OccupationUnSkilled        -1.801e-01  3.013e-02  -5.977 2.27e-09 ***
## EmploymentStatusOther      -9.912e-01  2.572e-02 -38.535  < 2e-16 ***
## EmploymentStatusUnemployed -9.963e-01  1.003e-01  -9.931  < 2e-16 ***
## EmploymentStatusDuration    9.114e-04  1.220e-04   7.469 8.09e-14 ***
## IsBorrowerHomeownerTRUE    -1.842e-01  2.338e-02  -7.878 3.32e-15 ***
## BorrowerStateAL            -9.021e-01  2.803e-01  -3.218 0.001289 ** 
## BorrowerStateAR            -3.317e-01  2.980e-01  -1.113 0.265781    
## BorrowerStateAZ            -6.467e-01  2.808e-01  -2.303 0.021293 *  
## BorrowerStateCA            -5.497e-01  2.718e-01  -2.022 0.043127 *  
## BorrowerStateCO            -3.808e-01  2.822e-01  -1.349 0.177268    
## BorrowerStateCT            -5.885e-01  2.864e-01  -2.055 0.039897 *  
## BorrowerStateDC            -4.142e-01  3.435e-01  -1.206 0.227872    
## BorrowerStateDE            -7.457e-01  3.355e-01  -2.223 0.026224 *  
## BorrowerStateFL            -7.800e-01  2.732e-01  -2.855 0.004303 ** 
## BorrowerStateGA            -9.165e-01  2.738e-01  -3.348 0.000814 ***
## BorrowerStateHI            -4.503e-01  3.303e-01  -1.364 0.172720    
## BorrowerStateIA            -9.577e-01  3.268e-01  -2.930 0.003388 ** 
## BorrowerStateID            -6.133e-01  2.998e-01  -2.046 0.040788 *  
## BorrowerStateIL            -7.871e-01  2.736e-01  -2.877 0.004017 ** 
## BorrowerStateIN            -6.217e-01  2.809e-01  -2.213 0.026900 *  
## BorrowerStateKS            -7.412e-01  2.899e-01  -2.557 0.010566 *  
## BorrowerStateKY            -5.964e-01  2.947e-01  -2.024 0.042981 *  
## BorrowerStateLA            -7.340e-01  2.916e-01  -2.518 0.011815 *  
## BorrowerStateMA            -5.145e-01  2.824e-01  -1.822 0.068472 .  
## BorrowerStateMD            -9.372e-01  2.775e-01  -3.377 0.000732 ***
## BorrowerStateME            -1.020e+00  3.561e-01  -2.863 0.004193 ** 
## BorrowerStateMI            -6.542e-01  2.757e-01  -2.373 0.017660 *  
## BorrowerStateMN            -6.298e-01  2.798e-01  -2.251 0.024407 *  
## BorrowerStateMO            -8.821e-01  2.770e-01  -3.185 0.001449 ** 
## BorrowerStateMS            -6.493e-01  2.963e-01  -2.192 0.028410 *  
## BorrowerStateMT            -6.235e-01  3.196e-01  -1.951 0.051108 .  
## BorrowerStateNC            -7.443e-01  2.767e-01  -2.690 0.007150 ** 
## BorrowerStateND            -7.171e-02  4.387e-01  -0.163 0.870166    
## BorrowerStateNE            -2.663e-02  3.163e-01  -0.084 0.932903    
## BorrowerStateNH            -7.622e-01  3.075e-01  -2.479 0.013187 *  
## BorrowerStateNJ            -5.454e-01  2.785e-01  -1.958 0.050184 .  
## BorrowerStateNM            -4.227e-01  3.101e-01  -1.363 0.172758    
## BorrowerStateNV            -5.196e-01  2.916e-01  -1.782 0.074750 .  
## BorrowerStateNY            -4.284e-01  2.742e-01  -1.562 0.118209    
## BorrowerStateOH            -7.319e-01  2.753e-01  -2.659 0.007840 ** 
## BorrowerStateOK            -5.340e-01  2.919e-01  -1.829 0.067326 .  
## BorrowerStateOR            -6.317e-01  2.811e-01  -2.248 0.024607 *  
## BorrowerStatePA            -4.312e-01  2.795e-01  -1.543 0.122925    
## BorrowerStateRI            -8.428e-01  3.200e-01  -2.634 0.008435 ** 
## BorrowerStateSC            -6.043e-01  2.907e-01  -2.078 0.037672 *  
## BorrowerStateSD            -8.341e-01  3.565e-01  -2.340 0.019298 *  
## BorrowerStateTN            -7.270e-01  2.832e-01  -2.567 0.010261 *  
## BorrowerStateTX            -5.728e-01  2.736e-01  -2.093 0.036338 *  
## BorrowerStateUT            -4.471e-01  2.917e-01  -1.533 0.125296    
## BorrowerStateVA            -6.760e-01  2.774e-01  -2.437 0.014810 *  
## BorrowerStateVT            -5.772e-01  3.578e-01  -1.613 0.106722    
## BorrowerStateWA            -7.671e-01  2.765e-01  -2.774 0.005530 ** 
## BorrowerStateWI            -6.482e-01  2.832e-01  -2.289 0.022072 *  
## BorrowerStateWV            -5.923e-01  3.196e-01  -1.853 0.063888 .  
## BorrowerStateWY            -2.170e-01  3.979e-01  -0.545 0.585512    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72076  on 79754  degrees of freedom
## Residual deviance: 59856  on 79680  degrees of freedom
## AIC: 60006
## 
## Number of Fisher Scoring iterations: 5

One should have noticed that the step () function kept all the features used previously! While I used the AIC criteria to compare models, there are other criteria we could have used.

Confusion Matrix : Let’s continue on and see how well our model performed against the test set. Create a confusion matrix using the predict function with type=‘response’ as an argument inside of that function.

##                
##                 FALSE  TRUE
##   NonPerforming  1083  4642
##   Performing      751 27706

Model Accuracy: Our model apparently did good job in predicting loan performace with an accuracy of 85%.

## Model Accuracy is :: 84 %

Multivariate Analysis

With multivariate analysis, we tried to show default and charged off status as a function of different variables and loan status. What stood out was the higher than normal defaults for loans were for student loans purpose. The data seemed to be validating impression by lenders that there was slight but noticeable difference for debit to income ratio for people with and without home. It was also observed that that loan performance for 36 months is the best followed by 60 months’ term. The interaction between different variables is not clearly visible. We did not notice any, presence of if any/little correlation between IncomeVerifiable and DebtToIncomeRatio.

However, multivariate Analysis did not show much of interaction between different variables as one can see in above plots. In most of cases the data was more spread out. But it was hard to tell, because there are just more points that have lower values, or so there will be more “extreme” points at that end even if the proportion of extreme points is the same. As for loan status, I can’t really draw a conclusion there either. Notice in above plots how the red and blue points seem to be random, and not really related to either the x and y variables. There may be some weak relationships between these variables, but nothing really jumps out at me when I look at the graph.

Nonetheless, I also developed a predictive model for projecting the probability of default for newly issued loans using logistic regression. I’ve gotten a good understanding of the available borrower data, and we’ve seen which variables give the best indications of future loan performance. I have found following features interesting bases on AIC criteria to compare models, there are other criteria we could have used. I also tried to use RandomForest to select for important features. However, because of size of the data. I could not accomplish that.

  • AmountDelinquent
  • RevolvingCreditBalance
  • DebtToIncomeRatio
  • IncomeVerifiable
  • EmploymentStatusDuration
  • IsBorrowerHomeowner
  • Term
  • BorrowerState
  • Occupation
  • CreditScoreRangeUpper
  • IncomeRange
  • InquiriesLast6Months
  • EmploymentStatus
  • BorrowerAPR

Final Plots and Summary

Plot One

We can see from the chart above that rates of majority of loans performed pretty well. This show algorithm used to determine a borrower’s level of risk, and then set the interest rates according to the level of risk works pretty well. It can be inferred from chart above, the stack of counts for “Performing” is loans are much higher than the ones under “NonPerforming”. Thus, fortunately for the LendingClub investors, most of them were able to receive their funds with pre-allocated interest rate.

Plot Two

Debt to income ratio indicates the ratio between a borrowers monthly debt payment and monthly income. This was originally formatted as a continuous numerical variable, but I bucketed it into 5% increments to better visualise the effect on loan performance. As we might expect, there is a steady increase in the percentage of non-performing loans as DTI increases, reflecting the constraints that increased debt put onto borrower ability to repay:

Plot Three

The chart shows debit to income ratio with default and charged off status as a function of loan purposes and loan status. What stands out is the higher than normal defaults for loans for student loans purpose. Loans for business, home improvement, personal purposes have proven to have higher defaults also.Notice how the red and blue points seem to be random, and not really related to either the x and y variables. There may be some weak relationships between these variables, but nothing really jumps out at me when I look at the graph. Sometimes, if you can’t see a relationship by looking at the graph, it just may not exist.


Reflection

The Prosper loan data set has 113,937 transaction record with 81 variables. I explored 17 out of these 81 variables. To begin with I looked at missing data. I removed variables where we had more than 8 % missing data. I noticed data with missing values ranging from 1 to 85 in some cases. I wanted to fill in missing age data instead of just dropping the missing age data rows. I used “mice” R package for the imputation of missing values. The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing data point. I extracted data the feature engineered some fields. I also eliminated a few indicative data fields that are repetitive or too granular to be analysed, and make some formatting changes to get the data ready for analysis. Finally, I mapped the loan statuses to the binary “Performing” and “NonPerforming” classifiers. Then I carried out exploratory data analysis using various plots to check how these variable are distributed and to get the feeling of the data. To further understand this dataset, I also visualised the interaction between two variables. A large number of the other variables provided strong indications of expected performance. Among the most telling are debt-to-income ratio, BorrowerState, BorrowerAPR, home ownership status, loan purpose, income range, inquiries in the past 6 months, and term. Furthermore, verified income status and number of public records show results opposite from what we would expect. This is likely due to increased standards on borrowers with poorer credit history, so all else equal we see outperformance in these loans.

This dataset posed a couple of difficulties to begin with, like unequal distribution of data points across variables and missing data. This makes the exploratory analysis and modeling inherently prone to flaws. However, real life data isn’t perfect either, so I went ahead and tested the success anyway. By accounting for uncertain classification with low predictions probability, the validation data could be classified accurately. However, for a more accurate model, these few cases don’t give enough information to reliably predict the outcome. More cases, more information (i.e. more features) and fewer missing data would improve the modeling outcome. By exploratory analysis and using simple algorithm like logistics regression I was able to short list the variables which could have predictive power for loan performance. I reduced number of categorical variables to small number of classes. This enabled me to have manageable and meaning features to explain prosper loan data. The data seize was colossal to run locally. It was big hindrance for me to have selection of feature importance as well running high end models like randomForest etc. Personally, I think randomForest modelling could be vital tool in predicting loan performance. As it would not only enable me to have prediction of loan performance more accurately but also to rank the features important in disentangling underlying patterns and trends in our data set.

This analysis succeeded in predicting loan performance with accuracy of 84 % as well as deciphering the underlying correlative relationship between features of interest. However, this example is only applicable for this project. There are 81 features in the dataset, and I only selected 17 of these from prosper loan dataset. In order to be able to draw more general conclusions about loan performance outcome, other features and additional information, for example all parameters, demographic information, etc. would be necessary to make more reliable predictions. All in all, this dataset served as a nice example of the possibilities (and pitfalls) of exploratory and machine learning applications and showcases a basic workflow for building prediction models with R.