Nowadays, loans are considered as a complex tool which serves both borrowers and investors. Loans can be divided into a large number of areas and components which determine its terms. Sometimes, due to their complexity, these terms are not clear to the borrower or even to the investor.
As a result, the main questions that usually arise during loan creation are:
This report was made in order to get insights which might answer such questions or provide direction of research regarding loan uptake.
This report shows data partial exploration of the Loan dataset obtained from kaggle website and comes from the Prosper p2p lending platform.
This exploration will focus on 9 of the 81 variables presented in this dataset:
As a preliminary step the measures of central tendancy and dispersion are shown:
## 'data.frame': 113937 obs. of 9 variables:
## $ L.Months.Since.Originated: int 78 0 86 16 6 3 11 10 3 3 ...
## $ L.Orig.Amount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ L.Orig.Quarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ Loan.Status : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ Prosper.Rating : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
## $ Prosper.Score : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ Borrower.Rate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ Borrower.State : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
## $ Stated.Monthly.Income : num 3083 6125 2083 2875 9583 ...
## L.Months.Since.Originated L.Orig.Amount L.Orig.Quarter
## Min. : 0.00000 Min. : 1000.000 Q4 2013:14450
## 1st Qu.: 6.00000 1st Qu.: 4000.000 Q1 2014:12172
## Median : 21.00000 Median : 6500.000 Q3 2013: 9180
## Mean : 31.89688 Mean : 8337.014 Q2 2013: 7099
## 3rd Qu.: 65.00000 3rd Qu.:12000.000 Q3 2012: 5632
## Max. :100.00000 Max. :35000.000 Q2 2012: 5061
## (Other):60343
## Loan.Status Prosper.Rating Prosper.Score
## Current :56576 C :18345 Min. : 1.000000
## Completed :38074 B :15581 1st Qu.: 4.000000
## Chargedoff :11992 A :14551 Median : 6.000000
## Defaulted : 5018 D :14274 Mean : 5.950067
## Past Due (1-15 days) : 806 E : 9795 3rd Qu.: 8.000000
## Past Due (31-60 days): 363 (Other):12307 Max. :11.000000
## (Other) : 1108 NA's :29084 NA's :29084
## Borrower.Rate Borrower.State Stated.Monthly.Income
## Min. :0.0000000 CA :14717 Min. : 0.000
## 1st Qu.:0.1340000 TX : 6842 1st Qu.: 3200.333
## Median :0.1840000 NY : 6729 Median : 4666.667
## Mean :0.1927641 FL : 6720 Mean : 5608.026
## 3rd Qu.:0.2500000 IL : 5921 3rd Qu.: 6825.000
## Max. :0.4975000 (Other):67493 Max. :1750002.917
## NA's : 5515
Preliminary examination shows that the dataset consists 113,937 loans.
Moreover, average amount of loans is 8337$ and most of the loans are still active (not completed).
It is also worth noting that most of the loans originated from California state.
Furthermore, loan quarter variable shows also year information in addition to the quarter number. Hence in order to better handle this variable, it will be splitted into two separated columns: Loan.Quarter and Loan.Year.
First step in data exploration is to analyze the frequency of each variable in order to find different directions of inquiry.
According to the histogram below, duration of loans is most concentrated between 0-10 months.Also, this variable is somewhat skewed to the right. In order to better present such distribution it is common to transform such skewed data by log transformation (will not be applied here).
The calculations of skewness (0.75) and kurtosis (2.06) supports this finding.
To illustrate this data a boxplot was drawn:
This boxplot shows that most loans are of size between 4000 and 12000 (lower and upper quartiles respectively). Also shown is the oulier points which are determined as the values above 1.5*interquartile range (IQR). These outlier values represent a group of loans which are unusual in their quantity and are much higher than the average amount.
After dividing the original variable into two different variables it is easier to identify patterns of each factor:
Quarter information shows that most loans are taken during the last quarter of the year. Given the year information, this statement can be misleading since a given year consists more or less observations than other years.
This dataset contains information about the status of each observation (loan). The figure below shows that there are more than 1000 “current” status observations than its following “completed” status.
The Prosper Rating is expressed as a letter grade that indicates the expected level of risk associated with a loan listing. Each letter grade corresponds to an estimated average annualized loss rate range. Rating of AA is considered to be a loan with lowest risk.
Prosper score on the other hand is a measure based on historical Prosper data and ranges from 1 to 11, with 11 being the best (lowest risk).
First impression is that both variables show somewhat normal distribution regarding the grade of rating and score of the loan - middle grading, logically, are the most abundant measure.
Borrower Rate, as the interest rate at which money may be borrowed, is a crucial factor determining the total amount of money to be returned to the investor (usually the bank).
Borrower State and Monthly Income may somewhat indicate the economic situation of the residents of each state.
According to the plotted graphs we can see that borrower rate is mainly set between 0.1-0.2. The calculated average value is 0.19.
Examination of the state data graph shows that California state contribute the largest amount of loans. The three next on the list are the states of Florida, New-York, and Texas.
Since monthly income data are very spread, The histogram above shows only the bulk concentration of this variable (can be transformed by log transformation - was not applied here). It is clear that most borrowers have monthly income which ranges between 0-10000 dollars. The maximum recorded income is 1750003 dollars. Since the unusual monthly income values might be a result of typing error (or other errors), it will be wiser to take into account only the bulk values (between 0 and 25,000 dollars).
The following list is a summary of the findings from the descriptive and univariate plots sections:
The objective of this study is to get insights regarding the nature of the loan market and by that, extrapolate those insights regarding the future or develop comprehensive models in future analysis.
In this section variables will be compared in order to find usefull correlations which may be used as a basis for construction of a model. To get a clue regarding directions of investigation, a correlation matrix is constructed (only numeric quantitative variables) :
## row column cor
## 1 Prosper.Score Borrower.Rate -0.64973616600
## 2 L.Orig.Amount Borrower.Rate -0.32895994186
## 3 L.Months.Since.Originated L.Orig.Amount -0.30905756354
## 4 L.Orig.Amount Prosper.Score 0.26629334688
## 5 L.Orig.Amount Stated.Monthly.Income 0.20125946403
## 6 L.Months.Since.Originated Prosper.Score 0.17761489749
## 7 Borrower.Rate Stated.Monthly.Income -0.08898179978
## 8 Prosper.Score Stated.Monthly.Income 0.08375665545
## 9 L.Months.Since.Originated Stated.Monthly.Income -0.08014726639
## 10 L.Months.Since.Originated Borrower.Rate 0.02041956410
## p
## 1 0.000000000e+00
## 2 0.000000000e+00
## 3 0.000000000e+00
## 4 0.000000000e+00
## 5 0.000000000e+00
## 6 0.000000000e+00
## 7 0.000000000e+00
## 8 0.000000000e+00
## 9 0.000000000e+00
## 10 5.456080032e-12
According to the output - highest correlation that was detected is between Borrower Rate ~ Prosper Score with value of -0.65 (negative correlation). Other two negative correlations are Loan Amount ~ Borrower Rate and Loan duration ~ Loan Amount. Additional positive correlations worth mentioning are Loan Amount ~ Prosper Score and Loan Amount ~ Stated Monthly Income. To summarize, there appear to be no strong correlation (correlation coefficient) between the quantitative variables.
The next figures will attempt to show combined categorical data:
Combined categorical data show that loans are NOT created mosly at the forth quarter of each year (as mentioned earlier as a setback). Also prosper rating of the loans are divided similarly in each loan status (modified- Past due include all dues, Defaulted include also Chargedoff, and Current or Completed include all non-penalty loans).
An attempt to examine some quantitative and categorical variables together will show several insights:
##
## 2005 2006 2007
## Current or Completed 1.000000000000 0.606332543176 0.605235602094
## Defaulted 0.000000000000 0.393667456824 0.394764397906
## Past Due 0.000000000000 0.000000000000 0.000000000000
##
## 2008 2009 2010
## Current or Completed 0.670360110803 0.848070346849 0.832802547771
## Defaulted 0.329639889197 0.151929653151 0.165782024062
## Past Due 0.000000000000 0.000000000000 0.001415428167
##
## 2011 2012 2013
## Current or Completed 0.782686141788 0.821306193423 0.964740136847
## Defaulted 0.191040256502 0.134097069503 0.009841316058
## Past Due 0.026273601710 0.044596737074 0.025418547096
##
## 2014
## Current or Completed 0.998439040421
## Defaulted 0.000000000000
## Past Due 0.001560959579
Exploration of the quantitative variables showed that Borrower Rate, in relation to the other variables, is correlating best with Prosper Score. this correlation value is approximately -0.65 which indicate a negative moderate correlation. From general point of view it can be said that one variable is determined by the other. Such relation, in light of the correlation value, might be also affected by other variable/s.
This plot is the number of new loans originated each year and the final disposition of those loans. The early year of Prosper were marked by very low lending standards. Coupled with the global financial crisis, these early loans had very high default rates and many investors had losses. Only after 2011, the proportion of the defaulted loans was decreasing suggesting that lessons were learned after the global financial crisis.
In addition, every year since the global crisis (2009) is characterized in greater quantity of loans. In view of this, 2014 is expected to contain more loans than its predecessors (and also the years afterwards, assuming no emerge of another economic crisis).
It is unclear why the Duration of the loans can be divided into two groups as mentioned above (no loans of duration around ~60 months). This artifact might be a result of recording procedure or a method regarding the insertion of the loan duration into the data-base. Another observation is that the smallest amounts of loans are prolonging more time than expected/usual. Also, while excluding Current/Completed loans it is clear that “Past Due” loans exist only in the lower duration group. This absence may indicate a delicate approch of the lenders about their return payments (long-planned loans will change status directly from current to defaulted, giving the benefit of the doubt in the middle cases).
The Prosper Loan dataset consists 113,937 loans filled with different data. In order to get meaningful insights about this dataset it is very important to understand each field of data specifically and the whole domain generally. Choosing the suitable variables which will give insights once compared to other variables seems to be a very crucial step in the data analysis. The extraction and preparation of the data is also an important step, on which data will be ready to be analyzed. In this case, the choice of variables is deficient, leading to the lack of meaningful insights.
However, there are several arguments that can be raised considering the data analyzed: