The dataset is individual loan data set provided by the P2P lending company, Prosper, in 2014. The dataset contains 81 variables and 113,937 loans. Variables include loaner’s income characteristics, loaner’s delinquencies history, each loan’s information (e.g. amount, interest rate, term), etc.
## [1] 113937 81
take a look on the Loan Status count on default and non default group
## Defaulted Not Default
## 17026 96906
I am going to explore 10 independent variables and dependent varaibles (Loan Status) in the bivaraite plot section.
Those ten varaibles are
Credit Record Characteristic Variables: InquiriesLast6Months, CreditScoreRangeAvg, PublicRecordsLast10Years, CurrentDelinquencies, AvailableBankcardCredit
Income Characteristic Variables: StatedMonthlyIncome, EmploymentStatusDuration, DebtToIncomeRatio, ListingCategory, IncomeVerifiable
I am going to explore the relationship between each dependent variables and Loan Status, using descriptive analysis, hypothesis testing, and visualzation
I explored variable StatedMonthlyIncome this time. I created a two way frequency table to have a look on the distribution of StatedMonthlyIncome in both default and not default group.
## [1] "descriptive statistics for Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3750 4450 5417 208300
## [1] "descriptive statistics for Not Defaulted Group"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3333 4917 5812 7083 1750000
To generate further analysis, I decided to group Stated Monthly Income into four classes, The Upper Class, The Upper Middle Class, The Lower Middle Class The Upper Class, The Poverty Class, based on the census data from 2015 and the definition on Investopedia. http://www.investopedia.com/financial-edge/0912/which-income-class-are-you.aspx
The Upper class has annual income above $250,000 The Upper Middle Class has annual income range from $100,000 to $250,000 The Lower Middle Class has annual income range from $35,000 to $100,000 The Poverty Class has annual income below $35,000
As shown in the table below, the column proportion of StatedMonthlyIncome in default group is always higher than the not default group except the provety group.
##
## Upper class Upper Middle Class Lower Middle Class
## Defaulted 0.00 0.08 0.59
## Not Default 0.01 0.15 0.67
##
## Poverty Class Sum
## Defaulted 0.33 1.00
## Not Default 0.17 1.00
As shown in the table below, the row proportion of StatedMonthlyIncome in not default group dominate all class, but the proportion is going down, while the proportion of default group is going up.
##
## Upper class Upper Middle Class Lower Middle Class
## Defaulted 0.09 0.08 0.13
## Not Default 0.91 0.92 0.87
## Sum 1.00 1.00 1.00
##
## Poverty Class
## Defaulted 0.25
## Not Default 0.75
## Sum 1.00
the testing result shows the StatedMonthlyIncome and Loan Status has statistically significent realtionship
##
## Pearson's Chi-squared test
##
## data: StatedMonthlyIncome
## X-squared = 2480.7, df = 3, p-value < 2.2e-16
Since the distribution of StatedMonthlyIncome is extremely positively skewed, I exclude the outliers, which is outside interquartile range for further visualization. As it shown in the Interleaved histograms and Density plots, the no default group has a positively skewed distribution and a higher mean. In the density plot, the both groups have peaked at value 3000, which is an anual $36000.
The boxplot below shows the same result as the above histogram and density plot. The non default group has a more spread out distribution and larger mean.