Below is the attempt to analyse the lending club loan data available for one of the lending clubs in US. It gives visualised view of loan data to derive the relations and valuable insights with inferences.
Setting up the environment with required packages to be included
## Warning: package 'ggplot2' was built under R version 3.4.2
## Warning: package 'ggthemes' was built under R version 3.4.2
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'gridExtra' was built under R version 3.4.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'readr' was built under R version 3.4.2
## Warning: package 'lattice' was built under R version 3.4.2
## Warning: package 'vcd' was built under R version 3.4.2
## Loading required package: grid
Specifying options as part of environment set up
Setting up the working directory where the source file exist
Reading the input file for loan data
## Parsed with column specification:
## cols(
## .default = col_character(),
## id = col_integer(),
## member_id = col_integer(),
## loan_amnt = col_double(),
## funded_amnt = col_double(),
## funded_amnt_inv = col_double(),
## int_rate = col_double(),
## installment = col_double(),
## annual_inc = col_double(),
## dti = col_double(),
## delinq_2yrs = col_double(),
## inq_last_6mths = col_double(),
## mths_since_last_delinq = col_double(),
## mths_since_last_record = col_double(),
## open_acc = col_double(),
## pub_rec = col_double(),
## revol_bal = col_double(),
## revol_util = col_double(),
## total_acc = col_double(),
## out_prncp = col_double(),
## out_prncp_inv = col_double()
## # ... with 11 more columns
## )
## See spec(...) for full column specifications.
Convert data types to factor from char to use these columns for analysis
## Warning: package 'bindrcpp' was built under R version 3.4.2
Derive year and month fields
Derive the loan grant amount field from loan and funded amount.
## [1] 0
## [1] 21150
## 0% 25% 50% 75% 100%
## 0 0 0 0 21150
## [1] 13.38698
Splitting Loan grant amount & loan verification status into different levels
Visual Charts/Plots for Insights:
histogram for loan amount Inference: Maximum number of loans taken is for 10k-12k amount and is ~ 1000000 Minimum number of loans taken is for 27-30k amount and is ~ 5000-8000
historgram for interest rate
Inference: Maximum people have interest rate between 10 & 20, with maximum loans given around 12-14
histogram for total payments received till today
Inference: This is left skewed histogram with maximum payments received today < 8000
Histogram Plot for Loan Amount against Term Inference: Loans< 10,000 have term as 36 months, most of loans >20,000 have 60 months term
Histogram Plot for Loan Amount against purpose Inference: Most loans have been taken for credit card and debt_consolidation purpose
Let us see using histogram plot what is the loan amount for credit card purpose Inference: For credit card maximum loan count is between 10,000 & 11,000
Let us see using histogram plot what is the loan amount for credit card purpose
Inference: For Debt_consolidation, maximum loan count is between 10,000 & 11,000
Histograms & Density plots for loan amount, interest rate & total payments against grade
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bar plot for loan amount based on grade type
Inference: Grade B has maximum loan amount followed by C
Bar plot for loan amount based on purpose
Inference: Maximum loan have been taken for debt_consolidation (~550000)
bar plot to calcualte the total interest rate according to states
Inference: Current Outstanding loan~600000, charged off ~ 45000, Fully paid ~ 210000
Inference: 36 months term has more loans than 60 months
Let us now visualize and observe inference from state/region wise loans
Inference: CA has maximum number of loans currently active followed by NY. Also CA has highest number of loans followed by NY and TX
Now let us see state wise loan deliquency
Inference: CA state has maximum no of delinquents followed by NY state
Inference: CA has maximum no of defaulters followed by NY and TX
Inference: Maximum defaulters are for debt consolidation across regions. CA has maximum no of defaulters
Inference: debit_consolidation and small business have almost same and highest mean loan amount followed by credit card and improvement house
Inference: Joint loans have higher mean value than individual loans
Inference: Those with Mortgage type of home ownership have highest loan amount followed by Rent
Inference: There are around 270000 records which are not verified
Inference: Not verified loans have some records which dont meet the credit policy
Inference: 2015 has maximum loans with Oct being the month having maximum loans issued
Interest Rate Vs Loan amount
Interest rate vs loan amount vs term
Inference: For both terms (36 & 60 months), higher amount loans have interest rate between 10-15%, for 60 month term lower loans have higher interest rate
Inference: Home_ownership not defined (Any) has less term and more loan amount. There is no sharp relationship between these three entities
Box plot for interest rate according to purpose
Box plot for total_pymt based on emp_length
Box plot for interest rate according to grade
Evaluating relationships between two variables
Employee Length & Purpose:
Inference: longer employee length people are having highest no of loans and highest no of loans in grace period
Grade & Purpose:
Inference: Grade F employees have negligable loans for credit cards, and all grades have negligable loans for wedding and vacation purpose. Only Grade C have taken loan for renewable energy. Only two grades have taken loan for house
## [1] 500
## [1] 35000
## 0% 25% 50% 75% 100%
## 500 8000 13000 20000 35000
## [1] 0
## [1] 57777.58
## # A tibble: 6 x 79
## id member_id loan_amnt funded_amnt funded_amnt_inv term
## <int> <int> <dbl> <dbl> <dbl> <fctr>
## 1 1077501 1296599 5000 5000 4975 36 months
## 2 1077430 1314167 2500 2500 2500 60 months
## 3 1077175 1313524 2400 2400 2400 36 months
## 4 1076863 1277178 10000 10000 10000 36 months
## 5 1075358 1311748 3000 3000 3000 60 months
## 6 1075269 1311441 5000 5000 5000 36 months
## # ... with 73 more variables: int_rate <dbl>, installment <dbl>,
## # grade <fctr>, sub_grade <fctr>, emp_title <fctr>, emp_length <fctr>,
## # home_ownership <fctr>, annual_inc <dbl>, verification_status <fctr>,
## # issue_d <fctr>, loan_status <fctr>, pymnt_plan <fctr>, url <fctr>,
## # desc <fctr>, purpose <fctr>, title <fctr>, zip_code <fctr>,
## # addr_state <fctr>, dti <dbl>, delinq_2yrs <dbl>,
## # earliest_cr_line <fctr>, inq_last_6mths <dbl>,
## # mths_since_last_delinq <dbl>, mths_since_last_record <dbl>,
## # open_acc <dbl>, pub_rec <dbl>, revol_bal <dbl>, revol_util <dbl>,
## # total_acc <dbl>, initial_list_status <fctr>, out_prncp <dbl>,
## # out_prncp_inv <dbl>, total_pymnt <dbl>, total_pymnt_inv <dbl>,
## # total_rec_prncp <dbl>, total_rec_int <dbl>, total_rec_late_fee <dbl>,
## # recoveries <dbl>, collection_recovery_fee <dbl>, last_pymnt_d <fctr>,
## # last_pymnt_amnt <dbl>, next_pymnt_d <fctr>, last_credit_pull_d <fctr>,
## # collections_12_mths_ex_med <dbl>, mths_since_last_major_derog <fctr>,
## # policy_code <dbl>, application_type <fctr>, annual_inc_joint <fctr>,
## # dti_joint <fctr>, verification_status_joint <fctr>,
## # acc_now_delinq <dbl>, tot_coll_amt <fctr>, tot_cur_bal <fctr>,
## # open_acc_6m <fctr>, open_il_6m <fctr>, open_il_12m <fctr>,
## # open_il_24m <fctr>, mths_since_rcnt_il <fctr>, total_bal_il <fctr>,
## # il_util <fctr>, open_rv_12m <fctr>, open_rv_24m <fctr>,
## # max_bal_bc <fctr>, all_util <fctr>, total_rev_hi_lim <fctr>,
## # inq_fi <fctr>, total_cu_tl <fctr>, inq_last_12m <fctr>, month <chr>,
## # year <chr>, loanamtgrant <dbl>, loanamtgrantrange <fctr>,
## # loanamnt_class <fctr>
## 0% 25% 50% 75% 100%
## 0.000 1914.590 4894.999 10616.814 57777.580
##
## car credit_card debt_consolidation educational
## 36 months 6954 150410 351848 406
## 60 months 1909 55772 172367 17
##
## home_improvement house major_purchase medical moving other
## 36 months 35832 2615 13034 6960 4662 34099
## 60 months 15997 1092 4243 1580 752 8795
##
## renewable_energy small_business vacation wedding
## 36 months 473 7553 4329 1950
## 60 months 102 2824 407 397
Finding Correlation between variables
Correlation between loan_amnt, int_rate and total_rec_int
## loan_amnt int_rate total_rec_int
## loan_amnt 1.0000000 0.1450231 0.5339157
## int_rate 0.1450231 1.0000000 0.4456788
## total_rec_int 0.5339157 0.4456788 1.0000000
Inference: There exist relation between total recurring interest and loan amount, total recurring interest and interest rate. However relatin between interest rate and loan amount is less.
Correlation between loan_amnt, int_rate and total_rec_late_fee
## loan_amnt int_rate total_rec_late_fee
## loan_amnt 1.00000000 0.14502310 0.03139469
## int_rate 0.14502310 1.00000000 0.05715012
## total_rec_late_fee 0.03139469 0.05715012 1.00000000