Below is the attempt to analyse the lending club loan data available for one of the lending clubs in US. It gives visualised view of loan data to derive the relations and valuable insights with inferences.

Setting up the environment with required packages to be included

## Warning: package 'ggplot2' was built under R version 3.4.2
## Warning: package 'ggthemes' was built under R version 3.4.2
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'gridExtra' was built under R version 3.4.2
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: package 'readr' was built under R version 3.4.2
## Warning: package 'lattice' was built under R version 3.4.2
## Warning: package 'vcd' was built under R version 3.4.2
## Loading required package: grid

Specifying options as part of environment set up

Setting up the working directory where the source file exist

Reading the input file for loan data

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   id = col_integer(),
##   member_id = col_integer(),
##   loan_amnt = col_double(),
##   funded_amnt = col_double(),
##   funded_amnt_inv = col_double(),
##   int_rate = col_double(),
##   installment = col_double(),
##   annual_inc = col_double(),
##   dti = col_double(),
##   delinq_2yrs = col_double(),
##   inq_last_6mths = col_double(),
##   mths_since_last_delinq = col_double(),
##   mths_since_last_record = col_double(),
##   open_acc = col_double(),
##   pub_rec = col_double(),
##   revol_bal = col_double(),
##   revol_util = col_double(),
##   total_acc = col_double(),
##   out_prncp = col_double(),
##   out_prncp_inv = col_double()
##   # ... with 11 more columns
## )
## See spec(...) for full column specifications.

Convert data types to factor from char to use these columns for analysis

## Warning: package 'bindrcpp' was built under R version 3.4.2

Derive year and month fields

Derive the loan grant amount field from loan and funded amount.

## [1] 0
## [1] 21150
##    0%   25%   50%   75%  100% 
##     0     0     0     0 21150
## [1] 13.38698

Splitting Loan grant amount & loan verification status into different levels

Visual Charts/Plots for Insights:

histogram for loan amount Inference: Maximum number of loans taken is for 10k-12k amount and is ~ 1000000 Minimum number of loans taken is for 27-30k amount and is ~ 5000-8000

historgram for interest rate

Inference: Maximum people have interest rate between 10 & 20, with maximum loans given around 12-14

histogram for total payments received till today

Inference: This is left skewed histogram with maximum payments received today < 8000

Histogram Plot for Loan Amount against Term Inference: Loans< 10,000 have term as 36 months, most of loans >20,000 have 60 months term

Histogram Plot for Loan Amount against purpose Inference: Most loans have been taken for credit card and debt_consolidation purpose

Let us see using histogram plot what is the loan amount for credit card purpose Inference: For credit card maximum loan count is between 10,000 & 11,000

Let us see using histogram plot what is the loan amount for credit card purpose

Inference: For Debt_consolidation, maximum loan count is between 10,000 & 11,000

Histograms & Density plots for loan amount, interest rate & total payments against grade

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bar plot for loan amount based on grade type

Inference: Grade B has maximum loan amount followed by C

Bar plot for loan amount based on purpose

Inference: Maximum loan have been taken for debt_consolidation (~550000)

bar plot to calcualte the total interest rate according to states

Inference: Current Outstanding loan~600000, charged off ~ 45000, Fully paid ~ 210000

Inference: 36 months term has more loans than 60 months

Let us now visualize and observe inference from state/region wise loans

Inference: CA has maximum number of loans currently active followed by NY. Also CA has highest number of loans followed by NY and TX

Now let us see state wise loan deliquency

Inference: CA state has maximum no of delinquents followed by NY state

Inference: CA has maximum no of defaulters followed by NY and TX

Inference: Maximum defaulters are for debt consolidation across regions. CA has maximum no of defaulters

Inference: debit_consolidation and small business have almost same and highest mean loan amount followed by credit card and improvement house

Inference: Joint loans have higher mean value than individual loans

Inference: Those with Mortgage type of home ownership have highest loan amount followed by Rent

Inference: There are around 270000 records which are not verified

Inference: Not verified loans have some records which dont meet the credit policy

Inference: 2015 has maximum loans with Oct being the month having maximum loans issued

Interest Rate Vs Loan amount

Interest rate vs loan amount vs term

Inference: For both terms (36 & 60 months), higher amount loans have interest rate between 10-15%, for 60 month term lower loans have higher interest rate

Inference: Home_ownership not defined (Any) has less term and more loan amount. There is no sharp relationship between these three entities

Box plot for interest rate according to purpose

Box plot for total_pymt based on emp_length

Box plot for interest rate according to grade

Evaluating relationships between two variables

Employee Length & Purpose:

Inference: longer employee length people are having highest no of loans and highest no of loans in grace period

Grade & Purpose:

Inference: Grade F employees have negligable loans for credit cards, and all grades have negligable loans for wedding and vacation purpose. Only Grade C have taken loan for renewable energy. Only two grades have taken loan for house

## [1] 500
## [1] 35000
##    0%   25%   50%   75%  100% 
##   500  8000 13000 20000 35000
## [1] 0
## [1] 57777.58
## # A tibble: 6 x 79
##        id member_id loan_amnt funded_amnt funded_amnt_inv      term
##     <int>     <int>     <dbl>       <dbl>           <dbl>    <fctr>
## 1 1077501   1296599      5000        5000            4975 36 months
## 2 1077430   1314167      2500        2500            2500 60 months
## 3 1077175   1313524      2400        2400            2400 36 months
## 4 1076863   1277178     10000       10000           10000 36 months
## 5 1075358   1311748      3000        3000            3000 60 months
## 6 1075269   1311441      5000        5000            5000 36 months
## # ... with 73 more variables: int_rate <dbl>, installment <dbl>,
## #   grade <fctr>, sub_grade <fctr>, emp_title <fctr>, emp_length <fctr>,
## #   home_ownership <fctr>, annual_inc <dbl>, verification_status <fctr>,
## #   issue_d <fctr>, loan_status <fctr>, pymnt_plan <fctr>, url <fctr>,
## #   desc <fctr>, purpose <fctr>, title <fctr>, zip_code <fctr>,
## #   addr_state <fctr>, dti <dbl>, delinq_2yrs <dbl>,
## #   earliest_cr_line <fctr>, inq_last_6mths <dbl>,
## #   mths_since_last_delinq <dbl>, mths_since_last_record <dbl>,
## #   open_acc <dbl>, pub_rec <dbl>, revol_bal <dbl>, revol_util <dbl>,
## #   total_acc <dbl>, initial_list_status <fctr>, out_prncp <dbl>,
## #   out_prncp_inv <dbl>, total_pymnt <dbl>, total_pymnt_inv <dbl>,
## #   total_rec_prncp <dbl>, total_rec_int <dbl>, total_rec_late_fee <dbl>,
## #   recoveries <dbl>, collection_recovery_fee <dbl>, last_pymnt_d <fctr>,
## #   last_pymnt_amnt <dbl>, next_pymnt_d <fctr>, last_credit_pull_d <fctr>,
## #   collections_12_mths_ex_med <dbl>, mths_since_last_major_derog <fctr>,
## #   policy_code <dbl>, application_type <fctr>, annual_inc_joint <fctr>,
## #   dti_joint <fctr>, verification_status_joint <fctr>,
## #   acc_now_delinq <dbl>, tot_coll_amt <fctr>, tot_cur_bal <fctr>,
## #   open_acc_6m <fctr>, open_il_6m <fctr>, open_il_12m <fctr>,
## #   open_il_24m <fctr>, mths_since_rcnt_il <fctr>, total_bal_il <fctr>,
## #   il_util <fctr>, open_rv_12m <fctr>, open_rv_24m <fctr>,
## #   max_bal_bc <fctr>, all_util <fctr>, total_rev_hi_lim <fctr>,
## #   inq_fi <fctr>, total_cu_tl <fctr>, inq_last_12m <fctr>, month <chr>,
## #   year <chr>, loanamtgrant <dbl>, loanamtgrantrange <fctr>,
## #   loanamnt_class <fctr>
##        0%       25%       50%       75%      100% 
##     0.000  1914.590  4894.999 10616.814 57777.580

##            
##                car credit_card debt_consolidation educational
##   36 months   6954      150410             351848         406
##   60 months   1909       55772             172367          17
##            
##             home_improvement  house major_purchase medical moving  other
##   36 months            35832   2615          13034    6960   4662  34099
##   60 months            15997   1092           4243    1580    752   8795
##            
##             renewable_energy small_business vacation wedding
##   36 months              473           7553     4329    1950
##   60 months              102           2824      407     397

Finding Correlation between variables

Correlation between loan_amnt, int_rate and total_rec_int

##               loan_amnt  int_rate total_rec_int
## loan_amnt     1.0000000 0.1450231     0.5339157
## int_rate      0.1450231 1.0000000     0.4456788
## total_rec_int 0.5339157 0.4456788     1.0000000

Inference: There exist relation between total recurring interest and loan amount, total recurring interest and interest rate. However relatin between interest rate and loan amount is less.

Correlation between loan_amnt, int_rate and total_rec_late_fee

##                     loan_amnt   int_rate total_rec_late_fee
## loan_amnt          1.00000000 0.14502310         0.03139469
## int_rate           0.14502310 1.00000000         0.05715012
## total_rec_late_fee 0.03139469 0.05715012         1.00000000