Lending Club Loan Data by Rahman Rahimi

========================================================

This dataset contains complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.

# Load the Data
loan <- read.csv("loan.csv")
head(loan)

tble <- tbl_df(loan)

Univariate Plots Section

In this section, we perform some preliminary exploration of our dataset. we run some summaries of the data and create univariate plots to understand the structure of the individual variables in our dataset.

glimpse(tble)

## Observations: 887,379
## Variables: 74
## $ id                          <int> 1077501, 1077430, 1077175, 1076863...
## $ member_id                   <int> 1296599, 1314167, 1313524, 1277178...
## $ loan_amnt                   <dbl> 5000, 2500, 2400, 10000, 3000, 500...
## $ funded_amnt                 <dbl> 5000, 2500, 2400, 10000, 3000, 500...
## $ funded_amnt_inv             <dbl> 4975.00, 2500.00, 2400.00, 10000.0...
## $ term                        <fct>  36 months,  60 months,  36 months...
## $ int_rate                    <dbl> 10.65, 15.27, 15.96, 13.49, 12.69,...
## $ installment                 <dbl> 162.87, 59.83, 84.33, 339.31, 67.7...
## $ grade                       <fct> B, C, C, C, B, A, C, E, F, B, C, B...
## $ sub_grade                   <fct> B2, C4, C5, C1, B5, A4, C5, E1, F2...
## $ emp_title                   <fct> , Ryder, , AIR RESOURCES BOARD, Un...
## $ emp_length                  <fct> 10+ years, < 1 year, 10+ years, 10...
## $ home_ownership              <fct> RENT, RENT, RENT, RENT, RENT, RENT...
## $ annual_inc                  <dbl> 24000.00, 30000.00, 12252.00, 4920...
## $ verification_status         <fct> Verified, Source Verified, Not Ver...
## $ issue_d                     <fct> Dec-2011, Dec-2011, Dec-2011, Dec-...
## $ loan_status                 <fct> Fully Paid, Charged Off, Fully Pai...
## $ pymnt_plan                  <fct> n, n, n, n, n, n, n, n, n, n, n, n...
## $ url                         <fct> https://www.lendingclub.com/browse...
## $ desc                        <fct>   Borrower added on 12/22/11 > I n...
## $ purpose                     <fct> credit_card, car, small_business, ...
## $ title                       <fct> Computer, bike, real estate busine...
## $ zip_code                    <fct> 860xx, 309xx, 606xx, 917xx, 972xx,...
## $ addr_state                  <fct> AZ, GA, IL, CA, OR, AZ, NC, CA, CA...
## $ dti                         <dbl> 27.65, 1.00, 8.72, 20.00, 17.94, 1...
## $ delinq_2yrs                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ earliest_cr_line            <fct> Jan-1985, Apr-1999, Nov-2001, Feb-...
## $ inq_last_6mths              <dbl> 1, 5, 2, 1, 0, 3, 1, 2, 2, 0, 2, 0...
## $ mths_since_last_delinq      <dbl> NA, NA, NA, 35, 38, NA, NA, NA, NA...
## $ mths_since_last_record      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_acc                    <dbl> 3, 3, 2, 10, 15, 9, 7, 4, 11, 2, 1...
## $ pub_rec                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ revol_bal                   <dbl> 13648, 1687, 2956, 5598, 27783, 79...
## $ revol_util                  <dbl> 83.70, 9.40, 98.50, 21.00, 53.90, ...
## $ total_acc                   <dbl> 9, 4, 10, 37, 38, 12, 11, 4, 13, 3...
## $ initial_list_status         <fct> f, f, f, f, f, f, f, f, f, f, f, f...
## $ out_prncp                   <dbl> 0.00, 0.00, 0.00, 0.00, 766.90, 0....
## $ out_prncp_inv               <dbl> 0.00, 0.00, 0.00, 0.00, 766.90, 0....
## $ total_pymnt                 <dbl> 5861.071, 1008.710, 3003.654, 1222...
## $ total_pymnt_inv             <dbl> 5831.78, 1008.71, 3003.65, 12226.3...
## $ total_rec_prncp             <dbl> 5000.00, 456.46, 2400.00, 10000.00...
## $ total_rec_int               <dbl> 861.07, 435.17, 603.65, 2209.33, 1...
## $ total_rec_late_fee          <dbl> 0.00, 0.00, 0.00, 16.97, 0.00, 0.0...
## $ recoveries                  <dbl> 0.00, 117.08, 0.00, 0.00, 0.00, 0....
## $ collection_recovery_fee     <dbl> 0.0000, 1.1100, 0.0000, 0.0000, 0....
## $ last_pymnt_d                <fct> Jan-2015, Apr-2013, Jun-2014, Jan-...
## $ last_pymnt_amnt             <dbl> 171.62, 119.66, 649.91, 357.48, 67...
## $ next_pymnt_d                <fct> , , , , Feb-2016, , Feb-2016, , , ...
## $ last_credit_pull_d          <fct> Jan-2016, Sep-2013, Jan-2016, Jan-...
## $ collections_12_mths_ex_med  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ mths_since_last_major_derog <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ policy_code                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ application_type            <fct> INDIVIDUAL, INDIVIDUAL, INDIVIDUAL...
## $ annual_inc_joint            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dti_joint                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ verification_status_joint   <fct> , , , , , , , , , , , , , , , , , ...
## $ acc_now_delinq              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tot_coll_amt                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tot_cur_bal                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_acc_6m                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_il_6m                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_il_12m                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_il_24m                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mths_since_rcnt_il          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_bal_il                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ il_util                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_rv_12m                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_rv_24m                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ max_bal_bc                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ all_util                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_rev_hi_lim            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ inq_fi                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_cu_tl                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ inq_last_12m                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...

df <- tble[, colSums(is.na(tble)) < length(tble)/2]
head(df)

## # A tibble: 6 x 50
##        id member_id loan_amnt funded_amnt funded_amnt_inv term    int_rate
##     <int>     <int>     <dbl>       <dbl>           <dbl> <fct>      <dbl>
## 1 1077501   1296599      5000        5000            4975 " 36 m~    10.6 
## 2 1077430   1314167      2500        2500            2500 " 60 m~    15.3 
## 3 1077175   1313524      2400        2400            2400 " 36 m~    16.0 
## 4 1076863   1277178     10000       10000           10000 " 36 m~    13.5 
## 5 1075358   1311748      3000        3000            3000 " 60 m~    12.7 
## 6 1075269   1311441      5000        5000            5000 " 36 m~     7.90
## # ... with 43 more variables: installment <dbl>, grade <fct>,
## #   sub_grade <fct>, emp_title <fct>, emp_length <fct>,
## #   home_ownership <fct>, annual_inc <dbl>, verification_status <fct>,
## #   issue_d <fct>, loan_status <fct>, pymnt_plan <fct>, url <fct>,
## #   desc <fct>, purpose <fct>, title <fct>, zip_code <fct>,
## #   addr_state <fct>, dti <dbl>, delinq_2yrs <dbl>,
## #   earliest_cr_line <fct>, inq_last_6mths <dbl>, open_acc <dbl>,
## #   pub_rec <dbl>, revol_bal <dbl>, total_acc <dbl>,
## #   initial_list_status <fct>, out_prncp <dbl>, out_prncp_inv <dbl>,
## #   total_pymnt <dbl>, total_pymnt_inv <dbl>, total_rec_prncp <dbl>,
## #   total_rec_int <dbl>, total_rec_late_fee <dbl>, recoveries <dbl>,
## #   collection_recovery_fee <dbl>, last_pymnt_d <fct>,
## #   last_pymnt_amnt <dbl>, next_pymnt_d <fct>, last_credit_pull_d <fct>,
## #   policy_code <dbl>, application_type <fct>,
## #   verification_status_joint <fct>, acc_now_delinq <dbl>

We first remove the columns that more than 50% of values are NaNs. As you can see from the table above, we have 50 columns remaining. Let’s focus on these columns over the course of our EDA.

factor_vars <- df[, sapply(df, is.factor)]
head(factor_vars)

numerical_vars <- df[, sapply(df, is.numeric)]
head(numerical_vars)

Now, I have devide the entire dataset into 2 seperate ones, one for numeric columns and one for factor columns. The tables above show a head of each of these 2 newly created dataframes.

p1 <- ggplot(data = df, aes(loan_amnt)) + geom_histogram(binwidth = 1000)
p2 <- ggplot(data = df, aes(loan_amnt)) + geom_density(fill = "gray")
grid.arrange(p1, p2, ncol = 2)

Above, we can see the distribution of Loan Amount using 2 different plots.

summary(df$loan_amnt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     500    8000   13000   14755   20000   35000

As we can see from the table above, the loan amount varies between as little as $500 up to maximum of $35000. The mean of the loan amount is $14755 and as the boxplot suggests, the majority of the loans are somewhere between $8000 - $20000. There are a large number of loans of $35000 as well.

loan_stat_df <- subset(df, !is.na(df$loan_status)) %>% group_by(loan_status) %>% 
    summarise(Number = n())
loan_stat_df

## # A tibble: 10 x 2
##    loan_status                                         Number
##    <fct>                                                <int>
##  1 Charged Off                                          45248
##  2 Current                                             601779
##  3 Default                                               1219
##  4 Does not meet the credit policy. Status:Charged Off    761
##  5 Does not meet the credit policy. Status:Fully Paid    1988
##  6 Fully Paid                                          207723
##  7 In Grace Period                                       6253
##  8 Issued                                                8460
##  9 Late (16-30 days)                                     2357
## 10 Late (31-120 days)                                   11591

Let us have a look at the Loans’ Status. About 600 thousand loans are still current. About 200 thousand are fully paid and about 45 thousand loans are charged off.

dataset.new <- filter(df, !loan_status %in% c("Does not meet the credit policy. Status:Charged Off", 
    "Does not meet the credit policy. Status:Fully Paid", "Default"))
ggplot(data = dataset.new, aes(loan_status)) + geom_bar(position = "dodge") + 
    labs(x = "Loan Status", title = "Distribution of Loan Status") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1))

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

A number of columns with Factor type and a the rest are continous or discrete form of numerical values. Among the numeric fields, the Loan Amount, Annual Income, Interest Rate are of particular interest.Of all the categorical fields (Factors), Home ownership, Loan Status, Loan Grade, Term, Verification Status are interesting.

Did you create any new variables from existing variables in the dataset?

I have also added a ratio of Loan Amount / Annual Income at the end of the DataFrame, This might give us some insight on what ratio is the most likely or most common range.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As we mentioned before, there are a number of columns with so many NaNs that could not be helpfu, so we decided to remove them.

Bivariate and Multivariate Plots Section

Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section?
I was particularly interested in the relationship between loan amount and the following fields:
Home Ownership
Employment Length
Loan Grade
Loan Status
Purpose
Interest Rate
Annula Income
Income-to-Loan-ratio
Verification Status Below, let us take a deeper look at these relationships:

p1 <- ggplot(data = df, aes(loan_amnt, color = grade)) + geom_histogram(binwidth = 1000) + 
    facet_grid(grade ~ .)
p2 <- ggplot(data = df, aes(loan_amnt, color = grade, fill = grade)) + geom_density(binwidth = 1000) + 
    facet_grid(grade ~ .)

## Warning: Ignoring unknown parameters: binwidth

grid.arrange(p1, p2, ncol = 2)

First, we look at the distribution of Loan Amount by Loan Grade. As you can see majority of the loans are in Grades A, B, and C. The interesting observation here was that as we move from grade A to Grade G, the loan amount distribution becomes more homogeneous.

l1 <- ggplot(data = subset(df, !is.na(int_rate)), aes(int_rate, color = grade)) + 
    geom_histogram(bins = 50) + facet_grid(grade ~ .) + labs(title = "Interest Rate by Loan Grade")
l2 <- ggplot(data = subset(df, !is.na(int_rate)), aes(int_rate, color = grade)) + 
    geom_density(fill = "SlateGray") + facet_grid(grade ~ .) + labs(title = "Interest Rate by Loan Grade")
grid.arrange(l1, l2, ncol = 2)

Now, the distribution of interest rates by Grade is also presented above, as you can see, the lowest interest rates apply to Grade A. So the more costly loans are in Grade G, F, E.

I am so curious to know to whome these good loans (Grade A) are granted and what kind of obvious difference could be detected between good grades and bad grades loan holders.

ggplot(data = subset(df, !is.na(annual_inc)), aes(x = annual_inc, y = id)) + 
    geom_point(alpha = 1/10, position = position_jitter(h = 0)) + labs(title = "Scatter Plot of Annual income by id")

The Scatter plot below shows the self reported annual income of all applicants. Obviously there are so many outliers that could be due to inputting error or some other reasons.

finding outliers

annual_inc.na <- subset(df, !is.na(annual_inc))

q3 <- quantile(annual_inc.na$annual_inc, 0.75)
q1 <- quantile(annual_inc.na$annual_inc, 0.25)
iqr.annual_inc <- (q3 - q1)
range <- (q3 + 1.5 * iqr.annual_inc)


annual_inc_new <- annual_inc.na %>% filter(annual_inc < range)

We calculated the IQR (Interquartile Range) and decided to remove the outliers from the analysis.we use the common measure of 1.5 X IQR beyond 3rd quartile as the outliers.

summary(annual_inc_new$annual_inc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   45000   62000   67502   85000  157471

now we have removed the outliers and show the summary of the new data set above:

l1 <- ggplot(data = annual_inc_new, aes(annual_inc)) + geom_histogram(binwidth = 5000)
l2 <- ggplot(data = annual_inc_new, aes(annual_inc)) + geom_density(fill = "gray")
l3 <- ggplot(data = annual_inc_new, aes(y = annual_inc, x = "id")) + geom_violin(scale = "area")
l4 <- ggplot(data = annual_inc_new, aes(y = annual_inc, x = "id")) + geom_boxplot(scale = "area")

## Warning: Ignoring unknown parameters: scale

grid.arrange(l1, l2, l3, l4, nrow = 2, ncol = 2)

Having removed the outliers, now we drew the same plots for Annual Income to get a better picture of the income of the customers.

Creating New Variables

we are going to create a new variable called Income_to_loan_ratio below:

annual_inc_new$Income_to_loan_ratio <- annual_inc_new$annual_inc/annual_inc_new$loan_amnt

ggplot(data = annual_inc_new, aes(Income_to_loan_ratio)) + geom_density(fill = "gray") + 
    coord_cartesian(xlim = c(0, 20)) + labs(title = "Distribution (Density Plot) of the ratio of Annual Income to Loan Amount")

so the majority of people have a self_reported annula income of approximately 5 times the amount of loans they have been granted.

ggplot(data = subset(df, !home_ownership %in% c("ANY", "NONE", "OTHER")), aes(y = home_ownership, 
    purpose)) + geom_count(color = "Navy") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1))

Let us have a look at the Home Ownership and Purpose of loans.

The plot above, shows the number of loans grouped by the Home Ownership and the purpose of the loan. Debt Consolidation is by far the most common purpose of loans. But the porportion is different among different types of home ownership. For example, the vast majority of Mortgage holders need loans for Debt Consolidation and credit cards. People who own their homes either do not need or do not apply for a loan through this website. As you can see, it is very uncommon for the home owners to get loans to pay their credit cards debt compared to the ones who either rent or paying mortgage.

ggplot(data = annual_inc_new, aes(y = annual_inc, x = id, fill = grade, color = grade)) + 
    geom_boxplot(alpha = 1/3) + facet_grid(. ~ grade) + labs(title = "Distribution (Box Plot) of Annual income by Loan Grade") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Box plots of Annual Income by Loan Grade are shown above.Looking at the median line of the box plots, we see a downtrend from Grade A to D and then a slight uptrend towards G.The Grade D has the narrowest range among all other grades.

Home Ownership

df %>% filter(!is.na(home_ownership)) %>% filter(!home_ownership %in% c("OTHER", 
    "NONE", "ANY")) %>% group_by(home_ownership) %>% summarise(mean_amnt_loan = mean(loan_amnt), 
    mean_annual_inc = mean(annual_inc), n = n()) %>% ungroup() %>% arrange(desc(n))

## # A tibble: 3 x 4
##   home_ownership mean_amnt_loan mean_annual_inc      n
##   <fct>                   <dbl>           <dbl>  <int>
## 1 MORTGAGE                16375           85106 443557
## 2 RENT                    12871           63982 356117
## 3 OWN                     14223           68913  87470

Table above shows the average amount of loans and average annual income by different Home Ownership status. More than 443 thousand loans were granted to those with a mortgaged home, but only 87 thosand to those who own their home. A clear difference between the annual inmcome of those with different home ownership is also shown.

ggplot(data = subset(df, (!is.na(home_ownership) & (!home_ownership %in% c("ANY", 
    "NONE", "OTHER")))), aes(home_ownership, fill = grade, color = grade)) + 
    geom_bar() + labs(title = "Distribution of Home Ownership by Loan Grade")

Here I brought the plot of Home Ownership by Loan Grade as a stacked bar chart. The majority of loans are granted to those with Mortgage and grade B and C are the most common ones.

purpose of loan

df %>% filter(!is.na(purpose)) %>% group_by(purpose) %>% summarise(mean_annual_inc = mean(annual_inc), 
    mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(desc(n))

## # A tibble: 14 x 4
##    purpose            mean_annual_inc mean_amnt_loan      n
##    <fct>                        <dbl>          <dbl>  <int>
##  1 debt_consolidation           73891          15428 524215
##  2 credit_card                  75098          15339 206182
##  3 home_improvement             90271          14321  51829
##  4 other                           NA           9889  42894
##  5 major_purchase               76700          11600  17277
##  6 small_business               89886          15420  10377
##  7 car                          67879           8882   8863
##  8 medical                      72163           9019   8540
##  9 moving                       68157           7892   5414
## 10 vacation                     66874           6290   4736
## 11 house                        80448          14798   3707
## 12 wedding                      69455          10469   2347
## 13 renewable_energy             75086           9948    575
## 14 educational                  51264           6616    423

The table above, depicted the Average amount of loan and average annual income based on the purpose of the loan. As we can see more than 520 thousand loans were granted for debt consolidation and more than 200 thousand for credit card. For those with highest average annual income ($90271/year), home improvment is the most common purpose, while those who have got loans for vacation have the lowest average annual income!

ggplot(data = subset(df, !is.na(purpose)), aes(purpose, fill = loan_status, 
    color = loan_status)) + geom_bar() + labs(title = "Distribution of Loan purpose") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now let us have a look at what have happened to the loans with different purposes. The plot above, shows the distribution of of loan purposes by loan status.

df %>% filter(!is.na(verification_status)) %>% group_by(verification_status) %>% 
    summarise(mean_annual_inc = mean(annual_inc), mean_amnt_loan = mean(loan_amnt), 
        n = n()) %>% ungroup() %>% arrange(verification_status)

## # A tibble: 3 x 4
##   verification_status mean_annual_inc mean_amnt_loan      n
##   <fct>                         <dbl>          <dbl>  <int>
## 1 Not Verified                     NA          11176 266750
## 2 Source Verified               78916          15500 329558
## 3 Verified                      78191          17193 291071

I am curious to know what is the average annual income and average loan amount granted based on the verification status. As we can see for those whose loans could not be verified, the average amount of loan is the lowest and if it was verified they got the highest amount of loan on average which is exactly what we can expect.

ggplot(data = subset(df, !is.na(verification_status)), aes(verification_status, 
    fill = loan_status, color = loan_status)) + geom_bar(position = "fill") + 
    labs(title = "Distribution of verification status by Loan Status") + scale_y_continuous(labels = percent_format())

Based on the plot above, a very interesting observation is that the highest percentage of loans with “Charged Off” Status is from those whose verification status is “Not Verified” and interestingly, those hwo could verify their status, have the highest percentage of “Charged off” loans.

df %>% filter(!is.na(grade)) %>% group_by(grade) %>% summarise(mean_annual_inc = mean(annual_inc), 
    mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(grade)

## # A tibble: 7 x 4
##   grade mean_annual_inc mean_amnt_loan      n
##   <fct>           <dbl>          <dbl>  <int>
## 1 A                  NA          14039 148202
## 2 B               75487          13643 254535
## 3 C               71537          14466 245860
## 4 D               69169          15460 139542
## 5 E               72119          17998  70705
## 6 F               73801          19139  23046
## 7 G               79905          20539   5489

Let us see what are the average amount of loans granted and the average annual income based on the loan grades. So the highest average amount of loans granted is on Grade G with about 5400 cases which is of course a very low number compared to say Grade B with more than 250 thousand cases. For grade B with the highest number of loans granted, the average amount of loan is $13642.

ggplot(data = subset(df, !is.na(grade)), aes(grade, fill = loan_status, color = loan_status)) + 
    geom_bar(position = "fill") + labs(title = "Distribution of Loan Grade by Loan Status") + 
    scale_y_continuous(labels = percent_format())

another interesting plot is shown above, this distribution plot gives the percentage of loan status based on their grades. So, as we expect, grade G has the highest percentage of loans with “Charged off” status and it shows a steady downtrend as we move towards the better grades (A and B). Moreover, the same trend is evident for the late payments. For example the highest percentage of late payments are among the loans with grade G.

df %>% filter(!is.na(loan_status)) %>% group_by(loan_status) %>% summarise(mean_annual_inc = mean(annual_inc), 
    mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(loan_status)

## # A tibble: 10 x 4
##    loan_status                       mean_annual_inc mean_amnt_loan      n
##    <fct>                                       <dbl>          <dbl>  <int>
##  1 Charged Off                                 65085          14556  45248
##  2 Current                                     76153          15242 601779
##  3 Default                                     66754          15194   1219
##  4 Does not meet the credit policy.~           69526           9527    761
##  5 Does not meet the credit policy.~              NA           8853   1988
##  6 Fully Paid                                  74162          13346 207723
##  7 In Grace Period                             76132          16006   6253
##  8 Issued                                      78318          15299   8460
##  9 Late (16-30 days)                           73279          15671   2357
## 10 Late (31-120 days)                          70021          15569  11591

Creating Bins for Interest Rates, Loan Amounts, and Annual Income

df <- transform(df, int_group = cut(int_rate, breaks = c(0, 7.5, 10, 12.5, 15, 
    20), labels = c("0-7.5", "7.5-10", "10-12.5", "12.5-15", "15 and above")))

df <- transform(df, loan_amnt_group = cut(loan_amnt, breaks = c(0, 5000, 10000, 
    15000, 20000, 25000, 40000), labels = c("0-5000", "5000-10000", "10000-15000", 
    "15000-20000", "20000-25000", "25000 and above")))

df <- transform(df, ann_income_group = cut(annual_inc, breaks = c(0, 25000, 
    50000, 75000, 1e+05, 1e+06), labels = c("0-25000", "25000-50000", "50000-75000", 
    "75000-100000", "100000 and above")))

We just binned the Interest Rate, Loan Amount, and Annual income columns and converted them into Factor values.

Correlation Analysis Using corrplot

new_list <- numerical_vars[c("loan_amnt", "int_rate", "funded_amnt", "installment")]
M <- cor(new_list)
corrplot(M, method = "color")

corrplot(M, method = "number", col = "black")

Above, we have provided the correlation plots for 4 numeric columns of: ‘loan_amnt’, ‘int_rate’, ‘funded_amnt’, ‘installment’.

converting Issue_d column to date Format:

df$dates <- dmy(paste0("01-", df$issue_d))
df_amount_grade <- df %>% group_by(dates, grade) %>% summarise(Loan_Amount = sum(loan_amnt), 
    n = n())

ggplot(df_amount_grade, aes(x = dates, y = Loan_Amount)) + geom_area(aes(fill = grade)) + 
    xlab("Issue Date") + labs(title = "Total $ Amount of loans grouped by Grade")

let us investigate how the total amount of loans has changed over the time. The plot above shows the total dollar amount of loans granted from 2008 untill 2016.There has been some spikes and some very sharp declines that might be due to some technical problem or lack of data or perhaps some seasonal trend exisiting in the market.

ggplot(data = subset(df, !is.na(addr_state)), aes(x = addr_state, y = loan_amnt, 
    fill = addr_state)) + stat_summary(fun.y = "sum", geom = "bar") + coord_flip() + 
    labs(y = "Total Loan Amount", x = "State", title = "Bar Chart of loan amount for each state")

The amount of loans granted by state is presented above. California is by far the first state in terms of the total amount of loans. ——

Bivariate and Multivariate Analysis

The relationships I have observed

I think, most of the finding align with what we could expect before running doing any analysis, such as the fact that the higher the income, the easier it gets to get a loan, or the cheapest it gets for you to secure a loan. However, there was one interesting finding here, my observation showed that the highest percentage of loans with “Charged Off” Status is from those whose verification status is “Not Verified” and interestingly, those hwo could verify their status, have the highest percentage of “Charged off” loans.

Final Plots and Summary

Plot One

ggplot(df_amount_grade, aes(x = dates, y = Loan_Amount)) + geom_area(aes(fill = grade)) + 
    xlab("Issue Date") + labs(title = "Total $ Amount of loans grouped by Grade")

Description One

The Total amount of loans has grown significantly since 2010, there was a period of stagnation from 2008 to 2010 before it took some momentum. Grade B and C are the most common loan grades and have become more so after 2015. There are a few unusual spikes folowed by sharp declines in late 2014 that need a more comprehensive analysis to figure out the reason behind it. All in all the growth looks impressive and there does not seem to be any sign of weakness.

Plot Two

ggplot(data = subset(df, !is.na(int_rate)), aes(int_rate, color = grade)) + 
    geom_histogram(bins = 50) + facet_grid(grade ~ .) + labs(title = "Interest Rate by Loan Grade")

Description Two

The plot shows the disctribution of Interest Rate by Loan Grade. It is kind of obvious that there are differences between grades in terms of interest rates. The grades A-C have significantly lower interest rates.

Plot Three

ggplot(data = subset(df, !is.na(verification_status)), aes(verification_status, 
    fill = loan_status, color = loan_status)) + geom_bar(position = "fill") + 
    labs(title = "Distribution of verification status by Loan Status") + scale_y_continuous(labels = percent_format())

Description Three

As I previously mentioned, this plot shows a rather interesting observation. The “Verified” category has shown a higher percentage of “Charged off” loans compared to the other 2 categories. What we would have expected before any deep analysis, could be the verified loans have a lower risk of default, but it does not seem to be the case. Another important fact from this chart that I would like to point out is that the verified category has higher percentage of “LATE”s as well which is also counter-intuitive.

Reflection

For this project I have used the Lending Club Loan Data. We downloaded the the dataset from Kaggle.com at this link: (https://www.kaggle.com/wendykan/lending-club-loan-data). This dataset contains complete loan data for all loans issued through 2007-2015,including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file is a matrix of about 890 thousand observations and 75 variables. I worked on this dataset and wrangled and partialy cleaned the data, then using great R functionalities and GGplot library I created some univariate analyses, some multi- and bi-variate analyses of the existing variables. Some of the findings were aligned with our initial expectations and some were a little strange. I believe there are some very interesting insights that we could draw from the dataset by working extensively on the latest payments, installments, and also fees. The other very interesting topic could be analysing the description of the loan applications and see what are the most successful terms we could use to secure a loan. Another aspect that could be focused on is the job titles and investigate whether there is any significant pattern in the dataset.

=======

EDA_Project

Rahman Rahimi

04 April 2018