| Lending Club Loan Data by Rahman Rahimi |
| ======================================================== |
This dataset contains complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.
# Load the Data
loan <- read.csv("loan.csv")
head(loan)
tble <- tbl_df(loan)
In this section, we perform some preliminary exploration of our dataset. we run some summaries of the data and create univariate plots to understand the structure of the individual variables in our dataset.
glimpse(tble)
## Observations: 887,379
## Variables: 74
## $ id <int> 1077501, 1077430, 1077175, 1076863...
## $ member_id <int> 1296599, 1314167, 1313524, 1277178...
## $ loan_amnt <dbl> 5000, 2500, 2400, 10000, 3000, 500...
## $ funded_amnt <dbl> 5000, 2500, 2400, 10000, 3000, 500...
## $ funded_amnt_inv <dbl> 4975.00, 2500.00, 2400.00, 10000.0...
## $ term <fct> 36 months, 60 months, 36 months...
## $ int_rate <dbl> 10.65, 15.27, 15.96, 13.49, 12.69,...
## $ installment <dbl> 162.87, 59.83, 84.33, 339.31, 67.7...
## $ grade <fct> B, C, C, C, B, A, C, E, F, B, C, B...
## $ sub_grade <fct> B2, C4, C5, C1, B5, A4, C5, E1, F2...
## $ emp_title <fct> , Ryder, , AIR RESOURCES BOARD, Un...
## $ emp_length <fct> 10+ years, < 1 year, 10+ years, 10...
## $ home_ownership <fct> RENT, RENT, RENT, RENT, RENT, RENT...
## $ annual_inc <dbl> 24000.00, 30000.00, 12252.00, 4920...
## $ verification_status <fct> Verified, Source Verified, Not Ver...
## $ issue_d <fct> Dec-2011, Dec-2011, Dec-2011, Dec-...
## $ loan_status <fct> Fully Paid, Charged Off, Fully Pai...
## $ pymnt_plan <fct> n, n, n, n, n, n, n, n, n, n, n, n...
## $ url <fct> https://www.lendingclub.com/browse...
## $ desc <fct> Borrower added on 12/22/11 > I n...
## $ purpose <fct> credit_card, car, small_business, ...
## $ title <fct> Computer, bike, real estate busine...
## $ zip_code <fct> 860xx, 309xx, 606xx, 917xx, 972xx,...
## $ addr_state <fct> AZ, GA, IL, CA, OR, AZ, NC, CA, CA...
## $ dti <dbl> 27.65, 1.00, 8.72, 20.00, 17.94, 1...
## $ delinq_2yrs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ earliest_cr_line <fct> Jan-1985, Apr-1999, Nov-2001, Feb-...
## $ inq_last_6mths <dbl> 1, 5, 2, 1, 0, 3, 1, 2, 2, 0, 2, 0...
## $ mths_since_last_delinq <dbl> NA, NA, NA, 35, 38, NA, NA, NA, NA...
## $ mths_since_last_record <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_acc <dbl> 3, 3, 2, 10, 15, 9, 7, 4, 11, 2, 1...
## $ pub_rec <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ revol_bal <dbl> 13648, 1687, 2956, 5598, 27783, 79...
## $ revol_util <dbl> 83.70, 9.40, 98.50, 21.00, 53.90, ...
## $ total_acc <dbl> 9, 4, 10, 37, 38, 12, 11, 4, 13, 3...
## $ initial_list_status <fct> f, f, f, f, f, f, f, f, f, f, f, f...
## $ out_prncp <dbl> 0.00, 0.00, 0.00, 0.00, 766.90, 0....
## $ out_prncp_inv <dbl> 0.00, 0.00, 0.00, 0.00, 766.90, 0....
## $ total_pymnt <dbl> 5861.071, 1008.710, 3003.654, 1222...
## $ total_pymnt_inv <dbl> 5831.78, 1008.71, 3003.65, 12226.3...
## $ total_rec_prncp <dbl> 5000.00, 456.46, 2400.00, 10000.00...
## $ total_rec_int <dbl> 861.07, 435.17, 603.65, 2209.33, 1...
## $ total_rec_late_fee <dbl> 0.00, 0.00, 0.00, 16.97, 0.00, 0.0...
## $ recoveries <dbl> 0.00, 117.08, 0.00, 0.00, 0.00, 0....
## $ collection_recovery_fee <dbl> 0.0000, 1.1100, 0.0000, 0.0000, 0....
## $ last_pymnt_d <fct> Jan-2015, Apr-2013, Jun-2014, Jan-...
## $ last_pymnt_amnt <dbl> 171.62, 119.66, 649.91, 357.48, 67...
## $ next_pymnt_d <fct> , , , , Feb-2016, , Feb-2016, , , ...
## $ last_credit_pull_d <fct> Jan-2016, Sep-2013, Jan-2016, Jan-...
## $ collections_12_mths_ex_med <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ mths_since_last_major_derog <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ policy_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ application_type <fct> INDIVIDUAL, INDIVIDUAL, INDIVIDUAL...
## $ annual_inc_joint <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dti_joint <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ verification_status_joint <fct> , , , , , , , , , , , , , , , , , ...
## $ acc_now_delinq <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tot_coll_amt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tot_cur_bal <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_acc_6m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_il_6m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_il_12m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_il_24m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mths_since_rcnt_il <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_bal_il <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ il_util <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_rv_12m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ open_rv_24m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ max_bal_bc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ all_util <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_rev_hi_lim <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ inq_fi <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ total_cu_tl <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ inq_last_12m <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
df <- tble[, colSums(is.na(tble)) < length(tble)/2]
head(df)
## # A tibble: 6 x 50
## id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate
## <int> <int> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 1077501 1296599 5000 5000 4975 " 36 m~ 10.6
## 2 1077430 1314167 2500 2500 2500 " 60 m~ 15.3
## 3 1077175 1313524 2400 2400 2400 " 36 m~ 16.0
## 4 1076863 1277178 10000 10000 10000 " 36 m~ 13.5
## 5 1075358 1311748 3000 3000 3000 " 60 m~ 12.7
## 6 1075269 1311441 5000 5000 5000 " 36 m~ 7.90
## # ... with 43 more variables: installment <dbl>, grade <fct>,
## # sub_grade <fct>, emp_title <fct>, emp_length <fct>,
## # home_ownership <fct>, annual_inc <dbl>, verification_status <fct>,
## # issue_d <fct>, loan_status <fct>, pymnt_plan <fct>, url <fct>,
## # desc <fct>, purpose <fct>, title <fct>, zip_code <fct>,
## # addr_state <fct>, dti <dbl>, delinq_2yrs <dbl>,
## # earliest_cr_line <fct>, inq_last_6mths <dbl>, open_acc <dbl>,
## # pub_rec <dbl>, revol_bal <dbl>, total_acc <dbl>,
## # initial_list_status <fct>, out_prncp <dbl>, out_prncp_inv <dbl>,
## # total_pymnt <dbl>, total_pymnt_inv <dbl>, total_rec_prncp <dbl>,
## # total_rec_int <dbl>, total_rec_late_fee <dbl>, recoveries <dbl>,
## # collection_recovery_fee <dbl>, last_pymnt_d <fct>,
## # last_pymnt_amnt <dbl>, next_pymnt_d <fct>, last_credit_pull_d <fct>,
## # policy_code <dbl>, application_type <fct>,
## # verification_status_joint <fct>, acc_now_delinq <dbl>
We first remove the columns that more than 50% of values are NaNs. As you can see from the table above, we have 50 columns remaining. Let’s focus on these columns over the course of our EDA.
factor_vars <- df[, sapply(df, is.factor)]
head(factor_vars)
numerical_vars <- df[, sapply(df, is.numeric)]
head(numerical_vars)
Now, I have devide the entire dataset into 2 seperate ones, one for numeric columns and one for factor columns. The tables above show a head of each of these 2 newly created dataframes.
p1 <- ggplot(data = df, aes(loan_amnt)) + geom_histogram(binwidth = 1000)
p2 <- ggplot(data = df, aes(loan_amnt)) + geom_density(fill = "gray")
grid.arrange(p1, p2, ncol = 2)
Above, we can see the distribution of Loan Amount using 2 different plots.
summary(df$loan_amnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 500 8000 13000 14755 20000 35000
As we can see from the table above, the loan amount varies between as little as $500 up to maximum of $35000. The mean of the loan amount is $14755 and as the boxplot suggests, the majority of the loans are somewhere between $8000 - $20000. There are a large number of loans of $35000 as well.
loan_stat_df <- subset(df, !is.na(df$loan_status)) %>% group_by(loan_status) %>%
summarise(Number = n())
loan_stat_df
## # A tibble: 10 x 2
## loan_status Number
## <fct> <int>
## 1 Charged Off 45248
## 2 Current 601779
## 3 Default 1219
## 4 Does not meet the credit policy. Status:Charged Off 761
## 5 Does not meet the credit policy. Status:Fully Paid 1988
## 6 Fully Paid 207723
## 7 In Grace Period 6253
## 8 Issued 8460
## 9 Late (16-30 days) 2357
## 10 Late (31-120 days) 11591
Let us have a look at the Loans’ Status. About 600 thousand loans are still current. About 200 thousand are fully paid and about 45 thousand loans are charged off.
dataset.new <- filter(df, !loan_status %in% c("Does not meet the credit policy. Status:Charged Off",
"Does not meet the credit policy. Status:Fully Paid", "Default"))
ggplot(data = dataset.new, aes(loan_status)) + geom_bar(position = "dodge") +
labs(x = "Loan Status", title = "Distribution of Loan Status") + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
A number of columns with Factor type and a the rest are continous or discrete form of numerical values. Among the numeric fields, the Loan Amount, Annual Income, Interest Rate are of particular interest.Of all the categorical fields (Factors), Home ownership, Loan Status, Loan Grade, Term, Verification Status are interesting.
I have also added a ratio of Loan Amount / Annual Income at the end of the DataFrame, This might give us some insight on what ratio is the most likely or most common range.
As we mentioned before, there are a number of columns with so many NaNs that could not be helpfu, so we decided to remove them.
Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section?
loan amount and the following fields:Verification Status Below, let us take a deeper look at these relationships:
p1 <- ggplot(data = df, aes(loan_amnt, color = grade)) + geom_histogram(binwidth = 1000) +
facet_grid(grade ~ .)
p2 <- ggplot(data = df, aes(loan_amnt, color = grade, fill = grade)) + geom_density(binwidth = 1000) +
facet_grid(grade ~ .)
## Warning: Ignoring unknown parameters: binwidth
grid.arrange(p1, p2, ncol = 2)
First, we look at the distribution of Loan Amount by Loan Grade. As you can see majority of the loans are in Grades A, B, and C. The interesting observation here was that as we move from grade A to Grade G, the loan amount distribution becomes more homogeneous.
l1 <- ggplot(data = subset(df, !is.na(int_rate)), aes(int_rate, color = grade)) +
geom_histogram(bins = 50) + facet_grid(grade ~ .) + labs(title = "Interest Rate by Loan Grade")
l2 <- ggplot(data = subset(df, !is.na(int_rate)), aes(int_rate, color = grade)) +
geom_density(fill = "SlateGray") + facet_grid(grade ~ .) + labs(title = "Interest Rate by Loan Grade")
grid.arrange(l1, l2, ncol = 2)
Now, the distribution of interest rates by Grade is also presented above, as you can see, the lowest interest rates apply to Grade A. So the more costly loans are in Grade G, F, E.
I am so curious to know to whome these good loans (Grade A) are granted and what kind of obvious difference could be detected between good grades and bad grades loan holders.
ggplot(data = subset(df, !is.na(annual_inc)), aes(x = annual_inc, y = id)) +
geom_point(alpha = 1/10, position = position_jitter(h = 0)) + labs(title = "Scatter Plot of Annual income by id")
The Scatter plot below shows the
self reported annual income of all applicants. Obviously there are so many outliers that could be due to inputting error or some other reasons.
annual_inc.na <- subset(df, !is.na(annual_inc))
q3 <- quantile(annual_inc.na$annual_inc, 0.75)
q1 <- quantile(annual_inc.na$annual_inc, 0.25)
iqr.annual_inc <- (q3 - q1)
range <- (q3 + 1.5 * iqr.annual_inc)
annual_inc_new <- annual_inc.na %>% filter(annual_inc < range)
We calculated the IQR (Interquartile Range) and decided to remove the outliers from the analysis.we use the common measure of 1.5 X IQR beyond 3rd quartile as the outliers.
summary(annual_inc_new$annual_inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 45000 62000 67502 85000 157471
now we have removed the outliers and show the summary of the new data set above:
l1 <- ggplot(data = annual_inc_new, aes(annual_inc)) + geom_histogram(binwidth = 5000)
l2 <- ggplot(data = annual_inc_new, aes(annual_inc)) + geom_density(fill = "gray")
l3 <- ggplot(data = annual_inc_new, aes(y = annual_inc, x = "id")) + geom_violin(scale = "area")
l4 <- ggplot(data = annual_inc_new, aes(y = annual_inc, x = "id")) + geom_boxplot(scale = "area")
## Warning: Ignoring unknown parameters: scale
grid.arrange(l1, l2, l3, l4, nrow = 2, ncol = 2)
Having removed the outliers, now we drew the same plots for Annual Income to get a better picture of the income of the customers.
we are going to create a new variable called Income_to_loan_ratio below:
annual_inc_new$Income_to_loan_ratio <- annual_inc_new$annual_inc/annual_inc_new$loan_amnt
ggplot(data = annual_inc_new, aes(Income_to_loan_ratio)) + geom_density(fill = "gray") +
coord_cartesian(xlim = c(0, 20)) + labs(title = "Distribution (Density Plot) of the ratio of Annual Income to Loan Amount")
so the majority of people have a self_reported annula income of approximately 5 times the amount of loans they have been granted.
ggplot(data = subset(df, !home_ownership %in% c("ANY", "NONE", "OTHER")), aes(y = home_ownership,
purpose)) + geom_count(color = "Navy") + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
Let us have a look at the Home Ownership and Purpose of loans.
The plot above, shows the number of loans grouped by the Home Ownership and the purpose of the loan. Debt Consolidation is by far the most common purpose of loans. But the porportion is different among different types of home ownership. For example, the vast majority of Mortgage holders need loans for Debt Consolidation and credit cards. People who own their homes either do not need or do not apply for a loan through this website. As you can see, it is very uncommon for the home owners to get loans to pay their credit cards debt compared to the ones who either rent or paying mortgage.
ggplot(data = annual_inc_new, aes(y = annual_inc, x = id, fill = grade, color = grade)) +
geom_boxplot(alpha = 1/3) + facet_grid(. ~ grade) + labs(title = "Distribution (Box Plot) of Annual income by Loan Grade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Box plots of Annual Income by Loan Grade are shown above.Looking at the median line of the box plots, we see a downtrend from Grade A to D and then a slight uptrend towards G.The Grade D has the narrowest range among all other grades.
df %>% filter(!is.na(home_ownership)) %>% filter(!home_ownership %in% c("OTHER",
"NONE", "ANY")) %>% group_by(home_ownership) %>% summarise(mean_amnt_loan = mean(loan_amnt),
mean_annual_inc = mean(annual_inc), n = n()) %>% ungroup() %>% arrange(desc(n))
## # A tibble: 3 x 4
## home_ownership mean_amnt_loan mean_annual_inc n
## <fct> <dbl> <dbl> <int>
## 1 MORTGAGE 16375 85106 443557
## 2 RENT 12871 63982 356117
## 3 OWN 14223 68913 87470
Table above shows the average amount of loans and average annual income by different Home Ownership status. More than 443 thousand loans were granted to those with a mortgaged home, but only 87 thosand to those who own their home. A clear difference between the annual inmcome of those with different home ownership is also shown.
ggplot(data = subset(df, (!is.na(home_ownership) & (!home_ownership %in% c("ANY",
"NONE", "OTHER")))), aes(home_ownership, fill = grade, color = grade)) +
geom_bar() + labs(title = "Distribution of Home Ownership by Loan Grade")
Here I brought the plot of Home Ownership by Loan Grade as a stacked bar chart. The majority of loans are granted to those with Mortgage and grade B and C are the most common ones.
df %>% filter(!is.na(purpose)) %>% group_by(purpose) %>% summarise(mean_annual_inc = mean(annual_inc),
mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(desc(n))
## # A tibble: 14 x 4
## purpose mean_annual_inc mean_amnt_loan n
## <fct> <dbl> <dbl> <int>
## 1 debt_consolidation 73891 15428 524215
## 2 credit_card 75098 15339 206182
## 3 home_improvement 90271 14321 51829
## 4 other NA 9889 42894
## 5 major_purchase 76700 11600 17277
## 6 small_business 89886 15420 10377
## 7 car 67879 8882 8863
## 8 medical 72163 9019 8540
## 9 moving 68157 7892 5414
## 10 vacation 66874 6290 4736
## 11 house 80448 14798 3707
## 12 wedding 69455 10469 2347
## 13 renewable_energy 75086 9948 575
## 14 educational 51264 6616 423
The table above, depicted the Average amount of loan and average annual income based on the purpose of the loan. As we can see more than 520 thousand loans were granted for debt consolidation and more than 200 thousand for credit card. For those with highest average annual income ($90271/year), home improvment is the most common purpose, while those who have got loans for vacation have the lowest average annual income!
ggplot(data = subset(df, !is.na(purpose)), aes(purpose, fill = loan_status,
color = loan_status)) + geom_bar() + labs(title = "Distribution of Loan purpose") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Now let us have a look at what have happened to the loans with different purposes. The plot above, shows the distribution of of loan purposes by loan status.
df %>% filter(!is.na(verification_status)) %>% group_by(verification_status) %>%
summarise(mean_annual_inc = mean(annual_inc), mean_amnt_loan = mean(loan_amnt),
n = n()) %>% ungroup() %>% arrange(verification_status)
## # A tibble: 3 x 4
## verification_status mean_annual_inc mean_amnt_loan n
## <fct> <dbl> <dbl> <int>
## 1 Not Verified NA 11176 266750
## 2 Source Verified 78916 15500 329558
## 3 Verified 78191 17193 291071
I am curious to know what is the average annual income and average loan amount granted based on the verification status. As we can see for those whose loans could not be verified, the average amount of loan is the lowest and if it was verified they got the highest amount of loan on average which is exactly what we can expect.
ggplot(data = subset(df, !is.na(verification_status)), aes(verification_status,
fill = loan_status, color = loan_status)) + geom_bar(position = "fill") +
labs(title = "Distribution of verification status by Loan Status") + scale_y_continuous(labels = percent_format())
Based on the plot above, a very interesting observation is that the highest percentage of loans with “Charged Off” Status is from those whose verification status is “Not Verified” and interestingly, those hwo could verify their status, have the highest percentage of “Charged off” loans.
df %>% filter(!is.na(grade)) %>% group_by(grade) %>% summarise(mean_annual_inc = mean(annual_inc),
mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(grade)
## # A tibble: 7 x 4
## grade mean_annual_inc mean_amnt_loan n
## <fct> <dbl> <dbl> <int>
## 1 A NA 14039 148202
## 2 B 75487 13643 254535
## 3 C 71537 14466 245860
## 4 D 69169 15460 139542
## 5 E 72119 17998 70705
## 6 F 73801 19139 23046
## 7 G 79905 20539 5489
Let us see what are the average amount of loans granted and the average annual income based on the loan grades. So the highest average amount of loans granted is on Grade G with about 5400 cases which is of course a very low number compared to say Grade B with more than 250 thousand cases. For grade B with the highest number of loans granted, the average amount of loan is $13642.
ggplot(data = subset(df, !is.na(grade)), aes(grade, fill = loan_status, color = loan_status)) +
geom_bar(position = "fill") + labs(title = "Distribution of Loan Grade by Loan Status") +
scale_y_continuous(labels = percent_format())
another interesting plot is shown above, this distribution plot gives the percentage of loan status based on their grades. So, as we expect, grade G has the highest percentage of loans with “Charged off” status and it shows a steady downtrend as we move towards the better grades (A and B). Moreover, the same trend is evident for the late payments. For example the highest percentage of late payments are among the loans with grade G.
df %>% filter(!is.na(loan_status)) %>% group_by(loan_status) %>% summarise(mean_annual_inc = mean(annual_inc),
mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(loan_status)
## # A tibble: 10 x 4
## loan_status mean_annual_inc mean_amnt_loan n
## <fct> <dbl> <dbl> <int>
## 1 Charged Off 65085 14556 45248
## 2 Current 76153 15242 601779
## 3 Default 66754 15194 1219
## 4 Does not meet the credit policy.~ 69526 9527 761
## 5 Does not meet the credit policy.~ NA 8853 1988
## 6 Fully Paid 74162 13346 207723
## 7 In Grace Period 76132 16006 6253
## 8 Issued 78318 15299 8460
## 9 Late (16-30 days) 73279 15671 2357
## 10 Late (31-120 days) 70021 15569 11591
df <- transform(df, int_group = cut(int_rate, breaks = c(0, 7.5, 10, 12.5, 15,
20), labels = c("0-7.5", "7.5-10", "10-12.5", "12.5-15", "15 and above")))
df <- transform(df, loan_amnt_group = cut(loan_amnt, breaks = c(0, 5000, 10000,
15000, 20000, 25000, 40000), labels = c("0-5000", "5000-10000", "10000-15000",
"15000-20000", "20000-25000", "25000 and above")))
df <- transform(df, ann_income_group = cut(annual_inc, breaks = c(0, 25000,
50000, 75000, 1e+05, 1e+06), labels = c("0-25000", "25000-50000", "50000-75000",
"75000-100000", "100000 and above")))
We just binned the Interest Rate, Loan Amount, and Annual income columns and converted them into Factor values.
new_list <- numerical_vars[c("loan_amnt", "int_rate", "funded_amnt", "installment")]
M <- cor(new_list)
corrplot(M, method = "color")
corrplot(M, method = "number", col = "black")
Above, we have provided the correlation plots for 4 numeric columns of: ‘loan_amnt’, ‘int_rate’, ‘funded_amnt’, ‘installment’.
converting Issue_d column to date Format:
df$dates <- dmy(paste0("01-", df$issue_d))
df_amount_grade <- df %>% group_by(dates, grade) %>% summarise(Loan_Amount = sum(loan_amnt),
n = n())
ggplot(df_amount_grade, aes(x = dates, y = Loan_Amount)) + geom_area(aes(fill = grade)) +
xlab("Issue Date") + labs(title = "Total $ Amount of loans grouped by Grade")
let us investigate how the total amount of loans has changed over the time. The plot above shows the total dollar amount of loans granted from 2008 untill 2016.There has been some spikes and some very sharp declines that might be due to some technical problem or lack of data or perhaps some seasonal trend exisiting in the market.
ggplot(data = subset(df, !is.na(addr_state)), aes(x = addr_state, y = loan_amnt,
fill = addr_state)) + stat_summary(fun.y = "sum", geom = "bar") + coord_flip() +
labs(y = "Total Loan Amount", x = "State", title = "Bar Chart of loan amount for each state")
The amount of loans granted by state is presented above. California is by far the first state in terms of the total amount of loans. ——
I think, most of the finding align with what we could expect before running doing any analysis, such as the fact that the higher the income, the easier it gets to get a loan, or the cheapest it gets for you to secure a loan. However, there was one interesting finding here, my observation showed that the highest percentage of loans with “Charged Off” Status is from those whose verification status is “Not Verified” and interestingly, those hwo could verify their status, have the highest percentage of “Charged off” loans.
ggplot(df_amount_grade, aes(x = dates, y = Loan_Amount)) + geom_area(aes(fill = grade)) +
xlab("Issue Date") + labs(title = "Total $ Amount of loans grouped by Grade")
The Total amount of loans has grown significantly since 2010, there was a period of stagnation from 2008 to 2010 before it took some momentum. Grade B and C are the most common loan grades and have become more so after 2015. There are a few unusual spikes folowed by sharp declines in late 2014 that need a more comprehensive analysis to figure out the reason behind it. All in all the growth looks impressive and there does not seem to be any sign of weakness.
ggplot(data = subset(df, !is.na(int_rate)), aes(int_rate, color = grade)) +
geom_histogram(bins = 50) + facet_grid(grade ~ .) + labs(title = "Interest Rate by Loan Grade")
The plot shows the disctribution of Interest Rate by Loan Grade. It is kind of obvious that there are differences between grades in terms of interest rates. The grades A-C have significantly lower interest rates.
ggplot(data = subset(df, !is.na(verification_status)), aes(verification_status,
fill = loan_status, color = loan_status)) + geom_bar(position = "fill") +
labs(title = "Distribution of verification status by Loan Status") + scale_y_continuous(labels = percent_format())
As I previously mentioned, this plot shows a rather interesting observation. The “Verified” category has shown a higher percentage of “Charged off” loans compared to the other 2 categories. What we would have expected before any deep analysis, could be the verified loans have a lower risk of default, but it does not seem to be the case. Another important fact from this chart that I would like to point out is that the verified category has higher percentage of “LATE”s as well which is also counter-intuitive.
For this project I have used the Lending Club Loan Data. We downloaded the the dataset from Kaggle.com at this link: (https://www.kaggle.com/wendykan/lending-club-loan-data). This dataset contains complete loan data for all loans issued through 2007-2015,including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file is a matrix of about 890 thousand observations and 75 variables. I worked on this dataset and wrangled and partialy cleaned the data, then using great R functionalities and GGplot library I created some univariate analyses, some multi- and bi-variate analyses of the existing variables. Some of the findings were aligned with our initial expectations and some were a little strange. I believe there are some very interesting insights that we could draw from the dataset by working extensively on the latest payments, installments, and also fees. The other very interesting topic could be analysing the description of the loan applications and see what are the most successful terms we could use to secure a loan. Another aspect that could be focused on is the job titles and investigate whether there is any significant pattern in the dataset.
=======