DATA 607 02 [15961] : Final Project Presentation

Debabrata Kabiraj

May 13, 2019

Install Pre-requisite R libraries

## Loading Packages

Part 1 - Introduction

Mission

Some months back, I was trying to apply for a lending club loan. Interest rate was different me and my colleague. That made me to think more about how lending rates are decided and what factors affect one’s ability to get best loan rate.

Background

Lending Club (LC), a San Francisco-based fintech company, works to facilitate peer-to-peer loans through their online lending platform. Started in 2007, their website allows individuals to publicly post loan applications, which other users can then browse and choose to fund. Company estimates place aggregate loan totals at over $15.98 billion through December 2015, making Lending Club the largest online loan platform in the world.

Simply put, LC is a US peer-to-peer lending company. Where investors provide funding and borrowers return back the payments. Lending club selects and approves the borrowers using many parameters. It is a sort of EBay for loans.

Although mired in scandal (CEO Renaud Laplanche stepped down amidst dropping loan volumes and well-vetted accusations of shady accounting practices), Lending Club still provides an excellent machine learning case study. Luckily, their website makes its historical records publicly available, leading to the interesting question:

Project Goal

This project is to predict the interest rate with various predictor variables. By performing this analysis we will know below information.

  1. What parameters will impact my interest rate? ie., Is loan interest % predictive of FICO credit score alone?

  2. Is loan funded amount are equal for different purpose of loan request? So the person can get loan in that particular loan type.

  3. It is always mentioned that living state plays a important role in interest rate. This hypothesis will be validated.

  4. There is a myth that home ownership will impact FICO scores. It will be validated via this dataset.

  5. Did lending club receive equal number of loans in each month

Part 2 - Data

When we register as a lending club user, you will get access to the borrowers data from lending loan website Lending Club.

This dataset has borrowers details (personal info will be removed) It has the funded amount, interest rate, fico credit score and about 150 variables. Also the row count is around 115K for Q1 2019.

For current analysis, I have taken a simple random sample of 10000 rows. These data are transformed and modified to perform analysis on data.

Data collection

Describe the method of data collection.

When we register as a lending club borrower/invester, you will get access to the borrowers data from lending loan website Lending Club.

LC also provides loan rejection dataset. But for current analysis, we have used only the borrowers dataset since rejection dataset does not have enough relevant data for analysis.

Download the Loan Statistics from Lending Club WebPage via LendingClub’s API.

Data can also be accessed from LendingClub package provied in R via LendingClub: A Lending Club API Wrapper

source_url <- "https://api.lendingclub.com/api/investor/v1/loans/listing"
raw_data <- httr::GET(source_url, add_headers(Authorization="Vkqakl1lAygRyXRwlKCOyHWG4DE")))
raw_data_content <- httr::content(raw_data, as = "text", encoding = "UTF-8")
write.csv(raw_data_content,"LoanStats_securev1_2019Q1.csv")
  1. Select the required columns
  2. Convert the relevant columns to numeric or integer or date (using lubridate)
  3. Remove unwanted columns and change the NA data(if any)
  4. Calculate the total payment by each customer adding the interest rate
# load data of 2019 Q1
set.seed(7340)

loans_summary_full <- read.csv ("LoanStats_securev1_2019Q1.csv",
                                header=TRUE, skip=1, sep=",", stringsAsFactors=FALSE, skipNul = TRUE) %>%  
                      select(c(pub_rec_bankruptcies, loan_amnt, funded_amnt, funded_amnt_inv, term, int_rate, emp_length, home_ownership, annual_inc, dti, addr_state, 
                               fico_range_low, fico_range_high, installment, issue_d, loan_status, purpose, loan_amnt, grade)) %>%   
                      mutate(fico_score=(fico_range_low+fico_range_high)/2) %>%     
                      mutate(term1=as.numeric(str_trim(str_replace(term,"months","")))) %>%       
                      select(-fico_range_low,-fico_range_high) %>%         
                      mutate(emp_length=str_replace(emp_length,c("n/a"),"NA")) %>%           
                      mutate(int_rate=(as.numeric(str_replace(as.character(int_rate),"%",""))))  %>%
                      mutate(loan_amnt=(as.integer(str_replace(as.character(loan_amnt),"%",""))))  %>%
                      mutate(issue_d=(dmy(paste("01-", issue_d, sep =""))))  %>%  
                      mutate(total_pymnt=(funded_amnt+(funded_amnt*int_rate*(term1/12))/100)) %>% 
                      filter(!is.na(int_rate))

DT::datatable(loans_summary_full, options = list(pagelength=5))
loans_summary <- sample_n(tbl=loans_summary_full, size=10000, replace = FALSE)

For this project, in some cases, have used the comple dataset and for some analysis I have taken a simple random sample of 10000 rows.

Variables

It has the funded amount, interest rate, fico credit score and about 150 variables. Also there are around 115K observations for Q1 2019.

But for this current analysis, below are the variables used.

LoanStatNew Description
loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt The total amount committed to that loan at that point in time.
int_rate Interest Rate on the loan
emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
annual_inc The self-reported annual income provided by the borrower during registration.
verification_status Indicates if income was verified by LC, not verified, or if the income source was verified
addr_state The state provided by the borrower in the loan application
issue_d The month which the loan was funded
loan_status Current status of the loan
purpose A category provided by the borrower for the loan request.
total_pymnt_inv Payments received to date for portion of total amount funded by investors
fico_range_high The upper boundary range the borrower’s FICO at loan origination belongs to.
fico_range_low The lower boundary range the borrower’s FICO at loan origination belongs to.
term The number of payments on the loan. Values are in months and can be either 36 or 60.

Part 3 - Exploratory data analysis

Below are some exploratory data analysis charts to understand more about the data.

Explanatory

Below table summarizes the question, response and explanatory variable. It also shows whether it is Numerical or Categorical.

Question Response Variable Explanatory Variable
1. What parameters will impact my interest rate? ie., Is loan interest % predictive of FICO credit score alone? Loan Amount, Loan Term, Interest rate % FICO Credit score, home ownership, Purpose
2. Is loan funded interest rate % are equal for different purpose of loan request? Interest rate % Purpose
3. Does different states have equal interest rate? Interest rate % States
4. Does home ownership really impact FICO scores or just by chance? FICO scores Home Ownership
5. Did lending club receive equal number of loans in each month? Loan count -

Charts

loandatamean <- group_by(loans_summary,purpose) %>% 
                summarise(mean(funded_amnt))

loandatasd <- group_by(loans_summary,purpose) %>% 
              summarise(sd(funded_amnt))

#Histogram of funded amount
ggplot(loans_summary,aes(x=funded_amnt)) + 
  geom_histogram()

#Histogram of Interest rate %
ggplot(loans_summary,aes(x=int_rate)) + 
  geom_histogram() +
  ggtitle("Histogram of Interest rate %")

#FICO score vs interest rate
ggplotly(ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + 
           geom_point(aes(color=purpose),size=1,alpha=0.8) + 
           geom_smooth(method = "lm") +
           ggtitle("FICO score vs Interest Rate"))
#Histogram of Interest rate %
ggplot(loans_summary,aes(x=int_rate)) + 
  geom_histogram(aes(fill=home_ownership),bins=10) + 
  ggtitle("Interest Rate Histogram")

#funded amount are equal for different purpose
ggplot(loans_summary,aes(x=purpose,y=int_rate)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Purpose vs Funded amount")

#Funded amount vs Interest rate 
ggplot(loans_summary,aes(x=funded_amnt,y=int_rate)) + 
  geom_point(aes(color=term1)) + 
  geom_smooth(method="lm") + 
  ggtitle("Funded amount vs Interest Rate")

#FICO Scores vs Interest rate
ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + 
  geom_point(aes(color=term1)) + 
  geom_smooth(method="lm") + 
  ggtitle("FICO Score vs Interest Rate")

#Home ownership vs FICO scores
ggplot(loans_summary,aes(x=fico_score)) + 
  geom_histogram() + 
  facet_grid(.~home_ownership) + 
  ggtitle("Histogram of Home ownership vs FICO scores")

# State vs  Interest rate % 
state_interestrate <- loans_summary %>% 
                      select(addr_state,int_rate) %>%  
                      group_by(addr_state) %>% 
                      summarise(avg=mean(int_rate))

#state_interestrate

a1 <- list(autotick = TRUE, title = "State")
b1 <- list(title = "Interest rate %")
plot_ly(x = ~state_interestrate$addr_state, y = ~state_interestrate$avg, type = "area") %>% 
  layout(xaxis=a1, yaxis=b1, title="State vs Average Interest Rate")
#remove(a1,b1,state_interestrate)
#Loan Request vs Loan Issued Month
plot_ly(x = ~loans_summary$issue_d) %>% 
  layout(title="Loan Request vs Loan Issued Month")
num_vars <- loans_summary %>% 
              sapply(is.numeric) %>% 
              which() %>% 
              names()

corrplot::corrplot(cor(loans_summary[, num_vars], use = "complete.obs"), 
                   method = "pie", type = "upper", title = "Loan Summary")

# Pie chart of the distribution of Loans across the different status:
sumAmnt(loans_summary, loan_status) %>% 
  merge(sumPerSatus(loans_summary, loan_status)) %>%
  plot_ly(type = "pie", 
          labels = loan_status, 
          values = total_issued, 
          hole = 0.5,
          marker = list(colors = brewer.pal(7, "Pastel2"),
                        line = list(width = 1, color = "rgb(52, 110, 165)")),
          sort = F,
          direction = "counterclockwise",
          rotation = 90,
          textinfo = "label+percent",
          textfont = list(size = 14),
          text = paste("Default rates: ", charged),
          opacity = 1,
          textposition = "outside") %>%
  layout(title = 'LOAN ISSUED GROUPED BY STATUS<br>(Hover for breakdown)',
         height = 500, width = 1400, autosize = T, 
         legend = list(font = list(size = 16), x = 1, y = 1, traceorder = "normal"))

Time Series

#Interest Rate vs Grade
loans_summary_full %>%
  select(int_rate, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, int_rate_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade) +
  ggtitle("Histogram of Interest Rate vs Grade")

Interpreting the plot:

#Loan Amount vs Grade
loans_summary_full %>%
  select(loan_amnt, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(loan_amnt_mean = mean(loan_amnt, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, loan_amnt_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade) +
  ggtitle("Histogram of Loan Amount vs Grade")

Deriving a few points from the plot:

Geolocation Plots

#meta_loans <- funModeling::df_status(loans_summary, print_results = FALSE)
#geo_vars <- c("zip_code", "addr_state")
#meta_loans %>% select(variable, p_zeros, p_na, type, unique) %>% filter_(~ variable %in% geo_vars) %>% knitr::kable()

interest_rate_state <- as.data.frame(loans_summary) %>%
                        select(int_rate, addr_state) %>%
                        group_by(addr_state) %>%
                        summarise(mean_int_rate = mean(int_rate, na.rm = TRUE)) %>%
                        rename(state = addr_state, value = mean_int_rate) %>%
                        #mutate(region = tolower(region)) %>%
                        select(state, value)
#colnames(interest_rate_state)
#choroplethr::state_choropleth(df = interest_rate_state, title = "Default rate by State")

plot_usmap(data = interest_rate_state, values = "value", lines = "red") + 
  scale_fill_continuous(name = "Interest Rates (1Q2019)", label = scales::comma) + 
  theme(legend.position = "right") + 
  labs(title = "Distribution of Interest Rates across US states")

loan_amount_state <- as.data.frame(loans_summary_full) %>%
                        select(addr_state) %>%
                        group_by(addr_state) %>%
                        summarise(count_loans = n()) %>%
                        rename(id = addr_state, value = count_loans) %>%
                        mutate(id = tolower(openintro::abbr2state(id))) %>%
                        select(id, value)

# Put the counts into buckets
loan_amount_state$bucket <- NA
loan_amount_state$bucket[loan_amount_state$value < 500] <- "Under 500"
loan_amount_state$bucket[loan_amount_state$value >= 500   & loan_amount_state$value < 1000] <- "500-1k"
loan_amount_state$bucket[loan_amount_state$value >= 1000  & loan_amount_state$value < 5000] <- "1k-5k"
loan_amount_state$bucket[loan_amount_state$value >= 5000  & loan_amount_state$value < 10000] <- "5k-10k"
loan_amount_state$bucket[loan_amount_state$value >= 10000 & loan_amount_state$value < 20000] <- "10k-20k"
loan_amount_state$bucket[loan_amount_state$value >= 20000] <- "20k+"
loan_amount_state$bucket <- factor(loan_amount_state$bucket,
                                   levels = c("Under 500",
                                              "500-1k",
                                              "1k-5k",
                                              "5k-10k",
                                              "10k-20k",
                                              "20k+"))

# Create data frame with longitude and latitude from "fiftystater" package
us <- fiftystater::fifty_states

#kable(head(us,10))
#kable(head(loan_amount_state,10))

# Merge map and count data
us <- merge(us, loan_amount_state, by = "id")

#kable(head(us,10))

# Create a map
ggplot(us) + 
  geom_map(aes(map_id = id, fill = bucket), col = "grey20", size = .2, map = us) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_fill_brewer("", palette = "PuRd") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  ggtitle("Distribution of Loan Amounts across US States") +
  theme(legend.text = element_text(family = "Gill Sans MT"),
        legend.position = "bottom",
        panel.background = element_blank(),
        plot.title = element_text(family = "Gill Sans MT"))

Derive a few points from the plot:

Statistics

Exploratory data analysis suggests below statistics.

Statistic Variable Value
Population Mean Interest rate 12.7205228
Population SD Interest rate 4.8548739
Sample Statistics Mean Interest rate 12.709096
Sample Statistics SD Interest rate 4.8590779

Sample size = 10000

summary(loans_summary)
##  pub_rec_bankruptcies   loan_amnt      funded_amnt    funded_amnt_inv
##  Min.   :0.0000       Min.   : 1000   Min.   : 1000   Min.   :  750  
##  1st Qu.:0.0000       1st Qu.: 8894   1st Qu.: 8894   1st Qu.: 8894  
##  Median :0.0000       Median :15000   Median :15000   Median :15000  
##  Mean   :0.1154       Mean   :16740   Mean   :16740   Mean   :16737  
##  3rd Qu.:0.0000       3rd Qu.:24000   3rd Qu.:24000   3rd Qu.:24000  
##  Max.   :4.0000       Max.   :40000   Max.   :40000   Max.   :40000  
##                                                                      
##      term              int_rate      emp_length        home_ownership    
##  Length:10000       Min.   : 6.46   Length:10000       Length:10000      
##  Class :character   1st Qu.: 8.19   Class :character   Class :character  
##  Mode  :character   Median :11.80   Mode  :character   Mode  :character  
##                     Mean   :12.71                                        
##                     3rd Qu.:15.57                                        
##                     Max.   :30.84                                        
##                                                                          
##    annual_inc           dti          addr_state         installment     
##  Min.   :      0   Min.   :  0.00   Length:10000       Min.   :  30.89  
##  1st Qu.:  48000   1st Qu.: 12.35   Class :character   1st Qu.: 263.09  
##  Median :  70000   Median : 18.66   Mode  :character   Median : 397.94  
##  Mean   :  84543   Mean   : 20.74                      Mean   : 481.50  
##  3rd Qu.: 100000   3rd Qu.: 25.96                      3rd Qu.: 652.58  
##  Max.   :6286908   Max.   :999.00                      Max.   :1618.24  
##                    NA's   :29                                           
##     issue_d           loan_status          purpose         
##  Min.   :2019-01-01   Length:10000       Length:10000      
##  1st Qu.:2019-01-01   Class :character   Class :character  
##  Median :2019-02-01   Mode  :character   Mode  :character  
##  Mean   :2019-01-28                                        
##  3rd Qu.:2019-03-01                                        
##  Max.   :2019-03-01                                        
##                                                            
##     grade             fico_score        term1       total_pymnt   
##  Length:10000       Min.   :662.0   Min.   :36.0   Min.   : 1211  
##  Class :character   1st Qu.:682.0   1st Qu.:36.0   1st Qu.:11938  
##  Mode  :character   Median :702.0   Median :36.0   Median :20841  
##                     Mean   :709.4   Mean   :43.8   Mean   :25196  
##                     3rd Qu.:732.0   3rd Qu.:60.0   3rd Qu.:36002  
##                     Max.   :847.5   Max.   :60.0   Max.   :87108  
## 

Confidence interval of the Interest Rate

Point estimate from the sample with the confidence interval is shown below

inference(y=loans_summary$int_rate,est="mean",null=0,alternative="twosided",type="ci",conflevel=0.95,method="theoretical")
## Single mean 
## Summary statistics:
## mean = 12.7091 ;  sd = 4.8591 ;  n = 10000 
## Standard error = 0.0486 
## 95 % Confidence interval = ( 12.6139 , 12.8043 )

For this test lest validate the total sample size required.

s=sd(loans_summary_full$int_rate)
n = ((pnorm(0.025)*s)/0.03)^2 

If the margin of error to be 3%, we need to get the samples of around 6810.9383735.

Part 4 - Inference

1. Type of loans issues with relation to Interest Rate

Looking at grade which seems to be a rating classification scheme that Lending Club uses to assign loans into risk buckets similar to other popular rating schemes like S&P or Moodys. We know that grades have values of A, B, C, D, E, F, G where A represents the highest quality loan and G the lowest.

give_count <- stat_summary(fun.data = function(x) return(c(y = median(x)*1.06,
                                                         label = length(x))),
               geom = "text")
give_mean <- stat_summary(fun.y = mean, colour = "darkgreen", geom = "point", 
                          shape = 18, size = 3, show.legend = FALSE)

loans_summary %>%
  ggplot(aes(grade, loan_amnt)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ loan_status) +
  labs(title="Loan Amount by Grade for each loan status", x = "Grade", y = "Loan Amount \n")

As per the plot we can infer following;

loans_summary %>%
  ggplot(aes(grade, int_rate)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  labs(title="Interest Rate by Grade", x = "Grade", y = "Interest Rate \n") +
  facet_wrap(~ term)

Above plot clearly indicates below:

2. Is loan interest % predictive of credit score and income?

To perform this statement, we will use the linear model to validate it. Here we are performing forward elimination technique to get the maximum adjusted R-squared vale.

loans_lm <- lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + emp_length + issue_d + pub_rec_bankruptcies + dti,loans_summary ) # 0.3628
summary(loans_lm)
## 
## Call:
## lm(formula = int_rate ~ fico_score + home_ownership + purpose + 
##     term1 + loan_amnt + annual_inc + emp_length + issue_d + pub_rec_bankruptcies + 
##     dti, data = loans_summary)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4759  -2.8083  -0.6065   2.1065  17.9751 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.741e+01  3.020e+01   0.576  0.56431    
## fico_score                -5.672e-02  1.223e-03 -46.381  < 2e-16 ***
## home_ownershipMORTGAGE     8.311e-02  5.860e-01   0.142  0.88723    
## home_ownershipOWN          6.171e-01  5.954e-01   1.036  0.30003    
## home_ownershipRENT         6.458e-01  5.862e-01   1.102  0.27059    
## purposecredit_card        -3.662e+00  4.485e-01  -8.166 3.57e-16 ***
## purposedebt_consolidation -2.051e+00  4.454e-01  -4.604 4.19e-06 ***
## purposehome_improvement   -1.268e+00  4.728e-01  -2.682  0.00732 ** 
## purposehouse              -6.850e-02  6.482e-01  -0.106  0.91584    
## purposemajor_purchase     -3.884e-01  5.571e-01  -0.697  0.48570    
## purposemedical            -8.314e-01  6.072e-01  -1.369  0.17096    
## purposemoving              6.631e-01  7.400e-01   0.896  0.37026    
## purposeother               1.137e-02  4.776e-01   0.024  0.98101    
## purposerenewable_energy   -3.471e-01  1.520e+00  -0.228  0.81930    
## purposesmall_business      6.588e-01  6.149e-01   1.071  0.28404    
## purposevacation            5.548e-01  6.775e-01   0.819  0.41288    
## term1                      1.364e-01  4.025e-03  33.879  < 2e-16 ***
## loan_amnt                 -3.458e-05  4.576e-06  -7.557 4.48e-14 ***
## annual_inc                -1.194e-06  4.412e-07  -2.706  0.00683 ** 
## emp_length1 year           2.211e-03  1.898e-01   0.012  0.99071    
## emp_length10+ years       -2.867e-01  1.376e-01  -2.083  0.03723 *  
## emp_length2 years         -3.444e-01  1.793e-01  -1.920  0.05484 .  
## emp_length3 years         -7.649e-02  1.907e-01  -0.401  0.68830    
## emp_length4 years         -9.185e-02  2.024e-01  -0.454  0.65002    
## emp_length5 years         -1.018e-01  2.055e-01  -0.496  0.62016    
## emp_length6 years          2.333e-03  2.377e-01   0.010  0.99217    
## emp_length7 years         -1.249e-01  2.503e-01  -0.499  0.61794    
## emp_length8 years         -7.755e-02  2.602e-01  -0.298  0.76573    
## emp_length9 years         -3.546e-01  2.945e-01  -1.204  0.22861    
## emp_lengthNA               1.142e-01  1.790e-01   0.638  0.52371    
## issue_d                    1.778e-03  1.683e-03   1.056  0.29094    
## pub_rec_bankruptcies      -3.787e-01  1.298e-01  -2.918  0.00353 ** 
## dti                        2.008e-02  1.823e-03  11.010  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.109 on 9938 degrees of freedom
##   (29 observations deleted due to missingness)
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2853 
## F-statistic: 125.3 on 32 and 9938 DF,  p-value: < 2.2e-16
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1,loans_summary)) #0.3458
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt,loans_summary)) # #0.3452
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc ,loans_summary )) #0.3591
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + addr_state,loans_summary )) #0.3594
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc ,loans_summary )) #0.3591
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc  + emp_length,loans_summary )) #0.3615

After mulitple iterations, we have come to the conclusion that below are the variables that affect interest rate %

#positions <- c("AK","AL","AR","AZ", "CA","CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","IN","KS","KY","LA","MA","MD","ME","MI","MN","MO","MS","MT","NC","ND","NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VA","VT","WA","WI","WV","WY")

#ggplot(state_interestrate, aes(x = addr_state, y = avg)) + geom_violin(scale = "width") + scale_x_discrete(limits = positions) + scale_y_continuous(labels = percent) + ggtitle("Distribution of Average Interest Rate aggregated by State") + theme(plot.title = element_text(size = 11), axis.text.x = element_text(angle = 45, hjust = 1)) 

funded_amnt <-
  #loans_summary %>%
  loans_summary_full %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt, 
              variable = "funded_amnt")

funded_amnt_inv <-
  #loans_summary %>%
  loans_summary_full %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt_inv, 
              variable = "funded_amnt_inv")

plot_data <- rbind(funded_amnt, funded_amnt_inv)
# remove unnecessary data using regex
#ls()
rm(list = ls()[grep("^funded", ls())])

plot_data %>%
  ggplot(aes(x = loan_amnt, y = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_point()

loans_summary %>%
  ggplot(aes(x = annual_inc, y = loan_amnt)) +
  geom_point()

We are going to validate the conditions for least squared line.

From the below chart, it shows that there is a downward relationship between FICO credit score and interest rate. But the linear model is not very strong due to large number of variability.

#fico score vs interest rate
ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + 
  geom_point(size=1,alpha=0.8) + 
  geom_smooth(method = "lm")  + 
  ggtitle("FICO SCORE vs Interest rate")

Let’s check the residuals normality with histogram and qqplot.

df_residuals <- augment(loans_lm)

#Historgram plot of residuals
ggplot(df_residuals,aes(df_residuals$.resid)) + 
  geom_histogram() + 
  ggtitle("Residual Histogram")

#QQ norm plot of the residuals
qqnorm(loans_lm$residuals)
qqline(loans_lm$residuals) 

The plots show that residuals are slightly left skewed. But the residuals are normal.

df_residuals <- filter(df_residuals, .fitted >-17)

ggplot(df_residuals, aes(x=.fitted, y=.resid)) + 
  geom_point(size=1,alpha=0.8) + 
  geom_smooth(method = "lm") + 
  ggtitle("FICO SCORE vs Residuals") 

ggplot(df_residuals, aes(x=.fitted, y=abs(.resid))) + 
  geom_point(size=1,alpha=0.8) + 
  ggtitle("FICO SCORE vs Residuals in Absolute terms")

Above plot shows that there is a constant variability in the chart.

3. Different purpose of loan request

3. Is loan funded amount are equal for different purpose of loan request?

Let’s validate if the purpose of loan interest rate varies or not.

inference(y=loans_summary$int_rate,x=loans_summary$purpose,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## 
## Summary statistics:
## n_car = 87, mean_car = 14.4262, sd_car = 5.1598
## n_credit_card = 2768, mean_credit_card = 11.4543, sd_credit_card = 4.2753
## n_debt_consolidation = 5508, mean_debt_consolidation = 12.969, sd_debt_consolidation = 4.8833
## n_home_improvement = 602, mean_home_improvement = 13.0386, sd_home_improvement = 4.8425
## n_house = 75, mean_house = 13.9613, sd_house = 5.5346
## n_major_purchase = 146, mean_major_purchase = 13.6175, sd_major_purchase = 5.0955
## n_medical = 97, mean_medical = 13.7255, sd_medical = 4.5394
## n_moving = 49, mean_moving = 15.05, sd_moving = 5.8391
## n_other = 503, mean_other = 14.4195, sd_other = 5.4904
## n_renewable_energy = 8, mean_renewable_energy = 14.885, sd_renewable_energy = 3.8127
## n_small_business = 93, mean_small_business = 15.1254, sd_small_business = 5.9583
## n_vacation = 64, mean_vacation = 15.0791, sd_vacation = 5.0107
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq Mean Sq F value    Pr(>F)
## x           11   8071  733.68  32.139 < 2.2e-16
## Residuals 9988 228012   22.83                  
## 
## Pairwise tests: t tests with pooled SD 
##                       car credit_card debt_consolidation home_improvement
## credit_card        0.0000          NA                 NA               NA
## debt_consolidation 0.0048      0.0000                 NA               NA
## home_improvement   0.0114      0.0000             0.7341               NA
## house              0.5369      0.0000             0.0740           0.1148
## major_purchase     0.2114      0.0000             0.1056           0.1891
## medical            0.3206      0.0000             0.1222           0.1889
## moving             0.4648      0.0000             0.0024           0.0046
## other              0.9904      0.0000             0.0000           0.0000
## renewable_energy   0.7949      0.0426             0.2571           0.2776
## small_business     0.3266      0.0000             0.0000           0.0001
## vacation           0.4067      0.0000             0.0004           0.0012
##                     house major_purchase medical moving  other
## credit_card            NA             NA      NA     NA     NA
## debt_consolidation     NA             NA      NA     NA     NA
## home_improvement       NA             NA      NA     NA     NA
## house                  NA             NA      NA     NA     NA
## major_purchase     0.6124             NA      NA     NA     NA
## medical            0.7482         0.8630      NA     NA     NA
## moving             0.2148         0.0694  0.1137     NA     NA
## other              0.4385         0.0742  0.1903 0.3779     NA
## renewable_energy   0.6032         0.4650  0.5094 0.9278 0.7846
## small_business     0.1165         0.0174  0.0435 0.9288 0.1906
## vacation           0.1693         0.0413  0.0786 0.9744 0.2983
##                    renewable_energy small_business
## credit_card                      NA             NA
## debt_consolidation               NA             NA
## home_improvement                 NA             NA
## house                            NA             NA
## major_purchase                   NA             NA
## medical                          NA             NA
## moving                           NA             NA
## other                            NA             NA
## renewable_energy                 NA             NA
## small_business               0.8914             NA
## vacation                     0.9138         0.9524

Above output shows that interest rate varies for each purpose of loan.

Our Model states that if the purpose of loan is vacation then the interest rate will be higher. Let’s compare with other purpose house

#Compare Credit card loan and debt_consolidation
creditcard_other <- filter(loans_summary,purpose ==c("vacation","house"))
inference(y=creditcard_other$int_rate,x=creditcard_other$purpose,est="mean",null = 0,alternative = "greater",type="ht",order=c("house","vacation"),
          method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_house = 37, mean_house = 13.2668, sd_house = 5.0985
## n_vacation = 33, mean_vacation = 14.9564, sd_vacation = 5.3667
## Observed difference between means (house-vacation) = -1.6896
## 
## H0: mu_house - mu_vacation = 0 
## HA: mu_house - mu_vacation > 0 
## Standard error = 1.255 
## Test statistic: Z =  -1.346 
## p-value =  0.9109

From above we can conclude that the interest rate % is higher for house.

4. State vs interest rate and loan amounts

4. It is always mentioned that living state plays a important role in interest rate. This hypothesis will be validated.

5. Homeownership vs FICO score

5. Does home ownership really impact FICO scores or just by chance?

We want to understand does the home ownership really impact FICO scores. Below we are performing ANOVA between the three variables (RENT VS MORTGAGE VS OWN)

inference(y=loans_summary$fico_score,x=loans_summary$home_ownership,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## 
## Summary statistics:
## n_ANY = 50, mean_ANY = 719.6, sd_ANY = 34.1103
## n_MORTGAGE = 5017, mean_MORTGAGE = 712.2562, sd_MORTGAGE = 35.8328
## n_OWN = 1141, mean_OWN = 711.9522, sd_OWN = 36.6887
## n_RENT = 3792, mean_RENT = 704.7492, sd_RENT = 32.4501
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##             Df   Sum Sq Mean Sq F value    Pr(>F)
## x            3   135582   45194  37.572 < 2.2e-16
## Residuals 9996 12023965    1203                  
## 
## Pairwise tests: t tests with pooled SD 
##             ANY MORTGAGE OWN
## MORTGAGE 0.1363       NA  NA
## OWN      0.1270   0.7893  NA
## RENT     0.0026   0.0000   0

From above we can conclude that FICO scores are not impacted by home ownership.

6. Loan issue date

6. Did lending club receive equal number of loans in each month?

#Loan request vs month
loans_count<- loans_summary %>% 
              #mutate(loan_month = match(substr(loans_summary$issue_d,1,3),month.abb)) %>%
              mutate(loan_month = format(issue_d,"%B")) %>% 
              group_by(loan_month) %>% 
              count(loan_month) %>% 
              rename(observed_loans=n) %>% 
              filter(!is.na(loan_month))

loans_count <- loans_count %>% 
                mutate(percent= 33.333333) %>% 
                mutate(expected_loans=sum(observed_loans)*(percent/100))

chisq.test(x=loans_count$observed_loans, p=loans_count$percent/100)
## 
##  Chi-squared test for given probabilities
## 
## data:  loans_count$observed_loans
## X-squared = 96.501, df = 2, p-value < 2.2e-16
loans_count
## # A tibble: 3 x 4
## # Groups:   loan_month [3]
##   loan_month observed_loans percent expected_loans
##   <chr>               <int>   <dbl>          <dbl>
## 1 February             3085    33.3          1028.
## 2 January              3796    33.3          1265.
## 3 March                3119    33.3          1040.

From the model and output of chi-test, we may hypothesize that if the loans applied on January may receive higher approved loan amount than other months in Q1.

Amount Funded based on FICO score

#Loan Amount vs. Funded Amount
ggplot(loans_summary, aes(loan_amnt, funded_amnt)) + 
  geom_point(aes(colour = fico_score)) +
  labs(title = "Loan Amount vs. Funded Amount") +
  geom_smooth()

As we can see in the plot above,

#Installment vs. Funded Amount
ggplot(loans_summary, aes(installment, funded_amnt)) + 
  geom_point(aes(colour = fico_score)) +
  labs(title = "Installment vs. Funded Amount") +
  geom_smooth()

ggplot(loans_summary, aes(annual_inc, funded_amnt)) + 
  geom_point(aes(colour = fico_score)) +
  labs(title = "Annual Income vs. Funded Amount") +
  geom_smooth()

ggplot(loans_summary, aes(purpose, loan_amnt, fill = loan_status)) + 
  geom_boxplot() + 
  labs(title = "Loan Amount vs Purpose for Loan Status") + 
  theme(axis.text.x=element_text(size=8, angle = 90))

Part 5 - Conclusion

Dataset from Lending club is an interesting dataset. It is often very difficult to get the insights of interest rate from Bank. This analysis provides interesting information about the interest rate which we get from Lending club for each person.

The interest rate which we receive depends on the various factors like FICO score, Homeownership, Purpose of loan, Term length of loan, loan amount requested, Annual income, Employee length, Issue month, Previous bankrupcies and Debt to income ratio.

If a person is wanting to get a good interest rate then he need to focus on above factors before applying for a lending club loan.

From the research question, I learnt to perform various tests and add visualizations to understand the pattern of Lending club borrowers data along-with the interest rates. It is not a very straight forward method which gets from FICO scores. Also lot of financial terms while dealing with loans.

Future analysis can be performed by adding more variables to the model. Which provided loans are at risk or at the status of Charged-off. We can also calculate the return interest rate for an investor.

These analysis can also be used if we want to invest in lending club and be a successful investor.