Part 2 - Data

When we register as a lending club user, you will get access to the borrowers data from lending loan website Lending Club.

This dataset has borrowers details (personal info will be removed) It has the funded amount, interest rate, fico credit score and about 150 variables. Also the row count is around 115K for Q1 2019.

For current analysis, I have taken a simple random sample of 10000 rows. These data are transformed and modified to perform analysis on data.

Data collection

Describe the method of data collection.

When we register as a lending club borrower/invester, you will get access to the borrowers data from lending loan website Lending Club.

LC also provides loan rejection dataset. But for current analysis, we have used only the borrowers dataset since rejection dataset does not have enough relevant data for analysis.

Download the Loan Statistics from Lending Club WebPage via LendingClub’s API.

Data can also be accessed from LendingClub package provied in R via LendingClub: A Lending Club API Wrapper

source_url <- "https://api.lendingclub.com/api/investor/v1/loans/listing"
raw_data <- httr::GET(source_url, add_headers(Authorization="Vkqakl1lAygRyXRwlKCOyHWG4DE")))
raw_data_content <- httr::content(raw_data, as = "text", encoding = "UTF-8")
write.csv(raw_data_content,"LoanStats_securev1_2019Q1.csv")

Select the required columns
Convert the relevant columns to numeric or integer or date (using lubridate)
Remove unwanted columns and change the NA data(if any)
Calculate the total payment by each customer adding the interest rate

# load data of 2019 Q1
set.seed(7340)

loans_summary_full <- read.csv ("LoanStats_securev1_2019Q1.csv",
                                header=TRUE, skip=1, sep=",", stringsAsFactors=FALSE, skipNul = TRUE) %>%  
                      select(c(pub_rec_bankruptcies, loan_amnt, funded_amnt, funded_amnt_inv, term, int_rate, emp_length, home_ownership, annual_inc, dti, addr_state, 
                               fico_range_low, fico_range_high, installment, issue_d, loan_status, purpose, loan_amnt, grade)) %>%   
                      mutate(fico_score=(fico_range_low+fico_range_high)/2) %>%     
                      mutate(term1=as.numeric(str_trim(str_replace(term,"months","")))) %>%       
                      select(-fico_range_low,-fico_range_high) %>%         
                      mutate(emp_length=str_replace(emp_length,c("n/a"),"NA")) %>%           
                      mutate(int_rate=(as.numeric(str_replace(as.character(int_rate),"%",""))))  %>%
                      mutate(loan_amnt=(as.integer(str_replace(as.character(loan_amnt),"%",""))))  %>%
                      mutate(issue_d=(dmy(paste("01-", issue_d, sep =""))))  %>%  
                      mutate(total_pymnt=(funded_amnt+(funded_amnt*int_rate*(term1/12))/100)) %>% 
                      filter(!is.na(int_rate))

DT::datatable(loans_summary_full, options = list(pagelength=5))

loans_summary <- sample_n(tbl=loans_summary_full, size=10000, replace = FALSE)

For this project, in some cases, have used the comple dataset and for some analysis I have taken a simple random sample of 10000 rows.

Variables

It has the funded amount, interest rate, fico credit score and about 150 variables. Also there are around 115K observations for Q1 2019.

But for this current analysis, below are the variables used.

LoanStatNew	Description
loan_amnt	The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt	The total amount committed to that loan at that point in time.
int_rate	Interest Rate on the loan
emp_length	Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership	The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
annual_inc	The self-reported annual income provided by the borrower during registration.
verification_status	Indicates if income was verified by LC, not verified, or if the income source was verified
addr_state	The state provided by the borrower in the loan application
issue_d	The month which the loan was funded
loan_status	Current status of the loan
purpose	A category provided by the borrower for the loan request.
total_pymnt_inv	Payments received to date for portion of total amount funded by investors
fico_range_high	The upper boundary range the borrower’s FICO at loan origination belongs to.
fico_range_low	The lower boundary range the borrower’s FICO at loan origination belongs to.
term	The number of payments on the loan. Values are in months and can be either 36 or 60.

Part 3 - Exploratory data analysis

Below are some exploratory data analysis charts to understand more about the data.

Explanatory

Below table summarizes the question, response and explanatory variable. It also shows whether it is Numerical or Categorical.

Question	Response Variable	Explanatory Variable
1. What parameters will impact my interest rate? ie., Is loan interest % predictive of FICO credit score alone?	Loan Amount, Loan Term, Interest rate %	FICO Credit score, home ownership, Purpose
2. Is loan funded interest rate % are equal for different purpose of loan request?	Interest rate %	Purpose
3. Does different states have equal interest rate?	Interest rate %	States
4. Does home ownership really impact FICO scores or just by chance?	FICO scores	Home Ownership
5. Did lending club receive equal number of loans in each month?	Loan count	-

Charts

loandatamean <- group_by(loans_summary,purpose) %>% 
                summarise(mean(funded_amnt))

loandatasd <- group_by(loans_summary,purpose) %>% 
              summarise(sd(funded_amnt))

#Histogram of funded amount
ggplot(loans_summary,aes(x=funded_amnt)) + 
  geom_histogram()

#Histogram of Interest rate %
ggplot(loans_summary,aes(x=int_rate)) + 
  geom_histogram() +
  ggtitle("Histogram of Interest rate %")

#FICO score vs interest rate
ggplotly(ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + 
           geom_point(aes(color=purpose),size=1,alpha=0.8) + 
           geom_smooth(method = "lm") +
           ggtitle("FICO score vs Interest Rate"))

#Histogram of Interest rate %
ggplot(loans_summary,aes(x=int_rate)) + 
  geom_histogram(aes(fill=home_ownership),bins=10) + 
  ggtitle("Interest Rate Histogram")

#funded amount are equal for different purpose
ggplot(loans_summary,aes(x=purpose,y=int_rate)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Purpose vs Funded amount")

#Funded amount vs Interest rate 
ggplot(loans_summary,aes(x=funded_amnt,y=int_rate)) + 
  geom_point(aes(color=term1)) + 
  geom_smooth(method="lm") + 
  ggtitle("Funded amount vs Interest Rate")

#FICO Scores vs Interest rate
ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + 
  geom_point(aes(color=term1)) + 
  geom_smooth(method="lm") + 
  ggtitle("FICO Score vs Interest Rate")

#Home ownership vs FICO scores
ggplot(loans_summary,aes(x=fico_score)) + 
  geom_histogram() + 
  facet_grid(.~home_ownership) + 
  ggtitle("Histogram of Home ownership vs FICO scores")

# State vs  Interest rate % 
state_interestrate <- loans_summary %>% 
                      select(addr_state,int_rate) %>%  
                      group_by(addr_state) %>% 
                      summarise(avg=mean(int_rate))

#state_interestrate

a1 <- list(autotick = TRUE, title = "State")
b1 <- list(title = "Interest rate %")
plot_ly(x = ~state_interestrate$addr_state, y = ~state_interestrate$avg, type = "area") %>% 
  layout(xaxis=a1, yaxis=b1, title="State vs Average Interest Rate")

#remove(a1,b1,state_interestrate)

#Loan Request vs Loan Issued Month
plot_ly(x = ~loans_summary$issue_d) %>% 
  layout(title="Loan Request vs Loan Issued Month")

num_vars <- loans_summary %>% 
              sapply(is.numeric) %>% 
              which() %>% 
              names()

corrplot::corrplot(cor(loans_summary[, num_vars], use = "complete.obs"), 
                   method = "pie", type = "upper", title = "Loan Summary")

# Pie chart of the distribution of Loans across the different status:
sumAmnt(loans_summary, loan_status) %>% 
  merge(sumPerSatus(loans_summary, loan_status)) %>%
  plot_ly(type = "pie", 
          labels = loan_status, 
          values = total_issued, 
          hole = 0.5,
          marker = list(colors = brewer.pal(7, "Pastel2"),
                        line = list(width = 1, color = "rgb(52, 110, 165)")),
          sort = F,
          direction = "counterclockwise",
          rotation = 90,
          textinfo = "label+percent",
          textfont = list(size = 14),
          text = paste("Default rates: ", charged),
          opacity = 1,
          textposition = "outside") %>%
  layout(title = 'LOAN ISSUED GROUPED BY STATUS<br>(Hover for breakdown)',
         height = 500, width = 1400, autosize = T, 
         legend = list(font = list(size = 16), x = 1, y = 1, traceorder = "normal"))

Time Series

#Interest Rate vs Grade
loans_summary_full %>%
  select(int_rate, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, int_rate_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade) +
  ggtitle("Histogram of Interest Rate vs Grade")

Interpreting the plot:

The mean interest rate is falling or relatively constant for high-rated clients
The mean interest rate is increasing significantly for low-rated clients

#Loan Amount vs Grade
loans_summary_full %>%
  select(loan_amnt, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(loan_amnt_mean = mean(loan_amnt, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, loan_amnt_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade) +
  ggtitle("Histogram of Loan Amount vs Grade")

Deriving a few points from the plot:

The mean loan amount is increasing for all grades
While high-rated clients have some mean loan amount volatility, it is much higher for low-rated clients

Geolocation Plots

#meta_loans <- funModeling::df_status(loans_summary, print_results = FALSE)
#geo_vars <- c("zip_code", "addr_state")
#meta_loans %>% select(variable, p_zeros, p_na, type, unique) %>% filter_(~ variable %in% geo_vars) %>% knitr::kable()

interest_rate_state <- as.data.frame(loans_summary) %>%
                        select(int_rate, addr_state) %>%
                        group_by(addr_state) %>%
                        summarise(mean_int_rate = mean(int_rate, na.rm = TRUE)) %>%
                        rename(state = addr_state, value = mean_int_rate) %>%
                        #mutate(region = tolower(region)) %>%
                        select(state, value)
#colnames(interest_rate_state)
#choroplethr::state_choropleth(df = interest_rate_state, title = "Default rate by State")

plot_usmap(data = interest_rate_state, values = "value", lines = "red") + 
  scale_fill_continuous(name = "Interest Rates (1Q2019)", label = scales::comma) + 
  theme(legend.position = "right") + 
  labs(title = "Distribution of Interest Rates across US states")

loan_amount_state <- as.data.frame(loans_summary_full) %>%
                        select(addr_state) %>%
                        group_by(addr_state) %>%
                        summarise(count_loans = n()) %>%
                        rename(id = addr_state, value = count_loans) %>%
                        mutate(id = tolower(openintro::abbr2state(id))) %>%
                        select(id, value)

# Put the counts into buckets
loan_amount_state$bucket <- NA
loan_amount_state$bucket[loan_amount_state$value < 500] <- "Under 500"
loan_amount_state$bucket[loan_amount_state$value >= 500   & loan_amount_state$value < 1000] <- "500-1k"
loan_amount_state$bucket[loan_amount_state$value >= 1000  & loan_amount_state$value < 5000] <- "1k-5k"
loan_amount_state$bucket[loan_amount_state$value >= 5000  & loan_amount_state$value < 10000] <- "5k-10k"
loan_amount_state$bucket[loan_amount_state$value >= 10000 & loan_amount_state$value < 20000] <- "10k-20k"
loan_amount_state$bucket[loan_amount_state$value >= 20000] <- "20k+"
loan_amount_state$bucket <- factor(loan_amount_state$bucket,
                                   levels = c("Under 500",
                                              "500-1k",
                                              "1k-5k",
                                              "5k-10k",
                                              "10k-20k",
                                              "20k+"))

# Create data frame with longitude and latitude from "fiftystater" package
us <- fiftystater::fifty_states

#kable(head(us,10))
#kable(head(loan_amount_state,10))

# Merge map and count data
us <- merge(us, loan_amount_state, by = "id")

#kable(head(us,10))

# Create a map
ggplot(us) + 
  geom_map(aes(map_id = id, fill = bucket), col = "grey20", size = .2, map = us) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_fill_brewer("", palette = "PuRd") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  ggtitle("Distribution of Loan Amounts across US States") +
  theme(legend.text = element_text(family = "Gill Sans MT"),
        legend.position = "bottom",
        panel.background = element_blank(),
        plot.title = element_text(family = "Gill Sans MT"))

Derive a few points from the plot:

Default rate varies across states
We are only looking at default rate but not defaulted exposure (there may be states with many defaults but the defaulted amount is small)
California has the most loan volume

Statistics

Exploratory data analysis suggests below statistics.

Statistic	Variable	Value
Population	Mean Interest rate	12.7205228
Population	SD Interest rate	4.8548739
Sample Statistics	Mean Interest rate	12.709096
Sample Statistics	SD Interest rate	4.8590779

Sample size = 10000

summary(loans_summary)

##  pub_rec_bankruptcies   loan_amnt      funded_amnt    funded_amnt_inv
##  Min.   :0.0000       Min.   : 1000   Min.   : 1000   Min.   :  750  
##  1st Qu.:0.0000       1st Qu.: 8894   1st Qu.: 8894   1st Qu.: 8894  
##  Median :0.0000       Median :15000   Median :15000   Median :15000  
##  Mean   :0.1154       Mean   :16740   Mean   :16740   Mean   :16737  
##  3rd Qu.:0.0000       3rd Qu.:24000   3rd Qu.:24000   3rd Qu.:24000  
##  Max.   :4.0000       Max.   :40000   Max.   :40000   Max.   :40000  
##                                                                      
##      term              int_rate      emp_length        home_ownership    
##  Length:10000       Min.   : 6.46   Length:10000       Length:10000      
##  Class :character   1st Qu.: 8.19   Class :character   Class :character  
##  Mode  :character   Median :11.80   Mode  :character   Mode  :character  
##                     Mean   :12.71                                        
##                     3rd Qu.:15.57                                        
##                     Max.   :30.84                                        
##                                                                          
##    annual_inc           dti          addr_state         installment     
##  Min.   :      0   Min.   :  0.00   Length:10000       Min.   :  30.89  
##  1st Qu.:  48000   1st Qu.: 12.35   Class :character   1st Qu.: 263.09  
##  Median :  70000   Median : 18.66   Mode  :character   Median : 397.94  
##  Mean   :  84543   Mean   : 20.74                      Mean   : 481.50  
##  3rd Qu.: 100000   3rd Qu.: 25.96                      3rd Qu.: 652.58  
##  Max.   :6286908   Max.   :999.00                      Max.   :1618.24  
##                    NA's   :29                                           
##     issue_d           loan_status          purpose         
##  Min.   :2019-01-01   Length:10000       Length:10000      
##  1st Qu.:2019-01-01   Class :character   Class :character  
##  Median :2019-02-01   Mode  :character   Mode  :character  
##  Mean   :2019-01-28                                        
##  3rd Qu.:2019-03-01                                        
##  Max.   :2019-03-01                                        
##                                                            
##     grade             fico_score        term1       total_pymnt   
##  Length:10000       Min.   :662.0   Min.   :36.0   Min.   : 1211  
##  Class :character   1st Qu.:682.0   1st Qu.:36.0   1st Qu.:11938  
##  Mode  :character   Median :702.0   Median :36.0   Median :20841  
##                     Mean   :709.4   Mean   :43.8   Mean   :25196  
##                     3rd Qu.:732.0   3rd Qu.:60.0   3rd Qu.:36002  
##                     Max.   :847.5   Max.   :60.0   Max.   :87108  
##

Confidence interval of the Interest Rate

Point estimate from the sample with the confidence interval is shown below

inference(y=loans_summary$int_rate,est="mean",null=0,alternative="twosided",type="ci",conflevel=0.95,method="theoretical")

## Single mean 
## Summary statistics:

## mean = 12.7091 ;  sd = 4.8591 ;  n = 10000 
## Standard error = 0.0486 
## 95 % Confidence interval = ( 12.6139 , 12.8043 )

For this test lest validate the total sample size required.

s=sd(loans_summary_full$int_rate)
n = ((pnorm(0.025)*s)/0.03)^2

If the margin of error to be 3%, we need to get the samples of around 6810.9383735.

Some states have higher interest rates than another.
Those who have home morgage has higher interest rate than others.
Interest rate increases with funded amount.

Part 4 - Inference

1. Type of loans issues with relation to Interest Rate

Looking at grade which seems to be a rating classification scheme that Lending Club uses to assign loans into risk buckets similar to other popular rating schemes like S&P or Moodys. We know that grades have values of A, B, C, D, E, F, G where A represents the highest quality loan and G the lowest.

give_count <- stat_summary(fun.data = function(x) return(c(y = median(x)*1.06,
                                                         label = length(x))),
               geom = "text")
give_mean <- stat_summary(fun.y = mean, colour = "darkgreen", geom = "point", 
                          shape = 18, size = 3, show.legend = FALSE)

loans_summary %>%
  ggplot(aes(grade, loan_amnt)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ loan_status) +
  labs(title="Loan Amount by Grade for each loan status", x = "Grade", y = "Loan Amount \n")

As per the plot we can infer following;

lower quality loans tend to have a higher loan amount
there are virtually no outliers except for grade B

loans_summary %>%
  ggplot(aes(grade, int_rate)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  labs(title="Interest Rate by Grade", x = "Grade", y = "Interest Rate \n") +
  facet_wrap(~ term)

Above plot clearly indicates below:

Interest rate increases with grade worsening
The 3-year term has a much higher number of high-rated borrowers while the 5-year term has a larger number in the low-rating grades

2. Is loan interest % predictive of credit score and income?

To perform this statement, we will use the linear model to validate it. Here we are performing forward elimination technique to get the maximum adjusted R-squared vale.

loans_lm <- lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + emp_length + issue_d + pub_rec_bankruptcies + dti,loans_summary ) # 0.3628
summary(loans_lm)

## 
## Call:
## lm(formula = int_rate ~ fico_score + home_ownership + purpose + 
##     term1 + loan_amnt + annual_inc + emp_length + issue_d + pub_rec_bankruptcies + 
##     dti, data = loans_summary)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4759  -2.8083  -0.6065   2.1065  17.9751 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.741e+01  3.020e+01   0.576  0.56431    
## fico_score                -5.672e-02  1.223e-03 -46.381  < 2e-16 ***
## home_ownershipMORTGAGE     8.311e-02  5.860e-01   0.142  0.88723    
## home_ownershipOWN          6.171e-01  5.954e-01   1.036  0.30003    
## home_ownershipRENT         6.458e-01  5.862e-01   1.102  0.27059    
## purposecredit_card        -3.662e+00  4.485e-01  -8.166 3.57e-16 ***
## purposedebt_consolidation -2.051e+00  4.454e-01  -4.604 4.19e-06 ***
## purposehome_improvement   -1.268e+00  4.728e-01  -2.682  0.00732 ** 
## purposehouse              -6.850e-02  6.482e-01  -0.106  0.91584    
## purposemajor_purchase     -3.884e-01  5.571e-01  -0.697  0.48570    
## purposemedical            -8.314e-01  6.072e-01  -1.369  0.17096    
## purposemoving              6.631e-01  7.400e-01   0.896  0.37026    
## purposeother               1.137e-02  4.776e-01   0.024  0.98101    
## purposerenewable_energy   -3.471e-01  1.520e+00  -0.228  0.81930    
## purposesmall_business      6.588e-01  6.149e-01   1.071  0.28404    
## purposevacation            5.548e-01  6.775e-01   0.819  0.41288    
## term1                      1.364e-01  4.025e-03  33.879  < 2e-16 ***
## loan_amnt                 -3.458e-05  4.576e-06  -7.557 4.48e-14 ***
## annual_inc                -1.194e-06  4.412e-07  -2.706  0.00683 ** 
## emp_length1 year           2.211e-03  1.898e-01   0.012  0.99071    
## emp_length10+ years       -2.867e-01  1.376e-01  -2.083  0.03723 *  
## emp_length2 years         -3.444e-01  1.793e-01  -1.920  0.05484 .  
## emp_length3 years         -7.649e-02  1.907e-01  -0.401  0.68830    
## emp_length4 years         -9.185e-02  2.024e-01  -0.454  0.65002    
## emp_length5 years         -1.018e-01  2.055e-01  -0.496  0.62016    
## emp_length6 years          2.333e-03  2.377e-01   0.010  0.99217    
## emp_length7 years         -1.249e-01  2.503e-01  -0.499  0.61794    
## emp_length8 years         -7.755e-02  2.602e-01  -0.298  0.76573    
## emp_length9 years         -3.546e-01  2.945e-01  -1.204  0.22861    
## emp_lengthNA               1.142e-01  1.790e-01   0.638  0.52371    
## issue_d                    1.778e-03  1.683e-03   1.056  0.29094    
## pub_rec_bankruptcies      -3.787e-01  1.298e-01  -2.918  0.00353 ** 
## dti                        2.008e-02  1.823e-03  11.010  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.109 on 9938 degrees of freedom
##   (29 observations deleted due to missingness)
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2853 
## F-statistic: 125.3 on 32 and 9938 DF,  p-value: < 2.2e-16

#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1,loans_summary)) #0.3458
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt,loans_summary)) # #0.3452
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc ,loans_summary )) #0.3591
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + addr_state,loans_summary )) #0.3594
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc ,loans_summary )) #0.3591
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc  + emp_length,loans_summary )) #0.3615

After mulitple iterations, we have come to the conclusion that below are the variables that affect interest rate %

fico_score
home_ownership
purpose
term1
loan_amnt
annual_inc
emp_length
issue_d
pub_rec_bankruptcies
dti (debt to income ration)

#positions <- c("AK","AL","AR","AZ", "CA","CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","IN","KS","KY","LA","MA","MD","ME","MI","MN","MO","MS","MT","NC","ND","NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VA","VT","WA","WI","WV","WY")

#ggplot(state_interestrate, aes(x = addr_state, y = avg)) + geom_violin(scale = "width") + scale_x_discrete(limits = positions) + scale_y_continuous(labels = percent) + ggtitle("Distribution of Average Interest Rate aggregated by State") + theme(plot.title = element_text(size = 11), axis.text.x = element_text(angle = 45, hjust = 1)) 

funded_amnt <-
  #loans_summary %>%
  loans_summary_full %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt, 
              variable = "funded_amnt")

funded_amnt_inv <-
  #loans_summary %>%
  loans_summary_full %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt_inv, 
              variable = "funded_amnt_inv")

plot_data <- rbind(funded_amnt, funded_amnt_inv)
# remove unnecessary data using regex
#ls()
rm(list = ls()[grep("^funded", ls())])

plot_data %>%
  ggplot(aes(x = loan_amnt, y = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_point()

there are instances when funded amount is smaller loan amount
there seems to be a number of loans where investment is smaller funded amount i.e. not the full loan is invested in

loans_summary %>%
  ggplot(aes(x = annual_inc, y = loan_amnt)) +
  geom_point()

We are going to validate the conditions for least squared line.

From the below chart, it shows that there is a downward relationship between FICO credit score and interest rate. But the linear model is not very strong due to large number of variability.

#fico score vs interest rate
ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + 
  geom_point(size=1,alpha=0.8) + 
  geom_smooth(method = "lm")  + 
  ggtitle("FICO SCORE vs Interest rate")

Let’s check the residuals normality with histogram and qqplot.

df_residuals <- augment(loans_lm)

#Historgram plot of residuals
ggplot(df_residuals,aes(df_residuals$.resid)) + 
  geom_histogram() + 
  ggtitle("Residual Histogram")

#QQ norm plot of the residuals
qqnorm(loans_lm$residuals)
qqline(loans_lm$residuals)

The plots show that residuals are slightly left skewed. But the residuals are normal.

df_residuals <- filter(df_residuals, .fitted >-17)

ggplot(df_residuals, aes(x=.fitted, y=.resid)) + 
  geom_point(size=1,alpha=0.8) + 
  geom_smooth(method = "lm") + 
  ggtitle("FICO SCORE vs Residuals") 

ggplot(df_residuals, aes(x=.fitted, y=abs(.resid))) + 
  geom_point(size=1,alpha=0.8) + 
  ggtitle("FICO SCORE vs Residuals in Absolute terms")

Above plot shows that there is a constant variability in the chart.

3. Different purpose of loan request

3. Is loan funded amount are equal for different purpose of loan request?

Let’s validate if the purpose of loan interest rate varies or not.

inference(y=loans_summary$int_rate,x=loans_summary$purpose,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## 
## Summary statistics:
## n_car = 87, mean_car = 14.4262, sd_car = 5.1598
## n_credit_card = 2768, mean_credit_card = 11.4543, sd_credit_card = 4.2753
## n_debt_consolidation = 5508, mean_debt_consolidation = 12.969, sd_debt_consolidation = 4.8833
## n_home_improvement = 602, mean_home_improvement = 13.0386, sd_home_improvement = 4.8425
## n_house = 75, mean_house = 13.9613, sd_house = 5.5346
## n_major_purchase = 146, mean_major_purchase = 13.6175, sd_major_purchase = 5.0955
## n_medical = 97, mean_medical = 13.7255, sd_medical = 4.5394
## n_moving = 49, mean_moving = 15.05, sd_moving = 5.8391
## n_other = 503, mean_other = 14.4195, sd_other = 5.4904
## n_renewable_energy = 8, mean_renewable_energy = 14.885, sd_renewable_energy = 3.8127
## n_small_business = 93, mean_small_business = 15.1254, sd_small_business = 5.9583
## n_vacation = 64, mean_vacation = 15.0791, sd_vacation = 5.0107

## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq Mean Sq F value    Pr(>F)
## x           11   8071  733.68  32.139 < 2.2e-16
## Residuals 9988 228012   22.83                  
## 
## Pairwise tests: t tests with pooled SD 
##                       car credit_card debt_consolidation home_improvement
## credit_card        0.0000          NA                 NA               NA
## debt_consolidation 0.0048      0.0000                 NA               NA
## home_improvement   0.0114      0.0000             0.7341               NA
## house              0.5369      0.0000             0.0740           0.1148
## major_purchase     0.2114      0.0000             0.1056           0.1891
## medical            0.3206      0.0000             0.1222           0.1889
## moving             0.4648      0.0000             0.0024           0.0046
## other              0.9904      0.0000             0.0000           0.0000
## renewable_energy   0.7949      0.0426             0.2571           0.2776
## small_business     0.3266      0.0000             0.0000           0.0001
## vacation           0.4067      0.0000             0.0004           0.0012
##                     house major_purchase medical moving  other
## credit_card            NA             NA      NA     NA     NA
## debt_consolidation     NA             NA      NA     NA     NA
## home_improvement       NA             NA      NA     NA     NA
## house                  NA             NA      NA     NA     NA
## major_purchase     0.6124             NA      NA     NA     NA
## medical            0.7482         0.8630      NA     NA     NA
## moving             0.2148         0.0694  0.1137     NA     NA
## other              0.4385         0.0742  0.1903 0.3779     NA
## renewable_energy   0.6032         0.4650  0.5094 0.9278 0.7846
## small_business     0.1165         0.0174  0.0435 0.9288 0.1906
## vacation           0.1693         0.0413  0.0786 0.9744 0.2983
##                    renewable_energy small_business
## credit_card                      NA             NA
## debt_consolidation               NA             NA
## home_improvement                 NA             NA
## house                            NA             NA
## major_purchase                   NA             NA
## medical                          NA             NA
## moving                           NA             NA
## other                            NA             NA
## renewable_energy                 NA             NA
## small_business               0.8914             NA
## vacation                     0.9138         0.9524

Above output shows that interest rate varies for each purpose of loan.

Our Model states that if the purpose of loan is vacation then the interest rate will be higher. Let’s compare with other purpose house

#Compare Credit card loan and debt_consolidation
creditcard_other <- filter(loans_summary,purpose ==c("vacation","house"))
inference(y=creditcard_other$int_rate,x=creditcard_other$purpose,est="mean",null = 0,alternative = "greater",type="ht",order=c("house","vacation"),
          method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_house = 37, mean_house = 13.2668, sd_house = 5.0985
## n_vacation = 33, mean_vacation = 14.9564, sd_vacation = 5.3667

## Observed difference between means (house-vacation) = -1.6896
## 
## H0: mu_house - mu_vacation = 0 
## HA: mu_house - mu_vacation > 0 
## Standard error = 1.255 
## Test statistic: Z =  -1.346 
## p-value =  0.9109

From above we can conclude that the interest rate % is higher for house.

4. State vs interest rate and loan amounts

4. It is always mentioned that living state plays a important role in interest rate. This hypothesis will be validated.

As per the Geolocation Plots under “Distribution of Interest Rates across US States”, interest rate is almost same for all states.
As per the Geolocation Plots under “Distribution of Loan Amounts across US States”, California has highest amount of loans followed by New York, Florida and Texas.

5. Homeownership vs FICO score

5. Does home ownership really impact FICO scores or just by chance?

We want to understand does the home ownership really impact FICO scores. Below we are performing ANOVA between the three variables (RENT VS MORTGAGE VS OWN)

inference(y=loans_summary$fico_score,x=loans_summary$home_ownership,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## 
## Summary statistics:
## n_ANY = 50, mean_ANY = 719.6, sd_ANY = 34.1103
## n_MORTGAGE = 5017, mean_MORTGAGE = 712.2562, sd_MORTGAGE = 35.8328
## n_OWN = 1141, mean_OWN = 711.9522, sd_OWN = 36.6887
## n_RENT = 3792, mean_RENT = 704.7492, sd_RENT = 32.4501

## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##             Df   Sum Sq Mean Sq F value    Pr(>F)
## x            3   135582   45194  37.572 < 2.2e-16
## Residuals 9996 12023965    1203                  
## 
## Pairwise tests: t tests with pooled SD 
##             ANY MORTGAGE OWN
## MORTGAGE 0.1363       NA  NA
## OWN      0.1270   0.7893  NA
## RENT     0.0026   0.0000   0

From above we can conclude that FICO scores are not impacted by home ownership.

6. Loan issue date

6. Did lending club receive equal number of loans in each month?

#Loan request vs month
loans_count<- loans_summary %>% 
              #mutate(loan_month = match(substr(loans_summary$issue_d,1,3),month.abb)) %>%
              mutate(loan_month = format(issue_d,"%B")) %>% 
              group_by(loan_month) %>% 
              count(loan_month) %>% 
              rename(observed_loans=n) %>% 
              filter(!is.na(loan_month))

loans_count <- loans_count %>% 
                mutate(percent= 33.333333) %>% 
                mutate(expected_loans=sum(observed_loans)*(percent/100))

chisq.test(x=loans_count$observed_loans, p=loans_count$percent/100)

## 
##  Chi-squared test for given probabilities
## 
## data:  loans_count$observed_loans
## X-squared = 96.501, df = 2, p-value < 2.2e-16

loans_count

## # A tibble: 3 x 4
## # Groups:   loan_month [3]
##   loan_month observed_loans percent expected_loans
##   <chr>               <int>   <dbl>          <dbl>
## 1 February             3085    33.3          1028.
## 2 January              3796    33.3          1265.
## 3 March                3119    33.3          1040.

From the model and output of chi-test, we may hypothesize that if the loans applied on January may receive higher approved loan amount than other months in Q1.

Amount Funded based on FICO score

#Loan Amount vs. Funded Amount
ggplot(loans_summary, aes(loan_amnt, funded_amnt)) + 
  geom_point(aes(colour = fico_score)) +
  labs(title = "Loan Amount vs. Funded Amount") +
  geom_smooth()

As we can see in the plot above,

Most of the time they approved the loan amount applied by the borrowers.
Although, in few cases they reduced the amount of loan and there could be multiple reasons for that.
Mostly they approved loans for the people with higher borrower scores but in many cases they approved loans for the people with lower borrower scores.

#Installment vs. Funded Amount
ggplot(loans_summary, aes(installment, funded_amnt)) + 
  geom_point(aes(colour = fico_score)) +
  labs(title = "Installment vs. Funded Amount") +
  geom_smooth()

installments for lower loans are lower and installments for higher loans are higher.
There is a range of installments for the same amount of loan - some people paid higher installments and some people paid lower installments with the same amount of loan. Their borrower score could be one of the reasons for that.
Most of the higher installments are from the people with lower borrower scores but still we don’t see any huge difference between the installments of people with lower borrower score vs people with higher borrower score.

ggplot(loans_summary, aes(annual_inc, funded_amnt)) + 
  geom_point(aes(colour = fico_score)) +
  labs(title = "Annual Income vs. Funded Amount") +
  geom_smooth()

It looks like there are some really high income borrowers still applying for loans and borrowing money!

ggplot(loans_summary, aes(purpose, loan_amnt, fill = loan_status)) + 
  geom_boxplot() + 
  labs(title = "Loan Amount vs Purpose for Loan Status") + 
  theme(axis.text.x=element_text(size=8, angle = 90))

Most stated reason for loan was small business followed by credit card and debt consolidation.
More people are with the status fully paid for the house loan than charged off.
House loans are generally applied by a family because the conclusion might indicate that chances of a house loan of being fully paid is higher than any other loans.
In general, small business loans are higher and riskier loans.

DATA 607 02 [15961] : Final Project Presentation

Install Pre-requisite R libraries

## Loading Packages

Part 1 - Introduction

Mission

Background

Project Goal