Some months back, I was trying to apply for a lending club loan. Interest rate was different me and my colleage. That made me to think more about Lending club(LC). LC is a US peer-to-peer lending company. Where investors provide funding and borrowers return back the payments. Lending club selects and approves the borrowers using many parameters. It is a sort of EBay for loans.
In this project, I’ll be trying to predict the interest rate with various predictor variables. By performing this analysis we will know below information.
What parameters will impact my interest rate? ie., Is loan interest % predictive of FICO credit score alone?
Is loan funded amount are equal for different purpose of loan request? So the person can get loan in that particular loan type.
It is always mentioned that living state plays a important role in interest rate. This hypothesis will be validated.
There is a myth that home ownership will impact FICO scores. It will be validated via this dataset.
Did lending club receive equal number of loans in each month
When we register as a lending club user, you will get access to the borrowers data from Lending loan website (https://www.lendingclub.com/info/download-data.action).
This dataset has borrowers details(personal info will be removed) It has the funded amount, interest rate, fico credit score and about 115 variables. Also the row count is around 130K for Q1 2016.
For current analysis, I have taken a simple random sample of 1000 rows. These data are transformed and modified to perform analysis on data.
Describe the method of data collection.
When we register as a lending club borrower/invester, you will get access to the borrowers data from Lending loan website (https://www.lendingclub.com/info/download-data.action).
LC also provides loan rejection dataset. But for current analysis, we have used only the borrowers dataset.
This dataset is an Observational study. Consumers requested for loan in LC. Each row will contain borrowers information(personal info will be removed by LC) and their current loan status.
For this project, in some cases, I have used the comple dataset and for some analysis I have taken a simple random sample of 1000 rows.
It has the funded amount, interest rate, fico credit score and about 115 variables. Also there are around 140K observations for Q1 2016.
But for this current analysis, below are the variables used.
| LoanStatNew | Description |
|---|---|
| loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
| funded_amnt | The total amount committed to that loan at that point in time. |
| int_rate | Interest Rate on the loan |
| emp_length | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
| home_ownership | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER |
| annual_inc | The self-reported annual income provided by the borrower during registration. |
| verification_status | Indicates if income was verified by LC, not verified, or if the income source was verified |
| addr_state | The state provided by the borrower in the loan application |
| issue_d | The month which the loan was funded |
| loan_status | Current status of the loan |
| purpose | A category provided by the borrower for the loan request. |
| total_pymnt_inv | Payments received to date for portion of total amount funded by investors |
| fico_range_high | The upper boundary range the borrower’s FICO at loan origination belongs to. |
| fico_range_low | The lower boundary range the borrower’s FICO at loan origination belongs to. |
| term | The number of payments on the loan. Values are in months and can be either 36 or 60. |
This is an observational study. We will arrive at conclusion by performing below tests on the mentioned variables.
If these information is not available for the borrower, this study will not be applicable.
Also this analysis is perfomed only for lending club and other peer-to-peer borrowing companies. It may not be applicable for the banking interest rates.
As this is an observational study we cannot derive any causal connections between the variables.
# load data of 2016 Q1
set.seed(7340)
# 1. Select the required columns
# 2. Convert the column term to numeric
# 3. Remove unwanted columns and change the NA data(if any)
# 4. Calculate the total payment by each customer adding the interest rate
loans_summary_full <- read.csv ("C:/Users/paperspace/Google Drive/CUNY/Assignment_Repositories/606 - Project/LoanStats_securev1_2016Q1.csv",header = TRUE,skip = 1, sep = ",",stringsAsFactors = FALSE,skipNul = TRUE) %>% select(c(pub_rec_bankruptcies
,loan_amnt,funded_amnt,term,int_rate,emp_length,home_ownership,annual_inc,dti,addr_state,fico_range_low,fico_range_high,issue_d,loan_status,purpose,loan_amnt)) %>% mutate(fico_score=(fico_range_low+fico_range_high)/2) %>% mutate(term1=as.numeric(str_trim(str_replace(term,"months","")))) %>% select(-term,-fico_range_low,-fico_range_high) %>% mutate(emp_length=str_replace(emp_length,c("n/a"),"NA")) %>% mutate(int_rate=(as.numeric(str_replace(as.character(int_rate),"%","")))) %>% mutate(total_pymnt=(funded_amnt+(funded_amnt*int_rate*(term1/12))/100)) %>% filter(!is.na(int_rate))
loans_summary <- sample_n(loans_summary_full,7000,replace = FALSE)
Below are some exploratory data analysis charts to understand more about the data.
Below table summarizes the question, response and explanatory variable. It also shows whether it is Numerical or Categorical.
| Question | Response Variable | Explanatory Variable |
|---|---|---|
| 1. What parameters will impact my interest rate? ie., Is loan interest % predictive of FICO credit score alone? | Interest rate % (Numerical) | FICO Credit score (Numerical), home ownership (Categorical), Purpose (Categorical) |
Is loan funded interest rate % are equal for different purpose of loan request? | Interest rate % (Numerical) | purpose (Categorical)
Does different states have equal interest rate? | Interest rate % (Numerical) | States (Categorical)
Does home ownership really impact FICO scores or just by chance? | FICO scores (Numerical) | Home Ownership (Categorical)
Did lending club receive equal number of loans in each month? | Loan count (Numerical) | - |
#Histogram of Interest rate %
ggplot(loans_summary,aes(x=int_rate)) + geom_histogram(aes(fill=home_ownership),bins=10) + ggtitle("Interest Rate Histogram")
#funded amount are equal for different purpose
ggplot(loans_summary,aes(x=purpose,y=int_rate)) + geom_boxplot() + ggtitle("Purpose vs Funded amount")
#Funded amount vs Interest rate
ggplot(loans_summary,aes(x=funded_amnt,y=int_rate)) + geom_point(aes(color=term1)) + geom_smooth(method="lm")+ ggtitle("Funded amount vs Interest Rate")
#FICO Scores vs Interest rate
ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + geom_point(aes(color=term1)) + geom_smooth(method="lm") + ggtitle("FICO Score vs Interest Rate")
#Home ownership vs FICO scores
ggplot(loans_summary,aes(x=fico_score)) + geom_histogram() + facet_grid(.~home_ownership) + ggtitle("Histogram of FICO score")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# State vs Interest rate %
state_interestrate <- loans_summary %>% select(addr_state,int_rate) %>% group_by(addr_state) %>% summarise(avg =mean(int_rate))
a1 <- list(autotick = TRUE, title = "State")
b1 <- list(title = "Interest rate %")
plot_ly(x = ~state_interestrate$addr_state, y = ~state_interestrate$avg, type = "area") %>% layout(xaxis=a1,yaxis=b1,title="State vs Average interest rate")
remove(a1,b1,state_interestrate)
Exploratory data analysis suggests below statistics.
| Statistic | Variable | Value |
|---|---|---|
| Population | Mean Interest rate | 12.4776311 |
| Population | SD Interest rate | 4.8290031 |
| Sample Statistics | Mean Interest rate | 12.5321657 |
| Sample Statistics | SD Interest rate | 4.8246797 |
Point estimate from the sample with the confidence interval is shown below
inference(y=loans_summary$int_rate,est="mean",null=0,alternative="twosided",type="ci",conflevel=0.95,method="theoretical")
## Single mean
## Summary statistics:
## mean = 12.5322 ; sd = 4.8247 ; n = 7000
## Standard error = 0.0577
## 95 % Confidence interval = ( 12.4191 , 12.6452 )
For this test lest validate the total sample size required.
s=sd(loans_summary_full$int_rate)
n = ((pnorm(0.025)*s)/0.03)^2
If the margin of error to be 3%, we need to get the samples of around 6738.5430648.
To perform this statement, we will use the linear model to validate it. Here we are performing forward elimination technique to get the maximum adjusted R-squared vale.
loans_lm <- lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + emp_length + issue_d + pub_rec_bankruptcies + dti,loans_summary ) # 0.3628
summary(loans_lm)
##
## Call:
## lm(formula = int_rate ~ fico_score + home_ownership + purpose +
## term1 + loan_amnt + annual_inc + emp_length + issue_d + pub_rec_bankruptcies +
## dti, data = loans_summary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2082 -2.7615 -0.6272 2.1442 25.0656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.724e+01 1.211e+00 39.006 < 2e-16 ***
## fico_score -5.995e-02 1.541e-03 -38.902 < 2e-16 ***
## home_ownershipOWN 5.643e-01 1.498e-01 3.768 0.000166 ***
## home_ownershipRENT 8.014e-01 1.048e-01 7.648 2.32e-14 ***
## purposecredit_card -1.667e+00 4.784e-01 -3.485 0.000494 ***
## purposedebt_consolidation -2.176e-01 4.729e-01 -0.460 0.645473
## purposehome_improvement 3.371e-01 5.072e-01 0.665 0.506323
## purposehouse 2.565e+00 9.159e-01 2.801 0.005115 **
## purposemajor_purchase -1.665e-01 5.645e-01 -0.295 0.768012
## purposemedical 1.043e+00 6.194e-01 1.684 0.092214 .
## purposemoving 1.548e+00 8.182e-01 1.892 0.058515 .
## purposeother 1.681e+00 5.050e-01 3.328 0.000878 ***
## purposerenewable_energy 5.802e+00 2.276e+00 2.549 0.010818 *
## purposesmall_business 3.500e+00 6.647e-01 5.266 1.44e-07 ***
## purposevacation 3.978e-01 7.951e-01 0.500 0.616889
## term1 1.651e-01 4.622e-03 35.718 < 2e-16 ***
## loan_amnt 5.996e-05 6.296e-06 9.524 < 2e-16 ***
## annual_inc -8.864e-06 7.649e-07 -11.589 < 2e-16 ***
## emp_length1 year -1.198e-01 2.409e-01 -0.497 0.619093
## emp_length10+ years -3.223e-01 1.821e-01 -1.769 0.076858 .
## emp_length2 years -1.660e-01 2.261e-01 -0.734 0.462934
## emp_length3 years -2.837e-01 2.274e-01 -1.247 0.212274
## emp_length4 years -7.933e-02 2.520e-01 -0.315 0.752909
## emp_length5 years -2.565e-01 2.494e-01 -1.028 0.303847
## emp_length6 years -1.662e-01 2.778e-01 -0.598 0.549676
## emp_length7 years -5.150e-01 3.055e-01 -1.686 0.091846 .
## emp_length8 years -3.066e-01 2.652e-01 -1.156 0.247703
## emp_length9 years -1.969e-01 2.953e-01 -0.667 0.504825
## emp_lengthNA 8.164e-01 2.454e-01 3.327 0.000883 ***
## issue_dJan-2016 -3.414e-01 1.260e-01 -2.710 0.006750 **
## issue_dMar-2016 9.206e-02 1.092e-01 0.843 0.399448
## pub_rec_bankruptcies -2.038e-01 1.209e-01 -1.685 0.091983 .
## dti 3.165e-04 3.862e-04 0.819 0.412613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.851 on 6967 degrees of freedom
## Multiple R-squared: 0.3657, Adjusted R-squared: 0.3628
## F-statistic: 125.5 on 32 and 6967 DF, p-value: < 2.2e-16
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1,loans_summary)) #0.3458
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt,loans_summary)) # #0.3452
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc ,loans_summary )) #0.3591
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + addr_state,loans_summary )) #0.3594
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc ,loans_summary )) #0.3591
#summary(lm(int_rate ~ fico_score + home_ownership + purpose + term1 + loan_amnt + annual_inc + emp_length,loans_summary )) #0.3615
After mulitple iterations, we have come to the conclusion that below are the variables that affect interest rate %
fico_score
home_ownership
purpose
term1
loan_amnt
annual_inc
emp_length
issue_d
pub_rec_bankruptcies
dti(debt to income ration)
We are going to validate the conditions for least squared line.
From the below chart, it shows that there is a downward relationship between FICO credit score and interest rate. But the linear model is not very strong due to large number of variability.
The correlation between the two variables is around -0.383.
#fico score vs interest rate
ggplot(loans_summary,aes(x=fico_score,y=int_rate)) + geom_point(size=1,alpha=0.8) + geom_smooth(method = "lm") + ggtitle("FICO SCORE vs Interest rate")
cor(loans_summary$fico_score,loans_summary$int_rate)
## [1] -0.3858804
df_residuals <- augment(loans_lm)
Let’s check the residuals normality with histogram and qqplot.
#Historgram plot of residuals
ggplot(df_residuals,aes(df_residuals$.resid)) + geom_histogram() + ggtitle("Residual Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#QQ norm plot of the residuals
qqnorm(loans_lm$residuals)
qqline(loans_lm$residuals)
The plots show that residuals are slightly left skewed. But the residuals are normal
df_residuals <- filter(df_residuals,.fitted >-17)
ggplot(df_residuals,aes(x=.fitted,y=.resid)) + geom_point(size=1,alpha=0.8) + geom_smooth(method = "lm") + ggtitle("FICO SCORE vs Residuals")
ggplot(df_residuals,aes(x=.fitted,y=abs(.resid))) + geom_point(size=1,alpha=0.8) + ggtitle("FICO SCORE vs Residuals")
Above plot shows that there is a constant variability in the chart. 36.28% of interest rate % is explained by the model.
2. Is loan funded amount are equal for different purpose of loan request?
Let’s validate if the purpose of loan interest rate varies or not.
#Hypothesis test between purpose
inference(y=loans_summary$int_rate,x=loans_summary$purpose,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_car = 68, mean_car = 12.3943, sd_car = 4.7805
## n_credit_card = 1661, mean_credit_card = 11.0786, sd_credit_card = 4.1975
## n_debt_consolidation = 4051, mean_debt_consolidation = 12.9649, sd_debt_consolidation = 4.8732
## n_home_improvement = 410, mean_home_improvement = 12.0454, sd_home_improvement = 5.1758
## n_house = 24, mean_house = 15.2312, sd_house = 5.629
## n_major_purchase = 150, mean_major_purchase = 11.8111, sd_major_purchase = 4.8367
## n_medical = 90, mean_medical = 13.2814, sd_medical = 4.7381
## n_moving = 33, mean_moving = 13.7994, sd_moving = 4.44
## n_other = 407, mean_other = 14.0046, sd_other = 4.8329
## n_renewable_energy = 3, mean_renewable_energy = 16.5433, sd_renewable_energy = 5.0373
## n_small_business = 67, mean_small_business = 15.8237, sd_small_business = 5.2839
## n_vacation = 36, mean_vacation = 11.77, sd_vacation = 3.3051
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 11 6401 581.87 25.978 < 2.2e-16
## Residuals 6988 156519 22.40
##
## Pairwise tests: t tests with pooled SD
## car credit_card debt_consolidation home_improvement
## credit_card 0.0247 NA NA NA
## debt_consolidation 0.3241 0.0000 NA NA
## home_improvement 0.5734 0.0002 0.0002 NA
## house 0.0116 0.0000 0.0194 0.0014
## major_purchase 0.3994 0.0695 0.0034 0.6040
## medical 0.2434 0.0000 0.5303 0.0249
## moving 0.1617 0.0011 0.3131 0.0406
## other 0.0094 0.0000 0.0000 0.0000
## renewable_energy 0.1373 0.0457 0.1905 0.1010
## small_business 0.0000 0.0000 0.0000 0.0000
## vacation 0.5222 0.3858 0.1315 0.7378
## house major_purchase medical moving other
## credit_card NA NA NA NA NA
## debt_consolidation NA NA NA NA NA
## home_improvement NA NA NA NA NA
## house NA NA NA NA NA
## major_purchase 0.0010 NA NA NA NA
## medical 0.0730 0.0198 NA NA NA
## moving 0.2595 0.0289 0.5907 NA NA
## other 0.2173 0.0000 0.1896 0.8107 NA
## renewable_energy 0.6508 0.0864 0.2403 0.3364 0.3546
## small_business 0.5987 0.0000 0.0009 0.0443 0.0036
## vacation 0.0055 0.9626 0.1054 0.0752 0.0066
## renewable_energy small_business
## credit_card NA NA
## debt_consolidation NA NA
## home_improvement NA NA
## house NA NA
## major_purchase NA NA
## medical NA NA
## moving NA NA
## other NA NA
## renewable_energy NA NA
## small_business 0.7967 NA
## vacation 0.0933 0
Above output shows that interest rate varies for each purpose of loan.
Our Model states that if the purpose of loan is vacation then the interest rate will be higher. Let’s compare with other purpose house
#Compare Credit card loan and debt_consolidation
creditcard_other <- filter(loans_summary,purpose ==c("vacation","house"))
inference(y=creditcard_other$int_rate,x=creditcard_other$purpose,est="mean",null = 0,alternative = "greater",type="ht",order=c("house","vacation"),
method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_house = 13, mean_house = 16.3769, sd_house = 5.6391
## n_vacation = 13, mean_vacation = 12.5723, sd_vacation = 4.3593
## Observed difference between means (house-vacation) = 3.8046
##
## H0: mu_house - mu_vacation = 0
## HA: mu_house - mu_vacation > 0
## Standard error = 1.977
## Test statistic: T = 1.925
## Degrees of freedom: 12
## p-value = 0.0392
In output, the p-value is low. So we will reject null Hypothesis and conclude that the interest rate % is higher for house.
3. It is always mentioned that living state plays a important role in interest rate. This hypothesis will be validated.
Null hypothesis \(H0\): Interest rate is same for all states.
Alternative hypothesis \(HA\):Interest rate is different at-least for one state.
inference(y=loans_summary$int_rate,x=loans_summary$addr_state,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_AK = 34, mean_AK = 12.0197, sd_AK = 4.747
## n_AL = 95, mean_AL = 12.6826, sd_AL = 5.1936
## n_AR = 47, mean_AR = 11.9881, sd_AR = 3.9117
## n_AZ = 134, mean_AZ = 12.4653, sd_AZ = 4.6191
## n_CA = 965, mean_CA = 12.1806, sd_CA = 4.9903
## n_CO = 142, mean_CO = 12.8249, sd_CO = 5.0707
## n_CT = 104, mean_CT = 12.8195, sd_CT = 4.9946
## n_DC = 22, mean_DC = 10.9991, sd_DC = 3.7911
## n_DE = 26, mean_DE = 13.3712, sd_DE = 4.2513
## n_FL = 498, mean_FL = 12.5023, sd_FL = 4.7869
## n_GA = 216, mean_GA = 12.4691, sd_GA = 4.8793
## n_HI = 39, mean_HI = 12.7928, sd_HI = 5.4609
## n_ID = 26, mean_ID = 14.5304, sd_ID = 5.0331
## n_IL = 262, mean_IL = 12.7392, sd_IL = 4.8206
## n_IN = 124, mean_IN = 12.2962, sd_IN = 4.8372
## n_KS = 64, mean_KS = 12.6522, sd_KS = 4.6001
## n_KY = 64, mean_KY = 13.0127, sd_KY = 4.4486
## n_LA = 92, mean_LA = 11.9349, sd_LA = 3.8755
## n_MA = 149, mean_MA = 12.6195, sd_MA = 4.7527
## n_MD = 152, mean_MD = 12.7582, sd_MD = 4.8841
## n_ME = 29, mean_ME = 12.8183, sd_ME = 5.3032
## n_MI = 203, mean_MI = 12.2267, sd_MI = 4.6473
## n_MN = 115, mean_MN = 12.3316, sd_MN = 4.3078
## n_MO = 115, mean_MO = 12.8459, sd_MO = 4.9702
## n_MS = 43, mean_MS = 13.2798, sd_MS = 5.4124
## n_MT = 19, mean_MT = 12.1589, sd_MT = 5.1096
## n_NC = 209, mean_NC = 12.6026, sd_NC = 4.7808
## n_ND = 21, mean_ND = 14.031, sd_ND = 4.9149
## n_NE = 29, mean_NE = 12.7214, sd_NE = 4.4401
## n_NH = 35, mean_NH = 12.3209, sd_NH = 5.9575
## n_NJ = 259, mean_NJ = 12.5314, sd_NJ = 5.0718
## n_NM = 49, mean_NM = 12.031, sd_NM = 3.825
## n_NV = 97, mean_NV = 12.1941, sd_NV = 4.7798
## n_NY = 542, mean_NY = 12.5022, sd_NY = 4.5085
## n_OH = 249, mean_OH = 12.0657, sd_OH = 4.399
## n_OK = 82, mean_OK = 12.7751, sd_OK = 4.5988
## n_OR = 88, mean_OR = 11.9642, sd_OR = 4.529
## n_PA = 223, mean_PA = 13.3944, sd_PA = 5.1073
## n_RI = 24, mean_RI = 13.6021, sd_RI = 5.4027
## n_SC = 74, mean_SC = 13.0695, sd_SC = 5.173
## n_SD = 22, mean_SD = 12.93, sd_SD = 5.238
## n_TN = 121, mean_TN = 13.1315, sd_TN = 5.5413
## n_TX = 543, mean_TX = 12.6216, sd_TX = 4.7932
## n_UT = 47, mean_UT = 12.087, sd_UT = 4.662
## n_VA = 205, mean_VA = 12.732, sd_VA = 5.0258
## n_VT = 9, mean_VT = 14.6511, sd_VT = 6.1342
## n_WA = 148, mean_WA = 12.7499, sd_WA = 4.7428
## n_WI = 109, mean_WI = 12.1703, sd_WI = 4.8583
## n_WV = 25, mean_WV = 13.0212, sd_WV = 5.4999
## n_WY = 11, mean_WY = 11.4091, sd_WY = 5.0787
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 49 1010 20.611 0.8847 0.7001
## Residuals 6950 161910 23.296
Higher p-value states that we failed to reject null hypothesis. It might also mention that interest rate is same for all states.
4. Does home ownership really impact FICO scores or just by chance?
We want to understand does the home ownership really impact FICO scores. Below we are performing ANOVA between the three variables (RENT VS MORTGAGE VS OWN)
Null hypothesis \(H0\): FICO scores are same between different home ownership. Scores do not dependent on home ownership.
Alternative hypothesis \(HA\): FICO scores are different for each home ownership.
#Hypothesis test
inference(y=loans_summary$fico_score,x=loans_summary$home_ownership,est="mean",null = 0,alternative = "greater",type="ht",method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Summary statistics:
## n_MORTGAGE = 3470, mean_MORTGAGE = 700.4973, sd_MORTGAGE = 32.4716
## n_OWN = 838, mean_OWN = 698.9153, sd_OWN = 32.4277
## n_RENT = 2692, mean_RENT = 693.2351, sd_RENT = 28.3526
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 2 81815 40907 42.714 < 2.2e-16
## Residuals 6997 6701103 958
##
## Pairwise tests: t tests with pooled SD
## MORTGAGE OWN
## OWN 0.1842 NA
## RENT 0.0000 0
In output, the p-value is low. So we will reject null Hypothesis and conclude that FICO scores are not impacted by home ownership
5. Did lending club receive equal number of loans in each month?
Our model also states that if the loan interest rate depends on the issue month. So lets validate it.
\(H0\): There is no incosistency between the observed and expected counts. Observed counts follow the same distribution as expected counts.
\(H1\): There is an incosistency between the observed and expected counts. Observed counts do not follow the same distribution as expected counts.Some months are different than another.
#Loan request vs month
loans_count<-
loans_summary %>% mutate(loan_month = match(substr(loans_summary$issue_d,1,3),month.abb)) %>% group_by(loan_month) %>% count(loan_month) %>% rename(observed_loans =n) %>% filter(!is.na(loan_month))
loans_count <- loans_count %>% mutate(percent= 33.333333) %>% mutate(expected_loans=sum(observed_loans)*(percent/100))
chisq.test(x=loans_count$observed_loans, p=loans_count$percent/100)
##
## Chi-squared test for given probabilities
##
## data: loans_count$observed_loans
## X-squared = 526.21, df = 2, p-value < 2.2e-16
loans_count
## # A tibble: 3 × 4
## loan_month observed_loans percent expected_loans
## <int> <int> <dbl> <dbl>
## 1 1 1738 33.33333 2333.333
## 2 2 2041 33.33333 2333.333
## 3 3 3221 33.33333 2333.333
The p-value is less than the significance level 5%. So we can reject null hypothesis. From the model and output of chi-test, we can say that if the loans applied on January will receive less interest rate than other months.
Dataset from Lending club is an interesting dataset. It is offen very difficult to get the insights of interest rate from Bank. This analysis provides interesting information about the interest rate which we get from Lending club for each person.
The interest rate which we receive depends on the various factors like FICO score, Homeownership, Purpose of loan, Team length of loan, loan amount requested, Annual income, Employee length, Issue month, Previous bankrupcies and Debt to income ratio.
If a person is wanting to get a good interest rate then he need to focus on above factors before applying for a lending club loan.
From the research question, I understood that we need to perform various test and add visalizations to understand the patern of Lending club borrowers data and find the interest rate. It is not a very straight forward method which gets from FICO scores. Also lot of financial terms while dealing with loans.
Future analysis can be performed by adding more variables to the model. Which provided loans are at risk or at the status of Charged-off. We can also calculate the return interest rate for an investor.
These analysis can also be used if we want to invest in lending club and be a successful investor.