Introduction
This data set contains loan data from https://www.kaggle.com/itssuru/loan-data. It contains various data like FICO score, interest rate, installment, purpose and some more.
This is an analysis based on loan data which includes various methods to perform exploratory data analysis to understand how FICO score is co-related to various other factors listed on the data set.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df <-read.csv("loan_data.csv")
Linear regression
Here we will be doing regression analysis. Regression analysis is a process of creating a model with independent variable to describe a dependent variable.
tail(df)
## credit.policy purpose int.rate installment log.annual.inc dti
## 9573 0 debt_consolidation 0.1565 69.98 10.11047 7.02
## 9574 0 all_other 0.1461 344.76 12.18075 10.39
## 9575 0 all_other 0.1253 257.70 11.14186 0.21
## 9576 0 debt_consolidation 0.1071 97.81 10.59663 13.09
## 9577 0 home_improvement 0.1600 351.58 10.81978 19.18
## 9578 0 debt_consolidation 0.1392 853.43 11.26446 16.28
## fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 9573 662 8190.042 2999 39.5 6 0
## 9574 672 10474.000 215372 82.1 2 0
## 9575 722 4380.000 184 1.1 5 0
## 9576 687 3450.042 10036 82.9 8 0
## 9577 692 1800.000 0 3.2 5 0
## 9578 732 4740.000 37879 57.0 6 0
## pub.rec not.fully.paid
## 9573 0 1
## 9574 0 1
## 9575 0 1
## 9576 0 1
## 9577 0 1
## 9578 0 1
The graph above shows our regression model showing fico and interest rates of our dataset. It looks like a good regression model because the dots are within the straight line.
linear <- lm(fico ~ int.rate, df)
linear
##
## Call:
## lm(formula = fico ~ int.rate, data = df)
##
## Coefficients:
## (Intercept) int.rate
## 834.8 -1011.0
The summary above shows that the Adjusted R-squared is .51 which means our model is only describing 51% of the model. P value lower than .05 which shows that our model is statistically significant.
plot(linear)
Various plots are illustrated to show regression model above.
Regression With No Intercept
linearNoIntercept <- lm(fico ~ 0 + int.rate, df)
summary(linearNoIntercept)
##
## Call:
## lm(formula = fico ~ 0 + int.rate, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -524.94 -86.21 31.40 157.54 431.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## int.rate 5484.94 14.69 373.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 180.5 on 9577 degrees of freedom
## Multiple R-squared: 0.9357, Adjusted R-squared: 0.9357
## F-statistic: 1.394e+05 on 1 and 9577 DF, p-value: < 2.2e-16
Similarly, linear model shows .93 as Adjusted R-squared which means 93% of the variance is described. This is an excellent model and p value is also statistically significant.
Multiple Regression
Now we will be performing multiple regression to see what works best for us.
multiple <- lm(fico ~ int.rate + purpose, df)
summary(multiple)
##
## Call:
## lm(formula = fico ~ int.rate + purpose, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126.132 -16.662 -1.858 14.502 137.354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 836.7889 1.2797 653.889 < 2e-16 ***
## int.rate -1040.0303 9.9789 -104.223 < 2e-16 ***
## purposecredit_card -2.6308 0.8932 -2.946 0.003232 **
## purposedebt_consolidation -1.2549 0.6741 -1.862 0.062669 .
## purposeeducational -3.7989 1.4775 -2.571 0.010148 *
## purposehome_improvement 10.2274 1.1477 8.911 < 2e-16 ***
## purposemajor_purchase 4.4557 1.3317 3.346 0.000823 ***
## purposesmall_business 26.8145 1.1743 22.833 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.54 on 9570 degrees of freedom
## Multiple R-squared: 0.5478, Adjusted R-squared: 0.5475
## F-statistic: 1656 on 7 and 9570 DF, p-value: < 2.2e-16
This model only explains 54% of the variance. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. We probably shouldn’t use this model unless there is no other model available.
multiple2 <- lm(fico ~ int.rate:purpose, df)
summary(multiple2)
##
## Call:
## lm(formula = fico ~ int.rate:purpose, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.281 -16.595 -1.861 14.418 125.659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 839.542 1.241 676.37 <2e-16 ***
## int.rate:purposeall_other -1065.091 10.998 -96.84 <2e-16 ***
## int.rate:purposecredit_card -1084.823 11.529 -94.10 <2e-16 ***
## int.rate:purposedebt_consolidation -1071.007 9.949 -107.65 <2e-16 ***
## int.rate:purposeeducational -1093.994 14.904 -73.40 <2e-16 ***
## int.rate:purposehome_improvement -986.752 13.090 -75.38 <2e-16 ***
## int.rate:purposemajor_purchase -1031.325 14.586 -70.71 <2e-16 ***
## int.rate:purposesmall_business -853.114 11.188 -76.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.4 on 9570 degrees of freedom
## Multiple R-squared: 0.5529, Adjusted R-squared: 0.5525
## F-statistic: 1690 on 7 and 9570 DF, p-value: < 2.2e-16
This model only explains 55% of the variance. This model explains a bit more than the previous model. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. We probably shouldn’t use this model unless there is no other model available.
multiple3 <- lm(fico ~ int.rate*purpose, df)
summary(multiple3)
##
## Call:
## lm(formula = fico ~ int.rate * purpose, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.009 -16.359 -1.921 14.349 114.176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 842.925 2.312 364.593 < 2e-16 ***
## int.rate -1092.579 19.285 -56.655 < 2e-16 ***
## purposecredit_card -3.875 4.191 -0.925 0.3552
## purposedebt_consolidation -5.725 3.119 -1.836 0.0664 .
## purposeeducational -5.124 6.731 -0.761 0.4465
## purposehome_improvement 21.051 5.078 4.145 3.42e-05 ***
## purposemajor_purchase 9.624 5.671 1.697 0.0898 .
## purposesmall_business -39.469 5.161 -7.647 2.25e-14 ***
## int.rate:purposecredit_card 11.690 34.476 0.339 0.7346
## int.rate:purposedebt_consolidation 39.387 25.202 1.563 0.1181
## int.rate:purposeeducational 12.425 54.969 0.226 0.8212
## int.rate:purposehome_improvement -91.782 42.177 -2.176 0.0296 *
## int.rate:purposemajor_purchase -46.460 48.138 -0.965 0.3345
## int.rate:purposesmall_business 488.123 37.874 12.888 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.28 on 9564 degrees of freedom
## Multiple R-squared: 0.5575, Adjusted R-squared: 0.5569
## F-statistic: 926.9 on 13 and 9564 DF, p-value: < 2.2e-16
This model only explains 55% of the variance. Tgere wasn’t any significant changes on this model compared the previous one. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. We probably shouldn’t use this model unless there is no other model available.
multiple4 <- lm(fico ~.,df)
summary(multiple4)
##
## Call:
## lm(formula = fico ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.150 -13.538 -1.333 12.021 110.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.067e+02 4.642e+00 173.797 < 2e-16 ***
## credit.policy 1.033e+01 6.732e-01 15.342 < 2e-16 ***
## purposecredit_card -3.196e+00 7.336e-01 -4.357 1.33e-05 ***
## purposedebt_consolidation -3.854e+00 5.632e-01 -6.842 8.29e-12 ***
## purposeeducational -2.694e+00 1.188e+00 -2.267 0.023423 *
## purposehome_improvement 1.942e+00 9.318e-01 2.084 0.037190 *
## purposemajor_purchase 1.825e+00 1.068e+00 1.708 0.087750 .
## purposesmall_business 1.298e+01 9.645e-01 13.459 < 2e-16 ***
## int.rate -8.244e+02 1.039e+01 -79.326 < 2e-16 ***
## installment 4.194e-02 1.243e-03 33.746 < 2e-16 ***
## log.annual.inc -5.595e-01 4.266e-01 -1.312 0.189718
## dti -1.497e-01 3.363e-02 -4.453 8.57e-06 ***
## days.with.cr.line 2.196e-03 9.304e-05 23.599 < 2e-16 ***
## revol.bal 2.430e-05 7.262e-06 3.346 0.000824 ***
## revol.util -3.242e-01 9.016e-03 -35.960 < 2e-16 ***
## inq.last.6mths 2.701e-02 1.152e-01 0.234 0.814603
## delinq.2yrs -9.529e+00 3.969e-01 -24.009 < 2e-16 ***
## pub.rec -9.429e+00 8.115e-01 -11.620 < 2e-16 ***
## not.fully.paid -3.276e+00 5.876e-01 -5.575 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.45 on 9559 degrees of freedom
## Multiple R-squared: 0.7104, Adjusted R-squared: 0.7099
## F-statistic: 1303 on 18 and 9559 DF, p-value: < 2.2e-16
This model only explains 71% of the variance. This model explains a lot more than the previous models that we created. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. This a great model compared to all previous ones.
ggplot(df,aes(x= int.rate, y = fico, color = purpose))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Now we made a ggplot to see the conclusions from our model into a visual form.
library(ggplot2)
ggplot(df, aes(x = fico, y = int.rate)) +
geom_jitter() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
Regression model shows a good regression here.
plot(df$fico,df$days.with.cr.line)
abline(linear)
Now we tried to see the corelation of fico score with the days in credit line category of our data set.
plot(df$int.rate,df$fico)
abline(linear, col = "Red")
abline(linear, col = "Blue")
When we plot the same model with the fico and interest rate, we get the plot above.
Conclusion
To conclude the linear regression I learned to make predictions and visualize my data to see co-relations in the items of my data set. It is a great tool to analyze any business to make insightful decisions. I also learned to determine if the data is statistically significant. Our linear regression was a great linear fit and I learned that fico score is very important factor in getting loans and interest rates. The higher the FICO score, easier it is to get a loan and it shows healthy financial history. Some ways to lower our interest rates and increase our FICO score is making payments in full at the end of the month, making sure all bills are paid and nothing is overdue or underpaid.