Introduction

This data set contains loan data from https://www.kaggle.com/itssuru/loan-data. It contains various data like FICO score, interest rate, installment, purpose and some more.
This is an analysis based on loan data which includes various methods to perform exploratory data analysis to understand how FICO score is co-related to various other factors listed on the data set.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df <-read.csv("loan_data.csv")

Linear regression

Here we will be doing regression analysis. Regression analysis is a process of creating a model with independent variable to describe a dependent variable.

tail(df)
##      credit.policy            purpose int.rate installment log.annual.inc   dti
## 9573             0 debt_consolidation   0.1565       69.98       10.11047  7.02
## 9574             0          all_other   0.1461      344.76       12.18075 10.39
## 9575             0          all_other   0.1253      257.70       11.14186  0.21
## 9576             0 debt_consolidation   0.1071       97.81       10.59663 13.09
## 9577             0   home_improvement   0.1600      351.58       10.81978 19.18
## 9578             0 debt_consolidation   0.1392      853.43       11.26446 16.28
##      fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs
## 9573  662          8190.042      2999       39.5              6           0
## 9574  672         10474.000    215372       82.1              2           0
## 9575  722          4380.000       184        1.1              5           0
## 9576  687          3450.042     10036       82.9              8           0
## 9577  692          1800.000         0        3.2              5           0
## 9578  732          4740.000     37879       57.0              6           0
##      pub.rec not.fully.paid
## 9573       0              1
## 9574       0              1
## 9575       0              1
## 9576       0              1
## 9577       0              1
## 9578       0              1

The graph above shows our regression model showing fico and interest rates of our dataset. It looks like a good regression model because the dots are within the straight line.

linear <- lm(fico ~ int.rate, df) 
linear
## 
## Call:
## lm(formula = fico ~ int.rate, data = df)
## 
## Coefficients:
## (Intercept)     int.rate  
##       834.8      -1011.0

The summary above shows that the Adjusted R-squared is .51 which means our model is only describing 51% of the model. P value lower than .05 which shows that our model is statistically significant.

plot(linear)

Various plots are illustrated to show regression model above.

Regression With No Intercept

linearNoIntercept <- lm(fico ~ 0 + int.rate, df)

summary(linearNoIntercept)
## 
## Call:
## lm(formula = fico ~ 0 + int.rate, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -524.94  -86.21   31.40  157.54  431.51 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## int.rate  5484.94      14.69   373.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 180.5 on 9577 degrees of freedom
## Multiple R-squared:  0.9357, Adjusted R-squared:  0.9357 
## F-statistic: 1.394e+05 on 1 and 9577 DF,  p-value: < 2.2e-16

Similarly, linear model shows .93 as Adjusted R-squared which means 93% of the variance is described. This is an excellent model and p value is also statistically significant.

Multiple Regression

Now we will be performing multiple regression to see what works best for us.

multiple <- lm(fico ~ int.rate + purpose, df)
summary(multiple)
## 
## Call:
## lm(formula = fico ~ int.rate + purpose, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.132  -16.662   -1.858   14.502  137.354 
## 
## Coefficients:
##                             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 836.7889     1.2797  653.889  < 2e-16 ***
## int.rate                  -1040.0303     9.9789 -104.223  < 2e-16 ***
## purposecredit_card           -2.6308     0.8932   -2.946 0.003232 ** 
## purposedebt_consolidation    -1.2549     0.6741   -1.862 0.062669 .  
## purposeeducational           -3.7989     1.4775   -2.571 0.010148 *  
## purposehome_improvement      10.2274     1.1477    8.911  < 2e-16 ***
## purposemajor_purchase         4.4557     1.3317    3.346 0.000823 ***
## purposesmall_business        26.8145     1.1743   22.833  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.54 on 9570 degrees of freedom
## Multiple R-squared:  0.5478, Adjusted R-squared:  0.5475 
## F-statistic:  1656 on 7 and 9570 DF,  p-value: < 2.2e-16

This model only explains 54% of the variance. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. We probably shouldn’t use this model unless there is no other model available.

multiple2 <- lm(fico ~ int.rate:purpose, df)
summary(multiple2)
## 
## Call:
## lm(formula = fico ~ int.rate:purpose, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.281  -16.595   -1.861   14.418  125.659 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          839.542      1.241  676.37   <2e-16 ***
## int.rate:purposeall_other          -1065.091     10.998  -96.84   <2e-16 ***
## int.rate:purposecredit_card        -1084.823     11.529  -94.10   <2e-16 ***
## int.rate:purposedebt_consolidation -1071.007      9.949 -107.65   <2e-16 ***
## int.rate:purposeeducational        -1093.994     14.904  -73.40   <2e-16 ***
## int.rate:purposehome_improvement    -986.752     13.090  -75.38   <2e-16 ***
## int.rate:purposemajor_purchase     -1031.325     14.586  -70.71   <2e-16 ***
## int.rate:purposesmall_business      -853.114     11.188  -76.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.4 on 9570 degrees of freedom
## Multiple R-squared:  0.5529, Adjusted R-squared:  0.5525 
## F-statistic:  1690 on 7 and 9570 DF,  p-value: < 2.2e-16

This model only explains 55% of the variance. This model explains a bit more than the previous model. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. We probably shouldn’t use this model unless there is no other model available.

multiple3 <- lm(fico ~ int.rate*purpose, df)
summary(multiple3)
## 
## Call:
## lm(formula = fico ~ int.rate * purpose, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.009  -16.359   -1.921   14.349  114.176 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          842.925      2.312 364.593  < 2e-16 ***
## int.rate                           -1092.579     19.285 -56.655  < 2e-16 ***
## purposecredit_card                    -3.875      4.191  -0.925   0.3552    
## purposedebt_consolidation             -5.725      3.119  -1.836   0.0664 .  
## purposeeducational                    -5.124      6.731  -0.761   0.4465    
## purposehome_improvement               21.051      5.078   4.145 3.42e-05 ***
## purposemajor_purchase                  9.624      5.671   1.697   0.0898 .  
## purposesmall_business                -39.469      5.161  -7.647 2.25e-14 ***
## int.rate:purposecredit_card           11.690     34.476   0.339   0.7346    
## int.rate:purposedebt_consolidation    39.387     25.202   1.563   0.1181    
## int.rate:purposeeducational           12.425     54.969   0.226   0.8212    
## int.rate:purposehome_improvement     -91.782     42.177  -2.176   0.0296 *  
## int.rate:purposemajor_purchase       -46.460     48.138  -0.965   0.3345    
## int.rate:purposesmall_business       488.123     37.874  12.888  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.28 on 9564 degrees of freedom
## Multiple R-squared:  0.5575, Adjusted R-squared:  0.5569 
## F-statistic: 926.9 on 13 and 9564 DF,  p-value: < 2.2e-16

This model only explains 55% of the variance. Tgere wasn’t any significant changes on this model compared the previous one. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. We probably shouldn’t use this model unless there is no other model available.

multiple4 <- lm(fico ~.,df)
summary(multiple4)
## 
## Call:
## lm(formula = fico ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.150 -13.538  -1.333  12.021 110.099 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.067e+02  4.642e+00 173.797  < 2e-16 ***
## credit.policy              1.033e+01  6.732e-01  15.342  < 2e-16 ***
## purposecredit_card        -3.196e+00  7.336e-01  -4.357 1.33e-05 ***
## purposedebt_consolidation -3.854e+00  5.632e-01  -6.842 8.29e-12 ***
## purposeeducational        -2.694e+00  1.188e+00  -2.267 0.023423 *  
## purposehome_improvement    1.942e+00  9.318e-01   2.084 0.037190 *  
## purposemajor_purchase      1.825e+00  1.068e+00   1.708 0.087750 .  
## purposesmall_business      1.298e+01  9.645e-01  13.459  < 2e-16 ***
## int.rate                  -8.244e+02  1.039e+01 -79.326  < 2e-16 ***
## installment                4.194e-02  1.243e-03  33.746  < 2e-16 ***
## log.annual.inc            -5.595e-01  4.266e-01  -1.312 0.189718    
## dti                       -1.497e-01  3.363e-02  -4.453 8.57e-06 ***
## days.with.cr.line          2.196e-03  9.304e-05  23.599  < 2e-16 ***
## revol.bal                  2.430e-05  7.262e-06   3.346 0.000824 ***
## revol.util                -3.242e-01  9.016e-03 -35.960  < 2e-16 ***
## inq.last.6mths             2.701e-02  1.152e-01   0.234 0.814603    
## delinq.2yrs               -9.529e+00  3.969e-01 -24.009  < 2e-16 ***
## pub.rec                   -9.429e+00  8.115e-01 -11.620  < 2e-16 ***
## not.fully.paid            -3.276e+00  5.876e-01  -5.575 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.45 on 9559 degrees of freedom
## Multiple R-squared:  0.7104, Adjusted R-squared:  0.7099 
## F-statistic:  1303 on 18 and 9559 DF,  p-value: < 2.2e-16

This model only explains 71% of the variance. This model explains a lot more than the previous models that we created. If we look at the estimate of intercepts we can see if they are positively related if they have positive sign and vice versa on the negative ones. This a great model compared to all previous ones.

ggplot(df,aes(x= int.rate, y = fico, color = purpose))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

Now we made a ggplot to see the conclusions from our model into a visual form.

library(ggplot2)
ggplot(df, aes(x = fico, y = int.rate)) +
  geom_jitter() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

Regression model shows a good regression here.

plot(df$fico,df$days.with.cr.line)
abline(linear)

Now we tried to see the corelation of fico score with the days in credit line category of our data set.

plot(df$int.rate,df$fico)
abline(linear, col = "Red")
abline(linear, col = "Blue")

When we plot the same model with the fico and interest rate, we get the plot above.

Conclusion

To conclude the linear regression I learned to make predictions and visualize my data to see co-relations in the items of my data set. It is a great tool to analyze any business to make insightful decisions. I also learned to determine if the data is statistically significant. Our linear regression was a great linear fit and I learned that fico score is very important factor in getting loans and interest rates. The higher the FICO score, easier it is to get a loan and it shows healthy financial history. Some ways to lower our interest rates and increase our FICO score is making payments in full at the end of the month, making sure all bills are paid and nothing is overdue or underpaid.