Introduction

This final project has three main goals:

  1. Understand relationships inherent in an archival data set
  2. Create an automated interest rate assessment tool
  3. Identify variables most commonly used to assess an interest rate

Approach

I plotted each independent variable vs interest rate. This helped me identify any linearity or homoscedasticity problems with the data. I also observed outliers which I cleaned up in the analysis.

I used the interquartile range (IQR) to identify and remove outliers. Annual income was the first independent variable where I changed the outliers and NA values to the average value of the dataset. I also updated the public record count outliers and changed the set to numeric. I updated month to a numeric value rather than the original character. I changed the delinquency variable to factor variable. For text mining, I detected the words debt or credit, RENT, verified and own/mortgage as part of the regression equation.

## 
## Call:
## lm(formula = Interest_Rate ~ Month + term + debtcredit + Public_Record_Count + 
##     Annual_Income + Length_Employed + Loan_Amount_Requested + 
##     Debt_To_Income + Revolving_Utilization + verified + Rent + 
##     Number_Delinqueny_2yrs + Number_Open_Accounts + Revolving_Balance + 
##     Inquiries_Last_6Mo, data = Train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.153781 -0.023119 -0.002774  0.020721  0.164909 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.262e-02  1.560e-03   8.088 6.30e-16 ***
## Month                    -3.120e-04  5.965e-05  -5.231 1.70e-07 ***
## term                      1.649e-03  1.994e-05  82.709  < 2e-16 ***
## debtcreditTRUE           -1.396e-02  5.369e-04 -25.995  < 2e-16 ***
## Public_Record_Count       8.383e-03  4.146e-04  20.222  < 2e-16 ***
## Annual_Income            -1.169e-07  1.172e-08  -9.975  < 2e-16 ***
## Length_Employed           1.513e-04  5.572e-05   2.716 0.006621 ** 
## Loan_Amount_Requested     4.812e-07  2.967e-08  16.217  < 2e-16 ***
## Debt_To_Income            4.685e-04  2.852e-05  16.425  < 2e-16 ***
## Revolving_Utilization     7.335e-02  9.613e-04  76.298  < 2e-16 ***
## verifiedTRUE              6.325e-03  4.776e-04  13.243  < 2e-16 ***
## RentTRUE                  7.856e-03  4.084e-04  19.235  < 2e-16 ***
## Number_Delinqueny_2yrs1   1.101e-02  5.844e-04  18.835  < 2e-16 ***
## Number_Delinqueny_2yrs2   1.182e-02  9.994e-04  11.826  < 2e-16 ***
## Number_Delinqueny_2yrs3   1.548e-02  1.633e-03   9.484  < 2e-16 ***
## Number_Delinqueny_2yrs4   1.829e-02  2.391e-03   7.649 2.09e-14 ***
## Number_Delinqueny_2yrs5   1.121e-02  3.298e-03   3.398 0.000679 ***
## Number_Delinqueny_2yrs6   1.364e-02  4.777e-03   2.856 0.004296 ** 
## Number_Delinqueny_2yrs7   1.796e-03  6.304e-03   0.285 0.775657    
## Number_Delinqueny_2yrs8   1.642e-02  9.070e-03   1.811 0.070178 .  
## Number_Delinqueny_2yrs9   2.700e-02  1.463e-02   1.846 0.064936 .  
## Number_Delinqueny_2yrs10  1.896e-02  1.158e-02   1.638 0.101483    
## Number_Delinqueny_2yrs11 -2.161e-02  2.313e-02  -0.934 0.350065    
## Number_Delinqueny_2yrs12 -1.226e-02  1.888e-02  -0.649 0.516291    
## Number_Delinqueny_2yrs13  2.447e-02  1.636e-02   1.496 0.134640    
## Number_Open_Accounts      6.690e-04  4.543e-05  14.726  < 2e-16 ***
## Revolving_Balance        -7.043e-07  2.429e-08 -28.992  < 2e-16 ***
## Inquiries_Last_6Mo1       1.319e-02  4.662e-04  28.288  < 2e-16 ***
## Inquiries_Last_6Mo2       2.031e-02  6.278e-04  32.349  < 2e-16 ***
## Inquiries_Last_6Mo3       2.767e-02  8.761e-04  31.587  < 2e-16 ***
## Inquiries_Last_6Mo4       3.015e-02  1.652e-03  18.253  < 2e-16 ***
## Inquiries_Last_6Mo5       3.553e-02  2.791e-03  12.731  < 2e-16 ***
## Inquiries_Last_6Mo6       3.384e-02  4.105e-03   8.244  < 2e-16 ***
## Inquiries_Last_6Mo7       1.495e-02  1.463e-02   1.022 0.306859    
## Inquiries_Last_6Mo8       1.906e-02  3.270e-02   0.583 0.559984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03269 on 27397 degrees of freedom
##   (1568 observations deleted due to missingness)
## Multiple R-squared:  0.4603, Adjusted R-squared:  0.4597 
## F-statistic: 687.3 on 34 and 27397 DF,  p-value: < 2.2e-16

Regression Equation & Summary

The regression equation for the model is: Interest Rate = .0126-.000312(test\(Month)+.001649*(test\)term)-.01396(test\(debtcredit)+ .008383*(test\)Public_Record_Count)-.000001169(test\(Annual_Income)+.0001513*(test\)Length_Employed)+.0000004812(test\(Loan_Amount_Requested)+ .0004685*(test\)Debt_To_Income)+.07335(test\(Revolving_Utilization)+ .006325*(test\)verified)+.007856(test\(Rent)+.01101*(test\)NumberDel1)+.01182(test\(NumberDel2)+ .01548*(test\)NumberDel3)+.01829(test\(NumberDel4)+.01121*(test\)NumberDel5)+ .01364(test\(NumberDel6)+.000669*(test\)Number_Open_Accounts)- .0000007043(test\(Revolving_Balance)+.01319*(test\)NumInq1)+.02031(test\(NumInq2)+ .02767*(test\)NumInq3)+.03015(test\(NumInq4)+.03553*(test\)NumInq5)+.03384*(test$NumInq6)

This regression equation can predict about 46% of the interest rate. The top 3 strongest factors include term of loan, Revolving Utilization and inquiries the last six months.

The frustrations that I encountered was which factors to use factors and which to leave as numeric. There were also some factors that showed colinearity and did not immensely help the regression such as earliest credit year, months since delinquency, and collections that exclude medical. I only included the statistically significant factor variables in the regression equation. There was a lot of trial and error to figure out the best mix of variables and how to present the variables in the equation. Overall, the independent variables that are in the regression equation proved to be the best fit after much analysis.

Test Data

The test data was cleaned up similar to the training set. The average numbers were used where NA’s were present. The clean-up of the test data can be seen in the RMD file. The csv file submitted with this project display the predicted interest rates.