This R-markdown analyzes data for interest rates given for 29,000 loan applicants with the goal of creating the best regression model to assign interest rates to an additional set of data containing 5000 more data points. This analysis will detail the approach taken to achieve the regression equation, and then the predicted interest rates based on the Test Data.
The following packages were required for this analysis:
| PACKAGE | Description |
|---|---|
| readr | Allows the imporation of .csv files |
| Skimr | Grants ability to generate summary statistics |
| Tidyverse | Loads the tidyverse packagess |
| knitr | RMarkdown documents |
| rmdformats | RMarkdown themes |
| CAR | Companion to Regression Package |
| Dplyr | Dplyr package |
| DT | Allows for datatables |
To access the data used in this case, download the .csv file of the data as follows:
# load the csv of the training data into r from by downloading the data from
# the following link:
# https://www.dropbox.com/s/5s71qme0yr3vio2/Test_data.xlsx?dl=1
# Place the data into your working folder and read the .csv in as follows:
Train_raw <-
read_csv("Train_data.csv")
# Load in the csv of the Test data into r by downloading the data from the
# following link:
# https://www.dropbox.com/s/5s71qme0yr3vio2/Test_data.xlsx?dl=1
# # Place the data into your working folder and read the .csv in as follows:
Test_Raw <-
read_csv("Test_data.csv")
The downloaded data is cleaned using the following processes:
The loan tern data is imported as a character variable. This analysis converts the character data into a new variable called “Loan_Term” to reflect a 36 month loan term or a 60 month loan term.
The average income is calculated for the entire training data set. The average income value is then used to replace the missing income data to make for a better regression correlation. This ensures that the missing data is not excluded from the analysis.
The Length_Employed variable is modified to replace the missing employment data with the number “0”. This ensures that the missing data is not excluded from the analysis.
The Home_Owner variable is modified to reduce the number of types of ownership to four: Mortgage, Own, Rent and Other. By doing this, it combines the NONE and N/A designations into one.
Lastly, the loan description is searched to find particular strings and assigns a new “loan_type” variable based on one of the following: • debt_consolidation • auto • flexline • education • other
Before understanding which actions to take to make a regression model, a baseline model is created making the assumption that the interest rate of a load is based on the amount requested for the loan. A plot of the baseline model is as follows:
From the plot, it can be seen that the data does not follow linear behavior, and would thus fail a linearity assumption. Additionally, the baseline model has the following statistics:
##
## Call:
## lm(formula = Interest_Rate ~ Loan_Amount_Requested, data = Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.094189 -0.032190 -0.000887 0.027601 0.136905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.185e-01 5.227e-04 226.80 <2e-16 ***
## Loan_Amount_Requested 1.114e-06 2.973e-08 37.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04343 on 28998 degrees of freedom
## Multiple R-squared: 0.04618, Adjusted R-squared: 0.04615
## F-statistic: 1404 on 1 and 28998 DF, p-value: < 2.2e-16
As shown, the baseline model has an r-squared value of 0.04618, which is an extremely low regression, however, the Loan_Amount_Requested Variable remains a significant predictor of the interest rate. Therefore, it is appropriate to include additional variables/transformations to the regression equation to more accurately predict the data.
The next step is to look into outliers in the data. Since the output of the data is the interest rate, it makes logical sense to understand whether there are outliers in the interest rate and eliminate those from our regression model.
The following is a box plot of the interest rates:
From the boxplot, we can see that the outliers in the interest rate begin after 0.255, so the data will be filtered to remove interest rates above this value.
After cleaning and filtering the data, it is time to build the actual regression model. Through some trial and error, the following variables are used in the regression model.
• Loan Term • Length Employed • Home Owner • Annual Income • Loan Amount Requested • Debt to Income Ratio • Number of Delinquencies in last 2 years • Inquiries in last 6 months • Revolving Balance • Revolving Utilization • Number of Open Accounts
The regression model is as follows:
B0 + B1* Loan_Term + B2* Length_Employed + B3* Home_Owner + B4* Annual_Income + B5* Annual_Income^2 + B6* Annual_Income^3 + B7* Loan_Amount_Requested + B8* Loan_Amount_Requested^2 + B9* Debt_To_income + B10* Number_Delinquency_2yrs + B11* Number_Delinqueny_2yrs^2 + B12* Number_Delinqueny_2yrs^3 + B13 Inquiries_Last_6Mo + B14* Inquiries_Last_6Mo^2 + B15* Revolving_Balance + B16* Revolving_Utilization + B17* Number_Open_Accounts + B18* loan_type
This regression equation has the following output:
##
## Call:
## lm(formula = Interest_Rate ~ as.factor(Loan_Term) + Length_Employed +
## as.factor(Home_Owner) + poly(Annual_Income, 3, raw = TRUE) +
## poly(Loan_Amount_Requested, 2, raw = TRUE) + Debt_To_Income +
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE) + poly(Inquiries_Last_6Mo,
## 2, raw = TRUE) + Revolving_Balance + Revolving_Utilization +
## Number_Open_Accounts + as.factor(loan_type), data = Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.133351 -0.023098 -0.002645 0.020264 0.274924
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 8.516e-02 1.303e-03 65.359
## as.factor(Loan_Term)60 4.149e-02 4.660e-04 89.025
## Length_Employed 8.453e-05 5.124e-05 1.650
## as.factor(Home_Owner)OTHER 4.345e-03 5.578e-04 7.788
## as.factor(Home_Owner)OWN 7.636e-03 7.477e-04 10.213
## as.factor(Home_Owner)RENT 1.035e-02 4.717e-04 21.942
## poly(Annual_Income, 3, raw = TRUE)1 -2.865e-07 1.520e-08 -18.841
## poly(Annual_Income, 3, raw = TRUE)2 7.620e-13 6.371e-14 11.961
## poly(Annual_Income, 3, raw = TRUE)3 -4.669e-19 5.680e-20 -8.220
## poly(Loan_Amount_Requested, 2, raw = TRUE)1 -2.369e-06 9.087e-08 -26.071
## poly(Loan_Amount_Requested, 2, raw = TRUE)2 7.592e-11 2.390e-12 31.767
## Debt_To_Income 2.231e-04 2.867e-05 7.780
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)1 1.249e-02 6.256e-04 19.966
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)2 -2.497e-03 2.434e-04 -10.258
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)3 1.321e-04 1.907e-05 6.928
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)1 1.441e-02 4.133e-04 34.873
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)2 -1.483e-03 1.150e-04 -12.896
## Revolving_Balance -1.239e-07 8.570e-09 -14.452
## Revolving_Utilization 6.468e-02 8.781e-04 73.656
## Number_Open_Accounts 5.126e-04 4.361e-05 11.755
## as.factor(loan_type)debt_consolidate 8.973e-03 6.824e-04 13.148
## as.factor(loan_type)education 6.772e-03 7.901e-03 0.857
## as.factor(loan_type)flexline 1.280e-02 1.133e-03 11.296
## as.factor(loan_type)other 8.932e-03 5.331e-04 16.754
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## as.factor(Loan_Term)60 < 2e-16 ***
## Length_Employed 0.099 .
## as.factor(Home_Owner)OTHER 7.01e-15 ***
## as.factor(Home_Owner)OWN < 2e-16 ***
## as.factor(Home_Owner)RENT < 2e-16 ***
## poly(Annual_Income, 3, raw = TRUE)1 < 2e-16 ***
## poly(Annual_Income, 3, raw = TRUE)2 < 2e-16 ***
## poly(Annual_Income, 3, raw = TRUE)3 < 2e-16 ***
## poly(Loan_Amount_Requested, 2, raw = TRUE)1 < 2e-16 ***
## poly(Loan_Amount_Requested, 2, raw = TRUE)2 < 2e-16 ***
## Debt_To_Income 7.52e-15 ***
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)1 < 2e-16 ***
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)2 < 2e-16 ***
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)3 4.37e-12 ***
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)1 < 2e-16 ***
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)2 < 2e-16 ***
## Revolving_Balance < 2e-16 ***
## Revolving_Utilization < 2e-16 ***
## Number_Open_Accounts < 2e-16 ***
## as.factor(loan_type)debt_consolidate < 2e-16 ***
## as.factor(loan_type)education 0.391
## as.factor(loan_type)flexline < 2e-16 ***
## as.factor(loan_type)other < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0325 on 28810 degrees of freedom
## Multiple R-squared: 0.4451, Adjusted R-squared: 0.4447
## F-statistic: 1005 on 23 and 28810 DF, p-value: < 2.2e-16
The summary of the new regression model shows an r-squared of 0.4451 that is almost 10x greater than the baseline model.
The plots above show that some outliers still remain, and thus will have an impact on the regression accuracy. Eliminating these will certainly improve the regression, but these data points will remain in the model.
Now that a regression model exists, the model can be checked for multicollinearity. The general guidelines are as follows:
VIF < 5 = Good, 5 < VIF < 10 = Possible Problem, VIF > 10 = Problem Very Likely
The results are as follows:
## GVIF Df GVIF^(1/(2*Df))
## as.factor(Loan_Term) 1.229030 1 1.108616
## Length_Employed 1.031235 1 1.015497
## as.factor(Home_Owner) 1.087698 3 1.014109
## poly(Annual_Income, 3, raw = TRUE) 1.694605 3 1.091888
## poly(Loan_Amount_Requested, 2, raw = TRUE) 1.593979 2 1.123623
## Debt_To_Income 1.351705 1 1.162629
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE) 1.019626 3 1.003245
## poly(Inquiries_Last_6Mo, 2, raw = TRUE) 1.033534 2 1.008280
## Revolving_Balance 1.320977 1 1.149338
## Revolving_Utilization 1.189592 1 1.090684
## Number_Open_Accounts 1.312409 1 1.145604
## as.factor(loan_type) 1.034626 4 1.004264
The output shows that the GVIF is less than 2 for all variables, therefore the model has very little if any multicollinearity among the variables.
Before settling on a model, it is prudent to understand whether the model can be improved by removing heteroskedasticity by applying a logarithm to the model. The output is as follows:
##
## Call:
## lm(formula = log(Interest_Rate) ~ Loan_Term + poly(Length_Employed,
## 2, raw = TRUE) + poly(Annual_Income, 3, raw = TRUE) + poly(Loan_Amount_Requested,
## 2, raw = TRUE) + Debt_To_Income + poly(Number_Delinqueny_2yrs,
## 3, raw = TRUE) + poly(Inquiries_Last_6Mo, 2, raw = TRUE) +
## Revolving_Balance + Revolving_Utilization + Number_Open_Accounts +
## as.factor(loan_type), data = Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20568 -0.17462 0.01088 0.17521 2.36427
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2.868e+00 1.112e-02 -257.972
## Loan_Term 1.285e-02 1.544e-04 83.230
## poly(Length_Employed, 2, raw = TRUE)1 1.630e-03 1.814e-03 0.898
## poly(Length_Employed, 2, raw = TRUE)2 -1.606e-04 1.617e-04 -0.993
## poly(Annual_Income, 3, raw = TRUE)1 -2.740e-06 1.202e-07 -22.797
## poly(Annual_Income, 3, raw = TRUE)2 7.463e-12 5.048e-13 14.783
## poly(Annual_Income, 3, raw = TRUE)3 -4.716e-18 4.507e-19 -10.465
## poly(Loan_Amount_Requested, 2, raw = TRUE)1 -1.798e-05 7.228e-07 -24.874
## poly(Loan_Amount_Requested, 2, raw = TRUE)2 5.703e-10 1.902e-11 29.987
## Debt_To_Income 1.902e-03 2.281e-04 8.337
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)1 1.045e-01 4.978e-03 20.999
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)2 -2.084e-02 1.937e-03 -10.759
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)3 1.106e-03 1.517e-04 7.290
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)1 1.098e-01 3.287e-03 33.412
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)2 -1.151e-02 9.150e-04 -12.577
## Revolving_Balance -1.055e-06 6.809e-08 -15.486
## Revolving_Utilization 5.526e-01 6.978e-03 79.190
## Number_Open_Accounts 3.864e-03 3.464e-04 11.156
## as.factor(loan_type)debt_consolidate 7.382e-02 5.429e-03 13.596
## as.factor(loan_type)education 2.562e-02 6.287e-02 0.408
## as.factor(loan_type)flexline 8.998e-02 8.989e-03 10.010
## as.factor(loan_type)other 6.848e-02 4.242e-03 16.144
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Loan_Term < 2e-16 ***
## poly(Length_Employed, 2, raw = TRUE)1 0.369
## poly(Length_Employed, 2, raw = TRUE)2 0.321
## poly(Annual_Income, 3, raw = TRUE)1 < 2e-16 ***
## poly(Annual_Income, 3, raw = TRUE)2 < 2e-16 ***
## poly(Annual_Income, 3, raw = TRUE)3 < 2e-16 ***
## poly(Loan_Amount_Requested, 2, raw = TRUE)1 < 2e-16 ***
## poly(Loan_Amount_Requested, 2, raw = TRUE)2 < 2e-16 ***
## Debt_To_Income < 2e-16 ***
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)1 < 2e-16 ***
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)2 < 2e-16 ***
## poly(Number_Delinqueny_2yrs, 3, raw = TRUE)3 3.17e-13 ***
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)1 < 2e-16 ***
## poly(Inquiries_Last_6Mo, 2, raw = TRUE)2 < 2e-16 ***
## Revolving_Balance < 2e-16 ***
## Revolving_Utilization < 2e-16 ***
## Number_Open_Accounts < 2e-16 ***
## as.factor(loan_type)debt_consolidate < 2e-16 ***
## as.factor(loan_type)education 0.684
## as.factor(loan_type)flexline < 2e-16 ***
## as.factor(loan_type)other < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2586 on 28812 degrees of freedom
## Multiple R-squared: 0.436, Adjusted R-squared: 0.4356
## F-statistic: 1061 on 21 and 28812 DF, p-value: < 2.2e-16
This output shows an r-squared value of 0.436, which is less than the r-squared value from the non-log model. Therefore, the non-log model is used to make any future predictions.
To generate the prediction, the Test Data must first be cleaned similar to the training data. The same steps are taken to clean the test data to make it consistent.
Once the data has been cleaned to match the Training data format, a prediction can be made. The interest rate predictions are exported into a .csv file.
A snapshot of the predicted interest rate output data is as follows: