Regression Prediction Competition
Introduction
The goal of this project is to build a predictive regression model for assigning interest rates to loans.
The data used for this project includes information on 34,000 loan applications. The data is broken down into two parts - a “training” data set of 29,000 observations with provided interest rates and a “test” data set of 5,000 observations for which we need to build a model to predict the interest rates.
Required Packages
The packages required for this markdown are:
| Packages | Summary |
|---|---|
| tidyverse | The tidyverse collection of packages |
| pander | Outputting summary tables |
| fastDummies | Used to create dummy variables |
| readr | Easy way to read rectangular data - i.e. csv files |
| skimr | Simple to use summary function |
| stargazer | Making pretty summary statistics tables |
| corrplot | Various correlation plots |
| PerformanceAnalytics | Building awesome graphics for analytics |
| knitr | RMarkdown documents |
| rmdformats | RMarkdown themes |
| scales | Methods for breaks and labels for axes in visualizations |
| fmsb | Variance Inflation Factors (VIF) |
| Metrics | RMSE for prediction vs. actual values |
| car | Used to calculate vif |
| lm.beta | Used to check standardized coefficients |
The Process
First, we organized and cleaned the data. Then, we analyzed the variables to determine which ones were significant. Using our significant variables, we then built a regression model we could use to estimate interest rates in the training data. After building the model, we checked the residuals, VIF, ANOVA, and ran various plots. These were done in order to check for linearity, multicollinearity, normality, and homoscedasticity. For more detail, see the relevant section below.
Cleaning the Data
First, we looked at the training set and began cleaning the data. Without properly cleaned and organized data, it would be impossible to discern the most effective model for predicting interest rates in the test data.
We started with checking the class of all the variables and ensured they were properly assigned to numerical, factor, and character classes. After identifying which variables would be used as factors, we moved on to making dummy variables. This introduced a fourth class of data - logical data.
Next, we replaced missing values in the data set. For the most part, character data with NA or missing values was handled by replacing these with “unknown.” Additionally, we consolidated missing values by replacing synonyms for NA such as “none” and “other” with “unknown” as well.
Missing or NA numerical data was handled on a case by case basis. Some variables were handled by replacing missing values with 0. For others, 0 would introduce the wrong assumptions into our data. For example, a 0 value in delinquency might indicate that a customer is currently delinquent, instead of conveying that the customer has NEVER been delinquent. In cases like these, we replaced NA values with an insane “dummy value” - 9999. We then made a dummy variable indicating whether the customer has been delinquent or not using 1’s and 0’s, where 9999 would translate to a value of 1 (Never Delinquent). Finally, missing values in the Annual Income variable was handled by replacing the missing values with an average of all the other income values. We then made a dummy variable indicating if Annual Income used our calculation for average annual income.
Finally, we had to find a way to handle outliers. To do this, we first had to define what we considered an outlier. We decided that, for our purposes, an “outlier” would be defined as any value more than 3 standard deviations from the mean. Then, we simply filtered out these values in the relevant columns.
Descriptions Variable
While the loan description variable can give a lot of insight into what the loan was being used for, we decided not to perform text analysis on this variable. After consulting with the professor, it was determined that text analysis was only likely to improve our model around 0.002% (in terms of RMSE). Thus, we made the decision to utilize our resources in other ways to find a better fitting model. Though the text analysis certainly would have been fun and interesting, it did not feel like the best use of our time for this project.
Analytical Methods
After skimming through the training data set, we removed State and Public_Record_Count as they appeared insignificant.
We also excluded two factor levels from Purpose_Of_Loan - credit card and educational purposes - as the p-values for both revealed them to be insignificant.
After further investigated the models via crPlots() and plot() to check linearity, we initially found that Months_Since_Delinquency and Months_Since_Record had linearity assumptions issues. We went about correcting this issues by making them a second-degree polynomial. After correcting that, Months_Since_Last_Derogatory’s plot also faced the same issue, so we went about fixing it the exact same way. As a result, the corrections fixed the linearity issues and improved the fit of the model (as shown in the “Investigating the Model” section).
By doing this, the RMSE was improved by approximately 0.01% and the Adjust R-squared increased to 0.50.
Regression Model
Generalized Regression Equation
Interest_Rate = b0 + b1(Loan_Amount_Requested) + b2(Initial_Loan_Term_Months) + b3(Home_Owner) + b4(Annual_Income) + b5(Income_Verified) + b6(Purpose_Of_Loan) + b7(Debt_To_Income) + b8(Number_Delinqueny_2yrs) + b9(Inquiries_Last_6Mo) + b10(Months_Since_Deliquency^2) + b11(Months_Since_Record^2) + b12(Number_Open_Accounts) + b13(Revolving_Balance) + b14(Revolving_Utilization) + b15(Total_Accounts) + b16(Collections_12Mo_Exclude_Med) + b17(Months_Since_Last_Derogatory^2)
Best Descriptive Regression Equation
Interest_Rate = (9.283e-02) + (6.058e-07)(Loan_Amount_Requested) + (3.896e-02)(Initial_Loan_Term_Months) + (7.263e-03)(Home_Owner_OWN) + (8.731e-03)(Home_Owner_RENT) + (4.164e-03)(Home_Owner_Unknown) + (-1.089e-07)(Annual_Income) + (9.732e-03)(Income_Verified_income) + (2.174e-03)(Income_Verified_income_source) + (8.259e-03)(Purpose_Of_Loan_debt_consolidation) + (1.156e-02)(Purpose_Of_Loan_home_improvement) + (2.413e-02)(Purpose_Of_Loan_house) + (9.920e-03)(Purpose_Of_Loan_major_purchase) + (3.173e-02)(Purpose_Of_Loan_medical) + (4.238e-02)(Purpose_Of_Loan_moving) + (3.099e-02)(Purpose_Of_Loan_other) + (1.786e-02)(Purpose_Of_Loan_renewable_energy) + (3.265e-02)(Purpose_Of_Loan_small_business) + (3.202e-02)(Purpose_Of_Loan_vacation) + (1.802e-02)(Purpose_Of_Loan_wedding) + (3.739e-04)(Debt_To_Income) + (1.305e-03)(Number_Delinqueny_2yrs) + (9.247e-03)(Inquiries_Last_6Mo) + (-1.766e-04)(Months_Since_Deliquency) + (1.749e-08)(Months_Since_Deliquency^2) + (-1.372e-04)(Months_Since_Record) + (1.347e-08)(Months_Since_Record^2) + (1.318e-03)(Number_Open_Accounts) + (-3.748e-07)(Revolving_Balance) + (6.963e-02)(Revolving_Utilization) + (-4.494e-04)(Total_Accounts) + (6.639e-03)(Collections_12Mo_Exclude_Med) + (7.228e-05)(Months_Since_Last_Derogatory) + (-7.249e-09)(Months_Since_Last_Derogatory^2)
Investigating the Model
Below are plots and graphs showing the “fit” of the regression model, as well as the RMSE (Root Mean Squared Error) of the model.
##
## Call:
## lm(formula = Interest_Rate ~ Loan_Amount_Requested + Initial_Loan_Term_Months +
## `Home_Owner?` + Annual_Income + Income_Verified + Purpose_Of_Loan +
## Debt_To_Income + Number_Delinqueny_2yrs + Inquiries_Last_6Mo +
## poly(Months_Since_Deliquency, 2, raw = TRUE) + poly(Months_Since_Record,
## 2, raw = TRUE) + Number_Open_Accounts + Revolving_Balance +
## Revolving_Utilization + Total_Accounts + Collections_12Mo_Exclude_Med +
## poly(Months_Since_Last_Derogatory, 2, raw = TRUE), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13564 -0.02219 -0.00318 0.01941 0.17359
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 9.013e-02 2.721e-03
## Loan_Amount_Requested 4.334e-07 3.022e-08
## Initial_Loan_Term_Months60 months 3.964e-02 4.653e-04
## `Home_Owner?`OWN 7.346e-03 7.421e-04
## `Home_Owner?`RENT 8.802e-03 4.686e-04
## `Home_Owner?`Unknown 3.958e-03 5.532e-04
## Annual_Income -1.048e-07 7.105e-09
## Income_VerifiedVERIFIED - income 9.985e-03 5.110e-04
## Income_VerifiedVERIFIED - income source 1.836e-03 5.145e-04
## Purpose_Of_Loandebt_consolidation 1.012e-02 1.881e-03
## Purpose_Of_Loanhome_improvement 1.285e-02 2.010e-03
## Purpose_Of_Loanhouse 2.548e-02 3.437e-03
## Purpose_Of_Loanmajor_purchase 1.029e-02 2.307e-03
## Purpose_Of_Loanmedical 3.250e-02 2.671e-03
## Purpose_Of_Loanmoving 4.287e-02 3.265e-03
## Purpose_Of_Loanother 4.092e-03 1.897e-03
## Purpose_Of_Loanrenewable_energy 1.819e-02 6.746e-03
## Purpose_Of_Loansmall_business 3.408e-02 2.445e-03
## Purpose_Of_Loanvacation 3.224e-02 3.110e-03
## Purpose_Of_Loanwedding 1.826e-02 3.625e-03
## Debt_To_Income 3.366e-04 2.878e-05
## Number_Delinqueny_2yrs 1.160e-03 2.957e-04
## Inquiries_Last_6Mo 9.416e-03 1.851e-04
## poly(Months_Since_Deliquency, 2, raw = TRUE)1 -1.715e-04 1.597e-05
## poly(Months_Since_Deliquency, 2, raw = TRUE)2 1.699e-08 1.591e-09
## poly(Months_Since_Record, 2, raw = TRUE)1 -1.302e-04 1.806e-05
## poly(Months_Since_Record, 2, raw = TRUE)2 1.279e-08 1.792e-09
## Number_Open_Accounts 1.279e-03 5.538e-05
## Revolving_Balance -3.939e-07 2.047e-08
## Revolving_Utilization 6.897e-02 9.276e-04
## Total_Accounts -4.268e-04 2.347e-05
## Collections_12Mo_Exclude_Med 6.857e-03 1.885e-03
## poly(Months_Since_Last_Derogatory, 2, raw = TRUE)1 8.589e-05 2.040e-05
## poly(Months_Since_Last_Derogatory, 2, raw = TRUE)2 -8.594e-09 2.031e-09
## t value Pr(>|t|)
## (Intercept) 33.120 < 2e-16 ***
## Loan_Amount_Requested 14.344 < 2e-16 ***
## Initial_Loan_Term_Months60 months 85.201 < 2e-16 ***
## `Home_Owner?`OWN 9.899 < 2e-16 ***
## `Home_Owner?`RENT 18.782 < 2e-16 ***
## `Home_Owner?`Unknown 7.156 8.52e-13 ***
## Annual_Income -14.752 < 2e-16 ***
## Income_VerifiedVERIFIED - income 19.539 < 2e-16 ***
## Income_VerifiedVERIFIED - income source 3.569 0.000359 ***
## Purpose_Of_Loandebt_consolidation 5.378 7.59e-08 ***
## Purpose_Of_Loanhome_improvement 6.395 1.63e-10 ***
## Purpose_Of_Loanhouse 7.413 1.27e-13 ***
## Purpose_Of_Loanmajor_purchase 4.463 8.11e-06 ***
## Purpose_Of_Loanmedical 12.168 < 2e-16 ***
## Purpose_Of_Loanmoving 13.129 < 2e-16 ***
## Purpose_Of_Loanother 2.157 0.031025 *
## Purpose_Of_Loanrenewable_energy 2.696 0.007012 **
## Purpose_Of_Loansmall_business 13.941 < 2e-16 ***
## Purpose_Of_Loanvacation 10.367 < 2e-16 ***
## Purpose_Of_Loanwedding 5.037 4.77e-07 ***
## Debt_To_Income 11.694 < 2e-16 ***
## Number_Delinqueny_2yrs 3.922 8.80e-05 ***
## Inquiries_Last_6Mo 50.879 < 2e-16 ***
## poly(Months_Since_Deliquency, 2, raw = TRUE)1 -10.740 < 2e-16 ***
## poly(Months_Since_Deliquency, 2, raw = TRUE)2 10.680 < 2e-16 ***
## poly(Months_Since_Record, 2, raw = TRUE)1 -7.211 5.68e-13 ***
## poly(Months_Since_Record, 2, raw = TRUE)2 7.137 9.75e-13 ***
## Number_Open_Accounts 23.088 < 2e-16 ***
## Revolving_Balance -19.245 < 2e-16 ***
## Revolving_Utilization 74.355 < 2e-16 ***
## Total_Accounts -18.184 < 2e-16 ***
## Collections_12Mo_Exclude_Med 3.637 0.000276 ***
## poly(Months_Since_Last_Derogatory, 2, raw = TRUE)1 4.211 2.56e-05 ***
## poly(Months_Since_Last_Derogatory, 2, raw = TRUE)2 -4.231 2.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03178 on 28124 degrees of freedom
## Multiple R-squared: 0.4883, Adjusted R-squared: 0.4877
## F-statistic: 813.4 on 33 and 28124 DF, p-value: < 2.2e-16
RMSE
## [1] 0.03175754
Most Significant 3 Variables
Put simply, an “interest rate” is the monetary charge for the privilege of borrowing money. The rate is expressed as a percentage of the loan principal. Interest rates are a function of the borrower’s risk to the lender. While the lender will always charge interest in order to make money on the transaction, high risk borrowers are charged more interest in order to pay the lender back for “accepting” the risk of the loan.
That being said, we ran lm.beta on our regression in order to look at the standardized coefficients. This allowed us to determine which 3 variables were the most significant in determining interest rates. As seen in the lm.beta output below, We found the most significant variables to be (in order of most to least significant): Initial_Loan_Term_Months(60Months), Revolving_Utilization, and Inquiries_Last_6Mo.
1) Initial_Loan_Term_Months (60 months) - The number of payments on the loan. Values are in months and can be either 36 or 60. A longer loan term indicates higher risk to the lender, as their money will be tied up longer. This gives the borrower a higher chance of defaulting on the loan. Due to this, the significance of this variable in the determination of interest rates makes sense.
2) Revolving_Utilization - Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. If a borrower is constantly utilizing 100% of their available credit, this indicates to the lender that the borrower may be living beyond their means and utilizing credit cards to pay for everyday bills and expenses. This can be indicative of a borrower who isn’t as responsible with money and extended credit, and thus might indicate a person who is more likely to default on their loan. Due to this, revolving utilization being significant in the determination of interest rates is not surprising.
3) Inquiries_Last_6Mo - The number of inquiries by creditors during the past 6 months. Much like a person who utilizes a high percentage of their available credit, lenders also don’t like to see borrowers who often apply for new credit. This also can be indicative of someone who is not as responsible with their spending and may be prone to getting into financial trouble. Due to this, lenders like to know how often a borrower has recently applied for new credit and might base their interest rate decisions on the answer to that question. A person who is often trying to obtain more credit accounts might be seen as a higher risk to default due to overextending themselves or spending more on credit than they can afford. Due to this, the inquiries metric being significant to interest rate predictions is unsurprising.
Future Improvement
We believe there are ways our model can be improved. Instead of excluding State and Zip code variable, to improve the model we could look at these states individually and include states that appear to be significant to the model. For example, looking at a quick regression reveals that both DC and VT appear to have significance. The same could be done with zip code. We dropped the two x’s on the end of the zip code provided to us, and classified Zip_3 as a factor variable. Looking at a regression between zip and interest rate reveals there are numerous significant zips as well. With more time, we would add these back into the model in an attempt to get a better fit.