My plan is to take a bit of a two-track approach and work on two aspects of the project at the same time. On one side, my goal is to explore the data and then prepare it into a final form, which will be a .csv file of cleaned and pruned data. On the other side, I am hoping to do the actual model building and testing separately in a different file. The starting point of the modeling script will be the tidy .csv produced in data preparation script.

The were a few reasons I chose this approach. First, it simply felt a bit less intimidating and daunting to split things up into two separate, smaller projects. Second, it felt like it would be easier to experiment with modeling packages if all of that was separated. Lastly, if I am able to do this as I hope, it will be allow me to be able to robustly test the inclusion of different variables in the full model without having to make a lot of changes.

Data import and cleaning

Our first step in any analysis is importing and exploring the data. Data used for this project is provided by the professor. It includes information on 34,000 loan applications. The training data includes 29,000 observations that have interest rates provided for the loans and the test data contains the remaining 5,000 observations that are missing these values.

One of our first issues to deal with is missing information. We have NA values in the length of employment, annual income, months since delinquency, months since record, and months since derogatory. For most of these, we assume the NA is equivalent to zero. It makes sense to change many of these NAs to zeros because it is a positive if you have not had any missed payments at all and we want tot take something like that into account for prediciting the interest rate. For other values, like whether an employer or income is listed, we want to note the absence by making a dummy variable showing whether there is anything listed in these categories.

I spent some time trying to figure out what to do with the missing values for annual income. My gut feeling was that these individuals would who did not report any income would end up receiving higher interest rates than those who reported an income. I tried to make a simple naive model predicting the mean income of those who did not report an income based upon the interest rates they received. To say this was a waste of time would be fair. My model predicted that the average income for these individuals would be $79,950 which is a total of six dollars less than the average income. Moral of the story, in this case it looks like we can just use the mean as a dummy value for those who did not list any income.

While this does not make intuitive sense, the data bears it out. A quick sanity check is to look at the average interest rate of those with no income listed 0.1358076 compared to those who did list an income 0.1356019. They are virtually identical. We will use the $79,950 number for annual income for those who did not list.

I attempted some text analysis of the loan descriptions provided by applicants to see if any specific mentions might help improve the model. For example, I used string detect to see if the purpose of the loan was for a boat (as opposed to a more practical vehicle), whether the applicant self reported that they had good, great, or high credit, and whether the applicant mentioned student debt as opposed to non-student loan debt.

If I had some extra time, I would make a dummy variable that counted the number of misspelled words in each loan description. This would have been fun to analyze and there are some r packages that have the ability to do this but since time is not unlimited, this is something I am putting on my list as a potential avenue to explore time permitting.

I also built a very simple model based upon my guesses as to what might be most important. I figure this might be helpful as a sanity check as I get further down the road.

In analyzing my very simple model, the things that jump out are that the length of the loan and debt to income ratio really leap out as obviously important. Somewhat surprisingly, annual income is not as strong a predictor as I would have thought (though there are obvious multicollinearity issues with debt to income).

Exploratory Data Analysis

The first thing I want to look at is the all-important interest rate variable and how it is distributed since this will be the focus of all of the analysis to follow.

One of the first things I want to explore before getting into correlations is the idea that maybe annual income is not a very important predictor.

Annual income does seem to have some predictive value for 36-month loans but not for 60-month loans. This is confirmation that the term of the loan is going to be an important predictor that must be included in any model. I do want to get rid of some of the income outliers to see if it changes anything and look at a cleaner visualization without the individual values.

We can see that there is a huge difference in interest rates at all income levels depending on the initial loan terms. We can confirm this with a quick boxplot.

My first real attempt to start narrowing down which predictors I want to use in my model will require looking at some correlation plots, for which I’ll use the corrplot package. Due to running this so many times, I am going to take smaller samples of the data to keep things moving along.

It was overwhelming trying to look at too many predictors at one time, so I am cycling through small groups of predictors to explore their predictive value and correlations with each other. Above is an example of one group. Loan term is consistently at the top of the list in terms of correlation with interest rate. There ends up being some fairly high correlations between a number of other predictors. For example, the higher the loan amount requested, the more likely it will be a 60-month loan term.

The model side

While we have training and test data, we do not know the actual interest rate for our test data. Thus, I am going to split the training data into a smaller training set (80%) and a separate testing set (20%) that actually has known interest rates to test potential fits against.

By doing all of the preparation separately, it makes most of the process on the modeling side very straight forward. Our feature engineering steps are all basically cut and paste from tidy model vignettes since we are making the more difficult choices in our data preparation script concurrently.

We can use the parsnip package, which is part of tidy models, to specify our linear regression model. Time permitting, we may even be able to try more than one method by setting different engines. Then, we can create and execute the workflow.

Lastly, we can use the testing set we created to obtain predictions and performance metrics for our model. As we make changes on the other side to the .csv, we can see how those impact these performance metrics as we try to tune the model as best as we possibly can.

I decided to try to see if I could build a model by splitting the training data into a smaller training set and a testing set that included the true interest rate to test how the model would do.

The model used was a linear regression model with the lm engine.

With the model in place, I can look at some metrics.

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0323
## 2 rsq     standard      0.465

The r-squared of 0.465 was the highest I could come up with after testing different combinations of predictors. This was better than the the very simple model I made at the beginning that I built to use for comparison. I think this means I am on the right track but the rmse of .0323 is not as low as I would have expected.

The most important predictors in my model are, in order: