Setup
# load packages
library(tidyverse)
library(ggthemes)
library(ggmap)
library(tidymodels)
library(readr)
theme_set(theme_minimal())
tidymodels_prefer()
# read in data
loans <- read_csv("loans_full_schema.csv")
Description and Issues
This data set contains loans made through the platform Lending Club. Each row represents a loan that was successfully made from one individual to another. The columns contain various information regarding the loan applicant and the loan itself. The only minor issue I see with the data set is the abundance of NA values due to the specificity of many of the columns.
Visualizations and Observations
loans %>%
group_by(loan_purpose) %>%
mutate(loan_purpose_count = n()) %>%
select(loan_purpose, loan_purpose_count) %>%
unique() %>%
ggplot(aes(x = fct_reorder(loan_purpose, loan_purpose_count), y = loan_purpose_count, fill = loan_purpose, label = loan_purpose_count)) +
geom_col() +
geom_text(nudge_y = 150) +
labs(x = "Loan Purpose", y = "", title = "Accepted Loans by Loan Purpose") +
theme(legend.position = "none", axis.text.x = element_text(size = 4.5))

Bar chart of accepted loans by loan purpose. There are the most loans for debt consolidation followed by credit card and the least for renewable energy.
loans %>%
filter(annual_income > 10) %>%
ggplot(aes(x = annual_income, y = interest_rate)) +
geom_point(color = "darkblue") +
scale_x_log10() +
labs(x = "Annual Income (log adjusted)", y = "Interest Rate", title = "Interest Rate vs. Annual Income")

Plot of interest rate vs. annual income. There appears to be no obvious correlation between income and interest rate, except that the distribution of incomes seems to be tighter for higher interest rate loans.
loans %>%
ggplot(aes(x = interest_rate, color = homeownership)) +
geom_density() +
labs(x = "Interest Rate", y = "", title = "Distribution of Loan Interest Rates based on Home Ownership")

Density plot of loan interest rates based on home ownership. Generally, interest rates are similar for all types of home ownership, however there are slightly more low interest rate loans for applicants who mortgage or own their house and slightly more high interest loans for applicants who rent their house.
loans %>%
ggplot(aes(x = grade, y = interest_rate, fill = grade)) +
geom_boxplot() +
labs(x = "Loan Grade", y = "Interest Rate", title = "Loan Interest Rates by Grade") +
theme(legend.position = "none")

Box plot of loan interest rates by grade. The lower the interest rate on the loan, the better the grade the loan receives.
states_map <- map_data("state")
loans %>%
mutate(state_name = state.name[match(state, state.abb)]) %>%
mutate(state_name = str_to_lower(state_name)) %>%
filter(is.na(state_name)==FALSE) %>%
group_by(state_name) %>%
mutate(state_count = n()) %>%
select(state_name, state_count) %>%
unique() %>%
ggplot() +
geom_map(map = states_map, aes(map_id = state_name, fill = state_count)) +
expand_limits(x = states_map$long, y = states_map$lat) +
theme_map() +
labs(title = "Number of Loans by State") +
theme(legend.title = element_blank())

Map of the number of loans by state. California appears to have the most loans, followed by Texas, New York, and Florida.
Modeling Interest Rate
Linear Regression with Cross-Validation
set.seed(1)
# linear regression model spec
lm_spec <-
linear_reg() %>%
set_engine(engine = 'lm') %>%
set_mode('regression')
# cross-validation folds
loans_cv <- vfold_cv(loans, v = 10)
# workflow
mod1_wf <- workflow() %>%
add_formula(interest_rate ~ annual_income + debt_to_income + delinq_2y) %>%
add_model(lm_spec)
# cross-validate with workflow
mod1_cv <- fit_resamples(mod1_wf, resamples = loans_cv, metrics = metric_set(rmse, rsq, mae))
mod1_cv %>% collect_metrics()
GAMs
set.seed(1)
# gam model spec
gam_spec <- gen_additive_mod() %>%
set_engine(engine = 'mgcv') %>%
set_mode('regression')
# fitting gam model
gam_mod <- fit(gam_spec, interest_rate ~ s(annual_income) + s(debt_to_income) + s(delinq_2y), data = loans)
par(mfrow=c(2,2))
gam_mod %>% pluck('fit') %>%
mgcv::gam.check()

##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 3.993287e-05 .
## The Hessian was positive definite.
## Model rank = 28 / 28
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(annual_income) 9.00 5.83 0.99 0.23
## s(debt_to_income) 9.00 7.18 0.99 0.18
## s(delinq_2y) 9.00 5.08 1.00 0.51
gam_mod %>% pluck('fit') %>%
plot(all.terms = TRUE, pages = 1)

Analysis
Overall, neither model is very accurate at predicting interest rate. The linear regression model with cross-validation has a very low R-squared value and the MAE and RMSE and quite high. For the GAMs model, our histogram of residuals is not symmetric and there is no correlation in the response vs. fitted values graph. Additionally, we see our estimated non-linear functions for each variable have quite a bit of error as each variable increases.
Our linear regression model assumes a linear relationship between the predictor and outcome variables. While GAMs no longer assumes a linear relationship, the relationship between predictor variables is still assumed to be additive.
There are many ways to enhance and improve upon both these models. Using a different set of predictor variables for our models could show more predicting power. Additionally, doing cross-validation with the GAMs model. With more time, I would explore using different model types like LASSO or non-parametric models like KNN with the previous enhancements as well. Furthermore, my approach to choosing the variables for the model were solely based off of my own intuition on what I thought would predict interest rate, however, with more time, I could analyze variable importance using LASSO or subset selection to make a more refined model.
LS0tCnRpdGxlOiAiQ2FzZSBTdHVkeSAjMSIKYXV0aG9yOiAiSmFzb24gV2hpdGVsYXciCmRhdGU6ICI0LzEwLzIwMjIiCm91dHB1dDogCiAgaHRtbF9kb2N1bWVudDoKICAgIGtlZXBfbWQ6IFRSVUUgICAgCiAgICB0b2M6IFRSVUUKICAgIHRvY19mbG9hdDogVFJVRQogICAgZGZfcHJpbnQ6IHBhZ2VkCiAgICBjb2RlX2Rvd25sb2FkOiBUUlVFCi0tLQoKIyMgU2V0dXAKCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQojIHIgc2V0dXAKa25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG8gPSBUUlVFLCBlcnJvcj1UUlVFLCBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFKQpgYGAKCmBgYHtyfQojIGxvYWQgcGFja2FnZXMKbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkoZ2d0aGVtZXMpCmxpYnJhcnkoZ2dtYXApCmxpYnJhcnkodGlkeW1vZGVscykKbGlicmFyeShyZWFkcikKdGhlbWVfc2V0KHRoZW1lX21pbmltYWwoKSkKdGlkeW1vZGVsc19wcmVmZXIoKQpgYGAKCmBgYHtyfQojIHJlYWQgaW4gZGF0YQpsb2FucyA8LSByZWFkX2NzdigibG9hbnNfZnVsbF9zY2hlbWEuY3N2IikKYGBgCgojIyBEZXNjcmlwdGlvbiBhbmQgSXNzdWVzCgpUaGlzIGRhdGEgc2V0IGNvbnRhaW5zIGxvYW5zIG1hZGUgdGhyb3VnaCB0aGUgcGxhdGZvcm0gTGVuZGluZyBDbHViLiBFYWNoIHJvdyByZXByZXNlbnRzIGEgbG9hbiB0aGF0IHdhcyBzdWNjZXNzZnVsbHkgbWFkZSBmcm9tIG9uZSBpbmRpdmlkdWFsIHRvIGFub3RoZXIuIFRoZSBjb2x1bW5zIGNvbnRhaW4gdmFyaW91cyBpbmZvcm1hdGlvbiByZWdhcmRpbmcgdGhlIGxvYW4gYXBwbGljYW50IGFuZCB0aGUgbG9hbiBpdHNlbGYuIFRoZSBvbmx5IG1pbm9yIGlzc3VlIEkgc2VlIHdpdGggdGhlIGRhdGEgc2V0IGlzIHRoZSBhYnVuZGFuY2Ugb2YgTkEgdmFsdWVzIGR1ZSB0byB0aGUgc3BlY2lmaWNpdHkgb2YgbWFueSBvZiB0aGUgY29sdW1ucy4KCiMjIFZpc3VhbGl6YXRpb25zIGFuZCBPYnNlcnZhdGlvbnMKCmBgYHtyfQpsb2FucyAlPiUKICBncm91cF9ieShsb2FuX3B1cnBvc2UpICU+JQogIG11dGF0ZShsb2FuX3B1cnBvc2VfY291bnQgPSBuKCkpICU+JQogIHNlbGVjdChsb2FuX3B1cnBvc2UsIGxvYW5fcHVycG9zZV9jb3VudCkgJT4lCiAgdW5pcXVlKCkgJT4lCiAgZ2dwbG90KGFlcyh4ID0gZmN0X3Jlb3JkZXIobG9hbl9wdXJwb3NlLCBsb2FuX3B1cnBvc2VfY291bnQpLCB5ID0gbG9hbl9wdXJwb3NlX2NvdW50LCBmaWxsID0gbG9hbl9wdXJwb3NlLCBsYWJlbCA9IGxvYW5fcHVycG9zZV9jb3VudCkpICsKICBnZW9tX2NvbCgpICsKICBnZW9tX3RleHQobnVkZ2VfeSA9IDE1MCkgKwogIGxhYnMoeCA9ICJMb2FuIFB1cnBvc2UiLCB5ID0gIiIsIHRpdGxlID0gIkFjY2VwdGVkIExvYW5zIGJ5IExvYW4gUHVycG9zZSIpICsKICB0aGVtZShsZWdlbmQucG9zaXRpb24gPSAibm9uZSIsIGF4aXMudGV4dC54ID0gZWxlbWVudF90ZXh0KHNpemUgPSA0LjUpKQpgYGAKCkJhciBjaGFydCBvZiBhY2NlcHRlZCBsb2FucyBieSBsb2FuIHB1cnBvc2UuIFRoZXJlIGFyZSB0aGUgbW9zdCBsb2FucyBmb3IgZGVidCBjb25zb2xpZGF0aW9uIGZvbGxvd2VkIGJ5IGNyZWRpdCBjYXJkIGFuZCB0aGUgbGVhc3QgZm9yIHJlbmV3YWJsZSBlbmVyZ3kuCgpgYGB7cn0KbG9hbnMgJT4lCiAgZmlsdGVyKGFubnVhbF9pbmNvbWUgPiAxMCkgJT4lCiAgZ2dwbG90KGFlcyh4ID0gYW5udWFsX2luY29tZSwgeSA9IGludGVyZXN0X3JhdGUpKSArCiAgZ2VvbV9wb2ludChjb2xvciA9ICJkYXJrYmx1ZSIpICsKICBzY2FsZV94X2xvZzEwKCkgKwogIGxhYnMoeCA9ICJBbm51YWwgSW5jb21lIChsb2cgYWRqdXN0ZWQpIiwgeSA9ICJJbnRlcmVzdCBSYXRlIiwgdGl0bGUgPSAiSW50ZXJlc3QgUmF0ZSB2cy4gQW5udWFsIEluY29tZSIpCmBgYAoKUGxvdCBvZiBpbnRlcmVzdCByYXRlIHZzLiBhbm51YWwgaW5jb21lLiBUaGVyZSBhcHBlYXJzIHRvIGJlIG5vIG9idmlvdXMgY29ycmVsYXRpb24gYmV0d2VlbiBpbmNvbWUgYW5kIGludGVyZXN0IHJhdGUsIGV4Y2VwdCB0aGF0IHRoZSBkaXN0cmlidXRpb24gb2YgaW5jb21lcyBzZWVtcyB0byBiZSB0aWdodGVyIGZvciBoaWdoZXIgaW50ZXJlc3QgcmF0ZSBsb2Fucy4KCmBgYHtyfQpsb2FucyAlPiUKICBnZ3Bsb3QoYWVzKHggPSBpbnRlcmVzdF9yYXRlLCBjb2xvciA9IGhvbWVvd25lcnNoaXApKSArCiAgZ2VvbV9kZW5zaXR5KCkgKwogIGxhYnMoeCA9ICJJbnRlcmVzdCBSYXRlIiwgeSA9ICIiLCB0aXRsZSA9ICJEaXN0cmlidXRpb24gb2YgTG9hbiBJbnRlcmVzdCBSYXRlcyBiYXNlZCBvbiBIb21lIE93bmVyc2hpcCIpCmBgYAoKRGVuc2l0eSBwbG90IG9mIGxvYW4gaW50ZXJlc3QgcmF0ZXMgYmFzZWQgb24gaG9tZSBvd25lcnNoaXAuIEdlbmVyYWxseSwgaW50ZXJlc3QgcmF0ZXMgYXJlIHNpbWlsYXIgZm9yIGFsbCB0eXBlcyBvZiBob21lIG93bmVyc2hpcCwgaG93ZXZlciB0aGVyZSBhcmUgc2xpZ2h0bHkgbW9yZSBsb3cgaW50ZXJlc3QgcmF0ZSBsb2FucyBmb3IgYXBwbGljYW50cyB3aG8gbW9ydGdhZ2Ugb3Igb3duIHRoZWlyIGhvdXNlIGFuZCBzbGlnaHRseSBtb3JlIGhpZ2ggaW50ZXJlc3QgbG9hbnMgZm9yIGFwcGxpY2FudHMgd2hvIHJlbnQgdGhlaXIgaG91c2UuCgpgYGB7cn0KbG9hbnMgJT4lCiAgZ2dwbG90KGFlcyh4ID0gZ3JhZGUsIHkgPSBpbnRlcmVzdF9yYXRlLCBmaWxsID0gZ3JhZGUpKSArCiAgZ2VvbV9ib3hwbG90KCkgKwogIGxhYnMoeCA9ICJMb2FuIEdyYWRlIiwgeSA9ICJJbnRlcmVzdCBSYXRlIiwgdGl0bGUgPSAiTG9hbiBJbnRlcmVzdCBSYXRlcyBieSBHcmFkZSIpICsKICB0aGVtZShsZWdlbmQucG9zaXRpb24gPSAibm9uZSIpCmBgYAoKQm94IHBsb3Qgb2YgbG9hbiBpbnRlcmVzdCByYXRlcyBieSBncmFkZS4gVGhlIGxvd2VyIHRoZSBpbnRlcmVzdCByYXRlIG9uIHRoZSBsb2FuLCB0aGUgYmV0dGVyIHRoZSBncmFkZSB0aGUgbG9hbiByZWNlaXZlcy4KCmBgYHtyfQpzdGF0ZXNfbWFwIDwtIG1hcF9kYXRhKCJzdGF0ZSIpCgpsb2FucyAlPiUKICBtdXRhdGUoc3RhdGVfbmFtZSA9IHN0YXRlLm5hbWVbbWF0Y2goc3RhdGUsIHN0YXRlLmFiYildKSAlPiUKICBtdXRhdGUoc3RhdGVfbmFtZSA9IHN0cl90b19sb3dlcihzdGF0ZV9uYW1lKSkgJT4lCiAgZmlsdGVyKGlzLm5hKHN0YXRlX25hbWUpPT1GQUxTRSkgJT4lCiAgZ3JvdXBfYnkoc3RhdGVfbmFtZSkgJT4lCiAgbXV0YXRlKHN0YXRlX2NvdW50ID0gbigpKSAlPiUKICBzZWxlY3Qoc3RhdGVfbmFtZSwgc3RhdGVfY291bnQpICU+JQogIHVuaXF1ZSgpICU+JQogIGdncGxvdCgpICsKICBnZW9tX21hcChtYXAgPSBzdGF0ZXNfbWFwLCBhZXMobWFwX2lkID0gc3RhdGVfbmFtZSwgZmlsbCA9IHN0YXRlX2NvdW50KSkgKwogIGV4cGFuZF9saW1pdHMoeCA9IHN0YXRlc19tYXAkbG9uZywgeSA9IHN0YXRlc19tYXAkbGF0KSArIAogIHRoZW1lX21hcCgpICsKICBsYWJzKHRpdGxlID0gIk51bWJlciBvZiBMb2FucyBieSBTdGF0ZSIpICsKICB0aGVtZShsZWdlbmQudGl0bGUgPSBlbGVtZW50X2JsYW5rKCkpCmBgYAoKTWFwIG9mIHRoZSBudW1iZXIgb2YgbG9hbnMgYnkgc3RhdGUuIENhbGlmb3JuaWEgYXBwZWFycyB0byBoYXZlIHRoZSBtb3N0IGxvYW5zLCBmb2xsb3dlZCBieSBUZXhhcywgTmV3IFlvcmssIGFuZCBGbG9yaWRhLgoKIyMgTW9kZWxpbmcgSW50ZXJlc3QgUmF0ZQoKIyMjIExpbmVhciBSZWdyZXNzaW9uIHdpdGggQ3Jvc3MtVmFsaWRhdGlvbgoKYGBge3J9CnNldC5zZWVkKDEpCgojIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsIHNwZWMKbG1fc3BlYyA8LQogICAgbGluZWFyX3JlZygpICU+JSAKICAgIHNldF9lbmdpbmUoZW5naW5lID0gJ2xtJykgJT4lIAogICAgc2V0X21vZGUoJ3JlZ3Jlc3Npb24nKQoKIyBjcm9zcy12YWxpZGF0aW9uIGZvbGRzCmxvYW5zX2N2IDwtIHZmb2xkX2N2KGxvYW5zLCB2ID0gMTApCmBgYAoKYGBge3J9CiMgd29ya2Zsb3cKbW9kMV93ZiA8LSB3b3JrZmxvdygpICU+JQogIGFkZF9mb3JtdWxhKGludGVyZXN0X3JhdGUgfiBhbm51YWxfaW5jb21lICsgZGVidF90b19pbmNvbWUgKyBkZWxpbnFfMnkpICU+JQogIGFkZF9tb2RlbChsbV9zcGVjKQoKIyBjcm9zcy12YWxpZGF0ZSB3aXRoIHdvcmtmbG93Cm1vZDFfY3YgPC0gZml0X3Jlc2FtcGxlcyhtb2QxX3dmLCByZXNhbXBsZXMgPSBsb2Fuc19jdiwgbWV0cmljcyA9IG1ldHJpY19zZXQocm1zZSwgcnNxLCBtYWUpKQpgYGAKCmBgYHtyfQptb2QxX2N2ICU+JSBjb2xsZWN0X21ldHJpY3MoKQpgYGAKCiMjIyBHQU1zCgpgYGB7cn0Kc2V0LnNlZWQoMSkKCiMgZ2FtIG1vZGVsIHNwZWMKZ2FtX3NwZWMgPC0gZ2VuX2FkZGl0aXZlX21vZCgpICU+JQogIHNldF9lbmdpbmUoZW5naW5lID0gJ21nY3YnKSAlPiUKICBzZXRfbW9kZSgncmVncmVzc2lvbicpIAoKIyBmaXR0aW5nIGdhbSBtb2RlbApnYW1fbW9kIDwtIGZpdChnYW1fc3BlYywgaW50ZXJlc3RfcmF0ZSB+IHMoYW5udWFsX2luY29tZSkgKyBzKGRlYnRfdG9faW5jb21lKSArIHMoZGVsaW5xXzJ5KSwgZGF0YSA9IGxvYW5zKQpgYGAKCmBgYHtyfQpwYXIobWZyb3c9YygyLDIpKQpnYW1fbW9kICU+JSBwbHVjaygnZml0JykgJT4lIAogIG1nY3Y6OmdhbS5jaGVjaygpIApgYGAKCmBgYHtyfQpnYW1fbW9kICU+JSBwbHVjaygnZml0JykgJT4lIAogIHBsb3QoYWxsLnRlcm1zID0gVFJVRSwgcGFnZXMgPSAxKQpgYGAKCiMjIEFuYWx5c2lzCgpPdmVyYWxsLCBuZWl0aGVyIG1vZGVsIGlzIHZlcnkgYWNjdXJhdGUgYXQgcHJlZGljdGluZyBpbnRlcmVzdCByYXRlLiBUaGUgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwgd2l0aCBjcm9zcy12YWxpZGF0aW9uIGhhcyBhIHZlcnkgbG93IFItc3F1YXJlZCB2YWx1ZSBhbmQgdGhlIE1BRSBhbmQgUk1TRSBhbmQgcXVpdGUgaGlnaC4gRm9yIHRoZSBHQU1zIG1vZGVsLCBvdXIgaGlzdG9ncmFtIG9mIHJlc2lkdWFscyBpcyBub3Qgc3ltbWV0cmljIGFuZCB0aGVyZSBpcyBubyBjb3JyZWxhdGlvbiBpbiB0aGUgcmVzcG9uc2UgdnMuIGZpdHRlZCB2YWx1ZXMgZ3JhcGguIEFkZGl0aW9uYWxseSwgd2Ugc2VlIG91ciBlc3RpbWF0ZWQgbm9uLWxpbmVhciBmdW5jdGlvbnMgZm9yIGVhY2ggdmFyaWFibGUgaGF2ZSBxdWl0ZSBhIGJpdCBvZiBlcnJvciBhcyBlYWNoIHZhcmlhYmxlIGluY3JlYXNlcy4KCk91ciBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbCBhc3N1bWVzIGEgbGluZWFyIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHRoZSBwcmVkaWN0b3IgYW5kIG91dGNvbWUgdmFyaWFibGVzLiBXaGlsZSBHQU1zIG5vIGxvbmdlciBhc3N1bWVzIGEgbGluZWFyIHJlbGF0aW9uc2hpcCwgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHByZWRpY3RvciB2YXJpYWJsZXMgaXMgc3RpbGwgYXNzdW1lZCB0byBiZSBhZGRpdGl2ZS4KClRoZXJlIGFyZSBtYW55IHdheXMgdG8gZW5oYW5jZSBhbmQgaW1wcm92ZSB1cG9uIGJvdGggdGhlc2UgbW9kZWxzLiBVc2luZyBhIGRpZmZlcmVudCBzZXQgb2YgcHJlZGljdG9yIHZhcmlhYmxlcyBmb3Igb3VyIG1vZGVscyBjb3VsZCBzaG93IG1vcmUgcHJlZGljdGluZyBwb3dlci4gQWRkaXRpb25hbGx5LCBkb2luZyBjcm9zcy12YWxpZGF0aW9uIHdpdGggdGhlIEdBTXMgbW9kZWwuIFdpdGggbW9yZSB0aW1lLCBJIHdvdWxkIGV4cGxvcmUgdXNpbmcgZGlmZmVyZW50IG1vZGVsIHR5cGVzIGxpa2UgTEFTU08gb3Igbm9uLXBhcmFtZXRyaWMgbW9kZWxzIGxpa2UgS05OIHdpdGggdGhlIHByZXZpb3VzIGVuaGFuY2VtZW50cyBhcyB3ZWxsLiBGdXJ0aGVybW9yZSwgbXkgYXBwcm9hY2ggdG8gY2hvb3NpbmcgdGhlIHZhcmlhYmxlcyBmb3IgdGhlIG1vZGVsIHdlcmUgc29sZWx5IGJhc2VkIG9mZiBvZiBteSBvd24gaW50dWl0aW9uIG9uIHdoYXQgSSB0aG91Z2h0IHdvdWxkIHByZWRpY3QgaW50ZXJlc3QgcmF0ZSwgaG93ZXZlciwgd2l0aCBtb3JlIHRpbWUsIEkgY291bGQgYW5hbHl6ZSB2YXJpYWJsZSBpbXBvcnRhbmNlIHVzaW5nIExBU1NPIG9yIHN1YnNldCBzZWxlY3Rpb24gdG8gbWFrZSBhIG1vcmUgcmVmaW5lZCBtb2RlbC4K