##Executive Summary
#For our Big Data Project we found our targeted area of interest to be accurately predicting and analyzing the possibility of loan default and factors that impact interest rates in Lending Club’s datasets. Lending Club is a peer to peer lending site where a single person can apply for a personal loan that might be selected to be funded by a single person or group of funders. Lending Club requires relevant financial data from the individual requesting the loan. Information such as loan amount, income, employment status, etc. Lending Club then grades the loan request and provides the amount of interest that will be charged by assessing risk from the provided information. Lending Club then puts the loan “up for auction” to see if any funders would like to fulfill the loan at the specified amount and percentage rate. The first steps in the group process was to identify and find an appropriate dataset that we could use to accurately run a regression model to understand the variables that lead to loan default.
#We focused our attention for finding the dataset by searching various internet forums, including Kaggle, for datasets that were relevant to our goal of analyzing the possibility of loan default. We ultimately found a dataset on Kaggle that was originally from Lending Club that provided the relevant variables to test in a predictive model. Most importantly, this dataset had a field that indicated the status of the loan. This variable is what we used for our binomial analysis classification problem. For our regression problem (where the output is a numerical scalar value) ww examined what factors impact the interest rate of the loan. We ultimately decided on a set of 8 variables that would most likely have the largest impact on both of the studied outcomes. These variables of the data were loan amount, interest rate, loan grade, employment length, home ownership, annual income, loan term, and finally loan status.
#Our findings matched with logical assumptions for both models.
#For loan outcome, the following columns showed positive coefficients: #1. Loan Amount #2. Interest Rate #3. Employment being N/A #4. Term Length being 60 months instead of 36 months #5. Grades A-D impact positively, but less with each drop in grade
#For loan outcome, the following columns showed negative coefficients: #1. Annual Income #2. Being employed is significantly better than being N/A #3. Grades E-G impact negatively # The probability of a loan defaulting changes inversely with the listed factors. #We found these coefficients to account for about 80% of the change in loan outcome.
#1. The higher the loan amount the higher the interest rate will be. #2. The lower the grade of the loan the higher the interest rate will be. #3. The longer one has been employed the lower the interest rate will be. #4. Home ownership type was not found to have any statistical significance on determining the interest rate. #5. The higher one’s annual income the lower the interest rate will be. #6. The longer term amount (60 months) results in a higher interest rate. #7. Loans that default typically had higher interest rates
#In general, investing money in a loan on lending club is similar to investing in the stock market. Higher interest comes with higher risk. But, you can utilize positive features, like employment and annual income to select high interest loans that have a lower risk of defaulting. ’
## First, we will load the required libraries for our analysis:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(DT)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/modules/R_X11.so
## Reason: image not found
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(DescTools)
##
## Attaching package: 'DescTools'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## Ian - Pathname for storage "/Users/ian/Downloads/loan.csv"
## Kevin - Pathname for storage "/Users/Spartatoe/Documents/MBA/Fall 2019/GSBAS 576/Group Project/loan.csv"
## Spencer - Pathname for storage: "/Users/spencermorgan/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - Spencer’s MacBook Air/Big Data/loan.csv"
## Load in dataset, set the blank spaces to NA, and examine the demensions and names of all 74 columns
loan = read_csv("/Users/ian/Downloads/loan.csv" , na = "")
## Parsed with column specification:
## cols(
## .default = col_double(),
## term = col_character(),
## grade = col_character(),
## sub_grade = col_character(),
## emp_title = col_character(),
## emp_length = col_character(),
## home_ownership = col_character(),
## verification_status = col_character(),
## issue_d = col_character(),
## loan_status = col_character(),
## pymnt_plan = col_character(),
## url = col_character(),
## desc = col_character(),
## purpose = col_character(),
## title = col_character(),
## zip_code = col_character(),
## addr_state = col_character(),
## earliest_cr_line = col_character(),
## initial_list_status = col_logical(),
## last_pymnt_d = col_character(),
## next_pymnt_d = col_character()
## # ... with 23 more columns
## )
## See spec(...) for full column specifications.
## Warning: 3400965 parsing failures.
## row col expected actual file
## 42536 initial_list_status 1/0/T/F/TRUE/FALSE w '/Users/ian/Downloads/loan.csv'
## 42536 tot_coll_amt 1/0/T/F/TRUE/FALSE 0.0 '/Users/ian/Downloads/loan.csv'
## 42536 tot_cur_bal 1/0/T/F/TRUE/FALSE 114834.0 '/Users/ian/Downloads/loan.csv'
## 42536 total_rev_hi_lim 1/0/T/F/TRUE/FALSE 59900.0 '/Users/ian/Downloads/loan.csv'
## 42537 tot_coll_amt 1/0/T/F/TRUE/FALSE 0.0 '/Users/ian/Downloads/loan.csv'
## ..... ................... .................. ........ ...............................
## See problems(...) for more details.
dim(loan)
## [1] 887379 74
colnames(loan)
## [1] "id" "member_id"
## [3] "loan_amnt" "funded_amnt"
## [5] "funded_amnt_inv" "term"
## [7] "int_rate" "installment"
## [9] "grade" "sub_grade"
## [11] "emp_title" "emp_length"
## [13] "home_ownership" "annual_inc"
## [15] "verification_status" "issue_d"
## [17] "loan_status" "pymnt_plan"
## [19] "url" "desc"
## [21] "purpose" "title"
## [23] "zip_code" "addr_state"
## [25] "dti" "delinq_2yrs"
## [27] "earliest_cr_line" "inq_last_6mths"
## [29] "mths_since_last_delinq" "mths_since_last_record"
## [31] "open_acc" "pub_rec"
## [33] "revol_bal" "revol_util"
## [35] "total_acc" "initial_list_status"
## [37] "out_prncp" "out_prncp_inv"
## [39] "total_pymnt" "total_pymnt_inv"
## [41] "total_rec_prncp" "total_rec_int"
## [43] "total_rec_late_fee" "recoveries"
## [45] "collection_recovery_fee" "last_pymnt_d"
## [47] "last_pymnt_amnt" "next_pymnt_d"
## [49] "last_credit_pull_d" "collections_12_mths_ex_med"
## [51] "mths_since_last_major_derog" "policy_code"
## [53] "application_type" "annual_inc_joint"
## [55] "dti_joint" "verification_status_joint"
## [57] "acc_now_delinq" "tot_coll_amt"
## [59] "tot_cur_bal" "open_acc_6m"
## [61] "open_il_6m" "open_il_12m"
## [63] "open_il_24m" "mths_since_rcnt_il"
## [65] "total_bal_il" "il_util"
## [67] "open_rv_12m" "open_rv_24m"
## [69] "max_bal_bc" "all_util"
## [71] "total_rev_hi_lim" "inq_fi"
## [73] "total_cu_tl" "inq_last_12m"
### This graph shows the distribution of loan amounts.
### Looking at the below graph it appears that a majority of the loeans are between $5,000 and $20,000
ggplot(data=loan, aes(loan_amnt))+geom_histogram(bins = 40,color="blue",fill="purple")
#See this graph for more detail of the breakdown. Interesting to note the spike at the maximum loan amount
Desc(loan$loan_amnt, main = "Loan amount distribution", plotit = TRUE)
## ------------------------------------------------------------------------------
## Loan amount distribution
##
## length n NAs unique 0s mean meanCI
## 887'379 887'379 0 1'372 0 14'755.26 14'737.71
## 100.0% 0.0% 0.0% 14'772.82
##
## .05 .10 .25 median .75 .90 .95
## 3'600.00 5'000.00 8'000.00 13'000.00 20'000.00 28'000.00 32'000.00
##
## range sd vcoef mad IQR skew kurt
## 34'500.00 8'435.46 0.57 8'599.08 12'000.00 0.68 -0.26
##
## lowest : 500.0 (11), 550.0, 600.0 (6), 700.0 (3), 725.0
## highest: 34'900.0 (14), 34'925.0 (9), 34'950.0 (18), 34'975.0 (31), 35'000.0 (36'368)
##
## heap(?): remarkable frequency (7.0%) for the mode(s) (= 10000)
# Take a look at reasons why people take out loans on Lending Club. Debt consolidation is by far the most funded and applied for reason
Desc(loan$purpose, main = "Loan purposes", plotit = TRUE)
## ------------------------------------------------------------------------------
## Loan purposes
##
## length n NAs unique levels dupes
## 887'379 887'379 0 14 14 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 debt_consolidation 524'215 59.1% 524'215 59.1%
## 2 credit_card 206'182 23.2% 730'397 82.3%
## 3 home_improvement 51'829 5.8% 782'226 88.2%
## 4 other 42'894 4.8% 825'120 93.0%
## 5 major_purchase 17'277 1.9% 842'397 94.9%
## 6 small_business 10'377 1.2% 852'774 96.1%
## 7 car 8'863 1.0% 861'637 97.1%
## 8 medical 8'540 1.0% 870'177 98.1%
## 9 moving 5'414 0.6% 875'591 98.7%
## 10 vacation 4'736 0.5% 880'327 99.2%
## 11 house 3'707 0.4% 884'034 99.6%
## 12 wedding 2'347 0.3% 886'381 99.9%
## ... etc.
## [list output truncated]
### This graph indicates that typically grades of B and C tend to have a higher loan amount.
### This could be indicitive of riskier loans
ggplot(loan, aes(x=grade, y=loan_amnt, fill=grade)) +
stat_summary(fun.y="sum", geom="bar") +
labs(y ="Total Loan Amount",title="Total loan amount based on loan grade")
### This graph indicates that as the grade becomes less the interest rate increases which would make sense as these loans would most likely be more risky.
ggplot(data=loan, aes(grade,int_rate,fill=grade))+geom_boxplot(outlier.color = "blue")+labs(title="Box plot of Interest rate")
### This graph shows a distribution of loan amount by grade.
### Seems that across all the grades for the most part the loan amounts are consistent across grades
### However it does appear that some of the lower grades have more loans of $30,000+
ggplot(data=loan,aes(loan_amnt, fill=grade))+
geom_density(alpha=0.25) +
facet_grid(grade ~ .)
### Graph showing distribution of interest rates depending on grade of the loan
### Hoping to find correlation between the two
hist1=sqldf("select count(*) as freq,grade,int_rate from loan group by grade,int_rate")
p <- ggplot(data = hist1, aes(x = int_rate, y = freq)) +
geom_smooth(aes(colour = grade, fill = grade)) + facet_wrap(~ grade) + labs(x = "Interest Rate", y="Frequency")
ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#Confirming state id's made it in import
unique(loan$addr_state)
## [1] "AZ" "GA" "IL" "CA" "OR" "NC" "TX" "VA" "MO" "CT" "UT" "FL" "NY" "PA" "MN"
## [16] "NJ" "KY" "OH" "SC" "RI" "LA" "MA" "WA" "WI" "AL" "CO" "KS" "NV" "AK" "MD"
## [31] "WV" "VT" "MI" "DC" "SD" "NH" "AR" "NM" "MT" "HI" "WY" "OK" "DE" "MS" "TN"
## [46] "IA" "NE" "ID" "IN" "ME" "ND"
### This is a hot map showing the states within the US that have the highest and lowest distributions of loan amounts
suppressPackageStartupMessages(library(maps))
loan$region <- loan$addr_state
loan$region <- as.factor(loan$region)
levels(loan$region)<- c("alaska", "alabama","arkansas", "arizona", "california","colorado","connecticut","district of columbia","delaware","florida","georgia","hawaii","iowa","idaho","illinois","indiana","kansas","kentucky","louisiana","massachusetts","maryland","maine","michigan","minnesota","missouri","mississippi","montana","north carolina","north dakota","nebraska","new hampshire","new jersey","new mexico","nevada","new york","ohio","oklahoma","oregon","pennsylvania","rhode island","south carolina","south dakota","tennessee","texas","utah","virginia","vermont","washington","wisconsin","west virginia","wyoming")
all_states <- map_data("state")
state_by_loan <-loan %>% group_by(region) %>%
summarise(value = sum(loan_amnt, na.rm=TRUE))
state_by_loan$region <- as.character(state_by_loan$region)
Total <- merge(all_states, state_by_loan, by="region")
p <- ggplot()
p <- p + geom_polygon(data=Total, aes(x=long, y=lat, group = group, fill=Total$value),colour="white"
) + scale_fill_continuous(low = "skyblue", high = "darkblue", guide="colorbar")
P1 <- p + theme_bw() + labs(fill = "Gradient of loan amount"
,title = "Heat Map of loan amount in all states", x="", y="")
P1 + scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border = element_blank())
### This graph shows a distribution of the different loan status types by state
ggplot(data = loan, aes(x=addr_state, fill=loan_status)) +
labs(x="State", y="Total Loan issued") +
geom_bar() +
coord_flip()
##Bar Graph Demonstrating a count of the loan status variable giving an idea of the distribution of the different loan status amounts
## IN this case a large majority of the data indicates that the loans are current.
## A much smaller percentage of the amount is defaulted or late
loan %>%
count(loan_status) %>%
ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) +
geom_col() +
coord_flip() +
labs(x = 'Loan Status' , y = 'Count')
# Start to clean data up a little more to make additional graphs
loan = loan %>%
select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership ,
annual_inc , term)
loan
## # A tibble: 887,379 x 8
## loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 Fully Paid 5000 10.6 B 10+ years RENT 24000
## 2 Charged Off 2500 15.3 C < 1 year RENT 30000
## 3 Fully Paid 2400 16.0 C 10+ years RENT 12252
## 4 Fully Paid 10000 13.5 C 10+ years RENT 49200
## 5 Current 3000 12.7 B 1 year RENT 80000
## 6 Fully Paid 5000 7.9 A 3 years RENT 36000
## 7 Current 7000 16.0 C 8 years RENT 47004
## 8 Fully Paid 3000 18.6 E 9 years RENT 48000
## 9 Charged Off 5600 21.3 F 4 years OWN 40000
## 10 Charged Off 5375 12.7 B < 1 year RENT 15000
## # … with 887,369 more rows, and 1 more variable: term <chr>
#Find loan status options to chose what is an acceptable loan outcome
unique(loan$loan_status)
## [1] "Fully Paid"
## [2] "Charged Off"
## [3] "Current"
## [4] "Default"
## [5] "Late (31-120 days)"
## [6] "In Grace Period"
## [7] "Late (16-30 days)"
## [8] "Does not meet the credit policy. Status:Fully Paid"
## [9] "Does not meet the credit policy. Status:Charged Off"
## [10] "Issued"
## We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 different levels. Loans with status Current, Late payments, In grace period need to be removed. Therefore, we create a new variable called loan_outcome where
## loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’
loan = loan %>%
mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') ,
1,
ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
))
barplot(table(loan$loan_outcome) , col = 'lightblue')
head(loan)
## # A tibble: 6 x 9
## loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 Fully Paid 5000 10.6 B 10+ years RENT 24000
## 2 Charged Off 2500 15.3 C < 1 year RENT 30000
## 3 Fully Paid 2400 16.0 C 10+ years RENT 12252
## 4 Fully Paid 10000 13.5 C 10+ years RENT 49200
## 5 Current 3000 12.7 B 1 year RENT 80000
## 6 Fully Paid 5000 7.9 A 3 years RENT 36000
## # … with 2 more variables: term <chr>, loan_outcome <chr>
# We will create a new dataset which contains only rows with 0 or 1 in loan_outcome feature for better modelling.
# Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status column for the modelling
loan2 = loan %>%
select(-loan_status) %>%
filter(loan_outcome %in% c(0 , 1))
dim(loan2) # Our new dataset contains 265,781 rows & 8 columns
## [1] 254190 8
#Testing Y column for binomial (loan_outcome) to confirm that NA is gone
unique(loan2$loan_outcome)
## [1] "0" "1"
# We assume that grade is a great predictor for the volume of non-performing loans. But how many of them did not performed grouped by grade?
table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
##
## Fully Paid Default
## A 39679 2664
## B 66546 9717
## C 52678 13002
## D 30020 10798
## E 12928 6459
## F 4726 3013
## G 1146 814
#let's graph this so we can see volume by grade and default
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Paid')))) +
geom_bar() +
theme(legend.title = element_blank())
#Modelling Process:
## Let's make factors in a new dataset of all non-numaric variables
loan3<-loan2
loan3$home_ownership<-factor(loan3$home_ownership)
loan3$grade<-factor(loan3$grade)
loan3$term<-factor(loan3$term)
loan3$emp_length<-factor(loan3$emp_length)
loan3$loan_outcome<-factor(loan3$loan_outcome)
set.seed(123) #locks seed for random partitioning
#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan3$loan_outcome, p=.70, list = FALSE)
#stores these rows in the training set
Training<-loan3[inTrain,]
#stores all rows not in the training set in the test/validation set
Testing<-loan3[-inTrain,]
##LOGISTIC REGRESSION## (need rank to be factor)
M_LOG<-glm(loan_outcome ~ ., data = Training, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(M_LOG)
##
## Call:
## glm(formula = loan_outcome ~ ., family = "binomial", data = Training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4548 -0.6764 -0.5186 -0.3373 4.6614
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.009e+01 4.395e+01 -0.230 0.818477
## loan_amnt 1.499e-05 9.881e-07 15.167 < 2e-16 ***
## int_rate 1.373e-01 4.765e-03 28.817 < 2e-16 ***
## gradeB 1.315e-01 3.357e-02 3.918 8.91e-05 ***
## gradeC 1.652e-01 4.332e-02 3.814 0.000136 ***
## gradeD 8.502e-02 5.540e-02 1.535 0.124853
## gradeE -4.992e-02 6.870e-02 -0.727 0.467468
## gradeF -2.886e-01 8.504e-02 -3.394 0.000688 ***
## gradeG -4.290e-01 1.039e-01 -4.128 3.67e-05 ***
## emp_length1 year -1.456e-02 3.303e-02 -0.441 0.659277
## emp_length10+ years -6.388e-02 2.532e-02 -2.523 0.011649 *
## emp_length2 years -6.764e-02 3.052e-02 -2.216 0.026665 *
## emp_length3 years -4.748e-02 3.157e-02 -1.504 0.132607
## emp_length4 years -4.054e-02 3.358e-02 -1.207 0.227327
## emp_length5 years -3.975e-02 3.263e-02 -1.218 0.223059
## emp_length6 years 1.545e-02 3.427e-02 0.451 0.652034
## emp_length7 years 5.799e-03 3.459e-02 0.168 0.866869
## emp_length8 years 1.327e-02 3.666e-02 0.362 0.717327
## emp_length9 years 3.873e-04 3.933e-02 0.010 0.992143
## emp_lengthn/a 4.010e-01 3.630e-02 11.048 < 2e-16 ***
## home_ownershipMORTGAGE 6.597e+00 4.395e+01 0.150 0.880688
## home_ownershipNONE 6.573e+00 4.396e+01 0.150 0.881128
## home_ownershipOTHER 7.148e+00 4.395e+01 0.163 0.870815
## home_ownershipOWN 6.666e+00 4.395e+01 0.152 0.879457
## home_ownershipRENT 6.802e+00 4.395e+01 0.155 0.877014
## annual_inc -6.931e-06 2.212e-07 -31.342 < 2e-16 ***
## term60 months 3.316e-01 1.710e-02 19.391 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 169257 on 177933 degrees of freedom
## Residual deviance: 156390 on 177907 degrees of freedom
## AIC: 156444
##
## Number of Fisher Scoring iterations: 7
##The table below summarizes the effect the variables have on determining if someone will or will not default on their loan.
##The larger the loan amount and interest rate, the more likely one will default on the loan.
##The lower the grade on the loan, the more likely the loan will default.
##Only employment length of 10+ years and N/A are statistically significant. If someone is employeed for 10+ years they are more likely to non-default, and if they are N/A employeed they are more likely to default.
##Home ownership does not have any staistical signficance on determing loan default.
##The higher someone's annual income is, the more likely they will not default.
##A loan term of 60 months makes a loan default more likely.
exp(cbind(M_LOG$coefficients, confint.default(M_LOG)))
## 2.5 % 97.5 %
## (Intercept) 4.159048e-05 1.604327e-42 1.078189e+33
## loan_amnt 1.000015e+00 1.000013e+00 1.000017e+00
## int_rate 1.147191e+00 1.136527e+00 1.157955e+00
## gradeB 1.140564e+00 1.067945e+00 1.218121e+00
## gradeC 1.179675e+00 1.083651e+00 1.284208e+00
## gradeD 1.088741e+00 9.767174e-01 1.213612e+00
## gradeE 9.513072e-01 8.314626e-01 1.088426e+00
## gradeF 7.492747e-01 6.342481e-01 8.851626e-01
## gradeG 6.511515e-01 5.311411e-01 7.982781e-01
## emp_length1 year 9.855442e-01 9.237722e-01 1.051447e+00
## emp_length10+ years 9.381208e-01 8.926990e-01 9.858538e-01
## emp_length2 years 9.345927e-01 8.803263e-01 9.922043e-01
## emp_length3 years 9.536263e-01 8.964020e-01 1.014504e+00
## emp_length4 years 9.602671e-01 8.990963e-01 1.025600e+00
## emp_length5 years 9.610273e-01 9.014976e-01 1.024488e+00
## emp_length6 years 1.015572e+00 9.496063e-01 1.086119e+00
## emp_length7 years 1.005816e+00 9.398822e-01 1.076375e+00
## emp_length8 years 1.013360e+00 9.431068e-01 1.088846e+00
## emp_length9 years 1.000387e+00 9.261759e-01 1.080545e+00
## emp_lengthn/a 1.493370e+00 1.390812e+00 1.603491e+00
## home_ownershipMORTGAGE 7.331453e+02 2.828207e-35 1.900505e+40
## home_ownershipNONE 7.156649e+02 2.747405e-35 1.864218e+40
## home_ownershipOTHER 1.271638e+03 4.898626e-35 3.301053e+40
## home_ownershipOWN 7.851949e+02 3.028973e-35 2.035446e+40
## home_ownershipRENT 8.997597e+02 3.470948e-35 2.332410e+40
## annual_inc 9.999931e-01 9.999926e-01 9.999935e-01
## term60 months 1.393258e+00 1.347328e+00 1.440754e+00
##The chart below shows the mean and the two standard deviation range of all the variables results.
confusionMatrix(table(predict(M_LOG, Testing, type="response") >= 0.5,
Testing$loan_outcome == 1))
## Confusion Matrix and Statistics
##
##
## FALSE TRUE
## FALSE 62129 13765
## TRUE 187 175
##
## Accuracy : 0.817
## 95% CI : (0.8143, 0.8198)
## No Information Rate : 0.8172
## P-Value [Acc > NIR] : 0.547
##
## Kappa : 0.0154
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99700
## Specificity : 0.01255
## Pos Pred Value : 0.81863
## Neg Pred Value : 0.48343
## Prevalence : 0.81719
## Detection Rate : 0.81474
## Detection Prevalence : 0.99525
## Balanced Accuracy : 0.50478
##
## 'Positive' Class : FALSE
##
##The table below shows how well our model was in predicting if someone will or will not default on their loan. We were able to predict with 81.7% accuracy.
##CART## #rpart package implementation
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
#sets the cross validation parameters to be fed to train()
M_CART <- train(loan_outcome ~., data = Training, trControl=train_control, tuneLength=10, method = "rpart")
#increasing tunelength increases regularization penalty
##the "cv", number = 10 refers to 10-fold cross validation on the training data
print(M_CART)
## CART
##
## 177934 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 160140, 160141, 160140, 160140, 160141, 160142, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0001076029 0.8170333 0.049708345
## 0.0001229748 0.8171850 0.047762354
## 0.0001414210 0.8173536 0.045588162
## 0.0001537184 0.8173986 0.042588010
## 0.0001767762 0.8176121 0.038570760
## 0.0002152058 0.8177302 0.031770406
## 0.0002459495 0.8178313 0.031324347
## 0.0002561974 0.8177864 0.031568566
## 0.0005841301 0.8171457 0.009225183
## 0.0006264027 0.8172468 0.006993798
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0002459495.
##Find the most optimal complexity parameter. The optimal CP is 0.0002459495
plot(M_CART) #produces plot of cross-validation results
M_CART$bestTune #returns optimal complexity parameter
## cp
## 7 0.0002459495
confusionMatrix(predict(M_CART, Testing), Testing$loan_outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61994 13643
## 1 322 297
##
## Accuracy : 0.8169
## 95% CI : (0.8141, 0.8196)
## No Information Rate : 0.8172
## P-Value [Acc > NIR] : 0.5948
##
## Kappa : 0.0257
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99483
## Specificity : 0.02131
## Pos Pred Value : 0.81963
## Neg Pred Value : 0.47981
## Prevalence : 0.81719
## Detection Rate : 0.81297
## Detection Prevalence : 0.99188
## Balanced Accuracy : 0.50807
##
## 'Positive' Class : 0
##
#Not as accurate as glm, so keeping glm function
##RANDOM FOREST## # Code below is not ran because it took over 24 hours and did not improve accuracy
#caret package implementation with 10-fold cross validation
#train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
#RF1 <- train(loan_outcome ~ ., method="rf", trControl=train_control, preProcess=c("center", "scale"), tuneLength=2, data=Training)
#print(RF1)
#confusionMatrix(predict(RF1, Testing), Testing$loan_outcome) #run model predictions on test data
#Removing Home ownership column and employment length rows with subsets that were not significant
# Select only the columns mentioned above.
loan4 = loan3 %>%
select(loan_amnt , int_rate , grade , emp_length ,annual_inc , term , loan_outcome)
loan4
## # A tibble: 254,190 x 7
## loan_amnt int_rate grade emp_length annual_inc term loan_outcome
## <dbl> <dbl> <fct> <fct> <dbl> <fct> <fct>
## 1 5000 10.6 B 10+ years 24000 36 months 0
## 2 2500 15.3 C < 1 year 30000 60 months 1
## 3 2400 16.0 C 10+ years 12252 36 months 0
## 4 10000 13.5 C 10+ years 49200 36 months 0
## 5 5000 7.9 A 3 years 36000 36 months 0
## 6 3000 18.6 E 9 years 48000 36 months 0
## 7 5600 21.3 F 4 years 40000 60 months 1
## 8 5375 12.7 B < 1 year 15000 60 months 1
## 9 6500 14.6 C 5 years 72000 60 months 0
## 10 12000 12.7 B 10+ years 75000 36 months 0
## # … with 254,180 more rows
####Removed the employment length data (1-9 years) since that data was not statistically significant in determing whether or not one would default on the loan.
loan4<-subset(loan4, emp_length!='1 year' & emp_length!='2 years' & emp_length!='3 years' & emp_length!='4 years' & emp_length!='5 years' & emp_length!='6 years' & emp_length!='7 years' &emp_length!='8 years' & emp_length!='9 years' )
set.seed(123) #locks seed for random partitioning
#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan4$loan_outcome, p=.70, list = FALSE)
#stores these rows in the training set
Training<-loan4[inTrain,]
#stores all rows not in the training set in the test/validation set
Testing<-loan4[-inTrain,]
##LOGISTIC REGRESSION## (need rank to be factor)
M_LOG2<-glm(loan_outcome ~ ., data = Training, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##The table below summarizes the effect the variables have on determining if someone will or will not default on their loan using the cleaned up data. (Removing Emp length & home ownership)
##The larger the loan amount and interest rate, the more likely one will default on the loan.
##The lower the grade on the loan, the more likely the loan will default. Only Grade A,B, & F are statistically signficant.
##If someone is employed for 10+ years they are more likely to non-default, and if they are N/A employeed they are more likely to default.
##The higher someone's annual income is, the more likely they will not default.
##A loan term of 60 months makes a loan default more likely.
summary(M_LOG2)
##
## Call:
## glm(formula = loan_outcome ~ ., family = "binomial", data = Training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4624 -0.6767 -0.5226 -0.3443 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.218e+00 7.121e-02 -45.184 < 2e-16 ***
## loan_amnt 1.451e-05 1.459e-06 9.947 < 2e-16 ***
## int_rate 1.304e-01 7.338e-03 17.768 < 2e-16 ***
## gradeB 1.605e-01 5.095e-02 3.150 0.00163 **
## gradeC 1.469e-01 6.625e-02 2.217 0.02660 *
## gradeD 1.234e-01 8.506e-02 1.451 0.14673
## gradeE 7.898e-03 1.053e-01 0.075 0.94019
## gradeF -2.693e-01 1.308e-01 -2.059 0.03948 *
## gradeG -2.747e-01 1.588e-01 -1.730 0.08370 .
## emp_length10+ years -1.551e-01 2.511e-02 -6.176 6.56e-10 ***
## emp_lengthn/a 3.230e-01 3.587e-02 9.007 < 2e-16 ***
## annual_inc -7.146e-06 3.359e-07 -21.276 < 2e-16 ***
## term60 months 3.234e-01 2.560e-02 12.630 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72605 on 75740 degrees of freedom
## Residual deviance: 67153 on 75728 degrees of freedom
## AIC: 67179
##
## Number of Fisher Scoring iterations: 6
##The table below shows the variables mean and two standard deviation range of results.
exp(cbind(M_LOG2$coefficients, confint.default(M_LOG)))
## Warning in cbind(M_LOG2$coefficients, confint.default(M_LOG)): number of rows of
## result is not a multiple of vector length (arg 1)
## 2.5 % 97.5 %
## (Intercept) 0.04005005 1.604327e-42 1.078189e+33
## loan_amnt 1.00001451 1.000013e+00 1.000017e+00
## int_rate 1.13926165 1.136527e+00 1.157955e+00
## gradeB 1.17410922 1.067945e+00 1.218121e+00
## gradeC 1.15824334 1.083651e+00 1.284208e+00
## gradeD 1.13137832 9.767174e-01 1.213612e+00
## gradeE 1.00792880 8.314626e-01 1.088426e+00
## gradeF 0.76388537 6.342481e-01 8.851626e-01
## gradeG 0.75979347 5.311411e-01 7.982781e-01
## emp_length1 year 0.85635999 9.237722e-01 1.051447e+00
## emp_length10+ years 1.38130179 8.926990e-01 9.858538e-01
## emp_length2 years 0.99999285 8.803263e-01 9.922043e-01
## emp_length3 years 1.38177930 8.964020e-01 1.014504e+00
## emp_length4 years 0.04005005 8.990963e-01 1.025600e+00
## emp_length5 years 1.00001451 9.014976e-01 1.024488e+00
## emp_length6 years 1.13926165 9.496063e-01 1.086119e+00
## emp_length7 years 1.17410922 9.398822e-01 1.076375e+00
## emp_length8 years 1.15824334 9.431068e-01 1.088846e+00
## emp_length9 years 1.13137832 9.261759e-01 1.080545e+00
## emp_lengthn/a 1.00792880 1.390812e+00 1.603491e+00
## home_ownershipMORTGAGE 0.76388537 2.828207e-35 1.900505e+40
## home_ownershipNONE 0.75979347 2.747405e-35 1.864218e+40
## home_ownershipOTHER 0.85635999 4.898626e-35 3.301053e+40
## home_ownershipOWN 1.38130179 3.028973e-35 2.035446e+40
## home_ownershipRENT 0.99999285 3.470948e-35 2.332410e+40
## annual_inc 1.38177930 9.999926e-01 9.999935e-01
## term60 months 0.04005005 1.347328e+00 1.440754e+00
confusionMatrix(table(predict(M_LOG2, Testing, type="response") >= 0.5,
Testing$loan_outcome == 1))
## Confusion Matrix and Statistics
##
##
## FALSE TRUE
## FALSE 26371 5954
## TRUE 74 59
##
## Accuracy : 0.8143
## 95% CI : (0.81, 0.8185)
## No Information Rate : 0.8147
## P-Value [Acc > NIR] : 0.5882
##
## Kappa : 0.0113
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.997202
## Specificity : 0.009812
## Pos Pred Value : 0.815808
## Neg Pred Value : 0.443609
## Prevalence : 0.814745
## Detection Rate : 0.812465
## Detection Prevalence : 0.995902
## Balanced Accuracy : 0.503507
##
## 'Positive' Class : FALSE
##
#Found to be not as accurate - Use previous regression and keep data for futher testing
##Testing against Interest Rate
#rpart package implementation
set.seed(123) #locks seed for random partitioning
#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan3$int_rate, p=.70, list = FALSE)
#stores these rows in the training set
Training<-loan3[inTrain,]
#stores all rows not in the training set in the test/validation set
Testing<-loan3[-inTrain,]
##LOGISTIC REGRESSION## (need rank to be factor)
M_LOG3<-lm(int_rate ~ ., data = Training)
##Created a formula that made interest rate the Y variable. This formula looks at what variables help predict what the interest rate on the loan will be
##The table below shows a summary of the formula ran above.
##The higher the loan amount the higher the interest rate will be.
##The lower the grade of the loan the higher the interest rate will be.
##The longer one has been employed the lower the interest rate will be.
##Home ownership type does not have any statistical signficance on determining the interest rate.
##The higher ones annual income the lower the interest rate will be.
##The longer term amount (60 months) results in a higher interest rate.
##Loans that default typically had higher interest rates
summary(M_LOG3)
##
## Call:
## lm(formula = int_rate ~ ., data = Training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4829 -0.8972 0.0793 0.9284 7.4380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.490e+00 1.315e+00 5.696 1.23e-08 ***
## loan_amnt 1.282e-05 4.514e-07 28.403 < 2e-16 ***
## gradeB 4.008e+00 9.580e-03 418.313 < 2e-16 ***
## gradeC 7.042e+00 1.007e-02 699.046 < 2e-16 ***
## gradeD 9.960e+00 1.139e-02 874.231 < 2e-16 ***
## gradeE 1.265e+01 1.478e-02 856.133 < 2e-16 ***
## gradeF 1.570e+01 2.076e-02 756.237 < 2e-16 ***
## gradeG 1.731e+01 3.781e-02 457.920 < 2e-16 ***
## emp_length1 year 3.928e-02 1.628e-02 2.414 0.015801 *
## emp_length10+ years 1.309e-01 1.244e-02 10.524 < 2e-16 ***
## emp_length2 years 5.731e-02 1.491e-02 3.842 0.000122 ***
## emp_length3 years 4.028e-02 1.545e-02 2.608 0.009118 **
## emp_length4 years 2.893e-02 1.643e-02 1.761 0.078200 .
## emp_length5 years 1.375e-01 1.597e-02 8.606 < 2e-16 ***
## emp_length6 years 1.699e-01 1.695e-02 10.023 < 2e-16 ***
## emp_length7 years 1.874e-01 1.718e-02 10.905 < 2e-16 ***
## emp_length8 years 1.303e-01 1.812e-02 7.192 6.41e-13 ***
## emp_length9 years 1.107e-01 1.941e-02 5.703 1.18e-08 ***
## emp_lengthn/a 1.274e-01 1.929e-02 6.600 4.12e-11 ***
## home_ownershipMORTGAGE -2.124e-01 1.315e+00 -0.162 0.871675
## home_ownershipNONE 4.318e-01 1.334e+00 0.324 0.746215
## home_ownershipOTHER -3.326e-01 1.322e+00 -0.252 0.801385
## home_ownershipOWN -1.883e-01 1.315e+00 -0.143 0.886140
## home_ownershipRENT -1.376e-01 1.315e+00 -0.105 0.916673
## annual_inc -8.310e-07 5.956e-08 -13.952 < 2e-16 ***
## term60 months 3.427e-02 9.143e-03 3.748 0.000178 ***
## loan_outcome1 2.570e-01 8.325e-03 30.872 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.315 on 177907 degrees of freedom
## Multiple R-squared: 0.9107, Adjusted R-squared: 0.9107
## F-statistic: 6.978e+04 on 26 and 177907 DF, p-value: < 2.2e-16
##The table below shows the mean and the two standard deviaton range of the variables below.
exp(cbind(M_LOG3$coefficients, confint.default(M_LOG3)))
## 2.5 % 97.5 %
## (Intercept) 1.790862e+03 1.360638e+02 2.357120e+04
## loan_amnt 1.000013e+00 1.000012e+00 1.000014e+00
## gradeB 5.501614e+01 5.399272e+01 5.605895e+01
## gradeC 1.143491e+03 1.121136e+03 1.166292e+03
## gradeD 2.116657e+04 2.069916e+04 2.164454e+04
## gradeE 3.121587e+05 3.032474e+05 3.213319e+05
## gradeF 6.577595e+06 6.315337e+06 6.850744e+06
## gradeG 3.306317e+07 3.070157e+07 3.560643e+07
## emp_length1 year 1.040062e+00 1.007409e+00 1.073773e+00
## emp_length10+ years 1.139854e+00 1.112402e+00 1.167982e+00
## emp_length2 years 1.058980e+00 1.028473e+00 1.090391e+00
## emp_length3 years 1.041107e+00 1.010056e+00 1.073112e+00
## emp_length4 years 1.029357e+00 9.967406e-01 1.063041e+00
## emp_length5 years 1.147358e+00 1.111996e+00 1.183845e+00
## emp_length6 years 1.185177e+00 1.146449e+00 1.225214e+00
## emp_length7 years 1.206089e+00 1.166146e+00 1.247399e+00
## emp_length8 years 1.139213e+00 1.099459e+00 1.180405e+00
## emp_length9 years 1.117031e+00 1.075342e+00 1.160337e+00
## emp_lengthn/a 1.135814e+00 1.093663e+00 1.179590e+00
## home_ownershipMORTGAGE 8.086393e-01 6.144543e-02 1.064192e+01
## home_ownershipNONE 1.539974e+00 1.126978e-01 2.104318e+01
## home_ownershipOTHER 7.170663e-01 5.372401e-02 9.570844e+00
## home_ownershipOWN 8.283755e-01 6.294102e-02 1.090237e+01
## home_ownershipRENT 8.714695e-01 6.621998e-02 1.146873e+01
## annual_inc 9.999992e-01 9.999991e-01 9.999993e-01
## term60 months 1.034864e+00 1.016484e+00 1.053577e+00
## loan_outcome1 1.293052e+00 1.272125e+00 1.314322e+00
#Calculate in and out-of sample error
RMSE_IN<-sqrt(sum(M_LOG3$residuals^2)/length(M_LOG3$residuals))
RMSE_OUT<-sqrt(sum((predict(M_LOG3, Testing)-Testing$int_rate)^2)/length(Testing))
##In-sample error
RMSE_IN
## [1] 1.314754
##Out of sample error. As expected it performs better in sample
RMSE_OUT
## [1] 128.4008
##CART## #rpart package implementation
set.seed(123) #locks seed for random partitioning
#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan3$int_rate, p=.70, list = FALSE)
#stores these rows in the training set
Training<-loan3[inTrain,]
#stores all rows not in the training set in the test/validation set
Testing<-loan3[-inTrain,]
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
#sets the cross validation parameters to be fed to train()
M_CART2 <- train(int_rate ~., data = Training, trControl=train_control, tuneLength=10, method = "rpart")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
#increasing tunelength increases regularization penalty
##the "cv", number = 10 refers to 10-fold cross validation on the training data
#Error Note: "There were missing values in resampled performance measures."
##This finds the most opitimal complex parameter for our model.
print(M_CART2)
## CART
##
## 177934 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 160140, 160141, 160141, 160141, 160141, 160140, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.0003506705 1.320604 0.9098941 1.058498
## 0.0101330433 1.373009 0.9025210 1.065246
## 0.0142811690 1.562458 0.8728190 1.129173
## 0.0219238664 1.730192 0.8452135 1.219438
## 0.0532072735 1.926081 0.8075706 1.354130
## 0.0573583562 2.237852 0.7405999 1.533855
## 0.0702092465 2.633747 0.6380623 1.878925
## 0.1173965943 3.271816 0.4404943 2.483735
## 0.1513530351 3.748131 0.2720469 2.967636
## 0.1977939980 4.177829 0.1919575 3.360700
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.0003506705.
plot(M_CART2) #produces plot of cross-validation results
M_CART2$bestTune #returns optimal complexity parameter
## cp
## 1 0.0003506705
##Again calculated the in and out of sample error for the cleaned up model
RMSE_IN<-sqrt(sum(M_CART2$residuals^2)/length(M_CART2$residuals))
RMSE_OUT<-sqrt(sum((predict(M_CART2, Testing)-Testing$int_rate)^2)/length(Testing))
RMSE_IN
## [1] NaN
RMSE_OUT
## [1] 129.03
##As we can see the the in and out of sample error are both worse in the second model
Original dataset from Lending Club had 74 different columns variables We determined that this was too many variables in which to base the regression model We evaluated the various variables and narrowed it down to 8 variables These variables were loan status, loan amount, interest rate, loan grade, employment length, home ownership, annual income, and loan term Once we had cleaned the data we then went about the process of setting our binomial y variable(loan outcome) that we based on the loan status(See r markdown for specific functions and actions taken to clean the data) Originally when developing our y variable(loan outcome) we had the loan reject part of the binomial include loan status variable types of Charged Off as well as the various late features We then went through the process of running our original regression and confusion matrix to determine significance and as well as accuracy In the original run through using this setup our model ended with an accuracy of 78% In the process we also determined that home ownership was not statistically significant We did additional run throughs removing home ownership, but it had little impact on the model We then went back and changed the structure of our y variable binomial loan outcome In this second model we set the binomial to be an Accept if Fully Paid or a Reject if charged off We removed the late payments from the setup of the binomial function After this we went back through the process of running our regression After restructuring our y variable(loan outcome) we saw in an increase of accuracy to 81% and also saw a decrease in complexity of our model Each team member then ran various iterations of the model by removing different subsets of the x variables to see if we could increase the accuracy of the model Ultimately the various different regressions we ran had minimal impact on the accuracy and complexity of the model
#Examining this dataset for loan outcome and interest rate variability was a good exercise to learn the basics of programing and running analysis in R. We were able to find resources available on the internet to help with solving code issues and to help interpret our data. We had lasting issues that we could not resolve, like how to run a confusion matrix with our numerical scalar regression. Because of that, we do not know the accuracy that our results were able to attain. #The project was a learning experience and showed how powerful this tool is to clean, analyze, and find conclusions from very large sets of data that the tools in excel would not be able to handle.