One industry in which machine learning is having a significant impact is financial lending. In this data analysis project, I will use data from LendingClub, a well-known peer-to-peer lending platform based in San Francisco, California, to build statistical models that use debtor attributes to predict the grades given to loan applications by the Lending Club.
Loading in the data needed for this analysis; clearing the first row and the empty rows in the bottom
library(readr)
getwd()
## [1] "C:/Users/fzy20/Downloads"
set.seed(1)
#data loading
loan_data = read_csv(skip = 1, "C:/Users/fzy20/Downloads/LoanStats3c.csv")
loan_data = loan_data[!is.na(loan_data$grade),]
#row structuring
loan_data = loan_data[0:-2,]
Instead of predicting the absolute loan amount I decided to make this a classification problem to predict the loan grades in a binary variable. Using the training data, I created a binary variable called highgrade that takes the value 1 when a loan has received an “A” or “B” grade and 0 otherwise
Furthermore I want to answer a couple of questions:
What is the proportion of loans in the training data that receive either an “A” or “B” grade?
determining whether the differences in the proportion of loans that are highgrade is significantly different depending upon:
Creating the highgrade variable:
#variable generation
loan_data$highgrade <- as.integer(loan_data$grade %in% c("A","B"))
Checking the proportion of high grade loans; there about 42% of invididual who received a high grade loan
# Load
library(wesanderson)
par(lwd = 4)
freq = as.data.frame(table(loan_data$highgrade))$Freq
lbls = c("lower grade", "high grade")
pct <- round(freq/sum(freq)*100)
pct_lbls <- paste(pct,"%",sep="") # ad % to labels
pie(freq, labels = pct_lbls, main = "Pie Chart highgrade", col = wes_palette(n=length(freq), name="GrandBudapest1"), border = "white")
legend("topright", lbls, cex = 0.8,
fill = wes_palette(n=length(freq), name="GrandBudapest1"))
The exact proportion is shown below:
table(loan_data$highgrade)[2][[1]]/nrow(loan_data)
## [1] 0.4160898
Next, we will use the t.test to examine three things: 1. level fo loans depending on the income level of the debtor - There is a significant difference between the proportion of individual receiving high grade loans for individuals above median income level and below income level; in fact the difference sum up to about 9 percent on average in favor of those with a higher income
t.test(loan_data[loan_data$annual_inc > median(loan_data$annual_inc),]$highgrade,
loan_data[loan_data$annual_inc <= median(loan_data$annual_inc),]$highgrade)
##
## Welch Two Sample t-test
##
## data: loan_data[loan_data$annual_inc > median(loan_data$annual_inc), and loan_data[loan_data$annual_inc <= median(loan_data$annual_inc), ]$highgrade and ]$highgrade
## t = 45.046, df = 231178, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.08739910 0.09535065
## sample estimates:
## mean of x mean of y
## 0.4642396 0.3728647
t.test(loan_data[loan_data$loan_amnt > median(loan_data$loan_amnt),]$highgrade,
loan_data[loan_data$loan_amnt <= median(loan_data$loan_amnt),]$highgrade)
##
## Welch Two Sample t-test
##
## data: loan_data[loan_data$loan_amnt > median(loan_data$loan_amnt), and loan_data[loan_data$loan_amnt <= median(loan_data$loan_amnt), ]$highgrade and ]$highgrade
## t = -34.087, df = 235539, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07302278 -0.06508180
## sample estimates:
## mean of x mean of y
## 0.3814803 0.4505326
t.test(loan_data[loan_data$home_ownership %in% c("RENT"),]$highgrade,
loan_data[!(loan_data$home_ownership %in% c("RENT")),]$highgrade)
##
## Welch Two Sample t-test
##
## data: loan_data[loan_data$home_ownership %in% c("RENT"), ]$highgrade and loan_data[!(loan_data$home_ownership %in% c("RENT")), ]$highgrade
## t = -14.684, df = 199441, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03450148 -0.02637568
## sample estimates:
## mean of x mean of y
## 0.3976242 0.4280627
These all point to the fact that loans are more in favor of those of a better social class
Proceeding to build a logistic regression model to predict the kind of loans a person will get; most predictors seem quite significant except for the home ownership predictor.
#eliminating unwanted data
loan_data <-loan_data[!(loan_data$home_ownership == ""),]
#model building
log_model <- glm("highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose", data = loan_data,family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#printing data summary
summary(log_model)
##
## Call:
## glm(formula = "highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose",
## family = binomial, data = loan_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.9499 -0.7030 1.1244 2.6029
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.188e+00 2.666e+01 0.307 0.7588
## annual_inc 8.547e-06 1.216e-07 70.262 < 2e-16
## home_ownershipMORTGAGE -8.055e+00 2.666e+01 -0.302 0.7626
## home_ownershipOWN -8.071e+00 2.666e+01 -0.303 0.7621
## home_ownershipRENT -8.180e+00 2.666e+01 -0.307 0.7590
## loan_amnt -3.895e-05 6.762e-07 -57.601 < 2e-16
## verification_statusSource Verified -6.533e-01 1.090e-02 -59.925 < 2e-16
## verification_statusVerified -9.497e-01 1.245e-02 -76.260 < 2e-16
## purposecredit_card 8.271e-01 4.978e-02 16.616 < 2e-16
## purposedebt_consolidation -8.010e-02 4.925e-02 -1.626 0.1039
## purposehome_improvement -3.269e-01 5.256e-02 -6.218 5.02e-10
## purposehouse -2.032e+00 1.385e-01 -14.673 < 2e-16
## purposemajor_purchase -1.265e-01 5.963e-02 -2.121 0.0339
## purposemedical -1.177e+00 7.063e-02 -16.659 < 2e-16
## purposemoving -2.159e+00 1.037e-01 -20.814 < 2e-16
## purposeother -1.173e+00 5.481e-02 -21.394 < 2e-16
## purposerenewable_energy -2.306e+00 3.299e-01 -6.990 2.74e-12
## purposesmall_business -1.844e+00 8.677e-02 -21.252 < 2e-16
## purposevacation -1.294e+00 8.797e-02 -14.712 < 2e-16
## purposewedding -4.688e-01 7.629e-01 -0.614 0.5389
##
## (Intercept)
## annual_inc ***
## home_ownershipMORTGAGE
## home_ownershipOWN
## home_ownershipRENT
## loan_amnt ***
## verification_statusSource Verified ***
## verification_statusVerified ***
## purposecredit_card ***
## purposedebt_consolidation
## purposehome_improvement ***
## purposehouse ***
## purposemajor_purchase *
## purposemedical ***
## purposemoving ***
## purposeother ***
## purposerenewable_energy ***
## purposesmall_business ***
## purposevacation ***
## purposewedding
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 319981 on 235626 degrees of freedom
## Residual deviance: 290584 on 235607 degrees of freedom
## AIC: 290624
##
## Number of Fisher Scoring iterations: 6
#prediciting result
loan_data$pred_result <- predict(log_model,loan_data,type= "response")
Choosing the best threshold for the top rating; as seen from the histogram most of these model have an accruracy of about 0.58 percent. hmmm..
func_1 <- function(num, loan_data){
pred <- as.integer(loan_data$pred_result >= num)
return(sum(pred==loan_data$highgrade)/nrow(loan_data))
}
num <- 1:100
num <- c(num/100)
result = sapply(num,function(i) func_1(i,loan_data))
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
hist(result, breaks = 20,col = wes_palette(n=1, name="GrandBudapest1"),labels = TRUE)
Now i will pick the top threshold(0.49) and make the final prediction
threshold = num[result == max(result)]
loan_data$pred_final <- as.integer(loan_data$pred_result >= threshold)
Let us check the predictive rating of our model; 0.66, not bad! For a frame of reference, the model that assigns 0 to all rows has an accuracy of: 0.58 For another frame of reference, the model that assigns random 0 and 1 to all rows has an accuracy of: 0.50
result = data.frame("form"= c("logistic regression","all zeroes","random numbers"), "predictive_value"= c(sum(loan_data$pred_final==loan_data$highgrade)/nrow(loan_data), sum(rep(0,235627)==loan_data$highgrade)/nrow(loan_data),sum(sample(c(1,0),235627,replace =TRUE)==loan_data$highgrade)/nrow(loan_data)))
ggplot(result,aes(x = form, y = predictive_value)) + geom_col(fill = wes_palette(n=3, name="GrandBudapest1")) + geom_text(aes(label = predictive_value),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()
To make another comparison, I will procedd to building a decision tree classification model to predict the kind of loans a person will get
library(rpart)
library(rattle) # Fancy tree plot
library(rpart.plot) # Enhanced tree plots
#model construction
fit = rpart("highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose", data = loan_data,method = "class")
Find below the graphical representation of the model; the model returned seems quite simple; in fact only three predictors are used: purpose; verfiication status and annual income
fancyRpartPlot(fit,cex = 0.7,palettes = "OrRd")
Now we shall predict our dataset using our classifcation model and check for its accuracy; the accuracy returned 0.65, slightly lower than our logistic model
loan_data$pred_final_rf <- predict(fit,loan_data,type="class")
result = data.frame("form"= c("decision tree","all zeroes","random numbers"), "predictive_value"= c(sum(loan_data$pred_final_rf==loan_data$highgrade)/nrow(loan_data), sum(rep(0,235627)==loan_data$highgrade)/nrow(loan_data),sum(sample(c(1,0),235627,replace =TRUE)==loan_data$highgrade)/nrow(loan_data)))
ggplot(result,aes(x = form, y = predictive_value)) + geom_col(fill = wes_palette(n=3, name="GrandBudapest1")) + geom_text(aes(label = predictive_value),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()
Load the test data and clear the bottom two rows
loan_test <- read.csv("C:/Users/fzy20/Downloads/LoanStats3d.csv", skip = 1)
loan_test = loan_test[0:-2,]
Predict the value of the loan test data using the model we constructed earlier
#creating variable
loan_test$highgrade <- as.integer(loan_test$grade %in% c("A","B"))
#remove educational level
loan_test <- loan_test[!(loan_test$purpose %in% c("educational")),]
#remove unecessary variable
loan_test <-loan_test[!(loan_test$purpose == ""),]
#variable correction
loan_test$purpose <- as.factor(as.character(loan_test$purpose))
#predictions
loan_test$pred_result <- predict(log_model,loan_test,type= "response")
loan_test$pred_final <- as.integer(loan_test$pred_result >= threshold)
loan_test$pred_final_rf <- predict(fit,loan_test,type="class")
evaluating results of the models with benchmarks; it is quite interesting that both the decision tree and the logistic regression actually have similar performance, with logistic regression slightly higher
result = data.frame("form"= c("logistic regression","decision tree","all zeroes","random numbers"), "predictive_value"= c(sum(loan_test$pred_final==loan_test$highgrade)/nrow(loan_test),sum(loan_test$pred_final_rf==loan_test$highgrade)/nrow(loan_test), sum(rep(0,421092)==loan_test$highgrade)/nrow(loan_test),sum(sample(c(1,0),421092,replace = TRUE)==loan_test$highgrade)/nrow(loan_test)))
ggplot(result,aes(x = form, y = predictive_value)) + geom_col(fill = wes_palette(n=4, name="GrandBudapest1")) + geom_text(aes(label = predictive_value),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()
Instead of stopping at the pure accuracy I also decided to test the recall and precision of our model. Recall(True Positive/True Positive + False Negative) and Power/Precision(True Positive/True Positive + False Positive) are both important metrics to really determine how good a model is. Specifically recall in this case refers to how how many individual who actually received high grade loans were classified as so, while precision refers to how many of those that we classified as high grade loan receivers actually received high grade loans.
Evaluating power of logistic regression model
sum(loan_test[loan_test$pred_final == 1,]$highgrade == 1)/sum(loan_test$pred_final == 1)
## [1] 0.6552654
Evaluating recall of logistic regression model
sum(loan_test[loan_test$pred_final == 1,]$highgrade == 1)/sum(loan_test$highgrade == 1)
## [1] 0.4798917
Evaluating power of the classification tree
sum(loan_test[loan_test$pred_final_rf == 1,]$highgrade == 1)/sum(loan_test$pred_final_rf == 1)
## [1] 0.6113031
Evaluating recall of the classification tree
sum(loan_test[loan_test$pred_final_rf == 1,]$highgrade == 1)/sum(loan_test$highgrade == 1)
## [1] 0.4993008