ABSTRACT :
In this project we have addressed the problem of classifying highly unbalanced data using supervised machine learning algorithms. Unbalanced data is ubiquitous in nature, it’s dealt with in a wide range of fields including but not limited to that of business, bioinformatics, engineering and banking sector. We have focussed here on Credit Risk, which is defined as risk of default on a debt that may arise from a borrower failing to make required payments. In the first resort, the risk is that of the lender and includes lost principal and interest, disruption to cash flows, and increased collection costs. Usually in a Credit Risk problem, loan default is a rare phenomenon and that is why we have such unbalanced data.
Here, we have tried to solve the problem of unbalanced data with the help of Logistic Regression and Classification and Regression Trees(CART).We have used techniques such Prior Probabilities, Loss Matrix and Matrix Weighing to deal with unbalanced data.
ABOUT THE DATA :
The data set is from a bank where data about customers was collected on a yearly basis for analysis. In our data set, we have 8 variables.
loan_status : A Categorical variable which represents 0 if the customer is a non-defaulter and 1 as defaulter.
loan_amnt : A Numerical variable which tells us about the last loan amount taken by the said customer.
int_rate : A Numerical variable which tells us about the interest rate of the last loan amount taken by the customer.
grade : A Categorical variable which tells us how important the customer is to the bank with A being highly important and G being least important.
emp_length : A Numerical variable which tells us about for how long the customer has been working at his/her current workplace.
home_ownership : A Categorical variable which tells us about the type of house owned by the customer.
annual_inc : A Numerical variable which tells us about the Annual income of the customer.
age : A Numerical variable which tells us about the age of the customer.
EXPLONATORY DATA ANALYSIS :
Let us now load the neccessary libraries for perfroming Data Visualization and Data Pre-processing.
# Data Preprocessing and Data Visualization
library(readxl, warn.conflicts = FALSE) # loading the excel package
library(gmodels,warn.conflicts = FALSE)
library(ggplot2,warn.conflicts = FALSE)
loan_data <- read_excel("loan_data.xlsx") # loan_data is our data set
str(loan_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 12499 obs. of 8 variables:
## $ loan_status : num 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : num 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : chr "10.65" " NA" "13.49" " NA" ...
## $ grade : chr " B " " C " " C " " A " ...
## $ emp_length : num 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership : chr " RENT " " RENT " " RENT " " RENT " ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : num 33 31 24 39 24 28 22 22 28 22 ...
histogram <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan Amount",
main = "Histogram of the Loan Amount",col = "red")
As expected, the histogram is right skewed. Now lets look at our data. Here, we are trying to predict loan_status, we see that 90% of our response variable is 0 and only about 10% is 1. Loan defaults is a rare phenomenon indeed.
loan_data$` loan_status ` <- factor(loan_data$` loan_status `)
CrossTable(loan_data$` loan_status `)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 12499
##
##
## | 0 | 1 |
## |-----------|-----------|
## | 11148 | 1351 |
## | 0.892 | 0.108 |
## |-----------|-----------|
##
##
##
##
CrossTable(loan_data$` home_ownership `)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 12499
##
##
## | OWN | RENT | MORTGAGE |
## |------------------|------------------|------------------|
## | 951 | 6190 | 5358 |
## | 0.076 | 0.495 | 0.429 |
## |------------------|------------------|------------------|
##
##
##
##
Cross <- CrossTable(loan_data$` home_ownership `,loan_data$` loan_status `,
prop.r = TRUE,prop.c = FALSE,prop.t = FALSE,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 12499
##
##
## | loan_data$` loan_status `
## loan_data$` home_ownership ` | 0 | 1 | Row Total |
## -----------------------------|-----------|-----------|-----------|
## OWN | 847 | 104 | 951 |
## | 0.891 | 0.109 | 0.076 |
## -----------------------------|-----------|-----------|-----------|
## RENT | 5419 | 771 | 6190 |
## | 0.875 | 0.125 | 0.495 |
## -----------------------------|-----------|-----------|-----------|
## MORTGAGE | 4882 | 476 | 5358 |
## | 0.911 | 0.089 | 0.429 |
## -----------------------------|-----------|-----------|-----------|
## Column Total | 11148 | 1351 | 12499 |
## -----------------------------|-----------|-----------|-----------|
##
##
Cross1 <- CrossTable(loan_data$grade,loan_data$` loan_status `,
prop.r = TRUE,prop.c = FALSE,prop.t = FALSE,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 12499
##
##
## | loan_data$` loan_status `
## loan_data$grade | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## A | 4603 | 315 | 4918 |
## | 0.936 | 0.064 | 0.393 |
## ----------------|-----------|-----------|-----------|
## B | 3757 | 458 | 4215 |
## | 0.891 | 0.109 | 0.337 |
## ----------------|-----------|-----------|-----------|
## C | 1656 | 313 | 1969 |
## | 0.841 | 0.159 | 0.158 |
## ----------------|-----------|-----------|-----------|
## D | 923 | 208 | 1131 |
## | 0.816 | 0.184 | 0.090 |
## ----------------|-----------|-----------|-----------|
## E | 175 | 45 | 220 |
## | 0.795 | 0.205 | 0.018 |
## ----------------|-----------|-----------|-----------|
## F | 29 | 9 | 38 |
## | 0.763 | 0.237 | 0.003 |
## ----------------|-----------|-----------|-----------|
## G | 5 | 3 | 8 |
## | 0.625 | 0.375 | 0.001 |
## ----------------|-----------|-----------|-----------|
## Column Total | 11148 | 1351 | 12499 |
## ----------------|-----------|-----------|-----------|
##
##
plot(Cross$prop.tbl,col = c("bisque2","azure3"),main = "Proportions of Default/Non-Default wrt Ownership"
,xlab = "Grade of the customer",ylab = "1 - Default 0 - Non-default")
There’s not much to discuss here as there is no definitive pattern that we observe. Let’s see if we can find any association of loan defaults with grades.
plot(Cross1$prop.tbl,col = c("cornflowerblue","chartreuse3"),main = "Proportions of Default/Non-Default wrt Grade"
,xlab = "Grade of the customer",ylab = "1 - Default 0 - Non-default")
There definitely is a pattern here, the number of loan defaults indcrease as we move away from grade A. That is logical, as most bad customers(usually grade E,F,G) are more likely to default on their loans.
Let’s see if we have any missing data. Infact, we do. There are a lot of missing values in our employee length and interest rate variables. One way to get rid of this is that we put them in bins and then convert them into categorical variables.
summary(loan_data$`emp_length `) # finding the number of missing data in emp_length
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.00 5.00 6.69 9.00 61.00 496
summary(loan_data$int_rate) # finding the number of missing data in int_rate
## Length Class Mode
## 12499 character character
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate)) # making a null column
loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"
loan_data$ir_cat <- as.factor(loan_data$ir_cat) # converting into categories
plot(loan_data$ir_cat,xlab = "Interest rate categories",ylab = "Count") # visualizing the propportions
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length)) # making a null column
loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"
loan_data$emp_cat <- as.factor(loan_data$emp_cat)
plot(loan_data$emp_cat,xlab = "Employee length categories",ylab = "Count")
We see how are interest rate varible and employee length variables have been grouped into categories now instead of numerical values.
loan_data$int_rate <- NULL
loan_data$`emp_length ` <- NULL
new <- ggplot(loan_data, aes(x = factor(` loan_status `), fill = grade))
val = c("lightcyan2", "lightcyan3","lightblue","lightcyan4","cadetblue3","aquamarine","darkblue")
lab = c("A","B","C","D","E","F","G")
new + geom_bar(position = "dodge",alpha = 0.75) +
scale_x_discrete("Loan Status") +
scale_y_continuous("Count") +
scale_fill_manual("Grades",values = val,labels = lab)
ggplot(loan_data, aes(x = factor(` loan_status `), fill = grade)) + geom_bar(position = "fill")
ggplot(loan_data,aes(x = annual_inc,y = ir_cat)) + geom_point() +
facet_grid(grade ~ .)
The count of grades with respect to loan defaults gives us little insight, but distribution of annual income vs interest rates tells us that there are a lot of people with 13.5+ interest rate with good annual incomes.
To plot an income distribution across grades, we will need to scale down our annual income variable temporarily.
# scale down loan_data$annual_inc first
color <- scale_fill_brewer("Loan Grades",palette = "Reds")
loan_new <- loan_data
loan_new$annual_inc <- loan_new$annual_inc/10000
ggplot(loan_new,aes(x = annual_inc,fill = factor(grade))) +
geom_histogram(binwidth = 1) + color + facet_grid(grade ~. )
We see that customers with a higher grade have more annual income.
Logistic Regression And Model Performance :
In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical.Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.
Thus, it treats the same set of problems as probit regression using similar techniques with the latter using a cumulative normal distribution curve instead. Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors.
Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences of these two models can be seen in the following two features of logistic regression.
First, the conditional distribution, is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities andarethereforerestrictedto(0,1)throughthelogisticdistributionfunctionbecauselogistic regression predicts the probability of particular outcomes.
# Logistic Regression
set.seed(123)
library(ROCR, warn.conflicts = FALSE)
library(pROC, warn.conflicts = FALSE)
index_train <- sample(1:nrow(loan_data), 2 / 3* nrow(loan_data)) # making a random sample of training set.
training_set <- loan_data[c(index_train), ] # storing the random sample in a training set.
test_set <- loan_data[-index_train, ]
# Let's start playing around with logistic regression.
log_reg_small <- glm(` loan_status ` ~ ir_cat + grade + loan_amnt + annual_inc,
family = "binomial", data = training_set)
summary(log_reg_small) # estimating the vaiable's of importance in our model. Loan amount is not significant.
##
## Call:
## glm(formula = ` loan_status ` ~ ir_cat + grade + loan_amnt +
## annual_inc, family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1193 -0.5435 -0.4248 -0.3412 3.6661
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.106e+00 1.428e-01 -14.744 < 2e-16 ***
## ir_cat11-13.5 1.117e-01 1.092e-01 1.023 0.30639
## ir_cat13.5+ 1.611e-01 1.099e-01 1.466 0.14254
## grade B 6.857e-01 1.163e-01 5.897 3.70e-09 ***
## grade C 1.006e+00 1.133e-01 8.879 < 2e-16 ***
## grade D 1.261e+00 1.170e-01 10.773 < 2e-16 ***
## grade E 1.564e+00 2.114e-01 7.398 1.38e-13 ***
## grade F 1.439e+00 4.795e-01 3.001 0.00269 **
## grade G 2.272e+00 1.256e+00 1.808 0.07054 .
## loan_amnt -8.941e-06 6.488e-06 -1.378 0.16819
## annual_inc -1.024e-05 1.310e-06 -7.819 5.34e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5892.4 on 8331 degrees of freedom
## Residual deviance: 5612.2 on 8321 degrees of freedom
## AIC: 5634.2
##
## Number of Fisher Scoring iterations: 5
Looking at the summary of our model, we can say that interest rates, annual income and grades are all very important in our model. Loan amount is not a very significant variable in our model.
Time to make some predictions with our model and test it on various parameters.
predictions_small <- predict(log_reg_small, newdata = test_set,
type = "response")
range(predictions_small) # getting the range of the probabilities calculated above.
## [1] 4.185654e-09 4.871028e-01
pred_cutoff_15_small <- ifelse(predictions_small > 0.15,1,0)
conf_15_small <- table(test_set$` loan_status `,pred_cutoff_15_small)
conf_15_small
## pred_cutoff_15_small
## 0 1
## 0 2931 830
## 1 253 153
accuracy_15_small <- sum(diag(conf_15_small))/sum(conf_15_small)
accuracy_15_small
## [1] 0.7401008
pred_cutoff_20_small <- ifelse(predictions_small > 0.20,1,0)
conf_20_small <- table(test_set$` loan_status `,pred_cutoff_20_small)
conf_20_small
## pred_cutoff_20_small
## 0 1
## 0 3468 293
## 1 335 71
accuracy_20_small <- sum(diag(conf_20_small))/sum(conf_20_small)
accuracy_20_small
## [1] 0.8492921
We see that depending on the cutoff value of 0.15 and 0.20, we get accuracies of 74% and 85% respectively. Now let us plot the ROC Curve and calculate the AUC(Area under the curve) for our model.
pred_train_small <- predict(log_reg_small,type = "response")
predROC_small <- prediction(pred_train_small, training_set$ ` loan_status `)
perfROC_small <- performance(predROC_small,"tpr","fpr")
plot(perfROC_small, colorize = TRUE,
print.cutoffs.at = seq(0,1,0.1),text.adj = c(-0.2,1.7))
AUC_small <- as.numeric(performance(predROC_small , "auc")@y.values)
AUC_small
## [1] 0.665261
We see that the we get an AUC of 0.665261. That’s a pretty decent AUC considering that we used only 4 predcitors in our Logistic model. Can we increase the accuracy by using all the variables? Let’s find out.
log_reg_full <- glm(` loan_status ` ~., family = "binomial", data = training_set)
summary(log_reg_full)
##
## Call:
## glm(formula = ` loan_status ` ~ ., family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2283 -0.5401 -0.4227 -0.3360 3.6112
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.265e+00 2.491e-01 -9.090 < 2e-16
## loan_amnt -8.416e-06 6.488e-06 -1.297 0.19460
## grade B 6.851e-01 1.170e-01 5.856 4.73e-09
## grade C 1.003e+00 1.142e-01 8.784 < 2e-16
## grade D 1.266e+00 1.183e-01 10.705 < 2e-16
## grade E 1.582e+00 2.124e-01 7.449 9.44e-14
## grade F 1.439e+00 4.809e-01 2.993 0.00276
## grade G 2.340e+00 1.261e+00 1.855 0.06359
## ` home_ownership ` RENT 1.816e-01 1.411e-01 1.287 0.19821
## ` home_ownership ` MORTGAGE 9.191e-02 1.472e-01 0.624 0.53234
## annual_inc -9.550e-06 1.367e-06 -6.988 2.78e-12
## ` age` -2.206e-03 5.907e-03 -0.373 0.70886
## ir_cat11-13.5 1.088e-01 1.094e-01 0.994 0.31999
## ir_cat13.5+ 1.452e-01 1.101e-01 1.320 0.18700
## emp_cat15-30 1.589e-01 1.265e-01 1.257 0.20891
## emp_cat30-45 1.917e-01 3.610e-01 0.531 0.59537
## emp_cat45+ 1.700e+00 6.337e-01 2.683 0.00729
## emp_catMissing 5.956e-01 1.508e-01 3.949 7.85e-05
##
## (Intercept) ***
## loan_amnt
## grade B ***
## grade C ***
## grade D ***
## grade E ***
## grade F **
## grade G .
## ` home_ownership ` RENT
## ` home_ownership ` MORTGAGE
## annual_inc ***
## ` age`
## ir_cat11-13.5
## ir_cat13.5+
## emp_cat15-30
## emp_cat30-45
## emp_cat45+ **
## emp_catMissing ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5892.4 on 8331 degrees of freedom
## Residual deviance: 5589.9 on 8314 degrees of freedom
## AIC: 5625.9
##
## Number of Fisher Scoring iterations: 5
The summary is not very informative. Let’s go straight to making predictions with this model.
predictions_full <- predict(log_reg_full, newdata = test_set, type = "response")
range(predictions_full)
## [1] 1.255721e-08 5.282657e-01
pred_cutoff_15 <- ifelse(predictions_full > 0.15,1,0)
confmat_15_full <- table(test_set$` loan_status `,pred_cutoff_15)
confmat_15_full
## pred_cutoff_15
## 0 1
## 0 2964 797
## 1 248 158
accuracy_15 <- sum(diag(confmat_15_full))/sum(confmat_15_full)
accuracy_15
## [1] 0.7492201
pred_cutoff_20 <- ifelse(predictions_full > 0.20,1,0)
confmat_20_full <- table(test_set$` loan_status `,pred_cutoff_20)
confmat_20_full
## pred_cutoff_20
## 0 1
## 0 3467 294
## 1 337 69
accuracy_20 <- sum(diag(confmat_20_full))/sum(confmat_20_full)
accuracy_20
## [1] 0.8485721
Even with all the predictor variables we are not able to increase the accuracy in our model. But what we have successfully done here is that we have increased the model complexity and made it even harder to interpret. We should stick to our first model with 4 predictor variables as it is much easy to interpret and gives more or less the same result.
Let’s now see how the model performs with the ROC Curve.
pred_train_full <- predict(log_reg_full,type = "response")
predROC_full <- prediction(pred_train_full, training_set$ ` loan_status `)
perfROC_full <- performance(predROC_full,"tpr","fpr")
plot(perfROC_full, colorize = TRUE,
print.cutoffs.at = seq(0,1,0.1),text.adj = c(-0.2,1.7))
AUC_full <- as.numeric(performance(predROC_full , "auc")@y.values)
AUC_full
## [1] 0.6708368
The AUC for this model comes out to be 0.6708368 which is not a significant gain from our previous model. Let’s move onto CART(Classification and Regression Trees) to see if we can make better predicions on this highly unbalanced data with Decision Trees and the various techniques that come with it.
Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function.
. Sensitivity: (also called the true positive rate, or the recall in some ???elds) measures the proportion of positives that are correctly identi???ed as such (e.g., the percentage of sick people who are correctly identi???ed as having the condition).
. Specificity: (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
CLASSIFICATION AND REGRESSION TREES :
What is CART?
Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees.
In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables.
PRIOR PROBABILITIES
In this technique, we will explicitly define the probabilities of defaults and non-defaults, such that the CART model knows when to classify the observation as a default.
# Decision Trees
library(rpart, warn.conflicts = FALSE) # making decision trees.
library(rpart.plot, warn.conflicts = FALSE) # for plotcp and princp functions.
library(rattle, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE) # for great decision tree plots
# Changing the prior probabilities of loan default and non-loan default
tree_prior <- rpart(` loan_status ` ~., method = "class",
data = training_set,
parms = list(prior = c(0.7,0.3)), # changing the probabilities of non loan default to 0.7
control = rpart.control(cp = 0.001)) # and loan default to 0.3
prp(tree_prior)
printcp(tree_prior) # cp and xerror values.
##
## Classification tree:
## rpart(formula = ` loan_status ` ~ ., data = training_set, method = "class",
## parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age home_ownership annual_inc emp_cat
## [5] grade ir_cat loan_amnt
##
## Root node error: 2499.6/8332 = 0.3
##
## n= 8332
##
## CP nsplit rel error xerror xstd
## 1 0.0035384 0 1.00000 1.00000 0.030630
## 2 0.0028667 14 0.93780 0.98684 0.029582
## 3 0.0028168 16 0.93207 1.00394 0.029190
## 4 0.0024641 21 0.91668 1.00358 0.029068
## 5 0.0022270 22 0.91422 1.00355 0.028987
## 6 0.0022111 26 0.90466 1.00647 0.028923
## 7 0.0020217 27 0.90245 1.01122 0.028970
## 8 0.0020005 28 0.90043 1.01301 0.028959
## 9 0.0019899 32 0.89222 1.01351 0.028905
## 10 0.0018904 33 0.89023 1.01288 0.028904
## 11 0.0017846 39 0.87769 1.01716 0.028814
## 12 0.0017532 40 0.87590 1.02147 0.028806
## 13 0.0017098 42 0.87240 1.01873 0.028775
## 14 0.0016820 49 0.85955 1.02031 0.028777
## 15 0.0016743 52 0.85314 1.02072 0.028750
## 16 0.0015953 59 0.84079 1.02135 0.028751
## 17 0.0015953 61 0.83760 1.01914 0.028748
## 18 0.0015400 62 0.83601 1.02263 0.028794
## 19 0.0015006 64 0.83293 1.02504 0.028783
## 20 0.0013582 65 0.83143 1.03586 0.028728
## 21 0.0013267 67 0.82871 1.04101 0.028706
## 22 0.0012213 69 0.82606 1.04897 0.028606
## 23 0.0012057 72 0.82239 1.05262 0.028513
## 24 0.0011815 75 0.81877 1.05135 0.028512
## 25 0.0011529 82 0.81021 1.05660 0.028463
## 26 0.0010976 83 0.80906 1.05563 0.028420
## 27 0.0010951 85 0.80687 1.05837 0.028409
## 28 0.0010503 89 0.80230 1.06015 0.028397
## 29 0.0010503 91 0.80020 1.06759 0.028308
## 30 0.0010109 93 0.79810 1.06557 0.028251
## 31 0.0010000 95 0.79608 1.07114 0.028243
plotcp(tree_prior) # cp vs xerror plot.
tree_min <- tree_prior$cptable[which.min(tree_prior$cptable[,"xerror"]),"CP"]
# pruning the tree for increased model performance.
ptree_prior <- prune(tree_prior, cp = tree_min) # pruning the tree.
prp(ptree_prior)
pred_prior <- predict(ptree_prior, newdata = test_set, type = "class") # making predictions.
confmat_prior <- table(test_set$` loan_status `,pred_prior) # making the confusion matrix.
confmat_prior
## pred_prior
## 0 1
## 0 3612 149
## 1 362 44
acc_prior <- sum(diag(confmat_prior)) / sum(confmat_prior)
acc_prior
## [1] 0.8773698
prp(ptree_prior)
We see a brilliant 88% accuracy with this model, this is a staright 3% jump from our previous best model.
LOSS MATRIX :
Inthistechnique,wewillpenalizeourmodeleverytimeitclassi???esadefaultasanon-default or a non-default as a default. This is a very strict model and most of the times we won’t get desired results.
# Including a loss matrix
tree_loss_matrix <- rpart(` loan_status ` ~., method = "class",
data = training_set,
parms = list(loss = matrix(c(0,10,1,0),ncol = 2)),
control = rpart.control(cp = 0.001))
#penalizing classifying a default as a non default 10 times more.
printcp(tree_loss_matrix)
##
## Classification tree:
## rpart(formula = ` loan_status ` ~ ., data = training_set, method = "class",
## parms = list(loss = matrix(c(0, 10, 1, 0), ncol = 2)), control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age home_ownership annual_inc emp_cat
## [5] grade ir_cat loan_amnt
##
## Root node error: 7387/8332 = 0.88658
##
## n= 8332
##
## CP nsplit rel error xerror xstd
## 1 0.1196697 0 1.00000 10.0000 0.039184
## 2 0.0227427 1 0.88033 5.9488 0.061404
## 3 0.0088669 2 0.85759 7.0895 0.059514
## 4 0.0085285 4 0.83985 5.8297 0.061463
## 5 0.0077614 5 0.83133 5.4239 0.061423
## 6 0.0057534 8 0.80804 5.1902 0.061252
## 7 0.0042642 10 0.79653 4.9295 0.060936
## 8 0.0039258 12 0.78801 4.9825 0.061005
## 9 0.0035197 14 0.78015 5.0414 0.061098
## 10 0.0027075 19 0.75904 5.1650 0.061210
## 11 0.0023013 20 0.75633 5.0964 0.061138
## 12 0.0021660 22 0.75173 4.9831 0.060987
## 13 0.0020306 23 0.74956 5.0009 0.061007
## 14 0.0019855 27 0.74049 4.9713 0.060967
## 15 0.0019629 30 0.73453 4.9713 0.060967
## 16 0.0018952 32 0.73061 4.9392 0.060920
## 17 0.0018275 33 0.72871 4.9319 0.060902
## 18 0.0017598 35 0.72506 4.9433 0.060912
## 19 0.0016245 39 0.71802 4.8843 0.060822
## 20 0.0015794 40 0.71639 4.7696 0.060620
## 21 0.0015568 43 0.71166 4.8052 0.060676
## 22 0.0015342 47 0.70543 4.8093 0.060683
## 23 0.0014891 51 0.69798 4.8270 0.060710
## 24 0.0014214 60 0.68201 4.7739 0.060611
## 25 0.0013537 62 0.67917 4.7229 0.060503
## 26 0.0012860 63 0.67781 4.6763 0.060408
## 27 0.0012184 68 0.66955 4.6377 0.060326
## 28 0.0011913 93 0.63165 4.6659 0.060384
## 29 0.0011797 118 0.58603 4.5882 0.060219
## 30 0.0011732 126 0.57574 4.5604 0.060154
## 31 0.0011507 129 0.57222 4.5940 0.060227
## 32 0.0010830 136 0.56369 4.5722 0.060169
## 33 0.0010379 143 0.55544 4.5983 0.060220
## 34 0.0010153 150 0.54691 4.4903 0.059967
## 35 0.0010000 154 0.54285 4.4903 0.059967
plotcp(tree_loss_matrix)
ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0012788)
prp(tree_loss_matrix)
We see a huge tree which is very hard to interpret, is oversized and is definitely overfitted. Overfitting is a huge problem in Predictive Modeling as it generalizes all too well on the training set and hence fails to do well on the test set or unseen observations.
pred_loss_matrix <- predict(ptree_loss_matrix,newdata = test_set, type = "class")
confmat_loss_matrix <- table(test_set$` loan_status `,pred_loss_matrix)
confmat_loss_matrix
## pred_loss_matrix
## 0 1
## 0 1923 1838
## 1 127 279
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / sum(confmat_loss_matrix)
acc_loss_matrix
## [1] 0.5284377
The accuracy comes out to be 53%. That’s pathetic! Maybe we penalized our model too much for classifying incorrectly? Maybe weighing the loan default cases can help us make a better model.
WEIGHING :
In this technique, we weight the important default cases and tell our program to refer to them when classifying.
# including case weights for the training set
case_weights <- ifelse(training_set$` loan_status ` == 0,1,3)
tree_weights <- rpart(` loan_status ` ~ ., method = "class",
data = training_set,
control = rpart.control(cp = 0.001,minsplit = 5,minbucket = 2),
weights = case_weights)
plotcp(tree_weights)
ptree_weights <- prune(tree_weights, cp = 0.00183101)
prp(ptree_weights,extra = 1)
pred_weights <- predict(ptree_weights, newdata = test_set,type = "class")
confmat_weights <- table(test_set$` loan_status `,pred_weights)
confmat_weights
## pred_weights
## 0 1
## 0 3533 228
## 1 355 51
acc_weights <- sum(diag(confmat_weights)) / sum(confmat_weights)
acc_weights
## [1] 0.8600912
We get a good accuracy of 86% which is decent enough. Weighing is one of the best techniques to classify unbalanced data.