ABSTRACT :

In this project we have addressed the problem of classifying highly unbalanced data using supervised machine learning algorithms. Unbalanced data is ubiquitous in nature, it’s dealt with in a wide range of fields including but not limited to that of business, bioinformatics, engineering and banking sector. We have focussed here on Credit Risk, which is defined as risk of default on a debt that may arise from a borrower failing to make required payments. In the first resort, the risk is that of the lender and includes lost principal and interest, disruption to cash flows, and increased collection costs. Usually in a Credit Risk problem, loan default is a rare phenomenon and that is why we have such unbalanced data.

Here, we have tried to solve the problem of unbalanced data with the help of Logistic Regression and Classification and Regression Trees(CART).We have used techniques such Prior Probabilities, Loss Matrix and Matrix Weighing to deal with unbalanced data.

ABOUT THE DATA :

The data set is from a bank where data about customers was collected on a yearly basis for analysis. In our data set, we have 8 variables.

loan_status : A Categorical variable which represents 0 if the customer is a non-defaulter and 1 as defaulter.

loan_amnt : A Numerical variable which tells us about the last loan amount taken by the said customer.

int_rate : A Numerical variable which tells us about the interest rate of the last loan amount taken by the customer.

grade : A Categorical variable which tells us how important the customer is to the bank with A being highly important and G being least important.

emp_length : A Numerical variable which tells us about for how long the customer has been working at his/her current workplace.

home_ownership : A Categorical variable which tells us about the type of house owned by the customer.

annual_inc : A Numerical variable which tells us about the Annual income of the customer.

age : A Numerical variable which tells us about the age of the customer.

EXPLONATORY DATA ANALYSIS :

Let us now load the neccessary libraries for perfroming Data Visualization and Data Pre-processing.

# Data Preprocessing and Data Visualization

library(readxl, warn.conflicts = FALSE) # loading the excel package

library(gmodels,warn.conflicts = FALSE)

library(ggplot2,warn.conflicts = FALSE)

loan_data <- read_excel("loan_data.xlsx") # loan_data is our data set

str(loan_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    12499 obs. of  8 variables:
##  $  loan_status    : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt       : num  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate        : chr  "10.65" "      NA" "13.49" "      NA" ...
##  $ grade           : chr  "    B  " "    C  " "    C  " "    A  " ...
##  $ emp_length      : num  10 25 13 3 9 11 0 3 3 0 ...
##  $  home_ownership : chr  "           RENT " "           RENT " "           RENT " "           RENT " ...
##  $ annual_inc      : num  24000 12252 49200 36000 48000 ...
##  $  age            : num  33 31 24 39 24 28 22 22 28 22 ...
histogram <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan Amount", 
               main = "Histogram of the Loan Amount",col = "red")

As expected, the histogram is right skewed. Now lets look at our data. Here, we are trying to predict loan_status, we see that 90% of our response variable is 0 and only about 10% is 1. Loan defaults is a rare phenomenon indeed.

loan_data$` loan_status ` <- factor(loan_data$` loan_status `)

CrossTable(loan_data$` loan_status `)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12499 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |     11148 |      1351 | 
##           |     0.892 |     0.108 | 
##           |-----------|-----------|
## 
## 
## 
## 
CrossTable(loan_data$` home_ownership `)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12499 
## 
##  
##                  |             OWN  |            RENT  |        MORTGAGE  | 
##                  |------------------|------------------|------------------|
##                  |              951 |             6190 |             5358 | 
##                  |            0.076 |            0.495 |            0.429 | 
##                  |------------------|------------------|------------------|
## 
## 
## 
## 
Cross <- CrossTable(loan_data$` home_ownership `,loan_data$` loan_status `,
                    prop.r = TRUE,prop.c = FALSE,prop.t = FALSE,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12499 
## 
##  
##                              | loan_data$` loan_status ` 
## loan_data$` home_ownership ` |         0 |         1 | Row Total | 
## -----------------------------|-----------|-----------|-----------|
##                         OWN  |       847 |       104 |       951 | 
##                              |     0.891 |     0.109 |     0.076 | 
## -----------------------------|-----------|-----------|-----------|
##                        RENT  |      5419 |       771 |      6190 | 
##                              |     0.875 |     0.125 |     0.495 | 
## -----------------------------|-----------|-----------|-----------|
##                    MORTGAGE  |      4882 |       476 |      5358 | 
##                              |     0.911 |     0.089 |     0.429 | 
## -----------------------------|-----------|-----------|-----------|
##                 Column Total |     11148 |      1351 |     12499 | 
## -----------------------------|-----------|-----------|-----------|
## 
## 
Cross1 <- CrossTable(loan_data$grade,loan_data$` loan_status `,
                     prop.r = TRUE,prop.c = FALSE,prop.t = FALSE,prop.chisq = FALSE) 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12499 
## 
##  
##                 | loan_data$` loan_status ` 
## loan_data$grade |         0 |         1 | Row Total | 
## ----------------|-----------|-----------|-----------|
##             A   |      4603 |       315 |      4918 | 
##                 |     0.936 |     0.064 |     0.393 | 
## ----------------|-----------|-----------|-----------|
##             B   |      3757 |       458 |      4215 | 
##                 |     0.891 |     0.109 |     0.337 | 
## ----------------|-----------|-----------|-----------|
##             C   |      1656 |       313 |      1969 | 
##                 |     0.841 |     0.159 |     0.158 | 
## ----------------|-----------|-----------|-----------|
##             D   |       923 |       208 |      1131 | 
##                 |     0.816 |     0.184 |     0.090 | 
## ----------------|-----------|-----------|-----------|
##             E   |       175 |        45 |       220 | 
##                 |     0.795 |     0.205 |     0.018 | 
## ----------------|-----------|-----------|-----------|
##             F   |        29 |         9 |        38 | 
##                 |     0.763 |     0.237 |     0.003 | 
## ----------------|-----------|-----------|-----------|
##             G   |         5 |         3 |         8 | 
##                 |     0.625 |     0.375 |     0.001 | 
## ----------------|-----------|-----------|-----------|
##    Column Total |     11148 |      1351 |     12499 | 
## ----------------|-----------|-----------|-----------|
## 
## 
plot(Cross$prop.tbl,col = c("bisque2","azure3"),main = "Proportions of Default/Non-Default wrt Ownership"
     ,xlab = "Grade of the customer",ylab = "1 - Default    0 - Non-default")

There’s not much to discuss here as there is no definitive pattern that we observe. Let’s see if we can find any association of loan defaults with grades.

plot(Cross1$prop.tbl,col = c("cornflowerblue","chartreuse3"),main = "Proportions of Default/Non-Default wrt Grade"
     ,xlab = "Grade of the customer",ylab = "1 - Default    0 - Non-default")

There definitely is a pattern here, the number of loan defaults indcrease as we move away from grade A. That is logical, as most bad customers(usually grade E,F,G) are more likely to default on their loans.

Let’s see if we have any missing data. Infact, we do. There are a lot of missing values in our employee length and interest rate variables. One way to get rid of this is that we put them in bins and then convert them into categorical variables.

summary(loan_data$`emp_length `) # finding the number of missing data in emp_length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.00    5.00    6.69    9.00   61.00     496
summary(loan_data$int_rate) # finding the number of missing data in int_rate 
##    Length     Class      Mode 
##     12499 character character
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate)) # making a null column

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat) # converting into categories

plot(loan_data$ir_cat,xlab = "Interest rate categories",ylab = "Count") # visualizing the propportions

loan_data$emp_cat <- rep(NA, length(loan_data$emp_length)) # making a null column

loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"

loan_data$emp_cat <- as.factor(loan_data$emp_cat)

plot(loan_data$emp_cat,xlab = "Employee length categories",ylab = "Count")

We see how are interest rate varible and employee length variables have been grouped into categories now instead of numerical values.

loan_data$int_rate <- NULL

loan_data$`emp_length ` <- NULL

new <- ggplot(loan_data, aes(x = factor(` loan_status `), fill = grade))

val = c("lightcyan2", "lightcyan3","lightblue","lightcyan4","cadetblue3","aquamarine","darkblue")
lab = c("A","B","C","D","E","F","G")
new + geom_bar(position = "dodge",alpha = 0.75) + 
      scale_x_discrete("Loan Status") +
      scale_y_continuous("Count") +
      scale_fill_manual("Grades",values = val,labels = lab)

ggplot(loan_data, aes(x = factor(` loan_status `), fill = grade)) + geom_bar(position = "fill")

ggplot(loan_data,aes(x = annual_inc,y = ir_cat)) + geom_point() +
  facet_grid(grade ~ .)

The count of grades with respect to loan defaults gives us little insight, but distribution of annual income vs interest rates tells us that there are a lot of people with 13.5+ interest rate with good annual incomes.

To plot an income distribution across grades, we will need to scale down our annual income variable temporarily.

# scale down loan_data$annual_inc first



color <- scale_fill_brewer("Loan Grades",palette = "Reds")

loan_new <- loan_data

loan_new$annual_inc <- loan_new$annual_inc/10000

ggplot(loan_new,aes(x = annual_inc,fill = factor(grade))) + 
  geom_histogram(binwidth = 1) + color + facet_grid(grade ~. )

We see that customers with a higher grade have more annual income.

Logistic Regression And Model Performance :

In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical.Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.

Thus, it treats the same set of problems as probit regression using similar techniques with the latter using a cumulative normal distribution curve instead. Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors.

Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences of these two models can be seen in the following two features of logistic regression.

First, the conditional distribution, is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities andarethereforerestrictedto(0,1)throughthelogisticdistributionfunctionbecauselogistic regression predicts the probability of particular outcomes.

# Logistic Regression
set.seed(123)

library(ROCR, warn.conflicts = FALSE)
library(pROC, warn.conflicts = FALSE)

index_train <- sample(1:nrow(loan_data), 2 / 3* nrow(loan_data)) # making a random sample of training set.

training_set <- loan_data[c(index_train), ] # storing the random sample in a training set.

test_set <- loan_data[-index_train, ]

# Let's start playing around with logistic regression.


log_reg_small <- glm(` loan_status ` ~ ir_cat + grade + loan_amnt + annual_inc,
                     family = "binomial", data = training_set)

summary(log_reg_small) # estimating the vaiable's of importance in our model. Loan amount is not significant.
## 
## Call:
## glm(formula = ` loan_status ` ~ ir_cat + grade + loan_amnt + 
##     annual_inc, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1193  -0.5435  -0.4248  -0.3412   3.6661  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.106e+00  1.428e-01 -14.744  < 2e-16 ***
## ir_cat11-13.5  1.117e-01  1.092e-01   1.023  0.30639    
## ir_cat13.5+    1.611e-01  1.099e-01   1.466  0.14254    
## grade    B     6.857e-01  1.163e-01   5.897 3.70e-09 ***
## grade    C     1.006e+00  1.133e-01   8.879  < 2e-16 ***
## grade    D     1.261e+00  1.170e-01  10.773  < 2e-16 ***
## grade    E     1.564e+00  2.114e-01   7.398 1.38e-13 ***
## grade    F     1.439e+00  4.795e-01   3.001  0.00269 ** 
## grade    G     2.272e+00  1.256e+00   1.808  0.07054 .  
## loan_amnt     -8.941e-06  6.488e-06  -1.378  0.16819    
## annual_inc    -1.024e-05  1.310e-06  -7.819 5.34e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5892.4  on 8331  degrees of freedom
## Residual deviance: 5612.2  on 8321  degrees of freedom
## AIC: 5634.2
## 
## Number of Fisher Scoring iterations: 5

Looking at the summary of our model, we can say that interest rates, annual income and grades are all very important in our model. Loan amount is not a very significant variable in our model.

Time to make some predictions with our model and test it on various parameters.

predictions_small <- predict(log_reg_small, newdata = test_set,
                              type = "response")                

range(predictions_small) # getting the range of the probabilities calculated above.
## [1] 4.185654e-09 4.871028e-01
pred_cutoff_15_small <- ifelse(predictions_small > 0.15,1,0)

conf_15_small <- table(test_set$` loan_status `,pred_cutoff_15_small)
conf_15_small
##    pred_cutoff_15_small
##        0    1
##   0 2931  830
##   1  253  153
accuracy_15_small <- sum(diag(conf_15_small))/sum(conf_15_small)
accuracy_15_small
## [1] 0.7401008
pred_cutoff_20_small <- ifelse(predictions_small > 0.20,1,0)

conf_20_small <- table(test_set$` loan_status `,pred_cutoff_20_small)
conf_20_small
##    pred_cutoff_20_small
##        0    1
##   0 3468  293
##   1  335   71
accuracy_20_small <- sum(diag(conf_20_small))/sum(conf_20_small)
accuracy_20_small
## [1] 0.8492921

We see that depending on the cutoff value of 0.15 and 0.20, we get accuracies of 74% and 85% respectively. Now let us plot the ROC Curve and calculate the AUC(Area under the curve) for our model.

pred_train_small <- predict(log_reg_small,type = "response")

predROC_small <- prediction(pred_train_small, training_set$ ` loan_status `)

perfROC_small <- performance(predROC_small,"tpr","fpr")

plot(perfROC_small, colorize = TRUE,
     print.cutoffs.at = seq(0,1,0.1),text.adj = c(-0.2,1.7))

AUC_small <- as.numeric(performance(predROC_small , "auc")@y.values)
AUC_small
## [1] 0.665261

We see that the we get an AUC of 0.665261. That’s a pretty decent AUC considering that we used only 4 predcitors in our Logistic model. Can we increase the accuracy by using all the variables? Let’s find out.

log_reg_full <- glm(` loan_status ` ~., family = "binomial", data = training_set)

summary(log_reg_full)
## 
## Call:
## glm(formula = ` loan_status ` ~ ., family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2283  -0.5401  -0.4227  -0.3360   3.6112  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                        -2.265e+00  2.491e-01  -9.090  < 2e-16
## loan_amnt                          -8.416e-06  6.488e-06  -1.297  0.19460
## grade    B                          6.851e-01  1.170e-01   5.856 4.73e-09
## grade    C                          1.003e+00  1.142e-01   8.784  < 2e-16
## grade    D                          1.266e+00  1.183e-01  10.705  < 2e-16
## grade    E                          1.582e+00  2.124e-01   7.449 9.44e-14
## grade    F                          1.439e+00  4.809e-01   2.993  0.00276
## grade    G                          2.340e+00  1.261e+00   1.855  0.06359
## ` home_ownership `           RENT   1.816e-01  1.411e-01   1.287  0.19821
## ` home_ownership `       MORTGAGE   9.191e-02  1.472e-01   0.624  0.53234
## annual_inc                         -9.550e-06  1.367e-06  -6.988 2.78e-12
## ` age`                             -2.206e-03  5.907e-03  -0.373  0.70886
## ir_cat11-13.5                       1.088e-01  1.094e-01   0.994  0.31999
## ir_cat13.5+                         1.452e-01  1.101e-01   1.320  0.18700
## emp_cat15-30                        1.589e-01  1.265e-01   1.257  0.20891
## emp_cat30-45                        1.917e-01  3.610e-01   0.531  0.59537
## emp_cat45+                          1.700e+00  6.337e-01   2.683  0.00729
## emp_catMissing                      5.956e-01  1.508e-01   3.949 7.85e-05
##                                       
## (Intercept)                        ***
## loan_amnt                             
## grade    B                         ***
## grade    C                         ***
## grade    D                         ***
## grade    E                         ***
## grade    F                         ** 
## grade    G                         .  
## ` home_ownership `           RENT     
## ` home_ownership `       MORTGAGE     
## annual_inc                         ***
## ` age`                                
## ir_cat11-13.5                         
## ir_cat13.5+                           
## emp_cat15-30                          
## emp_cat30-45                          
## emp_cat45+                         ** 
## emp_catMissing                     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5892.4  on 8331  degrees of freedom
## Residual deviance: 5589.9  on 8314  degrees of freedom
## AIC: 5625.9
## 
## Number of Fisher Scoring iterations: 5

The summary is not very informative. Let’s go straight to making predictions with this model.

predictions_full <- predict(log_reg_full, newdata = test_set, type = "response")

range(predictions_full)
## [1] 1.255721e-08 5.282657e-01
pred_cutoff_15 <- ifelse(predictions_full > 0.15,1,0)

confmat_15_full <- table(test_set$` loan_status `,pred_cutoff_15)
confmat_15_full
##    pred_cutoff_15
##        0    1
##   0 2964  797
##   1  248  158
accuracy_15 <- sum(diag(confmat_15_full))/sum(confmat_15_full)
accuracy_15
## [1] 0.7492201
pred_cutoff_20 <- ifelse(predictions_full > 0.20,1,0)

confmat_20_full <- table(test_set$` loan_status `,pred_cutoff_20)
confmat_20_full
##    pred_cutoff_20
##        0    1
##   0 3467  294
##   1  337   69
accuracy_20 <- sum(diag(confmat_20_full))/sum(confmat_20_full)
accuracy_20
## [1] 0.8485721

Even with all the predictor variables we are not able to increase the accuracy in our model. But what we have successfully done here is that we have increased the model complexity and made it even harder to interpret. We should stick to our first model with 4 predictor variables as it is much easy to interpret and gives more or less the same result.

Let’s now see how the model performs with the ROC Curve.

pred_train_full <- predict(log_reg_full,type = "response")

predROC_full <- prediction(pred_train_full, training_set$ ` loan_status `)

perfROC_full <- performance(predROC_full,"tpr","fpr")

plot(perfROC_full, colorize = TRUE,
     print.cutoffs.at = seq(0,1,0.1),text.adj = c(-0.2,1.7))

AUC_full <- as.numeric(performance(predROC_full , "auc")@y.values)
AUC_full
## [1] 0.6708368

The AUC for this model comes out to be 0.6708368 which is not a significant gain from our previous model. Let’s move onto CART(Classification and Regression Trees) to see if we can make better predicions on this highly unbalanced data with Decision Trees and the various techniques that come with it.

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function.

. Sensitivity: (also called the true positive rate, or the recall in some ???elds) measures the proportion of positives that are correctly identi???ed as such (e.g., the percentage of sick people who are correctly identi???ed as having the condition).

. Specificity: (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

CLASSIFICATION AND REGRESSION TREES :

What is CART?

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees.

In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables.

PRIOR PROBABILITIES

In this technique, we will explicitly define the probabilities of defaults and non-defaults, such that the CART model knows when to classify the observation as a default.

# Decision Trees

library(rpart, warn.conflicts = FALSE) # making decision trees.
library(rpart.plot, warn.conflicts = FALSE) # for plotcp and princp functions.
library(rattle, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE) # for great decision tree plots
 
# Changing the prior probabilities of loan default and non-loan default

tree_prior <- rpart(` loan_status ` ~., method = "class",
                    data = training_set,
                    parms = list(prior = c(0.7,0.3)), # changing the probabilities of non loan default to 0.7
                    control = rpart.control(cp = 0.001)) # and loan default to 0.3

prp(tree_prior)

printcp(tree_prior) # cp and xerror values.
## 
## Classification tree:
## rpart(formula = ` loan_status ` ~ ., data = training_set, method = "class", 
##     parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1]  age              home_ownership  annual_inc       emp_cat         
## [5] grade            ir_cat           loan_amnt       
## 
## Root node error: 2499.6/8332 = 0.3
## 
## n= 8332 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0035384      0   1.00000 1.00000 0.030630
## 2  0.0028667     14   0.93780 0.98684 0.029582
## 3  0.0028168     16   0.93207 1.00394 0.029190
## 4  0.0024641     21   0.91668 1.00358 0.029068
## 5  0.0022270     22   0.91422 1.00355 0.028987
## 6  0.0022111     26   0.90466 1.00647 0.028923
## 7  0.0020217     27   0.90245 1.01122 0.028970
## 8  0.0020005     28   0.90043 1.01301 0.028959
## 9  0.0019899     32   0.89222 1.01351 0.028905
## 10 0.0018904     33   0.89023 1.01288 0.028904
## 11 0.0017846     39   0.87769 1.01716 0.028814
## 12 0.0017532     40   0.87590 1.02147 0.028806
## 13 0.0017098     42   0.87240 1.01873 0.028775
## 14 0.0016820     49   0.85955 1.02031 0.028777
## 15 0.0016743     52   0.85314 1.02072 0.028750
## 16 0.0015953     59   0.84079 1.02135 0.028751
## 17 0.0015953     61   0.83760 1.01914 0.028748
## 18 0.0015400     62   0.83601 1.02263 0.028794
## 19 0.0015006     64   0.83293 1.02504 0.028783
## 20 0.0013582     65   0.83143 1.03586 0.028728
## 21 0.0013267     67   0.82871 1.04101 0.028706
## 22 0.0012213     69   0.82606 1.04897 0.028606
## 23 0.0012057     72   0.82239 1.05262 0.028513
## 24 0.0011815     75   0.81877 1.05135 0.028512
## 25 0.0011529     82   0.81021 1.05660 0.028463
## 26 0.0010976     83   0.80906 1.05563 0.028420
## 27 0.0010951     85   0.80687 1.05837 0.028409
## 28 0.0010503     89   0.80230 1.06015 0.028397
## 29 0.0010503     91   0.80020 1.06759 0.028308
## 30 0.0010109     93   0.79810 1.06557 0.028251
## 31 0.0010000     95   0.79608 1.07114 0.028243
plotcp(tree_prior) # cp vs xerror plot.

tree_min <- tree_prior$cptable[which.min(tree_prior$cptable[,"xerror"]),"CP"]

# pruning the tree for increased model performance.

ptree_prior <- prune(tree_prior, cp = tree_min) # pruning the tree.

prp(ptree_prior)

pred_prior <- predict(ptree_prior, newdata = test_set, type = "class") # making predictions.

confmat_prior <- table(test_set$` loan_status `,pred_prior) # making the confusion matrix.
confmat_prior
##    pred_prior
##        0    1
##   0 3612  149
##   1  362   44
acc_prior <- sum(diag(confmat_prior)) / sum(confmat_prior)
acc_prior
## [1] 0.8773698
prp(ptree_prior)

We see a brilliant 88% accuracy with this model, this is a staright 3% jump from our previous best model.

LOSS MATRIX :

Inthistechnique,wewillpenalizeourmodeleverytimeitclassi???esadefaultasanon-default or a non-default as a default. This is a very strict model and most of the times we won’t get desired results.

# Including a loss matrix

tree_loss_matrix <- rpart(` loan_status ` ~., method = "class",
                          data = training_set,
                          parms = list(loss = matrix(c(0,10,1,0),ncol = 2)),
                          control = rpart.control(cp = 0.001))  

#penalizing classifying a default as a non default 10 times more.

printcp(tree_loss_matrix)
## 
## Classification tree:
## rpart(formula = ` loan_status ` ~ ., data = training_set, method = "class", 
##     parms = list(loss = matrix(c(0, 10, 1, 0), ncol = 2)), control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1]  age              home_ownership  annual_inc       emp_cat         
## [5] grade            ir_cat           loan_amnt       
## 
## Root node error: 7387/8332 = 0.88658
## 
## n= 8332 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.1196697      0   1.00000 10.0000 0.039184
## 2  0.0227427      1   0.88033  5.9488 0.061404
## 3  0.0088669      2   0.85759  7.0895 0.059514
## 4  0.0085285      4   0.83985  5.8297 0.061463
## 5  0.0077614      5   0.83133  5.4239 0.061423
## 6  0.0057534      8   0.80804  5.1902 0.061252
## 7  0.0042642     10   0.79653  4.9295 0.060936
## 8  0.0039258     12   0.78801  4.9825 0.061005
## 9  0.0035197     14   0.78015  5.0414 0.061098
## 10 0.0027075     19   0.75904  5.1650 0.061210
## 11 0.0023013     20   0.75633  5.0964 0.061138
## 12 0.0021660     22   0.75173  4.9831 0.060987
## 13 0.0020306     23   0.74956  5.0009 0.061007
## 14 0.0019855     27   0.74049  4.9713 0.060967
## 15 0.0019629     30   0.73453  4.9713 0.060967
## 16 0.0018952     32   0.73061  4.9392 0.060920
## 17 0.0018275     33   0.72871  4.9319 0.060902
## 18 0.0017598     35   0.72506  4.9433 0.060912
## 19 0.0016245     39   0.71802  4.8843 0.060822
## 20 0.0015794     40   0.71639  4.7696 0.060620
## 21 0.0015568     43   0.71166  4.8052 0.060676
## 22 0.0015342     47   0.70543  4.8093 0.060683
## 23 0.0014891     51   0.69798  4.8270 0.060710
## 24 0.0014214     60   0.68201  4.7739 0.060611
## 25 0.0013537     62   0.67917  4.7229 0.060503
## 26 0.0012860     63   0.67781  4.6763 0.060408
## 27 0.0012184     68   0.66955  4.6377 0.060326
## 28 0.0011913     93   0.63165  4.6659 0.060384
## 29 0.0011797    118   0.58603  4.5882 0.060219
## 30 0.0011732    126   0.57574  4.5604 0.060154
## 31 0.0011507    129   0.57222  4.5940 0.060227
## 32 0.0010830    136   0.56369  4.5722 0.060169
## 33 0.0010379    143   0.55544  4.5983 0.060220
## 34 0.0010153    150   0.54691  4.4903 0.059967
## 35 0.0010000    154   0.54285  4.4903 0.059967
plotcp(tree_loss_matrix)

ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0012788)

prp(tree_loss_matrix)

We see a huge tree which is very hard to interpret, is oversized and is definitely overfitted. Overfitting is a huge problem in Predictive Modeling as it generalizes all too well on the training set and hence fails to do well on the test set or unseen observations.

pred_loss_matrix <- predict(ptree_loss_matrix,newdata = test_set, type = "class")

confmat_loss_matrix <- table(test_set$` loan_status `,pred_loss_matrix)
confmat_loss_matrix
##    pred_loss_matrix
##        0    1
##   0 1923 1838
##   1  127  279
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / sum(confmat_loss_matrix)
acc_loss_matrix
## [1] 0.5284377

The accuracy comes out to be 53%. That’s pathetic! Maybe we penalized our model too much for classifying incorrectly? Maybe weighing the loan default cases can help us make a better model.

WEIGHING :

In this technique, we weight the important default cases and tell our program to refer to them when classifying.

# including case weights for the training set

case_weights <- ifelse(training_set$` loan_status ` == 0,1,3)

tree_weights <- rpart(` loan_status ` ~ ., method = "class", 
                      data = training_set, 
                      control = rpart.control(cp = 0.001,minsplit = 5,minbucket = 2),
                      weights = case_weights)

plotcp(tree_weights)

ptree_weights <- prune(tree_weights, cp = 0.00183101)

prp(ptree_weights,extra = 1)

pred_weights <- predict(ptree_weights, newdata = test_set,type = "class")

confmat_weights <- table(test_set$` loan_status `,pred_weights)
confmat_weights
##    pred_weights
##        0    1
##   0 3533  228
##   1  355   51
acc_weights <- sum(diag(confmat_weights)) / sum(confmat_weights)
acc_weights
## [1] 0.8600912

We get a good accuracy of 86% which is decent enough. Weighing is one of the best techniques to classify unbalanced data.