Logistic Regression Model - Credit Risk Dataset

We will be learning Logistic Regression using Credit Risk dataset. Goal is to properly classify people who have defaulted based on dataset parameters. We shall be using Confusion Matrix for checking performance of logistic regresion model. You can find the dataset “Credit_Risk_Train_data.csv” & “Credit_Risk_Validate_data.csv” in the following location -> https://github.com/anup-jana/R-Machine-Learning/tree/master/R%20Scripts/Datasets So, Let’s Begin!

# Load necessary librariries for our work
options(warn=-1) # Suppress Warnings
library(ggplot2) # for some amazing looking graphs
library(MASS) # Library for our box-cox transform down the end
library(corrplot) # Plotting nice correlation matrix
library(cowplot) # arranging plots into a grid
library(dplyr) # Lirary for spliting train & test dataset

Let’s Load the dataset and look into characteristics of the data and variables

# Load the csv file and convert null string values to NA values for capturing NA values automatically
cr_org=read.csv("Credit_Risk_Train_data.csv", na.strings=c("","","NA"))

str(cr_org) # Check data structure
## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 1 1 1 3 1 4 3 2 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
summary(cr_org) # Summary of Dataset
##      Loan_ID       Gender    Married    Dependents        Education  
##  LP001002:  1   Female:112   No  :213   0   :345   Graduate    :480  
##  LP001003:  1   Male  :489   Yes :398   1   :102   Not Graduate:134  
##  LP001005:  1   NA's  : 13   NA's:  3   2   :101                     
##  LP001006:  1                           3+  : 51                     
##  LP001008:  1                           NA's: 15                     
##  LP001011:  1                                                        
##  (Other) :608                                                        
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

Let’s check some statistics like unique, mising and empty values in the dataset

cr_org=cr_org[-1] #Removing Loan_ID as it has no logical corelation

# Let's check for any missing values in the data
colSums(is.na(cr_org))
##            Gender           Married        Dependents         Education 
##                13                 3                15                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                32                 0                 0                22 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                14                50                 0                 0
# Checking for empty values
colSums(cr_org=='')
##            Gender           Married        Dependents         Education 
##                NA                NA                NA                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                NA                 0                 0                NA 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                NA                NA                 0                 0

Visualizing Credit Risk Dataset

We can observe that there are few NA values in few feature variables. Visualize the data distribution of various data variables through Histogram.

hist_loan_amt <- ggplot(cr_org, aes(x = LoanAmount)) + geom_histogram(binwidth = 10) +
  geom_vline(xintercept = mean(cr_org$LoanAmount), color = "indianred") +
  geom_vline(xintercept = median(cr_org$LoanAmount), color = "cornflowerblue")
hist_app_income <- ggplot(cr_org, aes(x = ApplicantIncome)) + geom_histogram(binwidth = 1000) +
  geom_vline(xintercept = mean(cr_org$ApplicantIncome), color = "indianred") +
  geom_vline(xintercept = median(cr_org$ApplicantIncome), color = "cornflowerblue")
hist_COapp_income <- ggplot(cr_org, aes(x = CoapplicantIncome)) + geom_histogram(binwidth = 1000) +
  geom_vline(xintercept = mean(cr_org$CoapplicantIncome), color = "indianred") +
  geom_vline(xintercept = median(cr_org$CoapplicantIncome), color = "cornflowerblue")
plot_grid(hist_loan_amt, hist_app_income, hist_COapp_income, labels = "AUTO")

# Scatterplot of Loan Status variable against LoanAmount
ggplot(data = cr_org, aes(x = LoanAmount, y = Loan_Status, col = Property_Area)) + geom_point()

Handling Null, Outlier and Categorical Variables

Let’s create a function to treat NA values in categorical attributes. We will treat NA values of numerical attributes with the mean of numerical variable.

# Get Mode value of character variables - This will be used to replace NA values
Mode = function(x){
  ta = table(x)
  tam = max(ta)
  if(all(ta == tam))
    mod = NA
  else if(is.numeric(x))
    mod = as.numeric(names(ta))[ta==tam]
    else
      mod = names(ta)[ta==tam]
  return(mod)
}

cr_wrk = cr_org # creating a new dataset copy of original credit risk before other manipulations

# For character variables use mode value of the attribute
cr_wrk$Gender[is.na(cr_wrk$Gender)] = Mode(cr_wrk$Gender)
cr_wrk$Married[is.na(cr_wrk$Married)] = Mode(cr_wrk$Married)
cr_wrk$Dependents[is.na(cr_wrk$Dependents)] = Mode(cr_wrk$Dependents)
cr_wrk$Credit_History[is.na(cr_wrk$Credit_History)] = Mode(cr_wrk$Credit_History)

#  For numeric variables use mean value of the attribute
cr_wrk$LoanAmount[is.na(cr_wrk$LoanAmount)] <- mean(cr_wrk$LoanAmount, na.rm = T)
cr_wrk$Loan_Amount_Term[is.na(cr_wrk$Loan_Amount_Term)] <- mean(cr_wrk$Loan_Amount_Term, na.rm = T)

summary(cr_wrk) # Check summary of dataset after replacing NA values
##     Gender    Married   Dependents        Education   Self_Employed
##  Female:112   No :213   0 :360     Graduate    :480   No  :500     
##  Male  :502   Yes:401   1 :102     Not Graduate:134   Yes : 82     
##                         2 :101                        NA's: 32     
##                         3+: 51                                     
##                                                                    
##                                                                    
##  ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
##  Min.   :  150   Min.   :    0     Min.   :  9.0   Min.   : 12     
##  1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.2   1st Qu.:360     
##  Median : 3812   Median : 1188     Median :129.0   Median :360     
##  Mean   : 5403   Mean   : 1621     Mean   :146.4   Mean   :342     
##  3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:164.8   3rd Qu.:360     
##  Max.   :81000   Max.   :41667     Max.   :700.0   Max.   :480     
##  Credit_History    Property_Area Loan_Status
##  Min.   :0.000   Rural    :179   N:192      
##  1st Qu.:1.000   Semiurban:233   Y:422      
##  Median :1.000   Urban    :202              
##  Mean   :0.855                              
##  3rd Qu.:1.000                              
##  Max.   :1.000

Let’s handle outliers for LoanAmount, ApplicantIncome & CoapplicantIncome

# Replace outlier with lower and upper cutoff value
out_std = function(x){
  m = mean(x)
  s = sd(x)
  lc = m-3*s
  uc= m+3*s
  n = sum(x>uc |  x<lc )
  val = list(num=n,lower_cutoff=lc,upper_cutoff=uc)
  return(val)
}

# Treatment of outlier for CoapplicantIncome
lc=out_std(cr_wrk$CoapplicantIncome)$lower_cutoff
uc=out_std(cr_wrk$CoapplicantIncome)$upper_cutoff
cr_wrk$CoapplicantIncome[cr_wrk$CoapplicantIncome>uc]=uc
cr_wrk$CoapplicantIncome[cr_wrk$CoapplicantIncome<lc]=lc

# Treatment of outlier for LoanAmount
cr_wrk$LoanAmount=as.numeric(cr_wrk$LoanAmount)
lc=out_std(cr_wrk$LoanAmount)$lower_cutoff
uc=out_std(cr_wrk$LoanAmount)$upper_cutoff
cr_wrk$LoanAmount[cr_wrk$LoanAmount>uc]=uc
cr_wrk$LoanAmount[cr_wrk$LoanAmount<lc]=lc

# Treatment of outlier for ApplicantIncome
lc=out_std(cr_wrk$ApplicantIncome)$lower_cutoff
uc=out_std(cr_wrk$ApplicantIncome)$upper_cutoff
cr_wrk$ApplicantIncome[cr_wrk$ApplicantIncome>uc]=uc
cr_wrk$ApplicantIncome[cr_wrk$ApplicantIncome<lc]=lc

summary(cr_wrk) # Check summary of dataset after outliers treatment
##     Gender    Married   Dependents        Education   Self_Employed
##  Female:112   No :213   0 :360     Graduate    :480   No  :500     
##  Male  :502   Yes:401   1 :102     Not Graduate:134   Yes : 82     
##                         2 :101                        NA's: 32     
##                         3+: 51                                     
##                                                                    
##                                                                    
##  ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
##  Min.   :  150   Min.   :    0     Min.   :  9.0   Min.   : 12     
##  1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.2   1st Qu.:360     
##  Median : 3812   Median : 1188     Median :129.0   Median :360     
##  Mean   : 5109   Mean   : 1498     Mean   :143.5   Mean   :342     
##  3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:164.8   3rd Qu.:360     
##  Max.   :23731   Max.   :10400     Max.   :398.5   Max.   :480     
##  Credit_History    Property_Area Loan_Status
##  Min.   :0.000   Rural    :179   N:192      
##  1st Qu.:1.000   Semiurban:233   Y:422      
##  Median :1.000   Urban    :202              
##  Mean   :0.855                              
##  3rd Qu.:1.000                              
##  Max.   :1.000

Now, let’s handle categorical attributes as regression model can only handle numeric attributes. So, let’s create dummy variables for categorical attributes which will be used for regression model. If there are only 2 unique values in attribute then create a dummy variable with 1/0. If there are more than 2 unique values in attribute then create a dummy variable for each value with 1/0.

table(cr_wrk$Property_Area) # use this command to check number of unqiue values 
## 
##     Rural Semiurban     Urban 
##       179       233       202
# 2 Unique values treatment
cr_wrk$Dummy_Gender=ifelse(cr_wrk$Gender=="Male",1,0)
cr_wrk$Dummy_Married=ifelse(cr_wrk$Married=="Yes",1,0)
cr_wrk$Dummy_Education=ifelse(cr_wrk$Education=="Graduate",1,0)
cr_wrk$Dummy_Self_employed=ifelse(cr_wrk$Self_Employed=="Yes",1,0)

# More than 2 unique values treatment
cr_wrk$Dummy_Urban=ifelse(cr_wrk$Property_Area=="Urban",1,0)
cr_wrk$Dummy_Rural=ifelse(cr_wrk$Property_Area=="Rural",1,0)
cr_wrk$Dummy_Semiurban=ifelse(cr_wrk$Property_Area=="Semiurban",1,0)

cr_wrk$Dummy_Dep=as.numeric(substr(cr_wrk$Dependents,1,1)) # Take first character each of them

cr_wrk$Loan_Status=ifelse(cr_wrk$Loan_Status=="Y",1,0)  # target response variable
# Check the transformed dataset before we go into logistic regression model

Correlation Matrix on transformed Credit Risk Dataset

Beore we go into building logistic regression model, let’s identify the variables that are highly correlated

# Get numeric variables for correlation function
numeric <- cr_wrk[sapply(cr_wrk, is.numeric)]
descrCor <- cor(numeric)
corrplot(descrCor)

Building Logistic Regression Model

Before we train the model, let’s create a dataset by taking only dummy variables and amount variables for our regression model. We need to remove Loan Status variable as we are going to predict Loan Status and then compare it with the original values for checking performance of model.

# Take dummy and amount variables for regression model
cr_df_train=select(cr_wrk,-Gender,-Married,-Education,-Self_Employed,-Dependents,-Property_Area) 

# Remove Loan Status as we are going to predict value of this attribute 
train_data = select(cr_df_train,-Loan_Status)

# Let's train the model and check the summary of the model
model1=glm(Loan_Status~., data=cr_df_train, family=binomial("logit"))
summary(model1)
## 
## Call:
## glm(formula = Loan_Status ~ ., family = binomial("logit"), data = cr_df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2798  -0.3845   0.5359   0.7111   2.5237  
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.138e+00  8.263e-01  -2.587  0.00967 ** 
## ApplicantIncome      1.847e-05  3.992e-05   0.463  0.64360    
## CoapplicantIncome    1.089e-05  6.790e-05   0.160  0.87254    
## LoanAmount          -3.653e-03  2.255e-03  -1.620  0.10520    
## Loan_Amount_Term    -6.819e-04  1.838e-03  -0.371  0.71070    
## Credit_History       3.871e+00  4.182e-01   9.254  < 2e-16 ***
## Dummy_Gender        -1.593e-02  3.113e-01  -0.051  0.95920    
## Dummy_Married        5.894e-01  2.597e-01   2.269  0.02324 *  
## Dummy_Education      4.817e-01  2.672e-01   1.803  0.07144 .  
## Dummy_Self_employed -2.586e-02  3.203e-01  -0.081  0.93565    
## Dummy_Urban         -6.689e-01  2.752e-01  -2.431  0.01507 *  
## Dummy_Rural         -8.407e-01  2.745e-01  -3.063  0.00219 ** 
## Dummy_Semiurban             NA         NA      NA       NA    
## Dummy_Dep            5.551e-02  1.221e-01   0.455  0.64931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 724.71  on 581  degrees of freedom
## Residual deviance: 527.99  on 569  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 553.99
## 
## Number of Fisher Scoring iterations: 5

Accuracy Test through Confusion Matrix, AIC, Null Deviance & ROC Curve

  1. AIC (Akaike Information Criteria) - The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Golden Rule - Lower the value, better the model.

  2. Null Deviance and Residual Deviance - Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Golden Rule - Lower the value, better the model.

  3. Confusion Matrix - It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. Golden Rule - Higher the value, better the model.This is how it looks like: image: Now, let’s consider the confusion matrix in more detail now. The names true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are often used to reference the four cells of the confustion matrix.

Sensitivity = True Positive Rate = TP / (TP + FN) Specificity = True Negative Rate = TN / (TN + FP) Prevelance / Accuracy = (TP + FN) / Total Obs

  1. ROC Curve: Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate(1- specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Golden Rule - Higher the area under curve, better the prediction power of the model.

Computing Accuracy and ROC Curve of the Model on Train Dataset

Let’s observe how our model has performed based on the model built above through parameters like accuracy, precision, sensitivity, specificity, error. This can be achieved by bulding Confusion Matrix though predicted and actual values.

# Validation of our model using training dataset
fitted.results1 = predict(model1, newdata=train_data, type='response')

# If results are more than 50% then convert to 1 else 0
fitted.results1 = ifelse(fitted.results1 >=0.5,1,0)

# Evaluate predictions on the training dataset through Confusion Matrix
cf1 = table(predicted = fitted.results1, actual = cr_df_train$Loan_Status)
cf1 # Check Confuxion Matrix
##          actual
## predicted   0   1
##         0  81   7
##         1 102 392
TN = cf1[1,1] # True Negative - Actual & Predicted is 0/N
TP = cf1[2,2] # True Positive - Actual & Predicted is 1/Y
FP = cf1[2,1] # False Positive - Actual is 0/N but Predicted is 1/Y
FN = cf1[1,2] # False Nefgative - Actual is 1/Y but Predicted is 0/N
TO = TN+TP+FP+FN # Total Observations

accuracy = (TP+TN)/TO # Accuracy or Prevalance of Confusion Matrix
accuracy # 81.27%
## [1] 0.8127148
precision = TP/(TP+FP) # Precision
precision # 79.35%
## [1] 0.7935223
sensitivity = TP/(TP+FN) # True Positive Rate
sensitivity # 98.24%
## [1] 0.9824561
specificity = TN/(TN+FP) # True Negative Rate
specificity # 44.26%
## [1] 0.442623
error = (FP+FN)/TO # Error Rate
error # 18.72%
## [1] 0.1872852

Inference - We see that the model is doing far better in sensitivity as compared to specificity. We will come back to these concepts later, but before that, let us check the Area Under the Curve (AUC), also known as ROC curve (receiver operating characteristic curve)

library(pROC) # For checking ROC Curve of the model
library(ROCR)
# Option 1 
roccurve=roc(fitted.results1, cr_df_train$Loan_Status)
plot(roccurve, print.auc = TRUE) # 85.70%

auc(roccurve) # 85.70% 
## Area under the curve: 0.857
# Option 2
ROCRpred <- prediction(fitted.results1, cr_df_train$Loan_Status)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

Predict Loan Status of Test Dataset

Now, it is time to predict the Loan Status value of test dataset. So, let’s load the test dataset and remember, we need to perform the same transformation to test dataset as we did it on train dataset before passing it onto prediction model. Here our functions will come into picture as we can use them again. Although, you can write a function to tranform the dataset which can be used again and again for different datasets when you get any for passing onto regression model.

# Load the test/validation dataset
cr_df_test=read.csv("Credit_Risk_Validate_data.csv", na.strings=c("","","NA"))

# Prepare data for logistic regression model same as training dataset like NA treatment, dummy variables, etc.
cr_df_test=cr_df_test[-1] #Removing Loan_ID as it has no logical corelation

# Null values treatment
cr_df_test$LoanAmount[is.na(cr_df_test$LoanAmount)] <- mean(cr_df_test$LoanAmount, na.rm = T)
cr_df_test$Loan_Amount_Term[is.na(cr_df_test$Loan_Amount_Term)] <- mean(cr_df_test$Loan_Amount_Term, na.rm = T)
cr_df_test$Gender[is.na(cr_df_test$Gender)] = Mode(cr_df_test$Gender)
cr_df_test$Married[is.na(cr_df_test$Married)] = Mode(cr_df_test$Married)
cr_df_test$Dependents[is.na(cr_df_test$Dependents)] = Mode(cr_df_test$Dependents)
cr_df_test$Credit_History[is.na(cr_df_test$Credit_History)] = Mode(cr_df_test$Credit_History)

# Dummy variables creation for categorical attributes
cr_df_test$Dummy_Gender=ifelse(cr_df_test$Gender=="Male",1,0)
cr_df_test$Dummy_Married=ifelse(cr_df_test$Married=="Yes",1,0)
cr_df_test$Dummy_Education=ifelse(cr_df_test$Education=="Graduate",1,0)
cr_df_test$Dummy_Self_employed=ifelse(cr_df_test$Self_Employed=="Yes",1,0)
cr_df_test$Dummy_Urban=ifelse(cr_df_test$Property_Area=="Urban",1,0)
cr_df_test$Dummy_Rural=ifelse(cr_df_test$Property_Area=="Rural",1,0)
cr_df_test$Dummy_Semiurban=ifelse(cr_df_test$Property_Area=="Semiurban",1,0)
cr_df_test$Dummy_Dep=as.numeric(substr(cr_df_test$Dependents,1,1)) # take first character

cr_df_test$outcome=ifelse(cr_df_test$outcome=="Y",1,0) # target response variable

# Remove corresponding variables for dummy and outcome
test_data=select(cr_df_test,-Gender,-Married,-Education,-Self_Employed,-Dependents,-Property_Area,-outcome)

# Validation of our model using validation dataset
fitted.results2 = predict(model1, newdata=test_data, type='response')

# If results are more than 50% then convert to 1 else 0
fitted.results2 = ifelse(fitted.results2 >=0.5,1,0)

Computing Accuracy and ROC Curve of the Model on Test Dataset

Let’s check the model’s accuracy on test dataset parameters like accuracy, precision, sensitivity, specificity, error by building Confusion Matrix.

# Making predictions on the train set through Confusion Matrix
cf2 = table(predicted = fitted.results2, actual = cr_df_test$outcome)
cf2 # Check Confuxion Matrix
##          actual
## predicted   0   1
##         0  49   1
##         1  16 278
TN = cf2[1,1] # True Negative - Actual & Predicted is 0/N
TP = cf2[2,2] # True Positive - Actual & Predicted is 1/Y
FP = cf2[2,1] # False Positive - Actual is 0/N but Predicted is 1/Y
FN = cf2[1,2] # False Nefgative - Actual is 1/Y but Predicted is 0/N
TO = TN+TP+FP+FN # Total Observations

accuracy = (TP+TN)/TO # Accuracy or Prevalance of Confusion Matrix
accuracy # 95.05%
## [1] 0.9505814
sensitivity = TP/(TP+FN) # True Positive Rate
sensitivity # 99.64%
## [1] 0.9964158
specificity = TN/(TN+FP) # True Negative Rate
specificity # 75.38%
## [1] 0.7538462
error = (FP+FN)/TO # Error Rate
error # 04.94%
## [1] 0.0494186
# ROC curve (receiver operating characteristic curve)
roccurve=roc(fitted.results2, cr_df_test$outcome)
plot(roccurve, print.auc = TRUE) # 96.30%

auc(roccurve) # 96.28%
## Area under the curve: 0.9628

That’s it for now. There is certainly much more we could cover like combination of predictors which might improve the accuracy of model, how to handle imbalance classification model, how to boost the predictor variables to improve the accuracy, etc. We will cover them in another post until then Happy Learning!