Reading & Explaining the Data

Please review the code below and execute it in your own RMD file.

Printing the Column Names of the Data

##  [1] "Agmt.No"        "ContractStatus" "StartDate"      "AGE"           
##  [5] "NOOFDEPE"       "MTHINCTH"       "SALDATFR"       "TENORYR"       
##  [9] "DWNPMFR"        "PROFBUS"        "QUALHSC"        "QUAL_PG"       
## [13] "SEXCODE"        "FULLPDC"        "FRICODE"        "WASHCODE"      
## [17] "Region"         "Branch"         "DefaulterFlag"  "DefaulterType" 
## [21] "DATASET"

List of Data Columns

DEFAULT

1. Defaulter Flag

  • 1: Customer has delayed paying at least once

  • 0: Otherwise

DEMOGRAPHIC VARIABLES

1. Gender

  • SEXCODE = 1 (Male)

  • SEXCODE = 0 (Female)

2. Age

3. Education

  • QUALHSC

  • QUAL_PG

4. Income

  • Monthly Income in Thousands (MTHINCTH)

  • Owns a Fridge (FRICODE)

  • Owns a Washing Machine (WASHCODE)

5. Profession

  • PROFBUS = 1 (BUSINESS)
  • PROFBUS = 0 (PROFESSIONAL)

6. No.of Dependents

  • NOOFDEPE

7. Region

Structure of the Dataset

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUALHSC       : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ QUAL_PG       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SEXCODE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FULLPDC       : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ FRICODE       : int  0 1 1 1 1 0 0 0 0 0 ...
##  $ WASHCODE      : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ Region        : chr  "AP2" "AP2" "AP2" "AP2" ...
##  $ Branch        : chr  "Vizag" "Vizag" "Vizag" "Vizag" ...
##  $ DefaulterFlag : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DefaulterType : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Convert catgorical variables to factor

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ QUALHSC       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ QUAL_PG       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEXCODE       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FULLPDC       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2 ...
##  $ FRICODE       : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
##  $ WASHCODE      : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Region        : Factor w/ 8 levels "AP1","AP2","Chennai",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Branch        : Factor w/ 14 levels "Bangalore","Chennai",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ DefaulterFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DefaulterType : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Section 1: Decision Tree

We made the Decision Tree using Gini on training dataset, the decision tree as shown below.

## Loading required package: lattice
## Loading required package: ggplot2
## CART 
## 
## 21679 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 19511, 19511, 19512, 19511, 19511, 19511, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01024164  0.7292763  0.2129099
##   0.01152184  0.7269239  0.2189725
##   0.01408225  0.7186214  0.1326211
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01024164.
## Loading required package: rpart

Using Decision Tree, We made Confusion Matrix, shown below, assuming threshold probability of 50%.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   434  297
##        Yes 1649 4847
##                                           
##                Accuracy : 0.7307          
##                  95% CI : (0.7203, 0.7409)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.0001805       
##                                           
##                   Kappa : 0.1867          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9423          
##             Specificity : 0.2084          
##          Pos Pred Value : 0.7462          
##          Neg Pred Value : 0.5937          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6707          
##    Detection Prevalence : 0.8989          
##       Balanced Accuracy : 0.5753          
##                                           
##        'Positive' Class : Yes             
## 

Section 2: Random Forest

Que 2. Using Random Forest model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   715  558
##        Yes 1368 4586
##                                           
##                Accuracy : 0.7335          
##                  95% CI : (0.7231, 0.7437)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 2.118e-05       
##                                           
##                   Kappa : 0.2655          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8915          
##             Specificity : 0.3433          
##          Pos Pred Value : 0.7702          
##          Neg Pred Value : 0.5617          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6346          
##    Detection Prevalence : 0.8239          
##       Balanced Accuracy : 0.6174          
##                                           
##        'Positive' Class : Yes             
## 

Que 6- Which Machine Learning technique (Decision Tree or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

Section 3: Bagging

Que 8. Using Bagging model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   744  706
##        Yes 1339 4438
##                                           
##                Accuracy : 0.717           
##                  95% CI : (0.7065, 0.7274)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.1651          
##                                           
##                   Kappa : 0.2418          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8628          
##             Specificity : 0.3572          
##          Pos Pred Value : 0.7682          
##          Neg Pred Value : 0.5131          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6141          
##    Detection Prevalence : 0.7994          
##       Balanced Accuracy : 0.6100          
##                                           
##        'Positive' Class : Yes             
## 

Que 11- Which Machine Learning technique (Bagging or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

Que 13- Which Model is doing better (Decision Tree / Random Forest / Bagging)? Please rank order the models, based on 1) Accuracy, 2) Sensitivity, 3) Specificity, 4) AUC. Please explain your reasoning in four or five sentences.