Reading & Explaining the Data

Please review the code below and execute it in your own RMD file.

Printing the Column Names of the Data

##  [1] "Agmt.No"        "ContractStatus" "StartDate"      "AGE"           
##  [5] "NOOFDEPE"       "MTHINCTH"       "SALDATFR"       "TENORYR"       
##  [9] "DWNPMFR"        "PROFBUS"        "QUALHSC"        "QUAL_PG"       
## [13] "SEXCODE"        "FULLPDC"        "FRICODE"        "WASHCODE"      
## [17] "Region"         "Branch"         "DefaulterFlag"  "DefaulterType" 
## [21] "DATASET"

List of Data Columns

DEFAULT

1. Defaulter Flag

  • 1: Customer has delayed paying at least once

  • 0: Otherwise

DEMOGRAPHIC VARIABLES

1. Gender

  • SEXCODE = 1 (Male)

  • SEXCODE = 0 (Female)

2. Age

3. Education

  • QUALHSC

  • QUAL_PG

4. Income

  • Monthly Income in Thousands (MTHINCTH)

  • Owns a Fridge (FRICODE)

  • Owns a Washing Machine (WASHCODE)

5. Profession

  • PROFBUS = 1 (BUSINESS)
  • PROFBUS = 0 (PROFESSIONAL)

6. No.of Dependents

  • NOOFDEPE

7. Region

Structure of the Dataset

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUALHSC       : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ QUAL_PG       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SEXCODE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FULLPDC       : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ FRICODE       : int  0 1 1 1 1 0 0 0 0 0 ...
##  $ WASHCODE      : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ Region        : chr  "AP2" "AP2" "AP2" "AP2" ...
##  $ Branch        : chr  "Vizag" "Vizag" "Vizag" "Vizag" ...
##  $ DefaulterFlag : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DefaulterType : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Convert catgorical variables to factor

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ QUALHSC       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ QUAL_PG       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEXCODE       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FULLPDC       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2 ...
##  $ FRICODE       : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
##  $ WASHCODE      : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Region        : Factor w/ 8 levels "AP1","AP2","Chennai",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Branch        : Factor w/ 14 levels "Bangalore","Chennai",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ DefaulterFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DefaulterType : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Section 1: Decision Tree

We made the decision tree using gini on training dataset, the decision tree as shown below.

## Loading required package: lattice
## Loading required package: ggplot2
## CART 
## 
## 21679 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 19511, 19511, 19512, 19511, 19511, 19511, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01024164  0.7292763  0.2129099
##   0.01152184  0.7269239  0.2189725
##   0.01408225  0.7186214  0.1326211
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01024164.
## Loading required package: rpart

Using Decision Tree, We made Confusion Matrix, shown below, assuming threshold probability of 50%.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   434  297
##        Yes 1649 4847
##                                           
##                Accuracy : 0.7307          
##                  95% CI : (0.7203, 0.7409)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.0001805       
##                                           
##                   Kappa : 0.1867          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9423          
##             Specificity : 0.2084          
##          Pos Pred Value : 0.7462          
##          Neg Pred Value : 0.5937          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6707          
##    Detection Prevalence : 0.8989          
##       Balanced Accuracy : 0.5753          
##                                           
##        'Positive' Class : Yes             
## 

Section 2: Random Forest

Que 2. Using Random Forest model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   710  552
##        Yes 1373 4592
##                                           
##                Accuracy : 0.7336          
##                  95% CI : (0.7233, 0.7438)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 1.89e-05        
##                                           
##                   Kappa : 0.2646          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8927          
##             Specificity : 0.3409          
##          Pos Pred Value : 0.7698          
##          Neg Pred Value : 0.5626          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6354          
##    Detection Prevalence : 0.8254          
##       Balanced Accuracy : 0.6168          
##                                           
##        'Positive' Class : Yes             
## 

Que 3- Which Machine Leraning technique (Decision Tree or Random Forest) is better based on Sensitivity? Explain your reasoning

Que 4- Which Machine Leraning technique (Decision Tree or Random Forest) is better based on Specificity? Explain your reasoning

Que 7- Which Machine Leraning technique (Decision Tree or Random Forest) is better based on AUC? Explain your reasoning

Section 3: Bagging

Que 9. Using Bagging model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   744  706
##        Yes 1339 4438
##                                           
##                Accuracy : 0.717           
##                  95% CI : (0.7065, 0.7274)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.1651          
##                                           
##                   Kappa : 0.2418          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8628          
##             Specificity : 0.3572          
##          Pos Pred Value : 0.7682          
##          Neg Pred Value : 0.5131          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6141          
##    Detection Prevalence : 0.7994          
##       Balanced Accuracy : 0.6100          
##                                           
##        'Positive' Class : Yes             
## 

Que 10- Which Machine Leraning technique (Bagging or Random Forest) is better based on Sensitivity? Explain your reasoning

Que 11- Which Machine Leraning technique (Bagging or Random Forest) is better based on Specificity? Explain your reasoning

Que 14- Which Machine Leraning technique (Bagging or Random Forest) is better based on AUC? Explain your reasoning

Que 16- On the basis of ROC & AUC which Model is doing better job (Decision Tree, Random Forest or Bagging), please explain in detail.

Que 17- Based on the overall analysis which ML technique is better and why? Please explain in detail.