Reading & Explaining the Data

Please review the code below and execute it in your own RMD file.

Printing the Column Names of the Data

##  [1] "Agmt.No"        "ContractStatus" "StartDate"      "AGE"           
##  [5] "NOOFDEPE"       "MTHINCTH"       "SALDATFR"       "TENORYR"       
##  [9] "DWNPMFR"        "PROFBUS"        "QUALHSC"        "QUAL_PG"       
## [13] "SEXCODE"        "FULLPDC"        "FRICODE"        "WASHCODE"      
## [17] "Region"         "Branch"         "DefaulterFlag"  "DefaulterType" 
## [21] "DATASET"

List of Data Columns

DEFAULT

1. Defaulter Flag

  • 1: Customer has delayed paying at least once

  • 0: Otherwise

DEMOGRAPHIC VARIABLES

1. Gender

  • SEXCODE = 1 (Male)

  • SEXCODE = 0 (Female)

2. Age

3. Education

  • QUALHSC

  • QUAL_PG

4. Income

  • Monthly Income in Thousands (MTHINCTH)

  • Owns a Fridge (FRICODE)

  • Owns a Washing Machine (WASHCODE)

5. Profession

  • PROFBUS = 1 (BUSINESS)
  • PROFBUS = 0 (PROFESSIONAL)

6. No.of Dependents

  • NOOFDEPE

7. Region

Structure of the Dataset

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUALHSC       : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ QUAL_PG       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SEXCODE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FULLPDC       : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ FRICODE       : int  0 1 1 1 1 0 0 0 0 0 ...
##  $ WASHCODE      : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ Region        : chr  "AP2" "AP2" "AP2" "AP2" ...
##  $ Branch        : chr  "Vizag" "Vizag" "Vizag" "Vizag" ...
##  $ DefaulterFlag : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DefaulterType : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Convert catgorical variables to factor

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ QUALHSC       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ QUAL_PG       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEXCODE       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FULLPDC       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2 ...
##  $ FRICODE       : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
##  $ WASHCODE      : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Region        : Factor w/ 8 levels "AP1","AP2","Chennai",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Branch        : Factor w/ 14 levels "Bangalore","Chennai",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ DefaulterFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DefaulterType : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Assignment: Part 2 (Logistic Regression)

We Ran Logistics Regression Model using glm() on the training data, the output is as follows.

## 
## Call:
## glm(formula = DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + NOOFDEPE + 
##     SALDATFR + TENORYR + DWNPMFR + PROFBUS + QUALHSC + QUAL_PG + 
##     SEXCODE + FULLPDC + FRICODE + WASHCODE + Region, family = binomial(), 
##     data = trainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7470  -1.0215   0.5716   0.7801   2.0874  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.2291732  0.1890284  11.793  < 2e-16 ***
## AGE           -0.0144968  0.0016680  -8.691  < 2e-16 ***
## NOOFDEPE       0.0566860  0.0107923   5.252 1.50e-07 ***
## MTHINCTH      -0.0004025  0.0035613  -0.113 0.910023    
## SALDATFR      -0.3833870  0.0420223  -9.123  < 2e-16 ***
## TENORYR        0.7727065  0.0456475  16.928  < 2e-16 ***
## DWNPMFR       -1.3074501  0.1274734 -10.257  < 2e-16 ***
## PROFBUS1       0.1966576  0.0487903   4.031 5.56e-05 ***
## QUALHSC1       0.1853120  0.0401652   4.614 3.95e-06 ***
## QUAL_PG1      -0.2990904  0.0787907  -3.796 0.000147 ***
## SEXCODE1       0.2339445  0.0600322   3.897 9.74e-05 ***
## FULLPDC1      -1.2365885  0.0368674 -33.541  < 2e-16 ***
## FRICODE1      -0.1761473  0.0377247  -4.669 3.02e-06 ***
## WASHCODE1     -0.2644245  0.0476814  -5.546 2.93e-08 ***
## RegionAP2     -0.5788864  0.1796029  -3.223 0.001268 ** 
## RegionChennai -1.4136987  0.1408192 -10.039  < 2e-16 ***
## RegionKA1     -0.6529787  0.1411987  -4.625 3.75e-06 ***
## RegionKE2     -0.5753874  0.1450753  -3.966 7.30e-05 ***
## RegionTN1     -0.8084619  0.1362745  -5.933 2.98e-09 ***
## RegionTN2     -0.6142186  0.1458691  -4.211 2.55e-05 ***
## RegionVellore -0.6570233  0.1595604  -4.118 3.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26040  on 21678  degrees of freedom
## Residual deviance: 23025  on 21658  degrees of freedom
## AIC: 23067
## 
## Number of Fisher Scoring iterations: 4

Que 1- Write R code for building the Logistics Regression Model with Interaction using glm() on the training data, the expected output should be as follows.

## 
## Call:
## glm(formula = DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + NOOFDEPE + 
##     SALDATFR + TENORYR + DWNPMFR + PROFBUS + QUALHSC + QUAL_PG + 
##     SEXCODE + FULLPDC + FRICODE + WASHCODE + Region + TENORYR * 
##     FULLPDC, family = binomial(), data = trainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5410  -1.0043   0.5721   0.7620   2.1539  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.4938627  0.1945079  12.821  < 2e-16 ***
## AGE              -0.0144308  0.0016723  -8.629  < 2e-16 ***
## NOOFDEPE          0.0569949  0.0108145   5.270 1.36e-07 ***
## MTHINCTH          0.0006478  0.0035839   0.181 0.856559    
## SALDATFR         -0.3900738  0.0422177  -9.240  < 2e-16 ***
## TENORYR           0.5723502  0.0557250  10.271  < 2e-16 ***
## DWNPMFR          -1.3175535  0.1275771 -10.328  < 2e-16 ***
## PROFBUS1          0.2100458  0.0491541   4.273 1.93e-05 ***
## QUALHSC1          0.1835106  0.0402820   4.556 5.22e-06 ***
## QUAL_PG1         -0.2972520  0.0789155  -3.767 0.000165 ***
## SEXCODE1          0.2352736  0.0600159   3.920 8.85e-05 ***
## FULLPDC1         -1.7550203  0.0952821 -18.419  < 2e-16 ***
## FRICODE1         -0.1719774  0.0378111  -4.548 5.41e-06 ***
## WASHCODE1        -0.2585977  0.0477647  -5.414 6.16e-08 ***
## RegionAP2        -0.5817435  0.1797634  -3.236 0.001211 ** 
## RegionChennai    -1.4436197  0.1414897 -10.203  < 2e-16 ***
## RegionKA1        -0.6604584  0.1417572  -4.659 3.18e-06 ***
## RegionKE2        -0.5973660  0.1458215  -4.097 4.19e-05 ***
## RegionTN1        -0.8355176  0.1369630  -6.100 1.06e-09 ***
## RegionTN2        -0.6171896  0.1464390  -4.215 2.50e-05 ***
## RegionVellore    -0.7037441  0.1604099  -4.387 1.15e-05 ***
## TENORYR:FULLPDC1  0.4227544  0.0714356   5.918 3.26e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26040  on 21678  degrees of freedom
## Residual deviance: 22990  on 21657  degrees of freedom
## AIC: 23034
## 
## Number of Fisher Scoring iterations: 4

Defaulters based on whether they gave post-dated cheques FULLPDC

##              FULLPDC
## DefaulterFlag     0     1
##             0 38.55 61.45
##             1 69.96 30.04

Consider a consumer having the following characteristics:

  • AGE = mean(AGE),

  • Male,

  • Education = UG,

  • MTHINCTH = mean(MTHINCTH)

  • NoOfDepe = mean(NOOFDEPE),

  • Owns a Fridge,

  • Owns a Washing Machine

  • Working Professional,

  • SALDATFR = mean(SALDATFR),

  • Lives in TN1

  • Down Payment = mean(DWNPMFR) %

Que 2- Using Model2, Calculate probability of defaulting on the auto finance loan for consumers paying using post dated check, when the loan tenure is equal to 1 year?

##        1 
## 0.420266

Que 3- Using Model2, Calculate probability of defaulting on the auto finance loan for consumers paying using post dated check, when the loan tenure is equal to 2 year?

##         1 
## 0.6622693

Que 4- Using Model2, Calculate the change in probability of defaulting on the auto finance loan for consumers paying using post dated check, when the loan tenure increases from 1 year to 2 year duration?

##         1 
## 0.2420032

Que 7- Using Model2, Calculate the change in probability of defaulting on the auto finance loan for consumers who do not pay using posted a cheque, when the loan tenure increases from 1 year to 2 year duration?

##          1 
## 0.09648523

Que 8- What problem would you face if you try to solve Que 4 & Que 7 using Model1 (without interaction), please explain in detail.

Que 9- Write R code to predict the probabilities of default of the consumers in the testing data using Model1.

Que 10- Write R Code to generate the Confusion Matrix, shown below, assuming threshold probability of 50%, using Model1.

## Loading required package: lattice
## Loading required package: ggplot2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   635  417
##        Yes 1448 4727
##                                          
##                Accuracy : 0.7419         
##                  95% CI : (0.7317, 0.752)
##     No Information Rate : 0.7118         
##     P-Value [Acc > NIR] : 5.682e-09      
##                                          
##                   Kappa : 0.2624         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9189         
##             Specificity : 0.3048         
##          Pos Pred Value : 0.7655         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.7118         
##          Detection Rate : 0.6541         
##    Detection Prevalence : 0.8544         
##       Balanced Accuracy : 0.6119         
##                                          
##        'Positive' Class : Yes            
## 

Que 12- Write R code to to calculate area under curve (AUC) Model1

## [1] 0.7263962

Que 13- Write R Code to plot the Decision Tree, shown below, using traning dataset.

## CART 
## 
## 21679 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 19511, 19511, 19512, 19511, 19511, 19511, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01024164  0.7292763  0.2129099
##   0.01152184  0.7269239  0.2189725
##   0.01408225  0.7186214  0.1326211
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01024164.
## Loading required package: rpart

Que 14- Write R code to predict the probabilities of default of the consumers in the testing data using Decision Tree.

Que 15- Write R Code to generate the Confusion Matrix, shown below, assuming threshold probability of 50%, using Decision Tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   434  297
##        Yes 1649 4847
##                                           
##                Accuracy : 0.7307          
##                  95% CI : (0.7203, 0.7409)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.0001805       
##                                           
##                   Kappa : 0.1867          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9423          
##             Specificity : 0.2084          
##          Pos Pred Value : 0.7462          
##          Neg Pred Value : 0.5937          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6707          
##    Detection Prevalence : 0.8989          
##       Balanced Accuracy : 0.5753          
##                                           
##        'Positive' Class : Yes             
## 

Que 17- Write R code to to calculate area under curve (AUC) Using Decision Tree

## [1] 0.6829464

Que 18- On the basis of ROC & AUC which Model is doing better job (Logistic Regression or Decision Tree), please explain in detail.