Reading & Explaining the Data

Please review the code below and execute it in your own RMD file.

Printing the Column Names of the Data

##  [1] "Agmt.No"        "ContractStatus" "StartDate"      "AGE"           
##  [5] "NOOFDEPE"       "MTHINCTH"       "SALDATFR"       "TENORYR"       
##  [9] "DWNPMFR"        "PROFBUS"        "QUALHSC"        "QUAL_PG"       
## [13] "SEXCODE"        "FULLPDC"        "FRICODE"        "WASHCODE"      
## [17] "Region"         "Branch"         "DefaulterFlag"  "DefaulterType" 
## [21] "DATASET"

List of Data Columns

DEFAULT

1. Defaulter Flag

  • 1: Customer has delayed paying at least once

  • 0: Otherwise

DEMOGRAPHIC VARIABLES

1. Gender

  • SEXCODE = 1 (Male)

  • SEXCODE = 0 (Female)

2. Age

3. Education

  • QUALHSC

  • QUAL_PG

4. Income

  • Monthly Income in Thousands (MTHINCTH)

  • Owns a Fridge (FRICODE)

  • Owns a Washing Machine (WASHCODE)

5. Profession

  • PROFBUS = 1 (BUSINESS)
  • PROFBUS = 0 (PROFESSIONAL)

6. No.of Dependents

  • NOOFDEPE

7. Region

Structure of the Dataset

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUALHSC       : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ QUAL_PG       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SEXCODE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FULLPDC       : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ FRICODE       : int  0 1 1 1 1 0 0 0 0 0 ...
##  $ WASHCODE      : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ Region        : chr  "AP2" "AP2" "AP2" "AP2" ...
##  $ Branch        : chr  "Vizag" "Vizag" "Vizag" "Vizag" ...
##  $ DefaulterFlag : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DefaulterType : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Convert catgorical variables to factor

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ QUALHSC       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ QUAL_PG       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEXCODE       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FULLPDC       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2 ...
##  $ FRICODE       : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
##  $ WASHCODE      : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Region        : Factor w/ 8 levels "AP1","AP2","Chennai",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Branch        : Factor w/ 14 levels "Bangalore","Chennai",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ DefaulterFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DefaulterType : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Descriptive Statistics

##                     n     mean      sd   median   min      max
## Agmt.No*        28906 14453.50 8344.59 14453.50  1.00 28906.00
## ContractStatus* 28906     1.33    0.77     1.00  1.00     4.00
## StartDate*      28906   827.85  552.47   812.00  1.00  1814.00
## AGE             28906    36.44    9.82    35.00 18.00    70.00
## NOOFDEPE        28906     2.85    1.61     3.00  0.00    10.00
## MTHINCTH        28906     8.94    4.81     8.00  0.10    39.50
## SALDATFR        28906     0.44    0.46     0.17  0.03     1.03
## TENORYR         28906     1.28    0.52     1.00  0.17     4.00
## DWNPMFR         28906     0.38    0.16     0.38  0.02     0.88
## PROFBUS*        28906     1.15    0.36     1.00  1.00     2.00
## QUALHSC*        28906     1.23    0.42     1.00  1.00     2.00
## QUAL_PG*        28906     1.04    0.20     1.00  1.00     2.00
## SEXCODE*        28906     1.92    0.27     2.00  1.00     2.00
## FULLPDC*        28906     1.39    0.49     1.00  1.00     2.00
## FRICODE*        28906     1.42    0.49     1.00  1.00     2.00
## WASHCODE*       28906     1.19    0.39     1.00  1.00     2.00
## Region*         28906     5.33    1.51     6.00  1.00     8.00
## Branch*         28906     5.93    3.47     6.00  1.00    14.00
## DefaulterFlag*  28906     1.71    0.45     2.00  1.00     2.00
## DefaulterType*  28906     1.85    0.63     2.00  1.00     3.00
## DATASET*        28906     2.52    0.50     3.00  1.00     3.00

Assignment: Part 2 (Logistic Regression)

Que 1- Write R Code for Building The Logistics Regression Model using glm(), the output is as follows.

## 
## Call:
## glm(formula = DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + NOOFDEPE + 
##     SALDATFR + TENORYR + DWNPMFR + PROFBUS + QUALHSC + QUAL_PG + 
##     SEXCODE + FULLPDC + FRICODE + WASHCODE + Region, family = binomial(), 
##     data = trainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7470  -1.0215   0.5716   0.7801   2.0874  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.2291732  0.1890284  11.793  < 2e-16 ***
## AGE           -0.0144968  0.0016680  -8.691  < 2e-16 ***
## NOOFDEPE       0.0566860  0.0107923   5.252 1.50e-07 ***
## MTHINCTH      -0.0004025  0.0035613  -0.113 0.910023    
## SALDATFR      -0.3833870  0.0420223  -9.123  < 2e-16 ***
## TENORYR        0.7727065  0.0456475  16.928  < 2e-16 ***
## DWNPMFR       -1.3074501  0.1274734 -10.257  < 2e-16 ***
## PROFBUS1       0.1966576  0.0487903   4.031 5.56e-05 ***
## QUALHSC1       0.1853120  0.0401652   4.614 3.95e-06 ***
## QUAL_PG1      -0.2990904  0.0787907  -3.796 0.000147 ***
## SEXCODE1       0.2339445  0.0600322   3.897 9.74e-05 ***
## FULLPDC1      -1.2365885  0.0368674 -33.541  < 2e-16 ***
## FRICODE1      -0.1761473  0.0377247  -4.669 3.02e-06 ***
## WASHCODE1     -0.2644245  0.0476814  -5.546 2.93e-08 ***
## RegionAP2     -0.5788864  0.1796029  -3.223 0.001268 ** 
## RegionChennai -1.4136987  0.1408192 -10.039  < 2e-16 ***
## RegionKA1     -0.6529787  0.1411987  -4.625 3.75e-06 ***
## RegionKE2     -0.5753874  0.1450753  -3.966 7.30e-05 ***
## RegionTN1     -0.8084619  0.1362745  -5.933 2.98e-09 ***
## RegionTN2     -0.6142186  0.1458691  -4.211 2.55e-05 ***
## RegionVellore -0.6570233  0.1595604  -4.118 3.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26040  on 21678  degrees of freedom
## Residual deviance: 23025  on 21658  degrees of freedom
## AIC: 23067
## 
## Number of Fisher Scoring iterations: 4

Que 2- Write R Code for Building The Logistics Regression Model using caret Package, the output is as follows.

Take, set.seed(766)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7470  -1.0215   0.5716   0.7801   2.0874  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.2291732  0.1890284  11.793  < 2e-16 ***
## AGE           -0.0144968  0.0016680  -8.691  < 2e-16 ***
## NOOFDEPE       0.0566860  0.0107923   5.252 1.50e-07 ***
## MTHINCTH      -0.0004025  0.0035613  -0.113 0.910023    
## SALDATFR      -0.3833870  0.0420223  -9.123  < 2e-16 ***
## TENORYR        0.7727065  0.0456475  16.928  < 2e-16 ***
## DWNPMFR       -1.3074501  0.1274734 -10.257  < 2e-16 ***
## PROFBUS1       0.1966576  0.0487903   4.031 5.56e-05 ***
## QUALHSC1       0.1853120  0.0401652   4.614 3.95e-06 ***
## QUAL_PG1      -0.2990904  0.0787907  -3.796 0.000147 ***
## SEXCODE1       0.2339445  0.0600322   3.897 9.74e-05 ***
## FULLPDC1      -1.2365885  0.0368674 -33.541  < 2e-16 ***
## FRICODE1      -0.1761473  0.0377247  -4.669 3.02e-06 ***
## WASHCODE1     -0.2644245  0.0476814  -5.546 2.93e-08 ***
## RegionAP2     -0.5788864  0.1796029  -3.223 0.001268 ** 
## RegionChennai -1.4136987  0.1408192 -10.039  < 2e-16 ***
## RegionKA1     -0.6529787  0.1411987  -4.625 3.75e-06 ***
## RegionKE2     -0.5753874  0.1450753  -3.966 7.30e-05 ***
## RegionTN1     -0.8084619  0.1362745  -5.933 2.98e-09 ***
## RegionTN2     -0.6142186  0.1458691  -4.211 2.55e-05 ***
## RegionVellore -0.6570233  0.1595604  -4.118 3.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26040  on 21678  degrees of freedom
## Residual deviance: 23025  on 21658  degrees of freedom
## AIC: 23067
## 
## Number of Fisher Scoring iterations: 4

Que 3- List the variable(s) which are not statistically significant in the Logistics Regression Model at 5% level of significance.

Que 4- Explain the impact of AGE on the Probability of Default in one or two sentences.

Que 5- Explain the impact of Gender (SEXCODE) on the Probability of Default in one or two sentences.

Que 6- Consider a consumer having the following characteristics:

  • Age = meanAge,

  • Male,

  • Education = UG,

  • MTHINCTH = mean(MTHINCTH)

  • NoOfDepe = mean(NOOFDEPE),

  • Owns a Fridge,

  • Owns a Washing Machine

  • Working Professional,

  • SALDATFR = mean(SALDATFR),

  • Lives in TN1

  • Tenure = mean(TENORYR)

  • Down Payment = mean(DWNPMFR) %

  • Did not submit FULLPDC

Que 7- Write R code to predict the probabilities using test dataset.

Que 8- Write R Code to make the following Confusion Matrix assuming threshold Prob = 50%.

##           Actual
## Prediction   No  Yes
##        No   635  417
##        Yes 1448 4727

Que 9- Calculate values of 3 Machine Learning Metrices (Accuarcy,Sensitivity,Specificity), and write R code to generate them.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   635  417
##        Yes 1448 4727
##                                          
##                Accuracy : 0.7419         
##                  95% CI : (0.7317, 0.752)
##     No Information Rate : 0.7118         
##     P-Value [Acc > NIR] : 5.682e-09      
##                                          
##                   Kappa : 0.2624         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9189         
##             Specificity : 0.3048         
##          Pos Pred Value : 0.7655         
##          Neg Pred Value : 0.6036         
##              Prevalence : 0.7118         
##          Detection Rate : 0.6541         
##    Detection Prevalence : 0.8544         
##       Balanced Accuracy : 0.6119         
##                                          
##        'Positive' Class : Yes            
## 

Que 10- Write R code to to calculate area under curve (AUC), and write its implications in one or two sentences.

## [1] 0.7263962