Information about the Classification Problem
Problem Statement
Applied methods of machine learning on the classification problem (two classes)
- Functions for Visualizations
Optimal Binning for Factor and Numeric (Scale) Variables
Exploratory data analysis (EDA)
Trainig Models in MicrosoftML package
Evaluation of Classifiers
- Descriptive mAchine Learning EXplanations of Models
The Best Classifier for Score Modeling
Conclusion
Finish of Session

Information about the Classification Problem

LTFS is one of India’s most respected & leading NBFCs providing finance.

Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default.

A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date. Following Information regarding the loan and loanee are provided in the datasets:

Loanee Information (Demographic data like age, Identity proof etc.)
Loan Information (Disbursal details, loan to value ratio etc.)
Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)

Doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimising the default rates.

Problem Statement

Vehicle Loan Default Prediction

Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default. A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date. Following Information regarding the loan and loanee are provided in the datasets:

Loanee Information (Demographic data like age, income, Identity proof etc.)
Loan Information (Disbursal details, amount, EMI, loan to value ratio etc.)
Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)

Doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimising the default rates.

Data Description

The Data Set contains:

train.csv contains the training data with details on loan as described in the last section
data_dictionary.csv contains a brief description on each variable provided in the training and test set.
test.csv contains details of all customers and loans for which the participants are to submit probability of default.
sample_submission.csv contains the submission format for the predictions against the test set. A single csv needs to be submitted as a solution.

Evaluation Metric

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

Applied methods of machine learning on the classification problem (two classes)

Despite the fact that various methods are actively used for preprocessing and interpreting modeling results, including the tidymodels, the Microsoft SQL Server 2019 Microsoft ML Server 9.4.7 {Update August 2019} was the main set for building classification models.

Functions for Visualizations

R Markdown does not support combining Rmd files. Per the R Markdown website, R Markdown requires a single Rmd file. It does not currently support the embedding of one Rmd file within another Rmd document.

Optimal Binning for Factor and Numeric (Scale) Variables

Import Data

library('tidyverse')        # An opinionated collection of R Packages designed for Data Science. 
library('magrittr')         # A Forward-Pipe Operator for R
# install.packages("https://cran.r-project.org/src/contrib/data.table_1.11.8.tar.gz", repos = NULL, type = "source")     # Install package ‘data.table’ version 1.11.8
library('data.table')       # Fast Extension of `data.frame`

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

LoadingData <- 'VLDP'

NameMySQLCode <- 'VLDP.sql'
# The 9 Best Features for Set 1
SelectedFeatures <- c('disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Employment_Type', 'State_ID', 'Employee_code_ID', 'Aadhar_flag', 'PAN_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS_SCORE', 'PERFORM_CNS_SCORE_DESCRIPTION', 'PRI_NO_OF_ACCTS', 'PRI_ACTIVE_ACCTS', 'PRI_OVERDUE_ACCTS', 'PRI_SANCTIONED_AMOUNT', 'PRI_DISBURSED_AMOUNT', 'SEC_CURRENT_BALANCE', 'SEC_SANCTIONED_AMOUNT', 'PRIMARY_INSTAL_AMT', 'NEW_ACCTS_IN_LAST_SIX_MONTHS', 'DELINQUENT_ACCTS_IN_LAST_SIX_M', 'NO_OF_INQUIRIES', 'Age', 'YearsSinceDisbursment', 'AverageLoanTenure', 'TimeSinceFirstLoan') 

ModelFilename = 'VLDP_model.rds'

# Check It is Microsoft R Server working
isMicrosoftRServer <- 'RevoScaleR' %in% rownames(installed.packages())
if( isMicrosoftRServer ) { 
    writeLines(paste('Microsoft R Server (Machine Learning)', getNamespaceVersion('RevoScaleR') ))
} else writeLines(paste('Common', version$version.string ))

## Microsoft R Server (Machine Learning) 9.4.7

# To Import file with Data for Application Scoring
DT1 <- data.table::fread(unzip(zipfile = 'VehicleLoanDefaultPrediction.zip', files = 'train.csv'))
data.table::setnames(DT1, old = c('loan_default'), new = c('GB_flag'))

DT2 <- data.table::fread(unzip(zipfile = 'VehicleLoanDefaultPrediction.zip', files = 'test.csv'))

# Makes One Data.Table From A List Of Many filling missing columns, and match by col names
DT <- list(DT1, DT2) %>% 
  data.table::rbindlist(., use.names = TRUE, fill = TRUE)

# To Import file with Data Dictionary for Application Scoring
DD <- readxl::read_excel(unzip(zipfile = 'VehicleLoanDefaultPrediction.zip', files = 'Data Dictionary.xlsx')) %>%   dplyr::mutate(`Variable Name` = stringr::str_replace(string = `Variable Name`, pattern = 'loan_default',
                                                       replacement = 'GB_flag'))

## New names:
## * `` -> `..3`

# create a list of 70% of the rows in the original training dataset we can use for training
# Randomly Split the data into two sets: training (inTrain) and testing (inTest) 
seed <- 2019
set.seed(seed)

inTrain1 <- rep(FALSE, nrow(DT1))
inTrain1[sample(nrow(DT1), 9/10 * nrow(DT1))] <- TRUE

inTrain = c( inTrain1, rep( FALSE, times = dim(DT2)[1] ) )
inTest = c( !inTrain1, rep( FALSE, times = dim(DT2)[1] ) )
inProblem = c( rep( FALSE, times = dim(DT1)[1] ), rep( TRUE, times = dim(DT2)[1] ) )

# Adding Attributes by Variables into DT from Data Dictionary
attr(DT, "variable.labels") <- tibble(`Variable.Name` = colnames(DT)) %>%
  dplyr::left_join(DD, by = c('Variable.Name' = 'Variable Name')) %>%
    dplyr::select(Description) %>%
      pull

# gsub() is replacing '.' (dot)
names(DT) %<>% gsub('[.]', '_', .)
# DT$PRI_CURRENT_BALANCE <- DT$PRI_CURRENT_BALANCE + 6678296
# DT$PRI_CURRENT_BALANCE <- log(DT$PRI_CURRENT_BALANCE)

# Renaming All Variables with Name's Length > 31
oldnames = c('DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS')

newnames = c('DELINQUENT_ACCTS_IN_LAST_SIX_M')

data.table::setnames(DT, old = oldnames, new = newnames)

# Get The Date of Birthday and convert into 'Age' at '2019-01-01'
Finish_Date <- lubridate::parse_date_time2('2019-01-01', orders = '%Y-%m-%d', tz = 'Asia/Dhaka')
DT[, Date_of_Birth := paste0(substr( DT$Date_of_Birth, 1, 6 ),
                             ifelse(substr( DT$Date_of_Birth, 7, 8 ) %>% as.integer > 40, '19', '20'),
                             substr( DT$Date_of_Birth, 7, 8 )) %>%
     as.Date(., '%d-%m-%Y') ]
DT[, Age := difftime(time1 = Finish_Date, time2 = Date_of_Birth, units = 'days') %>%
          as.integer(.) / 365.24 ]

# Get The Date of Disbursal and convert into 'YearsSinceDisbursment'
DT[, DisbursalDate := paste0(substr( DT$DisbursalDate, 1, 6 ),
                             ifelse(substr( DT$DisbursalDate, 7, 8 ) %>% as.integer > 40, '19', '20'),
                             substr( DT$DisbursalDate, 7, 8 )) %>%
    as.Date( ., '%d-%m-%Y') ]
DT[, YearsSinceDisbursment := difftime(time1 = Finish_Date, time2 = DisbursalDate, units = 'days') %>%
      as.integer(.) / 365.24 ]

# Get The Average loan tenure and convert into 'AverageLoanTenure'
DT[, AverageLoanTenure := 
     stringr::str_split_fixed(DT$AVERAGE_ACCT_AGE, ' ', 2) %>%
       data.frame(., stringsAsFactors = FALSE) %>%
         setNames(c('Years', 'Months')) %>%
           dplyr::mutate(Years = readr::parse_number(Years), Months = readr::parse_number(Months) / 12 ) %>%
             dplyr::mutate(Len = Years +  Months) %>%
               dplyr::select(Len) %>% pull ]

# Get The Time since first loan and convert into 'TimeSinceFirstLoan'
DT[, TimeSinceFirstLoan := 
     stringr::str_split_fixed(DT$CREDIT_HISTORY_LENGTH, ' ', 2) %>%
       data.frame(., stringsAsFactors = FALSE) %>%
         setNames(c('Years', 'Months')) %>%
           dplyr::mutate(Years = readr::parse_number(Years), Months = readr::parse_number(Months) / 12 ) %>%
             dplyr::mutate(Len = Years +  Months) %>%
               dplyr::select(Len) %>% pull ]

DT[, ':=' (
  YearsOnLoan = difftime(time1 = DisbursalDate, time2 = Date_of_Birth, units = 'days') %>%
          as.integer(.) / 365.24,
  DisAsDiff = asset_cost - disbursed_amount,
  DisAsShare = asset_cost / disbursed_amount,
  # DiffLTV = (asset_cost - disbursed_amount - ltv),
  
  
  Qrt = lubridate::quarter(DisbursalDate) %>% as.factor(.), # weekdays
  Day = format(DisbursalDate, '%d') %>% as.integer,
  
  OutstandingNow = disbursed_amount + PRI_CURRENT_BALANCE,
  DisbursedTotal = PRI_DISBURSED_AMOUNT + disbursed_amount,
  ShareOverdue = DELINQUENT_ACCTS_IN_LAST_SIX_M - NEW_ACCTS_IN_LAST_SIX_MONTHS,
  # OutstandingNow2Dsbrsd = OutstandingNow / DisbursedTotal
  
  SEC_OverdueShareSec = SEC_OVERDUE_ACCTS / SEC_NO_OF_ACCTS,
  PRI_OverdueShare = PRI_OVERDUE_ACCTS / PRI_NO_OF_ACCTS
) ]

library('Hmisc')            # Harrell Miscellaneous

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:rpart':
## 
##     solder

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

DT <- Hmisc::upData(DT, labels = c(Age = 'Age in Years', 
                                   YearsSinceDisbursment = 'Years Since Disbursment',
                                   AverageLoanTenure = 'Average loan tenure in Years (AVERAGE_ACCT_AGE)',
                                   TimeSinceFirstLoan = 'Years since First Loan (CREDIT_HISTORY_LENGTH)',
                                   YearsOnLoan = 'Years On Loan',
                                   DisAsDiff = 'Difference of asset_cost from disbursed_amount',
                                   DisAsShare = 'Ratio of asset_cost to disbursed_amount',
                                   # DiffLTV = 'Difference of asset_cost from disbursed_amount and ltv',
                                   Qrt = 'Quarter of DisbursalDate',
                                   Day = 'Months`s Day of DisbursalDate',
                                   OutstandingNow = 'Difference of disbursed_amount from PRI_CURRENT_BALANCE',
                                   DisbursedTotal = 'Summ of PRI_DISBURSED_AMOUNT and disbursed_amount',
                                   ShareOverdue = 'Difference of DELINQUENT_ACCTS_IN_LAST_SIX_M from NEW_ACCTS_IN_LAST_SIX_MONTHS',
                                   SEC_OverdueShareSec = 'Ratio of SEC_OVERDUE_ACCTS to SEC_NO_OF_ACCTS',
                                   PRI_OverdueShare = 'Ratio of PRI_OVERDUE_ACCTS / PRI_NO_OF_ACCTS'))

## Input object size:    96804992 bytes;     55 variables    345546 observations
## New object size: 96812608 bytes; 55 variables    345546 observations

# Hmisc::contents(DT)

# Convert *some* column classes in data.table
factor_cols  <- c('branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Employment_Type', 'State_ID', 'Employee_code_ID', 'MobileNo_Avl_Flag', 'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS_SCORE_DESCRIPTION')

DT[, (factor_cols) := lapply(.SD, factor), .SDcols = factor_cols]

delcols <- c('UniqueID', 'Date_of_Birth', 'DisbursalDate', 'PRI_CURRENT_BALANCE', 'AVERAGE_ACCT_AGE', 'CREDIT_HISTORY_LENGTH') # , grep('_flag', names(DT), value = TRUE), grep('SEC_', names(DT), value = TRUE)

# Remove outcome variable 'GB_flag' from 'delcols' vector to due '_flag'
delcols <- delcols[! delcols %in% c('GB_flag')]

if ( isMicrosoftRServer ) {
# for Microsoft Machine Learning Server 9.3
# Get a subset rows and columns from the data frame
  DF1 <- RevoScaleR::rxDataStep(inData = DT, maxRowsByCols = 2e9,
      varsToDrop = delcols 
      # varsToKeep = c('x', 'w', 'z'),
      # rowSelection = z > 0
      )
  out <- rxGetInfo(DF1, getVarInfo = TRUE)
} else {
  # for CRAN R version
  delColumns <- delcols
  DF1 <- copy(DT) %>% 
    .[, (delColumns) := NULL] %>% 
      data.table::setDF(.)
}

## Rows Read: 345546, Total Rows Processed: 345546, Total Chunk Time: 0.746 seconds

attr(DF1, 'variable.labels') <- attr(DF1, 'variable.labels')[names(DF1)]

I use package embed that contains extra steps for the recipes package for embedding predictors into one or more numeric columns. All of the preprocessing methods are supervised.

# Encoding of Multi-Level Categorical Variables (Factors) into Numeric Predictors
for (var in c('branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Employment_Type', 'State_ID', 'Employee_code_ID', 'PERFORM_CNS_SCORE_DESCRIPTION')) {
  print(var)
  
  DF1 <- EncodingMultiLevelFactors(X = DF1, Y = var, Z = inTrain)
}

## [1] "branch_id"
## start par. =  0.1123205 fn =  221335.9 
## At return
## eval:  19 fn:      221335.56 par: 0.106959
## [1] "supplier_id"
## start par. =  0.2075279 fn =  220404.6 
## At return
## eval:  19 fn:      220349.07 par: 0.172799
## [1] "manufacturer_id"
## start par. =  0.04441674 fn =  223266.8 
## At return
## eval:  18 fn:      223266.33 par: 0.0533252
## [1] "Current_pincode_ID"
## start par. =  0.245132 fn =  221670.7 
## At return
## eval:  16 fn:      221350.34 par: 0.166798
## [1] "Employment_Type"
## start par. =  0.02972829 fn =  216219.5 
## At return
## eval:  28 fn:      216219.22 par: 0.0422908
## [1] "State_ID"
## start par. =  0.08355545 fn =  222283.1 
## At return
## eval:  18 fn:            222283. par: 0.0889640
## [1] "Employee_code_ID"
## start par. =  0.2213827 fn =  220258.2 
## At return
## eval:  19 fn:      220127.69 par: 0.175705
## [1] "PERFORM_CNS_SCORE_DESCRIPTION"
## start par. =  0.098095 fn =  221735 
## At return
## eval:  17 fn:      221733.42 par: 0.119980

# DF1 <- DF1[, c('GB_flag', SelectedFeatures)]

DF0 <- DF1[inTrain, ]   # Testing Set after Discretization (Binning) of Numeric Variables

Max_Vars <- 32
Max_Levels <- 48

remove(DT1, DT2, DD, inTrain1, delcols, factor_cols, oldnames, newnames)

Inspection Data Frame using Automate Data Exploration and Inspection packages: DataExplorer, summarytools.

# Inspection  Data Frame

library('DataExplorer')     # Automate Data Exploration and Treatment 
library('summarytools')     # Tools to Quickly and Neatly Summarize Data

## 
## Attaching package: 'summarytools'

## The following objects are masked from 'package:Hmisc':
## 
##     label, label<-

## The following object is masked from 'package:tibble':
## 
##     view

dfName <- 'DT'
dfCols <- colnames(DT) %>% length()

data.frame(Variables = sapply(DT, class) %>%  unlist) %>%
  dplyr::filter(Variables != 'labelled') %>% 
    ggplot2::ggplot(aes(x = Variables, fill = Variables)) +
      ggplot2::geom_bar( stat = 'count') +
      ggplot2::geom_text(stat='count', aes(label = ..count..), vjust = 2, size = 5, color = 'white') +
      ggplot2::theme(legend.position = 'none') +
      ggplot2::labs(title = paste(dfName, 'Column Types'), subtitle = sprintf('Data.frame has %g columns.', dfCols),
                    x = NULL, y = 'Number of columns')

# Summarise Missing values in dataframe
DataExplorer::plot_intro(DT)

DataExplorer::plot_missing(DT, geom_label_args = list("size" = 2, "label.padding" = unit(0.1, "lines")))

# DataExplorer::profile_missing(DT)

# Data frame Summary by Variable
print(summarytools::dfSummary(DT, graph.magnif = 0.75), method = 'render')

Data Frame Summary

DT

N: 345546

No	Variable	Label	Stats / Values	Freqs (% of Valid)	Valid	Missing
1	UniqueID [integer]		mean (sd) : 593106.04 (101481.56) min < med < max : 417428 < 592918.5 < 769909 IQR (CV) : 175115.5 (0.17)	345546 distinct values	345546 (100%)	0 (0%)
2	disbursed_amount [integer]		mean (sd) : 54916.38 (13045.96) min < med < max : 11613 < 54303 < 990572 IQR (CV) : 13302 (0.24)	29271 distinct values	345546 (100%)	0 (0%)
3	asset_cost [integer]		mean (sd) : 76294.84 (18738.64) min < med < max : 37000 < 71541 < 1628992 IQR (CV) : 13323 (0.25)	53158 distinct values	345546 (100%)	0 (0%)
4	ltv [numeric]		mean (sd) : 74.93 (11.32) min < med < max : 10.03 < 77.14 < 95 IQR (CV) : 14.41 (0.15)	6819 distinct values	345546 (100%)	0 (0%)
5	branch_id [factor]		1. 1 2. 2 3. 3 4. 5 5. 7 6. 8 7. 9 8. 10 9. 11 10. 13 [ 72 others ]	8337 (2.4%) 20527 (5.9%) 14881 (4.3%) 10276 (3.0%) 4323 (1.3%) 4472 (1.3%) 3891 (1.1%) 5685 (1.6%) 5873 (1.7%) 4170 (1.2%) 263111 (76.1%)	345546 (100%)	0 (0%)
6	supplier_id [factor]		1. 10524 2. 12311 3. 12312 4. 12374 5. 12441 6. 12456 7. 12500 8. 12534 9. 12539 10. 12797 [ 3079 others ]	7 (0.0%) 6 (0.0%) 57 (0.0%) 146 (0.0%) 62 (0.0%) 101 (0.0%) 79 (0.0%) 73 (0.0%) 11 (0.0%) 87 (0.0%) 344917 (99.8%)	345546 (100%)	0 (0%)
7	manufacturer_id [factor]		1. 45 2. 48 3. 49 4. 51 5. 67 6. 86 7. 120 8. 145 9. 152 10. 153 [ 2 others ]	87053 (25.2%) 22964 (6.6%) 14812 (4.3%) 40927 (11.8%) 3364 (1.0%) 161203 (46.7%) 14049 (4.1%) 1138 (0.3%) 9 (0.0%) 25 (0.0%) 2 (0.0%)	345546 (100%)	0 (0%)
8	Current_pincode_ID [factor]		1. 1 2. 2 3. 3 4. 4 5. 5 6. 6 7. 7 8. 8 9. 9 10. 10 [ 7086 others ]	44 (0.0%) 118 (0.0%) 87 (0.0%) 153 (0.0%) 331 (0.1%) 162 (0.0%) 152 (0.0%) 74 (0.0%) 56 (0.0%) 8 (0.0%) 344361 (99.7%)	345546 (100%)	0 (0%)
9	Date_of_Birth [Date]		min : 1949-09-15 med : 1986-01-01 max : 2000-11-29 range : 51y 2m 14d	15888 distinct val.	345546 (100%)	0 (0%)
10	Employment_Type [factor]		1. 2. Salaried 3. Self employed	11104 (3.2%) 147013 (42.5%) 187429 (54.2%)	345546 (100%)	0 (0%)
11	DisbursalDate [Date]		min : 2018-08-01 med : 2018-10-20 max : 2018-11-30 range : 3m 29d	111 distinct val.	345546 (100%)	0 (0%)
12	State_ID [factor]		1. 1 2. 2 3. 3 4. 4 5. 5 6. 6 7. 7 8. 8 9. 9 10. 10 [ 12 others ]	14351 (4.2%) 7258 (2.1%) 47868 (13.9%) 70438 (20.4%) 14304 (4.1%) 48903 (14.2%) 10628 (3.1%) 20047 (5.8%) 21459 (6.2%) 5564 (1.6%) 84726 (24.5%)	345546 (100%)	0 (0%)
13	Employee_code_ID [factor]		1. 1 2. 3 3. 4 4. 5 5. 7 6. 9 7. 10 8. 11 9. 12 10. 15 [ 3388 others ]	106 (0.0%) 192 (0.1%) 96 (0.0%) 133 (0.0%) 221 (0.1%) 77 (0.0%) 44 (0.0%) 111 (0.0%) 162 (0.0%) 146 (0.0%) 344258 (99.6%)	345546 (100%)	0 (0%)
14	MobileNo_Avl_Flag [factor]		1. 1	345546 (100.0%)	345546 (100%)	0 (0%)
15	Aadhar_flag [factor]		1. 0 2. 1	51883 (15.0%) 293663 (85.0%)	345546 (100%)	0 (0%)
16	PAN_flag [factor]		1. 0 2. 1	306392 (88.7%) 39154 (11.3%)	345546 (100%)	0 (0%)
17	VoterID_flag [factor]		1. 0 2. 1	298155 (86.3%) 47391 (13.7%)	345546 (100%)	0 (0%)
18	Driving_flag [factor]		1. 0 2. 1	338249 (97.9%) 7297 (2.1%)	345546 (100%)	0 (0%)
19	Passport_flag [factor]		1. 0 2. 1	344835 (99.8%) 711 (0.2%)	345546 (100%)	0 (0%)
20	PERFORM_CNS_SCORE [integer]		mean (sd) : 289.03 (338.84) min < med < max : 0 < 0 < 890 IQR (CV) : 679 (1.17)	574 distinct values	345546 (100%)	0 (0%)
21	PERFORM_CNS_SCORE_DESCRIPTION [factor]		1. A-Very Low Risk 2. B-Very Low Risk 3. C-Very Low Risk 4. D-Very Low Risk 5. E-Low Risk 6. F-Low Risk 7. G-Low Risk 8. H-Medium Risk 9. I-Medium Risk 10. J-High Risk [ 10 others ]	21683 (6.3%) 13696 (4.0%) 23870 (6.9%) 16472 (4.8%) 8393 (2.4%) 12176 (3.5%) 5795 (1.7%) 10142 (2.9%) 8260 (2.4%) 5526 (1.6%) 219533 (63.5%)	345546 (100%)	0 (0%)
22	PRI_NO_OF_ACCTS [integer]		mean (sd) : 2.37 (5.01) min < med < max : 0 < 0 < 453 IQR (CV) : 3 (2.11)	114 distinct values	345546 (100%)	0 (0%)
23	PRI_ACTIVE_ACCTS [integer]		mean (sd) : 1 (1.88) min < med < max : 0 < 0 < 144 IQR (CV) : 1 (1.87)	42 distinct values	345546 (100%)	0 (0%)
24	PRI_OVERDUE_ACCTS [integer]		mean (sd) : 0.16 (0.54) min < med < max : 0 < 0 < 25 IQR (CV) : 0 (3.5)	23 distinct values	345546 (100%)	0 (0%)
25	PRI_CURRENT_BALANCE [integer]		mean (sd) : 160270.2 (925345.66) min < med < max : -6678296 < 0 < 96524920 IQR (CV) : 31364.5 (5.77)	97465 distinct values	345546 (100%)	0 (0%)
26	PRI_SANCTIONED_AMOUNT [integer]		mean (sd) : 209650.86 (2043865.78) min < med < max : -481500 < 0 < 1e+09 IQR (CV) : 59416.75 (9.75)	60681 distinct values	345546 (100%)	0 (0%)
27	PRI_DISBURSED_AMOUNT [integer]		mean (sd) : 209560.79 (2047482.84) min < med < max : 0 < 0 < 1e+09 IQR (CV) : 57645.75 (9.77)	65673 distinct values	345546 (100%)	0 (0%)
28	SEC_NO_OF_ACCTS [integer]		mean (sd) : 0.05 (0.56) min < med < max : 0 < 0 < 57 IQR (CV) : 0 (11.82)	40 distinct values	345546 (100%)	0 (0%)
29	SEC_ACTIVE_ACCTS [integer]		mean (sd) : 0.02 (0.28) min < med < max : 0 < 0 < 36 IQR (CV) : 0 (12.47)	23 distinct values	345546 (100%)	0 (0%)
30	SEC_OVERDUE_ACCTS [integer]		mean (sd) : 0.01 (0.1) min < med < max : 0 < 0 < 8 IQR (CV) : 0 (16.97)	0 : 343928 (99.5%) 1 : 1358 (0.4%) 2 : 166 (0.0%) 3 : 54 (0.0%) 4 : 22 (0.0%) 5 : 8 (0.0%) 6 : 6 (0.0%) 7 : 2 (0.0%) 8 : 2 (0.0%)	345546 (100%)	0 (0%)
31	SEC_CURRENT_BALANCE [integer]		mean (sd) : 4565.3 (161202.6) min < med < max : -574647 < 0 < 36032852 IQR (CV) : 0 (35.31)	3947 distinct values	345546 (100%)	0 (0%)
32	SEC_SANCTIONED_AMOUNT [integer]		mean (sd) : 6133.3 (189342.66) min < med < max : 0 < 0 < 57945000 IQR (CV) : 0 (30.87)	2631 distinct values	345546 (100%)	0 (0%)
33	SEC_DISBURSED_AMOUNT [integer]		mean (sd) : 6038.72 (188911.37) min < med < max : 0 < 0 < 57945000 IQR (CV) : 0 (31.28)	3031 distinct values	345546 (100%)	0 (0%)
34	PRIMARY_INSTAL_AMT [integer]		mean (sd) : 12497.73 (199754.5) min < med < max : 0 < 0 < 85262329 IQR (CV) : 1946 (15.98)	34330 distinct values	345546 (100%)	0 (0%)
35	SEC_INSTAL_AMT [integer]		mean (sd) : 272.74 (16261.26) min < med < max : 0 < 0 < 5390000 IQR (CV) : 0 (59.62)	2295 distinct values	345546 (100%)	0 (0%)
36	NEW_ACCTS_IN_LAST_SIX_MONTHS [integer]		mean (sd) : 0.36 (0.92) min < med < max : 0 < 0 < 35 IQR (CV) : 0 (2.56)	26 distinct values	345546 (100%)	0 (0%)
37	DELINQUENT_ACCTS_IN_LAST_SIX_M [integer]		mean (sd) : 0.1 (0.38) min < med < max : 0 < 0 < 20 IQR (CV) : 0 (4.01)	16 distinct values	345546 (100%)	0 (0%)
38	AVERAGE_ACCT_AGE [character]		1. 0yrs 0mon 2. 0yrs 6mon 3. 0yrs 7mon 4. 0yrs 11mon 5. 0yrs 10mon 6. 1yrs 0mon 7. 0yrs 9mon 8. 0yrs 8mon 9. 1yrs 1mon 10. 0yrs 5mon [ 190 others ]	177481 (51.4%) 9325 (2.7%) 8167 (2.4%) 7665 (2.2%) 7587 (2.2%) 7447 (2.2%) 7353 (2.1%) 7224 (2.1%) 6680 (1.9%) 6458 (1.9%) 100159 (29.0%)	345546 (100%)	0 (0%)
39	CREDIT_HISTORY_LENGTH [character]		1. 0yrs 0mon 2. 0yrs 6mon 3. 2yrs 1mon 4. 0yrs 7mon 5. 2yrs 0mon 6. 1yrs 0mon 7. 1yrs 1mon 8. 0yrs 11mon 9. 0yrs 8mon 10. 0yrs 9mon [ 297 others ]	177178 (51.3%) 7456 (2.2%) 6932 (2.0%) 6243 (1.8%) 5762 (1.7%) 5153 (1.5%) 4542 (1.3%) 3925 (1.1%) 3753 (1.1%) 3572 (1.0%) 121030 (35.0%)	345546 (100%)	0 (0%)
40	NO_OF_INQUIRIES [integer]		mean (sd) : 0.21 (0.72) min < med < max : 0 < 0 < 36 IQR (CV) : 0 (3.37)	26 distinct values	345546 (100%)	0 (0%)
41	GB_flag [integer]		mean (sd) : 0.22 (0.41) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.9)	0 : 182543 (78.3%) 1 : 50611 (21.7%)	233154 (67.47%)	112392 (32.53%)
42	Age [labelled, numeric]	Age in Years	mean (sd) : 34.81 (9.86) min < med < max : 18.09 < 33 < 69.29 IQR (CV) : 15.28 (0.28)	15888 distinct values	345546 (100%)	0 (0%)
43	YearsSinceDisbursment [labelled, numeric]	Years Since Disbursment	mean (sd) : 0.22 (0.1) min < med < max : 0.08 < 0.2 < 0.42 IQR (CV) : 0.16 (0.43)	111 distinct values	345546 (100%)	0 (0%)
44	AverageLoanTenure [labelled, numeric]	Average loan tenure in Years (AVERAGE_ACCT_AGE)	mean (sd) : 0.74 (1.26) min < med < max : 0 < 0 < 30.75 IQR (CV) : 1.08 (1.69)	200 distinct values	345546 (100%)	0 (0%)
45	TimeSinceFirstLoan [labelled, numeric]	Years since First Loan (CREDIT_HISTORY_LENGTH)	mean (sd) : 1.33 (2.35) min < med < max : 0 < 0 < 39 IQR (CV) : 1.92 (1.76)	307 distinct values	345546 (100%)	0 (0%)
46	YearsOnLoan [labelled, numeric]	Years On Loan	mean (sd) : 34.59 (9.86) min < med < max : 18 < 32.81 < 69.13 IQR (CV) : 15.23 (0.28)	16252 distinct values	345546 (100%)	0 (0%)
47	DisAsDiff [labelled, integer]	Difference of asset_cost from disbursed_amount	mean (sd) : 21378.46 (12308.93) min < med < max : 3917 < 17901 < 638420 IQR (CV) : 13187 (0.58)	48676 distinct values	345546 (100%)	0 (0%)
48	DisAsShare [labelled, numeric]	Ratio of asset_cost to disbursed_amount	mean (sd) : 1.42 (0.31) min < med < max : 1.07 < 1.34 < 10.57 IQR (CV) : 0.26 (0.22)	299901 distinct values	345546 (100%)	0 (0%)
49	Qrt [labelled, factor]	Quarter of DisbursalDate	1. 3 2. 4	134790 (39.0%) 210756 (61.0%)	345546 (100%)	0 (0%)
50	Day [labelled, integer]	Months`s Day of DisbursalDate	mean (sd) : 19.3 (7.86) min < med < max : 1 < 20 < 31 IQR (CV) : 13 (0.41)	31 distinct values	345546 (100%)	0 (0%)
51	OutstandingNow [labelled, integer]	Difference of disbursed_amount from PRI_CURRENT_BALANCE	mean (sd) : 215186.57 (925615.45) min < med < max : -6608979 < 60379.5 < 96583433 IQR (CV) : 40649.75 (4.3)	118719 distinct values	345546 (100%)	0 (0%)
52	DisbursedTotal [labelled, integer]	Summ of PRI_DISBURSED_AMOUNT and disbursed_amount	mean (sd) : 264477.16 (2047611.56) min < med < max : 11613 < 62639.5 < 1000047773 IQR (CV) : 63776.5 (7.74)	121005 distinct values	345546 (100%)	0 (0%)
53	ShareOverdue [labelled, integer]	Difference of DELINQUENT_ACCTS_IN_LAST_SIX_M from NEW_ACCTS_IN_LAST_SIX_MONTHS	mean (sd) : -0.26 (0.93) min < med < max : -30 < 0 < 17 IQR (CV) : 0 (-3.52)	38 distinct values	345546 (100%)	0 (0%)
54	SEC_OverdueShareSec [labelled, numeric]	Ratio of SEC_OVERDUE_ACCTS to SEC_NO_OF_ACCTS	mean (sd) : 0.16 (0.33) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (2.11)	66 distinct values	7108 (2.06%)	338438 (97.94%)
55	PRI_OverdueShare [labelled, numeric]	Ratio of PRI_OVERDUE_ACCTS / PRI_NO_OF_ACCTS	mean (sd) : 0.09 (0.23) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (2.49)	324 distinct values	170703 (49.4%)	174843 (50.6%)

Generated by summarytools 0.8.8 (R version 3.5.2)
2019-08-31

remove( dfName, dfCols )

Create IV Table for Factor and Numeric (Scale) Variables

A function smbinning::smbinning() of Optimal Binning for Scoring Modeling categorizes a numeric characteristic into bins for ulterior usage in scoring modeling. This process, also known as supervised discretization, utilizes Recursive Partitioning (rpart) to categorize the numeric characteristic.

WenSui Liu has developed two different algorithms for monotonic binning of numeric varisbles. While the first tends to generate bins with equal densities, the second would define finer bins based on the isotonic regression.

In the code snippet below, a third approach would be illustrated for the purpose to generate bins with roughly equal-sized bads. Once again, for the reporting layer, WenSui Liu leveraged the flexible smbinning::smbinning.custom() function with a small tweak.

The levels of factor variables should be jointed into groups manualy by a special function smbinning::smbinning.factor.custom().

# Binning and Fine Classing Factor and Numeric (Scale) Variables
# https://support.sas.com/documentation/cdl/en/emcsgs/66008/PDF/default/emcsgs.pdf - SAS Enterprise Miner 12.1

# install.packages("https://cran.microsoft.com/src/contrib/smbinning_0.9.tar.gz", repos = NULL, type = "source")
library('smbinning')         # Scoring Modeling and Optimal Binning for GLM Model from Herman Jopia (Chile)

## Loading required package: sqldf

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

## Loading required package: partykit

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

# install.packages("https://cran.r-project.org/src/contrib/woeBinning_0.1.6.tar.gz", repos = NULL, type = "source")
library('woeBinning')        # Supervised Weight of Evidence Binning of Numeric Variables and Factors
library('openxlsx')          # Read, Write and Edit Miscosoft XLSX Files 

isobin <- function(data, y, x) { # Second Variant - Finer Monotonic Binning Based on Isotonic Regression
# WenSui Liu, is leading a team of quantitative analysts developing operational risk models for American Bank.
# https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression/
  d1 <- data[c(y, x)]
  d2 <- d1[!is.na(d1[x]), ]
  c <- cor(d2[, 2], d2[, 1], method = 'spearman', use = 'complete.obs')
  reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1])
  k <- knots(as.stepfun(reg))
  sm1 <- smbinning::smbinning.custom(d1, y, x, k)
  c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint)
  c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, ' '))))
  c3 <- c2[!is.na(c2)]
  return(smbinning::smbinning.custom(d1, y, x, c3[-length(c3)]))
}

tree_bin<- function(data, y, x) {
# Thilo Eichenberg for `woeBinning` package
  binning <- woeBinning::woe.tree.binning(df = data, target.var =  y, pred.var =  x,
                           min.perc.total = 0.05, min.perc.class = 0, stop.limit = 0.01,
                           abbrev.fact.levels = 200, event.class = 1)
  
  if (class(binning) == 'list') {
    z <- c()
    
    if ( class(data[, x]) ==  'factor') {
      for (variable in binning[[2]]$Group.2 %>% levels()) {
        df <- binning[[2]]
        fac_vec <- df[binning[[2]]$Group.2 == variable, 'Group.1'] %>%
          as.character 
        chr_vec <- paste0('\'', paste0(fac_vec,  sep = "\'", collapse = ', \''))
        z <- c(z, chr_vec)
      }  
    } else { 
      df <- binning[[2]]
      z <- df$cutpoints.final %>%
        .[c(-1, -nrow(df))] %>%
          as.vector()
      } # End of 'Numeric' class
    } else { # ERROR !!!
      z <- c()
  }
  
  return( z )
}

# Exploratory Data Analysis (EDA)
if (Max_Vars >= ncol(DF0) )
  smbinning.eda(DF0)$eda

# Convert double after round into integer for smbinning()
DF0 %<>% # dplyr::mutate_if(is.double, round) %>%
  # dplyr::mutate_if(is.double, as.integer) %>% 
    dplyr::mutate( GB_flag = ifelse(GB_flag == 1, 0, 1) )   # Flag for Optimal Binning
# Convert integer features with unique < 5 into factor for smbinning()
DF0 %<>% dplyr::mutate_at( dplyr::select_if(., ~ is.integer(.) & unique(.) %>% length(.) < 5 ) %>%
  dplyr::select( -dplyr::one_of('GB_flag') ) %>% colnames,
    as.factor )

# Create MS Excel File for Output
openxlsx::addWorksheet(wb <- openxlsx::createWorkbook(), sheetName = 'IV Table', 
                       gridLines = FALSE, tabColour = 'olivedrab')
openxlsx::addWorksheet(wb, sheetName = 'Scorecard', gridLines = FALSE, tabColour = 'brown')

NamesOfVariables <- DF0 %>%
  dplyr::select( -dplyr::one_of('GB_flag', dplyr::select_if(., ~ is.factor(.) & nlevels(.) == 1) %>%
                                  colnames, # Levels > 1
                         # At least 5 different values for Numeric variables
                dplyr::select_if(., ~ is.numeric(.) & unique(.) %>% length() < 5) %>% colnames)) %>%
    colnames

binning.df <- cbind(Variable = NamesOfVariables,
                    `IV-Finish` = rep(NA_real_, times = length(NamesOfVariables)),
                    data.frame(matrix(data = rep(NA_real_, length(NamesOfVariables) * 8),
                                      nrow = length(NamesOfVariables), ncol = 8))) %>%
                      setNames(c('Variable', 'IV', 'IV-RPart', 'N-RPart', 'IV-Decile', 'N-Decile',
                                 'IV-Iso', 'N-Iso', 'IV-Tree', 'N-Tree'))
binning.df$Method <- rep(x = '', times = length(NamesOfVariables))

TotalBinning.sql <- ''

for (i in 1:length(NamesOfVariables)) {
  val <- NamesOfVariables[i]
  writeLines(paste(i, '-', val))

  if (DF0[, val] %>% is.factor) {  #  Generate a binning table for all the categories of a given factor variable (A factor variable with at least 2 different values. Labels with commas are not allowed)
    
    result.smb <- switch(val,
      `education`    = smbinning.factor.custom(DF0, x = val, y = 'GB_flag',
                          c("'Высшее'",                                    # 'Высшее'
                            "'Начальное','Средне-специальное'",            # 'Начальное','Средне-специальное'
                            "'Среднее'")),                                 # 'Среднее'
      if (levels(DF0[, val]) %>% length() > Max_Levels) { # Multi levels `factor` variables
        chr_vec <- tree_bin(DF0, 'GB_flag', val)   # Combine levels of factor variable by Badrate into some bins
        smbinning.factor.custom(DF0, x = val, y = 'GB_flag', chr_vec)
      }
      else {
        smbinning.factor(DF0, x = val, y = 'GB_flag', maxcat = levels(DF0[, val]) %>% length() + 1)  
      }
    )

    if (class(result.smb) == 'list') {
      binning.df[i, 'IV']      <- result.smb$iv
      binning.df[i, 'IV-RPart'] <- result.smb$iv
      binning.df[i, 'N-RPart']  <- result.smb$ivtable %>% nrow - 1 # with Missing Vales
      binning.df[i, 'Method']  <- ifelse(length(result.smb$groups) == 0, 'IV-By Levels', 'IV-By Groups')[1]
      
        # IV Table Supplement
      result.smb$ivtable <- result.smb$ivtable %>%
      dplyr::mutate(G_Dis = CntGood / table(DF0$GB_flag)[2],
           B_Dis = CntBad / table(DF0$GB_flag)[1],
           `G/B Index` = ifelse(G_Dis > B_Dis, G_Dis / B_Dis, B_Dis / G_Dis),
           `0=Good, 1=Bad` = ifelse(G_Dis > B_Dis, 0, 1),
           Bin = c(1:(nrow(result.smb$ivtable) - 1), NA),
           # Min = (c(NA, result.smb$cuts, NA, NA)),
           # Max = (c(result.smb$cuts, NA, NA, NA))
           Min = rep(NA, times = result.smb$ivtable %>% nrow),
           Max = rep(NA, times = result.smb$ivtable %>% nrow)
      )
    }

  } else 
    {                                                                  # Numeric Class
    # Optimal Binning for Scoring Modeling from package `smbinning`
    # This process, also known as supervised discretization, utilizes Recursive Partitioning to categorize the numeric characteristic. The especific algorithm is Conditional Inference Trees which initially excludes missing values (NA) to compute the cutpoints, adding them back later in the process for the calculation of the Information Value.
    result1.smb <- smbinning(DF0, 'GB_flag', val)
  
    if (class(result1.smb) == 'list') {
      binning.df[i, 'IV-RPart'] <- result1.smb$iv
      binning.df[i, 'N-RPart']  <- result1.smb$bands %>% length
    }
  
    # Custom Binning Based by cutpoints using percentiles (10% each)
    if (length(NamesOfVariables) <= Max_Vars || class(result1.smb) != 'list') {
      cbs1cuts <- as.vector(quantile(DF0[, val], probs=seq(0, 1, 0.1), na.rm=TRUE)) # Quantiles by 10%
      cbs1cuts <- cbs1cuts[2:(length(cbs1cuts) - 1)] # Remove first (min) and last (max) values
      result2.smb <- smbinning.custom(df=DF0, y = 'GB_flag', x = val, cuts = cbs1cuts)
      binning.df[i, 'IV-Decile'] <- result2.smb$iv
      binning.df[i, 'N-Decile']  <- result2.smb$bands %>% length
    } else {
      binning.df[i, 'IV-Decile'] <- 0
      binning.df[i, 'N-Decile']  <- ncol(DF0) + 1
    }
      
    if (length(NamesOfVariables) <= Max_Vars ) { # & !isMicrosoftRServer
      # Finer Monotonic Binning Based on Isotonic Regression - Do Not working with Microsoft R Server 9.3.0
      result3.smb <- isobin(DF0, 'GB_flag', val)
      binning.df[i, 'IV-Iso'] <- result3.smb$iv
      binning.df[i, 'N-Iso']  <- result3.smb$bands %>% length
    
      # Generates a supervised tree-like segmentation of numeric variables with respect to a binary target outcome
      # result4.smb <- tree_chimergebin(DF0, 'GB_flag', val)
      cbs1cuts <- tree_bin(DF0, 'GB_flag', val)   # Binning via Tree-Like Segmentation
      result4.smb <- smbinning.custom(df = DF0, x = val, y = 'GB_flag', cuts = cbs1cuts)
      if (class(result4.smb) == 'list') {  
        binning.df[i, 'IV-Tree'] <- result4.smb$iv
        binning.df[i, 'N-Tree']  <- result4.smb$bands %>% length
      } else {  # 'Not Meaningful (IV<0.1)' or 'Uniques values < 5' case
        binning.df[i, 'IV-Tree'] <- 0
        binning.df[i, 'N-Tree']  <- ncol(DF0)
      }
    } else {
      binning.df[i, 'IV-Iso'] <- 0
      binning.df[i, 'N-Iso']  <- ncol(DF0)
      
      binning.df[i, 'IV-Bad'] <- 0
      binning.df[i, 'N-Bad']  <- ncol(DF0)
    }
    
    # Selection of the Optimal Binning Method
    if (if_else(is.na(binning.df[i, 'IV-RPart']) == TRUE, 0, binning.df[i, 'IV-RPart'] * 1.1) >
          binning.df[i, 'IV-Decile']) {
                        binning.df[i, 'Method'] <- 'IV-RPart'
    } else 
      {
      if ( ( (binning.df[i, 'IV-Iso'] > binning.df[i, 'IV-Decile']) & 
             (binning.df[i, 'N-Iso'] / 1.1 < binning.df[i, 'N-Decile']) ) |
           ( (binning.df[i, 'IV-Iso'] * 1.1 > binning.df[i, 'IV-Decile']) & 
             (binning.df[i, 'N-Iso'] * 2 < binning.df[i, 'N-Decile']) ) ) {
                        binning.df[i, 'Method'] <- 'IV-Iso'
      } else {
        if (binning.df[i, 'IV-Decile'] >= binning.df[i, 'IV-Iso']) { 
                        binning.df[i, 'Method'] <- 'IV-Decile'
              } else { 
                  binning.df[i, 'Method'] <- 'IV-Iso' } 
              }
      
      }  # End Else If

    type <- binning.df[i, 'Method']
    result.smb <- 
      switch(type,
             `IV-RPart` = result1.smb,
             `IV-Decile` = result2.smb,
             `IV-Iso` = result3.smb,
             `IV-Bad` = result4.smb
             )
    binning.df[i, 'IV'] <- result.smb$iv
  
    # IV Table Supplement
    result.smb$ivtable <- result.smb$ivtable %>%
      mutate(G_Dis = CntGood / table(DF0$GB_flag)[2],
             B_Dis = CntBad / table(DF0$GB_flag)[1],
             `G/B Index` = if_else(G_Dis > B_Dis, G_Dis / B_Dis, B_Dis / G_Dis),
             `0=Good, 1=Bad` = if_else(G_Dis > B_Dis, 0, 1),
             Bin = c(1:(nrow(result.smb$ivtable) - 1), NA),
             Min = c(NA, result.smb$cuts, NA, NA),
             Max = c(result.smb$cuts, NA, NA, NA)
      )

  } #  End else for numeric class 
  
  # Prepare MySQL-code for Binning and Fine Classing
  # Sys.setloc <- Sys.setlocale(locale = 'Russian') # set locale to `Russian`
  binning.sql <- capture.output(smbinning.sql(result.smb)) %>% 
    gsub("then '0", "then '", .) %>% 
      gsub('TableName', 'DF', .) %>%
        stringr::str_replace('NewCharName', paste0(val, '_fct')) 
  
  val_fct <- paste0('  \'', val, '_fct', '\'', '  from data.frame \'DF0\'')
  binning.sql <- c('select *,', paste0('      /*   Inserting the new factor variable', val_fct, '  */'),
                    binning.sql[-c(1:3)] , paste0('  \'', val, '_fct', '\'', ' from \'DF1\''))

  # Truncation of the Name of the Gradation in a Complex Set of levels
  theBest <- TRUE
  for (j in 1:(nrow(result.smb[['ivtable']]) - 2)) {
    if (str_length(binning.sql[j + 3]) > 999) {
      gradation_str <- str_split( string = binning.sql[j + 3], pattern = sprintf('%s: %s ', j, val) ) %>% 
        unlist
      binning.sql[j + 3] <- 
        paste0(gradation_str[1], ifelse( theBest, sprintf('%s: %s the Best\'', j, val), 
                                          sprintf('%s: %s the Worst\'', j, val) ) )
      theBest <= FALSE
    } else {
      # print('empty')
    }
  }
  
  # Appending binning.sql into TotalBinning.sql
  if (i == 1) {
    binning.sql[length(binning.sql)] = paste0('  \'', val, '_fct\',')
    TotalBinning.sql <- binning.sql
  } else {
    if (i == length(NamesOfVariables)) {
      TotalBinning.sql <- c(TotalBinning.sql, '', binning.sql[-1] )
    } else {
      TotalBinning.sql <- c(TotalBinning.sql, '', binning.sql[-c(1, length(binning.sql))], 
                            paste0('  \'', val, '_fct\','))
    }
  } # End if (i == 1)
  # TotalBinning.sql <- c(TotalBinning.sql, '', binning.sql)
  
  # Preparing a Data.Frame with VI Table for Export into MS Excel
  addWorksheet(wb, val, gridLines = FALSE, tabColour = ifelse(result.smb$iv >= 0.05, 'chartreuse',
                                                              ifelse(result.smb$iv >= 0.03, 'khaki', 'white')))
  result.smb$ivtable[ is.na( result.smb$ivtable ) ] <- NA  # Dealing with NaN's in data frames
  N <- 3:(nrow(result.smb$ivtable) + 2)
  
  result.smb$ivtable %>% 
    dplyr::select(Cutpoint, Bin, Min, Max, CntRec, CntGood, CntBad,  G_Dis, B_Dis, Share = PctRec, 
                  BadRate, WoE, IV,`G/B Index`, `0=Good, 1=Bad`) %>% 
      writeDataTable(wb, sheet = val, x = ., tableStyle = 'TableStyleMedium2', startCol = 'A',
                     startRow = 2, tableName = val, firstColumn = TRUE, lastColumn = FALSE, bandedRows = TRUE)
  # Set Columns widths
  setColWidths(wb, sheet = val, cols = 1:4, widths = c(32, 7, 10, 10))
  
  # # Set Row heights
  # setRowHeights(wb, sheet = 1, rows = 1, heights = 45)
  
  # Set Styles & Conditional Formattings in Columns
  addStyle(wb, sheet = val, style = createStyle(wrapText = TRUE, halign = 'center', valign = 'center'),
                                              cols = 1:ncol(result.smb$ivtable), rows = 2)
  addStyle(wb, sheet = val, cols = 1, rows = 1, style = createStyle(fontSize = 16, textDecoration = 'bold'))
  addStyle(wb, sheet = val, cols = 1:ncol(result.smb$ivtable), rows = (nrow(result.smb$ivtable) + 2), 
           style = createStyle(textDecoration = 'bold'))
  addStyle(wb, sheet = val, cols = 5:7, rows = N, style = createStyle(numFmt = 'COMMA'), gridExpand = TRUE)
  addStyle(wb, sheet = val, cols = 8:10, rows = N, style = createStyle(numFmt = '0%'), gridExpand = TRUE)
  addStyle(wb, sheet = val, cols = 11, rows = N, style = createStyle(numFmt = paste0('0', options()$OutDec, '0%'), textDecoration = 'bold'))
  conditionalFormatting(wb, sheet = val, cols = 13, rows = 3:(nrow(result.smb$ivtable) + 1), type = 'databar',
                        border = FALSE, style = c('red', 'chartreuse'))
  addStyle(wb, sheet = val, cols = 4, rows = N, style = createStyle(border = 'right', borderColour = '#4F81BC'))
  addStyle(wb, sheet = val, cols = 7, rows = N, style = createStyle(numFmt = 'COMMA', border = 'right',
                                                                    borderColour = '#4F81BC'))
  addStyle(wb, sheet = val, cols = 10, rows = N, style = createStyle(numFmt = '0%', border = 'right',
                                                                     borderColour = '#4F81BC'))

  conditionalFormatting(wb, sheet = val, cols = 15, rows = 3:(nrow(result.smb$ivtable) + 1), rule ='$O3=0',
      style = createStyle(fontColour = 'red', halign = 'center', valign = 'center', textDecoration = 'bold'))
  conditionalFormatting(wb, sheet = val, cols = 15, rows = 3:(nrow(result.smb$ivtable) + 1), rule ='$O3>0',
      style = createStyle(fontColour = 'black', halign = 'center', valign = 'center', textDecoration = 'bold'))

  writeData(wb, sheet = val, val, startCol = 'A', startRow = 1)
  writeData(wb, sheet = val, data.frame(binning.sql), colNames = FALSE, rowNames = FALSE,
            startCol = 'A', startRow = nrow(result.smb$ivtable) + 4)

  writeFormula(wb, sheet = val, startCol = 'A', 
               startRow = nrow(result.smb$ivtable) + 5 + length(binning.sql), 
               x = makeHyperlinkString(sheet = 'IV Table', row = i + 2, col = 1,
                                       text = 'Link to IV Table'))
}     # End next i

## 1 - disbursed_amount
## 2 - asset_cost
## 3 - ltv
## 4 - branch_id
## 5 - supplier_id
## 6 - manufacturer_id
## 7 - Current_pincode_ID
## 8 - Employment_Type
## 9 - State_ID
## 10 - Employee_code_ID
## 11 - Aadhar_flag
## 12 - PAN_flag
## 13 - VoterID_flag
## 14 - Driving_flag
## 15 - Passport_flag
## 16 - PERFORM_CNS_SCORE
## 17 - PERFORM_CNS_SCORE_DESCRIPTION
## 18 - PRI_NO_OF_ACCTS
## 19 - PRI_ACTIVE_ACCTS
## 20 - PRI_OVERDUE_ACCTS
## 21 - PRI_SANCTIONED_AMOUNT
## 22 - PRI_DISBURSED_AMOUNT
## 23 - SEC_NO_OF_ACCTS
## 24 - SEC_ACTIVE_ACCTS
## 25 - SEC_OVERDUE_ACCTS
## 26 - SEC_CURRENT_BALANCE
## 27 - SEC_SANCTIONED_AMOUNT
## 28 - SEC_DISBURSED_AMOUNT
## 29 - PRIMARY_INSTAL_AMT
## 30 - SEC_INSTAL_AMT
## 31 - NEW_ACCTS_IN_LAST_SIX_MONTHS
## 32 - DELINQUENT_ACCTS_IN_LAST_SIX_M
## 33 - NO_OF_INQUIRIES
## 34 - Age
## 35 - YearsSinceDisbursment
## 36 - AverageLoanTenure
## 37 - TimeSinceFirstLoan
## 38 - YearsOnLoan
## 39 - DisAsDiff
## 40 - DisAsShare
## 41 - Qrt
## 42 - Day
## 43 - OutstandingNow
## 44 - DisbursedTotal
## 45 - ShareOverdue
## 46 - SEC_OverdueShareSec
## 47 - PRI_OverdueShare

# Flag Recovery after Optimal Binning
DF0 %<>% dplyr::mutate( GB_flag = ifelse(GB_flag == 1, 0, 1) )

# Write MySQL code for Coarse Classing Selected Variables
write_lines(x = TotalBinning.sql, path = NameMySQLCode, na = "NA", append = FALSE)

N <- 3:(nrow(binning.df) + 2)
# writeDataTable(wb, sheet = 'IV Table', x = binning.df, tableStyle = 'TableStyleMedium4', startCol = 'A',
#               startRow = 2, tableName = 'IVTable', firstColumn = FALSE, lastColumn = TRUE, bandedRows = TRUE)

# Set Columns widths
setColWidths(wb, sheet = 'IV Table', cols = 1:2, widths = c(15, 12))
setColWidths(wb, sheet = 'IV Table', cols = 11, widths = c(12))

conditionalFormatting(wb, sheet = 'IV Table', cols = 2, rows = N, type = 'databar',
                      border = FALSE, style = c('red', 'royalblue'))
addStyle(wb, sheet = 'IV Table', cols = 2, rows = N, 
         style = createStyle(border = 'right', borderColour = '#9CB95C'))
addStyle(wb, sheet = 'IV Table', cols = 10, rows = N, 
         style = createStyle(border = 'right', borderColour = '#9CB95C'))

writeData(wb, sheet = 'IV Table', 'IV Table', startCol = 'A', startRow = 1)
addStyle(wb, sheet = 'IV Table', cols = 1, rows = 1, 
         style = createStyle(fontSize = 16, textDecoration = 'bold'))

for (i in 1:nrow(binning.df)) {
  ## Internal - Text to display
  val = binning.df[i, 'Variable']
  writeFormula(wb, sheet = 'IV Table', startCol = 'A', startRow = i + 2, 
    x = makeHyperlinkString(sheet = val, row = 1, col = 1, text = val))
}

# # Open MS Excel
# openXL(wb)

 remove(NamesOfVariables, result1.smb, result2.smb, result3.smb, result4.smb, result.smb, chr_vec, # binning.df,
        j, TotalBinning.sql, theBest, binning.sql, val, val_fct, i, type, cbs1cuts, N)

Coarse Classing Factor and Numeric Variables

Fine Classing

Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable

Coarse Classing

Combine adjacent categories with similar WOE scores

Rules related to WOE

Each category (bin) should have at least 5% of the observations.
Each category (bin) should be non-zero for both non-events and events.
The WOE should be distinct for each category. Similar groups should be aggregated.
The WOE should be monotonic, i.e. either growing or decreasing with the groupings.

Missing values are binned separately.

# use library('sqlrutils') from Microsoft Corporation
library('sqldf') # Manipulate R Data Frames Using SQL

# Prepare MySQL-code for Coarse Classing
DF <- sqldf::sqldf(x = read_file( NameMySQLCode ), method = 'name__class') %>%  # PRIORITY_DAYS - 'difftime'
  dplyr::select(c(one_of('GB_flag'), ends_with('_fct'))) %>% 
    dplyr::mutate_all(as.factor)

remove(DF0, DF1)
# 
# SelectedFeatures <- c('ltv_fct', 'branch_id_fct', 'supplier_id_fct', 'Current_pincode_ID_fct', 'State_ID_fct', 'Employee_code_ID_fct', 'Passport_flag_fct', 'PERFORM_CNS_SCORE_DESCRIPTION_fct', 'PRI_OVERDUE_ACCTS_fct', 'PRI_DISBURSED_AMOUNT_fct', 'SEC_CURRENT_BALANCE_fct', 'NO_OF_INQUIRIES_fct', 'Age_fct', 'YearsSinceDisbursment_fct', 'TimeSinceFirstLoan_fct')
# 
# DF <- DF[, c('GB_flag', SelectedFeatures)]

Create Outcome - Dependent Binary Factor

Let’s form an array of independent factors (predictors) and a outcome - dependent binary factor.

# Сonverting the Integer Variables into Factor
X <- DF %>% 
  dplyr::select(-GB_flag)

Y <- DF %$%                                                # Pull out a single variable into a vector throu %$%
  ifelse(GB_flag == '1', 'Bad', 'Good') %>%               
    factor(levels = c('Bad', 'Good'))                      # Correct levels for Scoring Model

Split Train and Testing Sets

It is a good idea to use a validation hold out set. This is a sample of the data that we hold back from our analysis and modeling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data.

# Use train / test data set

table(Y[inTrain]) ## significant Class Imbalance Problem

## 
##    Bad   Good 
##  45557 164281

# Weights of cases to resolve Class Imbalances Problem
weight.cases <- as.numeric( Y[inTrain]  )
for(val in unique(weight.cases)) {weight.cases[weight.cases==val]=
    1/sum(weight.cases==val)*length(weight.cases)/2} # normalized to sum to length(samples)

# # To list all columns with missing values you might write: 
# is.na(X) %>% colSums()
#
# # To list all rows with missing values you might write: 
# X[!complete.cases(X), ]

Exploratory data analysis (EDA)

Population Proportion – Sample Size

\[ \displaystyle \large X = \frac{Z_{\alpha/2}^2 × p × (1\ - \ p)}{\beta^2} \hspace{.5 in} [1]\]

where, \(Z_{\alpha/2}\) is the critical value of the Normal distribution at \(\alpha\) (e.g. for a confidence level of 98%, \({\alpha}\) (Type I Error or False Positive) is 0.02, so \(alpha/2\) = 0.01 and \(Z_{\alpha/2}\) - the critical value is 2.326) and \({\beta}\) for calculating the Margin of Error (Type II Error or False Negative) is 0.05,

\(\beta\) is the margin of error by (1 - Power),
p is the sample proportion,

and X is the population size.

# Population Proportion – Minimal Sample Size
# Source: Daniel  WW.  Biostatistics:  A  Foundation for   Analysis   in   the   Health   Sciences.   7th edition. New York: John Wiley & Sons. (1999)
# https://select-statistics.co.uk/calculators/sample-size-calculator-population-proportion/

yName <- 'GB_flag'  # Name of Outcome
yTarget <- 1        # Name of Outcome's Positive Level

writeLines(sprintf('Sampling into two unequal sample sizes with Fraction (%0.2f%%) of Training & Test Sets. \n',
                   length(Y[inTrain]) / length(Y) * 100))

## Sampling into two unequal sample sizes with Fraction (60.73%) of Training & Test Sets.

writeLines(sprintf('Full Dataset has Real Sample Size is equal %g observations.
  Therefore the Training Dateset might be %g obs. and the Test Dateset - %g obs. \n',
                   length(Y), round(length(Y) * 0.70, -4), length(Y) - round(length(Y) * 0.70, -4)))

## Full Dataset has Real Sample Size is equal 345546 observations.
##   Therefore the Training Dateset might be 240000 obs. and the Test Dateset - 105546 obs.

alpha = 0.05        # Type I Error or False Positive - α probability (Significance level) fro Two tailed
beta =  0.10        # Type II Error or False Negative - β probability level

BadRate <- table(Y[inTrain]) %>% .['Bad'] / length(Y[inTrain])
# MSS <- (qnorm(1 - alpha/2)^2 * BadRate * (1 - BadRate)) / (beta)^2       # BadRate *

# *********************************************************************************************************** #
# 
# Power Analysis for two Proportions (different sample sizes) of The Binomial Distributions
#
# *********************************************************************************************************** #

# Fisher's exact test is a statistical significance test used for Contingency Tables (Binomial Distribution)
matrix(c(table(Y[inTrain]), table(Y[inTest])),
       nrow = 2,
       dimnames = list(Subjects = c('Bad', 'Good'),
                           Samples = c('inTrain', 'inTest'))) %T>%
    print() %>% 
stats::fisher.test(., alternative = 'two', conf.level = 1-alpha)

##         Samples
## Subjects inTrain inTest
##     Bad    45557   5054
##     Good  164281  18262

## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 0.9067
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9695789 1.0356963
## sample estimates:
## odds ratio 
##    1.00203

# # Null Hypothesis Significance Testing
# pB <-
#   ShowNullHypothesisSignificanceTesting(pA = BadRate,
#                                         pT = table(Y[inTest]) %>% .['Bad'] / length(Y[inTest]),
#                                         nA = length(Y[inTrain]),
#                                         nB = length(Y[inTest]),
#                                         alpha = alpha, 
#                                         beta = beta)

# Example of using `uniroot` function - https://rpubs.com/chidungkt/408264

frac <- length(Y[inTrain]) / (length(Y[inTrain]) + length(Y[inTest]))
# Set First functions: Power and Sample Size for Two-Sample Binomial Test
func1 <- function(x) { sizes <- Hmisc::bsamsize(p1 = BadRate, p2 = BadRate + x, 
                                       fraction = frac, alpha = alpha, power = (1 - beta))
                       return(sizes[2]) }  # B (Control) Sample Size

# Equilibrium quantity: Threshold for h - Effect Size (Cohen’s d)
pThreshold <- stats::uniroot( function(x) func1(x) - (length(Y[inTest]) * 0.98), c(0, BadRate) )$root

# Estimating Sample size(s) for Two unequivalent Binomial Sample 

library('pwr')              # Basic Functions for Power Analysis

( estimate <- pwr::pwr.2p2n.test( h = pwr::ES.h(p1 = BadRate,   # Treatment Probability of Success
                                    p2 = BadRate + pThreshold ) # Hypothesized control probability of success 
                                 , n1 = length(Y[inTrain])      # Sample Size for Training or A (Treatment) Set
                                 , n2 = NULL                    # Sample Size for Test or B (Control) Set
                                 , sig.level = alpha            # Type I Error or False Positive
                                 , power = (1 - beta)           # Power of Test (1 - Type II Error)
                                 , alternative = 'two.sided') )

## 
##      difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.02259551
##              n1 = 209838
##              n2 = 22818.26
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: different sample sizes

MSS <- round(estimate$n2 + 0.5)                                 # Minimal Sample Size for Test (Control) Set

pwr::plot.power.htest(estimate) + 
    ggplot2::annotate(geom = 'text', x = 0, y = 0, size = 3.5, hjust = 'left', 
                      label = sprintf('Badrate in { %0.2f%%, %0.2f%% }',
                                      (BadRate - pThreshold) * 100, (BadRate + pThreshold) * 100))

# MSS <- Hmisc::bsamsize(p1 = table(Y[inTrain]) %>% .['Bad'] / length(Y[inTrain]),
#                        p2 = table(Y[inTrain]) %>% .['Bad'] / length(Y[inTrain]) - 0.005,
#                        fraction = 0.7, alpha = alpha, power = (1 - beta)) %>%
#   round(.[2] + 0.5)

writeLines(sprintf('With Treatment Probability of Success (Bad Rate on Training dataset = %.2f%%) and Bias = %0.2f%%, Significance level (1 - alpha = %.1f%%) & Power (1 - beta = %.1f%%) Minimal Sample Size of the Binomial distribution should be %g obs., but Test set is %g obs. \n',
                   BadRate * 100, pThreshold * 100, (1 - alpha) * 100, (1- beta) * 100, MSS, length(Y[inTest]) ))

## With Treatment Probability of Success (Bad Rate on Training dataset = 21.71%) and Bias = 0.94%, Significance level (1 - alpha = 95.0%) & Power (1 - beta = 90.0%) Minimal Sample Size of the Binomial distribution should be 22819 obs., but Test set is 23316 obs.

# Weights of cases to resolve Class Imbalances Problem
weight.cases <- DT[inTrain, ] %>%
  dplyr::pull(!!yName) %>%
    as.double()
for(val in unique(weight.cases)) {weight.cases[weight.cases==val]=
    1/sum(weight.cases==val)*length(weight.cases)/2} # normalized to sum to length(samples)

remove(alpha, beta, BadRate, MSS, estimate, pThreshold, frac)

Unimodal Data Visualizations: Information Values & Distrubutions

Let’s look at visualizations of individual attributes. It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let’s look at histograms of each attribute to get a sense of the data distributions.

# Density plot for each Features vs `Y`

ShowUnimodalVisualizations(X[inTrain, ], Y[inTrain], isMicrosoftRServer)

## There are no any Numeric feature in `X` data.frame.
## There are no any Factor or Character feature in `X` data.frame.

Important Points

Information value increases as bins / groups increases for an independent variable. Be careful when there are more than 20 bins as some bins may have a very few number of events and non-events.
Information value should not be used as a feature selection method when you are building a classification model other than binary logistic regression (for example, random forest or SVM) as it’s designed for binary logistic regression model only.

Multimodal Data Visualizations - The Correlation Matrix

Let’s look at some visualizations of the interactions between outcome and predictors. The best place to start is a scatter plot matrix.

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:Hmisc':
## 
##     subplot

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: caret

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

## The following object is masked from 'package:tidyr':
## 
##     smiths

##   Var1                                                Var2 Spearman
## 1    Y     Employee_code_ID_fct.2..Employee_code_ID....153   0.0743
## 2    Y                  DisAsDiff_fct.5..DisAsDiff...19822   0.0679
## 3    Y Current_pincode_ID_fct.3..Current_pincode_ID....174   0.0673
## 4    Y               DisAsShare_fct.7..DisAsShare...1.7649   0.0655
## 5    Y     Employee_code_ID_fct.3..Employee_code_ID....170   0.0645
## 6    Y   PERFORM_CNS_SCORE_fct.6..PERFORM_CNS_SCORE....824   0.0644
## 7    Y               supplier_id_fct.3..supplier_id....165   0.0573
##   Var1                                                 Var2 Spearman
## 1    Y  Current_pincode_ID_fct.12..Current_pincode_ID...291  -0.1252
## 2    Y      Employee_code_ID_fct.15..Employee_code_ID...319  -0.1147
## 3    Y                supplier_id_fct.13..supplier_id...314  -0.1057
## 4    Y       OutstandingNow_fct.3..OutstandingNow....171384  -0.0907
## 5    Y Current_pincode_ID_fct.11..Current_pincode_ID....291  -0.0839
## 6    Y     Employee_code_ID_fct.14..Employee_code_ID....319  -0.0823
## 7    Y               supplier_id_fct.12..supplier_id....314  -0.0809

This helps point out the skew in many distributions so much so that data looks like outliers (e.g. beyond the whisker of the plots).

Select Features

Feature Selection (removing correlated attributes or reduce the quality of classification), but Transformations (Box-Cox or YeoJohnson) could not apply to factors.

#
#       Filtering predictors - Remove Redundant Variables
#
library('caret')            # Classification and Regression Training 

# # 1a. Imputing Missing Value, Centering, Scaling and Transformation, for example 'YeoJohnson'
# X <- cbind(X %>% select_if(., is.numeric) %>%
#             predict(preProcess( ., method = c( 'bagImpute' ) ), newdata = . ) #,
# 
# # 1b. Convert factor (nominal and ordered) variables into a full set of dummy integer variables without linear dependencies induced between these predictors
#            X %>% select_if(., is.factor) %>%
#              data.frame(predict(dummyVars(' ~ .', data = ., fullRank = TRUE), newdata = .)) %>%
#                select_if(., is.numeric)
#       )

# 2. Dropping zero variance predictors: the cutoff for the ratio of the first most common value to the second value. See https://www.mql5.com/ru/articles/2029
nzv <- caret::nearZeroVar( X[inTrain, ], freqCut = 999/1 , saveMetrics= TRUE )
nzv[nzv$nzv,]

## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

zv_cols = caret::nearZeroVar( X[inTrain, ], freqCut = 999/1, saveMetrics = FALSE )
print( sprintf('Dropping %d zero variance predictors from %d (fraction=%10.6f)',
        length(zv_cols), dim(X[inTrain, ])[2], length(zv_cols)/dim(X[inTrain, ])[2]) )

## [1] "Dropping 0 zero variance predictors from 47 (fraction=  0.000000)"

zv_cols

## integer(0)

# class(X) <- 'data.frame'
if ( length(zv_cols) != 0 )  {
      X <- X[, -zv_cols]
      }

# # 3. Centering, Scaling and Transformation or Principal Components # , 'pca'
# X[is.na(X)] <- -1        #  replace the <NA>s with zeros in number columns
# X <- X %>% predict(preProcess( X[inTrain, ], method = c( 'medianImpute', 'center', 'scale', 'YeoJohnson', 'pca' ), thresh = 0.99 ), newdata = . ) # %>% # Define all logical fields into numeric type
#         # spatialSign %>% data.frame

# 4. Remove NUMERIC variables with high correlation (> .70) to others (multicollinearity)

  cor.matrix <- cor( sapply( X[inTrain, ], function(x)
                    { as.numeric(x) } ) )
  cor.high   <- caret::findCorrelation(cor.matrix, cutoff = 0.80, verbose = TRUE,  names = FALSE, exact = TRUE)

## Compare row 18  and column  19 with corr  0.869 
##   Means:  0.239 vs 0.104 so flagging column 18 
## Compare row 37  and column  36 with corr  0.843 
##   Means:  0.211 vs 0.098 so flagging column 37 
## Compare row 44  and column  43 with corr  0.871 
##   Means:  0.206 vs 0.094 so flagging column 44 
## Compare row 36  and column  47 with corr  0.846 
##   Means:  0.186 vs 0.088 so flagging column 36 
## Compare row 47  and column  16 with corr  0.883 
##   Means:  0.162 vs 0.084 so flagging column 47 
## Compare row 21  and column  22 with corr  0.99 
##   Means:  0.132 vs 0.081 so flagging column 21 
## Compare row 31  and column  45 with corr  0.949 
##   Means:  0.134 vs 0.079 so flagging column 31 
## Compare row 23  and column  24 with corr  0.803 
##   Means:  0.143 vs 0.075 so flagging column 23 
## Compare row 24  and column  27 with corr  0.989 
##   Means:  0.129 vs 0.072 so flagging column 24 
## Compare row 27  and column  28 with corr  0.995 
##   Means:  0.106 vs 0.069 so flagging column 27 
## Compare row 28  and column  26 with corr  0.933 
##   Means:  0.081 vs 0.068 so flagging column 28 
## Compare row 40  and column  39 with corr  0.812 
##   Means:  0.102 vs 0.067 so flagging column 40 
## Compare row 34  and column  38 with corr  0.996 
##   Means:  0.082 vs 0.065 so flagging column 34 
## Compare row 11  and column  13 with corr  0.869 
##   Means:  0.093 vs 0.064 so flagging column 11 
## Compare row 35  and column  41 with corr  0.824 
##   Means:  0.064 vs 0.063 so flagging column 35 
## All correlations <= 0.8

  high.cor.remove <- row.names(cor.matrix)[cor.high]
  print( sprintf('Dropping %d predictors due to high correlation to others (multicollinearity) %d (fraction=%10.6f)',
                 length(high.cor.remove), dim(X[inTrain, ])[2],  length(high.cor.remove)/dim(X[inTrain, ])[2]) );

## [1] "Dropping 15 predictors due to high correlation to others (multicollinearity) 47 (fraction=  0.319149)"

  high.cor.remove

##  [1] "PRI_NO_OF_ACCTS_fct"              "TimeSinceFirstLoan_fct"          
##  [3] "DisbursedTotal_fct"               "AverageLoanTenure_fct"           
##  [5] "PRI_OverdueShare_fct"             "PRI_SANCTIONED_AMOUNT_fct"       
##  [7] "NEW_ACCTS_IN_LAST_SIX_MONTHS_fct" "SEC_NO_OF_ACCTS_fct"             
##  [9] "SEC_ACTIVE_ACCTS_fct"             "SEC_SANCTIONED_AMOUNT_fct"       
## [11] "SEC_DISBURSED_AMOUNT_fct"         "DisAsShare_fct"                  
## [13] "Age_fct"                          "Aadhar_flag_fct"                 
## [15] "YearsSinceDisbursment_fct"

  if (length(high.cor.remove) != 0) {
        X <- X[, -cor.high]
     }

# Remove some other features that do not add useful information for Machine Learning
remove(nzv, zv_cols, cor.matrix, cor.high, high.cor.remove)

# Remaining Variables
print( sprintf('Remaining %d Variables', dim(X)[2]) )

## [1] "Remaining 32 Variables"

names( X )

##  [1] "disbursed_amount_fct"              
##  [2] "asset_cost_fct"                    
##  [3] "ltv_fct"                           
##  [4] "branch_id_fct"                     
##  [5] "supplier_id_fct"                   
##  [6] "manufacturer_id_fct"               
##  [7] "Current_pincode_ID_fct"            
##  [8] "Employment_Type_fct"               
##  [9] "State_ID_fct"                      
## [10] "Employee_code_ID_fct"              
## [11] "PAN_flag_fct"                      
## [12] "VoterID_flag_fct"                  
## [13] "Driving_flag_fct"                  
## [14] "Passport_flag_fct"                 
## [15] "PERFORM_CNS_SCORE_fct"             
## [16] "PERFORM_CNS_SCORE_DESCRIPTION_fct" 
## [17] "PRI_ACTIVE_ACCTS_fct"              
## [18] "PRI_OVERDUE_ACCTS_fct"             
## [19] "PRI_DISBURSED_AMOUNT_fct"          
## [20] "SEC_OVERDUE_ACCTS_fct"             
## [21] "SEC_CURRENT_BALANCE_fct"           
## [22] "PRIMARY_INSTAL_AMT_fct"            
## [23] "SEC_INSTAL_AMT_fct"                
## [24] "DELINQUENT_ACCTS_IN_LAST_SIX_M_fct"
## [25] "NO_OF_INQUIRIES_fct"               
## [26] "YearsOnLoan_fct"                   
## [27] "DisAsDiff_fct"                     
## [28] "Qrt_fct"                           
## [29] "Day_fct"                           
## [30] "OutstandingNow_fct"                
## [31] "ShareOverdue_fct"                  
## [32] "SEC_OverdueShareSec_fct"

Entropy-Based Feature Selection

Different methods for calculating the feature importance are built into FSelectorRcpp’s function information_gain(). I recommend using a fast but effective method FSelectorRcpp_information.gain written in C++ from package FSelectorRcpp.

if (ncol(X) <= Max_Vars) {

  library('FSelectorRcpp')    # 'Rcpp' Implementation of 'FSelector' Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support
  
  library('parallel')         # Support for Parallel computation in R
  # Support for Parallel computation in R
  ncore <- parallel::detectCores()
  (cl = parallel::makeCluster(ncore))
  
  # Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support
  Entropy_Based_Features <- 
    FSelectorRcpp::information_gain(        # Calculate the score for each attribute
        formula = as.formula('Y ~ .'),      # that is on the right side of the formula.
        data = cbind(Y = Y, X)[inTrain, ],  # Attributes must exist in the passed data.
        type  = 'infogain',                 # Choose the type of a score to be calculated.
        threads = ncore                     # Set number of threads in a parallel backend.
    ) %>% 
    dplyr::arrange(-importance) %>%   # Sort of Entropy-based Importance by Descending.
      dplyr::slice(1:Max_Vars)        # Selection the First `Max_Vars` features.
    
  knitr::kable(Entropy_Based_Features, caption = sprintf('Selection of the %g Important Variables vs. `%s` outcome', Max_Vars, yName))    
  # utils::writeClipboard(Entropy_Based_Features[['attributes']])
  
  X <- X[, Entropy_Based_Features$attributes]
  
  if(!is.null(cl)) {
      parallel::stopCluster(cl)
      cl = NULL
  }
  
  remove(Entropy_Based_Features)

}

Simple Logit-Model

The Logit in logistic regression is a special case of a link function in a Generalized Linear Model (GLM): it is the canonical link function for the Bernoulli distribution.

The logistic model is usually represented as:

\[ \displaystyle \large \pi(Y)=\frac{\exp(\beta_0+\beta_1X)}{1+\exp(\beta_0+\beta_1X)} \hspace{.5 in} [2]\] or going into the common linear regression model:

\[ \displaystyle \large \ln\left(\frac{\pi(Y)}{1-\pi(Y)}\right)=\beta_0+\beta_1X \hspace{.5 in} [3]\]

Therefore, logit itself is obtained:

\[ \displaystyle \large \Pr(Y=1 \mid X) = [1 + e^{-X'\beta}]^{-1} \hspace{.5 in} [4]\]

# Train Logistic Regression Model with limited memory Broyden-FletcherGoldfarb-Shanno (L-BFGS) optimization 

# Microsoft Machine Learning Example - https://docs.microsoft.com/en-us/machine-learning-server/r/sample-solutions#loan-credit-risk
writeLines('\n Logit-Model on Training Set')

## 
##  Logit-Model on Training Set

# The named list containing components upper and lower, both formulae, defining the range of models to be examined in the stepwise search
scope <- list(
   lower = ~ supplier_id_fct,
   upper = as.formula( paste( '~', paste(names(X), collapse = ' + ') ) ) )

## rxLogit / variableSelection
varsel <- 
  RevoScaleR::rxStepControl(method = 'stepwise'
            , keepStepCoefs = TRUE  # flag specifying whether or not to keep the model coefficients at each step
            , scope = scope
            , stepCriterion = 'SigLevel' ) # significance level, the traditional stepwise approach in SAS

# # Ranking Features by Importance using rx (Random Forest)
# rf_model <-
#   RevoScaleR::rxDForest(  formula = paste0('Y ~ ', paste(colnames(X), collapse = '+'))
#                         , seed = seed
#                         , importance = TRUE
#                         , pweights = 'Pweights'
#                         , data = cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases) )
# View(rf_model$importance )

system.time(
  rxLogitFit <- # Create a Logistic Regression Model
    RevoScaleR::rxLogit(
      formula = paste('Y ~ ', paste(names(X), collapse = '+')) 
      # , variableSelection = varsel
      , pweights = 'Pweights'
      , data = cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases)
      , reportProgress = 1 # the number of processed rows is printed and updated
      , verbose = 1)
)

## **** Computing starting values:
## 
Rows Processed: 209838 
## 
Rows Processed: 209838 
## **** Scoring iteration #3:
## 
## Deviance: 201058.0735
## 
Rows Processed: 209838 
## **** Scoring iteration #4:
## 
## Deviance: 199961.8653
## 
Rows Processed: 209838 
## **** Scoring iteration #5:
## 
## Deviance: 199949.6311
## 
Rows Processed: 209838 
## **** Scoring iteration #6:
## 
## Deviance: 199949.6240
## 
Rows Processed: 209838 
## 
## 
## Logistic Regression Results for: Y~Employee_code_ID_fct+Current_pincode_ID_fct+supplier_id_fct+branch_id_fct+ltv_fct+PERFORM_CNS_SCORE_fct+disbursed_amount_fct+OutstandingNow_fct+PERFORM_CNS_SCORE_DESCRIPTION_fct+State_ID_fct+DisAsDiff_fct+PRI_DISBURSED_AMOUNT_fct+PRI_OVERDUE_ACCTS_fct+manufacturer_id_fct+VoterID_flag_fct+ShareOverdue_fct+PRI_ACTIVE_ACCTS_fct+Day_fct+NO_OF_INQUIRIES_fct+DELINQUENT_ACCTS_IN_LAST_SIX_M_fct+Qrt_fct+YearsOnLoan_fct+Employment_Type_fct+PRIMARY_INSTAL_AMT_fct+asset_cost_fct+SEC_OverdueShareSec_fct+SEC_CURRENT_BALANCE_fct+Passport_flag_fct+Driving_flag_fct+SEC_INSTAL_AMT_fct+PAN_flag_fct+SEC_OVERDUE_ACCTS_fct
## ************************************************************************************************************************
## Dependent Variable: Y
## Total independent variables: 163 (Including number dropped: 33)
## Number of valid observations: 209838
## -2*LogLikelihood: 199950 (Residual Deviance on 209708 degrees of freedom)
## Row        Coeffs.                                                                                 Value      Std. Error         t Value        Pr(>|t|)
## [   1,.]   (Intercept)                                                                           0.7391          0.2000          3.6958          0.0002
## [   2,.]   Employee_code_ID_fct=1: Employee_code_ID <= 134                                       0.7657          0.0427         17.9335          0.0000
## [   3,.]   Employee_code_ID_fct=10: Employee_code_ID <= 242                                     -0.0770          0.0303         -2.5387          0.0111
## [   4,.]   Employee_code_ID_fct=11: Employee_code_ID <= 254                                     -0.1015          0.0296         -3.4293          0.0006
## [   5,.]   Employee_code_ID_fct=12: Employee_code_ID <= 270                                     -0.1436          0.0297         -4.8288          0.0000
## [   6,.]   Employee_code_ID_fct=13: Employee_code_ID <= 289                                     -0.2340          0.0293         -7.9857          0.0000
## [   7,.]   Employee_code_ID_fct=14: Employee_code_ID <= 319                                     -0.2577          0.0286         -9.0069          0.0000
## [   8,.]   Employee_code_ID_fct=15: Employee_code_ID > 319                                      -0.4823          0.0320        -15.0504          0.0000
## [   9,.]   Employee_code_ID_fct=2: Employee_code_ID <= 153                                       0.5028          0.0339         14.8256          0.0000
## [  10,.]   Employee_code_ID_fct=3: Employee_code_ID <= 170                                       0.4005          0.0298         13.4511          0.0000
## [  11,.]   Employee_code_ID_fct=4: Employee_code_ID <= 179                                       0.2856          0.0337          8.4796          0.0000
## [  12,.]   Employee_code_ID_fct=5: Employee_code_ID <= 188                                       0.2518          0.0318          7.9085          0.0000
## [  13,.]   Employee_code_ID_fct=6: Employee_code_ID <= 199                                       0.1600          0.0297          5.3933          0.0000
## [  14,.]   Employee_code_ID_fct=7: Employee_code_ID <= 211                                       0.0937          0.0291          3.2206          0.0013
## [  15,.]   Employee_code_ID_fct=8: Employee_code_ID <= 221                                       0.0284          0.0298          0.9526          0.3408
## [  16,.]   Employee_code_ID_fct=9: Employee_code_ID <= 233                                      Dropped         Dropped         Dropped          0.0000
## [  17,.]   Current_pincode_ID_fct=1: Current_pincode_ID <= 143                                   0.8572          0.0388         22.0715          0.0000
## [  18,.]   Current_pincode_ID_fct=10: Current_pincode_ID <= 257                                 -0.1809          0.0245         -7.3807          0.0000
## [  19,.]   Current_pincode_ID_fct=11: Current_pincode_ID <= 291                                 -0.2728          0.0236        -11.5684          0.0000
## [  20,.]   Current_pincode_ID_fct=12: Current_pincode_ID > 291                                  -0.3917          0.0259        -15.1180          0.0000
## [  21,.]   Current_pincode_ID_fct=2: Current_pincode_ID <= 158                                   0.7512          0.0370         20.3062          0.0000
## [  22,.]   Current_pincode_ID_fct=3: Current_pincode_ID <= 174                                   0.5712          0.0278         20.5606          0.0000
## [  23,.]   Current_pincode_ID_fct=4: Current_pincode_ID <= 188                                   0.4876          0.0269         18.1216          0.0000
## [  24,.]   Current_pincode_ID_fct=5: Current_pincode_ID <= 201                                   0.3985          0.0244         16.3081          0.0000
## [  25,.]   Current_pincode_ID_fct=6: Current_pincode_ID <= 212                                   0.3153          0.0253         12.4573          0.0000
## [  26,.]   Current_pincode_ID_fct=7: Current_pincode_ID <= 219                                   0.1621          0.0295          5.4903          0.0000
## [  27,.]   Current_pincode_ID_fct=8: Current_pincode_ID <= 225                                   0.1075          0.0296          3.6285          0.0003
## [  28,.]   Current_pincode_ID_fct=9: Current_pincode_ID <= 238                                  Dropped         Dropped         Dropped          0.0000
## [  29,.]   supplier_id_fct=1: supplier_id <= 133                                                 0.3345          0.0425          7.8740          0.0000
## [  30,.]   supplier_id_fct=10: supplier_id <= 253                                               -0.0792          0.0270         -2.9305          0.0034
## [  31,.]   supplier_id_fct=11: supplier_id <= 275                                               -0.0973          0.0248         -3.9244          0.0001
## [  32,.]   supplier_id_fct=12: supplier_id <= 314                                               -0.1325          0.0259         -5.1236          0.0000
## [  33,.]   supplier_id_fct=13: supplier_id > 314                                                -0.2318          0.0306         -7.5686          0.0000
## [  34,.]   supplier_id_fct=2: supplier_id <= 149                                                 0.3415          0.0373          9.1609          0.0000
## [  35,.]   supplier_id_fct=3: supplier_id <= 165                                                 0.2308          0.0311          7.4297          0.0000
## [  36,.]   supplier_id_fct=4: supplier_id <= 178                                                 0.1855          0.0288          6.4352          0.0000
## [  37,.]   supplier_id_fct=5: supplier_id <= 196                                                 0.1510          0.0256          5.9035          0.0000
## [  38,.]   supplier_id_fct=6: supplier_id <= 206                                                 0.1296          0.0287          4.5106          0.0000
## [  39,.]   supplier_id_fct=7: supplier_id <= 214                                                 0.1039          0.0304          3.4171          0.0006
## [  40,.]   supplier_id_fct=8: supplier_id <= 225                                                 0.0651          0.0260          2.5088          0.0121
## [  41,.]   supplier_id_fct=9: supplier_id <= 240                                                Dropped         Dropped         Dropped          0.0000
## [  42,.]   branch_id_fct=1: branch_id <= 153                                                    -0.4503          0.0410        -10.9721          0.0000
## [  43,.]   branch_id_fct=10: branch_id <= 284                                                    0.0550          0.0304          1.8089          0.0705
## [  44,.]   branch_id_fct=11: branch_id > 284                                                     0.0527          0.0414          1.2718          0.2034
## [  45,.]   branch_id_fct=2: branch_id <= 174                                                    -0.4923          0.0384        -12.8176          0.0000
## [  46,.]   branch_id_fct=3: branch_id <= 184                                                    -0.3828          0.0317        -12.0901          0.0000
## [  47,.]   branch_id_fct=4: branch_id <= 198                                                    -0.3148          0.0281        -11.2160          0.0000
## [  48,.]   branch_id_fct=5: branch_id <= 214                                                    -0.2090          0.0319         -6.5462          0.0000
## [  49,.]   branch_id_fct=6: branch_id <= 222                                                    -0.1907          0.0332         -5.7400          0.0000
## [  50,.]   branch_id_fct=7: branch_id <= 233                                                    -0.0964          0.0316         -3.0471          0.0023
## [  51,.]   branch_id_fct=8: branch_id <= 261                                                    -0.2644          0.0302         -8.7606          0.0000
## [  52,.]   branch_id_fct=9: branch_id <= 276                                                    Dropped         Dropped         Dropped          0.0000
## [  53,.]   ltv_fct=1: ltv <= 55.63                                                               0.6741          0.0483         13.9558          0.0000
## [  54,.]   ltv_fct=10: ltv <= 84.57                                                             -0.1191          0.0296         -4.0282          0.0001
## [  55,.]   ltv_fct=11: ltv <= 85                                                                -0.2560          0.0299         -8.5656          0.0000
## [  56,.]   ltv_fct=12: ltv <= 87.8                                                              -0.1109          0.0328         -3.3790          0.0007
## [  57,.]   ltv_fct=13: ltv <= 89.3                                                              -0.2617          0.0335         -7.8184          0.0000
## [  58,.]   ltv_fct=14: ltv > 89.3                                                               -0.3100          0.0333         -9.3079          0.0000
## [  59,.]   ltv_fct=2: ltv <= 62.22                                                               0.5654          0.0424         13.3198          0.0000
## [  60,.]   ltv_fct=3: ltv <= 68.34                                                               0.3978          0.0373         10.6611          0.0000
## [  61,.]   ltv_fct=4: ltv <= 72.9301                                                             0.2696          0.0334          8.0815          0.0000
## [  62,.]   ltv_fct=5: ltv <= 74.31                                                               0.1395          0.0346          4.0289          0.0001
## [  63,.]   ltv_fct=6: ltv <= 75                                                                  0.0904          0.0336          2.6889          0.0072
## [  64,.]   ltv_fct=7: ltv <= 77.39                                                               0.1953          0.0306          6.3806          0.0000
## [  65,.]   ltv_fct=8: ltv <= 78.92                                                               0.0634          0.0287          2.2089          0.0272
## [  66,.]   ltv_fct=9: ltv <= 83.34                                                              Dropped         Dropped         Dropped          0.0000
## [  67,.]   PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0                                      -0.1592          0.0722         -2.2036          0.0276
## [  68,.]   PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18                                     -0.0610          0.0589         -1.0345          0.3009
## [  69,.]   PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441                                    -0.2494          0.0714         -3.4913          0.0005
## [  70,.]   PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643                                    -0.1389          0.0618         -2.2491          0.0245
## [  71,.]   PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738                                    -0.0788          0.0409         -1.9274          0.0539
## [  72,.]   PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824                                     0.1750          0.0449          3.8987          0.0001
## [  73,.]   PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824                                     Dropped         Dropped         Dropped          0.0000
## [  74,.]   disbursed_amount_fct=1: disbursed_amount <= 39134                                     0.0393          0.0477          0.8234          0.4103
## [  75,.]   disbursed_amount_fct=2: disbursed_amount <= 43615                                     0.0841          0.0422          1.9918          0.0464
## [  76,.]   disbursed_amount_fct=3: disbursed_amount <= 48555                                     0.0864          0.0309          2.7959          0.0052
## [  77,.]   disbursed_amount_fct=4: disbursed_amount <= 51908                                     0.0959          0.0248          3.8738          0.0001
## [  78,.]   disbursed_amount_fct=5: disbursed_amount <= 55400                                     0.0522          0.0191          2.7370          0.0062
## [  79,.]   disbursed_amount_fct=6: disbursed_amount > 55400                                     Dropped         Dropped         Dropped          0.0000
## [  80,.]   OutstandingNow_fct=1: OutstandingNow <= 44402                                        -0.0596          0.0582         -1.0234          0.3061
## [  81,.]   OutstandingNow_fct=2: OutstandingNow <= 50314                                        -0.1223          0.0545         -2.2443          0.0248
## [  82,.]   OutstandingNow_fct=3: OutstandingNow <= 171384                                       -0.1834          0.0485         -3.7823          0.0002
## [  83,.]   OutstandingNow_fct=4: OutstandingNow <= 324324                                       -0.0857          0.0429         -1.9988          0.0456
## [  84,.]   OutstandingNow_fct=5: OutstandingNow <= 746271                                       -0.1138          0.0390         -2.9159          0.0035
## [  85,.]   OutstandingNow_fct=6: OutstandingNow > 746271                                        Dropped         Dropped         Dropped          0.0000
## [  86,.]   PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150             0.3537          0.0632          5.5963          0.0000
## [  87,.]   PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172             0.3311          0.0616          5.3773          0.0000
## [  88,.]   PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205             0.2451          0.0500          4.9071          0.0000
## [  89,.]   PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231             0.2198          0.0622          3.5352          0.0004
## [  90,.]   PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256             0.0049          0.0358          0.1370          0.8910
## [  91,.]   PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256             Dropped         Dropped         Dropped          0.0000
## [  92,.]   State_ID_fct=1: State_ID <= 183                                                      -0.0597          0.0494         -1.2078          0.2271
## [  93,.]   State_ID_fct=2: State_ID <= 188                                                      -0.0293          0.0464         -0.6329          0.5268
## [  94,.]   State_ID_fct=3: State_ID <= 206                                                      -0.0022          0.0432         -0.0499          0.9602
## [  95,.]   State_ID_fct=4: State_ID <= 214                                                       0.0796          0.0425          1.8719          0.0612
## [  96,.]   State_ID_fct=5: State_ID <= 220                                                       0.1009          0.0497          2.0304          0.0423
## [  97,.]   State_ID_fct=6: State_ID <= 229                                                       0.0639          0.0512          1.2496          0.2115
## [  98,.]   State_ID_fct=7: State_ID <= 272                                                       0.0847          0.0445          1.9050          0.0568
## [  99,.]   State_ID_fct=8: State_ID > 272                                                       Dropped         Dropped         Dropped          0.0000
## [ 100,.]   DisAsDiff_fct=1: DisAsDiff <= 13554                                                  -0.1302          0.0396         -3.2854          0.0010
## [ 101,.]   DisAsDiff_fct=2: DisAsDiff <= 15670                                                  -0.0822          0.0342         -2.4020          0.0163
## [ 102,.]   DisAsDiff_fct=3: DisAsDiff <= 16661                                                   0.0062          0.0359          0.1722          0.8633
## [ 103,.]   DisAsDiff_fct=4: DisAsDiff <= 19822                                                  -0.0175          0.0254         -0.6869          0.4921
## [ 104,.]   DisAsDiff_fct=5: DisAsDiff > 19822                                                   Dropped         Dropped         Dropped          0.0000
## [ 105,.]   PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581                           -0.2560          0.0387         -6.6095          0.0000
## [ 106,.]   PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581                            Dropped         Dropped         Dropped          0.0000
## [ 107,.]   PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0                                       0.1444          0.0287          5.0372          0.0000
## [ 108,.]   PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0                                       Dropped         Dropped         Dropped          0.0000
## [ 109,.]   manufacturer_id_fct=1: manufacturer_id <= 210                                         0.1232          0.0239          5.1589          0.0000
## [ 110,.]   manufacturer_id_fct=2: manufacturer_id <= 221                                        -0.0035          0.0278         -0.1266          0.8993
## [ 111,.]   manufacturer_id_fct=3: manufacturer_id <= 228                                         0.0977          0.0261          3.7512          0.0002
## [ 112,.]   manufacturer_id_fct=4: manufacturer_id > 228                                         Dropped         Dropped         Dropped          0.0000
## [ 113,.]   VoterID_flag_fct=1: VoterID_flag = 0                                                  0.0935          0.0195          4.8004          0.0000
## [ 114,.]   VoterID_flag_fct=2: VoterID_flag = 1                                                 Dropped         Dropped         Dropped          0.0000
## [ 115,.]   ShareOverdue_fct=1: ShareOverdue <= -2                                                0.0923          0.0318          2.8997          0.0037
## [ 116,.]   ShareOverdue_fct=2: ShareOverdue <= -1                                                0.1288          0.0233          5.5417          0.0000
## [ 117,.]   ShareOverdue_fct=3: ShareOverdue > -1                                                Dropped         Dropped         Dropped          0.0000
## [ 118,.]   PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0                                        -0.2271          0.0443         -5.1314          0.0000
## [ 119,.]   PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1                                        -0.2429          0.0334         -7.2669          0.0000
## [ 120,.]   PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3                                        -0.2096          0.0288         -7.2732          0.0000
## [ 121,.]   PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3                                         Dropped         Dropped         Dropped          0.0000
## [ 122,.]   Day_fct=1: Day <= 28                                                                  0.2494          0.0213         11.7055          0.0000
## [ 123,.]   Day_fct=2: Day <= 30                                                                  0.1283          0.0260          4.9301          0.0000
## [ 124,.]   Day_fct=3: Day > 30                                                                  Dropped         Dropped         Dropped          0.0000
## [ 125,.]   NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0                                           0.2732          0.0170         16.0551          0.0000
## [ 126,.]   NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0                                           Dropped         Dropped         Dropped          0.0000
## [ 127,.]   DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0             0.2820          0.0244         11.5364          0.0000
## [ 128,.]   DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0             Dropped         Dropped         Dropped          0.0000
## [ 129,.]   Qrt_fct=1: Qrt = 3                                                                    0.2191          0.0118         18.5775          0.0000
## [ 130,.]   Qrt_fct=2: Qrt = 4                                                                   Dropped         Dropped         Dropped          0.0000
## [ 131,.]   YearsOnLoan_fct=1: YearsOnLoan <= 22.8918                                            -0.3841          0.0306        -12.5451          0.0000
## [ 132,.]   YearsOnLoan_fct=2: YearsOnLoan <= 28.8496                                            -0.2784          0.0266        -10.4683          0.0000
## [ 133,.]   YearsOnLoan_fct=3: YearsOnLoan <= 38.8321                                            -0.1799          0.0259         -6.9531          0.0000
## [ 134,.]   YearsOnLoan_fct=4: YearsOnLoan <= 51.8208                                            -0.0930          0.0265         -3.5125          0.0004
## [ 135,.]   YearsOnLoan_fct=5: YearsOnLoan > 51.8208                                             Dropped         Dropped         Dropped          0.0000
## [ 136,.]   Employment_Type_fct=1: Employment_Type = 203                                          0.1524          0.0123         12.4431          0.0000
## [ 137,.]   Employment_Type_fct=2: Employment_Type = 215                                          0.2166          0.0347          6.2474          0.0000
## [ 138,.]   Employment_Type_fct=3: Employment_Type = 227                                         Dropped         Dropped         Dropped          0.0000
## [ 139,.]   PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564                                 -0.0180          0.0325         -0.5519          0.5810
## [ 140,.]   PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832                                 -0.2560          0.0375         -6.8204          0.0000
## [ 141,.]   PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033                                 -0.1442          0.0387         -3.7226          0.0002
## [ 142,.]   PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326                                -0.1437          0.0325         -4.4164          0.0000
## [ 143,.]   PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326                                 Dropped         Dropped         Dropped          0.0000
## [ 144,.]   asset_cost_fct=1: asset_cost <= 60098                                                 0.0726          0.0457          1.5873          0.1124
## [ 145,.]   asset_cost_fct=2: asset_cost <= 70561                                                 0.1494          0.0296          5.0467          0.0000
## [ 146,.]   asset_cost_fct=3: asset_cost <= 85738                                                 0.1776          0.0224          7.9199          0.0000
## [ 147,.]   asset_cost_fct=4: asset_cost > 85738                                                 Dropped         Dropped         Dropped          0.0000
## [ 148,.]   SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0                                  -0.0599          0.0932         -0.6423          0.5206
## [ 149,.]   SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null                              -0.0752          0.1024         -0.7339          0.4630
## [ 150,.]   SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2                                 0.2864          0.2454          1.1672          0.2431
## [ 151,.]   SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1                                  Dropped         Dropped         Dropped          0.0000
## [ 152,.]   SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0                                  -0.0239          0.0779         -0.3071          0.7588
## [ 153,.]   SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0                                  Dropped         Dropped         Dropped          0.0000
## [ 154,.]   Passport_flag_fct=1: Passport_flag = 0                                               -0.1880          0.1371         -1.3710          0.1704
## [ 155,.]   Passport_flag_fct=2: Passport_flag = 1                                               Dropped         Dropped         Dropped          0.0000
## [ 156,.]   Driving_flag_fct=1: Driving_flag = 0                                                  0.0159          0.0384          0.4145          0.6785
## [ 157,.]   Driving_flag_fct=2: Driving_flag = 1                                                 Dropped         Dropped         Dropped          0.0000
## [ 158,.]   SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0                                             0.0609          0.0778          0.7835          0.4334
## [ 159,.]   SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0                                            Dropped         Dropped         Dropped          0.0000
## [ 160,.]   PAN_flag_fct=1: PAN_flag = 0                                                         -0.0460          0.0230         -1.9976          0.0458
## [ 161,.]   PAN_flag_fct=2: PAN_flag = 1                                                         Dropped         Dropped         Dropped          0.0000
## [ 162,.]   SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0                                      Dropped         Dropped         Dropped          0.0000
## [ 163,.]   SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0                                      Dropped         Dropped         Dropped          0.0000
## Condition number of final VC matrix: 1997.7639
## ************************************************************************************************************************
##

##    user  system elapsed 
##    0.14    0.01    3.95

summary(rxLogitFit)

## Call:
## RevoScaleR::rxLogit(formula = paste("Y ~ ", paste(names(X), collapse = "+")), 
##     data = cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases), 
##     pweights = "Pweights", reportProgress = 1, verbose = 1)
## 
## Logistic Regression Results for: Y ~
##     Employee_code_ID_fct+Current_pincode_ID_fct+supplier_id_fct+branch_id_fct+ltv_fct+PERFORM_CNS_SCORE_fct+disbursed_amount_fct+OutstandingNow_fct+PERFORM_CNS_SCORE_DESCRIPTION_fct+State_ID_fct+DisAsDiff_fct+PRI_DISBURSED_AMOUNT_fct+PRI_OVERDUE_ACCTS_fct+manufacturer_id_fct+VoterID_flag_fct+ShareOverdue_fct+PRI_ACTIVE_ACCTS_fct+Day_fct+NO_OF_INQUIRIES_fct+DELINQUENT_ACCTS_IN_LAST_SIX_M_fct+Qrt_fct+YearsOnLoan_fct+Employment_Type_fct+PRIMARY_INSTAL_AMT_fct+asset_cost_fct+SEC_OverdueShareSec_fct+SEC_CURRENT_BALANCE_fct+Passport_flag_fct+Driving_flag_fct+SEC_INSTAL_AMT_fct+PAN_flag_fct+SEC_OVERDUE_ACCTS_fct
## Data: cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases)
## Dependent variable(s): Y
## Total independent variables: 163 (Including number dropped: 33)
## Number of valid observations: 209838
## Number of missing observations: 0 
## -2*LogLikelihood: 199949.624 (Residual deviance on 209708 degrees of freedom)
##  
## Coefficients:
##                                                                            Estimate
## (Intercept)                                                                0.739128
## Employee_code_ID_fct=1: Employee_code_ID <= 134                            0.765665
## Employee_code_ID_fct=10: Employee_code_ID <= 242                          -0.077044
## Employee_code_ID_fct=11: Employee_code_ID <= 254                          -0.101471
## Employee_code_ID_fct=12: Employee_code_ID <= 270                          -0.143575
## Employee_code_ID_fct=13: Employee_code_ID <= 289                          -0.234039
## Employee_code_ID_fct=14: Employee_code_ID <= 319                          -0.257740
## Employee_code_ID_fct=15: Employee_code_ID > 319                           -0.482348
## Employee_code_ID_fct=2: Employee_code_ID <= 153                            0.502839
## Employee_code_ID_fct=3: Employee_code_ID <= 170                            0.400538
## Employee_code_ID_fct=4: Employee_code_ID <= 179                            0.285620
## Employee_code_ID_fct=5: Employee_code_ID <= 188                            0.251824
## Employee_code_ID_fct=6: Employee_code_ID <= 199                            0.160031
## Employee_code_ID_fct=7: Employee_code_ID <= 211                            0.093653
## Employee_code_ID_fct=8: Employee_code_ID <= 221                            0.028377
## Employee_code_ID_fct=9: Employee_code_ID <= 233                             Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143                        0.857166
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257                      -0.180906
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291                      -0.272803
## Current_pincode_ID_fct=12: Current_pincode_ID > 291                       -0.391739
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158                        0.751176
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174                        0.571246
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188                        0.487634
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201                        0.398476
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212                        0.315290
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219                        0.162143
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225                        0.107461
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238                         Dropped
## supplier_id_fct=1: supplier_id <= 133                                      0.334523
## supplier_id_fct=10: supplier_id <= 253                                    -0.079193
## supplier_id_fct=11: supplier_id <= 275                                    -0.097272
## supplier_id_fct=12: supplier_id <= 314                                    -0.132506
## supplier_id_fct=13: supplier_id > 314                                     -0.231812
## supplier_id_fct=2: supplier_id <= 149                                      0.341500
## supplier_id_fct=3: supplier_id <= 165                                      0.230836
## supplier_id_fct=4: supplier_id <= 178                                      0.185501
## supplier_id_fct=5: supplier_id <= 196                                      0.150963
## supplier_id_fct=6: supplier_id <= 206                                      0.129583
## supplier_id_fct=7: supplier_id <= 214                                      0.103929
## supplier_id_fct=8: supplier_id <= 225                                      0.065137
## supplier_id_fct=9: supplier_id <= 240                                       Dropped
## branch_id_fct=1: branch_id <= 153                                         -0.450331
## branch_id_fct=10: branch_id <= 284                                         0.055049
## branch_id_fct=11: branch_id > 284                                          0.052659
## branch_id_fct=2: branch_id <= 174                                         -0.492323
## branch_id_fct=3: branch_id <= 184                                         -0.382755
## branch_id_fct=4: branch_id <= 198                                         -0.314757
## branch_id_fct=5: branch_id <= 214                                         -0.208958
## branch_id_fct=6: branch_id <= 222                                         -0.190663
## branch_id_fct=7: branch_id <= 233                                         -0.096355
## branch_id_fct=8: branch_id <= 261                                         -0.264450
## branch_id_fct=9: branch_id <= 276                                           Dropped
## ltv_fct=1: ltv <= 55.63                                                    0.674092
## ltv_fct=10: ltv <= 84.57                                                  -0.119055
## ltv_fct=11: ltv <= 85                                                     -0.256004
## ltv_fct=12: ltv <= 87.8                                                   -0.110868
## ltv_fct=13: ltv <= 89.3                                                   -0.261671
## ltv_fct=14: ltv > 89.3                                                    -0.310022
## ltv_fct=2: ltv <= 62.22                                                    0.565406
## ltv_fct=3: ltv <= 68.34                                                    0.397794
## ltv_fct=4: ltv <= 72.9301                                                  0.269590
## ltv_fct=5: ltv <= 74.31                                                    0.139500
## ltv_fct=6: ltv <= 75                                                       0.090372
## ltv_fct=7: ltv <= 77.39                                                    0.195296
## ltv_fct=8: ltv <= 78.92                                                    0.063365
## ltv_fct=9: ltv <= 83.34                                                     Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0                           -0.159161
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18                          -0.060971
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441                         -0.249354
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643                         -0.138944
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738                         -0.078764
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824                          0.175027
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824                            Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134                          0.039263
## disbursed_amount_fct=2: disbursed_amount <= 43615                          0.084088
## disbursed_amount_fct=3: disbursed_amount <= 48555                          0.086379
## disbursed_amount_fct=4: disbursed_amount <= 51908                          0.095899
## disbursed_amount_fct=5: disbursed_amount <= 55400                          0.052241
## disbursed_amount_fct=6: disbursed_amount > 55400                            Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402                             -0.059569
## OutstandingNow_fct=2: OutstandingNow <= 50314                             -0.122291
## OutstandingNow_fct=3: OutstandingNow <= 171384                            -0.183387
## OutstandingNow_fct=4: OutstandingNow <= 324324                            -0.085655
## OutstandingNow_fct=5: OutstandingNow <= 746271                            -0.113832
## OutstandingNow_fct=6: OutstandingNow > 746271                               Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150  0.353738
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172  0.331067
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205  0.245147
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231  0.219787
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256  0.004902
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256    Dropped
## State_ID_fct=1: State_ID <= 183                                           -0.059697
## State_ID_fct=2: State_ID <= 188                                           -0.029348
## State_ID_fct=3: State_ID <= 206                                           -0.002158
## State_ID_fct=4: State_ID <= 214                                            0.079577
## State_ID_fct=5: State_ID <= 220                                            0.100859
## State_ID_fct=6: State_ID <= 229                                            0.063927
## State_ID_fct=7: State_ID <= 272                                            0.084702
## State_ID_fct=8: State_ID > 272                                              Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554                                       -0.130242
## DisAsDiff_fct=2: DisAsDiff <= 15670                                       -0.082229
## DisAsDiff_fct=3: DisAsDiff <= 16661                                        0.006175
## DisAsDiff_fct=4: DisAsDiff <= 19822                                       -0.017471
## DisAsDiff_fct=5: DisAsDiff > 19822                                          Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581                -0.255990
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581                   Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0                            0.144360
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0                              Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210                              0.123239
## manufacturer_id_fct=2: manufacturer_id <= 221                             -0.003519
## manufacturer_id_fct=3: manufacturer_id <= 228                              0.097721
## manufacturer_id_fct=4: manufacturer_id > 228                                Dropped
## VoterID_flag_fct=1: VoterID_flag = 0                                       0.093489
## VoterID_flag_fct=2: VoterID_flag = 1                                        Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2                                     0.092295
## ShareOverdue_fct=2: ShareOverdue <= -1                                     0.128849
## ShareOverdue_fct=3: ShareOverdue > -1                                       Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0                             -0.227117
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1                             -0.242919
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3                             -0.209621
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3                                Dropped
## Day_fct=1: Day <= 28                                                       0.249400
## Day_fct=2: Day <= 30                                                       0.128288
## Day_fct=3: Day > 30                                                         Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0                                0.273240
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0                                  Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0  0.281950
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0    Dropped
## Qrt_fct=1: Qrt = 3                                                         0.219055
## Qrt_fct=2: Qrt = 4                                                          Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918                                 -0.384106
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496                                 -0.278357
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321                                 -0.179941
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208                                 -0.092981
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208                                    Dropped
## Employment_Type_fct=1: Employment_Type = 203                               0.152449
## Employment_Type_fct=2: Employment_Type = 215                               0.216629
## Employment_Type_fct=3: Employment_Type = 227                                Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564                      -0.017958
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832                      -0.256020
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033                      -0.144240
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326                     -0.143675
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326                        Dropped
## asset_cost_fct=1: asset_cost <= 60098                                      0.072598
## asset_cost_fct=2: asset_cost <= 70561                                      0.149356
## asset_cost_fct=3: asset_cost <= 85738                                      0.177618
## asset_cost_fct=4: asset_cost > 85738                                        Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0                       -0.059896
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null                   -0.075173
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2                      0.286437
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1                         Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0                       -0.023913
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0                         Dropped
## Passport_flag_fct=1: Passport_flag = 0                                    -0.187963
## Passport_flag_fct=2: Passport_flag = 1                                      Dropped
## Driving_flag_fct=1: Driving_flag = 0                                       0.015928
## Driving_flag_fct=2: Driving_flag = 1                                        Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0                                  0.060927
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0                                   Dropped
## PAN_flag_fct=1: PAN_flag = 0                                              -0.045962
## PAN_flag_fct=2: PAN_flag = 1                                                Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0                             Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0                             Dropped
##                                                                           Std. Error
## (Intercept)                                                                 0.199992
## Employee_code_ID_fct=1: Employee_code_ID <= 134                             0.042695
## Employee_code_ID_fct=10: Employee_code_ID <= 242                            0.030348
## Employee_code_ID_fct=11: Employee_code_ID <= 254                            0.029589
## Employee_code_ID_fct=12: Employee_code_ID <= 270                            0.029733
## Employee_code_ID_fct=13: Employee_code_ID <= 289                            0.029307
## Employee_code_ID_fct=14: Employee_code_ID <= 319                            0.028616
## Employee_code_ID_fct=15: Employee_code_ID > 319                             0.032049
## Employee_code_ID_fct=2: Employee_code_ID <= 153                             0.033917
## Employee_code_ID_fct=3: Employee_code_ID <= 170                             0.029777
## Employee_code_ID_fct=4: Employee_code_ID <= 179                             0.033683
## Employee_code_ID_fct=5: Employee_code_ID <= 188                             0.031842
## Employee_code_ID_fct=6: Employee_code_ID <= 199                             0.029672
## Employee_code_ID_fct=7: Employee_code_ID <= 211                             0.029080
## Employee_code_ID_fct=8: Employee_code_ID <= 221                             0.029788
## Employee_code_ID_fct=9: Employee_code_ID <= 233                              Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143                         0.038836
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257                        0.024511
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291                        0.023582
## Current_pincode_ID_fct=12: Current_pincode_ID > 291                         0.025912
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158                         0.036992
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174                         0.027784
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188                         0.026909
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201                         0.024434
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212                         0.025310
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219                         0.029533
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225                         0.029616
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238                          Dropped
## supplier_id_fct=1: supplier_id <= 133                                       0.042484
## supplier_id_fct=10: supplier_id <= 253                                      0.027024
## supplier_id_fct=11: supplier_id <= 275                                      0.024787
## supplier_id_fct=12: supplier_id <= 314                                      0.025862
## supplier_id_fct=13: supplier_id > 314                                       0.030628
## supplier_id_fct=2: supplier_id <= 149                                       0.037278
## supplier_id_fct=3: supplier_id <= 165                                       0.031069
## supplier_id_fct=4: supplier_id <= 178                                       0.028826
## supplier_id_fct=5: supplier_id <= 196                                       0.025572
## supplier_id_fct=6: supplier_id <= 206                                       0.028729
## supplier_id_fct=7: supplier_id <= 214                                       0.030414
## supplier_id_fct=8: supplier_id <= 225                                       0.025963
## supplier_id_fct=9: supplier_id <= 240                                        Dropped
## branch_id_fct=1: branch_id <= 153                                           0.041043
## branch_id_fct=10: branch_id <= 284                                          0.030432
## branch_id_fct=11: branch_id > 284                                           0.041404
## branch_id_fct=2: branch_id <= 174                                           0.038410
## branch_id_fct=3: branch_id <= 184                                           0.031658
## branch_id_fct=4: branch_id <= 198                                           0.028063
## branch_id_fct=5: branch_id <= 214                                           0.031921
## branch_id_fct=6: branch_id <= 222                                           0.033216
## branch_id_fct=7: branch_id <= 233                                           0.031622
## branch_id_fct=8: branch_id <= 261                                           0.030186
## branch_id_fct=9: branch_id <= 276                                            Dropped
## ltv_fct=1: ltv <= 55.63                                                     0.048302
## ltv_fct=10: ltv <= 84.57                                                    0.029555
## ltv_fct=11: ltv <= 85                                                       0.029887
## ltv_fct=12: ltv <= 87.8                                                     0.032811
## ltv_fct=13: ltv <= 89.3                                                     0.033469
## ltv_fct=14: ltv > 89.3                                                      0.033307
## ltv_fct=2: ltv <= 62.22                                                     0.042449
## ltv_fct=3: ltv <= 68.34                                                     0.037313
## ltv_fct=4: ltv <= 72.9301                                                   0.033359
## ltv_fct=5: ltv <= 74.31                                                     0.034625
## ltv_fct=6: ltv <= 75                                                        0.033609
## ltv_fct=7: ltv <= 77.39                                                     0.030608
## ltv_fct=8: ltv <= 78.92                                                     0.028686
## ltv_fct=9: ltv <= 83.34                                                      Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0                             0.072227
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18                            0.058937
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441                           0.071422
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643                           0.061776
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738                           0.040865
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824                           0.044894
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824                             Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134                           0.047683
## disbursed_amount_fct=2: disbursed_amount <= 43615                           0.042217
## disbursed_amount_fct=3: disbursed_amount <= 48555                           0.030895
## disbursed_amount_fct=4: disbursed_amount <= 51908                           0.024756
## disbursed_amount_fct=5: disbursed_amount <= 55400                           0.019087
## disbursed_amount_fct=6: disbursed_amount > 55400                             Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402                               0.058208
## OutstandingNow_fct=2: OutstandingNow <= 50314                               0.054488
## OutstandingNow_fct=3: OutstandingNow <= 171384                              0.048486
## OutstandingNow_fct=4: OutstandingNow <= 324324                              0.042853
## OutstandingNow_fct=5: OutstandingNow <= 746271                              0.039039
## OutstandingNow_fct=6: OutstandingNow > 746271                                Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150   0.063209
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172   0.061567
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205   0.049958
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231   0.062171
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256   0.035776
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256     Dropped
## State_ID_fct=1: State_ID <= 183                                             0.049425
## State_ID_fct=2: State_ID <= 188                                             0.046370
## State_ID_fct=3: State_ID <= 206                                             0.043220
## State_ID_fct=4: State_ID <= 214                                             0.042510
## State_ID_fct=5: State_ID <= 220                                             0.049674
## State_ID_fct=6: State_ID <= 229                                             0.051160
## State_ID_fct=7: State_ID <= 272                                             0.044464
## State_ID_fct=8: State_ID > 272                                               Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554                                         0.039643
## DisAsDiff_fct=2: DisAsDiff <= 15670                                         0.034234
## DisAsDiff_fct=3: DisAsDiff <= 16661                                         0.035854
## DisAsDiff_fct=4: DisAsDiff <= 19822                                         0.025433
## DisAsDiff_fct=5: DisAsDiff > 19822                                           Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581                  0.038731
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581                    Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0                             0.028659
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0                               Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210                               0.023889
## manufacturer_id_fct=2: manufacturer_id <= 221                               0.027798
## manufacturer_id_fct=3: manufacturer_id <= 228                               0.026050
## manufacturer_id_fct=4: manufacturer_id > 228                                 Dropped
## VoterID_flag_fct=1: VoterID_flag = 0                                        0.019475
## VoterID_flag_fct=2: VoterID_flag = 1                                         Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2                                      0.031829
## ShareOverdue_fct=2: ShareOverdue <= -1                                      0.023251
## ShareOverdue_fct=3: ShareOverdue > -1                                        Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0                               0.044260
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1                               0.033428
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3                               0.028821
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3                                 Dropped
## Day_fct=1: Day <= 28                                                        0.021306
## Day_fct=2: Day <= 30                                                        0.026022
## Day_fct=3: Day > 30                                                          Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0                                 0.017019
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0                                   Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0   0.024440
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0     Dropped
## Qrt_fct=1: Qrt = 3                                                          0.011791
## Qrt_fct=2: Qrt = 4                                                           Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918                                   0.030618
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496                                   0.026590
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321                                   0.025879
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208                                   0.026471
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208                                     Dropped
## Employment_Type_fct=1: Employment_Type = 203                                0.012252
## Employment_Type_fct=2: Employment_Type = 215                                0.034675
## Employment_Type_fct=3: Employment_Type = 227                                 Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564                        0.032539
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832                        0.037537
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033                        0.038747
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326                       0.032532
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326                         Dropped
## asset_cost_fct=1: asset_cost <= 60098                                       0.045737
## asset_cost_fct=2: asset_cost <= 70561                                       0.029595
## asset_cost_fct=3: asset_cost <= 85738                                       0.022427
## asset_cost_fct=4: asset_cost > 85738                                         Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0                         0.093246
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null                     0.102431
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2                       0.245402
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1                          Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0                         0.077865
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0                          Dropped
## Passport_flag_fct=1: Passport_flag = 0                                      0.137099
## Passport_flag_fct=2: Passport_flag = 1                                       Dropped
## Driving_flag_fct=1: Driving_flag = 0                                        0.038427
## Driving_flag_fct=2: Driving_flag = 1                                         Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0                                   0.077767
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0                                    Dropped
## PAN_flag_fct=1: PAN_flag = 0                                                0.023009
## PAN_flag_fct=2: PAN_flag = 1                                                 Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0                              Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0                              Dropped
##                                                                           z value
## (Intercept)                                                                 3.696
## Employee_code_ID_fct=1: Employee_code_ID <= 134                            17.934
## Employee_code_ID_fct=10: Employee_code_ID <= 242                           -2.539
## Employee_code_ID_fct=11: Employee_code_ID <= 254                           -3.429
## Employee_code_ID_fct=12: Employee_code_ID <= 270                           -4.829
## Employee_code_ID_fct=13: Employee_code_ID <= 289                           -7.986
## Employee_code_ID_fct=14: Employee_code_ID <= 319                           -9.007
## Employee_code_ID_fct=15: Employee_code_ID > 319                           -15.050
## Employee_code_ID_fct=2: Employee_code_ID <= 153                            14.826
## Employee_code_ID_fct=3: Employee_code_ID <= 170                            13.451
## Employee_code_ID_fct=4: Employee_code_ID <= 179                             8.480
## Employee_code_ID_fct=5: Employee_code_ID <= 188                             7.909
## Employee_code_ID_fct=6: Employee_code_ID <= 199                             5.393
## Employee_code_ID_fct=7: Employee_code_ID <= 211                             3.221
## Employee_code_ID_fct=8: Employee_code_ID <= 221                             0.953
## Employee_code_ID_fct=9: Employee_code_ID <= 233                           Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143                        22.072
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257                       -7.381
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291                      -11.568
## Current_pincode_ID_fct=12: Current_pincode_ID > 291                       -15.118
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158                        20.306
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174                        20.561
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188                        18.122
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201                        16.308
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212                        12.457
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219                         5.490
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225                         3.629
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238                       Dropped
## supplier_id_fct=1: supplier_id <= 133                                       7.874
## supplier_id_fct=10: supplier_id <= 253                                     -2.930
## supplier_id_fct=11: supplier_id <= 275                                     -3.924
## supplier_id_fct=12: supplier_id <= 314                                     -5.124
## supplier_id_fct=13: supplier_id > 314                                      -7.569
## supplier_id_fct=2: supplier_id <= 149                                       9.161
## supplier_id_fct=3: supplier_id <= 165                                       7.430
## supplier_id_fct=4: supplier_id <= 178                                       6.435
## supplier_id_fct=5: supplier_id <= 196                                       5.903
## supplier_id_fct=6: supplier_id <= 206                                       4.511
## supplier_id_fct=7: supplier_id <= 214                                       3.417
## supplier_id_fct=8: supplier_id <= 225                                       2.509
## supplier_id_fct=9: supplier_id <= 240                                     Dropped
## branch_id_fct=1: branch_id <= 153                                         -10.972
## branch_id_fct=10: branch_id <= 284                                          1.809
## branch_id_fct=11: branch_id > 284                                           1.272
## branch_id_fct=2: branch_id <= 174                                         -12.818
## branch_id_fct=3: branch_id <= 184                                         -12.090
## branch_id_fct=4: branch_id <= 198                                         -11.216
## branch_id_fct=5: branch_id <= 214                                          -6.546
## branch_id_fct=6: branch_id <= 222                                          -5.740
## branch_id_fct=7: branch_id <= 233                                          -3.047
## branch_id_fct=8: branch_id <= 261                                          -8.761
## branch_id_fct=9: branch_id <= 276                                         Dropped
## ltv_fct=1: ltv <= 55.63                                                    13.956
## ltv_fct=10: ltv <= 84.57                                                   -4.028
## ltv_fct=11: ltv <= 85                                                      -8.566
## ltv_fct=12: ltv <= 87.8                                                    -3.379
## ltv_fct=13: ltv <= 89.3                                                    -7.818
## ltv_fct=14: ltv > 89.3                                                     -9.308
## ltv_fct=2: ltv <= 62.22                                                    13.320
## ltv_fct=3: ltv <= 68.34                                                    10.661
## ltv_fct=4: ltv <= 72.9301                                                   8.081
## ltv_fct=5: ltv <= 74.31                                                     4.029
## ltv_fct=6: ltv <= 75                                                        2.689
## ltv_fct=7: ltv <= 77.39                                                     6.381
## ltv_fct=8: ltv <= 78.92                                                     2.209
## ltv_fct=9: ltv <= 83.34                                                   Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0                            -2.204
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18                           -1.035
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441                          -3.491
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643                          -2.249
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738                          -1.927
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824                           3.899
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824                          Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134                           0.823
## disbursed_amount_fct=2: disbursed_amount <= 43615                           1.992
## disbursed_amount_fct=3: disbursed_amount <= 48555                           2.796
## disbursed_amount_fct=4: disbursed_amount <= 51908                           3.874
## disbursed_amount_fct=5: disbursed_amount <= 55400                           2.737
## disbursed_amount_fct=6: disbursed_amount > 55400                          Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402                              -1.023
## OutstandingNow_fct=2: OutstandingNow <= 50314                              -2.244
## OutstandingNow_fct=3: OutstandingNow <= 171384                             -3.782
## OutstandingNow_fct=4: OutstandingNow <= 324324                             -1.999
## OutstandingNow_fct=5: OutstandingNow <= 746271                             -2.916
## OutstandingNow_fct=6: OutstandingNow > 746271                             Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150   5.596
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172   5.377
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205   4.907
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231   3.535
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256   0.137
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256  Dropped
## State_ID_fct=1: State_ID <= 183                                            -1.208
## State_ID_fct=2: State_ID <= 188                                            -0.633
## State_ID_fct=3: State_ID <= 206                                            -0.050
## State_ID_fct=4: State_ID <= 214                                             1.872
## State_ID_fct=5: State_ID <= 220                                             2.030
## State_ID_fct=6: State_ID <= 229                                             1.250
## State_ID_fct=7: State_ID <= 272                                             1.905
## State_ID_fct=8: State_ID > 272                                            Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554                                        -3.285
## DisAsDiff_fct=2: DisAsDiff <= 15670                                        -2.402
## DisAsDiff_fct=3: DisAsDiff <= 16661                                         0.172
## DisAsDiff_fct=4: DisAsDiff <= 19822                                        -0.687
## DisAsDiff_fct=5: DisAsDiff > 19822                                        Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581                 -6.609
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581                 Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0                             5.037
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0                            Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210                               5.159
## manufacturer_id_fct=2: manufacturer_id <= 221                              -0.127
## manufacturer_id_fct=3: manufacturer_id <= 228                               3.751
## manufacturer_id_fct=4: manufacturer_id > 228                              Dropped
## VoterID_flag_fct=1: VoterID_flag = 0                                        4.800
## VoterID_flag_fct=2: VoterID_flag = 1                                      Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2                                      2.900
## ShareOverdue_fct=2: ShareOverdue <= -1                                      5.542
## ShareOverdue_fct=3: ShareOverdue > -1                                     Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0                              -5.131
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1                              -7.267
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3                              -7.273
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3                              Dropped
## Day_fct=1: Day <= 28                                                       11.705
## Day_fct=2: Day <= 30                                                        4.930
## Day_fct=3: Day > 30                                                       Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0                                16.055
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0                                Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0  11.536
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0  Dropped
## Qrt_fct=1: Qrt = 3                                                         18.577
## Qrt_fct=2: Qrt = 4                                                        Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918                                 -12.545
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496                                 -10.468
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321                                  -6.953
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208                                  -3.513
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208                                  Dropped
## Employment_Type_fct=1: Employment_Type = 203                               12.443
## Employment_Type_fct=2: Employment_Type = 215                                6.247
## Employment_Type_fct=3: Employment_Type = 227                              Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564                       -0.552
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832                       -6.820
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033                       -3.723
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326                      -4.416
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326                      Dropped
## asset_cost_fct=1: asset_cost <= 60098                                       1.587
## asset_cost_fct=2: asset_cost <= 70561                                       5.047
## asset_cost_fct=3: asset_cost <= 85738                                       7.920
## asset_cost_fct=4: asset_cost > 85738                                      Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0                        -0.642
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null                    -0.734
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2                       1.167
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1                       Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0                        -0.307
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0                       Dropped
## Passport_flag_fct=1: Passport_flag = 0                                     -1.371
## Passport_flag_fct=2: Passport_flag = 1                                    Dropped
## Driving_flag_fct=1: Driving_flag = 0                                        0.414
## Driving_flag_fct=2: Driving_flag = 1                                      Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0                                   0.783
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0                                 Dropped
## PAN_flag_fct=1: PAN_flag = 0                                               -1.998
## PAN_flag_fct=2: PAN_flag = 1                                              Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0                           Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0                           Dropped
##                                                                                       Pr(>|z|)
## (Intercept)                                                                           0.000219
## Employee_code_ID_fct=1: Employee_code_ID <= 134                           0.000000000000000222
## Employee_code_ID_fct=10: Employee_code_ID <= 242                                      0.011127
## Employee_code_ID_fct=11: Employee_code_ID <= 254                                      0.000605
## Employee_code_ID_fct=12: Employee_code_ID <= 270                          0.000001373400840832
## Employee_code_ID_fct=13: Employee_code_ID <= 289                          0.000000000000000222
## Employee_code_ID_fct=14: Employee_code_ID <= 319                          0.000000000000000222
## Employee_code_ID_fct=15: Employee_code_ID > 319                           0.000000000000000222
## Employee_code_ID_fct=2: Employee_code_ID <= 153                           0.000000000000000222
## Employee_code_ID_fct=3: Employee_code_ID <= 170                           0.000000000000000222
## Employee_code_ID_fct=4: Employee_code_ID <= 179                           0.000000000000000222
## Employee_code_ID_fct=5: Employee_code_ID <= 188                           0.000000000000000222
## Employee_code_ID_fct=6: Employee_code_ID <= 199                           0.000000069185031482
## Employee_code_ID_fct=7: Employee_code_ID <= 211                                       0.001279
## Employee_code_ID_fct=8: Employee_code_ID <= 221                                       0.340780
## Employee_code_ID_fct=9: Employee_code_ID <= 233                                        Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143                       0.000000000000000222
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257                      0.000000000000000222
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291                      0.000000000000000222
## Current_pincode_ID_fct=12: Current_pincode_ID > 291                       0.000000000000000222
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158                       0.000000000000000222
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174                       0.000000000000000222
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188                       0.000000000000000222
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201                       0.000000000000000222
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212                       0.000000000000000222
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219                       0.000000040128663947
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225                                   0.000285
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238                                    Dropped
## supplier_id_fct=1: supplier_id <= 133                                     0.000000000000000222
## supplier_id_fct=10: supplier_id <= 253                                                0.003385
## supplier_id_fct=11: supplier_id <= 275                                    0.000086946671635557
## supplier_id_fct=12: supplier_id <= 314                                    0.000000299739931098
## supplier_id_fct=13: supplier_id > 314                                     0.000000000000000222
## supplier_id_fct=2: supplier_id <= 149                                     0.000000000000000222
## supplier_id_fct=3: supplier_id <= 165                                     0.000000000000000222
## supplier_id_fct=4: supplier_id <= 178                                     0.000000000123338895
## supplier_id_fct=5: supplier_id <= 196                                     0.000000003559513795
## supplier_id_fct=6: supplier_id <= 206                                     0.000006464945373930
## supplier_id_fct=7: supplier_id <= 214                                                 0.000633
## supplier_id_fct=8: supplier_id <= 225                                                 0.012114
## supplier_id_fct=9: supplier_id <= 240                                                  Dropped
## branch_id_fct=1: branch_id <= 153                                         0.000000000000000222
## branch_id_fct=10: branch_id <= 284                                                    0.070464
## branch_id_fct=11: branch_id > 284                                                     0.203439
## branch_id_fct=2: branch_id <= 174                                         0.000000000000000222
## branch_id_fct=3: branch_id <= 184                                         0.000000000000000222
## branch_id_fct=4: branch_id <= 198                                         0.000000000000000222
## branch_id_fct=5: branch_id <= 214                                         0.000000000059036109
## branch_id_fct=6: branch_id <= 222                                         0.000000009467683748
## branch_id_fct=7: branch_id <= 233                                                     0.002310
## branch_id_fct=8: branch_id <= 261                                         0.000000000000000222
## branch_id_fct=9: branch_id <= 276                                                      Dropped
## ltv_fct=1: ltv <= 55.63                                                   0.000000000000000222
## ltv_fct=10: ltv <= 84.57                                                  0.000056194942485766
## ltv_fct=11: ltv <= 85                                                     0.000000000000000222
## ltv_fct=12: ltv <= 87.8                                                               0.000728
## ltv_fct=13: ltv <= 89.3                                                   0.000000000000000222
## ltv_fct=14: ltv > 89.3                                                    0.000000000000000222
## ltv_fct=2: ltv <= 62.22                                                   0.000000000000000222
## ltv_fct=3: ltv <= 68.34                                                   0.000000000000000222
## ltv_fct=4: ltv <= 72.9301                                                 0.000000000000000222
## ltv_fct=5: ltv <= 74.31                                                   0.000056050266953989
## ltv_fct=6: ltv <= 75                                                                  0.007169
## ltv_fct=7: ltv <= 77.39                                                   0.000000000176378467
## ltv_fct=8: ltv <= 78.92                                                               0.027181
## ltv_fct=9: ltv <= 83.34                                                                Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0                                       0.027551
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18                                      0.300902
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441                                     0.000481
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643                                     0.024504
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738                                     0.053931
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824                         0.000096707880559599
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824                                       Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134                                     0.410269
## disbursed_amount_fct=2: disbursed_amount <= 43615                                     0.046391
## disbursed_amount_fct=3: disbursed_amount <= 48555                                     0.005176
## disbursed_amount_fct=4: disbursed_amount <= 51908                                     0.000107
## disbursed_amount_fct=5: disbursed_amount <= 55400                                     0.006200
## disbursed_amount_fct=6: disbursed_amount > 55400                                       Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402                                         0.306128
## OutstandingNow_fct=2: OutstandingNow <= 50314                                         0.024810
## OutstandingNow_fct=3: OutstandingNow <= 171384                                        0.000155
## OutstandingNow_fct=4: OutstandingNow <= 324324                                        0.045628
## OutstandingNow_fct=5: OutstandingNow <= 746271                                        0.003547
## OutstandingNow_fct=6: OutstandingNow > 746271                                          Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 0.000000021892954560
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 0.000000075610862016
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 0.000000924302139271
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231             0.000407
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256             0.891012
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256               Dropped
## State_ID_fct=1: State_ID <= 183                                                       0.227109
## State_ID_fct=2: State_ID <= 188                                                       0.526790
## State_ID_fct=3: State_ID <= 206                                                       0.960185
## State_ID_fct=4: State_ID <= 214                                                       0.061214
## State_ID_fct=5: State_ID <= 220                                                       0.042315
## State_ID_fct=6: State_ID <= 229                                                       0.211464
## State_ID_fct=7: State_ID <= 272                                                       0.056786
## State_ID_fct=8: State_ID > 272                                                         Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554                                                   0.001018
## DisAsDiff_fct=2: DisAsDiff <= 15670                                                   0.016306
## DisAsDiff_fct=3: DisAsDiff <= 16661                                                   0.863261
## DisAsDiff_fct=4: DisAsDiff <= 19822                                                   0.492117
## DisAsDiff_fct=5: DisAsDiff > 19822                                                     Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581                0.000000000038574033
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581                              Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0                           0.000000472350173641
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0                                         Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210                             0.000000248386704094
## manufacturer_id_fct=2: manufacturer_id <= 221                                         0.899275
## manufacturer_id_fct=3: manufacturer_id <= 228                                         0.000176
## manufacturer_id_fct=4: manufacturer_id > 228                                           Dropped
## VoterID_flag_fct=1: VoterID_flag = 0                                      0.000001583545196970
## VoterID_flag_fct=2: VoterID_flag = 1                                                   Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2                                                0.003735
## ShareOverdue_fct=2: ShareOverdue <= -1                                    0.000000029958245662
## ShareOverdue_fct=3: ShareOverdue > -1                                                  Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0                             0.000000287604146276
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1                             0.000000000000000222
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3                             0.000000000000000222
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3                                           Dropped
## Day_fct=1: Day <= 28                                                      0.000000000000000222
## Day_fct=2: Day <= 30                                                      0.000000822043672688
## Day_fct=3: Day > 30                                                                    Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0                               0.000000000000000222
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0                                             Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 0.000000000000000222
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0               Dropped
## Qrt_fct=1: Qrt = 3                                                        0.000000000000000222
## Qrt_fct=2: Qrt = 4                                                                     Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918                                 0.000000000000000222
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496                                 0.000000000000000222
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321                                 0.000000000003574030
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208                                             0.000444
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208                                               Dropped
## Employment_Type_fct=1: Employment_Type = 203                              0.000000000000000222
## Employment_Type_fct=2: Employment_Type = 215                              0.000000000417368806
## Employment_Type_fct=3: Employment_Type = 227                                           Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564                                  0.581034
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832                      0.000000000009078738
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033                                  0.000197
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326                     0.000010036155548399
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326                                   Dropped
## asset_cost_fct=1: asset_cost <= 60098                                                 0.112444
## asset_cost_fct=2: asset_cost <= 70561                                     0.000000449390024304
## asset_cost_fct=3: asset_cost <= 85738                                     0.000000000000000222
## asset_cost_fct=4: asset_cost > 85738                                                   Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0                                   0.520648
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null                               0.463016
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2                                 0.243122
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1                                    Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0                                   0.758758
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0                                    Dropped
## Passport_flag_fct=1: Passport_flag = 0                                                0.170376
## Passport_flag_fct=2: Passport_flag = 1                                                 Dropped
## Driving_flag_fct=1: Driving_flag = 0                                                  0.678510
## Driving_flag_fct=2: Driving_flag = 1                                                   Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0                                             0.433359
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0                                              Dropped
## PAN_flag_fct=1: PAN_flag = 0                                                          0.045763
## PAN_flag_fct=2: PAN_flag = 1                                                           Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0                                        Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0                                        Dropped
##                                                                              
## (Intercept)                                                               ***
## Employee_code_ID_fct=1: Employee_code_ID <= 134                           ***
## Employee_code_ID_fct=10: Employee_code_ID <= 242                          *  
## Employee_code_ID_fct=11: Employee_code_ID <= 254                          ***
## Employee_code_ID_fct=12: Employee_code_ID <= 270                          ***
## Employee_code_ID_fct=13: Employee_code_ID <= 289                          ***
## Employee_code_ID_fct=14: Employee_code_ID <= 319                          ***
## Employee_code_ID_fct=15: Employee_code_ID > 319                           ***
## Employee_code_ID_fct=2: Employee_code_ID <= 153                           ***
## Employee_code_ID_fct=3: Employee_code_ID <= 170                           ***
## Employee_code_ID_fct=4: Employee_code_ID <= 179                           ***
## Employee_code_ID_fct=5: Employee_code_ID <= 188                           ***
## Employee_code_ID_fct=6: Employee_code_ID <= 199                           ***
## Employee_code_ID_fct=7: Employee_code_ID <= 211                           ** 
## Employee_code_ID_fct=8: Employee_code_ID <= 221                              
## Employee_code_ID_fct=9: Employee_code_ID <= 233                              
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143                       ***
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257                      ***
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291                      ***
## Current_pincode_ID_fct=12: Current_pincode_ID > 291                       ***
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158                       ***
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174                       ***
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188                       ***
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201                       ***
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212                       ***
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219                       ***
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225                       ***
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238                          
## supplier_id_fct=1: supplier_id <= 133                                     ***
## supplier_id_fct=10: supplier_id <= 253                                    ** 
## supplier_id_fct=11: supplier_id <= 275                                    ***
## supplier_id_fct=12: supplier_id <= 314                                    ***
## supplier_id_fct=13: supplier_id > 314                                     ***
## supplier_id_fct=2: supplier_id <= 149                                     ***
## supplier_id_fct=3: supplier_id <= 165                                     ***
## supplier_id_fct=4: supplier_id <= 178                                     ***
## supplier_id_fct=5: supplier_id <= 196                                     ***
## supplier_id_fct=6: supplier_id <= 206                                     ***
## supplier_id_fct=7: supplier_id <= 214                                     ***
## supplier_id_fct=8: supplier_id <= 225                                     *  
## supplier_id_fct=9: supplier_id <= 240                                        
## branch_id_fct=1: branch_id <= 153                                         ***
## branch_id_fct=10: branch_id <= 284                                        .  
## branch_id_fct=11: branch_id > 284                                            
## branch_id_fct=2: branch_id <= 174                                         ***
## branch_id_fct=3: branch_id <= 184                                         ***
## branch_id_fct=4: branch_id <= 198                                         ***
## branch_id_fct=5: branch_id <= 214                                         ***
## branch_id_fct=6: branch_id <= 222                                         ***
## branch_id_fct=7: branch_id <= 233                                         ** 
## branch_id_fct=8: branch_id <= 261                                         ***
## branch_id_fct=9: branch_id <= 276                                            
## ltv_fct=1: ltv <= 55.63                                                   ***
## ltv_fct=10: ltv <= 84.57                                                  ***
## ltv_fct=11: ltv <= 85                                                     ***
## ltv_fct=12: ltv <= 87.8                                                   ***
## ltv_fct=13: ltv <= 89.3                                                   ***
## ltv_fct=14: ltv > 89.3                                                    ***
## ltv_fct=2: ltv <= 62.22                                                   ***
## ltv_fct=3: ltv <= 68.34                                                   ***
## ltv_fct=4: ltv <= 72.9301                                                 ***
## ltv_fct=5: ltv <= 74.31                                                   ***
## ltv_fct=6: ltv <= 75                                                      ** 
## ltv_fct=7: ltv <= 77.39                                                   ***
## ltv_fct=8: ltv <= 78.92                                                   *  
## ltv_fct=9: ltv <= 83.34                                                      
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0                           *  
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18                             
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441                         ***
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643                         *  
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738                         .  
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824                         ***
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824                             
## disbursed_amount_fct=1: disbursed_amount <= 39134                            
## disbursed_amount_fct=2: disbursed_amount <= 43615                         *  
## disbursed_amount_fct=3: disbursed_amount <= 48555                         ** 
## disbursed_amount_fct=4: disbursed_amount <= 51908                         ***
## disbursed_amount_fct=5: disbursed_amount <= 55400                         ** 
## disbursed_amount_fct=6: disbursed_amount > 55400                             
## OutstandingNow_fct=1: OutstandingNow <= 44402                                
## OutstandingNow_fct=2: OutstandingNow <= 50314                             *  
## OutstandingNow_fct=3: OutstandingNow <= 171384                            ***
## OutstandingNow_fct=4: OutstandingNow <= 324324                            *  
## OutstandingNow_fct=5: OutstandingNow <= 746271                            ** 
## OutstandingNow_fct=6: OutstandingNow > 746271                                
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256    
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256     
## State_ID_fct=1: State_ID <= 183                                              
## State_ID_fct=2: State_ID <= 188                                              
## State_ID_fct=3: State_ID <= 206                                              
## State_ID_fct=4: State_ID <= 214                                           .  
## State_ID_fct=5: State_ID <= 220                                           *  
## State_ID_fct=6: State_ID <= 229                                              
## State_ID_fct=7: State_ID <= 272                                           .  
## State_ID_fct=8: State_ID > 272                                               
## DisAsDiff_fct=1: DisAsDiff <= 13554                                       ** 
## DisAsDiff_fct=2: DisAsDiff <= 15670                                       *  
## DisAsDiff_fct=3: DisAsDiff <= 16661                                          
## DisAsDiff_fct=4: DisAsDiff <= 19822                                          
## DisAsDiff_fct=5: DisAsDiff > 19822                                           
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581                ***
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581                    
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0                           ***
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0                               
## manufacturer_id_fct=1: manufacturer_id <= 210                             ***
## manufacturer_id_fct=2: manufacturer_id <= 221                                
## manufacturer_id_fct=3: manufacturer_id <= 228                             ***
## manufacturer_id_fct=4: manufacturer_id > 228                                 
## VoterID_flag_fct=1: VoterID_flag = 0                                      ***
## VoterID_flag_fct=2: VoterID_flag = 1                                         
## ShareOverdue_fct=1: ShareOverdue <= -2                                    ** 
## ShareOverdue_fct=2: ShareOverdue <= -1                                    ***
## ShareOverdue_fct=3: ShareOverdue > -1                                        
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0                             ***
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1                             ***
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3                             ***
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3                                 
## Day_fct=1: Day <= 28                                                      ***
## Day_fct=2: Day <= 30                                                      ***
## Day_fct=3: Day > 30                                                          
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0                               ***
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0                                   
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 ***
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0     
## Qrt_fct=1: Qrt = 3                                                        ***
## Qrt_fct=2: Qrt = 4                                                           
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918                                 ***
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496                                 ***
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321                                 ***
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208                                 ***
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208                                     
## Employment_Type_fct=1: Employment_Type = 203                              ***
## Employment_Type_fct=2: Employment_Type = 215                              ***
## Employment_Type_fct=3: Employment_Type = 227                                 
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564                         
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832                      ***
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033                      ***
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326                     ***
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326                         
## asset_cost_fct=1: asset_cost <= 60098                                        
## asset_cost_fct=2: asset_cost <= 70561                                     ***
## asset_cost_fct=3: asset_cost <= 85738                                     ***
## asset_cost_fct=4: asset_cost > 85738                                         
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0                          
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null                      
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2                        
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1                          
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0                          
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0                          
## Passport_flag_fct=1: Passport_flag = 0                                       
## Passport_flag_fct=2: Passport_flag = 1                                       
## Driving_flag_fct=1: Driving_flag = 0                                         
## Driving_flag_fct=2: Driving_flag = 1                                         
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0                                    
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0                                    
## PAN_flag_fct=1: PAN_flag = 0                                              *  
## PAN_flag_fct=2: PAN_flag = 1                                                 
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0                              
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Condition number of final variance-covariance matrix: 1997.764 
## Number of iterations: 6

# Estimate a Classification Result on Testing Set
Pred <- 
  RevoScaleR::rxPredict(modelObject = rxLogitFit
    , data = X[inTest, ] #, sqlFraudDS
    # , outData = sqlServerOutDS3
    , predVarNames = 'logitProbs'
    , type = 'response'
    , writeModelVars = FALSE
    # , extraVarsToWrite = 'SUBS_KEY'
    , overwrite = TRUE )

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.019 seconds

Probs <- Pred[, 'logitProbs']

# PrClass <- factor(ifelse( RevoScaleR::rxImport(sqlServerOutDS3)[, 'predLogitProbs'] < 0.5, 0, 1))
PrClass <- ifelse( Pred[, 'logitProbs'] < 0.5, 0, 1) %>%
    factor
levels(PrClass) = c('Bad', 'Good')

ObClass <- Y[inTest]

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(data = PrClass, reference = ObClass, positive = 'Bad', mode = 'everything')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad    284   338
##       Good  4770 17924
##                                              
##                Accuracy : 0.7809             
##                  95% CI : (0.7756, 0.7862)   
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 0.8069             
##                                              
##                   Kappa : 0.0552             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.05619            
##             Specificity : 0.98149            
##          Pos Pred Value : 0.45659            
##          Neg Pred Value : 0.78981            
##               Precision : 0.45659            
##                  Recall : 0.05619            
##                      F1 : 0.10007            
##              Prevalence : 0.21676            
##          Detection Rate : 0.01218            
##    Detection Prevalence : 0.02668            
##       Balanced Accuracy : 0.51884            
##                                              
##        'Positive' Class : Bad                
##

writeLines(paste0('\n Estimate a Gini Coefficient on Testing Set = ', 
           formattable::percent( hmeasure::HMeasure(true.class = ObClass %>% as.integer() - 1 ,
                                 scores = Probs)[['metrics']] %>% .[1, 'Gini'], 2 )))

## 
##  Estimate a Gini Coefficient on Testing Set = 31.79%

# Compute the ROC data for the default number of thresholds 
rxRocObject <- 
  RevoScaleR::rxRocCurve(actualVarName = 'ObClass'
                       , predVarNames = c('Probs')
                       , numBreaks = 100 # length(Probs)
                       , data = data.frame(ObClass = ObClass %>% as.integer() - 1, Probs)
                       , title = 'ROC Curve for Logit Model')

# Testing a New Data by Generalized Linear Model
# data.frame(..._fct = '4') %>%
#   mutate_if(is.character, as.factor) %>%
#     predict(glmFit, newdata = ., type = 'response')

# openxlsx::addWorksheet(wb0 <- openxlsx::createWorkbook(), sheetName = 'Example', gridLines = FALSE)
# openxlsx::writeData(wb0, sheet = 1, x = data.frame(X[inTest, ], ObClass, PrClass), withFilter = TRUE); openxlsx::openXL(wb0)

remove(varsel, scope)

Receiver Operating Characteristic Curve

ROC - A receiver operating characteristic Curve, i.e., ROC Curve or Lorenz Curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

Kolmogorov–Smirnov Test

One of the statistics commonly used in credit scoring, as well as countless other disciplines, is the KS statistic. This was developed by two renowned Soviet mathematicians, A.N. Kolmogorov (1903-1987) and N.V. Smirnov (1900-1966).

The K-S statistic of interest is where the difference is greatest. The treatment differs depending upon whether one or two samples were used to generate the values.

Measures of separation of the Score distribution

The credit score is a numeric expression measuring creditworthiness. Commercial Banks usually utilize it as a method to support the decision-making about credit applications.

If a reliable odds estimate already exists, whether because the statistical technique provides it directly, or some algorithm was used, scaling can be done using Equation Log reference.

\[ \displaystyle \large c' = \frac{S × ln(D × G) \ - \ (S + I) × ln(D)}{ln(G)} \hspace{.5 in} [5] \\ \displaystyle \large i' = \frac{I} {ln(G)} \hspace{0.3 in} s' = c' + ln(D_{Orig}) × i'\]

where \(S\) is the reference score, \(D\) is the required Good/Bad odds at that score, \(I\) is the score increment, \(G\) the required odds increment, and \(D_{Orig}\) the odds provided by the model. An example for a reference odds of 16 to 1 at a score of 700, with odds doubling every 50 points, is provided below. The scaled score equating to 128 to 1 is then 850, calculated as:

\[ \displaystyle \large c' = \frac{700 × ln(16 × 2) \ - \ (700 + 50) × ln(16)}{ln(2)} = 500 \hspace{.5 in} [6] \\ \displaystyle \large i' = \frac{50} {ln(2)} = 72.13475 \hspace{0.1 in} s' = 500 + ln(128) × 72.13475 = 850\]

A further method to perform validation is to compare the divergence statistic for the scores of ‘Good’ and ‘Bad’ class. Kullback-Leibler’s Divergence ot Relative Entropy can be calculated using the formula:

\[ \displaystyle \large Divergence = \frac{(mean_G \ - \ mean_B)^2}{0.5 × (var_G \ + \ var_B)} \hspace{.5 in} [7] \]

where \(mean_G\), \(mean_B\), \(var_G\), and \(var_B\) are the means and variances of the scored Good and Bad populations respectively.

If the divergence value is large, then the division of classes is fair.

Charts of Simple Logit-Model on Testing Set

# Generate an ROC Curve for the Best Model
ROCCurveShow(Preds = Probs,                          # `Good` Class Probabilities - numeric vector
             Obsers = ObClass,                       # Observed Classes (Reference) - logical vector
             NameOfModel = 'Initial GLM')            # Name Of The Model

## 
## Initial GLM - Estimate a Area Under the ROC Curve (AUC) on Testing Set = 65.90%

# Generate a KS Curve for the Best Model
KSCurveShow(Preds = Probs,                           # `Good` Class Probabilities - numeric vector
            Obsers = ObClass,                        # Observed Classes (Reference) - logical vector
            NameOfModel = 'Initial GLM')             # Name Of The Model

## 
## Initial GLM - Estimate a Kolmogorov-Smirnov Statistic on Testing Set = 0.2331

# Generate a Distribution's Curve of Scores by Good & Bad Class for the Best Model
ScoresCurveShow(Preds = Probs,                       # `Good` Class Probabilities - numeric vector
                Obsers = ObClass,                    # Observed Classes (Reference) - logical vector
                NameOfModel = 'Initial GLM')         # Name Of The Model

## 
## Initial GLM - Estimate a Kullback-LeiblerвЂ™s Divergence Statistic on Testing Set = 0.0842

remove(rxLogitFit, rxRocObject, Pred, Probs, PrClass, ObClass)

Trainig Models in `MicrosoftML` package

John Mount has long wondered about the applicability of heterogeneous statistical methods for solving various classification problems. First, which classification methods are most accurate in general — that is, which methods identify the correct class most of the time. Second, which classifiers behave most like each other, in terms of the class probabilities that they assign to each of the target classes. Answers to these questions can be found on the company’s website Win-Vector LLC.

The rxLogisticRegression() algorithm is used to predict the value of a categorical dependent variable from its relationship to one or more independent variables assumed to have a logistic distribution. The rxLogisticRegression learner automatically adjusts the weights to select those variables that are most useful for making predictions (L1 and L2 regularization). This model based on the Stochastic Dual Coordinate Ascent method.

# Train Generalized Linear Model with Regularized by the L1 and L2 penalties of the Lasso and Ridge methods
start_time <- Sys.time()

library('MicrosoftML')          # Microsoft Machine Learning for R

# Setup parallel processing - 2 times faster
# library('doParallel'); cl <- makeCluster(detectCores()); registerDoParallel(cl)

writeLines('\n\nGeneralized Linear Model with Regularized by the L1 and L2 penalties ...\n')

## 
## 
## Generalized Linear Model with Regularized by the L1 and L2 penalties ...

set.seed(seed)

# tuneLength <- 10
# tuneGrid = data.frame(lasso = seq(from = .1, to = 1, length.out = tuneLength),
#                       ridge = seq(from = .1, to = 1, length.out = tuneLength))
# Optimization.df <- data.frame(matrix(nrow = tuneLength, ncol = tuneLength))
# system.time(
#   for (i in 1:tuneLength ) { # The L1 (Lasso) regularization
# 
#     for (j in 1:tuneLength ) { # The L2 (Ridge) regularization
#       rxLogisticRegressionFit <-
#         MicrosoftML::rxLogisticRegression(
#           formula = Y ~ .
#           , data = caret::upSample(X[inTrain, ], Y[inTrain],  yname = 'Y')  # Up-Sampling Imbalanced Data
#           , type = 'binary'
#           , l2Weight = tuneGrid[j, 'ridge']  # The L2 (Ridge) regularization weight
#           , l1Weight = tuneGrid[i, 'lasso']  # The L1 (Lasso) regularization weight
#           , normalize = 'no'                 # no normalization is performed
#           , reportProgress = 0
#           , verbose = 4 )
#       Optimization.df[i, j] <- summary(rxLogisticRegressionFit)$summary$AIC
# 
#     }
#   }
# )
#
# optRegularizations <- which(Optimization.df == min(Optimization.df), arr.ind = TRUE)
rxLogisticRegressionFit <- 
  MicrosoftML::rxLogisticRegression( formula = Y ~ .
    , data = caret::upSample(X[inTrain, ], Y[inTrain],  yname = 'Y')  # Up-Sampling Imbalanced Data
    , type = 'binary'
    # , l2Weight = 1 # tuneGrid[optRegularizations[2], 'ridge']       # The Ridge regularization weight
    # , l1Weight = 1 # tuneGrid[optRegularizations[1], 'lasso']       # The Lasso regularization weight
    , trainThreads = parallel::detectCores()                          # The number of threads to use in model
    , normalize = 'no'                                                # no normalization is performed
    , reportProgress = 0
    , verbose = 0 )

## Not adding a normalizer.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0.001, Transform Time: 0
## Beginning processing data.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
## Beginning optimization
## num vars: 163
## improvement criterion: Mean Improvement
## L1 regularization selected 163 of 163 weights.
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:02.3866356
## Elapsed time: 00:00:00.1384473

# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <- 
  RevoScaleR::rxPredict( rxLogisticRegressionFit
    , data = cbind(Y = Y[inTest], ObClass = Y[inTest] %>% as.integer() - 1, X[inTest, ])
    , suffix = '.rxLogisticRegression'
    , extraVarsToWrite = names(cbind(Y = Y[inTest], ObClass = Y[inTest] %>% as.integer() - 1, X[inTest, ]))
    , outData = tempfile(fileext = '.xdf'))

## Beginning processing data.
## Rows Read: 23316, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:00.2378449
## Finished writing 23316 rows.
## Writing completed.

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
      dplyr::select(starts_with('PredictedLabel.rxLogisticRegression')) %>% pull
  , reference = Y[inTest], positive = 'Bad', mode = 'everything')

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.025 seconds

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad   3039  6817
##       Good  2015 11445
##                                              
##                Accuracy : 0.6212             
##                  95% CI : (0.6149, 0.6274)   
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1697             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.6013             
##             Specificity : 0.6267             
##          Pos Pred Value : 0.3083             
##          Neg Pred Value : 0.8503             
##               Precision : 0.3083             
##                  Recall : 0.6013             
##                      F1 : 0.4076             
##              Prevalence : 0.2168             
##          Detection Rate : 0.1303             
##    Detection Prevalence : 0.4227             
##       Balanced Accuracy : 0.6140             
##                                              
##        'Positive' Class : Bad                
##

# Measure running time of R code for Generalized Linear Model
writeLines('Measure running time of `Generalized Linear Model` code = ')

## Measure running time of `Generalized Linear Model` code =

print(Sys.time() - start_time)

## Time difference of 3.74208 secs

# remove(Optimization.df, tuneGrid, tuneLength)

The rxFastTrees() algorithm is a high performing, state of the art scalable boosted decision tree that implements FastRank, an efficient implementation of the MART gradient boosting algorithm. MART learns an ensemble of regression trees, which is a decision tree with scalar values in its leaves. For binary classification, the output is converted to a probability by using some form of calibration.

# Train FastTrees Model as an efficient implementation of the MART Gradient Boosting Algorith (GTB)

start_time <- Sys.time()

writeLines('\n\nFast Trees is an Gradient Tree Boosting Algorith (GTB) ...\n')

## 
## 
## Fast Trees is an Gradient Tree Boosting Algorith (GTB) ...

system.time(
  rxFastTreesFit <- 
    MicrosoftML::rxFastTrees( formula = Y ~ .
      , data = cbind(Y = Y[inTrain], X[inTrain, ])  # Up-Sampling Imbalanced Data 
      , type = 'binary'
      , numTrees = 100
      , numLeaves = 20
      , learningRate = 0.3 # Determines the size of the step taken in the direction of gradient in each step
      , minSplit = 10
      , exampleFraction = 0.7 # The fraction of randomly chosen instances to use for each tree
      , featureFraction = 1   # The fraction of randomly chosen features to use for each tree
      , splitFraction = 1     # The fraction of randomly chosen features to use on each split
      # , numBins = 255
      , firstUsePenalty = 0   # The feature first use penalty coefficient
      , gainConfLevel = 0     # Tree fitting gain confidence requirement (should be in the range [0,1))
      , unbalancedSets = TRUE                       # derivatives optimized for unbalanced sets are used
      , randomSeed = seed
      , reportProgress = 0
      , verbose = 0 )
)

##    user  system elapsed 
##    0.14    0.00    2.67

# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <- 
  RevoScaleR::rxPredict( rxFastTreesFit, fitTestScores, suffix = '.rxFastTrees'
                       , extraVarsToWrite = names(fitTestScores)
                       , outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 23316, Read Time: 0.004, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:00.3513291
## Finished writing 23316 rows.
## Writing completed.

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(
  data = RevoScaleR::rxImport(inData = fitTestScores) %>%
      dplyr::select(starts_with('PredictedLabel.rxFastTrees')) %>% pull
  , reference = Y[inTest], positive = 'Bad', mode = 'everything')

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.031 seconds

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad   3019  6681
##       Good  2035 11581
##                                              
##                Accuracy : 0.6262             
##                  95% CI : (0.6199, 0.6324)   
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1737             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.5973             
##             Specificity : 0.6342             
##          Pos Pred Value : 0.3112             
##          Neg Pred Value : 0.8505             
##               Precision : 0.3112             
##                  Recall : 0.5973             
##                      F1 : 0.4092             
##              Prevalence : 0.2168             
##          Detection Rate : 0.1295             
##    Detection Prevalence : 0.4160             
##       Balanced Accuracy : 0.6158             
##                                              
##        'Positive' Class : Bad                
##

# Measure running time of R code for Gradient Tree Boosting Model
writeLines('Measure running time of `Gradient Tree Boosting Model` code = ')

## Measure running time of `Gradient Tree Boosting Model` code =

print(Sys.time() - start_time)

## Time difference of 3.490278 secs

Decision trees are non-parametric models that perform a sequence of simple tests on inputs. The rxFastForest() algorithm is a random forest that provides a learning method for classification that constructs an ensemble of decision trees at training time, outputting the class that is the mode of the classes of the individual trees. Random decision forests can correct for the overfitting to training data sets to which decision trees are prone. The rxFastForest learner automatically builds a set of trees whose combined predictions are better than the predictions of any one of the trees.

# Train FastForest Model as an efficient implementation of the Random Forest (RF)

start_time <- Sys.time()

writeLines('\n\nFast Forest is an Fast Random Forest (RF) ...\n')

## 
## 
## Fast Forest is an Fast Random Forest (RF) ...

system.time(
  rxFastForestFit <- 
    MicrosoftML::rxFastForest( formula = Y ~ .
      , data = caret::upSample(X[inTrain, ], Y[inTrain],  yname = 'Y')  # Up-Sampling Imbalanced Data
      , type = 'binary'
      # , numTrees = 200
      # , numLeaves = 30
      , randomSeed = seed
      , reportProgress = 1
      , verbose = 0 )
)

## Beginning processing data.
## Rows Read: 328562
## Beginning processing data.
## Beginning processing data.
## Rows Read: 328562
## Beginning processing data.

##    user  system elapsed 
##    0.60    0.01    7.50

# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <- 
  RevoScaleR::rxPredict( rxFastForestFit, fitTestScores, suffix = '.rxFastForest'
                       , extraVarsToWrite = names(fitTestScores)
                       , outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 23316, Read Time: 0.003, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:00.8439608
## Finished writing 23316 rows.
## Writing completed.

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
      dplyr::select(starts_with('PredictedLabel.rxFastForest')) %>% pull
  , reference = Y[inTest], positive = 'Bad', mode = 'everything')

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.037 seconds

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad   3035  7349
##       Good  2019 10913
##                                              
##                Accuracy : 0.5982             
##                  95% CI : (0.5919, 0.6045)   
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1434             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.6005             
##             Specificity : 0.5976             
##          Pos Pred Value : 0.2923             
##          Neg Pred Value : 0.8439             
##               Precision : 0.2923             
##                  Recall : 0.6005             
##                      F1 : 0.3932             
##              Prevalence : 0.2168             
##          Detection Rate : 0.1302             
##    Detection Prevalence : 0.4454             
##       Balanced Accuracy : 0.5990             
##                                              
##        'Positive' Class : Bad                
##

# Measure running time of R code for Random Forest Model (Fast Forest)
writeLines('Measure running time of `Random Forest Model (Fast Forest)` code = ')

## Measure running time of `Random Forest Model (Fast Forest)` code =

print(Sys.time() - start_time)

## Time difference of 8.736882 secs

The rxNeuralNet() algorithm supports a user-defined multilayer network topology with GPU acceleration. A neural network is a class of prediction models inspired by the human brain. It can be represented as a weighted directed graph. Each node in the graph is called a neuron. The neural network algorithm tries to learn the optimal weights on the edges based on the training data. Any class of statistical models can be considered a neural network if they use adaptive weights and can approximate non-linear functions of their inputs. Neural network regression is especially suited to problems where a more traditional regression model cannot fit a solution. Define Neural network using Net# language or the Azure Gallery.

For GPU acceleration, it is recommended to use a miniBatchSize greater than one. If you want to use the GPU acceleration, there are additional manual setup steps are required:

Download and install NVidia CUDA Toolkit 6.5 (CUDA Toolkit).
Download and install NVidia cuDNN v2 Library (cudnn Library).
Find the libs directory of the MicrosoftRML package by calling system.file('mxLibs/x64', package = 'MicrosoftML').
Copy cublas64_65.dll, cudart64_65.dll and cusparse64_65.dll from the CUDA Toolkit 6.5 into the libs directory of the MicrosoftML package.
Copy cudnn64_65.dll from the cuDNN v2 Library into the libs directory of the MicrosoftML package.

# Train Artificial Neural Networks Model

start_time <- Sys.time()

writeLines('\n\nArtificial Neural Networks (ANN) ...\n')

## 
## 
## Artificial Neural Networks (ANN) ...

# Azure Net# definition of the structure of the neural network
netDefinition <- ('
// Define constants.
const { T = true; F = false; }

// Input layer definition.
input Data [33];

// First convolutional layer definition.
hidden C1 [ 7, 9]
  from Data convolve {
    InputShape  = [33];
    KernelShape = [17];
    Stride      = [2];
    MapCount = 7;
  }

// Second normalize layer definition.
hidden N1 [ 7, 9]
  from C1 response norm {
    InputShape = [7, 9];
    KernelShape = [1, 1];
    Alpha = 0.0001;
    Beta = 0.75;
    Offset = 1;
    AvgOverFullKernel = true;
                  }

// Third fully connected layer definition.
hidden H3 [100]
  from N1 all;

// Output layer definition.
output Result auto softmax from H3 all;
    ')

# netDefinition <- '
# input Data [33];
# 
# hidden H1 [100]
#     from Data all;
# 
# hidden H2 [100]
#     from H1 all;
# 
# output Result [1] softmax
#     from H2 all; 
# '

# Main Issue was Factorizing() the Data correctly.
 
system.time(
  rxNeuralNetFit <- 
    MicrosoftML::rxNeuralNet( formula = Y ~ .
      , data = caret::upSample(X[inTrain, ], Y[inTrain],  yname = 'Y') # Up-Sampling Imbalanced Data
      , type = 'binary'
      # , numHiddenNodes = 100
      # , numIterations = 50
      # , optimizer = adaDeltaSgd(decay = .99, conditioningConst = 1e-05) 
      #, netDefinition = netDefinition
      , acceleration = 'gpu'
      , initWtsDiameter = 0.005
      , normalize = 'warn'
      , randomSeed = seed
      # , reportProgress = 1
      # , verbose = 0 
      )
)

## Not adding a normalizer.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Failed to initialize CUDA runtime. Possible reasons:
## 1. The machine does not have CUDA-capable card. Supported devices have compute capability 2.0 and higher.
## 2. Outdated graphics drivers. Please install the latest drivers from http://www.nvidia.com/Drivers .
## 3. CUDA runtime DLLs are missing, please see the GPU acceleration help for the installation instructions.
## CUDA not supported, switched to SSE math.
## Using: SSE Math
## 
## ***** Net definition *****
##   input Data [162];
##   hidden H [100] sigmoid { // Depth 1
##     from Data all;
##   }
##   output Result [1] sigmoid { // Depth 0
##     from H all;
##   }
## ***** End net definition *****
## Input count: 162
## Output count: 1
## Output Function: Sigmoid
## Loss Function: LogLoss
## PreTrainer: NoPreTrainer
## ___________________________________________________________________
## Starting training...
## Learning rate: 0.001000
## Momentum: 0.000000
## InitWtsDiameter: 0.005000
## ___________________________________________________________________
## Initializing 1 Hidden Layers, 16401 Weights...
## Estimated Pre-training MeanError = 0.693613
## Iter:1/100, MeanErr=0.639431(-7.81%%), 3798.21M WeightUpdates/sec
## Iter:2/100, MeanErr=0.602368(-5.80%%), 3949.60M WeightUpdates/sec
## Iter:3/100, MeanErr=0.584635(-2.94%%), 3914.17M WeightUpdates/sec
## Iter:4/100, MeanErr=0.527511(-9.77%%), 3915.96M WeightUpdates/sec
## Iter:5/100, MeanErr=0.593839(12.57%%), 3827.49M WeightUpdates/sec
## Iter:6/100, MeanErr=0.554664(-6.60%%), 3856.07M WeightUpdates/sec
## Iter:7/100, MeanErr=0.562969(1.50%%), 3903.81M WeightUpdates/sec
## Iter:8/100, MeanErr=0.548285(-2.61%%), 3942.92M WeightUpdates/sec
## Iter:9/100, MeanErr=0.563428(2.76%%), 3897.05M WeightUpdates/sec
## Iter:10/100, MeanErr=0.586190(4.04%%), 3941.05M WeightUpdates/sec
## Iter:11/100, MeanErr=0.597030(1.85%%), 3853.81M WeightUpdates/sec
## Iter:12/100, MeanErr=0.608863(1.98%%), 3888.06M WeightUpdates/sec
## Iter:13/100, MeanErr=0.584287(-4.04%%), 3521.77M WeightUpdates/sec
## Iter:14/100, MeanErr=0.590426(1.05%%), 3879.91M WeightUpdates/sec
## Iter:15/100, MeanErr=0.520848(-11.78%%), 3869.92M WeightUpdates/sec
## Iter:16/100, MeanErr=0.582126(11.77%%), 3852.34M WeightUpdates/sec
## Iter:17/100, MeanErr=0.583295(0.20%%), 3921.90M WeightUpdates/sec
## Iter:18/100, MeanErr=0.603145(3.40%%), 3871.87M WeightUpdates/sec
## Iter:19/100, MeanErr=0.588263(-2.47%%), 3891.76M WeightUpdates/sec
## Iter:20/100, MeanErr=0.588528(0.05%%), 3984.09M WeightUpdates/sec
## Iter:21/100, MeanErr=0.561759(-4.55%%), 3952.93M WeightUpdates/sec
## Iter:22/100, MeanErr=0.597273(6.32%%), 3904.70M WeightUpdates/sec
## Iter:23/100, MeanErr=0.580513(-2.81%%), 3923.32M WeightUpdates/sec
## Iter:24/100, MeanErr=0.608079(4.75%%), 3931.57M WeightUpdates/sec
## Iter:25/100, MeanErr=0.575513(-5.36%%), 3940.99M WeightUpdates/sec
## Iter:26/100, MeanErr=0.601079(4.44%%), 3948.39M WeightUpdates/sec
## Iter:27/100, MeanErr=0.574797(-4.37%%), 3920.24M WeightUpdates/sec
## Iter:28/100, MeanErr=0.552832(-3.82%%), 3934.60M WeightUpdates/sec
## Iter:29/100, MeanErr=0.590020(6.73%%), 3982.74M WeightUpdates/sec
## Iter:30/100, MeanErr=0.574746(-2.59%%), 3846.70M WeightUpdates/sec
## Iter:31/100, MeanErr=0.582986(1.43%%), 3966.72M WeightUpdates/sec
## Iter:32/100, MeanErr=0.554354(-4.91%%), 3847.14M WeightUpdates/sec
## Iter:33/100, MeanErr=0.571911(3.17%%), 4055.53M WeightUpdates/sec
## Iter:34/100, MeanErr=0.607195(6.17%%), 3849.39M WeightUpdates/sec
## Iter:35/100, MeanErr=0.479481(-21.03%%), 3860.51M WeightUpdates/sec
## Iter:36/100, MeanErr=0.559952(16.78%%), 3806.84M WeightUpdates/sec
## Iter:37/100, MeanErr=0.572845(2.30%%), 3474.53M WeightUpdates/sec
## Iter:38/100, MeanErr=0.566092(-1.18%%), 3993.22M WeightUpdates/sec
## Iter:39/100, MeanErr=0.590559(4.32%%), 3884.40M WeightUpdates/sec
## Iter:40/100, MeanErr=0.562207(-4.80%%), 3865.32M WeightUpdates/sec
## Iter:41/100, MeanErr=0.570505(1.48%%), 3854.38M WeightUpdates/sec
## Iter:42/100, MeanErr=0.582996(2.19%%), 3185.89M WeightUpdates/sec
## Iter:43/100, MeanErr=0.589425(1.10%%), 3415.26M WeightUpdates/sec
## Iter:44/100, MeanErr=0.543417(-7.81%%), 2318.03M WeightUpdates/sec
## Iter:45/100, MeanErr=0.583254(7.33%%), 2706.60M WeightUpdates/sec
## Iter:46/100, MeanErr=0.570077(-2.26%%), 2547.02M WeightUpdates/sec
## Iter:47/100, MeanErr=0.593370(4.09%%), 3272.58M WeightUpdates/sec
## Iter:48/100, MeanErr=0.578531(-2.50%%), 3389.04M WeightUpdates/sec
## Iter:49/100, MeanErr=0.560148(-3.18%%), 3481.86M WeightUpdates/sec
## Iter:50/100, MeanErr=0.594706(6.17%%), 3710.83M WeightUpdates/sec
## Iter:51/100, MeanErr=0.576700(-3.03%%), 2830.77M WeightUpdates/sec
## Iter:52/100, MeanErr=0.591160(2.51%%), 3083.10M WeightUpdates/sec
## Iter:53/100, MeanErr=0.559937(-5.28%%), 3995.15M WeightUpdates/sec
## Iter:54/100, MeanErr=0.516699(-7.72%%), 4226.92M WeightUpdates/sec
## Iter:55/100, MeanErr=0.568916(10.11%%), 4151.69M WeightUpdates/sec
## Iter:56/100, MeanErr=0.551433(-3.07%%), 4325.79M WeightUpdates/sec
## Iter:57/100, MeanErr=0.601805(9.13%%), 4337.89M WeightUpdates/sec
## Iter:58/100, MeanErr=0.591378(-1.73%%), 4311.02M WeightUpdates/sec
## Iter:59/100, MeanErr=0.579823(-1.95%%), 4320.18M WeightUpdates/sec
## Iter:60/100, MeanErr=0.556642(-4.00%%), 4323.00M WeightUpdates/sec
## Iter:61/100, MeanErr=0.573425(3.01%%), 4319.77M WeightUpdates/sec
## Iter:62/100, MeanErr=0.571964(-0.25%%), 4263.00M WeightUpdates/sec
## Iter:63/100, MeanErr=0.568972(-0.52%%), 4258.27M WeightUpdates/sec
## Iter:64/100, MeanErr=0.569381(0.07%%), 4267.84M WeightUpdates/sec
## Iter:65/100, MeanErr=0.510309(-10.37%%), 4246.29M WeightUpdates/sec
## Iter:66/100, MeanErr=0.581686(13.99%%), 4215.08M WeightUpdates/sec
## Iter:67/100, MeanErr=0.570046(-2.00%%), 4180.59M WeightUpdates/sec
## Iter:68/100, MeanErr=0.588904(3.31%%), 3617.76M WeightUpdates/sec
## Iter:69/100, MeanErr=0.587524(-0.23%%), 4163.22M WeightUpdates/sec
## Iter:70/100, MeanErr=0.565621(-3.73%%), 4315.60M WeightUpdates/sec
## Iter:71/100, MeanErr=0.551080(-2.57%%), 4358.70M WeightUpdates/sec
## Iter:72/100, MeanErr=0.553144(0.37%%), 4288.95M WeightUpdates/sec
## Iter:73/100, MeanErr=0.546142(-1.27%%), 4341.31M WeightUpdates/sec
## Iter:74/100, MeanErr=0.599115(9.70%%), 4394.05M WeightUpdates/sec
## Iter:75/100, MeanErr=0.584863(-2.38%%), 4328.23M WeightUpdates/sec
## Iter:76/100, MeanErr=0.577997(-1.17%%), 4244.54M WeightUpdates/sec
## Iter:77/100, MeanErr=0.560597(-3.01%%), 4393.77M WeightUpdates/sec
## Iter:78/100, MeanErr=0.560554(-0.01%%), 4285.19M WeightUpdates/sec
## Iter:79/100, MeanErr=0.566539(1.07%%), 4323.93M WeightUpdates/sec
## Iter:80/100, MeanErr=0.583388(2.97%%), 4316.30M WeightUpdates/sec
## Iter:81/100, MeanErr=0.574882(-1.46%%), 4303.92M WeightUpdates/sec
## Iter:82/100, MeanErr=0.597803(3.99%%), 4363.38M WeightUpdates/sec
## Iter:83/100, MeanErr=0.573010(-4.15%%), 4319.29M WeightUpdates/sec
## Iter:84/100, MeanErr=0.576507(0.61%%), 4263.72M WeightUpdates/sec
## Iter:85/100, MeanErr=0.581926(0.94%%), 4375.43M WeightUpdates/sec
## Iter:86/100, MeanErr=0.550412(-5.42%%), 4324.91M WeightUpdates/sec
## Iter:87/100, MeanErr=0.524140(-4.77%%), 4318.94M WeightUpdates/sec
## Iter:88/100, MeanErr=0.584658(11.55%%), 4339.30M WeightUpdates/sec
## Iter:89/100, MeanErr=0.585000(0.06%%), 4297.34M WeightUpdates/sec
## Iter:90/100, MeanErr=0.562040(-3.92%%), 4313.00M WeightUpdates/sec
## Iter:91/100, MeanErr=0.553648(-1.49%%), 4360.05M WeightUpdates/sec
## Iter:92/100, MeanErr=0.598780(8.15%%), 4227.71M WeightUpdates/sec
## Iter:93/100, MeanErr=0.595341(-0.57%%), 4207.91M WeightUpdates/sec
## Iter:94/100, MeanErr=0.589134(-1.04%%), 4279.77M WeightUpdates/sec
## Iter:95/100, MeanErr=0.571897(-2.93%%), 4230.29M WeightUpdates/sec
## Iter:96/100, MeanErr=0.590322(3.22%%), 4332.06M WeightUpdates/sec
## Iter:97/100, MeanErr=0.568671(-3.67%%), 4195.30M WeightUpdates/sec
## Iter:98/100, MeanErr=0.528197(-7.12%%), 4155.70M WeightUpdates/sec
## Iter:99/100, MeanErr=0.573761(8.63%%), 4197.96M WeightUpdates/sec
## Iter:100/100, MeanErr=0.590485(2.91%%), 4186.22M WeightUpdates/sec
## Done!
## Estimated Post-training MeanError = 0.620647
## ___________________________________________________________________
## Not training a calibrator because it is not needed.
## Elapsed time: 00:02:18.7486743

##    user  system elapsed 
##    0.61    0.01  139.44

trained_model <- data.frame(payload = as.raw(serialize(rxNeuralNetFit, connection = NULL)))

# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <- 
  RevoScaleR::rxPredict(rxNeuralNetFit, fitTestScores, suffix = '.rxNeuralNet',
                       extraVarsToWrite = names(fitTestScores),
                       outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 23316, Read Time: 0.004, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:00.2187188
## Finished writing 23316 rows.
## Writing completed.

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
      dplyr::select(starts_with('PredictedLabel.rxNeuralNet')) %>% pull
  , reference = Y[inTest], positive = 'Bad', mode = 'everything')

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.042 seconds

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad   2882  6370
##       Good  2172 11892
##                                              
##                Accuracy : 0.6336             
##                  95% CI : (0.6274, 0.6398)   
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1703             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.5702             
##             Specificity : 0.6512             
##          Pos Pred Value : 0.3115             
##          Neg Pred Value : 0.8456             
##               Precision : 0.3115             
##                  Recall : 0.5702             
##                      F1 : 0.4029             
##              Prevalence : 0.2168             
##          Detection Rate : 0.1236             
##    Detection Prevalence : 0.3968             
##       Balanced Accuracy : 0.6107             
##                                              
##        'Positive' Class : Bad                
##

remove (trained_model, netDefinition)

# Measure running time of R code for Artificial Neural Networks (ANN)
writeLines('Measure running time of `Artificial Neural Networks (ANN)` code = ')

## Measure running time of `Artificial Neural Networks (ANN)` code =

print(Sys.time() - start_time)

## Time difference of 2.335371 mins

This type of SVM is one-class because the training set contains only examples from the target class. It infers what properties are normal for the objects in the target class and from these properties predicts which examples are unlike the normal examples. This is useful for anomaly detection because the scarcity of training examples is the defining character of anomalies: typically there are very few examples of network intrusion, fraud, or other types of anomalous behavior.

Parallel External Memory Algorithm for Naive Bayes Classifiers

# Train a Naive Bayes Classifiers

start_time <- Sys.time()

writeLines('\n\nFast Forest is an Naive Bayes Classifiers (NB) ...\n')

## 
## 
## Fast Forest is an Naive Bayes Classifiers (NB) ...

system.time(
  rxNaiveBayesFit <- 
    RevoScaleR::rxNaiveBayes( formula = Y ~ .
      , data = caret::upSample(X[inTrain, ], Y[inTrain],  yname = 'Y') # Up-Sampling Imbalanced Data
      , reportProgress = 1 # the number of processed rows is printed and updated
      , verbose = 0)
)

## 
Rows Processed: 328562

##    user  system elapsed 
##    0.62    0.03    0.83

# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
  RevoScaleR::rxPredict( rxNaiveBayesFit, data = fitTestScores, type = 'prob'
                       , predVarNames = c('Probability.rxNaiveBayes.Bad', 'Probability.rxNaiveBayes.Good')
                       , writeModelVars = TRUE
                       , extraVarsToWrite = names(fitTestScores)
                       , outData = tempfile(fileext = '.xdf')
)

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.694 seconds

# Use transformFunc to add PredictedLabel.rxNaiveBayes
ScoringResults.rxNaiveBayesFunc <- function(dataList) {
  dataList$PredictedLabel.rxNaiveBayes <- 
    factor( x = ifelse(dataList$Probability.rxNaiveBayes.Good < 0.5, 'Bad', 'Good') )
  return (dataList)
}

# Add PredictedLabel.rxNaiveBayes & Probability.rxEnsemble.Good into List Model's Scores
fitTestScores <- 
  RevoScaleR::rxDataStep(
    inData  = fitTestScores
  , maxRowsByCols = 2e9
  , transformFunc = ScoringResults.rxNaiveBayesFunc
  , varsToDrop = c('Probability.rxNaiveBayes.Bad')
  , overwrite = TRUE
  )

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.323 seconds

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
      dplyr::select(one_of('PredictedLabel.rxNaiveBayes')) %>% pull
  , reference = Y[inTest], positive = 'Bad', mode = 'everything')

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.050 seconds

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad   3014  6803
##       Good  2040 11459
##                                              
##                Accuracy : 0.6207             
##                  95% CI : (0.6145, 0.627)    
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1669             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.5964             
##             Specificity : 0.6275             
##          Pos Pred Value : 0.3070             
##          Neg Pred Value : 0.8489             
##               Precision : 0.3070             
##                  Recall : 0.5964             
##                      F1 : 0.4054             
##              Prevalence : 0.2168             
##          Detection Rate : 0.1293             
##    Detection Prevalence : 0.4210             
##       Balanced Accuracy : 0.6119             
##                                              
##        'Positive' Class : Bad                
##

# Measure running time of R code for Naive Bayes (NB)
writeLines('Measure running time of `Naive Bayes Model (NB)` code = ')

## Measure running time of `Naive Bayes Model (NB)` code =

print(Sys.time() - start_time)

## Time difference of 2.459047 secs

Next, we build an ensemble of fast tree models by using the function rxEnsemble().

# Train an Ensemble Model

start_time <- Sys.time()

writeLines('\n\nEnsemble of Some Models ...\n')

## 
## 
## Ensemble of Some Models ...

system.time(
  
  rxEnsembleFit <- 
    MicrosoftML::rxEnsemble( formula = Y ~ .
      , data = caret::upSample(X[inTrain, ], Y[inTrain],  yname = 'Y') # Up-Sampling Imbalanced Data
      , type = 'binary'
      , randomSeed = seed
      , replace = TRUE # logical value specifying if the sampling of observations should be done with or w/o
      , trainers=list(  fastTrees(randomSeed = seed)
        # , neuralNet(optimizer = sgd(), randomSeed = seed)
        # , neuralNet(optimizer = adaDeltaSgd(), randomSeed = seed)
        , fastTrees(numTrees = 300, randomSeed = seed)
        , fastTrees(numTrees = 300, randomSeed = seed, learningRate = 0.3)
        )
    , combineMethod = c('median', 'average', 'vote')[1] # to compute (pos-neg) / the total number of models, where 'pos' is the number of positive outputs and 'neg' is the number of negative outputs
    # , reportProgress = 1
    # , verbose = 0
  )  
)

## Not adding a normalizer.
## Making per-feature arrays
## Changing data from row-wise to column-wise
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Processed 327991 instances
## Binning and forming Feature objects
## Reserved memory for tree learner: 48568 bytes
## Starting to train ...
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:04.1492102
## Not adding a normalizer.
## Making per-feature arrays
## Changing data from row-wise to column-wise
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Processed 327962 instances
## Binning and forming Feature objects
## Reserved memory for tree learner: 48568 bytes
## Starting to train ...
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:06.5958746
## Not adding a normalizer.
## Making per-feature arrays
## Changing data from row-wise to column-wise
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Processed 328662 instances
## Binning and forming Feature objects
## Reserved memory for tree learner: 48568 bytes
## Starting to train ...
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:06.2534772
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Elapsed time: 00:00:06.4938534
## Beginning processing data.

##    user  system elapsed 
##    1.26    0.31   25.25

# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <- 
  RevoScaleR::rxPredict(rxEnsembleFit, fitTestScores, suffix = '.rxEnsemble',
                       extraVarsToWrite = names(fitTestScores), # [!grepl('Bad_prob', names(fitTestScores))],
                       outData = tempfile(fileext = '.xdf'))

## Beginning processing data.
## Rows Read: 23316, Read Time: 0.001, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:01.6385659
## Finished writing 23316 rows.
## Writing completed.

writeLines('\n Estimate a Classification Result on Testing Set \n')

## 
##  Estimate a Classification Result on Testing Set

caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
      dplyr::select(starts_with('PredictedLabel.rxEnsemble')) %>% pull
  , reference = Y[inTest], positive = 'Bad', mode = 'everything')

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.057 seconds

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Bad  Good
##       Bad   2995  6620
##       Good  2059 11642
##                                              
##                Accuracy : 0.6278             
##                  95% CI : (0.6215, 0.634)    
##     No Information Rate : 0.7832             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1735             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.5926             
##             Specificity : 0.6375             
##          Pos Pred Value : 0.3115             
##          Neg Pred Value : 0.8497             
##               Precision : 0.3115             
##                  Recall : 0.5926             
##                      F1 : 0.4083             
##              Prevalence : 0.2168             
##          Detection Rate : 0.1285             
##    Detection Prevalence : 0.4124             
##       Balanced Accuracy : 0.6150             
##                                              
##        'Positive' Class : Bad                
##

# Measure running time of R code for Ensemble Model
writeLines('Measure running time of `Ensemble Model` code = ')

## Measure running time of `Ensemble Model` code =

print(Sys.time() - start_time)

## Time difference of 27.30104 secs

Evaluation of Classifiers

After constructing the set of classifiers, we will test them on the control dataset (inTest), which did not participate in the solution of the classification problem. We will compare the accuracy on it with test cases (inTest) for this task.

# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
#
#          TRAINING SET
#
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

# Compute the fit models's ROC curves, AUC and Gini coefficients on Training Set
fitTrainScores <- 
  RevoScaleR::rxPredict( rxLogisticRegressionFit
    , data = cbind(Y = Y[inTrain], ObClass = Y[inTrain] %>% as.integer()-1, X[inTrain, ])
    , suffix = '.rxLogisticRegression'
    , extraVarsToWrite = names(cbind(Y = Y[inTrain], ObClass = Y[inTrain] %>% as.integer()-1, X[inTrain, ]))
    , outData = tempfile(fileext = '.xdf'))

## Beginning processing data.
## Rows Read: 209838, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:01.1230519
## Finished writing 209838 rows.
## Writing completed.

fitTrainScores <- 
  RevoScaleR::rxPredict( rxFastTreesFit, fitTrainScores, suffix = '.rxFastTrees'
                       , extraVarsToWrite = names(fitTrainScores)
                       , outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.012, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:02.7085950
## Finished writing 209838 rows.
## Writing completed.

fitTrainScores <- 
  RevoScaleR::rxPredict( rxFastForestFit, fitTrainScores, suffix = '.rxFastForest'
                       , extraVarsToWrite = names(fitTrainScores)
                       , outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.014, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:07.3217277
## Finished writing 209838 rows.
## Writing completed.

fitTrainScores <- 
  RevoScaleR::rxPredict(rxNeuralNetFit, fitTrainScores, suffix = '.rxNeuralNet',
                       extraVarsToWrite = names(fitTrainScores),
                       outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.015, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:01.5667492
## Finished writing 209838 rows.
## Writing completed.

# fitTrainScores <- 
#   RevoScaleR::rxPredict(rxOneClassSvmFit, fitTrainScores, suffix = '.rxOneClassSvm',
#                        extraVarsToWrite = names(fitTrainScores),
#                        outData = tempfile(fileext = '.xdf'))
fitTrainScores <- 
  RevoScaleR::rxPredict( rxNaiveBayesFit, data = fitTrainScores, type = 'prob'
                       , predVarNames = c('Probability.rxNaiveBayes.Bad', 'Probability.rxNaiveBayes.Good')
                       , writeModelVars = TRUE
                       , extraVarsToWrite = names(fitTrainScores)
                       , outData = tempfile(fileext = '.xdf'))

## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 2.125 seconds

fitTrainScores <- 
  RevoScaleR::rxPredict(rxEnsembleFit, fitTrainScores, suffix = '.rxEnsemble',
                       extraVarsToWrite = names(fitTrainScores),
                       outData = tempfile(fileext = '.xdf'))

## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.098, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set. 
## Elapsed time: 00:00:13.3835712
## Finished writing 209838 rows.
## Writing completed.

# Use transformFunc to add Probability.rxOneClassSvm & Probability.rxEnsemble.Good
ScoringResults.rxSomeModelsFunc <- function(dataList) {
  # dataList$PredictedLabel.rxOneClassSvm <- 
  #   factor( x = ifelse(dataList$Score.rxOneClassSvm > 0, 'Bad', 'Good') )
  # dataList$Probability.rxOneClassSvm <- 
  #   exp( dataList$Score.rxOneClassSvm ) / (1 + exp( dataList$Score.rxOneClassSvm ))
  dataList$PredictedLabel.rxNaiveBayes <- 
    factor( x = ifelse(dataList$Probability.rxNaiveBayes.Good < 0.5, 'Bad', 'Good') )
  dataList$Probability.rxEnsemble.Good <- 
    exp( dataList$Score.rxEnsemble.Good ) / (1 + exp( dataList$Score.rxEnsemble.Good ))
  return (dataList)
}

# Add PredictedLabel.rxNaiveBayes & Probability.rxEnsemble.Good into List Model's Scores
fitTrainScores <- 
  RevoScaleR::rxDataStep(
    inData  = fitTrainScores
  , outFile =  tempfile(fileext = '.xdf')
  , transformFunc = ScoringResults.rxSomeModelsFunc
  , varsToDrop = c('Probability.rxNaiveBayes.Bad')
  , overwrite = TRUE
  )

## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 1.197 seconds

# Delete Any Scores and Data Variables into List Model's Scores
fitTrainScores <-
  RevoScaleR::rxDataStep(
    inData  = fitTrainScores
  , outFile =  tempfile(fileext = '.xdf')
  , varsToKeep = c( 'ObClass', grep('Probability.', names(fitTrainScores), value = TRUE),
                    grep('PredictedLabel.', names(fitTrainScores), value = TRUE) )
  , overwrite = TRUE
  )

## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 0.260 seconds

# names(fitTrainScores)

# Compute the fit models's ROC curves on Training Set (by Precentiles)
fitRocTrain <- 
  RevoScaleR::rxRoc(
    actualVarName = 'ObClass'
  , predVarNames  = grep('Probability.', names(fitTrainScores), value = TRUE)
  , data = fitTrainScores
  , numBreaks = 100 # length(Y[inTrain])
  )

# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
#
#          TESTING SET
#
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

# Add PredictedLabel.rxNaiveBayes & Probability.rxEnsemble.Good into List Model's Scores
fitTestScores <- 
  RevoScaleR::rxDataStep( 
    inData  = fitTestScores
  , outFile =  tempfile(fileext = '.xdf')
  , transformFunc = ScoringResults.rxSomeModelsFunc
  # , varsToDrop = c('Probability.rxNaiveBayes.Bad')
  , overwrite = TRUE
  )

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.348 seconds

# Delete Any Scores and Data Variables into List Model's Scores
fitTestScores <- 
  RevoScaleR::rxDataStep( 
    inData  = fitTestScores
  , outFile =  tempfile(fileext = '.xdf')
  , varsToKeep = c( 'ObClass', grep('Probability.', names(fitTestScores), value = TRUE),
                    grep('PredictedLabel.', names(fitTestScores), value = TRUE) )
  , overwrite = TRUE
  )

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.030 seconds

# names(fitTestScores)

# Compute the fit models's ROC curves on Testing Set
fitRocTest <- 
  RevoScaleR::rxRoc(
    actualVarName = 'ObClass'
  , predVarNames  = grep('Probability.', names(fitTestScores), value = TRUE)
  , data = fitTestScores
  , numBreaks = 100 # length(Y[inTest])
  )

Compute and plot an ROC curve using actual and predicted values from binary classifier system.

# Create a named list of the fit models.
fitList <-
    list(  rxLogisticRegression = rxLogisticRegressionFit
         , rxFastTrees          = rxFastTreesFit
         , rxFastForest         = rxFastForestFit
         , rxNeuralNet          = rxNeuralNetFit
         # , rxOneClassSvm        = rxOneClassSvmFit
         , rxNaiveBayes         = rxNaiveBayesFit
         , rxEnsemble           = rxEnsembleFit 
         )

algolist <- c( 'rxLogisticRegression' 
             , 'rxFastTrees'
             , 'rxFastForest'
             , 'rxNeuralNet'
             # , 'rxOneClassSvm'
             , 'rxNaiveBayes'
             , 'rxEnsemble'
             ) %>%
  .[order(.)]

# Create a named list of models' Predictions on Training Data.
predList <- RevoScaleR::rxImport(inData = fitTrainScores) %>%
  dplyr::select(starts_with('PredictedLabel.')) %>%
    as.list()

## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 0.278 seconds

names(predList) <- algolist

# Confusion Matrix evaluation results
cm_metrics <-lapply(predList
                    , caret::confusionMatrix
                    , reference = Y[inTrain]
                    , positive = 'Bad'
                    , mode = 'everything'
                    )
# Kappa
kap_metrics <- 
  lapply(cm_metrics, `[[`, 'overall') %>%
  lapply(`[`, 2) %>%
  unlist() %>%
  as.vector()

# Sensitivity (Recall)
rec_metrics <- 
  lapply(cm_metrics, `[[`, 'byClass') %>%
  lapply(`[`, 1) %>%
  unlist() %>%
  as.vector()
  
# Specificity
spe_metrics <- 
  lapply(cm_metrics, `[[`, 'byClass') %>%
  lapply(`[`, 2) %>%
  unlist() %>%
  as.vector()

# Compute the fit models's AUCs.
fitAuc <- RevoScaleR::rxAuc(fitRocTrain)
names(fitAuc) <- substring(names(fitAuc), nchar('Probability.') + 1)

# coerce to data frame
result.df <- # Gini  Sens    Spec    Kappa
  data.frame(  Name = algolist
             , Gini1 = (fitAuc - 0.5) * 2
             , Sens1 = formattable::digits(rec_metrics, 4)
             , Spec1 = formattable::digits(spe_metrics, 4)
             , Kappa1 = formattable::digits(kap_metrics, 4)
             ) 

# Create a named list of models' Prediction Labels on Testing Data.
predList <- RevoScaleR::rxImport(inData = fitTestScores) %>%
  dplyr::select(starts_with('PredictedLabel.')) %>%
    select(sort(names(.))) %>% 
      setNames(algolist) %>% 
        as.list()

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.028 seconds

# Create a named list of models' Probabilities on Testing Data.
preds <- RevoScaleR::rxImport(inData = fitTestScores) %>%
  dplyr::select(starts_with('Probability.')) %>%
    select(sort(names(.))) %>% 
        setNames(algolist)

## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.029 seconds

# Confusion Matrix evaluation results
cm_metrics <-lapply(predList
                    , caret::confusionMatrix
                    , reference = Y[inTest]
                    , positive = 'Bad'
                    , mode = 'everything'
                    )
# Kappa
kap_metrics <- 
  lapply(cm_metrics, `[[`, 'overall') %>%
  lapply(`[`, 2) %>%
  unlist() %>%
  as.vector()

# Sensitivity (Recall)
rec_metrics <- 
  lapply(cm_metrics, `[[`, 'byClass') %>%
  lapply(`[`, 1) %>%
  unlist() %>%
  as.vector()
  
# Specificity
spe_metrics <- 
  lapply(cm_metrics, `[[`, 'byClass') %>%
  lapply(`[`, 2) %>%
  unlist() %>%
  as.vector()

# Compute the fit models's AUCs.
fitAuc <- RevoScaleR::rxAuc(fitRocTest)
names(fitAuc) <- substring(names(fitAuc), nchar('Probability.') + 1)

# Plot the ROC curves and report their AUCs.
plot(  fitRocTest
     , title = 'ROC Curve for Models on Testing Set' )

result.df %<>% 
  dplyr::mutate(
    Gini2 = formattable::digits((fitAuc - 0.5) * 2, 4)
  , Sens2 = formattable::digits(rec_metrics, 4)
  , Spec2 = formattable::digits(spe_metrics, 4)
  , Kappa2 = formattable::digits(kap_metrics, 4)
  , Note = '(empty)'
  ) # %>% 
  #   as_tibble %>% 
  #     remove_rownames %>% 
  #       dplyr::arrange(-Gini2) %T>%
  #         print

Finally, we evaluate and compare the above built models at various aspects.

# # Preallocate data types
# Name <- character()          # Name of Method
# Gini1 <- numeric()           # Gini's Coefficient of Training
# Sens1 <- numeric()           # Sensitivity of Training
# Spec1 <- numeric()           # Specificity of Training
# Kappa1 <- numeric()          # Cohen's Kappa of Training
# Gini2 <- numeric()           # Gini's Coefficient of Testing
# Sens2 <- numeric()           # Sensitivity of Testing
# Spec2 <- numeric()           # Specificity of Testing
# Kappa2 <- numeric()          # Cohen's Kappa of Testing
# Note <- character()          # Long Model Name

library('kableExtra')         # Construct Complex Table with 'kable' and Pipe Syntax
library('formattable')        # Create 'Formattable' Data Structures

## 
## Attaching package: 'formattable'

## The following object is masked from 'package:plotly':
## 
##     style

# Classification results by VALIDATION and TESTING Sets from `caret` models
result.df %>% 
  mutate(Name = cell_spec(Name, 'html', 
                          color = ifelse(Gini2 >= arrange(result.df, desc(Gini2))[3, 'Gini2'] %>%
                                    as.numeric, 'red', 'black')),
         Gini1, Sens1, Spec1, Kappa1, 
         Gini2 = proportion_bar('lightgreen')(Gini2), 
         Sens2 = cell_spec(Sens2, 'html', 
                           color = ifelse(Sens2 >= arrange(result.df, desc(Sens2))[1, 'Sens2'] %>%
                                     as.numeric, 'brown', 'black')),
         Spec2 = cell_spec(Spec2, 'html', 
                           color = ifelse(Spec2 >= arrange(result.df, desc(Spec2))[1, 'Spec2'] %>%
                                     as.numeric, 'darkviolet', 'black')), 
         Kappa2,
         Note = Note) %>% 
knitr::kable(format = 'html', digits = 4, longtable = TRUE, booktabs = TRUE, escape = F,
    col.names = c('Methods', rep(c('Gini', 'Sens', 'Spec', 'Kappa'), times = 2), 'Notes'),
    caption = 'Classification results on TRAINING and TESTING Sets from `Microsoft Machine Learning` models') %>% 
      kableExtra::kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive',
                                                         full_width = FALSE)) %>% 
        kableExtra::column_spec(9, width = '3cm') %>%
          kableExtra::add_header_above(c(' ', 'Training Sets' = 4, 'Testing Set' = 4, ' ')) # %>%

Classification results on TRAINING and TESTING Sets from `Microsoft Machine Learning` models
	Training Sets				Testing Set
Methods	Gini	Sens	Spec	Kappa	Gini	Sens	Spec	Kappa	Notes
rxEnsemble	0.4973	0.6656	0.6374	0.2241	0.3214	0.5926	0.6375	0.1735	(empty)
rxFastForest	0.3615	0.6897	0.6506	0.2534	0.2751	0.6005	0.5976	0.1434	(empty)
rxFastTrees	0.4641	0.6604	0.6079	0.1933	0.3229	0.5973	0.6342	0.1737	(empty)
rxLogisticRegression	0.4145	0.6383	0.6659	0.2332	0.3175	0.6013	0.6267	0.1697	(empty)
rxNaiveBayes	0.3701	0.7040	0.6634	0.2759	0.3063	0.5964	0.6275	0.1669	(empty)
rxNeuralNet	0.4192	0.6361	0.6331	0.1999	0.3062	0.5702	0.6512	0.1703	(empty)

               # kableExtra::group_rows(index = c('Generalized Linear Models' = 2,
               #                                  'Decision Trees & Random Forests' = 4))

# NameOfTheBestModel = 'rxLogisticRegression'
NameOfTheBestModel <- arrange(result.df, desc(Gini2))[1, 'Name'] %>% as.character
NameOfTheSecondModel <- arrange(result.df, desc(Gini2))[2, 'Name'] %>% as.character
NameOfTheThirdModel <- arrange(result.df, desc(Gini2))[3, 'Name'] %>% as.character

# Saving The Best Model created by `caret` packege  for future using
TheBestModel <- get(paste0(NameOfTheBestModel, 'Fit'))
# saveRDS(TheBestModel, file = 'TelCo0_model.RData')

The table shows that when comparing the predictions on the test dataset with the real breakdown into classes the top three in quality were:

• classifier trained by rxFastTrees - quality according to ‘Gini’ 0.3229

• classifier trained by rxEnsemble - quality according toо ‘Gini’ 0.3214

• classifier trained by rxLogisticRegression - quality according to ‘Gini’ 0.3175.

We select the most suitable method on this test dataset - the classifier by rxFastTrees.

Descriptive mAchine Learning EXplanations of Models

Machine Learning Models are widely used and have various applications in classification or regression tasks. Due to increasing computational power, availability of new data sources and new methods, ML models are more and more complex. Models created with techniques like boosting, bagging of neural networks are true black boxes. It is hard to trace the link between input variables and model outcomes. They are use because of high performance, but lack of interpretability is one of their weakest sides.

# See https://github.com/pbiecek/DALEX, https://github.com/MI2DataLab/modelDown
# See www.r-bloggers.com/dalex-and-h2o-machine-learning-model-interpretability-and-feature-explanation/
library('iml')              # Interpretable Machine Learning
library('DALEX')            # Descriptive mAchine Learning EXplanations
library('breakDown')        # Model Agnostic Explainers for Individual Predictions

Prob_fun <- function(object, newdata){
  # predict(object, newdata=newdata, type = 'prob')[, 'Good']
  
  rxPredict(modelObject = object, data = newdata, blocksPerRead = 200000,
        reportProgress = 0, verbose = 0) %>%
    dplyr::select(starts_with('Probability.')) %>%
      pull 
}

loss_gini <- function(observed, predicted) {
    hmeasure::HMeasure(true.class = observed, scores = predicted)[['metrics']] %>% .[1, 'Gini']
}

Predict.Fun <- function(model, newdata){
    rxPredict(modelObject = model, data = newdata, blocksPerRead = 200000,
        reportProgress = 0, verbose = 0) %>%
      dplyr::select(starts_with('Probability.')) %>%
          pull 
}

loss_Gini <- function(actual, predicted) {
    hmeasure::HMeasure(true.class = actual, scores = predicted)[['metrics']] %>% .[1, 'Gini']
}

Scorecard <- 'Scorecard'

# Create Model Explainer from 'DALEX' package
explainer_classif_1 <-
  DALEX::explain(TheBestModel, label = NameOfTheBestModel,
                 data = X[inTest, ], y = Y[inTest] %>% as.integer()-1, predict_function = Prob_fun)

explainer_classif_2 <-
  DALEX::explain(TheBestModel, label = NameOfTheSecondModel, 
                 data = X[inTest, ], y = Y[inTest] %>% as.integer()-1, predict_function = Prob_fun)

explainer_classif_3 <-
  DALEX::explain(get(paste0(NameOfTheThirdModel, 'Fit')),  label = NameOfTheThirdModel, 
                 data = X[inTest, ], y = Y[inTest] %>% as.integer()-1, predict_function = Prob_fun)

if (!exists('wb')) {
  # Create MS Excel File for Output
  openxlsx::addWorksheet(wb <- openxlsx::createWorkbook(), sheetName = 'IV Table', 
                         gridLines = FALSE, tabColour = 'olivedrab')
  openxlsx::addWorksheet(wb, sheetName = 'Scorecard', gridLines = FALSE, tabColour = 'brown')
}

if (ncol(X) <= Max_Vars) {

# [Interpretable Machine Learning: Feature Importance](https://christophm.github.io/interpretable-ml-book)
  system.time({
    ### Setup parallel processing - 4 times faster
    library('doParallel'); cl <- makeCluster(detectCores()); registerDoParallel(cl)

    model1 <- iml::Predictor$new(model = TheBestModel, data= X[inTest, ],
                y = Y[inTest] %>% as.integer() - 1, class = 'Good', predict.fun = Predict.Fun, type = 'prob' )
    # Feature Importance
    imp1 <- iml::FeatureImp$new(model1, loss = loss_Gini, compare = 'difference', parallel = TRUE)
    
    importance.df <- imp1$results %>%
      dplyr::mutate(Variable = gsub('_fct', '', feature) %>% as.factor,
                    Importance = -importance) %>%
        dplyr::select(Variable, Importance)
    
    p1 <- imp1$plot() + ggplot2::ggtitle(paste(NameOfTheBestModel, 'by Gini coefficient'))
    print(p1)    
    
    # model2 <- iml::Predictor$new(model = get(paste0(NameOfTheSecondModel, 'Fit')), data= X[inTest, ],
    #           y = Y[inTest] %>% as.integer()-1, class = 'Good', predict.fun = Predict.Fun, type = 'prob' )
    # # Feature Importance
    # imp2 <- iml::FeatureImp$new(model2, loss = loss_Gini, compare = 'difference', parallel = TRUE)
    # imp2$plot() + ggplot2::ggtitle(paste(NameOfTheSecondModel, 'by Gini coefficient'))
    # 
    # model3 <- iml::Predictor$new(model = get(paste0(NameOfTheThirdModel, 'Fit')), data= X[inTest, ],
    #           y = Y[inTest] %>% as.integer()-1, class = 'Good', predict.fun = Predict.Fun, type = 'prob' )
    # # Feature Importance
    # imp3 <- iml::FeatureImp$new(model3, loss = loss_Gini, compare = 'difference', parallel = TRUE)
    # imp3$plot() + ggplot2::ggtitle(paste(NameOfTheThirdModel, 'by Gini coefficient'))
    
    TheImportancePredictor <- imp1$results[nrow(imp1$results), 'feature']
    
    # Remember to stop the cluster in the end again
    stopCluster(cl)
  })
    
    # # Choice of The Importance Predictor - www.machinelearningplus.com/machine-learning/feature-selection/
    # TheImportancePredictor <-
    #   imp1$results %>% arrange(importance) %>% dplyr::select(feature) %>% pull %>% .[1]
    # 
    # # Effect of features on the model predictions
    # ale = iml::FeatureEffect$new(model, feature = TheImportancePredictor, method = 'ale')
    # ale$plot()
} else {
  system.time({
    # # Model Performance from 'DALEX' package
    # plot(DALEX::model_performance(explainer_classif_1)
    #    , DALEX::model_performance(explainer_classif_2)
    #    , DALEX::model_performance(explainer_classif_3)
    #    , geom = 'boxplot')
    
    # Variable importance
    importance.df <- DALEX::variable_importance(explainer_classif_1, type = 'raw', loss_function = loss_gini) %>%       dplyr::mutate(Importance = -(dropout_loss - .[variable == '_full_model_', 'dropout_loss'])) %>% 
        dplyr::filter(!str_detect(variable, 'full_model|baseline')) %>%
          dplyr::mutate(Variable = gsub('_fct', '', variable)) %>% 
            dplyr::select(Variable, Importance)
    
    p1 <- DALEX::variable_importance(explainer_classif_1, type = 'raw', loss_function = loss_gini) %>% 
      plot() + labs(title = 'Variable Importance', caption = 'by Gini Coefficient') + theme_grey()
    print(p1)
    
    # DALEX::variable_importance(explainer_classif_2, type = 'raw', loss_function = loss_gini) %>% 
    #   plot() + labs(subtitle = 'Variable Importance', caption = 'by Gini Coefficient')
    # DALEX::variable_importance(explainer_classif_3, type = 'raw', loss_function = loss_gini) %>% 
    #   plot() + labs(subtitle = 'Variable Importance', caption = 'by Gini Coefficient')
    
    TheImportancePredictor <-
      DALEX::variable_importance(explainer_classif_1, type = 'raw')[2, 'variable'] %>%
        as.character
  })
  } # End if ncol(X) <= Max_Vars

##    user  system elapsed 
##    0.81    0.39  116.79

# Output Chart: EXplanations of Model by Variable Importance
openxlsx::insertPlot(wb, sheet = 'IV Table', xy = c(1, nrow(binning.df)+5), width = 10 * (1 + sqrt(5)) / 2,
                     height = 10, units = 'cm')

if (TheBestModel$Description != 'LogisticRegression')
  openxlsx::insertPlot(wb, sheet = Scorecard, xy = c(1,2), width = 10*(1 + sqrt(5))/2, height = 10, units = 'cm')

# # compute Partial Dependence Plots for a given variable --> uses the pdp package
# plot(DALEX::variable_response(explainer_classif_1, variable = TheImportancePredictor, type = 'factor')) +
#     ggtitle('Marginal Response for a Single Variable by Gini Coefficient')
# plot(DALEX::variable_response(explainer_classif_2, variable = TheImportancePredictor, type = 'factor')) +
#     ggtitle('Marginal Response for a Single Variable by Gini Coefficient')
# plot(DALEX::variable_response(explainer_classif_3, variable = TheImportancePredictor, type = 'factor')) +
#     ggtitle('Marginal Response for a Single Variable by Gini Coefficient')            

if (ncol(X) <= Max_Vars) {
  # Explanations for a Single Prediction for Observations
  # True Good Observation
   new_obj <- data.frame(Probs = preds[[NameOfTheBestModel]], Obs = Y[inTest],
                Equal = ifelse(preds[[NameOfTheBestModel]] < 0.5, 0, 1) == Y[inTest] %>% as.integer() - 1) %>%
              rowid_to_column %>% dplyr::arrange(-Probs) %>%
                dplyr::filter(Probs <= 1, Obs == 'Good', Equal == TRUE) %>%
                  dplyr::select(rowid) %>% pull %>% .[1]
  
  # Recovery of Name of Selecled Scale Variables
  pdp1 <- DALEX::prediction_breakdown(explainer_classif_1, observation = X[inTest, ][new_obj, ])
  pdp1$variable <- pdp1$variable_value
  pdp1[1, 'variable'] <- '(Intercept)'; pdp1[nrow(pdp1), 'variable'] <- 'final_prognosis'
  
  # pdp2 <- DALEX::prediction_breakdown(explainer_classif_2, observation = X[inTest, ][new_obj, ])
  # pdp2$variable <- pdp2$variable_value
  # pdp2[1, 'variable'] <- '(Intercept)'; pdp2[nrow(pdp2), 'variable'] <- 'final_prognosis'
  # 
  # pdp3 <- DALEX::prediction_breakdown(explainer_classif_3, observation = X[inTest, ][new_obj, ])
  # pdp3$variable <- pdp3$variable_value
  # pdp3[1, 'variable'] <- '(Intercept)'; pdp3[nrow(pdp3), 'variable'] <- 'final_prognosis'
  
  p2 <- plot(pdp1, vcolors = c('-1' = 'tomato3', '0' = '#f5f5f5', '1' = 'palegreen3', 'X' = '#00BFC4')) +
    # theme(strip.background = element_rect(fill = 'gray45')) + 
    # theme(strip.text = element_text(colour = 'white')) +
    theme_grey() + theme(legend.position = 'none', panel.border = element_blank()) +
    labs(title = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
  # plot(pdp2) +
  #   labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
  # plot(pdp3) +
  #   labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
  print(p2)

  # Output Chart: EXplanations of Model by Predictors of True Bad Observation
  openxlsx::insertPlot(wb, sheet = 'IV Table', xy = c(7, nrow(binning.df) + 5), width = 10 * (1 + sqrt(5)) / 2,
                       height = 10, units = 'cm')
  
  if (TheBestModel$Description != 'LogisticRegression')
    openxlsx::insertPlot(wb, sheet = Scorecard, xy=c(1, 21), width = 10*(1 + sqrt(5))/2, height=10, units = 'cm')

  # True Bad Observation

  # Choice Another Level of The Importance Predictor for True Bad Observation than True Good Observation
  AnotherLevel4TrueBad <- table(X[inTest, TheImportancePredictor], Y[inTest])[-which.max(table(X[inTest,
    TheImportancePredictor], Y[inTest])[, 'Good']), 'Bad'] %>%
      which.max %>% names
  
  if (is.null(AnotherLevel4TrueBad)) {
    AnotherLevel4TrueBad <- table(X[inTest, TheImportancePredictor], Y[inTest]) %>%
        as.data.frame(stringsAsFactors = FALSE) %>% 
          setNames(c('Levels', 'Classes', 'Freq')) %>%
            dplyr::filter(Classes == 'Bad', Freq > 0) %>%
              dplyr::select(Levels) %>% pull    
  }
  
  new_obj <- data.frame(Probs = preds[[NameOfTheBestModel]], Obs = Y[inTest],
    Equal = ifelse(preds[[NameOfTheBestModel]] < 0.5, 0, 1) == Y[inTest] %>% as.integer() - 1,
    TheImportancePredictor = X[inTest, TheImportancePredictor]) %>%
      rowid_to_column %>% dplyr::arrange(-Probs) %>%
        dplyr::filter(Probs < 0.4, Obs == 'Bad', Equal == TRUE,
                      TheImportancePredictor == AnotherLevel4TrueBad) %>% 
          dplyr::select(rowid) %>% pull %>% .[1]
  
  pdp1 <- DALEX::prediction_breakdown(explainer_classif_1, observation = X[inTest, ][new_obj, ])
  pdp1$variable <- pdp1$variable_value
  pdp1[1, 'variable'] <- '(Intercept)'; pdp1[nrow(pdp1), 'variable'] <- 'final_prognosis'
  
  # pdp2 <- DALEX::prediction_breakdown(explainer_classif_2, observation = X[inTest, ][new_obj, ])
  # pdp2$variable <- pdp2$variable_value
  # pdp2[1, 'variable'] <- '(Intercept)'; pdp2[nrow(pdp2), 'variable'] <- 'final_prognosis'
  # 
  # pdp3 <- DALEX::prediction_breakdown(explainer_classif_3, observation = X[inTest, ][new_obj, ])
  # pdp3$variable <- pdp3$variable_value
  # pdp3[1, 'variable'] <- '(Intercept)'; pdp3[nrow(pdp3), 'variable'] <- 'final_prognosis'
  
  p3 <- plot(pdp1, vcolors = c('-1' = 'tomato3', '0' = '#f5f5f5', '1' = 'palegreen3', 'X' = '#F8766D')) +
    # theme(strip.background = element_rect(fill = 'gray45')) + 
    # theme(strip.text = element_text(colour = 'white')) +
    theme_grey() + theme(legend.position = 'none', panel.border = element_blank()) +
    labs(title = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
  # plot(pdp2) +
  #   labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
  # plot(pdp3) +
  #   labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
  print(p3)
  
  # Output Chart: EXplanations of Model by Predictors of True Bad Observation
  openxlsx::insertPlot(wb, sheet = 'IV Table', xy = c(15, nrow(binning.df) + 5), width = 10 * (1+sqrt(5)) / 2,
                       height = 10, units = 'cm')
  
  if (TheBestModel$Description != 'LogisticRegression')
    openxlsx::insertPlot(wb, sheet = Scorecard, xy=c(1, 40), width = 10*(1 + sqrt(5))/2, height=10, units = 'cm')  }

remove(model1, model2, model3, # imp1,
       imp2, imp3, explainer_classif_1, explainer_classif_2, explainer_classif_3, p, new_obj, pdp1, pdp2, pdp3)

Model agnostic tool for decomposition of predictions from black boxes. Break Down Table shows contributions of every variable to a final prediction. Break Down Plot presents variable contributions in a concise graphical way. This package work for binary classifiers and general regression models.

The Best Classifier for Score Modeling

# # Create MS Excel File for Output
# openxlsx::addWorksheet(wb <- openxlsx::createWorkbook(), sheetName = 'Scorecard', gridLines = FALSE, 
#                        tabColour = 'brown')

# preds - Probabilities of `Good` Class

# Generate an ROC Curve for the Best Model
ROCCurveShow(Preds = preds[NameOfTheBestModel],      # `Good` Class Probabilities - numeric vector
             Obsers = Y[inTest],                     # Observed Classes (Reference) - logical vector
             NameOfModel = NameOfTheBestModel,       # Name Of The Model
             wb = wb)                                # Workbook from `openxlsx` package

## 
## rxFastTrees - Estimate a Area Under the ROC Curve (AUC) on Testing Set = 66.15%

# Generate a KS Curve for the Best Model
KSCurveShow(Preds = preds[NameOfTheBestModel],       # `Good` Class Probabilities - numeric vector
            Obsers = Y[inTest],                      # Observed Classes (Reference) - logical vector
            NameOfModel = NameOfTheBestModel,        # Name Of The Model
            wb = wb)                                 # Workbook from `openxlsx` package

## 
## rxFastTrees - Estimate a Kolmogorov-Smirnov Statistic on Testing Set = 0.2372

# Generate a Distribution's Curve of Scores by Good & Bad Class for the Best Model as GLM
ScoresCurveShow(Preds = preds[NameOfTheBestModel], # `Good` Class Probabilities - numeric vector
                Obsers = Y[inTest],                # Observed Classes (Reference) - logical vector
                NameOfModel = NameOfTheBestModel,  # Name Of The Model
                wb = wb)                           # Workbook from `openxlsx` package

## 
## rxFastTrees - Estimate a Kullback-LeiblerвЂ™s Divergence Statistic on Testing Set = 0.0866

Finally, let’s build a Scorecard using the selected Logit-Model on the Test dataset.

I  = 50           # is the score increment (Points to double the Oods)
c_ = 500          # Offset of scores (margin of Good & Bad classes)

if (TheBestModel$Description == 'LogisticRegression') {
  # Create Scorecard for GLM
  
  Attributes <- vector()
  Levels <- vector()
  Predictors <- vector()
  Totals <- vector()
  ObBads <- vector()
  ObGoods <- vector()
  PrGoods <- vector()
  PrClass <- rxImport(inData= fitTestScores, varsToKeep = 'PredictedLabel.rxLogisticRegression') %>%
    pull
  Chi_squared_test <- vector()
  
  for (i in 1:length(rxLogisticRegressionFit$params$formulaVars %>% .[-1])) {
    
    Attributes0 <- 
      rxLogisticRegressionFit$coefficients %>% names() %>% 
        .[ grep(pattern = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i],
                x = rxLogisticRegressionFit$coefficients %>% names()) ] %>%
          sort %>% 
            dplyr::as_tibble() %>%
              tidyr::separate(value, into = c('empty', 'Attributes'), 
                              sep=paste0(rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i], '.')) %>%
                pull(Attributes) 
    Attributes <- c(Attributes, Attributes0)
    
    Levels0 <- rxLogisticRegressionFit$coefficients %>% names() %>% 
      .[ grep(pattern = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i],
              x = rxLogisticRegressionFit$coefficients %>% names()) ] %>% sort
    Levels <- c(Levels, Levels0)

    Predictors <- c(Predictors, rep(x = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i],
                                times = length(Levels0)))
    
    Totals <- c(Totals, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ]) %>% as.vector)
    
    ObBads <- c(ObBads, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], Y[inTest]) %>% 
                .[, 'Bad'] %>% as.vector)
    
    ObGoods <- c(ObGoods, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], Y[inTest]) %>%
                 .[, 'Good'] %>% as.vector)
  
    PrGoods <- c(PrGoods, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], PrClass) %>%
                 .[, 'Good'] %>% as.vector)
  
    # Chi squared test (contingency table) for each row in a table by each Predictors
    Chi_squared_test <-  c( Chi_squared_test, apply(table(X[inTest,
      rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], Y[inTest]), 1, 
      function(x) chisq.test(matrix(x, ncol = levels(Y) %>% length))$p.value) )
  }
  
  # Part of Intercept's Score / Number of Features
  CommonScore = (c_ + rxLogisticRegressionFit$coefficients[1] * I / log(2)) / length(rxLogisticRegressionFit$params$formulaVars %>% .[-1])
  
  # Create a Data.Frame by Predictors with Coefficients & Scores
  Scorecard.df <- 
    # dplyr::left_join(  # faster join as.integer, then as.factor and finally as.character
    #   x =  data.frame(Predictors, Attributes, Levels, Totals, ObBads, ObGoods, PrGoods, Chi_squared_test,
    #                   stringsAsFactors = FALSE), 
    #   y = data.frame(Names = rxLogisticRegressionFit$coefficients %>% attr('names'),
    #                  Coefficients = rxLogisticRegressionFit$coefficients,
    #                  stringsAsFactors = FALSE), 
    #   by = c('Levels' = 'Names')
    #   ) %>%
    dplyr::left_join(  # faster join as.integer, then as.factor and finally as.character
      x = data.frame(Predictors, Attributes, Levels,
                     stringsAsFactors = FALSE), 
      y = data.frame(Levels = paste0( attr(Chi_squared_test, 'names') %>%
                                        str_extract_all(., boundary('word')) %>% transpose() %>% .[2] %>% unlist,
                                      '_fct.', attr(Chi_squared_test, 'names') ),
                     Totals, ObBads, ObGoods, PrGoods, Chi_squared_test,
                     stringsAsFactors = FALSE), 
      by = c('Levels' = 'Levels') ) %>% 
      # Attaching tables with different lengths due to the small gradations of some predictors
      dplyr::left_join(  # faster join as.integer, then as.factor and finally as.character
        x = ., 
        y = data.frame(Names = rxLogisticRegressionFit$coefficients %>% attr('names'),
                       Coefficients = rxLogisticRegressionFit$coefficients,
                       stringsAsFactors = FALSE), 
        by = c('Levels' = 'Names') ) %>%
          tidyr::replace_na(list(Coefficients = 0)) %>% 
            dplyr::mutate(
                          Total = Totals, Bad = ObBads, Good = ObGoods,
                          `Share of Total` = Totals / length(Y[inTest]),
                          `Chi Squared` = Chi_squared_test,
                          `Pred Good` = PrGoods,
                          `Sensitivity by Levels` = PrGoods / Totals,
                          Scores = round(Coefficients * I / log(2) + CommonScore, 0)) %>% 
              dplyr::select(-Levels, -Chi_squared_test, -Totals, -ObBads, -ObGoods, -PrGoods)
    
  ListOfPredictors <- vector()
  for (i in 1:length(rxLogisticRegressionFit$params$formulaVars %>% .[-1])) {
    ListOfPredictors <- c( ListOfPredictors, length(rxLogisticRegressionFit$coefficients %>% names() %>% 
    .[ grep(pattern = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[5],
            x = rxLogisticRegressionFit$coefficients %>% names()) ]) )
    attr(ListOfPredictors, 'names')[i] <- rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i]
  }
  
  header.df <- data.frame(A = c('Scorecard by Logit-Model and Result of Classification on TESTING Set', 'Logit Model'), B = c(NA, NA), C = c(NA, NA), D = c(NA, 'Observed Distribution (Reference)'), E = c(NA, NA), F = c(NA, NA), G = c(NA, NA), H = c(NA, NA), I = c(NA, 'Prediction'), stringsAsFactors = FALSE)
  
  ListOfHeaders <- c(2, 5, 3)
  attr(ListOfHeaders, 'names') <- c(header.df[2, 'A'], header.df[2, 'D'], header.df[2, 'I'])
  
  # Show a Data.Frame by Predictors with Coefficients & Scores
  
  Scorecard.df %>% 
    mutate( # Predictors = cell_spec(Predictors, bold = TRUE),
      Attributes = Attributes,
      Coefficients = cell_spec(Coefficients, 'html', color=ifelse(Scores >= arrange(., desc(Scores))[nrow(.)*.25,
                                                                  'Scores']%>% as.numeric, 'darkgreen', 'black')),
      Total, Bad, Good, 
      `Share of Total` = formattable::percent(`Share of Total`, digits = 1),
      `Chi Squared` = cell_spec(formattable(`Chi Squared`, format = "f", digits = 4), 'html', 
                                color = ifelse(`Chi Squared` >= 0.05, 'orangered', 'darkgray')),
      `Pred Good`, 
      `Sensitivity by Levels` = proportion_bar('khaki')(round(`Sensitivity by Levels`, 2)),
      Scores = proportion_bar('chartreuse')(Scores) ) %>% dplyr::select(-Predictors) %>% 
  knitr::kable(format = 'html', digits = 4, longtable = TRUE, booktabs = TRUE, escape = F,
             # col.names = c('Levels of Predictors', 'Coefficients', 'Total', 'Bad', 'Good', 'Share of Total',
             #              'Chi²', 'Predicted Good', 'Sensitivity by Levels', 'Scores'),
             caption = header.df[1, 'A']) %>% 
           kableExtra::kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive',
                                                           full_width = FALSE)) %>% 
             # kableExtra::column_spec(10, width = '5cm') %>%
               kableExtra::add_header_above(ListOfHeaders) # %>% 
                 #kableExtra::group_rows(index = ListOfPredictors)
  
  # Export a Data.Frame with Coefficients & Scores into MS Excel
  N <- 4:(nrow(Scorecard.df) + 3)
  writeDataTable(wb, sheet = Scorecard, x = Scorecard.df, tableStyle = 'TableStyleMedium2', startCol = 'A',
                 startRow = 3, tableName = 'Scorecard', firstColumn = TRUE, lastColumn = TRUE, bandedRows = TRUE)

  # Set Columns widths
  setColWidths(wb, sheet=Scorecard, cols=1:ncol(Scorecard.df), widths = c(14, 15, 11, 7, 7, 7, 8, 9, 7, 11, 10))
  mergeCells(wb, sheet = Scorecard, cols = 1:3, rows = 2)
  mergeCells(wb, sheet = Scorecard, cols = 4:8, rows = 2)
  mergeCells(wb, sheet = Scorecard, cols = 9:11, rows = 2)
  
  # # Set Row heights
  # setRowHeights(wb, sheet = Scorecard, rows = 1, heights = 45)
  
  # Set Styles & Conditional Formattings in Columns
  addStyle(wb, sheet = Scorecard, style = createStyle(wrapText = TRUE, halign = 'center', valign = 'center'),
                                              cols = 1:ncol(Scorecard.df), rows = 3)
  conditionalFormatting(wb, sheet = Scorecard, cols = 3, rows = N, type = "between",               # Coefficients
                        rule = c(quantile(Scorecard.df$Coefficients)['75%'], max(Scorecard.df$Coefficients)),
                        style = createStyle(fontColour = 'darkgreen'))   
  addStyle(wb, sheet=Scorecard, cols=3, rows=N, style = createStyle(border = 'right', borderColour = '#4F81BC'))
  addStyle(wb, sheet = Scorecard, createStyle(numFmt = 'comma'), cols = 4:6, rows = N, gridExpand = TRUE)
  addStyle(wb, sheet = Scorecard, cols = 7, rows = N, style = createStyle(numFmt = '0%'))
  addStyle(wb, sheet = Scorecard, cols = 8, rows = N, style = createStyle(border = 'right', 
           borderColour = '#4F81BC', fontColour = 'darkgrey', numFmt = paste0('0', options()$OutDec, '0000')))
  conditionalFormatting(wb, sheet = Scorecard, cols = 8, rows = N, type = "between", rule = c(0.05, 1))   # Chi²
  addStyle(wb, sheet = Scorecard, cols = 9, rows = N, style = createStyle(numFmt = 'COMMA'))
  addStyle(wb, sheet = Scorecard, cols = 10, rows = N, 
           style = createStyle(numFmt = paste0('0', options()$OutDec, '0000')))
  conditionalFormatting(wb, sheet = Scorecard, cols = ncol(Scorecard.df) - 1, rows = N,
                                  style = c('red', 'khaki'), type = 'databar')
  conditionalFormatting(wb, sheet = Scorecard, cols = ncol(Scorecard.df), rows = N,
                                  style = c('red', 'chartreuse'), type = 'databar')
  
  writeData(wb, sheet = Scorecard, header.df, colNames = FALSE, rowNames = FALSE, startCol = 'A', startRow = 1)
  addStyle(wb, sheet = Scorecard, cols=1, rows = 1, style = createStyle(fontSize = 16, textDecoration = 'bold'))
  addStyle(wb, sheet = Scorecard, cols = 1:ncol(Scorecard.df), rows = 2, style = createStyle(wrapText = TRUE,
        halign = 'center', valign = 'center', fontColour = 'white', fgFill = '#4F81BC', textDecoration = 'bold'))
  addStyle(wb, sheet = Scorecard, cols=3, rows=3, style = createStyle(border = 'right', borderColour = 'white'))
  addStyle(wb, sheet = Scorecard, cols=8, rows=3, style = createStyle(border = 'right', borderColour = 'white'))

  remove(Attributes, Attributes0, Levels, Levels0, Predictors, Totals, ObBads, ObGoods, PrClass, PrGoods,
         Chi_squared_test, ListOfPredictors, ListOfHeaders, header.df, N)
  
} else { # Not Logit-Models
  
  writeData(wb, sheet = Scorecard, data.frame(A = c(paste0('Scorecard by Model `', NameOfTheBestModel ,
                                                   '` and Result of Classification on TESTING Set'))),
            colNames = FALSE, rowNames = FALSE, startCol = 'A', startRow = 1)
  addStyle(wb, sheet = Scorecard, cols=1, rows = 1, style = createStyle(fontSize = 16, textDecoration = 'bold'))
  
} # End if == LogisticRegression

openxlsx::renameWorksheet(wb, Scorecard, 'MLS')
openxlsx::writeFormula(wb, sheet = 'IV Table', x = makeHyperlinkString(sheet = 'MLS', row = 1, col = 1,
                       text = 'Scorecard: MLS'), startCol = 'A', startRow = nrow(binning.df) + 4)
openxlsx::addStyle(wb,sheet = 'IV Table', cols = 1, rows = nrow(binning.df) + 4, 
                   style = createStyle(fontColour = 'brown', textDecoration = 'bold'))

# Supplement Variable Table with Importance Feature
binning.df %>%
  # dplyr::mutate_if(is.factor, as.character) %>%
    dplyr::left_join(importance.df, by = c('Variable' = 'Variable')) %>%
      openxlsx::writeDataTable(wb, sheet = 'IV Table', x = ., tableStyle = 'TableStyleMedium4', startCol = 'A',
                  startRow = 2, tableName = 'IVTable', firstColumn = FALSE, lastColumn = TRUE, bandedRows = TRUE)

## Warning: Column `Variable` joining factors with different levels, coercing
## to character vector

openxlsx::writeComment(wb, sheet = 'IV Table', xy = c(ncol(binning.df) + 1, 2),
              comment = openxlsx::createComment(comment = 'Importance Feature by Gini coefficient (MLS)',
                                                height = .6))
  
# Recovering First Column of Names with Hyperlinks
for (i in 1:nrow(binning.df)) {
  ## Internal - Text to display
  val = binning.df[i, 'Variable']
  writeFormula(wb, sheet = 'IV Table', startCol = 'A', startRow = i + 2, 
    x = makeHyperlinkString(sheet = val, row = 1, col = 1, text = val))
}

# Set Columns widths
openxlsx::setColWidths(wb, sheet = 'IV Table', cols = 1:2, widths = c(32, 12))
openxlsx::setColWidths(wb, sheet = 'IV Table', cols = ncol(binning.df):(ncol(binning.df)+1), widths = c(12, 13))

N <- 3:(nrow(binning.df) + 2)
openxlsx::conditionalFormatting(wb, sheet = 'IV Table', cols = 2, rows = N, type = 'databar',
                                border = FALSE, style = c('red', 'royalblue'))
openxlsx::conditionalFormatting(wb, sheet = 'IV Table', type = 'databar', cols = ncol(binning.df) + 1, 
                                rows =3:(nrow(binning.df)+2), border = FALSE, style = c('tomato3', 'palegreen3'))
openxlsx::addStyle(wb, sheet = 'IV Table', cols = 2, rows = N, 
                   style = openxlsx::createStyle(border = 'right', borderColour = '#9CB95C'))
openxlsx::addStyle(wb, sheet = 'IV Table', cols = 10, rows = N, 
                   style = openxlsx::createStyle(border = 'right', borderColour = '#9CB95C'))
openxlsx::addStyle(wb, sheet = 'IV Table', cols = ncol(binning.df) + 1, rows = 3:(nrow(binning.df)+2), 
                   style = openxlsx::createStyle(numFmt = paste0('0', options()$OutDec, '0000')))

openxlsx::writeFormula(wb, sheet = 'IV Table', x = paste0('=T("Table of Variables (', LoadingData,  ')")'),
                       startCol = 'A', startRow = 1)
openxlsx::addStyle(wb, sheet = 'IV Table', cols = 1, rows = 1, 
                   style = openxlsx::createStyle(fontSize = 16, textDecoration = 'bold'))

# Open MS Excel 
openxlsx::openXL(wb)

remove(p1, p2, p3) # , wb)

Let’s clarify the characteristics of the scoring card formed by the selected predictors.

# https://rpubs.com/arifulmondal/216381 

Probs <- preds[NameOfTheBestModel] %>% pull

smbinning::smbinning.metrics(
  dataset = cbind(X[inTest, ], Y = as.integer(Y[inTest]) - 1,
  scores = round((c_ + (I / log(2))  * log(Probs / (1 - Probs)))) + 0, 0),
  prediction = 'scores', actualclass = 'Y',
  cutoff = c_, 
  report = 1, plot = 'none'  # plot = 'auc' - Plot AUC
)

## 
##   Overall Performance Metrics 
##   -------------------------------------------------- 
##                     KS : 0.2362 (Unpredictive)
##                    AUC : 0.6615 (Poor)
## 
##   Classification Matrix 
##   -------------------------------------------------- 
##            Cutoff (>=) : 500 (User Defined)
##    True Positives (TP) : 11626
##   False Positives (FP) : 2049
##   False Negatives (FN) : 6636
##    True Negatives (TN) : 3005
##    Total Positives (P) : 18262
##    Total Negatives (N) : 5054
## 
##   Business/Performance Metrics 
##   -------------------------------------------------- 
##       %Records>=Cutoff : 0.5865
##              Good Rate : 0.8502 (Vs 0.7832 Overall)
##               Bad Rate : 0.1498 (Vs 0.2168 Overall)
##         Accuracy (ACC) : 0.6275
##      Sensitivity (TPR) : 0.6366
##  False Neg. Rate (FNR) : 0.3634
##  False Pos. Rate (FPR) : 0.4054
##      Specificity (TNR) : 0.5946
##        Precision (PPV) : 0.8502
##   False Discovery Rate : 0.1498
##     False Omision Rate : 0.6883
##   Inv. Precision (NPV) : 0.3117
## 
##   Note: 0 rows deleted due to missing data.

remove(Probs)

if (class(TheBestModel) == 'rxNaiveBayes') { #  Model of `RevoScaleR` package
  Probs <- 
    RevoScaleR::rxPredict(modelObject = TheBestModel
      , data = X
      # , outData = sqlServerOutDS3
      , predVarNames = c('Bad_Probs', 'Good_Probs')
      , type = 'prob'
      , writeModelVars = FALSE
      # , extraVarsToWrite = 'SUBS_KEY'
      , overwrite = TRUE )['Good_Probs'] %>% 
    pull 
} else { # Microsoft R Machine Learning model from `MicrosoftML` package
  Probs <- 
    RevoScaleR::rxPredict( TheBestModel
      , data = cbind(Y = Y, ObClass = Y %>% as.integer() - 1, X)
      , suffix = '.MicrosoftML'
      # , extraVarsToWrite = names(cbind(Y = Y[inTrain], ObClass = Y[inTrain] %>% as.integer()-1, X[inTrain, ]))
      , outData = tempfile(fileext = '.xdf')) %>% 
        RevoScaleR::rxDataStep(varsToKeep = c('Probability.MicrosoftML.Good')) %>% 
          pull 
}

## Warning in if (class(TheBestModel) == "rxNaiveBayes") {: длина условия > 1,
## будет использован только первый элемент

## Beginning processing data.
## Rows Read: 345546, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:04.1849337
## Finished writing 345546 rows.
## Writing completed.
## Rows Read: 345546, Total Rows Processed: 345546, Total Chunk Time: 0.049 seconds

if (class(DT) %>% length == 1) setDT(DT)

Output.tbl <- 
  cbind(DT[, c('UniqueID', gsub('_fct', '', names(X))), with = FALSE], X,  #
    data.frame(
       `Probability of Default` = 1 - Probs
      , Scores = round((c_ + (I / log(2))  * log(Probs / (1 - Probs))) + 0, 0)
      , Prediction = ifelse(Probs < 0.5, 'Bad', 'Good')
      , Reference = Y)
    )

# Create DataTable in MS Excel 
openxlsx::addWorksheet(wb2 <- openxlsx::createWorkbook(), sheetName = 'Data', gridLines = FALSE)
openxlsx::writeDataTable(wb2, sheet = 'Data', x = Output.tbl, tableStyle = 'TableStyleMedium6', 
                         tableName = 'Data', firstColumn = TRUE, lastColumn = TRUE, bandedRows = TRUE)
# # Writing Comments into cells
# for (i in 1:ncol(Output.tbl)) 
#   openxlsx::writeComment(wb2, sheet = 'Data', xy = c(i, 1),
#                          comment = openxlsx::createComment(comment = attr(Output.tbl, 'variable.labels')[i], 
#                                     visible = FALSE, width = 2, height = 10, style = createStyle(fontSize = 8)))
# Set Columns widths
openxlsx::setColWidths(wb2, sheet = 'Data', cols = 1:2, widths = c(12, 10))

openxlsx::freezePane(wb2, 'Data', firstCol = TRUE)  # shortcut to firstActiveCol = 2
# openxlsx::conditionalFormatting(wb2, sheet = 'Data', type = 'expression', cols = 1:ncol(Output.tbl), 
#                                 rows =2:(nrow(Output.tbl)+1), rule = '$A2=="АО "&CHAR(34)&"Евразийский Банк"&CHAR(34)', style = createStyle(fontColour = 'darkgreen', bgFill = 'darkseagreen1', textDecoration = 'bold'))
# openxlsx::conditionalFormatting(wb2, sheet = 'Data', type = 'expression', cols = 1:ncol(Output.tbl), 
#                                 rows =2:(nrow(Output.tbl)+1), rule = '$A2=="АО "&CHAR(34)&"АТФБанк"&CHAR(34)', style = createStyle(fontColour = '#505000', bgFill = 'lemonchiffon', textDecoration = 'bold'))
# openxlsx::conditionalFormatting(wb2, sheet = 'Data', type = 'expression', cols = 1:ncol(Output.tbl), 
#                                 rows =2:(nrow(Output.tbl)+1), rule = '$A2=="ДБ АО "&CHAR(34)&"Банк Хоум Кредит"&CHAR(34)', style = createStyle(fontColour = 'coral4', bgFill = 'coral', textDecoration = 'bold'))

openxlsx::openXL(wb2)

## Solution of a Classification Problem

fitProblemScores <- 
    RevoScaleR::rxPredict( TheBestModel
                           , data = cbind(Y = Y[inProblem], ObClass = Y[inProblem] %>% as.integer()-1, X[inProblem, ])
                           , suffix = '.Problem'
                           # , extraVarsToWrite = names(cbind(Y = Y[inProblem], ObClass = Y[inProblem] %>% as.integer()-1, X[inProblem, ]))
                           , outData = tempfile(fileext = '.xdf'))

## Beginning processing data.
## Rows Read: 112392, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:01.4410516
## Finished writing 112392 rows.
## Writing completed.

openxlsx::addWorksheet(wb2 <- openxlsx::createWorkbook(), sheetName = "Ensembles", gridLines = FALSE)
RevoScaleR::rxImport(inData = fitProblemScores) %>% 
  dplyr::mutate(loan_default = ifelse(PredictedLabel.Problem == 'Bad', 1, 0)) %>% 
    dplyr::select(loan_default) %>% 
      openxlsx::writeDataTable(wb2, sheet = "Ensembles", x = ., colNames = TRUE,
                               tableStyle = "TableStyleMedium4", withFilter = TRUE)

## Rows Read: 112392, Total Rows Processed: 112392, Total Chunk Time: 0.033 seconds

openxlsx::openXL(wb2)

Conclusion

The best results in Public Leaderboard for the LTFS Data Science FinHack are above 0.6731 AUC or Gini 0.3642. However, the algorithms implemented in R or Python could not properly solve this problem.

Obviously, the classification problem has not been resolved. It was not possible to find or construct such predictors that would have sufficient separation power capable of resolving the binary class default of vehicle loans. None of the many diverse classification models were able to obtain quality on the test dataset above AUC 0.70 or Gini 0.40.

Although the algorithms of Microsoft Machine Learning (Microisoft ML Server 9.4.7) work quite quickly, they are not yet able to solve this classification problem. Perhaps this is due to the fact that the proportion of default (bad) Indian borrowers on car loans is very large, that is, it exceeds 20%.

Finish of Session

You can execute this R Markdown file as a job. You should create one R file with code:

rmarkdown::render(input = '1._MLS_.rmd', output_format = c('html_document'))

Then you can run the R file in subdirectory ~/Projects/.

# The End of Session

# if(!is.null(cl)) {
#     parallel::stopCluster(cl)
#     cl = NULL
# }

devtools::session_info()

## - Session info ----------------------------------------------------------
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  Russian_Russia.1251         
##  ctype    Russian_Russia.1251         
##  tz       Asia/Dhaka                  
##  date     2019-08-31                  
## 
## - Packages --------------------------------------------------------------
##  ! package          * version    date       lib
##    acepack            1.4.1      2016-10-29 [1]
##    agricolae          1.3-0      2019-01-07 [1]
##    ALEPlot            1.1        2018-05-24 [1]
##    AlgDesign          1.1-7.3    2014-10-15 [1]
##    assertthat         0.2.0      2017-04-11 [1]
##    backports          1.1.3      2018-12-14 [1]
##    base64enc          0.1-3      2015-07-28 [1]
##    bayesplot          1.6.0      2018-08-02 [1]
##    bindr              0.1.1      2018-03-13 [1]
##    bindrcpp         * 0.2.2      2018-03-29 [1]
##    bit                1.1-14     2018-05-29 [1]
##    bit64              0.9-7      2017-05-08 [1]
##    bitops             1.0-6      2013-08-17 [1]
##    blob               1.1.1      2018-03-25 [1]
##    boot               1.3-20     2017-08-06 [1]
##    breakDown        * 0.2.0      2019-08-15 [1]
##    broom              0.5.1      2018-12-05 [1]
##    callr              3.1.1      2018-12-21 [1]
##    caret            * 6.0-81     2018-11-20 [1]
##    caTools            1.17.1.1   2018-07-20 [1]
##    cellranger         1.1.0      2016-07-27 [1]
##    checkmate          1.9.1      2019-01-15 [1]
##    chron              2.3-53     2018-09-09 [1]
##    class              7.3-14     2015-08-30 [1]
##    cli                1.0.1      2018-09-25 [1]
##    cluster            2.0.7-1    2018-04-13 [1]
##    coda               0.19-2     2018-10-08 [1]
##    codetools          0.2-15     2016-10-05 [1]
##    colorspace         1.4-0      2019-01-13 [1]
##    colourpicker       1.0        2017-09-27 [1]
##    combinat           0.0-8      2012-10-29 [1]
##    CompatibilityAPI   1.1.0      2019-01-10 [1]
##    crayon             1.3.4      2017-09-16 [1]
##    crosstalk          1.0.0      2016-12-21 [1]
##    curl               3.3        2019-01-10 [1]
##    DALEX            * 0.2.6      2019-01-07 [1]
##    data.table       * 1.12.0     2019-01-13 [1]
##    DataExplorer     * 0.8.0      2019-08-24 [1]
##    DBI                1.0.0      2018-05-02 [1]
##    deldir             0.1-16     2019-01-04 [1]
##    desc               1.2.0      2018-05-01 [1]
##    DescTools          0.99.27    2019-01-19 [1]
##    devtools           2.0.1      2018-10-26 [1]
##    digest             0.6.18     2018-10-10 [1]
##    doParallel       * 1.0.14     2019-04-11 [1]
##    dplyr            * 0.7.8      2018-11-10 [1]
##    DT                 0.5        2018-11-05 [1]
##    dygraphs           1.1.1.6    2018-07-11 [1]
##    e1071              1.7-0.1    2019-01-21 [1]
##    embed              0.0.2      2018-11-19 [1]
##    evaluate           0.12       2018-10-09 [1]
##    expm               0.999-3    2018-09-22 [1]
##    factorMerger       0.4.0      2019-08-15 [1]
##    forcats          * 0.3.0      2018-02-19 [1]
##    foreach          * 1.5.1      2019-04-11 [1]
##    foreign            0.8-71     2018-07-20 [1]
##    formattable      * 0.2.0.1    2016-08-05 [1]
##    Formula          * 1.2-3      2018-05-03 [1]
##    fs                 1.2.6      2018-08-23 [1]
##    FSelectorRcpp    * 0.3.0      2018-11-12 [1]
##    gdata              2.18.0     2017-06-06 [1]
##    generics           0.0.2      2018-11-29 [1]
##    ggmosaic           0.2.0      2018-09-12 [1]
##    ggplot2          * 3.1.0      2018-10-25 [1]
##    ggpubr             0.2        2018-11-15 [1]
##    ggridges           0.5.1      2018-09-27 [1]
##    glmnet             2.0-16     2018-04-02 [1]
##    glue               1.3.0      2018-07-17 [1]
##    gmodels            2.18.1     2018-06-25 [1]
##    gower              0.1.2      2017-02-23 [1]
##    gplots           * 3.0.1.1    2019-01-27 [1]
##    gridExtra          2.3        2017-09-09 [1]
##    gsubfn           * 0.7        2018-03-16 [1]
##    gtable             0.2.0      2016-02-26 [1]
##    gtools             3.8.1      2018-06-26 [1]
##    haven              2.0.0      2018-11-22 [1]
##    highr              0.7        2018-06-09 [1]
##    hmeasure           1.0-1      2019-01-02 [1]
##    Hmisc            * 4.2-0      2019-01-26 [1]
##    hms                0.4.2      2018-03-10 [1]
##    htmlTable          1.13.1     2019-01-07 [1]
##    htmltools          0.3.6      2017-04-28 [1]
##    htmlwidgets        1.3        2018-09-30 [1]
##    httpuv             1.4.5.1    2018-12-18 [1]
##    httr               1.4.0      2018-12-11 [1]
##    igraph             1.2.2      2018-07-27 [1]
##    iml              * 0.8.1      2019-01-02 [1]
##    Information        0.0.9      2016-04-09 [1]
##    inline             0.3.15     2018-05-18 [1]
##    inum               1.0-0      2017-12-12 [1]
##    ipred              0.9-8      2018-11-05 [1]
##    iterators        * 1.0.11     2019-04-11 [1]
##    jsonlite           1.6        2018-12-07 [1]
##    kableExtra       * 1.0.1      2019-01-22 [1]
##    keras              2.2.4      2018-11-22 [1]
##    KernSmooth         2.23-15    2015-06-29 [1]
##    klaR               0.6-14     2018-03-19 [1]
##    knitr              1.21       2018-12-10 [1]
##    labeling           0.3        2014-08-23 [1]
##    later              0.7.5      2018-09-18 [1]
##    lattice          * 0.20-38    2018-11-04 [1]
##    latticeExtra       0.6-28     2016-02-09 [1]
##    lava               1.6.4      2018-11-25 [1]
##    lazyeval           0.2.1      2017-10-29 [1]
##    LearnBayes         2.15.1     2018-03-18 [1]
##    libcoin          * 1.0-2      2018-12-13 [1]
##    lme4               1.1-19     2018-11-10 [1]
##    loo                2.0.0      2018-04-11 [1]
##    lubridate          1.7.4      2018-04-11 [1]
##    magrittr         * 1.5        2014-11-22 [1]
##    manipulate         1.0.1      2014-12-24 [1]
##    markdown           0.9        2018-12-07 [1]
##    MASS               7.3-51.1   2018-11-01 [1]
##    Matrix             1.2-15     2018-11-01 [1]
##    matrixStats        0.54.0     2018-07-23 [1]
##    memoise            1.1.0      2017-04-21 [1]
##    Metrics            0.1.4      2018-07-09 [1]
##    MicrosoftML      * 9.4.7      2019-05-07 [1]
##    mime               0.6        2018-10-05 [1]
##    miniUI             0.1.1.1    2018-05-18 [1]
##    minqa              1.2.4      2014-10-09 [1]
##    ModelMetrics       1.2.2      2018-11-03 [1]
##    modelr             0.1.2      2018-05-11 [1]
##  D mrsdeploy        * 1.1.3      2019-05-15 [1]
##    munsell            0.5.0      2018-06-12 [1]
##    mvtnorm          * 1.0-8      2018-05-31 [1]
##    networkD3          0.4        2017-03-18 [1]
##    nlme               3.1-137    2018-04-07 [1]
##    nloptr             1.2.1      2018-10-03 [1]
##    nnet               7.3-12     2016-02-02 [1]
##    openxlsx         * 4.1.0      2018-05-26 [1]
##    pander             0.6.3      2018-11-06 [1]
##    partykit         * 1.2-3      2019-01-31 [1]
##    pdp                0.7.0      2018-08-27 [1]
##    pillar             1.3.1      2018-12-15 [1]
##    pkgbuild           1.0.2      2018-10-16 [1]
##    pkgconfig          2.0.2      2018-08-16 [1]
##    pkgload            1.0.2      2018-10-29 [1]
##    plotly           * 4.8.0      2018-07-20 [1]
##    plyr             * 1.8.4      2016-06-08 [1]
##    prettyunits        1.0.2      2015-07-13 [1]
##    processx           3.2.1      2018-12-05 [1]
##    prodlim            2018.04.18 2018-04-18 [1]
##    productplots       0.1.1      2016-07-02 [1]
##    promises           1.0.1      2018-04-13 [1]
##    proto            * 1.0.0      2016-10-29 [1]
##    proxy              0.4-22     2018-04-08 [1]
##    pryr               0.1.4      2018-02-18 [1]
##    ps                 1.3.0      2018-12-21 [1]
##    purrr            * 0.3.0      2019-01-27 [1]
##    pwr              * 1.2-2      2018-03-03 [1]
##    questionr          0.7.0      2018-11-26 [1]
##    R6                 2.3.0      2018-10-04 [1]
##    rapportools        1.0        2014-01-07 [1]
##    RColorBrewer       1.1-2      2014-12-07 [1]
##    Rcpp               1.0.0      2018-11-07 [1]
##    RCurl              1.95-4.11  2018-07-15 [1]
##    readr            * 1.3.1      2018-12-21 [1]
##    readxl             1.2.0      2018-12-19 [1]
##    recipes            0.1.4      2018-11-19 [1]
##    remotes            2.0.2      2018-10-30 [1]
##    reshape2         * 1.4.3      2017-12-11 [1]
##    reticulate         1.10       2018-08-05 [1]
##    RevoMods         * 11.0.1     2019-04-11 [1]
##    RevoScaleR       * 9.4.7      2019-05-21 [1]
##    RevoUtils        * 11.0.2     2019-04-11 [1]
##    RevoUtilsMath    * 11.0.0     2019-04-24 [1]
##    rlang              0.3.1      2019-01-08 [1]
##    rmarkdown          1.11       2018-12-08 [1]
##    ROCR             * 1.0-7      2015-03-26 [1]
##    rpart            * 4.1-13     2018-02-23 [1]
##    rprojroot          1.3-2      2018-01-03 [1]
##    rsconnect          0.8.13     2019-01-10 [1]
##    RSQLite          * 2.1.1      2018-05-06 [1]
##    rstan              2.18.2     2018-11-07 [1]
##    rstanarm           2.18.2     2018-11-10 [1]
##    rstantools         1.5.1      2018-08-22 [1]
##    rstudioapi         0.9.0      2019-01-09 [1]
##    rvest              0.3.2      2016-06-17 [1]
##    scales             1.0.0      2018-08-09 [1]
##    sessioninfo        1.1.1      2018-11-05 [1]
##    shiny              1.2.0      2018-11-02 [1]
##    shinyjs            1.0        2018-01-08 [1]
##    shinystan          2.5.0      2018-05-01 [1]
##    shinythemes        1.1.2      2018-11-06 [1]
##    smbinning        * 0.8        2019-01-07 [1]
##    sp                 1.3-1      2018-06-05 [1]
##    spData             0.3.0      2019-01-07 [1]
##    spdep              0.8-1      2018-11-21 [1]
##    sqldf            * 0.4-11     2017-06-28 [1]
##    StanHeaders        2.18.1     2019-01-28 [1]
##    stringi            1.2.4      2018-07-20 [1]
##    stringr          * 1.3.1      2018-05-10 [1]
##    summarytools     * 0.8.8      2018-10-07 [1]
##    survival         * 2.43-3     2018-11-26 [1]
##    tensorflow         1.10       2018-11-19 [1]
##    testthat           2.0.1      2018-10-13 [1]
##    tfruns             1.4        2018-08-25 [1]
##    threejs            0.3.1      2017-08-13 [1]
##    tibble           * 2.0.1      2019-01-12 [1]
##    tidyr            * 0.8.2      2018-10-28 [1]
##    tidyselect         0.2.5      2018-10-11 [1]
##    tidyverse        * 1.2.1      2017-11-14 [1]
##    timeDate           3043.102   2018-02-21 [1]
##    usethis            1.4.0      2018-08-14 [1]
##    viridisLite        0.3.0      2018-02-01 [1]
##    webshot            0.5.1      2018-09-28 [1]
##    whisker            0.3-2      2013-04-28 [1]
##    withr              2.1.2      2018-03-15 [1]
##    woeBinning       * 0.1.6      2018-07-28 [1]
##    xfun               0.4        2018-10-23 [1]
##    xml2               1.2.0      2018-01-24 [1]
##    xtable             1.8-3      2018-08-29 [1]
##    xts                0.11-2     2018-11-05 [1]
##    yaImpute           1.0-31     2019-01-09 [1]
##    yaml               2.2.0      2018-07-25 [1]
##    zeallot            0.1.0      2018-01-28 [1]
##    zip                1.0.0      2017-04-25 [1]
##    zoo                1.8-4      2018-09-19 [1]
##  source                                  
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  Github (pbiecek/breakDown@ba9a0d9)      
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  Github (boxuancui/DataExplorer@8a71951) 
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  Github (MI2DataLab/factorMerger@c49e37f)
##  CRAN (R 3.5.2)                          
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  local                                   
##  local                                   
##  local                                   
##  local                                   
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
##  CRAN (R 3.5.2)                          
## 
## [1] C:/R/MLS/R_SERVER/library
## 
##  D -- DLL MD5 mismatch, broken installation.

# sessionInfo()

Vehicle Loan Default Prediction (Classification Problem)

Vehicle Loan Default Prediction in India: `MicrosoftML` (Microsoft Machine Learning Server 9.4.7)

Alexander Rodionov

31 August 2019 г.

Information about the Classification Problem

Problem Statement

Vehicle Loan Default Prediction

Data Description

Evaluation Metric

Applied methods of machine learning on the classification problem (two classes)

Functions for Visualizations

Optimal Binning for Factor and Numeric (Scale) Variables

Import Data

Data Frame Summary

DT

Create IV Table for Factor and Numeric (Scale) Variables

Coarse Classing Factor and Numeric Variables

Create Outcome - Dependent Binary Factor

Split Train and Testing Sets

Exploratory data analysis (EDA)

Population Proportion – Sample Size

Unimodal Data Visualizations: Information Values & Distrubutions

Multimodal Data Visualizations - The Correlation Matrix

Select Features

Entropy-Based Feature Selection

Simple Logit-Model

Receiver Operating Characteristic Curve

Kolmogorov–Smirnov Test

Measures of separation of the Score distribution

Charts of Simple Logit-Model on Testing Set

Trainig Models in `MicrosoftML` package

Evaluation of Classifiers

Descriptive mAchine Learning EXplanations of Models

The Best Classifier for Score Modeling

Conclusion

Finish of Session

Vehicle Loan Default Prediction (Classification Problem)

Vehicle Loan Default Prediction in India: MicrosoftML (Microsoft Machine Learning Server 9.4.7)

Alexander Rodionov

31 August 2019 г.

Information about the Classification Problem

Problem Statement

Vehicle Loan Default Prediction

Data Description

Evaluation Metric

Applied methods of machine learning on the classification problem (two classes)

Functions for Visualizations

Optimal Binning for Factor and Numeric (Scale) Variables

Import Data

Data Frame Summary

DT

Create IV Table for Factor and Numeric (Scale) Variables

Coarse Classing Factor and Numeric Variables

Create Outcome - Dependent Binary Factor

Split Train and Testing Sets

Exploratory data analysis (EDA)

Population Proportion – Sample Size

Unimodal Data Visualizations: Information Values & Distrubutions

Multimodal Data Visualizations - The Correlation Matrix

Select Features

Entropy-Based Feature Selection

Simple Logit-Model

Receiver Operating Characteristic Curve

Kolmogorov–Smirnov Test

Measures of separation of the Score distribution

Charts of Simple Logit-Model on Testing Set

Trainig Models in MicrosoftML package

Evaluation of Classifiers

Descriptive mAchine Learning EXplanations of Models

The Best Classifier for Score Modeling

Conclusion

Finish of Session

Vehicle Loan Default Prediction in India: `MicrosoftML` (Microsoft Machine Learning Server 9.4.7)

Trainig Models in `MicrosoftML` package