MicrosoftML (Microsoft Machine Learning Server 9.4.7)MicrosoftML packageLTFS is one of India’s most respected & leading NBFCs providing finance.
Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default.
A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date. Following Information regarding the loan and loanee are provided in the datasets:
Loanee Information (Demographic data like age, Identity proof etc.)
Loan Information (Disbursal details, loan to value ratio etc.)
Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)
Doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimising the default rates.
Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default. A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date. Following Information regarding the loan and loanee are provided in the datasets:
Loanee Information (Demographic data like age, income, Identity proof etc.)
Loan Information (Disbursal details, amount, EMI, loan to value ratio etc.)
Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)
Doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimising the default rates.
The Data Set contains:
train.csv contains the training data with details on loan as described in the last section
data_dictionary.csv contains a brief description on each variable provided in the training and test set.
test.csv contains details of all customers and loans for which the participants are to submit probability of default.
sample_submission.csv contains the submission format for the predictions against the test set. A single csv needs to be submitted as a solution.
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
Despite the fact that various methods are actively used for preprocessing and interpreting modeling results, including the tidymodels, the Microsoft SQL Server 2019 Microsoft ML Server 9.4.7 {Update August 2019} was the main set for building classification models.
R Markdown does not support combining Rmd files. Per the R Markdown website, R Markdown requires a single Rmd file. It does not currently support the embedding of one Rmd file within another Rmd document.
library('tidyverse') # An opinionated collection of R Packages designed for Data Science.
library('magrittr') # A Forward-Pipe Operator for R
# install.packages("https://cran.r-project.org/src/contrib/data.table_1.11.8.tar.gz", repos = NULL, type = "source") # Install package ‘data.table’ version 1.11.8
library('data.table') # Fast Extension of `data.frame`##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
LoadingData <- 'VLDP'
NameMySQLCode <- 'VLDP.sql'
# The 9 Best Features for Set 1
SelectedFeatures <- c('disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Employment_Type', 'State_ID', 'Employee_code_ID', 'Aadhar_flag', 'PAN_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS_SCORE', 'PERFORM_CNS_SCORE_DESCRIPTION', 'PRI_NO_OF_ACCTS', 'PRI_ACTIVE_ACCTS', 'PRI_OVERDUE_ACCTS', 'PRI_SANCTIONED_AMOUNT', 'PRI_DISBURSED_AMOUNT', 'SEC_CURRENT_BALANCE', 'SEC_SANCTIONED_AMOUNT', 'PRIMARY_INSTAL_AMT', 'NEW_ACCTS_IN_LAST_SIX_MONTHS', 'DELINQUENT_ACCTS_IN_LAST_SIX_M', 'NO_OF_INQUIRIES', 'Age', 'YearsSinceDisbursment', 'AverageLoanTenure', 'TimeSinceFirstLoan')
ModelFilename = 'VLDP_model.rds'
# Check It is Microsoft R Server working
isMicrosoftRServer <- 'RevoScaleR' %in% rownames(installed.packages())
if( isMicrosoftRServer ) {
writeLines(paste('Microsoft R Server (Machine Learning)', getNamespaceVersion('RevoScaleR') ))
} else writeLines(paste('Common', version$version.string ))## Microsoft R Server (Machine Learning) 9.4.7
# To Import file with Data for Application Scoring
DT1 <- data.table::fread(unzip(zipfile = 'VehicleLoanDefaultPrediction.zip', files = 'train.csv'))
data.table::setnames(DT1, old = c('loan_default'), new = c('GB_flag'))
DT2 <- data.table::fread(unzip(zipfile = 'VehicleLoanDefaultPrediction.zip', files = 'test.csv'))
# Makes One Data.Table From A List Of Many filling missing columns, and match by col names
DT <- list(DT1, DT2) %>%
data.table::rbindlist(., use.names = TRUE, fill = TRUE)
# To Import file with Data Dictionary for Application Scoring
DD <- readxl::read_excel(unzip(zipfile = 'VehicleLoanDefaultPrediction.zip', files = 'Data Dictionary.xlsx')) %>% dplyr::mutate(`Variable Name` = stringr::str_replace(string = `Variable Name`, pattern = 'loan_default',
replacement = 'GB_flag'))## New names:
## * `` -> `..3`
# create a list of 70% of the rows in the original training dataset we can use for training
# Randomly Split the data into two sets: training (inTrain) and testing (inTest)
seed <- 2019
set.seed(seed)
inTrain1 <- rep(FALSE, nrow(DT1))
inTrain1[sample(nrow(DT1), 9/10 * nrow(DT1))] <- TRUE
inTrain = c( inTrain1, rep( FALSE, times = dim(DT2)[1] ) )
inTest = c( !inTrain1, rep( FALSE, times = dim(DT2)[1] ) )
inProblem = c( rep( FALSE, times = dim(DT1)[1] ), rep( TRUE, times = dim(DT2)[1] ) )
# Adding Attributes by Variables into DT from Data Dictionary
attr(DT, "variable.labels") <- tibble(`Variable.Name` = colnames(DT)) %>%
dplyr::left_join(DD, by = c('Variable.Name' = 'Variable Name')) %>%
dplyr::select(Description) %>%
pull
# gsub() is replacing '.' (dot)
names(DT) %<>% gsub('[.]', '_', .)
# DT$PRI_CURRENT_BALANCE <- DT$PRI_CURRENT_BALANCE + 6678296
# DT$PRI_CURRENT_BALANCE <- log(DT$PRI_CURRENT_BALANCE)
# Renaming All Variables with Name's Length > 31
oldnames = c('DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS')
newnames = c('DELINQUENT_ACCTS_IN_LAST_SIX_M')
data.table::setnames(DT, old = oldnames, new = newnames)
# Get The Date of Birthday and convert into 'Age' at '2019-01-01'
Finish_Date <- lubridate::parse_date_time2('2019-01-01', orders = '%Y-%m-%d', tz = 'Asia/Dhaka')
DT[, Date_of_Birth := paste0(substr( DT$Date_of_Birth, 1, 6 ),
ifelse(substr( DT$Date_of_Birth, 7, 8 ) %>% as.integer > 40, '19', '20'),
substr( DT$Date_of_Birth, 7, 8 )) %>%
as.Date(., '%d-%m-%Y') ]
DT[, Age := difftime(time1 = Finish_Date, time2 = Date_of_Birth, units = 'days') %>%
as.integer(.) / 365.24 ]
# Get The Date of Disbursal and convert into 'YearsSinceDisbursment'
DT[, DisbursalDate := paste0(substr( DT$DisbursalDate, 1, 6 ),
ifelse(substr( DT$DisbursalDate, 7, 8 ) %>% as.integer > 40, '19', '20'),
substr( DT$DisbursalDate, 7, 8 )) %>%
as.Date( ., '%d-%m-%Y') ]
DT[, YearsSinceDisbursment := difftime(time1 = Finish_Date, time2 = DisbursalDate, units = 'days') %>%
as.integer(.) / 365.24 ]
# Get The Average loan tenure and convert into 'AverageLoanTenure'
DT[, AverageLoanTenure :=
stringr::str_split_fixed(DT$AVERAGE_ACCT_AGE, ' ', 2) %>%
data.frame(., stringsAsFactors = FALSE) %>%
setNames(c('Years', 'Months')) %>%
dplyr::mutate(Years = readr::parse_number(Years), Months = readr::parse_number(Months) / 12 ) %>%
dplyr::mutate(Len = Years + Months) %>%
dplyr::select(Len) %>% pull ]
# Get The Time since first loan and convert into 'TimeSinceFirstLoan'
DT[, TimeSinceFirstLoan :=
stringr::str_split_fixed(DT$CREDIT_HISTORY_LENGTH, ' ', 2) %>%
data.frame(., stringsAsFactors = FALSE) %>%
setNames(c('Years', 'Months')) %>%
dplyr::mutate(Years = readr::parse_number(Years), Months = readr::parse_number(Months) / 12 ) %>%
dplyr::mutate(Len = Years + Months) %>%
dplyr::select(Len) %>% pull ]
DT[, ':=' (
YearsOnLoan = difftime(time1 = DisbursalDate, time2 = Date_of_Birth, units = 'days') %>%
as.integer(.) / 365.24,
DisAsDiff = asset_cost - disbursed_amount,
DisAsShare = asset_cost / disbursed_amount,
# DiffLTV = (asset_cost - disbursed_amount - ltv),
Qrt = lubridate::quarter(DisbursalDate) %>% as.factor(.), # weekdays
Day = format(DisbursalDate, '%d') %>% as.integer,
OutstandingNow = disbursed_amount + PRI_CURRENT_BALANCE,
DisbursedTotal = PRI_DISBURSED_AMOUNT + disbursed_amount,
ShareOverdue = DELINQUENT_ACCTS_IN_LAST_SIX_M - NEW_ACCTS_IN_LAST_SIX_MONTHS,
# OutstandingNow2Dsbrsd = OutstandingNow / DisbursedTotal
SEC_OverdueShareSec = SEC_OVERDUE_ACCTS / SEC_NO_OF_ACCTS,
PRI_OverdueShare = PRI_OVERDUE_ACCTS / PRI_NO_OF_ACCTS
) ]
library('Hmisc') # Harrell Miscellaneous## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:rpart':
##
## solder
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
DT <- Hmisc::upData(DT, labels = c(Age = 'Age in Years',
YearsSinceDisbursment = 'Years Since Disbursment',
AverageLoanTenure = 'Average loan tenure in Years (AVERAGE_ACCT_AGE)',
TimeSinceFirstLoan = 'Years since First Loan (CREDIT_HISTORY_LENGTH)',
YearsOnLoan = 'Years On Loan',
DisAsDiff = 'Difference of asset_cost from disbursed_amount',
DisAsShare = 'Ratio of asset_cost to disbursed_amount',
# DiffLTV = 'Difference of asset_cost from disbursed_amount and ltv',
Qrt = 'Quarter of DisbursalDate',
Day = 'Months`s Day of DisbursalDate',
OutstandingNow = 'Difference of disbursed_amount from PRI_CURRENT_BALANCE',
DisbursedTotal = 'Summ of PRI_DISBURSED_AMOUNT and disbursed_amount',
ShareOverdue = 'Difference of DELINQUENT_ACCTS_IN_LAST_SIX_M from NEW_ACCTS_IN_LAST_SIX_MONTHS',
SEC_OverdueShareSec = 'Ratio of SEC_OVERDUE_ACCTS to SEC_NO_OF_ACCTS',
PRI_OverdueShare = 'Ratio of PRI_OVERDUE_ACCTS / PRI_NO_OF_ACCTS'))## Input object size: 96804992 bytes; 55 variables 345546 observations
## New object size: 96812608 bytes; 55 variables 345546 observations
# Hmisc::contents(DT)
# Convert *some* column classes in data.table
factor_cols <- c('branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Employment_Type', 'State_ID', 'Employee_code_ID', 'MobileNo_Avl_Flag', 'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS_SCORE_DESCRIPTION')
DT[, (factor_cols) := lapply(.SD, factor), .SDcols = factor_cols]
delcols <- c('UniqueID', 'Date_of_Birth', 'DisbursalDate', 'PRI_CURRENT_BALANCE', 'AVERAGE_ACCT_AGE', 'CREDIT_HISTORY_LENGTH') # , grep('_flag', names(DT), value = TRUE), grep('SEC_', names(DT), value = TRUE)
# Remove outcome variable 'GB_flag' from 'delcols' vector to due '_flag'
delcols <- delcols[! delcols %in% c('GB_flag')]
if ( isMicrosoftRServer ) {
# for Microsoft Machine Learning Server 9.3
# Get a subset rows and columns from the data frame
DF1 <- RevoScaleR::rxDataStep(inData = DT, maxRowsByCols = 2e9,
varsToDrop = delcols
# varsToKeep = c('x', 'w', 'z'),
# rowSelection = z > 0
)
out <- rxGetInfo(DF1, getVarInfo = TRUE)
} else {
# for CRAN R version
delColumns <- delcols
DF1 <- copy(DT) %>%
.[, (delColumns) := NULL] %>%
data.table::setDF(.)
}## Rows Read: 345546, Total Rows Processed: 345546, Total Chunk Time: 0.746 seconds
I use package embed that contains extra steps for the recipes package for embedding predictors into one or more numeric columns. All of the preprocessing methods are supervised.
# Encoding of Multi-Level Categorical Variables (Factors) into Numeric Predictors
for (var in c('branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Employment_Type', 'State_ID', 'Employee_code_ID', 'PERFORM_CNS_SCORE_DESCRIPTION')) {
print(var)
DF1 <- EncodingMultiLevelFactors(X = DF1, Y = var, Z = inTrain)
}## [1] "branch_id"
## start par. = 0.1123205 fn = 221335.9
## At return
## eval: 19 fn: 221335.56 par: 0.106959
## [1] "supplier_id"
## start par. = 0.2075279 fn = 220404.6
## At return
## eval: 19 fn: 220349.07 par: 0.172799
## [1] "manufacturer_id"
## start par. = 0.04441674 fn = 223266.8
## At return
## eval: 18 fn: 223266.33 par: 0.0533252
## [1] "Current_pincode_ID"
## start par. = 0.245132 fn = 221670.7
## At return
## eval: 16 fn: 221350.34 par: 0.166798
## [1] "Employment_Type"
## start par. = 0.02972829 fn = 216219.5
## At return
## eval: 28 fn: 216219.22 par: 0.0422908
## [1] "State_ID"
## start par. = 0.08355545 fn = 222283.1
## At return
## eval: 18 fn: 222283. par: 0.0889640
## [1] "Employee_code_ID"
## start par. = 0.2213827 fn = 220258.2
## At return
## eval: 19 fn: 220127.69 par: 0.175705
## [1] "PERFORM_CNS_SCORE_DESCRIPTION"
## start par. = 0.098095 fn = 221735
## At return
## eval: 17 fn: 221733.42 par: 0.119980
# DF1 <- DF1[, c('GB_flag', SelectedFeatures)]
DF0 <- DF1[inTrain, ] # Testing Set after Discretization (Binning) of Numeric Variables
Max_Vars <- 32
Max_Levels <- 48
remove(DT1, DT2, DD, inTrain1, delcols, factor_cols, oldnames, newnames)Inspection Data Frame using Automate Data Exploration and Inspection packages: DataExplorer, summarytools.
# Inspection Data Frame
library('DataExplorer') # Automate Data Exploration and Treatment
library('summarytools') # Tools to Quickly and Neatly Summarize Data##
## Attaching package: 'summarytools'
## The following objects are masked from 'package:Hmisc':
##
## label, label<-
## The following object is masked from 'package:tibble':
##
## view
dfName <- 'DT'
dfCols <- colnames(DT) %>% length()
data.frame(Variables = sapply(DT, class) %>% unlist) %>%
dplyr::filter(Variables != 'labelled') %>%
ggplot2::ggplot(aes(x = Variables, fill = Variables)) +
ggplot2::geom_bar( stat = 'count') +
ggplot2::geom_text(stat='count', aes(label = ..count..), vjust = 2, size = 5, color = 'white') +
ggplot2::theme(legend.position = 'none') +
ggplot2::labs(title = paste(dfName, 'Column Types'), subtitle = sprintf('Data.frame has %g columns.', dfCols),
x = NULL, y = 'Number of columns') # DataExplorer::profile_missing(DT)
# Data frame Summary by Variable
print(summarytools::dfSummary(DT, graph.magnif = 0.75), method = 'render')| No | Variable | Label | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
|---|---|---|---|---|---|---|---|
| 1 | UniqueID [integer] | mean (sd) : 593106.04 (101481.56) min < med < max : 417428 < 592918.5 < 769909 IQR (CV) : 175115.5 (0.17) | 345546 distinct values | 345546 (100%) | 0 (0%) | ||
| 2 | disbursed_amount [integer] | mean (sd) : 54916.38 (13045.96) min < med < max : 11613 < 54303 < 990572 IQR (CV) : 13302 (0.24) | 29271 distinct values | 345546 (100%) | 0 (0%) | ||
| 3 | asset_cost [integer] | mean (sd) : 76294.84 (18738.64) min < med < max : 37000 < 71541 < 1628992 IQR (CV) : 13323 (0.25) | 53158 distinct values | 345546 (100%) | 0 (0%) | ||
| 4 | ltv [numeric] | mean (sd) : 74.93 (11.32) min < med < max : 10.03 < 77.14 < 95 IQR (CV) : 14.41 (0.15) | 6819 distinct values | 345546 (100%) | 0 (0%) | ||
| 5 | branch_id [factor] | 1. 1 2. 2 3. 3 4. 5 5. 7 6. 8 7. 9 8. 10 9. 11 10. 13 [ 72 others ] | 8337 (2.4%) 20527 (5.9%) 14881 (4.3%) 10276 (3.0%) 4323 (1.3%) 4472 (1.3%) 3891 (1.1%) 5685 (1.6%) 5873 (1.7%) 4170 (1.2%) 263111 (76.1%) | 345546 (100%) | 0 (0%) | ||
| 6 | supplier_id [factor] | 1. 10524 2. 12311 3. 12312 4. 12374 5. 12441 6. 12456 7. 12500 8. 12534 9. 12539 10. 12797 [ 3079 others ] | 7 (0.0%) 6 (0.0%) 57 (0.0%) 146 (0.0%) 62 (0.0%) 101 (0.0%) 79 (0.0%) 73 (0.0%) 11 (0.0%) 87 (0.0%) 344917 (99.8%) | 345546 (100%) | 0 (0%) | ||
| 7 | manufacturer_id [factor] | 1. 45 2. 48 3. 49 4. 51 5. 67 6. 86 7. 120 8. 145 9. 152 10. 153 [ 2 others ] | 87053 (25.2%) 22964 (6.6%) 14812 (4.3%) 40927 (11.8%) 3364 (1.0%) 161203 (46.7%) 14049 (4.1%) 1138 (0.3%) 9 (0.0%) 25 (0.0%) 2 (0.0%) | 345546 (100%) | 0 (0%) | ||
| 8 | Current_pincode_ID [factor] | 1. 1 2. 2 3. 3 4. 4 5. 5 6. 6 7. 7 8. 8 9. 9 10. 10 [ 7086 others ] | 44 (0.0%) 118 (0.0%) 87 (0.0%) 153 (0.0%) 331 (0.1%) 162 (0.0%) 152 (0.0%) 74 (0.0%) 56 (0.0%) 8 (0.0%) 344361 (99.7%) | 345546 (100%) | 0 (0%) | ||
| 9 | Date_of_Birth [Date] | min : 1949-09-15 med : 1986-01-01 max : 2000-11-29 range : 51y 2m 14d | 15888 distinct val. | 345546 (100%) | 0 (0%) | ||
| 10 | Employment_Type [factor] | 1. 2. Salaried 3. Self employed | 11104 (3.2%) 147013 (42.5%) 187429 (54.2%) | 345546 (100%) | 0 (0%) | ||
| 11 | DisbursalDate [Date] | min : 2018-08-01 med : 2018-10-20 max : 2018-11-30 range : 3m 29d | 111 distinct val. | 345546 (100%) | 0 (0%) | ||
| 12 | State_ID [factor] | 1. 1 2. 2 3. 3 4. 4 5. 5 6. 6 7. 7 8. 8 9. 9 10. 10 [ 12 others ] | 14351 (4.2%) 7258 (2.1%) 47868 (13.9%) 70438 (20.4%) 14304 (4.1%) 48903 (14.2%) 10628 (3.1%) 20047 (5.8%) 21459 (6.2%) 5564 (1.6%) 84726 (24.5%) | 345546 (100%) | 0 (0%) | ||
| 13 | Employee_code_ID [factor] | 1. 1 2. 3 3. 4 4. 5 5. 7 6. 9 7. 10 8. 11 9. 12 10. 15 [ 3388 others ] | 106 (0.0%) 192 (0.1%) 96 (0.0%) 133 (0.0%) 221 (0.1%) 77 (0.0%) 44 (0.0%) 111 (0.0%) 162 (0.0%) 146 (0.0%) 344258 (99.6%) | 345546 (100%) | 0 (0%) | ||
| 14 | MobileNo_Avl_Flag [factor] | 1. 1 | 345546 (100.0%) | 345546 (100%) | 0 (0%) | ||
| 15 | Aadhar_flag [factor] | 1. 0 2. 1 | 51883 (15.0%) 293663 (85.0%) | 345546 (100%) | 0 (0%) | ||
| 16 | PAN_flag [factor] | 1. 0 2. 1 | 306392 (88.7%) 39154 (11.3%) | 345546 (100%) | 0 (0%) | ||
| 17 | VoterID_flag [factor] | 1. 0 2. 1 | 298155 (86.3%) 47391 (13.7%) | 345546 (100%) | 0 (0%) | ||
| 18 | Driving_flag [factor] | 1. 0 2. 1 | 338249 (97.9%) 7297 (2.1%) | 345546 (100%) | 0 (0%) | ||
| 19 | Passport_flag [factor] | 1. 0 2. 1 | 344835 (99.8%) 711 (0.2%) | 345546 (100%) | 0 (0%) | ||
| 20 | PERFORM_CNS_SCORE [integer] | mean (sd) : 289.03 (338.84) min < med < max : 0 < 0 < 890 IQR (CV) : 679 (1.17) | 574 distinct values | 345546 (100%) | 0 (0%) | ||
| 21 | PERFORM_CNS_SCORE_DESCRIPTION [factor] | 1. A-Very Low Risk 2. B-Very Low Risk 3. C-Very Low Risk 4. D-Very Low Risk 5. E-Low Risk 6. F-Low Risk 7. G-Low Risk 8. H-Medium Risk 9. I-Medium Risk 10. J-High Risk [ 10 others ] | 21683 (6.3%) 13696 (4.0%) 23870 (6.9%) 16472 (4.8%) 8393 (2.4%) 12176 (3.5%) 5795 (1.7%) 10142 (2.9%) 8260 (2.4%) 5526 (1.6%) 219533 (63.5%) | 345546 (100%) | 0 (0%) | ||
| 22 | PRI_NO_OF_ACCTS [integer] | mean (sd) : 2.37 (5.01) min < med < max : 0 < 0 < 453 IQR (CV) : 3 (2.11) | 114 distinct values | 345546 (100%) | 0 (0%) | ||
| 23 | PRI_ACTIVE_ACCTS [integer] | mean (sd) : 1 (1.88) min < med < max : 0 < 0 < 144 IQR (CV) : 1 (1.87) | 42 distinct values | 345546 (100%) | 0 (0%) | ||
| 24 | PRI_OVERDUE_ACCTS [integer] | mean (sd) : 0.16 (0.54) min < med < max : 0 < 0 < 25 IQR (CV) : 0 (3.5) | 23 distinct values | 345546 (100%) | 0 (0%) | ||
| 25 | PRI_CURRENT_BALANCE [integer] | mean (sd) : 160270.2 (925345.66) min < med < max : -6678296 < 0 < 96524920 IQR (CV) : 31364.5 (5.77) | 97465 distinct values | 345546 (100%) | 0 (0%) | ||
| 26 | PRI_SANCTIONED_AMOUNT [integer] | mean (sd) : 209650.86 (2043865.78) min < med < max : -481500 < 0 < 1e+09 IQR (CV) : 59416.75 (9.75) | 60681 distinct values | 345546 (100%) | 0 (0%) | ||
| 27 | PRI_DISBURSED_AMOUNT [integer] | mean (sd) : 209560.79 (2047482.84) min < med < max : 0 < 0 < 1e+09 IQR (CV) : 57645.75 (9.77) | 65673 distinct values | 345546 (100%) | 0 (0%) | ||
| 28 | SEC_NO_OF_ACCTS [integer] | mean (sd) : 0.05 (0.56) min < med < max : 0 < 0 < 57 IQR (CV) : 0 (11.82) | 40 distinct values | 345546 (100%) | 0 (0%) | ||
| 29 | SEC_ACTIVE_ACCTS [integer] | mean (sd) : 0.02 (0.28) min < med < max : 0 < 0 < 36 IQR (CV) : 0 (12.47) | 23 distinct values | 345546 (100%) | 0 (0%) | ||
| 30 | SEC_OVERDUE_ACCTS [integer] | mean (sd) : 0.01 (0.1) min < med < max : 0 < 0 < 8 IQR (CV) : 0 (16.97) | 0 : 343928 (99.5%) 1 : 1358 (0.4%) 2 : 166 (0.0%) 3 : 54 (0.0%) 4 : 22 (0.0%) 5 : 8 (0.0%) 6 : 6 (0.0%) 7 : 2 (0.0%) 8 : 2 (0.0%) | 345546 (100%) | 0 (0%) | ||
| 31 | SEC_CURRENT_BALANCE [integer] | mean (sd) : 4565.3 (161202.6) min < med < max : -574647 < 0 < 36032852 IQR (CV) : 0 (35.31) | 3947 distinct values | 345546 (100%) | 0 (0%) | ||
| 32 | SEC_SANCTIONED_AMOUNT [integer] | mean (sd) : 6133.3 (189342.66) min < med < max : 0 < 0 < 57945000 IQR (CV) : 0 (30.87) | 2631 distinct values | 345546 (100%) | 0 (0%) | ||
| 33 | SEC_DISBURSED_AMOUNT [integer] | mean (sd) : 6038.72 (188911.37) min < med < max : 0 < 0 < 57945000 IQR (CV) : 0 (31.28) | 3031 distinct values | 345546 (100%) | 0 (0%) | ||
| 34 | PRIMARY_INSTAL_AMT [integer] | mean (sd) : 12497.73 (199754.5) min < med < max : 0 < 0 < 85262329 IQR (CV) : 1946 (15.98) | 34330 distinct values | 345546 (100%) | 0 (0%) | ||
| 35 | SEC_INSTAL_AMT [integer] | mean (sd) : 272.74 (16261.26) min < med < max : 0 < 0 < 5390000 IQR (CV) : 0 (59.62) | 2295 distinct values | 345546 (100%) | 0 (0%) | ||
| 36 | NEW_ACCTS_IN_LAST_SIX_MONTHS [integer] | mean (sd) : 0.36 (0.92) min < med < max : 0 < 0 < 35 IQR (CV) : 0 (2.56) | 26 distinct values | 345546 (100%) | 0 (0%) | ||
| 37 | DELINQUENT_ACCTS_IN_LAST_SIX_M [integer] | mean (sd) : 0.1 (0.38) min < med < max : 0 < 0 < 20 IQR (CV) : 0 (4.01) | 16 distinct values | 345546 (100%) | 0 (0%) | ||
| 38 | AVERAGE_ACCT_AGE [character] | 1. 0yrs 0mon 2. 0yrs 6mon 3. 0yrs 7mon 4. 0yrs 11mon 5. 0yrs 10mon 6. 1yrs 0mon 7. 0yrs 9mon 8. 0yrs 8mon 9. 1yrs 1mon 10. 0yrs 5mon [ 190 others ] | 177481 (51.4%) 9325 (2.7%) 8167 (2.4%) 7665 (2.2%) 7587 (2.2%) 7447 (2.2%) 7353 (2.1%) 7224 (2.1%) 6680 (1.9%) 6458 (1.9%) 100159 (29.0%) | 345546 (100%) | 0 (0%) | ||
| 39 | CREDIT_HISTORY_LENGTH [character] | 1. 0yrs 0mon 2. 0yrs 6mon 3. 2yrs 1mon 4. 0yrs 7mon 5. 2yrs 0mon 6. 1yrs 0mon 7. 1yrs 1mon 8. 0yrs 11mon 9. 0yrs 8mon 10. 0yrs 9mon [ 297 others ] | 177178 (51.3%) 7456 (2.2%) 6932 (2.0%) 6243 (1.8%) 5762 (1.7%) 5153 (1.5%) 4542 (1.3%) 3925 (1.1%) 3753 (1.1%) 3572 (1.0%) 121030 (35.0%) | 345546 (100%) | 0 (0%) | ||
| 40 | NO_OF_INQUIRIES [integer] | mean (sd) : 0.21 (0.72) min < med < max : 0 < 0 < 36 IQR (CV) : 0 (3.37) | 26 distinct values | 345546 (100%) | 0 (0%) | ||
| 41 | GB_flag [integer] | mean (sd) : 0.22 (0.41) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.9) | 0 : 182543 (78.3%) 1 : 50611 (21.7%) | 233154 (67.47%) | 112392 (32.53%) | ||
| 42 | Age [labelled, numeric] | Age in Years | mean (sd) : 34.81 (9.86) min < med < max : 18.09 < 33 < 69.29 IQR (CV) : 15.28 (0.28) | 15888 distinct values | 345546 (100%) | 0 (0%) | |
| 43 | YearsSinceDisbursment [labelled, numeric] | Years Since Disbursment | mean (sd) : 0.22 (0.1) min < med < max : 0.08 < 0.2 < 0.42 IQR (CV) : 0.16 (0.43) | 111 distinct values | 345546 (100%) | 0 (0%) | |
| 44 | AverageLoanTenure [labelled, numeric] | Average loan tenure in Years (AVERAGE_ACCT_AGE) | mean (sd) : 0.74 (1.26) min < med < max : 0 < 0 < 30.75 IQR (CV) : 1.08 (1.69) | 200 distinct values | 345546 (100%) | 0 (0%) | |
| 45 | TimeSinceFirstLoan [labelled, numeric] | Years since First Loan (CREDIT_HISTORY_LENGTH) | mean (sd) : 1.33 (2.35) min < med < max : 0 < 0 < 39 IQR (CV) : 1.92 (1.76) | 307 distinct values | 345546 (100%) | 0 (0%) | |
| 46 | YearsOnLoan [labelled, numeric] | Years On Loan | mean (sd) : 34.59 (9.86) min < med < max : 18 < 32.81 < 69.13 IQR (CV) : 15.23 (0.28) | 16252 distinct values | 345546 (100%) | 0 (0%) | |
| 47 | DisAsDiff [labelled, integer] | Difference of asset_cost from disbursed_amount | mean (sd) : 21378.46 (12308.93) min < med < max : 3917 < 17901 < 638420 IQR (CV) : 13187 (0.58) | 48676 distinct values | 345546 (100%) | 0 (0%) | |
| 48 | DisAsShare [labelled, numeric] | Ratio of asset_cost to disbursed_amount | mean (sd) : 1.42 (0.31) min < med < max : 1.07 < 1.34 < 10.57 IQR (CV) : 0.26 (0.22) | 299901 distinct values | 345546 (100%) | 0 (0%) | |
| 49 | Qrt [labelled, factor] | Quarter of DisbursalDate | 1. 3 2. 4 | 134790 (39.0%) 210756 (61.0%) | 345546 (100%) | 0 (0%) | |
| 50 | Day [labelled, integer] | Months`s Day of DisbursalDate | mean (sd) : 19.3 (7.86) min < med < max : 1 < 20 < 31 IQR (CV) : 13 (0.41) | 31 distinct values | 345546 (100%) | 0 (0%) | |
| 51 | OutstandingNow [labelled, integer] | Difference of disbursed_amount from PRI_CURRENT_BALANCE | mean (sd) : 215186.57 (925615.45) min < med < max : -6608979 < 60379.5 < 96583433 IQR (CV) : 40649.75 (4.3) | 118719 distinct values | 345546 (100%) | 0 (0%) | |
| 52 | DisbursedTotal [labelled, integer] | Summ of PRI_DISBURSED_AMOUNT and disbursed_amount | mean (sd) : 264477.16 (2047611.56) min < med < max : 11613 < 62639.5 < 1000047773 IQR (CV) : 63776.5 (7.74) | 121005 distinct values | 345546 (100%) | 0 (0%) | |
| 53 | ShareOverdue [labelled, integer] | Difference of DELINQUENT_ACCTS_IN_LAST_SIX_M from NEW_ACCTS_IN_LAST_SIX_MONTHS | mean (sd) : -0.26 (0.93) min < med < max : -30 < 0 < 17 IQR (CV) : 0 (-3.52) | 38 distinct values | 345546 (100%) | 0 (0%) | |
| 54 | SEC_OverdueShareSec [labelled, numeric] | Ratio of SEC_OVERDUE_ACCTS to SEC_NO_OF_ACCTS | mean (sd) : 0.16 (0.33) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (2.11) | 66 distinct values | 7108 (2.06%) | 338438 (97.94%) | |
| 55 | PRI_OverdueShare [labelled, numeric] | Ratio of PRI_OVERDUE_ACCTS / PRI_NO_OF_ACCTS | mean (sd) : 0.09 (0.23) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (2.49) | 324 distinct values | 170703 (49.4%) | 174843 (50.6%) |
Generated by summarytools 0.8.8 (R version 3.5.2)
2019-08-31
A function smbinning::smbinning() of Optimal Binning for Scoring Modeling categorizes a numeric characteristic into bins for ulterior usage in scoring modeling. This process, also known as supervised discretization, utilizes Recursive Partitioning (rpart) to categorize the numeric characteristic.
WenSui Liu has developed two different algorithms for monotonic binning of numeric varisbles. While the first tends to generate bins with equal densities, the second would define finer bins based on the isotonic regression.
In the code snippet below, a third approach would be illustrated for the purpose to generate bins with roughly equal-sized bads. Once again, for the reporting layer, WenSui Liu leveraged the flexible smbinning::smbinning.custom() function with a small tweak.
The levels of factor variables should be jointed into groups manualy by a special function smbinning::smbinning.factor.custom().
# Binning and Fine Classing Factor and Numeric (Scale) Variables
# https://support.sas.com/documentation/cdl/en/emcsgs/66008/PDF/default/emcsgs.pdf - SAS Enterprise Miner 12.1
# install.packages("https://cran.microsoft.com/src/contrib/smbinning_0.9.tar.gz", repos = NULL, type = "source")
library('smbinning') # Scoring Modeling and Optimal Binning for GLM Model from Herman Jopia (Chile)## Loading required package: sqldf
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: partykit
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
# install.packages("https://cran.r-project.org/src/contrib/woeBinning_0.1.6.tar.gz", repos = NULL, type = "source")
library('woeBinning') # Supervised Weight of Evidence Binning of Numeric Variables and Factors
library('openxlsx') # Read, Write and Edit Miscosoft XLSX Files
isobin <- function(data, y, x) { # Second Variant - Finer Monotonic Binning Based on Isotonic Regression
# WenSui Liu, is leading a team of quantitative analysts developing operational risk models for American Bank.
# https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression/
d1 <- data[c(y, x)]
d2 <- d1[!is.na(d1[x]), ]
c <- cor(d2[, 2], d2[, 1], method = 'spearman', use = 'complete.obs')
reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1])
k <- knots(as.stepfun(reg))
sm1 <- smbinning::smbinning.custom(d1, y, x, k)
c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint)
c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, ' '))))
c3 <- c2[!is.na(c2)]
return(smbinning::smbinning.custom(d1, y, x, c3[-length(c3)]))
}
tree_bin<- function(data, y, x) {
# Thilo Eichenberg for `woeBinning` package
binning <- woeBinning::woe.tree.binning(df = data, target.var = y, pred.var = x,
min.perc.total = 0.05, min.perc.class = 0, stop.limit = 0.01,
abbrev.fact.levels = 200, event.class = 1)
if (class(binning) == 'list') {
z <- c()
if ( class(data[, x]) == 'factor') {
for (variable in binning[[2]]$Group.2 %>% levels()) {
df <- binning[[2]]
fac_vec <- df[binning[[2]]$Group.2 == variable, 'Group.1'] %>%
as.character
chr_vec <- paste0('\'', paste0(fac_vec, sep = "\'", collapse = ', \''))
z <- c(z, chr_vec)
}
} else {
df <- binning[[2]]
z <- df$cutpoints.final %>%
.[c(-1, -nrow(df))] %>%
as.vector()
} # End of 'Numeric' class
} else { # ERROR !!!
z <- c()
}
return( z )
}
# Exploratory Data Analysis (EDA)
if (Max_Vars >= ncol(DF0) )
smbinning.eda(DF0)$eda
# Convert double after round into integer for smbinning()
DF0 %<>% # dplyr::mutate_if(is.double, round) %>%
# dplyr::mutate_if(is.double, as.integer) %>%
dplyr::mutate( GB_flag = ifelse(GB_flag == 1, 0, 1) ) # Flag for Optimal Binning
# Convert integer features with unique < 5 into factor for smbinning()
DF0 %<>% dplyr::mutate_at( dplyr::select_if(., ~ is.integer(.) & unique(.) %>% length(.) < 5 ) %>%
dplyr::select( -dplyr::one_of('GB_flag') ) %>% colnames,
as.factor )
# Create MS Excel File for Output
openxlsx::addWorksheet(wb <- openxlsx::createWorkbook(), sheetName = 'IV Table',
gridLines = FALSE, tabColour = 'olivedrab')
openxlsx::addWorksheet(wb, sheetName = 'Scorecard', gridLines = FALSE, tabColour = 'brown')
NamesOfVariables <- DF0 %>%
dplyr::select( -dplyr::one_of('GB_flag', dplyr::select_if(., ~ is.factor(.) & nlevels(.) == 1) %>%
colnames, # Levels > 1
# At least 5 different values for Numeric variables
dplyr::select_if(., ~ is.numeric(.) & unique(.) %>% length() < 5) %>% colnames)) %>%
colnames
binning.df <- cbind(Variable = NamesOfVariables,
`IV-Finish` = rep(NA_real_, times = length(NamesOfVariables)),
data.frame(matrix(data = rep(NA_real_, length(NamesOfVariables) * 8),
nrow = length(NamesOfVariables), ncol = 8))) %>%
setNames(c('Variable', 'IV', 'IV-RPart', 'N-RPart', 'IV-Decile', 'N-Decile',
'IV-Iso', 'N-Iso', 'IV-Tree', 'N-Tree'))
binning.df$Method <- rep(x = '', times = length(NamesOfVariables))
TotalBinning.sql <- ''
for (i in 1:length(NamesOfVariables)) {
val <- NamesOfVariables[i]
writeLines(paste(i, '-', val))
if (DF0[, val] %>% is.factor) { # Generate a binning table for all the categories of a given factor variable (A factor variable with at least 2 different values. Labels with commas are not allowed)
result.smb <- switch(val,
`education` = smbinning.factor.custom(DF0, x = val, y = 'GB_flag',
c("'Высшее'", # 'Высшее'
"'Начальное','Средне-специальное'", # 'Начальное','Средне-специальное'
"'Среднее'")), # 'Среднее'
if (levels(DF0[, val]) %>% length() > Max_Levels) { # Multi levels `factor` variables
chr_vec <- tree_bin(DF0, 'GB_flag', val) # Combine levels of factor variable by Badrate into some bins
smbinning.factor.custom(DF0, x = val, y = 'GB_flag', chr_vec)
}
else {
smbinning.factor(DF0, x = val, y = 'GB_flag', maxcat = levels(DF0[, val]) %>% length() + 1)
}
)
if (class(result.smb) == 'list') {
binning.df[i, 'IV'] <- result.smb$iv
binning.df[i, 'IV-RPart'] <- result.smb$iv
binning.df[i, 'N-RPart'] <- result.smb$ivtable %>% nrow - 1 # with Missing Vales
binning.df[i, 'Method'] <- ifelse(length(result.smb$groups) == 0, 'IV-By Levels', 'IV-By Groups')[1]
# IV Table Supplement
result.smb$ivtable <- result.smb$ivtable %>%
dplyr::mutate(G_Dis = CntGood / table(DF0$GB_flag)[2],
B_Dis = CntBad / table(DF0$GB_flag)[1],
`G/B Index` = ifelse(G_Dis > B_Dis, G_Dis / B_Dis, B_Dis / G_Dis),
`0=Good, 1=Bad` = ifelse(G_Dis > B_Dis, 0, 1),
Bin = c(1:(nrow(result.smb$ivtable) - 1), NA),
# Min = (c(NA, result.smb$cuts, NA, NA)),
# Max = (c(result.smb$cuts, NA, NA, NA))
Min = rep(NA, times = result.smb$ivtable %>% nrow),
Max = rep(NA, times = result.smb$ivtable %>% nrow)
)
}
} else
{ # Numeric Class
# Optimal Binning for Scoring Modeling from package `smbinning`
# This process, also known as supervised discretization, utilizes Recursive Partitioning to categorize the numeric characteristic. The especific algorithm is Conditional Inference Trees which initially excludes missing values (NA) to compute the cutpoints, adding them back later in the process for the calculation of the Information Value.
result1.smb <- smbinning(DF0, 'GB_flag', val)
if (class(result1.smb) == 'list') {
binning.df[i, 'IV-RPart'] <- result1.smb$iv
binning.df[i, 'N-RPart'] <- result1.smb$bands %>% length
}
# Custom Binning Based by cutpoints using percentiles (10% each)
if (length(NamesOfVariables) <= Max_Vars || class(result1.smb) != 'list') {
cbs1cuts <- as.vector(quantile(DF0[, val], probs=seq(0, 1, 0.1), na.rm=TRUE)) # Quantiles by 10%
cbs1cuts <- cbs1cuts[2:(length(cbs1cuts) - 1)] # Remove first (min) and last (max) values
result2.smb <- smbinning.custom(df=DF0, y = 'GB_flag', x = val, cuts = cbs1cuts)
binning.df[i, 'IV-Decile'] <- result2.smb$iv
binning.df[i, 'N-Decile'] <- result2.smb$bands %>% length
} else {
binning.df[i, 'IV-Decile'] <- 0
binning.df[i, 'N-Decile'] <- ncol(DF0) + 1
}
if (length(NamesOfVariables) <= Max_Vars ) { # & !isMicrosoftRServer
# Finer Monotonic Binning Based on Isotonic Regression - Do Not working with Microsoft R Server 9.3.0
result3.smb <- isobin(DF0, 'GB_flag', val)
binning.df[i, 'IV-Iso'] <- result3.smb$iv
binning.df[i, 'N-Iso'] <- result3.smb$bands %>% length
# Generates a supervised tree-like segmentation of numeric variables with respect to a binary target outcome
# result4.smb <- tree_chimergebin(DF0, 'GB_flag', val)
cbs1cuts <- tree_bin(DF0, 'GB_flag', val) # Binning via Tree-Like Segmentation
result4.smb <- smbinning.custom(df = DF0, x = val, y = 'GB_flag', cuts = cbs1cuts)
if (class(result4.smb) == 'list') {
binning.df[i, 'IV-Tree'] <- result4.smb$iv
binning.df[i, 'N-Tree'] <- result4.smb$bands %>% length
} else { # 'Not Meaningful (IV<0.1)' or 'Uniques values < 5' case
binning.df[i, 'IV-Tree'] <- 0
binning.df[i, 'N-Tree'] <- ncol(DF0)
}
} else {
binning.df[i, 'IV-Iso'] <- 0
binning.df[i, 'N-Iso'] <- ncol(DF0)
binning.df[i, 'IV-Bad'] <- 0
binning.df[i, 'N-Bad'] <- ncol(DF0)
}
# Selection of the Optimal Binning Method
if (if_else(is.na(binning.df[i, 'IV-RPart']) == TRUE, 0, binning.df[i, 'IV-RPart'] * 1.1) >
binning.df[i, 'IV-Decile']) {
binning.df[i, 'Method'] <- 'IV-RPart'
} else
{
if ( ( (binning.df[i, 'IV-Iso'] > binning.df[i, 'IV-Decile']) &
(binning.df[i, 'N-Iso'] / 1.1 < binning.df[i, 'N-Decile']) ) |
( (binning.df[i, 'IV-Iso'] * 1.1 > binning.df[i, 'IV-Decile']) &
(binning.df[i, 'N-Iso'] * 2 < binning.df[i, 'N-Decile']) ) ) {
binning.df[i, 'Method'] <- 'IV-Iso'
} else {
if (binning.df[i, 'IV-Decile'] >= binning.df[i, 'IV-Iso']) {
binning.df[i, 'Method'] <- 'IV-Decile'
} else {
binning.df[i, 'Method'] <- 'IV-Iso' }
}
} # End Else If
type <- binning.df[i, 'Method']
result.smb <-
switch(type,
`IV-RPart` = result1.smb,
`IV-Decile` = result2.smb,
`IV-Iso` = result3.smb,
`IV-Bad` = result4.smb
)
binning.df[i, 'IV'] <- result.smb$iv
# IV Table Supplement
result.smb$ivtable <- result.smb$ivtable %>%
mutate(G_Dis = CntGood / table(DF0$GB_flag)[2],
B_Dis = CntBad / table(DF0$GB_flag)[1],
`G/B Index` = if_else(G_Dis > B_Dis, G_Dis / B_Dis, B_Dis / G_Dis),
`0=Good, 1=Bad` = if_else(G_Dis > B_Dis, 0, 1),
Bin = c(1:(nrow(result.smb$ivtable) - 1), NA),
Min = c(NA, result.smb$cuts, NA, NA),
Max = c(result.smb$cuts, NA, NA, NA)
)
} # End else for numeric class
# Prepare MySQL-code for Binning and Fine Classing
# Sys.setloc <- Sys.setlocale(locale = 'Russian') # set locale to `Russian`
binning.sql <- capture.output(smbinning.sql(result.smb)) %>%
gsub("then '0", "then '", .) %>%
gsub('TableName', 'DF', .) %>%
stringr::str_replace('NewCharName', paste0(val, '_fct'))
val_fct <- paste0(' \'', val, '_fct', '\'', ' from data.frame \'DF0\'')
binning.sql <- c('select *,', paste0(' /* Inserting the new factor variable', val_fct, ' */'),
binning.sql[-c(1:3)] , paste0(' \'', val, '_fct', '\'', ' from \'DF1\''))
# Truncation of the Name of the Gradation in a Complex Set of levels
theBest <- TRUE
for (j in 1:(nrow(result.smb[['ivtable']]) - 2)) {
if (str_length(binning.sql[j + 3]) > 999) {
gradation_str <- str_split( string = binning.sql[j + 3], pattern = sprintf('%s: %s ', j, val) ) %>%
unlist
binning.sql[j + 3] <-
paste0(gradation_str[1], ifelse( theBest, sprintf('%s: %s the Best\'', j, val),
sprintf('%s: %s the Worst\'', j, val) ) )
theBest <= FALSE
} else {
# print('empty')
}
}
# Appending binning.sql into TotalBinning.sql
if (i == 1) {
binning.sql[length(binning.sql)] = paste0(' \'', val, '_fct\',')
TotalBinning.sql <- binning.sql
} else {
if (i == length(NamesOfVariables)) {
TotalBinning.sql <- c(TotalBinning.sql, '', binning.sql[-1] )
} else {
TotalBinning.sql <- c(TotalBinning.sql, '', binning.sql[-c(1, length(binning.sql))],
paste0(' \'', val, '_fct\','))
}
} # End if (i == 1)
# TotalBinning.sql <- c(TotalBinning.sql, '', binning.sql)
# Preparing a Data.Frame with VI Table for Export into MS Excel
addWorksheet(wb, val, gridLines = FALSE, tabColour = ifelse(result.smb$iv >= 0.05, 'chartreuse',
ifelse(result.smb$iv >= 0.03, 'khaki', 'white')))
result.smb$ivtable[ is.na( result.smb$ivtable ) ] <- NA # Dealing with NaN's in data frames
N <- 3:(nrow(result.smb$ivtable) + 2)
result.smb$ivtable %>%
dplyr::select(Cutpoint, Bin, Min, Max, CntRec, CntGood, CntBad, G_Dis, B_Dis, Share = PctRec,
BadRate, WoE, IV,`G/B Index`, `0=Good, 1=Bad`) %>%
writeDataTable(wb, sheet = val, x = ., tableStyle = 'TableStyleMedium2', startCol = 'A',
startRow = 2, tableName = val, firstColumn = TRUE, lastColumn = FALSE, bandedRows = TRUE)
# Set Columns widths
setColWidths(wb, sheet = val, cols = 1:4, widths = c(32, 7, 10, 10))
# # Set Row heights
# setRowHeights(wb, sheet = 1, rows = 1, heights = 45)
# Set Styles & Conditional Formattings in Columns
addStyle(wb, sheet = val, style = createStyle(wrapText = TRUE, halign = 'center', valign = 'center'),
cols = 1:ncol(result.smb$ivtable), rows = 2)
addStyle(wb, sheet = val, cols = 1, rows = 1, style = createStyle(fontSize = 16, textDecoration = 'bold'))
addStyle(wb, sheet = val, cols = 1:ncol(result.smb$ivtable), rows = (nrow(result.smb$ivtable) + 2),
style = createStyle(textDecoration = 'bold'))
addStyle(wb, sheet = val, cols = 5:7, rows = N, style = createStyle(numFmt = 'COMMA'), gridExpand = TRUE)
addStyle(wb, sheet = val, cols = 8:10, rows = N, style = createStyle(numFmt = '0%'), gridExpand = TRUE)
addStyle(wb, sheet = val, cols = 11, rows = N, style = createStyle(numFmt = paste0('0', options()$OutDec, '0%'), textDecoration = 'bold'))
conditionalFormatting(wb, sheet = val, cols = 13, rows = 3:(nrow(result.smb$ivtable) + 1), type = 'databar',
border = FALSE, style = c('red', 'chartreuse'))
addStyle(wb, sheet = val, cols = 4, rows = N, style = createStyle(border = 'right', borderColour = '#4F81BC'))
addStyle(wb, sheet = val, cols = 7, rows = N, style = createStyle(numFmt = 'COMMA', border = 'right',
borderColour = '#4F81BC'))
addStyle(wb, sheet = val, cols = 10, rows = N, style = createStyle(numFmt = '0%', border = 'right',
borderColour = '#4F81BC'))
conditionalFormatting(wb, sheet = val, cols = 15, rows = 3:(nrow(result.smb$ivtable) + 1), rule ='$O3=0',
style = createStyle(fontColour = 'red', halign = 'center', valign = 'center', textDecoration = 'bold'))
conditionalFormatting(wb, sheet = val, cols = 15, rows = 3:(nrow(result.smb$ivtable) + 1), rule ='$O3>0',
style = createStyle(fontColour = 'black', halign = 'center', valign = 'center', textDecoration = 'bold'))
writeData(wb, sheet = val, val, startCol = 'A', startRow = 1)
writeData(wb, sheet = val, data.frame(binning.sql), colNames = FALSE, rowNames = FALSE,
startCol = 'A', startRow = nrow(result.smb$ivtable) + 4)
writeFormula(wb, sheet = val, startCol = 'A',
startRow = nrow(result.smb$ivtable) + 5 + length(binning.sql),
x = makeHyperlinkString(sheet = 'IV Table', row = i + 2, col = 1,
text = 'Link to IV Table'))
} # End next i## 1 - disbursed_amount
## 2 - asset_cost
## 3 - ltv
## 4 - branch_id
## 5 - supplier_id
## 6 - manufacturer_id
## 7 - Current_pincode_ID
## 8 - Employment_Type
## 9 - State_ID
## 10 - Employee_code_ID
## 11 - Aadhar_flag
## 12 - PAN_flag
## 13 - VoterID_flag
## 14 - Driving_flag
## 15 - Passport_flag
## 16 - PERFORM_CNS_SCORE
## 17 - PERFORM_CNS_SCORE_DESCRIPTION
## 18 - PRI_NO_OF_ACCTS
## 19 - PRI_ACTIVE_ACCTS
## 20 - PRI_OVERDUE_ACCTS
## 21 - PRI_SANCTIONED_AMOUNT
## 22 - PRI_DISBURSED_AMOUNT
## 23 - SEC_NO_OF_ACCTS
## 24 - SEC_ACTIVE_ACCTS
## 25 - SEC_OVERDUE_ACCTS
## 26 - SEC_CURRENT_BALANCE
## 27 - SEC_SANCTIONED_AMOUNT
## 28 - SEC_DISBURSED_AMOUNT
## 29 - PRIMARY_INSTAL_AMT
## 30 - SEC_INSTAL_AMT
## 31 - NEW_ACCTS_IN_LAST_SIX_MONTHS
## 32 - DELINQUENT_ACCTS_IN_LAST_SIX_M
## 33 - NO_OF_INQUIRIES
## 34 - Age
## 35 - YearsSinceDisbursment
## 36 - AverageLoanTenure
## 37 - TimeSinceFirstLoan
## 38 - YearsOnLoan
## 39 - DisAsDiff
## 40 - DisAsShare
## 41 - Qrt
## 42 - Day
## 43 - OutstandingNow
## 44 - DisbursedTotal
## 45 - ShareOverdue
## 46 - SEC_OverdueShareSec
## 47 - PRI_OverdueShare
# Flag Recovery after Optimal Binning
DF0 %<>% dplyr::mutate( GB_flag = ifelse(GB_flag == 1, 0, 1) )
# Write MySQL code for Coarse Classing Selected Variables
write_lines(x = TotalBinning.sql, path = NameMySQLCode, na = "NA", append = FALSE)
N <- 3:(nrow(binning.df) + 2)
# writeDataTable(wb, sheet = 'IV Table', x = binning.df, tableStyle = 'TableStyleMedium4', startCol = 'A',
# startRow = 2, tableName = 'IVTable', firstColumn = FALSE, lastColumn = TRUE, bandedRows = TRUE)
# Set Columns widths
setColWidths(wb, sheet = 'IV Table', cols = 1:2, widths = c(15, 12))
setColWidths(wb, sheet = 'IV Table', cols = 11, widths = c(12))
conditionalFormatting(wb, sheet = 'IV Table', cols = 2, rows = N, type = 'databar',
border = FALSE, style = c('red', 'royalblue'))
addStyle(wb, sheet = 'IV Table', cols = 2, rows = N,
style = createStyle(border = 'right', borderColour = '#9CB95C'))
addStyle(wb, sheet = 'IV Table', cols = 10, rows = N,
style = createStyle(border = 'right', borderColour = '#9CB95C'))
writeData(wb, sheet = 'IV Table', 'IV Table', startCol = 'A', startRow = 1)
addStyle(wb, sheet = 'IV Table', cols = 1, rows = 1,
style = createStyle(fontSize = 16, textDecoration = 'bold'))
for (i in 1:nrow(binning.df)) {
## Internal - Text to display
val = binning.df[i, 'Variable']
writeFormula(wb, sheet = 'IV Table', startCol = 'A', startRow = i + 2,
x = makeHyperlinkString(sheet = val, row = 1, col = 1, text = val))
}
# # Open MS Excel
# openXL(wb)
remove(NamesOfVariables, result1.smb, result2.smb, result3.smb, result4.smb, result.smb, chr_vec, # binning.df,
j, TotalBinning.sql, theBest, binning.sql, val, val_fct, i, type, cbs1cuts, N)Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable
Combine adjacent categories with similar WOE scores
Rules related to WOE
Each category (bin) should have at least 5% of the observations.
Each category (bin) should be non-zero for both non-events and events.
The WOE should be distinct for each category. Similar groups should be aggregated.
The WOE should be monotonic, i.e. either growing or decreasing with the groupings.
Missing values are binned separately.
# use library('sqlrutils') from Microsoft Corporation
library('sqldf') # Manipulate R Data Frames Using SQL
# Prepare MySQL-code for Coarse Classing
DF <- sqldf::sqldf(x = read_file( NameMySQLCode ), method = 'name__class') %>% # PRIORITY_DAYS - 'difftime'
dplyr::select(c(one_of('GB_flag'), ends_with('_fct'))) %>%
dplyr::mutate_all(as.factor)
remove(DF0, DF1)
#
# SelectedFeatures <- c('ltv_fct', 'branch_id_fct', 'supplier_id_fct', 'Current_pincode_ID_fct', 'State_ID_fct', 'Employee_code_ID_fct', 'Passport_flag_fct', 'PERFORM_CNS_SCORE_DESCRIPTION_fct', 'PRI_OVERDUE_ACCTS_fct', 'PRI_DISBURSED_AMOUNT_fct', 'SEC_CURRENT_BALANCE_fct', 'NO_OF_INQUIRIES_fct', 'Age_fct', 'YearsSinceDisbursment_fct', 'TimeSinceFirstLoan_fct')
#
# DF <- DF[, c('GB_flag', SelectedFeatures)]Let’s form an array of independent factors (predictors) and a outcome - dependent binary factor.
It is a good idea to use a validation hold out set. This is a sample of the data that we hold back from our analysis and modeling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data.
##
## Bad Good
## 45557 164281
# Weights of cases to resolve Class Imbalances Problem
weight.cases <- as.numeric( Y[inTrain] )
for(val in unique(weight.cases)) {weight.cases[weight.cases==val]=
1/sum(weight.cases==val)*length(weight.cases)/2} # normalized to sum to length(samples)
# # To list all columns with missing values you might write:
# is.na(X) %>% colSums()
#
# # To list all rows with missing values you might write:
# X[!complete.cases(X), ]\[ \displaystyle \large X = \frac{Z_{\alpha/2}^2 × p × (1\ - \ p)}{\beta^2} \hspace{.5 in} [1]\]
where, \(Z_{\alpha/2}\) is the critical value of the Normal distribution at \(\alpha\) (e.g. for a confidence level of 98%, \({\alpha}\) (Type I Error or False Positive) is 0.02, so \(alpha/2\) = 0.01 and \(Z_{\alpha/2}\) - the critical value is 2.326) and \({\beta}\) for calculating the Margin of Error (Type II Error or False Negative) is 0.05,
\(\beta\) is the margin of error by (1 - Power),
p is the sample proportion,
and X is the population size.
# Population Proportion – Minimal Sample Size
# Source: Daniel WW. Biostatistics: A Foundation for Analysis in the Health Sciences. 7th edition. New York: John Wiley & Sons. (1999)
# https://select-statistics.co.uk/calculators/sample-size-calculator-population-proportion/
yName <- 'GB_flag' # Name of Outcome
yTarget <- 1 # Name of Outcome's Positive Level
writeLines(sprintf('Sampling into two unequal sample sizes with Fraction (%0.2f%%) of Training & Test Sets. \n',
length(Y[inTrain]) / length(Y) * 100))## Sampling into two unequal sample sizes with Fraction (60.73%) of Training & Test Sets.
writeLines(sprintf('Full Dataset has Real Sample Size is equal %g observations.
Therefore the Training Dateset might be %g obs. and the Test Dateset - %g obs. \n',
length(Y), round(length(Y) * 0.70, -4), length(Y) - round(length(Y) * 0.70, -4)))## Full Dataset has Real Sample Size is equal 345546 observations.
## Therefore the Training Dateset might be 240000 obs. and the Test Dateset - 105546 obs.
alpha = 0.05 # Type I Error or False Positive - α probability (Significance level) fro Two tailed
beta = 0.10 # Type II Error or False Negative - β probability level
BadRate <- table(Y[inTrain]) %>% .['Bad'] / length(Y[inTrain])
# MSS <- (qnorm(1 - alpha/2)^2 * BadRate * (1 - BadRate)) / (beta)^2 # BadRate *
# *********************************************************************************************************** #
#
# Power Analysis for two Proportions (different sample sizes) of The Binomial Distributions
#
# *********************************************************************************************************** #
# Fisher's exact test is a statistical significance test used for Contingency Tables (Binomial Distribution)
matrix(c(table(Y[inTrain]), table(Y[inTest])),
nrow = 2,
dimnames = list(Subjects = c('Bad', 'Good'),
Samples = c('inTrain', 'inTest'))) %T>%
print() %>%
stats::fisher.test(., alternative = 'two', conf.level = 1-alpha)## Samples
## Subjects inTrain inTest
## Bad 45557 5054
## Good 164281 18262
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.9067
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9695789 1.0356963
## sample estimates:
## odds ratio
## 1.00203
# # Null Hypothesis Significance Testing
# pB <-
# ShowNullHypothesisSignificanceTesting(pA = BadRate,
# pT = table(Y[inTest]) %>% .['Bad'] / length(Y[inTest]),
# nA = length(Y[inTrain]),
# nB = length(Y[inTest]),
# alpha = alpha,
# beta = beta)
# Example of using `uniroot` function - https://rpubs.com/chidungkt/408264
frac <- length(Y[inTrain]) / (length(Y[inTrain]) + length(Y[inTest]))
# Set First functions: Power and Sample Size for Two-Sample Binomial Test
func1 <- function(x) { sizes <- Hmisc::bsamsize(p1 = BadRate, p2 = BadRate + x,
fraction = frac, alpha = alpha, power = (1 - beta))
return(sizes[2]) } # B (Control) Sample Size
# Equilibrium quantity: Threshold for h - Effect Size (Cohen’s d)
pThreshold <- stats::uniroot( function(x) func1(x) - (length(Y[inTest]) * 0.98), c(0, BadRate) )$root
# Estimating Sample size(s) for Two unequivalent Binomial Sample
library('pwr') # Basic Functions for Power Analysis
( estimate <- pwr::pwr.2p2n.test( h = pwr::ES.h(p1 = BadRate, # Treatment Probability of Success
p2 = BadRate + pThreshold ) # Hypothesized control probability of success
, n1 = length(Y[inTrain]) # Sample Size for Training or A (Treatment) Set
, n2 = NULL # Sample Size for Test or B (Control) Set
, sig.level = alpha # Type I Error or False Positive
, power = (1 - beta) # Power of Test (1 - Type II Error)
, alternative = 'two.sided') )##
## difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.02259551
## n1 = 209838
## n2 = 22818.26
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: different sample sizes
MSS <- round(estimate$n2 + 0.5) # Minimal Sample Size for Test (Control) Set
pwr::plot.power.htest(estimate) +
ggplot2::annotate(geom = 'text', x = 0, y = 0, size = 3.5, hjust = 'left',
label = sprintf('Badrate in { %0.2f%%, %0.2f%% }',
(BadRate - pThreshold) * 100, (BadRate + pThreshold) * 100))# MSS <- Hmisc::bsamsize(p1 = table(Y[inTrain]) %>% .['Bad'] / length(Y[inTrain]),
# p2 = table(Y[inTrain]) %>% .['Bad'] / length(Y[inTrain]) - 0.005,
# fraction = 0.7, alpha = alpha, power = (1 - beta)) %>%
# round(.[2] + 0.5)
writeLines(sprintf('With Treatment Probability of Success (Bad Rate on Training dataset = %.2f%%) and Bias = %0.2f%%, Significance level (1 - alpha = %.1f%%) & Power (1 - beta = %.1f%%) Minimal Sample Size of the Binomial distribution should be %g obs., but Test set is %g obs. \n',
BadRate * 100, pThreshold * 100, (1 - alpha) * 100, (1- beta) * 100, MSS, length(Y[inTest]) ))## With Treatment Probability of Success (Bad Rate on Training dataset = 21.71%) and Bias = 0.94%, Significance level (1 - alpha = 95.0%) & Power (1 - beta = 90.0%) Minimal Sample Size of the Binomial distribution should be 22819 obs., but Test set is 23316 obs.
# Weights of cases to resolve Class Imbalances Problem
weight.cases <- DT[inTrain, ] %>%
dplyr::pull(!!yName) %>%
as.double()
for(val in unique(weight.cases)) {weight.cases[weight.cases==val]=
1/sum(weight.cases==val)*length(weight.cases)/2} # normalized to sum to length(samples)
remove(alpha, beta, BadRate, MSS, estimate, pThreshold, frac)Let’s look at visualizations of individual attributes. It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let’s look at histograms of each attribute to get a sense of the data distributions.
# Density plot for each Features vs `Y`
ShowUnimodalVisualizations(X[inTrain, ], Y[inTrain], isMicrosoftRServer)## There are no any Numeric feature in `X` data.frame.
## There are no any Factor or Character feature in `X` data.frame.
Important Points
Information value increases as bins / groups increases for an independent variable. Be careful when there are more than 20 bins as some bins may have a very few number of events and non-events.
Information value should not be used as a feature selection method when you are building a classification model other than binary logistic regression (for example, random forest or SVM) as it’s designed for binary logistic regression model only.
Let’s look at some visualizations of the interactions between outcome and predictors. The best place to start is a scatter plot matrix.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: caret
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## The following object is masked from 'package:tidyr':
##
## smiths
## Var1 Var2 Spearman
## 1 Y Employee_code_ID_fct.2..Employee_code_ID....153 0.0743
## 2 Y DisAsDiff_fct.5..DisAsDiff...19822 0.0679
## 3 Y Current_pincode_ID_fct.3..Current_pincode_ID....174 0.0673
## 4 Y DisAsShare_fct.7..DisAsShare...1.7649 0.0655
## 5 Y Employee_code_ID_fct.3..Employee_code_ID....170 0.0645
## 6 Y PERFORM_CNS_SCORE_fct.6..PERFORM_CNS_SCORE....824 0.0644
## 7 Y supplier_id_fct.3..supplier_id....165 0.0573
## Var1 Var2 Spearman
## 1 Y Current_pincode_ID_fct.12..Current_pincode_ID...291 -0.1252
## 2 Y Employee_code_ID_fct.15..Employee_code_ID...319 -0.1147
## 3 Y supplier_id_fct.13..supplier_id...314 -0.1057
## 4 Y OutstandingNow_fct.3..OutstandingNow....171384 -0.0907
## 5 Y Current_pincode_ID_fct.11..Current_pincode_ID....291 -0.0839
## 6 Y Employee_code_ID_fct.14..Employee_code_ID....319 -0.0823
## 7 Y supplier_id_fct.12..supplier_id....314 -0.0809
This helps point out the skew in many distributions so much so that data looks like outliers (e.g. beyond the whisker of the plots).
Feature Selection (removing correlated attributes or reduce the quality of classification), but Transformations (Box-Cox or YeoJohnson) could not apply to factors.
#
# Filtering predictors - Remove Redundant Variables
#
library('caret') # Classification and Regression Training
# # 1a. Imputing Missing Value, Centering, Scaling and Transformation, for example 'YeoJohnson'
# X <- cbind(X %>% select_if(., is.numeric) %>%
# predict(preProcess( ., method = c( 'bagImpute' ) ), newdata = . ) #,
#
# # 1b. Convert factor (nominal and ordered) variables into a full set of dummy integer variables without linear dependencies induced between these predictors
# X %>% select_if(., is.factor) %>%
# data.frame(predict(dummyVars(' ~ .', data = ., fullRank = TRUE), newdata = .)) %>%
# select_if(., is.numeric)
# )
# 2. Dropping zero variance predictors: the cutoff for the ratio of the first most common value to the second value. See https://www.mql5.com/ru/articles/2029
nzv <- caret::nearZeroVar( X[inTrain, ], freqCut = 999/1 , saveMetrics= TRUE )
nzv[nzv$nzv,]## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
zv_cols = caret::nearZeroVar( X[inTrain, ], freqCut = 999/1, saveMetrics = FALSE )
print( sprintf('Dropping %d zero variance predictors from %d (fraction=%10.6f)',
length(zv_cols), dim(X[inTrain, ])[2], length(zv_cols)/dim(X[inTrain, ])[2]) )## [1] "Dropping 0 zero variance predictors from 47 (fraction= 0.000000)"
## integer(0)
# class(X) <- 'data.frame'
if ( length(zv_cols) != 0 ) {
X <- X[, -zv_cols]
}
# # 3. Centering, Scaling and Transformation or Principal Components # , 'pca'
# X[is.na(X)] <- -1 # replace the <NA>s with zeros in number columns
# X <- X %>% predict(preProcess( X[inTrain, ], method = c( 'medianImpute', 'center', 'scale', 'YeoJohnson', 'pca' ), thresh = 0.99 ), newdata = . ) # %>% # Define all logical fields into numeric type
# # spatialSign %>% data.frame
# 4. Remove NUMERIC variables with high correlation (> .70) to others (multicollinearity)
cor.matrix <- cor( sapply( X[inTrain, ], function(x)
{ as.numeric(x) } ) )
cor.high <- caret::findCorrelation(cor.matrix, cutoff = 0.80, verbose = TRUE, names = FALSE, exact = TRUE)## Compare row 18 and column 19 with corr 0.869
## Means: 0.239 vs 0.104 so flagging column 18
## Compare row 37 and column 36 with corr 0.843
## Means: 0.211 vs 0.098 so flagging column 37
## Compare row 44 and column 43 with corr 0.871
## Means: 0.206 vs 0.094 so flagging column 44
## Compare row 36 and column 47 with corr 0.846
## Means: 0.186 vs 0.088 so flagging column 36
## Compare row 47 and column 16 with corr 0.883
## Means: 0.162 vs 0.084 so flagging column 47
## Compare row 21 and column 22 with corr 0.99
## Means: 0.132 vs 0.081 so flagging column 21
## Compare row 31 and column 45 with corr 0.949
## Means: 0.134 vs 0.079 so flagging column 31
## Compare row 23 and column 24 with corr 0.803
## Means: 0.143 vs 0.075 so flagging column 23
## Compare row 24 and column 27 with corr 0.989
## Means: 0.129 vs 0.072 so flagging column 24
## Compare row 27 and column 28 with corr 0.995
## Means: 0.106 vs 0.069 so flagging column 27
## Compare row 28 and column 26 with corr 0.933
## Means: 0.081 vs 0.068 so flagging column 28
## Compare row 40 and column 39 with corr 0.812
## Means: 0.102 vs 0.067 so flagging column 40
## Compare row 34 and column 38 with corr 0.996
## Means: 0.082 vs 0.065 so flagging column 34
## Compare row 11 and column 13 with corr 0.869
## Means: 0.093 vs 0.064 so flagging column 11
## Compare row 35 and column 41 with corr 0.824
## Means: 0.064 vs 0.063 so flagging column 35
## All correlations <= 0.8
high.cor.remove <- row.names(cor.matrix)[cor.high]
print( sprintf('Dropping %d predictors due to high correlation to others (multicollinearity) %d (fraction=%10.6f)',
length(high.cor.remove), dim(X[inTrain, ])[2], length(high.cor.remove)/dim(X[inTrain, ])[2]) );## [1] "Dropping 15 predictors due to high correlation to others (multicollinearity) 47 (fraction= 0.319149)"
## [1] "PRI_NO_OF_ACCTS_fct" "TimeSinceFirstLoan_fct"
## [3] "DisbursedTotal_fct" "AverageLoanTenure_fct"
## [5] "PRI_OverdueShare_fct" "PRI_SANCTIONED_AMOUNT_fct"
## [7] "NEW_ACCTS_IN_LAST_SIX_MONTHS_fct" "SEC_NO_OF_ACCTS_fct"
## [9] "SEC_ACTIVE_ACCTS_fct" "SEC_SANCTIONED_AMOUNT_fct"
## [11] "SEC_DISBURSED_AMOUNT_fct" "DisAsShare_fct"
## [13] "Age_fct" "Aadhar_flag_fct"
## [15] "YearsSinceDisbursment_fct"
if (length(high.cor.remove) != 0) {
X <- X[, -cor.high]
}
# Remove some other features that do not add useful information for Machine Learning
remove(nzv, zv_cols, cor.matrix, cor.high, high.cor.remove)
# Remaining Variables
print( sprintf('Remaining %d Variables', dim(X)[2]) )## [1] "Remaining 32 Variables"
## [1] "disbursed_amount_fct"
## [2] "asset_cost_fct"
## [3] "ltv_fct"
## [4] "branch_id_fct"
## [5] "supplier_id_fct"
## [6] "manufacturer_id_fct"
## [7] "Current_pincode_ID_fct"
## [8] "Employment_Type_fct"
## [9] "State_ID_fct"
## [10] "Employee_code_ID_fct"
## [11] "PAN_flag_fct"
## [12] "VoterID_flag_fct"
## [13] "Driving_flag_fct"
## [14] "Passport_flag_fct"
## [15] "PERFORM_CNS_SCORE_fct"
## [16] "PERFORM_CNS_SCORE_DESCRIPTION_fct"
## [17] "PRI_ACTIVE_ACCTS_fct"
## [18] "PRI_OVERDUE_ACCTS_fct"
## [19] "PRI_DISBURSED_AMOUNT_fct"
## [20] "SEC_OVERDUE_ACCTS_fct"
## [21] "SEC_CURRENT_BALANCE_fct"
## [22] "PRIMARY_INSTAL_AMT_fct"
## [23] "SEC_INSTAL_AMT_fct"
## [24] "DELINQUENT_ACCTS_IN_LAST_SIX_M_fct"
## [25] "NO_OF_INQUIRIES_fct"
## [26] "YearsOnLoan_fct"
## [27] "DisAsDiff_fct"
## [28] "Qrt_fct"
## [29] "Day_fct"
## [30] "OutstandingNow_fct"
## [31] "ShareOverdue_fct"
## [32] "SEC_OverdueShareSec_fct"
Different methods for calculating the feature importance are built into FSelectorRcpp’s function information_gain(). I recommend using a fast but effective method FSelectorRcpp_information.gain written in C++ from package FSelectorRcpp.
if (ncol(X) <= Max_Vars) {
library('FSelectorRcpp') # 'Rcpp' Implementation of 'FSelector' Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support
library('parallel') # Support for Parallel computation in R
# Support for Parallel computation in R
ncore <- parallel::detectCores()
(cl = parallel::makeCluster(ncore))
# Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support
Entropy_Based_Features <-
FSelectorRcpp::information_gain( # Calculate the score for each attribute
formula = as.formula('Y ~ .'), # that is on the right side of the formula.
data = cbind(Y = Y, X)[inTrain, ], # Attributes must exist in the passed data.
type = 'infogain', # Choose the type of a score to be calculated.
threads = ncore # Set number of threads in a parallel backend.
) %>%
dplyr::arrange(-importance) %>% # Sort of Entropy-based Importance by Descending.
dplyr::slice(1:Max_Vars) # Selection the First `Max_Vars` features.
knitr::kable(Entropy_Based_Features, caption = sprintf('Selection of the %g Important Variables vs. `%s` outcome', Max_Vars, yName))
# utils::writeClipboard(Entropy_Based_Features[['attributes']])
X <- X[, Entropy_Based_Features$attributes]
if(!is.null(cl)) {
parallel::stopCluster(cl)
cl = NULL
}
remove(Entropy_Based_Features)
}The Logit in logistic regression is a special case of a link function in a Generalized Linear Model (GLM): it is the canonical link function for the Bernoulli distribution.
The logistic model is usually represented as:
\[ \displaystyle \large \pi(Y)=\frac{\exp(\beta_0+\beta_1X)}{1+\exp(\beta_0+\beta_1X)} \hspace{.5 in} [2]\] or going into the common linear regression model:
\[ \displaystyle \large \ln\left(\frac{\pi(Y)}{1-\pi(Y)}\right)=\beta_0+\beta_1X \hspace{.5 in} [3]\]
Therefore, logit itself is obtained:
\[ \displaystyle \large \Pr(Y=1 \mid X) = [1 + e^{-X'\beta}]^{-1} \hspace{.5 in} [4]\]
# Train Logistic Regression Model with limited memory Broyden-FletcherGoldfarb-Shanno (L-BFGS) optimization
# Microsoft Machine Learning Example - https://docs.microsoft.com/en-us/machine-learning-server/r/sample-solutions#loan-credit-risk
writeLines('\n Logit-Model on Training Set')##
## Logit-Model on Training Set
# The named list containing components upper and lower, both formulae, defining the range of models to be examined in the stepwise search
scope <- list(
lower = ~ supplier_id_fct,
upper = as.formula( paste( '~', paste(names(X), collapse = ' + ') ) ) )
## rxLogit / variableSelection
varsel <-
RevoScaleR::rxStepControl(method = 'stepwise'
, keepStepCoefs = TRUE # flag specifying whether or not to keep the model coefficients at each step
, scope = scope
, stepCriterion = 'SigLevel' ) # significance level, the traditional stepwise approach in SAS
# # Ranking Features by Importance using rx (Random Forest)
# rf_model <-
# RevoScaleR::rxDForest( formula = paste0('Y ~ ', paste(colnames(X), collapse = '+'))
# , seed = seed
# , importance = TRUE
# , pweights = 'Pweights'
# , data = cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases) )
# View(rf_model$importance )
system.time(
rxLogitFit <- # Create a Logistic Regression Model
RevoScaleR::rxLogit(
formula = paste('Y ~ ', paste(names(X), collapse = '+'))
# , variableSelection = varsel
, pweights = 'Pweights'
, data = cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases)
, reportProgress = 1 # the number of processed rows is printed and updated
, verbose = 1)
)## **** Computing starting values:
##
Rows Processed: 209838
##
Rows Processed: 209838
## **** Scoring iteration #3:
##
## Deviance: 201058.0735
##
Rows Processed: 209838
## **** Scoring iteration #4:
##
## Deviance: 199961.8653
##
Rows Processed: 209838
## **** Scoring iteration #5:
##
## Deviance: 199949.6311
##
Rows Processed: 209838
## **** Scoring iteration #6:
##
## Deviance: 199949.6240
##
Rows Processed: 209838
##
##
## Logistic Regression Results for: Y~Employee_code_ID_fct+Current_pincode_ID_fct+supplier_id_fct+branch_id_fct+ltv_fct+PERFORM_CNS_SCORE_fct+disbursed_amount_fct+OutstandingNow_fct+PERFORM_CNS_SCORE_DESCRIPTION_fct+State_ID_fct+DisAsDiff_fct+PRI_DISBURSED_AMOUNT_fct+PRI_OVERDUE_ACCTS_fct+manufacturer_id_fct+VoterID_flag_fct+ShareOverdue_fct+PRI_ACTIVE_ACCTS_fct+Day_fct+NO_OF_INQUIRIES_fct+DELINQUENT_ACCTS_IN_LAST_SIX_M_fct+Qrt_fct+YearsOnLoan_fct+Employment_Type_fct+PRIMARY_INSTAL_AMT_fct+asset_cost_fct+SEC_OverdueShareSec_fct+SEC_CURRENT_BALANCE_fct+Passport_flag_fct+Driving_flag_fct+SEC_INSTAL_AMT_fct+PAN_flag_fct+SEC_OVERDUE_ACCTS_fct
## ************************************************************************************************************************
## Dependent Variable: Y
## Total independent variables: 163 (Including number dropped: 33)
## Number of valid observations: 209838
## -2*LogLikelihood: 199950 (Residual Deviance on 209708 degrees of freedom)
## Row Coeffs. Value Std. Error t Value Pr(>|t|)
## [ 1,.] (Intercept) 0.7391 0.2000 3.6958 0.0002
## [ 2,.] Employee_code_ID_fct=1: Employee_code_ID <= 134 0.7657 0.0427 17.9335 0.0000
## [ 3,.] Employee_code_ID_fct=10: Employee_code_ID <= 242 -0.0770 0.0303 -2.5387 0.0111
## [ 4,.] Employee_code_ID_fct=11: Employee_code_ID <= 254 -0.1015 0.0296 -3.4293 0.0006
## [ 5,.] Employee_code_ID_fct=12: Employee_code_ID <= 270 -0.1436 0.0297 -4.8288 0.0000
## [ 6,.] Employee_code_ID_fct=13: Employee_code_ID <= 289 -0.2340 0.0293 -7.9857 0.0000
## [ 7,.] Employee_code_ID_fct=14: Employee_code_ID <= 319 -0.2577 0.0286 -9.0069 0.0000
## [ 8,.] Employee_code_ID_fct=15: Employee_code_ID > 319 -0.4823 0.0320 -15.0504 0.0000
## [ 9,.] Employee_code_ID_fct=2: Employee_code_ID <= 153 0.5028 0.0339 14.8256 0.0000
## [ 10,.] Employee_code_ID_fct=3: Employee_code_ID <= 170 0.4005 0.0298 13.4511 0.0000
## [ 11,.] Employee_code_ID_fct=4: Employee_code_ID <= 179 0.2856 0.0337 8.4796 0.0000
## [ 12,.] Employee_code_ID_fct=5: Employee_code_ID <= 188 0.2518 0.0318 7.9085 0.0000
## [ 13,.] Employee_code_ID_fct=6: Employee_code_ID <= 199 0.1600 0.0297 5.3933 0.0000
## [ 14,.] Employee_code_ID_fct=7: Employee_code_ID <= 211 0.0937 0.0291 3.2206 0.0013
## [ 15,.] Employee_code_ID_fct=8: Employee_code_ID <= 221 0.0284 0.0298 0.9526 0.3408
## [ 16,.] Employee_code_ID_fct=9: Employee_code_ID <= 233 Dropped Dropped Dropped 0.0000
## [ 17,.] Current_pincode_ID_fct=1: Current_pincode_ID <= 143 0.8572 0.0388 22.0715 0.0000
## [ 18,.] Current_pincode_ID_fct=10: Current_pincode_ID <= 257 -0.1809 0.0245 -7.3807 0.0000
## [ 19,.] Current_pincode_ID_fct=11: Current_pincode_ID <= 291 -0.2728 0.0236 -11.5684 0.0000
## [ 20,.] Current_pincode_ID_fct=12: Current_pincode_ID > 291 -0.3917 0.0259 -15.1180 0.0000
## [ 21,.] Current_pincode_ID_fct=2: Current_pincode_ID <= 158 0.7512 0.0370 20.3062 0.0000
## [ 22,.] Current_pincode_ID_fct=3: Current_pincode_ID <= 174 0.5712 0.0278 20.5606 0.0000
## [ 23,.] Current_pincode_ID_fct=4: Current_pincode_ID <= 188 0.4876 0.0269 18.1216 0.0000
## [ 24,.] Current_pincode_ID_fct=5: Current_pincode_ID <= 201 0.3985 0.0244 16.3081 0.0000
## [ 25,.] Current_pincode_ID_fct=6: Current_pincode_ID <= 212 0.3153 0.0253 12.4573 0.0000
## [ 26,.] Current_pincode_ID_fct=7: Current_pincode_ID <= 219 0.1621 0.0295 5.4903 0.0000
## [ 27,.] Current_pincode_ID_fct=8: Current_pincode_ID <= 225 0.1075 0.0296 3.6285 0.0003
## [ 28,.] Current_pincode_ID_fct=9: Current_pincode_ID <= 238 Dropped Dropped Dropped 0.0000
## [ 29,.] supplier_id_fct=1: supplier_id <= 133 0.3345 0.0425 7.8740 0.0000
## [ 30,.] supplier_id_fct=10: supplier_id <= 253 -0.0792 0.0270 -2.9305 0.0034
## [ 31,.] supplier_id_fct=11: supplier_id <= 275 -0.0973 0.0248 -3.9244 0.0001
## [ 32,.] supplier_id_fct=12: supplier_id <= 314 -0.1325 0.0259 -5.1236 0.0000
## [ 33,.] supplier_id_fct=13: supplier_id > 314 -0.2318 0.0306 -7.5686 0.0000
## [ 34,.] supplier_id_fct=2: supplier_id <= 149 0.3415 0.0373 9.1609 0.0000
## [ 35,.] supplier_id_fct=3: supplier_id <= 165 0.2308 0.0311 7.4297 0.0000
## [ 36,.] supplier_id_fct=4: supplier_id <= 178 0.1855 0.0288 6.4352 0.0000
## [ 37,.] supplier_id_fct=5: supplier_id <= 196 0.1510 0.0256 5.9035 0.0000
## [ 38,.] supplier_id_fct=6: supplier_id <= 206 0.1296 0.0287 4.5106 0.0000
## [ 39,.] supplier_id_fct=7: supplier_id <= 214 0.1039 0.0304 3.4171 0.0006
## [ 40,.] supplier_id_fct=8: supplier_id <= 225 0.0651 0.0260 2.5088 0.0121
## [ 41,.] supplier_id_fct=9: supplier_id <= 240 Dropped Dropped Dropped 0.0000
## [ 42,.] branch_id_fct=1: branch_id <= 153 -0.4503 0.0410 -10.9721 0.0000
## [ 43,.] branch_id_fct=10: branch_id <= 284 0.0550 0.0304 1.8089 0.0705
## [ 44,.] branch_id_fct=11: branch_id > 284 0.0527 0.0414 1.2718 0.2034
## [ 45,.] branch_id_fct=2: branch_id <= 174 -0.4923 0.0384 -12.8176 0.0000
## [ 46,.] branch_id_fct=3: branch_id <= 184 -0.3828 0.0317 -12.0901 0.0000
## [ 47,.] branch_id_fct=4: branch_id <= 198 -0.3148 0.0281 -11.2160 0.0000
## [ 48,.] branch_id_fct=5: branch_id <= 214 -0.2090 0.0319 -6.5462 0.0000
## [ 49,.] branch_id_fct=6: branch_id <= 222 -0.1907 0.0332 -5.7400 0.0000
## [ 50,.] branch_id_fct=7: branch_id <= 233 -0.0964 0.0316 -3.0471 0.0023
## [ 51,.] branch_id_fct=8: branch_id <= 261 -0.2644 0.0302 -8.7606 0.0000
## [ 52,.] branch_id_fct=9: branch_id <= 276 Dropped Dropped Dropped 0.0000
## [ 53,.] ltv_fct=1: ltv <= 55.63 0.6741 0.0483 13.9558 0.0000
## [ 54,.] ltv_fct=10: ltv <= 84.57 -0.1191 0.0296 -4.0282 0.0001
## [ 55,.] ltv_fct=11: ltv <= 85 -0.2560 0.0299 -8.5656 0.0000
## [ 56,.] ltv_fct=12: ltv <= 87.8 -0.1109 0.0328 -3.3790 0.0007
## [ 57,.] ltv_fct=13: ltv <= 89.3 -0.2617 0.0335 -7.8184 0.0000
## [ 58,.] ltv_fct=14: ltv > 89.3 -0.3100 0.0333 -9.3079 0.0000
## [ 59,.] ltv_fct=2: ltv <= 62.22 0.5654 0.0424 13.3198 0.0000
## [ 60,.] ltv_fct=3: ltv <= 68.34 0.3978 0.0373 10.6611 0.0000
## [ 61,.] ltv_fct=4: ltv <= 72.9301 0.2696 0.0334 8.0815 0.0000
## [ 62,.] ltv_fct=5: ltv <= 74.31 0.1395 0.0346 4.0289 0.0001
## [ 63,.] ltv_fct=6: ltv <= 75 0.0904 0.0336 2.6889 0.0072
## [ 64,.] ltv_fct=7: ltv <= 77.39 0.1953 0.0306 6.3806 0.0000
## [ 65,.] ltv_fct=8: ltv <= 78.92 0.0634 0.0287 2.2089 0.0272
## [ 66,.] ltv_fct=9: ltv <= 83.34 Dropped Dropped Dropped 0.0000
## [ 67,.] PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0 -0.1592 0.0722 -2.2036 0.0276
## [ 68,.] PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18 -0.0610 0.0589 -1.0345 0.3009
## [ 69,.] PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441 -0.2494 0.0714 -3.4913 0.0005
## [ 70,.] PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643 -0.1389 0.0618 -2.2491 0.0245
## [ 71,.] PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738 -0.0788 0.0409 -1.9274 0.0539
## [ 72,.] PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824 0.1750 0.0449 3.8987 0.0001
## [ 73,.] PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824 Dropped Dropped Dropped 0.0000
## [ 74,.] disbursed_amount_fct=1: disbursed_amount <= 39134 0.0393 0.0477 0.8234 0.4103
## [ 75,.] disbursed_amount_fct=2: disbursed_amount <= 43615 0.0841 0.0422 1.9918 0.0464
## [ 76,.] disbursed_amount_fct=3: disbursed_amount <= 48555 0.0864 0.0309 2.7959 0.0052
## [ 77,.] disbursed_amount_fct=4: disbursed_amount <= 51908 0.0959 0.0248 3.8738 0.0001
## [ 78,.] disbursed_amount_fct=5: disbursed_amount <= 55400 0.0522 0.0191 2.7370 0.0062
## [ 79,.] disbursed_amount_fct=6: disbursed_amount > 55400 Dropped Dropped Dropped 0.0000
## [ 80,.] OutstandingNow_fct=1: OutstandingNow <= 44402 -0.0596 0.0582 -1.0234 0.3061
## [ 81,.] OutstandingNow_fct=2: OutstandingNow <= 50314 -0.1223 0.0545 -2.2443 0.0248
## [ 82,.] OutstandingNow_fct=3: OutstandingNow <= 171384 -0.1834 0.0485 -3.7823 0.0002
## [ 83,.] OutstandingNow_fct=4: OutstandingNow <= 324324 -0.0857 0.0429 -1.9988 0.0456
## [ 84,.] OutstandingNow_fct=5: OutstandingNow <= 746271 -0.1138 0.0390 -2.9159 0.0035
## [ 85,.] OutstandingNow_fct=6: OutstandingNow > 746271 Dropped Dropped Dropped 0.0000
## [ 86,.] PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 0.3537 0.0632 5.5963 0.0000
## [ 87,.] PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 0.3311 0.0616 5.3773 0.0000
## [ 88,.] PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 0.2451 0.0500 4.9071 0.0000
## [ 89,.] PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 0.2198 0.0622 3.5352 0.0004
## [ 90,.] PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256 0.0049 0.0358 0.1370 0.8910
## [ 91,.] PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256 Dropped Dropped Dropped 0.0000
## [ 92,.] State_ID_fct=1: State_ID <= 183 -0.0597 0.0494 -1.2078 0.2271
## [ 93,.] State_ID_fct=2: State_ID <= 188 -0.0293 0.0464 -0.6329 0.5268
## [ 94,.] State_ID_fct=3: State_ID <= 206 -0.0022 0.0432 -0.0499 0.9602
## [ 95,.] State_ID_fct=4: State_ID <= 214 0.0796 0.0425 1.8719 0.0612
## [ 96,.] State_ID_fct=5: State_ID <= 220 0.1009 0.0497 2.0304 0.0423
## [ 97,.] State_ID_fct=6: State_ID <= 229 0.0639 0.0512 1.2496 0.2115
## [ 98,.] State_ID_fct=7: State_ID <= 272 0.0847 0.0445 1.9050 0.0568
## [ 99,.] State_ID_fct=8: State_ID > 272 Dropped Dropped Dropped 0.0000
## [ 100,.] DisAsDiff_fct=1: DisAsDiff <= 13554 -0.1302 0.0396 -3.2854 0.0010
## [ 101,.] DisAsDiff_fct=2: DisAsDiff <= 15670 -0.0822 0.0342 -2.4020 0.0163
## [ 102,.] DisAsDiff_fct=3: DisAsDiff <= 16661 0.0062 0.0359 0.1722 0.8633
## [ 103,.] DisAsDiff_fct=4: DisAsDiff <= 19822 -0.0175 0.0254 -0.6869 0.4921
## [ 104,.] DisAsDiff_fct=5: DisAsDiff > 19822 Dropped Dropped Dropped 0.0000
## [ 105,.] PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581 -0.2560 0.0387 -6.6095 0.0000
## [ 106,.] PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581 Dropped Dropped Dropped 0.0000
## [ 107,.] PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0 0.1444 0.0287 5.0372 0.0000
## [ 108,.] PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0 Dropped Dropped Dropped 0.0000
## [ 109,.] manufacturer_id_fct=1: manufacturer_id <= 210 0.1232 0.0239 5.1589 0.0000
## [ 110,.] manufacturer_id_fct=2: manufacturer_id <= 221 -0.0035 0.0278 -0.1266 0.8993
## [ 111,.] manufacturer_id_fct=3: manufacturer_id <= 228 0.0977 0.0261 3.7512 0.0002
## [ 112,.] manufacturer_id_fct=4: manufacturer_id > 228 Dropped Dropped Dropped 0.0000
## [ 113,.] VoterID_flag_fct=1: VoterID_flag = 0 0.0935 0.0195 4.8004 0.0000
## [ 114,.] VoterID_flag_fct=2: VoterID_flag = 1 Dropped Dropped Dropped 0.0000
## [ 115,.] ShareOverdue_fct=1: ShareOverdue <= -2 0.0923 0.0318 2.8997 0.0037
## [ 116,.] ShareOverdue_fct=2: ShareOverdue <= -1 0.1288 0.0233 5.5417 0.0000
## [ 117,.] ShareOverdue_fct=3: ShareOverdue > -1 Dropped Dropped Dropped 0.0000
## [ 118,.] PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0 -0.2271 0.0443 -5.1314 0.0000
## [ 119,.] PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1 -0.2429 0.0334 -7.2669 0.0000
## [ 120,.] PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3 -0.2096 0.0288 -7.2732 0.0000
## [ 121,.] PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3 Dropped Dropped Dropped 0.0000
## [ 122,.] Day_fct=1: Day <= 28 0.2494 0.0213 11.7055 0.0000
## [ 123,.] Day_fct=2: Day <= 30 0.1283 0.0260 4.9301 0.0000
## [ 124,.] Day_fct=3: Day > 30 Dropped Dropped Dropped 0.0000
## [ 125,.] NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0 0.2732 0.0170 16.0551 0.0000
## [ 126,.] NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0 Dropped Dropped Dropped 0.0000
## [ 127,.] DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 0.2820 0.0244 11.5364 0.0000
## [ 128,.] DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0 Dropped Dropped Dropped 0.0000
## [ 129,.] Qrt_fct=1: Qrt = 3 0.2191 0.0118 18.5775 0.0000
## [ 130,.] Qrt_fct=2: Qrt = 4 Dropped Dropped Dropped 0.0000
## [ 131,.] YearsOnLoan_fct=1: YearsOnLoan <= 22.8918 -0.3841 0.0306 -12.5451 0.0000
## [ 132,.] YearsOnLoan_fct=2: YearsOnLoan <= 28.8496 -0.2784 0.0266 -10.4683 0.0000
## [ 133,.] YearsOnLoan_fct=3: YearsOnLoan <= 38.8321 -0.1799 0.0259 -6.9531 0.0000
## [ 134,.] YearsOnLoan_fct=4: YearsOnLoan <= 51.8208 -0.0930 0.0265 -3.5125 0.0004
## [ 135,.] YearsOnLoan_fct=5: YearsOnLoan > 51.8208 Dropped Dropped Dropped 0.0000
## [ 136,.] Employment_Type_fct=1: Employment_Type = 203 0.1524 0.0123 12.4431 0.0000
## [ 137,.] Employment_Type_fct=2: Employment_Type = 215 0.2166 0.0347 6.2474 0.0000
## [ 138,.] Employment_Type_fct=3: Employment_Type = 227 Dropped Dropped Dropped 0.0000
## [ 139,.] PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564 -0.0180 0.0325 -0.5519 0.5810
## [ 140,.] PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832 -0.2560 0.0375 -6.8204 0.0000
## [ 141,.] PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033 -0.1442 0.0387 -3.7226 0.0002
## [ 142,.] PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326 -0.1437 0.0325 -4.4164 0.0000
## [ 143,.] PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326 Dropped Dropped Dropped 0.0000
## [ 144,.] asset_cost_fct=1: asset_cost <= 60098 0.0726 0.0457 1.5873 0.1124
## [ 145,.] asset_cost_fct=2: asset_cost <= 70561 0.1494 0.0296 5.0467 0.0000
## [ 146,.] asset_cost_fct=3: asset_cost <= 85738 0.1776 0.0224 7.9199 0.0000
## [ 147,.] asset_cost_fct=4: asset_cost > 85738 Dropped Dropped Dropped 0.0000
## [ 148,.] SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0 -0.0599 0.0932 -0.6423 0.5206
## [ 149,.] SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null -0.0752 0.1024 -0.7339 0.4630
## [ 150,.] SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2 0.2864 0.2454 1.1672 0.2431
## [ 151,.] SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1 Dropped Dropped Dropped 0.0000
## [ 152,.] SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0 -0.0239 0.0779 -0.3071 0.7588
## [ 153,.] SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0 Dropped Dropped Dropped 0.0000
## [ 154,.] Passport_flag_fct=1: Passport_flag = 0 -0.1880 0.1371 -1.3710 0.1704
## [ 155,.] Passport_flag_fct=2: Passport_flag = 1 Dropped Dropped Dropped 0.0000
## [ 156,.] Driving_flag_fct=1: Driving_flag = 0 0.0159 0.0384 0.4145 0.6785
## [ 157,.] Driving_flag_fct=2: Driving_flag = 1 Dropped Dropped Dropped 0.0000
## [ 158,.] SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0 0.0609 0.0778 0.7835 0.4334
## [ 159,.] SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0 Dropped Dropped Dropped 0.0000
## [ 160,.] PAN_flag_fct=1: PAN_flag = 0 -0.0460 0.0230 -1.9976 0.0458
## [ 161,.] PAN_flag_fct=2: PAN_flag = 1 Dropped Dropped Dropped 0.0000
## [ 162,.] SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0 Dropped Dropped Dropped 0.0000
## [ 163,.] SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0 Dropped Dropped Dropped 0.0000
## Condition number of final VC matrix: 1997.7639
## ************************************************************************************************************************
##
## user system elapsed
## 0.14 0.01 3.95
## Call:
## RevoScaleR::rxLogit(formula = paste("Y ~ ", paste(names(X), collapse = "+")),
## data = cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases),
## pweights = "Pweights", reportProgress = 1, verbose = 1)
##
## Logistic Regression Results for: Y ~
## Employee_code_ID_fct+Current_pincode_ID_fct+supplier_id_fct+branch_id_fct+ltv_fct+PERFORM_CNS_SCORE_fct+disbursed_amount_fct+OutstandingNow_fct+PERFORM_CNS_SCORE_DESCRIPTION_fct+State_ID_fct+DisAsDiff_fct+PRI_DISBURSED_AMOUNT_fct+PRI_OVERDUE_ACCTS_fct+manufacturer_id_fct+VoterID_flag_fct+ShareOverdue_fct+PRI_ACTIVE_ACCTS_fct+Day_fct+NO_OF_INQUIRIES_fct+DELINQUENT_ACCTS_IN_LAST_SIX_M_fct+Qrt_fct+YearsOnLoan_fct+Employment_Type_fct+PRIMARY_INSTAL_AMT_fct+asset_cost_fct+SEC_OverdueShareSec_fct+SEC_CURRENT_BALANCE_fct+Passport_flag_fct+Driving_flag_fct+SEC_INSTAL_AMT_fct+PAN_flag_fct+SEC_OVERDUE_ACCTS_fct
## Data: cbind(Y = Y[inTrain], X[inTrain, ], Pweights = weight.cases)
## Dependent variable(s): Y
## Total independent variables: 163 (Including number dropped: 33)
## Number of valid observations: 209838
## Number of missing observations: 0
## -2*LogLikelihood: 199949.624 (Residual deviance on 209708 degrees of freedom)
##
## Coefficients:
## Estimate
## (Intercept) 0.739128
## Employee_code_ID_fct=1: Employee_code_ID <= 134 0.765665
## Employee_code_ID_fct=10: Employee_code_ID <= 242 -0.077044
## Employee_code_ID_fct=11: Employee_code_ID <= 254 -0.101471
## Employee_code_ID_fct=12: Employee_code_ID <= 270 -0.143575
## Employee_code_ID_fct=13: Employee_code_ID <= 289 -0.234039
## Employee_code_ID_fct=14: Employee_code_ID <= 319 -0.257740
## Employee_code_ID_fct=15: Employee_code_ID > 319 -0.482348
## Employee_code_ID_fct=2: Employee_code_ID <= 153 0.502839
## Employee_code_ID_fct=3: Employee_code_ID <= 170 0.400538
## Employee_code_ID_fct=4: Employee_code_ID <= 179 0.285620
## Employee_code_ID_fct=5: Employee_code_ID <= 188 0.251824
## Employee_code_ID_fct=6: Employee_code_ID <= 199 0.160031
## Employee_code_ID_fct=7: Employee_code_ID <= 211 0.093653
## Employee_code_ID_fct=8: Employee_code_ID <= 221 0.028377
## Employee_code_ID_fct=9: Employee_code_ID <= 233 Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143 0.857166
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257 -0.180906
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291 -0.272803
## Current_pincode_ID_fct=12: Current_pincode_ID > 291 -0.391739
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158 0.751176
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174 0.571246
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188 0.487634
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201 0.398476
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212 0.315290
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219 0.162143
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225 0.107461
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238 Dropped
## supplier_id_fct=1: supplier_id <= 133 0.334523
## supplier_id_fct=10: supplier_id <= 253 -0.079193
## supplier_id_fct=11: supplier_id <= 275 -0.097272
## supplier_id_fct=12: supplier_id <= 314 -0.132506
## supplier_id_fct=13: supplier_id > 314 -0.231812
## supplier_id_fct=2: supplier_id <= 149 0.341500
## supplier_id_fct=3: supplier_id <= 165 0.230836
## supplier_id_fct=4: supplier_id <= 178 0.185501
## supplier_id_fct=5: supplier_id <= 196 0.150963
## supplier_id_fct=6: supplier_id <= 206 0.129583
## supplier_id_fct=7: supplier_id <= 214 0.103929
## supplier_id_fct=8: supplier_id <= 225 0.065137
## supplier_id_fct=9: supplier_id <= 240 Dropped
## branch_id_fct=1: branch_id <= 153 -0.450331
## branch_id_fct=10: branch_id <= 284 0.055049
## branch_id_fct=11: branch_id > 284 0.052659
## branch_id_fct=2: branch_id <= 174 -0.492323
## branch_id_fct=3: branch_id <= 184 -0.382755
## branch_id_fct=4: branch_id <= 198 -0.314757
## branch_id_fct=5: branch_id <= 214 -0.208958
## branch_id_fct=6: branch_id <= 222 -0.190663
## branch_id_fct=7: branch_id <= 233 -0.096355
## branch_id_fct=8: branch_id <= 261 -0.264450
## branch_id_fct=9: branch_id <= 276 Dropped
## ltv_fct=1: ltv <= 55.63 0.674092
## ltv_fct=10: ltv <= 84.57 -0.119055
## ltv_fct=11: ltv <= 85 -0.256004
## ltv_fct=12: ltv <= 87.8 -0.110868
## ltv_fct=13: ltv <= 89.3 -0.261671
## ltv_fct=14: ltv > 89.3 -0.310022
## ltv_fct=2: ltv <= 62.22 0.565406
## ltv_fct=3: ltv <= 68.34 0.397794
## ltv_fct=4: ltv <= 72.9301 0.269590
## ltv_fct=5: ltv <= 74.31 0.139500
## ltv_fct=6: ltv <= 75 0.090372
## ltv_fct=7: ltv <= 77.39 0.195296
## ltv_fct=8: ltv <= 78.92 0.063365
## ltv_fct=9: ltv <= 83.34 Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0 -0.159161
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18 -0.060971
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441 -0.249354
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643 -0.138944
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738 -0.078764
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824 0.175027
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824 Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134 0.039263
## disbursed_amount_fct=2: disbursed_amount <= 43615 0.084088
## disbursed_amount_fct=3: disbursed_amount <= 48555 0.086379
## disbursed_amount_fct=4: disbursed_amount <= 51908 0.095899
## disbursed_amount_fct=5: disbursed_amount <= 55400 0.052241
## disbursed_amount_fct=6: disbursed_amount > 55400 Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402 -0.059569
## OutstandingNow_fct=2: OutstandingNow <= 50314 -0.122291
## OutstandingNow_fct=3: OutstandingNow <= 171384 -0.183387
## OutstandingNow_fct=4: OutstandingNow <= 324324 -0.085655
## OutstandingNow_fct=5: OutstandingNow <= 746271 -0.113832
## OutstandingNow_fct=6: OutstandingNow > 746271 Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 0.353738
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 0.331067
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 0.245147
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 0.219787
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256 0.004902
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256 Dropped
## State_ID_fct=1: State_ID <= 183 -0.059697
## State_ID_fct=2: State_ID <= 188 -0.029348
## State_ID_fct=3: State_ID <= 206 -0.002158
## State_ID_fct=4: State_ID <= 214 0.079577
## State_ID_fct=5: State_ID <= 220 0.100859
## State_ID_fct=6: State_ID <= 229 0.063927
## State_ID_fct=7: State_ID <= 272 0.084702
## State_ID_fct=8: State_ID > 272 Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554 -0.130242
## DisAsDiff_fct=2: DisAsDiff <= 15670 -0.082229
## DisAsDiff_fct=3: DisAsDiff <= 16661 0.006175
## DisAsDiff_fct=4: DisAsDiff <= 19822 -0.017471
## DisAsDiff_fct=5: DisAsDiff > 19822 Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581 -0.255990
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581 Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0 0.144360
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0 Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210 0.123239
## manufacturer_id_fct=2: manufacturer_id <= 221 -0.003519
## manufacturer_id_fct=3: manufacturer_id <= 228 0.097721
## manufacturer_id_fct=4: manufacturer_id > 228 Dropped
## VoterID_flag_fct=1: VoterID_flag = 0 0.093489
## VoterID_flag_fct=2: VoterID_flag = 1 Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2 0.092295
## ShareOverdue_fct=2: ShareOverdue <= -1 0.128849
## ShareOverdue_fct=3: ShareOverdue > -1 Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0 -0.227117
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1 -0.242919
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3 -0.209621
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3 Dropped
## Day_fct=1: Day <= 28 0.249400
## Day_fct=2: Day <= 30 0.128288
## Day_fct=3: Day > 30 Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0 0.273240
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0 Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 0.281950
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0 Dropped
## Qrt_fct=1: Qrt = 3 0.219055
## Qrt_fct=2: Qrt = 4 Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918 -0.384106
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496 -0.278357
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321 -0.179941
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208 -0.092981
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208 Dropped
## Employment_Type_fct=1: Employment_Type = 203 0.152449
## Employment_Type_fct=2: Employment_Type = 215 0.216629
## Employment_Type_fct=3: Employment_Type = 227 Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564 -0.017958
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832 -0.256020
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033 -0.144240
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326 -0.143675
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326 Dropped
## asset_cost_fct=1: asset_cost <= 60098 0.072598
## asset_cost_fct=2: asset_cost <= 70561 0.149356
## asset_cost_fct=3: asset_cost <= 85738 0.177618
## asset_cost_fct=4: asset_cost > 85738 Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0 -0.059896
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null -0.075173
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2 0.286437
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1 Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0 -0.023913
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0 Dropped
## Passport_flag_fct=1: Passport_flag = 0 -0.187963
## Passport_flag_fct=2: Passport_flag = 1 Dropped
## Driving_flag_fct=1: Driving_flag = 0 0.015928
## Driving_flag_fct=2: Driving_flag = 1 Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0 0.060927
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0 Dropped
## PAN_flag_fct=1: PAN_flag = 0 -0.045962
## PAN_flag_fct=2: PAN_flag = 1 Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0 Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0 Dropped
## Std. Error
## (Intercept) 0.199992
## Employee_code_ID_fct=1: Employee_code_ID <= 134 0.042695
## Employee_code_ID_fct=10: Employee_code_ID <= 242 0.030348
## Employee_code_ID_fct=11: Employee_code_ID <= 254 0.029589
## Employee_code_ID_fct=12: Employee_code_ID <= 270 0.029733
## Employee_code_ID_fct=13: Employee_code_ID <= 289 0.029307
## Employee_code_ID_fct=14: Employee_code_ID <= 319 0.028616
## Employee_code_ID_fct=15: Employee_code_ID > 319 0.032049
## Employee_code_ID_fct=2: Employee_code_ID <= 153 0.033917
## Employee_code_ID_fct=3: Employee_code_ID <= 170 0.029777
## Employee_code_ID_fct=4: Employee_code_ID <= 179 0.033683
## Employee_code_ID_fct=5: Employee_code_ID <= 188 0.031842
## Employee_code_ID_fct=6: Employee_code_ID <= 199 0.029672
## Employee_code_ID_fct=7: Employee_code_ID <= 211 0.029080
## Employee_code_ID_fct=8: Employee_code_ID <= 221 0.029788
## Employee_code_ID_fct=9: Employee_code_ID <= 233 Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143 0.038836
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257 0.024511
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291 0.023582
## Current_pincode_ID_fct=12: Current_pincode_ID > 291 0.025912
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158 0.036992
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174 0.027784
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188 0.026909
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201 0.024434
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212 0.025310
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219 0.029533
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225 0.029616
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238 Dropped
## supplier_id_fct=1: supplier_id <= 133 0.042484
## supplier_id_fct=10: supplier_id <= 253 0.027024
## supplier_id_fct=11: supplier_id <= 275 0.024787
## supplier_id_fct=12: supplier_id <= 314 0.025862
## supplier_id_fct=13: supplier_id > 314 0.030628
## supplier_id_fct=2: supplier_id <= 149 0.037278
## supplier_id_fct=3: supplier_id <= 165 0.031069
## supplier_id_fct=4: supplier_id <= 178 0.028826
## supplier_id_fct=5: supplier_id <= 196 0.025572
## supplier_id_fct=6: supplier_id <= 206 0.028729
## supplier_id_fct=7: supplier_id <= 214 0.030414
## supplier_id_fct=8: supplier_id <= 225 0.025963
## supplier_id_fct=9: supplier_id <= 240 Dropped
## branch_id_fct=1: branch_id <= 153 0.041043
## branch_id_fct=10: branch_id <= 284 0.030432
## branch_id_fct=11: branch_id > 284 0.041404
## branch_id_fct=2: branch_id <= 174 0.038410
## branch_id_fct=3: branch_id <= 184 0.031658
## branch_id_fct=4: branch_id <= 198 0.028063
## branch_id_fct=5: branch_id <= 214 0.031921
## branch_id_fct=6: branch_id <= 222 0.033216
## branch_id_fct=7: branch_id <= 233 0.031622
## branch_id_fct=8: branch_id <= 261 0.030186
## branch_id_fct=9: branch_id <= 276 Dropped
## ltv_fct=1: ltv <= 55.63 0.048302
## ltv_fct=10: ltv <= 84.57 0.029555
## ltv_fct=11: ltv <= 85 0.029887
## ltv_fct=12: ltv <= 87.8 0.032811
## ltv_fct=13: ltv <= 89.3 0.033469
## ltv_fct=14: ltv > 89.3 0.033307
## ltv_fct=2: ltv <= 62.22 0.042449
## ltv_fct=3: ltv <= 68.34 0.037313
## ltv_fct=4: ltv <= 72.9301 0.033359
## ltv_fct=5: ltv <= 74.31 0.034625
## ltv_fct=6: ltv <= 75 0.033609
## ltv_fct=7: ltv <= 77.39 0.030608
## ltv_fct=8: ltv <= 78.92 0.028686
## ltv_fct=9: ltv <= 83.34 Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0 0.072227
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18 0.058937
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441 0.071422
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643 0.061776
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738 0.040865
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824 0.044894
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824 Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134 0.047683
## disbursed_amount_fct=2: disbursed_amount <= 43615 0.042217
## disbursed_amount_fct=3: disbursed_amount <= 48555 0.030895
## disbursed_amount_fct=4: disbursed_amount <= 51908 0.024756
## disbursed_amount_fct=5: disbursed_amount <= 55400 0.019087
## disbursed_amount_fct=6: disbursed_amount > 55400 Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402 0.058208
## OutstandingNow_fct=2: OutstandingNow <= 50314 0.054488
## OutstandingNow_fct=3: OutstandingNow <= 171384 0.048486
## OutstandingNow_fct=4: OutstandingNow <= 324324 0.042853
## OutstandingNow_fct=5: OutstandingNow <= 746271 0.039039
## OutstandingNow_fct=6: OutstandingNow > 746271 Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 0.063209
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 0.061567
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 0.049958
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 0.062171
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256 0.035776
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256 Dropped
## State_ID_fct=1: State_ID <= 183 0.049425
## State_ID_fct=2: State_ID <= 188 0.046370
## State_ID_fct=3: State_ID <= 206 0.043220
## State_ID_fct=4: State_ID <= 214 0.042510
## State_ID_fct=5: State_ID <= 220 0.049674
## State_ID_fct=6: State_ID <= 229 0.051160
## State_ID_fct=7: State_ID <= 272 0.044464
## State_ID_fct=8: State_ID > 272 Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554 0.039643
## DisAsDiff_fct=2: DisAsDiff <= 15670 0.034234
## DisAsDiff_fct=3: DisAsDiff <= 16661 0.035854
## DisAsDiff_fct=4: DisAsDiff <= 19822 0.025433
## DisAsDiff_fct=5: DisAsDiff > 19822 Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581 0.038731
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581 Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0 0.028659
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0 Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210 0.023889
## manufacturer_id_fct=2: manufacturer_id <= 221 0.027798
## manufacturer_id_fct=3: manufacturer_id <= 228 0.026050
## manufacturer_id_fct=4: manufacturer_id > 228 Dropped
## VoterID_flag_fct=1: VoterID_flag = 0 0.019475
## VoterID_flag_fct=2: VoterID_flag = 1 Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2 0.031829
## ShareOverdue_fct=2: ShareOverdue <= -1 0.023251
## ShareOverdue_fct=3: ShareOverdue > -1 Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0 0.044260
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1 0.033428
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3 0.028821
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3 Dropped
## Day_fct=1: Day <= 28 0.021306
## Day_fct=2: Day <= 30 0.026022
## Day_fct=3: Day > 30 Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0 0.017019
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0 Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 0.024440
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0 Dropped
## Qrt_fct=1: Qrt = 3 0.011791
## Qrt_fct=2: Qrt = 4 Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918 0.030618
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496 0.026590
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321 0.025879
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208 0.026471
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208 Dropped
## Employment_Type_fct=1: Employment_Type = 203 0.012252
## Employment_Type_fct=2: Employment_Type = 215 0.034675
## Employment_Type_fct=3: Employment_Type = 227 Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564 0.032539
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832 0.037537
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033 0.038747
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326 0.032532
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326 Dropped
## asset_cost_fct=1: asset_cost <= 60098 0.045737
## asset_cost_fct=2: asset_cost <= 70561 0.029595
## asset_cost_fct=3: asset_cost <= 85738 0.022427
## asset_cost_fct=4: asset_cost > 85738 Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0 0.093246
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null 0.102431
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2 0.245402
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1 Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0 0.077865
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0 Dropped
## Passport_flag_fct=1: Passport_flag = 0 0.137099
## Passport_flag_fct=2: Passport_flag = 1 Dropped
## Driving_flag_fct=1: Driving_flag = 0 0.038427
## Driving_flag_fct=2: Driving_flag = 1 Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0 0.077767
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0 Dropped
## PAN_flag_fct=1: PAN_flag = 0 0.023009
## PAN_flag_fct=2: PAN_flag = 1 Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0 Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0 Dropped
## z value
## (Intercept) 3.696
## Employee_code_ID_fct=1: Employee_code_ID <= 134 17.934
## Employee_code_ID_fct=10: Employee_code_ID <= 242 -2.539
## Employee_code_ID_fct=11: Employee_code_ID <= 254 -3.429
## Employee_code_ID_fct=12: Employee_code_ID <= 270 -4.829
## Employee_code_ID_fct=13: Employee_code_ID <= 289 -7.986
## Employee_code_ID_fct=14: Employee_code_ID <= 319 -9.007
## Employee_code_ID_fct=15: Employee_code_ID > 319 -15.050
## Employee_code_ID_fct=2: Employee_code_ID <= 153 14.826
## Employee_code_ID_fct=3: Employee_code_ID <= 170 13.451
## Employee_code_ID_fct=4: Employee_code_ID <= 179 8.480
## Employee_code_ID_fct=5: Employee_code_ID <= 188 7.909
## Employee_code_ID_fct=6: Employee_code_ID <= 199 5.393
## Employee_code_ID_fct=7: Employee_code_ID <= 211 3.221
## Employee_code_ID_fct=8: Employee_code_ID <= 221 0.953
## Employee_code_ID_fct=9: Employee_code_ID <= 233 Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143 22.072
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257 -7.381
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291 -11.568
## Current_pincode_ID_fct=12: Current_pincode_ID > 291 -15.118
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158 20.306
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174 20.561
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188 18.122
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201 16.308
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212 12.457
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219 5.490
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225 3.629
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238 Dropped
## supplier_id_fct=1: supplier_id <= 133 7.874
## supplier_id_fct=10: supplier_id <= 253 -2.930
## supplier_id_fct=11: supplier_id <= 275 -3.924
## supplier_id_fct=12: supplier_id <= 314 -5.124
## supplier_id_fct=13: supplier_id > 314 -7.569
## supplier_id_fct=2: supplier_id <= 149 9.161
## supplier_id_fct=3: supplier_id <= 165 7.430
## supplier_id_fct=4: supplier_id <= 178 6.435
## supplier_id_fct=5: supplier_id <= 196 5.903
## supplier_id_fct=6: supplier_id <= 206 4.511
## supplier_id_fct=7: supplier_id <= 214 3.417
## supplier_id_fct=8: supplier_id <= 225 2.509
## supplier_id_fct=9: supplier_id <= 240 Dropped
## branch_id_fct=1: branch_id <= 153 -10.972
## branch_id_fct=10: branch_id <= 284 1.809
## branch_id_fct=11: branch_id > 284 1.272
## branch_id_fct=2: branch_id <= 174 -12.818
## branch_id_fct=3: branch_id <= 184 -12.090
## branch_id_fct=4: branch_id <= 198 -11.216
## branch_id_fct=5: branch_id <= 214 -6.546
## branch_id_fct=6: branch_id <= 222 -5.740
## branch_id_fct=7: branch_id <= 233 -3.047
## branch_id_fct=8: branch_id <= 261 -8.761
## branch_id_fct=9: branch_id <= 276 Dropped
## ltv_fct=1: ltv <= 55.63 13.956
## ltv_fct=10: ltv <= 84.57 -4.028
## ltv_fct=11: ltv <= 85 -8.566
## ltv_fct=12: ltv <= 87.8 -3.379
## ltv_fct=13: ltv <= 89.3 -7.818
## ltv_fct=14: ltv > 89.3 -9.308
## ltv_fct=2: ltv <= 62.22 13.320
## ltv_fct=3: ltv <= 68.34 10.661
## ltv_fct=4: ltv <= 72.9301 8.081
## ltv_fct=5: ltv <= 74.31 4.029
## ltv_fct=6: ltv <= 75 2.689
## ltv_fct=7: ltv <= 77.39 6.381
## ltv_fct=8: ltv <= 78.92 2.209
## ltv_fct=9: ltv <= 83.34 Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0 -2.204
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18 -1.035
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441 -3.491
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643 -2.249
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738 -1.927
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824 3.899
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824 Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134 0.823
## disbursed_amount_fct=2: disbursed_amount <= 43615 1.992
## disbursed_amount_fct=3: disbursed_amount <= 48555 2.796
## disbursed_amount_fct=4: disbursed_amount <= 51908 3.874
## disbursed_amount_fct=5: disbursed_amount <= 55400 2.737
## disbursed_amount_fct=6: disbursed_amount > 55400 Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402 -1.023
## OutstandingNow_fct=2: OutstandingNow <= 50314 -2.244
## OutstandingNow_fct=3: OutstandingNow <= 171384 -3.782
## OutstandingNow_fct=4: OutstandingNow <= 324324 -1.999
## OutstandingNow_fct=5: OutstandingNow <= 746271 -2.916
## OutstandingNow_fct=6: OutstandingNow > 746271 Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 5.596
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 5.377
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 4.907
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 3.535
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256 0.137
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256 Dropped
## State_ID_fct=1: State_ID <= 183 -1.208
## State_ID_fct=2: State_ID <= 188 -0.633
## State_ID_fct=3: State_ID <= 206 -0.050
## State_ID_fct=4: State_ID <= 214 1.872
## State_ID_fct=5: State_ID <= 220 2.030
## State_ID_fct=6: State_ID <= 229 1.250
## State_ID_fct=7: State_ID <= 272 1.905
## State_ID_fct=8: State_ID > 272 Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554 -3.285
## DisAsDiff_fct=2: DisAsDiff <= 15670 -2.402
## DisAsDiff_fct=3: DisAsDiff <= 16661 0.172
## DisAsDiff_fct=4: DisAsDiff <= 19822 -0.687
## DisAsDiff_fct=5: DisAsDiff > 19822 Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581 -6.609
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581 Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0 5.037
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0 Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210 5.159
## manufacturer_id_fct=2: manufacturer_id <= 221 -0.127
## manufacturer_id_fct=3: manufacturer_id <= 228 3.751
## manufacturer_id_fct=4: manufacturer_id > 228 Dropped
## VoterID_flag_fct=1: VoterID_flag = 0 4.800
## VoterID_flag_fct=2: VoterID_flag = 1 Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2 2.900
## ShareOverdue_fct=2: ShareOverdue <= -1 5.542
## ShareOverdue_fct=3: ShareOverdue > -1 Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0 -5.131
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1 -7.267
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3 -7.273
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3 Dropped
## Day_fct=1: Day <= 28 11.705
## Day_fct=2: Day <= 30 4.930
## Day_fct=3: Day > 30 Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0 16.055
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0 Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 11.536
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0 Dropped
## Qrt_fct=1: Qrt = 3 18.577
## Qrt_fct=2: Qrt = 4 Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918 -12.545
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496 -10.468
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321 -6.953
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208 -3.513
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208 Dropped
## Employment_Type_fct=1: Employment_Type = 203 12.443
## Employment_Type_fct=2: Employment_Type = 215 6.247
## Employment_Type_fct=3: Employment_Type = 227 Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564 -0.552
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832 -6.820
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033 -3.723
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326 -4.416
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326 Dropped
## asset_cost_fct=1: asset_cost <= 60098 1.587
## asset_cost_fct=2: asset_cost <= 70561 5.047
## asset_cost_fct=3: asset_cost <= 85738 7.920
## asset_cost_fct=4: asset_cost > 85738 Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0 -0.642
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null -0.734
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2 1.167
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1 Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0 -0.307
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0 Dropped
## Passport_flag_fct=1: Passport_flag = 0 -1.371
## Passport_flag_fct=2: Passport_flag = 1 Dropped
## Driving_flag_fct=1: Driving_flag = 0 0.414
## Driving_flag_fct=2: Driving_flag = 1 Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0 0.783
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0 Dropped
## PAN_flag_fct=1: PAN_flag = 0 -1.998
## PAN_flag_fct=2: PAN_flag = 1 Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0 Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0 Dropped
## Pr(>|z|)
## (Intercept) 0.000219
## Employee_code_ID_fct=1: Employee_code_ID <= 134 0.000000000000000222
## Employee_code_ID_fct=10: Employee_code_ID <= 242 0.011127
## Employee_code_ID_fct=11: Employee_code_ID <= 254 0.000605
## Employee_code_ID_fct=12: Employee_code_ID <= 270 0.000001373400840832
## Employee_code_ID_fct=13: Employee_code_ID <= 289 0.000000000000000222
## Employee_code_ID_fct=14: Employee_code_ID <= 319 0.000000000000000222
## Employee_code_ID_fct=15: Employee_code_ID > 319 0.000000000000000222
## Employee_code_ID_fct=2: Employee_code_ID <= 153 0.000000000000000222
## Employee_code_ID_fct=3: Employee_code_ID <= 170 0.000000000000000222
## Employee_code_ID_fct=4: Employee_code_ID <= 179 0.000000000000000222
## Employee_code_ID_fct=5: Employee_code_ID <= 188 0.000000000000000222
## Employee_code_ID_fct=6: Employee_code_ID <= 199 0.000000069185031482
## Employee_code_ID_fct=7: Employee_code_ID <= 211 0.001279
## Employee_code_ID_fct=8: Employee_code_ID <= 221 0.340780
## Employee_code_ID_fct=9: Employee_code_ID <= 233 Dropped
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143 0.000000000000000222
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257 0.000000000000000222
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291 0.000000000000000222
## Current_pincode_ID_fct=12: Current_pincode_ID > 291 0.000000000000000222
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158 0.000000000000000222
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174 0.000000000000000222
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188 0.000000000000000222
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201 0.000000000000000222
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212 0.000000000000000222
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219 0.000000040128663947
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225 0.000285
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238 Dropped
## supplier_id_fct=1: supplier_id <= 133 0.000000000000000222
## supplier_id_fct=10: supplier_id <= 253 0.003385
## supplier_id_fct=11: supplier_id <= 275 0.000086946671635557
## supplier_id_fct=12: supplier_id <= 314 0.000000299739931098
## supplier_id_fct=13: supplier_id > 314 0.000000000000000222
## supplier_id_fct=2: supplier_id <= 149 0.000000000000000222
## supplier_id_fct=3: supplier_id <= 165 0.000000000000000222
## supplier_id_fct=4: supplier_id <= 178 0.000000000123338895
## supplier_id_fct=5: supplier_id <= 196 0.000000003559513795
## supplier_id_fct=6: supplier_id <= 206 0.000006464945373930
## supplier_id_fct=7: supplier_id <= 214 0.000633
## supplier_id_fct=8: supplier_id <= 225 0.012114
## supplier_id_fct=9: supplier_id <= 240 Dropped
## branch_id_fct=1: branch_id <= 153 0.000000000000000222
## branch_id_fct=10: branch_id <= 284 0.070464
## branch_id_fct=11: branch_id > 284 0.203439
## branch_id_fct=2: branch_id <= 174 0.000000000000000222
## branch_id_fct=3: branch_id <= 184 0.000000000000000222
## branch_id_fct=4: branch_id <= 198 0.000000000000000222
## branch_id_fct=5: branch_id <= 214 0.000000000059036109
## branch_id_fct=6: branch_id <= 222 0.000000009467683748
## branch_id_fct=7: branch_id <= 233 0.002310
## branch_id_fct=8: branch_id <= 261 0.000000000000000222
## branch_id_fct=9: branch_id <= 276 Dropped
## ltv_fct=1: ltv <= 55.63 0.000000000000000222
## ltv_fct=10: ltv <= 84.57 0.000056194942485766
## ltv_fct=11: ltv <= 85 0.000000000000000222
## ltv_fct=12: ltv <= 87.8 0.000728
## ltv_fct=13: ltv <= 89.3 0.000000000000000222
## ltv_fct=14: ltv > 89.3 0.000000000000000222
## ltv_fct=2: ltv <= 62.22 0.000000000000000222
## ltv_fct=3: ltv <= 68.34 0.000000000000000222
## ltv_fct=4: ltv <= 72.9301 0.000000000000000222
## ltv_fct=5: ltv <= 74.31 0.000056050266953989
## ltv_fct=6: ltv <= 75 0.007169
## ltv_fct=7: ltv <= 77.39 0.000000000176378467
## ltv_fct=8: ltv <= 78.92 0.027181
## ltv_fct=9: ltv <= 83.34 Dropped
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0 0.027551
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18 0.300902
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441 0.000481
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643 0.024504
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738 0.053931
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824 0.000096707880559599
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824 Dropped
## disbursed_amount_fct=1: disbursed_amount <= 39134 0.410269
## disbursed_amount_fct=2: disbursed_amount <= 43615 0.046391
## disbursed_amount_fct=3: disbursed_amount <= 48555 0.005176
## disbursed_amount_fct=4: disbursed_amount <= 51908 0.000107
## disbursed_amount_fct=5: disbursed_amount <= 55400 0.006200
## disbursed_amount_fct=6: disbursed_amount > 55400 Dropped
## OutstandingNow_fct=1: OutstandingNow <= 44402 0.306128
## OutstandingNow_fct=2: OutstandingNow <= 50314 0.024810
## OutstandingNow_fct=3: OutstandingNow <= 171384 0.000155
## OutstandingNow_fct=4: OutstandingNow <= 324324 0.045628
## OutstandingNow_fct=5: OutstandingNow <= 746271 0.003547
## OutstandingNow_fct=6: OutstandingNow > 746271 Dropped
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 0.000000021892954560
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 0.000000075610862016
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 0.000000924302139271
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 0.000407
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256 0.891012
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256 Dropped
## State_ID_fct=1: State_ID <= 183 0.227109
## State_ID_fct=2: State_ID <= 188 0.526790
## State_ID_fct=3: State_ID <= 206 0.960185
## State_ID_fct=4: State_ID <= 214 0.061214
## State_ID_fct=5: State_ID <= 220 0.042315
## State_ID_fct=6: State_ID <= 229 0.211464
## State_ID_fct=7: State_ID <= 272 0.056786
## State_ID_fct=8: State_ID > 272 Dropped
## DisAsDiff_fct=1: DisAsDiff <= 13554 0.001018
## DisAsDiff_fct=2: DisAsDiff <= 15670 0.016306
## DisAsDiff_fct=3: DisAsDiff <= 16661 0.863261
## DisAsDiff_fct=4: DisAsDiff <= 19822 0.492117
## DisAsDiff_fct=5: DisAsDiff > 19822 Dropped
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581 0.000000000038574033
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581 Dropped
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0 0.000000472350173641
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0 Dropped
## manufacturer_id_fct=1: manufacturer_id <= 210 0.000000248386704094
## manufacturer_id_fct=2: manufacturer_id <= 221 0.899275
## manufacturer_id_fct=3: manufacturer_id <= 228 0.000176
## manufacturer_id_fct=4: manufacturer_id > 228 Dropped
## VoterID_flag_fct=1: VoterID_flag = 0 0.000001583545196970
## VoterID_flag_fct=2: VoterID_flag = 1 Dropped
## ShareOverdue_fct=1: ShareOverdue <= -2 0.003735
## ShareOverdue_fct=2: ShareOverdue <= -1 0.000000029958245662
## ShareOverdue_fct=3: ShareOverdue > -1 Dropped
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0 0.000000287604146276
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1 0.000000000000000222
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3 0.000000000000000222
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3 Dropped
## Day_fct=1: Day <= 28 0.000000000000000222
## Day_fct=2: Day <= 30 0.000000822043672688
## Day_fct=3: Day > 30 Dropped
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0 0.000000000000000222
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0 Dropped
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 0.000000000000000222
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0 Dropped
## Qrt_fct=1: Qrt = 3 0.000000000000000222
## Qrt_fct=2: Qrt = 4 Dropped
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918 0.000000000000000222
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496 0.000000000000000222
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321 0.000000000003574030
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208 0.000444
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208 Dropped
## Employment_Type_fct=1: Employment_Type = 203 0.000000000000000222
## Employment_Type_fct=2: Employment_Type = 215 0.000000000417368806
## Employment_Type_fct=3: Employment_Type = 227 Dropped
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564 0.581034
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832 0.000000000009078738
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033 0.000197
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326 0.000010036155548399
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326 Dropped
## asset_cost_fct=1: asset_cost <= 60098 0.112444
## asset_cost_fct=2: asset_cost <= 70561 0.000000449390024304
## asset_cost_fct=3: asset_cost <= 85738 0.000000000000000222
## asset_cost_fct=4: asset_cost > 85738 Dropped
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0 0.520648
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null 0.463016
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2 0.243122
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1 Dropped
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0 0.758758
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0 Dropped
## Passport_flag_fct=1: Passport_flag = 0 0.170376
## Passport_flag_fct=2: Passport_flag = 1 Dropped
## Driving_flag_fct=1: Driving_flag = 0 0.678510
## Driving_flag_fct=2: Driving_flag = 1 Dropped
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0 0.433359
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0 Dropped
## PAN_flag_fct=1: PAN_flag = 0 0.045763
## PAN_flag_fct=2: PAN_flag = 1 Dropped
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0 Dropped
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0 Dropped
##
## (Intercept) ***
## Employee_code_ID_fct=1: Employee_code_ID <= 134 ***
## Employee_code_ID_fct=10: Employee_code_ID <= 242 *
## Employee_code_ID_fct=11: Employee_code_ID <= 254 ***
## Employee_code_ID_fct=12: Employee_code_ID <= 270 ***
## Employee_code_ID_fct=13: Employee_code_ID <= 289 ***
## Employee_code_ID_fct=14: Employee_code_ID <= 319 ***
## Employee_code_ID_fct=15: Employee_code_ID > 319 ***
## Employee_code_ID_fct=2: Employee_code_ID <= 153 ***
## Employee_code_ID_fct=3: Employee_code_ID <= 170 ***
## Employee_code_ID_fct=4: Employee_code_ID <= 179 ***
## Employee_code_ID_fct=5: Employee_code_ID <= 188 ***
## Employee_code_ID_fct=6: Employee_code_ID <= 199 ***
## Employee_code_ID_fct=7: Employee_code_ID <= 211 **
## Employee_code_ID_fct=8: Employee_code_ID <= 221
## Employee_code_ID_fct=9: Employee_code_ID <= 233
## Current_pincode_ID_fct=1: Current_pincode_ID <= 143 ***
## Current_pincode_ID_fct=10: Current_pincode_ID <= 257 ***
## Current_pincode_ID_fct=11: Current_pincode_ID <= 291 ***
## Current_pincode_ID_fct=12: Current_pincode_ID > 291 ***
## Current_pincode_ID_fct=2: Current_pincode_ID <= 158 ***
## Current_pincode_ID_fct=3: Current_pincode_ID <= 174 ***
## Current_pincode_ID_fct=4: Current_pincode_ID <= 188 ***
## Current_pincode_ID_fct=5: Current_pincode_ID <= 201 ***
## Current_pincode_ID_fct=6: Current_pincode_ID <= 212 ***
## Current_pincode_ID_fct=7: Current_pincode_ID <= 219 ***
## Current_pincode_ID_fct=8: Current_pincode_ID <= 225 ***
## Current_pincode_ID_fct=9: Current_pincode_ID <= 238
## supplier_id_fct=1: supplier_id <= 133 ***
## supplier_id_fct=10: supplier_id <= 253 **
## supplier_id_fct=11: supplier_id <= 275 ***
## supplier_id_fct=12: supplier_id <= 314 ***
## supplier_id_fct=13: supplier_id > 314 ***
## supplier_id_fct=2: supplier_id <= 149 ***
## supplier_id_fct=3: supplier_id <= 165 ***
## supplier_id_fct=4: supplier_id <= 178 ***
## supplier_id_fct=5: supplier_id <= 196 ***
## supplier_id_fct=6: supplier_id <= 206 ***
## supplier_id_fct=7: supplier_id <= 214 ***
## supplier_id_fct=8: supplier_id <= 225 *
## supplier_id_fct=9: supplier_id <= 240
## branch_id_fct=1: branch_id <= 153 ***
## branch_id_fct=10: branch_id <= 284 .
## branch_id_fct=11: branch_id > 284
## branch_id_fct=2: branch_id <= 174 ***
## branch_id_fct=3: branch_id <= 184 ***
## branch_id_fct=4: branch_id <= 198 ***
## branch_id_fct=5: branch_id <= 214 ***
## branch_id_fct=6: branch_id <= 222 ***
## branch_id_fct=7: branch_id <= 233 **
## branch_id_fct=8: branch_id <= 261 ***
## branch_id_fct=9: branch_id <= 276
## ltv_fct=1: ltv <= 55.63 ***
## ltv_fct=10: ltv <= 84.57 ***
## ltv_fct=11: ltv <= 85 ***
## ltv_fct=12: ltv <= 87.8 ***
## ltv_fct=13: ltv <= 89.3 ***
## ltv_fct=14: ltv > 89.3 ***
## ltv_fct=2: ltv <= 62.22 ***
## ltv_fct=3: ltv <= 68.34 ***
## ltv_fct=4: ltv <= 72.9301 ***
## ltv_fct=5: ltv <= 74.31 ***
## ltv_fct=6: ltv <= 75 **
## ltv_fct=7: ltv <= 77.39 ***
## ltv_fct=8: ltv <= 78.92 *
## ltv_fct=9: ltv <= 83.34
## PERFORM_CNS_SCORE_fct=1: PERFORM_CNS_SCORE <= 0 *
## PERFORM_CNS_SCORE_fct=2: PERFORM_CNS_SCORE <= 18
## PERFORM_CNS_SCORE_fct=3: PERFORM_CNS_SCORE <= 441 ***
## PERFORM_CNS_SCORE_fct=4: PERFORM_CNS_SCORE <= 643 *
## PERFORM_CNS_SCORE_fct=5: PERFORM_CNS_SCORE <= 738 .
## PERFORM_CNS_SCORE_fct=6: PERFORM_CNS_SCORE <= 824 ***
## PERFORM_CNS_SCORE_fct=7: PERFORM_CNS_SCORE > 824
## disbursed_amount_fct=1: disbursed_amount <= 39134
## disbursed_amount_fct=2: disbursed_amount <= 43615 *
## disbursed_amount_fct=3: disbursed_amount <= 48555 **
## disbursed_amount_fct=4: disbursed_amount <= 51908 ***
## disbursed_amount_fct=5: disbursed_amount <= 55400 **
## disbursed_amount_fct=6: disbursed_amount > 55400
## OutstandingNow_fct=1: OutstandingNow <= 44402
## OutstandingNow_fct=2: OutstandingNow <= 50314 *
## OutstandingNow_fct=3: OutstandingNow <= 171384 ***
## OutstandingNow_fct=4: OutstandingNow <= 324324 *
## OutstandingNow_fct=5: OutstandingNow <= 746271 **
## OutstandingNow_fct=6: OutstandingNow > 746271
## PERFORM_CNS_SCORE_DESCRIPTION_fct=1: PERFORM_CNS_SCORE_DESCRIPTION <= 150 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=2: PERFORM_CNS_SCORE_DESCRIPTION <= 172 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=3: PERFORM_CNS_SCORE_DESCRIPTION <= 205 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=4: PERFORM_CNS_SCORE_DESCRIPTION <= 231 ***
## PERFORM_CNS_SCORE_DESCRIPTION_fct=5: PERFORM_CNS_SCORE_DESCRIPTION <= 256
## PERFORM_CNS_SCORE_DESCRIPTION_fct=6: PERFORM_CNS_SCORE_DESCRIPTION > 256
## State_ID_fct=1: State_ID <= 183
## State_ID_fct=2: State_ID <= 188
## State_ID_fct=3: State_ID <= 206
## State_ID_fct=4: State_ID <= 214 .
## State_ID_fct=5: State_ID <= 220 *
## State_ID_fct=6: State_ID <= 229
## State_ID_fct=7: State_ID <= 272 .
## State_ID_fct=8: State_ID > 272
## DisAsDiff_fct=1: DisAsDiff <= 13554 **
## DisAsDiff_fct=2: DisAsDiff <= 15670 *
## DisAsDiff_fct=3: DisAsDiff <= 16661
## DisAsDiff_fct=4: DisAsDiff <= 19822
## DisAsDiff_fct=5: DisAsDiff > 19822
## PRI_DISBURSED_AMOUNT_fct=1: PRI_DISBURSED_AMOUNT <= 218581 ***
## PRI_DISBURSED_AMOUNT_fct=2: PRI_DISBURSED_AMOUNT > 218581
## PRI_OVERDUE_ACCTS_fct=1: PRI_OVERDUE_ACCTS <= 0 ***
## PRI_OVERDUE_ACCTS_fct=2: PRI_OVERDUE_ACCTS > 0
## manufacturer_id_fct=1: manufacturer_id <= 210 ***
## manufacturer_id_fct=2: manufacturer_id <= 221
## manufacturer_id_fct=3: manufacturer_id <= 228 ***
## manufacturer_id_fct=4: manufacturer_id > 228
## VoterID_flag_fct=1: VoterID_flag = 0 ***
## VoterID_flag_fct=2: VoterID_flag = 1
## ShareOverdue_fct=1: ShareOverdue <= -2 **
## ShareOverdue_fct=2: ShareOverdue <= -1 ***
## ShareOverdue_fct=3: ShareOverdue > -1
## PRI_ACTIVE_ACCTS_fct=1: PRI_ACTIVE_ACCTS <= 0 ***
## PRI_ACTIVE_ACCTS_fct=2: PRI_ACTIVE_ACCTS <= 1 ***
## PRI_ACTIVE_ACCTS_fct=3: PRI_ACTIVE_ACCTS <= 3 ***
## PRI_ACTIVE_ACCTS_fct=4: PRI_ACTIVE_ACCTS > 3
## Day_fct=1: Day <= 28 ***
## Day_fct=2: Day <= 30 ***
## Day_fct=3: Day > 30
## NO_OF_INQUIRIES_fct=1: NO_OF_INQUIRIES <= 0 ***
## NO_OF_INQUIRIES_fct=2: NO_OF_INQUIRIES > 0
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=1: DELINQUENT_ACCTS_IN_LAST_SIX_M <= 0 ***
## DELINQUENT_ACCTS_IN_LAST_SIX_M_fct=2: DELINQUENT_ACCTS_IN_LAST_SIX_M > 0
## Qrt_fct=1: Qrt = 3 ***
## Qrt_fct=2: Qrt = 4
## YearsOnLoan_fct=1: YearsOnLoan <= 22.8918 ***
## YearsOnLoan_fct=2: YearsOnLoan <= 28.8496 ***
## YearsOnLoan_fct=3: YearsOnLoan <= 38.8321 ***
## YearsOnLoan_fct=4: YearsOnLoan <= 51.8208 ***
## YearsOnLoan_fct=5: YearsOnLoan > 51.8208
## Employment_Type_fct=1: Employment_Type = 203 ***
## Employment_Type_fct=2: Employment_Type = 215 ***
## Employment_Type_fct=3: Employment_Type = 227
## PRIMARY_INSTAL_AMT_fct=1: PRIMARY_INSTAL_AMT <= 1564
## PRIMARY_INSTAL_AMT_fct=2: PRIMARY_INSTAL_AMT <= 2832 ***
## PRIMARY_INSTAL_AMT_fct=3: PRIMARY_INSTAL_AMT <= 5033 ***
## PRIMARY_INSTAL_AMT_fct=4: PRIMARY_INSTAL_AMT <= 25326 ***
## PRIMARY_INSTAL_AMT_fct=5: PRIMARY_INSTAL_AMT > 25326
## asset_cost_fct=1: asset_cost <= 60098
## asset_cost_fct=2: asset_cost <= 70561 ***
## asset_cost_fct=3: asset_cost <= 85738 ***
## asset_cost_fct=4: asset_cost > 85738
## SEC_OverdueShareSec_fct=1: SEC_OverdueShareSec <= 0
## SEC_OverdueShareSec_fct=11: SEC_OverdueShareSec Is Null
## SEC_OverdueShareSec_fct=8: SEC_OverdueShareSec <= 0.2
## SEC_OverdueShareSec_fct=9: SEC_OverdueShareSec <= 1
## SEC_CURRENT_BALANCE_fct=1: SEC_CURRENT_BALANCE <= 0
## SEC_CURRENT_BALANCE_fct=10: SEC_CURRENT_BALANCE > 0
## Passport_flag_fct=1: Passport_flag = 0
## Passport_flag_fct=2: Passport_flag = 1
## Driving_flag_fct=1: Driving_flag = 0
## Driving_flag_fct=2: Driving_flag = 1
## SEC_INSTAL_AMT_fct=1: SEC_INSTAL_AMT <= 0
## SEC_INSTAL_AMT_fct=10: SEC_INSTAL_AMT > 0
## PAN_flag_fct=1: PAN_flag = 0 *
## PAN_flag_fct=2: PAN_flag = 1
## SEC_OVERDUE_ACCTS_fct=1: SEC_OVERDUE_ACCTS <= 0
## SEC_OVERDUE_ACCTS_fct=10: SEC_OVERDUE_ACCTS > 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Condition number of final variance-covariance matrix: 1997.764
## Number of iterations: 6
# Estimate a Classification Result on Testing Set
Pred <-
RevoScaleR::rxPredict(modelObject = rxLogitFit
, data = X[inTest, ] #, sqlFraudDS
# , outData = sqlServerOutDS3
, predVarNames = 'logitProbs'
, type = 'response'
, writeModelVars = FALSE
# , extraVarsToWrite = 'SUBS_KEY'
, overwrite = TRUE )## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.019 seconds
Probs <- Pred[, 'logitProbs']
# PrClass <- factor(ifelse( RevoScaleR::rxImport(sqlServerOutDS3)[, 'predLogitProbs'] < 0.5, 0, 1))
PrClass <- ifelse( Pred[, 'logitProbs'] < 0.5, 0, 1) %>%
factor
levels(PrClass) = c('Bad', 'Good')
ObClass <- Y[inTest]
writeLines('\n Estimate a Classification Result on Testing Set \n')##
## Estimate a Classification Result on Testing Set
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 284 338
## Good 4770 17924
##
## Accuracy : 0.7809
## 95% CI : (0.7756, 0.7862)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 0.8069
##
## Kappa : 0.0552
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.05619
## Specificity : 0.98149
## Pos Pred Value : 0.45659
## Neg Pred Value : 0.78981
## Precision : 0.45659
## Recall : 0.05619
## F1 : 0.10007
## Prevalence : 0.21676
## Detection Rate : 0.01218
## Detection Prevalence : 0.02668
## Balanced Accuracy : 0.51884
##
## 'Positive' Class : Bad
##
writeLines(paste0('\n Estimate a Gini Coefficient on Testing Set = ',
formattable::percent( hmeasure::HMeasure(true.class = ObClass %>% as.integer() - 1 ,
scores = Probs)[['metrics']] %>% .[1, 'Gini'], 2 )))##
## Estimate a Gini Coefficient on Testing Set = 31.79%
# Compute the ROC data for the default number of thresholds
rxRocObject <-
RevoScaleR::rxRocCurve(actualVarName = 'ObClass'
, predVarNames = c('Probs')
, numBreaks = 100 # length(Probs)
, data = data.frame(ObClass = ObClass %>% as.integer() - 1, Probs)
, title = 'ROC Curve for Logit Model')# Testing a New Data by Generalized Linear Model
# data.frame(..._fct = '4') %>%
# mutate_if(is.character, as.factor) %>%
# predict(glmFit, newdata = ., type = 'response')
# openxlsx::addWorksheet(wb0 <- openxlsx::createWorkbook(), sheetName = 'Example', gridLines = FALSE)
# openxlsx::writeData(wb0, sheet = 1, x = data.frame(X[inTest, ], ObClass, PrClass), withFilter = TRUE); openxlsx::openXL(wb0)
remove(varsel, scope)ROC - A receiver operating characteristic Curve, i.e., ROC Curve or Lorenz Curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
One of the statistics commonly used in credit scoring, as well as countless other disciplines, is the KS statistic. This was developed by two renowned Soviet mathematicians, A.N. Kolmogorov (1903-1987) and N.V. Smirnov (1900-1966).
The K-S statistic of interest is where the difference is greatest. The treatment differs depending upon whether one or two samples were used to generate the values.
The credit score is a numeric expression measuring creditworthiness. Commercial Banks usually utilize it as a method to support the decision-making about credit applications.
If a reliable odds estimate already exists, whether because the statistical technique provides it directly, or some algorithm was used, scaling can be done using Equation Log reference.
\[ \displaystyle \large c' = \frac{S × ln(D × G) \ - \ (S + I) × ln(D)}{ln(G)} \hspace{.5 in} [5] \\ \displaystyle \large i' = \frac{I} {ln(G)} \hspace{0.3 in} s' = c' + ln(D_{Orig}) × i'\]
where \(S\) is the reference score, \(D\) is the required Good/Bad odds at that score, \(I\) is the score increment, \(G\) the required odds increment, and \(D_{Orig}\) the odds provided by the model. An example for a reference odds of 16 to 1 at a score of 700, with odds doubling every 50 points, is provided below. The scaled score equating to 128 to 1 is then 850, calculated as:
\[ \displaystyle \large c' = \frac{700 × ln(16 × 2) \ - \ (700 + 50) × ln(16)}{ln(2)} = 500 \hspace{.5 in} [6] \\ \displaystyle \large i' = \frac{50} {ln(2)} = 72.13475 \hspace{0.1 in} s' = 500 + ln(128) × 72.13475 = 850\]
A further method to perform validation is to compare the divergence statistic for the scores of ‘Good’ and ‘Bad’ class. Kullback-Leibler’s Divergence ot Relative Entropy can be calculated using the formula:
\[ \displaystyle \large Divergence = \frac{(mean_G \ - \ mean_B)^2}{0.5 × (var_G \ + \ var_B)} \hspace{.5 in} [7] \]
where \(mean_G\), \(mean_B\), \(var_G\), and \(var_B\) are the means and variances of the scored Good and Bad populations respectively.
If the divergence value is large, then the division of classes is fair.
# Generate an ROC Curve for the Best Model
ROCCurveShow(Preds = Probs, # `Good` Class Probabilities - numeric vector
Obsers = ObClass, # Observed Classes (Reference) - logical vector
NameOfModel = 'Initial GLM') # Name Of The Model##
## Initial GLM - Estimate a Area Under the ROC Curve (AUC) on Testing Set = 65.90%
# Generate a KS Curve for the Best Model
KSCurveShow(Preds = Probs, # `Good` Class Probabilities - numeric vector
Obsers = ObClass, # Observed Classes (Reference) - logical vector
NameOfModel = 'Initial GLM') # Name Of The Model##
## Initial GLM - Estimate a Kolmogorov-Smirnov Statistic on Testing Set = 0.2331
# Generate a Distribution's Curve of Scores by Good & Bad Class for the Best Model
ScoresCurveShow(Preds = Probs, # `Good` Class Probabilities - numeric vector
Obsers = ObClass, # Observed Classes (Reference) - logical vector
NameOfModel = 'Initial GLM') # Name Of The Model##
## Initial GLM - Estimate a Kullback-Leibler’s Divergence Statistic on Testing Set = 0.0842
MicrosoftML packageJohn Mount has long wondered about the applicability of heterogeneous statistical methods for solving various classification problems. First, which classification methods are most accurate in general — that is, which methods identify the correct class most of the time. Second, which classifiers behave most like each other, in terms of the class probabilities that they assign to each of the target classes. Answers to these questions can be found on the company’s website Win-Vector LLC.
The rxLogisticRegression() algorithm is used to predict the value of a categorical dependent variable from its relationship to one or more independent variables assumed to have a logistic distribution. The rxLogisticRegression learner automatically adjusts the weights to select those variables that are most useful for making predictions (L1 and L2 regularization). This model based on the Stochastic Dual Coordinate Ascent method.
# Train Generalized Linear Model with Regularized by the L1 and L2 penalties of the Lasso and Ridge methods
start_time <- Sys.time()
library('MicrosoftML') # Microsoft Machine Learning for R
# Setup parallel processing - 2 times faster
# library('doParallel'); cl <- makeCluster(detectCores()); registerDoParallel(cl)
writeLines('\n\nGeneralized Linear Model with Regularized by the L1 and L2 penalties ...\n')##
##
## Generalized Linear Model with Regularized by the L1 and L2 penalties ...
set.seed(seed)
# tuneLength <- 10
# tuneGrid = data.frame(lasso = seq(from = .1, to = 1, length.out = tuneLength),
# ridge = seq(from = .1, to = 1, length.out = tuneLength))
# Optimization.df <- data.frame(matrix(nrow = tuneLength, ncol = tuneLength))
# system.time(
# for (i in 1:tuneLength ) { # The L1 (Lasso) regularization
#
# for (j in 1:tuneLength ) { # The L2 (Ridge) regularization
# rxLogisticRegressionFit <-
# MicrosoftML::rxLogisticRegression(
# formula = Y ~ .
# , data = caret::upSample(X[inTrain, ], Y[inTrain], yname = 'Y') # Up-Sampling Imbalanced Data
# , type = 'binary'
# , l2Weight = tuneGrid[j, 'ridge'] # The L2 (Ridge) regularization weight
# , l1Weight = tuneGrid[i, 'lasso'] # The L1 (Lasso) regularization weight
# , normalize = 'no' # no normalization is performed
# , reportProgress = 0
# , verbose = 4 )
# Optimization.df[i, j] <- summary(rxLogisticRegressionFit)$summary$AIC
#
# }
# }
# )
#
# optRegularizations <- which(Optimization.df == min(Optimization.df), arr.ind = TRUE)
rxLogisticRegressionFit <-
MicrosoftML::rxLogisticRegression( formula = Y ~ .
, data = caret::upSample(X[inTrain, ], Y[inTrain], yname = 'Y') # Up-Sampling Imbalanced Data
, type = 'binary'
# , l2Weight = 1 # tuneGrid[optRegularizations[2], 'ridge'] # The Ridge regularization weight
# , l1Weight = 1 # tuneGrid[optRegularizations[1], 'lasso'] # The Lasso regularization weight
, trainThreads = parallel::detectCores() # The number of threads to use in model
, normalize = 'no' # no normalization is performed
, reportProgress = 0
, verbose = 0 )## Not adding a normalizer.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0.001, Transform Time: 0
## Beginning processing data.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
## Beginning optimization
## num vars: 163
## improvement criterion: Mean Improvement
## L1 regularization selected 163 of 163 weights.
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:02.3866356
## Elapsed time: 00:00:00.1384473
# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
RevoScaleR::rxPredict( rxLogisticRegressionFit
, data = cbind(Y = Y[inTest], ObClass = Y[inTest] %>% as.integer() - 1, X[inTest, ])
, suffix = '.rxLogisticRegression'
, extraVarsToWrite = names(cbind(Y = Y[inTest], ObClass = Y[inTest] %>% as.integer() - 1, X[inTest, ]))
, outData = tempfile(fileext = '.xdf'))## Beginning processing data.
## Rows Read: 23316, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:00.2378449
## Finished writing 23316 rows.
## Writing completed.
##
## Estimate a Classification Result on Testing Set
caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('PredictedLabel.rxLogisticRegression')) %>% pull
, reference = Y[inTest], positive = 'Bad', mode = 'everything')## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.025 seconds
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 3039 6817
## Good 2015 11445
##
## Accuracy : 0.6212
## 95% CI : (0.6149, 0.6274)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1697
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.6013
## Specificity : 0.6267
## Pos Pred Value : 0.3083
## Neg Pred Value : 0.8503
## Precision : 0.3083
## Recall : 0.6013
## F1 : 0.4076
## Prevalence : 0.2168
## Detection Rate : 0.1303
## Detection Prevalence : 0.4227
## Balanced Accuracy : 0.6140
##
## 'Positive' Class : Bad
##
# Measure running time of R code for Generalized Linear Model
writeLines('Measure running time of `Generalized Linear Model` code = ')## Measure running time of `Generalized Linear Model` code =
## Time difference of 3.74208 secs
The rxFastTrees() algorithm is a high performing, state of the art scalable boosted decision tree that implements FastRank, an efficient implementation of the MART gradient boosting algorithm. MART learns an ensemble of regression trees, which is a decision tree with scalar values in its leaves. For binary classification, the output is converted to a probability by using some form of calibration.
# Train FastTrees Model as an efficient implementation of the MART Gradient Boosting Algorith (GTB)
start_time <- Sys.time()
writeLines('\n\nFast Trees is an Gradient Tree Boosting Algorith (GTB) ...\n')##
##
## Fast Trees is an Gradient Tree Boosting Algorith (GTB) ...
system.time(
rxFastTreesFit <-
MicrosoftML::rxFastTrees( formula = Y ~ .
, data = cbind(Y = Y[inTrain], X[inTrain, ]) # Up-Sampling Imbalanced Data
, type = 'binary'
, numTrees = 100
, numLeaves = 20
, learningRate = 0.3 # Determines the size of the step taken in the direction of gradient in each step
, minSplit = 10
, exampleFraction = 0.7 # The fraction of randomly chosen instances to use for each tree
, featureFraction = 1 # The fraction of randomly chosen features to use for each tree
, splitFraction = 1 # The fraction of randomly chosen features to use on each split
# , numBins = 255
, firstUsePenalty = 0 # The feature first use penalty coefficient
, gainConfLevel = 0 # Tree fitting gain confidence requirement (should be in the range [0,1))
, unbalancedSets = TRUE # derivatives optimized for unbalanced sets are used
, randomSeed = seed
, reportProgress = 0
, verbose = 0 )
)## user system elapsed
## 0.14 0.00 2.67
# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
RevoScaleR::rxPredict( rxFastTreesFit, fitTestScores, suffix = '.rxFastTrees'
, extraVarsToWrite = names(fitTestScores)
, outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 23316, Read Time: 0.004, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:00.3513291
## Finished writing 23316 rows.
## Writing completed.
##
## Estimate a Classification Result on Testing Set
caret::confusionMatrix(
data = RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('PredictedLabel.rxFastTrees')) %>% pull
, reference = Y[inTest], positive = 'Bad', mode = 'everything')## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.031 seconds
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 3019 6681
## Good 2035 11581
##
## Accuracy : 0.6262
## 95% CI : (0.6199, 0.6324)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1737
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.5973
## Specificity : 0.6342
## Pos Pred Value : 0.3112
## Neg Pred Value : 0.8505
## Precision : 0.3112
## Recall : 0.5973
## F1 : 0.4092
## Prevalence : 0.2168
## Detection Rate : 0.1295
## Detection Prevalence : 0.4160
## Balanced Accuracy : 0.6158
##
## 'Positive' Class : Bad
##
# Measure running time of R code for Gradient Tree Boosting Model
writeLines('Measure running time of `Gradient Tree Boosting Model` code = ')## Measure running time of `Gradient Tree Boosting Model` code =
## Time difference of 3.490278 secs
Decision trees are non-parametric models that perform a sequence of simple tests on inputs. The rxFastForest() algorithm is a random forest that provides a learning method for classification that constructs an ensemble of decision trees at training time, outputting the class that is the mode of the classes of the individual trees. Random decision forests can correct for the overfitting to training data sets to which decision trees are prone. The rxFastForest learner automatically builds a set of trees whose combined predictions are better than the predictions of any one of the trees.
# Train FastForest Model as an efficient implementation of the Random Forest (RF)
start_time <- Sys.time()
writeLines('\n\nFast Forest is an Fast Random Forest (RF) ...\n')##
##
## Fast Forest is an Fast Random Forest (RF) ...
system.time(
rxFastForestFit <-
MicrosoftML::rxFastForest( formula = Y ~ .
, data = caret::upSample(X[inTrain, ], Y[inTrain], yname = 'Y') # Up-Sampling Imbalanced Data
, type = 'binary'
# , numTrees = 200
# , numLeaves = 30
, randomSeed = seed
, reportProgress = 1
, verbose = 0 )
)## Beginning processing data.
## Rows Read: 328562
## Beginning processing data.
## Beginning processing data.
## Rows Read: 328562
## Beginning processing data.
## user system elapsed
## 0.60 0.01 7.50
# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
RevoScaleR::rxPredict( rxFastForestFit, fitTestScores, suffix = '.rxFastForest'
, extraVarsToWrite = names(fitTestScores)
, outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 23316, Read Time: 0.003, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:00.8439608
## Finished writing 23316 rows.
## Writing completed.
##
## Estimate a Classification Result on Testing Set
caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('PredictedLabel.rxFastForest')) %>% pull
, reference = Y[inTest], positive = 'Bad', mode = 'everything')## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.037 seconds
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 3035 7349
## Good 2019 10913
##
## Accuracy : 0.5982
## 95% CI : (0.5919, 0.6045)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1434
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.6005
## Specificity : 0.5976
## Pos Pred Value : 0.2923
## Neg Pred Value : 0.8439
## Precision : 0.2923
## Recall : 0.6005
## F1 : 0.3932
## Prevalence : 0.2168
## Detection Rate : 0.1302
## Detection Prevalence : 0.4454
## Balanced Accuracy : 0.5990
##
## 'Positive' Class : Bad
##
# Measure running time of R code for Random Forest Model (Fast Forest)
writeLines('Measure running time of `Random Forest Model (Fast Forest)` code = ')## Measure running time of `Random Forest Model (Fast Forest)` code =
## Time difference of 8.736882 secs
The rxNeuralNet() algorithm supports a user-defined multilayer network topology with GPU acceleration. A neural network is a class of prediction models inspired by the human brain. It can be represented as a weighted directed graph. Each node in the graph is called a neuron. The neural network algorithm tries to learn the optimal weights on the edges based on the training data. Any class of statistical models can be considered a neural network if they use adaptive weights and can approximate non-linear functions of their inputs. Neural network regression is especially suited to problems where a more traditional regression model cannot fit a solution. Define Neural network using Net# language or the Azure Gallery.
For GPU acceleration, it is recommended to use a miniBatchSize greater than one. If you want to use the GPU acceleration, there are additional manual setup steps are required:
Download and install NVidia CUDA Toolkit 6.5 (CUDA Toolkit).
Download and install NVidia cuDNN v2 Library (cudnn Library).
Find the libs directory of the MicrosoftRML package by calling system.file('mxLibs/x64', package = 'MicrosoftML').
Copy cublas64_65.dll, cudart64_65.dll and cusparse64_65.dll from the CUDA Toolkit 6.5 into the libs directory of the MicrosoftML package.
Copy cudnn64_65.dll from the cuDNN v2 Library into the libs directory of the MicrosoftML package.
# Train Artificial Neural Networks Model
start_time <- Sys.time()
writeLines('\n\nArtificial Neural Networks (ANN) ...\n')##
##
## Artificial Neural Networks (ANN) ...
# Azure Net# definition of the structure of the neural network
netDefinition <- ('
// Define constants.
const { T = true; F = false; }
// Input layer definition.
input Data [33];
// First convolutional layer definition.
hidden C1 [ 7, 9]
from Data convolve {
InputShape = [33];
KernelShape = [17];
Stride = [2];
MapCount = 7;
}
// Second normalize layer definition.
hidden N1 [ 7, 9]
from C1 response norm {
InputShape = [7, 9];
KernelShape = [1, 1];
Alpha = 0.0001;
Beta = 0.75;
Offset = 1;
AvgOverFullKernel = true;
}
// Third fully connected layer definition.
hidden H3 [100]
from N1 all;
// Output layer definition.
output Result auto softmax from H3 all;
')
# netDefinition <- '
# input Data [33];
#
# hidden H1 [100]
# from Data all;
#
# hidden H2 [100]
# from H1 all;
#
# output Result [1] softmax
# from H2 all;
# '
# Main Issue was Factorizing() the Data correctly.
system.time(
rxNeuralNetFit <-
MicrosoftML::rxNeuralNet( formula = Y ~ .
, data = caret::upSample(X[inTrain, ], Y[inTrain], yname = 'Y') # Up-Sampling Imbalanced Data
, type = 'binary'
# , numHiddenNodes = 100
# , numIterations = 50
# , optimizer = adaDeltaSgd(decay = .99, conditioningConst = 1e-05)
#, netDefinition = netDefinition
, acceleration = 'gpu'
, initWtsDiameter = 0.005
, normalize = 'warn'
, randomSeed = seed
# , reportProgress = 1
# , verbose = 0
)
)## Not adding a normalizer.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Failed to initialize CUDA runtime. Possible reasons:
## 1. The machine does not have CUDA-capable card. Supported devices have compute capability 2.0 and higher.
## 2. Outdated graphics drivers. Please install the latest drivers from http://www.nvidia.com/Drivers .
## 3. CUDA runtime DLLs are missing, please see the GPU acceleration help for the installation instructions.
## CUDA not supported, switched to SSE math.
## Using: SSE Math
##
## ***** Net definition *****
## input Data [162];
## hidden H [100] sigmoid { // Depth 1
## from Data all;
## }
## output Result [1] sigmoid { // Depth 0
## from H all;
## }
## ***** End net definition *****
## Input count: 162
## Output count: 1
## Output Function: Sigmoid
## Loss Function: LogLoss
## PreTrainer: NoPreTrainer
## ___________________________________________________________________
## Starting training...
## Learning rate: 0.001000
## Momentum: 0.000000
## InitWtsDiameter: 0.005000
## ___________________________________________________________________
## Initializing 1 Hidden Layers, 16401 Weights...
## Estimated Pre-training MeanError = 0.693613
## Iter:1/100, MeanErr=0.639431(-7.81%%), 3798.21M WeightUpdates/sec
## Iter:2/100, MeanErr=0.602368(-5.80%%), 3949.60M WeightUpdates/sec
## Iter:3/100, MeanErr=0.584635(-2.94%%), 3914.17M WeightUpdates/sec
## Iter:4/100, MeanErr=0.527511(-9.77%%), 3915.96M WeightUpdates/sec
## Iter:5/100, MeanErr=0.593839(12.57%%), 3827.49M WeightUpdates/sec
## Iter:6/100, MeanErr=0.554664(-6.60%%), 3856.07M WeightUpdates/sec
## Iter:7/100, MeanErr=0.562969(1.50%%), 3903.81M WeightUpdates/sec
## Iter:8/100, MeanErr=0.548285(-2.61%%), 3942.92M WeightUpdates/sec
## Iter:9/100, MeanErr=0.563428(2.76%%), 3897.05M WeightUpdates/sec
## Iter:10/100, MeanErr=0.586190(4.04%%), 3941.05M WeightUpdates/sec
## Iter:11/100, MeanErr=0.597030(1.85%%), 3853.81M WeightUpdates/sec
## Iter:12/100, MeanErr=0.608863(1.98%%), 3888.06M WeightUpdates/sec
## Iter:13/100, MeanErr=0.584287(-4.04%%), 3521.77M WeightUpdates/sec
## Iter:14/100, MeanErr=0.590426(1.05%%), 3879.91M WeightUpdates/sec
## Iter:15/100, MeanErr=0.520848(-11.78%%), 3869.92M WeightUpdates/sec
## Iter:16/100, MeanErr=0.582126(11.77%%), 3852.34M WeightUpdates/sec
## Iter:17/100, MeanErr=0.583295(0.20%%), 3921.90M WeightUpdates/sec
## Iter:18/100, MeanErr=0.603145(3.40%%), 3871.87M WeightUpdates/sec
## Iter:19/100, MeanErr=0.588263(-2.47%%), 3891.76M WeightUpdates/sec
## Iter:20/100, MeanErr=0.588528(0.05%%), 3984.09M WeightUpdates/sec
## Iter:21/100, MeanErr=0.561759(-4.55%%), 3952.93M WeightUpdates/sec
## Iter:22/100, MeanErr=0.597273(6.32%%), 3904.70M WeightUpdates/sec
## Iter:23/100, MeanErr=0.580513(-2.81%%), 3923.32M WeightUpdates/sec
## Iter:24/100, MeanErr=0.608079(4.75%%), 3931.57M WeightUpdates/sec
## Iter:25/100, MeanErr=0.575513(-5.36%%), 3940.99M WeightUpdates/sec
## Iter:26/100, MeanErr=0.601079(4.44%%), 3948.39M WeightUpdates/sec
## Iter:27/100, MeanErr=0.574797(-4.37%%), 3920.24M WeightUpdates/sec
## Iter:28/100, MeanErr=0.552832(-3.82%%), 3934.60M WeightUpdates/sec
## Iter:29/100, MeanErr=0.590020(6.73%%), 3982.74M WeightUpdates/sec
## Iter:30/100, MeanErr=0.574746(-2.59%%), 3846.70M WeightUpdates/sec
## Iter:31/100, MeanErr=0.582986(1.43%%), 3966.72M WeightUpdates/sec
## Iter:32/100, MeanErr=0.554354(-4.91%%), 3847.14M WeightUpdates/sec
## Iter:33/100, MeanErr=0.571911(3.17%%), 4055.53M WeightUpdates/sec
## Iter:34/100, MeanErr=0.607195(6.17%%), 3849.39M WeightUpdates/sec
## Iter:35/100, MeanErr=0.479481(-21.03%%), 3860.51M WeightUpdates/sec
## Iter:36/100, MeanErr=0.559952(16.78%%), 3806.84M WeightUpdates/sec
## Iter:37/100, MeanErr=0.572845(2.30%%), 3474.53M WeightUpdates/sec
## Iter:38/100, MeanErr=0.566092(-1.18%%), 3993.22M WeightUpdates/sec
## Iter:39/100, MeanErr=0.590559(4.32%%), 3884.40M WeightUpdates/sec
## Iter:40/100, MeanErr=0.562207(-4.80%%), 3865.32M WeightUpdates/sec
## Iter:41/100, MeanErr=0.570505(1.48%%), 3854.38M WeightUpdates/sec
## Iter:42/100, MeanErr=0.582996(2.19%%), 3185.89M WeightUpdates/sec
## Iter:43/100, MeanErr=0.589425(1.10%%), 3415.26M WeightUpdates/sec
## Iter:44/100, MeanErr=0.543417(-7.81%%), 2318.03M WeightUpdates/sec
## Iter:45/100, MeanErr=0.583254(7.33%%), 2706.60M WeightUpdates/sec
## Iter:46/100, MeanErr=0.570077(-2.26%%), 2547.02M WeightUpdates/sec
## Iter:47/100, MeanErr=0.593370(4.09%%), 3272.58M WeightUpdates/sec
## Iter:48/100, MeanErr=0.578531(-2.50%%), 3389.04M WeightUpdates/sec
## Iter:49/100, MeanErr=0.560148(-3.18%%), 3481.86M WeightUpdates/sec
## Iter:50/100, MeanErr=0.594706(6.17%%), 3710.83M WeightUpdates/sec
## Iter:51/100, MeanErr=0.576700(-3.03%%), 2830.77M WeightUpdates/sec
## Iter:52/100, MeanErr=0.591160(2.51%%), 3083.10M WeightUpdates/sec
## Iter:53/100, MeanErr=0.559937(-5.28%%), 3995.15M WeightUpdates/sec
## Iter:54/100, MeanErr=0.516699(-7.72%%), 4226.92M WeightUpdates/sec
## Iter:55/100, MeanErr=0.568916(10.11%%), 4151.69M WeightUpdates/sec
## Iter:56/100, MeanErr=0.551433(-3.07%%), 4325.79M WeightUpdates/sec
## Iter:57/100, MeanErr=0.601805(9.13%%), 4337.89M WeightUpdates/sec
## Iter:58/100, MeanErr=0.591378(-1.73%%), 4311.02M WeightUpdates/sec
## Iter:59/100, MeanErr=0.579823(-1.95%%), 4320.18M WeightUpdates/sec
## Iter:60/100, MeanErr=0.556642(-4.00%%), 4323.00M WeightUpdates/sec
## Iter:61/100, MeanErr=0.573425(3.01%%), 4319.77M WeightUpdates/sec
## Iter:62/100, MeanErr=0.571964(-0.25%%), 4263.00M WeightUpdates/sec
## Iter:63/100, MeanErr=0.568972(-0.52%%), 4258.27M WeightUpdates/sec
## Iter:64/100, MeanErr=0.569381(0.07%%), 4267.84M WeightUpdates/sec
## Iter:65/100, MeanErr=0.510309(-10.37%%), 4246.29M WeightUpdates/sec
## Iter:66/100, MeanErr=0.581686(13.99%%), 4215.08M WeightUpdates/sec
## Iter:67/100, MeanErr=0.570046(-2.00%%), 4180.59M WeightUpdates/sec
## Iter:68/100, MeanErr=0.588904(3.31%%), 3617.76M WeightUpdates/sec
## Iter:69/100, MeanErr=0.587524(-0.23%%), 4163.22M WeightUpdates/sec
## Iter:70/100, MeanErr=0.565621(-3.73%%), 4315.60M WeightUpdates/sec
## Iter:71/100, MeanErr=0.551080(-2.57%%), 4358.70M WeightUpdates/sec
## Iter:72/100, MeanErr=0.553144(0.37%%), 4288.95M WeightUpdates/sec
## Iter:73/100, MeanErr=0.546142(-1.27%%), 4341.31M WeightUpdates/sec
## Iter:74/100, MeanErr=0.599115(9.70%%), 4394.05M WeightUpdates/sec
## Iter:75/100, MeanErr=0.584863(-2.38%%), 4328.23M WeightUpdates/sec
## Iter:76/100, MeanErr=0.577997(-1.17%%), 4244.54M WeightUpdates/sec
## Iter:77/100, MeanErr=0.560597(-3.01%%), 4393.77M WeightUpdates/sec
## Iter:78/100, MeanErr=0.560554(-0.01%%), 4285.19M WeightUpdates/sec
## Iter:79/100, MeanErr=0.566539(1.07%%), 4323.93M WeightUpdates/sec
## Iter:80/100, MeanErr=0.583388(2.97%%), 4316.30M WeightUpdates/sec
## Iter:81/100, MeanErr=0.574882(-1.46%%), 4303.92M WeightUpdates/sec
## Iter:82/100, MeanErr=0.597803(3.99%%), 4363.38M WeightUpdates/sec
## Iter:83/100, MeanErr=0.573010(-4.15%%), 4319.29M WeightUpdates/sec
## Iter:84/100, MeanErr=0.576507(0.61%%), 4263.72M WeightUpdates/sec
## Iter:85/100, MeanErr=0.581926(0.94%%), 4375.43M WeightUpdates/sec
## Iter:86/100, MeanErr=0.550412(-5.42%%), 4324.91M WeightUpdates/sec
## Iter:87/100, MeanErr=0.524140(-4.77%%), 4318.94M WeightUpdates/sec
## Iter:88/100, MeanErr=0.584658(11.55%%), 4339.30M WeightUpdates/sec
## Iter:89/100, MeanErr=0.585000(0.06%%), 4297.34M WeightUpdates/sec
## Iter:90/100, MeanErr=0.562040(-3.92%%), 4313.00M WeightUpdates/sec
## Iter:91/100, MeanErr=0.553648(-1.49%%), 4360.05M WeightUpdates/sec
## Iter:92/100, MeanErr=0.598780(8.15%%), 4227.71M WeightUpdates/sec
## Iter:93/100, MeanErr=0.595341(-0.57%%), 4207.91M WeightUpdates/sec
## Iter:94/100, MeanErr=0.589134(-1.04%%), 4279.77M WeightUpdates/sec
## Iter:95/100, MeanErr=0.571897(-2.93%%), 4230.29M WeightUpdates/sec
## Iter:96/100, MeanErr=0.590322(3.22%%), 4332.06M WeightUpdates/sec
## Iter:97/100, MeanErr=0.568671(-3.67%%), 4195.30M WeightUpdates/sec
## Iter:98/100, MeanErr=0.528197(-7.12%%), 4155.70M WeightUpdates/sec
## Iter:99/100, MeanErr=0.573761(8.63%%), 4197.96M WeightUpdates/sec
## Iter:100/100, MeanErr=0.590485(2.91%%), 4186.22M WeightUpdates/sec
## Done!
## Estimated Post-training MeanError = 0.620647
## ___________________________________________________________________
## Not training a calibrator because it is not needed.
## Elapsed time: 00:02:18.7486743
## user system elapsed
## 0.61 0.01 139.44
trained_model <- data.frame(payload = as.raw(serialize(rxNeuralNetFit, connection = NULL)))
# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
RevoScaleR::rxPredict(rxNeuralNetFit, fitTestScores, suffix = '.rxNeuralNet',
extraVarsToWrite = names(fitTestScores),
outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 23316, Read Time: 0.004, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:00.2187188
## Finished writing 23316 rows.
## Writing completed.
##
## Estimate a Classification Result on Testing Set
caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('PredictedLabel.rxNeuralNet')) %>% pull
, reference = Y[inTest], positive = 'Bad', mode = 'everything')## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.042 seconds
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 2882 6370
## Good 2172 11892
##
## Accuracy : 0.6336
## 95% CI : (0.6274, 0.6398)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1703
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.5702
## Specificity : 0.6512
## Pos Pred Value : 0.3115
## Neg Pred Value : 0.8456
## Precision : 0.3115
## Recall : 0.5702
## F1 : 0.4029
## Prevalence : 0.2168
## Detection Rate : 0.1236
## Detection Prevalence : 0.3968
## Balanced Accuracy : 0.6107
##
## 'Positive' Class : Bad
##
remove (trained_model, netDefinition)
# Measure running time of R code for Artificial Neural Networks (ANN)
writeLines('Measure running time of `Artificial Neural Networks (ANN)` code = ')## Measure running time of `Artificial Neural Networks (ANN)` code =
## Time difference of 2.335371 mins
This type of SVM is one-class because the training set contains only examples from the target class. It infers what properties are normal for the objects in the target class and from these properties predicts which examples are unlike the normal examples. This is useful for anomaly detection because the scarcity of training examples is the defining character of anomalies: typically there are very few examples of network intrusion, fraud, or other types of anomalous behavior.
Parallel External Memory Algorithm for Naive Bayes Classifiers
# Train a Naive Bayes Classifiers
start_time <- Sys.time()
writeLines('\n\nFast Forest is an Naive Bayes Classifiers (NB) ...\n')##
##
## Fast Forest is an Naive Bayes Classifiers (NB) ...
system.time(
rxNaiveBayesFit <-
RevoScaleR::rxNaiveBayes( formula = Y ~ .
, data = caret::upSample(X[inTrain, ], Y[inTrain], yname = 'Y') # Up-Sampling Imbalanced Data
, reportProgress = 1 # the number of processed rows is printed and updated
, verbose = 0)
)##
Rows Processed: 328562
## user system elapsed
## 0.62 0.03 0.83
# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
RevoScaleR::rxPredict( rxNaiveBayesFit, data = fitTestScores, type = 'prob'
, predVarNames = c('Probability.rxNaiveBayes.Bad', 'Probability.rxNaiveBayes.Good')
, writeModelVars = TRUE
, extraVarsToWrite = names(fitTestScores)
, outData = tempfile(fileext = '.xdf')
) ## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.694 seconds
# Use transformFunc to add PredictedLabel.rxNaiveBayes
ScoringResults.rxNaiveBayesFunc <- function(dataList) {
dataList$PredictedLabel.rxNaiveBayes <-
factor( x = ifelse(dataList$Probability.rxNaiveBayes.Good < 0.5, 'Bad', 'Good') )
return (dataList)
}
# Add PredictedLabel.rxNaiveBayes & Probability.rxEnsemble.Good into List Model's Scores
fitTestScores <-
RevoScaleR::rxDataStep(
inData = fitTestScores
, maxRowsByCols = 2e9
, transformFunc = ScoringResults.rxNaiveBayesFunc
, varsToDrop = c('Probability.rxNaiveBayes.Bad')
, overwrite = TRUE
)## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.323 seconds
##
## Estimate a Classification Result on Testing Set
caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(one_of('PredictedLabel.rxNaiveBayes')) %>% pull
, reference = Y[inTest], positive = 'Bad', mode = 'everything')## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.050 seconds
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 3014 6803
## Good 2040 11459
##
## Accuracy : 0.6207
## 95% CI : (0.6145, 0.627)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1669
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.5964
## Specificity : 0.6275
## Pos Pred Value : 0.3070
## Neg Pred Value : 0.8489
## Precision : 0.3070
## Recall : 0.5964
## F1 : 0.4054
## Prevalence : 0.2168
## Detection Rate : 0.1293
## Detection Prevalence : 0.4210
## Balanced Accuracy : 0.6119
##
## 'Positive' Class : Bad
##
# Measure running time of R code for Naive Bayes (NB)
writeLines('Measure running time of `Naive Bayes Model (NB)` code = ')## Measure running time of `Naive Bayes Model (NB)` code =
## Time difference of 2.459047 secs
Next, we build an ensemble of fast tree models by using the function rxEnsemble().
##
##
## Ensemble of Some Models ...
system.time(
rxEnsembleFit <-
MicrosoftML::rxEnsemble( formula = Y ~ .
, data = caret::upSample(X[inTrain, ], Y[inTrain], yname = 'Y') # Up-Sampling Imbalanced Data
, type = 'binary'
, randomSeed = seed
, replace = TRUE # logical value specifying if the sampling of observations should be done with or w/o
, trainers=list( fastTrees(randomSeed = seed)
# , neuralNet(optimizer = sgd(), randomSeed = seed)
# , neuralNet(optimizer = adaDeltaSgd(), randomSeed = seed)
, fastTrees(numTrees = 300, randomSeed = seed)
, fastTrees(numTrees = 300, randomSeed = seed, learningRate = 0.3)
)
, combineMethod = c('median', 'average', 'vote')[1] # to compute (pos-neg) / the total number of models, where 'pos' is the number of positive outputs and 'neg' is the number of negative outputs
# , reportProgress = 1
# , verbose = 0
)
)## Not adding a normalizer.
## Making per-feature arrays
## Changing data from row-wise to column-wise
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Processed 327991 instances
## Binning and forming Feature objects
## Reserved memory for tree learner: 48568 bytes
## Starting to train ...
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:04.1492102
## Not adding a normalizer.
## Making per-feature arrays
## Changing data from row-wise to column-wise
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Processed 327962 instances
## Binning and forming Feature objects
## Reserved memory for tree learner: 48568 bytes
## Starting to train ...
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:06.5958746
## Not adding a normalizer.
## Making per-feature arrays
## Changing data from row-wise to column-wise
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Processed 328662 instances
## Binning and forming Feature objects
## Reserved memory for tree learner: 48568 bytes
## Starting to train ...
## Not training a calibrator because it is not needed.
## Elapsed time: 00:00:06.2534772
## Beginning processing data.
## Rows Read: 328562, Read Time: 0, Transform Time: 0
## Elapsed time: 00:00:06.4938534
## Beginning processing data.
## user system elapsed
## 1.26 0.31 25.25
# Create the PredictedLabel, Score, and Probability, and save them in the new table defined in data source
fitTestScores <-
RevoScaleR::rxPredict(rxEnsembleFit, fitTestScores, suffix = '.rxEnsemble',
extraVarsToWrite = names(fitTestScores), # [!grepl('Bad_prob', names(fitTestScores))],
outData = tempfile(fileext = '.xdf'))## Beginning processing data.
## Rows Read: 23316, Read Time: 0.001, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:01.6385659
## Finished writing 23316 rows.
## Writing completed.
##
## Estimate a Classification Result on Testing Set
caret::confusionMatrix(data = RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('PredictedLabel.rxEnsemble')) %>% pull
, reference = Y[inTest], positive = 'Bad', mode = 'everything')## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.057 seconds
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 2995 6620
## Good 2059 11642
##
## Accuracy : 0.6278
## 95% CI : (0.6215, 0.634)
## No Information Rate : 0.7832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1735
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.5926
## Specificity : 0.6375
## Pos Pred Value : 0.3115
## Neg Pred Value : 0.8497
## Precision : 0.3115
## Recall : 0.5926
## F1 : 0.4083
## Prevalence : 0.2168
## Detection Rate : 0.1285
## Detection Prevalence : 0.4124
## Balanced Accuracy : 0.6150
##
## 'Positive' Class : Bad
##
# Measure running time of R code for Ensemble Model
writeLines('Measure running time of `Ensemble Model` code = ')## Measure running time of `Ensemble Model` code =
## Time difference of 27.30104 secs
After constructing the set of classifiers, we will test them on the control dataset (inTest), which did not participate in the solution of the classification problem. We will compare the accuracy on it with test cases (inTest) for this task.
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
#
# TRAINING SET
#
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
# Compute the fit models's ROC curves, AUC and Gini coefficients on Training Set
fitTrainScores <-
RevoScaleR::rxPredict( rxLogisticRegressionFit
, data = cbind(Y = Y[inTrain], ObClass = Y[inTrain] %>% as.integer()-1, X[inTrain, ])
, suffix = '.rxLogisticRegression'
, extraVarsToWrite = names(cbind(Y = Y[inTrain], ObClass = Y[inTrain] %>% as.integer()-1, X[inTrain, ]))
, outData = tempfile(fileext = '.xdf'))## Beginning processing data.
## Rows Read: 209838, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:01.1230519
## Finished writing 209838 rows.
## Writing completed.
fitTrainScores <-
RevoScaleR::rxPredict( rxFastTreesFit, fitTrainScores, suffix = '.rxFastTrees'
, extraVarsToWrite = names(fitTrainScores)
, outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.012, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:02.7085950
## Finished writing 209838 rows.
## Writing completed.
fitTrainScores <-
RevoScaleR::rxPredict( rxFastForestFit, fitTrainScores, suffix = '.rxFastForest'
, extraVarsToWrite = names(fitTrainScores)
, outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.014, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:07.3217277
## Finished writing 209838 rows.
## Writing completed.
fitTrainScores <-
RevoScaleR::rxPredict(rxNeuralNetFit, fitTrainScores, suffix = '.rxNeuralNet',
extraVarsToWrite = names(fitTrainScores),
outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.015, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:01.5667492
## Finished writing 209838 rows.
## Writing completed.
# fitTrainScores <-
# RevoScaleR::rxPredict(rxOneClassSvmFit, fitTrainScores, suffix = '.rxOneClassSvm',
# extraVarsToWrite = names(fitTrainScores),
# outData = tempfile(fileext = '.xdf'))
fitTrainScores <-
RevoScaleR::rxPredict( rxNaiveBayesFit, data = fitTrainScores, type = 'prob'
, predVarNames = c('Probability.rxNaiveBayes.Bad', 'Probability.rxNaiveBayes.Good')
, writeModelVars = TRUE
, extraVarsToWrite = names(fitTrainScores)
, outData = tempfile(fileext = '.xdf'))## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 2.125 seconds
fitTrainScores <-
RevoScaleR::rxPredict(rxEnsembleFit, fitTrainScores, suffix = '.rxEnsemble',
extraVarsToWrite = names(fitTrainScores),
outData = tempfile(fileext = '.xdf'))## Beginning read for block: 1
## Rows Read: 209838, Read Time: 0.098, Transform Time: 0
## Beginning read for block: 2
## No rows remaining. Finished reading data set.
## Elapsed time: 00:00:13.3835712
## Finished writing 209838 rows.
## Writing completed.
# Use transformFunc to add Probability.rxOneClassSvm & Probability.rxEnsemble.Good
ScoringResults.rxSomeModelsFunc <- function(dataList) {
# dataList$PredictedLabel.rxOneClassSvm <-
# factor( x = ifelse(dataList$Score.rxOneClassSvm > 0, 'Bad', 'Good') )
# dataList$Probability.rxOneClassSvm <-
# exp( dataList$Score.rxOneClassSvm ) / (1 + exp( dataList$Score.rxOneClassSvm ))
dataList$PredictedLabel.rxNaiveBayes <-
factor( x = ifelse(dataList$Probability.rxNaiveBayes.Good < 0.5, 'Bad', 'Good') )
dataList$Probability.rxEnsemble.Good <-
exp( dataList$Score.rxEnsemble.Good ) / (1 + exp( dataList$Score.rxEnsemble.Good ))
return (dataList)
}
# Add PredictedLabel.rxNaiveBayes & Probability.rxEnsemble.Good into List Model's Scores
fitTrainScores <-
RevoScaleR::rxDataStep(
inData = fitTrainScores
, outFile = tempfile(fileext = '.xdf')
, transformFunc = ScoringResults.rxSomeModelsFunc
, varsToDrop = c('Probability.rxNaiveBayes.Bad')
, overwrite = TRUE
)## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 1.197 seconds
# Delete Any Scores and Data Variables into List Model's Scores
fitTrainScores <-
RevoScaleR::rxDataStep(
inData = fitTrainScores
, outFile = tempfile(fileext = '.xdf')
, varsToKeep = c( 'ObClass', grep('Probability.', names(fitTrainScores), value = TRUE),
grep('PredictedLabel.', names(fitTrainScores), value = TRUE) )
, overwrite = TRUE
)## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 0.260 seconds
# names(fitTrainScores)
# Compute the fit models's ROC curves on Training Set (by Precentiles)
fitRocTrain <-
RevoScaleR::rxRoc(
actualVarName = 'ObClass'
, predVarNames = grep('Probability.', names(fitTrainScores), value = TRUE)
, data = fitTrainScores
, numBreaks = 100 # length(Y[inTrain])
)
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
#
# TESTING SET
#
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
# Add PredictedLabel.rxNaiveBayes & Probability.rxEnsemble.Good into List Model's Scores
fitTestScores <-
RevoScaleR::rxDataStep(
inData = fitTestScores
, outFile = tempfile(fileext = '.xdf')
, transformFunc = ScoringResults.rxSomeModelsFunc
# , varsToDrop = c('Probability.rxNaiveBayes.Bad')
, overwrite = TRUE
)## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.348 seconds
# Delete Any Scores and Data Variables into List Model's Scores
fitTestScores <-
RevoScaleR::rxDataStep(
inData = fitTestScores
, outFile = tempfile(fileext = '.xdf')
, varsToKeep = c( 'ObClass', grep('Probability.', names(fitTestScores), value = TRUE),
grep('PredictedLabel.', names(fitTestScores), value = TRUE) )
, overwrite = TRUE
)## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.030 seconds
# names(fitTestScores)
# Compute the fit models's ROC curves on Testing Set
fitRocTest <-
RevoScaleR::rxRoc(
actualVarName = 'ObClass'
, predVarNames = grep('Probability.', names(fitTestScores), value = TRUE)
, data = fitTestScores
, numBreaks = 100 # length(Y[inTest])
)Compute and plot an ROC curve using actual and predicted values from binary classifier system.
# Create a named list of the fit models.
fitList <-
list( rxLogisticRegression = rxLogisticRegressionFit
, rxFastTrees = rxFastTreesFit
, rxFastForest = rxFastForestFit
, rxNeuralNet = rxNeuralNetFit
# , rxOneClassSvm = rxOneClassSvmFit
, rxNaiveBayes = rxNaiveBayesFit
, rxEnsemble = rxEnsembleFit
)
algolist <- c( 'rxLogisticRegression'
, 'rxFastTrees'
, 'rxFastForest'
, 'rxNeuralNet'
# , 'rxOneClassSvm'
, 'rxNaiveBayes'
, 'rxEnsemble'
) %>%
.[order(.)]
# Create a named list of models' Predictions on Training Data.
predList <- RevoScaleR::rxImport(inData = fitTrainScores) %>%
dplyr::select(starts_with('PredictedLabel.')) %>%
as.list()## Rows Read: 209838, Total Rows Processed: 209838, Total Chunk Time: 0.278 seconds
names(predList) <- algolist
# Confusion Matrix evaluation results
cm_metrics <-lapply(predList
, caret::confusionMatrix
, reference = Y[inTrain]
, positive = 'Bad'
, mode = 'everything'
)
# Kappa
kap_metrics <-
lapply(cm_metrics, `[[`, 'overall') %>%
lapply(`[`, 2) %>%
unlist() %>%
as.vector()
# Sensitivity (Recall)
rec_metrics <-
lapply(cm_metrics, `[[`, 'byClass') %>%
lapply(`[`, 1) %>%
unlist() %>%
as.vector()
# Specificity
spe_metrics <-
lapply(cm_metrics, `[[`, 'byClass') %>%
lapply(`[`, 2) %>%
unlist() %>%
as.vector()
# Compute the fit models's AUCs.
fitAuc <- RevoScaleR::rxAuc(fitRocTrain)
names(fitAuc) <- substring(names(fitAuc), nchar('Probability.') + 1)
# coerce to data frame
result.df <- # Gini Sens Spec Kappa
data.frame( Name = algolist
, Gini1 = (fitAuc - 0.5) * 2
, Sens1 = formattable::digits(rec_metrics, 4)
, Spec1 = formattable::digits(spe_metrics, 4)
, Kappa1 = formattable::digits(kap_metrics, 4)
)
# Create a named list of models' Prediction Labels on Testing Data.
predList <- RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('PredictedLabel.')) %>%
select(sort(names(.))) %>%
setNames(algolist) %>%
as.list()## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.028 seconds
# Create a named list of models' Probabilities on Testing Data.
preds <- RevoScaleR::rxImport(inData = fitTestScores) %>%
dplyr::select(starts_with('Probability.')) %>%
select(sort(names(.))) %>%
setNames(algolist)## Rows Read: 23316, Total Rows Processed: 23316, Total Chunk Time: 0.029 seconds
# Confusion Matrix evaluation results
cm_metrics <-lapply(predList
, caret::confusionMatrix
, reference = Y[inTest]
, positive = 'Bad'
, mode = 'everything'
)
# Kappa
kap_metrics <-
lapply(cm_metrics, `[[`, 'overall') %>%
lapply(`[`, 2) %>%
unlist() %>%
as.vector()
# Sensitivity (Recall)
rec_metrics <-
lapply(cm_metrics, `[[`, 'byClass') %>%
lapply(`[`, 1) %>%
unlist() %>%
as.vector()
# Specificity
spe_metrics <-
lapply(cm_metrics, `[[`, 'byClass') %>%
lapply(`[`, 2) %>%
unlist() %>%
as.vector()
# Compute the fit models's AUCs.
fitAuc <- RevoScaleR::rxAuc(fitRocTest)
names(fitAuc) <- substring(names(fitAuc), nchar('Probability.') + 1)
# Plot the ROC curves and report their AUCs.
plot( fitRocTest
, title = 'ROC Curve for Models on Testing Set' )result.df %<>%
dplyr::mutate(
Gini2 = formattable::digits((fitAuc - 0.5) * 2, 4)
, Sens2 = formattable::digits(rec_metrics, 4)
, Spec2 = formattable::digits(spe_metrics, 4)
, Kappa2 = formattable::digits(kap_metrics, 4)
, Note = '(empty)'
) # %>%
# as_tibble %>%
# remove_rownames %>%
# dplyr::arrange(-Gini2) %T>%
# printFinally, we evaluate and compare the above built models at various aspects.
# # Preallocate data types
# Name <- character() # Name of Method
# Gini1 <- numeric() # Gini's Coefficient of Training
# Sens1 <- numeric() # Sensitivity of Training
# Spec1 <- numeric() # Specificity of Training
# Kappa1 <- numeric() # Cohen's Kappa of Training
# Gini2 <- numeric() # Gini's Coefficient of Testing
# Sens2 <- numeric() # Sensitivity of Testing
# Spec2 <- numeric() # Specificity of Testing
# Kappa2 <- numeric() # Cohen's Kappa of Testing
# Note <- character() # Long Model Name
library('kableExtra') # Construct Complex Table with 'kable' and Pipe Syntax
library('formattable') # Create 'Formattable' Data Structures##
## Attaching package: 'formattable'
## The following object is masked from 'package:plotly':
##
## style
# Classification results by VALIDATION and TESTING Sets from `caret` models
result.df %>%
mutate(Name = cell_spec(Name, 'html',
color = ifelse(Gini2 >= arrange(result.df, desc(Gini2))[3, 'Gini2'] %>%
as.numeric, 'red', 'black')),
Gini1, Sens1, Spec1, Kappa1,
Gini2 = proportion_bar('lightgreen')(Gini2),
Sens2 = cell_spec(Sens2, 'html',
color = ifelse(Sens2 >= arrange(result.df, desc(Sens2))[1, 'Sens2'] %>%
as.numeric, 'brown', 'black')),
Spec2 = cell_spec(Spec2, 'html',
color = ifelse(Spec2 >= arrange(result.df, desc(Spec2))[1, 'Spec2'] %>%
as.numeric, 'darkviolet', 'black')),
Kappa2,
Note = Note) %>%
knitr::kable(format = 'html', digits = 4, longtable = TRUE, booktabs = TRUE, escape = F,
col.names = c('Methods', rep(c('Gini', 'Sens', 'Spec', 'Kappa'), times = 2), 'Notes'),
caption = 'Classification results on TRAINING and TESTING Sets from `Microsoft Machine Learning` models') %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive',
full_width = FALSE)) %>%
kableExtra::column_spec(9, width = '3cm') %>%
kableExtra::add_header_above(c(' ', 'Training Sets' = 4, 'Testing Set' = 4, ' ')) # %>% | Methods | Gini | Sens | Spec | Kappa | Gini | Sens | Spec | Kappa | Notes |
|---|---|---|---|---|---|---|---|---|---|
| rxEnsemble | 0.4973 | 0.6656 | 0.6374 | 0.2241 | 0.3214 | 0.5926 | 0.6375 | 0.1735 | (empty) |
| rxFastForest | 0.3615 | 0.6897 | 0.6506 | 0.2534 | 0.2751 | 0.6005 | 0.5976 | 0.1434 | (empty) |
| rxFastTrees | 0.4641 | 0.6604 | 0.6079 | 0.1933 | 0.3229 | 0.5973 | 0.6342 | 0.1737 | (empty) |
| rxLogisticRegression | 0.4145 | 0.6383 | 0.6659 | 0.2332 | 0.3175 | 0.6013 | 0.6267 | 0.1697 | (empty) |
| rxNaiveBayes | 0.3701 | 0.7040 | 0.6634 | 0.2759 | 0.3063 | 0.5964 | 0.6275 | 0.1669 | (empty) |
| rxNeuralNet | 0.4192 | 0.6361 | 0.6331 | 0.1999 | 0.3062 | 0.5702 | 0.6512 | 0.1703 | (empty) |
# kableExtra::group_rows(index = c('Generalized Linear Models' = 2,
# 'Decision Trees & Random Forests' = 4))
# NameOfTheBestModel = 'rxLogisticRegression'
NameOfTheBestModel <- arrange(result.df, desc(Gini2))[1, 'Name'] %>% as.character
NameOfTheSecondModel <- arrange(result.df, desc(Gini2))[2, 'Name'] %>% as.character
NameOfTheThirdModel <- arrange(result.df, desc(Gini2))[3, 'Name'] %>% as.character
# Saving The Best Model created by `caret` packege for future using
TheBestModel <- get(paste0(NameOfTheBestModel, 'Fit'))
# saveRDS(TheBestModel, file = 'TelCo0_model.RData')The table shows that when comparing the predictions on the test dataset with the real breakdown into classes the top three in quality were:
• classifier trained by rxFastTrees - quality according to ‘Gini’ 0.3229
• classifier trained by rxEnsemble - quality according toо ‘Gini’ 0.3214
• classifier trained by rxLogisticRegression - quality according to ‘Gini’ 0.3175.
We select the most suitable method on this test dataset - the classifier by rxFastTrees.
Machine Learning Models are widely used and have various applications in classification or regression tasks. Due to increasing computational power, availability of new data sources and new methods, ML models are more and more complex. Models created with techniques like boosting, bagging of neural networks are true black boxes. It is hard to trace the link between input variables and model outcomes. They are use because of high performance, but lack of interpretability is one of their weakest sides.
# See https://github.com/pbiecek/DALEX, https://github.com/MI2DataLab/modelDown
# See www.r-bloggers.com/dalex-and-h2o-machine-learning-model-interpretability-and-feature-explanation/
library('iml') # Interpretable Machine Learning
library('DALEX') # Descriptive mAchine Learning EXplanations
library('breakDown') # Model Agnostic Explainers for Individual Predictions
Prob_fun <- function(object, newdata){
# predict(object, newdata=newdata, type = 'prob')[, 'Good']
rxPredict(modelObject = object, data = newdata, blocksPerRead = 200000,
reportProgress = 0, verbose = 0) %>%
dplyr::select(starts_with('Probability.')) %>%
pull
}
loss_gini <- function(observed, predicted) {
hmeasure::HMeasure(true.class = observed, scores = predicted)[['metrics']] %>% .[1, 'Gini']
}
Predict.Fun <- function(model, newdata){
rxPredict(modelObject = model, data = newdata, blocksPerRead = 200000,
reportProgress = 0, verbose = 0) %>%
dplyr::select(starts_with('Probability.')) %>%
pull
}
loss_Gini <- function(actual, predicted) {
hmeasure::HMeasure(true.class = actual, scores = predicted)[['metrics']] %>% .[1, 'Gini']
}
Scorecard <- 'Scorecard'
# Create Model Explainer from 'DALEX' package
explainer_classif_1 <-
DALEX::explain(TheBestModel, label = NameOfTheBestModel,
data = X[inTest, ], y = Y[inTest] %>% as.integer()-1, predict_function = Prob_fun)
explainer_classif_2 <-
DALEX::explain(TheBestModel, label = NameOfTheSecondModel,
data = X[inTest, ], y = Y[inTest] %>% as.integer()-1, predict_function = Prob_fun)
explainer_classif_3 <-
DALEX::explain(get(paste0(NameOfTheThirdModel, 'Fit')), label = NameOfTheThirdModel,
data = X[inTest, ], y = Y[inTest] %>% as.integer()-1, predict_function = Prob_fun)
if (!exists('wb')) {
# Create MS Excel File for Output
openxlsx::addWorksheet(wb <- openxlsx::createWorkbook(), sheetName = 'IV Table',
gridLines = FALSE, tabColour = 'olivedrab')
openxlsx::addWorksheet(wb, sheetName = 'Scorecard', gridLines = FALSE, tabColour = 'brown')
}
if (ncol(X) <= Max_Vars) {
# [Interpretable Machine Learning: Feature Importance](https://christophm.github.io/interpretable-ml-book)
system.time({
### Setup parallel processing - 4 times faster
library('doParallel'); cl <- makeCluster(detectCores()); registerDoParallel(cl)
model1 <- iml::Predictor$new(model = TheBestModel, data= X[inTest, ],
y = Y[inTest] %>% as.integer() - 1, class = 'Good', predict.fun = Predict.Fun, type = 'prob' )
# Feature Importance
imp1 <- iml::FeatureImp$new(model1, loss = loss_Gini, compare = 'difference', parallel = TRUE)
importance.df <- imp1$results %>%
dplyr::mutate(Variable = gsub('_fct', '', feature) %>% as.factor,
Importance = -importance) %>%
dplyr::select(Variable, Importance)
p1 <- imp1$plot() + ggplot2::ggtitle(paste(NameOfTheBestModel, 'by Gini coefficient'))
print(p1)
# model2 <- iml::Predictor$new(model = get(paste0(NameOfTheSecondModel, 'Fit')), data= X[inTest, ],
# y = Y[inTest] %>% as.integer()-1, class = 'Good', predict.fun = Predict.Fun, type = 'prob' )
# # Feature Importance
# imp2 <- iml::FeatureImp$new(model2, loss = loss_Gini, compare = 'difference', parallel = TRUE)
# imp2$plot() + ggplot2::ggtitle(paste(NameOfTheSecondModel, 'by Gini coefficient'))
#
# model3 <- iml::Predictor$new(model = get(paste0(NameOfTheThirdModel, 'Fit')), data= X[inTest, ],
# y = Y[inTest] %>% as.integer()-1, class = 'Good', predict.fun = Predict.Fun, type = 'prob' )
# # Feature Importance
# imp3 <- iml::FeatureImp$new(model3, loss = loss_Gini, compare = 'difference', parallel = TRUE)
# imp3$plot() + ggplot2::ggtitle(paste(NameOfTheThirdModel, 'by Gini coefficient'))
TheImportancePredictor <- imp1$results[nrow(imp1$results), 'feature']
# Remember to stop the cluster in the end again
stopCluster(cl)
})
# # Choice of The Importance Predictor - www.machinelearningplus.com/machine-learning/feature-selection/
# TheImportancePredictor <-
# imp1$results %>% arrange(importance) %>% dplyr::select(feature) %>% pull %>% .[1]
#
# # Effect of features on the model predictions
# ale = iml::FeatureEffect$new(model, feature = TheImportancePredictor, method = 'ale')
# ale$plot()
} else {
system.time({
# # Model Performance from 'DALEX' package
# plot(DALEX::model_performance(explainer_classif_1)
# , DALEX::model_performance(explainer_classif_2)
# , DALEX::model_performance(explainer_classif_3)
# , geom = 'boxplot')
# Variable importance
importance.df <- DALEX::variable_importance(explainer_classif_1, type = 'raw', loss_function = loss_gini) %>% dplyr::mutate(Importance = -(dropout_loss - .[variable == '_full_model_', 'dropout_loss'])) %>%
dplyr::filter(!str_detect(variable, 'full_model|baseline')) %>%
dplyr::mutate(Variable = gsub('_fct', '', variable)) %>%
dplyr::select(Variable, Importance)
p1 <- DALEX::variable_importance(explainer_classif_1, type = 'raw', loss_function = loss_gini) %>%
plot() + labs(title = 'Variable Importance', caption = 'by Gini Coefficient') + theme_grey()
print(p1)
# DALEX::variable_importance(explainer_classif_2, type = 'raw', loss_function = loss_gini) %>%
# plot() + labs(subtitle = 'Variable Importance', caption = 'by Gini Coefficient')
# DALEX::variable_importance(explainer_classif_3, type = 'raw', loss_function = loss_gini) %>%
# plot() + labs(subtitle = 'Variable Importance', caption = 'by Gini Coefficient')
TheImportancePredictor <-
DALEX::variable_importance(explainer_classif_1, type = 'raw')[2, 'variable'] %>%
as.character
})
} # End if ncol(X) <= Max_Vars## user system elapsed
## 0.81 0.39 116.79
# Output Chart: EXplanations of Model by Variable Importance
openxlsx::insertPlot(wb, sheet = 'IV Table', xy = c(1, nrow(binning.df)+5), width = 10 * (1 + sqrt(5)) / 2,
height = 10, units = 'cm')
if (TheBestModel$Description != 'LogisticRegression')
openxlsx::insertPlot(wb, sheet = Scorecard, xy = c(1,2), width = 10*(1 + sqrt(5))/2, height = 10, units = 'cm')
# # compute Partial Dependence Plots for a given variable --> uses the pdp package
# plot(DALEX::variable_response(explainer_classif_1, variable = TheImportancePredictor, type = 'factor')) +
# ggtitle('Marginal Response for a Single Variable by Gini Coefficient')
# plot(DALEX::variable_response(explainer_classif_2, variable = TheImportancePredictor, type = 'factor')) +
# ggtitle('Marginal Response for a Single Variable by Gini Coefficient')
# plot(DALEX::variable_response(explainer_classif_3, variable = TheImportancePredictor, type = 'factor')) +
# ggtitle('Marginal Response for a Single Variable by Gini Coefficient')
if (ncol(X) <= Max_Vars) {
# Explanations for a Single Prediction for Observations
# True Good Observation
new_obj <- data.frame(Probs = preds[[NameOfTheBestModel]], Obs = Y[inTest],
Equal = ifelse(preds[[NameOfTheBestModel]] < 0.5, 0, 1) == Y[inTest] %>% as.integer() - 1) %>%
rowid_to_column %>% dplyr::arrange(-Probs) %>%
dplyr::filter(Probs <= 1, Obs == 'Good', Equal == TRUE) %>%
dplyr::select(rowid) %>% pull %>% .[1]
# Recovery of Name of Selecled Scale Variables
pdp1 <- DALEX::prediction_breakdown(explainer_classif_1, observation = X[inTest, ][new_obj, ])
pdp1$variable <- pdp1$variable_value
pdp1[1, 'variable'] <- '(Intercept)'; pdp1[nrow(pdp1), 'variable'] <- 'final_prognosis'
# pdp2 <- DALEX::prediction_breakdown(explainer_classif_2, observation = X[inTest, ][new_obj, ])
# pdp2$variable <- pdp2$variable_value
# pdp2[1, 'variable'] <- '(Intercept)'; pdp2[nrow(pdp2), 'variable'] <- 'final_prognosis'
#
# pdp3 <- DALEX::prediction_breakdown(explainer_classif_3, observation = X[inTest, ][new_obj, ])
# pdp3$variable <- pdp3$variable_value
# pdp3[1, 'variable'] <- '(Intercept)'; pdp3[nrow(pdp3), 'variable'] <- 'final_prognosis'
p2 <- plot(pdp1, vcolors = c('-1' = 'tomato3', '0' = '#f5f5f5', '1' = 'palegreen3', 'X' = '#00BFC4')) +
# theme(strip.background = element_rect(fill = 'gray45')) +
# theme(strip.text = element_text(colour = 'white')) +
theme_grey() + theme(legend.position = 'none', panel.border = element_blank()) +
labs(title = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
# plot(pdp2) +
# labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
# plot(pdp3) +
# labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
print(p2)
# Output Chart: EXplanations of Model by Predictors of True Bad Observation
openxlsx::insertPlot(wb, sheet = 'IV Table', xy = c(7, nrow(binning.df) + 5), width = 10 * (1 + sqrt(5)) / 2,
height = 10, units = 'cm')
if (TheBestModel$Description != 'LogisticRegression')
openxlsx::insertPlot(wb, sheet = Scorecard, xy=c(1, 21), width = 10*(1 + sqrt(5))/2, height=10, units = 'cm')
# True Bad Observation
# Choice Another Level of The Importance Predictor for True Bad Observation than True Good Observation
AnotherLevel4TrueBad <- table(X[inTest, TheImportancePredictor], Y[inTest])[-which.max(table(X[inTest,
TheImportancePredictor], Y[inTest])[, 'Good']), 'Bad'] %>%
which.max %>% names
if (is.null(AnotherLevel4TrueBad)) {
AnotherLevel4TrueBad <- table(X[inTest, TheImportancePredictor], Y[inTest]) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
setNames(c('Levels', 'Classes', 'Freq')) %>%
dplyr::filter(Classes == 'Bad', Freq > 0) %>%
dplyr::select(Levels) %>% pull
}
new_obj <- data.frame(Probs = preds[[NameOfTheBestModel]], Obs = Y[inTest],
Equal = ifelse(preds[[NameOfTheBestModel]] < 0.5, 0, 1) == Y[inTest] %>% as.integer() - 1,
TheImportancePredictor = X[inTest, TheImportancePredictor]) %>%
rowid_to_column %>% dplyr::arrange(-Probs) %>%
dplyr::filter(Probs < 0.4, Obs == 'Bad', Equal == TRUE,
TheImportancePredictor == AnotherLevel4TrueBad) %>%
dplyr::select(rowid) %>% pull %>% .[1]
pdp1 <- DALEX::prediction_breakdown(explainer_classif_1, observation = X[inTest, ][new_obj, ])
pdp1$variable <- pdp1$variable_value
pdp1[1, 'variable'] <- '(Intercept)'; pdp1[nrow(pdp1), 'variable'] <- 'final_prognosis'
# pdp2 <- DALEX::prediction_breakdown(explainer_classif_2, observation = X[inTest, ][new_obj, ])
# pdp2$variable <- pdp2$variable_value
# pdp2[1, 'variable'] <- '(Intercept)'; pdp2[nrow(pdp2), 'variable'] <- 'final_prognosis'
#
# pdp3 <- DALEX::prediction_breakdown(explainer_classif_3, observation = X[inTest, ][new_obj, ])
# pdp3$variable <- pdp3$variable_value
# pdp3[1, 'variable'] <- '(Intercept)'; pdp3[nrow(pdp3), 'variable'] <- 'final_prognosis'
p3 <- plot(pdp1, vcolors = c('-1' = 'tomato3', '0' = '#f5f5f5', '1' = 'palegreen3', 'X' = '#F8766D')) +
# theme(strip.background = element_rect(fill = 'gray45')) +
# theme(strip.text = element_text(colour = 'white')) +
theme_grey() + theme(legend.position = 'none', panel.border = element_blank()) +
labs(title = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
# plot(pdp2) +
# labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
# plot(pdp3) +
# labs(subtitle = paste0('Reference (', new_obj, ' observation) = True ', Y[inTest][ new_obj ]))
print(p3)
# Output Chart: EXplanations of Model by Predictors of True Bad Observation
openxlsx::insertPlot(wb, sheet = 'IV Table', xy = c(15, nrow(binning.df) + 5), width = 10 * (1+sqrt(5)) / 2,
height = 10, units = 'cm')
if (TheBestModel$Description != 'LogisticRegression')
openxlsx::insertPlot(wb, sheet = Scorecard, xy=c(1, 40), width = 10*(1 + sqrt(5))/2, height=10, units = 'cm') }remove(model1, model2, model3, # imp1,
imp2, imp3, explainer_classif_1, explainer_classif_2, explainer_classif_3, p, new_obj, pdp1, pdp2, pdp3)Model agnostic tool for decomposition of predictions from black boxes. Break Down Table shows contributions of every variable to a final prediction. Break Down Plot presents variable contributions in a concise graphical way. This package work for binary classifiers and general regression models.
# # Create MS Excel File for Output
# openxlsx::addWorksheet(wb <- openxlsx::createWorkbook(), sheetName = 'Scorecard', gridLines = FALSE,
# tabColour = 'brown')
# preds - Probabilities of `Good` Class
# Generate an ROC Curve for the Best Model
ROCCurveShow(Preds = preds[NameOfTheBestModel], # `Good` Class Probabilities - numeric vector
Obsers = Y[inTest], # Observed Classes (Reference) - logical vector
NameOfModel = NameOfTheBestModel, # Name Of The Model
wb = wb) # Workbook from `openxlsx` package##
## rxFastTrees - Estimate a Area Under the ROC Curve (AUC) on Testing Set = 66.15%
# Generate a KS Curve for the Best Model
KSCurveShow(Preds = preds[NameOfTheBestModel], # `Good` Class Probabilities - numeric vector
Obsers = Y[inTest], # Observed Classes (Reference) - logical vector
NameOfModel = NameOfTheBestModel, # Name Of The Model
wb = wb) # Workbook from `openxlsx` package##
## rxFastTrees - Estimate a Kolmogorov-Smirnov Statistic on Testing Set = 0.2372
# Generate a Distribution's Curve of Scores by Good & Bad Class for the Best Model as GLM
ScoresCurveShow(Preds = preds[NameOfTheBestModel], # `Good` Class Probabilities - numeric vector
Obsers = Y[inTest], # Observed Classes (Reference) - logical vector
NameOfModel = NameOfTheBestModel, # Name Of The Model
wb = wb) # Workbook from `openxlsx` package##
## rxFastTrees - Estimate a Kullback-Leibler’s Divergence Statistic on Testing Set = 0.0866
Finally, let’s build a Scorecard using the selected Logit-Model on the Test dataset.
I = 50 # is the score increment (Points to double the Oods)
c_ = 500 # Offset of scores (margin of Good & Bad classes)
if (TheBestModel$Description == 'LogisticRegression') {
# Create Scorecard for GLM
Attributes <- vector()
Levels <- vector()
Predictors <- vector()
Totals <- vector()
ObBads <- vector()
ObGoods <- vector()
PrGoods <- vector()
PrClass <- rxImport(inData= fitTestScores, varsToKeep = 'PredictedLabel.rxLogisticRegression') %>%
pull
Chi_squared_test <- vector()
for (i in 1:length(rxLogisticRegressionFit$params$formulaVars %>% .[-1])) {
Attributes0 <-
rxLogisticRegressionFit$coefficients %>% names() %>%
.[ grep(pattern = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i],
x = rxLogisticRegressionFit$coefficients %>% names()) ] %>%
sort %>%
dplyr::as_tibble() %>%
tidyr::separate(value, into = c('empty', 'Attributes'),
sep=paste0(rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i], '.')) %>%
pull(Attributes)
Attributes <- c(Attributes, Attributes0)
Levels0 <- rxLogisticRegressionFit$coefficients %>% names() %>%
.[ grep(pattern = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i],
x = rxLogisticRegressionFit$coefficients %>% names()) ] %>% sort
Levels <- c(Levels, Levels0)
Predictors <- c(Predictors, rep(x = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i],
times = length(Levels0)))
Totals <- c(Totals, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ]) %>% as.vector)
ObBads <- c(ObBads, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], Y[inTest]) %>%
.[, 'Bad'] %>% as.vector)
ObGoods <- c(ObGoods, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], Y[inTest]) %>%
.[, 'Good'] %>% as.vector)
PrGoods <- c(PrGoods, table(X[inTest, rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], PrClass) %>%
.[, 'Good'] %>% as.vector)
# Chi squared test (contingency table) for each row in a table by each Predictors
Chi_squared_test <- c( Chi_squared_test, apply(table(X[inTest,
rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i] ], Y[inTest]), 1,
function(x) chisq.test(matrix(x, ncol = levels(Y) %>% length))$p.value) )
}
# Part of Intercept's Score / Number of Features
CommonScore = (c_ + rxLogisticRegressionFit$coefficients[1] * I / log(2)) / length(rxLogisticRegressionFit$params$formulaVars %>% .[-1])
# Create a Data.Frame by Predictors with Coefficients & Scores
Scorecard.df <-
# dplyr::left_join( # faster join as.integer, then as.factor and finally as.character
# x = data.frame(Predictors, Attributes, Levels, Totals, ObBads, ObGoods, PrGoods, Chi_squared_test,
# stringsAsFactors = FALSE),
# y = data.frame(Names = rxLogisticRegressionFit$coefficients %>% attr('names'),
# Coefficients = rxLogisticRegressionFit$coefficients,
# stringsAsFactors = FALSE),
# by = c('Levels' = 'Names')
# ) %>%
dplyr::left_join( # faster join as.integer, then as.factor and finally as.character
x = data.frame(Predictors, Attributes, Levels,
stringsAsFactors = FALSE),
y = data.frame(Levels = paste0( attr(Chi_squared_test, 'names') %>%
str_extract_all(., boundary('word')) %>% transpose() %>% .[2] %>% unlist,
'_fct.', attr(Chi_squared_test, 'names') ),
Totals, ObBads, ObGoods, PrGoods, Chi_squared_test,
stringsAsFactors = FALSE),
by = c('Levels' = 'Levels') ) %>%
# Attaching tables with different lengths due to the small gradations of some predictors
dplyr::left_join( # faster join as.integer, then as.factor and finally as.character
x = .,
y = data.frame(Names = rxLogisticRegressionFit$coefficients %>% attr('names'),
Coefficients = rxLogisticRegressionFit$coefficients,
stringsAsFactors = FALSE),
by = c('Levels' = 'Names') ) %>%
tidyr::replace_na(list(Coefficients = 0)) %>%
dplyr::mutate(
Total = Totals, Bad = ObBads, Good = ObGoods,
`Share of Total` = Totals / length(Y[inTest]),
`Chi Squared` = Chi_squared_test,
`Pred Good` = PrGoods,
`Sensitivity by Levels` = PrGoods / Totals,
Scores = round(Coefficients * I / log(2) + CommonScore, 0)) %>%
dplyr::select(-Levels, -Chi_squared_test, -Totals, -ObBads, -ObGoods, -PrGoods)
ListOfPredictors <- vector()
for (i in 1:length(rxLogisticRegressionFit$params$formulaVars %>% .[-1])) {
ListOfPredictors <- c( ListOfPredictors, length(rxLogisticRegressionFit$coefficients %>% names() %>%
.[ grep(pattern = rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[5],
x = rxLogisticRegressionFit$coefficients %>% names()) ]) )
attr(ListOfPredictors, 'names')[i] <- rxLogisticRegressionFit$params$formulaVars %>% .[-1] %>% .[i]
}
header.df <- data.frame(A = c('Scorecard by Logit-Model and Result of Classification on TESTING Set', 'Logit Model'), B = c(NA, NA), C = c(NA, NA), D = c(NA, 'Observed Distribution (Reference)'), E = c(NA, NA), F = c(NA, NA), G = c(NA, NA), H = c(NA, NA), I = c(NA, 'Prediction'), stringsAsFactors = FALSE)
ListOfHeaders <- c(2, 5, 3)
attr(ListOfHeaders, 'names') <- c(header.df[2, 'A'], header.df[2, 'D'], header.df[2, 'I'])
# Show a Data.Frame by Predictors with Coefficients & Scores
Scorecard.df %>%
mutate( # Predictors = cell_spec(Predictors, bold = TRUE),
Attributes = Attributes,
Coefficients = cell_spec(Coefficients, 'html', color=ifelse(Scores >= arrange(., desc(Scores))[nrow(.)*.25,
'Scores']%>% as.numeric, 'darkgreen', 'black')),
Total, Bad, Good,
`Share of Total` = formattable::percent(`Share of Total`, digits = 1),
`Chi Squared` = cell_spec(formattable(`Chi Squared`, format = "f", digits = 4), 'html',
color = ifelse(`Chi Squared` >= 0.05, 'orangered', 'darkgray')),
`Pred Good`,
`Sensitivity by Levels` = proportion_bar('khaki')(round(`Sensitivity by Levels`, 2)),
Scores = proportion_bar('chartreuse')(Scores) ) %>% dplyr::select(-Predictors) %>%
knitr::kable(format = 'html', digits = 4, longtable = TRUE, booktabs = TRUE, escape = F,
# col.names = c('Levels of Predictors', 'Coefficients', 'Total', 'Bad', 'Good', 'Share of Total',
# 'Chi²', 'Predicted Good', 'Sensitivity by Levels', 'Scores'),
caption = header.df[1, 'A']) %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive',
full_width = FALSE)) %>%
# kableExtra::column_spec(10, width = '5cm') %>%
kableExtra::add_header_above(ListOfHeaders) # %>%
#kableExtra::group_rows(index = ListOfPredictors)
# Export a Data.Frame with Coefficients & Scores into MS Excel
N <- 4:(nrow(Scorecard.df) + 3)
writeDataTable(wb, sheet = Scorecard, x = Scorecard.df, tableStyle = 'TableStyleMedium2', startCol = 'A',
startRow = 3, tableName = 'Scorecard', firstColumn = TRUE, lastColumn = TRUE, bandedRows = TRUE)
# Set Columns widths
setColWidths(wb, sheet=Scorecard, cols=1:ncol(Scorecard.df), widths = c(14, 15, 11, 7, 7, 7, 8, 9, 7, 11, 10))
mergeCells(wb, sheet = Scorecard, cols = 1:3, rows = 2)
mergeCells(wb, sheet = Scorecard, cols = 4:8, rows = 2)
mergeCells(wb, sheet = Scorecard, cols = 9:11, rows = 2)
# # Set Row heights
# setRowHeights(wb, sheet = Scorecard, rows = 1, heights = 45)
# Set Styles & Conditional Formattings in Columns
addStyle(wb, sheet = Scorecard, style = createStyle(wrapText = TRUE, halign = 'center', valign = 'center'),
cols = 1:ncol(Scorecard.df), rows = 3)
conditionalFormatting(wb, sheet = Scorecard, cols = 3, rows = N, type = "between", # Coefficients
rule = c(quantile(Scorecard.df$Coefficients)['75%'], max(Scorecard.df$Coefficients)),
style = createStyle(fontColour = 'darkgreen'))
addStyle(wb, sheet=Scorecard, cols=3, rows=N, style = createStyle(border = 'right', borderColour = '#4F81BC'))
addStyle(wb, sheet = Scorecard, createStyle(numFmt = 'comma'), cols = 4:6, rows = N, gridExpand = TRUE)
addStyle(wb, sheet = Scorecard, cols = 7, rows = N, style = createStyle(numFmt = '0%'))
addStyle(wb, sheet = Scorecard, cols = 8, rows = N, style = createStyle(border = 'right',
borderColour = '#4F81BC', fontColour = 'darkgrey', numFmt = paste0('0', options()$OutDec, '0000')))
conditionalFormatting(wb, sheet = Scorecard, cols = 8, rows = N, type = "between", rule = c(0.05, 1)) # Chi²
addStyle(wb, sheet = Scorecard, cols = 9, rows = N, style = createStyle(numFmt = 'COMMA'))
addStyle(wb, sheet = Scorecard, cols = 10, rows = N,
style = createStyle(numFmt = paste0('0', options()$OutDec, '0000')))
conditionalFormatting(wb, sheet = Scorecard, cols = ncol(Scorecard.df) - 1, rows = N,
style = c('red', 'khaki'), type = 'databar')
conditionalFormatting(wb, sheet = Scorecard, cols = ncol(Scorecard.df), rows = N,
style = c('red', 'chartreuse'), type = 'databar')
writeData(wb, sheet = Scorecard, header.df, colNames = FALSE, rowNames = FALSE, startCol = 'A', startRow = 1)
addStyle(wb, sheet = Scorecard, cols=1, rows = 1, style = createStyle(fontSize = 16, textDecoration = 'bold'))
addStyle(wb, sheet = Scorecard, cols = 1:ncol(Scorecard.df), rows = 2, style = createStyle(wrapText = TRUE,
halign = 'center', valign = 'center', fontColour = 'white', fgFill = '#4F81BC', textDecoration = 'bold'))
addStyle(wb, sheet = Scorecard, cols=3, rows=3, style = createStyle(border = 'right', borderColour = 'white'))
addStyle(wb, sheet = Scorecard, cols=8, rows=3, style = createStyle(border = 'right', borderColour = 'white'))
remove(Attributes, Attributes0, Levels, Levels0, Predictors, Totals, ObBads, ObGoods, PrClass, PrGoods,
Chi_squared_test, ListOfPredictors, ListOfHeaders, header.df, N)
} else { # Not Logit-Models
writeData(wb, sheet = Scorecard, data.frame(A = c(paste0('Scorecard by Model `', NameOfTheBestModel ,
'` and Result of Classification on TESTING Set'))),
colNames = FALSE, rowNames = FALSE, startCol = 'A', startRow = 1)
addStyle(wb, sheet = Scorecard, cols=1, rows = 1, style = createStyle(fontSize = 16, textDecoration = 'bold'))
} # End if == LogisticRegression
openxlsx::renameWorksheet(wb, Scorecard, 'MLS')
openxlsx::writeFormula(wb, sheet = 'IV Table', x = makeHyperlinkString(sheet = 'MLS', row = 1, col = 1,
text = 'Scorecard: MLS'), startCol = 'A', startRow = nrow(binning.df) + 4)
openxlsx::addStyle(wb,sheet = 'IV Table', cols = 1, rows = nrow(binning.df) + 4,
style = createStyle(fontColour = 'brown', textDecoration = 'bold'))
# Supplement Variable Table with Importance Feature
binning.df %>%
# dplyr::mutate_if(is.factor, as.character) %>%
dplyr::left_join(importance.df, by = c('Variable' = 'Variable')) %>%
openxlsx::writeDataTable(wb, sheet = 'IV Table', x = ., tableStyle = 'TableStyleMedium4', startCol = 'A',
startRow = 2, tableName = 'IVTable', firstColumn = FALSE, lastColumn = TRUE, bandedRows = TRUE)## Warning: Column `Variable` joining factors with different levels, coercing
## to character vector
openxlsx::writeComment(wb, sheet = 'IV Table', xy = c(ncol(binning.df) + 1, 2),
comment = openxlsx::createComment(comment = 'Importance Feature by Gini coefficient (MLS)',
height = .6))
# Recovering First Column of Names with Hyperlinks
for (i in 1:nrow(binning.df)) {
## Internal - Text to display
val = binning.df[i, 'Variable']
writeFormula(wb, sheet = 'IV Table', startCol = 'A', startRow = i + 2,
x = makeHyperlinkString(sheet = val, row = 1, col = 1, text = val))
}
# Set Columns widths
openxlsx::setColWidths(wb, sheet = 'IV Table', cols = 1:2, widths = c(32, 12))
openxlsx::setColWidths(wb, sheet = 'IV Table', cols = ncol(binning.df):(ncol(binning.df)+1), widths = c(12, 13))
N <- 3:(nrow(binning.df) + 2)
openxlsx::conditionalFormatting(wb, sheet = 'IV Table', cols = 2, rows = N, type = 'databar',
border = FALSE, style = c('red', 'royalblue'))
openxlsx::conditionalFormatting(wb, sheet = 'IV Table', type = 'databar', cols = ncol(binning.df) + 1,
rows =3:(nrow(binning.df)+2), border = FALSE, style = c('tomato3', 'palegreen3'))
openxlsx::addStyle(wb, sheet = 'IV Table', cols = 2, rows = N,
style = openxlsx::createStyle(border = 'right', borderColour = '#9CB95C'))
openxlsx::addStyle(wb, sheet = 'IV Table', cols = 10, rows = N,
style = openxlsx::createStyle(border = 'right', borderColour = '#9CB95C'))
openxlsx::addStyle(wb, sheet = 'IV Table', cols = ncol(binning.df) + 1, rows = 3:(nrow(binning.df)+2),
style = openxlsx::createStyle(numFmt = paste0('0', options()$OutDec, '0000')))
openxlsx::writeFormula(wb, sheet = 'IV Table', x = paste0('=T("Table of Variables (', LoadingData, ')")'),
startCol = 'A', startRow = 1)
openxlsx::addStyle(wb, sheet = 'IV Table', cols = 1, rows = 1,
style = openxlsx::createStyle(fontSize = 16, textDecoration = 'bold'))
# Open MS Excel
openxlsx::openXL(wb)
remove(p1, p2, p3) # , wb)Let’s clarify the characteristics of the scoring card formed by the selected predictors.
# https://rpubs.com/arifulmondal/216381
Probs <- preds[NameOfTheBestModel] %>% pull
smbinning::smbinning.metrics(
dataset = cbind(X[inTest, ], Y = as.integer(Y[inTest]) - 1,
scores = round((c_ + (I / log(2)) * log(Probs / (1 - Probs)))) + 0, 0),
prediction = 'scores', actualclass = 'Y',
cutoff = c_,
report = 1, plot = 'none' # plot = 'auc' - Plot AUC
)##
## Overall Performance Metrics
## --------------------------------------------------
## KS : 0.2362 (Unpredictive)
## AUC : 0.6615 (Poor)
##
## Classification Matrix
## --------------------------------------------------
## Cutoff (>=) : 500 (User Defined)
## True Positives (TP) : 11626
## False Positives (FP) : 2049
## False Negatives (FN) : 6636
## True Negatives (TN) : 3005
## Total Positives (P) : 18262
## Total Negatives (N) : 5054
##
## Business/Performance Metrics
## --------------------------------------------------
## %Records>=Cutoff : 0.5865
## Good Rate : 0.8502 (Vs 0.7832 Overall)
## Bad Rate : 0.1498 (Vs 0.2168 Overall)
## Accuracy (ACC) : 0.6275
## Sensitivity (TPR) : 0.6366
## False Neg. Rate (FNR) : 0.3634
## False Pos. Rate (FPR) : 0.4054
## Specificity (TNR) : 0.5946
## Precision (PPV) : 0.8502
## False Discovery Rate : 0.1498
## False Omision Rate : 0.6883
## Inv. Precision (NPV) : 0.3117
##
## Note: 0 rows deleted due to missing data.
if (class(TheBestModel) == 'rxNaiveBayes') { # Model of `RevoScaleR` package
Probs <-
RevoScaleR::rxPredict(modelObject = TheBestModel
, data = X
# , outData = sqlServerOutDS3
, predVarNames = c('Bad_Probs', 'Good_Probs')
, type = 'prob'
, writeModelVars = FALSE
# , extraVarsToWrite = 'SUBS_KEY'
, overwrite = TRUE )['Good_Probs'] %>%
pull
} else { # Microsoft R Machine Learning model from `MicrosoftML` package
Probs <-
RevoScaleR::rxPredict( TheBestModel
, data = cbind(Y = Y, ObClass = Y %>% as.integer() - 1, X)
, suffix = '.MicrosoftML'
# , extraVarsToWrite = names(cbind(Y = Y[inTrain], ObClass = Y[inTrain] %>% as.integer()-1, X[inTrain, ]))
, outData = tempfile(fileext = '.xdf')) %>%
RevoScaleR::rxDataStep(varsToKeep = c('Probability.MicrosoftML.Good')) %>%
pull
}## Warning in if (class(TheBestModel) == "rxNaiveBayes") {: длина условия > 1,
## будет использован только первый элемент
## Beginning processing data.
## Rows Read: 345546, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:04.1849337
## Finished writing 345546 rows.
## Writing completed.
## Rows Read: 345546, Total Rows Processed: 345546, Total Chunk Time: 0.049 seconds
if (class(DT) %>% length == 1) setDT(DT)
Output.tbl <-
cbind(DT[, c('UniqueID', gsub('_fct', '', names(X))), with = FALSE], X, #
data.frame(
`Probability of Default` = 1 - Probs
, Scores = round((c_ + (I / log(2)) * log(Probs / (1 - Probs))) + 0, 0)
, Prediction = ifelse(Probs < 0.5, 'Bad', 'Good')
, Reference = Y)
)
# Create DataTable in MS Excel
openxlsx::addWorksheet(wb2 <- openxlsx::createWorkbook(), sheetName = 'Data', gridLines = FALSE)
openxlsx::writeDataTable(wb2, sheet = 'Data', x = Output.tbl, tableStyle = 'TableStyleMedium6',
tableName = 'Data', firstColumn = TRUE, lastColumn = TRUE, bandedRows = TRUE)
# # Writing Comments into cells
# for (i in 1:ncol(Output.tbl))
# openxlsx::writeComment(wb2, sheet = 'Data', xy = c(i, 1),
# comment = openxlsx::createComment(comment = attr(Output.tbl, 'variable.labels')[i],
# visible = FALSE, width = 2, height = 10, style = createStyle(fontSize = 8)))
# Set Columns widths
openxlsx::setColWidths(wb2, sheet = 'Data', cols = 1:2, widths = c(12, 10))
openxlsx::freezePane(wb2, 'Data', firstCol = TRUE) # shortcut to firstActiveCol = 2
# openxlsx::conditionalFormatting(wb2, sheet = 'Data', type = 'expression', cols = 1:ncol(Output.tbl),
# rows =2:(nrow(Output.tbl)+1), rule = '$A2=="АО "&CHAR(34)&"Евразийский Банк"&CHAR(34)', style = createStyle(fontColour = 'darkgreen', bgFill = 'darkseagreen1', textDecoration = 'bold'))
# openxlsx::conditionalFormatting(wb2, sheet = 'Data', type = 'expression', cols = 1:ncol(Output.tbl),
# rows =2:(nrow(Output.tbl)+1), rule = '$A2=="АО "&CHAR(34)&"АТФБанк"&CHAR(34)', style = createStyle(fontColour = '#505000', bgFill = 'lemonchiffon', textDecoration = 'bold'))
# openxlsx::conditionalFormatting(wb2, sheet = 'Data', type = 'expression', cols = 1:ncol(Output.tbl),
# rows =2:(nrow(Output.tbl)+1), rule = '$A2=="ДБ АО "&CHAR(34)&"Банк Хоум Кредит"&CHAR(34)', style = createStyle(fontColour = 'coral4', bgFill = 'coral', textDecoration = 'bold'))
openxlsx::openXL(wb2)## Solution of a Classification Problem
fitProblemScores <-
RevoScaleR::rxPredict( TheBestModel
, data = cbind(Y = Y[inProblem], ObClass = Y[inProblem] %>% as.integer()-1, X[inProblem, ])
, suffix = '.Problem'
# , extraVarsToWrite = names(cbind(Y = Y[inProblem], ObClass = Y[inProblem] %>% as.integer()-1, X[inProblem, ]))
, outData = tempfile(fileext = '.xdf'))## Beginning processing data.
## Rows Read: 112392, Read Time: 0, Transform Time: 0
## Beginning processing data.
## Elapsed time: 00:00:01.4410516
## Finished writing 112392 rows.
## Writing completed.
openxlsx::addWorksheet(wb2 <- openxlsx::createWorkbook(), sheetName = "Ensembles", gridLines = FALSE)
RevoScaleR::rxImport(inData = fitProblemScores) %>%
dplyr::mutate(loan_default = ifelse(PredictedLabel.Problem == 'Bad', 1, 0)) %>%
dplyr::select(loan_default) %>%
openxlsx::writeDataTable(wb2, sheet = "Ensembles", x = ., colNames = TRUE,
tableStyle = "TableStyleMedium4", withFilter = TRUE)## Rows Read: 112392, Total Rows Processed: 112392, Total Chunk Time: 0.033 seconds
The best results in Public Leaderboard for the LTFS Data Science FinHack are above 0.6731 AUC or Gini 0.3642. However, the algorithms implemented in R or Python could not properly solve this problem.
Obviously, the classification problem has not been resolved. It was not possible to find or construct such predictors that would have sufficient separation power capable of resolving the binary class default of vehicle loans. None of the many diverse classification models were able to obtain quality on the test dataset above AUC 0.70 or Gini 0.40.
Although the algorithms of Microsoft Machine Learning (Microisoft ML Server 9.4.7) work quite quickly, they are not yet able to solve this classification problem. Perhaps this is due to the fact that the proportion of default (bad) Indian borrowers on car loans is very large, that is, it exceeds 20%.
You can execute this R Markdown file as a job. You should create one R file with code:
rmarkdown::render(input = '1._MLS_.rmd', output_format = c('html_document'))
Then you can run the R file in subdirectory ~/Projects/.
# The End of Session
# if(!is.null(cl)) {
# parallel::stopCluster(cl)
# cl = NULL
# }
devtools::session_info()## - Session info ----------------------------------------------------------
## setting value
## version R version 3.5.2 (2018-12-20)
## os Windows 10 x64
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate Russian_Russia.1251
## ctype Russian_Russia.1251
## tz Asia/Dhaka
## date 2019-08-31
##
## - Packages --------------------------------------------------------------
## ! package * version date lib
## acepack 1.4.1 2016-10-29 [1]
## agricolae 1.3-0 2019-01-07 [1]
## ALEPlot 1.1 2018-05-24 [1]
## AlgDesign 1.1-7.3 2014-10-15 [1]
## assertthat 0.2.0 2017-04-11 [1]
## backports 1.1.3 2018-12-14 [1]
## base64enc 0.1-3 2015-07-28 [1]
## bayesplot 1.6.0 2018-08-02 [1]
## bindr 0.1.1 2018-03-13 [1]
## bindrcpp * 0.2.2 2018-03-29 [1]
## bit 1.1-14 2018-05-29 [1]
## bit64 0.9-7 2017-05-08 [1]
## bitops 1.0-6 2013-08-17 [1]
## blob 1.1.1 2018-03-25 [1]
## boot 1.3-20 2017-08-06 [1]
## breakDown * 0.2.0 2019-08-15 [1]
## broom 0.5.1 2018-12-05 [1]
## callr 3.1.1 2018-12-21 [1]
## caret * 6.0-81 2018-11-20 [1]
## caTools 1.17.1.1 2018-07-20 [1]
## cellranger 1.1.0 2016-07-27 [1]
## checkmate 1.9.1 2019-01-15 [1]
## chron 2.3-53 2018-09-09 [1]
## class 7.3-14 2015-08-30 [1]
## cli 1.0.1 2018-09-25 [1]
## cluster 2.0.7-1 2018-04-13 [1]
## coda 0.19-2 2018-10-08 [1]
## codetools 0.2-15 2016-10-05 [1]
## colorspace 1.4-0 2019-01-13 [1]
## colourpicker 1.0 2017-09-27 [1]
## combinat 0.0-8 2012-10-29 [1]
## CompatibilityAPI 1.1.0 2019-01-10 [1]
## crayon 1.3.4 2017-09-16 [1]
## crosstalk 1.0.0 2016-12-21 [1]
## curl 3.3 2019-01-10 [1]
## DALEX * 0.2.6 2019-01-07 [1]
## data.table * 1.12.0 2019-01-13 [1]
## DataExplorer * 0.8.0 2019-08-24 [1]
## DBI 1.0.0 2018-05-02 [1]
## deldir 0.1-16 2019-01-04 [1]
## desc 1.2.0 2018-05-01 [1]
## DescTools 0.99.27 2019-01-19 [1]
## devtools 2.0.1 2018-10-26 [1]
## digest 0.6.18 2018-10-10 [1]
## doParallel * 1.0.14 2019-04-11 [1]
## dplyr * 0.7.8 2018-11-10 [1]
## DT 0.5 2018-11-05 [1]
## dygraphs 1.1.1.6 2018-07-11 [1]
## e1071 1.7-0.1 2019-01-21 [1]
## embed 0.0.2 2018-11-19 [1]
## evaluate 0.12 2018-10-09 [1]
## expm 0.999-3 2018-09-22 [1]
## factorMerger 0.4.0 2019-08-15 [1]
## forcats * 0.3.0 2018-02-19 [1]
## foreach * 1.5.1 2019-04-11 [1]
## foreign 0.8-71 2018-07-20 [1]
## formattable * 0.2.0.1 2016-08-05 [1]
## Formula * 1.2-3 2018-05-03 [1]
## fs 1.2.6 2018-08-23 [1]
## FSelectorRcpp * 0.3.0 2018-11-12 [1]
## gdata 2.18.0 2017-06-06 [1]
## generics 0.0.2 2018-11-29 [1]
## ggmosaic 0.2.0 2018-09-12 [1]
## ggplot2 * 3.1.0 2018-10-25 [1]
## ggpubr 0.2 2018-11-15 [1]
## ggridges 0.5.1 2018-09-27 [1]
## glmnet 2.0-16 2018-04-02 [1]
## glue 1.3.0 2018-07-17 [1]
## gmodels 2.18.1 2018-06-25 [1]
## gower 0.1.2 2017-02-23 [1]
## gplots * 3.0.1.1 2019-01-27 [1]
## gridExtra 2.3 2017-09-09 [1]
## gsubfn * 0.7 2018-03-16 [1]
## gtable 0.2.0 2016-02-26 [1]
## gtools 3.8.1 2018-06-26 [1]
## haven 2.0.0 2018-11-22 [1]
## highr 0.7 2018-06-09 [1]
## hmeasure 1.0-1 2019-01-02 [1]
## Hmisc * 4.2-0 2019-01-26 [1]
## hms 0.4.2 2018-03-10 [1]
## htmlTable 1.13.1 2019-01-07 [1]
## htmltools 0.3.6 2017-04-28 [1]
## htmlwidgets 1.3 2018-09-30 [1]
## httpuv 1.4.5.1 2018-12-18 [1]
## httr 1.4.0 2018-12-11 [1]
## igraph 1.2.2 2018-07-27 [1]
## iml * 0.8.1 2019-01-02 [1]
## Information 0.0.9 2016-04-09 [1]
## inline 0.3.15 2018-05-18 [1]
## inum 1.0-0 2017-12-12 [1]
## ipred 0.9-8 2018-11-05 [1]
## iterators * 1.0.11 2019-04-11 [1]
## jsonlite 1.6 2018-12-07 [1]
## kableExtra * 1.0.1 2019-01-22 [1]
## keras 2.2.4 2018-11-22 [1]
## KernSmooth 2.23-15 2015-06-29 [1]
## klaR 0.6-14 2018-03-19 [1]
## knitr 1.21 2018-12-10 [1]
## labeling 0.3 2014-08-23 [1]
## later 0.7.5 2018-09-18 [1]
## lattice * 0.20-38 2018-11-04 [1]
## latticeExtra 0.6-28 2016-02-09 [1]
## lava 1.6.4 2018-11-25 [1]
## lazyeval 0.2.1 2017-10-29 [1]
## LearnBayes 2.15.1 2018-03-18 [1]
## libcoin * 1.0-2 2018-12-13 [1]
## lme4 1.1-19 2018-11-10 [1]
## loo 2.0.0 2018-04-11 [1]
## lubridate 1.7.4 2018-04-11 [1]
## magrittr * 1.5 2014-11-22 [1]
## manipulate 1.0.1 2014-12-24 [1]
## markdown 0.9 2018-12-07 [1]
## MASS 7.3-51.1 2018-11-01 [1]
## Matrix 1.2-15 2018-11-01 [1]
## matrixStats 0.54.0 2018-07-23 [1]
## memoise 1.1.0 2017-04-21 [1]
## Metrics 0.1.4 2018-07-09 [1]
## MicrosoftML * 9.4.7 2019-05-07 [1]
## mime 0.6 2018-10-05 [1]
## miniUI 0.1.1.1 2018-05-18 [1]
## minqa 1.2.4 2014-10-09 [1]
## ModelMetrics 1.2.2 2018-11-03 [1]
## modelr 0.1.2 2018-05-11 [1]
## D mrsdeploy * 1.1.3 2019-05-15 [1]
## munsell 0.5.0 2018-06-12 [1]
## mvtnorm * 1.0-8 2018-05-31 [1]
## networkD3 0.4 2017-03-18 [1]
## nlme 3.1-137 2018-04-07 [1]
## nloptr 1.2.1 2018-10-03 [1]
## nnet 7.3-12 2016-02-02 [1]
## openxlsx * 4.1.0 2018-05-26 [1]
## pander 0.6.3 2018-11-06 [1]
## partykit * 1.2-3 2019-01-31 [1]
## pdp 0.7.0 2018-08-27 [1]
## pillar 1.3.1 2018-12-15 [1]
## pkgbuild 1.0.2 2018-10-16 [1]
## pkgconfig 2.0.2 2018-08-16 [1]
## pkgload 1.0.2 2018-10-29 [1]
## plotly * 4.8.0 2018-07-20 [1]
## plyr * 1.8.4 2016-06-08 [1]
## prettyunits 1.0.2 2015-07-13 [1]
## processx 3.2.1 2018-12-05 [1]
## prodlim 2018.04.18 2018-04-18 [1]
## productplots 0.1.1 2016-07-02 [1]
## promises 1.0.1 2018-04-13 [1]
## proto * 1.0.0 2016-10-29 [1]
## proxy 0.4-22 2018-04-08 [1]
## pryr 0.1.4 2018-02-18 [1]
## ps 1.3.0 2018-12-21 [1]
## purrr * 0.3.0 2019-01-27 [1]
## pwr * 1.2-2 2018-03-03 [1]
## questionr 0.7.0 2018-11-26 [1]
## R6 2.3.0 2018-10-04 [1]
## rapportools 1.0 2014-01-07 [1]
## RColorBrewer 1.1-2 2014-12-07 [1]
## Rcpp 1.0.0 2018-11-07 [1]
## RCurl 1.95-4.11 2018-07-15 [1]
## readr * 1.3.1 2018-12-21 [1]
## readxl 1.2.0 2018-12-19 [1]
## recipes 0.1.4 2018-11-19 [1]
## remotes 2.0.2 2018-10-30 [1]
## reshape2 * 1.4.3 2017-12-11 [1]
## reticulate 1.10 2018-08-05 [1]
## RevoMods * 11.0.1 2019-04-11 [1]
## RevoScaleR * 9.4.7 2019-05-21 [1]
## RevoUtils * 11.0.2 2019-04-11 [1]
## RevoUtilsMath * 11.0.0 2019-04-24 [1]
## rlang 0.3.1 2019-01-08 [1]
## rmarkdown 1.11 2018-12-08 [1]
## ROCR * 1.0-7 2015-03-26 [1]
## rpart * 4.1-13 2018-02-23 [1]
## rprojroot 1.3-2 2018-01-03 [1]
## rsconnect 0.8.13 2019-01-10 [1]
## RSQLite * 2.1.1 2018-05-06 [1]
## rstan 2.18.2 2018-11-07 [1]
## rstanarm 2.18.2 2018-11-10 [1]
## rstantools 1.5.1 2018-08-22 [1]
## rstudioapi 0.9.0 2019-01-09 [1]
## rvest 0.3.2 2016-06-17 [1]
## scales 1.0.0 2018-08-09 [1]
## sessioninfo 1.1.1 2018-11-05 [1]
## shiny 1.2.0 2018-11-02 [1]
## shinyjs 1.0 2018-01-08 [1]
## shinystan 2.5.0 2018-05-01 [1]
## shinythemes 1.1.2 2018-11-06 [1]
## smbinning * 0.8 2019-01-07 [1]
## sp 1.3-1 2018-06-05 [1]
## spData 0.3.0 2019-01-07 [1]
## spdep 0.8-1 2018-11-21 [1]
## sqldf * 0.4-11 2017-06-28 [1]
## StanHeaders 2.18.1 2019-01-28 [1]
## stringi 1.2.4 2018-07-20 [1]
## stringr * 1.3.1 2018-05-10 [1]
## summarytools * 0.8.8 2018-10-07 [1]
## survival * 2.43-3 2018-11-26 [1]
## tensorflow 1.10 2018-11-19 [1]
## testthat 2.0.1 2018-10-13 [1]
## tfruns 1.4 2018-08-25 [1]
## threejs 0.3.1 2017-08-13 [1]
## tibble * 2.0.1 2019-01-12 [1]
## tidyr * 0.8.2 2018-10-28 [1]
## tidyselect 0.2.5 2018-10-11 [1]
## tidyverse * 1.2.1 2017-11-14 [1]
## timeDate 3043.102 2018-02-21 [1]
## usethis 1.4.0 2018-08-14 [1]
## viridisLite 0.3.0 2018-02-01 [1]
## webshot 0.5.1 2018-09-28 [1]
## whisker 0.3-2 2013-04-28 [1]
## withr 2.1.2 2018-03-15 [1]
## woeBinning * 0.1.6 2018-07-28 [1]
## xfun 0.4 2018-10-23 [1]
## xml2 1.2.0 2018-01-24 [1]
## xtable 1.8-3 2018-08-29 [1]
## xts 0.11-2 2018-11-05 [1]
## yaImpute 1.0-31 2019-01-09 [1]
## yaml 2.2.0 2018-07-25 [1]
## zeallot 0.1.0 2018-01-28 [1]
## zip 1.0.0 2017-04-25 [1]
## zoo 1.8-4 2018-09-19 [1]
## source
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## Github (pbiecek/breakDown@ba9a0d9)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## Github (boxuancui/DataExplorer@8a71951)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## Github (MI2DataLab/factorMerger@c49e37f)
## CRAN (R 3.5.2)
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## local
## local
## local
## local
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
## CRAN (R 3.5.2)
##
## [1] C:/R/MLS/R_SERVER/library
##
## D -- DLL MD5 mismatch, broken installation.