This is an attempt to showcase some worked out examples of Machine Learning (ML) use German Credit Data. Though we have selected credit scoring problem as a case study in this article, the same process will be applicable for wide range of classification or regression problems “Response modeling”, “Risk Management”, “Attrition/Churn management”, “Cross-Sell/Up-Sell”, “usage Patterns”, “Net Present Value”, “Life Time Value”, “Predictive Maintenance and condition based monitoring”, “Warranty”, “Reliability”, “Failure Prediction”, “Image/Video Processing”, “Crime”, “Medical Experiments”, “Hidden pattern recognition” . for Banking, Insurance, Finance, Telecom, Manufacturing, “Law Firms and Criminal Investigation”, “Surveillance”, “Catalogue”, “Travel Transport and Hospitality”, “Healthcare”, “Utilities”, “Publishing”, “Education” and any industry you may come across.
The basic difference of traditional modeling and machine learning is that “in traditional modeling we intend to set up a modeling framework and try to establish relationships while in machine learning we allow the model to learn from the data by understanding the hidden patterns”. Hence the first one requires analyst to have solid understanding of statistical techniques and business knowledge while the later one is more complex in nature and computational intensive, hence requires higher computation power of the systems and analyst needs to be tech savvy.
Kindly note that while traditional techniques perform well on small to large amount of data, machine learning will certainly learn better on high-dimensional and complex data such as Big Data set up.
If you want to do more experiments and not sure where to get a problem definition or data to machine learning, you may explore the on-line machine learning repository here.
If you are looking for answers of some technical queries about R
you may post your question here on stackoverflow and of course do not forget to ask your best friends on the web Google and/or Bing.
I have used following machine learning techniques in this article:
Setting up working directory using setwd()
where your project files are located.
setwd("C:/creditscoring")
#setwd("C:\\creditscoring")
I have used below list of R packages in this paper. Kindly note that there many alternative packages available for same techniques. To install R
packages type install.packages("packagename")
on your R
console.
library(DT) # For Data Tables
library(lattice) # The lattice add-on of Trellis graphics for R
library(knitr) # For Dynamic Report Generation in R
library(gplots) # Various R Programming Tools for Plotting Data
library(ggplot2) # An Implementation of the Grammar of Graphics
library(ClustOfVar) # Clustering of variables
library(ape) # Analyses of Phylogenetics and Evolution (as.phylo)
library(Information) # Data Exploration with Information Theory (Weight-of-Evidence and Information Value)
library(ROCR) # Model Performance and ROC curve
library(caret) # Classification and Regression Training - for any machine learning algorithms
library(rpart) # Recursive partitioning for classification, regression and survival trees
library(rpart.utils) # Tools for parsing and manipulating rpart objects, including generating machine readable rules
library(rpart.plot) # Plot 'rpart' Models: An Enhanced Version of 'plot.rpart'
library(randomForest)# Leo Breiman and Cutler's Random Forests for Classification and Regression
library(party) # A computational toolbox for recursive partitioning - Conditional inference Trees
library(bnlearn) # Bayesian Network Structure Learning, Parameter Learning and Inference
library(DAAG) # Data Analysis and Graphics Data and Functions
library(vcd) # Visualizing Categorical Data
library(kernlab) # Support Vector Machine
# Following libraries we have load for model 8 and model 9
#library(neuralnet) # Neural Network
#library(lars) # For Least Angle Regression, Lasso and Forward Stagewise
#library(glmnet) # Lasso and Elastic-Net Regularized Generalized Linear Models
Let’s create some of our own functions for data analysis
# Function 1: Create function to calculate percent distribution for factors
pct <- function(x){
tbl <- table(x)
tbl_pct <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(tbl_pct) <- c('Count','Percentage')
kable(tbl_pct)
}
# pct(cdata$good_bad_21)
# Function 2: Own function to calculate IV, WOE and Eefficiency
gbpct <- function(x, y=cdata$good_bad_21){
mt <- as.matrix(table(as.factor(x), as.factor(y))) # x -> independent variable(vector), y->dependent variable(vector)
Total <- mt[,1] + mt[,2] # Total observations
Total_Pct <- round(Total/sum(mt)*100, 2) # Total PCT
Bad_pct <- round((mt[,1]/sum(mt[,1]))*100, 2) # PCT of BAd or event or response
Good_pct <- round((mt[,2]/sum(mt[,2]))*100, 2) # PCT of Good or non-event
Bad_Rate <- round((mt[,1]/(mt[,1]+mt[,2]))*100, 2) # Bad rate or response rate
grp_score <- round((Good_pct/(Good_pct + Bad_pct))*10, 2) # score for each group
WOE <- round(log(Good_pct/Bad_pct)*10, 2) # Weight of Evidence for each group
g_b_comp <- ifelse(mt[,1] == mt[,2], 0, 1)
IV <- ifelse(g_b_comp == 0, 0, (Good_pct - Bad_pct)*(WOE/10)) # Information value for each group
Efficiency <- abs(Good_pct - Bad_pct)/2 # Efficiency for each group
otb<-as.data.frame(cbind(mt, Good_pct, Bad_pct, Total,
Total_Pct, Bad_Rate, grp_score,
WOE, IV, Efficiency ))
otb$Names <- rownames(otb)
rownames(otb) <- NULL
otb[,c(12,2,1,3:11)] # return IV table
}
# Function 3: Normalize using Range
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
I have used German Credit Data
for this experiement. You may read more about the data and download from here and can be read in R
using read.table()
function.
cdata<-read.table("data.txt", h=F, sep="")
# Update column Names
colnames(cdata) <- c("chk_ac_status_1",
"duration_month_2", "credit_history_3", "purpose_4",
"credit_amount_5","savings_ac_bond_6","p_employment_since_7",
"instalment_pct_8", "personal_status_9","other_debtors_or_grantors_10",
"present_residence_since_11","property_type_12","age_in_yrs_13",
"other_instalment_type_14", "housing_type_15",
"number_cards_this_bank_16","job_17","no_people_liable_for_mntnance_18",
"telephone_19", "foreign_worker_20",
"good_bad_21")
# Read a numeric copy: Numeric data for Neural network & Lasso
cdatanum<-read.table("german.data-numeric.txt", h=F, sep="")
cdatanum <- as.data.frame(sapply(cdatanum, as.numeric ))
Variable names of the data. You may create your own variable names.
kable(as.data.frame(colnames(cdata)))
chk_ac_status_1
duration_month_2
credit_history_3
purpose_4
credit_amount_5
savings_ac_bond_6
p_employment_since_7
instalment_pct_8
personal_status_9
other_debtors_or_grantors_10
present_residence_since_11
property_type_12
age_in_yrs_13
other_instalment_type_14
housing_type_15
number_cards_this_bank_16
job_17
no_people_liable_for_mntnance_18 telephone_19
foreign_worker_20
good_bad_21
Understand the structure of the data using str()
function.
str(cdata)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_ac_status_1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration_month_2 : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history_3 : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose_4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ credit_amount_5 : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_ac_bond_6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ p_employment_since_7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ instalment_pct_8 : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status_9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtors_or_grantors_10 : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ present_residence_since_11 : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property_type_12 : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age_in_yrs_13 : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_instalment_type_14 : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing_type_15 : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ number_cards_this_bank_16 : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job_17 : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ no_people_liable_for_mntnance_18: int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone_19 : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker_20 : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ good_bad_21 : int 1 2 1 1 2 1 1 1 1 2 ...
Print summary of all the variables using summary()
function.
summary(cdata)
## chk_ac_status_1 duration_month_2 credit_history_3 purpose_4 credit_amount_5 savings_ac_bond_6 p_employment_since_7 instalment_pct_8 personal_status_9 other_debtors_or_grantors_10
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250 A61:603 A71: 62 Min. :1.000 A91: 50 A101:907
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366 A62:103 A72:172 1st Qu.:2.000 A92:310 A102: 41
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320 A63: 63 A73:339 Median :3.000 A93:548 A103: 52
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271 A64: 48 A74:174 Mean :2.973 A94: 92
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972 A65:183 A75:253 3rd Qu.:4.000
## Max. :72.0 A46 : 50 Max. :18424 Max. :4.000
## (Other): 55
## present_residence_since_11 property_type_12 age_in_yrs_13 other_instalment_type_14 housing_type_15 number_cards_this_bank_16 job_17 no_people_liable_for_mntnance_18 telephone_19
## Min. :1.000 A121:282 Min. :19.00 A141:139 A151:179 Min. :1.000 A171: 22 Min. :1.000 A191:596
## 1st Qu.:2.000 A122:232 1st Qu.:27.00 A142: 47 A152:713 1st Qu.:1.000 A172:200 1st Qu.:1.000 A192:404
## Median :3.000 A123:332 Median :33.00 A143:814 A153:108 Median :1.000 A173:630 Median :1.000
## Mean :2.845 A124:154 Mean :35.55 Mean :1.407 A174:148 Mean :1.155
## 3rd Qu.:4.000 3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :4.000 Max. :75.00 Max. :4.000 Max. :2.000
##
## foreign_worker_20 good_bad_21
## A201:963 Min. :1.0
## A202: 37 1st Qu.:1.0
## Median :1.0
## Mean :1.3
## 3rd Qu.:2.0
## Max. :2.0
##
We can print some observations using head()
for first few rows or tail()
for last few rows. I have used kable()
function to represent the data into nice table format.
Here I have used datatable()
from DT
package to interactively visualize the entire dataset or part of the dataset.
# library(knitr) # required for kable() function
# kable(head(cdata, 5))
# kable(tail(cdata, 5))
# library(DT) # Data Table
DT::datatable(cdata[1:100,]) # First 100 observations
Before applying modeling techniques, it is important to understand the attributes and distributions of the variables. This will help you to choose right variables, select right transformation options or grouping (fine classing and coarse classing).
Fine Classing: Create as many as subgroups (bins, typically 10/20) for independent variables and calculate Weight of Evidence (WOE)
and Information Value(IV)
of the variable based on WOE's
and IV's
of subgroups respectively.
Coarse Classing: Combine adjacent categories with equal or similar WOE's
.
To know more click here.
Note: SAS
users may have a look here
STATISTICA
users may want look here
We may need to convert data types of certain variables based on their properties as below.
cdata$duration_month_2 <- as.numeric(cdata$duration_month_2)
cdata$credit_amount_5 <- as.numeric(cdata$credit_amount_5 )
cdata$instalment_pct_8 <- as.numeric(cdata$instalment_pct_8)
cdata$present_residence_since_11 <- as.numeric(cdata$present_residence_since_11)
cdata$age_in_yrs_13 <- as.numeric(cdata$age_in_yrs_13)
cdata$number_cards_this_bank_16 <- as.numeric(cdata$number_cards_this_bank_16)
cdata$no_people_liable_for_mntnance_18 <- as.numeric(cdata$no_people_liable_for_mntnance_18)
In this data data good_bad_21
is our response/target variable where 1
as good/non-event
and 2
as bad/event
. Kindly note that there could be many situations where we have to create response
variables based on business objectives and analysis goal as the response variable may not be readily available with the data.
cdata$good_bad_21<-as.factor(ifelse(cdata$good_bad_21 == 1, "Good", "Bad"))
pct(cdata$good_bad_21)
Count | Percentage | |
---|---|---|
Bad | 300 | 30 |
Good | 700 | 70 |
op<-par(mfrow=c(1,2), new=TRUE)
plot(as.numeric(cdata$good_bad_21), ylab="Good-Bad", xlab="n", main="Good ~ Bad")
hist(as.numeric(cdata$good_bad_21), breaks=2,
xlab="Good(1) and Bad(2)", col="blue")
par(op)
Weight of Evidence(WOE): WoE shows predictive power of an independent variable in relation to dependent variable. It evolved with credit scoring to magnify separation power between a good customer
and a bad customer
, hence it is one of the measures of separation between two classes(good/bad, yes/no, 0/1, A/B, response/no-response). It is defined as:
\[WOE = ln(\frac{\textrm{Distribution of Non-Events(Good)}}{\textrm{Distribution of Events(Bad)}})\]
It is computed from the basic odds ratio:
(Distribution of Good Credit Outcomes) / (Distribution of Bad Credit Outcomes)
Information Value(IV):
IV helps to select variables by using their order of importance w.r.to information value after grouping.
\[IV =\sum(\textrm{%Non-Events - %Events})* WOE\]
Efficiency:
\[ Efficiency = Abs(\textrm{%Non-Events - %Events})/2\]
To know more about WoE & Information Value with R woe
Package click here.
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(cdata$chk_ac_status_1)
op1<-par(mfrow=c(1,2), new=TRUE)
plot(cdata$chk_ac_status_1, cdata$good_bad_21,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Account Status",
xlab="Category",
ylab="WOE"
)
par(op1)
kable(A1, caption = 'Checking Account Status ~ Good-Bad')
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A11 | 139 | 135 | 19.86 | 45.00 | 274 | 27.4 | 49.27 | 3.06 | -8.18 | 20.56452 | 12.570 |
A12 | 164 | 105 | 23.43 | 35.00 | 269 | 26.9 | 39.03 | 4.01 | -4.01 | 4.63957 | 5.785 |
A13 | 49 | 14 | 7.00 | 4.67 | 63 | 6.3 | 22.22 | 6.00 | 4.05 | 0.94365 | 1.165 |
A14 | 348 | 46 | 49.71 | 15.33 | 394 | 39.4 | 11.68 | 7.64 | 11.76 | 40.43088 | 17.190 |
Information Value is 66.58 and Efficiency is 36.71 .
# Attribute 2: (numerical)
#-----------------------------------------------------------
# Loan Duration (Tenure) in Month
summary(cdata$duration_month_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0
op2<-par(mfrow=c(1,2))
boxplot(cdata$duration_month_2, ylab="Loan Duration(Month)", main="Boxplot:Loan Duration")
plot(cdata$duration_month_2, cdata$good_bad_21,
ylab="Good-Bad", xlab="Loan Duration(Month)",
main="Loan Duration ~ Good-Bad ")
plot(as.factor(cdata$duration_month_2), cdata$good_bad_21,
ylab="Good-Bad", xlab="Loan Duration(Month)",
main="Loan Duration(Before Groupping)")
# Create some groups from contious variables
cdata$duration_month_2 <-as.factor(ifelse(cdata$duration_month_2<=6,'00-06',
ifelse(cdata$duration_month_2<=12,'06-12',
ifelse(cdata$duration_month_2<=24,'12-24',
ifelse(cdata$duration_month_2<=30,'24-30',
ifelse(cdata$duration_month_2<=36,'30-36',
ifelse(cdata$duration_month_2<=42,'36-42','42+')))))))
plot(cdata$duration_month_2, cdata$good_bad_21,
main="Loan Duration(after grouping) ",
xlab="Loan Duration (Month)",
ylab="Good-Bad")
par(op2)
A2<-gbpct(cdata$duration_month_2)
barplot(A2$WOE, col="brown", names.arg=c(A2$Levels),
main="Loan Duration",
xlab="Duration(Months)",
ylab="WOE"
)
kable(A2, caption = 'Loan Duration ~ Good-Bad')
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
00-06 | 73 | 9 | 10.43 | 3.00 | 82 | 8.2 | 10.98 | 7.77 | 12.46 | 9.25778 | 3.715 |
06-12 | 210 | 67 | 30.00 | 22.33 | 277 | 27.7 | 24.19 | 5.73 | 2.95 | 2.26265 | 3.835 |
12-24 | 289 | 122 | 41.29 | 40.67 | 411 | 41.1 | 29.68 | 5.04 | 0.15 | 0.00930 | 0.310 |
24-30 | 38 | 19 | 5.43 | 6.33 | 57 | 5.7 | 33.33 | 4.62 | -1.53 | 0.13770 | 0.450 |
30-36 | 48 | 38 | 6.86 | 12.67 | 86 | 8.6 | 44.19 | 3.51 | -6.14 | 3.56734 | 2.905 |
36-42 | 12 | 5 | 1.71 | 1.67 | 17 | 1.7 | 29.41 | 5.06 | 0.24 | 0.00096 | 0.020 |
42+ | 30 | 40 | 4.29 | 13.33 | 70 | 7.0 | 57.14 | 2.43 | -11.34 | 10.25136 | 4.520 |
Information Value is 25.49 and Efficiency is 15.75 .
# Attribute 3: (qualitative)
#-----------------------------------------------------------
# Credit History
# A30 : no credits taken/
# all credits paid back duly
# A31 : all credits at this bank paid back duly
# A32 : existing credits paid back duly till now
# A33 : delay in paying off in the past
# A34 : critical account/
# other credits existing (not at this bank)
# Combine few groups together based on WOE and bad rates
cdata$credit_history_3<-as.factor(ifelse(cdata$credit_history_3 == "A30", "01.A30",
ifelse(cdata$credit_history_3 == "A31","02.A31",
ifelse(cdata$credit_history_3 == "A32","03.A32.A33",
ifelse(cdata$credit_history_3 == "A33","03.A32.A33",
"04.A34")))))
op3<-par(mfrow=c(1,2))
plot(cdata$credit_history_3, cdata$good_bad_21,
main = "Credit History ~ Good-Bad",
xlab = "Credit History",
ylab = "Good-Bad")
plot(cdata$credit_history_3, cdata$good_bad_21,
main = "Credit History(After Groupping) ~ Good-Bad ",
xlab = "Credit History",
ylab = "Good-Bad")
par(op3)
A3<-gbpct(cdata$credit_history_3)
barplot(A3$WOE, col="brown", names.arg=c(A3$Levels),
main="Credit History",
xlab="Credit History",
ylab="WOE"
)
kable(A3, caption = "Credit History~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
01.A30 | 15 | 25 | 2.14 | 8.33 | 40 | 4.0 | 62.50 | 2.04 | -13.59 | 8.41221 | 3.095 |
02.A31 | 21 | 28 | 3.00 | 9.33 | 49 | 4.9 | 57.14 | 2.43 | -11.35 | 7.18455 | 3.165 |
03.A32.A33 | 421 | 197 | 60.14 | 65.67 | 618 | 61.8 | 31.88 | 4.78 | -0.88 | 0.48664 | 2.765 |
04.A34 | 243 | 50 | 34.71 | 16.67 | 293 | 29.3 | 17.06 | 6.76 | 7.33 | 13.22332 | 9.020 |
Information Value is 29.31 and Efficiency is 18.05 .
# Attribute 4: (qualitative)
#-----------------------------------------------------------
# Purpose of the loan
#
# A40 : car (new)
# A41 : car (used)
# A42 : furniture/equipment
# A43 : radio/television
# A44 : domestic appliances
# A45 : repairs
# A46 : education
# A47 : (vacation - does not exist?)
# A48 : retraining
# A49 : business
# A410 : others
A4<-gbpct(cdata$purpose_4)
op4<-par(mfrow=c(1,2))
plot(cdata$purpose_4, cdata$good_bad_21,
main="Purpose of Loan~ Good-Bad ",
xlab="Purpose",
ylab="Good-Bad")
barplot(A4$WOE, col="brown", names.arg=c(A4$Levels),
main="Purpose of Loan",
xlab="Category",
ylab="WOE")
par(op4)
kable(A4, caption = "Purpose of Loan~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A40 | 145 | 89 | 20.71 | 29.67 | 234 | 23.4 | 38.03 | 4.11 | -3.60 | 3.22560 | 4.480 |
A41 | 86 | 17 | 12.29 | 5.67 | 103 | 10.3 | 16.50 | 6.84 | 7.74 | 5.12388 | 3.310 |
A410 | 7 | 5 | 1.00 | 1.67 | 12 | 1.2 | 41.67 | 3.75 | -5.13 | 0.34371 | 0.335 |
A42 | 123 | 58 | 17.57 | 19.33 | 181 | 18.1 | 32.04 | 4.76 | -0.95 | 0.16720 | 0.880 |
A43 | 218 | 62 | 31.14 | 20.67 | 280 | 28.0 | 22.14 | 6.01 | 4.10 | 4.29270 | 5.235 |
A44 | 8 | 4 | 1.14 | 1.33 | 12 | 1.2 | 33.33 | 4.62 | -1.54 | 0.02926 | 0.095 |
A45 | 14 | 8 | 2.00 | 2.67 | 22 | 2.2 | 36.36 | 4.28 | -2.89 | 0.19363 | 0.335 |
A46 | 28 | 22 | 4.00 | 7.33 | 50 | 5.0 | 44.00 | 3.53 | -6.06 | 2.01798 | 1.665 |
A48 | 8 | 1 | 1.14 | 0.33 | 9 | 0.9 | 11.11 | 7.76 | 12.40 | 1.00440 | 0.405 |
A49 | 63 | 34 | 9.00 | 11.33 | 97 | 9.7 | 35.05 | 4.43 | -2.30 | 0.53590 | 1.165 |
Information Value is 16.93 and Efficiency is 17.9 .
# Attribute 5: (numerical)
#-----------------------------------------------------------
# Credit (Loan) Amount
cdata$credit_amount_5 <- as.double(cdata$credit_amount_5)
summary(cdata$credit_amount_5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
boxplot(cdata$credit_amount_5)
# Create groups based on their distribution
cdata$credit_amount_5<-as.factor(ifelse(cdata$credit_amount_5<=1400,'0-1400',
ifelse(cdata$credit_amount_5<=2500,'1400-2500',
ifelse(cdata$credit_amount_5<=3500,'2500-3500',
ifelse(cdata$credit_amount_5<=4500,'3500-4500',
ifelse(cdata$credit_amount_5<=5500,'4500-5500','5500+'))))))
A5<-gbpct(cdata$credit_amount_5)
plot(cdata$credit_amount_5, cdata$good_bad_21,
main="Credit Ammount (After Grouping) ~ Good-Bad",
xlab="Amount",
ylab="Good-Bad")
barplot(A5$WOE, col="brown", names.arg=c(A5$Levels),
main="Credit Ammount",
xlab="Amount",
ylab="WOE")
kable(A5, caption = "Credit Ammount ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
0-1400 | 185 | 82 | 26.43 | 27.33 | 267 | 26.7 | 30.71 | 4.92 | -0.33 | 0.02970 | 0.450 |
1400-2500 | 205 | 65 | 29.29 | 21.67 | 270 | 27.0 | 24.07 | 5.75 | 3.01 | 2.29362 | 3.810 |
2500-3500 | 111 | 38 | 15.86 | 12.67 | 149 | 14.9 | 25.50 | 5.56 | 2.25 | 0.71775 | 1.595 |
3500-4500 | 72 | 26 | 10.29 | 8.67 | 98 | 9.8 | 26.53 | 5.43 | 1.71 | 0.27702 | 0.810 |
4500-5500 | 31 | 17 | 4.43 | 5.67 | 48 | 4.8 | 35.42 | 4.39 | -2.47 | 0.30628 | 0.620 |
5500+ | 96 | 72 | 13.71 | 24.00 | 168 | 16.8 | 42.86 | 3.64 | -5.60 | 5.76240 | 5.145 |
Information Value is 9.39 and Efficiency is 12.43 .
# Attibute 6: (qualitative)
#-----------------------------------------------------------
# Savings account/bonds
# A61 : ... < 100 DM
# A62 : 100 <= ... < 500 DM
# A63 : 500 <= ... < 1000 DM
# A64 : .. >= 1000 DM
# A65 : unknown/ no savings account
A6<-gbpct(cdata$savings_ac_bond_6)
plot(cdata$savings_ac_bond_6, cdata$good_bad_21,
main="Savings account/bonds ~ Good-Bad",
xlab="Savings/Bonds",
ylab="Good-Bad")
barplot(A6$WOE, col="brown", names.arg=c(A6$Levels),
main="Savings account/bonds",
xlab="Category",
ylab="WOE")
kable(A6, caption = "Savings account/bonds ~ Good-Bad" )
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A61 | 386 | 217 | 55.14 | 72.33 | 603 | 60.3 | 35.99 | 4.33 | -2.71 | 4.65849 | 8.595 |
A62 | 69 | 34 | 9.86 | 11.33 | 103 | 10.3 | 33.01 | 4.65 | -1.39 | 0.20433 | 0.735 |
A63 | 52 | 11 | 7.43 | 3.67 | 63 | 6.3 | 17.46 | 6.69 | 7.05 | 2.65080 | 1.880 |
A64 | 42 | 6 | 6.00 | 2.00 | 48 | 4.8 | 12.50 | 7.50 | 10.99 | 4.39600 | 2.000 |
A65 | 151 | 32 | 21.57 | 10.67 | 183 | 18.3 | 17.49 | 6.69 | 7.04 | 7.67360 | 5.450 |
Information Value is 19.58 and Efficiency is 18.66 .
# Attribute 7: (qualitative)
#-----------------------------------------------------------
# Present employment since
# A71 : unemployed
# A72 : ... < 1 year
# A73 : 1 <= ... < 4 years
# A74 : 4 <= ... < 7 years
# A75 : .. >= 7 years
A7<-gbpct(cdata$p_employment_since_7)
op7<-par(mfrow=c(1,2))
plot(cdata$p_employment_since_7, cdata$good_bad_21,
main="Present employment since ~ Good-Bad",
xlab="Employment in Years",
ylab="Good-Bad")
barplot(A7$WOE, col="brown", names.arg=c(A7$Levels),
main="Present employment",
xlab="Category",
ylab="WOE")
par(op7)
kable(A7, caption ="Present employment since ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A71 | 39 | 23 | 5.57 | 7.67 | 62 | 6.2 | 37.10 | 4.21 | -3.20 | 0.67200 | 1.050 |
A72 | 102 | 70 | 14.57 | 23.33 | 172 | 17.2 | 40.70 | 3.84 | -4.71 | 4.12596 | 4.380 |
A73 | 235 | 104 | 33.57 | 34.67 | 339 | 33.9 | 30.68 | 4.92 | -0.32 | 0.03520 | 0.550 |
A74 | 135 | 39 | 19.29 | 13.00 | 174 | 17.4 | 22.41 | 5.97 | 3.95 | 2.48455 | 3.145 |
A75 | 189 | 64 | 27.00 | 21.33 | 253 | 25.3 | 25.30 | 5.59 | 2.36 | 1.33812 | 2.835 |
Information Value is 8.66 and Efficiency is 11.96 .
# Attribute 8: (numerical)
#-----------------------------------------------------------
# instalment rate in percentage of disposable income
summary(cdata$instalment_pct_8)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.973 4.000 4.000
op8<-par(mfrow=c(1,2))
boxplot(cdata$instalment_pct_8)
histogram(cdata$instalment_pct_8,
main = "instalment rate in percentage of disposable income",
xlab = "instalment percent",
ylab = "Percent Population")
par(op8)
A8<-gbpct(cdata$instalment_pct_8)
op8_1<-par(mfrow=c(1,2))
plot(as.factor(cdata$instalment_pct_8), cdata$good_bad_21,
main="instalment rate in percentage of disposable income ~ Good-Bad",
xlab="Percent",
ylab="Good-Bad")
barplot(A8$WOE, col="brown", names.arg=c(A8$Levels),
main="instalment rate",
xlab="Percent",
ylab="WOE")
par(op8_1)
kable(A8, caption = "instalment rate in percentage of disposable income ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 102 | 34 | 14.57 | 11.33 | 136 | 13.6 | 25.00 | 5.63 | 2.52 | 0.81648 | 1.620 |
2 | 169 | 62 | 24.14 | 20.67 | 231 | 23.1 | 26.84 | 5.39 | 1.55 | 0.53785 | 1.735 |
3 | 112 | 45 | 16.00 | 15.00 | 157 | 15.7 | 28.66 | 5.16 | 0.65 | 0.06500 | 0.500 |
4 | 317 | 159 | 45.29 | 53.00 | 476 | 47.6 | 33.40 | 4.61 | -1.57 | 1.21047 | 3.855 |
Information Value is 2.63 and Efficiency is 7.71 .
# Attribute 9: (qualitative)
#-----------------------------------------------------------
# Personal status and sex - you may not use for some country due to regulations
# A91 : male : divorced/separated
# A92 : female : divorced/separated/married
# A93 : male : single
# A94 : male : married/widowed
# A95 : female : single
A9<-gbpct(cdata$personal_status_9)
op9<-par(mfrow=c(1,2))
plot(cdata$personal_status_9, cdata$good_bad_21,
main=" Personal status",
xlab=" Personal status",
ylab="Good-Bad")
barplot(A9$WOE, col="brown", names.arg=c(A9$Levels),
main="Personal status",
xlab="Category",
ylab="WOE")
par(op9)
kable(A9, caption = "Personal status ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A91 | 30 | 20 | 4.29 | 6.67 | 50 | 5.0 | 40.00 | 3.91 | -4.41 | 1.04958 | 1.19 |
A92 | 201 | 109 | 28.71 | 36.33 | 310 | 31.0 | 35.16 | 4.41 | -2.35 | 1.79070 | 3.81 |
A93 | 402 | 146 | 57.43 | 48.67 | 548 | 54.8 | 26.64 | 5.41 | 1.66 | 1.45416 | 4.38 |
A94 | 67 | 25 | 9.57 | 8.33 | 92 | 9.2 | 27.17 | 5.35 | 1.39 | 0.17236 | 0.62 |
Information Value is 4.47 and Efficiency is 10 .
# Attribute 10: (qualitative)
#-----------------------------------------------------------
# Other debtors / guarantors
# A101 : none
# A102 : co-applicant
# A103 : guarantor
A10<-gbpct(cdata$other_debtors_or_grantors_10)
op10<-par(mfrow=c(1,2))
plot(cdata$other_debtors_or_grantors_10, cdata$good_bad_21,
main="Other debtors / guarantors",
xlab="Category",
ylab="Good-Bad")
barplot(A10$WOE, col="brown", names.arg=c(A10$Levels),
main="Other debtors / guarantors",
xlab="Category",
ylab="WOE")
par(op10)
kable(A10, caption = "Other debtors / guarantors ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A101 | 635 | 272 | 90.71 | 90.67 | 907 | 90.7 | 29.99 | 5.00 | 0.00 | 0.00000 | 0.020 |
A102 | 23 | 18 | 3.29 | 6.00 | 41 | 4.1 | 43.90 | 3.54 | -6.01 | 1.62871 | 1.355 |
A103 | 42 | 10 | 6.00 | 3.33 | 52 | 5.2 | 19.23 | 6.43 | 5.89 | 1.57263 | 1.335 |
Information Value is 3.2 and Efficiency is 2.71 .
# Attribute 11: (numerical)
#-----------------------------------------------------------
# Present residence since
summary(cdata$present_residence_since_11)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.845 4.000 4.000
A11<-gbpct(cdata$present_residence_since_11)
op11<-par(mfrow=c(1,2))
histogram(cdata$present_residence_since_11,
main="Present Residence~ Good-Bad",
xlab="Present residence Since",
ylab="Percent Population")
barplot(A11$WOE, col="brown", names.arg=c(A11$Levels),
main="Present Residence",
xlab="Category",
ylab="WOE")
par(op11)
kable(A11, caption = "Present Residence~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 94 | 36 | 13.43 | 12.00 | 130 | 13.0 | 27.69 | 5.28 | 1.13 | 0.16159 | 0.715 |
2 | 211 | 97 | 30.14 | 32.33 | 308 | 30.8 | 31.49 | 4.82 | -0.70 | 0.15330 | 1.095 |
3 | 106 | 43 | 15.14 | 14.33 | 149 | 14.9 | 28.86 | 5.14 | 0.55 | 0.04455 | 0.405 |
4 | 289 | 124 | 41.29 | 41.33 | 413 | 41.3 | 30.02 | 5.00 | -0.01 | 0.00004 | 0.020 |
Information Value is 0.36 and Efficiency is 2.23 .
# Attribute 12: (qualitative)
#-----------------------------------------------------------
# Property
# A121 : real estate
# A122 : if not A121 : building society savings agreement/
# life insurance
# A123 : if not A121/A122 : car or other, not in attribute 6
# A124 : unknown / no property
A12 <- gbpct(cdata$property_type_12)
op12 <- par(mfrow = c(1,2))
plot(cdata$property_type_12, cdata$good_bad_21,
main = "Property Type",
xlab="Type",
ylab="Good-Bad")
barplot(A12$WOE, col="brown", names.arg=c(A12$Levels),
main="Property Type",
xlab="Category",
ylab="WOE")
par(op12)
kable(A12, caption = "Property Type")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A121 | 222 | 60 | 31.71 | 20.00 | 282 | 28.2 | 21.28 | 6.13 | 4.61 | 5.39831 | 5.855 |
A122 | 161 | 71 | 23.00 | 23.67 | 232 | 23.2 | 30.60 | 4.93 | -0.29 | 0.01943 | 0.335 |
A123 | 230 | 102 | 32.86 | 34.00 | 332 | 33.2 | 30.72 | 4.91 | -0.34 | 0.03876 | 0.570 |
A124 | 87 | 67 | 12.43 | 22.33 | 154 | 15.4 | 43.51 | 3.58 | -5.86 | 5.80140 | 4.950 |
Information Value is 11.26 and Efficiency is 11.71 .
# Attribute 13: (numerical)
#-----------------------------------------------------------
# Age in Years
summary(cdata$age_in_yrs_13)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.00 33.00 35.55 42.00 75.00
op13 <- par(mfrow = c(1,2))
boxplot(cdata$age_in_yrs_13)
plot(as.factor(cdata$age_in_yrs_13), cdata$good_bad_21,
main = "Age",
xlab = "Age in Years",
ylab = "Good-Bad")
par(op13)
# Group AGE - Coarse Classing (after some iterations in fine classing stage)
cdata$age_in_yrs_13 <- as.factor(ifelse(cdata$age_in_yrs_13<=25, '0-25',
ifelse(cdata$age_in_yrs_13<=30, '25-30',
ifelse(cdata$age_in_yrs_13<=35, '30-35',
ifelse(cdata$age_in_yrs_13<=40, '35-40',
ifelse(cdata$age_in_yrs_13<=45, '40-45',
ifelse(cdata$age_in_yrs_13<=50, '45-50',
ifelse(cdata$age_in_yrs_13<=60, '50-60',
'60+'))))))))
A13<-gbpct(cdata$age_in_yrs_13)
op13_1<-par(mfrow=c(1,2))
plot(as.factor(cdata$age_in_yrs_13), cdata$good_bad_21,
main="Age (After Grouping)",
xlab="Other instalment plans",
ylab="Good-Bad")
barplot(A13$WOE, col="brown", names.arg=c(A13$Levels),
main="Age",
xlab="Category",
ylab="WOE")
par(op13_1)
kable(A13, caption = "Age (After Grouping) ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
0-25 | 110 | 80 | 15.71 | 26.67 | 190 | 19.0 | 42.11 | 3.71 | -5.29 | 5.79784 | 5.480 |
25-30 | 153 | 68 | 21.86 | 22.67 | 221 | 22.1 | 30.77 | 4.91 | -0.36 | 0.02916 | 0.405 |
30-35 | 127 | 50 | 18.14 | 16.67 | 177 | 17.7 | 28.25 | 5.21 | 0.85 | 0.12495 | 0.735 |
35-40 | 108 | 30 | 15.43 | 10.00 | 138 | 13.8 | 21.74 | 6.07 | 4.34 | 2.35662 | 2.715 |
40-45 | 63 | 25 | 9.00 | 8.33 | 88 | 8.8 | 28.41 | 5.19 | 0.77 | 0.05159 | 0.335 |
45-50 | 57 | 16 | 8.14 | 5.33 | 73 | 7.3 | 21.92 | 6.04 | 4.23 | 1.18863 | 1.405 |
50-60 | 47 | 21 | 6.71 | 7.00 | 68 | 6.8 | 30.88 | 4.89 | -0.42 | 0.01218 | 0.145 |
60+ | 35 | 10 | 5.00 | 3.33 | 45 | 4.5 | 22.22 | 6.00 | 4.06 | 0.67802 | 0.835 |
Information Value is 10.24 and Efficiency is 12.06 .
# Attribute 14: (qualitative)
#-----------------------------------------------------------
# Other instalment plans
# A141 : bank
# A142 : stores
# A143 : none
A14<-gbpct(cdata$other_instalment_type_14)
op14<-par(mfrow=c(1,2))
plot(cdata$other_instalment_type_14, cdata$good_bad_21,
main="Other instalment plans ~ Good-Bad",
xlab="Other instalment plans",
ylab="Good-Bad")
barplot(A14$WOE, col="brown", names.arg=c(A14$Levels),
main="Other instalment plans",
xlab="Category",
ylab="WOE")
par(op14)
kable(A14, caption = "Other instalment plans ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A141 | 82 | 57 | 11.71 | 19.00 | 139 | 13.9 | 41.01 | 3.81 | -4.84 | 3.52836 | 3.645 |
A142 | 28 | 19 | 4.00 | 6.33 | 47 | 4.7 | 40.43 | 3.87 | -4.59 | 1.06947 | 1.165 |
A143 | 590 | 224 | 84.29 | 74.67 | 814 | 81.4 | 27.52 | 5.30 | 1.21 | 1.16402 | 4.810 |
Information Value is 5.76 and Efficiency is 9.62 .
# Attribute 15: (qualitative)
#-----------------------------------------------------------
# Housing
# A151 : rent
# A152 : own
# A153 : for free
A15<-gbpct(cdata$housing_type_15)
op15<-par(mfrow=c(1,2))
plot(cdata$housing_type_15, cdata$good_bad_21,
main="Home Ownership Type",
xlab="Type",
ylab="Good-Bad")
barplot(A15$WOE, col="brown", names.arg=c(A15$Levels),
main="Home Ownership Type",
xlab="Type",
ylab="WOE")
par(op15)
kable(A15, caption = "Home Ownership Type ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A151 | 109 | 70 | 15.57 | 23.33 | 179 | 17.9 | 39.11 | 4.00 | -4.04 | 3.13504 | 3.880 |
A152 | 527 | 186 | 75.29 | 62.00 | 713 | 71.3 | 26.09 | 5.48 | 1.94 | 2.57826 | 6.645 |
A153 | 64 | 44 | 9.14 | 14.67 | 108 | 10.8 | 40.74 | 3.84 | -4.73 | 2.61569 | 2.765 |
Information Value is 8.33 and Efficiency is 13.29 .
# Attribute 16: (numerical)
#-----------------------------------------------------------
# Number of existing credits at this bank
summary(cdata$number_cards_this_bank_16)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.407 2.000 4.000
A16<-gbpct(cdata$number_cards_this_bank_16)
op16<-par(mfrow=c(1,2))
plot(as.factor(cdata$number_cards_this_bank_16), cdata$good_bad_21,
main="Number of credits at this bank",
xlab="Number of Cards",
ylab="Good-Bad")
barplot(A16$WOE, col="brown", names.arg=c(A16$Levels),
main="Number of credits at this bank",
xlab="Number of Cards",
ylab="WOE")
par(op16)
kable(A16, caption = "Number of credits at this bank ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 433 | 200 | 61.86 | 66.67 | 633 | 63.3 | 31.60 | 4.81 | -0.75 | 0.36075 | 2.405 |
2 | 241 | 92 | 34.43 | 30.67 | 333 | 33.3 | 27.63 | 5.29 | 1.16 | 0.43616 | 1.880 |
3 | 22 | 6 | 3.14 | 2.00 | 28 | 2.8 | 21.43 | 6.11 | 4.51 | 0.51414 | 0.570 |
4 | 4 | 2 | 0.57 | 0.67 | 6 | 0.6 | 33.33 | 4.60 | -1.62 | 0.01620 | 0.050 |
Information Value is 1.33 and Efficiency is 4.91 .
# Attribute 17: (qualitative)
#-----------------------------------------------------------
# Job
# A171 : unemployed/ unskilled - non-resident
# A172 : unskilled - resident
# A173 : skilled employee / official
# A174 : management/ self-employed/
# highly qualified employee/ officer
A17<-gbpct(cdata$job_17)
op17<-par(mfrow=c(1,2))
plot(cdata$job_17, cdata$good_bad_21,
main="Employment Type",
xlab="Job",
ylab="Good-Bad")
barplot(A17$WOE, col="brown", names.arg=c(A17$Levels),
main="Employment Type",
xlab="Job",
ylab="WOE")
par(op17)
kable(A17, caption = "Employment Type ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A171 | 15 | 7 | 2.14 | 2.33 | 22 | 2.2 | 31.82 | 4.79 | -0.85 | 0.01615 | 0.095 |
A172 | 144 | 56 | 20.57 | 18.67 | 200 | 20.0 | 28.00 | 5.24 | 0.97 | 0.18430 | 0.950 |
A173 | 444 | 186 | 63.43 | 62.00 | 630 | 63.0 | 29.52 | 5.06 | 0.23 | 0.03289 | 0.715 |
A174 | 97 | 51 | 13.86 | 17.00 | 148 | 14.8 | 34.46 | 4.49 | -2.04 | 0.64056 | 1.570 |
Information Value is 0.87 and Efficiency is 3.33 .
# Attribute 18: (numerical)
#-----------------------------------------------------------
# Number of people being liable to provide maintenance for
summary(cdata$no_people_liable_for_mntnance_18)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.155 1.000 2.000
A18<-gbpct(cdata$no_people_liable_for_mntnance_18)
op18<-par(mfrow = c(1,2))
plot(as.factor(cdata$no_people_liable_for_mntnance_18), cdata$good_bad_21,
main = "Number of people being liable",
xlab = "Number of People",
ylab = "Good-Bad")
barplot(A18$WOE, col = "brown", names.arg=c(A18$Levels),
main = " Number of people being liable",
xlab = "Number of People",
ylab = "WOE")
par(op18)
kable(A18, caption = "Number of people being liable ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 591 | 254 | 84.43 | 84.67 | 845 | 84.5 | 30.06 | 4.99 | -0.03 | 0.00072 | 0.12 |
2 | 109 | 46 | 15.57 | 15.33 | 155 | 15.5 | 29.68 | 5.04 | 0.16 | 0.00384 | 0.12 |
Information Value is 0 and Efficiency is 0.24 .
# Attribute 19: (qualitative)
#-----------------------------------------------------------
# Telephone
# A191 : none
# A192 : yes, registered under the customers name
A19<-gbpct(cdata$telephone_19)
op19<-par(mfrow=c(1,2))
plot(cdata$telephone_19, cdata$good_bad_21,
main="Telephone",
xlab="Telephone(Yes/No)",
ylab="Good-Bad")
barplot(A19$WOE, col="brown", names.arg=c(A19$Levels),
main="Telephone",
xlab="Telephone(Yeas/No)",
ylab="WOE")
par(op19)
kable(A19, caption = "Telephone ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A191 | 409 | 187 | 58.43 | 62.33 | 596 | 59.6 | 31.38 | 4.84 | -0.65 | 0.2535 | 1.95 |
A192 | 291 | 113 | 41.57 | 37.67 | 404 | 40.4 | 27.97 | 5.25 | 0.99 | 0.3861 | 1.95 |
Information Value is 0.64 and Efficiency is 3.9 .
# Attribute 20: (qualitative)
#-----------------------------------------------------------
# foreign worker
# A201 : yes
# A202 : no
A20<-gbpct(cdata$foreign_worker_20)
op20<-par(mfrow=c(1,2))
plot(cdata$foreign_worker_20, cdata$good_bad_21,
main="Foreign Worker",
xlab="Foreign Worker(Yes/No)",
ylab="Good-Bad")
barplot(A20$WOE, col="brown", names.arg=c(A20$Levels),
main="Foreign Worker",
xlab="Foreign Worker(Yes/No)",
ylab="WOE")
par(op20)
kable(A20, caption = "Foreign Worker ~ Good-Bad")
Names | Good | Bad | Good_pct | Bad_pct | Total | Total_Pct | Bad_Rate | grp_score | WOE | IV | Efficiency |
---|---|---|---|---|---|---|---|---|---|---|---|
A201 | 667 | 296 | 95.29 | 98.67 | 963 | 96.3 | 30.74 | 4.91 | -0.35 | 0.1183 | 1.69 |
A202 | 33 | 4 | 4.71 | 1.33 | 37 | 3.7 | 10.81 | 7.80 | 12.65 | 4.2757 | 1.69 |
Information Value is 4.39 and Efficiency is 3.38 .
# require library(Information)
cdata$good_bad_21<-as.numeric(ifelse(cdata$good_bad_21 == "Good", 0, 1))
IV <- Information::create_infotables(data=cdata, NULL, y="good_bad_21", 10)
IV$Summary$IV <- round(IV$Summary$IV*100,2)
IV$Tables
## $chk_ac_status_1
## chk_ac_status_1 N Percent WOE IV
## 1 A11 274 0.274 0.8180987 0.2056934
## 2 A12 269 0.269 0.4013918 0.2521402
## 3 A13 63 0.063 -0.4054651 0.2616010
## 4 A14 394 0.394 -1.1762632 0.6660115
##
## $duration_month_2
## duration_month_2 N Percent WOE IV
## 1 00-06 82 0.082 -1.24593700 0.09255532
## 2 06-12 277 0.277 -0.29511705 0.11518096
## 3 12-24 411 0.411 -0.01510778 0.11527449
## 4 24-30 57 0.057 0.15415068 0.11666918
## 5 30-36 86 0.086 0.61368301 0.15232124
## 6 36-42 17 0.017 -0.02817088 0.15233466
## 7 42+ 70 0.070 1.13497993 0.25502332
##
## $credit_history_3
## credit_history_3 N Percent WOE IV
## 1 01.A30 40 0.040 1.35812348 0.08407431
## 2 02.A31 49 0.049 1.13497993 0.15595637
## 3 03.A32.A33 618 0.618 0.08786876 0.16081008
## 4 04.A34 293 0.293 -0.73374058 0.29323278
##
## $purpose_4
## purpose_4 N Percent WOE IV
## 1 A40 234 0.234 0.35920049 0.03215700
## 2 A41 103 0.103 -0.77383609 0.08337758
## 3 A410 12 0.012 0.51082562 0.08678308
## 4 A42 181 0.181 0.09555652 0.08846669
## 5 A43 280 0.280 -0.41006282 0.13142566
## 6 A44 12 0.012 0.15415068 0.13171928
## 7 A45 22 0.022 0.28768207 0.13363716
## 8 A46 50 0.050 0.60613580 0.15384168
## 9 A48 9 0.009 -1.23214368 0.16381618
## 10 A49 97 0.097 0.23052366 0.16919507
##
## $credit_amount_5
## credit_amount_5 N Percent WOE IV
## 1 0-1400 267 0.267 0.03366128 0.0003045545
## 2 1400-2500 270 0.270 -0.30132485 0.0232626382
## 3 2500-3500 149 0.149 -0.22464618 0.0304299211
## 4 3500-4500 98 0.098 -0.17127172 0.0332028918
## 5 4500-5500 48 0.048 0.24652400 0.0362550937
## 6 5500+ 168 0.168 0.55961579 0.0938155748
##
## $savings_ac_bond_6
## savings_ac_bond_6 N Percent WOE IV
## 1 A61 603 0.603 0.2713578 0.04664771
## 2 A62 103 0.103 0.1395519 0.04870776
## 3 A63 63 0.063 -0.7060506 0.07526871
## 4 A64 48 0.048 -1.0986123 0.11921320
## 5 A65 183 0.183 -0.7042461 0.19600956
##
## $p_employment_since_7
## p_employment_since_7 N Percent WOE IV
## 1 A71 62 0.062 0.31923043 0.006688638
## 2 A72 172 0.172 0.47082029 0.047941463
## 3 A73 339 0.339 0.03210325 0.048293070
## 4 A74 174 0.174 -0.39441527 0.073084887
## 5 A75 253 0.253 -0.23556607 0.086433631
##
## $instalment_pct_8
## instalment_pct_8 N Percent WOE IV
## 1 [1,1] 136 0.136 -0.25131443 0.008137801
## 2 [2,2] 231 0.231 -0.15546647 0.013542111
## 3 [3,3] 157 0.157 -0.06453852 0.014187496
## 4 [4,4] 476 0.476 0.15730029 0.026322090
##
## $personal_status_9
## personal_status_9 N Percent WOE IV
## 1 A91 50 0.050 0.4418328 0.01051983
## 2 A92 310 0.310 0.2353408 0.02845056
## 3 A93 548 0.548 -0.1655476 0.04295568
## 4 A94 92 0.092 -0.1385189 0.04467068
##
## $other_debtors_or_grantors_10
## other_debtors_or_grantors_10 N Percent WOE IV
## 1 A101 907 0.907 -0.0005250722 2.500344e-07
## 2 A102 41 0.041 0.6021754024 1.634501e-02
## 3 A103 52 0.052 -0.5877866649 3.201932e-02
##
## $present_residence_since_11
## present_residence_since_11 N Percent WOE IV
## 1 [1,1] 130 0.130 -0.112477983 0.001606828
## 2 [2,2] 308 0.308 0.070150705 0.003143463
## 3 [3,3] 149 0.149 -0.054941118 0.003588224
## 4 [4,4] 413 0.413 0.001152738 0.003588773
##
## $property_type_12
## property_type_12 N Percent WOE IV
## 1 A121 282 0.282 -0.46103496 0.05400695
## 2 A122 232 0.232 0.02857337 0.05419744
## 3 A123 332 0.332 0.03419136 0.05458820
## 4 A124 154 0.154 0.58608236 0.11263826
##
## $age_in_yrs_13
## age_in_yrs_13 N Percent WOE IV
## 1 0-25 190 0.190 0.52884413 0.05792102
## 2 25-30 221 0.221 0.03636764 0.05821543
## 3 30-35 177 0.177 -0.08486622 0.05946822
## 4 35-40 138 0.138 -0.43363599 0.08300845
## 5 40-45 88 0.088 -0.07696104 0.08352153
## 6 45-50 73 0.073 -0.42316469 0.09541044
## 7 50-60 68 0.068 0.04167270 0.09552951
## 8 60+ 45 0.045 -0.40546511 0.10228726
##
## $other_instalment_type_14
## other_instalment_type_14 N Percent WOE IV
## 1 A141 139 0.139 0.4836299 0.03523589
## 2 A142 47 0.047 0.4595323 0.04595831
## 3 A143 814 0.814 -0.1211786 0.05761454
##
## $housing_type_15
## housing_type_15 N Percent WOE IV
## 1 A151 179 0.179 0.4044452 0.03139265
## 2 A152 713 0.713 -0.1941560 0.05718767
## 3 A153 108 0.108 0.4726044 0.08329343
##
## $number_cards_this_bank_16
## number_cards_this_bank_16 N Percent WOE IV
## 1 [1,1] 633 0.633 0.0748775 0.003601251
## 2 [2,4] 367 0.367 -0.1347806 0.010083557
##
## $job_17
## job_17 N Percent WOE IV
## 1 A171 22 0.022 0.08515781 0.0001622053
## 2 A172 200 0.200 -0.09716375 0.0020129434
## 3 A173 630 0.630 -0.02278003 0.0023383724
## 4 A174 148 0.148 0.20441251 0.0087627657
##
## $no_people_liable_for_mntnance_18
## no_people_liable_for_mntnance_18 N Percent WOE IV
## 1 [1,1] 845 0.845 0.00281611 6.705024e-06
## 2 [2,2] 155 0.155 -0.01540863 4.339223e-05
##
## $telephone_19
## telephone_19 N Percent WOE IV
## 1 A191 596 0.596 0.06469132 0.002526042
## 2 A192 404 0.404 -0.09863759 0.006377605
##
## $foreign_worker_20
## foreign_worker_20 N Percent WOE IV
## 1 A201 963 0.963 0.03486727 0.001178846
## 2 A202 37 0.037 -1.26291534 0.043877412
kable(IV$Summary)
Variable | IV | |
---|---|---|
1 | chk_ac_status_1 | 66.60 |
3 | credit_history_3 | 29.32 |
2 | duration_month_2 | 25.50 |
6 | savings_ac_bond_6 | 19.60 |
4 | purpose_4 | 16.92 |
12 | property_type_12 | 11.26 |
13 | age_in_yrs_13 | 10.23 |
5 | credit_amount_5 | 9.38 |
7 | p_employment_since_7 | 8.64 |
15 | housing_type_15 | 8.33 |
14 | other_instalment_type_14 | 5.76 |
9 | personal_status_9 | 4.47 |
20 | foreign_worker_20 | 4.39 |
10 | other_debtors_or_grantors_10 | 3.20 |
8 | instalment_pct_8 | 2.63 |
16 | number_cards_this_bank_16 | 1.01 |
17 | job_17 | 0.88 |
19 | telephone_19 | 0.64 |
11 | present_residence_since_11 | 0.36 |
18 | no_people_liable_for_mntnance_18 | 0.00 |
cdata$good_bad_21<-as.factor(ifelse(cdata$good_bad_21 == 0, "Good", "Bad"))
I. Following variables do not have prediction power - very very weak predictor (IV< 2%), hence we shall exclude them from modeling
Position | Variable | IV |
---|---|---|
16 | number_cards_this_bank_16 | 1.01 |
17 | job_17 | 0.88 |
19 | telephone_19 | 0.64 |
11 | present_residence_since_11 | 0.36 |
18 | no_people_liable_for_mntnance_18 | 0.00 |
Position | Variable | IV |
---|---|---|
7 | p_employment_since_7 | 8.64 |
15 | housing_type_15 | 8.33 |
14 | other_instalment_type_14 | 5.76 |
9 | personal_status_9 | 4.47 |
20 | foreign_worker_20 | 4.39 |
10 | other_debtors_or_grantors_10 | 3.20 |
8 | instalment_pct_8 | 2.63 |
Position | Variable | IV |
---|---|---|
3 | credit_history_3 | 29.32 |
2 | duration_month_2 | 27.79 |
6 | savings_ac_bond_6 | 19.60 |
4 | purpose_4 | 16.92 |
13 | age_in_yrs_13 | 12.12 |
12 | property_type_12 | 11.26 |
5 | credit_amount_5 | 11.18 |
There is no strong predictor with IV between 30% to 50%
chk_ac_status_1 has a very high prediction power (IV > 50%), it could be suspicious and require further investigation
Position | Variable | IV |
---|---|---|
1 | chk_ac_status_1 | 66.60 |
var_list_1 <- IV$Summary[IV$Summary$IV > 2,] # 15 variables
cdata_reduced_1 <- cdata[, c(var_list_1$Variable,"good_bad_21")] #16 variables
Clustering of variables is as a way to arrange variables into homogeneous clusters, i.e., groups of variables which are strongly related to each other and thus bring the same information.
When we have large number of variables, this should be done well before univariate analysis. This can also be done using Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA) or Factor Analysis.
To perform clustering on variables we need ClusterOfVar
package. For more details on using R ClusterOfVar see or click here.
# Step 1: Subset quantitative and qualitative variables X.quanti and X.quali
factors <- sapply(cdata_reduced_1, is.factor)
#subset Qualitative variables
vars_quali <- cdata_reduced_1[,factors]
#vars_quali$good_bad_21<-vars_quali$good_bad_21[drop=TRUE] # remove empty factors
str(vars_quali)
## 'data.frame': 1000 obs. of 15 variables:
## $ chk_ac_status_1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ credit_history_3 : Factor w/ 4 levels "01.A30","02.A31",..: 4 3 4 3 3 3 3 3 3 4 ...
## $ duration_month_2 : Factor w/ 7 levels "00-06","06-12",..: 1 7 2 6 3 5 3 5 2 4 ...
## $ savings_ac_bond_6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ purpose_4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ property_type_12 : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age_in_yrs_13 : Factor w/ 8 levels "0-25","25-30",..: 8 1 6 5 7 3 7 3 8 2 ...
## $ credit_amount_5 : Factor w/ 6 levels "0-1400","1400-2500",..: 1 6 2 6 5 6 3 6 3 5 ...
## $ p_employment_since_7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ housing_type_15 : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ other_instalment_type_14 : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ personal_status_9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ foreign_worker_20 : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ other_debtors_or_grantors_10: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ good_bad_21 : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
#subset Quantitative variables
vars_quanti <- cdata_reduced_1[,!factors]
str(vars_quanti)
## num [1:1000] 4 2 2 2 3 2 3 2 2 4 ...
#Step 2: Hierarchical Clustering of Variables
# requires library(ClustOfVar)
# Need help type ?hclustvar on R console
tree <- hclustvar(X.quanti=vars_quanti,X.quali=vars_quali[,-c(12)])
plot(tree, main="variable clustering")
rect.hclust(tree, k=10, border = 1:10)
summary(tree)
## Length Class Mode
## call 3 -none- call
## rec 14 -none- list
## merge 28 -none- numeric
## height 14 -none- numeric
## order 15 -none- numeric
## labels 15 -none- character
## clusmat 225 -none- numeric
## X.quanti 1 data.frame list
## X.quali 14 data.frame list
# Phylogenetic trees
# require library("ape")
plot(as.phylo(tree), type = "fan",
tip.color = hsv(runif(15, 0.65, 0.95), 1, 1, 0.7),
edge.color = hsv(runif(10, 0.65, 0.75), 1, 1, 0.7),
edge.width = runif(20, 0.5, 3), use.edge.length = TRUE, col = "gray80")
summary.phylo(as.phylo(tree))
##
## Phylogenetic tree: as.phylo(tree)
##
## Number of tips: 15
## Number of nodes: 14
## Branch lengths:
## mean: 0.2484595
## variance: 0.01806587
## distribution summary:
## Min. 1st Qu. Median 3rd Qu. Max.
## 0.01508 0.12620 0.24880 0.35860 0.47440
## No root edge.
## First ten tip labels: X.quanti
## chk_ac_status_1
## credit_history_3
## duration_month_2
## savings_ac_bond_6
## purpose_4
## property_type_12
## age_in_yrs_13
## credit_amount_5
## p_employment_since_7
## No node labels.
stab<-stability(tree,B=50) # Bootstrap 50 times
# plot(stab,main="Stability of the partitions")
boxplot(stab$matCR)
part<-cutreevar(tree,10)
print(part)
##
## Call:
## cutreevar(obj = tree, k = 10)
##
##
##
## name description
## "$var" "list of variables in each cluster"
## "$sim" "similarity matrix in each cluster"
## "$cluster" "cluster memberships"
## "$wss" "within-cluster sum of squares"
## "$E" "gain in cohesion (in %)"
## "$size" "size of each cluster"
## "$scores" "score of each cluster"
summary(part)
##
## Call:
## cutreevar(obj = tree, k = 10)
##
##
##
## Data:
## number of observations: 1000
## number of variables: 15
## number of numerical variables: 1
## number of categorical variables: 14
## number of clusters: 10
##
## Cluster 1 :
## squared loading
## X.quanti 1
##
##
## Cluster 2 :
## squared loading
## chk_ac_status_1 0.68
## good_bad_21 0.68
##
##
## Cluster 3 :
## squared loading
## credit_history_3 0.64
## other_instalment_type_14 0.64
##
##
## Cluster 4 :
## squared loading
## duration_month_2 0.82
## credit_amount_5 0.82
##
##
## Cluster 5 :
## squared loading
## savings_ac_bond_6 1
##
##
## Cluster 6 :
## squared loading
## purpose_4 1
##
##
## Cluster 7 :
## squared loading
## property_type_12 0.89
## housing_type_15 0.89
##
##
## Cluster 8 :
## squared loading
## age_in_yrs_13 0.72
## p_employment_since_7 0.72
##
##
## Cluster 9 :
## squared loading
## foreign_worker_20 1
##
##
## Cluster 10 :
## squared loading
## other_debtors_or_grantors_10 1
##
##
## Gain in cohesion (in %): 79.58
# head(part$scores)
We may also cross check the outcomes of hierarchical clustering using K-means variable clustering:
# kmeans for variable clustering
# requires library(ClustOfVar)
# Need help type ?kmeansvar on R console
kfit<-kmeansvar(X.quanti = vars_quanti, X.quali = vars_quali[,-c(12)], init=10,
iter.max = 150, nstart = 1, matsim = TRUE)
summary(kfit)
##
## Call:
## kmeansvar(X.quanti = vars_quanti, X.quali = vars_quali[, -c(12)], init = 10, iter.max = 150, nstart = 1, matsim = TRUE)
##
## number of iterations: 1
##
## Data:
## number of observations: 1000
## number of variables: 15
## number of numerical variables: 1
## number of categorical variables: 14
## number of clusters: 10
##
## Cluster 1 :
## squared loading
## X.quanti 1
##
##
## Cluster 2 :
## squared loading
## credit_history_3 0.64
## other_instalment_type_14 0.64
##
##
## Cluster 3 :
## squared loading
## duration_month_2 1
##
##
## Cluster 4 :
## squared loading
## purpose_4 1
##
##
## Cluster 5 :
## squared loading
## property_type_12 0.89
## housing_type_15 0.89
##
##
## Cluster 6 :
## squared loading
## age_in_yrs_13 0.72
## p_employment_since_7 0.72
##
##
## Cluster 7 :
## squared loading
## credit_amount_5 1
##
##
## Cluster 8 :
## squared loading
## foreign_worker_20 1
##
##
## Cluster 9 :
## squared loading
## other_debtors_or_grantors_10 1
##
##
## Cluster 10 :
## squared loading
## chk_ac_status_1 0.60
## savings_ac_bond_6 0.39
## good_bad_21 0.54
##
##
## Gain in cohesion (in %): 75.87
plot(cbind(vars_quanti, vars_quali), as.factor(kfit$cluster))
kfit$E
## [1] 75.87314
We will model first ten top labels from the varclus:
duration_month_2
age_in_yrs_13
credit_amount_5
instalment_pct_8
chk_ac_status_1
credit_history_3
savings_ac_bond_6
purpose_4
property_type_12
p_employment_since_7
Keep only important variables from variable of cluster analysis.
keep<- c(1:8,12,13,21)
cdata_reduced_2 <- cdata[,keep]
str(cdata_reduced_2)
## 'data.frame': 1000 obs. of 11 variables:
## $ chk_ac_status_1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration_month_2 : Factor w/ 7 levels "00-06","06-12",..: 1 7 2 6 3 5 3 5 2 4 ...
## $ credit_history_3 : Factor w/ 4 levels "01.A30","02.A31",..: 4 3 4 3 3 3 3 3 3 4 ...
## $ purpose_4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ credit_amount_5 : Factor w/ 6 levels "0-1400","1400-2500",..: 1 6 2 6 5 6 3 6 3 5 ...
## $ savings_ac_bond_6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ p_employment_since_7: Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ instalment_pct_8 : num 4 2 2 2 3 2 3 2 2 4 ...
## $ property_type_12 : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age_in_yrs_13 : Factor w/ 8 levels "0-25","25-30",..: 8 1 6 5 7 3 7 3 8 2 ...
## $ good_bad_21 : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
We may split the data (given population) into random samples with 50-50, 60-40 or 70-30 ratios for Training (Development Sample on which model will be developed or trained) and Test (validation/holdout sample on which model will be tested) based on population size. In this exercise we will split the sample into 70-30. You may perform this step even before Univariate analysis.
To know more about random sampling click here.
Simple random sampling is the most basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample (Yale).
div_part <- sort(sample(nrow(cdata_reduced_2), nrow(cdata_reduced_2)*.6))
#select training sample
train<-cdata_reduced_2[div_part,] # 70% here
pct(train$good_bad_21)
Count | Percentage | |
---|---|---|
Bad | 173 | 28.83 |
Good | 427 | 71.17 |
# put remaining into test sample
test<-cdata_reduced_2[-div_part,] # rest of the 30% data goes here
pct(test$good_bad_21)
Count | Percentage | |
---|---|---|
Bad | 127 | 31.75 |
Good | 273 | 68.25 |
Stratified random sampling is a method of sampling that involves the division of a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members’ shared attributes or characteristics. See more here.
In our case we will use good_bad
as strata and partition data into 70%-30% as train
and test
sets. You may also split them into 60-40 or 50-50 ratio.
# Required library(caret)
# considering good_bad variable as strata
pct(cdata_reduced_2$good_bad_21)
Count | Percentage | |
---|---|---|
Bad | 300 | 30 |
Good | 700 | 70 |
div_part_1 <- createDataPartition(y = cdata_reduced_2$good_bad_21, p = 0.7, list = F)
# Training Sample
train_1 <- cdata_reduced_2[div_part_1,] # 70% here
pct(train_1$good_bad_21)
Count | Percentage | |
---|---|---|
Bad | 210 | 30 |
Good | 490 | 70 |
# Test Sample
test_1 <- cdata_reduced_2[-div_part_1,] # rest of the 30% data goes here
pct(test_1$good_bad_21)
Count | Percentage | |
---|---|---|
Bad | 90 | 30 |
Good | 210 | 70 |
save(train_1, file="train_1.RData")
save(test_1, file="test_1.RData")
# For neural network we would need contious data
# Sampling for Neural Network - It can be used for other modeling as well
div_part_2 <- createDataPartition(y = cdatanum[,25], p = 0.7, list = F)
# Training Sample for Neural Network
train_num <- cdatanum[div_part_2,] # 70% here
# Test Sample for Neural Network
test_num <- cdatanum[-div_part_2,] # rest of the 30% data goes here
# Save for the future
save(train_num, file="train_num.RData")
save(test_num, file="test_num.RData")
Clearly stratified sampling is more accurate than simple random sampling.
The most important thing in developing model is to select right modeling algorithm(s). Here I have discussed several machine learning techniques. You may choose to use one of them or combination of few techniques to get best result.
In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical. This article covers the case of binary dependent variables—that is, where it can take only two values, such as pass/fail, win/lose, alive/dead or healthy/sick. Cases with more than two categories are referred to as multinomial logistic regression, or, if the multiple categories are ordered, as ordinal logistic regression.
Logistic regression was developed by statistician David Cox in 1958. The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). As such it is not a classification method. It could be called a qualitative response/discrete choice model in the terminology of economics.
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Thus, it treats the same set of problems as probit regression using similar techniques, with the latter using a cumulative normal distribution curve instead. Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors. [Source: wiki.
Form of the Standard Logistic Function:
Read more here.
# library(stats)
# Model: Stepwise Logistic Regression Model
m1 <- glm(good_bad_21~.,data=train_1,family=binomial())
m1 <- step(m1)
## Start: AIC=716.83
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## instalment_pct_8 + property_type_12 + age_in_yrs_13
##
## Df Deviance AIC
## - age_in_yrs_13 7 635.41 713.41
## - p_employment_since_7 4 630.37 714.37
## - property_type_12 3 629.19 715.19
## - instalment_pct_8 1 625.58 715.58
## <none> 624.83 716.83
## - savings_ac_bond_6 4 635.65 719.65
## - credit_amount_5 5 644.16 726.16
## - purpose_4 9 655.61 729.61
## - duration_month_2 6 651.38 731.38
## - credit_history_3 3 646.74 732.74
## - chk_ac_status_1 3 666.72 752.72
##
## Step: AIC=713.41
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## instalment_pct_8 + property_type_12
##
## Df Deviance AIC
## - p_employment_since_7 4 640.58 710.58
## - property_type_12 3 639.37 711.37
## - instalment_pct_8 1 635.87 711.87
## <none> 635.41 713.41
## - savings_ac_bond_6 4 646.85 716.85
## - credit_amount_5 5 656.27 724.27
## - purpose_4 9 666.23 726.23
## - duration_month_2 6 661.74 727.74
## - credit_history_3 3 657.33 729.33
## - chk_ac_status_1 3 681.56 753.56
##
## Step: AIC=710.58
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + instalment_pct_8 +
## property_type_12
##
## Df Deviance AIC
## - property_type_12 3 644.92 708.92
## - instalment_pct_8 1 640.99 708.99
## <none> 640.58 710.58
## - savings_ac_bond_6 4 652.85 714.85
## - credit_amount_5 5 661.70 721.70
## - purpose_4 9 671.86 723.86
## - duration_month_2 6 666.65 724.65
## - credit_history_3 3 663.15 727.15
## - chk_ac_status_1 3 688.70 752.70
##
## Step: AIC=708.92
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + instalment_pct_8
##
## Df Deviance AIC
## - instalment_pct_8 1 645.73 707.73
## <none> 644.92 708.92
## - savings_ac_bond_6 4 657.28 713.28
## - credit_amount_5 5 665.60 719.60
## - purpose_4 9 678.69 724.69
## - credit_history_3 3 667.89 725.89
## - duration_month_2 6 674.55 726.55
## - chk_ac_status_1 3 695.03 753.03
##
## Step: AIC=707.73
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6
##
## Df Deviance AIC
## <none> 645.73 707.73
## - savings_ac_bond_6 4 657.75 711.75
## - credit_amount_5 5 667.93 719.93
## - purpose_4 9 679.25 723.25
## - credit_history_3 3 668.61 724.61
## - duration_month_2 6 681.68 731.68
## - chk_ac_status_1 3 695.89 751.89
summary(m1)
##
## Call:
## glm(formula = good_bad_21 ~ chk_ac_status_1 + duration_month_2 +
## credit_history_3 + purpose_4 + credit_amount_5 + savings_ac_bond_6,
## family = binomial(), data = train_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6281 -0.7105 0.4113 0.7373 2.1901
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.37285 0.73245 -0.509 0.610724
## chk_ac_status_1A12 0.61522 0.25355 2.426 0.015251 *
## chk_ac_status_1A13 1.11106 0.42766 2.598 0.009377 **
## chk_ac_status_1A14 1.75686 0.26604 6.604 4.01e-11 ***
## duration_month_206-12 -0.97096 0.53395 -1.818 0.068995 .
## duration_month_212-24 -1.83253 0.53948 -3.397 0.000682 ***
## duration_month_224-30 -2.21101 0.67052 -3.297 0.000976 ***
## duration_month_230-36 -2.65730 0.63165 -4.207 2.59e-05 ***
## duration_month_236-42 -1.35801 1.13146 -1.200 0.230049
## duration_month_242+ -2.98232 0.67195 -4.438 9.07e-06 ***
## credit_history_302.A31 -0.36956 0.66620 -0.555 0.579082
## credit_history_303.A32.A33 0.84786 0.51708 1.640 0.101066
## credit_history_304.A34 1.59326 0.54645 2.916 0.003549 **
## purpose_4A41 1.35955 0.43010 3.161 0.001572 **
## purpose_4A410 1.56310 0.88989 1.757 0.079001 .
## purpose_4A42 0.37335 0.30268 1.234 0.217385
## purpose_4A43 0.88349 0.28162 3.137 0.001706 **
## purpose_4A44 1.07479 0.94320 1.140 0.254486
## purpose_4A45 0.71002 0.62297 1.140 0.254401
## purpose_4A46 -1.11693 0.48143 -2.320 0.020340 *
## purpose_4A48 1.84760 1.22416 1.509 0.131227
## purpose_4A49 0.62860 0.38827 1.619 0.105449
## credit_amount_51400-2500 0.78631 0.29464 2.669 0.007615 **
## credit_amount_52500-3500 0.82826 0.35504 2.333 0.019657 *
## credit_amount_53500-4500 1.56938 0.42714 3.674 0.000239 ***
## credit_amount_54500-5500 -0.09244 0.50037 -0.185 0.853437
## credit_amount_55500+ 0.22986 0.40318 0.570 0.568595
## savings_ac_bond_6A62 -0.02907 0.32452 -0.090 0.928620
## savings_ac_bond_6A63 0.76163 0.50526 1.507 0.131707
## savings_ac_bond_6A64 1.17121 0.60814 1.926 0.054118 .
## savings_ac_bond_6A65 0.74516 0.29773 2.503 0.012322 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 645.73 on 669 degrees of freedom
## AIC: 707.73
##
## Number of Fisher Scoring iterations: 5
# List of significant variables and features with p-value <0.01
significant.variables <- summary(m1)$coeff[-1,4] < 0.01
names(significant.variables)[significant.variables == TRUE]
## [1] "chk_ac_status_1A13" "chk_ac_status_1A14" "duration_month_212-24" "duration_month_224-30" "duration_month_230-36" "duration_month_242+" "credit_history_304.A34"
## [8] "purpose_4A41" "purpose_4A43" "credit_amount_51400-2500" "credit_amount_53500-4500"
prob <- predict(m1, type = "response")
res <- residuals(m1, type = "deviance")
#Plot Residuals
plot(predict(m1), res,
xlab="Fitted values", ylab = "Residuals",
ylim = max(abs(res)) * c(-1,1))
## CIs using profiled log-likelihood
confint(m1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.79807571 1.0962272
## chk_ac_status_1A12 0.12072884 1.1161421
## chk_ac_status_1A13 0.30456266 1.9941972
## chk_ac_status_1A14 1.24433267 2.2892300
## duration_month_206-12 -2.10434506 0.0130801
## duration_month_212-24 -2.98107366 -0.8449855
## duration_month_224-30 -3.58423854 -0.9361673
## duration_month_230-36 -3.96803668 -1.4743125
## duration_month_236-42 -3.49771769 1.0467428
## duration_month_242+ -4.36755546 -1.7179212
## credit_history_302.A31 -1.67919881 0.9448202
## credit_history_303.A32.A33 -0.15343713 1.8884731
## credit_history_304.A34 0.53675638 2.6919733
## purpose_4A41 0.53824541 2.2299898
## purpose_4A410 -0.14709305 3.4235200
## purpose_4A42 -0.21686909 0.9718579
## purpose_4A43 0.33589028 1.4418147
## purpose_4A44 -0.64788682 3.1860703
## purpose_4A45 -0.47222122 2.0079758
## purpose_4A46 -2.07575977 -0.1798538
## purpose_4A48 -0.24480547 4.9531715
## purpose_4A49 -0.12057851 1.4057208
## credit_amount_51400-2500 0.21426402 1.3715382
## credit_amount_52500-3500 0.14221605 1.5368036
## credit_amount_53500-4500 0.75546458 2.4353802
## credit_amount_54500-5500 -1.07023616 0.9002018
## credit_amount_55500+ -0.55771383 1.0259590
## savings_ac_bond_6A62 -0.65809959 0.6181156
## savings_ac_bond_6A63 -0.16678959 1.8349083
## savings_ac_bond_6A64 0.08043327 2.5140282
## savings_ac_bond_6A65 0.17439632 1.3449956
## CIs using standard errors
confint.default(m1)
## 2.5 % 97.5 %
## (Intercept) -1.8084294 1.06273401
## chk_ac_status_1A12 0.1182588 1.11217605
## chk_ac_status_1A13 0.2728588 1.94927005
## chk_ac_status_1A14 1.2354259 2.27829396
## duration_month_206-12 -2.0174699 0.07555885
## duration_month_212-24 -2.8898860 -0.77517698
## duration_month_224-30 -3.5252181 -0.89681190
## duration_month_230-36 -3.8953166 -1.41928648
## duration_month_236-42 -3.5756340 0.85960587
## duration_month_242+ -4.2993283 -1.66531560
## credit_history_302.A31 -1.6752798 0.93616640
## credit_history_303.A32.A33 -0.1655986 1.86132316
## credit_history_304.A34 0.5222425 2.66428406
## purpose_4A41 0.5165559 2.20253479
## purpose_4A410 -0.1810516 3.30725684
## purpose_4A42 -0.2198793 0.96658649
## purpose_4A43 0.3315314 1.43544756
## purpose_4A44 -0.7738393 2.92342633
## purpose_4A45 -0.5109859 1.93102065
## purpose_4A46 -2.0605252 -0.17334218
## purpose_4A48 -0.5517077 4.24689981
## purpose_4A49 -0.1323886 1.38959649
## credit_amount_51400-2500 0.2088219 1.36379745
## credit_amount_52500-3500 0.1323847 1.52412831
## credit_amount_53500-4500 0.7321981 2.40656246
## credit_amount_54500-5500 -1.0731387 0.88826844
## credit_amount_55500+ -0.5603558 1.02007839
## savings_ac_bond_6A62 -0.6651243 0.60698222
## savings_ac_bond_6A63 -0.2286595 1.75191474
## savings_ac_bond_6A64 -0.0207208 2.36314490
## savings_ac_bond_6A65 0.1616141 1.32869874
#
## odds ratios and 95% CI
exp(cbind(OR = coef(m1), confint(m1)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.68877012 0.16561728 2.9928534
## chk_ac_status_1A12 1.85005879 1.12831892 3.0530530
## chk_ac_status_1A13 3.03759002 1.35603182 7.3463031
## chk_ac_status_1A14 5.79421449 3.47061796 9.8673374
## duration_month_206-12 0.37872099 0.12192550 1.0131660
## duration_month_212-24 0.16000800 0.05073833 0.4295636
## duration_month_224-30 0.10958936 0.02775780 0.3921279
## duration_month_230-36 0.07013723 0.01891052 0.2289361
## duration_month_236-42 0.25717100 0.03026638 2.8483584
## duration_month_242+ 0.05067503 0.01268220 0.1794388
## credit_history_302.A31 0.69104059 0.18652336 2.5723508
## credit_history_303.A32.A33 2.33465065 0.85775469 6.6092695
## credit_history_304.A34 4.91977740 1.71044980 14.7607748
## purpose_4A41 3.89442218 1.71299861 9.2997716
## purpose_4A410 4.77360899 0.86321365 30.6772102
## purpose_4A42 1.45259789 0.80503535 2.6428501
## purpose_4A43 2.41932716 1.39918550 4.2283619
## purpose_4A44 2.92938800 0.52315012 24.1931689
## purpose_4A45 2.03402657 0.62361554 7.4482257
## purpose_4A46 0.32728180 0.12546107 0.8353923
## purpose_4A48 6.34454929 0.78285681 141.6234077
## purpose_4A49 1.87499117 0.88640749 4.0784653
## credit_amount_51400-2500 2.19528014 1.23894972 3.9414088
## credit_amount_52500-3500 2.28932386 1.15282569 4.6497044
## credit_amount_53500-4500 4.80367027 2.12860019 11.4201598
## credit_amount_54500-5500 0.91170834 0.34292752 2.4600996
## credit_amount_55500+ 1.25842542 0.57251644 2.7897696
## savings_ac_bond_6A62 0.97134746 0.51783450 1.8554284
## savings_ac_bond_6A63 2.14175933 0.84637768 6.2645594
## savings_ac_bond_6A64 3.22590023 1.08375653 12.3545969
## savings_ac_bond_6A65 2.10677096 1.19052730 3.8381696
#score test data set
test_1$m1_score <- predict(m1,type='response',test_1)
m1_pred <- prediction(test_1$m1_score, test_1$good_bad_21)
m1_perf <- performance(m1_pred,"tpr","fpr")
#ROC
plot(m1_perf, lwd=2, colorize=TRUE, main="ROC m1: Logistic Regression Performance")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m1_perf_precision <- performance(m1_pred, measure = "prec", x.measure = "rec")
plot(m1_perf_precision, main="m1 Logistic:Precision/recall curve")
# Plot accuracy as function of threshold
m1_perf_acc <- performance(m1_pred, measure = "acc")
plot(m1_perf_acc, main="m1 Logistic:Accuracy as function of threshold")
#KS, Gini & AUC m1
m1_KS <- round(max(attr(m1_perf,'y.values')[[1]]-attr(m1_perf,'x.values')[[1]])*100, 2)
m1_AUROC <- round(performance(m1_pred, measure = "auc")@y.values[[1]]*100, 2)
m1_Gini <- (2*m1_AUROC - 100)
cat("AUROC: ",m1_AUROC,"\tKS: ", m1_KS, "\tGini:", m1_Gini, "\n")
## AUROC: 76.87 KS: 43.17 Gini: 53.74
#auc = performance(m1_pred, "auc")
#auc = unlist(auc@y.values)
#auc
# Cross Validatio
#load Data Analysis And Graphics Package for R (DAAG)
#library(DAAG)
#calculate accuracy over 100 random folds of data for simple logit
#m1_h <- CVbinary(obj=m1, rand=NULL, nfolds=100, print.details=TRUE)
#Model
m1_1 <- glm(good_bad_21~chk_ac_status_1+duration_month_2
+savings_ac_bond_6+instalment_pct_8,
data=train_1,family=binomial())
step(m1_1)
## Start: AIC=748.21
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + savings_ac_bond_6 +
## instalment_pct_8
##
## Df Deviance AIC
## <none> 718.21 748.21
## - instalment_pct_8 1 720.25 748.25
## - savings_ac_bond_6 4 728.88 750.88
## - duration_month_2 6 754.87 772.87
## - chk_ac_status_1 3 786.78 810.78
##
## Call: glm(formula = good_bad_21 ~ chk_ac_status_1 + duration_month_2 +
## savings_ac_bond_6 + instalment_pct_8, family = binomial(),
## data = train_1)
##
## Coefficients:
## (Intercept) chk_ac_status_1A12 chk_ac_status_1A13 chk_ac_status_1A14 duration_month_206-12 duration_month_212-24 duration_month_224-30 duration_month_230-36
## 1.2957 0.5688 1.1188 1.8628 -0.5570 -1.0829 -1.2651 -1.7626
## duration_month_236-42 duration_month_242+ savings_ac_bond_6A62 savings_ac_bond_6A63 savings_ac_bond_6A64 savings_ac_bond_6A65 instalment_pct_8
## -0.6044 -2.3254 -0.1323 0.6309 0.9385 0.6462 -0.1213
##
## Degrees of Freedom: 699 Total (i.e. Null); 685 Residual
## Null Deviance: 855.2
## Residual Deviance: 718.2 AIC: 748.2
#Model Summary
summary(m1_1)
##
## Call:
## glm(formula = good_bad_21 ~ chk_ac_status_1 + duration_month_2 +
## savings_ac_bond_6 + instalment_pct_8, family = binomial(),
## data = train_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4835 -0.9080 0.4770 0.8046 1.8513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.29566 0.51625 2.510 0.01208 *
## chk_ac_status_1A12 0.56878 0.22922 2.481 0.01309 *
## chk_ac_status_1A13 1.11882 0.39959 2.800 0.00511 **
## chk_ac_status_1A14 1.86276 0.24510 7.600 2.96e-14 ***
## duration_month_206-12 -0.55700 0.49183 -1.132 0.25743
## duration_month_212-24 -1.08286 0.47473 -2.281 0.02255 *
## duration_month_224-30 -1.26508 0.58169 -2.175 0.02964 *
## duration_month_230-36 -1.76258 0.53831 -3.274 0.00106 **
## duration_month_236-42 -0.60442 0.97264 -0.621 0.53433
## duration_month_242+ -2.32541 0.55847 -4.164 3.13e-05 ***
## savings_ac_bond_6A62 -0.13232 0.30444 -0.435 0.66384
## savings_ac_bond_6A63 0.63095 0.45452 1.388 0.16509
## savings_ac_bond_6A64 0.93851 0.57335 1.637 0.10166
## savings_ac_bond_6A65 0.64617 0.27238 2.372 0.01768 *
## instalment_pct_8 -0.12130 0.08527 -1.422 0.15489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 718.21 on 685 degrees of freedom
## AIC: 748.21
##
## Number of Fisher Scoring iterations: 5
# Model Scoring
test_1$m1_1_score <- predict(m1_1,type='response',test_1)
m1_1_pred <- prediction(test_1$m1_1_score,test_1$good_bad_21)
m1_1_perf <- performance(m1_1_pred,"tpr","fpr")
#Model Performance plot
plot(m1_1_perf, lwd=2, colorize=TRUE,main = " ROC m1_1: Logistic Regression with selected variables")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# lines(x=c(0.5, 0), y=c(0, 0.5), col="orange", lwd=1, lty=4)
# lines(x=c(1, 0.5), y=c(1, 0.5), col="blue", lwd=1, lty=4)
# Plot precision/recall curve
m1_1_perf_precision <- performance(m1_1_pred, measure = "prec", x.measure = "rec")
plot(m1_1_perf_precision, main="m1_1 Logistic:Precision/recall curve")
# Plot accuracy as function of threshold
m1_1_perf_acc <- performance(m1_1_pred, measure = "acc")
plot(m1_1_perf_acc, main="m1_1 Logistic:Accuracy as function of threshold")
#KS & AUC m1_1
m1_1_AUROC <- round(performance(m1_1_pred, measure = "auc")@y.values[[1]]*100, 2)
m1_1_KS <- round(max(attr(m1_1_perf,'y.values')[[1]]-attr(m1_1_perf,'x.values')[[1]])*100, 2)
m1_1_Gini <- (2*m1_1_AUROC - 100)
cat("AUROC: ",m1_1_AUROC,"\tKS: ", m1_1_KS, "\tGini:", m1_1_Gini, "\n")
## AUROC: 77.27 KS: 46.19 Gini: 54.54
# 100 folds Cross Validation
# Requires library(DAAG)
# Ref: https://cran.r-project.org/web/packages/DAAG/DAAG.pdf
#m1_1_h <- CVbinary(obj=m1_1, rand=NULL, nfolds=100, print.details=TRUE)
Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables.Source:wiki.
See more here.
See more here.
# Requires library(rpart)
m2 <- rpart(good_bad_21~.,data=train_1)
# Print tree detail
printcp(m2)
##
## Classification tree:
## rpart(formula = good_bad_21 ~ ., data = train_1)
##
## Variables actually used in tree construction:
## [1] age_in_yrs_13 chk_ac_status_1 credit_amount_5 credit_history_3 duration_month_2 property_type_12 purpose_4
##
## Root node error: 210/700 = 0.3
##
## n= 700
##
## CP nsplit rel error xerror xstd
## 1 0.049206 0 1.00000 1.00000 0.057735
## 2 0.038095 3 0.85238 0.97619 0.057334
## 3 0.030952 4 0.81429 1.00000 0.057735
## 4 0.016667 6 0.75238 0.97143 0.057252
## 5 0.010000 9 0.70000 0.97143 0.057252
# Tree plot
plot(m2, main="Tree:Recursive Partitioning");text(m2);
# Better version of plot
prp(m2,type=2,extra=1, main="Tree:Recursive Partitioning")
# score test data
test_1$m2_score <- predict(m2,type='prob',test_1)
m2_pred <- prediction(test_1$m2_score[,2],test_1$good_bad_21)
m2_perf <- performance(m2_pred,"tpr","fpr")
# MOdel performance plot
plot(m2_perf, lwd=2, colorize=TRUE, main="ROC m2: Traditional Recursive Partitioning")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m2_perf_precision <- performance(m2_pred, measure = "prec", x.measure = "rec")
plot(m2_perf_precision, main="m2 Recursive Partitioning:Precision/recall curve")
# Plot accuracy as function of threshold
m2_perf_acc <- performance(m2_pred, measure = "acc")
plot(m2_perf_acc, main="m2 Recursive Partitioning:Accuracy as function of threshold")
# KS & AUC m1
m2_AUROC <- round(performance(m2_pred, measure = "auc")@y.values[[1]]*100, 2)
m2_KS <- round(max(attr(m2_perf,'y.values')[[1]]-attr(m2_perf,'x.values')[[1]])*100, 2)
m2_Gini <- (2*m2_AUROC - 100)
cat("AUROC: ",m2_AUROC,"\tKS: ", m2_KS, "\tGini:", m2_Gini, "\n")
## AUROC: 73.65 KS: 41.9 Gini: 47.3
Read more here.
# Requires library(rpart)
# Fit Model
# m2_1 <- rpart(good_bad_21~.,data=train_1,parms=list(prior=c(.9,.1)),cp=.0002) #- build model using 90%-10% priors
# m2_1 <- rpart(good_bad_21~.,data=train_1,parms=list(prior=c(.8,.2)),cp=.0002) #- build model using 80%-20% priors
m2_1 <- rpart(good_bad_21~.,data=train_1,parms=list(prior=c(.7,.3)),cp=.0002) #- build model using 70%-30% priors
# m2_1 <- rpart(good_bad_21~.,data=train_1,parms=list(prior=c(.75,.25)),cp=.0002) #- build model using 75%-25% priors
# m2_1 <- rpart(good_bad_21~.,data=train_1,parms=list(prior=c(.6,.4)),cp=.0002) #- build model using 60%-40% priors
# Print tree detail
printcp(m2_1)
##
## Classification tree:
## rpart(formula = good_bad_21 ~ ., data = train_1, parms = list(prior = c(0.7,
## 0.3)), cp = 2e-04)
##
## Variables actually used in tree construction:
## [1] age_in_yrs_13 chk_ac_status_1 credit_amount_5 credit_history_3 duration_month_2 instalment_pct_8 p_employment_since_7 property_type_12 purpose_4
## [10] savings_ac_bond_6
##
## Root node error: 210/700 = 0.3
##
## n= 700
##
## CP nsplit rel error xerror xstd
## 1 0.08526077 0 1.00000 1.00000 0.024744
## 2 0.02448980 2 0.82948 0.88685 0.048986
## 3 0.01640212 4 0.78050 0.94059 0.059819
## 4 0.01439909 7 0.73129 0.99093 0.064396
## 5 0.01330310 11 0.66984 1.01995 0.068019
## 6 0.01201814 18 0.56349 1.01179 0.068071
## 7 0.01043084 20 0.53946 1.00249 0.068864
## 8 0.00419501 22 0.51859 1.06054 0.075481
## 9 0.00331066 24 0.51020 1.10181 0.078473
## 10 0.00317460 29 0.49365 1.09864 0.079135
## 11 0.00215420 30 0.49048 1.13197 0.080787
## 12 0.00056689 32 0.48617 1.20159 0.084510
## 13 0.00020000 34 0.48503 1.20862 0.085046
# plot trees
plot(m2_1, main="m2_1-Recursive Partitioning - Using Bayesian N 70%-30%");text(m2_1);
prp(m2_1,type=2,extra=1, main="m2_1-Recursive Partitioning - Using Bayesian N 70%-30%")
test_1$m2_1_score <- predict(m2_1,type='prob',test_1)
m2_1_pred <- prediction(test_1$m2_1_score[,2],test_1$good_bad_21)
m2_1_perf <- performance(m2_1_pred,"tpr","fpr")
# MOdel performance plot
plot(m2_1_perf, colorize=TRUE, main="ROC m2_1: Recursive Partitioning - Using Bayesian N 70%-30%")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m2_1_perf_precision <- performance(m2_1_pred, measure = "prec", x.measure = "rec")
plot(m2_1_perf_precision, main="m2_1 Recursive Partitioning:Precision/recall curve")
# Plot accuracy as function of threshold
m2_1_perf_acc <- performance(m2_1_pred, measure = "acc")
plot(m2_1_perf_acc, main="m2_1 Recursive Partitioning:Accuracy as function of threshold")
# KS & AUC m2_1
m2_1_AUROC <- round(performance(m2_1_pred, measure = "auc")@y.values[[1]]*100, 2)
m2_1_KS <- round(max(attr(m2_1_perf,'y.values')[[1]]-attr(m2_1_perf,'x.values')[[1]])*100, 2)
m2_1_Gini <- (2*m2_1_AUROC - 100)
cat("AUROC: ",m2_1_AUROC,"\tKS: ", m2_1_KS, "\tGini:", m2_1_Gini, "\n")
## AUROC: 68.22 KS: 33.17 Gini: 36.44
#prints complexity and out of sample error
printcp(m2)
##
## Classification tree:
## rpart(formula = good_bad_21 ~ ., data = train_1)
##
## Variables actually used in tree construction:
## [1] age_in_yrs_13 chk_ac_status_1 credit_amount_5 credit_history_3 duration_month_2 property_type_12 purpose_4
##
## Root node error: 210/700 = 0.3
##
## n= 700
##
## CP nsplit rel error xerror xstd
## 1 0.049206 0 1.00000 1.00000 0.057735
## 2 0.038095 3 0.85238 0.97619 0.057334
## 3 0.030952 4 0.81429 1.00000 0.057735
## 4 0.016667 6 0.75238 0.97143 0.057252
## 5 0.010000 9 0.70000 0.97143 0.057252
#plots complexity vs. error
plotcp(m2)
#prints complexity and out of sample error
printcp(m2_1)
##
## Classification tree:
## rpart(formula = good_bad_21 ~ ., data = train_1, parms = list(prior = c(0.7,
## 0.3)), cp = 2e-04)
##
## Variables actually used in tree construction:
## [1] age_in_yrs_13 chk_ac_status_1 credit_amount_5 credit_history_3 duration_month_2 instalment_pct_8 p_employment_since_7 property_type_12 purpose_4
## [10] savings_ac_bond_6
##
## Root node error: 210/700 = 0.3
##
## n= 700
##
## CP nsplit rel error xerror xstd
## 1 0.08526077 0 1.00000 1.00000 0.024744
## 2 0.02448980 2 0.82948 0.88685 0.048986
## 3 0.01640212 4 0.78050 0.94059 0.059819
## 4 0.01439909 7 0.73129 0.99093 0.064396
## 5 0.01330310 11 0.66984 1.01995 0.068019
## 6 0.01201814 18 0.56349 1.01179 0.068071
## 7 0.01043084 20 0.53946 1.00249 0.068864
## 8 0.00419501 22 0.51859 1.06054 0.075481
## 9 0.00331066 24 0.51020 1.10181 0.078473
## 10 0.00317460 29 0.49365 1.09864 0.079135
## 11 0.00215420 30 0.49048 1.13197 0.080787
## 12 0.00056689 32 0.48617 1.20159 0.084510
## 13 0.00020000 34 0.48503 1.20862 0.085046
#plots complexity vs. error
plotcp(m2_1)
# ROC Comparision
plot(m2_perf, colorize=TRUE, lwd=2,lty=1, main='Recursive partitioning:Model Performance Comparision (m2 ROC)')
plot(m2_1_perf, col='blue',lty=2, add=TRUE); # simple tree
legend(0.4,0.4,
c("m2: Traditional", "m2_1: Bayesian -70%-30%"),
col=c('orange', 'blue'),
lwd=3);
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
#print rules for all classes
#rpart.lists(m2)
#rpart.rules(m2)
#rpart.lists(m2_1)
#rpart.rules.table(m2_1)
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of over-fitting to their training set.
The first algorithm for random decision forests was created by Tin Kam Ho using the random subspace method,[2] which, in Ho’s formulation, is a way to implement the “stochastic discrimination” approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, and “Random Forests” is their trademark. The extension combines Breiman’s “bagging” idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman in order to construct a collection of decision trees with controlled variance. Source:Wiki
A Brief Tour of the Trees and Forests is here.
# Requires library(randomForest)
m3 <- randomForest(good_bad_21 ~ ., data = train_1)
m3_fitForest <- predict(m3, newdata = test_1, type="prob")[,2]
m3_pred <- prediction( m3_fitForest, test_1$good_bad_21)
m3_perf <- performance(m3_pred, "tpr", "fpr")
#plot variable importance
varImpPlot(m3, main="Random Forest: Variable Importance")
# Model Performance plot
plot(m3_perf,colorize=TRUE, lwd=2, main = "m3 ROC: Random Forest", col = "blue")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m3_perf_precision <- performance(m3_pred, measure = "prec", x.measure = "rec")
plot(m3_perf_precision, main="m3 Random Forests:Precision/recall curve")
# Plot accuracy as function of threshold
m3_perf_acc <- performance(m3_pred, measure = "acc")
plot(m3_perf_acc, main="m3 Random Forests:Accuracy as function of threshold")
#KS & AUC m3
m3_AUROC <- round(performance(m3_pred, measure = "auc")@y.values[[1]]*100, 2)
m3_KS <- round(max(attr(m3_perf,'y.values')[[1]] - attr(m3_perf,'x.values')[[1]])*100, 2)
m3_Gini <- (2*m3_AUROC - 100)
cat("AUROC: ",m3_AUROC,"\tKS: ", m3_KS, "\tGini:", m3_Gini, "\n")
## AUROC: 75.77 KS: 41.27 Gini: 51.54
# requires library(party)
set.seed(123456742)
m3_1 <- cforest(good_bad_21~., control = cforest_unbiased(mtry = 2, ntree = 50), data = train_1)
#plot(m3_1)
# Variable Importance
kable(as.data.frame(varimp(m3_1)))
varimp(m3_1) | |
---|---|
chk_ac_status_1 | 0.0468482 |
duration_month_2 | 0.0103502 |
credit_history_3 | 0.0126848 |
purpose_4 | 0.0042802 |
credit_amount_5 | 0.0053696 |
savings_ac_bond_6 | 0.0065370 |
p_employment_since_7 | -0.0024903 |
instalment_pct_8 | -0.0006226 |
property_type_12 | 0.0136187 |
age_in_yrs_13 | 0.0020233 |
# Model Summary
summary(m3_1)
## Length Class Mode
## 1 RandomForest S4
# Model Performance
m3_1_fitForest <- predict(m3, newdata = test_1, type = "prob")[,2]
m3_1_pred <- prediction(m3_1_fitForest, test_1$good_bad_21)
m3_1_perf <- performance(m3_1_pred, "tpr", "fpr")
# Model Performance Plot
plot(m3_1_perf, colorize=TRUE, lwd=2, main = " m3_1 ROC: Conditional Random Forests")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m3_1_perf_precision <- performance(m3_1_pred, measure = "prec", x.measure = "rec")
plot(m3_1_perf_precision, main="m3_1 Conditional Random Forests:Precision/recall curve")
# Plot accuracy as function of threshold
m3_1_perf_acc <- performance(m3_1_pred, measure = "acc")
plot(m3_1_perf_acc, main="m3_1 Conditional Random Forests:Accuracy as function of threshold")
# KS & AUC m3_1
m3_1_AUROC <- round(performance(m3_1_pred, measure = "auc")@y.values[[1]]*100, 2)
m3_1_KS <- round(max(attr(m3_perf,'y.values')[[1]] - attr(m3_perf,'x.values')[[1]])*100, 2)
m3_1_Gini <- (2*m3_1_AUROC - 100)
cat("AUROC: ",m3_1_AUROC,"\tKS: ", m3_1_KS, "\tGini:", m3_1_Gini, "\n")
## AUROC: 75.77 KS: 41.27 Gini: 51.54
# You need to perform some iterations - apply your data science mind and business sense
#model based on trial and error based on random forest variable importance
#m3_2<-glm(good_bad_21~.+credit_history_3:p_employment_since_7+ credit_history_3:instalment_pct_8
# +chk_ac_status_1:p_employment_since_7 +chk_ac_status_1:purpose_4
# + duration_month_2:credit_amount_5, data=train_1,family=binomial())
m3_2 <- glm(good_bad_21~.+credit_history_3:p_employment_since_7
+ credit_history_3:age_in_yrs_13
+ chk_ac_status_1:p_employment_since_7
+ chk_ac_status_1:savings_ac_bond_6
+ duration_month_2:purpose_4, data=train_1,family=binomial())
m3_2 <- step(m3_2)
## Start: AIC=774.53
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## instalment_pct_8 + property_type_12 + age_in_yrs_13 + credit_history_3:p_employment_since_7 +
## credit_history_3:age_in_yrs_13 + chk_ac_status_1:p_employment_since_7 +
## chk_ac_status_1:savings_ac_bond_6 + duration_month_2:purpose_4
##
## Df Deviance AIC
## - duration_month_2:purpose_4 40 533.69 735.69
## - chk_ac_status_1:savings_ac_bond_6 12 501.60 759.60
## - credit_history_3:p_employment_since_7 11 504.65 764.65
## - chk_ac_status_1:p_employment_since_7 11 505.31 765.31
## - instalment_pct_8 1 492.61 772.61
## - property_type_12 3 497.86 773.86
## <none> 492.53 774.53
## - credit_history_3:age_in_yrs_13 21 539.67 779.67
## - credit_amount_5 5 512.29 784.29
##
## Step: AIC=735.69
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## instalment_pct_8 + property_type_12 + age_in_yrs_13 + credit_history_3:p_employment_since_7 +
## credit_history_3:age_in_yrs_13 + chk_ac_status_1:p_employment_since_7 +
## chk_ac_status_1:savings_ac_bond_6
##
## Df Deviance AIC
## - chk_ac_status_1:savings_ac_bond_6 12 547.32 725.32
## - credit_history_3:p_employment_since_7 11 546.08 726.08
## - chk_ac_status_1:p_employment_since_7 11 548.44 728.44
## - property_type_12 3 537.74 733.74
## - instalment_pct_8 1 533.81 733.81
## <none> 533.69 735.69
## - credit_history_3:age_in_yrs_13 21 579.22 739.22
## - credit_amount_5 5 553.91 745.91
## - duration_month_2 6 566.57 756.57
## - purpose_4 9 575.45 759.45
##
## Step: AIC=725.32
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## instalment_pct_8 + property_type_12 + age_in_yrs_13 + credit_history_3:p_employment_since_7 +
## credit_history_3:age_in_yrs_13 + chk_ac_status_1:p_employment_since_7
##
## Df Deviance AIC
## - credit_history_3:p_employment_since_7 11 559.65 715.65
## - chk_ac_status_1:p_employment_since_7 11 565.62 721.62
## - instalment_pct_8 1 547.44 723.44
## - property_type_12 3 553.30 725.30
## <none> 547.32 725.32
## - savings_ac_bond_6 4 558.08 728.08
## - credit_history_3:age_in_yrs_13 21 594.48 730.48
## - credit_amount_5 5 572.69 740.69
## - duration_month_2 6 580.07 746.07
## - purpose_4 9 587.44 747.44
##
## Step: AIC=715.65
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## instalment_pct_8 + property_type_12 + age_in_yrs_13 + credit_history_3:age_in_yrs_13 +
## chk_ac_status_1:p_employment_since_7
##
## Df Deviance AIC
## - instalment_pct_8 1 559.92 713.92
## - chk_ac_status_1:p_employment_since_7 11 581.18 715.18
## <none> 559.65 715.65
## - property_type_12 3 565.81 715.81
## - credit_history_3:age_in_yrs_13 21 604.25 718.25
## - savings_ac_bond_6 4 571.19 719.19
## - credit_amount_5 5 582.36 728.36
## - duration_month_2 6 588.14 732.14
## - purpose_4 9 595.79 733.79
##
## Step: AIC=713.92
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## property_type_12 + age_in_yrs_13 + credit_history_3:age_in_yrs_13 +
## chk_ac_status_1:p_employment_since_7
##
## Df Deviance AIC
## - chk_ac_status_1:p_employment_since_7 11 581.47 713.47
## <none> 559.92 713.92
## - property_type_12 3 566.35 714.35
## - credit_history_3:age_in_yrs_13 21 604.93 716.93
## - savings_ac_bond_6 4 571.42 717.42
## - credit_amount_5 5 584.23 728.23
## - purpose_4 9 595.85 731.85
## - duration_month_2 6 592.98 734.98
##
## Step: AIC=713.47
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + p_employment_since_7 +
## property_type_12 + age_in_yrs_13 + credit_history_3:age_in_yrs_13
##
## Df Deviance AIC
## - p_employment_since_7 4 586.59 710.59
## - property_type_12 3 585.93 711.93
## <none> 581.47 713.47
## - credit_history_3:age_in_yrs_13 21 625.58 715.58
## - savings_ac_bond_6 4 591.65 715.65
## - purpose_4 9 612.67 726.67
## - credit_amount_5 5 608.31 730.31
## - duration_month_2 6 615.79 735.79
## - chk_ac_status_1 3 630.18 756.18
##
## Step: AIC=710.59
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + property_type_12 +
## age_in_yrs_13 + credit_history_3:age_in_yrs_13
##
## Df Deviance AIC
## - property_type_12 3 591.35 709.35
## <none> 586.59 710.59
## - credit_history_3:age_in_yrs_13 21 631.02 713.02
## - savings_ac_bond_6 4 597.69 713.69
## - purpose_4 9 618.62 724.62
## - credit_amount_5 5 614.57 728.57
## - duration_month_2 6 620.12 732.12
## - chk_ac_status_1 3 637.28 755.28
##
## Step: AIC=709.35
## good_bad_21 ~ chk_ac_status_1 + duration_month_2 + credit_history_3 +
## purpose_4 + credit_amount_5 + savings_ac_bond_6 + age_in_yrs_13 +
## credit_history_3:age_in_yrs_13
##
## Df Deviance AIC
## <none> 591.35 709.35
## - credit_history_3:age_in_yrs_13 21 636.01 712.01
## - savings_ac_bond_6 4 602.36 712.36
## - purpose_4 9 625.87 725.87
## - credit_amount_5 5 618.51 726.51
## - duration_month_2 6 630.14 736.14
## - chk_ac_status_1 3 645.14 757.14
summary(m3_2)
##
## Call:
## glm(formula = good_bad_21 ~ chk_ac_status_1 + duration_month_2 +
## credit_history_3 + purpose_4 + credit_amount_5 + savings_ac_bond_6 +
## age_in_yrs_13 + credit_history_3:age_in_yrs_13, family = binomial(),
## data = train_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5225 -0.6444 0.3464 0.6802 2.2532
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.16421 1.63298 0.713 0.475884
## chk_ac_status_1A12 0.81188 0.27478 2.955 0.003130 **
## chk_ac_status_1A13 1.02106 0.46117 2.214 0.026824 *
## chk_ac_status_1A14 1.98310 0.28842 6.876 6.17e-12 ***
## duration_month_206-12 -1.02830 0.56381 -1.824 0.068176 .
## duration_month_212-24 -1.99584 0.57011 -3.501 0.000464 ***
## duration_month_224-30 -2.52706 0.72045 -3.508 0.000452 ***
## duration_month_230-36 -3.07912 0.67897 -4.535 5.76e-06 ***
## duration_month_236-42 -2.14294 1.23872 -1.730 0.083636 .
## duration_month_242+ -3.30410 0.72573 -4.553 5.29e-06 ***
## credit_history_302.A31 -1.85753 1.82787 -1.016 0.309521
## credit_history_303.A32.A33 -1.15296 1.56553 -0.736 0.461446
## credit_history_304.A34 -0.29265 1.62193 -0.180 0.856815
## purpose_4A41 1.66956 0.47700 3.500 0.000465 ***
## purpose_4A410 1.85418 1.07520 1.725 0.084616 .
## purpose_4A42 0.36047 0.32731 1.101 0.270759
## purpose_4A43 0.99046 0.30369 3.261 0.001108 **
## purpose_4A44 1.21184 1.00993 1.200 0.230170
## purpose_4A45 0.98809 0.65870 1.500 0.133599
## purpose_4A46 -0.93357 0.50506 -1.848 0.064541 .
## purpose_4A48 1.26810 1.35752 0.934 0.350238
## purpose_4A49 0.78033 0.42274 1.846 0.064912 .
## credit_amount_51400-2500 0.98508 0.32250 3.055 0.002254 **
## credit_amount_52500-3500 1.01988 0.38602 2.642 0.008242 **
## credit_amount_53500-4500 1.93977 0.46900 4.136 3.53e-05 ***
## credit_amount_54500-5500 -0.14717 0.54439 -0.270 0.786900
## credit_amount_55500+ 0.45874 0.44187 1.038 0.299184
## savings_ac_bond_6A62 -0.03207 0.33887 -0.095 0.924608
## savings_ac_bond_6A63 0.85223 0.53822 1.583 0.113329
## savings_ac_bond_6A64 1.13327 0.62663 1.809 0.070525 .
## savings_ac_bond_6A65 0.74127 0.31734 2.336 0.019497 *
## age_in_yrs_1325-30 -2.21367 1.70604 -1.298 0.194443
## age_in_yrs_1330-35 16.92632 1457.68232 0.012 0.990735
## age_in_yrs_1335-40 -3.83185 2.02413 -1.893 0.058347 .
## age_in_yrs_1340-45 -18.25417 3956.18067 -0.005 0.996319
## age_in_yrs_1345-50 -17.70198 3956.18063 -0.004 0.996430
## age_in_yrs_1350-60 -16.95986 2687.55439 -0.006 0.994965
## age_in_yrs_1360+ -18.22845 3956.18067 -0.005 0.996324
## credit_history_302.A31:age_in_yrs_1325-30 1.99898 2.18502 0.915 0.360268
## credit_history_303.A32.A33:age_in_yrs_1325-30 3.02085 1.74688 1.729 0.083759 .
## credit_history_304.A34:age_in_yrs_1325-30 2.60706 1.87080 1.394 0.163453
## credit_history_302.A31:age_in_yrs_1330-35 -16.40804 1457.68294 -0.011 0.991019
## credit_history_303.A32.A33:age_in_yrs_1330-35 -16.82940 1457.68237 -0.012 0.990788
## credit_history_304.A34:age_in_yrs_1330-35 -17.91500 1457.68255 -0.012 0.990194
## credit_history_302.A31:age_in_yrs_1335-40 3.36118 2.56712 1.309 0.190427
## credit_history_303.A32.A33:age_in_yrs_1335-40 4.81919 2.08250 2.314 0.020660 *
## credit_history_304.A34:age_in_yrs_1335-40 4.33364 2.14650 2.019 0.043495 *
## credit_history_302.A31:age_in_yrs_1340-45 19.31480 3956.18107 0.005 0.996105
## credit_history_303.A32.A33:age_in_yrs_1340-45 18.30082 3956.18070 0.005 0.996309
## credit_history_304.A34:age_in_yrs_1340-45 17.67326 3956.18076 0.004 0.996436
## credit_history_302.A31:age_in_yrs_1345-50 34.21934 5594.88421 0.006 0.995120
## credit_history_303.A32.A33:age_in_yrs_1345-50 18.06356 3956.18067 0.005 0.996357
## credit_history_304.A34:age_in_yrs_1345-50 19.45024 3956.18081 0.005 0.996077
## credit_history_302.A31:age_in_yrs_1350-60 -0.64077 3163.55755 0.000 0.999838
## credit_history_303.A32.A33:age_in_yrs_1350-60 17.27943 2687.55447 0.006 0.994870
## credit_history_304.A34:age_in_yrs_1350-60 17.09051 2687.55457 0.006 0.994926
## credit_history_302.A31:age_in_yrs_1360+ 0.04974 5594.88426 0.000 0.999993
## credit_history_303.A32.A33:age_in_yrs_1360+ 17.20068 3956.18072 0.004 0.996531
## credit_history_304.A34:age_in_yrs_1360+ 35.00363 4084.83951 0.009 0.993163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 591.35 on 641 degrees of freedom
## AIC: 709.35
##
## Number of Fisher Scoring iterations: 16
# Model Performance
test_1$m3_2_score <- predict(m3_2,type='response',test_1)
m3_2_pred <- prediction(test_1$m3_2_score,test_1$good_bad_21)
m3_2_perf <- performance(m3_2_pred,"tpr","fpr")
# ROC Plot
plot(m3_2_perf, colorize=TRUE, lwd=2, main="ROC m3_2:Improve Logistic Results using Random Forest")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m3_2_perf_precision <- performance(m3_2_pred, measure = "prec", x.measure = "rec")
plot(m3_2_perf_precision, main="m3_2 Improve Logistic:Precision/recall curve")
# Plot accuracy as function of threshold
m3_2_perf_acc <- performance(m3_2_pred, measure = "acc")
plot(m3_2_perf_acc, main="m3_2 Improve Logistic:Accuracy as function of threshold")
#KS & AUC m3_2
m3_2_AUROC <- round(performance(m3_2_pred, measure = "auc")@y.values[[1]]*100, 2)
m3_2_KS <- round(max(attr(m3_2_perf,'y.values')[[1]]-attr(m3_2_perf,'x.values')[[1]])*100, 2)
m3_2_Gini <- (2*m3_2_AUROC - 100)
cat("AUROC: ",m3_2_AUROC,"\tKS: ", m3_2_KS, "\tGini:", m3_2_Gini, "\n")
## AUROC: 69.12 KS: 34.13 Gini: 38.24
Recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.
#library(party)
m4 <- ctree(good_bad_21~.,data=train_1)
plot(m4, main="m4: Conditional inference Tree",col="blue");
resultdfr <- as.data.frame(do.call("rbind", treeresponse(m4, newdata = test_1)))
test_1$m4_score <- resultdfr[,2]
m4_pred <- prediction(test_1$m4_score,test_1$good_bad_21)
m4_perf <- performance(m4_pred,"tpr","fpr")
# Model Performance
plot(m4_perf, colorize=TRUE, lwd=2, main="ROC m4: Conditional inference Tree")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m4_perf_precision <- performance(m4_pred, measure = "prec", x.measure = "rec")
plot(m4_perf_precision, main="m4 CIT:Plot precision/recall curve")
# Plot accuracy as function of threshold
m4_perf_acc <- performance(m4_pred, measure = "acc")
plot(m4_perf_acc, main="m4 CIT:Plot accuracy as function of threshold")
#KS & AUC m4
m4_AUROC <- round(performance(m4_pred, measure = "auc")@y.values[[1]]*100, 2)
m4_KS <- round(max(attr(m4_perf,'y.values')[[1]]-attr(m4_perf,'x.values')[[1]])*100, 2)
m4_Gini <- (2*m4_AUROC - 100)
cat("AUROC: ",m4_AUROC,"\tKS: ", m4_KS, "\tGini:", m4_Gini, "\n")
## AUROC: 71.63 KS: 40.48 Gini: 43.26
#randomForest (randomForest) and cforest (party) have same results
A Bayesian network, Bayes network, belief network, Bayesian model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). It is in general computational intensive and expensive. Source:Wiki.
A simple Bayesian network with conditional probability tables.
Read more here.
#library(bnlearn)
train_2<-train_1
train_2$duration_month_2 <- as.factor(train_2$duration_month_2)
train_2$credit_amount_5 <- as.factor(train_2$credit_amount_5)
train_2$instalment_pct_8 <- as.factor(train_2$instalment_pct_8)
train_2$age_in_yrs_13 <- as.factor(train_2$age_in_yrs_13)
bn.gs <- gs(train_2)
bn.gs
##
## Bayesian network learned via Constraint-based methods
##
## model:
## [partially directed graph]
## nodes: 11
## arcs: 4
## undirected arcs: 2
## directed arcs: 2
## average markov blanket size: 0.91
## average neighbourhood size: 0.73
## average branching factor: 0.18
##
## learning algorithm: Grow-Shrink
## conditional independence test: Mutual Information (disc.)
## alpha threshold: 0.05
## tests used in the learning procedure: 164
## optimized: TRUE
bn2 <- iamb(train_2)
bn2
##
## Bayesian network learned via Constraint-based methods
##
## model:
## [partially directed graph]
## nodes: 11
## arcs: 8
## undirected arcs: 4
## directed arcs: 4
## average markov blanket size: 1.82
## average neighbourhood size: 1.45
## average branching factor: 0.36
##
## learning algorithm: IAMB
## conditional independence test: Mutual Information (disc.)
## alpha threshold: 0.05
## tests used in the learning procedure: 208
## optimized: TRUE
bn3 <- fast.iamb(train_2)
bn3
##
## Bayesian network learned via Constraint-based methods
##
## model:
## [undirected graph]
## nodes: 11
## arcs: 6
## undirected arcs: 6
## directed arcs: 0
## average markov blanket size: 1.09
## average neighbourhood size: 1.09
## average branching factor: 0.00
##
## learning algorithm: Fast-IAMB
## conditional independence test: Mutual Information (disc.)
## alpha threshold: 0.05
## tests used in the learning procedure: 113
## optimized: TRUE
bn4 <- inter.iamb(train_2)
bn4
##
## Bayesian network learned via Constraint-based methods
##
## model:
## [partially directed graph]
## nodes: 11
## arcs: 8
## undirected arcs: 6
## directed arcs: 2
## average markov blanket size: 1.64
## average neighbourhood size: 1.45
## average branching factor: 0.18
##
## learning algorithm: Inter-IAMB
## conditional independence test: Mutual Information (disc.)
## alpha threshold: 0.05
## tests used in the learning procedure: 212
## optimized: TRUE
compare(bn.gs, bn2)
## $tp
## [1] 2
##
## $fp
## [1] 6
##
## $fn
## [1] 2
compare(bn.gs, bn3)
## $tp
## [1] 2
##
## $fp
## [1] 4
##
## $fn
## [1] 2
compare(bn.gs, bn4)
## $tp
## [1] 1
##
## $fp
## [1] 7
##
## $fn
## [1] 3
#On the other hand hill-climbing results in a completely directed network, which diers from
#the previous one because the arc between A and B is directed (A ! B instead of A B).
bn.hc <- hc(train_2, score = "aic")
bn.hc
##
## Bayesian network learned via Score-based methods
##
## model:
## [chk_ac_status_1][savings_ac_bond_6|chk_ac_status_1][good_bad_21|chk_ac_status_1][duration_month_2|good_bad_21][credit_history_3|good_bad_21][credit_amount_5|duration_month_2]
## [purpose_4|credit_amount_5][instalment_pct_8|credit_amount_5][property_type_12|credit_amount_5:good_bad_21][p_employment_since_7|property_type_12][age_in_yrs_13|p_employment_since_7]
## nodes: 11
## arcs: 11
## undirected arcs: 0
## directed arcs: 11
## average markov blanket size: 2.18
## average neighbourhood size: 2.00
## average branching factor: 1.00
##
## learning algorithm: Hill-Climbing
## score: AIC (disc.)
## penalization coefficient: 1
## tests used in the learning procedure: 165
## optimized: TRUE
compare(bn.hc, bn.gs)
## $tp
## [1] 0
##
## $fp
## [1] 4
##
## $fn
## [1] 11
opm5<-par(mfrow = c(1,2))
plot(bn.gs, main = "Constraint-based algorithms")
plot(bn.hc, main = "Hill-Climbing")
par(opm5)
modelstring(bn.hc)
## [1] "[chk_ac_status_1][savings_ac_bond_6|chk_ac_status_1][good_bad_21|chk_ac_status_1][duration_month_2|good_bad_21][credit_history_3|good_bad_21][credit_amount_5|duration_month_2][purpose_4|credit_amount_5][instalment_pct_8|credit_amount_5][property_type_12|credit_amount_5:good_bad_21][p_employment_since_7|property_type_12][age_in_yrs_13|p_employment_since_7]"
res2 = hc(train_2)
fitted2 = bn.fit(res2, train_2)
fitted2
##
## Bayesian network parameters
##
## Parameters of node chk_ac_status_1 (multinomial distribution)
##
## Conditional probability table:
## A11 A12 A13 A14
## 0.27571429 0.26428571 0.06857143 0.39142857
##
## Parameters of node duration_month_2 (multinomial distribution)
##
## Conditional probability table:
##
## good_bad_21
## duration_month_2 Bad Good
## 00-06 0.02857143 0.09387755
## 06-12 0.19523810 0.30000000
## 12-24 0.42857143 0.43877551
## 24-30 0.06666667 0.04897959
## 30-36 0.12857143 0.06326531
## 36-42 0.00952381 0.01224490
## 42+ 0.14285714 0.04285714
##
## Parameters of node credit_history_3 (multinomial distribution)
##
## Conditional probability table:
##
## good_bad_21
## credit_history_3 Bad Good
## 01.A30 0.07142857 0.02040816
## 02.A31 0.09047619 0.02448980
## 03.A32.A33 0.67142857 0.63469388
## 04.A34 0.16666667 0.32040816
##
## Parameters of node purpose_4 (multinomial distribution)
##
## Conditional probability table:
## A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
## 0.24000000 0.09857143 0.01142857 0.18142857 0.27857143 0.01142857 0.02428571 0.04428571 0.01142857 0.09857143
##
## Parameters of node credit_amount_5 (multinomial distribution)
##
## Conditional probability table:
##
## duration_month_2
## credit_amount_5 00-06 06-12 12-24 24-30 30-36 36-42 42+
## 0-1400 0.57692308 0.51063830 0.18688525 0.00000000 0.00000000 0.00000000 0.00000000
## 1400-2500 0.26923077 0.29787234 0.35409836 0.15789474 0.20689655 0.00000000 0.01960784
## 2500-3500 0.05769231 0.10638298 0.19672131 0.26315789 0.13793103 0.25000000 0.05882353
## 3500-4500 0.03846154 0.04787234 0.11475410 0.15789474 0.22413793 0.00000000 0.11764706
## 4500-5500 0.01923077 0.01595745 0.04918033 0.10526316 0.03448276 0.25000000 0.11764706
## 5500+ 0.03846154 0.02127660 0.09836066 0.31578947 0.39655172 0.50000000 0.68627451
##
## Parameters of node savings_ac_bond_6 (multinomial distribution)
##
## Conditional probability table:
## A61 A62 A63 A64 A65
## 0.60857143 0.10285714 0.06142857 0.04714286 0.18000000
##
## Parameters of node p_employment_since_7 (multinomial distribution)
##
## Conditional probability table:
## A71 A72 A73 A74 A75
## 0.07285714 0.18285714 0.33571429 0.16428571 0.24428571
##
## Parameters of node instalment_pct_8 (multinomial distribution)
##
## Conditional probability table:
## 1 2 3 4
## 0.1314286 0.2328571 0.1557143 0.4800000
##
## Parameters of node property_type_12 (multinomial distribution)
##
## Conditional probability table:
##
## credit_amount_5
## property_type_12 0-1400 1400-2500 2500-3500 3500-4500 4500-5500 5500+
## A121 0.45355191 0.31472081 0.30188679 0.21126761 0.06060606 0.10000000
## A122 0.25136612 0.19796954 0.23584906 0.25352113 0.36363636 0.16363636
## A123 0.22404372 0.38071066 0.33962264 0.42253521 0.30303030 0.37272727
## A124 0.07103825 0.10659898 0.12264151 0.11267606 0.27272727 0.36363636
##
## Parameters of node age_in_yrs_13 (multinomial distribution)
##
## Conditional probability table:
## 0-25 25-30 30-35 35-40 40-45 45-50 50-60 60+
## 0.19857143 0.22428571 0.16000000 0.14142857 0.09285714 0.06857143 0.07142857 0.04285714
##
## Parameters of node good_bad_21 (multinomial distribution)
##
## Conditional probability table:
##
## chk_ac_status_1
## good_bad_21 A11 A12 A13 A14
## Bad 0.5077720 0.3621622 0.2083333 0.1277372
## Good 0.4922280 0.6378378 0.7916667 0.8722628
# library(gRain)
Ref: https://cran.r-project.org/web/packages/party/vignettes/MOB.pdf.
# model-based recursive partitioning- Need to perform manual iterations to improve the performances
#library(party)
# iter 1
m6 <- mob(good_bad_21~chk_ac_status_1 |
duration_month_2
+credit_history_3
+purpose_4
+credit_amount_5
+savings_ac_bond_6
+p_employment_since_7
+instalment_pct_8
+property_type_12
+age_in_yrs_13,
data=train_1,
model=glinearModel,family=binomial())
# iter 2
# m6<-mob(good_bad_21~ credit_history_3 +chk_ac_status_1 + savings_ac_bond_6 + purpose_4 |
# +duration_month_2
# +instalment_pct_8
# +credit_amount_5
# +p_employment_since_7
# +property_type_12
# +age_in_yrs_13,
# data=train_1,
# model=glinearModel,family=binomial())
# iter 3
# m6<-mob(good_bad_21~ chk_ac_status_1 + purpose_4|
# credit_history_3
# +duration_month_2
# +instalment_pct_8
# +credit_amount_5
# +savings_ac_bond_6
# +p_employment_since_7
# +property_type_12
# +age_in_yrs_13,
# data=train_1,
# model=glinearModel,family=binomial())
#library(vcd) #the `vcd' package is required for CD plots
plot(m6, main="m6: Model based Tree with GLM")
# Scoring
test_1$m6_score <- predict(m6, newdata = test_1, type =c("response"))
# Predict
m6_pred <- prediction(test_1$m6_score,test_1$good_bad_21)
m6_perf <- performance(m6_pred,"tpr","fpr")
plot(m6_perf, colorize=TRUE, lwd=2, main="ROC m6: Model based Tree with GLM")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m6_perf_precision <- performance(m6_pred, measure = "prec", x.measure = "rec")
plot(m6_perf_precision, main="MOB:Plot precision/recall curve")
# Plot accuracy as function of threshold
m6_perf_acc <- performance(m6_pred, measure = "acc")
plot(m6_perf_acc, main="MOB:Plot accuracy as function of threshold")
# Model Performance
#KS & AUC m6
m6_AUROC <- round(performance(m6_pred, measure = "auc")@y.values[[1]]*100, 2)
m6_KS <- round(max(attr(m6_perf,'y.values')[[1]]-attr(m6_perf,'x.values')[[1]])*100, 2)
m6_Gini <- (2*m6_AUROC - 100)
cat("AUROC: ",m6_AUROC,"\tKS: ", m6_KS, "\tGini:", m6_Gini, "\n")
## AUROC: 71.45 KS: 36.48 Gini: 42.9
A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. SVM models are closely related to neural networks Source.
Read more here.
#library(kernlab) #for SVM
# Basic Model
m7_1 <- ksvm(good_bad_21 ~ ., data = train_1, kernel = "vanilladot")
## Setting default kernel parameters
m7_1_pred <- predict(m7_1, test_1[,1:10], type="response")
head(m7_1_pred)
## [1] Good Bad Good Good Bad Bad
## Levels: Bad Good
# Model accuracy:
table(m7_1_pred, test_1$good_bad_21)
##
## m7_1_pred Bad Good
## Bad 39 28
## Good 51 182
#agreement
m7_1_accuracy <- (m7_1_pred == test_1$good_bad_21)
pct(m7_1_accuracy)
Count | Percentage | |
---|---|---|
FALSE | 79 | 26.33 |
TRUE | 221 | 73.67 |
# Compute at the prediction scores
m7_1_score <- predict(m7_1,test_1, type="decision")
m7_1_pred <- prediction(m7_1_score, test_1$good_bad_21)
# Plot ROC curve
m7_1_perf <- performance(m7_1_pred, measure = "tpr", x.measure = "fpr")
plot(m7_1_perf, colorize=TRUE, lwd=2, main="m7_1 SVM:Plot ROC curve - Vanilladot")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m7_1_perf_precision <- performance(m7_1_pred, measure = "prec", x.measure = "rec")
plot(m7_1_perf_precision, main="m7_1 SVM:Plot precision/recall curve")
# Plot accuracy as function of threshold
m7_1_perf_acc <- performance(m7_1_pred, measure = "acc")
plot(m7_1_perf_acc, main="m7_1 SVM:Plot accuracy as function of threshold")
# Model Performance
#KS & AUC m7_1
m7_1_AUROC <- round(performance(m7_1_pred, measure = "auc")@y.values[[1]]*100, 2)
m7_1_KS <- round(max(attr(m7_1_perf,'y.values')[[1]]-attr(m7_1_perf,'x.values')[[1]])*100, 2)
m7_1_Gini <- (2*m7_1_AUROC - 100)
cat("AUROC: ",m7_1_AUROC,"\tKS: ", m7_1_KS, "\tGini:", m7_1_Gini, "\n")
## AUROC: 74.46 KS: 39.21 Gini: 48.92
# Requires library(kernlab)
# Model Improvement with Gaussian RBF kernel
m7_2 <- ksvm(good_bad_21 ~ ., data = train_1, kernel = "rbfdot")
m7_2_pred <- predict(m7_2, test_1[,1:10], type="response")
head(m7_2_pred)
## [1] Good Good Good Good Good Good
## Levels: Bad Good
# Model accuracy:
table(m7_2_pred, test_1$good_bad_21)
##
## m7_2_pred Bad Good
## Bad 30 11
## Good 60 199
#agreement
m7_2_accuracy <- (m7_2_pred == test_1$good_bad_21)
pct(m7_2_accuracy)
Count | Percentage | |
---|---|---|
FALSE | 71 | 23.67 |
TRUE | 229 | 76.33 |
# Compute at the prediction scores
m7_2_score <- predict(m7_2,test_1, type="decision")
m7_2_pred <- prediction(m7_2_score, test_1$good_bad_21)
# Plot ROC curve
m7_2_perf <- performance(m7_2_pred, measure = "tpr", x.measure = "fpr")
plot(m7_2_perf, colorize=TRUE, lwd=2, main="SVM:Plot ROC curve - RBF", col="blue")
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);
lines(x=c(1, 0), y=c(0, 1), col="green", lwd=1, lty=4)
# Plot precision/recall curve
m7_2_perf_precision <- performance(m7_2_pred, measure = "prec", x.measure = "rec")
plot(m7_2_perf_precision, main="m7_2 SVM:Plot precision/recall curve")
# Plot accuracy as function of threshold
m7_2_perf_acc <- performance(m7_2_pred, measure = "acc")
plot(m7_2_perf_acc, main="m7_2 SVM:Plot accuracy as function of threshold")
# Model Performance
#KS &AUC m7_2
m7_2_AUROC <- round(performance(m7_2_pred, measure = "auc")@y.values[[1]]*100, 2)
m7_2_KS <- round(max(attr(m7_2_perf,'y.values')[[1]]-attr(m7_2_perf,'x.values')[[1]])*100, 2)
m7_2_Gini <- (2*m7_2_AUROC - 100)
cat("AUROC: ",m7_2_AUROC,"\tKS: ", m7_2_KS, "\tGini:", m7_2_Gini, "\n")
## AUROC: 72.91 KS: 37.46 Gini: 45.82
#Your results may differ from those shown here due to randomness in the ksvm RBF kernel. If you'd like them to match exactly, use set.seed(12345) prior to running the ksvm() function.
# ROC Comparision
plot(m7_1_perf, col='blue', lty=1, main='SVM:Model Performance Comparision (m7 ROC)')
plot(m7_2_perf, col='green',lty=2, add=TRUE); # simple tree
legend(0.5,0.4,
c("m7_1: SVM vanilladot", "m7_2: SVM RBF kernel"),
col=c('blue', 'green'),
lwd=3);
lines(x=c(0, 1), y=c(0, 1), col="red", lwd=1, lty=3);# random line
# Precision Comparision
plot(m7_1_perf_precision, col='blue', lty=1, main='SVM:Model Performance Comparision (m7 precision/recall)')
plot(m7_2_perf_precision, col='green',lty=2, add=TRUE); # simple tree
legend(0.2,0.85,c("m7_1: SVM vanilladot", "m7_2: SVM RBF kernel"),
col=c('blue', 'green'),lwd=3);
# Plot accuracy as function of threshold
plot(m7_1_perf_acc, col='blue', lty=1, main='SVM:Model accuracy as function of threshold (m7)')
plot(m7_2_perf_acc, col='green',lty=2, add=TRUE); # simple tree
legend(-1,0.5,c("m7_1: SVM vanilladot", "m7_2: SVM RBF kernel"),
col=c('blue', 'green'),lwd=3);
Read more here
Normally doesn’t fit well with low number of observationsFor Neural network we need numeric data to model.
library(nnet)
library(NeuralNetTools)
library(e1071)
# Normalize
train_num$V25 <- as.factor(train_num$V25)
test_num$V25 <- as.factor(test_num$V25)
train_num_norm <- as.data.frame(lapply(train_num[,1:24], normalize ))
test_num_norm <- as.data.frame(lapply(test_num[,1:24], normalize ))
train_num_norm$V25 <- as.factor(ifelse(train_num$V25 == 1, 1, 0))
test_num_norm$V25 <- as.factor(ifelse(test_num$V25 == 1, 1, 0))
# train_num_norm <- as.data.frame(lapply(train_num[,1:24], scale )) # use scale if normal
# test_num_norm <- as.data.frame(lapply(test_num[,1:24], scale )) # use scale if normal
# build the neural network (NN) formula
a <- colnames(train_num[,1:24])
mformula <- as.formula(paste('V25 ~ ' , paste(a,collapse='+')))
set.seed(1234567890)
train_nn <- train_num_norm
test_nn <- test_num_norm
# Modeling
m8 <- nnet(V25~., data=train_nn,size=20,maxit=10000,decay=.001, linout=F, trace = F)
table(test_nn$V25,predict(m8,newdata=test_nn, type="class"))
##
## 0 1
## 0 51 45
## 1 47 157
m8_pred <- prediction(predict(m8, newdata=test_nn, type="raw"),test_nn$V25)
m8_perf <- performance(m8_pred,"tpr","fpr")
# Model Performance
#ROC
plot(m8_perf,lwd=2, colerize=TRUE, main="m8:ROC - Neural Network")
abline(a=0,b=1)
# Residula plots
plot(m8$residuals)
#KS, Gini & AUC m8
m8_AUROC <- round(performance(m8_pred, measure = "auc")@y.values[[1]]*100, 2)
m8_KS <- round(max(attr(m8_perf,'y.values')[[1]] - attr(m8_perf,'x.values')[[1]])*100, 2)
m8_Gini <- (2*m8_AUROC - 100)
cat("AUROC: ",m8_AUROC,"\tKS: ", m8_KS, "\tGini:", m8_Gini, "\n")
## AUROC: 74.68 KS: 43.63 Gini: 49.36
# get the weights and structure in the right format
wts <- neuralweights(m8)
struct <- wts$struct
wts <- unlist(wts$wts)
plotnet(m8, struct=struct)
mat_in <- train_nn[,1:24]
grps <- apply(mat_in, 2, quantile, seq(0, 1, by = 0.2))
pred_val <- as.data.frame(pred_sens(mat_in, m8, 'V1', 100, grps, 'V25'))
head(pred_val, 10)
## X1.V25 X1.x_vars X2.V25 X2.x_vars X3.V25 X3.x_vars X4.V25 X4.x_vars X5.V25 X5.x_vars X6.V25 X6.x_vars
## 1 0.002297193 0.00000000 0.9741025 0.00000000 0.4190216 0.00000000 0.01161606 0.00000000 0.9947616 0.00000000 9.160677e-05 0.00000000
## 2 0.002559073 0.01010101 0.9775689 0.01010101 0.4209706 0.01010101 0.01183499 0.01010101 0.9966693 0.01010101 8.620528e-05 0.01010101
## 3 0.002850712 0.02020202 0.9805372 0.02020202 0.4233656 0.02020202 0.01215656 0.02020202 0.9978792 0.02020202 8.093745e-05 0.02020202
## 4 0.003174258 0.03030303 0.9830828 0.03030303 0.4262346 0.03030303 0.01258852 0.03030303 0.9986463 0.03030303 7.583231e-05 0.03030303
## 5 0.003531720 0.04040404 0.9852694 0.04040404 0.4296101 0.04040404 0.01314112 0.04040404 0.9991330 0.04040404 7.091367e-05 0.04040404
## 6 0.003924903 0.05050505 0.9871508 0.05050505 0.4335292 0.05050505 0.01382749 0.05050505 0.9994423 0.05050505 6.620039e-05 0.05050505
## 7 0.004355331 0.06060606 0.9887728 0.06060606 0.4380327 0.06060606 0.01466404 0.06060606 0.9996395 0.06060606 6.170668e-05 0.06060606
## 8 0.004824171 0.07070707 0.9901738 0.07070707 0.4431650 0.07070707 0.01567102 0.07070707 0.9997657 0.07070707 5.744250e-05 0.07070707
## 9 0.005332161 0.08080808 0.9913864 0.08080808 0.4489727 0.08080808 0.01687311 0.08080808 0.9998467 0.08080808 5.341395e-05 0.08080808
## 10 0.005879545 0.09090909 0.9924381 0.09090909 0.4555042 0.09090909 0.01830025 0.09090909 0.9998990 0.09090909 4.962374e-05 0.09090909
#------------------------------------------------------------------------------
In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. Source:wiki
We can also explore Ridge Regression or “non-Negative Garrote”.
Read more here.
library(lars)
## Loaded lars 1.2
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
mat1 <- model.matrix(V25 ~ . , data = train_num ) # convert to numeric matrix
mat2 <- model.matrix(V25 ~ . , data = test_num ) # convert to numeric matrix
#library(lars)
# Model using lars package
m9_1<- lars(x= mat1, y= as.numeric(train_num$V25), trace = T)
## LASSO sequence
## LARS Step 0 : 1 Variables with Variance < eps; dropped for good
## Computing X'X .....
## LARS Step 1 : Variable 2 added
## LARS Step 2 : Variable 3 added
## LARS Step 3 : Variable 4 added
## LARS Step 4 : Variable 6 added
## LARS Step 5 : Variable 5 added
## LARS Step 6 : Variable 17 added
## LARS Step 7 : Variable 18 added
## LARS Step 8 : Variable 11 added
## LARS Step 9 : Variable 20 added
## LARS Step 10 : Variable 16 added
## LARS Step 11 : Variable 12 added
## LARS Step 12 : Variable 10 added
## LARS Step 13 : Variable 21 added
## LARS Step 14 : Variable 7 added
## LARS Step 15 : Variable 8 added
## LARS Step 16 : Variable 23 added
## LARS Step 17 : Variable 13 added
## LARS Step 18 : Variable 19 added
## LARS Step 19 : Variable 24 added
## LARS Step 20 : Variable 15 added
## LARS Step 21 : Variable 22 added
## LARS Step 22 : Variable 14 added
## LARS Step 23 : Variable 9 added
## LARS Step 24 : Variable 25 added
## Computing residuals, RSS etc .....
# library(glmnet)
#lm9_2<- glmnet(x= mat1, y= train_num$V25, family="binomial")
# Summary
summary(m9_1)
## LARS/LASSO
## Call: lars(x = mat1, y = as.numeric(train_num$V25), trace = T)
## Df Rss Cp
## 0 1 144.55 196.361
## 1 2 133.83 132.051
## 2 3 132.68 126.906
## 3 4 124.07 75.656
## 4 5 122.59 68.488
## 5 6 122.06 67.227
## 6 7 121.04 62.897
## 7 8 120.98 64.510
## 8 9 120.56 63.940
## 9 10 119.41 58.813
## 10 11 118.82 57.183
## 11 12 118.63 57.969
## 12 13 118.20 57.359
## 13 14 117.76 56.633
## 14 15 116.52 50.944
## 15 16 113.72 35.642
## 16 17 112.42 29.576
## 17 18 112.20 30.226
## 18 19 110.04 18.835
## 19 20 109.86 19.720
## 20 21 109.41 18.965
## 21 22 109.24 19.910
## 22 23 109.14 21.290
## 23 24 109.10 23.013
## 24 25 109.09 25.000
# Plots
plot(m9_1, plottype="coefficients")
plot(m9_1, plottype="Cp")
# Predict
m9_pred <- predict.lars(m9_1, mat2, type="fit")
las1.mspe <- mean( (test_num$V25 - m9_pred$fit)^2 )
cat("Error:", round(las1.mspe,2)*100, "%\n")
## Error: NA %
### extract the coefficient vector with L1 norm=4.1
coef4.1 <- predict(m9_1, s=4.1, type="coef", mode="norm")
coef4.1
## $s
## [1] 4.1
##
## $fraction
## [1] 0.2021086
##
## $mode
## [1] "norm"
##
## $coefficients
## (Intercept) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 0.000000000 -0.074266012 0.002953664 -0.024986779 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## V15 V16 V17 V18 V19 V20 V21 V22 V23 V24
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
# By using cross-validation with the lasso, a good (hopefully near-optimal) value for
# the "fraction" can be determined.
cvlas <- cv.lars(mat1,y= as.numeric(train_num$V25), type="lasso")
# frac <- cvlas$fraction[which.min(cvlas$cv)]
# frac
# las.coef <- predict.lars(m9_1, type="coefficients", mode="fraction", s=frac)
# las.coef
# List of model supported by `caret` package
# names(getModelInfo())
#-----------------------------------------------------------------------------
# Alternative method - using glmnet package
# library(glmnet)
# For help type >?cv.glmnet
# Ref: http://www.moseslab.csb.utoronto.ca/alan/glmnet_presentation.pdf
m9_2<-cv.glmnet(mat1,as.numeric(train_num$V25), alpha=1, family="binomial", type.measure = 'auc')
summary(m9_2)
## Length Class Mode
## lambda 61 -none- numeric
## cvm 61 -none- numeric
## cvsd 61 -none- numeric
## cvup 61 -none- numeric
## cvlo 61 -none- numeric
## nzero 61 -none- numeric
## name 1 -none- character
## glmnet.fit 13 lognet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
#m9_2_pred <- predict(m9_2, mat2)
#cor(m9_2_pred, as.numeric(test_num$V25))
#plot(m9_2_pred, test_num$V25, xlab="Predicted", ylab="Observed", main="Lasso-Compare Predicted to Observed")
#mydata<-as.data.frame(cbind(m9_2_pred -.5, test_num$V25))
#ggplot(mydata, aes(m9_2_pred,as.numeric(test_num$V25)))+geom_point()
# Apply model to testing dataset
test_num$lasso.prob <- predict(m9_2,type="response", newx =mat2, s = 'lambda.min')
m9_2_pred <- prediction(test_num$lasso.prob,test_num$V25)
# calculate probabilities for TPR/FPR for predictions
m9_2_perf <- performance(m9_2_pred,"tpr","fpr")
#performance(m9_2_pred,"auc") # shows calculated AUC for model
plot(m9_2_perf,colorize=T, col="blue", main="m9-ROC: Lasso Regression") # plot ROC curve
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
# Plot precision/recall curve
m9_2_perf_precision <- performance(m9_2_pred, measure = "prec", x.measure = "rec")
plot(m9_2_perf_precision, main="m9_2 LASSO:Precision/recall curve")
# Plot accuracy as function of threshold
m9_2_perf_acc <- performance(m9_2_pred, measure = "acc")
plot(m9_2_perf_acc, main="m3_2 Lasso:Accuracy as function of threshold")
#KS & AUC m1
m9_KS <- round(max(attr(m9_2_perf,'y.values')[[1]]-attr(m9_2_perf,'x.values')[[1]])*100, 2)
m9_AUROC <- round(performance(m9_2_pred, measure = "auc")@y.values[[1]]*100, 2)
m9_Gini <- (2*m9_AUROC - 100)
cat("AUROC: ",m9_AUROC,"\tKS: ", m9_KS, "\tGini:", m9_Gini, "\n")
## AUROC: 82.04 KS: 52.7 Gini: 64.08
To select best models out of set of experimental models we may compare the ROCs, Area Uder the Receiver Operating Characteristic Curve (AUROC/AUC), KS, Gini etc.
The ROC curve is a fundamental tool for diagnostic test evaluation. In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 − specificity. Source:wiki
ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.
Read more here.
#Compare ROC Performance of Models
plot(m1_perf, col='blue', lty=1, main='ROCs: Model Performance Comparision') # logistic regression
plot(m2_perf, col='gold',lty=2, add=TRUE); # simple tree
plot(m2_1_perf, col='dark orange',lty=3, add=TRUE); #tree with 90/10 prior
plot(m3_perf, col='green',add=TRUE,lty=4); # random forest
plot(m4_perf, col='dark gray',add=TRUE,lty=5); # Conditional Inference Tree
plot(m3_2_perf, col='dark green',add=TRUE,lty=6); # Improved logistic regression using random forest
plot(m7_2_perf, col='black',add=TRUE,lty=7); # Support Vector Machine (SVM)
plot(m8_perf, col='red',add=TRUE,lty=8); # Neural Network
plot(m9_2_perf, col='brown',add=TRUE,lty=9); # Neural Network
legend(0.6,0.5,
c('m1:logistic reg','m2:Recursive Partitioning','m2_1:Recursive Partitioning - Bayesian',
'm3:random forest', "m4:condtn inference tree", "m3_2:Improved Logistic", "m7_2:SVM",
"m8:Neural Net", "m9:Lasso"),
col=c('blue','gold', 'orange','green', 'dark gray', 'dark green', "black", "red","brown"),
lwd=3);
lines(c(0,1),c(0,1),col = "gray", lty = 4 ) # random line
This is the most common measure of accuracy for machine learning techniques for binary classification problems. It is best to use as AUROC or AUROCC than using AUC as ‘AUC’ can be area under any curve and we are interested in only “Area under ROC curve”.
The best possible test (100% sensitive and 100% specific) would have an area under the curve of 1.0 (100%) since the entire graph would fall beneath the curve. While it is possible that the “perfectly bad test” would have an area of 0.0, in reality we would just redefine what is positive and negative and it would become a perfect test.
AUROCC Interpretation:
AUROCC | Interpretation | Remarks |
---|---|---|
1.0 (100%) | Perfect model | Possiblility of overfit |
0.9-0.99 (90-99%) | Excellent Model | Possiblility of overfit |
0.8-0.89 (80-89%) | Very Good Model | Happy |
0.7-0.79 (70-79%) | Fair Model | Very common |
0.51-0.69(51-69%) | Poor Model | Need to revisit Modeling approach |
<0.5(50%) | Worthless Model | May need more data or worth look for unsupervised learning |
Read more on AUROCC here.
The Kolmogorov-Smirnov test (KS-test) tries to determine if two classes differ significantly. In our case we are trying to measure whether the model is able to classify “good” class from “bad” class very well. If they are completely separated (i.e. AUROCC=100%) then the value of KS will be 1 (100%) and if they are same then the value of KS will be 0 (0%). This means higher the value is better classification power. In simple term KS = Maximum distance between distributions of two classes
.
The Kolmogorov-Smirnov test (Chakravart, Laha, and Roy, 1967) is used to decide if a sample comes from a population with a specific distribution.
Read more on KS here
Gini is another commonly used measure to evaluate goodness of fit for binary classification problems. Which has a straight relationship with AUROCC (Gini = 2*AUROCC - 1).
Read more on Gini here
# Performance Table
models <- c('m1:Logistic regression', 'm1:Logistic regression - selected vars',
'm2:Recursive partitioning','m2_1:Recursive partitioning-Bayesian 70/30% prior',
'm3:Random forest', "m3_2: Improved Logistic",
"m4:Conditional inference tree", "m6:Model Based Tree",
"m7_1:SVM (vanilla)", "m7_2:SVM (RBF)", "m8:Neural net",
"m9: Lasso")
# AUCs
models_AUC <- c(m1_AUROC, m1_1_AUROC, m2_AUROC, m2_1_AUROC,
m3_AUROC,m3_2_AUROC, m4_AUROC, m6_AUROC,
m7_1_AUROC, m7_2_AUROC, m8_AUROC, m9_AUROC)
# KS
models_KS <- c(m1_KS, m1_1_KS, m2_KS, m2_1_KS,
m3_KS,m3_2_KS,m4_KS, m6_KS,
m7_1_KS, m7_2_KS, m8_KS, m9_KS )
# Gini
models_Gini <- c(m1_Gini, m1_1_Gini, m2_Gini, m2_1_Gini,
m3_Gini, m3_2_Gini, m4_Gini, m6_Gini,
m7_1_Gini, m7_2_Gini, m8_Gini, m9_Gini )
# Combine AUC and KS
model_performance_metric <- as.data.frame(cbind(models, models_AUC, models_KS, models_Gini))
# Colnames
colnames(model_performance_metric) <- c("Model", "AUC", "KS", "Gini")
# Display Performance Reports
kable(model_performance_metric, caption ="Comparision of Model Performances")
Model | AUC | KS | Gini |
---|---|---|---|
m1:Logistic regression | 76.87 | 43.17 | 53.74 |
m1:Logistic regression - selected vars | 77.27 | 46.19 | 54.54 |
m2:Recursive partitioning | 73.65 | 41.9 | 47.3 |
m2_1:Recursive partitioning-Bayesian 70/30% prior | 68.22 | 33.17 | 36.44 |
m3:Random forest | 75.77 | 41.27 | 51.54 |
m3_2: Improved Logistic | 69.12 | 34.13 | 38.24 |
m4:Conditional inference tree | 71.63 | 40.48 | 43.26 |
m6:Model Based Tree | 71.45 | 36.48 | 42.9 |
m7_1:SVM (vanilla) | 74.46 | 39.21 | 48.92 |
m7_2:SVM (RBF) | 72.91 | 37.46 | 45.82 |
m8:Neural net | 74.68 | 43.63 | 49.36 |
m9: Lasso | 82.04 | 52.7 | 64.08 |
From the comparative study we found that the Support vector machine
, random forest
and lasso
performing better than other methods with given set-up. Depending on the problem definitions, data availability and given set-up one of the machine learning techniques would be best fit.
library(lattice)
The lattice add-on of Trellis graphics for R. Ref:https://cran.r-project.org/web/packages/lattice/lattice.pdf
library(knitr)
For Dynamic Report Generation in R. Ref: https://cran.r-project.org/web/packages/knitr/knitr.pdf
library(gplots)
Various R Programming Tools for Plotting Data. Ref:https://cran.r-project.org/web/packages/gplots/gplots.pdf
library(ggplot2)
An Implementation of the Grammar of Graphics. Ref: https://cran.rstudio.com/web/packages/ggplot2/ggplot2.pdf
library(ClustOfVar)
Clustering of variables. Ref: https://cran.r-project.org/web/packages/ClustOfVar/ClustOfVar.pdf
library(ape)
Analyses of Phylogenetics and Evolution (as.phylo). Ref: https://cran.r-project.org/web/packages/ape/ape.pdf
library(Information)
Data Exploration with Information Theory (Weight-of-Evidence and Information Value). Ref: https://cran.r-project.org/web/packages/Information/Information.pdf
library(ROCR)
Visualizing the Performance of Scoring Classifiers. Ref: https://cran.r-project.org/web/packages/ROCR/ROCR.pdf
library(caret)
Classification and Regression Training - for any machine learning algorithms. Ref: ftp://cran.r-project.org/pub/R/web/packages/caret/caret.pdf
library(rpart)
Recursive partitioning for classification, regression and survival trees. Ref: https://cran.r-project.org/web/packages/rpart/rpart.pdf
library(rpart.utils)
Tools for parsing and manipulating rpart objects, including generating machine readable rules. Ref: https://cran.r-project.org/web/packages/rpart.utils/rpart.utils.pdf
library(rpart.plot)
Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. Ref: https://cran.r-project.org/web/packages/knitr/knitr.pdf
library(randomForest)
Leo Breiman and Cutler’s Random Forests for Classification and Regression. Ref: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
library(party)
A computational toolbox for recursive partitioning - Conditional inference Trees. Ref: https://cran.r-project.org/web/packages/party/party.pdf
library(bnlearn)
Bayesian Network Structure Learning, Parameter Learning and Inference. Ref: https://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf
library(DAAG)
Data Analysis and Graphics Data and Functions. Ref: https://cran.r-project.org/web/packages/DAAG/DAAG.pdf
library(vcd)
Visualizing Categorical Data. Ref: https://cran.r-project.org/web/packages/vcd/vcd.pdf
library(neuralnet)
Neural Network implementation. Ref: https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf
library(kernlab)
Kernel-Based Machine Learning Lab. Ref: https://cran.r-project.org/web/packages/kernlab/kernlab.pdf
library(glmnet)
Lasso and Elastic-Net Regularized Generalized Linear Models. Ref: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf
library(lars)
Least Angle Regression, Lasso and Forward Stagewise. Ref: ftp://cran.r-project.org/pub/R/web/packages/lars/lars.pdf
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.