I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.
Below the variables description
-BAD: 1 = client defaulted on loan 0 = loan repaid
-LOAN: Amount of the loan request
-MORTDUE: Amount due on existing mortgage
-VALUE: Value of current property
-REASON: DebtCon = debt consolidation; HomeImp = home improvement
-JOB: Six occupational categories
-YOJ: Years at present job
-DEROG: Number of major derogatory reports
-DELINQ: Number of delinquent credit lines
-CLAGE: Age of oldest trade line in months
-NINQ: Number of recent credit lines
-CLNO: Number of credit lines
-DEBTINC: Debt-to-income ratio
suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## v purrr 0.3.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
suppressWarnings({library(caret)})
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
suppressWarnings({library(corrplot)})
## corrplot 0.84 loaded
suppressWarnings({library(gridExtra)})
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
suppressWarnings({library(MLmetrics)})
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
suppressWarnings({library(data.table)})
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
suppressWarnings({library(h2o)})
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
##
## hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init(nthreads = -1, #Number of threads -1 means use all cores on your machine
max_mem_size = "32G")
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\user\AppData\Local\Temp\Rtmpa437WQ/h2o_user_started_from_r.out
## C:\Users\user\AppData\Local\Temp\Rtmpa437WQ/h2o_user_started_from_r.err
##
##
## Starting H2O JVM and connecting: Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 4 seconds 32 milliseconds
## H2O cluster timezone: Europe/Rome
## H2O data parsing timezone: UTC
## H2O cluster version: 3.28.0.2
## H2O cluster version age: 11 days
## H2O cluster name: H2O_started_from_R_user_rsf111
## H2O cluster total nodes: 1
## H2O cluster total memory: 32.00 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.1 (2019-07-05)
path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)
# Dimensions of data set
dim(df)
## [1] 5960 13
# List types for each attribute
sapply(df, class)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG
## "integer" "integer" "numeric" "numeric" "factor" "factor" "numeric" "integer"
## DELINQ CLAGE NINQ CLNO DEBTINC
## "integer" "numeric" "integer" "integer" "numeric"
# Take a peek at the first rows of the data set
head(df,5)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO
## 1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1 9
## 2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0 14
## 3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1 10
## 4 1 1500 NA NA NA NA NA NA NA NA
## 5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0 14
## DEBTINC
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
# Summarize attribute distributions
summary(df)
## BAD LOAN MORTDUE VALUE
## Min. :0.0000 Min. : 1100 Min. : 2063 Min. : 8000
## 1st Qu.:0.0000 1st Qu.:11100 1st Qu.: 46276 1st Qu.: 66076
## Median :0.0000 Median :16300 Median : 65019 Median : 89236
## Mean :0.1995 Mean :18608 Mean : 73761 Mean :101776
## 3rd Qu.:0.0000 3rd Qu.:23300 3rd Qu.: 91488 3rd Qu.:119824
## Max. :1.0000 Max. :89900 Max. :399550 Max. :855909
## NA's :518 NA's :112
## REASON JOB YOJ DEROG
## : 252 : 279 Min. : 0.000 Min. : 0.0000
## DebtCon:3928 Mgr : 767 1st Qu.: 3.000 1st Qu.: 0.0000
## HomeImp:1780 Office : 948 Median : 7.000 Median : 0.0000
## Other :2388 Mean : 8.922 Mean : 0.2546
## ProfExe:1276 3rd Qu.:13.000 3rd Qu.: 0.0000
## Sales : 109 Max. :41.000 Max. :10.0000
## Self : 193 NA's :515 NA's :708
## DELINQ CLAGE NINQ CLNO
## Min. : 0.0000 Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 115.1 1st Qu.: 0.000 1st Qu.:15.0
## Median : 0.0000 Median : 173.5 Median : 1.000 Median :20.0
## Mean : 0.4494 Mean : 179.8 Mean : 1.186 Mean :21.3
## 3rd Qu.: 0.0000 3rd Qu.: 231.6 3rd Qu.: 2.000 3rd Qu.:26.0
## Max. :15.0000 Max. :1168.2 Max. :17.000 Max. :71.0
## NA's :580 NA's :308 NA's :510 NA's :222
## DEBTINC
## Min. : 0.5245
## 1st Qu.: 29.1400
## Median : 34.8183
## Mean : 33.7799
## 3rd Qu.: 39.0031
## Max. :203.3121
## NA's :1267
# Summarize data structure
str(df)
## 'data.frame': 5960 obs. of 13 variables:
## $ BAD : int 1 1 1 1 0 1 1 1 1 1 ...
## $ LOAN : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
## $ MORTDUE: num 25860 70053 13500 NA 97800 ...
## $ VALUE : num 39025 68400 16700 NA 112000 ...
## $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
## $ JOB : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
## $ YOJ : num 10.5 7 4 NA 3 9 5 11 3 16 ...
## $ DEROG : int 0 0 0 NA 0 0 3 0 0 0 ...
## $ DELINQ : int 0 2 0 NA 0 0 2 0 2 0 ...
## $ CLAGE : num 94.4 121.8 149.5 NA 93.3 ...
## $ NINQ : int 1 0 1 NA 0 1 1 0 1 0 ...
## $ CLNO : int 9 14 10 NA 14 8 17 8 12 13 ...
## $ DEBTINC: num NA NA NA NA NA ...
BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated
df$REASON[df$REASON == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated
My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”
mi_summary <- function(data_frame){
mi_summary<-c()
for (col in colnames(data_frame)){
mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
}
mi_summary_new <- mi_summary[mi_summary>0]
mi_summary_cols <- colnames(data_frame)[mi_summary>0]
mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
mi_summary[,2] <- round(mi_summary[,2],6)
rownames(mi_summary) <- NULL
return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary
## col_name perc_missing
## 1 DEBTINC 21.258389
## 2 DEROG 11.879195
## 3 DELINQ 9.731544
## 4 MORTDUE 8.691275
## 5 YOJ 8.640940
## 6 NINQ 8.557047
## 7 CLAGE 5.167785
## 8 JOB 4.681208
## 9 REASON 4.228188
## 10 CLNO 3.724832
## 11 VALUE 1.879195
df <- df %>%
mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
mutate(REASON_NA = ifelse(is.na(REASON),1,0))
for (col in missing_summary$col_name){
if (class(df[,col]) == 'factor'){
unique_levels <- unique(df[,col])
df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
} else {
df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
}
}
pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss
## named numeric(0)
df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))
I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis
cat <- df[,sapply(df, is.factor)] %>%
select_if(~nlevels(.) <=15 ) %>%
select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]
The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.
cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)
## freq percentage
## 0 4771 80.05034
## 1 1189 19.94966
ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Distribution of Target variable")
I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.
cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
counts <- table(cat[,i])
name <- names(cat)[i]
barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}
The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.
par(mfrow=c(2,2))
for(i in 1:length(cat)){
freq=table(cat[,i])
percentage=prop.table(table(cat[,i]))*100
freq_cat_outcome=table(BAD,cat[,i])
name <- names(cat)[i]
cat(sep="\n")
cat(paste("Distribution of", name), sep="\n")
print(cbind(freq,percentage))
cat(sep="\n")
cat(paste("Distribution by Target variable and", name), sep="\n")
print(freq_cat_outcome)
cat(sep="\n")
cat(paste("Chi-squared test by Target variable and", name), sep="\n")
suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}
##
## Distribution of DELINQ
## freq percentage
## 0 4759 79.84899329
## 1 654 10.97315436
## 2 250 4.19463087
## 3 129 2.16442953
## 4 78 1.30872483
## 5 38 0.63758389
## 6 27 0.45302013
## 7 13 0.21812081
## 8 5 0.08389262
## 10 2 0.03355705
## 11 2 0.03355705
## 12 1 0.01677852
## 13 1 0.01677852
## 15 1 0.01677852
##
## Distribution by Target variable and DELINQ
##
## BAD 0 1 2 3 4 5 6 7 8 10 11 12 13 15
## 0 4104 432 138 58 32 7 0 0 0 0 0 0 0 0
## 1 655 222 112 71 46 31 27 13 5 2 2 1 1 1
##
## Chi-squared test by Target variable and DELINQ
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
##
##
## Distribution of REASON
## freq percentage
## DebtCon 4180 70.13423
## HomeImp 1780 29.86577
##
## Distribution by Target variable and REASON
##
## BAD DebtCon HomeImp
## 0 3387 1384
## 1 793 396
##
## Chi-squared test by Target variable and REASON
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
##
##
## Distribution of JOB
## freq percentage
## Mgr 767 12.869128
## Office 948 15.906040
## Other 2667 44.748322
## ProfExe 1276 21.409396
## Sales 109 1.828859
## Self 193 3.238255
##
## Distribution by Target variable and JOB
##
## BAD Mgr Office Other ProfExe Sales Self
## 0 588 823 2090 1064 71 135
## 1 179 125 577 212 38 58
##
## Chi-squared test by Target variable and JOB
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
##
##
## Distribution of DEROG
## freq percentage
## 0 5235 87.83557047
## 1 435 7.29865772
## 2 160 2.68456376
## 3 58 0.97315436
## 4 23 0.38590604
## 5 15 0.25167785
## 6 15 0.25167785
## 7 8 0.13422819
## 8 6 0.10067114
## 9 3 0.05033557
## 10 2 0.03355705
##
## Distribution by Target variable and DEROG
##
## BAD 0 1 2 3 4 5 6 7 8 9 10
## 0 4394 266 78 15 5 8 5 0 0 0 0
## 1 841 169 82 43 18 7 10 8 6 3 2
##
## Chi-squared test by Target variable and DEROG
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16
pl1 <- cat %>%
ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
ggplot(aes(x=BAD, y=REASON, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
ggplot(aes(x=BAD, y=JOB, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
ggplot(aes(x=BAD, y=DEROG, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)
I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.
dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))
I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.
par(mfrow=c(2,3))
for(i in 1:length(num)) {
hist(num[,i], main=names(num)[i], col='blue')
}
par(mfrow=c(2,3))
for(i in 1:length(num)) {
boxplot(num[,i], main=names(num)[i], col='orange')
}
par(mfrow=c(2,3))
for(i in 1:length(num)){
plot(density(num[,i]), main=names(num)[i], col='red')
}
for(i in 1:length(num)){
name <- names(num)[i]
cat(paste("Distribution of", name), sep="\n")
#cat(names(num)[i],sep = "\n")
print(summary(num[,i]))
cat(sep="\n")
stand.deviation = sd(num[,i])
variance = var(num[,i])
skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
outlier_values <- sum(table(boxplot.stats(num[,i])$out))
cat(paste("Statistical analysis of", name), sep="\n")
print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
cat(sep="\n")
cat(paste("anova_test between BAD and", name),sep = "\n")
print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
cat(sep="\n")
}
## Distribution of LOAN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1100 11100 16300 18608 23300 89900
##
## Statistical analysis of LOAN
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 11207.48 125607617 2.022762 6.922438 256
##
## anova_test between BAD and LOAN
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 5.4 5.368 33.79 6.45e-09 ***
## Residuals 5958 946.4 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of MORTDUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2063 48139 65019 73001 88200 399550
##
## Statistical analysis of MORTDUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 42552.73 1810734556 1.941153 7.440693 308
##
## anova_test between BAD and MORTDUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.0 2.0303 12.74 0.000361 ***
## Residuals 5958 949.8 0.1594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of VALUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8000 66490 89236 101540 119005 855909
##
## Statistical analysis of VALUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 56869.44 3234132829 3.088963 24.85644 347
##
## anova_test between BAD and VALUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 1.3 1.2675 7.945 0.00484 **
## Residuals 5958 950.5 0.1595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of YOJ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 7.000 8.756 12.000 41.000
##
## Statistical analysis of YOJ
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.259424 52.69923 1.092072 0.744703 211
##
## anova_test between BAD and YOJ
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.8 2.7710 17.4 3.08e-05 ***
## Residuals 5958 949.0 0.1593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of CLAGE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 117.4 173.5 179.4 227.1 1168.2
##
## Statistical analysis of CLAGE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 83.5747 6984.73 1.389903 8.180556 66
##
## anova_test between BAD and CLAGE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 26.1 26.106 168 <2e-16 ***
## Residuals 5958 925.7 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of DEBTINC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5245 30.7632 34.8183 34.0007 37.9499 203.3122
##
## Statistical analysis of DEBTINC
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.644528 58.43881 3.111598 64.0733 245
##
## anova_test between BAD and DEBTINC
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 22.7 22.733 145.8 <2e-16 ***
## Residuals 5958 929.1 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pl5 <- num %>%
ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot()
pl6 <- num %>%
ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot()
pl7 <- num %>%
ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot()
pl8 <- num %>%
ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot()
pl9 <- num %>%
ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot()
pl10 <- num %>%
ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot()
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)
Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.
# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")
# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)] <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")
# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")
# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)] <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE after handled outliers")
# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")
# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)] <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")
# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")
# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)] <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")
# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")
# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)] <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")
# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")
# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)] <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")
The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)
data <- cbind(categorical,num)
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]
## freqRatio percentUnique zeroVar nzv
## DELINQ.2 22.84000 0.03355705 FALSE TRUE
## DELINQ.3 45.20155 0.03355705 FALSE TRUE
## DELINQ.4 75.41026 0.03355705 FALSE TRUE
## DELINQ.5 155.84211 0.03355705 FALSE TRUE
## DELINQ.6 219.74074 0.03355705 FALSE TRUE
## DELINQ.7 457.46154 0.03355705 FALSE TRUE
## DELINQ.8 1191.00000 0.03355705 FALSE TRUE
## DELINQ.10 2979.00000 0.03355705 FALSE TRUE
## DELINQ.11 2979.00000 0.03355705 FALSE TRUE
## DELINQ.12 5959.00000 0.03355705 FALSE TRUE
## DELINQ.13 5959.00000 0.03355705 FALSE TRUE
## DELINQ.15 5959.00000 0.03355705 FALSE TRUE
## JOB.Sales 53.67890 0.03355705 FALSE TRUE
## JOB.Self 29.88083 0.03355705 FALSE TRUE
## DEROG.2 36.25000 0.03355705 FALSE TRUE
nzv <- nearZeroVar(data)
data_new <- data[, -nzv]
Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.
par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")
summary(cor[upper.tri(cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000000 -0.038116 0.005731 -0.014157 0.043268 0.792421
To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.
# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)
## 'data.frame': 5960 obs. of 14 variables:
## $ DELINQ.0 : num 0.502 -1.99 0.502 0.502 0.502 ...
## $ DELINQ.1 : num -0.351 -0.351 -0.351 -0.351 -0.351 ...
## $ REASON.HomeImp: num 1.532 1.532 1.532 -0.653 1.532 ...
## $ JOB.Mgr : num -0.384 -0.384 -0.384 -0.384 -0.384 ...
## $ JOB.Office : num -0.435 -0.435 -0.435 -0.435 2.299 ...
## $ JOB.Other : num 1.11 1.11 1.11 1.11 -0.9 ...
## $ JOB.ProfExe : num -0.522 -0.522 -0.522 -0.522 -0.522 ...
## $ DEROG.1 : num -0.281 -0.281 -0.281 -0.281 -0.281 ...
## $ LOAN : num -1.86 -1.84 -1.81 -1.81 -1.79 ...
## $ VALUE : num -1.322 -0.669 -1.818 -0.206 0.3 ...
## $ YOJ : num 0.279 -0.233 -0.672 -0.233 -0.819 ...
## $ CLAGE : num -1.091 -0.73 -0.367 -0.052 -1.104 ...
## $ DEBTINC : num 0.143 0.143 0.143 0.143 0.143 ...
## $ y : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...
In this analysis I’ve used the same four baseline models applying sampling methods by H2O framework: Logistic Regression as the easiest model and a benchmark, two ensemble models (Random Forest and Gradient Boosting Machine) and a Neural Network.
transformed_h2o <- as.h2o(transformed)
##
|
| | 0%
|
|======================================================================| 100%
response_col <- transformed_h2o[,14]
h2o.levels(response_col)
## [1] "X0" "X1"
response <- 'y'
# Get columns of predictors
predictors <- which(!colnames(transformed_h2o) %in% c(response))
splits <- h2o.splitFrame(data = transformed_h2o,
ratios = 0.8, #partition data into 80%, 20% chunks
seed = 1)
itrain <- splits[[1]]
itest <- splits[[2]]
fit.glm <- h2o.glm(x = predictors, y = response, training_frame = itrain,
validation_frame = itest, balance_classes = TRUE, family = "binomial", seed = 12345)
##
|
| | 0%
|
|======================================================================| 100%
# Evaluation performance
fit.glm
## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model ID: GLM_model_R_1580512297754_1
## GLM Model: summary
## family link regularization
## 1 binomial logit Elastic Net (alpha = 0.5, lambda = 2.508E-4 )
## number_of_predictors_total number_of_active_predictors number_of_iterations
## 1 13 13 4
## training_frame
## 1 RTMP_sid_8b18_4
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept -1.664925 -1.663053
## 2 DELINQ.0 -0.887165 -0.885173
## 3 DELINQ.1 -0.312985 -0.307732
## 4 REASON.HomeImp 0.083323 0.083333
## 5 JOB.Mgr -0.220402 -0.219713
## 6 JOB.Office -0.495108 -0.497275
## 7 JOB.Other -0.298622 -0.298754
## 8 JOB.ProfExe -0.293356 -0.292313
## 9 DEROG.1 0.156282 0.156471
## 10 LOAN -0.126946 -0.126756
## 11 VALUE -0.049972 -0.049778
## 12 YOJ -0.076104 -0.076685
## 13 CLAGE -0.515819 -0.515964
## 14 DEBTINC 0.348879 0.346992
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.1318866
## RMSE: 0.3631619
## LogLoss: 0.4179426
## Mean Per-Class Error: 0.3042381
## AUC: 0.779947
## AUCPR: 0.4865624
## Gini: 0.5598941
## R^2: 0.1798153
## Residual Deviance: 3998.039
## AIC: 4026.039
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 3241 579 0.151571 =579/3820
## X1 440 523 0.456906 =440/963
## Totals 3681 1102 0.213046 =1019/4783
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.266897 0.506538 207
## 2 max f2 0.120973 0.641257 310
## 3 max f0point5 0.363619 0.519205 157
## 4 max accuracy 0.469527 0.819778 118
## 5 max precision 0.920567 1.000000 0
## 6 max recall 0.019352 1.000000 396
## 7 max specificity 0.920567 1.000000 0
## 8 max absolute_mcc 0.278180 0.376616 200
## 9 max min_per_class_accuracy 0.187625 0.706283 259
## 10 max mean_per_class_accuracy 0.179857 0.708884 264
## 11 max tns 0.920567 3820.000000 0
## 12 max fns 0.920567 962.000000 0
## 13 max fps 0.010097 3820.000000 399
## 14 max tps 0.019352 963.000000 396
## 15 max tnr 0.920567 1.000000 0
## 16 max fnr 0.920567 0.998962 0
## 17 max fpr 0.010097 1.000000 399
## 18 max tpr 0.019352 1.000000 396
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
##
## MSE: 0.1248547
## RMSE: 0.3533478
## LogLoss: 0.3984473
## Mean Per-Class Error: 0.2963555
## AUC: 0.7979956
## AUCPR: 0.4822863
## Gini: 0.5959912
## R^2: 0.1952356
## Residual Deviance: 937.945
## AIC: 965.945
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 846 105 0.110410 =105/951
## X1 109 117 0.482301 =109/226
## Totals 955 222 0.181818 =214/1177
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.298294 0.522321 147
## 2 max f2 0.159037 0.641205 247
## 3 max f0point5 0.357076 0.543237 122
## 4 max accuracy 0.448213 0.834325 86
## 5 max precision 0.904054 1.000000 0
## 6 max recall 0.036115 1.000000 382
## 7 max specificity 0.904054 1.000000 0
## 8 max absolute_mcc 0.298294 0.410080 147
## 9 max min_per_class_accuracy 0.193036 0.723449 218
## 10 max mean_per_class_accuracy 0.186638 0.730886 222
## 11 max tns 0.904054 951.000000 0
## 12 max fns 0.904054 225.000000 0
## 13 max fps 0.011599 951.000000 399
## 14 max tps 0.036115 226.000000 382
## 15 max tnr 0.904054 1.000000 0
## 16 max fnr 0.904054 0.995575 0
## 17 max fpr 0.011599 1.000000 399
## 18 max tpr 0.036115 1.000000 382
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
fit.rf <- h2o.randomForest(x = predictors, y = response, training_frame = itrain,
validation_frame = itest, balance_classes = TRUE, seed = 12345)
##
|
| | 0%
|
|====================== | 32%
|
|===================================================== | 76%
|
|======================================================================| 100%
# Evaluation performance
fit.rf
## Model Details:
## ==============
##
## H2OBinomialModel: drf
## Model ID: DRF_model_R_1580512297754_4
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 415061 20
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 20 20.00000 545 710 655.32000
##
##
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 0.02057868
## RMSE: 0.1434527
## LogLoss: 0.1015732
## Mean Per-Class Error: 0.01046997
## AUC: 0.9981022
## AUCPR: 0.6466011
## Gini: 0.9962045
## R^2: 0.9176853
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 3752 68 0.017801 =68/3820
## X1 12 3811 0.003139 =12/3823
## Totals 3764 3879 0.010467 =80/7643
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.454961 0.989613 235
## 2 max f2 0.423489 0.995259 242
## 3 max f0point5 0.559141 0.988921 209
## 4 max accuracy 0.509753 0.989533 222
## 5 max precision 0.983557 0.998684 11
## 6 max recall 0.404016 1.000000 246
## 7 max specificity 1.000000 0.999476 0
## 8 max absolute_mcc 0.454961 0.979171 235
## 9 max min_per_class_accuracy 0.521732 0.987958 218
## 10 max mean_per_class_accuracy 0.509753 0.989532 222
## 11 max tns 1.000000 3818.000000 0
## 12 max fns 1.000000 2481.000000 0
## 13 max fps 0.000001 3820.000000 399
## 14 max tps 0.404016 3823.000000 246
## 15 max tnr 1.000000 0.999476 0
## 16 max fnr 1.000000 0.648967 0
## 17 max fpr 0.000001 1.000000 399
## 18 max tpr 0.404016 1.000000 246
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on validation data. **
##
## MSE: 0.08684995
## RMSE: 0.2947031
## LogLoss: 0.26435
## Mean Per-Class Error: 0.1272926
## AUC: 0.9519044
## AUCPR: 0.8188366
## Gini: 0.9038088
## R^2: 0.4401992
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 873 78 0.082019 =78/951
## X1 39 187 0.172566 =39/226
## Totals 912 265 0.099405 =117/1177
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.165189 0.761711 203
## 2 max f2 0.059243 0.830131 282
## 3 max f0point5 0.290356 0.792812 139
## 4 max accuracy 0.290356 0.909941 139
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.026814 1.000000 336
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.165189 0.702946 203
## 9 max min_per_class_accuracy 0.106160 0.870662 236
## 10 max mean_per_class_accuracy 0.086885 0.886044 253
## 11 max tns 1.000000 951.000000 0
## 12 max fns 1.000000 224.000000 0
## 13 max fps 0.000005 951.000000 399
## 14 max tps 0.026814 226.000000 336
## 15 max tnr 1.000000 1.000000 0
## 16 max fnr 1.000000 0.991150 0
## 17 max fpr 0.000005 1.000000 399
## 18 max tpr 0.026814 1.000000 336
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
fit.nn <- h2o.deeplearning(x = predictors, y = response, training_frame = itrain,
validation_frame = itest, balance_classes = TRUE, seed = 12345)
##
|
| | 0%
|
|====================== | 32%
|
|=================================================================== | 96%
|
|======================================================================| 100%
# Evaluation performance
fit.nn
## Model Details:
## ==============
##
## H2OBinomialModel: deeplearning
## Model ID: DeepLearning_model_R_1580512297754_106
## Status of Neuron Layers: predicting y, 2-class classification, bernoulli distribution, CrossEntropy loss, 43.402 weights/biases, 518,7 KB, 76.430 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms momentum
## 1 1 13 Input 0.00 % NA NA NA NA NA
## 2 2 200 Rectifier 0.00 % 0.000000 0.000000 0.003801 0.001774 0.000000
## 3 3 200 Rectifier 0.00 % 0.000000 0.000000 0.076288 0.103208 0.000000
## 4 4 2 Softmax NA 0.000000 0.000000 0.003362 0.001877 0.000000
## mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA
## 2 -0.003673 0.128088 0.236593 0.097845
## 3 -0.030226 0.099299 0.880866 0.156410
## 4 0.006480 0.370943 0.000550 0.094676
##
##
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## MSE: 0.1491253
## RMSE: 0.3861674
## LogLoss: 0.4734618
## Mean Per-Class Error: 0.1362267
## AUC: 0.9452189
## AUCPR: 0.9352998
## Gini: 0.8904378
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 3070 750 0.196335 =750/3820
## X1 291 3532 0.076118 =291/3823
## Totals 3361 4282 0.136203 =1041/7643
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.069172 0.871561 338
## 2 max f2 0.022442 0.918023 371
## 3 max f0point5 0.306695 0.886163 245
## 4 max accuracy 0.109100 0.867984 318
## 5 max precision 0.999604 1.000000 0
## 6 max recall 0.001587 1.000000 396
## 7 max specificity 0.999604 1.000000 0
## 8 max absolute_mcc 0.109100 0.736165 318
## 9 max min_per_class_accuracy 0.117126 0.865183 313
## 10 max mean_per_class_accuracy 0.109100 0.867979 318
## 11 max tns 0.999604 3820.000000 0
## 12 max fns 0.999604 3787.000000 0
## 13 max fps 0.000146 3820.000000 399
## 14 max tps 0.001587 3823.000000 396
## 15 max tnr 0.999604 1.000000 0
## 16 max fnr 0.999604 0.990583 0
## 17 max fpr 0.000146 1.000000 399
## 18 max tpr 0.001587 1.000000 396
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
##
## MSE: 0.09111745
## RMSE: 0.3018567
## LogLoss: 0.3269739
## Mean Per-Class Error: 0.1933363
## AUC: 0.8908904
## AUCPR: 0.7215676
## Gini: 0.7817807
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 861 90 0.094637 =90/951
## X1 66 160 0.292035 =66/226
## Totals 927 250 0.132540 =156/1177
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.244101 0.672269 201
## 2 max f2 0.083636 0.720966 283
## 3 max f0point5 0.639524 0.736544 91
## 4 max accuracy 0.595380 0.884452 106
## 5 max precision 0.999569 1.000000 0
## 6 max recall 0.000205 1.000000 398
## 7 max specificity 0.999569 1.000000 0
## 8 max absolute_mcc 0.346157 0.597573 167
## 9 max min_per_class_accuracy 0.090171 0.800885 279
## 10 max mean_per_class_accuracy 0.202298 0.813212 221
## 11 max tns 0.999569 951.000000 0
## 12 max fns 0.999569 225.000000 0
## 13 max fps 0.000069 951.000000 399
## 14 max tps 0.000205 226.000000 398
## 15 max tnr 0.999569 1.000000 0
## 16 max fnr 0.999569 0.995575 0
## 17 max fpr 0.000069 1.000000 399
## 18 max tpr 0.000205 1.000000 398
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
fit.gbm <- h2o.gbm(x = predictors, y = response, training_frame = itrain,
validation_frame = itest, balance_classes = TRUE, seed = 12345)
##
|
| | 0%
|
|===================================================== | 76%
|
|======================================================================| 100%
# Evaluation performance
fit.gbm
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1580512297754_111
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 20719 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 20 32 28.28000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.1271649
## RMSE: 0.3566019
## LogLoss: 0.3933718
## Mean Per-Class Error: 0.09291098
## AUC: 0.9633733
## AUCPR: 0.9592815
## Gini: 0.9267467
## R^2: 0.4913404
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 3314 506 0.132461 =506/3820
## X1 204 3619 0.053361 =204/3823
## Totals 3518 4125 0.092895 =710/7643
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.133319 0.910669 301
## 2 max f2 0.072990 0.938093 335
## 3 max f0point5 0.217954 0.910984 267
## 4 max accuracy 0.191579 0.908282 277
## 5 max precision 0.927522 1.000000 0
## 6 max recall 0.022366 1.000000 382
## 7 max specificity 0.927522 1.000000 0
## 8 max absolute_mcc 0.133319 0.816758 301
## 9 max min_per_class_accuracy 0.195194 0.906095 275
## 10 max mean_per_class_accuracy 0.191579 0.908281 277
## 11 max tns 0.927522 3820.000000 0
## 12 max fns 0.927522 3819.000000 0
## 13 max fps 0.006498 3820.000000 399
## 14 max tps 0.022366 3823.000000 382
## 15 max tnr 0.927522 1.000000 0
## 16 max fnr 0.927522 0.998954 0
## 17 max fpr 0.006498 1.000000 399
## 18 max tpr 0.022366 1.000000 382
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.07695371
## RMSE: 0.2774053
## LogLoss: 0.2533059
## Mean Per-Class Error: 0.1430865
## AUC: 0.9378205
## AUCPR: 0.7933898
## Gini: 0.8756409
## R^2: 0.5039865
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## X0 X1 Error Rate
## X0 864 87 0.091483 =87/951
## X1 44 182 0.194690 =44/226
## Totals 908 269 0.111300 =131/1177
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.273834 0.735354 193
## 2 max f2 0.108992 0.813323 254
## 3 max f0point5 0.513181 0.762527 119
## 4 max accuracy 0.513181 0.898895 119
## 5 max precision 0.891477 1.000000 0
## 6 max recall 0.028563 1.000000 353
## 7 max specificity 0.891477 1.000000 0
## 8 max absolute_mcc 0.273834 0.669605 193
## 9 max min_per_class_accuracy 0.151069 0.860147 229
## 10 max mean_per_class_accuracy 0.129747 0.873931 239
## 11 max tns 0.891477 951.000000 0
## 12 max fns 0.891477 225.000000 0
## 13 max fps 0.006635 951.000000 399
## 14 max tps 0.028563 226.000000 353
## 15 max tnr 0.891477 1.000000 0
## 16 max fnr 0.891477 0.995575 0
## 17 max fpr 0.006635 1.000000 399
## 18 max tpr 0.028563 1.000000 353
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.
# Back to R workspace
trainy_true <- as.vector(itrain$y)
testy_true <- as.vector(itest$y)
train.glm <- as.data.table((h2o.predict(fit.glm, itrain)))
##
|
| | 0%
|
|======================================================================| 100%
test.glm <- as.data.table((h2o.predict(fit.glm, itest)))
##
|
| | 0%
|
|======================================================================| 100%
train_prediction.glm <- as.vector(train.glm[[1]])
prediction.glm <- as.vector(test.glm[[1]])
F1_train <- F1_Score(train_prediction.glm, trainy_true)
F1_test <- F1_Score(prediction.glm, testy_true)
cat(paste('F1_train_glm:',F1_train,'F1_test_glm:', F1_test), sep='\n')
## F1_train_glm: 0.875391032325339 F1_test_glm: 0.887840670859539
train.rf <- as.data.table((h2o.predict(fit.rf, itrain)))
##
|
| | 0%
|
|======================================================================| 100%
test.rf <- as.data.table((h2o.predict(fit.rf, itest)))
##
|
| | 0%
|
|======================================================================| 100%
train_prediction.rf <- as.vector(train.rf[[1]])
prediction.rf <- as.vector(test.rf[[1]])
F1_train <- F1_Score(train_prediction.rf, trainy_true)
F1_test <- F1_Score(prediction.rf, testy_true)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.998820600183462 F1_test_rf: 0.936695278969957
train.nn <- as.data.table((h2o.predict(fit.nn, itrain)))
##
|
| | 0%
|
|======================================================================| 100%
test.nn <- as.data.table((h2o.predict(fit.nn, itest)))
##
|
| | 0%
|
|======================================================================| 100%
train_prediction.nn <- as.vector(train.nn[[1]])
prediction.nn <- as.vector(test.nn[[1]])
F1_train <- F1_Score(train_prediction.nn, trainy_true)
F1_test <- F1_Score(prediction.nn, testy_true)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.937467157120336 F1_test_rf: 0.917021276595745
train.gbm <- as.data.table((h2o.predict(fit.gbm, itrain)))
##
|
| | 0%
|
|======================================================================| 100%
test.gbm <- as.data.table((h2o.predict(fit.gbm, itest)))
##
|
| | 0%
|
|======================================================================| 100%
train_prediction.gbm <- as.vector(train.gbm[[1]])
prediction.gbm <- as.vector(test.gbm[[1]])
F1_train <- F1_Score(train_prediction.gbm, trainy_true)
F1_test <- F1_Score(prediction.gbm, testy_true)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.942883770469977 F1_test_rf: 0.929032258064516
par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, testy_true)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, testy_true)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nn, testy_true)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, testy_true)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GBM Confusion Matrix")
h2o.shutdown(prompt = TRUE)
## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?
## [1] TRUE
Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013
Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013
https://machinelearningmastery.com/what-is-imbalanced-classification/