How to face a majority class greater than a minority class in a classification predictive modeling: sampling methods with H2O

Challenge

I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.

Below the variables description

-BAD: 1 = client defaulted on loan 0 = loan repaid

-LOAN: Amount of the loan request

-MORTDUE: Amount due on existing mortgage

-VALUE: Value of current property

-REASON: DebtCon = debt consolidation; HomeImp = home improvement

-JOB: Six occupational categories

-YOJ: Years at present job

-DEROG: Number of major derogatory reports

-DELINQ: Number of delinquent credit lines

-CLAGE: Age of oldest trade line in months

-NINQ: Number of recent credit lines

-CLNO: Number of credit lines

-DEBTINC: Debt-to-income ratio

Prepare Workspace

suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

suppressWarnings({library(caret)})

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

suppressWarnings({library(corrplot)})

## corrplot 0.84 loaded

suppressWarnings({library(gridExtra)})

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

suppressWarnings({library(MLmetrics)})

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

suppressWarnings({library(data.table)})

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

suppressWarnings({library(h2o)})

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

h2o.init(nthreads = -1, #Number of threads -1 means use all cores on your machine
         max_mem_size = "32G")

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\user\AppData\Local\Temp\Rtmpa437WQ/h2o_user_started_from_r.out
##     C:\Users\user\AppData\Local\Temp\Rtmpa437WQ/h2o_user_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         4 seconds 32 milliseconds 
##     H2O cluster timezone:       Europe/Rome 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.2 
##     H2O cluster version age:    11 days  
##     H2O cluster name:           H2O_started_from_R_user_rsf111 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   32.00 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.1 (2019-07-05)

Upload Dataset

path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)

Summarize Data set

# Dimensions of data set
dim(df)

## [1] 5960   13

# List types for each attribute
sapply(df, class)

##       BAD      LOAN   MORTDUE     VALUE    REASON       JOB       YOJ     DEROG 
## "integer" "integer" "numeric" "numeric"  "factor"  "factor" "numeric" "integer" 
##    DELINQ     CLAGE      NINQ      CLNO   DEBTINC 
## "integer" "numeric" "integer" "integer" "numeric"

# Take a peek at the first rows of the data set
head(df,5)

##   BAD LOAN MORTDUE  VALUE  REASON    JOB  YOJ DEROG DELINQ     CLAGE NINQ CLNO
## 1   1 1100   25860  39025 HomeImp  Other 10.5     0      0  94.36667    1    9
## 2   1 1300   70053  68400 HomeImp  Other  7.0     0      2 121.83333    0   14
## 3   1 1500   13500  16700 HomeImp  Other  4.0     0      0 149.46667    1   10
## 4   1 1500      NA     NA                  NA    NA     NA        NA   NA   NA
## 5   0 1700   97800 112000 HomeImp Office  3.0     0      0  93.33333    0   14
##   DEBTINC
## 1      NA
## 2      NA
## 3      NA
## 4      NA
## 5      NA

# Summarize attribute distributions
summary(df)

##       BAD              LOAN          MORTDUE           VALUE       
##  Min.   :0.0000   Min.   : 1100   Min.   :  2063   Min.   :  8000  
##  1st Qu.:0.0000   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
##  Median :0.0000   Median :16300   Median : 65019   Median : 89236  
##  Mean   :0.1995   Mean   :18608   Mean   : 73761   Mean   :101776  
##  3rd Qu.:0.0000   3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
##  Max.   :1.0000   Max.   :89900   Max.   :399550   Max.   :855909  
##                                   NA's   :518      NA's   :112     
##      REASON          JOB            YOJ             DEROG        
##         : 252          : 279   Min.   : 0.000   Min.   : 0.0000  
##  DebtCon:3928   Mgr    : 767   1st Qu.: 3.000   1st Qu.: 0.0000  
##  HomeImp:1780   Office : 948   Median : 7.000   Median : 0.0000  
##                 Other  :2388   Mean   : 8.922   Mean   : 0.2546  
##                 ProfExe:1276   3rd Qu.:13.000   3rd Qu.: 0.0000  
##                 Sales  : 109   Max.   :41.000   Max.   :10.0000  
##                 Self   : 193   NA's   :515      NA's   :708      
##      DELINQ            CLAGE             NINQ             CLNO     
##  Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
##  Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
##  Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
##  3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
##  Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
##  NA's   :580       NA's   :308      NA's   :510      NA's   :222   
##     DEBTINC        
##  Min.   :  0.5245  
##  1st Qu.: 29.1400  
##  Median : 34.8183  
##  Mean   : 33.7799  
##  3rd Qu.: 39.0031  
##  Max.   :203.3121  
##  NA's   :1267

# Summarize data structure
str(df)

## 'data.frame':    5960 obs. of  13 variables:
##  $ BAD    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ LOAN   : int  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
##  $ MORTDUE: num  25860 70053 13500 NA 97800 ...
##  $ VALUE  : num  39025 68400 16700 NA 112000 ...
##  $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
##  $ JOB    : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
##  $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
##  $ DEROG  : int  0 0 0 NA 0 0 3 0 0 0 ...
##  $ DELINQ : int  0 2 0 NA 0 0 2 0 2 0 ...
##  $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
##  $ NINQ   : int  1 0 1 NA 0 1 1 0 1 0 ...
##  $ CLNO   : int  9 14 10 NA 14 8 17 8 12 13 ...
##  $ DEBTINC: num  NA NA NA NA NA ...

Formatting Features and managing some levels

BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated

df$REASON[df$REASON == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated

Handling missing values

My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”

mi_summary <- function(data_frame){
  mi_summary<-c()
  for (col in colnames(data_frame)){
    mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
  }
  mi_summary_new <- mi_summary[mi_summary>0]
  mi_summary_cols <- colnames(data_frame)[mi_summary>0]
  mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
  mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
  mi_summary[,2] <- round(mi_summary[,2],6)
  rownames(mi_summary) <- NULL
  return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary

##    col_name perc_missing
## 1   DEBTINC    21.258389
## 2     DEROG    11.879195
## 3    DELINQ     9.731544
## 4   MORTDUE     8.691275
## 5       YOJ     8.640940
## 6      NINQ     8.557047
## 7     CLAGE     5.167785
## 8       JOB     4.681208
## 9    REASON     4.228188
## 10     CLNO     3.724832
## 11    VALUE     1.879195

Input boolean variables for features with NA’s

df <- df %>%
  mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
  mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
  mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
  mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
  mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
  mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
  mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
  mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
  mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
  mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
  mutate(REASON_NA = ifelse(is.na(REASON),1,0))

Input missing values with median for numerical columns and with the most common level for categorical columns

for (col in missing_summary$col_name){
  if (class(df[,col]) == 'factor'){
    unique_levels <- unique(df[,col])
    df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
  } else {
    df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
  }
}

Check results

pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss

## named numeric(0)

Formatting new features and managing some levels

df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))

Split data set into categorical, boolean and numerical variables

I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis

cat <- df[,sapply(df, is.factor)] %>%
  select_if(~nlevels(.) <=15 ) %>%
  select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]

Summarize the class distribution of the target variable

The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.

Visualize data

cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)

##   freq percentage
## 0 4771   80.05034
## 1 1189   19.94966

ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  ggtitle("Distribution of Target variable")

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.

Univariate Analysis

cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
  counts <- table(cat[,i])
  name <- names(cat)[i]
  barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}

Bivariate Analysis with Feature Selection Analysis

The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.

par(mfrow=c(2,2))
for(i in 1:length(cat)){
  freq=table(cat[,i])
  percentage=prop.table(table(cat[,i]))*100
  freq_cat_outcome=table(BAD,cat[,i])
  name <- names(cat)[i]
  cat(sep="\n")
  cat(paste("Distribution of", name), sep="\n")
  print(cbind(freq,percentage))
  cat(sep="\n")
  cat(paste("Distribution by Target variable and", name), sep="\n")
  print(freq_cat_outcome)
  cat(sep="\n")
  cat(paste("Chi-squared test by Target variable and", name), sep="\n")
  suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}

## 
## Distribution of DELINQ
##    freq  percentage
## 0  4759 79.84899329
## 1   654 10.97315436
## 2   250  4.19463087
## 3   129  2.16442953
## 4    78  1.30872483
## 5    38  0.63758389
## 6    27  0.45302013
## 7    13  0.21812081
## 8     5  0.08389262
## 10    2  0.03355705
## 11    2  0.03355705
## 12    1  0.01677852
## 13    1  0.01677852
## 15    1  0.01677852
## 
## Distribution by Target variable and DELINQ
##    
## BAD    0    1    2    3    4    5    6    7    8   10   11   12   13   15
##   0 4104  432  138   58   32    7    0    0    0    0    0    0    0    0
##   1  655  222  112   71   46   31   27   13    5    2    2    1    1    1
## 
## Chi-squared test by Target variable and DELINQ
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
## 
## 
## Distribution of REASON
##         freq percentage
## DebtCon 4180   70.13423
## HomeImp 1780   29.86577
## 
## Distribution by Target variable and REASON
##    
## BAD DebtCon HomeImp
##   0    3387    1384
##   1     793     396
## 
## Chi-squared test by Target variable and REASON
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
## 
## 
## Distribution of JOB
##         freq percentage
## Mgr      767  12.869128
## Office   948  15.906040
## Other   2667  44.748322
## ProfExe 1276  21.409396
## Sales    109   1.828859
## Self     193   3.238255
## 
## Distribution by Target variable and JOB
##    
## BAD  Mgr Office Other ProfExe Sales Self
##   0  588    823  2090    1064    71  135
##   1  179    125   577     212    38   58
## 
## Chi-squared test by Target variable and JOB
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
## 
## 
## Distribution of DEROG
##    freq  percentage
## 0  5235 87.83557047
## 1   435  7.29865772
## 2   160  2.68456376
## 3    58  0.97315436
## 4    23  0.38590604
## 5    15  0.25167785
## 6    15  0.25167785
## 7     8  0.13422819
## 8     6  0.10067114
## 9     3  0.05033557
## 10    2  0.03355705
## 
## Distribution by Target variable and DEROG
##    
## BAD    0    1    2    3    4    5    6    7    8    9   10
##   0 4394  266   78   15    5    8    5    0    0    0    0
##   1  841  169   82   43   18    7   10    8    6    3    2
## 
## Chi-squared test by Target variable and DEROG
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16

Visualization of Bivariate Analysis

pl1 <- cat %>%
  ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) + 
  geom_bar(stat='identity') + 
  ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
  ggplot(aes(x=BAD, y=REASON, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
  ggplot(aes(x=BAD, y=JOB, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
  ggplot(aes(x=BAD, y=DEROG, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)

One-hot encoding on categorical features

I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.

dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))

Remove correlated levels from boolean features

drop_cols <- c('DEBTINC_NA.0','DEROG_NA.0','DELINQ_NA.0','MORTDUE_NA.0','YOJ_NA.0','NINQ_NA.0','CLAGE_NA.0','CLNO_NA.0','VALUE_NA.0','JOB_NA.0','REASON_NA.0')
categorical <- cat_num[,!colnames(cat_num) %in% drop_cols]

Analysis for numerical features (univariate analysis, bivariate analysis)

I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.

Univariate Analysis, histograms

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  hist(num[,i], main=names(num)[i], col='blue')
}

Univariate Analysis, boxplots

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  boxplot(num[,i], main=names(num)[i], col='orange')
}

Univariate Analysis, densityplots

par(mfrow=c(2,3))
for(i in 1:length(num)){
  plot(density(num[,i]), main=names(num)[i], col='red')
}

Bivariate Analysis

for(i in 1:length(num)){
  name <- names(num)[i]
  cat(paste("Distribution of", name), sep="\n")
  #cat(names(num)[i],sep = "\n")
  print(summary(num[,i]))
  cat(sep="\n")
  stand.deviation = sd(num[,i])
  variance = var(num[,i])
  skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
  kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
  outlier_values <- sum(table(boxplot.stats(num[,i])$out))
  cat(paste("Statistical analysis of", name), sep="\n")
  print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
  cat(sep="\n")
  cat(paste("anova_test between BAD and", name),sep = "\n")
  print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
  cat(sep="\n")
}

## Distribution of LOAN
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1100   11100   16300   18608   23300   89900 
## 
## Statistical analysis of LOAN
##      stand.deviation  variance skewness kurtosis outlier_values
## [1,]        11207.48 125607617 2.022762 6.922438            256
## 
## anova_test between BAD and LOAN
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    5.4   5.368   33.79 6.45e-09 ***
## Residuals   5958  946.4   0.159                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of MORTDUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2063   48139   65019   73001   88200  399550 
## 
## Statistical analysis of MORTDUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        42552.73 1810734556 1.941153 7.440693            308
## 
## anova_test between BAD and MORTDUE
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.0  2.0303   12.74 0.000361 ***
## Residuals   5958  949.8  0.1594                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of VALUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8000   66490   89236  101540  119005  855909 
## 
## Statistical analysis of VALUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        56869.44 3234132829 3.088963 24.85644            347
## 
## anova_test between BAD and VALUE
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## num[, i]       1    1.3  1.2675   7.945 0.00484 **
## Residuals   5958  950.5  0.1595                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of YOJ
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   8.756  12.000  41.000 
## 
## Statistical analysis of YOJ
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.259424 52.69923 1.092072 0.744703            211
## 
## anova_test between BAD and YOJ
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.8  2.7710    17.4 3.08e-05 ***
## Residuals   5958  949.0  0.1593                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of CLAGE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   117.4   173.5   179.4   227.1  1168.2 
## 
## Statistical analysis of CLAGE
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]         83.5747  6984.73 1.389903 8.180556             66
## 
## anova_test between BAD and CLAGE
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   26.1  26.106     168 <2e-16 ***
## Residuals   5958  925.7   0.155                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of DEBTINC
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.5245  30.7632  34.8183  34.0007  37.9499 203.3122 
## 
## Statistical analysis of DEBTINC
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.644528 58.43881 3.111598  64.0733            245
## 
## anova_test between BAD and DEBTINC
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   22.7  22.733   145.8 <2e-16 ***
## Residuals   5958  929.1   0.156                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualization of Bivariate Analysis

pl5 <- num %>%
  ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot() 
pl6 <- num %>%
  ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot() 
pl7 <- num %>%
  ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot() 
pl8 <- num %>%
  ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot() 
pl9 <- num %>%
  ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot() 
pl10 <- num %>%
  ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot() 
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)

Handling outliers

Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.

# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")

# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)]  <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")

# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")

# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)]  <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3)  + ggtitle("MORTDUE after handled outliers")

# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")

# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)]  <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")

# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")

# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)]  <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")

# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")

# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)]  <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")

# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")

# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)]  <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")

Delete Zero-and Near Zero-Variance Predictors

The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)

data <- cbind(categorical,num) 
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]

##            freqRatio percentUnique zeroVar  nzv
## DELINQ.2    22.84000    0.03355705   FALSE TRUE
## DELINQ.3    45.20155    0.03355705   FALSE TRUE
## DELINQ.4    75.41026    0.03355705   FALSE TRUE
## DELINQ.5   155.84211    0.03355705   FALSE TRUE
## DELINQ.6   219.74074    0.03355705   FALSE TRUE
## DELINQ.7   457.46154    0.03355705   FALSE TRUE
## DELINQ.8  1191.00000    0.03355705   FALSE TRUE
## DELINQ.10 2979.00000    0.03355705   FALSE TRUE
## DELINQ.11 2979.00000    0.03355705   FALSE TRUE
## DELINQ.12 5959.00000    0.03355705   FALSE TRUE
## DELINQ.13 5959.00000    0.03355705   FALSE TRUE
## DELINQ.15 5959.00000    0.03355705   FALSE TRUE
## JOB.Sales   53.67890    0.03355705   FALSE TRUE
## JOB.Self    29.88083    0.03355705   FALSE TRUE
## DEROG.2     36.25000    0.03355705   FALSE TRUE

nzv <- nearZeroVar(data)
data_new <- data[, -nzv]

Correlation

Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.

Visualization

par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")

summary(cor[upper.tri(cor)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.000000 -0.038116  0.005731 -0.014157  0.043268  0.792421

Delete correlated features

tmp <- cor(data_new)
tmp[upper.tri(tmp)] <- 0
diag(tmp) <- 0
df_new <- data_new[,!apply(tmp,2,function(x) any(abs(x) > 0.75))]
cor <- cor(df_new,use="complete.obs",method = "spearman")
summary(cor[upper.tri(cor)])

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6988623 -0.0421582  0.0004314 -0.0234455  0.0307002  0.3406328

Pre-processing

To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.

# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)

## 'data.frame':    5960 obs. of  14 variables:
##  $ DELINQ.0      : num  0.502 -1.99 0.502 0.502 0.502 ...
##  $ DELINQ.1      : num  -0.351 -0.351 -0.351 -0.351 -0.351 ...
##  $ REASON.HomeImp: num  1.532 1.532 1.532 -0.653 1.532 ...
##  $ JOB.Mgr       : num  -0.384 -0.384 -0.384 -0.384 -0.384 ...
##  $ JOB.Office    : num  -0.435 -0.435 -0.435 -0.435 2.299 ...
##  $ JOB.Other     : num  1.11 1.11 1.11 1.11 -0.9 ...
##  $ JOB.ProfExe   : num  -0.522 -0.522 -0.522 -0.522 -0.522 ...
##  $ DEROG.1       : num  -0.281 -0.281 -0.281 -0.281 -0.281 ...
##  $ LOAN          : num  -1.86 -1.84 -1.81 -1.81 -1.79 ...
##  $ VALUE         : num  -1.322 -0.669 -1.818 -0.206 0.3 ...
##  $ YOJ           : num  0.279 -0.233 -0.672 -0.233 -0.819 ...
##  $ CLAGE         : num  -1.091 -0.73 -0.367 -0.052 -1.104 ...
##  $ DEBTINC       : num  0.143 0.143 0.143 0.143 0.143 ...
##  $ y             : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...

Sampling: evaluating models by H2O

In this analysis I’ve used the same four baseline models applying sampling methods by H2O framework: Logistic Regression as the easiest model and a benchmark, two ensemble models (Random Forest and Gradient Boosting Machine) and a Neural Network.

Transform data to h2o data frame

transformed_h2o <- as.h2o(transformed)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

response_col <- transformed_h2o[,14]
h2o.levels(response_col)

## [1] "X0" "X1"

response <- 'y'
# Get columns of predictors
predictors <- which(!colnames(transformed_h2o) %in% c(response))

Partition the data into training and test sets

splits <- h2o.splitFrame(data = transformed_h2o, 
                         ratios = 0.8,  #partition data into 80%, 20% chunks
                         seed = 1)  

itrain <- splits[[1]]
itest <- splits[[2]]

GLM:

fit.glm <- h2o.glm(x = predictors, y = response, training_frame = itrain,
                   validation_frame = itest, balance_classes = TRUE, family = "binomial", seed = 12345)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Evaluation performance
fit.glm

## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model ID:  GLM_model_R_1580512297754_1 
## GLM Model: summary
##     family  link                                regularization
## 1 binomial logit Elastic Net (alpha = 0.5, lambda = 2.508E-4 )
##   number_of_predictors_total number_of_active_predictors number_of_iterations
## 1                         13                          13                    4
##    training_frame
## 1 RTMP_sid_8b18_4
## 
## Coefficients: glm coefficients
##             names coefficients standardized_coefficients
## 1       Intercept    -1.664925                 -1.663053
## 2        DELINQ.0    -0.887165                 -0.885173
## 3        DELINQ.1    -0.312985                 -0.307732
## 4  REASON.HomeImp     0.083323                  0.083333
## 5         JOB.Mgr    -0.220402                 -0.219713
## 6      JOB.Office    -0.495108                 -0.497275
## 7       JOB.Other    -0.298622                 -0.298754
## 8     JOB.ProfExe    -0.293356                 -0.292313
## 9         DEROG.1     0.156282                  0.156471
## 10           LOAN    -0.126946                 -0.126756
## 11          VALUE    -0.049972                 -0.049778
## 12            YOJ    -0.076104                 -0.076685
## 13          CLAGE    -0.515819                 -0.515964
## 14        DEBTINC     0.348879                  0.346992
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.1318866
## RMSE:  0.3631619
## LogLoss:  0.4179426
## Mean Per-Class Error:  0.3042381
## AUC:  0.779947
## AUCPR:  0.4865624
## Gini:  0.5598941
## R^2:  0.1798153
## Residual Deviance:  3998.039
## AIC:  4026.039
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          X0   X1    Error        Rate
## X0     3241  579 0.151571   =579/3820
## X1      440  523 0.456906    =440/963
## Totals 3681 1102 0.213046  =1019/4783
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.266897    0.506538 207
## 2                       max f2  0.120973    0.641257 310
## 3                 max f0point5  0.363619    0.519205 157
## 4                 max accuracy  0.469527    0.819778 118
## 5                max precision  0.920567    1.000000   0
## 6                   max recall  0.019352    1.000000 396
## 7              max specificity  0.920567    1.000000   0
## 8             max absolute_mcc  0.278180    0.376616 200
## 9   max min_per_class_accuracy  0.187625    0.706283 259
## 10 max mean_per_class_accuracy  0.179857    0.708884 264
## 11                     max tns  0.920567 3820.000000   0
## 12                     max fns  0.920567  962.000000   0
## 13                     max fps  0.010097 3820.000000 399
## 14                     max tps  0.019352  963.000000 396
## 15                     max tnr  0.920567    1.000000   0
## 16                     max fnr  0.920567    0.998962   0
## 17                     max fpr  0.010097    1.000000 399
## 18                     max tpr  0.019352    1.000000 396
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  0.1248547
## RMSE:  0.3533478
## LogLoss:  0.3984473
## Mean Per-Class Error:  0.2963555
## AUC:  0.7979956
## AUCPR:  0.4822863
## Gini:  0.5959912
## R^2:  0.1952356
## Residual Deviance:  937.945
## AIC:  965.945
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         X0  X1    Error       Rate
## X0     846 105 0.110410   =105/951
## X1     109 117 0.482301   =109/226
## Totals 955 222 0.181818  =214/1177
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.298294   0.522321 147
## 2                       max f2  0.159037   0.641205 247
## 3                 max f0point5  0.357076   0.543237 122
## 4                 max accuracy  0.448213   0.834325  86
## 5                max precision  0.904054   1.000000   0
## 6                   max recall  0.036115   1.000000 382
## 7              max specificity  0.904054   1.000000   0
## 8             max absolute_mcc  0.298294   0.410080 147
## 9   max min_per_class_accuracy  0.193036   0.723449 218
## 10 max mean_per_class_accuracy  0.186638   0.730886 222
## 11                     max tns  0.904054 951.000000   0
## 12                     max fns  0.904054 225.000000   0
## 13                     max fps  0.011599 951.000000 399
## 14                     max tps  0.036115 226.000000 382
## 15                     max tnr  0.904054   1.000000   0
## 16                     max fnr  0.904054   0.995575   0
## 17                     max fpr  0.011599   1.000000 399
## 18                     max tpr  0.036115   1.000000 382
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

RANDOM FOREST:

fit.rf <- h2o.randomForest(x = predictors, y = response, training_frame = itrain,
                   validation_frame = itest, balance_classes = TRUE, seed = 12345)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================================| 100%

# Evaluation performance
fit.rf

## Model Details:
## ==============
## 
## H2OBinomialModel: drf
## Model ID:  DRF_model_R_1580512297754_4 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50              415061        20
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1        20   20.00000        545        710   655.32000
## 
## 
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.02057868
## RMSE:  0.1434527
## LogLoss:  0.1015732
## Mean Per-Class Error:  0.01046997
## AUC:  0.9981022
## AUCPR:  0.6466011
## Gini:  0.9962045
## R^2:  0.9176853
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          X0   X1    Error      Rate
## X0     3752   68 0.017801  =68/3820
## X1       12 3811 0.003139  =12/3823
## Totals 3764 3879 0.010467  =80/7643
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.454961    0.989613 235
## 2                       max f2  0.423489    0.995259 242
## 3                 max f0point5  0.559141    0.988921 209
## 4                 max accuracy  0.509753    0.989533 222
## 5                max precision  0.983557    0.998684  11
## 6                   max recall  0.404016    1.000000 246
## 7              max specificity  1.000000    0.999476   0
## 8             max absolute_mcc  0.454961    0.979171 235
## 9   max min_per_class_accuracy  0.521732    0.987958 218
## 10 max mean_per_class_accuracy  0.509753    0.989532 222
## 11                     max tns  1.000000 3818.000000   0
## 12                     max fns  1.000000 2481.000000   0
## 13                     max fps  0.000001 3820.000000 399
## 14                     max tps  0.404016 3823.000000 246
## 15                     max tnr  1.000000    0.999476   0
## 16                     max fnr  1.000000    0.648967   0
## 17                     max fpr  0.000001    1.000000 399
## 18                     max tpr  0.404016    1.000000 246
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on validation data. **
## 
## MSE:  0.08684995
## RMSE:  0.2947031
## LogLoss:  0.26435
## Mean Per-Class Error:  0.1272926
## AUC:  0.9519044
## AUCPR:  0.8188366
## Gini:  0.9038088
## R^2:  0.4401992
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         X0  X1    Error       Rate
## X0     873  78 0.082019    =78/951
## X1      39 187 0.172566    =39/226
## Totals 912 265 0.099405  =117/1177
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.165189   0.761711 203
## 2                       max f2  0.059243   0.830131 282
## 3                 max f0point5  0.290356   0.792812 139
## 4                 max accuracy  0.290356   0.909941 139
## 5                max precision  1.000000   1.000000   0
## 6                   max recall  0.026814   1.000000 336
## 7              max specificity  1.000000   1.000000   0
## 8             max absolute_mcc  0.165189   0.702946 203
## 9   max min_per_class_accuracy  0.106160   0.870662 236
## 10 max mean_per_class_accuracy  0.086885   0.886044 253
## 11                     max tns  1.000000 951.000000   0
## 12                     max fns  1.000000 224.000000   0
## 13                     max fps  0.000005 951.000000 399
## 14                     max tps  0.026814 226.000000 336
## 15                     max tnr  1.000000   1.000000   0
## 16                     max fnr  1.000000   0.991150   0
## 17                     max fpr  0.000005   1.000000 399
## 18                     max tpr  0.026814   1.000000 336
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

DEEP LEARNING:

fit.nn <- h2o.deeplearning(x = predictors, y = response, training_frame = itrain,
                           validation_frame = itest, balance_classes = TRUE, seed = 12345)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

# Evaluation performance
fit.nn

## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1580512297754_106 
## Status of Neuron Layers: predicting y, 2-class classification, bernoulli distribution, CrossEntropy loss, 43.402 weights/biases, 518,7 KB, 76.430 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1    13     Input  0.00 %       NA       NA        NA       NA       NA
## 2     2   200 Rectifier  0.00 % 0.000000 0.000000  0.003801 0.001774 0.000000
## 3     3   200 Rectifier  0.00 % 0.000000 0.000000  0.076288 0.103208 0.000000
## 4     4     2   Softmax      NA 0.000000 0.000000  0.003362 0.001877 0.000000
##   mean_weight weight_rms mean_bias bias_rms
## 1          NA         NA        NA       NA
## 2   -0.003673   0.128088  0.236593 0.097845
## 3   -0.030226   0.099299  0.880866 0.156410
## 4    0.006480   0.370943  0.000550 0.094676
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## MSE:  0.1491253
## RMSE:  0.3861674
## LogLoss:  0.4734618
## Mean Per-Class Error:  0.1362267
## AUC:  0.9452189
## AUCPR:  0.9352998
## Gini:  0.8904378
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          X0   X1    Error        Rate
## X0     3070  750 0.196335   =750/3820
## X1      291 3532 0.076118   =291/3823
## Totals 3361 4282 0.136203  =1041/7643
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.069172    0.871561 338
## 2                       max f2  0.022442    0.918023 371
## 3                 max f0point5  0.306695    0.886163 245
## 4                 max accuracy  0.109100    0.867984 318
## 5                max precision  0.999604    1.000000   0
## 6                   max recall  0.001587    1.000000 396
## 7              max specificity  0.999604    1.000000   0
## 8             max absolute_mcc  0.109100    0.736165 318
## 9   max min_per_class_accuracy  0.117126    0.865183 313
## 10 max mean_per_class_accuracy  0.109100    0.867979 318
## 11                     max tns  0.999604 3820.000000   0
## 12                     max fns  0.999604 3787.000000   0
## 13                     max fps  0.000146 3820.000000 399
## 14                     max tps  0.001587 3823.000000 396
## 15                     max tnr  0.999604    1.000000   0
## 16                     max fnr  0.999604    0.990583   0
## 17                     max fpr  0.000146    1.000000 399
## 18                     max tpr  0.001587    1.000000 396
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  0.09111745
## RMSE:  0.3018567
## LogLoss:  0.3269739
## Mean Per-Class Error:  0.1933363
## AUC:  0.8908904
## AUCPR:  0.7215676
## Gini:  0.7817807
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         X0  X1    Error       Rate
## X0     861  90 0.094637    =90/951
## X1      66 160 0.292035    =66/226
## Totals 927 250 0.132540  =156/1177
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.244101   0.672269 201
## 2                       max f2  0.083636   0.720966 283
## 3                 max f0point5  0.639524   0.736544  91
## 4                 max accuracy  0.595380   0.884452 106
## 5                max precision  0.999569   1.000000   0
## 6                   max recall  0.000205   1.000000 398
## 7              max specificity  0.999569   1.000000   0
## 8             max absolute_mcc  0.346157   0.597573 167
## 9   max min_per_class_accuracy  0.090171   0.800885 279
## 10 max mean_per_class_accuracy  0.202298   0.813212 221
## 11                     max tns  0.999569 951.000000   0
## 12                     max fns  0.999569 225.000000   0
## 13                     max fps  0.000069 951.000000 399
## 14                     max tps  0.000205 226.000000 398
## 15                     max tnr  0.999569   1.000000   0
## 16                     max fnr  0.999569   0.995575   0
## 17                     max fpr  0.000069   1.000000 399
## 18                     max tpr  0.000205   1.000000 398
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

GBM:

fit.gbm <- h2o.gbm(x = predictors, y = response, training_frame = itrain,
                           validation_frame = itest, balance_classes = TRUE, seed = 12345)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================================| 100%

# Evaluation performance
fit.gbm

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1580512297754_111 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50               20719         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         20         32    28.28000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.1271649
## RMSE:  0.3566019
## LogLoss:  0.3933718
## Mean Per-Class Error:  0.09291098
## AUC:  0.9633733
## AUCPR:  0.9592815
## Gini:  0.9267467
## R^2:  0.4913404
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          X0   X1    Error       Rate
## X0     3314  506 0.132461  =506/3820
## X1      204 3619 0.053361  =204/3823
## Totals 3518 4125 0.092895  =710/7643
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.133319    0.910669 301
## 2                       max f2  0.072990    0.938093 335
## 3                 max f0point5  0.217954    0.910984 267
## 4                 max accuracy  0.191579    0.908282 277
## 5                max precision  0.927522    1.000000   0
## 6                   max recall  0.022366    1.000000 382
## 7              max specificity  0.927522    1.000000   0
## 8             max absolute_mcc  0.133319    0.816758 301
## 9   max min_per_class_accuracy  0.195194    0.906095 275
## 10 max mean_per_class_accuracy  0.191579    0.908281 277
## 11                     max tns  0.927522 3820.000000   0
## 12                     max fns  0.927522 3819.000000   0
## 13                     max fps  0.006498 3820.000000 399
## 14                     max tps  0.022366 3823.000000 382
## 15                     max tnr  0.927522    1.000000   0
## 16                     max fnr  0.927522    0.998954   0
## 17                     max fpr  0.006498    1.000000 399
## 18                     max tpr  0.022366    1.000000 382
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.07695371
## RMSE:  0.2774053
## LogLoss:  0.2533059
## Mean Per-Class Error:  0.1430865
## AUC:  0.9378205
## AUCPR:  0.7933898
## Gini:  0.8756409
## R^2:  0.5039865
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         X0  X1    Error       Rate
## X0     864  87 0.091483    =87/951
## X1      44 182 0.194690    =44/226
## Totals 908 269 0.111300  =131/1177
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.273834   0.735354 193
## 2                       max f2  0.108992   0.813323 254
## 3                 max f0point5  0.513181   0.762527 119
## 4                 max accuracy  0.513181   0.898895 119
## 5                max precision  0.891477   1.000000   0
## 6                   max recall  0.028563   1.000000 353
## 7              max specificity  0.891477   1.000000   0
## 8             max absolute_mcc  0.273834   0.669605 193
## 9   max min_per_class_accuracy  0.151069   0.860147 229
## 10 max mean_per_class_accuracy  0.129747   0.873931 239
## 11                     max tns  0.891477 951.000000   0
## 12                     max fns  0.891477 225.000000   0
## 13                     max fps  0.006635 951.000000 399
## 14                     max tps  0.028563 226.000000 353
## 15                     max tnr  0.891477   1.000000   0
## 16                     max fnr  0.891477   0.995575   0
## 17                     max fpr  0.006635   1.000000 399
## 18                     max tpr  0.028563   1.000000 353
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Visualization of results

Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.

# Back to R workspace
trainy_true <- as.vector(itrain$y)
testy_true <- as.vector(itest$y)

GLM

train.glm <- as.data.table((h2o.predict(fit.glm, itrain)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

test.glm <- as.data.table((h2o.predict(fit.glm, itest)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

train_prediction.glm <- as.vector(train.glm[[1]])
prediction.glm <- as.vector(test.glm[[1]])
F1_train <- F1_Score(train_prediction.glm, trainy_true)
F1_test <- F1_Score(prediction.glm, testy_true)
cat(paste('F1_train_glm:',F1_train,'F1_test_glm:', F1_test), sep='\n')

## F1_train_glm: 0.875391032325339 F1_test_glm: 0.887840670859539

RANDOM FOREST

train.rf <- as.data.table((h2o.predict(fit.rf, itrain)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

test.rf <- as.data.table((h2o.predict(fit.rf, itest)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

train_prediction.rf <- as.vector(train.rf[[1]])
prediction.rf <- as.vector(test.rf[[1]])
F1_train <- F1_Score(train_prediction.rf, trainy_true)
F1_test <- F1_Score(prediction.rf, testy_true)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.998820600183462 F1_test_rf: 0.936695278969957

DEEP LEARNING

train.nn <- as.data.table((h2o.predict(fit.nn, itrain)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

test.nn <- as.data.table((h2o.predict(fit.nn, itest)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

train_prediction.nn <- as.vector(train.nn[[1]])
prediction.nn <- as.vector(test.nn[[1]])
F1_train <- F1_Score(train_prediction.nn, trainy_true)
F1_test <- F1_Score(prediction.nn, testy_true)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.937467157120336 F1_test_rf: 0.917021276595745

GBM

train.gbm <- as.data.table((h2o.predict(fit.gbm, itrain)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

test.gbm <- as.data.table((h2o.predict(fit.gbm, itest)))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

train_prediction.gbm <- as.vector(train.gbm[[1]])
prediction.gbm <- as.vector(test.gbm[[1]])
F1_train <- F1_Score(train_prediction.gbm, trainy_true)
F1_test <- F1_Score(prediction.gbm, testy_true)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.942883770469977 F1_test_rf: 0.929032258064516

Confusion Matrix Plots

par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, testy_true)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, testy_true)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nn, testy_true)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, testy_true)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GBM Confusion Matrix")

h2o.shutdown(prompt = TRUE)

## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

## [1] TRUE

References:

Article

Github repository

Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013

Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

https://machinelearningmastery.com/what-is-imbalanced-classification/

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

How to face a majority class greater than a minority class in a classification predictive modeling: sampling methods with H2O

Claudio G. Giancaterino

28/1/2020

Challenge

Prepare Workspace

Upload Dataset

Summarize Data set

Formatting Features and managing some levels

Handling missing values

Input boolean variables for features with NA’s

Input missing values with median for numerical columns and with the most common level for categorical columns

Check results

Formatting new features and managing some levels

Split data set into categorical, boolean and numerical variables

Summarize the class distribution of the target variable

Visualize data

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

Univariate Analysis

Bivariate Analysis with Feature Selection Analysis

Visualization of Bivariate Analysis

One-hot encoding on categorical features

Remove correlated levels from boolean features

Analysis for numerical features (univariate analysis, bivariate analysis)

Univariate Analysis, histograms

Univariate Analysis, boxplots

Univariate Analysis, densityplots

Bivariate Analysis

Visualization of Bivariate Analysis

Handling outliers

Delete Zero-and Near Zero-Variance Predictors

Correlation

Visualization

Delete correlated features

Pre-processing

Sampling: evaluating models by H2O

Transform data to h2o data frame

Partition the data into training and test sets

GLM:

RANDOM FOREST:

DEEP LEARNING:

GBM:

Visualization of results

GLM

RANDOM FOREST

DEEP LEARNING

GBM

Confusion Matrix Plots

References: