How to face a majority class greater than a minority class in a classification predictive modeling: baseline models overview

Challenge

I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.

Below the variables description

-BAD: 1 = client defaulted on loan 0 = loan repaid

-LOAN: Amount of the loan request

-MORTDUE: Amount due on existing mortgage

-VALUE: Value of current property

-REASON: DebtCon = debt consolidation; HomeImp = home improvement

-JOB: Six occupational categories

-YOJ: Years at present job

-DEROG: Number of major derogatory reports

-DELINQ: Number of delinquent credit lines

-CLAGE: Age of oldest trade line in months

-NINQ: Number of recent credit lines

-CLNO: Number of credit lines

-DEBTINC: Debt-to-income ratio

Prepare Workspace

suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

suppressWarnings({library(caret)})

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

suppressWarnings({library(corrplot)})

## corrplot 0.84 loaded

suppressWarnings({library(gridExtra)})

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

suppressWarnings({library(MLmetrics)})

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

Upload Dataset

path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)

Summarize Data set

# Dimensions of data set
dim(df)

## [1] 5960   13

# List types for each attribute
sapply(df, class)

##       BAD      LOAN   MORTDUE     VALUE    REASON       JOB       YOJ     DEROG 
## "integer" "integer" "numeric" "numeric"  "factor"  "factor" "numeric" "integer" 
##    DELINQ     CLAGE      NINQ      CLNO   DEBTINC 
## "integer" "numeric" "integer" "integer" "numeric"

# Take a peek at the first rows of the data set
head(df,5)

##   BAD LOAN MORTDUE  VALUE  REASON    JOB  YOJ DEROG DELINQ     CLAGE NINQ CLNO
## 1   1 1100   25860  39025 HomeImp  Other 10.5     0      0  94.36667    1    9
## 2   1 1300   70053  68400 HomeImp  Other  7.0     0      2 121.83333    0   14
## 3   1 1500   13500  16700 HomeImp  Other  4.0     0      0 149.46667    1   10
## 4   1 1500      NA     NA                  NA    NA     NA        NA   NA   NA
## 5   0 1700   97800 112000 HomeImp Office  3.0     0      0  93.33333    0   14
##   DEBTINC
## 1      NA
## 2      NA
## 3      NA
## 4      NA
## 5      NA

# Summarize attribute distributions
summary(df)

##       BAD              LOAN          MORTDUE           VALUE       
##  Min.   :0.0000   Min.   : 1100   Min.   :  2063   Min.   :  8000  
##  1st Qu.:0.0000   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
##  Median :0.0000   Median :16300   Median : 65019   Median : 89236  
##  Mean   :0.1995   Mean   :18608   Mean   : 73761   Mean   :101776  
##  3rd Qu.:0.0000   3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
##  Max.   :1.0000   Max.   :89900   Max.   :399550   Max.   :855909  
##                                   NA's   :518      NA's   :112     
##      REASON          JOB            YOJ             DEROG        
##         : 252          : 279   Min.   : 0.000   Min.   : 0.0000  
##  DebtCon:3928   Mgr    : 767   1st Qu.: 3.000   1st Qu.: 0.0000  
##  HomeImp:1780   Office : 948   Median : 7.000   Median : 0.0000  
##                 Other  :2388   Mean   : 8.922   Mean   : 0.2546  
##                 ProfExe:1276   3rd Qu.:13.000   3rd Qu.: 0.0000  
##                 Sales  : 109   Max.   :41.000   Max.   :10.0000  
##                 Self   : 193   NA's   :515      NA's   :708      
##      DELINQ            CLAGE             NINQ             CLNO     
##  Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
##  Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
##  Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
##  3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
##  Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
##  NA's   :580       NA's   :308      NA's   :510      NA's   :222   
##     DEBTINC        
##  Min.   :  0.5245  
##  1st Qu.: 29.1400  
##  Median : 34.8183  
##  Mean   : 33.7799  
##  3rd Qu.: 39.0031  
##  Max.   :203.3121  
##  NA's   :1267

# Summarize data structure
str(df)

## 'data.frame':    5960 obs. of  13 variables:
##  $ BAD    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ LOAN   : int  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
##  $ MORTDUE: num  25860 70053 13500 NA 97800 ...
##  $ VALUE  : num  39025 68400 16700 NA 112000 ...
##  $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
##  $ JOB    : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
##  $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
##  $ DEROG  : int  0 0 0 NA 0 0 3 0 0 0 ...
##  $ DELINQ : int  0 2 0 NA 0 0 2 0 2 0 ...
##  $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
##  $ NINQ   : int  1 0 1 NA 0 1 1 0 1 0 ...
##  $ CLNO   : int  9 14 10 NA 14 8 17 8 12 13 ...
##  $ DEBTINC: num  NA NA NA NA NA ...

Formatting Features and managing some levels

BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated

df$REASON[df$REASON == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated

Handling missing values

My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”

mi_summary <- function(data_frame){
  mi_summary<-c()
  for (col in colnames(data_frame)){
    mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
  }
  mi_summary_new <- mi_summary[mi_summary>0]
  mi_summary_cols <- colnames(data_frame)[mi_summary>0]
  mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
  mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
  mi_summary[,2] <- round(mi_summary[,2],6)
  rownames(mi_summary) <- NULL
  return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary

##    col_name perc_missing
## 1   DEBTINC    21.258389
## 2     DEROG    11.879195
## 3    DELINQ     9.731544
## 4   MORTDUE     8.691275
## 5       YOJ     8.640940
## 6      NINQ     8.557047
## 7     CLAGE     5.167785
## 8       JOB     4.681208
## 9    REASON     4.228188
## 10     CLNO     3.724832
## 11    VALUE     1.879195

Input boolean variables for features with NA’s

df <- df %>%
  mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
  mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
  mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
  mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
  mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
  mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
  mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
  mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
  mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
  mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
  mutate(REASON_NA = ifelse(is.na(REASON),1,0))

Input missing values with median for numerical columns and with the most common level for categorical columns

for (col in missing_summary$col_name){
  if (class(df[,col]) == 'factor'){
    unique_levels <- unique(df[,col])
    df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
  } else {
    df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
  }
}

Check results

pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss

## named numeric(0)

Formatting new features and managing some levels

df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))

Split data set into categorical, boolean and numerical variables

I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis

cat <- df[,sapply(df, is.factor)] %>%
  select_if(~nlevels(.) <=15 ) %>%
  select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]

Summarize the class distribution of the target variable

The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.

Visualize data

cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)

##   freq percentage
## 0 4771   80.05034
## 1 1189   19.94966

ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  ggtitle("Distribution of Target variable")

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.

Univariate Analysis

cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
  counts <- table(cat[,i])
  name <- names(cat)[i]
  barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}

Bivariate Analysis with Feature Selection Analysis

The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.

par(mfrow=c(2,2))
for(i in 1:length(cat)){
  freq=table(cat[,i])
  percentage=prop.table(table(cat[,i]))*100
  freq_cat_outcome=table(BAD,cat[,i])
  name <- names(cat)[i]
  cat(sep="\n")
  cat(paste("Distribution of", name), sep="\n")
  print(cbind(freq,percentage))
  cat(sep="\n")
  cat(paste("Distribution by Target variable and", name), sep="\n")
  print(freq_cat_outcome)
  cat(sep="\n")
  cat(paste("Chi-squared test by Target variable and", name), sep="\n")
  suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}

## 
## Distribution of DELINQ
##    freq  percentage
## 0  4759 79.84899329
## 1   654 10.97315436
## 2   250  4.19463087
## 3   129  2.16442953
## 4    78  1.30872483
## 5    38  0.63758389
## 6    27  0.45302013
## 7    13  0.21812081
## 8     5  0.08389262
## 10    2  0.03355705
## 11    2  0.03355705
## 12    1  0.01677852
## 13    1  0.01677852
## 15    1  0.01677852
## 
## Distribution by Target variable and DELINQ
##    
## BAD    0    1    2    3    4    5    6    7    8   10   11   12   13   15
##   0 4104  432  138   58   32    7    0    0    0    0    0    0    0    0
##   1  655  222  112   71   46   31   27   13    5    2    2    1    1    1
## 
## Chi-squared test by Target variable and DELINQ
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
## 
## 
## Distribution of REASON
##         freq percentage
## DebtCon 4180   70.13423
## HomeImp 1780   29.86577
## 
## Distribution by Target variable and REASON
##    
## BAD DebtCon HomeImp
##   0    3387    1384
##   1     793     396
## 
## Chi-squared test by Target variable and REASON
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
## 
## 
## Distribution of JOB
##         freq percentage
## Mgr      767  12.869128
## Office   948  15.906040
## Other   2667  44.748322
## ProfExe 1276  21.409396
## Sales    109   1.828859
## Self     193   3.238255
## 
## Distribution by Target variable and JOB
##    
## BAD  Mgr Office Other ProfExe Sales Self
##   0  588    823  2090    1064    71  135
##   1  179    125   577     212    38   58
## 
## Chi-squared test by Target variable and JOB
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
## 
## 
## Distribution of DEROG
##    freq  percentage
## 0  5235 87.83557047
## 1   435  7.29865772
## 2   160  2.68456376
## 3    58  0.97315436
## 4    23  0.38590604
## 5    15  0.25167785
## 6    15  0.25167785
## 7     8  0.13422819
## 8     6  0.10067114
## 9     3  0.05033557
## 10    2  0.03355705
## 
## Distribution by Target variable and DEROG
##    
## BAD    0    1    2    3    4    5    6    7    8    9   10
##   0 4394  266   78   15    5    8    5    0    0    0    0
##   1  841  169   82   43   18    7   10    8    6    3    2
## 
## Chi-squared test by Target variable and DEROG
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16

Visualization of Bivariate Analysis

pl1 <- cat %>%
  ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) + 
  geom_bar(stat='identity') + 
  ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
  ggplot(aes(x=BAD, y=REASON, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
  ggplot(aes(x=BAD, y=JOB, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
  ggplot(aes(x=BAD, y=DEROG, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)

One-hot encoding on categorical features

I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.

dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))

Remove correlated levels from boolean features

drop_cols <- c('DEBTINC_NA.0','DEROG_NA.0','DELINQ_NA.0','MORTDUE_NA.0','YOJ_NA.0','NINQ_NA.0','CLAGE_NA.0','CLNO_NA.0','VALUE_NA.0','JOB_NA.0','REASON_NA.0')
categorical <- cat_num[,!colnames(cat_num) %in% drop_cols]

Analysis for numerical features (univariate analysis, bivariate analysis)

I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.

Univariate Analysis, histograms

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  hist(num[,i], main=names(num)[i], col='blue')
}

Univariate Analysis, boxplots

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  boxplot(num[,i], main=names(num)[i], col='orange')
}

Univariate Analysis, densityplots

par(mfrow=c(2,3))
for(i in 1:length(num)){
  plot(density(num[,i]), main=names(num)[i], col='red')
}

Bivariate Analysis

for(i in 1:length(num)){
  name <- names(num)[i]
  cat(paste("Distribution of", name), sep="\n")
  #cat(names(num)[i],sep = "\n")
  print(summary(num[,i]))
  cat(sep="\n")
  stand.deviation = sd(num[,i])
  variance = var(num[,i])
  skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
  kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
  outlier_values <- sum(table(boxplot.stats(num[,i])$out))
  cat(paste("Statistical analysis of", name), sep="\n")
  print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
  cat(sep="\n")
  cat(paste("anova_test between BAD and", name),sep = "\n")
  print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
  cat(sep="\n")
}

## Distribution of LOAN
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1100   11100   16300   18608   23300   89900 
## 
## Statistical analysis of LOAN
##      stand.deviation  variance skewness kurtosis outlier_values
## [1,]        11207.48 125607617 2.022762 6.922438            256
## 
## anova_test between BAD and LOAN
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    5.4   5.368   33.79 6.45e-09 ***
## Residuals   5958  946.4   0.159                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of MORTDUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2063   48139   65019   73001   88200  399550 
## 
## Statistical analysis of MORTDUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        42552.73 1810734556 1.941153 7.440693            308
## 
## anova_test between BAD and MORTDUE
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.0  2.0303   12.74 0.000361 ***
## Residuals   5958  949.8  0.1594                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of VALUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8000   66490   89236  101540  119005  855909 
## 
## Statistical analysis of VALUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        56869.44 3234132829 3.088963 24.85644            347
## 
## anova_test between BAD and VALUE
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## num[, i]       1    1.3  1.2675   7.945 0.00484 **
## Residuals   5958  950.5  0.1595                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of YOJ
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   8.756  12.000  41.000 
## 
## Statistical analysis of YOJ
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.259424 52.69923 1.092072 0.744703            211
## 
## anova_test between BAD and YOJ
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.8  2.7710    17.4 3.08e-05 ***
## Residuals   5958  949.0  0.1593                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of CLAGE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   117.4   173.5   179.4   227.1  1168.2 
## 
## Statistical analysis of CLAGE
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]         83.5747  6984.73 1.389903 8.180556             66
## 
## anova_test between BAD and CLAGE
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   26.1  26.106     168 <2e-16 ***
## Residuals   5958  925.7   0.155                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of DEBTINC
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.5245  30.7632  34.8183  34.0007  37.9499 203.3122 
## 
## Statistical analysis of DEBTINC
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.644528 58.43881 3.111598  64.0733            245
## 
## anova_test between BAD and DEBTINC
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   22.7  22.733   145.8 <2e-16 ***
## Residuals   5958  929.1   0.156                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualization of Bivariate Analysis

pl5 <- num %>%
  ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot() 
pl6 <- num %>%
  ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot() 
pl7 <- num %>%
  ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot() 
pl8 <- num %>%
  ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot() 
pl9 <- num %>%
  ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot() 
pl10 <- num %>%
  ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot() 
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)

Handling outliers

Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.

# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")

# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)]  <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")

# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")

# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)]  <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3)  + ggtitle("MORTDUE after handled outliers")

# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")

# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)]  <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")

# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")

# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)]  <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")

# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")

# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)]  <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")

# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")

# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)]  <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")

Delete Zero-and Near Zero-Variance Predictors

The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)

data <- cbind(categorical,num) 
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]

##            freqRatio percentUnique zeroVar  nzv
## DELINQ.2    22.84000    0.03355705   FALSE TRUE
## DELINQ.3    45.20155    0.03355705   FALSE TRUE
## DELINQ.4    75.41026    0.03355705   FALSE TRUE
## DELINQ.5   155.84211    0.03355705   FALSE TRUE
## DELINQ.6   219.74074    0.03355705   FALSE TRUE
## DELINQ.7   457.46154    0.03355705   FALSE TRUE
## DELINQ.8  1191.00000    0.03355705   FALSE TRUE
## DELINQ.10 2979.00000    0.03355705   FALSE TRUE
## DELINQ.11 2979.00000    0.03355705   FALSE TRUE
## DELINQ.12 5959.00000    0.03355705   FALSE TRUE
## DELINQ.13 5959.00000    0.03355705   FALSE TRUE
## DELINQ.15 5959.00000    0.03355705   FALSE TRUE
## JOB.Sales   53.67890    0.03355705   FALSE TRUE
## JOB.Self    29.88083    0.03355705   FALSE TRUE
## DEROG.2     36.25000    0.03355705   FALSE TRUE

nzv <- nearZeroVar(data)
data_new <- data[, -nzv]

Correlation

Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.

Visualization

par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")

summary(cor[upper.tri(cor)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.000000 -0.038116  0.005731 -0.014157  0.043268  0.792421

Delete correlated features

tmp <- cor(data_new)
tmp[upper.tri(tmp)] <- 0
diag(tmp) <- 0
df_new <- data_new[,!apply(tmp,2,function(x) any(abs(x) > 0.75))]
cor <- cor(df_new,use="complete.obs",method = "spearman")
summary(cor[upper.tri(cor)])

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6988623 -0.0421582  0.0004314 -0.0234455  0.0307002  0.3406328

Pre-processing

To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.

# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)

## 'data.frame':    5960 obs. of  14 variables:
##  $ DELINQ.0      : num  0.502 -1.99 0.502 0.502 0.502 ...
##  $ DELINQ.1      : num  -0.351 -0.351 -0.351 -0.351 -0.351 ...
##  $ REASON.HomeImp: num  1.532 1.532 1.532 -0.653 1.532 ...
##  $ JOB.Mgr       : num  -0.384 -0.384 -0.384 -0.384 -0.384 ...
##  $ JOB.Office    : num  -0.435 -0.435 -0.435 -0.435 2.299 ...
##  $ JOB.Other     : num  1.11 1.11 1.11 1.11 -0.9 ...
##  $ JOB.ProfExe   : num  -0.522 -0.522 -0.522 -0.522 -0.522 ...
##  $ DEROG.1       : num  -0.281 -0.281 -0.281 -0.281 -0.281 ...
##  $ LOAN          : num  -1.86 -1.84 -1.81 -1.81 -1.79 ...
##  $ VALUE         : num  -1.322 -0.669 -1.818 -0.206 0.3 ...
##  $ YOJ           : num  0.279 -0.233 -0.672 -0.233 -0.819 ...
##  $ CLAGE         : num  -1.091 -0.73 -0.367 -0.052 -1.104 ...
##  $ DEBTINC       : num  0.143 0.143 0.143 0.143 0.143 ...
##  $ y             : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...

Split data set

# Draw a random, stratified sample including p percent of the data
set.seed(12345)
test_index <- createDataPartition(transformed$y, p=0.80, list=FALSE)
# select 20% of the data for test
itest <- transformed[-test_index,]
# use the remaining 80% of data to training the models
itrain <- transformed[test_index,]

Baseline Models: evaluating models by Caret

The analysis follows four baseline models: Logistic Regression as the easiest model and a benchmark, two ensemble models (Random Forest and Gradient Boosting Machine) and a Neural Network.

control <- trainControl(method="cv", number=5,classProbs = TRUE, summaryFunction = prSummary)
metric <- 'AUC'

GLM

set.seed(2019)
fit.glm <- train(y~., data=itrain, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)

## Generalized Linear Model 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.9233094  0.8392055  0.9554614  0.8935387

plot(varImp(fit.glm),15, main = 'GLM feature selection')

RANDOM FOREST

set.seed(2019)
fit.rf <- train(y~., data=itrain, method="rf", metric=metric, trControl=control)
print(fit.rf)

## Random Forest 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815 
## Resampling results across tuning parameters:
## 
##   mtry  AUC        Precision  Recall     F        
##    2    0.9414138  0.8733717  0.9861150  0.9263060
##    7    0.9156241  0.9162927  0.9481257  0.9318733
##   13    0.8359977  0.9169983  0.9423621  0.9294455
## 
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

plot(varImp(fit.rf),15, main='Random Forest feature selection')

NNET

set.seed(2019)
fit.nnet <- train(y~., data=itrain, method="nnet", metric=metric, trControl=control)

## # weights:  16
## initial  value 2284.294996 
## iter  10 value 1721.049223
## iter  20 value 1620.022847
## iter  30 value 1602.908584
## iter  40 value 1600.702019
## iter  50 value 1600.556801
## iter  60 value 1592.288750
## iter  70 value 1571.448943
## iter  80 value 1553.318945
## iter  90 value 1553.094547
## iter  90 value 1553.094540
## iter 100 value 1553.091712
## final  value 1553.091712 
## stopped after 100 iterations
## # weights:  46
## initial  value 2818.831525 
## iter  10 value 1590.252570
## iter  20 value 1525.321325
## iter  30 value 1499.001051
## iter  40 value 1491.491814
## iter  50 value 1483.236544
## iter  60 value 1474.932138
## iter  70 value 1456.831485
## iter  80 value 1453.734344
## iter  90 value 1452.774889
## iter 100 value 1452.712267
## final  value 1452.712267 
## stopped after 100 iterations
## # weights:  76
## initial  value 3952.580922 
## iter  10 value 1577.566284
## iter  20 value 1513.195773
## iter  30 value 1430.582005
## iter  40 value 1365.272471
## iter  50 value 1329.775858
## iter  60 value 1314.849925
## iter  70 value 1298.915217
## iter  80 value 1239.686171
## iter  90 value 1204.170844
## iter 100 value 1181.741450
## final  value 1181.741450 
## stopped after 100 iterations
## # weights:  16
## initial  value 2473.072639 
## iter  10 value 1918.276194
## iter  20 value 1761.639765
## iter  30 value 1734.113113
## iter  40 value 1579.018271
## iter  50 value 1567.888055
## iter  60 value 1554.836894
## iter  70 value 1553.938962
## iter  80 value 1553.934065
## final  value 1553.933368 
## converged
## # weights:  46
## initial  value 2956.232405 
## iter  10 value 1634.867450
## iter  20 value 1587.433297
## iter  30 value 1545.267535
## iter  40 value 1512.622039
## iter  50 value 1507.295601
## iter  60 value 1504.924409
## iter  70 value 1503.215932
## iter  80 value 1502.197944
## final  value 1502.189817 
## converged
## # weights:  76
## initial  value 3033.421221 
## iter  10 value 1564.297036
## iter  20 value 1497.232011
## iter  30 value 1465.606553
## iter  40 value 1424.770433
## iter  50 value 1385.582874
## iter  60 value 1355.919119
## iter  70 value 1305.897171
## iter  80 value 1280.678677
## iter  90 value 1261.422066
## iter 100 value 1247.401252
## final  value 1247.401252 
## stopped after 100 iterations
## # weights:  16
## initial  value 2971.515708 
## iter  10 value 1640.066836
## iter  20 value 1609.074215
## iter  30 value 1604.591971
## iter  40 value 1572.975998
## iter  50 value 1552.358276
## iter  60 value 1551.796628
## final  value 1551.795675 
## converged
## # weights:  46
## initial  value 1960.479908 
## iter  10 value 1589.707701
## iter  20 value 1532.236702
## iter  30 value 1429.893317
## iter  40 value 1405.758560
## iter  50 value 1377.639785
## iter  60 value 1357.079469
## iter  70 value 1346.556622
## iter  80 value 1340.703464
## iter  90 value 1334.678225
## iter 100 value 1329.649220
## final  value 1329.649220 
## stopped after 100 iterations
## # weights:  76
## initial  value 2799.376384 
## iter  10 value 1539.903671
## iter  20 value 1502.404931
## iter  30 value 1461.603966
## iter  40 value 1436.230406
## iter  50 value 1415.189140
## iter  60 value 1409.884568
## iter  70 value 1401.879097
## iter  80 value 1393.992459
## iter  90 value 1389.840390
## iter 100 value 1388.525407
## final  value 1388.525407 
## stopped after 100 iterations
## # weights:  16
## initial  value 2312.255110 
## iter  10 value 1636.279352
## iter  20 value 1616.842978
## iter  30 value 1586.561869
## iter  40 value 1577.460505
## iter  50 value 1574.476667
## iter  60 value 1574.401395
## iter  70 value 1574.398800
## iter  70 value 1574.398795
## iter  70 value 1574.398795
## final  value 1574.398795 
## converged
## # weights:  46
## initial  value 3150.189663 
## iter  10 value 1687.562518
## iter  20 value 1632.100646
## iter  30 value 1589.927484
## iter  40 value 1564.245600
## iter  50 value 1511.824174
## iter  60 value 1494.117482
## iter  70 value 1484.623953
## iter  80 value 1467.910193
## iter  90 value 1457.199592
## iter 100 value 1455.991366
## final  value 1455.991366 
## stopped after 100 iterations
## # weights:  76
## initial  value 2591.228412 
## iter  10 value 1562.457979
## iter  20 value 1507.237090
## iter  30 value 1479.076826
## iter  40 value 1453.553286
## iter  50 value 1432.164670
## iter  60 value 1415.268606
## iter  70 value 1402.796240
## iter  80 value 1380.973344
## iter  90 value 1364.223423
## iter 100 value 1357.564287
## final  value 1357.564287 
## stopped after 100 iterations
## # weights:  16
## initial  value 3049.855195 
## iter  10 value 1850.318163
## iter  20 value 1655.062081
## iter  30 value 1613.877675
## iter  40 value 1590.344782
## iter  50 value 1585.407726
## iter  60 value 1578.539913
## iter  70 value 1576.939678
## final  value 1576.844709 
## converged
## # weights:  46
## initial  value 2611.804912 
## iter  10 value 1600.759648
## iter  20 value 1568.198599
## iter  30 value 1543.004275
## iter  40 value 1529.937853
## iter  50 value 1526.619157
## iter  60 value 1524.742414
## iter  70 value 1523.878132
## iter  80 value 1523.290274
## iter  90 value 1523.074739
## iter 100 value 1523.051828
## final  value 1523.051828 
## stopped after 100 iterations
## # weights:  76
## initial  value 2343.390708 
## iter  10 value 1542.620964
## iter  20 value 1490.819226
## iter  30 value 1463.338517
## iter  40 value 1453.887083
## iter  50 value 1450.121000
## iter  60 value 1448.528349
## iter  70 value 1446.599006
## iter  80 value 1443.200027
## iter  90 value 1430.716703
## iter 100 value 1402.865663
## final  value 1402.865663 
## stopped after 100 iterations
## # weights:  16
## initial  value 2911.253764 
## iter  10 value 1712.538487
## iter  20 value 1648.263995
## iter  30 value 1620.932300
## iter  40 value 1618.301916
## iter  50 value 1617.915729
## final  value 1617.903297 
## converged
## # weights:  46
## initial  value 2372.891569 
## iter  10 value 1616.196110
## iter  20 value 1553.329321
## iter  30 value 1488.498374
## iter  40 value 1481.042835
## iter  50 value 1478.518628
## iter  60 value 1476.038274
## iter  70 value 1471.464241
## iter  80 value 1464.026155
## iter  90 value 1461.944285
## iter 100 value 1461.736278
## final  value 1461.736278 
## stopped after 100 iterations
## # weights:  76
## initial  value 2858.630833 
## iter  10 value 1619.540846
## iter  20 value 1517.889635
## iter  30 value 1479.302967
## iter  40 value 1463.706624
## iter  50 value 1448.222186
## iter  60 value 1443.501900
## iter  70 value 1440.805375
## iter  80 value 1430.507944
## iter  90 value 1417.950012
## iter 100 value 1410.043305
## final  value 1410.043305 
## stopped after 100 iterations
## # weights:  16
## initial  value 3483.012769 
## iter  10 value 1795.174119
## iter  20 value 1754.846888
## iter  30 value 1740.117473
## iter  40 value 1667.721601
## iter  50 value 1626.215558
## iter  60 value 1618.384787
## iter  70 value 1615.807417
## iter  80 value 1607.904710
## iter  90 value 1598.985548
## iter 100 value 1596.779369
## final  value 1596.779369 
## stopped after 100 iterations
## # weights:  46
## initial  value 3858.076785 
## iter  10 value 1637.655772
## iter  20 value 1592.606861
## iter  30 value 1538.087188
## iter  40 value 1504.191870
## iter  50 value 1498.004302
## iter  60 value 1491.407491
## iter  70 value 1480.875612
## iter  80 value 1477.288379
## iter  90 value 1474.300187
## iter 100 value 1472.541508
## final  value 1472.541508 
## stopped after 100 iterations
## # weights:  76
## initial  value 2020.469489 
## iter  10 value 1530.271001
## iter  20 value 1461.603448
## iter  30 value 1427.417722
## iter  40 value 1401.689954
## iter  50 value 1376.388963
## iter  60 value 1356.044606
## iter  70 value 1341.811193
## iter  80 value 1336.713004
## iter  90 value 1335.714349
## iter 100 value 1335.577885
## final  value 1335.577885 
## stopped after 100 iterations
## # weights:  16
## initial  value 2847.158953 
## iter  10 value 1730.435735
## iter  20 value 1624.724754
## iter  30 value 1606.369572
## iter  40 value 1600.686870
## iter  50 value 1576.495659
## iter  60 value 1560.544337
## iter  70 value 1557.373055
## final  value 1557.366261 
## converged
## # weights:  46
## initial  value 2534.597359 
## iter  10 value 1625.729707
## iter  20 value 1595.860878
## iter  30 value 1544.781302
## iter  40 value 1527.026281
## iter  50 value 1515.116089
## iter  60 value 1492.543756
## iter  70 value 1459.256203
## iter  80 value 1381.411833
## iter  90 value 1340.749561
## iter 100 value 1327.231598
## final  value 1327.231598 
## stopped after 100 iterations
## # weights:  76
## initial  value 1952.303349 
## iter  10 value 1536.434727
## iter  20 value 1506.900269
## iter  30 value 1475.950104
## iter  40 value 1451.561350
## iter  50 value 1444.238361
## iter  60 value 1424.300107
## iter  70 value 1418.790643
## iter  80 value 1417.114480
## iter  90 value 1414.524473
## iter 100 value 1411.602054
## final  value 1411.602054 
## stopped after 100 iterations
## # weights:  16
## initial  value 2009.737336 
## iter  10 value 1617.587790
## iter  20 value 1585.821364
## iter  30 value 1557.625817
## iter  40 value 1556.410237
## iter  50 value 1556.265660
## final  value 1556.200277 
## converged
## # weights:  46
## initial  value 3090.009733 
## iter  10 value 1601.234271
## iter  20 value 1538.274140
## iter  30 value 1524.210274
## iter  40 value 1510.195848
## iter  50 value 1499.183929
## iter  60 value 1491.268579
## iter  70 value 1489.653698
## iter  80 value 1488.933069
## iter  90 value 1486.702630
## iter 100 value 1484.575071
## final  value 1484.575071 
## stopped after 100 iterations
## # weights:  76
## initial  value 2752.362933 
## iter  10 value 1559.306680
## iter  20 value 1504.636400
## iter  30 value 1419.939232
## iter  40 value 1359.834842
## iter  50 value 1323.809083
## iter  60 value 1309.379393
## iter  70 value 1290.448627
## iter  80 value 1279.640886
## iter  90 value 1273.679966
## iter 100 value 1271.099052
## final  value 1271.099052 
## stopped after 100 iterations
## # weights:  16
## initial  value 3596.278028 
## iter  10 value 1762.091732
## iter  20 value 1677.886429
## iter  30 value 1664.307997
## iter  40 value 1663.187479
## iter  50 value 1661.338501
## iter  60 value 1658.951978
## final  value 1658.951368 
## converged
## # weights:  46
## initial  value 2535.157025 
## iter  10 value 1651.129545
## iter  20 value 1570.249551
## iter  30 value 1473.009382
## iter  40 value 1383.954010
## iter  50 value 1345.337844
## iter  60 value 1316.524023
## iter  70 value 1287.286336
## iter  80 value 1277.501588
## iter  90 value 1272.036548
## iter 100 value 1269.813654
## final  value 1269.813654 
## stopped after 100 iterations
## # weights:  76
## initial  value 2247.882109 
## iter  10 value 1583.896505
## iter  20 value 1470.317215
## iter  30 value 1392.280462
## iter  40 value 1352.022127
## iter  50 value 1329.794092
## iter  60 value 1307.343927
## iter  70 value 1291.650230
## iter  80 value 1280.001493
## iter  90 value 1269.783514
## iter 100 value 1251.939811
## final  value 1251.939811 
## stopped after 100 iterations
## # weights:  16
## initial  value 2160.738548 
## iter  10 value 1620.076817
## iter  20 value 1605.644701
## iter  30 value 1601.902919
## final  value 1601.900944 
## converged
## # weights:  46
## initial  value 3018.781404 
## iter  10 value 1612.539307
## iter  20 value 1582.071127
## iter  30 value 1514.211819
## iter  40 value 1477.462755
## iter  50 value 1428.795651
## iter  60 value 1395.871652
## iter  70 value 1386.496893
## iter  80 value 1383.746693
## iter  90 value 1382.454275
## iter 100 value 1380.877412
## final  value 1380.877412 
## stopped after 100 iterations
## # weights:  76
## initial  value 2208.653489 
## iter  10 value 1591.665398
## iter  20 value 1521.775407
## iter  30 value 1475.223594
## iter  40 value 1429.229290
## iter  50 value 1398.961357
## iter  60 value 1390.931271
## iter  70 value 1383.006191
## iter  80 value 1376.430114
## iter  90 value 1369.463202
## iter 100 value 1347.839951
## final  value 1347.839951 
## stopped after 100 iterations
## # weights:  16
## initial  value 2309.335909 
## iter  10 value 1738.025888
## iter  20 value 1638.643059
## iter  30 value 1606.388114
## iter  40 value 1600.557069
## iter  50 value 1600.155422
## iter  60 value 1600.037108
## iter  60 value 1600.037103
## iter  60 value 1600.037103
## final  value 1600.037103 
## converged
## # weights:  46
## initial  value 1888.304107 
## iter  10 value 1624.279708
## iter  20 value 1579.243594
## iter  30 value 1548.216591
## iter  40 value 1534.856188
## iter  50 value 1522.580232
## iter  60 value 1518.869513
## iter  70 value 1518.534368
## iter  80 value 1518.397617
## iter  90 value 1518.297716
## iter 100 value 1518.283296
## final  value 1518.283296 
## stopped after 100 iterations
## # weights:  76
## initial  value 2255.212589 
## iter  10 value 1585.234781
## iter  20 value 1517.624503
## iter  30 value 1461.612161
## iter  40 value 1415.317692
## iter  50 value 1400.973739
## iter  60 value 1385.488945
## iter  70 value 1371.295418
## iter  80 value 1366.500710
## iter  90 value 1364.128751
## iter 100 value 1362.073626
## final  value 1362.073626 
## stopped after 100 iterations
## # weights:  16
## initial  value 2641.297722 
## iter  10 value 1674.304827
## iter  20 value 1618.981065
## iter  30 value 1613.177789
## iter  40 value 1612.460166
## iter  50 value 1612.454164
## iter  50 value 1612.454150
## final  value 1612.454105 
## converged
## # weights:  46
## initial  value 2834.372292 
## iter  10 value 1586.634456
## iter  20 value 1547.455623
## iter  30 value 1531.987168
## iter  40 value 1511.583440
## iter  50 value 1496.177632
## iter  60 value 1469.245114
## iter  70 value 1453.055732
## iter  80 value 1446.785349
## iter  90 value 1439.840511
## iter 100 value 1436.866680
## final  value 1436.866680 
## stopped after 100 iterations
## # weights:  76
## initial  value 2475.660518 
## iter  10 value 1653.663448
## iter  20 value 1548.566385
## iter  30 value 1522.692367
## iter  40 value 1479.285059
## iter  50 value 1452.788250
## iter  60 value 1435.654626
## iter  70 value 1428.265158
## iter  80 value 1413.682233
## iter  90 value 1403.890576
## iter 100 value 1392.675684
## final  value 1392.675684 
## stopped after 100 iterations
## # weights:  16
## initial  value 3005.123206 
## iter  10 value 1696.448643
## iter  20 value 1638.730034
## iter  30 value 1598.610379
## iter  40 value 1577.326834
## iter  50 value 1569.802573
## iter  60 value 1568.398769
## final  value 1568.014262 
## converged
## # weights:  46
## initial  value 2856.923628 
## iter  10 value 1577.343614
## iter  20 value 1559.586309
## iter  30 value 1528.390061
## iter  40 value 1515.950574
## iter  50 value 1512.071958
## iter  60 value 1507.538777
## iter  70 value 1505.746279
## iter  80 value 1505.271045
## iter  90 value 1505.163573
## final  value 1505.157680 
## converged
## # weights:  76
## initial  value 3122.038015 
## iter  10 value 1596.162011
## iter  20 value 1490.458483
## iter  30 value 1414.812652
## iter  40 value 1387.091241
## iter  50 value 1358.900517
## iter  60 value 1332.263721
## iter  70 value 1316.989372
## iter  80 value 1284.333950
## iter  90 value 1266.394468
## iter 100 value 1254.548130
## final  value 1254.548130 
## stopped after 100 iterations
## # weights:  16
## initial  value 4246.993275 
## iter  10 value 1826.662962
## iter  20 value 1793.629776
## iter  30 value 1787.704528
## iter  40 value 1781.871304
## iter  50 value 1777.251555
## iter  60 value 1776.716576
## iter  70 value 1776.407916
## iter  80 value 1776.131487
## iter  90 value 1776.050243
## iter 100 value 1775.969880
## final  value 1775.969880 
## stopped after 100 iterations
## # weights:  46
## initial  value 2641.449583 
## iter  10 value 1573.765672
## iter  20 value 1478.232029
## iter  30 value 1448.754031
## iter  40 value 1433.118806
## iter  50 value 1424.690185
## iter  60 value 1419.647567
## iter  70 value 1417.897942
## iter  80 value 1416.154858
## iter  90 value 1400.866231
## iter 100 value 1383.827422
## final  value 1383.827422 
## stopped after 100 iterations
## # weights:  76
## initial  value 2395.317763 
## iter  10 value 1551.572596
## iter  20 value 1483.421506
## iter  30 value 1447.212508
## iter  40 value 1421.131328
## iter  50 value 1409.428599
## iter  60 value 1396.169587
## iter  70 value 1388.768959
## iter  80 value 1383.896325
## iter  90 value 1381.476090
## iter 100 value 1381.208120
## final  value 1381.208120 
## stopped after 100 iterations
## # weights:  76
## initial  value 2835.861807 
## iter  10 value 1948.117893
## iter  20 value 1910.792929
## iter  30 value 1870.880567
## iter  40 value 1850.434183
## iter  50 value 1838.960776
## iter  60 value 1830.483410
## iter  70 value 1825.908303
## iter  80 value 1822.201250
## iter  90 value 1819.433589
## iter 100 value 1817.642137
## final  value 1817.642137 
## stopped after 100 iterations

print(fit.nnet)

## Neural Network 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815 
## Resampling results across tuning parameters:
## 
##   size  decay  AUC        Precision  Recall     F        
##   1     0e+00  0.9191526  0.8177037  0.9756387  0.8889979
##   1     1e-04  0.8862537  0.8278028  0.9661974  0.8909677
##   1     1e-01  0.9228062  0.8466382  0.9428849  0.8921627
##   3     0e+00  0.8529599  0.8656062  0.9161796  0.8894465
##   3     1e-04  0.8908519  0.8731755  0.8999331  0.8860357
##   3     1e-01  0.9341468  0.8547087  0.9507435  0.9001177
##   5     0e+00  0.9211971  0.8617435  0.9449843  0.9013797
##   5     1e-04  0.9181517  0.8620943  0.9300584  0.8943309
##   5     1e-01  0.9407425  0.8746971  0.9381695  0.9052206
## 
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')

GBM

set.seed(2019)
fit.gbm <- train(y~., data=itrain, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)

## Stochastic Gradient Boosting 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  AUC        Precision  Recall     F        
##   1                   50      0.7931374  0.8569476  0.9819228  0.9151634
##   1                  100      0.9550171  0.8778632  0.9693457  0.9213127
##   1                  150      0.9613312  0.8894362  0.9622731  0.9243772
##   2                   50      0.9450629  0.8838504  0.9685596  0.9242512
##   2                  100      0.9654723  0.9021633  0.9578188  0.9291050
##   2                  150      0.9667118  0.9067379  0.9567703  0.9310413
##   3                   50      0.9637935  0.8971377  0.9604396  0.9276370
##   3                  100      0.9681581  0.9092756  0.9549354  0.9315009
##   3                  150      0.9698005  0.9136212  0.9549357  0.9337780
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')

##                           var     rel.inf
## DEBTINC               DEBTINC 56.36730306
## DELINQ.0             DELINQ.0 13.26448221
## CLAGE                   CLAGE  9.15695905
## VALUE                   VALUE  8.81423375
## LOAN                     LOAN  5.11793040
## YOJ                       YOJ  3.16235560
## DELINQ.1             DELINQ.1  1.17456145
## DEROG.1               DEROG.1  1.05232313
## JOB.Office         JOB.Office  0.97548147
## REASON.HomeImp REASON.HomeImp  0.38930802
## JOB.Other           JOB.Other  0.33100892
## JOB.Mgr               JOB.Mgr  0.13148194
## JOB.ProfExe       JOB.ProfExe  0.06257101

Comparison of algorithms

A graphical plot to show performances of the models from the training set.

results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')

## Results

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: glm, rf, nnet, gbm 
## Number of resamples: 5 
## 
## AUC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.9091625 0.9157063 0.9174697 0.9233094 0.9323515 0.9418573    0
## rf   0.9365244 0.9390366 0.9434109 0.9414138 0.9438351 0.9442619    0
## nnet 0.9059294 0.9443217 0.9448658 0.9407425 0.9497152 0.9588806    0
## gbm  0.9638430 0.9654963 0.9673594 0.9698005 0.9709633 0.9813404    0
## 
## F 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.8888889 0.8890251 0.8919903 0.8935387 0.8931439 0.9046455    0
## rf   0.9205176 0.9206349 0.9258344 0.9263060 0.9299754 0.9345679    0
## nnet 0.8883184 0.8984772 0.9096692 0.9052206 0.9131793 0.9164589    0
## gbm  0.9264516 0.9276105 0.9310563 0.9337780 0.9382239 0.9455477    0
## 
## Precision 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.8305085 0.8352535 0.8360277 0.8392055 0.8456140 0.8486239    0
## rf   0.8617143 0.8686047 0.8751445 0.8733717 0.8770492 0.8843458    0
## nnet 0.8704403 0.8719212 0.8733572 0.8746971 0.8739596 0.8838072    0
## gbm  0.8997555 0.9072682 0.9123253 0.9136212 0.9227848 0.9259724    0
## 
## Recall 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.9463351 0.9488860 0.9501966 0.9554614 0.9633028 0.9685864    0
## rf   0.9790301 0.9803665 0.9882045 0.9861150 0.9908377 0.9921363    0
## nnet 0.9069463 0.9267016 0.9370904 0.9381695 0.9568063 0.9633028    0
## gbm  0.9410223 0.9488860 0.9541885 0.9549357 0.9646134 0.9659686    0

par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')

Predictions

par(mfrow=c(2,2))
set.seed(2019)
prediction.glm <-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf <-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet <-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm <-predict(fit.gbm,newdata=itest,type="raw")

Visualization of results

Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.

cat(paste('Confusion Matrix GLM Model'), sep='\n')

## Confusion Matrix GLM Model

confusionMatrix(prediction.glm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 909 172
##         X1  45  65
##                                           
##                Accuracy : 0.8178          
##                  95% CI : (0.7947, 0.8393)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.07737         
##                                           
##                   Kappa : 0.2844          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9528          
##             Specificity : 0.2743          
##          Pos Pred Value : 0.8409          
##          Neg Pred Value : 0.5909          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7632          
##    Detection Prevalence : 0.9076          
##       Balanced Accuracy : 0.6135          
##                                           
##        'Positive' Class : X0              
##

F1_train <- F1_Score(itrain$y,predict(fit.glm,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')

## F1_train_glm: 0.893731635651322 F1_test_glm: 0.893366093366094

cat(paste('Confusion Matrix Random Forest Model'), sep='\n')

## Confusion Matrix Random Forest Model

confusionMatrix(prediction.rf, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 938 137
##         X1  16 100
##                                         
##                Accuracy : 0.8715        
##                  95% CI : (0.8512, 0.89)
##     No Information Rate : 0.801         
##     P-Value [Acc > NIR] : 9.097e-11     
##                                         
##                   Kappa : 0.5014        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.9832        
##             Specificity : 0.4219        
##          Pos Pred Value : 0.8726        
##          Neg Pred Value : 0.8621        
##              Prevalence : 0.8010        
##          Detection Rate : 0.7876        
##    Detection Prevalence : 0.9026        
##       Balanced Accuracy : 0.7026        
##                                         
##        'Positive' Class : X0            
##

F1_train <- F1_Score(itrain$y,predict(fit.rf,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.955605718585403 F1_test_rf: 0.924593395761459

cat(paste('Confusion Matrix Neural Network Model'), sep='\n')

## Confusion Matrix Neural Network Model

confusionMatrix(prediction.nnet, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 888 146
##         X1  66  91
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.7991, 0.8433)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.03634         
##                                           
##                   Kappa : 0.3605          
##                                           
##  Mcnemar's Test P-Value : 5.771e-08       
##                                           
##             Sensitivity : 0.9308          
##             Specificity : 0.3840          
##          Pos Pred Value : 0.8588          
##          Neg Pred Value : 0.5796          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7456          
##    Detection Prevalence : 0.8682          
##       Balanced Accuracy : 0.6574          
##                                           
##        'Positive' Class : X0              
##

F1_train <- F1_Score(itrain$y,predict(fit.nnet,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')

## F1_train_nnet: 0.903363991432531 F1_test_nnet: 0.893360160965795

cat(paste('Confusion Matrix GBM Model'), sep='\n')

## Confusion Matrix GBM Model

confusionMatrix(prediction.gbm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 911  92
##         X1  43 145
##                                           
##                Accuracy : 0.8866          
##                  95% CI : (0.8673, 0.9041)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 1.910e-15       
##                                           
##                   Kappa : 0.6145          
##                                           
##  Mcnemar's Test P-Value : 3.609e-05       
##                                           
##             Sensitivity : 0.9549          
##             Specificity : 0.6118          
##          Pos Pred Value : 0.9083          
##          Neg Pred Value : 0.7713          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7649          
##    Detection Prevalence : 0.8421          
##       Balanced Accuracy : 0.7834          
##                                           
##        'Positive' Class : X0              
##

F1_train <- F1_Score(itrain$y,predict(fit.gbm,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')

## F1_train_gbm: 0.940754958413308 F1_test_gbm: 0.931016862544711

Confusion Matrix Plots

par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GBM Confusion Matrix")

References:

Article

Github repository

Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013

Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

https://machinelearningmastery.com/what-is-imbalanced-classification/

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

How to face a majority class greater than a minority class in a classification predictive modeling: baseline models overview

Claudio G. Giancaterino

27/1/2020

Challenge

Prepare Workspace

Upload Dataset

Summarize Data set

Formatting Features and managing some levels

Handling missing values

Input boolean variables for features with NA’s

Input missing values with median for numerical columns and with the most common level for categorical columns

Check results

Formatting new features and managing some levels

Split data set into categorical, boolean and numerical variables

Summarize the class distribution of the target variable

Visualize data

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

Univariate Analysis

Bivariate Analysis with Feature Selection Analysis

Visualization of Bivariate Analysis

One-hot encoding on categorical features

Remove correlated levels from boolean features

Analysis for numerical features (univariate analysis, bivariate analysis)

Univariate Analysis, histograms

Univariate Analysis, boxplots

Univariate Analysis, densityplots

Bivariate Analysis

Visualization of Bivariate Analysis

Handling outliers

Delete Zero-and Near Zero-Variance Predictors

Correlation

Visualization

Delete correlated features

Pre-processing

Split data set

Baseline Models: evaluating models by Caret

GLM

RANDOM FOREST

NNET

GBM

Comparison of algorithms

Predictions

Visualization of results

Confusion Matrix Plots

References: