How to face a majority class greater than a minority class in a classification predictive modeling: Cost Sensitive models with Caret

Challenge

I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.

Below the variables description

-BAD: 1 = client defaulted on loan 0 = loan repaid

-LOAN: Amount of the loan request

-MORTDUE: Amount due on existing mortgage

-VALUE: Value of current property

-REASON: DebtCon = debt consolidation; HomeImp = home improvement

-JOB: Six occupational categories

-YOJ: Years at present job

-DEROG: Number of major derogatory reports

-DELINQ: Number of delinquent credit lines

-CLAGE: Age of oldest trade line in months

-NINQ: Number of recent credit lines

-CLNO: Number of credit lines

-DEBTINC: Debt-to-income ratio

Prepare Workspace

suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

suppressWarnings({library(caret)})

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

suppressWarnings({library(DMwR)})

## Loading required package: grid

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

suppressWarnings({library(ROSE)})

## Loaded ROSE 0.0-3

suppressWarnings({library(corrplot)})

## corrplot 0.84 loaded

suppressWarnings({library(gridExtra)})

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

suppressWarnings({library(MLmetrics)})

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

Upload Dataset

path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)

Summarize Data set

# Dimensions of data set
dim(df)

## [1] 5960   13

# List types for each attribute
sapply(df, class)

##       BAD      LOAN   MORTDUE     VALUE    REASON       JOB       YOJ     DEROG 
## "integer" "integer" "numeric" "numeric"  "factor"  "factor" "numeric" "integer" 
##    DELINQ     CLAGE      NINQ      CLNO   DEBTINC 
## "integer" "numeric" "integer" "integer" "numeric"

# Take a peek at the first rows of the data set
head(df,5)

##   BAD LOAN MORTDUE  VALUE  REASON    JOB  YOJ DEROG DELINQ     CLAGE NINQ CLNO
## 1   1 1100   25860  39025 HomeImp  Other 10.5     0      0  94.36667    1    9
## 2   1 1300   70053  68400 HomeImp  Other  7.0     0      2 121.83333    0   14
## 3   1 1500   13500  16700 HomeImp  Other  4.0     0      0 149.46667    1   10
## 4   1 1500      NA     NA                  NA    NA     NA        NA   NA   NA
## 5   0 1700   97800 112000 HomeImp Office  3.0     0      0  93.33333    0   14
##   DEBTINC
## 1      NA
## 2      NA
## 3      NA
## 4      NA
## 5      NA

# Summarize attribute distributions
summary(df)

##       BAD              LOAN          MORTDUE           VALUE       
##  Min.   :0.0000   Min.   : 1100   Min.   :  2063   Min.   :  8000  
##  1st Qu.:0.0000   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
##  Median :0.0000   Median :16300   Median : 65019   Median : 89236  
##  Mean   :0.1995   Mean   :18608   Mean   : 73761   Mean   :101776  
##  3rd Qu.:0.0000   3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
##  Max.   :1.0000   Max.   :89900   Max.   :399550   Max.   :855909  
##                                   NA's   :518      NA's   :112     
##      REASON          JOB            YOJ             DEROG        
##         : 252          : 279   Min.   : 0.000   Min.   : 0.0000  
##  DebtCon:3928   Mgr    : 767   1st Qu.: 3.000   1st Qu.: 0.0000  
##  HomeImp:1780   Office : 948   Median : 7.000   Median : 0.0000  
##                 Other  :2388   Mean   : 8.922   Mean   : 0.2546  
##                 ProfExe:1276   3rd Qu.:13.000   3rd Qu.: 0.0000  
##                 Sales  : 109   Max.   :41.000   Max.   :10.0000  
##                 Self   : 193   NA's   :515      NA's   :708      
##      DELINQ            CLAGE             NINQ             CLNO     
##  Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
##  Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
##  Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
##  3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
##  Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
##  NA's   :580       NA's   :308      NA's   :510      NA's   :222   
##     DEBTINC        
##  Min.   :  0.5245  
##  1st Qu.: 29.1400  
##  Median : 34.8183  
##  Mean   : 33.7799  
##  3rd Qu.: 39.0031  
##  Max.   :203.3121  
##  NA's   :1267

# Summarize data structure
str(df)

## 'data.frame':    5960 obs. of  13 variables:
##  $ BAD    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ LOAN   : int  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
##  $ MORTDUE: num  25860 70053 13500 NA 97800 ...
##  $ VALUE  : num  39025 68400 16700 NA 112000 ...
##  $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
##  $ JOB    : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
##  $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
##  $ DEROG  : int  0 0 0 NA 0 0 3 0 0 0 ...
##  $ DELINQ : int  0 2 0 NA 0 0 2 0 2 0 ...
##  $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
##  $ NINQ   : int  1 0 1 NA 0 1 1 0 1 0 ...
##  $ CLNO   : int  9 14 10 NA 14 8 17 8 12 13 ...
##  $ DEBTINC: num  NA NA NA NA NA ...

Formatting Features and managing some levels

BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated

df$REASON[df$REASON == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated

Handling missing values

My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”

mi_summary <- function(data_frame){
  mi_summary<-c()
  for (col in colnames(data_frame)){
    mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
  }
  mi_summary_new <- mi_summary[mi_summary>0]
  mi_summary_cols <- colnames(data_frame)[mi_summary>0]
  mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
  mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
  mi_summary[,2] <- round(mi_summary[,2],6)
  rownames(mi_summary) <- NULL
  return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary

##    col_name perc_missing
## 1   DEBTINC    21.258389
## 2     DEROG    11.879195
## 3    DELINQ     9.731544
## 4   MORTDUE     8.691275
## 5       YOJ     8.640940
## 6      NINQ     8.557047
## 7     CLAGE     5.167785
## 8       JOB     4.681208
## 9    REASON     4.228188
## 10     CLNO     3.724832
## 11    VALUE     1.879195

Input boolean variables for features with NA’s

df <- df %>%
  mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
  mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
  mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
  mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
  mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
  mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
  mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
  mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
  mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
  mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
  mutate(REASON_NA = ifelse(is.na(REASON),1,0))

Input missing values with median for numerical columns and with the most common level for categorical columns

for (col in missing_summary$col_name){
  if (class(df[,col]) == 'factor'){
    unique_levels <- unique(df[,col])
    df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
  } else {
    df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
  }
}

Check results

pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss

## named numeric(0)

Formatting new features and managing some levels

df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))

Split data set into categorical, boolean and numerical variables

I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis

cat <- df[,sapply(df, is.factor)] %>%
  select_if(~nlevels(.) <=15 ) %>%
  select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]

Summarize the class distribution of the target variable

The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.

Visualize data

cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)

##   freq percentage
## 0 4771   80.05034
## 1 1189   19.94966

ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  ggtitle("Distribution of Target variable")

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.

Univariate Analysis

cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
  counts <- table(cat[,i])
  name <- names(cat)[i]
  barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}

Bivariate Analysis with Feature Selection Analysis

The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.

par(mfrow=c(2,2))
for(i in 1:length(cat)){
  freq=table(cat[,i])
  percentage=prop.table(table(cat[,i]))*100
  freq_cat_outcome=table(BAD,cat[,i])
  name <- names(cat)[i]
  cat(sep="\n")
  cat(paste("Distribution of", name), sep="\n")
  print(cbind(freq,percentage))
  cat(sep="\n")
  cat(paste("Distribution by Target variable and", name), sep="\n")
  print(freq_cat_outcome)
  cat(sep="\n")
  cat(paste("Chi-squared test by Target variable and", name), sep="\n")
  suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}

## 
## Distribution of DELINQ
##    freq  percentage
## 0  4759 79.84899329
## 1   654 10.97315436
## 2   250  4.19463087
## 3   129  2.16442953
## 4    78  1.30872483
## 5    38  0.63758389
## 6    27  0.45302013
## 7    13  0.21812081
## 8     5  0.08389262
## 10    2  0.03355705
## 11    2  0.03355705
## 12    1  0.01677852
## 13    1  0.01677852
## 15    1  0.01677852
## 
## Distribution by Target variable and DELINQ
##    
## BAD    0    1    2    3    4    5    6    7    8   10   11   12   13   15
##   0 4104  432  138   58   32    7    0    0    0    0    0    0    0    0
##   1  655  222  112   71   46   31   27   13    5    2    2    1    1    1
## 
## Chi-squared test by Target variable and DELINQ
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
## 
## 
## Distribution of REASON
##         freq percentage
## DebtCon 4180   70.13423
## HomeImp 1780   29.86577
## 
## Distribution by Target variable and REASON
##    
## BAD DebtCon HomeImp
##   0    3387    1384
##   1     793     396
## 
## Chi-squared test by Target variable and REASON
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
## 
## 
## Distribution of JOB
##         freq percentage
## Mgr      767  12.869128
## Office   948  15.906040
## Other   2667  44.748322
## ProfExe 1276  21.409396
## Sales    109   1.828859
## Self     193   3.238255
## 
## Distribution by Target variable and JOB
##    
## BAD  Mgr Office Other ProfExe Sales Self
##   0  588    823  2090    1064    71  135
##   1  179    125   577     212    38   58
## 
## Chi-squared test by Target variable and JOB
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
## 
## 
## Distribution of DEROG
##    freq  percentage
## 0  5235 87.83557047
## 1   435  7.29865772
## 2   160  2.68456376
## 3    58  0.97315436
## 4    23  0.38590604
## 5    15  0.25167785
## 6    15  0.25167785
## 7     8  0.13422819
## 8     6  0.10067114
## 9     3  0.05033557
## 10    2  0.03355705
## 
## Distribution by Target variable and DEROG
##    
## BAD    0    1    2    3    4    5    6    7    8    9   10
##   0 4394  266   78   15    5    8    5    0    0    0    0
##   1  841  169   82   43   18    7   10    8    6    3    2
## 
## Chi-squared test by Target variable and DEROG
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16

Visualization of Bivariate Analysis

pl1 <- cat %>%
  ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) + 
  geom_bar(stat='identity') + 
  ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
  ggplot(aes(x=BAD, y=REASON, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
  ggplot(aes(x=BAD, y=JOB, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
  ggplot(aes(x=BAD, y=DEROG, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)

One-hot encoding on categorical features

I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.

dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))

Remove correlated levels from boolean features

drop_cols <- c('DEBTINC_NA.0','DEROG_NA.0','DELINQ_NA.0','MORTDUE_NA.0','YOJ_NA.0','NINQ_NA.0','CLAGE_NA.0','CLNO_NA.0','VALUE_NA.0','JOB_NA.0','REASON_NA.0')
categorical <- cat_num[,!colnames(cat_num) %in% drop_cols]

Analysis for numerical features (univariate analysis, bivariate analysis)

I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.

Univariate Analysis, histograms

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  hist(num[,i], main=names(num)[i], col='blue')
}

Univariate Analysis, boxplots

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  boxplot(num[,i], main=names(num)[i], col='orange')
}

Univariate Analysis, densityplots

par(mfrow=c(2,3))
for(i in 1:length(num)){
  plot(density(num[,i]), main=names(num)[i], col='red')
}

Bivariate Analysis

for(i in 1:length(num)){
  name <- names(num)[i]
  cat(paste("Distribution of", name), sep="\n")
  #cat(names(num)[i],sep = "\n")
  print(summary(num[,i]))
  cat(sep="\n")
  stand.deviation = sd(num[,i])
  variance = var(num[,i])
  skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
  kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
  outlier_values <- sum(table(boxplot.stats(num[,i])$out))
  cat(paste("Statistical analysis of", name), sep="\n")
  print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
  cat(sep="\n")
  cat(paste("anova_test between BAD and", name),sep = "\n")
  print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
  cat(sep="\n")
}

## Distribution of LOAN
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1100   11100   16300   18608   23300   89900 
## 
## Statistical analysis of LOAN
##      stand.deviation  variance skewness kurtosis outlier_values
## [1,]        11207.48 125607617 2.022762 6.922438            256
## 
## anova_test between BAD and LOAN
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    5.4   5.368   33.79 6.45e-09 ***
## Residuals   5958  946.4   0.159                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of MORTDUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2063   48139   65019   73001   88200  399550 
## 
## Statistical analysis of MORTDUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        42552.73 1810734556 1.941153 7.440693            308
## 
## anova_test between BAD and MORTDUE
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.0  2.0303   12.74 0.000361 ***
## Residuals   5958  949.8  0.1594                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of VALUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8000   66490   89236  101540  119005  855909 
## 
## Statistical analysis of VALUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        56869.44 3234132829 3.088963 24.85644            347
## 
## anova_test between BAD and VALUE
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## num[, i]       1    1.3  1.2675   7.945 0.00484 **
## Residuals   5958  950.5  0.1595                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of YOJ
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   8.756  12.000  41.000 
## 
## Statistical analysis of YOJ
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.259424 52.69923 1.092072 0.744703            211
## 
## anova_test between BAD and YOJ
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.8  2.7710    17.4 3.08e-05 ***
## Residuals   5958  949.0  0.1593                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of CLAGE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   117.4   173.5   179.4   227.1  1168.2 
## 
## Statistical analysis of CLAGE
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]         83.5747  6984.73 1.389903 8.180556             66
## 
## anova_test between BAD and CLAGE
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   26.1  26.106     168 <2e-16 ***
## Residuals   5958  925.7   0.155                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of DEBTINC
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.5245  30.7632  34.8183  34.0007  37.9499 203.3122 
## 
## Statistical analysis of DEBTINC
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.644528 58.43881 3.111598  64.0733            245
## 
## anova_test between BAD and DEBTINC
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   22.7  22.733   145.8 <2e-16 ***
## Residuals   5958  929.1   0.156                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualization of Bivariate Analysis

pl5 <- num %>%
  ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot() 
pl6 <- num %>%
  ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot() 
pl7 <- num %>%
  ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot() 
pl8 <- num %>%
  ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot() 
pl9 <- num %>%
  ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot() 
pl10 <- num %>%
  ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot() 
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)

Handling outliers

Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.

# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")

# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)]  <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")

# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")

# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)]  <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3)  + ggtitle("MORTDUE after handled outliers")

# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")

# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)]  <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")

# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")

# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)]  <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")

# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")

# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)]  <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")

# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")

# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)]  <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")

Delete Zero-and Near Zero-Variance Predictors

The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)

data <- cbind(categorical,num) 
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]

##            freqRatio percentUnique zeroVar  nzv
## DELINQ.2    22.84000    0.03355705   FALSE TRUE
## DELINQ.3    45.20155    0.03355705   FALSE TRUE
## DELINQ.4    75.41026    0.03355705   FALSE TRUE
## DELINQ.5   155.84211    0.03355705   FALSE TRUE
## DELINQ.6   219.74074    0.03355705   FALSE TRUE
## DELINQ.7   457.46154    0.03355705   FALSE TRUE
## DELINQ.8  1191.00000    0.03355705   FALSE TRUE
## DELINQ.10 2979.00000    0.03355705   FALSE TRUE
## DELINQ.11 2979.00000    0.03355705   FALSE TRUE
## DELINQ.12 5959.00000    0.03355705   FALSE TRUE
## DELINQ.13 5959.00000    0.03355705   FALSE TRUE
## DELINQ.15 5959.00000    0.03355705   FALSE TRUE
## JOB.Sales   53.67890    0.03355705   FALSE TRUE
## JOB.Self    29.88083    0.03355705   FALSE TRUE
## DEROG.2     36.25000    0.03355705   FALSE TRUE

nzv <- nearZeroVar(data)
data_new <- data[, -nzv]

Correlation

Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.

Visualization

par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")

summary(cor[upper.tri(cor)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.000000 -0.038116  0.005731 -0.014157  0.043268  0.792421

Delete correlated features

tmp <- cor(data_new)
tmp[upper.tri(tmp)] <- 0
diag(tmp) <- 0
df_new <- data_new[,!apply(tmp,2,function(x) any(abs(x) > 0.75))]
cor <- cor(df_new,use="complete.obs",method = "spearman")
summary(cor[upper.tri(cor)])

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6988623 -0.0421582  0.0004314 -0.0234455  0.0307002  0.3406328

Pre-processing

To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.

# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)

## 'data.frame':    5960 obs. of  14 variables:
##  $ DELINQ.0      : num  0.502 -1.99 0.502 0.502 0.502 ...
##  $ DELINQ.1      : num  -0.351 -0.351 -0.351 -0.351 -0.351 ...
##  $ REASON.HomeImp: num  1.532 1.532 1.532 -0.653 1.532 ...
##  $ JOB.Mgr       : num  -0.384 -0.384 -0.384 -0.384 -0.384 ...
##  $ JOB.Office    : num  -0.435 -0.435 -0.435 -0.435 2.299 ...
##  $ JOB.Other     : num  1.11 1.11 1.11 1.11 -0.9 ...
##  $ JOB.ProfExe   : num  -0.522 -0.522 -0.522 -0.522 -0.522 ...
##  $ DEROG.1       : num  -0.281 -0.281 -0.281 -0.281 -0.281 ...
##  $ LOAN          : num  -1.86 -1.84 -1.81 -1.81 -1.79 ...
##  $ VALUE         : num  -1.322 -0.669 -1.818 -0.206 0.3 ...
##  $ YOJ           : num  0.279 -0.233 -0.672 -0.233 -0.819 ...
##  $ CLAGE         : num  -1.091 -0.73 -0.367 -0.052 -1.104 ...
##  $ DEBTINC       : num  0.143 0.143 0.143 0.143 0.143 ...
##  $ y             : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...

Split data set

# Draw a random, stratified sample including p percent of the data
set.seed(12345)
test_index <- createDataPartition(transformed$y, p=0.80, list=FALSE)
# select 20% of the data for test
itest <- transformed[-test_index,]
# use the remaining 80% of data to training the models
itrain <- transformed[test_index,]

Cost-sensitive learning: evaluating models by Caret

In this analysis I’ve used some baseline cost sensitive learning models: Reguralized Random Forest and Penalized Multinomial Regression.

# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, 
summaryFunction=prSummary)
metric <- "AUC"

REGULARIZED RANDOM FOREST

set.seed(2019)
fit.rrf <- train(y~., data=itrain, method="RRF", metric=metric, trControl=control)

## Registered S3 method overwritten by 'RRF':
##   method      from        
##   plot.margin randomForest

print(fit.rrf)

## Regularized Random Forest 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Resampling results across tuning parameters:
## 
##   mtry  coefReg  coefImp  AUC        Precision  Recall     F        
##    2    0.010    0.0      0.8688082  0.8948432  0.9418337  0.9173708
##    2    0.010    0.5      0.8704264  0.8892433  0.9434174  0.9152755
##    2    0.010    1.0      0.8346641  0.9105366  0.9379087  0.9240001
##    2    0.505    0.0      0.8672459  0.8992080  0.9491738  0.9233087
##    2    0.505    0.5      0.8503685  0.9097663  0.9347680  0.9220873
##    2    0.505    1.0      0.8294250  0.9053201  0.9365991  0.9206890
##    2    1.000    0.0      0.8318242  0.9178406  0.9423634  0.9299322
##    2    1.000    0.5      0.8305954  0.9178146  0.9392204  0.9283873
##    2    1.000    1.0      0.8680398  0.9078320  0.9386944  0.9229780
##    7    0.010    0.0      0.8445569  0.8989145  0.9499571  0.9234684
##    7    0.010    0.5      0.8875775  0.8911118  0.9434147  0.9161364
##    7    0.010    1.0      0.8396133  0.9051242  0.9339775  0.9192906
##    7    0.505    0.0      0.8318548  0.9029518  0.9394849  0.9207807
##    7    0.505    0.5      0.8572132  0.9040103  0.9321417  0.9178319
##    7    0.505    1.0      0.8610492  0.9090686  0.9371223  0.9228666
##    7    1.000    0.0      0.8430865  0.9181814  0.9407921  0.9293387
##    7    1.000    0.5      0.8528008  0.9161716  0.9392197  0.9275442
##    7    1.000    1.0      0.8344937  0.9055855  0.9344997  0.9197968
##   13    0.010    0.0      0.7931388  0.9147255  0.9441952  0.9292164
##   13    0.010    0.5      0.8508829  0.9036179  0.9373841  0.9201755
##   13    0.010    1.0      0.8616062  0.9021009  0.9337144  0.9176069
##   13    0.505    0.0      0.7947710  0.9101297  0.9365984  0.9231694
##   13    0.505    0.5      0.8616669  0.9020509  0.9331915  0.9173339
##   13    0.505    1.0      0.8678545  0.9022286  0.9321417  0.9168933
##   13    1.000    0.0      0.8364031  0.9175508  0.9415781  0.9293938
##   13    1.000    0.5      0.8554741  0.9134485  0.9373876  0.9252577
##   13    1.000    1.0      0.8617296  0.9016886  0.9318802  0.9165126
## 
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7, coefReg = 0.01 and coefImp
##  = 0.5.

plot(varImp(fit.rrf),15, main='Regularized Random Forest feature selection')

Penalized Multinomial Regression

set.seed(2019)
fit.pmr <- train(y~., data=itrain, method="multinom", metric=metric, trControl=control)

## # weights:  15 (14 variable)
## initial  value 2643.663347 
## iter  10 value 1643.449429
## iter  20 value 1596.132959
## final  value 1596.129966 
## converged
## # weights:  15 (14 variable)
## initial  value 2643.663347 
## iter  10 value 1644.055053
## iter  20 value 1596.581893
## final  value 1596.579823 
## converged
## # weights:  15 (14 variable)
## initial  value 2643.663347 
## iter  10 value 1643.450035
## iter  20 value 1596.133410
## final  value 1596.130416 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1616.608929
## iter  20 value 1576.917418
## final  value 1576.917398 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1617.228086
## iter  20 value 1577.395100
## final  value 1577.395081 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1616.609548
## iter  20 value 1576.917896
## final  value 1576.917876 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1631.992725
## iter  20 value 1586.523644
## final  value 1586.523531 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1632.618173
## iter  20 value 1587.003794
## final  value 1587.003681 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1631.993351
## iter  20 value 1586.524125
## final  value 1586.524012 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1623.606406
## iter  20 value 1571.510591
## final  value 1571.510475 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1624.245690
## iter  20 value 1571.999053
## final  value 1571.998938 
## converged
## # weights:  15 (14 variable)
## initial  value 2645.049641 
## iter  10 value 1623.607046
## iter  20 value 1571.511080
## final  value 1571.510964 
## converged
## # weights:  15 (14 variable)
## initial  value 2643.663347 
## iter  10 value 1621.673938
## iter  20 value 1571.776738
## final  value 1571.776667 
## converged
## # weights:  15 (14 variable)
## initial  value 2643.663347 
## iter  10 value 1622.341658
## iter  20 value 1572.270680
## final  value 1572.270608 
## converged
## # weights:  15 (14 variable)
## initial  value 2643.663347 
## iter  10 value 1621.674606
## iter  20 value 1571.777233
## final  value 1571.777161 
## converged
## # weights:  15 (14 variable)
## initial  value 3305.618904 
## iter  10 value 2168.329691
## iter  20 value 1977.784792
## final  value 1977.781680 
## converged

print(fit.pmr)

## Penalized Multinomial Regression 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Resampling results across tuning parameters:
## 
##   decay  AUC        Precision  Recall     F        
##   0e+00  0.9232465  0.8383432  0.9557259  0.8931242
##   1e-04  0.9232465  0.8383432  0.9557259  0.8931242
##   1e-01  0.9232486  0.8383432  0.9557259  0.8931242
## 
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.

plot(varImp(fit.pmr),15, main = 'Penalized Multinomial Regression feature selection')

Comparison of algorithms

A graphical plot to show performances of the models from the training set.

results <- resamples(list(rrf=fit.rrf, pmr=fit.pmr))
cat(paste('Results'), sep='\n')

## Results

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: rrf, pmr 
## Number of resamples: 5 
## 
## AUC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rrf 0.8521807 0.8792198 0.8965403 0.8875775 0.9019223 0.9080243    0
## pmr 0.9155026 0.9189913 0.9228955 0.9232486 0.9234657 0.9353882    0
## 
## F 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rrf 0.9078707 0.9113111 0.9187097 0.9161364 0.9206963 0.9220945    0
## pmr 0.8893012 0.8906634 0.8929664 0.8931242 0.8941035 0.8985864    0
## 
## Precision 
##          Min.  1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rrf 0.8493151 0.895202 0.9002494 0.8911118 0.9047014 0.9060914    0
## pmr 0.8264739 0.837156 0.8391204 0.8383432 0.8419204 0.8470452    0
## 
## Recall 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rrf 0.9280105 0.9331586 0.9357798 0.9434147 0.9450262 0.9750983    0
## pmr 0.9423329 0.9489529 0.9567497 0.9557259 0.9568063 0.9737877    0

par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')

Predictions

par(mfrow=c(2,2))
set.seed(2019)
prediction.rrf <-predict(fit.rrf,newdata=itest,type="raw")
set.seed(2019)
prediction.pmr <-predict(fit.pmr,newdata=itest,type="raw")

Visualization of results

Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.

cat(paste('Confusion Matrix Regularized Random Forest Model'), sep='\n')

## Confusion Matrix Regularized Random Forest Model

confusionMatrix(prediction.rrf, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 922 155
##         X1  32  82
##                                           
##                Accuracy : 0.843           
##                  95% CI : (0.8211, 0.8632)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.0001124       
##                                           
##                   Kappa : 0.3881          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9665          
##             Specificity : 0.3460          
##          Pos Pred Value : 0.8561          
##          Neg Pred Value : 0.7193          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7741          
##    Detection Prevalence : 0.9043          
##       Balanced Accuracy : 0.6562          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.rrf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rrf)
cat(paste('F1_train_rrf:',F1_train,'F1_test_rrf:', F1_test), sep='\n')

## F1_train_rrf: 0.894843227894302 F1_test_rrf: 0.907927129492861

cat(paste('Confusion Matrix Penalized Multinomial Regression Model'), sep='\n')

## Confusion Matrix Penalized Multinomial Regression Model

confusionMatrix(prediction.pmr, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 909 172
##         X1  45  65
##                                           
##                Accuracy : 0.8178          
##                  95% CI : (0.7947, 0.8393)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.07737         
##                                           
##                   Kappa : 0.2844          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9528          
##             Specificity : 0.2743          
##          Pos Pred Value : 0.8409          
##          Neg Pred Value : 0.5909          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7632          
##    Detection Prevalence : 0.9076          
##       Balanced Accuracy : 0.6135          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.pmr$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.pmr)
cat(paste('F1_train_pmr:',F1_train,'F1_test_pmr:', F1_test), sep='\n')

## F1_train_pmr: 0.00751785594104046 F1_test_pmr: 0.893366093366094

Confusion Matrix Plots

par(mfrow=c(2,2))
ctable.rrf <- table(prediction.rrf, itest$y)
fourfoldplot(ctable.rrf, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RRF Confusion Matrix")
ctable.pmr <- table(prediction.pmr, itest$y)
fourfoldplot(ctable.pmr, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "PMR Confusion Matrix")

References:

Article

Github repository

Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013

Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

https://machinelearningmastery.com/what-is-imbalanced-classification/

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

How to face a majority class greater than a minority class in a classification predictive modeling: Cost Sensitive models with Caret

Claudio G. Giancaterino

30/1/2020

Challenge

Prepare Workspace

Upload Dataset

Summarize Data set

Formatting Features and managing some levels

Handling missing values

Input boolean variables for features with NA’s

Input missing values with median for numerical columns and with the most common level for categorical columns

Check results

Formatting new features and managing some levels

Split data set into categorical, boolean and numerical variables

Summarize the class distribution of the target variable

Visualize data

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

Univariate Analysis

Bivariate Analysis with Feature Selection Analysis

Visualization of Bivariate Analysis

One-hot encoding on categorical features

Remove correlated levels from boolean features

Analysis for numerical features (univariate analysis, bivariate analysis)

Univariate Analysis, histograms

Univariate Analysis, boxplots

Univariate Analysis, densityplots

Bivariate Analysis

Visualization of Bivariate Analysis

Handling outliers

Delete Zero-and Near Zero-Variance Predictors

Correlation

Visualization

Delete correlated features

Pre-processing

Split data set

Cost-sensitive learning: evaluating models by Caret

REGULARIZED RANDOM FOREST

Penalized Multinomial Regression

Comparison of algorithms

Predictions

Visualization of results

Confusion Matrix Plots

References: