How to face a majority class greater than a minority class in a classification predictive modeling: sampling methods with Caret

Challenge

I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.

Below the variables description

-BAD: 1 = client defaulted on loan 0 = loan repaid

-LOAN: Amount of the loan request

-MORTDUE: Amount due on existing mortgage

-VALUE: Value of current property

-REASON: DebtCon = debt consolidation; HomeImp = home improvement

-JOB: Six occupational categories

-YOJ: Years at present job

-DEROG: Number of major derogatory reports

-DELINQ: Number of delinquent credit lines

-CLAGE: Age of oldest trade line in months

-NINQ: Number of recent credit lines

-CLNO: Number of credit lines

-DEBTINC: Debt-to-income ratio

Prepare Workspace

suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

suppressWarnings({library(caret)})

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

suppressWarnings({library(DMwR)})

## Loading required package: grid

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

suppressWarnings({library(ROSE)})

## Loaded ROSE 0.0-3

suppressWarnings({library(corrplot)})

## corrplot 0.84 loaded

suppressWarnings({library(gridExtra)})

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

suppressWarnings({library(MLmetrics)})

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

Upload Dataset

path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)

Summarize Data set

# Dimensions of data set
dim(df)

## [1] 5960   13

# List types for each attribute
sapply(df, class)

##       BAD      LOAN   MORTDUE     VALUE    REASON       JOB       YOJ     DEROG 
## "integer" "integer" "numeric" "numeric"  "factor"  "factor" "numeric" "integer" 
##    DELINQ     CLAGE      NINQ      CLNO   DEBTINC 
## "integer" "numeric" "integer" "integer" "numeric"

# Take a peek at the first rows of the data set
head(df,5)

##   BAD LOAN MORTDUE  VALUE  REASON    JOB  YOJ DEROG DELINQ     CLAGE NINQ CLNO
## 1   1 1100   25860  39025 HomeImp  Other 10.5     0      0  94.36667    1    9
## 2   1 1300   70053  68400 HomeImp  Other  7.0     0      2 121.83333    0   14
## 3   1 1500   13500  16700 HomeImp  Other  4.0     0      0 149.46667    1   10
## 4   1 1500      NA     NA                  NA    NA     NA        NA   NA   NA
## 5   0 1700   97800 112000 HomeImp Office  3.0     0      0  93.33333    0   14
##   DEBTINC
## 1      NA
## 2      NA
## 3      NA
## 4      NA
## 5      NA

# Summarize attribute distributions
summary(df)

##       BAD              LOAN          MORTDUE           VALUE       
##  Min.   :0.0000   Min.   : 1100   Min.   :  2063   Min.   :  8000  
##  1st Qu.:0.0000   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
##  Median :0.0000   Median :16300   Median : 65019   Median : 89236  
##  Mean   :0.1995   Mean   :18608   Mean   : 73761   Mean   :101776  
##  3rd Qu.:0.0000   3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
##  Max.   :1.0000   Max.   :89900   Max.   :399550   Max.   :855909  
##                                   NA's   :518      NA's   :112     
##      REASON          JOB            YOJ             DEROG        
##         : 252          : 279   Min.   : 0.000   Min.   : 0.0000  
##  DebtCon:3928   Mgr    : 767   1st Qu.: 3.000   1st Qu.: 0.0000  
##  HomeImp:1780   Office : 948   Median : 7.000   Median : 0.0000  
##                 Other  :2388   Mean   : 8.922   Mean   : 0.2546  
##                 ProfExe:1276   3rd Qu.:13.000   3rd Qu.: 0.0000  
##                 Sales  : 109   Max.   :41.000   Max.   :10.0000  
##                 Self   : 193   NA's   :515      NA's   :708      
##      DELINQ            CLAGE             NINQ             CLNO     
##  Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
##  Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
##  Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
##  3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
##  Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
##  NA's   :580       NA's   :308      NA's   :510      NA's   :222   
##     DEBTINC        
##  Min.   :  0.5245  
##  1st Qu.: 29.1400  
##  Median : 34.8183  
##  Mean   : 33.7799  
##  3rd Qu.: 39.0031  
##  Max.   :203.3121  
##  NA's   :1267

# Summarize data structure
str(df)

## 'data.frame':    5960 obs. of  13 variables:
##  $ BAD    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ LOAN   : int  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
##  $ MORTDUE: num  25860 70053 13500 NA 97800 ...
##  $ VALUE  : num  39025 68400 16700 NA 112000 ...
##  $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
##  $ JOB    : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
##  $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
##  $ DEROG  : int  0 0 0 NA 0 0 3 0 0 0 ...
##  $ DELINQ : int  0 2 0 NA 0 0 2 0 2 0 ...
##  $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
##  $ NINQ   : int  1 0 1 NA 0 1 1 0 1 0 ...
##  $ CLNO   : int  9 14 10 NA 14 8 17 8 12 13 ...
##  $ DEBTINC: num  NA NA NA NA NA ...

Formatting Features and managing some levels

BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated

df$REASON[df$REASON == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated

Handling missing values

My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”

mi_summary <- function(data_frame){
  mi_summary<-c()
  for (col in colnames(data_frame)){
    mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
  }
  mi_summary_new <- mi_summary[mi_summary>0]
  mi_summary_cols <- colnames(data_frame)[mi_summary>0]
  mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
  mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
  mi_summary[,2] <- round(mi_summary[,2],6)
  rownames(mi_summary) <- NULL
  return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary

##    col_name perc_missing
## 1   DEBTINC    21.258389
## 2     DEROG    11.879195
## 3    DELINQ     9.731544
## 4   MORTDUE     8.691275
## 5       YOJ     8.640940
## 6      NINQ     8.557047
## 7     CLAGE     5.167785
## 8       JOB     4.681208
## 9    REASON     4.228188
## 10     CLNO     3.724832
## 11    VALUE     1.879195

Input boolean variables for features with NA’s

df <- df %>%
  mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
  mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
  mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
  mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
  mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
  mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
  mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
  mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
  mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
  mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
  mutate(REASON_NA = ifelse(is.na(REASON),1,0))

Input missing values with median for numerical columns and with the most common level for categorical columns

for (col in missing_summary$col_name){
  if (class(df[,col]) == 'factor'){
    unique_levels <- unique(df[,col])
    df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
  } else {
    df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
  }
}

Check results

pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss

## named numeric(0)

Formatting new features and managing some levels

df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))

Split data set into categorical, boolean and numerical variables

I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis

cat <- df[,sapply(df, is.factor)] %>%
  select_if(~nlevels(.) <=15 ) %>%
  select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]

Summarize the class distribution of the target variable

The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.

Visualize data

cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)

##   freq percentage
## 0 4771   80.05034
## 1 1189   19.94966

ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  ggtitle("Distribution of Target variable")

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.

Univariate Analysis

cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
  counts <- table(cat[,i])
  name <- names(cat)[i]
  barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}

Bivariate Analysis with Feature Selection Analysis

The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.

par(mfrow=c(2,2))
for(i in 1:length(cat)){
  freq=table(cat[,i])
  percentage=prop.table(table(cat[,i]))*100
  freq_cat_outcome=table(BAD,cat[,i])
  name <- names(cat)[i]
  cat(sep="\n")
  cat(paste("Distribution of", name), sep="\n")
  print(cbind(freq,percentage))
  cat(sep="\n")
  cat(paste("Distribution by Target variable and", name), sep="\n")
  print(freq_cat_outcome)
  cat(sep="\n")
  cat(paste("Chi-squared test by Target variable and", name), sep="\n")
  suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}

## 
## Distribution of DELINQ
##    freq  percentage
## 0  4759 79.84899329
## 1   654 10.97315436
## 2   250  4.19463087
## 3   129  2.16442953
## 4    78  1.30872483
## 5    38  0.63758389
## 6    27  0.45302013
## 7    13  0.21812081
## 8     5  0.08389262
## 10    2  0.03355705
## 11    2  0.03355705
## 12    1  0.01677852
## 13    1  0.01677852
## 15    1  0.01677852
## 
## Distribution by Target variable and DELINQ
##    
## BAD    0    1    2    3    4    5    6    7    8   10   11   12   13   15
##   0 4104  432  138   58   32    7    0    0    0    0    0    0    0    0
##   1  655  222  112   71   46   31   27   13    5    2    2    1    1    1
## 
## Chi-squared test by Target variable and DELINQ
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
## 
## 
## Distribution of REASON
##         freq percentage
## DebtCon 4180   70.13423
## HomeImp 1780   29.86577
## 
## Distribution by Target variable and REASON
##    
## BAD DebtCon HomeImp
##   0    3387    1384
##   1     793     396
## 
## Chi-squared test by Target variable and REASON
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
## 
## 
## Distribution of JOB
##         freq percentage
## Mgr      767  12.869128
## Office   948  15.906040
## Other   2667  44.748322
## ProfExe 1276  21.409396
## Sales    109   1.828859
## Self     193   3.238255
## 
## Distribution by Target variable and JOB
##    
## BAD  Mgr Office Other ProfExe Sales Self
##   0  588    823  2090    1064    71  135
##   1  179    125   577     212    38   58
## 
## Chi-squared test by Target variable and JOB
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
## 
## 
## Distribution of DEROG
##    freq  percentage
## 0  5235 87.83557047
## 1   435  7.29865772
## 2   160  2.68456376
## 3    58  0.97315436
## 4    23  0.38590604
## 5    15  0.25167785
## 6    15  0.25167785
## 7     8  0.13422819
## 8     6  0.10067114
## 9     3  0.05033557
## 10    2  0.03355705
## 
## Distribution by Target variable and DEROG
##    
## BAD    0    1    2    3    4    5    6    7    8    9   10
##   0 4394  266   78   15    5    8    5    0    0    0    0
##   1  841  169   82   43   18    7   10    8    6    3    2
## 
## Chi-squared test by Target variable and DEROG
## 
##  Pearson's Chi-squared test
## 
## data:  table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16

Visualization of Bivariate Analysis

pl1 <- cat %>%
  ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) + 
  geom_bar(stat='identity') + 
  ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
  ggplot(aes(x=BAD, y=REASON, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
  ggplot(aes(x=BAD, y=JOB, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
  ggplot(aes(x=BAD, y=DEROG, fill=BAD)) + 
  geom_bar(stat='identity') +
  ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)

One-hot encoding on categorical features

I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.

dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))

Remove correlated levels from boolean features

drop_cols <- c('DEBTINC_NA.0','DEROG_NA.0','DELINQ_NA.0','MORTDUE_NA.0','YOJ_NA.0','NINQ_NA.0','CLAGE_NA.0','CLNO_NA.0','VALUE_NA.0','JOB_NA.0','REASON_NA.0')
categorical <- cat_num[,!colnames(cat_num) %in% drop_cols]

Analysis for numerical features (univariate analysis, bivariate analysis)

I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.

Univariate Analysis, histograms

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  hist(num[,i], main=names(num)[i], col='blue')
}

Univariate Analysis, boxplots

par(mfrow=c(2,3))
for(i in 1:length(num)) {
  boxplot(num[,i], main=names(num)[i], col='orange')
}

Univariate Analysis, densityplots

par(mfrow=c(2,3))
for(i in 1:length(num)){
  plot(density(num[,i]), main=names(num)[i], col='red')
}

Bivariate Analysis

for(i in 1:length(num)){
  name <- names(num)[i]
  cat(paste("Distribution of", name), sep="\n")
  #cat(names(num)[i],sep = "\n")
  print(summary(num[,i]))
  cat(sep="\n")
  stand.deviation = sd(num[,i])
  variance = var(num[,i])
  skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
  kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
  outlier_values <- sum(table(boxplot.stats(num[,i])$out))
  cat(paste("Statistical analysis of", name), sep="\n")
  print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
  cat(sep="\n")
  cat(paste("anova_test between BAD and", name),sep = "\n")
  print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
  cat(sep="\n")
}

## Distribution of LOAN
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1100   11100   16300   18608   23300   89900 
## 
## Statistical analysis of LOAN
##      stand.deviation  variance skewness kurtosis outlier_values
## [1,]        11207.48 125607617 2.022762 6.922438            256
## 
## anova_test between BAD and LOAN
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    5.4   5.368   33.79 6.45e-09 ***
## Residuals   5958  946.4   0.159                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of MORTDUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2063   48139   65019   73001   88200  399550 
## 
## Statistical analysis of MORTDUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        42552.73 1810734556 1.941153 7.440693            308
## 
## anova_test between BAD and MORTDUE
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.0  2.0303   12.74 0.000361 ***
## Residuals   5958  949.8  0.1594                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of VALUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8000   66490   89236  101540  119005  855909 
## 
## Statistical analysis of VALUE
##      stand.deviation   variance skewness kurtosis outlier_values
## [1,]        56869.44 3234132829 3.088963 24.85644            347
## 
## anova_test between BAD and VALUE
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## num[, i]       1    1.3  1.2675   7.945 0.00484 **
## Residuals   5958  950.5  0.1595                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of YOJ
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   8.756  12.000  41.000 
## 
## Statistical analysis of YOJ
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.259424 52.69923 1.092072 0.744703            211
## 
## anova_test between BAD and YOJ
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## num[, i]       1    2.8  2.7710    17.4 3.08e-05 ***
## Residuals   5958  949.0  0.1593                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of CLAGE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   117.4   173.5   179.4   227.1  1168.2 
## 
## Statistical analysis of CLAGE
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]         83.5747  6984.73 1.389903 8.180556             66
## 
## anova_test between BAD and CLAGE
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   26.1  26.106     168 <2e-16 ***
## Residuals   5958  925.7   0.155                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution of DEBTINC
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.5245  30.7632  34.8183  34.0007  37.9499 203.3122 
## 
## Statistical analysis of DEBTINC
##      stand.deviation variance skewness kurtosis outlier_values
## [1,]        7.644528 58.43881 3.111598  64.0733            245
## 
## anova_test between BAD and DEBTINC
##               Df Sum Sq Mean Sq F value Pr(>F)    
## num[, i]       1   22.7  22.733   145.8 <2e-16 ***
## Residuals   5958  929.1   0.156                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualization of Bivariate Analysis

pl5 <- num %>%
  ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot() 
pl6 <- num %>%
  ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot() 
pl7 <- num %>%
  ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot() 
pl8 <- num %>%
  ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot() 
pl9 <- num %>%
  ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot() 
pl10 <- num %>%
  ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot() 
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)

Handling outliers

Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.

# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")

# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)]  <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")

# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")

# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)]  <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3)  + ggtitle("MORTDUE after handled outliers")

# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")

# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)]  <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")

# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")

# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)]  <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")

# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")

# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)]  <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")

# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")

# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)]  <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")

Delete Zero-and Near Zero-Variance Predictors

The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)

data <- cbind(categorical,num) 
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]

##            freqRatio percentUnique zeroVar  nzv
## DELINQ.2    22.84000    0.03355705   FALSE TRUE
## DELINQ.3    45.20155    0.03355705   FALSE TRUE
## DELINQ.4    75.41026    0.03355705   FALSE TRUE
## DELINQ.5   155.84211    0.03355705   FALSE TRUE
## DELINQ.6   219.74074    0.03355705   FALSE TRUE
## DELINQ.7   457.46154    0.03355705   FALSE TRUE
## DELINQ.8  1191.00000    0.03355705   FALSE TRUE
## DELINQ.10 2979.00000    0.03355705   FALSE TRUE
## DELINQ.11 2979.00000    0.03355705   FALSE TRUE
## DELINQ.12 5959.00000    0.03355705   FALSE TRUE
## DELINQ.13 5959.00000    0.03355705   FALSE TRUE
## DELINQ.15 5959.00000    0.03355705   FALSE TRUE
## JOB.Sales   53.67890    0.03355705   FALSE TRUE
## JOB.Self    29.88083    0.03355705   FALSE TRUE
## DEROG.2     36.25000    0.03355705   FALSE TRUE

nzv <- nearZeroVar(data)
data_new <- data[, -nzv]

Correlation

Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.

Visualization

par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")

summary(cor[upper.tri(cor)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.000000 -0.038116  0.005731 -0.014157  0.043268  0.792421

Delete correlated features

tmp <- cor(data_new)
tmp[upper.tri(tmp)] <- 0
diag(tmp) <- 0
df_new <- data_new[,!apply(tmp,2,function(x) any(abs(x) > 0.75))]
cor <- cor(df_new,use="complete.obs",method = "spearman")
summary(cor[upper.tri(cor)])

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6988623 -0.0421582  0.0004314 -0.0234455  0.0307002  0.3406328

Pre-processing

To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.

# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)

## 'data.frame':    5960 obs. of  14 variables:
##  $ DELINQ.0      : num  0.502 -1.99 0.502 0.502 0.502 ...
##  $ DELINQ.1      : num  -0.351 -0.351 -0.351 -0.351 -0.351 ...
##  $ REASON.HomeImp: num  1.532 1.532 1.532 -0.653 1.532 ...
##  $ JOB.Mgr       : num  -0.384 -0.384 -0.384 -0.384 -0.384 ...
##  $ JOB.Office    : num  -0.435 -0.435 -0.435 -0.435 2.299 ...
##  $ JOB.Other     : num  1.11 1.11 1.11 1.11 -0.9 ...
##  $ JOB.ProfExe   : num  -0.522 -0.522 -0.522 -0.522 -0.522 ...
##  $ DEROG.1       : num  -0.281 -0.281 -0.281 -0.281 -0.281 ...
##  $ LOAN          : num  -1.86 -1.84 -1.81 -1.81 -1.79 ...
##  $ VALUE         : num  -1.322 -0.669 -1.818 -0.206 0.3 ...
##  $ YOJ           : num  0.279 -0.233 -0.672 -0.233 -0.819 ...
##  $ CLAGE         : num  -1.091 -0.73 -0.367 -0.052 -1.104 ...
##  $ DEBTINC       : num  0.143 0.143 0.143 0.143 0.143 ...
##  $ y             : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...

Split data set

# Draw a random, stratified sample including p percent of the data
set.seed(12345)
test_index <- createDataPartition(transformed$y, p=0.80, list=FALSE)
# select 20% of the data for test
itest <- transformed[-test_index,]
# use the remaining 80% of data to training the models
itrain <- transformed[test_index,]

Sampling Methods: evaluating models by Caret

In this analysis I’ve used the same four baseline models applying sampling methods: Logistic Regression as the easiest model and a benchmark, two ensemble models (Random Forest and Gradient Boosting Machine) and a Neural Network.

Downsampling

Randomly subset the majority class to reach the same size of the minority class.

# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, summaryFunction = prSummary, sampling = "down")
metric <- "AUC"

GLM

set.seed(2019)
fit.glm <- train(y~., data=itrain, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)

## Generalized Linear Model 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using down-sampling
## 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.9222213  0.8986736  0.7377612  0.8101311

plot(varImp(fit.glm),15, main = 'GLM feature selection')

RANDOM FOREST

set.seed(2019)
fit.rf <- train(y~., data=itrain, method="rf", metric=metric, trControl=control)
print(fit.rf)

## Random Forest 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  AUC        Precision  Recall     F        
##    2    0.9717582  0.9494766  0.8564296  0.9004898
##    7    0.9673899  0.9526566  0.8477840  0.8971426
##   13    0.9622294  0.9496174  0.8538162  0.8991397
## 
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

plot(varImp(fit.rf),15, main='Random Forest feature selection')

NNET

set.seed(2019)
fit.nnet <- train(y~., data=itrain, method="nnet", metric=metric, trControl=control)

## # weights:  16
## initial  value 1056.253378 
## iter  10 value 879.484812
## iter  20 value 868.350633
## iter  30 value 865.093468
## iter  40 value 864.822484
## iter  50 value 864.129214
## iter  60 value 863.234665
## final  value 863.233899 
## converged
## # weights:  46
## initial  value 1180.291887 
## iter  10 value 848.042214
## iter  20 value 817.352747
## iter  30 value 803.449641
## iter  40 value 799.254416
## iter  50 value 797.485390
## iter  60 value 795.960937
## iter  70 value 789.080649
## iter  80 value 782.179431
## iter  90 value 779.857251
## iter 100 value 779.123330
## final  value 779.123330 
## stopped after 100 iterations
## # weights:  76
## initial  value 1049.456662 
## iter  10 value 849.810013
## iter  20 value 816.870713
## iter  30 value 789.850286
## iter  40 value 752.925131
## iter  50 value 694.939786
## iter  60 value 669.757327
## iter  70 value 654.607835
## iter  80 value 646.064553
## iter  90 value 639.734903
## iter 100 value 631.642870
## final  value 631.642870 
## stopped after 100 iterations
## # weights:  16
## initial  value 1128.298300 
## iter  10 value 924.441989
## iter  20 value 882.951132
## iter  30 value 873.534176
## iter  40 value 871.433298
## iter  50 value 871.352603
## final  value 871.347712 
## converged
## # weights:  46
## initial  value 1050.360297 
## iter  10 value 894.488187
## iter  20 value 852.800400
## iter  30 value 824.609400
## iter  40 value 775.157715
## iter  50 value 759.916609
## iter  60 value 755.148515
## iter  70 value 749.387821
## iter  80 value 747.704970
## iter  90 value 747.165491
## iter 100 value 745.294482
## final  value 745.294482 
## stopped after 100 iterations
## # weights:  76
## initial  value 1136.275594 
## iter  10 value 869.495803
## iter  20 value 803.504679
## iter  30 value 769.979139
## iter  40 value 759.852817
## iter  50 value 753.895799
## iter  60 value 750.950613
## iter  70 value 748.371997
## iter  80 value 746.653154
## iter  90 value 742.716042
## iter 100 value 739.347869
## final  value 739.347869 
## stopped after 100 iterations
## # weights:  16
## initial  value 1091.379843 
## iter  10 value 888.608147
## iter  20 value 876.880311
## iter  30 value 869.790444
## iter  40 value 869.655491
## final  value 869.654885 
## converged
## # weights:  46
## initial  value 1050.752812 
## iter  10 value 864.268677
## iter  20 value 836.011182
## iter  30 value 819.606507
## iter  40 value 808.611271
## iter  50 value 804.843048
## iter  60 value 803.410438
## iter  70 value 797.629663
## iter  80 value 796.548384
## iter  90 value 796.382662
## iter 100 value 796.028735
## final  value 796.028735 
## stopped after 100 iterations
## # weights:  76
## initial  value 1072.236142 
## iter  10 value 834.633832
## iter  20 value 796.657746
## iter  30 value 760.947389
## iter  40 value 742.144752
## iter  50 value 730.956100
## iter  60 value 721.368627
## iter  70 value 716.996308
## iter  80 value 716.040541
## iter  90 value 715.752269
## iter 100 value 715.445085
## final  value 715.445085 
## stopped after 100 iterations
## # weights:  16
## initial  value 1056.850127 
## iter  10 value 979.378350
## iter  20 value 862.104243
## iter  30 value 843.984404
## iter  40 value 843.668172
## iter  50 value 843.571685
## iter  60 value 843.464710
## iter  60 value 843.464709
## iter  60 value 843.464709
## final  value 843.464709 
## converged
## # weights:  46
## initial  value 1067.799814 
## iter  10 value 827.507168
## iter  20 value 799.320933
## iter  30 value 782.431940
## iter  40 value 772.079842
## iter  50 value 759.314195
## iter  60 value 753.550007
## iter  70 value 751.393835
## iter  80 value 751.261570
## final  value 751.261432 
## converged
## # weights:  76
## initial  value 1165.411501 
## iter  10 value 834.491129
## iter  20 value 788.479789
## iter  30 value 763.836333
## iter  40 value 754.136020
## iter  50 value 748.406461
## iter  60 value 737.845120
## iter  70 value 727.895328
## iter  80 value 707.670565
## iter  90 value 703.967797
## iter 100 value 703.846799
## final  value 703.846799 
## stopped after 100 iterations
## # weights:  16
## initial  value 1085.872140 
## iter  10 value 871.797915
## iter  20 value 857.008928
## iter  30 value 839.774013
## iter  40 value 831.324137
## iter  50 value 830.820940
## iter  50 value 830.820939
## iter  50 value 830.820939
## final  value 830.820939 
## converged
## # weights:  46
## initial  value 1071.685706 
## iter  10 value 907.596929
## iter  20 value 864.807211
## iter  30 value 838.744753
## iter  40 value 832.018414
## iter  50 value 828.860348
## iter  60 value 822.490108
## iter  70 value 820.580453
## iter  80 value 816.010490
## iter  90 value 814.949093
## iter 100 value 813.076952
## final  value 813.076952 
## stopped after 100 iterations
## # weights:  76
## initial  value 1445.450604 
## iter  10 value 863.018480
## iter  20 value 819.185461
## iter  30 value 792.809837
## iter  40 value 778.614069
## iter  50 value 770.842469
## iter  60 value 767.750756
## iter  70 value 765.517626
## iter  80 value 764.167611
## iter  90 value 761.583492
## iter 100 value 753.643968
## final  value 753.643968 
## stopped after 100 iterations
## # weights:  16
## initial  value 1108.029106 
## iter  10 value 868.168639
## iter  20 value 854.645071
## iter  30 value 839.554900
## iter  40 value 838.323795
## iter  50 value 838.211522
## iter  60 value 838.023445
## iter  70 value 837.680140
## iter  80 value 836.731239
## final  value 836.728990 
## converged
## # weights:  46
## initial  value 1114.055849 
## iter  10 value 899.024269
## iter  20 value 826.748622
## iter  30 value 812.550175
## iter  40 value 800.307138
## iter  50 value 788.520412
## iter  60 value 782.134871
## iter  70 value 776.019240
## iter  80 value 774.530460
## iter  90 value 768.362538
## iter 100 value 767.365431
## final  value 767.365431 
## stopped after 100 iterations
## # weights:  76
## initial  value 1129.555195 
## iter  10 value 860.304314
## iter  20 value 809.768229
## iter  30 value 761.348968
## iter  40 value 726.429244
## iter  50 value 714.827471
## iter  60 value 708.899740
## iter  70 value 700.496700
## iter  80 value 695.962423
## iter  90 value 694.675643
## iter 100 value 694.169746
## final  value 694.169746 
## stopped after 100 iterations
## # weights:  16
## initial  value 1116.371933 
## iter  10 value 892.479741
## iter  20 value 846.873790
## iter  30 value 839.916516
## iter  40 value 839.098424
## iter  50 value 838.717918
## final  value 838.626871 
## converged
## # weights:  46
## initial  value 1044.360477 
## iter  10 value 824.559286
## iter  20 value 798.952894
## iter  30 value 771.282492
## iter  40 value 742.871031
## iter  50 value 727.464759
## iter  60 value 707.700044
## iter  70 value 699.714312
## iter  80 value 695.858969
## iter  90 value 695.672557
## final  value 695.658410 
## converged
## # weights:  76
## initial  value 1269.195292 
## iter  10 value 827.724494
## iter  20 value 770.115536
## iter  30 value 717.613743
## iter  40 value 702.092066
## iter  50 value 687.023991
## iter  60 value 682.651645
## iter  70 value 666.788129
## iter  80 value 654.744146
## iter  90 value 653.809208
## iter 100 value 653.797833
## final  value 653.797833 
## stopped after 100 iterations
## # weights:  16
## initial  value 1082.405955 
## iter  10 value 905.481786
## iter  20 value 890.672051
## iter  30 value 886.420166
## iter  40 value 882.043620
## iter  50 value 875.856533
## iter  60 value 872.887514
## final  value 872.871828 
## converged
## # weights:  46
## initial  value 1068.114510 
## iter  10 value 842.183417
## iter  20 value 829.653655
## iter  30 value 820.417766
## iter  40 value 818.565871
## iter  50 value 817.288323
## iter  60 value 817.193354
## final  value 817.191414 
## converged
## # weights:  76
## initial  value 1136.407131 
## iter  10 value 892.344788
## iter  20 value 844.772065
## iter  30 value 821.232192
## iter  40 value 785.794181
## iter  50 value 774.220826
## iter  60 value 767.083027
## iter  70 value 759.360242
## iter  80 value 754.242436
## iter  90 value 749.995513
## iter 100 value 745.778287
## final  value 745.778287 
## stopped after 100 iterations
## # weights:  16
## initial  value 1063.858075 
## iter  10 value 910.493353
## iter  20 value 892.868162
## iter  30 value 890.113107
## iter  40 value 890.097270
## final  value 890.097184 
## converged
## # weights:  46
## initial  value 1060.472625 
## iter  10 value 862.139436
## iter  20 value 827.033369
## iter  30 value 807.351929
## iter  40 value 792.791452
## iter  50 value 779.683634
## iter  60 value 741.280230
## iter  70 value 734.667970
## iter  80 value 732.979765
## iter  90 value 732.386970
## iter 100 value 730.495035
## final  value 730.495035 
## stopped after 100 iterations
## # weights:  76
## initial  value 1085.088860 
## iter  10 value 858.813261
## iter  20 value 801.130539
## iter  30 value 739.436769
## iter  40 value 720.866673
## iter  50 value 707.393028
## iter  60 value 696.635311
## iter  70 value 690.047421
## iter  80 value 689.447497
## iter  90 value 688.197894
## iter 100 value 687.928546
## final  value 687.928546 
## stopped after 100 iterations
## # weights:  16
## initial  value 1050.240954 
## iter  10 value 915.542583
## iter  20 value 886.393023
## iter  30 value 877.267573
## iter  40 value 870.639265
## iter  50 value 864.381584
## iter  60 value 863.733314
## iter  70 value 863.649380
## iter  80 value 863.389169
## iter  90 value 863.309529
## iter 100 value 863.308744
## final  value 863.308744 
## stopped after 100 iterations
## # weights:  46
## initial  value 1102.586796 
## iter  10 value 857.117719
## iter  20 value 844.575934
## iter  30 value 816.030416
## iter  40 value 793.724300
## iter  50 value 788.087287
## iter  60 value 784.248446
## iter  70 value 777.331030
## iter  80 value 771.267304
## iter  90 value 769.190509
## iter 100 value 769.137035
## final  value 769.137035 
## stopped after 100 iterations
## # weights:  76
## initial  value 1051.818438 
## iter  10 value 825.727324
## iter  20 value 789.246428
## iter  30 value 760.102897
## iter  40 value 732.183578
## iter  50 value 714.670201
## iter  60 value 707.048661
## iter  70 value 694.615114
## iter  80 value 683.063685
## iter  90 value 681.907458
## iter 100 value 681.879572
## final  value 681.879572 
## stopped after 100 iterations
## # weights:  16
## initial  value 1069.663238 
## iter  10 value 916.875216
## iter  20 value 861.841762
## iter  30 value 853.182590
## iter  40 value 853.097703
## iter  40 value 853.097701
## final  value 853.097676 
## converged
## # weights:  46
## initial  value 1162.955146 
## iter  10 value 863.497378
## iter  20 value 817.483660
## iter  30 value 804.157417
## iter  40 value 794.980394
## iter  50 value 790.918448
## iter  60 value 789.180679
## iter  70 value 788.735739
## iter  80 value 788.312489
## iter  90 value 782.178711
## iter 100 value 773.164879
## final  value 773.164879 
## stopped after 100 iterations
## # weights:  76
## initial  value 1117.628482 
## iter  10 value 846.014381
## iter  20 value 794.870788
## iter  30 value 728.650033
## iter  40 value 697.064988
## iter  50 value 679.460583
## iter  60 value 665.316650
## iter  70 value 659.414992
## iter  80 value 656.236063
## iter  90 value 651.799275
## iter 100 value 650.474330
## final  value 650.474330 
## stopped after 100 iterations
## # weights:  16
## initial  value 1088.120650 
## iter  10 value 906.371683
## iter  20 value 892.720897
## iter  30 value 890.091349
## final  value 890.073306 
## converged
## # weights:  46
## initial  value 1035.676995 
## iter  10 value 861.024151
## iter  20 value 824.266262
## iter  30 value 791.022675
## iter  40 value 782.943185
## iter  50 value 774.427656
## iter  60 value 770.452490
## iter  70 value 769.439346
## iter  80 value 768.960104
## iter  90 value 768.933330
## iter 100 value 768.875345
## final  value 768.875345 
## stopped after 100 iterations
## # weights:  76
## initial  value 1074.585977 
## iter  10 value 854.212691
## iter  20 value 797.662969
## iter  30 value 767.260324
## iter  40 value 745.032313
## iter  50 value 729.360663
## iter  60 value 724.212002
## iter  70 value 717.000161
## iter  80 value 704.627047
## iter  90 value 699.663559
## iter 100 value 697.944197
## final  value 697.944197 
## stopped after 100 iterations
## # weights:  16
## initial  value 1055.853223 
## iter  10 value 890.470940
## iter  20 value 862.732082
## iter  30 value 859.765344
## iter  40 value 853.179295
## iter  50 value 852.151308
## iter  60 value 852.002938
## iter  70 value 851.747164
## iter  80 value 851.661229
## iter  90 value 850.017681
## iter 100 value 849.676511
## final  value 849.676511 
## stopped after 100 iterations
## # weights:  46
## initial  value 1206.515849 
## iter  10 value 847.563068
## iter  20 value 821.953677
## iter  30 value 812.768375
## iter  40 value 801.035330
## iter  50 value 795.804808
## iter  60 value 787.831167
## iter  70 value 783.028776
## iter  80 value 776.025807
## iter  90 value 774.337052
## iter 100 value 774.205133
## final  value 774.205133 
## stopped after 100 iterations
## # weights:  76
## initial  value 1113.223059 
## iter  10 value 820.258041
## iter  20 value 766.713937
## iter  30 value 742.254057
## iter  40 value 725.835838
## iter  50 value 712.125113
## iter  60 value 689.043600
## iter  70 value 677.837980
## iter  80 value 677.115807
## final  value 677.109674 
## converged
## # weights:  16
## initial  value 1048.728316 
## iter  10 value 883.771185
## iter  20 value 877.171641
## iter  30 value 871.815343
## iter  40 value 856.892527
## iter  50 value 847.884195
## iter  60 value 845.678522
## iter  60 value 845.678516
## iter  60 value 845.678516
## final  value 845.678516 
## converged
## # weights:  46
## initial  value 1127.730750 
## iter  10 value 828.697488
## iter  20 value 804.146547
## iter  30 value 794.004775
## iter  40 value 790.269708
## iter  50 value 789.092455
## iter  60 value 787.520861
## iter  70 value 786.555325
## iter  80 value 785.681527
## iter  90 value 785.638934
## iter 100 value 785.634136
## final  value 785.634136 
## stopped after 100 iterations
## # weights:  76
## initial  value 1235.068836 
## iter  10 value 865.602521
## iter  20 value 835.215792
## iter  30 value 771.143865
## iter  40 value 730.225860
## iter  50 value 697.335954
## iter  60 value 683.836855
## iter  70 value 672.701034
## iter  80 value 660.904019
## iter  90 value 655.598014
## iter 100 value 651.943055
## final  value 651.943055 
## stopped after 100 iterations
## # weights:  16
## initial  value 1086.403308 
## iter  10 value 864.652773
## iter  20 value 844.698639
## iter  30 value 841.712615
## iter  40 value 840.676142
## iter  50 value 840.424908
## iter  60 value 840.150932
## iter  70 value 840.140419
## iter  70 value 840.140416
## final  value 840.140416 
## converged
## # weights:  46
## initial  value 1075.801315 
## iter  10 value 834.776386
## iter  20 value 803.962134
## iter  30 value 793.448907
## iter  40 value 779.500391
## iter  50 value 770.208877
## iter  60 value 762.947651
## iter  70 value 762.034671
## iter  80 value 761.727967
## iter  90 value 761.483328
## iter 100 value 761.420469
## final  value 761.420469 
## stopped after 100 iterations
## # weights:  76
## initial  value 1128.502247 
## iter  10 value 826.944647
## iter  20 value 790.075242
## iter  30 value 763.989009
## iter  40 value 731.242464
## iter  50 value 705.453573
## iter  60 value 687.055721
## iter  70 value 681.090834
## iter  80 value 679.582881
## iter  90 value 678.739741
## iter 100 value 678.585927
## final  value 678.585927 
## stopped after 100 iterations
## # weights:  76
## initial  value 1368.431074 
## iter  10 value 1043.874871
## iter  20 value 991.999297
## iter  30 value 972.088441
## iter  40 value 938.263513
## iter  50 value 906.585138
## iter  60 value 885.283014
## iter  70 value 873.795768
## iter  80 value 867.472366
## iter  90 value 855.626227
## iter 100 value 843.995891
## final  value 843.995891 
## stopped after 100 iterations

print(fit.nnet)

## Neural Network 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   size  decay  AUC        Precision  Recall     F        
##   1     0e+00  0.7852862  0.9025181  0.7238673  0.8033016
##   1     1e-04  0.9220693  0.9078548  0.7104987  0.7968063
##   1     1e-01  0.9230165  0.9006473  0.7338252  0.8084184
##   3     0e+00  0.8615307  0.9024859  0.7005266  0.7879552
##   3     1e-04  0.9177405  0.9040929  0.7354127  0.8103297
##   3     1e-01  0.9270808  0.9162541  0.7382775  0.8175502
##   5     0e+00  0.9067854  0.9042002  0.7149414  0.7980137
##   5     1e-04  0.9184409  0.9038094  0.7057900  0.7916020
##   5     1e-01  0.9392531  0.9181286  0.7608054  0.8319055
## 
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')

GBM

set.seed(2019)
fit.gbm <- train(y~., data=itrain, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)

## Stochastic Gradient Boosting 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  AUC        Precision  Recall     F        
##   1                   50      0.8995544  0.9366174  0.8414951  0.8863661
##   1                  100      0.9578286  0.9421950  0.8359949  0.8858860
##   1                  150      0.9618027  0.9447139  0.8359946  0.8869571
##   2                   50      0.9599822  0.9407903  0.8532844  0.8948068
##   2                  100      0.9636186  0.9423435  0.8519762  0.8948579
##   2                  150      0.9648336  0.9425322  0.8504035  0.8940648
##   3                   50      0.9627674  0.9461036  0.8404483  0.8901064
##   3                  100      0.9660428  0.9463940  0.8454259  0.8930295
##   3                  150      0.9685070  0.9492507  0.8414933  0.8920759
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')

##                           var     rel.inf
## DEBTINC               DEBTINC 59.55084456
## CLAGE                   CLAGE 11.71658042
## DELINQ.0             DELINQ.0  9.23981342
## VALUE                   VALUE  7.38265155
## YOJ                       YOJ  4.62293401
## LOAN                     LOAN  4.09278121
## JOB.Office         JOB.Office  1.16219326
## DELINQ.1             DELINQ.1  0.73893268
## DEROG.1               DEROG.1  0.70718645
## REASON.HomeImp REASON.HomeImp  0.45258375
## JOB.Other           JOB.Other  0.11948720
## JOB.Mgr               JOB.Mgr  0.11710766
## JOB.ProfExe       JOB.ProfExe  0.09690382

Comparison of algorithms

A graphical plot to show performances of the models from the training set.

results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')

## Results

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: glm, rf, nnet, gbm 
## Number of resamples: 5 
## 
## AUC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.9153513 0.9180248 0.9218465 0.9222213 0.9221887 0.9336949    0
## rf   0.9666777 0.9695952 0.9707874 0.9717582 0.9723774 0.9793531    0
## nnet 0.9195431 0.9324396 0.9465394 0.9392531 0.9480358 0.9497077    0
## gbm  0.9636049 0.9657601 0.9657630 0.9685070 0.9727419 0.9746651    0
## 
## F 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.7988423 0.8014599 0.8063814 0.8101311 0.8219564 0.8220157    0
## rf   0.8859527 0.8965996 0.9047293 0.9004898 0.9055824 0.9095853    0
## nnet 0.8189091 0.8225352 0.8312817 0.8319055 0.8354978 0.8513037    0
## gbm  0.8868715 0.8898246 0.8929068 0.8920759 0.8951724 0.8956044    0
## 
## Precision 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.8875380 0.8932039 0.9025974 0.8986736 0.9040881 0.9059406    0
## rf   0.9437037 0.9462518 0.9482759 0.9494766 0.9542097 0.9549419    0
## nnet 0.8888889 0.9214403 0.9221374 0.9181286 0.9288026 0.9293740    0
## gbm  0.9421965 0.9460641 0.9491779 0.9492507 0.9511111 0.9577039    0
## 
## Recall 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.7185864 0.7225131 0.7287025 0.7377612 0.7536042 0.7653997    0
## rf   0.8348624 0.8455497 0.8610747 0.8564296 0.8650066 0.8756545    0
## nnet 0.7369110 0.7522936 0.7588467 0.7608054 0.7653997 0.7905759    0
## gbm  0.8309305 0.8322412 0.8414155 0.8414933 0.8494764 0.8534031    0

par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')

Predictions

par(mfrow=c(2,2))
set.seed(2019)
prediction.glm <-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf <-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet <-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm <-predict(fit.gbm,newdata=itest,type="raw")

Visualization of results

Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.

cat(paste('Confusion Matrix GLM Model'), sep='\n')

## Confusion Matrix GLM Model

confusionMatrix(prediction.glm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 708  74
##         X1 246 163
##                                           
##                Accuracy : 0.7313          
##                  95% CI : (0.7052, 0.7563)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3378          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7421          
##             Specificity : 0.6878          
##          Pos Pred Value : 0.9054          
##          Neg Pred Value : 0.3985          
##              Prevalence : 0.8010          
##          Detection Rate : 0.5945          
##    Detection Prevalence : 0.6566          
##       Balanced Accuracy : 0.7150          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.glm$results[5]
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')

## F1_train_glm: 0.810131128341594 F1_test_glm: 0.815668202764977

cat(paste('Confusion Matrix Random Forest Model'), sep='\n')

## Confusion Matrix Random Forest Model

confusionMatrix(prediction.rf, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 823  51
##         X1 131 186
##                                           
##                Accuracy : 0.8472          
##                  95% CI : (0.8255, 0.8672)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 2.291e-05       
##                                           
##                   Kappa : 0.5746          
##                                           
##  Mcnemar's Test P-Value : 4.745e-09       
##                                           
##             Sensitivity : 0.8627          
##             Specificity : 0.7848          
##          Pos Pred Value : 0.9416          
##          Neg Pred Value : 0.5868          
##              Prevalence : 0.8010          
##          Detection Rate : 0.6910          
##    Detection Prevalence : 0.7338          
##       Balanced Accuracy : 0.8237          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.rf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.900489847090052 F1_test_rf: 0.900437636761488

cat(paste('Confusion Matrix Neural Network Model'), sep='\n')

## Confusion Matrix Neural Network Model

confusionMatrix(prediction.nnet, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 739  61
##         X1 215 176
##                                          
##                Accuracy : 0.7683         
##                  95% CI : (0.7432, 0.792)
##     No Information Rate : 0.801          
##     P-Value [Acc > NIR] : 0.9976         
##                                          
##                   Kappa : 0.4157         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7746         
##             Specificity : 0.7426         
##          Pos Pred Value : 0.9238         
##          Neg Pred Value : 0.4501         
##              Prevalence : 0.8010         
##          Detection Rate : 0.6205         
##    Detection Prevalence : 0.6717         
##       Balanced Accuracy : 0.7586         
##                                          
##        'Positive' Class : X0             
##

F1_train <- fit.nnet$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')

## F1_train_nnet: 0.803301647507338 F1_test_nnet: 0.842645381984036

cat(paste('Confusion Matrix GBM Model'), sep='\n')

## Confusion Matrix GBM Model

confusionMatrix(prediction.gbm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 800  48
##         X1 154 189
##                                           
##                Accuracy : 0.8304          
##                  95% CI : (0.8079, 0.8513)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.005457        
##                                           
##                   Kappa : 0.5445          
##                                           
##  Mcnemar's Test P-Value : 1.493e-13       
##                                           
##             Sensitivity : 0.8386          
##             Specificity : 0.7975          
##          Pos Pred Value : 0.9434          
##          Neg Pred Value : 0.5510          
##              Prevalence : 0.8010          
##          Detection Rate : 0.6717          
##    Detection Prevalence : 0.7120          
##       Balanced Accuracy : 0.8180          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.gbm$results[[8]][1]
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')

## F1_train_gbm: 0.886366083205398 F1_test_gbm: 0.887902330743618

Confusion Matrix Plots

par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GBM Confusion Matrix")

Oversampling

Randomly sample (with replacement) the minority class to reach the same size of the majority class

# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, summaryFunction = prSummary, sampling = "up")
metric <- "AUC"

GLM

set.seed(2019)
fit.glm <- train(y~., data=itrain, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)

## Generalized Linear Model 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using up-sampling
## 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.9229779  0.8984912  0.7427367  0.8130053

plot(varImp(fit.glm),15, main = 'GLM feature selection')

RANDOM FOREST

set.seed(2019)
fit.rf <- train(y~., data=itrain, method="rf", metric=metric, trControl=control)
print(fit.rf)

## Random Forest 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  AUC        Precision  Recall     F        
##    2    0.9753659  0.9367482  0.9221896  0.9293680
##    7    0.9144147  0.9312547  0.9326700  0.9319351
##   13    0.8260580  0.9234212  0.9221923  0.9227965
## 
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

plot(varImp(fit.rf),15, main='Random Forest feature selection')

NNET

set.seed(2019)
fit.nnet <- train(y~., data=itrain, method="nnet", metric=metric, trControl=control)

## # weights:  16
## initial  value 4233.012610 
## iter  10 value 3543.180266
## iter  20 value 3490.158386
## iter  30 value 3458.066660
## final  value 3457.940102 
## converged
## # weights:  46
## initial  value 5221.214406 
## iter  10 value 3505.134747
## iter  20 value 3360.823043
## iter  30 value 3278.738057
## iter  40 value 3233.122284
## iter  50 value 3193.934321
## iter  60 value 3155.261529
## iter  70 value 3127.990734
## iter  80 value 3106.331496
## iter  90 value 3100.126744
## final  value 3100.105965 
## converged
## # weights:  76
## initial  value 4264.067375 
## iter  10 value 3408.370996
## iter  20 value 3276.621708
## iter  30 value 3192.183146
## iter  40 value 3035.329649
## iter  50 value 2938.215069
## iter  60 value 2879.858248
## iter  70 value 2831.492383
## iter  80 value 2802.354756
## iter  90 value 2784.640458
## iter 100 value 2769.435622
## final  value 2769.435622 
## stopped after 100 iterations
## # weights:  16
## initial  value 4372.614844 
## iter  10 value 3824.237958
## iter  20 value 3568.336495
## iter  30 value 3546.571152
## iter  40 value 3533.027854
## iter  50 value 3488.678581
## iter  60 value 3469.137704
## iter  70 value 3459.702262
## final  value 3459.631609 
## converged
## # weights:  46
## initial  value 4240.292716 
## iter  10 value 3544.989466
## iter  20 value 3447.335692
## iter  30 value 3382.187331
## iter  40 value 3354.557475
## iter  50 value 3340.301991
## iter  60 value 3322.638087
## iter  70 value 3314.863472
## iter  80 value 3304.409313
## iter  90 value 3291.021719
## iter 100 value 3289.249235
## final  value 3289.249235 
## stopped after 100 iterations
## # weights:  76
## initial  value 4356.967420 
## iter  10 value 3525.174119
## iter  20 value 3316.902854
## iter  30 value 3180.278283
## iter  40 value 3121.420056
## iter  50 value 3069.290052
## iter  60 value 3036.229425
## iter  70 value 3021.431450
## iter  80 value 3012.262790
## iter  90 value 3008.349787
## iter 100 value 3005.834749
## final  value 3005.834749 
## stopped after 100 iterations
## # weights:  16
## initial  value 4273.860200 
## iter  10 value 3601.097217
## iter  20 value 3478.406805
## iter  30 value 3475.740241
## iter  40 value 3475.627927
## iter  50 value 3475.582133
## final  value 3475.561096 
## converged
## # weights:  46
## initial  value 4342.687557 
## iter  10 value 3596.058089
## iter  20 value 3500.705542
## iter  30 value 3432.896091
## iter  40 value 3402.153819
## iter  50 value 3374.857202
## iter  60 value 3343.736410
## iter  70 value 3322.903979
## iter  80 value 3316.010757
## iter  90 value 3314.851568
## iter 100 value 3313.855982
## final  value 3313.855982 
## stopped after 100 iterations
## # weights:  76
## initial  value 4280.819434 
## iter  10 value 3446.029246
## iter  20 value 3256.993150
## iter  30 value 3182.988575
## iter  40 value 3141.617262
## iter  50 value 3118.089582
## iter  60 value 3087.040787
## iter  70 value 3041.752963
## iter  80 value 3021.028913
## iter  90 value 3016.363202
## iter 100 value 3014.176111
## final  value 3014.176111 
## stopped after 100 iterations
## # weights:  16
## initial  value 4278.737366 
## iter  10 value 3655.231247
## iter  20 value 3486.094844
## iter  30 value 3414.385597
## iter  40 value 3412.084979
## iter  50 value 3411.015503
## iter  60 value 3410.221750
## iter  60 value 3410.221719
## iter  60 value 3410.221719
## final  value 3410.221719 
## converged
## # weights:  46
## initial  value 5206.878522 
## iter  10 value 3433.447838
## iter  20 value 3382.100721
## iter  30 value 3316.263434
## iter  40 value 3291.010243
## iter  50 value 3265.301413
## iter  60 value 3234.663814
## iter  70 value 3177.461785
## iter  80 value 3161.584781
## iter  90 value 3155.952024
## iter 100 value 3153.474815
## final  value 3153.474815 
## stopped after 100 iterations
## # weights:  76
## initial  value 4876.810759 
## iter  10 value 3351.823321
## iter  20 value 3207.756841
## iter  30 value 3101.817910
## iter  40 value 2986.783199
## iter  50 value 2925.909723
## iter  60 value 2870.841113
## iter  70 value 2846.341740
## iter  80 value 2821.569398
## iter  90 value 2771.906893
## iter 100 value 2745.134102
## final  value 2745.134102 
## stopped after 100 iterations
## # weights:  16
## initial  value 4109.807774 
## iter  10 value 3503.355659
## iter  20 value 3490.443253
## iter  30 value 3487.286693
## iter  40 value 3414.803912
## iter  50 value 3410.728153
## iter  60 value 3408.615538
## final  value 3408.614074 
## converged
## # weights:  46
## initial  value 4330.742235 
## iter  10 value 3473.894023
## iter  20 value 3362.018018
## iter  30 value 3287.115249
## iter  40 value 3262.459898
## iter  50 value 3250.455744
## iter  60 value 3175.210850
## iter  70 value 3020.115698
## iter  80 value 2941.283807
## iter  90 value 2929.196180
## iter 100 value 2915.529276
## final  value 2915.529276 
## stopped after 100 iterations
## # weights:  76
## initial  value 4679.551471 
## iter  10 value 3401.059726
## iter  20 value 3266.524067
## iter  30 value 3201.158651
## iter  40 value 3114.485665
## iter  50 value 3082.764748
## iter  60 value 3043.897675
## iter  70 value 2953.402582
## iter  80 value 2823.204252
## iter  90 value 2718.566792
## iter 100 value 2688.537204
## final  value 2688.537204 
## stopped after 100 iterations
## # weights:  16
## initial  value 4250.676388 
## iter  10 value 3621.607519
## iter  20 value 3497.973656
## iter  30 value 3425.412928
## iter  40 value 3411.827514
## iter  50 value 3409.440561
## iter  60 value 3404.938506
## final  value 3404.926623 
## converged
## # weights:  46
## initial  value 4878.771985 
## iter  10 value 3447.746544
## iter  20 value 3371.512066
## iter  30 value 3273.486401
## iter  40 value 3039.003146
## iter  50 value 2899.208603
## iter  60 value 2830.979722
## iter  70 value 2790.324256
## iter  80 value 2759.570498
## iter  90 value 2752.578117
## iter 100 value 2749.798068
## final  value 2749.798068 
## stopped after 100 iterations
## # weights:  76
## initial  value 5191.555933 
## iter  10 value 3543.710054
## iter  20 value 3337.287871
## iter  30 value 3251.571383
## iter  40 value 3180.331022
## iter  50 value 3130.400325
## iter  60 value 3090.068676
## iter  70 value 3066.609877
## iter  80 value 3045.332861
## iter  90 value 3036.291823
## iter 100 value 3032.881902
## final  value 3032.881902 
## stopped after 100 iterations
## # weights:  16
## initial  value 4380.234155 
## iter  10 value 3597.173606
## iter  20 value 3535.944022
## iter  30 value 3523.152789
## iter  40 value 3521.593081
## iter  50 value 3521.465147
## final  value 3521.458957 
## converged
## # weights:  46
## initial  value 4400.984899 
## iter  10 value 3440.781713
## iter  20 value 3400.630984
## iter  30 value 3340.169335
## iter  40 value 3303.721606
## iter  50 value 3284.557309
## iter  60 value 3259.331445
## iter  70 value 3250.134378
## iter  80 value 3241.433414
## iter  90 value 3223.114305
## iter 100 value 3212.988309
## final  value 3212.988309 
## stopped after 100 iterations
## # weights:  76
## initial  value 4729.198270 
## iter  10 value 3403.098099
## iter  20 value 3262.772506
## iter  30 value 3183.018964
## iter  40 value 3131.618616
## iter  50 value 3101.225761
## iter  60 value 3072.834684
## iter  70 value 3055.207297
## iter  80 value 3032.813375
## iter  90 value 3003.457395
## iter 100 value 2976.422867
## final  value 2976.422867 
## stopped after 100 iterations
## # weights:  16
## initial  value 4344.020507 
## iter  10 value 3659.565844
## iter  20 value 3534.407733
## iter  30 value 3431.676689
## iter  40 value 3425.598449
## iter  50 value 3424.545459
## iter  60 value 3422.110223
## final  value 3422.110186 
## converged
## # weights:  46
## initial  value 4222.914262 
## iter  10 value 3478.313798
## iter  20 value 3386.973062
## iter  30 value 3319.584321
## iter  40 value 3302.399432
## iter  50 value 3298.043602
## iter  60 value 3291.510476
## iter  70 value 3286.533145
## iter  80 value 3286.418898
## iter  90 value 3286.412701
## iter  90 value 3286.412686
## iter  90 value 3286.412686
## final  value 3286.412686 
## converged
## # weights:  76
## initial  value 5609.173953 
## iter  10 value 3572.338328
## iter  20 value 3444.597679
## iter  30 value 3297.040307
## iter  40 value 3132.033875
## iter  50 value 3007.632601
## iter  60 value 2960.698176
## iter  70 value 2916.578810
## iter  80 value 2890.130001
## iter  90 value 2872.432222
## iter 100 value 2867.461338
## final  value 2867.461338 
## stopped after 100 iterations
## # weights:  16
## initial  value 4259.603243 
## iter  10 value 3565.082140
## iter  20 value 3526.350142
## iter  30 value 3510.522185
## iter  40 value 3509.877185
## iter  40 value 3509.877169
## final  value 3509.876991 
## converged
## # weights:  46
## initial  value 4289.395003 
## iter  10 value 3730.513735
## iter  20 value 3618.629254
## iter  30 value 3439.797568
## iter  40 value 3375.604523
## iter  50 value 3311.495322
## iter  60 value 3255.350111
## iter  70 value 3229.464429
## iter  80 value 3224.147927
## iter  90 value 3220.871388
## iter 100 value 3215.268454
## final  value 3215.268454 
## stopped after 100 iterations
## # weights:  76
## initial  value 5573.933763 
## iter  10 value 3499.951039
## iter  20 value 3331.959606
## iter  30 value 3201.442109
## iter  40 value 3085.128587
## iter  50 value 2996.526828
## iter  60 value 2929.194556
## iter  70 value 2875.310699
## iter  80 value 2844.957621
## iter  90 value 2830.683020
## iter 100 value 2819.564117
## final  value 2819.564117 
## stopped after 100 iterations
## # weights:  16
## initial  value 4381.818570 
## iter  10 value 3579.148811
## iter  20 value 3516.734634
## iter  30 value 3459.379369
## iter  40 value 3411.954674
## iter  50 value 3407.208131
## iter  60 value 3400.996667
## final  value 3400.996597 
## converged
## # weights:  46
## initial  value 4478.035655 
## iter  10 value 3455.619805
## iter  20 value 3361.658019
## iter  30 value 3335.338220
## iter  40 value 3296.957086
## iter  50 value 3279.708946
## iter  60 value 3260.637938
## iter  70 value 3248.582612
## iter  80 value 3229.782515
## iter  90 value 3162.817065
## iter 100 value 3137.676372
## final  value 3137.676372 
## stopped after 100 iterations
## # weights:  76
## initial  value 4460.544847 
## iter  10 value 3429.638013
## iter  20 value 3259.398462
## iter  30 value 3164.811407
## iter  40 value 3103.341822
## iter  50 value 3055.533808
## iter  60 value 3020.130953
## iter  70 value 2976.338914
## iter  80 value 2944.815647
## iter  90 value 2898.919102
## iter 100 value 2876.682636
## final  value 2876.682636 
## stopped after 100 iterations
## # weights:  16
## initial  value 4542.959590 
## iter  10 value 3734.751215
## iter  20 value 3503.605377
## iter  30 value 3412.652609
## iter  40 value 3409.908859
## iter  50 value 3409.487246
## final  value 3409.294939 
## converged
## # weights:  46
## initial  value 4204.592247 
## iter  10 value 3460.914345
## iter  20 value 3399.219271
## iter  30 value 3332.065740
## iter  40 value 3304.944775
## iter  50 value 3294.175913
## iter  60 value 3287.856234
## iter  70 value 3287.510504
## iter  80 value 3285.024276
## iter  90 value 3283.051318
## iter 100 value 3282.128912
## final  value 3282.128912 
## stopped after 100 iterations
## # weights:  76
## initial  value 5188.922075 
## iter  10 value 3402.086063
## iter  20 value 3305.659979
## iter  30 value 3211.243757
## iter  40 value 3149.209356
## iter  50 value 3124.827424
## iter  60 value 3108.463535
## iter  70 value 3097.636696
## iter  80 value 3093.429690
## iter  90 value 3090.417423
## iter 100 value 3083.288549
## final  value 3083.288549 
## stopped after 100 iterations
## # weights:  16
## initial  value 4279.566183 
## iter  10 value 3684.631726
## iter  20 value 3535.071756
## iter  30 value 3504.247042
## iter  40 value 3504.079179
## final  value 3504.078878 
## converged
## # weights:  46
## initial  value 4581.730107 
## iter  10 value 3520.499853
## iter  20 value 3385.604799
## iter  30 value 3315.104943
## iter  40 value 3243.557573
## iter  50 value 3200.796047
## iter  60 value 3140.021572
## iter  70 value 3097.113733
## iter  80 value 3086.823673
## iter  90 value 3080.638090
## iter 100 value 3076.566863
## final  value 3076.566863 
## stopped after 100 iterations
## # weights:  76
## initial  value 4419.752742 
## iter  10 value 3396.454648
## iter  20 value 3258.727656
## iter  30 value 3098.589172
## iter  40 value 2978.531096
## iter  50 value 2898.422287
## iter  60 value 2860.783852
## iter  70 value 2829.469696
## iter  80 value 2790.839249
## iter  90 value 2749.780423
## iter 100 value 2729.255771
## final  value 2729.255771 
## stopped after 100 iterations
## # weights:  16
## initial  value 4243.865136 
## iter  10 value 3550.624620
## iter  20 value 3467.596468
## iter  30 value 3459.872410
## final  value 3459.872213 
## converged
## # weights:  46
## initial  value 5453.584318 
## iter  10 value 3460.297488
## iter  20 value 3412.734323
## iter  30 value 3372.296510
## iter  40 value 3340.581392
## iter  50 value 3325.022126
## iter  60 value 3285.060379
## iter  70 value 3252.711909
## iter  80 value 3243.999851
## iter  90 value 3236.956129
## iter 100 value 3203.482885
## final  value 3203.482885 
## stopped after 100 iterations
## # weights:  76
## initial  value 4212.280154 
## iter  10 value 3379.605790
## iter  20 value 3271.379477
## iter  30 value 3147.374131
## iter  40 value 3077.241141
## iter  50 value 3037.815296
## iter  60 value 2959.936277
## iter  70 value 2880.122624
## iter  80 value 2862.658352
## iter  90 value 2827.623880
## iter 100 value 2773.864189
## final  value 2773.864189 
## stopped after 100 iterations
## # weights:  16
## initial  value 4544.085648 
## iter  10 value 3532.481781
## iter  20 value 3493.915613
## iter  30 value 3472.929455
## iter  40 value 3436.413702
## iter  50 value 3429.415393
## final  value 3429.398053 
## converged
## # weights:  46
## initial  value 4279.832352 
## iter  10 value 3377.053966
## iter  20 value 3301.207075
## iter  30 value 3268.991360
## iter  40 value 3248.978423
## iter  50 value 3234.136721
## iter  60 value 3223.653008
## iter  70 value 3218.215970
## iter  80 value 3213.676944
## iter  90 value 3210.969301
## iter 100 value 3208.405829
## final  value 3208.405829 
## stopped after 100 iterations
## # weights:  76
## initial  value 4844.671932 
## iter  10 value 3389.665987
## iter  20 value 3249.184523
## iter  30 value 3114.566714
## iter  40 value 2986.706874
## iter  50 value 2901.233988
## iter  60 value 2854.820638
## iter  70 value 2792.329600
## iter  80 value 2744.168426
## iter  90 value 2717.802433
## iter 100 value 2708.754921
## final  value 2708.754921 
## stopped after 100 iterations
## # weights:  16
## initial  value 4332.975771 
## iter  10 value 3860.910083
## iter  20 value 3680.052040
## iter  30 value 3408.432387
## iter  40 value 3407.316471
## iter  50 value 3406.928358
## iter  60 value 3406.407734
## final  value 3406.403828 
## converged
## # weights:  46
## initial  value 4257.841168 
## iter  10 value 3416.838379
## iter  20 value 3335.858541
## iter  30 value 3299.628379
## iter  40 value 3272.612445
## iter  50 value 3140.167840
## iter  60 value 3076.670659
## iter  70 value 3058.225104
## iter  80 value 3050.078118
## iter  90 value 3047.195768
## iter 100 value 3046.663066
## final  value 3046.663066 
## stopped after 100 iterations
## # weights:  76
## initial  value 4383.273813 
## iter  10 value 3415.520847
## iter  20 value 3235.917107
## iter  30 value 3156.833823
## iter  40 value 3100.321471
## iter  50 value 3075.927751
## iter  60 value 3057.393424
## iter  70 value 3032.831048
## iter  80 value 3015.959875
## iter  90 value 2999.518553
## iter 100 value 2993.757290
## final  value 2993.757290 
## stopped after 100 iterations
## # weights:  76
## initial  value 6008.323715 
## iter  10 value 4318.521422
## iter  20 value 4134.290376
## iter  30 value 4072.098715
## iter  40 value 4027.196625
## iter  50 value 4017.175759
## iter  60 value 4011.532324
## iter  70 value 4005.945642
## iter  80 value 4001.969734
## iter  90 value 3995.308122
## iter 100 value 3984.722019
## final  value 3984.722019 
## stopped after 100 iterations

print(fit.nnet)

## Neural Network 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   size  decay  AUC        Precision  Recall     F        
##   1     0e+00  0.9206352  0.9043590  0.7089386  0.7946450
##   1     1e-04  0.9221209  0.9047536  0.7047391  0.7921428
##   1     1e-01  0.9238685  0.9046901  0.7225597  0.8033415
##   3     0e+00  0.8435589  0.9020172  0.7319602  0.8066338
##   3     1e-04  0.9081168  0.9155727  0.7361826  0.8157205
##   3     1e-01  0.9338175  0.9069256  0.7456204  0.8177587
##   5     0e+00  0.9200589  0.8980998  0.7820205  0.8355041
##   5     1e-04  0.9292254  0.9050240  0.7532323  0.8211169
##   5     1e-01  0.9408704  0.9215037  0.7566135  0.8308441
## 
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')

GBM

set.seed(2019)
fit.gbm <- train(y~., data=itrain, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)

## Stochastic Gradient Boosting 
## 
## 4769 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  AUC        Precision  Recall     F        
##   1                   50      0.7887613  0.9325682  0.8585231  0.8937265
##   1                  100      0.9560085  0.9411022  0.8451648  0.8904113
##   1                  150      0.9625541  0.9410865  0.8407125  0.8879457
##   2                   50      0.9590954  0.9421104  0.8556411  0.8966535
##   2                  100      0.9657656  0.9456085  0.8593085  0.9002933
##   2                  150      0.9672073  0.9441038  0.8663834  0.9035156
##   3                   50      0.9635314  0.9445827  0.8519755  0.8958138
##   3                  100      0.9669553  0.9456678  0.8606222  0.9010865
##   3                  150      0.9689793  0.9464588  0.8697924  0.9064571
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')

##                           var     rel.inf
## DEBTINC               DEBTINC 61.69994826
## DELINQ.0             DELINQ.0 13.97062800
## CLAGE                   CLAGE  9.29015675
## VALUE                   VALUE  6.19009121
## LOAN                     LOAN  2.94490907
## YOJ                       YOJ  2.55295300
## DELINQ.1             DELINQ.1  1.42949366
## JOB.Office         JOB.Office  0.81009594
## DEROG.1               DEROG.1  0.60202923
## REASON.HomeImp REASON.HomeImp  0.27637804
## JOB.ProfExe       JOB.ProfExe  0.18741340
## JOB.Mgr               JOB.Mgr  0.04590344
## JOB.Other           JOB.Other  0.00000000

Comparison of algorithms

A graphical plot to show performances of the models from the training set.

results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')

## Results

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: glm, rf, nnet, gbm 
## Number of resamples: 5 
## 
## AUC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.9153892 0.9186178 0.9231522 0.9229779 0.9235566 0.9341736    0
## rf   0.9657198 0.9730883 0.9737972 0.9753659 0.9788954 0.9853288    0
## nnet 0.9184813 0.9407977 0.9456926 0.9408704 0.9488647 0.9505155    0
## gbm  0.9631971 0.9667793 0.9676233 0.9689793 0.9702821 0.9770149    0
## 
## F 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.8028986 0.8052516 0.8071685 0.8130053 0.8209169 0.8287911    0
## rf   0.9227696 0.9263158 0.9274984 0.9293680 0.9315615 0.9386948    0
## nnet 0.8098694 0.8195051 0.8376437 0.8308441 0.8431655 0.8440367    0
## gbm  0.8955017 0.9038855 0.9051491 0.9064571 0.9083503 0.9193989    0
## 
## Precision 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.8877246 0.8922345 0.8978930 0.8984912 0.9052133 0.9093904    0
## rf   0.9256845 0.9299868 0.9377537 0.9367482 0.9442971 0.9460189    0
## nnet 0.9073171 0.9157734 0.9229508 0.9215037 0.9268680 0.9346093    0
## gbm  0.9382022 0.9422535 0.9431010 0.9464588 0.9486804 0.9600571    0
## 
## Recall 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.7225131 0.7260813 0.7369110 0.7427367 0.7509830 0.7771953    0
## rf   0.9082569 0.9175393 0.9226737 0.9221896 0.9293194 0.9331586    0
## nnet 0.7313237 0.7369110 0.7640891 0.7566135 0.7680210 0.7827225    0
## gbm  0.8479685 0.8678010 0.8743455 0.8697924 0.8768021 0.8820446    0

par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')

Predictions

par(mfrow=c(2,2))
set.seed(2019)
prediction.glm <-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf <-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet <-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm <-predict(fit.gbm,newdata=itest,type="raw")

Visualization of results

Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.

cat(paste('Confusion Matrix GLM Model'), sep='\n')

## Confusion Matrix GLM Model

confusionMatrix(prediction.glm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 718  77
##         X1 236 160
##                                          
##                Accuracy : 0.7372         
##                  95% CI : (0.7112, 0.762)
##     No Information Rate : 0.801          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3416         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7526         
##             Specificity : 0.6751         
##          Pos Pred Value : 0.9031         
##          Neg Pred Value : 0.4040         
##              Prevalence : 0.8010         
##          Detection Rate : 0.6029         
##    Detection Prevalence : 0.6675         
##       Balanced Accuracy : 0.7139         
##                                          
##        'Positive' Class : X0             
##

F1_train <- fit.glm$results[5]
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')

## F1_train_glm: 0.813005322258826 F1_test_glm: 0.8210405946255

cat(paste('Confusion Matrix Random Forest Model'), sep='\n')

## Confusion Matrix Random Forest Model

confusionMatrix(prediction.rf, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 878  70
##         X1  76 167
##                                           
##                Accuracy : 0.8774          
##                  95% CI : (0.8574, 0.8955)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 1.859e-12       
##                                           
##                   Kappa : 0.6191          
##                                           
##  Mcnemar's Test P-Value : 0.679           
##                                           
##             Sensitivity : 0.9203          
##             Specificity : 0.7046          
##          Pos Pred Value : 0.9262          
##          Neg Pred Value : 0.6872          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7372          
##    Detection Prevalence : 0.7960          
##       Balanced Accuracy : 0.8125          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.rf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.929368010236147 F1_test_rf: 0.923238696109359

cat(paste('Confusion Matrix Neural Network Model'), sep='\n')

## Confusion Matrix Neural Network Model

confusionMatrix(prediction.nnet, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 712  68
##         X1 242 169
##                                           
##                Accuracy : 0.7397          
##                  95% CI : (0.7138, 0.7644)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3601          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7463          
##             Specificity : 0.7131          
##          Pos Pred Value : 0.9128          
##          Neg Pred Value : 0.4112          
##              Prevalence : 0.8010          
##          Detection Rate : 0.5978          
##    Detection Prevalence : 0.6549          
##       Balanced Accuracy : 0.7297          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.nnet$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')

## F1_train_nnet: 0.794644959784861 F1_test_nnet: 0.821222606689735

cat(paste('Confusion Matrix GBM Model'), sep='\n')

## Confusion Matrix GBM Model

confusionMatrix(prediction.gbm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 812  50
##         X1 142 187
##                                           
##                Accuracy : 0.8388          
##                  95% CI : (0.8167, 0.8592)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.0004739       
##                                           
##                   Kappa : 0.5587          
##                                           
##  Mcnemar's Test P-Value : 5.122e-11       
##                                           
##             Sensitivity : 0.8512          
##             Specificity : 0.7890          
##          Pos Pred Value : 0.9420          
##          Neg Pred Value : 0.5684          
##              Prevalence : 0.8010          
##          Detection Rate : 0.6818          
##    Detection Prevalence : 0.7238          
##       Balanced Accuracy : 0.8201          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.gbm$results[[8]][1]
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')

## F1_train_gbm: 0.893726484265782 F1_test_gbm: 0.894273127753304

Confusion Matrix Plots

par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GBM Confusion Matrix")

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling method that creates synthetic samples from the minority class instead of creating copies from it.

set.seed(2019)
smote_train <- SMOTE(y ~ ., data  = itrain)                          
table(smote_train$y)

## 
##   X0   X1 
## 3808 2856

# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, summaryFunction = prSummary)
metric <- "AUC"

GLM

set.seed(2019)
fit.glm <- train(y~., data=smote_train, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)

## Generalized Linear Model 
## 
## 6664 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.6088746  0.3247685  0.9569427  0.4849409

plot(varImp(fit.glm),15, main = 'GLM feature selection')

RANDOM FOREST

set.seed(2019)
fit.rf <- train(y~., data=smote_train, method="rf", metric=metric, trControl=control)
print(fit.rf)

## Random Forest 
## 
## 6664 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Resampling results across tuning parameters:
## 
##   mtry  AUC        Precision  Recall     F        
##    2    0.8267628  0.3964015  0.9934463  0.5666820
##    7    0.7882037  0.5019825  0.9727242  0.6621885
##   13    0.6846804  0.4935740  0.9692963  0.6540501
## 
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

plot(varImp(fit.rf),15, main='Random Forest feature selection')

NNET

set.seed(2019)
fit.nnet <- train(y~., data=smote_train, method="nnet", metric=metric, trControl=control)

## # weights:  16
## initial  value 2284.176732 
## iter  10 value 1726.014789
## iter  20 value 1615.976167
## iter  30 value 1608.769547
## iter  40 value 1606.822884
## iter  50 value 1606.755672
## final  value 1606.740614 
## converged
## # weights:  46
## initial  value 2817.566368 
## iter  10 value 1580.037753
## iter  20 value 1552.849768
## iter  30 value 1530.946777
## iter  40 value 1509.325901
## iter  50 value 1486.521701
## iter  60 value 1483.176552
## iter  70 value 1469.941348
## iter  80 value 1459.906827
## iter  90 value 1453.482697
## iter 100 value 1452.965904
## final  value 1452.965904 
## stopped after 100 iterations
## # weights:  76
## initial  value 3955.159272 
## iter  10 value 1584.532173
## iter  20 value 1497.026199
## iter  30 value 1417.270440
## iter  40 value 1362.012422
## iter  50 value 1326.959391
## iter  60 value 1289.954550
## iter  70 value 1258.318306
## iter  80 value 1224.012070
## iter  90 value 1196.903931
## iter 100 value 1179.925159
## final  value 1179.925159 
## stopped after 100 iterations
## # weights:  16
## initial  value 2471.131143 
## iter  10 value 1928.587243
## iter  20 value 1759.673250
## iter  30 value 1654.974662
## iter  40 value 1620.877496
## iter  50 value 1598.504608
## iter  60 value 1572.795809
## iter  70 value 1570.323995
## iter  80 value 1570.270742
## final  value 1570.251999 
## converged
## # weights:  46
## initial  value 2956.899021 
## iter  10 value 1633.332237
## iter  20 value 1596.367666
## iter  30 value 1571.255846
## iter  40 value 1550.743329
## iter  50 value 1541.635557
## iter  60 value 1534.699505
## iter  70 value 1529.316910
## iter  80 value 1523.594286
## iter  90 value 1520.981192
## iter 100 value 1520.864258
## final  value 1520.864258 
## stopped after 100 iterations
## # weights:  76
## initial  value 3030.189139 
## iter  10 value 1568.322423
## iter  20 value 1507.054469
## iter  30 value 1476.156890
## iter  40 value 1451.102333
## iter  50 value 1437.229713
## iter  60 value 1416.889605
## iter  70 value 1380.188021
## iter  80 value 1351.510403
## iter  90 value 1328.415851
## iter 100 value 1321.406522
## final  value 1321.406522 
## stopped after 100 iterations
## # weights:  16
## initial  value 2970.571191 
## iter  10 value 1641.881642
## iter  20 value 1612.001560
## iter  30 value 1606.717131
## iter  40 value 1574.065279
## iter  50 value 1568.707119
## iter  60 value 1567.429460
## final  value 1567.420477 
## converged
## # weights:  46
## initial  value 1962.575202 
## iter  10 value 1600.754558
## iter  20 value 1512.163457
## iter  30 value 1418.884838
## iter  40 value 1399.514863
## iter  50 value 1387.232315
## iter  60 value 1354.096357
## iter  70 value 1329.014876
## iter  80 value 1322.494887
## iter  90 value 1308.968720
## iter 100 value 1303.687919
## final  value 1303.687919 
## stopped after 100 iterations
## # weights:  76
## initial  value 2796.752052 
## iter  10 value 1561.051895
## iter  20 value 1515.569805
## iter  30 value 1465.142785
## iter  40 value 1442.119708
## iter  50 value 1428.836678
## iter  60 value 1419.149953
## iter  70 value 1406.246713
## iter  80 value 1391.435445
## iter  90 value 1385.801874
## iter 100 value 1384.605029
## final  value 1384.605029 
## stopped after 100 iterations
## # weights:  16
## initial  value 2316.280514 
## iter  10 value 1612.057092
## iter  20 value 1604.764692
## iter  30 value 1603.436294
## iter  40 value 1582.699637
## iter  50 value 1565.114393
## iter  60 value 1564.155988
## iter  70 value 1562.737185
## iter  80 value 1562.373384
## final  value 1562.373352 
## converged
## # weights:  46
## initial  value 3140.722907 
## iter  10 value 1701.433774
## iter  20 value 1608.809869
## iter  30 value 1583.610457
## iter  40 value 1511.397400
## iter  50 value 1492.496283
## iter  60 value 1487.029830
## iter  70 value 1483.476444
## iter  80 value 1473.439809
## iter  90 value 1464.039294
## iter 100 value 1454.486004
## final  value 1454.486004 
## stopped after 100 iterations
## # weights:  76
## initial  value 2592.839061 
## iter  10 value 1522.850267
## iter  20 value 1447.919253
## iter  30 value 1395.844890
## iter  40 value 1382.301620
## iter  50 value 1375.039134
## iter  60 value 1367.985478
## iter  70 value 1352.839517
## iter  80 value 1335.059949
## iter  90 value 1307.622926
## iter 100 value 1303.938978
## final  value 1303.938978 
## stopped after 100 iterations
## # weights:  16
## initial  value 3049.034703 
## iter  10 value 1804.707452
## iter  20 value 1627.437146
## iter  30 value 1595.991516
## iter  40 value 1582.140133
## iter  50 value 1569.769232
## iter  60 value 1565.521817
## final  value 1565.384953 
## converged
## # weights:  46
## initial  value 2608.780559 
## iter  10 value 1610.454611
## iter  20 value 1571.516982
## iter  30 value 1554.504148
## iter  40 value 1539.171900
## iter  50 value 1520.580091
## iter  60 value 1509.209775
## iter  70 value 1506.369692
## iter  80 value 1502.314166
## iter  90 value 1485.463032
## iter 100 value 1470.086322
## final  value 1470.086322 
## stopped after 100 iterations
## # weights:  76
## initial  value 2344.803076 
## iter  10 value 1523.808967
## iter  20 value 1473.079269
## iter  30 value 1436.542599
## iter  40 value 1422.038626
## iter  50 value 1415.736493
## iter  60 value 1412.728831
## iter  70 value 1409.478469
## iter  80 value 1406.629824
## iter  90 value 1404.019500
## iter 100 value 1403.468159
## final  value 1403.468159 
## stopped after 100 iterations
## # weights:  16
## initial  value 2916.090932 
## iter  10 value 1745.694753
## iter  20 value 1615.012801
## iter  30 value 1595.920459
## iter  40 value 1578.939403
## iter  50 value 1572.374688
## iter  60 value 1565.285045
## iter  70 value 1564.997487
## final  value 1564.997022 
## converged
## # weights:  46
## initial  value 2383.553476 
## iter  10 value 1713.589066
## iter  20 value 1590.059018
## iter  30 value 1545.161217
## iter  40 value 1518.446169
## iter  50 value 1507.325265
## iter  60 value 1499.547198
## iter  70 value 1493.473415
## iter  80 value 1489.954584
## iter  90 value 1486.822853
## iter 100 value 1485.845699
## final  value 1485.845699 
## stopped after 100 iterations
## # weights:  76
## initial  value 2847.818390 
## iter  10 value 1603.008855
## iter  20 value 1463.754285
## iter  30 value 1380.051581
## iter  40 value 1348.370731
## iter  50 value 1337.056100
## iter  60 value 1322.762454
## iter  70 value 1313.790755
## iter  80 value 1303.355663
## iter  90 value 1301.231464
## iter 100 value 1300.425550
## final  value 1300.425550 
## stopped after 100 iterations
## # weights:  16
## initial  value 3484.237234 
## iter  10 value 1801.362592
## iter  20 value 1761.957927
## iter  30 value 1755.727451
## iter  40 value 1754.749653
## iter  50 value 1752.237137
## iter  60 value 1748.365951
## final  value 1747.575777 
## converged
## # weights:  46
## initial  value 3859.694969 
## iter  10 value 1668.776139
## iter  20 value 1569.915619
## iter  30 value 1543.274747
## iter  40 value 1512.703494
## iter  50 value 1494.418462
## iter  60 value 1486.979662
## iter  70 value 1484.183084
## iter  80 value 1481.389655
## iter  90 value 1479.019157
## iter 100 value 1476.871048
## final  value 1476.871048 
## stopped after 100 iterations
## # weights:  76
## initial  value 2026.960510 
## iter  10 value 1554.110647
## iter  20 value 1466.686621
## iter  30 value 1432.232661
## iter  40 value 1412.460615
## iter  50 value 1401.071846
## iter  60 value 1390.609106
## iter  70 value 1365.877877
## iter  80 value 1353.782688
## iter  90 value 1353.228813
## final  value 1353.169710 
## converged
## # weights:  16
## initial  value 2853.050894 
## iter  10 value 1788.805702
## iter  20 value 1685.478578
## iter  30 value 1618.182743
## iter  40 value 1603.899664
## iter  50 value 1578.648300
## iter  60 value 1561.475656
## iter  70 value 1558.198740
## final  value 1558.197913 
## converged
## # weights:  46
## initial  value 2531.470135 
## iter  10 value 1649.846247
## iter  20 value 1602.852505
## iter  30 value 1562.885575
## iter  40 value 1512.823242
## iter  50 value 1498.221064
## iter  60 value 1483.786906
## iter  70 value 1467.364121
## iter  80 value 1452.286919
## iter  90 value 1422.422014
## iter 100 value 1411.588732
## final  value 1411.588732 
## stopped after 100 iterations
## # weights:  76
## initial  value 1938.006244 
## iter  10 value 1546.693219
## iter  20 value 1505.651494
## iter  30 value 1463.782230
## iter  40 value 1391.979307
## iter  50 value 1346.925851
## iter  60 value 1325.859742
## iter  70 value 1319.602330
## iter  80 value 1312.337774
## iter  90 value 1306.873628
## iter 100 value 1305.776473
## final  value 1305.776473 
## stopped after 100 iterations
## # weights:  16
## initial  value 2012.030873 
## iter  10 value 1628.320573
## iter  20 value 1561.932865
## iter  30 value 1558.339723
## iter  40 value 1556.475754
## iter  50 value 1556.346647
## iter  60 value 1556.045092
## final  value 1555.923199 
## converged
## # weights:  46
## initial  value 3069.406660 
## iter  10 value 1577.138331
## iter  20 value 1515.903030
## iter  30 value 1491.457367
## iter  40 value 1484.644662
## iter  50 value 1466.613094
## iter  60 value 1449.480562
## iter  70 value 1447.303578
## iter  80 value 1446.771663
## iter  90 value 1446.717990
## iter 100 value 1446.647112
## final  value 1446.647112 
## stopped after 100 iterations
## # weights:  76
## initial  value 2754.884424 
## iter  10 value 1546.963850
## iter  20 value 1494.246819
## iter  30 value 1451.479108
## iter  40 value 1428.835623
## iter  50 value 1412.905432
## iter  60 value 1397.828672
## iter  70 value 1378.908921
## iter  80 value 1373.035893
## iter  90 value 1371.163997
## iter 100 value 1370.132170
## final  value 1370.132170 
## stopped after 100 iterations
## # weights:  16
## initial  value 3591.395819 
## iter  10 value 1766.785464
## iter  20 value 1698.857271
## iter  30 value 1619.567536
## iter  40 value 1617.857801
## iter  50 value 1614.797343
## iter  60 value 1613.080488
## iter  60 value 1613.080483
## iter  60 value 1613.080483
## final  value 1613.080483 
## converged
## # weights:  46
## initial  value 2532.047333 
## iter  10 value 1567.390027
## iter  20 value 1493.909288
## iter  30 value 1460.311689
## iter  40 value 1437.908721
## iter  50 value 1431.350257
## iter  60 value 1428.226270
## iter  70 value 1419.252043
## iter  80 value 1408.290418
## iter  90 value 1405.783790
## iter 100 value 1405.331599
## final  value 1405.331599 
## stopped after 100 iterations
## # weights:  76
## initial  value 2254.683321 
## iter  10 value 1560.309861
## iter  20 value 1507.751796
## iter  30 value 1480.613075
## iter  40 value 1459.991403
## iter  50 value 1433.500693
## iter  60 value 1415.217298
## iter  70 value 1398.672412
## iter  80 value 1383.752396
## iter  90 value 1364.619860
## iter 100 value 1349.445589
## final  value 1349.445589 
## stopped after 100 iterations
## # weights:  16
## initial  value 2163.878200 
## iter  10 value 1603.267492
## iter  20 value 1575.347414
## iter  30 value 1572.261742
## final  value 1572.227072 
## converged
## # weights:  46
## initial  value 3014.396652 
## iter  10 value 1590.715034
## iter  20 value 1557.288471
## iter  30 value 1528.226735
## iter  40 value 1507.266695
## iter  50 value 1460.753868
## iter  60 value 1452.313277
## iter  70 value 1448.508616
## iter  80 value 1444.466942
## iter  90 value 1438.503704
## iter 100 value 1431.175499
## final  value 1431.175499 
## stopped after 100 iterations
## # weights:  76
## initial  value 2212.181577 
## iter  10 value 1553.181764
## iter  20 value 1519.196554
## iter  30 value 1484.710439
## iter  40 value 1473.601515
## iter  50 value 1470.218111
## iter  60 value 1465.410104
## iter  70 value 1462.042858
## iter  80 value 1460.151702
## iter  90 value 1458.064256
## iter 100 value 1425.632193
## final  value 1425.632193 
## stopped after 100 iterations
## # weights:  16
## initial  value 2316.988680 
## iter  10 value 1687.440237
## iter  20 value 1662.115407
## iter  30 value 1617.015532
## iter  40 value 1588.305342
## iter  50 value 1572.597132
## iter  60 value 1570.545470
## iter  70 value 1570.511940
## final  value 1570.509453 
## converged
## # weights:  46
## initial  value 1890.248743 
## iter  10 value 1566.059759
## iter  20 value 1516.579066
## iter  30 value 1489.729466
## iter  40 value 1443.266116
## iter  50 value 1425.725044
## iter  60 value 1414.480899
## iter  70 value 1404.320644
## iter  80 value 1400.203784
## iter  90 value 1399.888612
## iter 100 value 1399.531709
## final  value 1399.531709 
## stopped after 100 iterations
## # weights:  76
## initial  value 2263.141589 
## iter  10 value 1555.973087
## iter  20 value 1505.550814
## iter  30 value 1455.145582
## iter  40 value 1421.857043
## iter  50 value 1409.997176
## iter  60 value 1394.651232
## iter  70 value 1383.626182
## iter  80 value 1377.250570
## iter  90 value 1373.223163
## iter 100 value 1371.082642
## final  value 1371.082642 
## stopped after 100 iterations
## # weights:  16
## initial  value 2636.590912 
## iter  10 value 1682.098216
## iter  20 value 1613.373065
## iter  30 value 1605.723186
## final  value 1605.716057 
## converged
## # weights:  46
## initial  value 2824.216740 
## iter  10 value 1592.007078
## iter  20 value 1535.493627
## iter  30 value 1514.622737
## iter  40 value 1497.136570
## iter  50 value 1491.164329
## iter  60 value 1476.424788
## iter  70 value 1458.099525
## iter  80 value 1450.198912
## iter  90 value 1449.496422
## final  value 1449.469738 
## converged
## # weights:  76
## initial  value 2484.982818 
## iter  10 value 1553.819935
## iter  20 value 1490.223508
## iter  30 value 1449.237908
## iter  40 value 1426.395477
## iter  50 value 1409.741926
## iter  60 value 1392.970969
## iter  70 value 1372.379779
## iter  80 value 1350.096778
## iter  90 value 1339.995348
## iter 100 value 1336.867000
## final  value 1336.867000 
## stopped after 100 iterations
## # weights:  16
## initial  value 3001.192356 
## iter  10 value 1683.602034
## iter  20 value 1613.131691
## iter  30 value 1572.907240
## iter  40 value 1562.125095
## iter  50 value 1559.703174
## iter  60 value 1557.838301
## final  value 1557.838275 
## converged
## # weights:  46
## initial  value 2857.222027 
## iter  10 value 1573.389243
## iter  20 value 1537.708527
## iter  30 value 1517.092336
## iter  40 value 1511.239065
## iter  50 value 1498.097496
## iter  60 value 1492.341282
## iter  70 value 1489.617870
## iter  80 value 1488.881166
## iter  90 value 1488.811098
## iter 100 value 1488.809035
## final  value 1488.809035 
## stopped after 100 iterations
## # weights:  76
## initial  value 3107.270616 
## iter  10 value 1579.145758
## iter  20 value 1493.300641
## iter  30 value 1424.508335
## iter  40 value 1383.423330
## iter  50 value 1347.847295
## iter  60 value 1332.294422
## iter  70 value 1311.039157
## iter  80 value 1280.116614
## iter  90 value 1262.589254
## iter 100 value 1245.428153
## final  value 1245.428153 
## stopped after 100 iterations
## # weights:  16
## initial  value 4235.498655 
## iter  10 value 1873.091461
## iter  20 value 1842.655838
## iter  30 value 1818.911867
## iter  40 value 1748.069442
## iter  50 value 1743.931067
## iter  60 value 1740.811148
## iter  70 value 1714.550878
## iter  80 value 1663.960363
## iter  90 value 1660.119024
## iter 100 value 1660.022672
## final  value 1660.022672 
## stopped after 100 iterations
## # weights:  46
## initial  value 2637.516171 
## iter  10 value 1552.040730
## iter  20 value 1498.236213
## iter  30 value 1457.525571
## iter  40 value 1434.066699
## iter  50 value 1416.349482
## iter  60 value 1410.821912
## iter  70 value 1408.524178
## iter  80 value 1406.148144
## iter  90 value 1403.631574
## iter 100 value 1402.791615
## final  value 1402.791615 
## stopped after 100 iterations
## # weights:  76
## initial  value 2393.628067 
## iter  10 value 1534.006861
## iter  20 value 1479.216947
## iter  30 value 1408.470437
## iter  40 value 1367.291521
## iter  50 value 1356.458496
## iter  60 value 1347.678574
## iter  70 value 1337.970115
## iter  80 value 1321.199800
## iter  90 value 1308.423504
## iter 100 value 1306.420005
## final  value 1306.420005 
## stopped after 100 iterations
## # weights:  76
## initial  value 5057.089050 
## iter  10 value 3666.636206
## iter  20 value 3453.617841
## iter  30 value 3246.924419
## iter  40 value 3120.732844
## iter  50 value 3031.831098
## iter  60 value 3001.258903
## iter  70 value 2975.696972
## iter  80 value 2959.422461
## iter  90 value 2948.444021
## iter 100 value 2942.375427
## final  value 2942.375427 
## stopped after 100 iterations

print(fit.nnet)

## Neural Network 
## 
## 6664 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Resampling results across tuning parameters:
## 
##   size  decay  AUC        Precision  Recall     F        
##   1     0e+00  0.5466821  0.2810342  0.9884363  0.4365438
##   1     1e-04  0.5344953  0.3234498  0.9541768  0.4820788
##   1     1e-01  0.6044792  0.3392303  0.9437910  0.4990655
##   3     0e+00  0.5674257  0.3757548  0.9348374  0.5346354
##   3     1e-04  0.6200640  0.3935681  0.9264093  0.5513988
##   3     1e-01  0.6547411  0.3720097  0.9438047  0.5335487
##   5     0e+00  0.6083395  0.3795557  0.9414122  0.5406185
##   5     1e-04  0.6491697  0.3744020  0.9461800  0.5359312
##   5     1e-01  0.6946661  0.3863132  0.9469730  0.5485364
## 
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')

GBM

set.seed(2019)
fit.gbm <- train(y~., data=smote_train, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)

## Stochastic Gradient Boosting 
## 
## 6664 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  AUC        Precision  Recall     F        
##   1                   50      0.4966608  0.3284126  0.9813633  0.4921296
##   1                  100      0.6504328  0.3566349  0.9716581  0.5217598
##   1                  150      0.6744467  0.3718810  0.9617036  0.5363294
##   2                   50      0.6635953  0.3654392  0.9650958  0.5300896
##   2                  100      0.7006521  0.3864208  0.9569708  0.5504646
##   2                  150      0.7089790  0.3970976  0.9577706  0.5614109
##   3                   50      0.6799393  0.3794861  0.9569561  0.5433311
##   3                  100      0.7121102  0.3997628  0.9580287  0.5640839
##   3                  150      0.7328222  0.4092089  0.9595908  0.5737339
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')

##                           var    rel.inf
## DEBTINC               DEBTINC 45.9330043
## DELINQ.0             DELINQ.0 17.4782942
## YOJ                       YOJ 13.7109264
## CLAGE                   CLAGE 10.8536652
## LOAN                     LOAN  4.2013317
## VALUE                   VALUE  3.8245245
## JOB.Office         JOB.Office  1.2721423
## DELINQ.1             DELINQ.1  1.2421182
## DEROG.1               DEROG.1  1.1449823
## REASON.HomeImp REASON.HomeImp  0.3390109
## JOB.Mgr               JOB.Mgr  0.0000000
## JOB.Other           JOB.Other  0.0000000
## JOB.ProfExe       JOB.ProfExe  0.0000000

Comparison of algorithms

A graphical plot to show performances of the models from the training set.

results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')

## Results

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: glm, rf, nnet, gbm 
## Number of resamples: 5 
## 
## AUC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.5976324 0.5998371 0.6008851 0.6088746 0.6147725 0.6312458    0
## rf   0.8116890 0.8128272 0.8293679 0.8267628 0.8341390 0.8457911    0
## nnet 0.6263446 0.6862141 0.7140258 0.6946661 0.7209892 0.7257566    0
## gbm  0.7230957 0.7267140 0.7313329 0.7328222 0.7369240 0.7460444    0
## 
## F 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.4793115 0.4843646 0.4850895 0.4849409 0.4877076 0.4882313    0
## rf   0.5611193 0.5667042 0.5668047 0.5666820 0.5686495 0.5701323    0
## nnet 0.5254613 0.5344070 0.5507246 0.5485364 0.5630153 0.5690738    0
## gbm  0.5702028 0.5728840 0.5734320 0.5737339 0.5734541 0.5786963    0
## 
## Precision 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.3196468 0.3239875 0.3244681 0.3247685 0.3276786 0.3280615    0
## rf   0.3915725 0.3961073 0.3968421 0.3964015 0.3976975 0.3997879    0
## nnet 0.3653155 0.3705584 0.3869239 0.3863132 0.4010067 0.4077615    0
## gbm  0.4070156 0.4074693 0.4077562 0.4092089 0.4089888 0.4148148    0
## 
## Recall 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.9532468 0.9540079 0.9576720 0.9569427 0.9591568 0.9606299    0
## rf   0.9896104 0.9908016 0.9934124 0.9934463 0.9960317 0.9973753    0
## nnet 0.9356110 0.9415584 0.9446640 0.9469730 0.9550265 0.9580052    0
## gbm  0.9493506 0.9566360 0.9591568 0.9595908 0.9658793 0.9669312    0

par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')

Predictions

par(mfrow=c(3,2))
set.seed(2019)
prediction.glm<-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf<-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet<-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm<-predict(fit.gbm,newdata=itest,type="raw")

Visualization of results

Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.

cat(paste('Confusion Matrix GLM Model'), sep='\n')

## Confusion Matrix GLM Model

confusionMatrix(prediction.glm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 778  99
##         X1 176 138
##                                           
##                Accuracy : 0.7691          
##                  95% CI : (0.7441, 0.7928)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.997           
##                                           
##                   Kappa : 0.3545          
##                                           
##  Mcnemar's Test P-Value : 4.584e-06       
##                                           
##             Sensitivity : 0.8155          
##             Specificity : 0.5823          
##          Pos Pred Value : 0.8871          
##          Neg Pred Value : 0.4395          
##              Prevalence : 0.8010          
##          Detection Rate : 0.6532          
##    Detection Prevalence : 0.7364          
##       Balanced Accuracy : 0.6989          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.glm$results[5]
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')

## F1_train_glm: 0.484940906613912 F1_test_glm: 0.849808847624249

cat(paste('Confusion Matrix Random Forest Model'), sep='\n')

## Confusion Matrix Random Forest Model

confusionMatrix(prediction.rf, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 872  75
##         X1  82 162
##                                           
##                Accuracy : 0.8682          
##                  95% CI : (0.8476, 0.8869)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 7.14e-10        
##                                           
##                   Kappa : 0.591           
##                                           
##  Mcnemar's Test P-Value : 0.632           
##                                           
##             Sensitivity : 0.9140          
##             Specificity : 0.6835          
##          Pos Pred Value : 0.9208          
##          Neg Pred Value : 0.6639          
##              Prevalence : 0.8010          
##          Detection Rate : 0.7322          
##    Detection Prevalence : 0.7951          
##       Balanced Accuracy : 0.7988          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.rf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')

## F1_train_rf: 0.566681997839924 F1_test_rf: 0.917411888479748

cat(paste('Confusion Matrix Neural Network Model'), sep='\n')

## Confusion Matrix Neural Network Model

confusionMatrix(prediction.nnet, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 770  65
##         X1 184 172
##                                           
##                Accuracy : 0.7909          
##                  95% CI : (0.7667, 0.8137)
##     No Information Rate : 0.801           
##     P-Value [Acc > NIR] : 0.8182          
##                                           
##                   Kappa : 0.4483          
##                                           
##  Mcnemar's Test P-Value : 7.549e-14       
##                                           
##             Sensitivity : 0.8071          
##             Specificity : 0.7257          
##          Pos Pred Value : 0.9222          
##          Neg Pred Value : 0.4831          
##              Prevalence : 0.8010          
##          Detection Rate : 0.6465          
##    Detection Prevalence : 0.7011          
##       Balanced Accuracy : 0.7664          
##                                           
##        'Positive' Class : X0              
##

F1_train <- fit.nnet$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')

## F1_train_nnet: 0.436543766362327 F1_test_nnet: 0.860816098378983

cat(paste('Confusion Matrix GBM Model'), sep='\n')

## Confusion Matrix GBM Model

confusionMatrix(prediction.gbm, itest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  X0  X1
##         X0 827  59
##         X1 127 178
##                                          
##                Accuracy : 0.8438         
##                  95% CI : (0.8219, 0.864)
##     No Information Rate : 0.801          
##     P-Value [Acc > NIR] : 8.275e-05      
##                                          
##                   Kappa : 0.5578         
##                                          
##  Mcnemar's Test P-Value : 8.984e-07      
##                                          
##             Sensitivity : 0.8669         
##             Specificity : 0.7511         
##          Pos Pred Value : 0.9334         
##          Neg Pred Value : 0.5836         
##              Prevalence : 0.8010         
##          Detection Rate : 0.6944         
##    Detection Prevalence : 0.7439         
##       Balanced Accuracy : 0.8090         
##                                          
##        'Positive' Class : X0             
##

F1_train <- fit.gbm$results[[8]][1]
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')

## F1_train_gbm: 0.492129600499759 F1_test_gbm: 0.898913043478261

Confusion Matrix Plots

par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "GBM Confusion Matrix")

References:

Article

Github repository

Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013

Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

https://machinelearningmastery.com/what-is-imbalanced-classification/

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

How to face a majority class greater than a minority class in a classification predictive modeling: sampling methods with Caret

Claudio G. Giancaterino

27/1/2020

Challenge

Prepare Workspace

Upload Dataset

Summarize Data set

Formatting Features and managing some levels

Handling missing values

Input boolean variables for features with NA’s

Input missing values with median for numerical columns and with the most common level for categorical columns

Check results

Formatting new features and managing some levels

Split data set into categorical, boolean and numerical variables

Summarize the class distribution of the target variable

Visualize data

Analysis for categorical features (barplot, univariate analysis, bivariate analysis)

Univariate Analysis

Bivariate Analysis with Feature Selection Analysis

Visualization of Bivariate Analysis

One-hot encoding on categorical features

Remove correlated levels from boolean features

Analysis for numerical features (univariate analysis, bivariate analysis)

Univariate Analysis, histograms

Univariate Analysis, boxplots

Univariate Analysis, densityplots

Bivariate Analysis

Visualization of Bivariate Analysis

Handling outliers

Delete Zero-and Near Zero-Variance Predictors

Correlation

Visualization

Delete correlated features

Pre-processing

Split data set

Sampling Methods: evaluating models by Caret

Downsampling

GLM

RANDOM FOREST

NNET

GBM

Comparison of algorithms

Predictions

Visualization of results

Confusion Matrix Plots

Oversampling

GLM

RANDOM FOREST

NNET

GBM

Comparison of algorithms

Predictions

Visualization of results

Confusion Matrix Plots

SMOTE

GLM

RANDOM FOREST

NNET

GBM

Comparison of algorithms

Predictions

Visualization of results

Confusion Matrix Plots

References: