I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.
Below the variables description
-BAD: 1 = client defaulted on loan 0 = loan repaid
-LOAN: Amount of the loan request
-MORTDUE: Amount due on existing mortgage
-VALUE: Value of current property
-REASON: DebtCon = debt consolidation; HomeImp = home improvement
-JOB: Six occupational categories
-YOJ: Years at present job
-DEROG: Number of major derogatory reports
-DELINQ: Number of delinquent credit lines
-CLAGE: Age of oldest trade line in months
-NINQ: Number of recent credit lines
-CLNO: Number of credit lines
-DEBTINC: Debt-to-income ratio
suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## v purrr 0.3.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
suppressWarnings({library(caret)})
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
suppressWarnings({library(DMwR)})
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
suppressWarnings({library(ROSE)})
## Loaded ROSE 0.0-3
suppressWarnings({library(corrplot)})
## corrplot 0.84 loaded
suppressWarnings({library(gridExtra)})
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
suppressWarnings({library(MLmetrics)})
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)
# Dimensions of data set
dim(df)
## [1] 5960 13
# List types for each attribute
sapply(df, class)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG
## "integer" "integer" "numeric" "numeric" "factor" "factor" "numeric" "integer"
## DELINQ CLAGE NINQ CLNO DEBTINC
## "integer" "numeric" "integer" "integer" "numeric"
# Take a peek at the first rows of the data set
head(df,5)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO
## 1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1 9
## 2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0 14
## 3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1 10
## 4 1 1500 NA NA NA NA NA NA NA NA
## 5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0 14
## DEBTINC
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
# Summarize attribute distributions
summary(df)
## BAD LOAN MORTDUE VALUE
## Min. :0.0000 Min. : 1100 Min. : 2063 Min. : 8000
## 1st Qu.:0.0000 1st Qu.:11100 1st Qu.: 46276 1st Qu.: 66076
## Median :0.0000 Median :16300 Median : 65019 Median : 89236
## Mean :0.1995 Mean :18608 Mean : 73761 Mean :101776
## 3rd Qu.:0.0000 3rd Qu.:23300 3rd Qu.: 91488 3rd Qu.:119824
## Max. :1.0000 Max. :89900 Max. :399550 Max. :855909
## NA's :518 NA's :112
## REASON JOB YOJ DEROG
## : 252 : 279 Min. : 0.000 Min. : 0.0000
## DebtCon:3928 Mgr : 767 1st Qu.: 3.000 1st Qu.: 0.0000
## HomeImp:1780 Office : 948 Median : 7.000 Median : 0.0000
## Other :2388 Mean : 8.922 Mean : 0.2546
## ProfExe:1276 3rd Qu.:13.000 3rd Qu.: 0.0000
## Sales : 109 Max. :41.000 Max. :10.0000
## Self : 193 NA's :515 NA's :708
## DELINQ CLAGE NINQ CLNO
## Min. : 0.0000 Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 115.1 1st Qu.: 0.000 1st Qu.:15.0
## Median : 0.0000 Median : 173.5 Median : 1.000 Median :20.0
## Mean : 0.4494 Mean : 179.8 Mean : 1.186 Mean :21.3
## 3rd Qu.: 0.0000 3rd Qu.: 231.6 3rd Qu.: 2.000 3rd Qu.:26.0
## Max. :15.0000 Max. :1168.2 Max. :17.000 Max. :71.0
## NA's :580 NA's :308 NA's :510 NA's :222
## DEBTINC
## Min. : 0.5245
## 1st Qu.: 29.1400
## Median : 34.8183
## Mean : 33.7799
## 3rd Qu.: 39.0031
## Max. :203.3121
## NA's :1267
# Summarize data structure
str(df)
## 'data.frame': 5960 obs. of 13 variables:
## $ BAD : int 1 1 1 1 0 1 1 1 1 1 ...
## $ LOAN : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
## $ MORTDUE: num 25860 70053 13500 NA 97800 ...
## $ VALUE : num 39025 68400 16700 NA 112000 ...
## $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
## $ JOB : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
## $ YOJ : num 10.5 7 4 NA 3 9 5 11 3 16 ...
## $ DEROG : int 0 0 0 NA 0 0 3 0 0 0 ...
## $ DELINQ : int 0 2 0 NA 0 0 2 0 2 0 ...
## $ CLAGE : num 94.4 121.8 149.5 NA 93.3 ...
## $ NINQ : int 1 0 1 NA 0 1 1 0 1 0 ...
## $ CLNO : int 9 14 10 NA 14 8 17 8 12 13 ...
## $ DEBTINC: num NA NA NA NA NA ...
BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated
df$REASON[df$REASON == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated
My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”
mi_summary <- function(data_frame){
mi_summary<-c()
for (col in colnames(data_frame)){
mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
}
mi_summary_new <- mi_summary[mi_summary>0]
mi_summary_cols <- colnames(data_frame)[mi_summary>0]
mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
mi_summary[,2] <- round(mi_summary[,2],6)
rownames(mi_summary) <- NULL
return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary
## col_name perc_missing
## 1 DEBTINC 21.258389
## 2 DEROG 11.879195
## 3 DELINQ 9.731544
## 4 MORTDUE 8.691275
## 5 YOJ 8.640940
## 6 NINQ 8.557047
## 7 CLAGE 5.167785
## 8 JOB 4.681208
## 9 REASON 4.228188
## 10 CLNO 3.724832
## 11 VALUE 1.879195
df <- df %>%
mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
mutate(REASON_NA = ifelse(is.na(REASON),1,0))
for (col in missing_summary$col_name){
if (class(df[,col]) == 'factor'){
unique_levels <- unique(df[,col])
df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
} else {
df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
}
}
pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss
## named numeric(0)
df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))
I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis
cat <- df[,sapply(df, is.factor)] %>%
select_if(~nlevels(.) <=15 ) %>%
select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]
The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.
cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)
## freq percentage
## 0 4771 80.05034
## 1 1189 19.94966
ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Distribution of Target variable")
I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.
cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
counts <- table(cat[,i])
name <- names(cat)[i]
barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}
The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.
par(mfrow=c(2,2))
for(i in 1:length(cat)){
freq=table(cat[,i])
percentage=prop.table(table(cat[,i]))*100
freq_cat_outcome=table(BAD,cat[,i])
name <- names(cat)[i]
cat(sep="\n")
cat(paste("Distribution of", name), sep="\n")
print(cbind(freq,percentage))
cat(sep="\n")
cat(paste("Distribution by Target variable and", name), sep="\n")
print(freq_cat_outcome)
cat(sep="\n")
cat(paste("Chi-squared test by Target variable and", name), sep="\n")
suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}
##
## Distribution of DELINQ
## freq percentage
## 0 4759 79.84899329
## 1 654 10.97315436
## 2 250 4.19463087
## 3 129 2.16442953
## 4 78 1.30872483
## 5 38 0.63758389
## 6 27 0.45302013
## 7 13 0.21812081
## 8 5 0.08389262
## 10 2 0.03355705
## 11 2 0.03355705
## 12 1 0.01677852
## 13 1 0.01677852
## 15 1 0.01677852
##
## Distribution by Target variable and DELINQ
##
## BAD 0 1 2 3 4 5 6 7 8 10 11 12 13 15
## 0 4104 432 138 58 32 7 0 0 0 0 0 0 0 0
## 1 655 222 112 71 46 31 27 13 5 2 2 1 1 1
##
## Chi-squared test by Target variable and DELINQ
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
##
##
## Distribution of REASON
## freq percentage
## DebtCon 4180 70.13423
## HomeImp 1780 29.86577
##
## Distribution by Target variable and REASON
##
## BAD DebtCon HomeImp
## 0 3387 1384
## 1 793 396
##
## Chi-squared test by Target variable and REASON
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
##
##
## Distribution of JOB
## freq percentage
## Mgr 767 12.869128
## Office 948 15.906040
## Other 2667 44.748322
## ProfExe 1276 21.409396
## Sales 109 1.828859
## Self 193 3.238255
##
## Distribution by Target variable and JOB
##
## BAD Mgr Office Other ProfExe Sales Self
## 0 588 823 2090 1064 71 135
## 1 179 125 577 212 38 58
##
## Chi-squared test by Target variable and JOB
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
##
##
## Distribution of DEROG
## freq percentage
## 0 5235 87.83557047
## 1 435 7.29865772
## 2 160 2.68456376
## 3 58 0.97315436
## 4 23 0.38590604
## 5 15 0.25167785
## 6 15 0.25167785
## 7 8 0.13422819
## 8 6 0.10067114
## 9 3 0.05033557
## 10 2 0.03355705
##
## Distribution by Target variable and DEROG
##
## BAD 0 1 2 3 4 5 6 7 8 9 10
## 0 4394 266 78 15 5 8 5 0 0 0 0
## 1 841 169 82 43 18 7 10 8 6 3 2
##
## Chi-squared test by Target variable and DEROG
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16
pl1 <- cat %>%
ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
ggplot(aes(x=BAD, y=REASON, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
ggplot(aes(x=BAD, y=JOB, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
ggplot(aes(x=BAD, y=DEROG, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)
I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.
dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))
I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.
par(mfrow=c(2,3))
for(i in 1:length(num)) {
hist(num[,i], main=names(num)[i], col='blue')
}
par(mfrow=c(2,3))
for(i in 1:length(num)) {
boxplot(num[,i], main=names(num)[i], col='orange')
}
par(mfrow=c(2,3))
for(i in 1:length(num)){
plot(density(num[,i]), main=names(num)[i], col='red')
}
for(i in 1:length(num)){
name <- names(num)[i]
cat(paste("Distribution of", name), sep="\n")
#cat(names(num)[i],sep = "\n")
print(summary(num[,i]))
cat(sep="\n")
stand.deviation = sd(num[,i])
variance = var(num[,i])
skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
outlier_values <- sum(table(boxplot.stats(num[,i])$out))
cat(paste("Statistical analysis of", name), sep="\n")
print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
cat(sep="\n")
cat(paste("anova_test between BAD and", name),sep = "\n")
print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
cat(sep="\n")
}
## Distribution of LOAN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1100 11100 16300 18608 23300 89900
##
## Statistical analysis of LOAN
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 11207.48 125607617 2.022762 6.922438 256
##
## anova_test between BAD and LOAN
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 5.4 5.368 33.79 6.45e-09 ***
## Residuals 5958 946.4 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of MORTDUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2063 48139 65019 73001 88200 399550
##
## Statistical analysis of MORTDUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 42552.73 1810734556 1.941153 7.440693 308
##
## anova_test between BAD and MORTDUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.0 2.0303 12.74 0.000361 ***
## Residuals 5958 949.8 0.1594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of VALUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8000 66490 89236 101540 119005 855909
##
## Statistical analysis of VALUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 56869.44 3234132829 3.088963 24.85644 347
##
## anova_test between BAD and VALUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 1.3 1.2675 7.945 0.00484 **
## Residuals 5958 950.5 0.1595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of YOJ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 7.000 8.756 12.000 41.000
##
## Statistical analysis of YOJ
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.259424 52.69923 1.092072 0.744703 211
##
## anova_test between BAD and YOJ
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.8 2.7710 17.4 3.08e-05 ***
## Residuals 5958 949.0 0.1593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of CLAGE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 117.4 173.5 179.4 227.1 1168.2
##
## Statistical analysis of CLAGE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 83.5747 6984.73 1.389903 8.180556 66
##
## anova_test between BAD and CLAGE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 26.1 26.106 168 <2e-16 ***
## Residuals 5958 925.7 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of DEBTINC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5245 30.7632 34.8183 34.0007 37.9499 203.3122
##
## Statistical analysis of DEBTINC
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.644528 58.43881 3.111598 64.0733 245
##
## anova_test between BAD and DEBTINC
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 22.7 22.733 145.8 <2e-16 ***
## Residuals 5958 929.1 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pl5 <- num %>%
ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot()
pl6 <- num %>%
ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot()
pl7 <- num %>%
ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot()
pl8 <- num %>%
ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot()
pl9 <- num %>%
ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot()
pl10 <- num %>%
ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot()
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)
Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.
# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")
# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)] <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")
# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")
# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)] <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE after handled outliers")
# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")
# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)] <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")
# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")
# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)] <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")
# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")
# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)] <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")
# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")
# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)] <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")
The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)
data <- cbind(categorical,num)
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]
## freqRatio percentUnique zeroVar nzv
## DELINQ.2 22.84000 0.03355705 FALSE TRUE
## DELINQ.3 45.20155 0.03355705 FALSE TRUE
## DELINQ.4 75.41026 0.03355705 FALSE TRUE
## DELINQ.5 155.84211 0.03355705 FALSE TRUE
## DELINQ.6 219.74074 0.03355705 FALSE TRUE
## DELINQ.7 457.46154 0.03355705 FALSE TRUE
## DELINQ.8 1191.00000 0.03355705 FALSE TRUE
## DELINQ.10 2979.00000 0.03355705 FALSE TRUE
## DELINQ.11 2979.00000 0.03355705 FALSE TRUE
## DELINQ.12 5959.00000 0.03355705 FALSE TRUE
## DELINQ.13 5959.00000 0.03355705 FALSE TRUE
## DELINQ.15 5959.00000 0.03355705 FALSE TRUE
## JOB.Sales 53.67890 0.03355705 FALSE TRUE
## JOB.Self 29.88083 0.03355705 FALSE TRUE
## DEROG.2 36.25000 0.03355705 FALSE TRUE
nzv <- nearZeroVar(data)
data_new <- data[, -nzv]
Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.
par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")
summary(cor[upper.tri(cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000000 -0.038116 0.005731 -0.014157 0.043268 0.792421
To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.
# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)
## 'data.frame': 5960 obs. of 14 variables:
## $ DELINQ.0 : num 0.502 -1.99 0.502 0.502 0.502 ...
## $ DELINQ.1 : num -0.351 -0.351 -0.351 -0.351 -0.351 ...
## $ REASON.HomeImp: num 1.532 1.532 1.532 -0.653 1.532 ...
## $ JOB.Mgr : num -0.384 -0.384 -0.384 -0.384 -0.384 ...
## $ JOB.Office : num -0.435 -0.435 -0.435 -0.435 2.299 ...
## $ JOB.Other : num 1.11 1.11 1.11 1.11 -0.9 ...
## $ JOB.ProfExe : num -0.522 -0.522 -0.522 -0.522 -0.522 ...
## $ DEROG.1 : num -0.281 -0.281 -0.281 -0.281 -0.281 ...
## $ LOAN : num -1.86 -1.84 -1.81 -1.81 -1.79 ...
## $ VALUE : num -1.322 -0.669 -1.818 -0.206 0.3 ...
## $ YOJ : num 0.279 -0.233 -0.672 -0.233 -0.819 ...
## $ CLAGE : num -1.091 -0.73 -0.367 -0.052 -1.104 ...
## $ DEBTINC : num 0.143 0.143 0.143 0.143 0.143 ...
## $ y : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...
# Draw a random, stratified sample including p percent of the data
set.seed(12345)
test_index <- createDataPartition(transformed$y, p=0.80, list=FALSE)
# select 20% of the data for test
itest <- transformed[-test_index,]
# use the remaining 80% of data to training the models
itrain <- transformed[test_index,]
In this analysis I’ve used the same four baseline models applying sampling methods: Logistic Regression as the easiest model and a benchmark, two ensemble models (Random Forest and Gradient Boosting Machine) and a Neural Network.
Randomly subset the majority class to reach the same size of the minority class.
# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, summaryFunction = prSummary, sampling = "down")
metric <- "AUC"
set.seed(2019)
fit.glm <- train(y~., data=itrain, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)
## Generalized Linear Model
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using down-sampling
##
## Resampling results:
##
## AUC Precision Recall F
## 0.9222213 0.8986736 0.7377612 0.8101311
plot(varImp(fit.glm),15, main = 'GLM feature selection')
set.seed(2019)
fit.rf <- train(y~., data=itrain, method="rf", metric=metric, trControl=control)
print(fit.rf)
## Random Forest
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry AUC Precision Recall F
## 2 0.9717582 0.9494766 0.8564296 0.9004898
## 7 0.9673899 0.9526566 0.8477840 0.8971426
## 13 0.9622294 0.9496174 0.8538162 0.8991397
##
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(fit.rf),15, main='Random Forest feature selection')
set.seed(2019)
fit.nnet <- train(y~., data=itrain, method="nnet", metric=metric, trControl=control)
## # weights: 16
## initial value 1056.253378
## iter 10 value 879.484812
## iter 20 value 868.350633
## iter 30 value 865.093468
## iter 40 value 864.822484
## iter 50 value 864.129214
## iter 60 value 863.234665
## final value 863.233899
## converged
## # weights: 46
## initial value 1180.291887
## iter 10 value 848.042214
## iter 20 value 817.352747
## iter 30 value 803.449641
## iter 40 value 799.254416
## iter 50 value 797.485390
## iter 60 value 795.960937
## iter 70 value 789.080649
## iter 80 value 782.179431
## iter 90 value 779.857251
## iter 100 value 779.123330
## final value 779.123330
## stopped after 100 iterations
## # weights: 76
## initial value 1049.456662
## iter 10 value 849.810013
## iter 20 value 816.870713
## iter 30 value 789.850286
## iter 40 value 752.925131
## iter 50 value 694.939786
## iter 60 value 669.757327
## iter 70 value 654.607835
## iter 80 value 646.064553
## iter 90 value 639.734903
## iter 100 value 631.642870
## final value 631.642870
## stopped after 100 iterations
## # weights: 16
## initial value 1128.298300
## iter 10 value 924.441989
## iter 20 value 882.951132
## iter 30 value 873.534176
## iter 40 value 871.433298
## iter 50 value 871.352603
## final value 871.347712
## converged
## # weights: 46
## initial value 1050.360297
## iter 10 value 894.488187
## iter 20 value 852.800400
## iter 30 value 824.609400
## iter 40 value 775.157715
## iter 50 value 759.916609
## iter 60 value 755.148515
## iter 70 value 749.387821
## iter 80 value 747.704970
## iter 90 value 747.165491
## iter 100 value 745.294482
## final value 745.294482
## stopped after 100 iterations
## # weights: 76
## initial value 1136.275594
## iter 10 value 869.495803
## iter 20 value 803.504679
## iter 30 value 769.979139
## iter 40 value 759.852817
## iter 50 value 753.895799
## iter 60 value 750.950613
## iter 70 value 748.371997
## iter 80 value 746.653154
## iter 90 value 742.716042
## iter 100 value 739.347869
## final value 739.347869
## stopped after 100 iterations
## # weights: 16
## initial value 1091.379843
## iter 10 value 888.608147
## iter 20 value 876.880311
## iter 30 value 869.790444
## iter 40 value 869.655491
## final value 869.654885
## converged
## # weights: 46
## initial value 1050.752812
## iter 10 value 864.268677
## iter 20 value 836.011182
## iter 30 value 819.606507
## iter 40 value 808.611271
## iter 50 value 804.843048
## iter 60 value 803.410438
## iter 70 value 797.629663
## iter 80 value 796.548384
## iter 90 value 796.382662
## iter 100 value 796.028735
## final value 796.028735
## stopped after 100 iterations
## # weights: 76
## initial value 1072.236142
## iter 10 value 834.633832
## iter 20 value 796.657746
## iter 30 value 760.947389
## iter 40 value 742.144752
## iter 50 value 730.956100
## iter 60 value 721.368627
## iter 70 value 716.996308
## iter 80 value 716.040541
## iter 90 value 715.752269
## iter 100 value 715.445085
## final value 715.445085
## stopped after 100 iterations
## # weights: 16
## initial value 1056.850127
## iter 10 value 979.378350
## iter 20 value 862.104243
## iter 30 value 843.984404
## iter 40 value 843.668172
## iter 50 value 843.571685
## iter 60 value 843.464710
## iter 60 value 843.464709
## iter 60 value 843.464709
## final value 843.464709
## converged
## # weights: 46
## initial value 1067.799814
## iter 10 value 827.507168
## iter 20 value 799.320933
## iter 30 value 782.431940
## iter 40 value 772.079842
## iter 50 value 759.314195
## iter 60 value 753.550007
## iter 70 value 751.393835
## iter 80 value 751.261570
## final value 751.261432
## converged
## # weights: 76
## initial value 1165.411501
## iter 10 value 834.491129
## iter 20 value 788.479789
## iter 30 value 763.836333
## iter 40 value 754.136020
## iter 50 value 748.406461
## iter 60 value 737.845120
## iter 70 value 727.895328
## iter 80 value 707.670565
## iter 90 value 703.967797
## iter 100 value 703.846799
## final value 703.846799
## stopped after 100 iterations
## # weights: 16
## initial value 1085.872140
## iter 10 value 871.797915
## iter 20 value 857.008928
## iter 30 value 839.774013
## iter 40 value 831.324137
## iter 50 value 830.820940
## iter 50 value 830.820939
## iter 50 value 830.820939
## final value 830.820939
## converged
## # weights: 46
## initial value 1071.685706
## iter 10 value 907.596929
## iter 20 value 864.807211
## iter 30 value 838.744753
## iter 40 value 832.018414
## iter 50 value 828.860348
## iter 60 value 822.490108
## iter 70 value 820.580453
## iter 80 value 816.010490
## iter 90 value 814.949093
## iter 100 value 813.076952
## final value 813.076952
## stopped after 100 iterations
## # weights: 76
## initial value 1445.450604
## iter 10 value 863.018480
## iter 20 value 819.185461
## iter 30 value 792.809837
## iter 40 value 778.614069
## iter 50 value 770.842469
## iter 60 value 767.750756
## iter 70 value 765.517626
## iter 80 value 764.167611
## iter 90 value 761.583492
## iter 100 value 753.643968
## final value 753.643968
## stopped after 100 iterations
## # weights: 16
## initial value 1108.029106
## iter 10 value 868.168639
## iter 20 value 854.645071
## iter 30 value 839.554900
## iter 40 value 838.323795
## iter 50 value 838.211522
## iter 60 value 838.023445
## iter 70 value 837.680140
## iter 80 value 836.731239
## final value 836.728990
## converged
## # weights: 46
## initial value 1114.055849
## iter 10 value 899.024269
## iter 20 value 826.748622
## iter 30 value 812.550175
## iter 40 value 800.307138
## iter 50 value 788.520412
## iter 60 value 782.134871
## iter 70 value 776.019240
## iter 80 value 774.530460
## iter 90 value 768.362538
## iter 100 value 767.365431
## final value 767.365431
## stopped after 100 iterations
## # weights: 76
## initial value 1129.555195
## iter 10 value 860.304314
## iter 20 value 809.768229
## iter 30 value 761.348968
## iter 40 value 726.429244
## iter 50 value 714.827471
## iter 60 value 708.899740
## iter 70 value 700.496700
## iter 80 value 695.962423
## iter 90 value 694.675643
## iter 100 value 694.169746
## final value 694.169746
## stopped after 100 iterations
## # weights: 16
## initial value 1116.371933
## iter 10 value 892.479741
## iter 20 value 846.873790
## iter 30 value 839.916516
## iter 40 value 839.098424
## iter 50 value 838.717918
## final value 838.626871
## converged
## # weights: 46
## initial value 1044.360477
## iter 10 value 824.559286
## iter 20 value 798.952894
## iter 30 value 771.282492
## iter 40 value 742.871031
## iter 50 value 727.464759
## iter 60 value 707.700044
## iter 70 value 699.714312
## iter 80 value 695.858969
## iter 90 value 695.672557
## final value 695.658410
## converged
## # weights: 76
## initial value 1269.195292
## iter 10 value 827.724494
## iter 20 value 770.115536
## iter 30 value 717.613743
## iter 40 value 702.092066
## iter 50 value 687.023991
## iter 60 value 682.651645
## iter 70 value 666.788129
## iter 80 value 654.744146
## iter 90 value 653.809208
## iter 100 value 653.797833
## final value 653.797833
## stopped after 100 iterations
## # weights: 16
## initial value 1082.405955
## iter 10 value 905.481786
## iter 20 value 890.672051
## iter 30 value 886.420166
## iter 40 value 882.043620
## iter 50 value 875.856533
## iter 60 value 872.887514
## final value 872.871828
## converged
## # weights: 46
## initial value 1068.114510
## iter 10 value 842.183417
## iter 20 value 829.653655
## iter 30 value 820.417766
## iter 40 value 818.565871
## iter 50 value 817.288323
## iter 60 value 817.193354
## final value 817.191414
## converged
## # weights: 76
## initial value 1136.407131
## iter 10 value 892.344788
## iter 20 value 844.772065
## iter 30 value 821.232192
## iter 40 value 785.794181
## iter 50 value 774.220826
## iter 60 value 767.083027
## iter 70 value 759.360242
## iter 80 value 754.242436
## iter 90 value 749.995513
## iter 100 value 745.778287
## final value 745.778287
## stopped after 100 iterations
## # weights: 16
## initial value 1063.858075
## iter 10 value 910.493353
## iter 20 value 892.868162
## iter 30 value 890.113107
## iter 40 value 890.097270
## final value 890.097184
## converged
## # weights: 46
## initial value 1060.472625
## iter 10 value 862.139436
## iter 20 value 827.033369
## iter 30 value 807.351929
## iter 40 value 792.791452
## iter 50 value 779.683634
## iter 60 value 741.280230
## iter 70 value 734.667970
## iter 80 value 732.979765
## iter 90 value 732.386970
## iter 100 value 730.495035
## final value 730.495035
## stopped after 100 iterations
## # weights: 76
## initial value 1085.088860
## iter 10 value 858.813261
## iter 20 value 801.130539
## iter 30 value 739.436769
## iter 40 value 720.866673
## iter 50 value 707.393028
## iter 60 value 696.635311
## iter 70 value 690.047421
## iter 80 value 689.447497
## iter 90 value 688.197894
## iter 100 value 687.928546
## final value 687.928546
## stopped after 100 iterations
## # weights: 16
## initial value 1050.240954
## iter 10 value 915.542583
## iter 20 value 886.393023
## iter 30 value 877.267573
## iter 40 value 870.639265
## iter 50 value 864.381584
## iter 60 value 863.733314
## iter 70 value 863.649380
## iter 80 value 863.389169
## iter 90 value 863.309529
## iter 100 value 863.308744
## final value 863.308744
## stopped after 100 iterations
## # weights: 46
## initial value 1102.586796
## iter 10 value 857.117719
## iter 20 value 844.575934
## iter 30 value 816.030416
## iter 40 value 793.724300
## iter 50 value 788.087287
## iter 60 value 784.248446
## iter 70 value 777.331030
## iter 80 value 771.267304
## iter 90 value 769.190509
## iter 100 value 769.137035
## final value 769.137035
## stopped after 100 iterations
## # weights: 76
## initial value 1051.818438
## iter 10 value 825.727324
## iter 20 value 789.246428
## iter 30 value 760.102897
## iter 40 value 732.183578
## iter 50 value 714.670201
## iter 60 value 707.048661
## iter 70 value 694.615114
## iter 80 value 683.063685
## iter 90 value 681.907458
## iter 100 value 681.879572
## final value 681.879572
## stopped after 100 iterations
## # weights: 16
## initial value 1069.663238
## iter 10 value 916.875216
## iter 20 value 861.841762
## iter 30 value 853.182590
## iter 40 value 853.097703
## iter 40 value 853.097701
## final value 853.097676
## converged
## # weights: 46
## initial value 1162.955146
## iter 10 value 863.497378
## iter 20 value 817.483660
## iter 30 value 804.157417
## iter 40 value 794.980394
## iter 50 value 790.918448
## iter 60 value 789.180679
## iter 70 value 788.735739
## iter 80 value 788.312489
## iter 90 value 782.178711
## iter 100 value 773.164879
## final value 773.164879
## stopped after 100 iterations
## # weights: 76
## initial value 1117.628482
## iter 10 value 846.014381
## iter 20 value 794.870788
## iter 30 value 728.650033
## iter 40 value 697.064988
## iter 50 value 679.460583
## iter 60 value 665.316650
## iter 70 value 659.414992
## iter 80 value 656.236063
## iter 90 value 651.799275
## iter 100 value 650.474330
## final value 650.474330
## stopped after 100 iterations
## # weights: 16
## initial value 1088.120650
## iter 10 value 906.371683
## iter 20 value 892.720897
## iter 30 value 890.091349
## final value 890.073306
## converged
## # weights: 46
## initial value 1035.676995
## iter 10 value 861.024151
## iter 20 value 824.266262
## iter 30 value 791.022675
## iter 40 value 782.943185
## iter 50 value 774.427656
## iter 60 value 770.452490
## iter 70 value 769.439346
## iter 80 value 768.960104
## iter 90 value 768.933330
## iter 100 value 768.875345
## final value 768.875345
## stopped after 100 iterations
## # weights: 76
## initial value 1074.585977
## iter 10 value 854.212691
## iter 20 value 797.662969
## iter 30 value 767.260324
## iter 40 value 745.032313
## iter 50 value 729.360663
## iter 60 value 724.212002
## iter 70 value 717.000161
## iter 80 value 704.627047
## iter 90 value 699.663559
## iter 100 value 697.944197
## final value 697.944197
## stopped after 100 iterations
## # weights: 16
## initial value 1055.853223
## iter 10 value 890.470940
## iter 20 value 862.732082
## iter 30 value 859.765344
## iter 40 value 853.179295
## iter 50 value 852.151308
## iter 60 value 852.002938
## iter 70 value 851.747164
## iter 80 value 851.661229
## iter 90 value 850.017681
## iter 100 value 849.676511
## final value 849.676511
## stopped after 100 iterations
## # weights: 46
## initial value 1206.515849
## iter 10 value 847.563068
## iter 20 value 821.953677
## iter 30 value 812.768375
## iter 40 value 801.035330
## iter 50 value 795.804808
## iter 60 value 787.831167
## iter 70 value 783.028776
## iter 80 value 776.025807
## iter 90 value 774.337052
## iter 100 value 774.205133
## final value 774.205133
## stopped after 100 iterations
## # weights: 76
## initial value 1113.223059
## iter 10 value 820.258041
## iter 20 value 766.713937
## iter 30 value 742.254057
## iter 40 value 725.835838
## iter 50 value 712.125113
## iter 60 value 689.043600
## iter 70 value 677.837980
## iter 80 value 677.115807
## final value 677.109674
## converged
## # weights: 16
## initial value 1048.728316
## iter 10 value 883.771185
## iter 20 value 877.171641
## iter 30 value 871.815343
## iter 40 value 856.892527
## iter 50 value 847.884195
## iter 60 value 845.678522
## iter 60 value 845.678516
## iter 60 value 845.678516
## final value 845.678516
## converged
## # weights: 46
## initial value 1127.730750
## iter 10 value 828.697488
## iter 20 value 804.146547
## iter 30 value 794.004775
## iter 40 value 790.269708
## iter 50 value 789.092455
## iter 60 value 787.520861
## iter 70 value 786.555325
## iter 80 value 785.681527
## iter 90 value 785.638934
## iter 100 value 785.634136
## final value 785.634136
## stopped after 100 iterations
## # weights: 76
## initial value 1235.068836
## iter 10 value 865.602521
## iter 20 value 835.215792
## iter 30 value 771.143865
## iter 40 value 730.225860
## iter 50 value 697.335954
## iter 60 value 683.836855
## iter 70 value 672.701034
## iter 80 value 660.904019
## iter 90 value 655.598014
## iter 100 value 651.943055
## final value 651.943055
## stopped after 100 iterations
## # weights: 16
## initial value 1086.403308
## iter 10 value 864.652773
## iter 20 value 844.698639
## iter 30 value 841.712615
## iter 40 value 840.676142
## iter 50 value 840.424908
## iter 60 value 840.150932
## iter 70 value 840.140419
## iter 70 value 840.140416
## final value 840.140416
## converged
## # weights: 46
## initial value 1075.801315
## iter 10 value 834.776386
## iter 20 value 803.962134
## iter 30 value 793.448907
## iter 40 value 779.500391
## iter 50 value 770.208877
## iter 60 value 762.947651
## iter 70 value 762.034671
## iter 80 value 761.727967
## iter 90 value 761.483328
## iter 100 value 761.420469
## final value 761.420469
## stopped after 100 iterations
## # weights: 76
## initial value 1128.502247
## iter 10 value 826.944647
## iter 20 value 790.075242
## iter 30 value 763.989009
## iter 40 value 731.242464
## iter 50 value 705.453573
## iter 60 value 687.055721
## iter 70 value 681.090834
## iter 80 value 679.582881
## iter 90 value 678.739741
## iter 100 value 678.585927
## final value 678.585927
## stopped after 100 iterations
## # weights: 76
## initial value 1368.431074
## iter 10 value 1043.874871
## iter 20 value 991.999297
## iter 30 value 972.088441
## iter 40 value 938.263513
## iter 50 value 906.585138
## iter 60 value 885.283014
## iter 70 value 873.795768
## iter 80 value 867.472366
## iter 90 value 855.626227
## iter 100 value 843.995891
## final value 843.995891
## stopped after 100 iterations
print(fit.nnet)
## Neural Network
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## size decay AUC Precision Recall F
## 1 0e+00 0.7852862 0.9025181 0.7238673 0.8033016
## 1 1e-04 0.9220693 0.9078548 0.7104987 0.7968063
## 1 1e-01 0.9230165 0.9006473 0.7338252 0.8084184
## 3 0e+00 0.8615307 0.9024859 0.7005266 0.7879552
## 3 1e-04 0.9177405 0.9040929 0.7354127 0.8103297
## 3 1e-01 0.9270808 0.9162541 0.7382775 0.8175502
## 5 0e+00 0.9067854 0.9042002 0.7149414 0.7980137
## 5 1e-04 0.9184409 0.9038094 0.7057900 0.7916020
## 5 1e-01 0.9392531 0.9181286 0.7608054 0.8319055
##
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')
set.seed(2019)
fit.gbm <- train(y~., data=itrain, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees AUC Precision Recall F
## 1 50 0.8995544 0.9366174 0.8414951 0.8863661
## 1 100 0.9578286 0.9421950 0.8359949 0.8858860
## 1 150 0.9618027 0.9447139 0.8359946 0.8869571
## 2 50 0.9599822 0.9407903 0.8532844 0.8948068
## 2 100 0.9636186 0.9423435 0.8519762 0.8948579
## 2 150 0.9648336 0.9425322 0.8504035 0.8940648
## 3 50 0.9627674 0.9461036 0.8404483 0.8901064
## 3 100 0.9660428 0.9463940 0.8454259 0.8930295
## 3 150 0.9685070 0.9492507 0.8414933 0.8920759
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')
## var rel.inf
## DEBTINC DEBTINC 59.55084456
## CLAGE CLAGE 11.71658042
## DELINQ.0 DELINQ.0 9.23981342
## VALUE VALUE 7.38265155
## YOJ YOJ 4.62293401
## LOAN LOAN 4.09278121
## JOB.Office JOB.Office 1.16219326
## DELINQ.1 DELINQ.1 0.73893268
## DEROG.1 DEROG.1 0.70718645
## REASON.HomeImp REASON.HomeImp 0.45258375
## JOB.Other JOB.Other 0.11948720
## JOB.Mgr JOB.Mgr 0.11710766
## JOB.ProfExe JOB.ProfExe 0.09690382
A graphical plot to show performances of the models from the training set.
results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')
## Results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: glm, rf, nnet, gbm
## Number of resamples: 5
##
## AUC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.9153513 0.9180248 0.9218465 0.9222213 0.9221887 0.9336949 0
## rf 0.9666777 0.9695952 0.9707874 0.9717582 0.9723774 0.9793531 0
## nnet 0.9195431 0.9324396 0.9465394 0.9392531 0.9480358 0.9497077 0
## gbm 0.9636049 0.9657601 0.9657630 0.9685070 0.9727419 0.9746651 0
##
## F
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.7988423 0.8014599 0.8063814 0.8101311 0.8219564 0.8220157 0
## rf 0.8859527 0.8965996 0.9047293 0.9004898 0.9055824 0.9095853 0
## nnet 0.8189091 0.8225352 0.8312817 0.8319055 0.8354978 0.8513037 0
## gbm 0.8868715 0.8898246 0.8929068 0.8920759 0.8951724 0.8956044 0
##
## Precision
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.8875380 0.8932039 0.9025974 0.8986736 0.9040881 0.9059406 0
## rf 0.9437037 0.9462518 0.9482759 0.9494766 0.9542097 0.9549419 0
## nnet 0.8888889 0.9214403 0.9221374 0.9181286 0.9288026 0.9293740 0
## gbm 0.9421965 0.9460641 0.9491779 0.9492507 0.9511111 0.9577039 0
##
## Recall
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.7185864 0.7225131 0.7287025 0.7377612 0.7536042 0.7653997 0
## rf 0.8348624 0.8455497 0.8610747 0.8564296 0.8650066 0.8756545 0
## nnet 0.7369110 0.7522936 0.7588467 0.7608054 0.7653997 0.7905759 0
## gbm 0.8309305 0.8322412 0.8414155 0.8414933 0.8494764 0.8534031 0
par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')
par(mfrow=c(2,2))
set.seed(2019)
prediction.glm <-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf <-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet <-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm <-predict(fit.gbm,newdata=itest,type="raw")
Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.
cat(paste('Confusion Matrix GLM Model'), sep='\n')
## Confusion Matrix GLM Model
confusionMatrix(prediction.glm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 708 74
## X1 246 163
##
## Accuracy : 0.7313
## 95% CI : (0.7052, 0.7563)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3378
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7421
## Specificity : 0.6878
## Pos Pred Value : 0.9054
## Neg Pred Value : 0.3985
## Prevalence : 0.8010
## Detection Rate : 0.5945
## Detection Prevalence : 0.6566
## Balanced Accuracy : 0.7150
##
## 'Positive' Class : X0
##
F1_train <- fit.glm$results[5]
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')
## F1_train_glm: 0.810131128341594 F1_test_glm: 0.815668202764977
cat(paste('Confusion Matrix Random Forest Model'), sep='\n')
## Confusion Matrix Random Forest Model
confusionMatrix(prediction.rf, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 823 51
## X1 131 186
##
## Accuracy : 0.8472
## 95% CI : (0.8255, 0.8672)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 2.291e-05
##
## Kappa : 0.5746
##
## Mcnemar's Test P-Value : 4.745e-09
##
## Sensitivity : 0.8627
## Specificity : 0.7848
## Pos Pred Value : 0.9416
## Neg Pred Value : 0.5868
## Prevalence : 0.8010
## Detection Rate : 0.6910
## Detection Prevalence : 0.7338
## Balanced Accuracy : 0.8237
##
## 'Positive' Class : X0
##
F1_train <- fit.rf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.900489847090052 F1_test_rf: 0.900437636761488
cat(paste('Confusion Matrix Neural Network Model'), sep='\n')
## Confusion Matrix Neural Network Model
confusionMatrix(prediction.nnet, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 739 61
## X1 215 176
##
## Accuracy : 0.7683
## 95% CI : (0.7432, 0.792)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.9976
##
## Kappa : 0.4157
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7746
## Specificity : 0.7426
## Pos Pred Value : 0.9238
## Neg Pred Value : 0.4501
## Prevalence : 0.8010
## Detection Rate : 0.6205
## Detection Prevalence : 0.6717
## Balanced Accuracy : 0.7586
##
## 'Positive' Class : X0
##
F1_train <- fit.nnet$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')
## F1_train_nnet: 0.803301647507338 F1_test_nnet: 0.842645381984036
cat(paste('Confusion Matrix GBM Model'), sep='\n')
## Confusion Matrix GBM Model
confusionMatrix(prediction.gbm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 800 48
## X1 154 189
##
## Accuracy : 0.8304
## 95% CI : (0.8079, 0.8513)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.005457
##
## Kappa : 0.5445
##
## Mcnemar's Test P-Value : 1.493e-13
##
## Sensitivity : 0.8386
## Specificity : 0.7975
## Pos Pred Value : 0.9434
## Neg Pred Value : 0.5510
## Prevalence : 0.8010
## Detection Rate : 0.6717
## Detection Prevalence : 0.7120
## Balanced Accuracy : 0.8180
##
## 'Positive' Class : X0
##
F1_train <- fit.gbm$results[[8]][1]
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')
## F1_train_gbm: 0.886366083205398 F1_test_gbm: 0.887902330743618
par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GBM Confusion Matrix")
Randomly sample (with replacement) the minority class to reach the same size of the majority class
# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, summaryFunction = prSummary, sampling = "up")
metric <- "AUC"
set.seed(2019)
fit.glm <- train(y~., data=itrain, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)
## Generalized Linear Model
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using up-sampling
##
## Resampling results:
##
## AUC Precision Recall F
## 0.9229779 0.8984912 0.7427367 0.8130053
plot(varImp(fit.glm),15, main = 'GLM feature selection')
set.seed(2019)
fit.rf <- train(y~., data=itrain, method="rf", metric=metric, trControl=control)
print(fit.rf)
## Random Forest
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using up-sampling
##
## Resampling results across tuning parameters:
##
## mtry AUC Precision Recall F
## 2 0.9753659 0.9367482 0.9221896 0.9293680
## 7 0.9144147 0.9312547 0.9326700 0.9319351
## 13 0.8260580 0.9234212 0.9221923 0.9227965
##
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(fit.rf),15, main='Random Forest feature selection')
set.seed(2019)
fit.nnet <- train(y~., data=itrain, method="nnet", metric=metric, trControl=control)
## # weights: 16
## initial value 4233.012610
## iter 10 value 3543.180266
## iter 20 value 3490.158386
## iter 30 value 3458.066660
## final value 3457.940102
## converged
## # weights: 46
## initial value 5221.214406
## iter 10 value 3505.134747
## iter 20 value 3360.823043
## iter 30 value 3278.738057
## iter 40 value 3233.122284
## iter 50 value 3193.934321
## iter 60 value 3155.261529
## iter 70 value 3127.990734
## iter 80 value 3106.331496
## iter 90 value 3100.126744
## final value 3100.105965
## converged
## # weights: 76
## initial value 4264.067375
## iter 10 value 3408.370996
## iter 20 value 3276.621708
## iter 30 value 3192.183146
## iter 40 value 3035.329649
## iter 50 value 2938.215069
## iter 60 value 2879.858248
## iter 70 value 2831.492383
## iter 80 value 2802.354756
## iter 90 value 2784.640458
## iter 100 value 2769.435622
## final value 2769.435622
## stopped after 100 iterations
## # weights: 16
## initial value 4372.614844
## iter 10 value 3824.237958
## iter 20 value 3568.336495
## iter 30 value 3546.571152
## iter 40 value 3533.027854
## iter 50 value 3488.678581
## iter 60 value 3469.137704
## iter 70 value 3459.702262
## final value 3459.631609
## converged
## # weights: 46
## initial value 4240.292716
## iter 10 value 3544.989466
## iter 20 value 3447.335692
## iter 30 value 3382.187331
## iter 40 value 3354.557475
## iter 50 value 3340.301991
## iter 60 value 3322.638087
## iter 70 value 3314.863472
## iter 80 value 3304.409313
## iter 90 value 3291.021719
## iter 100 value 3289.249235
## final value 3289.249235
## stopped after 100 iterations
## # weights: 76
## initial value 4356.967420
## iter 10 value 3525.174119
## iter 20 value 3316.902854
## iter 30 value 3180.278283
## iter 40 value 3121.420056
## iter 50 value 3069.290052
## iter 60 value 3036.229425
## iter 70 value 3021.431450
## iter 80 value 3012.262790
## iter 90 value 3008.349787
## iter 100 value 3005.834749
## final value 3005.834749
## stopped after 100 iterations
## # weights: 16
## initial value 4273.860200
## iter 10 value 3601.097217
## iter 20 value 3478.406805
## iter 30 value 3475.740241
## iter 40 value 3475.627927
## iter 50 value 3475.582133
## final value 3475.561096
## converged
## # weights: 46
## initial value 4342.687557
## iter 10 value 3596.058089
## iter 20 value 3500.705542
## iter 30 value 3432.896091
## iter 40 value 3402.153819
## iter 50 value 3374.857202
## iter 60 value 3343.736410
## iter 70 value 3322.903979
## iter 80 value 3316.010757
## iter 90 value 3314.851568
## iter 100 value 3313.855982
## final value 3313.855982
## stopped after 100 iterations
## # weights: 76
## initial value 4280.819434
## iter 10 value 3446.029246
## iter 20 value 3256.993150
## iter 30 value 3182.988575
## iter 40 value 3141.617262
## iter 50 value 3118.089582
## iter 60 value 3087.040787
## iter 70 value 3041.752963
## iter 80 value 3021.028913
## iter 90 value 3016.363202
## iter 100 value 3014.176111
## final value 3014.176111
## stopped after 100 iterations
## # weights: 16
## initial value 4278.737366
## iter 10 value 3655.231247
## iter 20 value 3486.094844
## iter 30 value 3414.385597
## iter 40 value 3412.084979
## iter 50 value 3411.015503
## iter 60 value 3410.221750
## iter 60 value 3410.221719
## iter 60 value 3410.221719
## final value 3410.221719
## converged
## # weights: 46
## initial value 5206.878522
## iter 10 value 3433.447838
## iter 20 value 3382.100721
## iter 30 value 3316.263434
## iter 40 value 3291.010243
## iter 50 value 3265.301413
## iter 60 value 3234.663814
## iter 70 value 3177.461785
## iter 80 value 3161.584781
## iter 90 value 3155.952024
## iter 100 value 3153.474815
## final value 3153.474815
## stopped after 100 iterations
## # weights: 76
## initial value 4876.810759
## iter 10 value 3351.823321
## iter 20 value 3207.756841
## iter 30 value 3101.817910
## iter 40 value 2986.783199
## iter 50 value 2925.909723
## iter 60 value 2870.841113
## iter 70 value 2846.341740
## iter 80 value 2821.569398
## iter 90 value 2771.906893
## iter 100 value 2745.134102
## final value 2745.134102
## stopped after 100 iterations
## # weights: 16
## initial value 4109.807774
## iter 10 value 3503.355659
## iter 20 value 3490.443253
## iter 30 value 3487.286693
## iter 40 value 3414.803912
## iter 50 value 3410.728153
## iter 60 value 3408.615538
## final value 3408.614074
## converged
## # weights: 46
## initial value 4330.742235
## iter 10 value 3473.894023
## iter 20 value 3362.018018
## iter 30 value 3287.115249
## iter 40 value 3262.459898
## iter 50 value 3250.455744
## iter 60 value 3175.210850
## iter 70 value 3020.115698
## iter 80 value 2941.283807
## iter 90 value 2929.196180
## iter 100 value 2915.529276
## final value 2915.529276
## stopped after 100 iterations
## # weights: 76
## initial value 4679.551471
## iter 10 value 3401.059726
## iter 20 value 3266.524067
## iter 30 value 3201.158651
## iter 40 value 3114.485665
## iter 50 value 3082.764748
## iter 60 value 3043.897675
## iter 70 value 2953.402582
## iter 80 value 2823.204252
## iter 90 value 2718.566792
## iter 100 value 2688.537204
## final value 2688.537204
## stopped after 100 iterations
## # weights: 16
## initial value 4250.676388
## iter 10 value 3621.607519
## iter 20 value 3497.973656
## iter 30 value 3425.412928
## iter 40 value 3411.827514
## iter 50 value 3409.440561
## iter 60 value 3404.938506
## final value 3404.926623
## converged
## # weights: 46
## initial value 4878.771985
## iter 10 value 3447.746544
## iter 20 value 3371.512066
## iter 30 value 3273.486401
## iter 40 value 3039.003146
## iter 50 value 2899.208603
## iter 60 value 2830.979722
## iter 70 value 2790.324256
## iter 80 value 2759.570498
## iter 90 value 2752.578117
## iter 100 value 2749.798068
## final value 2749.798068
## stopped after 100 iterations
## # weights: 76
## initial value 5191.555933
## iter 10 value 3543.710054
## iter 20 value 3337.287871
## iter 30 value 3251.571383
## iter 40 value 3180.331022
## iter 50 value 3130.400325
## iter 60 value 3090.068676
## iter 70 value 3066.609877
## iter 80 value 3045.332861
## iter 90 value 3036.291823
## iter 100 value 3032.881902
## final value 3032.881902
## stopped after 100 iterations
## # weights: 16
## initial value 4380.234155
## iter 10 value 3597.173606
## iter 20 value 3535.944022
## iter 30 value 3523.152789
## iter 40 value 3521.593081
## iter 50 value 3521.465147
## final value 3521.458957
## converged
## # weights: 46
## initial value 4400.984899
## iter 10 value 3440.781713
## iter 20 value 3400.630984
## iter 30 value 3340.169335
## iter 40 value 3303.721606
## iter 50 value 3284.557309
## iter 60 value 3259.331445
## iter 70 value 3250.134378
## iter 80 value 3241.433414
## iter 90 value 3223.114305
## iter 100 value 3212.988309
## final value 3212.988309
## stopped after 100 iterations
## # weights: 76
## initial value 4729.198270
## iter 10 value 3403.098099
## iter 20 value 3262.772506
## iter 30 value 3183.018964
## iter 40 value 3131.618616
## iter 50 value 3101.225761
## iter 60 value 3072.834684
## iter 70 value 3055.207297
## iter 80 value 3032.813375
## iter 90 value 3003.457395
## iter 100 value 2976.422867
## final value 2976.422867
## stopped after 100 iterations
## # weights: 16
## initial value 4344.020507
## iter 10 value 3659.565844
## iter 20 value 3534.407733
## iter 30 value 3431.676689
## iter 40 value 3425.598449
## iter 50 value 3424.545459
## iter 60 value 3422.110223
## final value 3422.110186
## converged
## # weights: 46
## initial value 4222.914262
## iter 10 value 3478.313798
## iter 20 value 3386.973062
## iter 30 value 3319.584321
## iter 40 value 3302.399432
## iter 50 value 3298.043602
## iter 60 value 3291.510476
## iter 70 value 3286.533145
## iter 80 value 3286.418898
## iter 90 value 3286.412701
## iter 90 value 3286.412686
## iter 90 value 3286.412686
## final value 3286.412686
## converged
## # weights: 76
## initial value 5609.173953
## iter 10 value 3572.338328
## iter 20 value 3444.597679
## iter 30 value 3297.040307
## iter 40 value 3132.033875
## iter 50 value 3007.632601
## iter 60 value 2960.698176
## iter 70 value 2916.578810
## iter 80 value 2890.130001
## iter 90 value 2872.432222
## iter 100 value 2867.461338
## final value 2867.461338
## stopped after 100 iterations
## # weights: 16
## initial value 4259.603243
## iter 10 value 3565.082140
## iter 20 value 3526.350142
## iter 30 value 3510.522185
## iter 40 value 3509.877185
## iter 40 value 3509.877169
## final value 3509.876991
## converged
## # weights: 46
## initial value 4289.395003
## iter 10 value 3730.513735
## iter 20 value 3618.629254
## iter 30 value 3439.797568
## iter 40 value 3375.604523
## iter 50 value 3311.495322
## iter 60 value 3255.350111
## iter 70 value 3229.464429
## iter 80 value 3224.147927
## iter 90 value 3220.871388
## iter 100 value 3215.268454
## final value 3215.268454
## stopped after 100 iterations
## # weights: 76
## initial value 5573.933763
## iter 10 value 3499.951039
## iter 20 value 3331.959606
## iter 30 value 3201.442109
## iter 40 value 3085.128587
## iter 50 value 2996.526828
## iter 60 value 2929.194556
## iter 70 value 2875.310699
## iter 80 value 2844.957621
## iter 90 value 2830.683020
## iter 100 value 2819.564117
## final value 2819.564117
## stopped after 100 iterations
## # weights: 16
## initial value 4381.818570
## iter 10 value 3579.148811
## iter 20 value 3516.734634
## iter 30 value 3459.379369
## iter 40 value 3411.954674
## iter 50 value 3407.208131
## iter 60 value 3400.996667
## final value 3400.996597
## converged
## # weights: 46
## initial value 4478.035655
## iter 10 value 3455.619805
## iter 20 value 3361.658019
## iter 30 value 3335.338220
## iter 40 value 3296.957086
## iter 50 value 3279.708946
## iter 60 value 3260.637938
## iter 70 value 3248.582612
## iter 80 value 3229.782515
## iter 90 value 3162.817065
## iter 100 value 3137.676372
## final value 3137.676372
## stopped after 100 iterations
## # weights: 76
## initial value 4460.544847
## iter 10 value 3429.638013
## iter 20 value 3259.398462
## iter 30 value 3164.811407
## iter 40 value 3103.341822
## iter 50 value 3055.533808
## iter 60 value 3020.130953
## iter 70 value 2976.338914
## iter 80 value 2944.815647
## iter 90 value 2898.919102
## iter 100 value 2876.682636
## final value 2876.682636
## stopped after 100 iterations
## # weights: 16
## initial value 4542.959590
## iter 10 value 3734.751215
## iter 20 value 3503.605377
## iter 30 value 3412.652609
## iter 40 value 3409.908859
## iter 50 value 3409.487246
## final value 3409.294939
## converged
## # weights: 46
## initial value 4204.592247
## iter 10 value 3460.914345
## iter 20 value 3399.219271
## iter 30 value 3332.065740
## iter 40 value 3304.944775
## iter 50 value 3294.175913
## iter 60 value 3287.856234
## iter 70 value 3287.510504
## iter 80 value 3285.024276
## iter 90 value 3283.051318
## iter 100 value 3282.128912
## final value 3282.128912
## stopped after 100 iterations
## # weights: 76
## initial value 5188.922075
## iter 10 value 3402.086063
## iter 20 value 3305.659979
## iter 30 value 3211.243757
## iter 40 value 3149.209356
## iter 50 value 3124.827424
## iter 60 value 3108.463535
## iter 70 value 3097.636696
## iter 80 value 3093.429690
## iter 90 value 3090.417423
## iter 100 value 3083.288549
## final value 3083.288549
## stopped after 100 iterations
## # weights: 16
## initial value 4279.566183
## iter 10 value 3684.631726
## iter 20 value 3535.071756
## iter 30 value 3504.247042
## iter 40 value 3504.079179
## final value 3504.078878
## converged
## # weights: 46
## initial value 4581.730107
## iter 10 value 3520.499853
## iter 20 value 3385.604799
## iter 30 value 3315.104943
## iter 40 value 3243.557573
## iter 50 value 3200.796047
## iter 60 value 3140.021572
## iter 70 value 3097.113733
## iter 80 value 3086.823673
## iter 90 value 3080.638090
## iter 100 value 3076.566863
## final value 3076.566863
## stopped after 100 iterations
## # weights: 76
## initial value 4419.752742
## iter 10 value 3396.454648
## iter 20 value 3258.727656
## iter 30 value 3098.589172
## iter 40 value 2978.531096
## iter 50 value 2898.422287
## iter 60 value 2860.783852
## iter 70 value 2829.469696
## iter 80 value 2790.839249
## iter 90 value 2749.780423
## iter 100 value 2729.255771
## final value 2729.255771
## stopped after 100 iterations
## # weights: 16
## initial value 4243.865136
## iter 10 value 3550.624620
## iter 20 value 3467.596468
## iter 30 value 3459.872410
## final value 3459.872213
## converged
## # weights: 46
## initial value 5453.584318
## iter 10 value 3460.297488
## iter 20 value 3412.734323
## iter 30 value 3372.296510
## iter 40 value 3340.581392
## iter 50 value 3325.022126
## iter 60 value 3285.060379
## iter 70 value 3252.711909
## iter 80 value 3243.999851
## iter 90 value 3236.956129
## iter 100 value 3203.482885
## final value 3203.482885
## stopped after 100 iterations
## # weights: 76
## initial value 4212.280154
## iter 10 value 3379.605790
## iter 20 value 3271.379477
## iter 30 value 3147.374131
## iter 40 value 3077.241141
## iter 50 value 3037.815296
## iter 60 value 2959.936277
## iter 70 value 2880.122624
## iter 80 value 2862.658352
## iter 90 value 2827.623880
## iter 100 value 2773.864189
## final value 2773.864189
## stopped after 100 iterations
## # weights: 16
## initial value 4544.085648
## iter 10 value 3532.481781
## iter 20 value 3493.915613
## iter 30 value 3472.929455
## iter 40 value 3436.413702
## iter 50 value 3429.415393
## final value 3429.398053
## converged
## # weights: 46
## initial value 4279.832352
## iter 10 value 3377.053966
## iter 20 value 3301.207075
## iter 30 value 3268.991360
## iter 40 value 3248.978423
## iter 50 value 3234.136721
## iter 60 value 3223.653008
## iter 70 value 3218.215970
## iter 80 value 3213.676944
## iter 90 value 3210.969301
## iter 100 value 3208.405829
## final value 3208.405829
## stopped after 100 iterations
## # weights: 76
## initial value 4844.671932
## iter 10 value 3389.665987
## iter 20 value 3249.184523
## iter 30 value 3114.566714
## iter 40 value 2986.706874
## iter 50 value 2901.233988
## iter 60 value 2854.820638
## iter 70 value 2792.329600
## iter 80 value 2744.168426
## iter 90 value 2717.802433
## iter 100 value 2708.754921
## final value 2708.754921
## stopped after 100 iterations
## # weights: 16
## initial value 4332.975771
## iter 10 value 3860.910083
## iter 20 value 3680.052040
## iter 30 value 3408.432387
## iter 40 value 3407.316471
## iter 50 value 3406.928358
## iter 60 value 3406.407734
## final value 3406.403828
## converged
## # weights: 46
## initial value 4257.841168
## iter 10 value 3416.838379
## iter 20 value 3335.858541
## iter 30 value 3299.628379
## iter 40 value 3272.612445
## iter 50 value 3140.167840
## iter 60 value 3076.670659
## iter 70 value 3058.225104
## iter 80 value 3050.078118
## iter 90 value 3047.195768
## iter 100 value 3046.663066
## final value 3046.663066
## stopped after 100 iterations
## # weights: 76
## initial value 4383.273813
## iter 10 value 3415.520847
## iter 20 value 3235.917107
## iter 30 value 3156.833823
## iter 40 value 3100.321471
## iter 50 value 3075.927751
## iter 60 value 3057.393424
## iter 70 value 3032.831048
## iter 80 value 3015.959875
## iter 90 value 2999.518553
## iter 100 value 2993.757290
## final value 2993.757290
## stopped after 100 iterations
## # weights: 76
## initial value 6008.323715
## iter 10 value 4318.521422
## iter 20 value 4134.290376
## iter 30 value 4072.098715
## iter 40 value 4027.196625
## iter 50 value 4017.175759
## iter 60 value 4011.532324
## iter 70 value 4005.945642
## iter 80 value 4001.969734
## iter 90 value 3995.308122
## iter 100 value 3984.722019
## final value 3984.722019
## stopped after 100 iterations
print(fit.nnet)
## Neural Network
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using up-sampling
##
## Resampling results across tuning parameters:
##
## size decay AUC Precision Recall F
## 1 0e+00 0.9206352 0.9043590 0.7089386 0.7946450
## 1 1e-04 0.9221209 0.9047536 0.7047391 0.7921428
## 1 1e-01 0.9238685 0.9046901 0.7225597 0.8033415
## 3 0e+00 0.8435589 0.9020172 0.7319602 0.8066338
## 3 1e-04 0.9081168 0.9155727 0.7361826 0.8157205
## 3 1e-01 0.9338175 0.9069256 0.7456204 0.8177587
## 5 0e+00 0.9200589 0.8980998 0.7820205 0.8355041
## 5 1e-04 0.9292254 0.9050240 0.7532323 0.8211169
## 5 1e-01 0.9408704 0.9215037 0.7566135 0.8308441
##
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')
set.seed(2019)
fit.gbm <- train(y~., data=itrain, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Addtional sampling using up-sampling
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees AUC Precision Recall F
## 1 50 0.7887613 0.9325682 0.8585231 0.8937265
## 1 100 0.9560085 0.9411022 0.8451648 0.8904113
## 1 150 0.9625541 0.9410865 0.8407125 0.8879457
## 2 50 0.9590954 0.9421104 0.8556411 0.8966535
## 2 100 0.9657656 0.9456085 0.8593085 0.9002933
## 2 150 0.9672073 0.9441038 0.8663834 0.9035156
## 3 50 0.9635314 0.9445827 0.8519755 0.8958138
## 3 100 0.9669553 0.9456678 0.8606222 0.9010865
## 3 150 0.9689793 0.9464588 0.8697924 0.9064571
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')
## var rel.inf
## DEBTINC DEBTINC 61.69994826
## DELINQ.0 DELINQ.0 13.97062800
## CLAGE CLAGE 9.29015675
## VALUE VALUE 6.19009121
## LOAN LOAN 2.94490907
## YOJ YOJ 2.55295300
## DELINQ.1 DELINQ.1 1.42949366
## JOB.Office JOB.Office 0.81009594
## DEROG.1 DEROG.1 0.60202923
## REASON.HomeImp REASON.HomeImp 0.27637804
## JOB.ProfExe JOB.ProfExe 0.18741340
## JOB.Mgr JOB.Mgr 0.04590344
## JOB.Other JOB.Other 0.00000000
A graphical plot to show performances of the models from the training set.
results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')
## Results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: glm, rf, nnet, gbm
## Number of resamples: 5
##
## AUC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.9153892 0.9186178 0.9231522 0.9229779 0.9235566 0.9341736 0
## rf 0.9657198 0.9730883 0.9737972 0.9753659 0.9788954 0.9853288 0
## nnet 0.9184813 0.9407977 0.9456926 0.9408704 0.9488647 0.9505155 0
## gbm 0.9631971 0.9667793 0.9676233 0.9689793 0.9702821 0.9770149 0
##
## F
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.8028986 0.8052516 0.8071685 0.8130053 0.8209169 0.8287911 0
## rf 0.9227696 0.9263158 0.9274984 0.9293680 0.9315615 0.9386948 0
## nnet 0.8098694 0.8195051 0.8376437 0.8308441 0.8431655 0.8440367 0
## gbm 0.8955017 0.9038855 0.9051491 0.9064571 0.9083503 0.9193989 0
##
## Precision
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.8877246 0.8922345 0.8978930 0.8984912 0.9052133 0.9093904 0
## rf 0.9256845 0.9299868 0.9377537 0.9367482 0.9442971 0.9460189 0
## nnet 0.9073171 0.9157734 0.9229508 0.9215037 0.9268680 0.9346093 0
## gbm 0.9382022 0.9422535 0.9431010 0.9464588 0.9486804 0.9600571 0
##
## Recall
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.7225131 0.7260813 0.7369110 0.7427367 0.7509830 0.7771953 0
## rf 0.9082569 0.9175393 0.9226737 0.9221896 0.9293194 0.9331586 0
## nnet 0.7313237 0.7369110 0.7640891 0.7566135 0.7680210 0.7827225 0
## gbm 0.8479685 0.8678010 0.8743455 0.8697924 0.8768021 0.8820446 0
par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')
par(mfrow=c(2,2))
set.seed(2019)
prediction.glm <-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf <-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet <-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm <-predict(fit.gbm,newdata=itest,type="raw")
Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.
cat(paste('Confusion Matrix GLM Model'), sep='\n')
## Confusion Matrix GLM Model
confusionMatrix(prediction.glm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 718 77
## X1 236 160
##
## Accuracy : 0.7372
## 95% CI : (0.7112, 0.762)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3416
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7526
## Specificity : 0.6751
## Pos Pred Value : 0.9031
## Neg Pred Value : 0.4040
## Prevalence : 0.8010
## Detection Rate : 0.6029
## Detection Prevalence : 0.6675
## Balanced Accuracy : 0.7139
##
## 'Positive' Class : X0
##
F1_train <- fit.glm$results[5]
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')
## F1_train_glm: 0.813005322258826 F1_test_glm: 0.8210405946255
cat(paste('Confusion Matrix Random Forest Model'), sep='\n')
## Confusion Matrix Random Forest Model
confusionMatrix(prediction.rf, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 878 70
## X1 76 167
##
## Accuracy : 0.8774
## 95% CI : (0.8574, 0.8955)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 1.859e-12
##
## Kappa : 0.6191
##
## Mcnemar's Test P-Value : 0.679
##
## Sensitivity : 0.9203
## Specificity : 0.7046
## Pos Pred Value : 0.9262
## Neg Pred Value : 0.6872
## Prevalence : 0.8010
## Detection Rate : 0.7372
## Detection Prevalence : 0.7960
## Balanced Accuracy : 0.8125
##
## 'Positive' Class : X0
##
F1_train <- fit.rf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.929368010236147 F1_test_rf: 0.923238696109359
cat(paste('Confusion Matrix Neural Network Model'), sep='\n')
## Confusion Matrix Neural Network Model
confusionMatrix(prediction.nnet, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 712 68
## X1 242 169
##
## Accuracy : 0.7397
## 95% CI : (0.7138, 0.7644)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3601
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7463
## Specificity : 0.7131
## Pos Pred Value : 0.9128
## Neg Pred Value : 0.4112
## Prevalence : 0.8010
## Detection Rate : 0.5978
## Detection Prevalence : 0.6549
## Balanced Accuracy : 0.7297
##
## 'Positive' Class : X0
##
F1_train <- fit.nnet$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')
## F1_train_nnet: 0.794644959784861 F1_test_nnet: 0.821222606689735
cat(paste('Confusion Matrix GBM Model'), sep='\n')
## Confusion Matrix GBM Model
confusionMatrix(prediction.gbm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 812 50
## X1 142 187
##
## Accuracy : 0.8388
## 95% CI : (0.8167, 0.8592)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.0004739
##
## Kappa : 0.5587
##
## Mcnemar's Test P-Value : 5.122e-11
##
## Sensitivity : 0.8512
## Specificity : 0.7890
## Pos Pred Value : 0.9420
## Neg Pred Value : 0.5684
## Prevalence : 0.8010
## Detection Rate : 0.6818
## Detection Prevalence : 0.7238
## Balanced Accuracy : 0.8201
##
## 'Positive' Class : X0
##
F1_train <- fit.gbm$results[[8]][1]
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')
## F1_train_gbm: 0.893726484265782 F1_test_gbm: 0.894273127753304
par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GBM Confusion Matrix")
SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling method that creates synthetic samples from the minority class instead of creating copies from it.
set.seed(2019)
smote_train <- SMOTE(y ~ ., data = itrain)
table(smote_train$y)
##
## X0 X1
## 3808 2856
# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE, summaryFunction = prSummary)
metric <- "AUC"
set.seed(2019)
fit.glm <- train(y~., data=smote_train, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)
## Generalized Linear Model
##
## 6664 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Resampling results:
##
## AUC Precision Recall F
## 0.6088746 0.3247685 0.9569427 0.4849409
plot(varImp(fit.glm),15, main = 'GLM feature selection')
set.seed(2019)
fit.rf <- train(y~., data=smote_train, method="rf", metric=metric, trControl=control)
print(fit.rf)
## Random Forest
##
## 6664 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Resampling results across tuning parameters:
##
## mtry AUC Precision Recall F
## 2 0.8267628 0.3964015 0.9934463 0.5666820
## 7 0.7882037 0.5019825 0.9727242 0.6621885
## 13 0.6846804 0.4935740 0.9692963 0.6540501
##
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(fit.rf),15, main='Random Forest feature selection')
set.seed(2019)
fit.nnet <- train(y~., data=smote_train, method="nnet", metric=metric, trControl=control)
## # weights: 16
## initial value 2284.176732
## iter 10 value 1726.014789
## iter 20 value 1615.976167
## iter 30 value 1608.769547
## iter 40 value 1606.822884
## iter 50 value 1606.755672
## final value 1606.740614
## converged
## # weights: 46
## initial value 2817.566368
## iter 10 value 1580.037753
## iter 20 value 1552.849768
## iter 30 value 1530.946777
## iter 40 value 1509.325901
## iter 50 value 1486.521701
## iter 60 value 1483.176552
## iter 70 value 1469.941348
## iter 80 value 1459.906827
## iter 90 value 1453.482697
## iter 100 value 1452.965904
## final value 1452.965904
## stopped after 100 iterations
## # weights: 76
## initial value 3955.159272
## iter 10 value 1584.532173
## iter 20 value 1497.026199
## iter 30 value 1417.270440
## iter 40 value 1362.012422
## iter 50 value 1326.959391
## iter 60 value 1289.954550
## iter 70 value 1258.318306
## iter 80 value 1224.012070
## iter 90 value 1196.903931
## iter 100 value 1179.925159
## final value 1179.925159
## stopped after 100 iterations
## # weights: 16
## initial value 2471.131143
## iter 10 value 1928.587243
## iter 20 value 1759.673250
## iter 30 value 1654.974662
## iter 40 value 1620.877496
## iter 50 value 1598.504608
## iter 60 value 1572.795809
## iter 70 value 1570.323995
## iter 80 value 1570.270742
## final value 1570.251999
## converged
## # weights: 46
## initial value 2956.899021
## iter 10 value 1633.332237
## iter 20 value 1596.367666
## iter 30 value 1571.255846
## iter 40 value 1550.743329
## iter 50 value 1541.635557
## iter 60 value 1534.699505
## iter 70 value 1529.316910
## iter 80 value 1523.594286
## iter 90 value 1520.981192
## iter 100 value 1520.864258
## final value 1520.864258
## stopped after 100 iterations
## # weights: 76
## initial value 3030.189139
## iter 10 value 1568.322423
## iter 20 value 1507.054469
## iter 30 value 1476.156890
## iter 40 value 1451.102333
## iter 50 value 1437.229713
## iter 60 value 1416.889605
## iter 70 value 1380.188021
## iter 80 value 1351.510403
## iter 90 value 1328.415851
## iter 100 value 1321.406522
## final value 1321.406522
## stopped after 100 iterations
## # weights: 16
## initial value 2970.571191
## iter 10 value 1641.881642
## iter 20 value 1612.001560
## iter 30 value 1606.717131
## iter 40 value 1574.065279
## iter 50 value 1568.707119
## iter 60 value 1567.429460
## final value 1567.420477
## converged
## # weights: 46
## initial value 1962.575202
## iter 10 value 1600.754558
## iter 20 value 1512.163457
## iter 30 value 1418.884838
## iter 40 value 1399.514863
## iter 50 value 1387.232315
## iter 60 value 1354.096357
## iter 70 value 1329.014876
## iter 80 value 1322.494887
## iter 90 value 1308.968720
## iter 100 value 1303.687919
## final value 1303.687919
## stopped after 100 iterations
## # weights: 76
## initial value 2796.752052
## iter 10 value 1561.051895
## iter 20 value 1515.569805
## iter 30 value 1465.142785
## iter 40 value 1442.119708
## iter 50 value 1428.836678
## iter 60 value 1419.149953
## iter 70 value 1406.246713
## iter 80 value 1391.435445
## iter 90 value 1385.801874
## iter 100 value 1384.605029
## final value 1384.605029
## stopped after 100 iterations
## # weights: 16
## initial value 2316.280514
## iter 10 value 1612.057092
## iter 20 value 1604.764692
## iter 30 value 1603.436294
## iter 40 value 1582.699637
## iter 50 value 1565.114393
## iter 60 value 1564.155988
## iter 70 value 1562.737185
## iter 80 value 1562.373384
## final value 1562.373352
## converged
## # weights: 46
## initial value 3140.722907
## iter 10 value 1701.433774
## iter 20 value 1608.809869
## iter 30 value 1583.610457
## iter 40 value 1511.397400
## iter 50 value 1492.496283
## iter 60 value 1487.029830
## iter 70 value 1483.476444
## iter 80 value 1473.439809
## iter 90 value 1464.039294
## iter 100 value 1454.486004
## final value 1454.486004
## stopped after 100 iterations
## # weights: 76
## initial value 2592.839061
## iter 10 value 1522.850267
## iter 20 value 1447.919253
## iter 30 value 1395.844890
## iter 40 value 1382.301620
## iter 50 value 1375.039134
## iter 60 value 1367.985478
## iter 70 value 1352.839517
## iter 80 value 1335.059949
## iter 90 value 1307.622926
## iter 100 value 1303.938978
## final value 1303.938978
## stopped after 100 iterations
## # weights: 16
## initial value 3049.034703
## iter 10 value 1804.707452
## iter 20 value 1627.437146
## iter 30 value 1595.991516
## iter 40 value 1582.140133
## iter 50 value 1569.769232
## iter 60 value 1565.521817
## final value 1565.384953
## converged
## # weights: 46
## initial value 2608.780559
## iter 10 value 1610.454611
## iter 20 value 1571.516982
## iter 30 value 1554.504148
## iter 40 value 1539.171900
## iter 50 value 1520.580091
## iter 60 value 1509.209775
## iter 70 value 1506.369692
## iter 80 value 1502.314166
## iter 90 value 1485.463032
## iter 100 value 1470.086322
## final value 1470.086322
## stopped after 100 iterations
## # weights: 76
## initial value 2344.803076
## iter 10 value 1523.808967
## iter 20 value 1473.079269
## iter 30 value 1436.542599
## iter 40 value 1422.038626
## iter 50 value 1415.736493
## iter 60 value 1412.728831
## iter 70 value 1409.478469
## iter 80 value 1406.629824
## iter 90 value 1404.019500
## iter 100 value 1403.468159
## final value 1403.468159
## stopped after 100 iterations
## # weights: 16
## initial value 2916.090932
## iter 10 value 1745.694753
## iter 20 value 1615.012801
## iter 30 value 1595.920459
## iter 40 value 1578.939403
## iter 50 value 1572.374688
## iter 60 value 1565.285045
## iter 70 value 1564.997487
## final value 1564.997022
## converged
## # weights: 46
## initial value 2383.553476
## iter 10 value 1713.589066
## iter 20 value 1590.059018
## iter 30 value 1545.161217
## iter 40 value 1518.446169
## iter 50 value 1507.325265
## iter 60 value 1499.547198
## iter 70 value 1493.473415
## iter 80 value 1489.954584
## iter 90 value 1486.822853
## iter 100 value 1485.845699
## final value 1485.845699
## stopped after 100 iterations
## # weights: 76
## initial value 2847.818390
## iter 10 value 1603.008855
## iter 20 value 1463.754285
## iter 30 value 1380.051581
## iter 40 value 1348.370731
## iter 50 value 1337.056100
## iter 60 value 1322.762454
## iter 70 value 1313.790755
## iter 80 value 1303.355663
## iter 90 value 1301.231464
## iter 100 value 1300.425550
## final value 1300.425550
## stopped after 100 iterations
## # weights: 16
## initial value 3484.237234
## iter 10 value 1801.362592
## iter 20 value 1761.957927
## iter 30 value 1755.727451
## iter 40 value 1754.749653
## iter 50 value 1752.237137
## iter 60 value 1748.365951
## final value 1747.575777
## converged
## # weights: 46
## initial value 3859.694969
## iter 10 value 1668.776139
## iter 20 value 1569.915619
## iter 30 value 1543.274747
## iter 40 value 1512.703494
## iter 50 value 1494.418462
## iter 60 value 1486.979662
## iter 70 value 1484.183084
## iter 80 value 1481.389655
## iter 90 value 1479.019157
## iter 100 value 1476.871048
## final value 1476.871048
## stopped after 100 iterations
## # weights: 76
## initial value 2026.960510
## iter 10 value 1554.110647
## iter 20 value 1466.686621
## iter 30 value 1432.232661
## iter 40 value 1412.460615
## iter 50 value 1401.071846
## iter 60 value 1390.609106
## iter 70 value 1365.877877
## iter 80 value 1353.782688
## iter 90 value 1353.228813
## final value 1353.169710
## converged
## # weights: 16
## initial value 2853.050894
## iter 10 value 1788.805702
## iter 20 value 1685.478578
## iter 30 value 1618.182743
## iter 40 value 1603.899664
## iter 50 value 1578.648300
## iter 60 value 1561.475656
## iter 70 value 1558.198740
## final value 1558.197913
## converged
## # weights: 46
## initial value 2531.470135
## iter 10 value 1649.846247
## iter 20 value 1602.852505
## iter 30 value 1562.885575
## iter 40 value 1512.823242
## iter 50 value 1498.221064
## iter 60 value 1483.786906
## iter 70 value 1467.364121
## iter 80 value 1452.286919
## iter 90 value 1422.422014
## iter 100 value 1411.588732
## final value 1411.588732
## stopped after 100 iterations
## # weights: 76
## initial value 1938.006244
## iter 10 value 1546.693219
## iter 20 value 1505.651494
## iter 30 value 1463.782230
## iter 40 value 1391.979307
## iter 50 value 1346.925851
## iter 60 value 1325.859742
## iter 70 value 1319.602330
## iter 80 value 1312.337774
## iter 90 value 1306.873628
## iter 100 value 1305.776473
## final value 1305.776473
## stopped after 100 iterations
## # weights: 16
## initial value 2012.030873
## iter 10 value 1628.320573
## iter 20 value 1561.932865
## iter 30 value 1558.339723
## iter 40 value 1556.475754
## iter 50 value 1556.346647
## iter 60 value 1556.045092
## final value 1555.923199
## converged
## # weights: 46
## initial value 3069.406660
## iter 10 value 1577.138331
## iter 20 value 1515.903030
## iter 30 value 1491.457367
## iter 40 value 1484.644662
## iter 50 value 1466.613094
## iter 60 value 1449.480562
## iter 70 value 1447.303578
## iter 80 value 1446.771663
## iter 90 value 1446.717990
## iter 100 value 1446.647112
## final value 1446.647112
## stopped after 100 iterations
## # weights: 76
## initial value 2754.884424
## iter 10 value 1546.963850
## iter 20 value 1494.246819
## iter 30 value 1451.479108
## iter 40 value 1428.835623
## iter 50 value 1412.905432
## iter 60 value 1397.828672
## iter 70 value 1378.908921
## iter 80 value 1373.035893
## iter 90 value 1371.163997
## iter 100 value 1370.132170
## final value 1370.132170
## stopped after 100 iterations
## # weights: 16
## initial value 3591.395819
## iter 10 value 1766.785464
## iter 20 value 1698.857271
## iter 30 value 1619.567536
## iter 40 value 1617.857801
## iter 50 value 1614.797343
## iter 60 value 1613.080488
## iter 60 value 1613.080483
## iter 60 value 1613.080483
## final value 1613.080483
## converged
## # weights: 46
## initial value 2532.047333
## iter 10 value 1567.390027
## iter 20 value 1493.909288
## iter 30 value 1460.311689
## iter 40 value 1437.908721
## iter 50 value 1431.350257
## iter 60 value 1428.226270
## iter 70 value 1419.252043
## iter 80 value 1408.290418
## iter 90 value 1405.783790
## iter 100 value 1405.331599
## final value 1405.331599
## stopped after 100 iterations
## # weights: 76
## initial value 2254.683321
## iter 10 value 1560.309861
## iter 20 value 1507.751796
## iter 30 value 1480.613075
## iter 40 value 1459.991403
## iter 50 value 1433.500693
## iter 60 value 1415.217298
## iter 70 value 1398.672412
## iter 80 value 1383.752396
## iter 90 value 1364.619860
## iter 100 value 1349.445589
## final value 1349.445589
## stopped after 100 iterations
## # weights: 16
## initial value 2163.878200
## iter 10 value 1603.267492
## iter 20 value 1575.347414
## iter 30 value 1572.261742
## final value 1572.227072
## converged
## # weights: 46
## initial value 3014.396652
## iter 10 value 1590.715034
## iter 20 value 1557.288471
## iter 30 value 1528.226735
## iter 40 value 1507.266695
## iter 50 value 1460.753868
## iter 60 value 1452.313277
## iter 70 value 1448.508616
## iter 80 value 1444.466942
## iter 90 value 1438.503704
## iter 100 value 1431.175499
## final value 1431.175499
## stopped after 100 iterations
## # weights: 76
## initial value 2212.181577
## iter 10 value 1553.181764
## iter 20 value 1519.196554
## iter 30 value 1484.710439
## iter 40 value 1473.601515
## iter 50 value 1470.218111
## iter 60 value 1465.410104
## iter 70 value 1462.042858
## iter 80 value 1460.151702
## iter 90 value 1458.064256
## iter 100 value 1425.632193
## final value 1425.632193
## stopped after 100 iterations
## # weights: 16
## initial value 2316.988680
## iter 10 value 1687.440237
## iter 20 value 1662.115407
## iter 30 value 1617.015532
## iter 40 value 1588.305342
## iter 50 value 1572.597132
## iter 60 value 1570.545470
## iter 70 value 1570.511940
## final value 1570.509453
## converged
## # weights: 46
## initial value 1890.248743
## iter 10 value 1566.059759
## iter 20 value 1516.579066
## iter 30 value 1489.729466
## iter 40 value 1443.266116
## iter 50 value 1425.725044
## iter 60 value 1414.480899
## iter 70 value 1404.320644
## iter 80 value 1400.203784
## iter 90 value 1399.888612
## iter 100 value 1399.531709
## final value 1399.531709
## stopped after 100 iterations
## # weights: 76
## initial value 2263.141589
## iter 10 value 1555.973087
## iter 20 value 1505.550814
## iter 30 value 1455.145582
## iter 40 value 1421.857043
## iter 50 value 1409.997176
## iter 60 value 1394.651232
## iter 70 value 1383.626182
## iter 80 value 1377.250570
## iter 90 value 1373.223163
## iter 100 value 1371.082642
## final value 1371.082642
## stopped after 100 iterations
## # weights: 16
## initial value 2636.590912
## iter 10 value 1682.098216
## iter 20 value 1613.373065
## iter 30 value 1605.723186
## final value 1605.716057
## converged
## # weights: 46
## initial value 2824.216740
## iter 10 value 1592.007078
## iter 20 value 1535.493627
## iter 30 value 1514.622737
## iter 40 value 1497.136570
## iter 50 value 1491.164329
## iter 60 value 1476.424788
## iter 70 value 1458.099525
## iter 80 value 1450.198912
## iter 90 value 1449.496422
## final value 1449.469738
## converged
## # weights: 76
## initial value 2484.982818
## iter 10 value 1553.819935
## iter 20 value 1490.223508
## iter 30 value 1449.237908
## iter 40 value 1426.395477
## iter 50 value 1409.741926
## iter 60 value 1392.970969
## iter 70 value 1372.379779
## iter 80 value 1350.096778
## iter 90 value 1339.995348
## iter 100 value 1336.867000
## final value 1336.867000
## stopped after 100 iterations
## # weights: 16
## initial value 3001.192356
## iter 10 value 1683.602034
## iter 20 value 1613.131691
## iter 30 value 1572.907240
## iter 40 value 1562.125095
## iter 50 value 1559.703174
## iter 60 value 1557.838301
## final value 1557.838275
## converged
## # weights: 46
## initial value 2857.222027
## iter 10 value 1573.389243
## iter 20 value 1537.708527
## iter 30 value 1517.092336
## iter 40 value 1511.239065
## iter 50 value 1498.097496
## iter 60 value 1492.341282
## iter 70 value 1489.617870
## iter 80 value 1488.881166
## iter 90 value 1488.811098
## iter 100 value 1488.809035
## final value 1488.809035
## stopped after 100 iterations
## # weights: 76
## initial value 3107.270616
## iter 10 value 1579.145758
## iter 20 value 1493.300641
## iter 30 value 1424.508335
## iter 40 value 1383.423330
## iter 50 value 1347.847295
## iter 60 value 1332.294422
## iter 70 value 1311.039157
## iter 80 value 1280.116614
## iter 90 value 1262.589254
## iter 100 value 1245.428153
## final value 1245.428153
## stopped after 100 iterations
## # weights: 16
## initial value 4235.498655
## iter 10 value 1873.091461
## iter 20 value 1842.655838
## iter 30 value 1818.911867
## iter 40 value 1748.069442
## iter 50 value 1743.931067
## iter 60 value 1740.811148
## iter 70 value 1714.550878
## iter 80 value 1663.960363
## iter 90 value 1660.119024
## iter 100 value 1660.022672
## final value 1660.022672
## stopped after 100 iterations
## # weights: 46
## initial value 2637.516171
## iter 10 value 1552.040730
## iter 20 value 1498.236213
## iter 30 value 1457.525571
## iter 40 value 1434.066699
## iter 50 value 1416.349482
## iter 60 value 1410.821912
## iter 70 value 1408.524178
## iter 80 value 1406.148144
## iter 90 value 1403.631574
## iter 100 value 1402.791615
## final value 1402.791615
## stopped after 100 iterations
## # weights: 76
## initial value 2393.628067
## iter 10 value 1534.006861
## iter 20 value 1479.216947
## iter 30 value 1408.470437
## iter 40 value 1367.291521
## iter 50 value 1356.458496
## iter 60 value 1347.678574
## iter 70 value 1337.970115
## iter 80 value 1321.199800
## iter 90 value 1308.423504
## iter 100 value 1306.420005
## final value 1306.420005
## stopped after 100 iterations
## # weights: 76
## initial value 5057.089050
## iter 10 value 3666.636206
## iter 20 value 3453.617841
## iter 30 value 3246.924419
## iter 40 value 3120.732844
## iter 50 value 3031.831098
## iter 60 value 3001.258903
## iter 70 value 2975.696972
## iter 80 value 2959.422461
## iter 90 value 2948.444021
## iter 100 value 2942.375427
## final value 2942.375427
## stopped after 100 iterations
print(fit.nnet)
## Neural Network
##
## 6664 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Resampling results across tuning parameters:
##
## size decay AUC Precision Recall F
## 1 0e+00 0.5466821 0.2810342 0.9884363 0.4365438
## 1 1e-04 0.5344953 0.3234498 0.9541768 0.4820788
## 1 1e-01 0.6044792 0.3392303 0.9437910 0.4990655
## 3 0e+00 0.5674257 0.3757548 0.9348374 0.5346354
## 3 1e-04 0.6200640 0.3935681 0.9264093 0.5513988
## 3 1e-01 0.6547411 0.3720097 0.9438047 0.5335487
## 5 0e+00 0.6083395 0.3795557 0.9414122 0.5406185
## 5 1e-04 0.6491697 0.3744020 0.9461800 0.5359312
## 5 1e-01 0.6946661 0.3863132 0.9469730 0.5485364
##
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')
set.seed(2019)
fit.gbm <- train(y~., data=smote_train, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 6664 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Resampling results across tuning parameters:
##
## interaction.depth n.trees AUC Precision Recall F
## 1 50 0.4966608 0.3284126 0.9813633 0.4921296
## 1 100 0.6504328 0.3566349 0.9716581 0.5217598
## 1 150 0.6744467 0.3718810 0.9617036 0.5363294
## 2 50 0.6635953 0.3654392 0.9650958 0.5300896
## 2 100 0.7006521 0.3864208 0.9569708 0.5504646
## 2 150 0.7089790 0.3970976 0.9577706 0.5614109
## 3 50 0.6799393 0.3794861 0.9569561 0.5433311
## 3 100 0.7121102 0.3997628 0.9580287 0.5640839
## 3 150 0.7328222 0.4092089 0.9595908 0.5737339
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')
## var rel.inf
## DEBTINC DEBTINC 45.9330043
## DELINQ.0 DELINQ.0 17.4782942
## YOJ YOJ 13.7109264
## CLAGE CLAGE 10.8536652
## LOAN LOAN 4.2013317
## VALUE VALUE 3.8245245
## JOB.Office JOB.Office 1.2721423
## DELINQ.1 DELINQ.1 1.2421182
## DEROG.1 DEROG.1 1.1449823
## REASON.HomeImp REASON.HomeImp 0.3390109
## JOB.Mgr JOB.Mgr 0.0000000
## JOB.Other JOB.Other 0.0000000
## JOB.ProfExe JOB.ProfExe 0.0000000
A graphical plot to show performances of the models from the training set.
results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')
## Results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: glm, rf, nnet, gbm
## Number of resamples: 5
##
## AUC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.5976324 0.5998371 0.6008851 0.6088746 0.6147725 0.6312458 0
## rf 0.8116890 0.8128272 0.8293679 0.8267628 0.8341390 0.8457911 0
## nnet 0.6263446 0.6862141 0.7140258 0.6946661 0.7209892 0.7257566 0
## gbm 0.7230957 0.7267140 0.7313329 0.7328222 0.7369240 0.7460444 0
##
## F
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.4793115 0.4843646 0.4850895 0.4849409 0.4877076 0.4882313 0
## rf 0.5611193 0.5667042 0.5668047 0.5666820 0.5686495 0.5701323 0
## nnet 0.5254613 0.5344070 0.5507246 0.5485364 0.5630153 0.5690738 0
## gbm 0.5702028 0.5728840 0.5734320 0.5737339 0.5734541 0.5786963 0
##
## Precision
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.3196468 0.3239875 0.3244681 0.3247685 0.3276786 0.3280615 0
## rf 0.3915725 0.3961073 0.3968421 0.3964015 0.3976975 0.3997879 0
## nnet 0.3653155 0.3705584 0.3869239 0.3863132 0.4010067 0.4077615 0
## gbm 0.4070156 0.4074693 0.4077562 0.4092089 0.4089888 0.4148148 0
##
## Recall
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.9532468 0.9540079 0.9576720 0.9569427 0.9591568 0.9606299 0
## rf 0.9896104 0.9908016 0.9934124 0.9934463 0.9960317 0.9973753 0
## nnet 0.9356110 0.9415584 0.9446640 0.9469730 0.9550265 0.9580052 0
## gbm 0.9493506 0.9566360 0.9591568 0.9595908 0.9658793 0.9669312 0
par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')
par(mfrow=c(3,2))
set.seed(2019)
prediction.glm<-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf<-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet<-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm<-predict(fit.gbm,newdata=itest,type="raw")
Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.
cat(paste('Confusion Matrix GLM Model'), sep='\n')
## Confusion Matrix GLM Model
confusionMatrix(prediction.glm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 778 99
## X1 176 138
##
## Accuracy : 0.7691
## 95% CI : (0.7441, 0.7928)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.997
##
## Kappa : 0.3545
##
## Mcnemar's Test P-Value : 4.584e-06
##
## Sensitivity : 0.8155
## Specificity : 0.5823
## Pos Pred Value : 0.8871
## Neg Pred Value : 0.4395
## Prevalence : 0.8010
## Detection Rate : 0.6532
## Detection Prevalence : 0.7364
## Balanced Accuracy : 0.6989
##
## 'Positive' Class : X0
##
F1_train <- fit.glm$results[5]
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')
## F1_train_glm: 0.484940906613912 F1_test_glm: 0.849808847624249
cat(paste('Confusion Matrix Random Forest Model'), sep='\n')
## Confusion Matrix Random Forest Model
confusionMatrix(prediction.rf, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 872 75
## X1 82 162
##
## Accuracy : 0.8682
## 95% CI : (0.8476, 0.8869)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 7.14e-10
##
## Kappa : 0.591
##
## Mcnemar's Test P-Value : 0.632
##
## Sensitivity : 0.9140
## Specificity : 0.6835
## Pos Pred Value : 0.9208
## Neg Pred Value : 0.6639
## Prevalence : 0.8010
## Detection Rate : 0.7322
## Detection Prevalence : 0.7951
## Balanced Accuracy : 0.7988
##
## 'Positive' Class : X0
##
F1_train <- fit.rf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.566681997839924 F1_test_rf: 0.917411888479748
cat(paste('Confusion Matrix Neural Network Model'), sep='\n')
## Confusion Matrix Neural Network Model
confusionMatrix(prediction.nnet, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 770 65
## X1 184 172
##
## Accuracy : 0.7909
## 95% CI : (0.7667, 0.8137)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.8182
##
## Kappa : 0.4483
##
## Mcnemar's Test P-Value : 7.549e-14
##
## Sensitivity : 0.8071
## Specificity : 0.7257
## Pos Pred Value : 0.9222
## Neg Pred Value : 0.4831
## Prevalence : 0.8010
## Detection Rate : 0.6465
## Detection Prevalence : 0.7011
## Balanced Accuracy : 0.7664
##
## 'Positive' Class : X0
##
F1_train <- fit.nnet$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')
## F1_train_nnet: 0.436543766362327 F1_test_nnet: 0.860816098378983
cat(paste('Confusion Matrix GBM Model'), sep='\n')
## Confusion Matrix GBM Model
confusionMatrix(prediction.gbm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 827 59
## X1 127 178
##
## Accuracy : 0.8438
## 95% CI : (0.8219, 0.864)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 8.275e-05
##
## Kappa : 0.5578
##
## Mcnemar's Test P-Value : 8.984e-07
##
## Sensitivity : 0.8669
## Specificity : 0.7511
## Pos Pred Value : 0.9334
## Neg Pred Value : 0.5836
## Prevalence : 0.8010
## Detection Rate : 0.6944
## Detection Prevalence : 0.7439
## Balanced Accuracy : 0.8090
##
## 'Positive' Class : X0
##
F1_train <- fit.gbm$results[[8]][1]
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')
## F1_train_gbm: 0.492129600499759 F1_test_gbm: 0.898913043478261
par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GBM Confusion Matrix")
Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013
Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013
https://machinelearningmastery.com/what-is-imbalanced-classification/