I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.
Below the variables description
-BAD: 1 = client defaulted on loan 0 = loan repaid
-LOAN: Amount of the loan request
-MORTDUE: Amount due on existing mortgage
-VALUE: Value of current property
-REASON: DebtCon = debt consolidation; HomeImp = home improvement
-JOB: Six occupational categories
-YOJ: Years at present job
-DEROG: Number of major derogatory reports
-DELINQ: Number of delinquent credit lines
-CLAGE: Age of oldest trade line in months
-NINQ: Number of recent credit lines
-CLNO: Number of credit lines
-DEBTINC: Debt-to-income ratio
suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## v purrr 0.3.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
suppressWarnings({library(caret)})
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
suppressWarnings({library(corrplot)})
## corrplot 0.84 loaded
suppressWarnings({library(gridExtra)})
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
suppressWarnings({library(MLmetrics)})
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)
# Dimensions of data set
dim(df)
## [1] 5960 13
# List types for each attribute
sapply(df, class)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG
## "integer" "integer" "numeric" "numeric" "factor" "factor" "numeric" "integer"
## DELINQ CLAGE NINQ CLNO DEBTINC
## "integer" "numeric" "integer" "integer" "numeric"
# Take a peek at the first rows of the data set
head(df,5)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO
## 1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1 9
## 2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0 14
## 3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1 10
## 4 1 1500 NA NA NA NA NA NA NA NA
## 5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0 14
## DEBTINC
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
# Summarize attribute distributions
summary(df)
## BAD LOAN MORTDUE VALUE
## Min. :0.0000 Min. : 1100 Min. : 2063 Min. : 8000
## 1st Qu.:0.0000 1st Qu.:11100 1st Qu.: 46276 1st Qu.: 66076
## Median :0.0000 Median :16300 Median : 65019 Median : 89236
## Mean :0.1995 Mean :18608 Mean : 73761 Mean :101776
## 3rd Qu.:0.0000 3rd Qu.:23300 3rd Qu.: 91488 3rd Qu.:119824
## Max. :1.0000 Max. :89900 Max. :399550 Max. :855909
## NA's :518 NA's :112
## REASON JOB YOJ DEROG
## : 252 : 279 Min. : 0.000 Min. : 0.0000
## DebtCon:3928 Mgr : 767 1st Qu.: 3.000 1st Qu.: 0.0000
## HomeImp:1780 Office : 948 Median : 7.000 Median : 0.0000
## Other :2388 Mean : 8.922 Mean : 0.2546
## ProfExe:1276 3rd Qu.:13.000 3rd Qu.: 0.0000
## Sales : 109 Max. :41.000 Max. :10.0000
## Self : 193 NA's :515 NA's :708
## DELINQ CLAGE NINQ CLNO
## Min. : 0.0000 Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 115.1 1st Qu.: 0.000 1st Qu.:15.0
## Median : 0.0000 Median : 173.5 Median : 1.000 Median :20.0
## Mean : 0.4494 Mean : 179.8 Mean : 1.186 Mean :21.3
## 3rd Qu.: 0.0000 3rd Qu.: 231.6 3rd Qu.: 2.000 3rd Qu.:26.0
## Max. :15.0000 Max. :1168.2 Max. :17.000 Max. :71.0
## NA's :580 NA's :308 NA's :510 NA's :222
## DEBTINC
## Min. : 0.5245
## 1st Qu.: 29.1400
## Median : 34.8183
## Mean : 33.7799
## 3rd Qu.: 39.0031
## Max. :203.3121
## NA's :1267
# Summarize data structure
str(df)
## 'data.frame': 5960 obs. of 13 variables:
## $ BAD : int 1 1 1 1 0 1 1 1 1 1 ...
## $ LOAN : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
## $ MORTDUE: num 25860 70053 13500 NA 97800 ...
## $ VALUE : num 39025 68400 16700 NA 112000 ...
## $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
## $ JOB : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
## $ YOJ : num 10.5 7 4 NA 3 9 5 11 3 16 ...
## $ DEROG : int 0 0 0 NA 0 0 3 0 0 0 ...
## $ DELINQ : int 0 2 0 NA 0 0 2 0 2 0 ...
## $ CLAGE : num 94.4 121.8 149.5 NA 93.3 ...
## $ NINQ : int 1 0 1 NA 0 1 1 0 1 0 ...
## $ CLNO : int 9 14 10 NA 14 8 17 8 12 13 ...
## $ DEBTINC: num NA NA NA NA NA ...
BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated
df$REASON[df$REASON == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated
My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”
mi_summary <- function(data_frame){
mi_summary<-c()
for (col in colnames(data_frame)){
mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
}
mi_summary_new <- mi_summary[mi_summary>0]
mi_summary_cols <- colnames(data_frame)[mi_summary>0]
mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
mi_summary[,2] <- round(mi_summary[,2],6)
rownames(mi_summary) <- NULL
return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary
## col_name perc_missing
## 1 DEBTINC 21.258389
## 2 DEROG 11.879195
## 3 DELINQ 9.731544
## 4 MORTDUE 8.691275
## 5 YOJ 8.640940
## 6 NINQ 8.557047
## 7 CLAGE 5.167785
## 8 JOB 4.681208
## 9 REASON 4.228188
## 10 CLNO 3.724832
## 11 VALUE 1.879195
df <- df %>%
mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
mutate(REASON_NA = ifelse(is.na(REASON),1,0))
for (col in missing_summary$col_name){
if (class(df[,col]) == 'factor'){
unique_levels <- unique(df[,col])
df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
} else {
df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
}
}
pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss
## named numeric(0)
df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))
I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis
cat <- df[,sapply(df, is.factor)] %>%
select_if(~nlevels(.) <=15 ) %>%
select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]
The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.
cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)
## freq percentage
## 0 4771 80.05034
## 1 1189 19.94966
ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Distribution of Target variable")
I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.
cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
counts <- table(cat[,i])
name <- names(cat)[i]
barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}
The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.
par(mfrow=c(2,2))
for(i in 1:length(cat)){
freq=table(cat[,i])
percentage=prop.table(table(cat[,i]))*100
freq_cat_outcome=table(BAD,cat[,i])
name <- names(cat)[i]
cat(sep="\n")
cat(paste("Distribution of", name), sep="\n")
print(cbind(freq,percentage))
cat(sep="\n")
cat(paste("Distribution by Target variable and", name), sep="\n")
print(freq_cat_outcome)
cat(sep="\n")
cat(paste("Chi-squared test by Target variable and", name), sep="\n")
suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}
##
## Distribution of DELINQ
## freq percentage
## 0 4759 79.84899329
## 1 654 10.97315436
## 2 250 4.19463087
## 3 129 2.16442953
## 4 78 1.30872483
## 5 38 0.63758389
## 6 27 0.45302013
## 7 13 0.21812081
## 8 5 0.08389262
## 10 2 0.03355705
## 11 2 0.03355705
## 12 1 0.01677852
## 13 1 0.01677852
## 15 1 0.01677852
##
## Distribution by Target variable and DELINQ
##
## BAD 0 1 2 3 4 5 6 7 8 10 11 12 13 15
## 0 4104 432 138 58 32 7 0 0 0 0 0 0 0 0
## 1 655 222 112 71 46 31 27 13 5 2 2 1 1 1
##
## Chi-squared test by Target variable and DELINQ
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
##
##
## Distribution of REASON
## freq percentage
## DebtCon 4180 70.13423
## HomeImp 1780 29.86577
##
## Distribution by Target variable and REASON
##
## BAD DebtCon HomeImp
## 0 3387 1384
## 1 793 396
##
## Chi-squared test by Target variable and REASON
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
##
##
## Distribution of JOB
## freq percentage
## Mgr 767 12.869128
## Office 948 15.906040
## Other 2667 44.748322
## ProfExe 1276 21.409396
## Sales 109 1.828859
## Self 193 3.238255
##
## Distribution by Target variable and JOB
##
## BAD Mgr Office Other ProfExe Sales Self
## 0 588 823 2090 1064 71 135
## 1 179 125 577 212 38 58
##
## Chi-squared test by Target variable and JOB
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
##
##
## Distribution of DEROG
## freq percentage
## 0 5235 87.83557047
## 1 435 7.29865772
## 2 160 2.68456376
## 3 58 0.97315436
## 4 23 0.38590604
## 5 15 0.25167785
## 6 15 0.25167785
## 7 8 0.13422819
## 8 6 0.10067114
## 9 3 0.05033557
## 10 2 0.03355705
##
## Distribution by Target variable and DEROG
##
## BAD 0 1 2 3 4 5 6 7 8 9 10
## 0 4394 266 78 15 5 8 5 0 0 0 0
## 1 841 169 82 43 18 7 10 8 6 3 2
##
## Chi-squared test by Target variable and DEROG
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16
pl1 <- cat %>%
ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
ggplot(aes(x=BAD, y=REASON, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
ggplot(aes(x=BAD, y=JOB, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
ggplot(aes(x=BAD, y=DEROG, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)
I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.
dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))
I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.
par(mfrow=c(2,3))
for(i in 1:length(num)) {
hist(num[,i], main=names(num)[i], col='blue')
}
par(mfrow=c(2,3))
for(i in 1:length(num)) {
boxplot(num[,i], main=names(num)[i], col='orange')
}
par(mfrow=c(2,3))
for(i in 1:length(num)){
plot(density(num[,i]), main=names(num)[i], col='red')
}
for(i in 1:length(num)){
name <- names(num)[i]
cat(paste("Distribution of", name), sep="\n")
#cat(names(num)[i],sep = "\n")
print(summary(num[,i]))
cat(sep="\n")
stand.deviation = sd(num[,i])
variance = var(num[,i])
skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
outlier_values <- sum(table(boxplot.stats(num[,i])$out))
cat(paste("Statistical analysis of", name), sep="\n")
print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
cat(sep="\n")
cat(paste("anova_test between BAD and", name),sep = "\n")
print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
cat(sep="\n")
}
## Distribution of LOAN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1100 11100 16300 18608 23300 89900
##
## Statistical analysis of LOAN
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 11207.48 125607617 2.022762 6.922438 256
##
## anova_test between BAD and LOAN
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 5.4 5.368 33.79 6.45e-09 ***
## Residuals 5958 946.4 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of MORTDUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2063 48139 65019 73001 88200 399550
##
## Statistical analysis of MORTDUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 42552.73 1810734556 1.941153 7.440693 308
##
## anova_test between BAD and MORTDUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.0 2.0303 12.74 0.000361 ***
## Residuals 5958 949.8 0.1594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of VALUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8000 66490 89236 101540 119005 855909
##
## Statistical analysis of VALUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 56869.44 3234132829 3.088963 24.85644 347
##
## anova_test between BAD and VALUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 1.3 1.2675 7.945 0.00484 **
## Residuals 5958 950.5 0.1595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of YOJ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 7.000 8.756 12.000 41.000
##
## Statistical analysis of YOJ
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.259424 52.69923 1.092072 0.744703 211
##
## anova_test between BAD and YOJ
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.8 2.7710 17.4 3.08e-05 ***
## Residuals 5958 949.0 0.1593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of CLAGE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 117.4 173.5 179.4 227.1 1168.2
##
## Statistical analysis of CLAGE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 83.5747 6984.73 1.389903 8.180556 66
##
## anova_test between BAD and CLAGE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 26.1 26.106 168 <2e-16 ***
## Residuals 5958 925.7 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of DEBTINC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5245 30.7632 34.8183 34.0007 37.9499 203.3122
##
## Statistical analysis of DEBTINC
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.644528 58.43881 3.111598 64.0733 245
##
## anova_test between BAD and DEBTINC
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 22.7 22.733 145.8 <2e-16 ***
## Residuals 5958 929.1 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pl5 <- num %>%
ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot()
pl6 <- num %>%
ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot()
pl7 <- num %>%
ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot()
pl8 <- num %>%
ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot()
pl9 <- num %>%
ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot()
pl10 <- num %>%
ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot()
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)
Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.
# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")
# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)] <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")
# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")
# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)] <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE after handled outliers")
# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")
# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)] <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")
# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")
# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)] <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")
# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")
# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)] <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")
# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")
# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)] <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")
The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)
data <- cbind(categorical,num)
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]
## freqRatio percentUnique zeroVar nzv
## DELINQ.2 22.84000 0.03355705 FALSE TRUE
## DELINQ.3 45.20155 0.03355705 FALSE TRUE
## DELINQ.4 75.41026 0.03355705 FALSE TRUE
## DELINQ.5 155.84211 0.03355705 FALSE TRUE
## DELINQ.6 219.74074 0.03355705 FALSE TRUE
## DELINQ.7 457.46154 0.03355705 FALSE TRUE
## DELINQ.8 1191.00000 0.03355705 FALSE TRUE
## DELINQ.10 2979.00000 0.03355705 FALSE TRUE
## DELINQ.11 2979.00000 0.03355705 FALSE TRUE
## DELINQ.12 5959.00000 0.03355705 FALSE TRUE
## DELINQ.13 5959.00000 0.03355705 FALSE TRUE
## DELINQ.15 5959.00000 0.03355705 FALSE TRUE
## JOB.Sales 53.67890 0.03355705 FALSE TRUE
## JOB.Self 29.88083 0.03355705 FALSE TRUE
## DEROG.2 36.25000 0.03355705 FALSE TRUE
nzv <- nearZeroVar(data)
data_new <- data[, -nzv]
Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.
par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")
summary(cor[upper.tri(cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000000 -0.038116 0.005731 -0.014157 0.043268 0.792421
To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.
# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)
## 'data.frame': 5960 obs. of 14 variables:
## $ DELINQ.0 : num 0.502 -1.99 0.502 0.502 0.502 ...
## $ DELINQ.1 : num -0.351 -0.351 -0.351 -0.351 -0.351 ...
## $ REASON.HomeImp: num 1.532 1.532 1.532 -0.653 1.532 ...
## $ JOB.Mgr : num -0.384 -0.384 -0.384 -0.384 -0.384 ...
## $ JOB.Office : num -0.435 -0.435 -0.435 -0.435 2.299 ...
## $ JOB.Other : num 1.11 1.11 1.11 1.11 -0.9 ...
## $ JOB.ProfExe : num -0.522 -0.522 -0.522 -0.522 -0.522 ...
## $ DEROG.1 : num -0.281 -0.281 -0.281 -0.281 -0.281 ...
## $ LOAN : num -1.86 -1.84 -1.81 -1.81 -1.79 ...
## $ VALUE : num -1.322 -0.669 -1.818 -0.206 0.3 ...
## $ YOJ : num 0.279 -0.233 -0.672 -0.233 -0.819 ...
## $ CLAGE : num -1.091 -0.73 -0.367 -0.052 -1.104 ...
## $ DEBTINC : num 0.143 0.143 0.143 0.143 0.143 ...
## $ y : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...
# Draw a random, stratified sample including p percent of the data
set.seed(12345)
test_index <- createDataPartition(transformed$y, p=0.80, list=FALSE)
# select 20% of the data for test
itest <- transformed[-test_index,]
# use the remaining 80% of data to training the models
itrain <- transformed[test_index,]
The analysis follows four baseline models: Logistic Regression as the easiest model and a benchmark, two ensemble models (Random Forest and Gradient Boosting Machine) and a Neural Network.
control <- trainControl(method="cv", number=5,classProbs = TRUE, summaryFunction = prSummary)
metric <- 'AUC'
set.seed(2019)
fit.glm <- train(y~., data=itrain, method="glm", family=binomial(link='logit'), metric=metric, trControl=control)
print(fit.glm)
## Generalized Linear Model
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815
## Resampling results:
##
## AUC Precision Recall F
## 0.9233094 0.8392055 0.9554614 0.8935387
plot(varImp(fit.glm),15, main = 'GLM feature selection')
set.seed(2019)
fit.rf <- train(y~., data=itrain, method="rf", metric=metric, trControl=control)
print(fit.rf)
## Random Forest
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815
## Resampling results across tuning parameters:
##
## mtry AUC Precision Recall F
## 2 0.9414138 0.8733717 0.9861150 0.9263060
## 7 0.9156241 0.9162927 0.9481257 0.9318733
## 13 0.8359977 0.9169983 0.9423621 0.9294455
##
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(fit.rf),15, main='Random Forest feature selection')
set.seed(2019)
fit.nnet <- train(y~., data=itrain, method="nnet", metric=metric, trControl=control)
## # weights: 16
## initial value 2284.294996
## iter 10 value 1721.049223
## iter 20 value 1620.022847
## iter 30 value 1602.908584
## iter 40 value 1600.702019
## iter 50 value 1600.556801
## iter 60 value 1592.288750
## iter 70 value 1571.448943
## iter 80 value 1553.318945
## iter 90 value 1553.094547
## iter 90 value 1553.094540
## iter 100 value 1553.091712
## final value 1553.091712
## stopped after 100 iterations
## # weights: 46
## initial value 2818.831525
## iter 10 value 1590.252570
## iter 20 value 1525.321325
## iter 30 value 1499.001051
## iter 40 value 1491.491814
## iter 50 value 1483.236544
## iter 60 value 1474.932138
## iter 70 value 1456.831485
## iter 80 value 1453.734344
## iter 90 value 1452.774889
## iter 100 value 1452.712267
## final value 1452.712267
## stopped after 100 iterations
## # weights: 76
## initial value 3952.580922
## iter 10 value 1577.566284
## iter 20 value 1513.195773
## iter 30 value 1430.582005
## iter 40 value 1365.272471
## iter 50 value 1329.775858
## iter 60 value 1314.849925
## iter 70 value 1298.915217
## iter 80 value 1239.686171
## iter 90 value 1204.170844
## iter 100 value 1181.741450
## final value 1181.741450
## stopped after 100 iterations
## # weights: 16
## initial value 2473.072639
## iter 10 value 1918.276194
## iter 20 value 1761.639765
## iter 30 value 1734.113113
## iter 40 value 1579.018271
## iter 50 value 1567.888055
## iter 60 value 1554.836894
## iter 70 value 1553.938962
## iter 80 value 1553.934065
## final value 1553.933368
## converged
## # weights: 46
## initial value 2956.232405
## iter 10 value 1634.867450
## iter 20 value 1587.433297
## iter 30 value 1545.267535
## iter 40 value 1512.622039
## iter 50 value 1507.295601
## iter 60 value 1504.924409
## iter 70 value 1503.215932
## iter 80 value 1502.197944
## final value 1502.189817
## converged
## # weights: 76
## initial value 3033.421221
## iter 10 value 1564.297036
## iter 20 value 1497.232011
## iter 30 value 1465.606553
## iter 40 value 1424.770433
## iter 50 value 1385.582874
## iter 60 value 1355.919119
## iter 70 value 1305.897171
## iter 80 value 1280.678677
## iter 90 value 1261.422066
## iter 100 value 1247.401252
## final value 1247.401252
## stopped after 100 iterations
## # weights: 16
## initial value 2971.515708
## iter 10 value 1640.066836
## iter 20 value 1609.074215
## iter 30 value 1604.591971
## iter 40 value 1572.975998
## iter 50 value 1552.358276
## iter 60 value 1551.796628
## final value 1551.795675
## converged
## # weights: 46
## initial value 1960.479908
## iter 10 value 1589.707701
## iter 20 value 1532.236702
## iter 30 value 1429.893317
## iter 40 value 1405.758560
## iter 50 value 1377.639785
## iter 60 value 1357.079469
## iter 70 value 1346.556622
## iter 80 value 1340.703464
## iter 90 value 1334.678225
## iter 100 value 1329.649220
## final value 1329.649220
## stopped after 100 iterations
## # weights: 76
## initial value 2799.376384
## iter 10 value 1539.903671
## iter 20 value 1502.404931
## iter 30 value 1461.603966
## iter 40 value 1436.230406
## iter 50 value 1415.189140
## iter 60 value 1409.884568
## iter 70 value 1401.879097
## iter 80 value 1393.992459
## iter 90 value 1389.840390
## iter 100 value 1388.525407
## final value 1388.525407
## stopped after 100 iterations
## # weights: 16
## initial value 2312.255110
## iter 10 value 1636.279352
## iter 20 value 1616.842978
## iter 30 value 1586.561869
## iter 40 value 1577.460505
## iter 50 value 1574.476667
## iter 60 value 1574.401395
## iter 70 value 1574.398800
## iter 70 value 1574.398795
## iter 70 value 1574.398795
## final value 1574.398795
## converged
## # weights: 46
## initial value 3150.189663
## iter 10 value 1687.562518
## iter 20 value 1632.100646
## iter 30 value 1589.927484
## iter 40 value 1564.245600
## iter 50 value 1511.824174
## iter 60 value 1494.117482
## iter 70 value 1484.623953
## iter 80 value 1467.910193
## iter 90 value 1457.199592
## iter 100 value 1455.991366
## final value 1455.991366
## stopped after 100 iterations
## # weights: 76
## initial value 2591.228412
## iter 10 value 1562.457979
## iter 20 value 1507.237090
## iter 30 value 1479.076826
## iter 40 value 1453.553286
## iter 50 value 1432.164670
## iter 60 value 1415.268606
## iter 70 value 1402.796240
## iter 80 value 1380.973344
## iter 90 value 1364.223423
## iter 100 value 1357.564287
## final value 1357.564287
## stopped after 100 iterations
## # weights: 16
## initial value 3049.855195
## iter 10 value 1850.318163
## iter 20 value 1655.062081
## iter 30 value 1613.877675
## iter 40 value 1590.344782
## iter 50 value 1585.407726
## iter 60 value 1578.539913
## iter 70 value 1576.939678
## final value 1576.844709
## converged
## # weights: 46
## initial value 2611.804912
## iter 10 value 1600.759648
## iter 20 value 1568.198599
## iter 30 value 1543.004275
## iter 40 value 1529.937853
## iter 50 value 1526.619157
## iter 60 value 1524.742414
## iter 70 value 1523.878132
## iter 80 value 1523.290274
## iter 90 value 1523.074739
## iter 100 value 1523.051828
## final value 1523.051828
## stopped after 100 iterations
## # weights: 76
## initial value 2343.390708
## iter 10 value 1542.620964
## iter 20 value 1490.819226
## iter 30 value 1463.338517
## iter 40 value 1453.887083
## iter 50 value 1450.121000
## iter 60 value 1448.528349
## iter 70 value 1446.599006
## iter 80 value 1443.200027
## iter 90 value 1430.716703
## iter 100 value 1402.865663
## final value 1402.865663
## stopped after 100 iterations
## # weights: 16
## initial value 2911.253764
## iter 10 value 1712.538487
## iter 20 value 1648.263995
## iter 30 value 1620.932300
## iter 40 value 1618.301916
## iter 50 value 1617.915729
## final value 1617.903297
## converged
## # weights: 46
## initial value 2372.891569
## iter 10 value 1616.196110
## iter 20 value 1553.329321
## iter 30 value 1488.498374
## iter 40 value 1481.042835
## iter 50 value 1478.518628
## iter 60 value 1476.038274
## iter 70 value 1471.464241
## iter 80 value 1464.026155
## iter 90 value 1461.944285
## iter 100 value 1461.736278
## final value 1461.736278
## stopped after 100 iterations
## # weights: 76
## initial value 2858.630833
## iter 10 value 1619.540846
## iter 20 value 1517.889635
## iter 30 value 1479.302967
## iter 40 value 1463.706624
## iter 50 value 1448.222186
## iter 60 value 1443.501900
## iter 70 value 1440.805375
## iter 80 value 1430.507944
## iter 90 value 1417.950012
## iter 100 value 1410.043305
## final value 1410.043305
## stopped after 100 iterations
## # weights: 16
## initial value 3483.012769
## iter 10 value 1795.174119
## iter 20 value 1754.846888
## iter 30 value 1740.117473
## iter 40 value 1667.721601
## iter 50 value 1626.215558
## iter 60 value 1618.384787
## iter 70 value 1615.807417
## iter 80 value 1607.904710
## iter 90 value 1598.985548
## iter 100 value 1596.779369
## final value 1596.779369
## stopped after 100 iterations
## # weights: 46
## initial value 3858.076785
## iter 10 value 1637.655772
## iter 20 value 1592.606861
## iter 30 value 1538.087188
## iter 40 value 1504.191870
## iter 50 value 1498.004302
## iter 60 value 1491.407491
## iter 70 value 1480.875612
## iter 80 value 1477.288379
## iter 90 value 1474.300187
## iter 100 value 1472.541508
## final value 1472.541508
## stopped after 100 iterations
## # weights: 76
## initial value 2020.469489
## iter 10 value 1530.271001
## iter 20 value 1461.603448
## iter 30 value 1427.417722
## iter 40 value 1401.689954
## iter 50 value 1376.388963
## iter 60 value 1356.044606
## iter 70 value 1341.811193
## iter 80 value 1336.713004
## iter 90 value 1335.714349
## iter 100 value 1335.577885
## final value 1335.577885
## stopped after 100 iterations
## # weights: 16
## initial value 2847.158953
## iter 10 value 1730.435735
## iter 20 value 1624.724754
## iter 30 value 1606.369572
## iter 40 value 1600.686870
## iter 50 value 1576.495659
## iter 60 value 1560.544337
## iter 70 value 1557.373055
## final value 1557.366261
## converged
## # weights: 46
## initial value 2534.597359
## iter 10 value 1625.729707
## iter 20 value 1595.860878
## iter 30 value 1544.781302
## iter 40 value 1527.026281
## iter 50 value 1515.116089
## iter 60 value 1492.543756
## iter 70 value 1459.256203
## iter 80 value 1381.411833
## iter 90 value 1340.749561
## iter 100 value 1327.231598
## final value 1327.231598
## stopped after 100 iterations
## # weights: 76
## initial value 1952.303349
## iter 10 value 1536.434727
## iter 20 value 1506.900269
## iter 30 value 1475.950104
## iter 40 value 1451.561350
## iter 50 value 1444.238361
## iter 60 value 1424.300107
## iter 70 value 1418.790643
## iter 80 value 1417.114480
## iter 90 value 1414.524473
## iter 100 value 1411.602054
## final value 1411.602054
## stopped after 100 iterations
## # weights: 16
## initial value 2009.737336
## iter 10 value 1617.587790
## iter 20 value 1585.821364
## iter 30 value 1557.625817
## iter 40 value 1556.410237
## iter 50 value 1556.265660
## final value 1556.200277
## converged
## # weights: 46
## initial value 3090.009733
## iter 10 value 1601.234271
## iter 20 value 1538.274140
## iter 30 value 1524.210274
## iter 40 value 1510.195848
## iter 50 value 1499.183929
## iter 60 value 1491.268579
## iter 70 value 1489.653698
## iter 80 value 1488.933069
## iter 90 value 1486.702630
## iter 100 value 1484.575071
## final value 1484.575071
## stopped after 100 iterations
## # weights: 76
## initial value 2752.362933
## iter 10 value 1559.306680
## iter 20 value 1504.636400
## iter 30 value 1419.939232
## iter 40 value 1359.834842
## iter 50 value 1323.809083
## iter 60 value 1309.379393
## iter 70 value 1290.448627
## iter 80 value 1279.640886
## iter 90 value 1273.679966
## iter 100 value 1271.099052
## final value 1271.099052
## stopped after 100 iterations
## # weights: 16
## initial value 3596.278028
## iter 10 value 1762.091732
## iter 20 value 1677.886429
## iter 30 value 1664.307997
## iter 40 value 1663.187479
## iter 50 value 1661.338501
## iter 60 value 1658.951978
## final value 1658.951368
## converged
## # weights: 46
## initial value 2535.157025
## iter 10 value 1651.129545
## iter 20 value 1570.249551
## iter 30 value 1473.009382
## iter 40 value 1383.954010
## iter 50 value 1345.337844
## iter 60 value 1316.524023
## iter 70 value 1287.286336
## iter 80 value 1277.501588
## iter 90 value 1272.036548
## iter 100 value 1269.813654
## final value 1269.813654
## stopped after 100 iterations
## # weights: 76
## initial value 2247.882109
## iter 10 value 1583.896505
## iter 20 value 1470.317215
## iter 30 value 1392.280462
## iter 40 value 1352.022127
## iter 50 value 1329.794092
## iter 60 value 1307.343927
## iter 70 value 1291.650230
## iter 80 value 1280.001493
## iter 90 value 1269.783514
## iter 100 value 1251.939811
## final value 1251.939811
## stopped after 100 iterations
## # weights: 16
## initial value 2160.738548
## iter 10 value 1620.076817
## iter 20 value 1605.644701
## iter 30 value 1601.902919
## final value 1601.900944
## converged
## # weights: 46
## initial value 3018.781404
## iter 10 value 1612.539307
## iter 20 value 1582.071127
## iter 30 value 1514.211819
## iter 40 value 1477.462755
## iter 50 value 1428.795651
## iter 60 value 1395.871652
## iter 70 value 1386.496893
## iter 80 value 1383.746693
## iter 90 value 1382.454275
## iter 100 value 1380.877412
## final value 1380.877412
## stopped after 100 iterations
## # weights: 76
## initial value 2208.653489
## iter 10 value 1591.665398
## iter 20 value 1521.775407
## iter 30 value 1475.223594
## iter 40 value 1429.229290
## iter 50 value 1398.961357
## iter 60 value 1390.931271
## iter 70 value 1383.006191
## iter 80 value 1376.430114
## iter 90 value 1369.463202
## iter 100 value 1347.839951
## final value 1347.839951
## stopped after 100 iterations
## # weights: 16
## initial value 2309.335909
## iter 10 value 1738.025888
## iter 20 value 1638.643059
## iter 30 value 1606.388114
## iter 40 value 1600.557069
## iter 50 value 1600.155422
## iter 60 value 1600.037108
## iter 60 value 1600.037103
## iter 60 value 1600.037103
## final value 1600.037103
## converged
## # weights: 46
## initial value 1888.304107
## iter 10 value 1624.279708
## iter 20 value 1579.243594
## iter 30 value 1548.216591
## iter 40 value 1534.856188
## iter 50 value 1522.580232
## iter 60 value 1518.869513
## iter 70 value 1518.534368
## iter 80 value 1518.397617
## iter 90 value 1518.297716
## iter 100 value 1518.283296
## final value 1518.283296
## stopped after 100 iterations
## # weights: 76
## initial value 2255.212589
## iter 10 value 1585.234781
## iter 20 value 1517.624503
## iter 30 value 1461.612161
## iter 40 value 1415.317692
## iter 50 value 1400.973739
## iter 60 value 1385.488945
## iter 70 value 1371.295418
## iter 80 value 1366.500710
## iter 90 value 1364.128751
## iter 100 value 1362.073626
## final value 1362.073626
## stopped after 100 iterations
## # weights: 16
## initial value 2641.297722
## iter 10 value 1674.304827
## iter 20 value 1618.981065
## iter 30 value 1613.177789
## iter 40 value 1612.460166
## iter 50 value 1612.454164
## iter 50 value 1612.454150
## final value 1612.454105
## converged
## # weights: 46
## initial value 2834.372292
## iter 10 value 1586.634456
## iter 20 value 1547.455623
## iter 30 value 1531.987168
## iter 40 value 1511.583440
## iter 50 value 1496.177632
## iter 60 value 1469.245114
## iter 70 value 1453.055732
## iter 80 value 1446.785349
## iter 90 value 1439.840511
## iter 100 value 1436.866680
## final value 1436.866680
## stopped after 100 iterations
## # weights: 76
## initial value 2475.660518
## iter 10 value 1653.663448
## iter 20 value 1548.566385
## iter 30 value 1522.692367
## iter 40 value 1479.285059
## iter 50 value 1452.788250
## iter 60 value 1435.654626
## iter 70 value 1428.265158
## iter 80 value 1413.682233
## iter 90 value 1403.890576
## iter 100 value 1392.675684
## final value 1392.675684
## stopped after 100 iterations
## # weights: 16
## initial value 3005.123206
## iter 10 value 1696.448643
## iter 20 value 1638.730034
## iter 30 value 1598.610379
## iter 40 value 1577.326834
## iter 50 value 1569.802573
## iter 60 value 1568.398769
## final value 1568.014262
## converged
## # weights: 46
## initial value 2856.923628
## iter 10 value 1577.343614
## iter 20 value 1559.586309
## iter 30 value 1528.390061
## iter 40 value 1515.950574
## iter 50 value 1512.071958
## iter 60 value 1507.538777
## iter 70 value 1505.746279
## iter 80 value 1505.271045
## iter 90 value 1505.163573
## final value 1505.157680
## converged
## # weights: 76
## initial value 3122.038015
## iter 10 value 1596.162011
## iter 20 value 1490.458483
## iter 30 value 1414.812652
## iter 40 value 1387.091241
## iter 50 value 1358.900517
## iter 60 value 1332.263721
## iter 70 value 1316.989372
## iter 80 value 1284.333950
## iter 90 value 1266.394468
## iter 100 value 1254.548130
## final value 1254.548130
## stopped after 100 iterations
## # weights: 16
## initial value 4246.993275
## iter 10 value 1826.662962
## iter 20 value 1793.629776
## iter 30 value 1787.704528
## iter 40 value 1781.871304
## iter 50 value 1777.251555
## iter 60 value 1776.716576
## iter 70 value 1776.407916
## iter 80 value 1776.131487
## iter 90 value 1776.050243
## iter 100 value 1775.969880
## final value 1775.969880
## stopped after 100 iterations
## # weights: 46
## initial value 2641.449583
## iter 10 value 1573.765672
## iter 20 value 1478.232029
## iter 30 value 1448.754031
## iter 40 value 1433.118806
## iter 50 value 1424.690185
## iter 60 value 1419.647567
## iter 70 value 1417.897942
## iter 80 value 1416.154858
## iter 90 value 1400.866231
## iter 100 value 1383.827422
## final value 1383.827422
## stopped after 100 iterations
## # weights: 76
## initial value 2395.317763
## iter 10 value 1551.572596
## iter 20 value 1483.421506
## iter 30 value 1447.212508
## iter 40 value 1421.131328
## iter 50 value 1409.428599
## iter 60 value 1396.169587
## iter 70 value 1388.768959
## iter 80 value 1383.896325
## iter 90 value 1381.476090
## iter 100 value 1381.208120
## final value 1381.208120
## stopped after 100 iterations
## # weights: 76
## initial value 2835.861807
## iter 10 value 1948.117893
## iter 20 value 1910.792929
## iter 30 value 1870.880567
## iter 40 value 1850.434183
## iter 50 value 1838.960776
## iter 60 value 1830.483410
## iter 70 value 1825.908303
## iter 80 value 1822.201250
## iter 90 value 1819.433589
## iter 100 value 1817.642137
## final value 1817.642137
## stopped after 100 iterations
print(fit.nnet)
## Neural Network
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815
## Resampling results across tuning parameters:
##
## size decay AUC Precision Recall F
## 1 0e+00 0.9191526 0.8177037 0.9756387 0.8889979
## 1 1e-04 0.8862537 0.8278028 0.9661974 0.8909677
## 1 1e-01 0.9228062 0.8466382 0.9428849 0.8921627
## 3 0e+00 0.8529599 0.8656062 0.9161796 0.8894465
## 3 1e-04 0.8908519 0.8731755 0.8999331 0.8860357
## 3 1e-01 0.9341468 0.8547087 0.9507435 0.9001177
## 5 0e+00 0.9211971 0.8617435 0.9449843 0.9013797
## 5 1e-04 0.9181517 0.8620943 0.9300584 0.8943309
## 5 1e-01 0.9407425 0.8746971 0.9381695 0.9052206
##
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
plot(varImp(fit.nnet),15, main = 'Neural Network feature selection')
set.seed(2019)
fit.gbm <- train(y~., data=itrain, method="gbm", metric=metric, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3816, 3815, 3815, 3815, 3815
## Resampling results across tuning parameters:
##
## interaction.depth n.trees AUC Precision Recall F
## 1 50 0.7931374 0.8569476 0.9819228 0.9151634
## 1 100 0.9550171 0.8778632 0.9693457 0.9213127
## 1 150 0.9613312 0.8894362 0.9622731 0.9243772
## 2 50 0.9450629 0.8838504 0.9685596 0.9242512
## 2 100 0.9654723 0.9021633 0.9578188 0.9291050
## 2 150 0.9667118 0.9067379 0.9567703 0.9310413
## 3 50 0.9637935 0.8971377 0.9604396 0.9276370
## 3 100 0.9681581 0.9092756 0.9549354 0.9315009
## 3 150 0.9698005 0.9136212 0.9549357 0.9337780
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
par(mar = c(4, 11, 1, 1))
summary(fit.gbm, cBars=15, las=2, plotit=T, main = 'GBM feature selection')
## var rel.inf
## DEBTINC DEBTINC 56.36730306
## DELINQ.0 DELINQ.0 13.26448221
## CLAGE CLAGE 9.15695905
## VALUE VALUE 8.81423375
## LOAN LOAN 5.11793040
## YOJ YOJ 3.16235560
## DELINQ.1 DELINQ.1 1.17456145
## DEROG.1 DEROG.1 1.05232313
## JOB.Office JOB.Office 0.97548147
## REASON.HomeImp REASON.HomeImp 0.38930802
## JOB.Other JOB.Other 0.33100892
## JOB.Mgr JOB.Mgr 0.13148194
## JOB.ProfExe JOB.ProfExe 0.06257101
A graphical plot to show performances of the models from the training set.
results <- resamples(list(glm=fit.glm, rf=fit.rf, nnet=fit.nnet, gbm=fit.gbm))
cat(paste('Results'), sep='\n')
## Results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: glm, rf, nnet, gbm
## Number of resamples: 5
##
## AUC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.9091625 0.9157063 0.9174697 0.9233094 0.9323515 0.9418573 0
## rf 0.9365244 0.9390366 0.9434109 0.9414138 0.9438351 0.9442619 0
## nnet 0.9059294 0.9443217 0.9448658 0.9407425 0.9497152 0.9588806 0
## gbm 0.9638430 0.9654963 0.9673594 0.9698005 0.9709633 0.9813404 0
##
## F
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.8888889 0.8890251 0.8919903 0.8935387 0.8931439 0.9046455 0
## rf 0.9205176 0.9206349 0.9258344 0.9263060 0.9299754 0.9345679 0
## nnet 0.8883184 0.8984772 0.9096692 0.9052206 0.9131793 0.9164589 0
## gbm 0.9264516 0.9276105 0.9310563 0.9337780 0.9382239 0.9455477 0
##
## Precision
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.8305085 0.8352535 0.8360277 0.8392055 0.8456140 0.8486239 0
## rf 0.8617143 0.8686047 0.8751445 0.8733717 0.8770492 0.8843458 0
## nnet 0.8704403 0.8719212 0.8733572 0.8746971 0.8739596 0.8838072 0
## gbm 0.8997555 0.9072682 0.9123253 0.9136212 0.9227848 0.9259724 0
##
## Recall
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.9463351 0.9488860 0.9501966 0.9554614 0.9633028 0.9685864 0
## rf 0.9790301 0.9803665 0.9882045 0.9861150 0.9908377 0.9921363 0
## nnet 0.9069463 0.9267016 0.9370904 0.9381695 0.9568063 0.9633028 0
## gbm 0.9410223 0.9488860 0.9541885 0.9549357 0.9646134 0.9659686 0
par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')
par(mfrow=c(2,2))
set.seed(2019)
prediction.glm <-predict(fit.glm,newdata=itest,type="raw")
set.seed(2019)
prediction.rf <-predict(fit.rf,newdata=itest,type="raw")
set.seed(2019)
prediction.nnet <-predict(fit.nnet,newdata=itest,type="raw")
set.seed(2019)
prediction.gbm <-predict(fit.gbm,newdata=itest,type="raw")
Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.
cat(paste('Confusion Matrix GLM Model'), sep='\n')
## Confusion Matrix GLM Model
confusionMatrix(prediction.glm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 909 172
## X1 45 65
##
## Accuracy : 0.8178
## 95% CI : (0.7947, 0.8393)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.07737
##
## Kappa : 0.2844
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9528
## Specificity : 0.2743
## Pos Pred Value : 0.8409
## Neg Pred Value : 0.5909
## Prevalence : 0.8010
## Detection Rate : 0.7632
## Detection Prevalence : 0.9076
## Balanced Accuracy : 0.6135
##
## 'Positive' Class : X0
##
F1_train <- F1_Score(itrain$y,predict(fit.glm,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.glm)
cat(paste('F1_train_glm:',F1_train, 'F1_test_glm:', F1_test), sep='\n')
## F1_train_glm: 0.893731635651322 F1_test_glm: 0.893366093366094
cat(paste('Confusion Matrix Random Forest Model'), sep='\n')
## Confusion Matrix Random Forest Model
confusionMatrix(prediction.rf, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 938 137
## X1 16 100
##
## Accuracy : 0.8715
## 95% CI : (0.8512, 0.89)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 9.097e-11
##
## Kappa : 0.5014
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9832
## Specificity : 0.4219
## Pos Pred Value : 0.8726
## Neg Pred Value : 0.8621
## Prevalence : 0.8010
## Detection Rate : 0.7876
## Detection Prevalence : 0.9026
## Balanced Accuracy : 0.7026
##
## 'Positive' Class : X0
##
F1_train <- F1_Score(itrain$y,predict(fit.rf,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.rf)
cat(paste('F1_train_rf:',F1_train,'F1_test_rf:', F1_test), sep='\n')
## F1_train_rf: 0.955605718585403 F1_test_rf: 0.924593395761459
cat(paste('Confusion Matrix Neural Network Model'), sep='\n')
## Confusion Matrix Neural Network Model
confusionMatrix(prediction.nnet, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 888 146
## X1 66 91
##
## Accuracy : 0.822
## 95% CI : (0.7991, 0.8433)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.03634
##
## Kappa : 0.3605
##
## Mcnemar's Test P-Value : 5.771e-08
##
## Sensitivity : 0.9308
## Specificity : 0.3840
## Pos Pred Value : 0.8588
## Neg Pred Value : 0.5796
## Prevalence : 0.8010
## Detection Rate : 0.7456
## Detection Prevalence : 0.8682
## Balanced Accuracy : 0.6574
##
## 'Positive' Class : X0
##
F1_train <- F1_Score(itrain$y,predict(fit.nnet,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.nnet)
cat(paste('F1_train_nnet:',F1_train,'F1_test_nnet:', F1_test), sep='\n')
## F1_train_nnet: 0.903363991432531 F1_test_nnet: 0.893360160965795
cat(paste('Confusion Matrix GBM Model'), sep='\n')
## Confusion Matrix GBM Model
confusionMatrix(prediction.gbm, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 911 92
## X1 43 145
##
## Accuracy : 0.8866
## 95% CI : (0.8673, 0.9041)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 1.910e-15
##
## Kappa : 0.6145
##
## Mcnemar's Test P-Value : 3.609e-05
##
## Sensitivity : 0.9549
## Specificity : 0.6118
## Pos Pred Value : 0.9083
## Neg Pred Value : 0.7713
## Prevalence : 0.8010
## Detection Rate : 0.7649
## Detection Prevalence : 0.8421
## Balanced Accuracy : 0.7834
##
## 'Positive' Class : X0
##
F1_train <- F1_Score(itrain$y,predict(fit.gbm,newdata=itrain,type="raw"))
F1_test <- F1_Score(itest$y, prediction.gbm)
cat(paste('F1_train_gbm:',F1_train,'F1_test_gbm:', F1_test), sep='\n')
## F1_train_gbm: 0.940754958413308 F1_test_gbm: 0.931016862544711
par(mfrow=c(2,2))
ctable.glm <- table(prediction.glm, itest$y)
fourfoldplot(ctable.glm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GLM Confusion Matrix")
ctable.rf <- table(prediction.rf, itest$y)
fourfoldplot(ctable.rf, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RF Confusion Matrix")
ctable.nnet <- table(prediction.nnet, itest$y)
fourfoldplot(ctable.nnet, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "NNET Confusion Matrix")
ctable.gbm <- table(prediction.gbm, itest$y)
fourfoldplot(ctable.gbm, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "GBM Confusion Matrix")
Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013
Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013
https://machinelearningmastery.com/what-is-imbalanced-classification/