I’ve used a data set from Kaggle: the Home Equity data set (HMEQ) It contains loan performance information for 5.960 recent home equity loans. The target variable (BAD) is a binary variable indicating whether an applicant eventually defaulted and for each applicant are reported 12 input variables. The task is to predict clients who default on their loans.
Below the variables description
-BAD: 1 = client defaulted on loan 0 = loan repaid
-LOAN: Amount of the loan request
-MORTDUE: Amount due on existing mortgage
-VALUE: Value of current property
-REASON: DebtCon = debt consolidation; HomeImp = home improvement
-JOB: Six occupational categories
-YOJ: Years at present job
-DEROG: Number of major derogatory reports
-DELINQ: Number of delinquent credit lines
-CLAGE: Age of oldest trade line in months
-NINQ: Number of recent credit lines
-CLNO: Number of credit lines
-DEBTINC: Debt-to-income ratio
suppressWarnings({library(ggplot2)})
suppressWarnings({library(tidyverse)})
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## v purrr 0.3.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
suppressWarnings({library(caret)})
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
suppressWarnings({library(DMwR)})
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
suppressWarnings({library(ROSE)})
## Loaded ROSE 0.0-3
suppressWarnings({library(corrplot)})
## corrplot 0.84 loaded
suppressWarnings({library(gridExtra)})
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
suppressWarnings({library(MLmetrics)})
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
path <- "C:/Users/user/Documents/eRUM2020/hmeq.csv"
df = read.csv(path)
# Dimensions of data set
dim(df)
## [1] 5960 13
# List types for each attribute
sapply(df, class)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG
## "integer" "integer" "numeric" "numeric" "factor" "factor" "numeric" "integer"
## DELINQ CLAGE NINQ CLNO DEBTINC
## "integer" "numeric" "integer" "integer" "numeric"
# Take a peek at the first rows of the data set
head(df,5)
## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO
## 1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1 9
## 2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0 14
## 3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1 10
## 4 1 1500 NA NA NA NA NA NA NA NA
## 5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0 14
## DEBTINC
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
# Summarize attribute distributions
summary(df)
## BAD LOAN MORTDUE VALUE
## Min. :0.0000 Min. : 1100 Min. : 2063 Min. : 8000
## 1st Qu.:0.0000 1st Qu.:11100 1st Qu.: 46276 1st Qu.: 66076
## Median :0.0000 Median :16300 Median : 65019 Median : 89236
## Mean :0.1995 Mean :18608 Mean : 73761 Mean :101776
## 3rd Qu.:0.0000 3rd Qu.:23300 3rd Qu.: 91488 3rd Qu.:119824
## Max. :1.0000 Max. :89900 Max. :399550 Max. :855909
## NA's :518 NA's :112
## REASON JOB YOJ DEROG
## : 252 : 279 Min. : 0.000 Min. : 0.0000
## DebtCon:3928 Mgr : 767 1st Qu.: 3.000 1st Qu.: 0.0000
## HomeImp:1780 Office : 948 Median : 7.000 Median : 0.0000
## Other :2388 Mean : 8.922 Mean : 0.2546
## ProfExe:1276 3rd Qu.:13.000 3rd Qu.: 0.0000
## Sales : 109 Max. :41.000 Max. :10.0000
## Self : 193 NA's :515 NA's :708
## DELINQ CLAGE NINQ CLNO
## Min. : 0.0000 Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 115.1 1st Qu.: 0.000 1st Qu.:15.0
## Median : 0.0000 Median : 173.5 Median : 1.000 Median :20.0
## Mean : 0.4494 Mean : 179.8 Mean : 1.186 Mean :21.3
## 3rd Qu.: 0.0000 3rd Qu.: 231.6 3rd Qu.: 2.000 3rd Qu.:26.0
## Max. :15.0000 Max. :1168.2 Max. :17.000 Max. :71.0
## NA's :580 NA's :308 NA's :510 NA's :222
## DEBTINC
## Min. : 0.5245
## 1st Qu.: 29.1400
## Median : 34.8183
## Mean : 33.7799
## 3rd Qu.: 39.0031
## Max. :203.3121
## NA's :1267
# Summarize data structure
str(df)
## 'data.frame': 5960 obs. of 13 variables:
## $ BAD : int 1 1 1 1 0 1 1 1 1 1 ...
## $ LOAN : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
## $ MORTDUE: num 25860 70053 13500 NA 97800 ...
## $ VALUE : num 39025 68400 16700 NA 112000 ...
## $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
## $ JOB : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
## $ YOJ : num 10.5 7 4 NA 3 9 5 11 3 16 ...
## $ DEROG : int 0 0 0 NA 0 0 3 0 0 0 ...
## $ DELINQ : int 0 2 0 NA 0 0 2 0 2 0 ...
## $ CLAGE : num 94.4 121.8 149.5 NA 93.3 ...
## $ NINQ : int 1 0 1 NA 0 1 1 0 1 0 ...
## $ CLNO : int 9 14 10 NA 14 8 17 8 12 13 ...
## $ DEBTINC: num NA NA NA NA NA ...
BAD <- df$BAD <- as.factor(df$BAD)
df$LOAN <- as.numeric(df$LOAN)
df$DEROG <- as.factor(df$DEROG)
df$DELINQ <- as.factor(df$DELINQ)
df$NINQ <- as.factor(df$NINQ)
df$CLNO <- as.factor(df$CLNO)
df$JOB[df$JOB == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$JOB == "", value = structure(c(4L, 4L, :
## invalid factor level, NA generated
df$REASON[df$REASON == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, df$REASON == "", value = structure(c(3L, :
## invalid factor level, NA generated
My approach: I’ve filled up missing values with the median for numerical variables and the most common level for the categorical ones. Also, I’ve created boolean features with 1 (true-missing value) or 0 (false-actual value) for each one with missing values. “Pawel Grabinski”
mi_summary <- function(data_frame){
mi_summary<-c()
for (col in colnames(data_frame)){
mi_summary <- c(mi_summary,mean(is.na(data_frame[,col])*100))
}
mi_summary_new <- mi_summary[mi_summary>0]
mi_summary_cols <- colnames(data_frame)[mi_summary>0]
mi_summary <- data.frame('col_name' = mi_summary_cols, 'perc_missing' = mi_summary_new)
mi_summary <- mi_summary[order(mi_summary[,2], decreasing = TRUE), ]
mi_summary[,2] <- round(mi_summary[,2],6)
rownames(mi_summary) <- NULL
return(mi_summary)
}
missing_summary <- mi_summary(df)
missing_summary
## col_name perc_missing
## 1 DEBTINC 21.258389
## 2 DEROG 11.879195
## 3 DELINQ 9.731544
## 4 MORTDUE 8.691275
## 5 YOJ 8.640940
## 6 NINQ 8.557047
## 7 CLAGE 5.167785
## 8 JOB 4.681208
## 9 REASON 4.228188
## 10 CLNO 3.724832
## 11 VALUE 1.879195
df <- df %>%
mutate(DEBTINC_NA = ifelse(is.na(DEBTINC),1,0)) %>%
mutate(DEROG_NA = ifelse(is.na(DEROG),1,0)) %>%
mutate(DELINQ_NA = ifelse(is.na(DELINQ),1,0)) %>%
mutate(MORTDUE_NA = ifelse(is.na(MORTDUE),1,0)) %>%
mutate(YOJ_NA = ifelse(is.na(YOJ),1,0)) %>%
mutate(NINQ_NA = ifelse(is.na(NINQ),1,0)) %>%
mutate(CLAGE_NA = ifelse(is.na(CLAGE),1,0)) %>%
mutate(CLNO_NA = ifelse(is.na(CLNO),1,0)) %>%
mutate(VALUE_NA = ifelse(is.na(VALUE),1,0)) %>%
mutate(JOB_NA = ifelse(is.na(JOB),1,0)) %>%
mutate(REASON_NA = ifelse(is.na(REASON),1,0))
for (col in missing_summary$col_name){
if (class(df[,col]) == 'factor'){
unique_levels <- unique(df[,col])
df[is.na(df[,col]), col] <- unique_levels[which.max(tabulate(match(df[,col], unique_levels)))]
} else {
df[is.na(df[,col]),col] <- median(as.numeric(df[,col]), na.rm = TRUE)
}
}
pMiss <- function(x){sum(is.na(x))/length(x)*100}
pMiss <- apply(df,2,pMiss)
pMiss <- pMiss[pMiss > 0]
pMiss <- pMiss[order(pMiss, decreasing=T)]
pMiss
## named numeric(0)
df$DEROG_NA <- as.factor(df$DEROG_NA)
df$DEBTINC_NA <- as.factor(df$DEBTINC_NA)
df$DELINQ_NA <- as.factor(df$DELINQ_NA)
df$MORTDUE_NA <- as.factor(df$MORTDUE_NA)
df$YOJ_NA <- as.factor(df$YOJ_NA)
df$NINQ_NA <- as.factor(df$NINQ_NA)
df$CLAGE_NA <- as.factor(df$CLAGE_NA)
df$CLNO_NA <- as.factor(df$CLNO_NA)
df$VALUE_NA <- as.factor(df$VALUE_NA)
df$JOB_NA <- as.factor(df$JOB_NA)
df$REASON_NA <- as.factor(df$REASON_NA)
df$JOB <- factor(df$JOB, labels=c('Mgr','Office','Other','ProfExe','Sales','Self'))
df$REASON <- factor(df$REASON, labels=c('DebtCon','HomeImp'))
I’ve grouped the data set into numerical, categorical and boolean variables for the Exploratory Data Analysis
cat <- df[,sapply(df, is.factor)] %>%
select_if(~nlevels(.) <=15 ) %>%
select(-BAD)
bol <- df[,c('DEBTINC_NA','DEROG_NA','DELINQ_NA','MORTDUE_NA','YOJ_NA','NINQ_NA','CLAGE_NA','CLNO_NA','VALUE_NA','JOB_NA','REASON_NA')]
num <- df[,sapply(df, is.numeric)]
The target variable is grouped into two classes: “Loan defaulted (1)” and “Loan repaid (0)”. Looking at the barplot, it’s quite unbalanced.
cbind(freq=table(df$BAD), percentage=prop.table(table(df$BAD))*100)
## freq percentage
## 0 4771 80.05034
## 1 1189 19.94966
ggplot(df, aes(BAD, fill=BAD)) + geom_bar() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Distribution of Target variable")
I’ve grouped all categorical features into a new subset: I’ve done a graphical analysis using barplots and I’ve counted the frequency for each class. For a bivariate analysis I’ve used a Chi-Square Test to evaluate the relationship between the target variable and each categorical feature. This number tells how much difference exists between observed counts and the counts would be expect if there were no relationship at all in the population.
cat <- cat[,c('DELINQ','REASON','JOB','DEROG')]
for(i in 1:length(cat)) {
counts <- table(cat[,i])
name <- names(cat)[i]
barplot(counts, main=name, col=c("blue","red","green","orange","purple"))
}
The Chi-Square Test is used as feature selection testing the null hypothesis of independence between target variable and categorical features. The goal is to test that two classifications are independent or not. Two classifications are independent if the distribution of one, with respect to a classification principle, is not influenced by the other one. If the null hypothesis is not rejected the two classifications are independent (P-Value>0.05) and the feature can be dropped.
par(mfrow=c(2,2))
for(i in 1:length(cat)){
freq=table(cat[,i])
percentage=prop.table(table(cat[,i]))*100
freq_cat_outcome=table(BAD,cat[,i])
name <- names(cat)[i]
cat(sep="\n")
cat(paste("Distribution of", name), sep="\n")
print(cbind(freq,percentage))
cat(sep="\n")
cat(paste("Distribution by Target variable and", name), sep="\n")
print(freq_cat_outcome)
cat(sep="\n")
cat(paste("Chi-squared test by Target variable and", name), sep="\n")
suppressWarnings({print(chisq.test(table(BAD,cat[,i])))})
}
##
## Distribution of DELINQ
## freq percentage
## 0 4759 79.84899329
## 1 654 10.97315436
## 2 250 4.19463087
## 3 129 2.16442953
## 4 78 1.30872483
## 5 38 0.63758389
## 6 27 0.45302013
## 7 13 0.21812081
## 8 5 0.08389262
## 10 2 0.03355705
## 11 2 0.03355705
## 12 1 0.01677852
## 13 1 0.01677852
## 15 1 0.01677852
##
## Distribution by Target variable and DELINQ
##
## BAD 0 1 2 3 4 5 6 7 8 10 11 12 13 15
## 0 4104 432 138 58 32 7 0 0 0 0 0 0 0 0
## 1 655 222 112 71 46 31 27 13 5 2 2 1 1 1
##
## Chi-squared test by Target variable and DELINQ
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 763.8, df = 13, p-value < 2.2e-16
##
##
## Distribution of REASON
## freq percentage
## DebtCon 4180 70.13423
## HomeImp 1780 29.86577
##
## Distribution by Target variable and REASON
##
## BAD DebtCon HomeImp
## 0 3387 1384
## 1 793 396
##
## Chi-squared test by Target variable and REASON
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(BAD, cat[, i])
## X-squared = 8.1852, df = 1, p-value = 0.004223
##
##
## Distribution of JOB
## freq percentage
## Mgr 767 12.869128
## Office 948 15.906040
## Other 2667 44.748322
## ProfExe 1276 21.409396
## Sales 109 1.828859
## Self 193 3.238255
##
## Distribution by Target variable and JOB
##
## BAD Mgr Office Other ProfExe Sales Self
## 0 588 823 2090 1064 71 135
## 1 179 125 577 212 38 58
##
## Chi-squared test by Target variable and JOB
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 73.815, df = 5, p-value = 1.644e-14
##
##
## Distribution of DEROG
## freq percentage
## 0 5235 87.83557047
## 1 435 7.29865772
## 2 160 2.68456376
## 3 58 0.97315436
## 4 23 0.38590604
## 5 15 0.25167785
## 6 15 0.25167785
## 7 8 0.13422819
## 8 6 0.10067114
## 9 3 0.05033557
## 10 2 0.03355705
##
## Distribution by Target variable and DEROG
##
## BAD 0 1 2 3 4 5 6 7 8 9 10
## 0 4394 266 78 15 5 8 5 0 0 0 0
## 1 841 169 82 43 18 7 10 8 6 3 2
##
## Chi-squared test by Target variable and DEROG
##
## Pearson's Chi-squared test
##
## data: table(BAD, cat[, i])
## X-squared = 503.99, df = 10, p-value < 2.2e-16
pl1 <- cat %>%
ggplot(aes(x=BAD, y=DELINQ, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DELINQ")
pl2 <- cat %>%
ggplot(aes(x=BAD, y=REASON, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and REASON")
pl3 <- cat %>%
ggplot(aes(x=BAD, y=JOB, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and JOB")
pl4 <- cat %>%
ggplot(aes(x=BAD, y=DEROG, fill=BAD)) +
geom_bar(stat='identity') +
ggtitle("Distribution by BAD and DEROG")
par(mfrow=c(2,2))
grid.arrange(pl1,pl2,pl3,pl4, ncol=2)
I’ve transformed categorical features into numerical variables with one-hot encoding methodology to afford a better understanding of variables by machine learning models.
dmy <- dummyVars("~.", data = cat,fullRank = F)
cat_num <- data.frame(predict(dmy, newdata = cat))
I’ve grouped all numerical features into a new subset: I’ve done a graphical analysis using histograms, density plots and boxplots; also I’ve performed metrics such as standard deviation, skewness and kurtosis for each variable. For a bivariate analysis I’ve used the Anova Test to evaluate the relationship between the target variable and each numerical feature. This test assesses whether the average of more than two groups are statistically different.
par(mfrow=c(2,3))
for(i in 1:length(num)) {
hist(num[,i], main=names(num)[i], col='blue')
}
par(mfrow=c(2,3))
for(i in 1:length(num)) {
boxplot(num[,i], main=names(num)[i], col='orange')
}
par(mfrow=c(2,3))
for(i in 1:length(num)){
plot(density(num[,i]), main=names(num)[i], col='red')
}
for(i in 1:length(num)){
name <- names(num)[i]
cat(paste("Distribution of", name), sep="\n")
#cat(names(num)[i],sep = "\n")
print(summary(num[,i]))
cat(sep="\n")
stand.deviation = sd(num[,i])
variance = var(num[,i])
skewness = mean((num[,i] - mean(num[,i]))^3/sd(num[,i])^3)
kurtosis = mean((num[,i] - mean(num[,i]))^4/sd(num[,i])^4) - 3
outlier_values <- sum(table(boxplot.stats(num[,i])$out))
cat(paste("Statistical analysis of", name), sep="\n")
print(cbind(stand.deviation, variance, skewness, kurtosis, outlier_values))
cat(sep="\n")
cat(paste("anova_test between BAD and", name),sep = "\n")
print(summary(aov(as.numeric(BAD)~num[,i], data=num)))
cat(sep="\n")
}
## Distribution of LOAN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1100 11100 16300 18608 23300 89900
##
## Statistical analysis of LOAN
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 11207.48 125607617 2.022762 6.922438 256
##
## anova_test between BAD and LOAN
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 5.4 5.368 33.79 6.45e-09 ***
## Residuals 5958 946.4 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of MORTDUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2063 48139 65019 73001 88200 399550
##
## Statistical analysis of MORTDUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 42552.73 1810734556 1.941153 7.440693 308
##
## anova_test between BAD and MORTDUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.0 2.0303 12.74 0.000361 ***
## Residuals 5958 949.8 0.1594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of VALUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8000 66490 89236 101540 119005 855909
##
## Statistical analysis of VALUE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 56869.44 3234132829 3.088963 24.85644 347
##
## anova_test between BAD and VALUE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 1.3 1.2675 7.945 0.00484 **
## Residuals 5958 950.5 0.1595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of YOJ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 7.000 8.756 12.000 41.000
##
## Statistical analysis of YOJ
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.259424 52.69923 1.092072 0.744703 211
##
## anova_test between BAD and YOJ
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 2.8 2.7710 17.4 3.08e-05 ***
## Residuals 5958 949.0 0.1593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of CLAGE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 117.4 173.5 179.4 227.1 1168.2
##
## Statistical analysis of CLAGE
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 83.5747 6984.73 1.389903 8.180556 66
##
## anova_test between BAD and CLAGE
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 26.1 26.106 168 <2e-16 ***
## Residuals 5958 925.7 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Distribution of DEBTINC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5245 30.7632 34.8183 34.0007 37.9499 203.3122
##
## Statistical analysis of DEBTINC
## stand.deviation variance skewness kurtosis outlier_values
## [1,] 7.644528 58.43881 3.111598 64.0733 245
##
## anova_test between BAD and DEBTINC
## Df Sum Sq Mean Sq F value Pr(>F)
## num[, i] 1 22.7 22.733 145.8 <2e-16 ***
## Residuals 5958 929.1 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pl5 <- num %>%
ggplot(aes(x=BAD, y=LOAN, fill=BAD)) + geom_boxplot()
pl6 <- num %>%
ggplot(aes(x=BAD, y=MORTDUE, fill=BAD)) + geom_boxplot()
pl7 <- num %>%
ggplot(aes(x=BAD, y=VALUE, fill=BAD)) + geom_boxplot()
pl8 <- num %>%
ggplot(aes(x=BAD, y=YOJ, fill=BAD)) + geom_boxplot()
pl9 <- num %>%
ggplot(aes(x=BAD, y=CLAGE, fill=BAD)) + geom_boxplot()
pl10 <- num %>%
ggplot(aes(x=BAD, y=DEBTINC, fill=BAD)) + geom_boxplot()
par(mfrow=c(2,3))
grid.arrange(pl5,pl6,pl7,pl8,pl9,pl10, ncol=2)
Outliers are those observations that lie outside 1.5 times the Inter Quartile Range (difference between 75th and 25th quartiles). If they are not detected and corrected in an appropriate way they can distort the prediction. There are several ways to handle outliers, I’ve decided to cap them replacing those observations outside the lower limit with the value of the 5th quartile and those that lie above the upper limit, with the value of the 95th quartile.
# Before
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN")
# Managing outliers
qnt <- quantile(num$LOAN, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$LOAN, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$LOAN, na.rm = T)
num$LOAN[num$LOAN < (qnt[1] - H)] <- caps[1]
num$LOAN[num$LOAN >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = LOAN, fill = BAD)) + geom_density(alpha = .3) + ggtitle("LOAN after handled outliers")
# Before
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE")
# Managing outliers
qnt <- quantile(num$MORTDUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$MORTDUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$MORTDUE, na.rm = T)
num$MORTDUE[num$MORTDUE < (qnt[1] - H)] <- caps[1]
num$MORTDUE[num$MORTDUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = MORTDUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("MORTDUE after handled outliers")
# Before
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE")
# Managing outliers
qnt <- quantile(num$VALUE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$VALUE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$VALUE, na.rm = T)
num$VALUE[num$VALUE < (qnt[1] - H)] <- caps[1]
num$VALUE[num$VALUE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = VALUE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("VALUE after handled outliers")
# Before
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ")
# Managing outliers
qnt <- quantile(num$YOJ, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$YOJ, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$YOJ, na.rm = T)
num$YOJ[num$YOJ < (qnt[1] - H)] <- caps[1]
num$YOJ[num$YOJ >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = YOJ, fill = BAD)) + geom_density(alpha = .3) + ggtitle("YOJ after handled outliers")
# Before
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE")
# Managing outliers
qnt <- quantile(num$CLAGE, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$CLAGE, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$CLAGE, na.rm = T)
num$CLAGE[num$CLAGE < (qnt[1] - H)] <- caps[1]
num$CLAGE[num$CLAGE >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = CLAGE, fill = BAD)) + geom_density(alpha = .3) + ggtitle("CLAGE after handled outliers")
# Before
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC")
# Managing outliers
qnt <- quantile(num$DEBTINC, probs=c(.25, .75), na.rm = T)
caps <- quantile(num$DEBTINC, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(num$DEBTINC, na.rm = T)
num$DEBTINC[num$DEBTINC < (qnt[1] - H)] <- caps[1]
num$DEBTINC[num$DEBTINC >(qnt[2] + H)] <- caps[2]
# After
ggplot(num, aes(x = DEBTINC, fill = BAD)) + geom_density(alpha = .3) + ggtitle("DEBTINC after handled outliers")
The goal of this step is to delete predictors that can have single unique values or handful of unique values that occur with very low frequencies. These “zero-variance predictors” may cause problems to fit the algorithms (excluding tree-based models)
data <- cbind(categorical,num)
nzv <- nearZeroVar(data, saveMetrics= TRUE)
nzv[nzv$nzv,][1:15,]
## freqRatio percentUnique zeroVar nzv
## DELINQ.2 22.84000 0.03355705 FALSE TRUE
## DELINQ.3 45.20155 0.03355705 FALSE TRUE
## DELINQ.4 75.41026 0.03355705 FALSE TRUE
## DELINQ.5 155.84211 0.03355705 FALSE TRUE
## DELINQ.6 219.74074 0.03355705 FALSE TRUE
## DELINQ.7 457.46154 0.03355705 FALSE TRUE
## DELINQ.8 1191.00000 0.03355705 FALSE TRUE
## DELINQ.10 2979.00000 0.03355705 FALSE TRUE
## DELINQ.11 2979.00000 0.03355705 FALSE TRUE
## DELINQ.12 5959.00000 0.03355705 FALSE TRUE
## DELINQ.13 5959.00000 0.03355705 FALSE TRUE
## DELINQ.15 5959.00000 0.03355705 FALSE TRUE
## JOB.Sales 53.67890 0.03355705 FALSE TRUE
## JOB.Self 29.88083 0.03355705 FALSE TRUE
## DEROG.2 36.25000 0.03355705 FALSE TRUE
nzv <- nearZeroVar(data)
data_new <- data[, -nzv]
Another feature selection approach is to observe correlation between variables, I apply it on all data set. There are some models such as linear regression where related features can deteriorate the performance (multicollinearity). Though some ensemble models are not sensitive at this topic, “Ensembles of tree-based models”, I prefer to remove them anyway because I don’t know which model to use in advance.
par(mfrow=c(1,1))
cor <- cor(data_new,use="complete.obs",method = "spearman")
corrplot(cor, type="lower", tl.col = "black", diag=FALSE, method="number", mar = c(0, 0, 2, 0), title="Correlation")
summary(cor[upper.tri(cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000000 -0.038116 0.005731 -0.014157 0.043268 0.792421
To analyze the performance of a model is a good manner to split the data set into the training set and the test set. The training set is a sample of data used to fit the model, the test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.
# calculate the pre-process parameters from the data set
set.seed(2019)
preprocessParams <- preProcess(df_new, method=c("center", "scale"))
# Transform the data set using the parameters
transformed <- predict(preprocessParams, df_new)
# Manage levels on the target variable
y <- as.factor(df$BAD)
transformed <- cbind.data.frame(transformed,y)
levels(transformed$y) <- make.names(levels(factor(transformed$y)))
str(transformed)
## 'data.frame': 5960 obs. of 14 variables:
## $ DELINQ.0 : num 0.502 -1.99 0.502 0.502 0.502 ...
## $ DELINQ.1 : num -0.351 -0.351 -0.351 -0.351 -0.351 ...
## $ REASON.HomeImp: num 1.532 1.532 1.532 -0.653 1.532 ...
## $ JOB.Mgr : num -0.384 -0.384 -0.384 -0.384 -0.384 ...
## $ JOB.Office : num -0.435 -0.435 -0.435 -0.435 2.299 ...
## $ JOB.Other : num 1.11 1.11 1.11 1.11 -0.9 ...
## $ JOB.ProfExe : num -0.522 -0.522 -0.522 -0.522 -0.522 ...
## $ DEROG.1 : num -0.281 -0.281 -0.281 -0.281 -0.281 ...
## $ LOAN : num -1.86 -1.84 -1.81 -1.81 -1.79 ...
## $ VALUE : num -1.322 -0.669 -1.818 -0.206 0.3 ...
## $ YOJ : num 0.279 -0.233 -0.672 -0.233 -0.819 ...
## $ CLAGE : num -1.091 -0.73 -0.367 -0.052 -1.104 ...
## $ DEBTINC : num 0.143 0.143 0.143 0.143 0.143 ...
## $ y : Factor w/ 2 levels "X0","X1": 2 2 2 2 1 2 2 2 2 2 ...
# Draw a random, stratified sample including p percent of the data
set.seed(12345)
test_index <- createDataPartition(transformed$y, p=0.80, list=FALSE)
# select 20% of the data for test
itest <- transformed[-test_index,]
# use the remaining 80% of data to training the models
itrain <- transformed[test_index,]
In this analysis I’ve used some baseline cost sensitive learning models: Reguralized Random Forest and Penalized Multinomial Regression.
# Stratified cross-validation
folds <- 5
set.seed(2019)
cvIndex <- createFolds(factor(itrain$y), folds, returnTrain = T)
control <- trainControl(index = cvIndex,method="repeatedcv", number=folds,classProbs = TRUE,
summaryFunction=prSummary)
metric <- "AUC"
set.seed(2019)
fit.rrf <- train(y~., data=itrain, method="RRF", metric=metric, trControl=control)
## Registered S3 method overwritten by 'RRF':
## method from
## plot.margin randomForest
print(fit.rrf)
## Regularized Random Forest
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Resampling results across tuning parameters:
##
## mtry coefReg coefImp AUC Precision Recall F
## 2 0.010 0.0 0.8688082 0.8948432 0.9418337 0.9173708
## 2 0.010 0.5 0.8704264 0.8892433 0.9434174 0.9152755
## 2 0.010 1.0 0.8346641 0.9105366 0.9379087 0.9240001
## 2 0.505 0.0 0.8672459 0.8992080 0.9491738 0.9233087
## 2 0.505 0.5 0.8503685 0.9097663 0.9347680 0.9220873
## 2 0.505 1.0 0.8294250 0.9053201 0.9365991 0.9206890
## 2 1.000 0.0 0.8318242 0.9178406 0.9423634 0.9299322
## 2 1.000 0.5 0.8305954 0.9178146 0.9392204 0.9283873
## 2 1.000 1.0 0.8680398 0.9078320 0.9386944 0.9229780
## 7 0.010 0.0 0.8445569 0.8989145 0.9499571 0.9234684
## 7 0.010 0.5 0.8875775 0.8911118 0.9434147 0.9161364
## 7 0.010 1.0 0.8396133 0.9051242 0.9339775 0.9192906
## 7 0.505 0.0 0.8318548 0.9029518 0.9394849 0.9207807
## 7 0.505 0.5 0.8572132 0.9040103 0.9321417 0.9178319
## 7 0.505 1.0 0.8610492 0.9090686 0.9371223 0.9228666
## 7 1.000 0.0 0.8430865 0.9181814 0.9407921 0.9293387
## 7 1.000 0.5 0.8528008 0.9161716 0.9392197 0.9275442
## 7 1.000 1.0 0.8344937 0.9055855 0.9344997 0.9197968
## 13 0.010 0.0 0.7931388 0.9147255 0.9441952 0.9292164
## 13 0.010 0.5 0.8508829 0.9036179 0.9373841 0.9201755
## 13 0.010 1.0 0.8616062 0.9021009 0.9337144 0.9176069
## 13 0.505 0.0 0.7947710 0.9101297 0.9365984 0.9231694
## 13 0.505 0.5 0.8616669 0.9020509 0.9331915 0.9173339
## 13 0.505 1.0 0.8678545 0.9022286 0.9321417 0.9168933
## 13 1.000 0.0 0.8364031 0.9175508 0.9415781 0.9293938
## 13 1.000 0.5 0.8554741 0.9134485 0.9373876 0.9252577
## 13 1.000 1.0 0.8617296 0.9016886 0.9318802 0.9165126
##
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7, coefReg = 0.01 and coefImp
## = 0.5.
plot(varImp(fit.rrf),15, main='Regularized Random Forest feature selection')
set.seed(2019)
fit.pmr <- train(y~., data=itrain, method="multinom", metric=metric, trControl=control)
## # weights: 15 (14 variable)
## initial value 2643.663347
## iter 10 value 1643.449429
## iter 20 value 1596.132959
## final value 1596.129966
## converged
## # weights: 15 (14 variable)
## initial value 2643.663347
## iter 10 value 1644.055053
## iter 20 value 1596.581893
## final value 1596.579823
## converged
## # weights: 15 (14 variable)
## initial value 2643.663347
## iter 10 value 1643.450035
## iter 20 value 1596.133410
## final value 1596.130416
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1616.608929
## iter 20 value 1576.917418
## final value 1576.917398
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1617.228086
## iter 20 value 1577.395100
## final value 1577.395081
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1616.609548
## iter 20 value 1576.917896
## final value 1576.917876
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1631.992725
## iter 20 value 1586.523644
## final value 1586.523531
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1632.618173
## iter 20 value 1587.003794
## final value 1587.003681
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1631.993351
## iter 20 value 1586.524125
## final value 1586.524012
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1623.606406
## iter 20 value 1571.510591
## final value 1571.510475
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1624.245690
## iter 20 value 1571.999053
## final value 1571.998938
## converged
## # weights: 15 (14 variable)
## initial value 2645.049641
## iter 10 value 1623.607046
## iter 20 value 1571.511080
## final value 1571.510964
## converged
## # weights: 15 (14 variable)
## initial value 2643.663347
## iter 10 value 1621.673938
## iter 20 value 1571.776738
## final value 1571.776667
## converged
## # weights: 15 (14 variable)
## initial value 2643.663347
## iter 10 value 1622.341658
## iter 20 value 1572.270680
## final value 1572.270608
## converged
## # weights: 15 (14 variable)
## initial value 2643.663347
## iter 10 value 1621.674606
## iter 20 value 1571.777233
## final value 1571.777161
## converged
## # weights: 15 (14 variable)
## initial value 3305.618904
## iter 10 value 2168.329691
## iter 20 value 1977.784792
## final value 1977.781680
## converged
print(fit.pmr)
## Penalized Multinomial Regression
##
## 4769 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3814, 3816, 3816, 3816, 3814
## Resampling results across tuning parameters:
##
## decay AUC Precision Recall F
## 0e+00 0.9232465 0.8383432 0.9557259 0.8931242
## 1e-04 0.9232465 0.8383432 0.9557259 0.8931242
## 1e-01 0.9232486 0.8383432 0.9557259 0.8931242
##
## AUC was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
plot(varImp(fit.pmr),15, main = 'Penalized Multinomial Regression feature selection')
A graphical plot to show performances of the models from the training set.
results <- resamples(list(rrf=fit.rrf, pmr=fit.pmr))
cat(paste('Results'), sep='\n')
## Results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: rrf, pmr
## Number of resamples: 5
##
## AUC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rrf 0.8521807 0.8792198 0.8965403 0.8875775 0.9019223 0.9080243 0
## pmr 0.9155026 0.9189913 0.9228955 0.9232486 0.9234657 0.9353882 0
##
## F
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rrf 0.9078707 0.9113111 0.9187097 0.9161364 0.9206963 0.9220945 0
## pmr 0.8893012 0.8906634 0.8929664 0.8931242 0.8941035 0.8985864 0
##
## Precision
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rrf 0.8493151 0.895202 0.9002494 0.8911118 0.9047014 0.9060914 0
## pmr 0.8264739 0.837156 0.8391204 0.8383432 0.8419204 0.8470452 0
##
## Recall
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rrf 0.9280105 0.9331586 0.9357798 0.9434147 0.9450262 0.9750983 0
## pmr 0.9423329 0.9489529 0.9567497 0.9557259 0.9568063 0.9737877 0
par(mar = c(4, 11, 1, 1))
dotplot(results, main = 'AUC results from algorithms')
par(mfrow=c(2,2))
set.seed(2019)
prediction.rrf <-predict(fit.rrf,newdata=itest,type="raw")
set.seed(2019)
prediction.pmr <-predict(fit.pmr,newdata=itest,type="raw")
Results are explained by the confusion matrix and the F1-score as better evaluation metric for imbalanced data set.
cat(paste('Confusion Matrix Regularized Random Forest Model'), sep='\n')
## Confusion Matrix Regularized Random Forest Model
confusionMatrix(prediction.rrf, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 922 155
## X1 32 82
##
## Accuracy : 0.843
## 95% CI : (0.8211, 0.8632)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.0001124
##
## Kappa : 0.3881
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9665
## Specificity : 0.3460
## Pos Pred Value : 0.8561
## Neg Pred Value : 0.7193
## Prevalence : 0.8010
## Detection Rate : 0.7741
## Detection Prevalence : 0.9043
## Balanced Accuracy : 0.6562
##
## 'Positive' Class : X0
##
F1_train <- fit.rrf$results[[5]][1]
F1_test <- F1_Score(itest$y, prediction.rrf)
cat(paste('F1_train_rrf:',F1_train,'F1_test_rrf:', F1_test), sep='\n')
## F1_train_rrf: 0.894843227894302 F1_test_rrf: 0.907927129492861
cat(paste('Confusion Matrix Penalized Multinomial Regression Model'), sep='\n')
## Confusion Matrix Penalized Multinomial Regression Model
confusionMatrix(prediction.pmr, itest$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 909 172
## X1 45 65
##
## Accuracy : 0.8178
## 95% CI : (0.7947, 0.8393)
## No Information Rate : 0.801
## P-Value [Acc > NIR] : 0.07737
##
## Kappa : 0.2844
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9528
## Specificity : 0.2743
## Pos Pred Value : 0.8409
## Neg Pred Value : 0.5909
## Prevalence : 0.8010
## Detection Rate : 0.7632
## Detection Prevalence : 0.9076
## Balanced Accuracy : 0.6135
##
## 'Positive' Class : X0
##
F1_train <- fit.pmr$results[[6]][1]
F1_test <- F1_Score(itest$y, prediction.pmr)
cat(paste('F1_train_pmr:',F1_train,'F1_test_pmr:', F1_test), sep='\n')
## F1_train_pmr: 0.00751785594104046 F1_test_pmr: 0.893366093366094
par(mfrow=c(2,2))
ctable.rrf <- table(prediction.rrf, itest$y)
fourfoldplot(ctable.rrf, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RRF Confusion Matrix")
ctable.pmr <- table(prediction.pmr, itest$y)
fourfoldplot(ctable.pmr, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "PMR Confusion Matrix")
Robert Tibshirani, Trevor Hastie, Daniela Witten, Gareth James, “An Introduction to Statistical Learning: With Applications in R”, 2013
Max Kuhn, Kjell Johnson, “Applied Predictive Modeling”, 2013
https://machinelearningmastery.com/what-is-imbalanced-classification/