Introduction

Classification problem

The main goal of this paper is to apply ML methods to classify clients as “good” or “bad”.

We base on the dataset delivered by our lecturer. The dataset contains additional 10 “mysterious” variables (feat01 - feat10) generated by him. While the exact details of how these variables were generated are not provided, they may contain additional features or information that could be relevant for analysis.

As the following sections are independent of each other at the beginning of almost every section, we clean the environment and load the necessary data and libraries.

First of all we will clear the data, then perform simple Exploratory Data Analysis. Later on we will choose the final features using RFE and we will move to building models: XGB, GBM and RF using the approach from classes (hypertuning parameters one by one) and then try to build models using tidymodels package and some general approach.

library(tidyverse)
library(DT)
library(dplyr)
library(fastDummies)
library(caret)
library(ggplot2)
library(plotfunctions)

Data cleaning

Read csv

df <- read.csv("data/input/c2.csv")

Below we can observe raw dataframe and its format of the columns:

str(df)
## 'data.frame':    2000 obs. of  32 variables:
##  $ id                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                   : int  50 30 32 34 21 25 36 25 21 36 ...
##  $ checking_status       : chr  "'no checking'" "'no checking'" "'no checking'" "'no checking'" ...
##  $ class                 : chr  "good" "good" "good" "good" ...
##  $ credit_amount         : int  2319 1163 1502 4436 10155 3413 2235 3555 1805 2756 ...
##  $ credit_history        : chr  "'existing paid'" "'delayed previously'" "'critical/other existing credit'" "'delayed previously'" ...
##  $ duration              : int  21 15 10 36 60 24 24 18 6 36 ...
##  $ employment            : chr  "'>=7'" "'4<=X<7'" "'>=7'" "'1<=X<4'" ...
##  $ existing_credits      : int  2 2 2 2 1 2 1 1 1 1 ...
##  $ feat01                : num  0.24 0.521 0.287 0.431 0.44 ...
##  $ feat02                : num  0.614 0.79 0.542 0.531 0.603 ...
##  $ feat03                : num  1.31 1.453 0.957 0.942 1.109 ...
##  $ feat04                : num  0.976 1.2 0.856 1.164 0.764 ...
##  $ feat05                : num  0.71 1.203 0.544 1.17 1.025 ...
##  $ feat06                : num  0.57 0.957 1.14 1.339 0.634 ...
##  $ feat07                : num  0.951 0.515 1.039 0.979 0.846 ...
##  $ feat08                : num  0.581 1.18 1.497 0.532 1.04 ...
##  $ feat09                : num  0.78 1.14 1.13 1.08 1.26 ...
##  $ feat10                : num  0.483 0.328 0.482 0.63 0.518 ...
##  $ foreign_worker        : chr  "yes" "yes" "no" "yes" ...
##  $ housing               : chr  "own" "own" "own" "own" ...
##  $ installment_commitment: int  4 3 3 4 2 4 4 4 1 4 ...
##  $ job                   : chr  "skilled" "skilled" "'unskilled resident'" "skilled" ...
##  $ num_dependents        : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ other_parties         : chr  "none" "none" "none" "none" ...
##  $ other_payment_plans   : chr  "none" "none" "none" "none" ...
##  $ own_telephone         : chr  "none" "none" "none" "none" ...
##  $ personal_status       : chr  "'male single'" "'female div/dep/mar'" "'male single'" "'female div/dep/mar'" ...
##  $ property_magnitude    : chr  "'real estate'" "'life insurance'" "'real estate'" "'real estate'" ...
##  $ purpose               : chr  "furniture/equipment" "furniture/equipment" "'new car'" "radio/tv" ...
##  $ residence_since       : int  2 2 4 4 4 2 4 1 2 4 ...
##  $ savings_status        : chr  "'<100'" "'>=1000'" "'<100'" "'<100'" ...

Delete quotations

As we can observe all chr columns have additiona quotes ’’, let’s delete them.

df <- df %>%
  mutate_if(is.character, function(x) gsub("'", "", x))

Missing values

Dataframe does not have any missing values, as we can see below

## Getting the number of missing values in each column
num_missing <-  colSums(is.na(df))

# Excluding columns that contains 0 missing values
num_missing <-  num_missing[num_missing > 0]
num_missing
## named numeric(0)

Handle Categorical variables

Select all columns with chr data type - we treat them as categorical

chr_columns <- names(df[, sapply(df, is.character)])
chr_columns
##  [1] "checking_status"     "class"              
##  [3] "credit_history"      "employment"         
##  [5] "foreign_worker"      "housing"            
##  [7] "job"                 "other_parties"      
##  [9] "other_payment_plans" "own_telephone"      
## [11] "personal_status"     "property_magnitude" 
## [13] "purpose"             "savings_status"

Create df with categorical variables, print unique values of each column.

# Function to print unique values for each column in dataframe
print_unique_values <- function(df){
  
  df_unique <- lapply(df, unique)
  # Print unique values for each column
  for (col_name in names(df_unique)) {
    cat(col_name, ": ", paste(df_unique[[col_name]], collapse = ", "), "\n")
  }
}
df_categorical <- df[chr_columns]

print_unique_values(df_categorical)
## checking_status :  no checking, 0<=X<200, <0, >=200 
## class :  good, bad 
## credit_history :  existing paid, delayed previously, critical/other existing credit, all paid, no credits/all paid 
## employment :  >=7, 4<=X<7, 1<=X<4, <1, unemployed 
## foreign_worker :  yes, no 
## housing :  own, for free, rent 
## job :  skilled, unskilled resident, high qualif/self emp/mgmt, unemp/unskilled non res 
## other_parties :  none, guarantor, co applicant 
## other_payment_plans :  none, bank, stores 
## own_telephone :  none, yes 
## personal_status :  male single, female div/dep/mar, male mar/wid, male div/sep 
## property_magnitude :  real estate, life insurance, car, no known property 
## purpose :  furniture/equipment, new car, radio/tv, used car, education, repairs, business, retraining, other, domestic appliance 
## savings_status :  <100, >=1000, 100<=X<500, no known savings, 500<=X<1000

Personal status column

Let’s focus on the column “personal_status”, here we have 2 information, sex and marital status

We should separate them into 2 columns:

  • sex (0 - female, 1 - male)
  • marital_status:
    • single
    • div/dep/mar
    • mar/wid
    • div/sep

Check the size of each group:

personal_status_counts <- table(df$personal_status)

print(personal_status_counts)
## 
## female div/dep/mar       male div/sep       male mar/wid 
##                649                105                181 
##        male single 
##               1065

Create 2 new columns:

# Extracting sex information
df <- df %>%
  mutate(sex = ifelse(grepl("female", personal_status), 0, 1))

# Extracting the rest of the information
df <- df %>%
  mutate(marital_status = sub("^[^ ]+ ", "", sub(" $", "", personal_status)))

# Drop the original "personal_status" column
df <- df %>%
  select(-personal_status)
sex_counts <- table(df$sex)
marital_status_counts <- table(df$marital_status)

print(sex_counts)
## 
##    0    1 
##  649 1351
print(marital_status_counts)
## 
## div/dep/mar     div/sep     mar/wid      single 
##         649         105         181        1065

Create a list of reamining chr columns which needs to be handled yet

# Copy chr columns and remove personal_status - now we have list with remaining chr columns
chr_columns_help <- chr_columns

# Remove "personal_status" from list
chr_columns_help <- setdiff(chr_columns_help, "personal_status")

# Adding marital_status to object_columns_help
chr_columns_help <- union(chr_columns_help, "marital_status")

print(chr_columns_help)
##  [1] "checking_status"     "class"              
##  [3] "credit_history"      "employment"         
##  [5] "foreign_worker"      "housing"            
##  [7] "job"                 "other_parties"      
##  [9] "other_payment_plans" "own_telephone"      
## [11] "property_magnitude"  "purpose"            
## [13] "savings_status"      "marital_status"

Ordered categorical variables

In the dataset we can observe a few ordered categorical variables. Let’s define and transform them.

# Refresh the df_categorical and show unique values - columns to be handled yet

df_categorical <- df[, chr_columns_help]
print_unique_values(df_categorical)
## checking_status :  no checking, 0<=X<200, <0, >=200 
## class :  good, bad 
## credit_history :  existing paid, delayed previously, critical/other existing credit, all paid, no credits/all paid 
## employment :  >=7, 4<=X<7, 1<=X<4, <1, unemployed 
## foreign_worker :  yes, no 
## housing :  own, for free, rent 
## job :  skilled, unskilled resident, high qualif/self emp/mgmt, unemp/unskilled non res 
## other_parties :  none, guarantor, co applicant 
## other_payment_plans :  none, bank, stores 
## own_telephone :  none, yes 
## property_magnitude :  real estate, life insurance, car, no known property 
## purpose :  furniture/equipment, new car, radio/tv, used car, education, repairs, business, retraining, other, domestic appliance 
## savings_status :  <100, >=1000, 100<=X<500, no known savings, 500<=X<1000 
## marital_status :  single, div/dep/mar, mar/wid, div/sep

Let’s define ordered categorical variables:

ordered_categorical = c("checking_status", "employment", "job", "savings_status")
head(df[, ordered_categorical])
##   checking_status employment                job
## 1     no checking        >=7            skilled
## 2     no checking     4<=X<7            skilled
## 3     no checking        >=7 unskilled resident
## 4     no checking     1<=X<4            skilled
## 5     no checking     4<=X<7            skilled
## 6     no checking         <1            skilled
##     savings_status
## 1             <100
## 2           >=1000
## 3             <100
## 4             <100
## 5       100<=X<500
## 6 no known savings

We found out that the group unemp/unskilled non res has just 40 observations, 29 with good class and 11 with bad. The distribution of this class is very similar to the distribution of unskilled resident.

unemp/unskilled non res: 29/40 = 0.725 - share of good credits in this group

unskilled resident: 288/411 = 0.700

Because of that we decide to join these two groups.

df %>%
dplyr::group_by(job, class) %>%
dplyr::summarise(count = n())
## # A tibble: 8 × 3
## # Groups:   job [4]
##   job                       class count
##   <chr>                     <chr> <int>
## 1 high qualif/self emp/mgmt bad     106
## 2 high qualif/self emp/mgmt good    185
## 3 skilled                   bad     378
## 4 skilled                   good    880
## 5 unemp/unskilled non res   bad      11
## 6 unemp/unskilled non res   good     29
## 7 unskilled resident        bad     123
## 8 unskilled resident        good    288

Define values for each category

checking_status_mapping <- c(
  "no checking" = 1,
  "<0" = 2, 
  "0<=X<200" = 3, 
  ">=200" = 4 
  )

employment_mapping <- c(
  "unemployed" = 1, 
  "<1" = 2, 
  "1<=X<4" = 3, 
  "4<=X<7"= 4, 
  ">=7" = 5)

job_mapping <- c(
    "unemp/unskilled non res" = 1,
    "unskilled resident" = 1,
    "skilled" = 2,
    "high qualif/self emp/mgmt" = 3
)

savings_status_mapping <- c(
    "no known savings" = 1,
    "<100" = 2,
    "100<=X<500" = 3,
    "500<=X<1000" = 4,
    ">=1000" = 5
)

# Define the mapping function
recode_variable <- function(variable, mapping) {
  variable <- factor(variable, levels = names(mapping))
  return(as.numeric(variable))
}

df$checking_status <- recode_variable(df$checking_status, checking_status_mapping)
df$employment <- recode_variable(df$employment, employment_mapping)
df$job <- recode_variable(df$job, job_mapping)
df$savings_status <- recode_variable(df$savings_status, savings_status_mapping)

Update the chr_columns_help - delete ordered categorical columns and print

# Remove ordered_categorical variables names from list
chr_columns_help <- setdiff(chr_columns_help, ordered_categorical)

Boolean variables

Let’s define boolean variables:

boolean_variables = c("class", "foreign_worker", "own_telephone")
head(df[, boolean_variables])
##   class foreign_worker own_telephone
## 1  good            yes          none
## 2  good            yes          none
## 3  good             no          none
## 4  good            yes          none
## 5  good            yes           yes
## 6  good            yes          none
df$class <- ifelse(grepl("good", df$class), 1, 0)
df$foreign_worker <- ifelse(grepl("yes", df$foreign_worker), 1, 0)
df$own_telephone <- ifelse(grepl("yes", df$own_telephone), 1, 0)
# Remove boolean_variables variables names from list
chr_columns_help <- setdiff(chr_columns_help, boolean_variables)

One-hot encoding

For the remaining columns we apply one-hot encodig

# Print remaining categorical variables and its unique values
df_categorical <- df[, chr_columns_help]
print_unique_values(df_categorical)
## credit_history :  existing paid, delayed previously, critical/other existing credit, all paid, no credits/all paid 
## housing :  own, for free, rent 
## other_parties :  none, guarantor, co applicant 
## other_payment_plans :  none, bank, stores 
## property_magnitude :  real estate, life insurance, car, no known property 
## purpose :  furniture/equipment, new car, radio/tv, used car, education, repairs, business, retraining, other, domestic appliance 
## marital_status :  single, div/dep/mar, mar/wid, div/sep

We use dummy_cols function from fastDummies library

df <- dummy_cols(
  df, 
  select_columns = chr_columns_help, 
  remove_first_dummy = TRUE,           # remove the first dummy to avoid the issue of multicollinearity
  remove_selected_columns = TRUE)     # remove the original columns

Summary of data cleaning

We performed data cleaning in a few steps:

  • Delete quotations (clean chr variables)
  • Handle categorical variables divided into:
    • Personal status column - divde column into 2 columns
    • Ordered categorical variables
    • Boolean variables
    • Categorical variable (one-hot encoding)

After this transformations we obtained dataframe with 51 variables.

As we can see below, after the data transformation there are no missing values, so the data transformation process was successful.

## Getting the number of missing values in each column
num_missing <-  colSums(is.na(df))

# Excluding columns that contains 0 missing values
num_missing <-  num_missing[num_missing > 0]
num_missing
## named numeric(0)

Below we can see the new dataframe and its columns along with their value types

str(df)
## 'data.frame':    2000 obs. of  51 variables:
##  $ id                                           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                                          : int  50 30 32 34 21 25 36 25 21 36 ...
##  $ checking_status                              : num  1 1 1 1 1 1 3 2 1 3 ...
##  $ class                                        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ credit_amount                                : int  2319 1163 1502 4436 10155 3413 2235 3555 1805 2756 ...
##  $ duration                                     : int  21 15 10 36 60 24 24 18 6 36 ...
##  $ employment                                   : num  5 4 5 3 4 2 5 4 3 5 ...
##  $ existing_credits                             : int  2 2 2 2 1 2 1 1 1 1 ...
##  $ feat01                                       : num  0.24 0.521 0.287 0.431 0.44 ...
##  $ feat02                                       : num  0.614 0.79 0.542 0.531 0.603 ...
##  $ feat03                                       : num  1.31 1.453 0.957 0.942 1.109 ...
##  $ feat04                                       : num  0.976 1.2 0.856 1.164 0.764 ...
##  $ feat05                                       : num  0.71 1.203 0.544 1.17 1.025 ...
##  $ feat06                                       : num  0.57 0.957 1.14 1.339 0.634 ...
##  $ feat07                                       : num  0.951 0.515 1.039 0.979 0.846 ...
##  $ feat08                                       : num  0.581 1.18 1.497 0.532 1.04 ...
##  $ feat09                                       : num  0.78 1.14 1.13 1.08 1.26 ...
##  $ feat10                                       : num  0.483 0.328 0.482 0.63 0.518 ...
##  $ foreign_worker                               : num  1 1 0 1 1 1 1 1 1 1 ...
##  $ installment_commitment                       : int  4 3 3 4 2 4 4 4 1 4 ...
##  $ job                                          : num  3 3 2 3 3 3 3 3 3 3 ...
##  $ num_dependents                               : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ own_telephone                                : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ residence_since                              : int  2 2 4 4 4 2 4 1 2 4 ...
##  $ savings_status                               : num  2 5 2 2 3 1 1 2 2 2 ...
##  $ sex                                          : num  1 0 1 0 0 1 1 0 1 1 ...
##  $ credit_history_critical/other existing credit: int  0 0 1 0 0 0 0 0 0 1 ...
##  $ credit_history_delayed previously            : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ credit_history_existing paid                 : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ credit_history_no credits/all paid           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ housing_own                                  : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ housing_rent                                 : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ other_parties_guarantor                      : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ other_parties_none                           : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ other_payment_plans_none                     : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ other_payment_plans_stores                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ property_magnitude_life insurance            : int  0 1 0 0 0 0 0 0 1 0 ...
##  $ property_magnitude_no known property         : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ property_magnitude_real estate               : int  1 0 1 1 1 0 0 1 0 1 ...
##  $ purpose_domestic appliance                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ purpose_education                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ purpose_furniture/equipment                  : int  1 1 0 0 0 0 0 0 1 0 ...
##  $ purpose_new car                              : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ purpose_other                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ purpose_radio/tv                             : int  0 0 0 1 1 1 0 1 0 1 ...
##  $ purpose_repairs                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ purpose_retraining                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ purpose_used car                             : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ marital_status_div/sep                       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ marital_status_mar/wid                       : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ marital_status_single                        : int  1 0 1 0 0 1 1 0 0 1 ...

Splitting the dataset

Before performing Exploratory Data Analysis we should divide our dataset into train and test, we do it as below. Furthermore, we remove the ID column (first column) since it will not be used in the further analysis.

set.seed(123456789)
training_obs <- createDataPartition(df$id, 
                                    p = 0.7, 
                                    list = FALSE) 
df.train <- df[training_obs,]
df.test  <- df[-training_obs,]

df <- df[, -1]
df.train <- df.train[, -1]
df.test <- df.test[, -1]

Let’s check the dataframe

Before writing the dfs let’s change the order of columns, so the target variable (class) would be in the first column.

# Move "class" from 3rd to the first position
df <- df[, c(3, 1:2, 4:ncol(df))]
df.train <- df.train[, c(3, 1:2, 4:ncol(df.train))]
df.test <- df.test[, c(3, 1:2, 4:ncol(df.train))]

Last check of the dataset before saving

Write dataframes

# remove unecessary variables
rm(list = setdiff(ls(), c("df", "df.test", "df.train", "print_unique_values")))

# write df.train and df.test
write.csv(df.test, "./data/output/df.test.csv", row.names = FALSE)
write.csv(df.train, "./data/output/df.train.csv", row.names = FALSE)
write.csv(df, "./data/output/df.csv", row.names = FALSE)

Exploratory Data Analysis

The EDA will be performed on training dataframe.

Divide variables according to their type

# define numeric variables
numeric_variables <- c("age",
                       "checking_status",
                       "credit_amount",
                       "duration",
                       "employment",
                       "existing_credits",
                       "installment_commitment",
                       "num_dependents",
                       "residence_since")

# define feat variables - some "mysterious" variables prepared by our lecturer
feat_variables <- c()
for (i in 1:10){
  if (i < 10){
    var <- paste("feat0", i, sep="")
  } else {
    var <- paste("feat", i, sep="")
  }
  feat_variables <- append(feat_variables, var)
}

# define ordered_categorical
ordered_categorical = c("checking_status", "employment", "job", "savings_status")

# define boolean variables
boolean_variables <- setdiff(colnames(df.train), c(numeric_variables, feat_variables, ordered_categorical))

# check if the division was made properly
print_unique_values(df.train[,boolean_variables])
## class :  1, 0 
## foreign_worker :  1, 0 
## own_telephone :  0, 1 
## sex :  1, 0 
## credit_history_critical/other existing credit :  0, 1 
## credit_history_delayed previously :  0, 1 
## credit_history_existing paid :  1, 0 
## credit_history_no credits/all paid :  0, 1 
## housing_own :  1, 0 
## housing_rent :  0, 1 
## other_parties_guarantor :  0, 1 
## other_parties_none :  1, 0 
## other_payment_plans_none :  1, 0 
## other_payment_plans_stores :  0, 1 
## property_magnitude_life insurance :  0, 1 
## property_magnitude_no known property :  0, 1 
## property_magnitude_real estate :  1, 0 
## purpose_domestic appliance :  0, 1 
## purpose_education :  0, 1 
## purpose_furniture/equipment :  1, 0 
## purpose_new car :  0, 1 
## purpose_other :  0, 1 
## purpose_radio/tv :  0, 1 
## purpose_repairs :  0, 1 
## purpose_retraining :  0, 1 
## purpose_used car :  0, 1 
## marital_status_div/sep :  0, 1 
## marital_status_mar/wid :  0, 1 
## marital_status_single :  1, 0

Histograms of numeric variables

par(mfrow=c(3, 4))

for (var in numeric_variables) {
    hist(df.train[[var]], main=paste("Histogram of", var), xlab=var)
}

Histograms of feat variables

par(mfrow=c(3, 4))

for (var in feat_variables) {
    hist(df.train[[var]], main=paste("Histogram of", var), xlab=var)
}

Correlation matrix

For numeric variables and target variable.

library(corrplot)
num_var <- c(numeric_variables, feat_variables, "class")
numeric_df <- df.train[, num_var]
cor_matrix <- cor(numeric_df, use="complete.obs")
corrplot(cor_matrix, method="circle")

Barplots of categorical variables

# For ordered categorical variables
par(mfrow=c(2, 2))
for (var in ordered_categorical) {
  counts <- table(df.train[[var]])
  barplot(counts, main=paste("Bar Plot of", var), xlab=var, ylab="Count")
}

Boolean variables

In case of boolean variables we create table with distribution of every variable.

# Creating a summary table for boolean variables with percentages
boolean_summary <- sapply(df.train[, boolean_variables], function(x) {
  tab <- table(factor(x, levels = c(0, 1)))
  pct <- 100 * tab / sum(tab) # Convert counts to percentage
  return(pct)
})

boolean_summary <- t(boolean_summary) # Transposing to have variables as rows and percentage columns

# Formatting the table for better readability
boolean_summary <- round(boolean_summary, 1) # Round to two decimal places

# Printing the summary table
print(boolean_summary)
##                                                  0    1
## class                                         30.5 69.5
## foreign_worker                                 4.0 96.0
## own_telephone                                 59.3 40.7
## sex                                           32.6 67.4
## credit_history_critical/other existing credit 70.1 29.9
## credit_history_delayed previously             91.6  8.4
## credit_history_existing paid                  47.5 52.5
## credit_history_no credits/all paid            95.8  4.2
## housing_own                                   27.7 72.3
## housing_rent                                  82.7 17.3
## other_parties_guarantor                       95.0  5.0
## other_parties_none                             9.1 90.9
## other_payment_plans_none                      19.1 80.9
## other_payment_plans_stores                    94.4  5.6
## property_magnitude_life insurance             77.4 22.6
## property_magnitude_no known property          84.3 15.7
## property_magnitude_real estate                71.4 28.6
## purpose_domestic appliance                    98.6  1.4
## purpose_education                             95.2  4.8
## purpose_furniture/equipment                   82.6 17.4
## purpose_new car                               77.0 23.0
## purpose_other                                 98.5  1.5
## purpose_radio/tv                              70.9 29.1
## purpose_repairs                               97.6  2.4
## purpose_retraining                            99.0  1.0
## purpose_used car                              90.9  9.1
## marital_status_div/sep                        94.6  5.4
## marital_status_mar/wid                        90.9  9.1
## marital_status_single                         47.1 52.9

The distribution of target variable

As we can see below, the train and test dataframes are split proportionally, however we can observe unequal distribution - there are more than twice observations with class “1” (good).

# Target variable distribution in df:
round(table(df$class)/length(df$class), 3)
## 
##     0     1 
## 0.309 0.691
# Target variabledistribution in df.train:
round(table(df.train$class)/length(df.train$class), 3)
## 
##     0     1 
## 0.305 0.695
# Target variabledistribution in df.test:
round(table(df.test$class)/length(df.test$class), 3)
## 
##     0     1 
## 0.318 0.682

Feature selection with RFE

Intro to RFE

Recursive Feature Elimination (RFE) is a feature selection method used to identify and select a subset of relevant features for model building. The method works by recursively removing features, building a model with the remaining features, and calculating model accuracy. The least important features are removed in each iteration until the specified number of features is reached.

In this section of the project, we employ RFE to optimize the feature set for predicting the class variable in our dataset. The process is outlined as follows.

First of all we clear the environment, load train and test df and new libraries.

rm(list=ls())
library(mlbench)
library(caret)
library(recipes)
library(doParallel)
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")

# rfe requires factor as target variable, so we modify class (1st column) - only on df.train dataframe as of now
df.train[, 1] <- as.factor(df.train[,1])

Detect the number of CPU cores available on the current machine and assign it to ‘cores’. Then, register the detected number of cores for parallel processing using the doParallel package. This enables parallel execution of code to improve performance, especially for operations that can be executed concurrently.

cores=detectCores()
registerDoParallel(cores=cores)

Define seeds for rfeControl

If we use doParallel package and want to make our code reproducible we should pass seeds argument to rfeControl function it requires a list where each element corresponds to a specific resampling iteration’s seed and the last element is a seed for the overall process.

Assuming 10-fold CV, we need a list of 11 elements: 10 for each fold and 1 for the overall process.

set.seed(123456789)
numFolds <- 10
seedsList <- lapply(1:numFolds, function(x) sample.int(1000, 49)) # Generate random seeds for each fold

set.seed(123456789)
seedsList[[length(seedsList) + 1]] <- sample.int(1000, 1) # Add one more seed for the overall process
## Define rfeControl object
control <- rfeControl(functions=rfFuncs, method="cv", number=10, seeds = seedsList)

Perform feature selection with rfe

set.seed(123456789)
results <- rfe(df.train[, seq(2, length(df.train), 1)],             # predictor variables 
               df.train[, 1],                                       # target variable for prediction
               sizes=c(1:49),                                       # the subset sizes to evaluate
               rfeControl=control)

Basing on the RFE method 35 variables have been chosen as the best predictors. Below we can see them.

predictors(results)
##  [1] "checking_status"                              
##  [2] "feat02"                                       
##  [3] "duration"                                     
##  [4] "credit_amount"                                
##  [5] "feat01"                                       
##  [6] "feat10"                                       
##  [7] "credit_history_critical.other.existing.credit"
##  [8] "age"                                          
##  [9] "savings_status"                               
## [10] "employment"                                   
## [11] "property_magnitude_real.estate"               
## [12] "other_payment_plans_none"                     
## [13] "installment_commitment"                       
## [14] "property_magnitude_no.known.property"         
## [15] "credit_history_no.credits.all.paid"           
## [16] "housing_own"                                  
## [17] "credit_history_existing.paid"                 
## [18] "purpose_education"                            
## [19] "purpose_radio.tv"                             
## [20] "residence_since"                              
## [21] "existing_credits"                             
## [22] "other_payment_plans_stores"                   
## [23] "purpose_new.car"                              
## [24] "own_telephone"                                
## [25] "other_parties_guarantor"                      
## [26] "job"                                          
## [27] "credit_history_delayed.previously"            
## [28] "housing_rent"                                 
## [29] "other_parties_none"                           
## [30] "purpose_used.car"                             
## [31] "sex"                                          
## [32] "marital_status_single"                        
## [33] "marital_status_div.sep"                       
## [34] "num_dependents"                               
## [35] "marital_status_mar.wid"

Below we can see the plot of accuracy results vs number of variables included. Interesting is the fact that having more predictors than 35 not only does not increase precision but also worsens it.

plot(results, type=c("g", "o"))

Assign the final columns to list and write as .csv.

rfe_columns <- predictors(results)
write.csv(rfe_columns, "./data/output/rfe_columns.csv")

XGB - one by one

In this section we apply the approach from classes adjusted to our case. We build the xgb model by hypertuning parameters one by one in the following order:

  • nrounds,
  • max_depth and min_child_weight,
  • colsample_bytree,
  • subsample,
  • change learning rate and number of trees,
  • change learning rate and number of trees again.

Let’s start with loading required packages, cleaning the environment and loading the data.

library(dplyr)
library(rpart)
library(rpart.plot)
library(xgboost)
library(plyr)
library(doParallel)
library(pROC)
library(ggplot2)
library(caret)

cores=detectCores()
registerDoParallel(cores=cores)
rm(list=ls())

df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")

Xgb requires target variable as factor - let’s change it

df$class <- factor(df$class, levels = c(0, 1), labels = c("bad", "good"))
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))

Take columns chosen by RFE

# Read columns chosen by rfe
rfe_columns <- read.csv("./data/output/rfe_columns.csv")

# Add "class" to the list of columns chosen for analysis
rfe_columns <- c("class", rfe_columns$x)

# Modify dfs to have columns chosen by rfe
df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]

Calculate initial parameters based on recommendations provided during the lectures.

# colsample_bytree - # rule of thumb (number of predictors)
sqrt(35)/35.  # 0.1690309
## [1] 0.1690309
# min_child_weight - 0.5 ... 0.1% of number of observations
0.005 * 2000  # 10
## [1] 10
0.01 * 2000   # 20
## [1] 20

Hyperparameter tuning

Nrounds

set.seed(12345678)
xgb_grid <- expand.grid(nrounds = seq(20, 80, 10),
                              max_depth = c(8),
                              eta = c(0.25),
                              gamma = 1,
                              colsample_bytree = c(0.17),
                              min_child_weight = c(10),
                              subsample = 0.8)

train_control <- trainControl(method = "cv", 
                              number = 5,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

xgb_model.1 <- train(class ~ .,
                   data = df.train,
                   method = "xgbTree",
                   trControl = train_control,
                   tuneGrid  = xgb_grid)
xgb_model.1
## eXtreme Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121 
## Resampling results across tuning parameters:
## 
##   nrounds  ROC        Sens       Spec     
##   20       0.8467558  0.4005472  0.9547713
##   30       0.8542388  0.4870588  0.9434629
##   40       0.8592276  0.5268947  0.9259847
##   50       0.8588507  0.5455814  0.9280307
##   60       0.8559132  0.5433105  0.9198150
##   70       0.8580116  0.5503967  0.9167169
##   80       0.8582144  0.5668126  0.9146497
## 
## Tuning parameter 'max_depth' was held constant at a
##  held constant at a value of 10
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the
##  largest value.
## The final values used for the model were nrounds =
##  colsample_bytree = 0.17, min_child_weight = 10
##  and subsample = 0.8.
xgb_model.1 %>% saveRDS("./data/output/xgb/xgb_model.1.rds")

The best for nrounds = 40 Let’s take this parameter and tuner others.

Max_depth and min_child_weight

xgb_grid <- expand.grid(nrounds = 40,
                        max_depth = c(5, 15, 1),
                        eta = c(0.25),
                        gamma = 1,
                        colsample_bytree = c(0.17),
                        min_child_weight = seq(10, 200, 2),
                        subsample = 0.8)
set.seed(12345678)
xgb_model.2 <- train(class ~ .,
                     data = df.train,
                     method = "xgbTree",
                     trControl = train_control,
                     tuneGrid  = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.2
## eXtreme Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121 
## Resampling results across tuning parameters:
## 
##   max_depth  min_child_weight  ROC        Sens       
##    1          10               0.8344563  0.395950752
##    1          12               0.8358665  0.407523940
##    1          14               0.8358767  0.400328317
##    1          16               0.8358507  0.407359781
##    1          18               0.8359002  0.412120383
##    1          20               0.8326024  0.393378933
##    1          22               0.8270847  0.395677155
##    1          24               0.8320066  0.423748290
##    1          26               0.8319196  0.411983584
##    1          28               0.8264682  0.414418605
##    1          30               0.8242036  0.391053352
##    1          32               0.8281184  0.416908345
##    1          34               0.8308021  0.416689466
##    1          36               0.8223432  0.391053352
##    1          38               0.8229040  0.376853625
##    1          40               0.8227221  0.395567715
##    1          42               0.8272814  0.402708618
##    1          44               0.8156231  0.374664843
##    1          46               0.8218393  0.383967168
##    1          48               0.8168432  0.369986320
##    1          50               0.8176469  0.348919289
##    1          52               0.8059365  0.341860465
##    1          54               0.8008042  0.355978112
##    1          56               0.8001785  0.341833105
##    1          58               0.8036833  0.342024624
##    1          60               0.7879441  0.288016416
##    1          62               0.7829489  0.294938440
##    1          64               0.7770599  0.287906977
##    1          66               0.7781812  0.299671683
##    1          68               0.7737653  0.295020520
##    1          70               0.7705386  0.250588235
##    1          72               0.7621448  0.266894665
##    1          74               0.7560739  0.290369357
##    1          76               0.7481793  0.266648427
##    1          78               0.7404916  0.227004104
##    1          80               0.7453078  0.192147743
##    1          82               0.7314558  0.208290014
##    1          84               0.7236018  0.196580027
##    1          86               0.7183731  0.156908345
##    1          88               0.7196452  0.168673051
##    1          90               0.7066056  0.060984952
##    1          92               0.6820901  0.039945280
##    1          94               0.6947117  0.007058824
##    1          96               0.6875540  0.000000000
##    1          98               0.6712238  0.000000000
##    1         100               0.6526735  0.000000000
##    1         102               0.6195525  0.000000000
##    1         104               0.6237589  0.000000000
##    1         106               0.5758576  0.000000000
##    1         108               0.5746589  0.000000000
##    1         110               0.6094594  0.000000000
##    1         112               0.5326217  0.000000000
##    1         114               0.5000000  0.000000000
##    1         116               0.5000000  0.000000000
##    1         118               0.5000000  0.000000000
##    1         120               0.5000000  0.000000000
##    1         122               0.5000000  0.000000000
##    1         124               0.5000000  0.000000000
##    1         126               0.5000000  0.000000000
##    1         128               0.5000000  0.000000000
##    1         130               0.5000000  0.000000000
##    1         132               0.5000000  0.000000000
##    1         134               0.5000000  0.000000000
##    1         136               0.5000000  0.000000000
##    1         138               0.5000000  0.000000000
##    1         140               0.5000000  0.000000000
##    1         142               0.5000000  0.000000000
##    1         144               0.5000000  0.000000000
##    1         146               0.5000000  0.000000000
##    1         148               0.5000000  0.000000000
##    1         150               0.5000000  0.000000000
##    1         152               0.5000000  0.000000000
##    1         154               0.5000000  0.000000000
##    1         156               0.5000000  0.000000000
##    1         158               0.5000000  0.000000000
##    1         160               0.5000000  0.000000000
##    1         162               0.5000000  0.000000000
##    1         164               0.5000000  0.000000000
##    1         166               0.5000000  0.000000000
##    1         168               0.5000000  0.000000000
##    1         170               0.5000000  0.000000000
##    1         172               0.5000000  0.000000000
##    1         174               0.5000000  0.000000000
##    1         176               0.5000000  0.000000000
##    1         178               0.5000000  0.000000000
##    1         180               0.5000000  0.000000000
##    1         182               0.5000000  0.000000000
##    1         184               0.5000000  0.000000000
##    1         186               0.5000000  0.000000000
##    1         188               0.5000000  0.000000000
##    1         190               0.5000000  0.000000000
##    1         192               0.5000000  0.000000000
##    1         194               0.5000000  0.000000000
##    1         196               0.5000000  0.000000000
##    1         198               0.5000000  0.000000000
##    1         200               0.5000000  0.000000000
##    5          10               0.8565765  0.540930233
##    5          12               0.8612264  0.541039672
##    5          14               0.8604411  0.524596443
##    5          16               0.8491117  0.510478796
##    5          18               0.8438764  0.484815321
##    5          20               0.8455765  0.505800274
##    5          22               0.8318509  0.480136799
##    5          24               0.8430497  0.494035568
##    5          26               0.8428770  0.494008208
##    5          28               0.8414709  0.491573187
##    5          30               0.8309704  0.461203830
##    5          32               0.8341771  0.458905609
##    5          34               0.8299657  0.447168263
##    5          36               0.8374388  0.461258550
##    5          38               0.8368526  0.477592339
##    5          40               0.8267579  0.430615595
##    5          42               0.8289070  0.454309166
##    5          44               0.8334869  0.458987688
##    5          46               0.8254475  0.435458276
##    5          48               0.8301018  0.400492476
##    5          50               0.8249659  0.407606019
##    5          52               0.8183259  0.400410397
##    5          54               0.8136487  0.379480164
##    5          56               0.8067035  0.330259918
##    5          58               0.8024038  0.360601915
##    5          60               0.7915911  0.332585499
##    5          62               0.7928539  0.351135431
##    5          64               0.7756334  0.301997264
##    5          66               0.7872167  0.313761970
##    5          68               0.7687912  0.262325581
##    5          70               0.7782777  0.285745554
##    5          72               0.7504151  0.255239398
##    5          74               0.7443232  0.266867305
##    5          76               0.7428209  0.236251710
##    5          78               0.7332966  0.246073871
##    5          80               0.7319336  0.210943912
##    5          82               0.7286488  0.234227086
##    5          84               0.7307047  0.185225718
##    5          86               0.7199071  0.192120383
##    5          88               0.7145815  0.215458276
##    5          90               0.6919750  0.100629275
##    5          92               0.7048677  0.039917921
##    5          94               0.6907750  0.000000000
##    5          96               0.6863598  0.000000000
##    5          98               0.6716951  0.000000000
##    5         100               0.6343764  0.000000000
##    5         102               0.6741921  0.000000000
##    5         104               0.5931451  0.000000000
##    5         106               0.6000353  0.000000000
##    5         108               0.5645613  0.000000000
##    5         110               0.5675017  0.000000000
##    5         112               0.5004293  0.000000000
##    5         114               0.5000000  0.000000000
##    5         116               0.5000000  0.000000000
##    5         118               0.5000000  0.000000000
##    5         120               0.5000000  0.000000000
##    5         122               0.5000000  0.000000000
##    5         124               0.5000000  0.000000000
##    5         126               0.5000000  0.000000000
##    5         128               0.5000000  0.000000000
##    5         130               0.5000000  0.000000000
##    5         132               0.5000000  0.000000000
##    5         134               0.5000000  0.000000000
##    5         136               0.5000000  0.000000000
##    5         138               0.5000000  0.000000000
##    5         140               0.5000000  0.000000000
##    5         142               0.5000000  0.000000000
##    5         144               0.5000000  0.000000000
##    5         146               0.5000000  0.000000000
##    5         148               0.5000000  0.000000000
##    5         150               0.5000000  0.000000000
##    5         152               0.5000000  0.000000000
##    5         154               0.5000000  0.000000000
##    5         156               0.5000000  0.000000000
##    5         158               0.5000000  0.000000000
##    5         160               0.5000000  0.000000000
##    5         162               0.5000000  0.000000000
##    5         164               0.5000000  0.000000000
##    5         166               0.5000000  0.000000000
##    5         168               0.5000000  0.000000000
##    5         170               0.5000000  0.000000000
##    5         172               0.5000000  0.000000000
##    5         174               0.5000000  0.000000000
##    5         176               0.5000000  0.000000000
##    5         178               0.5000000  0.000000000
##    5         180               0.5000000  0.000000000
##    5         182               0.5000000  0.000000000
##    5         184               0.5000000  0.000000000
##    5         186               0.5000000  0.000000000
##    5         188               0.5000000  0.000000000
##    5         190               0.5000000  0.000000000
##    5         192               0.5000000  0.000000000
##    5         194               0.5000000  0.000000000
##    5         196               0.5000000  0.000000000
##    5         198               0.5000000  0.000000000
##    5         200               0.5000000  0.000000000
##   15          10               0.8561711  0.526894665
##   15          12               0.8549140  0.552640219
##   15          14               0.8508082  0.508180575
##   15          16               0.8563379  0.519890561
##   15          18               0.8439466  0.491792066
##   15          20               0.8442873  0.517428181
##   15          22               0.8427940  0.489329685
##   15          24               0.8455911  0.456798906
##   15          26               0.8432056  0.477674419
##   15          28               0.8366147  0.484870041
##   15          30               0.8424677  0.498604651
##   15          32               0.8418105  0.489439124
##   15          34               0.8289386  0.466155951
##   15          36               0.8255641  0.463611491
##   15          38               0.8324604  0.458741450
##   15          40               0.8227214  0.423885089
##   15          42               0.8271911  0.447250342
##   15          44               0.8273940  0.456634747
##   15          46               0.8267264  0.442435021
##   15          48               0.8194536  0.412010944
##   15          50               0.8188501  0.423857729
##   15          52               0.8034414  0.372229822
##   15          54               0.8014843  0.353570451
##   15          56               0.8089223  0.376963064
##   15          58               0.8000047  0.332503420
##   15          60               0.7942000  0.334829001
##   15          62               0.7854083  0.276224350
##   15          64               0.7830316  0.304459644
##   15          66               0.7831815  0.295102599
##   15          68               0.7702310  0.285690834
##   15          70               0.7537844  0.283283174
##   15          72               0.7677682  0.255376197
##   15          74               0.7544088  0.234227086
##   15          76               0.7487642  0.261969904
##   15          78               0.7547113  0.250396717
##   15          80               0.7397075  0.229493844
##   15          82               0.7420136  0.229658003
##   15          84               0.7191190  0.201285910
##   15          86               0.7176485  0.194555404
##   15          88               0.7172817  0.192038304
##   15          90               0.7082077  0.140465116
##   15          92               0.6948009  0.042134063
##   15          94               0.7033914  0.002352941
##   15          96               0.6796861  0.000000000
##   15          98               0.6681156  0.000000000
##   15         100               0.6775902  0.000000000
##   15         102               0.6146169  0.000000000
##   15         104               0.6214176  0.000000000
##   15         106               0.5975022  0.000000000
##   15         108               0.5858185  0.000000000
##   15         110               0.5723162  0.000000000
##   15         112               0.5397587  0.000000000
##   15         114               0.5228602  0.000000000
##   15         116               0.5000000  0.000000000
##   15         118               0.5000000  0.000000000
##   15         120               0.5000000  0.000000000
##   15         122               0.5000000  0.000000000
##   15         124               0.5000000  0.000000000
##   15         126               0.5000000  0.000000000
##   15         128               0.5000000  0.000000000
##   15         130               0.5000000  0.000000000
##   15         132               0.5000000  0.000000000
##   15         134               0.5000000  0.000000000
##   15         136               0.5000000  0.000000000
##   15         138               0.5000000  0.000000000
##   15         140               0.5000000  0.000000000
##   15         142               0.5000000  0.000000000
##   15         144               0.5000000  0.000000000
##   15         146               0.5000000  0.000000000
##   15         148               0.5000000  0.000000000
##   15         150               0.5000000  0.000000000
##   15         152               0.5000000  0.000000000
##   15         154               0.5000000  0.000000000
##   15         156               0.5000000  0.000000000
##   15         158               0.5000000  0.000000000
##   15         160               0.5000000  0.000000000
##   15         162               0.5000000  0.000000000
##   15         164               0.5000000  0.000000000
##   15         166               0.5000000  0.000000000
##   15         168               0.5000000  0.000000000
##   15         170               0.5000000  0.000000000
##   15         172               0.5000000  0.000000000
##   15         174               0.5000000  0.000000000
##   15         176               0.5000000  0.000000000
##   15         178               0.5000000  0.000000000
##   15         180               0.5000000  0.000000000
##   15         182               0.5000000  0.000000000
##   15         184               0.5000000  0.000000000
##   15         186               0.5000000  0.000000000
##   15         188               0.5000000  0.000000000
##   15         190               0.5000000  0.000000000
##   15         192               0.5000000  0.000000000
##   15         194               0.5000000  0.000000000
##   15         196               0.5000000  0.000000000
##   15         198               0.5000000  0.000000000
##   15         200               0.5000000  0.000000000
##   Spec     
##   0.9445044
##   0.9414063
##   0.9444885
##   0.9362411
##   0.9362675
##   0.9475601
##   0.9352313
##   0.9403965
##   0.9331853
##   0.9424531
##   0.9341792
##   0.9352419
##   0.9393867
##   0.9311393
##   0.9331747
##   0.9383241
##   0.9414327
##   0.9434787
##   0.9383188
##   0.9352524
##   0.9393550
##   0.9434523
##   0.9465451
##   0.9444885
##   0.9506794
##   0.9445096
##   0.9465662
##   0.9496167
##   0.9342585
##   0.9414274
##   0.9557917
##   0.9517103
##   0.9465662
##   0.9465556
##   0.9548084
##   0.9619773
##   0.9516997
##   0.9599366
##   0.9537616
##   0.9568543
##   0.9815015
##   0.9886968
##   0.9979487
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   0.9198202
##   0.9301084
##   0.9321755
##   0.9197991
##   0.9208776
##   0.9249696
##   0.9167539
##   0.9146868
##   0.9249485
##   0.9239387
##   0.9126355
##   0.9259847
##   0.9280412
##   0.9342004
##   0.9259952
##   0.9208459
##   0.9321279
##   0.9300872
##   0.9280465
##   0.9393656
##   0.9342162
##   0.9403807
##   0.9527359
##   0.9568385
##   0.9342321
##   0.9424372
##   0.9372773
##   0.9475813
##   0.9506688
##   0.9516733
##   0.9496484
##   0.9404229
##   0.9414274
##   0.9609622
##   0.9476183
##   0.9506846
##   0.9527306
##   0.9496801
##   0.9660693
##   0.9547925
##   0.9732805
##   0.9886915
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   0.9229130
##   0.9229289
##   0.9146762
##   0.9208512
##   0.9146709
##   0.9187999
##   0.9188052
##   0.9249643
##   0.9270209
##   0.9198150
##   0.9249485
##   0.9270262
##   0.9280465
##   0.9311182
##   0.9249379
##   0.9311393
##   0.9311182
##   0.9352313
##   0.9321597
##   0.9373090
##   0.9280518
##   0.9352419
##   0.9311182
##   0.9455141
##   0.9527148
##   0.9465556
##   0.9331959
##   0.9527095
##   0.9547766
##   0.9373037
##   0.9424478
##   0.9588951
##   0.9486016
##   0.9434840
##   0.9455300
##   0.9507005
##   0.9496484
##   0.9517526
##   0.9475760
##   0.9578588
##   0.9681364
##   0.9897066
##   0.9989744
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
##   1.0000000
## 
## Tuning parameter 'nrounds' was held constant at a value
##  value of 0.17
## Tuning parameter 'subsample' was
##  held constant at a value of 0.8
## ROC was used to select the optimal model using the
##  largest value.
## The final values used for the model were nrounds =
##  colsample_bytree = 0.17, min_child_weight = 12
##  and subsample = 0.8.
xgb_model.2 %>% saveRDS("./data/output/xgb/xgb_model.2.rds")

The best for max_depth = 5, min_child_weight = 12. Let’s take this parameter and tuner others.

Colsample_bytree

xgb_grid <- expand.grid(nrounds = 40,
                        max_depth = 5,
                        eta = c(0.25),
                        gamma = 1,
                        colsample_bytree = seq(0.05, 0.7, 0.01),
                        min_child_weight = 12,
                        subsample = 0.8)

set.seed(12345678)
xgb_model.3 <- train(class ~ .,
                     data = df.train,
                     method = "xgbTree",
                     trControl = train_control,
                     tuneGrid  = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.3
## eXtreme Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121 
## Resampling results across tuning parameters:
## 
##   colsample_bytree  ROC        Sens       Spec     
##   0.05              0.8039593  0.2272777  0.9732805
##   0.06              0.8123201  0.3112449  0.9506794
##   0.07              0.8321331  0.3583311  0.9496326
##   0.08              0.8235153  0.3559781  0.9434946
##   0.09              0.8514210  0.4615869  0.9393762
##   0.10              0.8522279  0.4497674  0.9403859
##   0.11              0.8518078  0.5034200  0.9321597
##   0.12              0.8529059  0.4893570  0.9342321
##   0.13              0.8584172  0.5010944  0.9342268
##   0.14              0.8472329  0.5011491  0.9260058
##   0.15              0.8488754  0.5035568  0.9229025
##   0.16              0.8467481  0.5315458  0.9187840
##   0.17              0.8576850  0.5105062  0.9280730
##   0.18              0.8572149  0.5410397  0.9259794
##   0.19              0.8515742  0.5479891  0.9146973
##   0.20              0.8574852  0.5454993  0.9249326
##   0.21              0.8610720  0.5549658  0.9064499
##   0.22              0.8555261  0.5409576  0.9157177
##   0.23              0.8626182  0.5574829  0.9228760
##   0.24              0.8613935  0.5621341  0.9136453
##   0.25              0.8544855  0.5713817  0.9136611
##   0.26              0.8606732  0.5433105  0.9146868
##   0.27              0.8571888  0.5668126  0.9157177
##   0.28              0.8559451  0.5760876  0.9064552
##   0.29              0.8590332  0.5690834  0.9167063
##   0.30              0.8607166  0.5620520  0.9116204
##   0.31              0.8583121  0.5783858  0.9054137
##   0.32              0.8680037  0.5690014  0.9126408
##   0.33              0.8624078  0.5783584  0.9126302
##   0.34              0.8623350  0.5783858  0.9044251
##   0.35              0.8630722  0.5901778  0.9074914
##   0.36              0.8652446  0.5877702  0.9044039
##   0.37              0.8645034  0.5854446  0.9023315
##   0.38              0.8611421  0.5806840  0.9156860
##   0.39              0.8620319  0.5666758  0.9074544
##   0.40              0.8646351  0.5924487  0.9033888
##   0.41              0.8634601  0.5737620  0.9095163
##   0.42              0.8594687  0.5948837  0.9023526
##   0.43              0.8668948  0.5691108  0.9136664
##   0.44              0.8637347  0.5901778  0.9126460
##   0.45              0.8629746  0.5970725  0.8940999
##   0.46              0.8677471  0.5762517  0.9126249
##   0.47              0.8669960  0.5854720  0.9023368
##   0.48              0.8676400  0.5948290  0.9003119
##   0.49              0.8625358  0.5761149  0.9043828
##   0.50              0.8683363  0.6041587  0.9146868
##   0.51              0.8639989  0.5714364  0.9105736
##   0.52              0.8676573  0.5924761  0.9002802
##   0.53              0.8610158  0.5830917  0.8992598
##   0.54              0.8690980  0.5900410  0.9074755
##   0.55              0.8687565  0.5855540  0.9074755
##   0.56              0.8646249  0.6181943  0.9002802
##   0.57              0.8679794  0.5832011  0.9054296
##   0.58              0.8674274  0.5948564  0.9002855
##   0.59              0.8695200  0.5901778  0.9095533
##   0.60              0.8659532  0.5925581  0.9043933
##   0.61              0.8651930  0.5877428  0.9023473
##   0.62              0.8711825  0.5855540  0.9105630
##   0.63              0.8666612  0.6111902  0.8900132
##   0.64              0.8611872  0.5877702  0.8972297
##   0.65              0.8768339  0.5994528  0.9085223
##   0.66              0.8639603  0.5737346  0.8941211
##   0.67              0.8654696  0.5619973  0.9023421
##   0.68              0.8657598  0.5925034  0.9054137
##   0.69              0.8692144  0.6017784  0.8982554
##   0.70              0.8646742  0.5877155  0.9105578
## 
## Tuning parameter 'nrounds' was held constant at a value
##  value of 12
## Tuning parameter 'subsample' was held
##  constant at a value of 0.8
## ROC was used to select the optimal model using the
##  largest value.
## The final values used for the model were nrounds =
##  colsample_bytree = 0.65, min_child_weight = 12
##  and subsample = 0.8.
xgb_model.3 %>% saveRDS("./data/output/xgb/xgb_model.3.rds")

The best for colsample_bytree = 0.65. Let’s take this parameter and tuner others.

Subsample

xgb_grid <- expand.grid(nrounds = 40,
                        max_depth = 5,
                        eta = c(0.25),
                        gamma = 1,
                        colsample_bytree = 0.65,
                        min_child_weight = 12,
                        subsample = seq(0.6, 0.9, 0.05))

set.seed(12345678)
xgb_model.4 <- train(class ~ .,
                     data = df.train,
                     method = "xgbTree",
                     trControl = train_control,
                     tuneGrid  = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.4
## eXtreme Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121 
## Resampling results across tuning parameters:
## 
##   subsample  ROC        Sens       Spec     
##   0.60       0.8531228  0.5690834  0.8951361
##   0.65       0.8576970  0.6088372  0.8951308
##   0.70       0.8564885  0.5785226  0.9044039
##   0.75       0.8655747  0.5830096  0.9074808
##   0.80       0.8654781  0.5901778  0.9043933
##   0.85       0.8698701  0.6112175  0.9002855
##   0.90       0.8689630  0.5899863  0.8961935
## 
## Tuning parameter 'nrounds' was held constant at a value
##  value of 0.65
## Tuning parameter 'min_child_weight' was
##  held constant at a value of 12
## ROC was used to select the optimal model using the
##  largest value.
## The final values used for the model were nrounds =
##  colsample_bytree = 0.65, min_child_weight = 12
##  and subsample = 0.85.
xgb_model.4 %>% saveRDS("./data/output/xgb/xgb_model.4.rds")

The best for subsample = 0.85. Let’s take this parameter and tuner others.

Learning rate and number of trees

Let’s double number of trees and reduce by half learning rate

# Change learning rate and number of trees
xgb_grid <- expand.grid(nrounds = 80,
                        max_depth = 5,
                        eta = c(0.12),
                        gamma = 1,
                        colsample_bytree = 0.65,
                        min_child_weight = 12,
                        subsample = 0.85)

set.seed(12345678)
xgb_model.5 <- train(class ~ .,
                     data = df.train,
                     method = "xgbTree",
                     trControl = train_control,
                     tuneGrid  = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.5
## eXtreme Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121 
## Resampling results:
## 
##   ROC        Sens       Spec    
##   0.8749795  0.5902599  0.911594
## 
## Tuning parameter 'nrounds' was held constant at a value
##  held constant at a value of 12
## Tuning
##  parameter 'subsample' was held constant at a value of 0.85
xgb_model.5 %>% saveRDS("./data/output/xgb/xgb_model.5.rds")

Learning rate and number of trees

Let’s double number of trees and reduce by half learning rate again

# Change learning rate and number of trees
xgb_grid <- expand.grid(nrounds = 160,
                        max_depth = 5,
                        eta = c(0.06),
                        gamma = 1,
                        colsample_bytree = 0.65,
                        min_child_weight = 12,
                        subsample = 0.85)

set.seed(12345678)
xgb_model.6 <- train(class ~ .,
                     data = df.train,
                     method = "xgbTree",
                     trControl = train_control,
                     tuneGrid  = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.6
## eXtreme Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121 
## Resampling results:
## 
##   ROC       Sens       Spec     
##   0.876145  0.6041313  0.9064605
## 
## Tuning parameter 'nrounds' was held constant at a value
##  held constant at a value of 12
## Tuning
##  parameter 'subsample' was held constant at a value of 0.85

Compare models on test dataframe

# Load function to take accuracy and gini
source("./scripts/getAccuracyAndGini.R")

models <- c("1":"6")

# Apply the function to every model and build table
sapply(paste0("xgb_model.", models),
       function(x) getAccuracyAndGini(model = get(x),
                                      data = df.test,
                                      target_variable = "class",
                                      predicted_class = "good")
)
##             xgb_model.1 xgb_model.2 xgb_model.3 xgb_model.4
## Accuracy      0.8116667   0.8016667   0.8033333   0.8083333
## Sensitivity   0.9217604   0.9242054   0.9046455   0.9095355
## Specificity   0.5759162   0.5392670   0.5863874   0.5916230
## Gini          0.7613385   0.7545796   0.7537347   0.7640523
##             xgb_model.5 xgb_model.6
## Accuracy      0.8116667   0.8083333
## Sensitivity   0.9070905   0.9070905
## Specificity   0.6073298   0.5968586
## Gini          0.7606216   0.7695055

xgb_model.5 has the highest values, so we choose this one as final for xgb.

Calculate ROC

ROC.train <- pROC::roc(df.train$class, 
                       predict(xgb_model.6,
                               df.train, type = "prob")[, "good"])

ROC.test  <- pROC::roc(df.test$class, 
                       predict(xgb_model.6,
                               df.test, type = "prob")[, "good"])

cat("AUC for train = ", pROC::auc(ROC.train), 
    ", Gini for train = ", 2 * pROC::auc(ROC.train) - 1, "\n", sep = "")
## AUC for train = 0.960558, Gini for train = 0.921116
cat("AUC for test = ", pROC::auc(ROC.test), 
    ", Gini for test = ", 2 * pROC::auc(ROC.test) - 1, "\n", sep = "")
## AUC for test = 0.8847527, Gini for test = 0.7695055

Chart of xgb ROC

list(
  ROC.train   = ROC.train,
  ROC.test    = ROC.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "xgb = ", 
                         round(100 * (2 * auc(ROC.train) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "xgb = ", 
                         round(100 * (2 * auc(ROC.test) - 1), 1), "%, "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

GBM

In this section we will perform GBM model. Let’s start with clearing the environment and preparing df.

library(gbm)
library(dplyr)
library(caret)
rm(list=ls())

df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")
rfe_columns <- read.csv("./data/output/rfe_columns.csv")
rfe_columns <- c("class", rfe_columns$x)

df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]
# Make factor
df$class <- factor(df$class, levels = c(0, 1), labels = c("bad", "good"))
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))

Function to define model formula

# define model formula - all variables apart from class
create_model_formula <- function(df, explained_variable){
  
  model_formula <- paste(explained_variable, "~")
  
  for (i in 2:length(colnames(df))){
    if(i == 2){
      model_formula <- paste(model_formula, colnames(df)[i])
    } else {
      model_formula <- paste(model_formula, " + ", colnames(df)[i])
    }
  }
  
  return(as.formula(model_formula))
}

Define 2 model formulas - one for class (factor) and one for class_1 (number).

model_formula01 <- create_model_formula(df.train, "class_1")
model_formula <- create_model_formula(df.train, "class")

GBM requires target variable to be defined as 0 and 1, not factor so we do not change it. However we will also need variable “class” with factor. Define new variable - class_1 with 0 and 1.

df.train$class_1 <- (df.train$class == "good") * 1
df.test$class_1 <- (df.test$class == "good") * 1

Simple GBM

In this section we will build simple GBM modelm without any hyperparameter tuning.

set.seed(123456789)
gbm.1 <- 
  gbm(model_formula01,
      data = df.train,
      distribution = "bernoulli",
      # total number of trees
      n.trees = 500,
      # number of variable interactions - actually depth of the trees
      interaction.depth = 4,
      # shrinkage parameter - speed (pace) of learning
      shrinkage = 0.01,
      verbose = FALSE)

gbm.1 %>% saveRDS("./data/output/gbm/gbm.1.rds")

# predict train
df.pred.train.gbm <- predict(gbm.1,
                                  df.train, 
                                  type = "response",
                                  n.trees = 500)
# predict test
df.pred.test.gbm <- predict(gbm.1,
                             df.test, 
                             type = "response",
                             n.trees = 500)
# Load function to get accuracy and gini values
source("./scripts/getAccuracyAndGini2.R")

# Training set
getAccuracyAndGini2(data = data.frame(class = df.train$class,
                                      pred = df.pred.train.gbm),
                    predicted_probs = "pred",
                    target_variable = "class",
                    target_levels = c("good", "bad"),
                    predicted_class = "good")
##    Accuracy Sensitivity Specificity        Gini 
##   0.8585714   0.9588900   0.6299766   0.8426509
# Test set
getAccuracyAndGini2(data = data.frame(class = df.test$class,
                                      pred = df.pred.test.gbm),
                    predicted_probs = "pred",
                    target_variable = "class",
                    target_levels = c("good", "bad"),
                    predicted_class = "good")
##    Accuracy Sensitivity Specificity        Gini 
##   0.7900000   0.9339853   0.4816754   0.7479486
ROC.train.gbm <- pROC::roc(df.train$class, 
                           df.pred.train.gbm)

ROC.test.gbm  <- pROC::roc(df.test$class, 
                           df.pred.test.gbm)

cat("AUC for train = ", pROC::auc(ROC.train.gbm), 
    ", Gini for train = ", 2 * pROC::auc(ROC.train.gbm) - 1, "\n", sep = "")
## AUC for train = 0.9213254, Gini for train = 0.8426509
cat("AUC for test = ", pROC::auc(ROC.test.gbm), 
    ", Gini for test = ", 2 * pROC::auc(ROC.test.gbm) - 1, "\n", sep = "")
## AUC for test = 0.8739743, Gini for test = 0.7479486

Plot ROC

list(
  ROC.train.gbm = ROC.train.gbm,
  ROC.test.gbm  = ROC.test.gbm
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "gbm = ", 
                         round(100 * (2 * auc(ROC.train.gbm) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "gbm = ", 
                         round(100 * (2 * auc(ROC.test.gbm) - 1), 1), "%, "
  )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Hyperparameter tuning

parameters_gbm <- expand.grid(interaction.depth = c(1, 2, 4),
                              n.trees = c(100, 300, 500),
                              shrinkage = c(0.01, 0.1), 
                              n.minobsinnode = c(150, 250, 400))

ctrl_cv3 <- trainControl(method = "cv", 
                         number = 10,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary)
set.seed(123456789)
gbm.2  <- train(model_formula,
                       data = df.train,
                       distribution = "bernoulli",
                       method = "gbm",
                       tuneGrid = parameters_gbm,
                       trControl = ctrl_cv3,
                       verbose = FALSE)

gbm.2
## Stochastic Gradient Boosting 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1261, 1260, 1259, 1261, 1261, 1259, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees
##   0.01       1                  150             100    
##   0.01       1                  150             300    
##   0.01       1                  150             500    
##   0.01       1                  250             100    
##   0.01       1                  250             300    
##   0.01       1                  250             500    
##   0.01       1                  400             100    
##   0.01       1                  400             300    
##   0.01       1                  400             500    
##   0.01       2                  150             100    
##   0.01       2                  150             300    
##   0.01       2                  150             500    
##   0.01       2                  250             100    
##   0.01       2                  250             300    
##   0.01       2                  250             500    
##   0.01       2                  400             100    
##   0.01       2                  400             300    
##   0.01       2                  400             500    
##   0.01       4                  150             100    
##   0.01       4                  150             300    
##   0.01       4                  150             500    
##   0.01       4                  250             100    
##   0.01       4                  250             300    
##   0.01       4                  250             500    
##   0.01       4                  400             100    
##   0.01       4                  400             300    
##   0.01       4                  400             500    
##   0.10       1                  150             100    
##   0.10       1                  150             300    
##   0.10       1                  150             500    
##   0.10       1                  250             100    
##   0.10       1                  250             300    
##   0.10       1                  250             500    
##   0.10       1                  400             100    
##   0.10       1                  400             300    
##   0.10       1                  400             500    
##   0.10       2                  150             100    
##   0.10       2                  150             300    
##   0.10       2                  150             500    
##   0.10       2                  250             100    
##   0.10       2                  250             300    
##   0.10       2                  250             500    
##   0.10       2                  400             100    
##   0.10       2                  400             300    
##   0.10       2                  400             500    
##   0.10       4                  150             100    
##   0.10       4                  150             300    
##   0.10       4                  150             500    
##   0.10       4                  250             100    
##   0.10       4                  250             300    
##   0.10       4                  250             500    
##   0.10       4                  400             100    
##   0.10       4                  400             300    
##   0.10       4                  400             500    
##   ROC        Sens        Spec     
##   0.7792206  0.00000000  1.0000000
##   0.8038604  0.19706534  0.9774037
##   0.8158203  0.32563677  0.9476226
##   0.7510418  0.00000000  1.0000000
##   0.7793486  0.15919158  0.9835893
##   0.7869867  0.30238095  0.9455502
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##   0.7897975  0.02569214  0.9989796
##   0.8170165  0.38405316  0.9455397
##   0.8302039  0.45420819  0.9280875
##   0.7483803  0.00000000  1.0000000
##   0.7807060  0.14988926  0.9825584
##   0.7871220  0.30465116  0.9455502
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##   0.7937979  0.02563677  0.9989796
##   0.8168736  0.37940199  0.9445087
##   0.8289242  0.45658915  0.9321902
##   0.7506987  0.00000000  1.0000000
##   0.7799251  0.15215947  0.9825794
##   0.7876903  0.30470653  0.9414370
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##   0.8277510  0.46129568  0.9219440
##   0.8366069  0.56650055  0.8972649
##   0.8345561  0.57364341  0.8941721
##   0.7890732  0.40293466  0.9064801
##   0.7943637  0.45442968  0.8848832
##   0.7945992  0.44496124  0.8828424
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##   0.8389667  0.53156146  0.9034399
##   0.8451245  0.58078627  0.8890069
##   0.8445341  0.59241417  0.8941195
##   0.7928862  0.41926910  0.9075005
##   0.7972307  0.45437431  0.8879865
##   0.7948166  0.45431894  0.8838628
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##   0.8396336  0.53626800  0.9075531
##   0.8441383  0.55498339  0.8900484
##   0.8483632  0.59246955  0.8890069
##   0.7943256  0.41245847  0.9105933
##   0.7963135  0.44036545  0.8930781
##   0.7953236  0.44966777  0.8849043
##         NaN         NaN        NaN
##         NaN         NaN        NaN
##         NaN         NaN        NaN
## 
## ROC was used to select the optimal model using the
##  largest value.
## The final values used for the model were n.trees =
##  500, interaction.depth = 4, shrinkage = 0.1
##  and n.minobsinnode = 150.
res <- gbm.2$results

saveRDS(object = gbm.2,
        file   = "./data/output/gbm/gbm.2.rds")
# Predict train
df.pred.train.gbm2 <- predict(gbm.2,
                             df.train, 
                             type = "prob",
                             n.trees = 500)
# Predict test
df.pred.test.gbm2 <- predict(gbm.2,
                            df.test, 
                            type = "prob",
                            n.trees = 500)
# Training set
accuracy_gini_train <- getAccuracyAndGini2(data = data.frame(class = df.train$class,
                                      pred = df.pred.train.gbm2[, "good"]),
                    predicted_probs = "pred",
                    target_variable = "class",
                    target_levels = c("good", "bad"),
                    predicted_class = "good")
# Test set
accuracy_gini_test <- getAccuracyAndGini2(data = data.frame(class = df.test$class,
                                      pred = df.pred.test.gbm2[, "good"]),
                    predicted_probs = "pred",
                    target_variable = "class",
                    target_levels = c("good", "bad"),
                    predicted_class = "good")
write.csv(accuracy_gini_test, "./data/accuracy_gini_test.csv")
write.csv(accuracy_gini_train, "./data/accuracy_gini_train.csv")
ROC.train.gbm2 <- pROC::roc(df.train$class, 
                           df.pred.train.gbm2[, "good"])

ROC.test.gbm2  <- pROC::roc(df.test$class, 
                           df.pred.test.gbm2[, "good"])
cat("AUC for train = ", pROC::auc(ROC.train.gbm2), 
    ", Gini for train = ", 2 * pROC::auc(ROC.train.gbm2) - 1, "\n", sep = "")
## AUC for train = 0.9359089, Gini for train = 0.8718178
cat("AUC for test = ", pROC::auc(ROC.test.gbm2), 
    ", Gini for test = ", 2 * pROC::auc(ROC.test.gbm2) - 1, "\n", sep = "")
## AUC for test = 0.8558609, Gini for test = 0.7117219
list(
  ROC.train.gbm2 = ROC.train.gbm2,
  ROC.test.gbm2  = ROC.test.gbm2
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "gbm = ", 
                         round(100 * (2 * auc(ROC.train.gbm2) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "gbm = ", 
                         round(100 * (2 * auc(ROC.test.gbm2) - 1), 1), "%, "
  )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Random Forest - one by one

Let’s start with cleaning the environment and loading the data.

rm(list=ls())

df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")

df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))

Prepare df as before.

rfe_columns <- read.csv("./data/output/rfe_columns.csv")
rfe_columns <- c("class", rfe_columns$x)

df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]

rf

Estimate first - basic model with all variables and default parameters

library(randomForest)
library(caret)
library(pROC)
library(dplyr)
set.seed(123456789)
rf <- randomForest(class ~ ., 
                   data = df.train)

saveRDS(rf, file = "./data/output/rf/rf.RDS")
print(rf)
## 
## Call:
##  randomForest(formula = class ~ ., data = df.train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 13.5%
## Confusion matrix:
##      bad good class.error
## bad  275  152  0.35597190
## good  37  936  0.03802672

Estimated out-of-bag-error: 13.57%

plot(rf)

The plot above helps determine the appropriate number of trees.

The dark solid line presents the total OOB error, the red line presents the prediction error for “good” response, while the green line for “bad” response.

As the number of trees is higher, the OOB error is smaller, and converges to approx. 120 trees.

rf2

Hence, let us try to limit number of trees and estimate the model on bootstrap samples from the full data set.

We also increase the number of predictors used from 5 to 10.

set.seed(123456789)
rf2 <- 
  randomForest(class ~ .,
               data = df.train,
               ntree = 120,
               sampsize = nrow(df.train),
               mtry = 10,
               # minimum number of obs in the terminal nodes
               nodesize = 100,
               # we also generate predictors importance measures,
               importance = TRUE)

saveRDS(rf2, file = "./data/output/rf/rf2.RDS")
print(rf2)
## 
## Call:
##  randomForest(formula = class ~ ., data = df.train, ntree = 120,      sampsize = nrow(df.train), mtry = 10, nodesize = 100, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 120
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 21.07%
## Confusion matrix:
##      bad good class.error
## bad  194  233  0.54566745
## good  62  911  0.06372045

rf3 - cross validation

Let us try to optimize the mtry parameter using the cross-validation process.

We consider 35 predictors in the model formula. Hence, let us try mtry between 2 and 31.

parameters_rf <- expand.grid(mtry = 2:31)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)

set.seed(123456789)

rf3 <-
  train(class ~ .,
        data = df.train,
        method = "rf",
        ntree = 120,
        nodesize = 100,
        tuneGrid = parameters_rf,
        trControl = ctrl_oob,
        importance = TRUE)

rf3
## Random Forest 
## 
## 1400 samples
##   35 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7357143  0.1945491
##    3    0.7578571  0.2926595
##    4    0.7792857  0.3755810
##    5    0.7657143  0.3335094
##    6    0.7714286  0.3697438
##    7    0.7850000  0.4174569
##    8    0.7785714  0.4058685
##    9    0.7857143  0.4292998
##   10    0.7885714  0.4499675
##   11    0.7864286  0.4358065
##   12    0.7842857  0.4460444
##   13    0.7857143  0.4473408
##   14    0.7814286  0.4263970
##   15    0.7800000  0.4350386
##   16    0.7821429  0.4344730
##   17    0.7750000  0.4299468
##   18    0.7735714  0.4255165
##   19    0.7771429  0.4333763
##   20    0.7792857  0.4287088
##   21    0.7757143  0.4337547
##   22    0.7835714  0.4485495
##   23    0.7835714  0.4485495
##   24    0.7857143  0.4612746
##   25    0.7792857  0.4415932
##   26    0.7757143  0.4321573
##   27    0.7750000  0.4283329
##   28    0.7742857  0.4372718
##   29    0.7814286  0.4481813
##   30    0.7800000  0.4430078
##   31    0.7771429  0.4285191
## 
## Accuracy was used to select the optimal model using
##  the largest value.
## The final value used for the model was mtry = 10.
saveRDS(rf3, file = "./data/output/rf/rf3.RDS") 
plot(rf3$results$mtry,
     rf3$results$Accuracy, type = "b")

plot(rf3$results$mtry,
     rf3$results$Kappa, type = "b")

modelLookup("ranger")
##    model     parameter                         label forReg
## 1 ranger          mtry #Randomly Selected Predictors   TRUE
## 2 ranger     splitrule                Splitting Rule   TRUE
## 3 ranger min.node.size             Minimal Node Size   TRUE
##   forClass probModel
## 1     TRUE      TRUE
## 2     TRUE      TRUE
## 3     TRUE      TRUE

Hyperparameter tuning using tidymodels

In this section we tuned hyperparameters using tidymodels and tried to do it in more automated way.

In the first step we tune hyperparameters and take the best model parameters.

We built a function tune_two_stages (defined in a separate script) which takes the 2 grids of parameters and tunes models in two steps. Firstly it tunes one parameter, takes the best one and than it goes to the second grid tuning all parameters at the same time. Results are sorted by -mean value and parameters of three best models are taken and written as csv.

In the second step we take the results and check the performance of the models.

We built the functions summary_xgb, summary_random_forest, summary_decision_tree which takes the 3 best set of parameters and their names and check the performance of the model. We estimate the models using specific parameters on df.train, make predictions on df.train, df.test and check the performance using GINI and saving roc plots.

Let’s start with cleaning the environment and loading the data.

library(dplyr)
library(rpart)
library(rpart.plot)
library(xgboost)
library(plyr)
library(doParallel)
library(pROC)
library(ggplot2)
library(gbm)
library(caret)
library(workflows)
library(parsnip)
library(recipes)
library(dials)
library(rsample)
library(tune)
library(yardstick)
library(DT)

cores=detectCores()
registerDoParallel(cores=cores)
rm(list=ls())

cores=delightgbmcores=detectCores()
registerDoParallel(cores=cores)

source("./scripts/tune_functions.R")

df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")

# Make factor
df$class <- factor(df$class, levels = c(0, 1), labels = c("bad", "good"))
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))

# Load rfe columns
rfe_columns <- read.csv("./data/output/rfe_columns.csv")
rfe_columns <- c("class", rfe_columns$x)

# Adjust dfs to have rfe columns
df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]

# Load functions from tune_functions script
source("./scripts/tune_functions.R")

Define models

Define models and their hyperpatemeters to check.

# define parameters of models
models <- list(
  xgboost = list(
    model_name = 'xgboost',
    
    model_1 = boost_tree(mode = "classification", engine = "xgboost", learn_rate = tune()),
    
    model_2_args = list(tree_depth = tune(), 
                        loss_reduction = tune(), 
                        trees = tune(),
                        stop_iter = tune(),
                        min_n = tune()),
    
    rec_spec = recipe(class ~ ., df.train),
    
    grid_1 = set_seed_grid(grid_latin_hypercube(learn_rate(), size = 10), 123),
    
    grid_2 = set_seed_grid(as_tibble(expand.grid(
      tree_depth = c(6, 9, 12),
      loss_reduction = c(0.5, 1, 2),
      trees = c(100, 300, 500),
      stop_iter = seq(10, 50, by=10),
      min_n = seq(25, 100, by=25)
    )), 123)
  ),
  random_forest = list(
    model_name = 'random_forest',
    model_1 = rand_forest(mode = "classification", engine = "ranger", mtry = 5, trees = tune()),
    model_2_args = list(min_n = tune()),
    
    rec_spec = recipe(class ~ ., df.train),
    
    grid_1 = set_seed_grid(grid_latin_hypercube(trees(), size = 50), 123),
    
    grid_2 = set_seed_grid(as_tibble(expand.grid(
      min_n = c(75, 100, 150, 300)
    )), 123)
  ),
  decision_tree = list(
    model_name = 'decision_tree',
    model_1 = decision_tree(mode = "classification", engine = "rpart", cost_complexity = tune()),
    model_2_args = list(tree_depth = tune(), min_n = tune()),
    
    rec_spec = recipe(class ~ ., df.train),
    
    grid_1 = set_seed_grid(grid_latin_hypercube(cost_complexity(), size = 10), 123),
    grid_2 = set_seed_grid(as_tibble(expand.grid(
      tree_depth = c(6, 9, 12),
      min_n = seq(25, 100, by=25)
    )), 123)
  )
)

Tune xgb

# xgb tune
result_list_xgb <- list(tune_two_stages(models, "xgboost"))
result_list_xgb <- result_list_xgb[[1]]

# xgb summarise
tuned_params_xgb <- c("trees", "min_n", "tree_depth", "loss_reduction", "stop_iter", "learn_rate")
results <- result_list_xgb$result

final_results_xgb <- summary_xgb(results, tuned_params_xgb)
write.csv(final_results_xgb, "./data/output/xgb/final_results_xgb.csv")

Tune rf

result_list_random_forest <- list(tune_two_stages(models, "random_forest"))
result_list_random_forest <- result_list_random_forest[[1]]

# rf summarise
tuned_params_random_forest <- c("trees", "min_n")
results <- result_list_random_forest$result

final_results_random_forest <- summary_random_forest(results, tuned_params_random_forest)
write.csv(final_results_random_forest, "./data/output/rf/final_results_random_forest.csv")

Tune decision tree

result_decision_tree <- list(tune_two_stages(models, "decision_tree"))
result_decision_tree <- result_decision_tree[[1]]

# decision_tree summarise
tuned_params_decision_tree <- c("cost_complexity", "tree_depth", "min_n")
results <- result_decision_tree$result

final_results_decision_tree <- summary_decision_tree(results, tuned_params_decision_tree)
write.csv(final_results_decision_tree, "./data/output/dt/final_results_decision_tree.csv")

Compare results

columns_to_compare <- c("iteration", "gini_train", "gini_test", "models_spec", "model_name")

final_results_xgb_c <- final_results_xgb[, columns_to_compare]
final_results_random_forest_c <- final_results_random_forest[, columns_to_compare]
final_results_decision_tree_c <- final_results_decision_tree[, columns_to_compare]
final_results_all <- rbind(final_results_xgb_c, final_results_random_forest_c, final_results_decision_tree_c) %>% arrange(-gini_test)

Plot ROC for the best model - rf1.

knitr::include_graphics("./plots/rf_1.png")

Summary

To sum up, we built several ML models using a few techniques. Finally, it turned out that the xgb model tuned by “one by one” method is the best with gini 92.1% on the df.train and 77% on df.test.

Random forest model tuned in more automated way, presented in the table above also turned out to have relatively high gini values (91.2% train, 77.7% df.test).

All results are presented in the attached pdf presentation.

Even though we used a few techniques which might increase the models accuracy and efficiency like handling missing values, transforming columns according to their characteristics, feature selection using Recursive Feature Elimination, cross validation, grid search and so on we believe that the accuracy might be even more increased for instance by using over or undersampling or some other methods, which we did not apply.