Classification problem
The main goal of this paper is to apply ML methods to classify clients as “good” or “bad”.
We base on the dataset delivered by our lecturer. The dataset contains additional 10 “mysterious” variables (feat01 - feat10) generated by him. While the exact details of how these variables were generated are not provided, they may contain additional features or information that could be relevant for analysis.
As the following sections are independent of each other at the beginning of almost every section, we clean the environment and load the necessary data and libraries.
First of all we will clear the data, then perform simple Exploratory Data Analysis. Later on we will choose the final features using RFE and we will move to building models: XGB, GBM and RF using the approach from classes (hypertuning parameters one by one) and then try to build models using tidymodels package and some general approach.
library(tidyverse)
library(DT)
library(dplyr)
library(fastDummies)
library(caret)
library(ggplot2)
library(plotfunctions)
Read csv
df <- read.csv("data/input/c2.csv")
Below we can observe raw dataframe and its format of the columns:
str(df)
## 'data.frame': 2000 obs. of 32 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 50 30 32 34 21 25 36 25 21 36 ...
## $ checking_status : chr "'no checking'" "'no checking'" "'no checking'" "'no checking'" ...
## $ class : chr "good" "good" "good" "good" ...
## $ credit_amount : int 2319 1163 1502 4436 10155 3413 2235 3555 1805 2756 ...
## $ credit_history : chr "'existing paid'" "'delayed previously'" "'critical/other existing credit'" "'delayed previously'" ...
## $ duration : int 21 15 10 36 60 24 24 18 6 36 ...
## $ employment : chr "'>=7'" "'4<=X<7'" "'>=7'" "'1<=X<4'" ...
## $ existing_credits : int 2 2 2 2 1 2 1 1 1 1 ...
## $ feat01 : num 0.24 0.521 0.287 0.431 0.44 ...
## $ feat02 : num 0.614 0.79 0.542 0.531 0.603 ...
## $ feat03 : num 1.31 1.453 0.957 0.942 1.109 ...
## $ feat04 : num 0.976 1.2 0.856 1.164 0.764 ...
## $ feat05 : num 0.71 1.203 0.544 1.17 1.025 ...
## $ feat06 : num 0.57 0.957 1.14 1.339 0.634 ...
## $ feat07 : num 0.951 0.515 1.039 0.979 0.846 ...
## $ feat08 : num 0.581 1.18 1.497 0.532 1.04 ...
## $ feat09 : num 0.78 1.14 1.13 1.08 1.26 ...
## $ feat10 : num 0.483 0.328 0.482 0.63 0.518 ...
## $ foreign_worker : chr "yes" "yes" "no" "yes" ...
## $ housing : chr "own" "own" "own" "own" ...
## $ installment_commitment: int 4 3 3 4 2 4 4 4 1 4 ...
## $ job : chr "skilled" "skilled" "'unskilled resident'" "skilled" ...
## $ num_dependents : int 1 1 2 1 1 1 1 1 1 1 ...
## $ other_parties : chr "none" "none" "none" "none" ...
## $ other_payment_plans : chr "none" "none" "none" "none" ...
## $ own_telephone : chr "none" "none" "none" "none" ...
## $ personal_status : chr "'male single'" "'female div/dep/mar'" "'male single'" "'female div/dep/mar'" ...
## $ property_magnitude : chr "'real estate'" "'life insurance'" "'real estate'" "'real estate'" ...
## $ purpose : chr "furniture/equipment" "furniture/equipment" "'new car'" "radio/tv" ...
## $ residence_since : int 2 2 4 4 4 2 4 1 2 4 ...
## $ savings_status : chr "'<100'" "'>=1000'" "'<100'" "'<100'" ...
As we can observe all chr columns have additiona quotes ’’, let’s delete them.
df <- df %>%
mutate_if(is.character, function(x) gsub("'", "", x))
Dataframe does not have any missing values, as we can see below
## Getting the number of missing values in each column
num_missing <- colSums(is.na(df))
# Excluding columns that contains 0 missing values
num_missing <- num_missing[num_missing > 0]
num_missing
## named numeric(0)
Select all columns with chr data type - we treat them as categorical
chr_columns <- names(df[, sapply(df, is.character)])
chr_columns
## [1] "checking_status" "class"
## [3] "credit_history" "employment"
## [5] "foreign_worker" "housing"
## [7] "job" "other_parties"
## [9] "other_payment_plans" "own_telephone"
## [11] "personal_status" "property_magnitude"
## [13] "purpose" "savings_status"
Create df with categorical variables, print unique values of each column.
# Function to print unique values for each column in dataframe
print_unique_values <- function(df){
df_unique <- lapply(df, unique)
# Print unique values for each column
for (col_name in names(df_unique)) {
cat(col_name, ": ", paste(df_unique[[col_name]], collapse = ", "), "\n")
}
}
df_categorical <- df[chr_columns]
print_unique_values(df_categorical)
## checking_status : no checking, 0<=X<200, <0, >=200
## class : good, bad
## credit_history : existing paid, delayed previously, critical/other existing credit, all paid, no credits/all paid
## employment : >=7, 4<=X<7, 1<=X<4, <1, unemployed
## foreign_worker : yes, no
## housing : own, for free, rent
## job : skilled, unskilled resident, high qualif/self emp/mgmt, unemp/unskilled non res
## other_parties : none, guarantor, co applicant
## other_payment_plans : none, bank, stores
## own_telephone : none, yes
## personal_status : male single, female div/dep/mar, male mar/wid, male div/sep
## property_magnitude : real estate, life insurance, car, no known property
## purpose : furniture/equipment, new car, radio/tv, used car, education, repairs, business, retraining, other, domestic appliance
## savings_status : <100, >=1000, 100<=X<500, no known savings, 500<=X<1000
Let’s focus on the column “personal_status”, here we have 2 information, sex and marital status
We should separate them into 2 columns:
Check the size of each group:
personal_status_counts <- table(df$personal_status)
print(personal_status_counts)
##
## female div/dep/mar male div/sep male mar/wid
## 649 105 181
## male single
## 1065
Create 2 new columns:
# Extracting sex information
df <- df %>%
mutate(sex = ifelse(grepl("female", personal_status), 0, 1))
# Extracting the rest of the information
df <- df %>%
mutate(marital_status = sub("^[^ ]+ ", "", sub(" $", "", personal_status)))
# Drop the original "personal_status" column
df <- df %>%
select(-personal_status)
sex_counts <- table(df$sex)
marital_status_counts <- table(df$marital_status)
print(sex_counts)
##
## 0 1
## 649 1351
print(marital_status_counts)
##
## div/dep/mar div/sep mar/wid single
## 649 105 181 1065
Create a list of reamining chr columns which needs to be handled yet
# Copy chr columns and remove personal_status - now we have list with remaining chr columns
chr_columns_help <- chr_columns
# Remove "personal_status" from list
chr_columns_help <- setdiff(chr_columns_help, "personal_status")
# Adding marital_status to object_columns_help
chr_columns_help <- union(chr_columns_help, "marital_status")
print(chr_columns_help)
## [1] "checking_status" "class"
## [3] "credit_history" "employment"
## [5] "foreign_worker" "housing"
## [7] "job" "other_parties"
## [9] "other_payment_plans" "own_telephone"
## [11] "property_magnitude" "purpose"
## [13] "savings_status" "marital_status"
In the dataset we can observe a few ordered categorical variables. Let’s define and transform them.
# Refresh the df_categorical and show unique values - columns to be handled yet
df_categorical <- df[, chr_columns_help]
print_unique_values(df_categorical)
## checking_status : no checking, 0<=X<200, <0, >=200
## class : good, bad
## credit_history : existing paid, delayed previously, critical/other existing credit, all paid, no credits/all paid
## employment : >=7, 4<=X<7, 1<=X<4, <1, unemployed
## foreign_worker : yes, no
## housing : own, for free, rent
## job : skilled, unskilled resident, high qualif/self emp/mgmt, unemp/unskilled non res
## other_parties : none, guarantor, co applicant
## other_payment_plans : none, bank, stores
## own_telephone : none, yes
## property_magnitude : real estate, life insurance, car, no known property
## purpose : furniture/equipment, new car, radio/tv, used car, education, repairs, business, retraining, other, domestic appliance
## savings_status : <100, >=1000, 100<=X<500, no known savings, 500<=X<1000
## marital_status : single, div/dep/mar, mar/wid, div/sep
Let’s define ordered categorical variables:
ordered_categorical = c("checking_status", "employment", "job", "savings_status")
head(df[, ordered_categorical])
## checking_status employment job
## 1 no checking >=7 skilled
## 2 no checking 4<=X<7 skilled
## 3 no checking >=7 unskilled resident
## 4 no checking 1<=X<4 skilled
## 5 no checking 4<=X<7 skilled
## 6 no checking <1 skilled
## savings_status
## 1 <100
## 2 >=1000
## 3 <100
## 4 <100
## 5 100<=X<500
## 6 no known savings
We found out that the group unemp/unskilled non res has just 40 observations, 29 with good class and 11 with bad. The distribution of this class is very similar to the distribution of unskilled resident.
unemp/unskilled non res: 29/40 = 0.725 - share of good credits in this group
unskilled resident: 288/411 = 0.700
Because of that we decide to join these two groups.
df %>%
dplyr::group_by(job, class) %>%
dplyr::summarise(count = n())
## # A tibble: 8 × 3
## # Groups: job [4]
## job class count
## <chr> <chr> <int>
## 1 high qualif/self emp/mgmt bad 106
## 2 high qualif/self emp/mgmt good 185
## 3 skilled bad 378
## 4 skilled good 880
## 5 unemp/unskilled non res bad 11
## 6 unemp/unskilled non res good 29
## 7 unskilled resident bad 123
## 8 unskilled resident good 288
Define values for each category
checking_status_mapping <- c(
"no checking" = 1,
"<0" = 2,
"0<=X<200" = 3,
">=200" = 4
)
employment_mapping <- c(
"unemployed" = 1,
"<1" = 2,
"1<=X<4" = 3,
"4<=X<7"= 4,
">=7" = 5)
job_mapping <- c(
"unemp/unskilled non res" = 1,
"unskilled resident" = 1,
"skilled" = 2,
"high qualif/self emp/mgmt" = 3
)
savings_status_mapping <- c(
"no known savings" = 1,
"<100" = 2,
"100<=X<500" = 3,
"500<=X<1000" = 4,
">=1000" = 5
)
# Define the mapping function
recode_variable <- function(variable, mapping) {
variable <- factor(variable, levels = names(mapping))
return(as.numeric(variable))
}
df$checking_status <- recode_variable(df$checking_status, checking_status_mapping)
df$employment <- recode_variable(df$employment, employment_mapping)
df$job <- recode_variable(df$job, job_mapping)
df$savings_status <- recode_variable(df$savings_status, savings_status_mapping)
Update the chr_columns_help - delete ordered categorical columns and print
# Remove ordered_categorical variables names from list
chr_columns_help <- setdiff(chr_columns_help, ordered_categorical)
Let’s define boolean variables:
boolean_variables = c("class", "foreign_worker", "own_telephone")
head(df[, boolean_variables])
## class foreign_worker own_telephone
## 1 good yes none
## 2 good yes none
## 3 good no none
## 4 good yes none
## 5 good yes yes
## 6 good yes none
df$class <- ifelse(grepl("good", df$class), 1, 0)
df$foreign_worker <- ifelse(grepl("yes", df$foreign_worker), 1, 0)
df$own_telephone <- ifelse(grepl("yes", df$own_telephone), 1, 0)
# Remove boolean_variables variables names from list
chr_columns_help <- setdiff(chr_columns_help, boolean_variables)
For the remaining columns we apply one-hot encodig
# Print remaining categorical variables and its unique values
df_categorical <- df[, chr_columns_help]
print_unique_values(df_categorical)
## credit_history : existing paid, delayed previously, critical/other existing credit, all paid, no credits/all paid
## housing : own, for free, rent
## other_parties : none, guarantor, co applicant
## other_payment_plans : none, bank, stores
## property_magnitude : real estate, life insurance, car, no known property
## purpose : furniture/equipment, new car, radio/tv, used car, education, repairs, business, retraining, other, domestic appliance
## marital_status : single, div/dep/mar, mar/wid, div/sep
We use dummy_cols function from fastDummies library
df <- dummy_cols(
df,
select_columns = chr_columns_help,
remove_first_dummy = TRUE, # remove the first dummy to avoid the issue of multicollinearity
remove_selected_columns = TRUE) # remove the original columns
We performed data cleaning in a few steps:
After this transformations we obtained dataframe with 51 variables.
As we can see below, after the data transformation there are no missing values, so the data transformation process was successful.
## Getting the number of missing values in each column
num_missing <- colSums(is.na(df))
# Excluding columns that contains 0 missing values
num_missing <- num_missing[num_missing > 0]
num_missing
## named numeric(0)
Below we can see the new dataframe and its columns along with their value types
str(df)
## 'data.frame': 2000 obs. of 51 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 50 30 32 34 21 25 36 25 21 36 ...
## $ checking_status : num 1 1 1 1 1 1 3 2 1 3 ...
## $ class : num 1 1 1 1 1 1 1 1 1 1 ...
## $ credit_amount : int 2319 1163 1502 4436 10155 3413 2235 3555 1805 2756 ...
## $ duration : int 21 15 10 36 60 24 24 18 6 36 ...
## $ employment : num 5 4 5 3 4 2 5 4 3 5 ...
## $ existing_credits : int 2 2 2 2 1 2 1 1 1 1 ...
## $ feat01 : num 0.24 0.521 0.287 0.431 0.44 ...
## $ feat02 : num 0.614 0.79 0.542 0.531 0.603 ...
## $ feat03 : num 1.31 1.453 0.957 0.942 1.109 ...
## $ feat04 : num 0.976 1.2 0.856 1.164 0.764 ...
## $ feat05 : num 0.71 1.203 0.544 1.17 1.025 ...
## $ feat06 : num 0.57 0.957 1.14 1.339 0.634 ...
## $ feat07 : num 0.951 0.515 1.039 0.979 0.846 ...
## $ feat08 : num 0.581 1.18 1.497 0.532 1.04 ...
## $ feat09 : num 0.78 1.14 1.13 1.08 1.26 ...
## $ feat10 : num 0.483 0.328 0.482 0.63 0.518 ...
## $ foreign_worker : num 1 1 0 1 1 1 1 1 1 1 ...
## $ installment_commitment : int 4 3 3 4 2 4 4 4 1 4 ...
## $ job : num 3 3 2 3 3 3 3 3 3 3 ...
## $ num_dependents : int 1 1 2 1 1 1 1 1 1 1 ...
## $ own_telephone : num 0 0 0 0 1 0 1 0 0 0 ...
## $ residence_since : int 2 2 4 4 4 2 4 1 2 4 ...
## $ savings_status : num 2 5 2 2 3 1 1 2 2 2 ...
## $ sex : num 1 0 1 0 0 1 1 0 1 1 ...
## $ credit_history_critical/other existing credit: int 0 0 1 0 0 0 0 0 0 1 ...
## $ credit_history_delayed previously : int 0 1 0 1 0 0 0 0 0 0 ...
## $ credit_history_existing paid : int 1 0 0 0 1 1 1 1 1 0 ...
## $ credit_history_no credits/all paid : int 0 0 0 0 0 0 0 0 0 0 ...
## $ housing_own : int 1 1 1 1 1 1 0 1 0 1 ...
## $ housing_rent : int 0 0 0 0 0 0 0 0 1 0 ...
## $ other_parties_guarantor : int 0 0 0 0 0 0 0 1 0 0 ...
## $ other_parties_none : int 1 1 1 1 1 1 1 0 1 1 ...
## $ other_payment_plans_none : int 1 1 1 1 1 1 0 1 1 1 ...
## $ other_payment_plans_stores : int 0 0 0 0 0 0 0 0 0 0 ...
## $ property_magnitude_life insurance : int 0 1 0 0 0 0 0 0 1 0 ...
## $ property_magnitude_no known property : int 0 0 0 0 0 0 1 0 0 0 ...
## $ property_magnitude_real estate : int 1 0 1 1 1 0 0 1 0 1 ...
## $ purpose_domestic appliance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ purpose_education : int 0 0 0 0 0 0 0 0 0 0 ...
## $ purpose_furniture/equipment : int 1 1 0 0 0 0 0 0 1 0 ...
## $ purpose_new car : int 0 0 1 0 0 0 0 0 0 0 ...
## $ purpose_other : int 0 0 0 0 0 0 0 0 0 0 ...
## $ purpose_radio/tv : int 0 0 0 1 1 1 0 1 0 1 ...
## $ purpose_repairs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ purpose_retraining : int 0 0 0 0 0 0 0 0 0 0 ...
## $ purpose_used car : int 0 0 0 0 0 0 1 0 0 0 ...
## $ marital_status_div/sep : int 0 0 0 0 0 0 0 0 0 0 ...
## $ marital_status_mar/wid : int 0 0 0 0 0 0 0 0 1 0 ...
## $ marital_status_single : int 1 0 1 0 0 1 1 0 0 1 ...
Before performing Exploratory Data Analysis we should divide our dataset into train and test, we do it as below. Furthermore, we remove the ID column (first column) since it will not be used in the further analysis.
set.seed(123456789)
training_obs <- createDataPartition(df$id,
p = 0.7,
list = FALSE)
df.train <- df[training_obs,]
df.test <- df[-training_obs,]
df <- df[, -1]
df.train <- df.train[, -1]
df.test <- df.test[, -1]
Let’s check the dataframe
Before writing the dfs let’s change the order of columns, so the target variable (class) would be in the first column.
# Move "class" from 3rd to the first position
df <- df[, c(3, 1:2, 4:ncol(df))]
df.train <- df.train[, c(3, 1:2, 4:ncol(df.train))]
df.test <- df.test[, c(3, 1:2, 4:ncol(df.train))]
Last check of the dataset before saving
Write dataframes
# remove unecessary variables
rm(list = setdiff(ls(), c("df", "df.test", "df.train", "print_unique_values")))
# write df.train and df.test
write.csv(df.test, "./data/output/df.test.csv", row.names = FALSE)
write.csv(df.train, "./data/output/df.train.csv", row.names = FALSE)
write.csv(df, "./data/output/df.csv", row.names = FALSE)
The EDA will be performed on training dataframe.
Divide variables according to their type
# define numeric variables
numeric_variables <- c("age",
"checking_status",
"credit_amount",
"duration",
"employment",
"existing_credits",
"installment_commitment",
"num_dependents",
"residence_since")
# define feat variables - some "mysterious" variables prepared by our lecturer
feat_variables <- c()
for (i in 1:10){
if (i < 10){
var <- paste("feat0", i, sep="")
} else {
var <- paste("feat", i, sep="")
}
feat_variables <- append(feat_variables, var)
}
# define ordered_categorical
ordered_categorical = c("checking_status", "employment", "job", "savings_status")
# define boolean variables
boolean_variables <- setdiff(colnames(df.train), c(numeric_variables, feat_variables, ordered_categorical))
# check if the division was made properly
print_unique_values(df.train[,boolean_variables])
## class : 1, 0
## foreign_worker : 1, 0
## own_telephone : 0, 1
## sex : 1, 0
## credit_history_critical/other existing credit : 0, 1
## credit_history_delayed previously : 0, 1
## credit_history_existing paid : 1, 0
## credit_history_no credits/all paid : 0, 1
## housing_own : 1, 0
## housing_rent : 0, 1
## other_parties_guarantor : 0, 1
## other_parties_none : 1, 0
## other_payment_plans_none : 1, 0
## other_payment_plans_stores : 0, 1
## property_magnitude_life insurance : 0, 1
## property_magnitude_no known property : 0, 1
## property_magnitude_real estate : 1, 0
## purpose_domestic appliance : 0, 1
## purpose_education : 0, 1
## purpose_furniture/equipment : 1, 0
## purpose_new car : 0, 1
## purpose_other : 0, 1
## purpose_radio/tv : 0, 1
## purpose_repairs : 0, 1
## purpose_retraining : 0, 1
## purpose_used car : 0, 1
## marital_status_div/sep : 0, 1
## marital_status_mar/wid : 0, 1
## marital_status_single : 1, 0
par(mfrow=c(3, 4))
for (var in numeric_variables) {
hist(df.train[[var]], main=paste("Histogram of", var), xlab=var)
}
par(mfrow=c(3, 4))
for (var in feat_variables) {
hist(df.train[[var]], main=paste("Histogram of", var), xlab=var)
}
For numeric variables and target variable.
library(corrplot)
num_var <- c(numeric_variables, feat_variables, "class")
numeric_df <- df.train[, num_var]
cor_matrix <- cor(numeric_df, use="complete.obs")
corrplot(cor_matrix, method="circle")
# For ordered categorical variables
par(mfrow=c(2, 2))
for (var in ordered_categorical) {
counts <- table(df.train[[var]])
barplot(counts, main=paste("Bar Plot of", var), xlab=var, ylab="Count")
}
In case of boolean variables we create table with distribution of every variable.
# Creating a summary table for boolean variables with percentages
boolean_summary <- sapply(df.train[, boolean_variables], function(x) {
tab <- table(factor(x, levels = c(0, 1)))
pct <- 100 * tab / sum(tab) # Convert counts to percentage
return(pct)
})
boolean_summary <- t(boolean_summary) # Transposing to have variables as rows and percentage columns
# Formatting the table for better readability
boolean_summary <- round(boolean_summary, 1) # Round to two decimal places
# Printing the summary table
print(boolean_summary)
## 0 1
## class 30.5 69.5
## foreign_worker 4.0 96.0
## own_telephone 59.3 40.7
## sex 32.6 67.4
## credit_history_critical/other existing credit 70.1 29.9
## credit_history_delayed previously 91.6 8.4
## credit_history_existing paid 47.5 52.5
## credit_history_no credits/all paid 95.8 4.2
## housing_own 27.7 72.3
## housing_rent 82.7 17.3
## other_parties_guarantor 95.0 5.0
## other_parties_none 9.1 90.9
## other_payment_plans_none 19.1 80.9
## other_payment_plans_stores 94.4 5.6
## property_magnitude_life insurance 77.4 22.6
## property_magnitude_no known property 84.3 15.7
## property_magnitude_real estate 71.4 28.6
## purpose_domestic appliance 98.6 1.4
## purpose_education 95.2 4.8
## purpose_furniture/equipment 82.6 17.4
## purpose_new car 77.0 23.0
## purpose_other 98.5 1.5
## purpose_radio/tv 70.9 29.1
## purpose_repairs 97.6 2.4
## purpose_retraining 99.0 1.0
## purpose_used car 90.9 9.1
## marital_status_div/sep 94.6 5.4
## marital_status_mar/wid 90.9 9.1
## marital_status_single 47.1 52.9
As we can see below, the train and test dataframes are split proportionally, however we can observe unequal distribution - there are more than twice observations with class “1” (good).
# Target variable distribution in df:
round(table(df$class)/length(df$class), 3)
##
## 0 1
## 0.309 0.691
# Target variabledistribution in df.train:
round(table(df.train$class)/length(df.train$class), 3)
##
## 0 1
## 0.305 0.695
# Target variabledistribution in df.test:
round(table(df.test$class)/length(df.test$class), 3)
##
## 0 1
## 0.318 0.682
Recursive Feature Elimination (RFE) is a feature selection method used to identify and select a subset of relevant features for model building. The method works by recursively removing features, building a model with the remaining features, and calculating model accuracy. The least important features are removed in each iteration until the specified number of features is reached.
In this section of the project, we employ RFE to optimize the feature set for predicting the class variable in our dataset. The process is outlined as follows.
First of all we clear the environment, load train and test df and new libraries.
rm(list=ls())
library(mlbench)
library(caret)
library(recipes)
library(doParallel)
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")
# rfe requires factor as target variable, so we modify class (1st column) - only on df.train dataframe as of now
df.train[, 1] <- as.factor(df.train[,1])
Detect the number of CPU cores available on the current machine and assign it to ‘cores’. Then, register the detected number of cores for parallel processing using the doParallel package. This enables parallel execution of code to improve performance, especially for operations that can be executed concurrently.
cores=detectCores()
registerDoParallel(cores=cores)
If we use doParallel package and want to make our code reproducible we should pass seeds argument to rfeControl function it requires a list where each element corresponds to a specific resampling iteration’s seed and the last element is a seed for the overall process.
Assuming 10-fold CV, we need a list of 11 elements: 10 for each fold and 1 for the overall process.
set.seed(123456789)
numFolds <- 10
seedsList <- lapply(1:numFolds, function(x) sample.int(1000, 49)) # Generate random seeds for each fold
set.seed(123456789)
seedsList[[length(seedsList) + 1]] <- sample.int(1000, 1) # Add one more seed for the overall process
## Define rfeControl object
control <- rfeControl(functions=rfFuncs, method="cv", number=10, seeds = seedsList)
set.seed(123456789)
results <- rfe(df.train[, seq(2, length(df.train), 1)], # predictor variables
df.train[, 1], # target variable for prediction
sizes=c(1:49), # the subset sizes to evaluate
rfeControl=control)
Basing on the RFE method 35 variables have been chosen as the best predictors. Below we can see them.
predictors(results)
## [1] "checking_status"
## [2] "feat02"
## [3] "duration"
## [4] "credit_amount"
## [5] "feat01"
## [6] "feat10"
## [7] "credit_history_critical.other.existing.credit"
## [8] "age"
## [9] "savings_status"
## [10] "employment"
## [11] "property_magnitude_real.estate"
## [12] "other_payment_plans_none"
## [13] "installment_commitment"
## [14] "property_magnitude_no.known.property"
## [15] "credit_history_no.credits.all.paid"
## [16] "housing_own"
## [17] "credit_history_existing.paid"
## [18] "purpose_education"
## [19] "purpose_radio.tv"
## [20] "residence_since"
## [21] "existing_credits"
## [22] "other_payment_plans_stores"
## [23] "purpose_new.car"
## [24] "own_telephone"
## [25] "other_parties_guarantor"
## [26] "job"
## [27] "credit_history_delayed.previously"
## [28] "housing_rent"
## [29] "other_parties_none"
## [30] "purpose_used.car"
## [31] "sex"
## [32] "marital_status_single"
## [33] "marital_status_div.sep"
## [34] "num_dependents"
## [35] "marital_status_mar.wid"
Below we can see the plot of accuracy results vs number of variables included. Interesting is the fact that having more predictors than 35 not only does not increase precision but also worsens it.
plot(results, type=c("g", "o"))
Assign the final columns to list and write as .csv.
rfe_columns <- predictors(results)
write.csv(rfe_columns, "./data/output/rfe_columns.csv")
In this section we apply the approach from classes adjusted to our case. We build the xgb model by hypertuning parameters one by one in the following order:
Let’s start with loading required packages, cleaning the environment and loading the data.
library(dplyr)
library(rpart)
library(rpart.plot)
library(xgboost)
library(plyr)
library(doParallel)
library(pROC)
library(ggplot2)
library(caret)
cores=detectCores()
registerDoParallel(cores=cores)
rm(list=ls())
df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")
Xgb requires target variable as factor - let’s change it
df$class <- factor(df$class, levels = c(0, 1), labels = c("bad", "good"))
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))
Take columns chosen by RFE
# Read columns chosen by rfe
rfe_columns <- read.csv("./data/output/rfe_columns.csv")
# Add "class" to the list of columns chosen for analysis
rfe_columns <- c("class", rfe_columns$x)
# Modify dfs to have columns chosen by rfe
df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]
Calculate initial parameters based on recommendations provided during the lectures.
# colsample_bytree - # rule of thumb (number of predictors)
sqrt(35)/35. # 0.1690309
## [1] 0.1690309
# min_child_weight - 0.5 ... 0.1% of number of observations
0.005 * 2000 # 10
## [1] 10
0.01 * 2000 # 20
## [1] 20
set.seed(12345678)
xgb_grid <- expand.grid(nrounds = seq(20, 80, 10),
max_depth = c(8),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.17),
min_child_weight = c(10),
subsample = 0.8)
train_control <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
xgb_model.1 <- train(class ~ .,
data = df.train,
method = "xgbTree",
trControl = train_control,
tuneGrid = xgb_grid)
xgb_model.1
## eXtreme Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121
## Resampling results across tuning parameters:
##
## nrounds ROC Sens Spec
## 20 0.8467558 0.4005472 0.9547713
## 30 0.8542388 0.4870588 0.9434629
## 40 0.8592276 0.5268947 0.9259847
## 50 0.8588507 0.5455814 0.9280307
## 60 0.8559132 0.5433105 0.9198150
## 70 0.8580116 0.5503967 0.9167169
## 80 0.8582144 0.5668126 0.9146497
##
## Tuning parameter 'max_depth' was held constant at a
## held constant at a value of 10
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the
## largest value.
## The final values used for the model were nrounds =
## colsample_bytree = 0.17, min_child_weight = 10
## and subsample = 0.8.
xgb_model.1 %>% saveRDS("./data/output/xgb/xgb_model.1.rds")
The best for nrounds = 40 Let’s take this parameter and tuner others.
xgb_grid <- expand.grid(nrounds = 40,
max_depth = c(5, 15, 1),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.17),
min_child_weight = seq(10, 200, 2),
subsample = 0.8)
set.seed(12345678)
xgb_model.2 <- train(class ~ .,
data = df.train,
method = "xgbTree",
trControl = train_control,
tuneGrid = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.2
## eXtreme Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121
## Resampling results across tuning parameters:
##
## max_depth min_child_weight ROC Sens
## 1 10 0.8344563 0.395950752
## 1 12 0.8358665 0.407523940
## 1 14 0.8358767 0.400328317
## 1 16 0.8358507 0.407359781
## 1 18 0.8359002 0.412120383
## 1 20 0.8326024 0.393378933
## 1 22 0.8270847 0.395677155
## 1 24 0.8320066 0.423748290
## 1 26 0.8319196 0.411983584
## 1 28 0.8264682 0.414418605
## 1 30 0.8242036 0.391053352
## 1 32 0.8281184 0.416908345
## 1 34 0.8308021 0.416689466
## 1 36 0.8223432 0.391053352
## 1 38 0.8229040 0.376853625
## 1 40 0.8227221 0.395567715
## 1 42 0.8272814 0.402708618
## 1 44 0.8156231 0.374664843
## 1 46 0.8218393 0.383967168
## 1 48 0.8168432 0.369986320
## 1 50 0.8176469 0.348919289
## 1 52 0.8059365 0.341860465
## 1 54 0.8008042 0.355978112
## 1 56 0.8001785 0.341833105
## 1 58 0.8036833 0.342024624
## 1 60 0.7879441 0.288016416
## 1 62 0.7829489 0.294938440
## 1 64 0.7770599 0.287906977
## 1 66 0.7781812 0.299671683
## 1 68 0.7737653 0.295020520
## 1 70 0.7705386 0.250588235
## 1 72 0.7621448 0.266894665
## 1 74 0.7560739 0.290369357
## 1 76 0.7481793 0.266648427
## 1 78 0.7404916 0.227004104
## 1 80 0.7453078 0.192147743
## 1 82 0.7314558 0.208290014
## 1 84 0.7236018 0.196580027
## 1 86 0.7183731 0.156908345
## 1 88 0.7196452 0.168673051
## 1 90 0.7066056 0.060984952
## 1 92 0.6820901 0.039945280
## 1 94 0.6947117 0.007058824
## 1 96 0.6875540 0.000000000
## 1 98 0.6712238 0.000000000
## 1 100 0.6526735 0.000000000
## 1 102 0.6195525 0.000000000
## 1 104 0.6237589 0.000000000
## 1 106 0.5758576 0.000000000
## 1 108 0.5746589 0.000000000
## 1 110 0.6094594 0.000000000
## 1 112 0.5326217 0.000000000
## 1 114 0.5000000 0.000000000
## 1 116 0.5000000 0.000000000
## 1 118 0.5000000 0.000000000
## 1 120 0.5000000 0.000000000
## 1 122 0.5000000 0.000000000
## 1 124 0.5000000 0.000000000
## 1 126 0.5000000 0.000000000
## 1 128 0.5000000 0.000000000
## 1 130 0.5000000 0.000000000
## 1 132 0.5000000 0.000000000
## 1 134 0.5000000 0.000000000
## 1 136 0.5000000 0.000000000
## 1 138 0.5000000 0.000000000
## 1 140 0.5000000 0.000000000
## 1 142 0.5000000 0.000000000
## 1 144 0.5000000 0.000000000
## 1 146 0.5000000 0.000000000
## 1 148 0.5000000 0.000000000
## 1 150 0.5000000 0.000000000
## 1 152 0.5000000 0.000000000
## 1 154 0.5000000 0.000000000
## 1 156 0.5000000 0.000000000
## 1 158 0.5000000 0.000000000
## 1 160 0.5000000 0.000000000
## 1 162 0.5000000 0.000000000
## 1 164 0.5000000 0.000000000
## 1 166 0.5000000 0.000000000
## 1 168 0.5000000 0.000000000
## 1 170 0.5000000 0.000000000
## 1 172 0.5000000 0.000000000
## 1 174 0.5000000 0.000000000
## 1 176 0.5000000 0.000000000
## 1 178 0.5000000 0.000000000
## 1 180 0.5000000 0.000000000
## 1 182 0.5000000 0.000000000
## 1 184 0.5000000 0.000000000
## 1 186 0.5000000 0.000000000
## 1 188 0.5000000 0.000000000
## 1 190 0.5000000 0.000000000
## 1 192 0.5000000 0.000000000
## 1 194 0.5000000 0.000000000
## 1 196 0.5000000 0.000000000
## 1 198 0.5000000 0.000000000
## 1 200 0.5000000 0.000000000
## 5 10 0.8565765 0.540930233
## 5 12 0.8612264 0.541039672
## 5 14 0.8604411 0.524596443
## 5 16 0.8491117 0.510478796
## 5 18 0.8438764 0.484815321
## 5 20 0.8455765 0.505800274
## 5 22 0.8318509 0.480136799
## 5 24 0.8430497 0.494035568
## 5 26 0.8428770 0.494008208
## 5 28 0.8414709 0.491573187
## 5 30 0.8309704 0.461203830
## 5 32 0.8341771 0.458905609
## 5 34 0.8299657 0.447168263
## 5 36 0.8374388 0.461258550
## 5 38 0.8368526 0.477592339
## 5 40 0.8267579 0.430615595
## 5 42 0.8289070 0.454309166
## 5 44 0.8334869 0.458987688
## 5 46 0.8254475 0.435458276
## 5 48 0.8301018 0.400492476
## 5 50 0.8249659 0.407606019
## 5 52 0.8183259 0.400410397
## 5 54 0.8136487 0.379480164
## 5 56 0.8067035 0.330259918
## 5 58 0.8024038 0.360601915
## 5 60 0.7915911 0.332585499
## 5 62 0.7928539 0.351135431
## 5 64 0.7756334 0.301997264
## 5 66 0.7872167 0.313761970
## 5 68 0.7687912 0.262325581
## 5 70 0.7782777 0.285745554
## 5 72 0.7504151 0.255239398
## 5 74 0.7443232 0.266867305
## 5 76 0.7428209 0.236251710
## 5 78 0.7332966 0.246073871
## 5 80 0.7319336 0.210943912
## 5 82 0.7286488 0.234227086
## 5 84 0.7307047 0.185225718
## 5 86 0.7199071 0.192120383
## 5 88 0.7145815 0.215458276
## 5 90 0.6919750 0.100629275
## 5 92 0.7048677 0.039917921
## 5 94 0.6907750 0.000000000
## 5 96 0.6863598 0.000000000
## 5 98 0.6716951 0.000000000
## 5 100 0.6343764 0.000000000
## 5 102 0.6741921 0.000000000
## 5 104 0.5931451 0.000000000
## 5 106 0.6000353 0.000000000
## 5 108 0.5645613 0.000000000
## 5 110 0.5675017 0.000000000
## 5 112 0.5004293 0.000000000
## 5 114 0.5000000 0.000000000
## 5 116 0.5000000 0.000000000
## 5 118 0.5000000 0.000000000
## 5 120 0.5000000 0.000000000
## 5 122 0.5000000 0.000000000
## 5 124 0.5000000 0.000000000
## 5 126 0.5000000 0.000000000
## 5 128 0.5000000 0.000000000
## 5 130 0.5000000 0.000000000
## 5 132 0.5000000 0.000000000
## 5 134 0.5000000 0.000000000
## 5 136 0.5000000 0.000000000
## 5 138 0.5000000 0.000000000
## 5 140 0.5000000 0.000000000
## 5 142 0.5000000 0.000000000
## 5 144 0.5000000 0.000000000
## 5 146 0.5000000 0.000000000
## 5 148 0.5000000 0.000000000
## 5 150 0.5000000 0.000000000
## 5 152 0.5000000 0.000000000
## 5 154 0.5000000 0.000000000
## 5 156 0.5000000 0.000000000
## 5 158 0.5000000 0.000000000
## 5 160 0.5000000 0.000000000
## 5 162 0.5000000 0.000000000
## 5 164 0.5000000 0.000000000
## 5 166 0.5000000 0.000000000
## 5 168 0.5000000 0.000000000
## 5 170 0.5000000 0.000000000
## 5 172 0.5000000 0.000000000
## 5 174 0.5000000 0.000000000
## 5 176 0.5000000 0.000000000
## 5 178 0.5000000 0.000000000
## 5 180 0.5000000 0.000000000
## 5 182 0.5000000 0.000000000
## 5 184 0.5000000 0.000000000
## 5 186 0.5000000 0.000000000
## 5 188 0.5000000 0.000000000
## 5 190 0.5000000 0.000000000
## 5 192 0.5000000 0.000000000
## 5 194 0.5000000 0.000000000
## 5 196 0.5000000 0.000000000
## 5 198 0.5000000 0.000000000
## 5 200 0.5000000 0.000000000
## 15 10 0.8561711 0.526894665
## 15 12 0.8549140 0.552640219
## 15 14 0.8508082 0.508180575
## 15 16 0.8563379 0.519890561
## 15 18 0.8439466 0.491792066
## 15 20 0.8442873 0.517428181
## 15 22 0.8427940 0.489329685
## 15 24 0.8455911 0.456798906
## 15 26 0.8432056 0.477674419
## 15 28 0.8366147 0.484870041
## 15 30 0.8424677 0.498604651
## 15 32 0.8418105 0.489439124
## 15 34 0.8289386 0.466155951
## 15 36 0.8255641 0.463611491
## 15 38 0.8324604 0.458741450
## 15 40 0.8227214 0.423885089
## 15 42 0.8271911 0.447250342
## 15 44 0.8273940 0.456634747
## 15 46 0.8267264 0.442435021
## 15 48 0.8194536 0.412010944
## 15 50 0.8188501 0.423857729
## 15 52 0.8034414 0.372229822
## 15 54 0.8014843 0.353570451
## 15 56 0.8089223 0.376963064
## 15 58 0.8000047 0.332503420
## 15 60 0.7942000 0.334829001
## 15 62 0.7854083 0.276224350
## 15 64 0.7830316 0.304459644
## 15 66 0.7831815 0.295102599
## 15 68 0.7702310 0.285690834
## 15 70 0.7537844 0.283283174
## 15 72 0.7677682 0.255376197
## 15 74 0.7544088 0.234227086
## 15 76 0.7487642 0.261969904
## 15 78 0.7547113 0.250396717
## 15 80 0.7397075 0.229493844
## 15 82 0.7420136 0.229658003
## 15 84 0.7191190 0.201285910
## 15 86 0.7176485 0.194555404
## 15 88 0.7172817 0.192038304
## 15 90 0.7082077 0.140465116
## 15 92 0.6948009 0.042134063
## 15 94 0.7033914 0.002352941
## 15 96 0.6796861 0.000000000
## 15 98 0.6681156 0.000000000
## 15 100 0.6775902 0.000000000
## 15 102 0.6146169 0.000000000
## 15 104 0.6214176 0.000000000
## 15 106 0.5975022 0.000000000
## 15 108 0.5858185 0.000000000
## 15 110 0.5723162 0.000000000
## 15 112 0.5397587 0.000000000
## 15 114 0.5228602 0.000000000
## 15 116 0.5000000 0.000000000
## 15 118 0.5000000 0.000000000
## 15 120 0.5000000 0.000000000
## 15 122 0.5000000 0.000000000
## 15 124 0.5000000 0.000000000
## 15 126 0.5000000 0.000000000
## 15 128 0.5000000 0.000000000
## 15 130 0.5000000 0.000000000
## 15 132 0.5000000 0.000000000
## 15 134 0.5000000 0.000000000
## 15 136 0.5000000 0.000000000
## 15 138 0.5000000 0.000000000
## 15 140 0.5000000 0.000000000
## 15 142 0.5000000 0.000000000
## 15 144 0.5000000 0.000000000
## 15 146 0.5000000 0.000000000
## 15 148 0.5000000 0.000000000
## 15 150 0.5000000 0.000000000
## 15 152 0.5000000 0.000000000
## 15 154 0.5000000 0.000000000
## 15 156 0.5000000 0.000000000
## 15 158 0.5000000 0.000000000
## 15 160 0.5000000 0.000000000
## 15 162 0.5000000 0.000000000
## 15 164 0.5000000 0.000000000
## 15 166 0.5000000 0.000000000
## 15 168 0.5000000 0.000000000
## 15 170 0.5000000 0.000000000
## 15 172 0.5000000 0.000000000
## 15 174 0.5000000 0.000000000
## 15 176 0.5000000 0.000000000
## 15 178 0.5000000 0.000000000
## 15 180 0.5000000 0.000000000
## 15 182 0.5000000 0.000000000
## 15 184 0.5000000 0.000000000
## 15 186 0.5000000 0.000000000
## 15 188 0.5000000 0.000000000
## 15 190 0.5000000 0.000000000
## 15 192 0.5000000 0.000000000
## 15 194 0.5000000 0.000000000
## 15 196 0.5000000 0.000000000
## 15 198 0.5000000 0.000000000
## 15 200 0.5000000 0.000000000
## Spec
## 0.9445044
## 0.9414063
## 0.9444885
## 0.9362411
## 0.9362675
## 0.9475601
## 0.9352313
## 0.9403965
## 0.9331853
## 0.9424531
## 0.9341792
## 0.9352419
## 0.9393867
## 0.9311393
## 0.9331747
## 0.9383241
## 0.9414327
## 0.9434787
## 0.9383188
## 0.9352524
## 0.9393550
## 0.9434523
## 0.9465451
## 0.9444885
## 0.9506794
## 0.9445096
## 0.9465662
## 0.9496167
## 0.9342585
## 0.9414274
## 0.9557917
## 0.9517103
## 0.9465662
## 0.9465556
## 0.9548084
## 0.9619773
## 0.9516997
## 0.9599366
## 0.9537616
## 0.9568543
## 0.9815015
## 0.9886968
## 0.9979487
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 0.9198202
## 0.9301084
## 0.9321755
## 0.9197991
## 0.9208776
## 0.9249696
## 0.9167539
## 0.9146868
## 0.9249485
## 0.9239387
## 0.9126355
## 0.9259847
## 0.9280412
## 0.9342004
## 0.9259952
## 0.9208459
## 0.9321279
## 0.9300872
## 0.9280465
## 0.9393656
## 0.9342162
## 0.9403807
## 0.9527359
## 0.9568385
## 0.9342321
## 0.9424372
## 0.9372773
## 0.9475813
## 0.9506688
## 0.9516733
## 0.9496484
## 0.9404229
## 0.9414274
## 0.9609622
## 0.9476183
## 0.9506846
## 0.9527306
## 0.9496801
## 0.9660693
## 0.9547925
## 0.9732805
## 0.9886915
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 0.9229130
## 0.9229289
## 0.9146762
## 0.9208512
## 0.9146709
## 0.9187999
## 0.9188052
## 0.9249643
## 0.9270209
## 0.9198150
## 0.9249485
## 0.9270262
## 0.9280465
## 0.9311182
## 0.9249379
## 0.9311393
## 0.9311182
## 0.9352313
## 0.9321597
## 0.9373090
## 0.9280518
## 0.9352419
## 0.9311182
## 0.9455141
## 0.9527148
## 0.9465556
## 0.9331959
## 0.9527095
## 0.9547766
## 0.9373037
## 0.9424478
## 0.9588951
## 0.9486016
## 0.9434840
## 0.9455300
## 0.9507005
## 0.9496484
## 0.9517526
## 0.9475760
## 0.9578588
## 0.9681364
## 0.9897066
## 0.9989744
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
## 1.0000000
##
## Tuning parameter 'nrounds' was held constant at a value
## value of 0.17
## Tuning parameter 'subsample' was
## held constant at a value of 0.8
## ROC was used to select the optimal model using the
## largest value.
## The final values used for the model were nrounds =
## colsample_bytree = 0.17, min_child_weight = 12
## and subsample = 0.8.
xgb_model.2 %>% saveRDS("./data/output/xgb/xgb_model.2.rds")
The best for max_depth = 5, min_child_weight = 12. Let’s take this parameter and tuner others.
xgb_grid <- expand.grid(nrounds = 40,
max_depth = 5,
eta = c(0.25),
gamma = 1,
colsample_bytree = seq(0.05, 0.7, 0.01),
min_child_weight = 12,
subsample = 0.8)
set.seed(12345678)
xgb_model.3 <- train(class ~ .,
data = df.train,
method = "xgbTree",
trControl = train_control,
tuneGrid = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.3
## eXtreme Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121
## Resampling results across tuning parameters:
##
## colsample_bytree ROC Sens Spec
## 0.05 0.8039593 0.2272777 0.9732805
## 0.06 0.8123201 0.3112449 0.9506794
## 0.07 0.8321331 0.3583311 0.9496326
## 0.08 0.8235153 0.3559781 0.9434946
## 0.09 0.8514210 0.4615869 0.9393762
## 0.10 0.8522279 0.4497674 0.9403859
## 0.11 0.8518078 0.5034200 0.9321597
## 0.12 0.8529059 0.4893570 0.9342321
## 0.13 0.8584172 0.5010944 0.9342268
## 0.14 0.8472329 0.5011491 0.9260058
## 0.15 0.8488754 0.5035568 0.9229025
## 0.16 0.8467481 0.5315458 0.9187840
## 0.17 0.8576850 0.5105062 0.9280730
## 0.18 0.8572149 0.5410397 0.9259794
## 0.19 0.8515742 0.5479891 0.9146973
## 0.20 0.8574852 0.5454993 0.9249326
## 0.21 0.8610720 0.5549658 0.9064499
## 0.22 0.8555261 0.5409576 0.9157177
## 0.23 0.8626182 0.5574829 0.9228760
## 0.24 0.8613935 0.5621341 0.9136453
## 0.25 0.8544855 0.5713817 0.9136611
## 0.26 0.8606732 0.5433105 0.9146868
## 0.27 0.8571888 0.5668126 0.9157177
## 0.28 0.8559451 0.5760876 0.9064552
## 0.29 0.8590332 0.5690834 0.9167063
## 0.30 0.8607166 0.5620520 0.9116204
## 0.31 0.8583121 0.5783858 0.9054137
## 0.32 0.8680037 0.5690014 0.9126408
## 0.33 0.8624078 0.5783584 0.9126302
## 0.34 0.8623350 0.5783858 0.9044251
## 0.35 0.8630722 0.5901778 0.9074914
## 0.36 0.8652446 0.5877702 0.9044039
## 0.37 0.8645034 0.5854446 0.9023315
## 0.38 0.8611421 0.5806840 0.9156860
## 0.39 0.8620319 0.5666758 0.9074544
## 0.40 0.8646351 0.5924487 0.9033888
## 0.41 0.8634601 0.5737620 0.9095163
## 0.42 0.8594687 0.5948837 0.9023526
## 0.43 0.8668948 0.5691108 0.9136664
## 0.44 0.8637347 0.5901778 0.9126460
## 0.45 0.8629746 0.5970725 0.8940999
## 0.46 0.8677471 0.5762517 0.9126249
## 0.47 0.8669960 0.5854720 0.9023368
## 0.48 0.8676400 0.5948290 0.9003119
## 0.49 0.8625358 0.5761149 0.9043828
## 0.50 0.8683363 0.6041587 0.9146868
## 0.51 0.8639989 0.5714364 0.9105736
## 0.52 0.8676573 0.5924761 0.9002802
## 0.53 0.8610158 0.5830917 0.8992598
## 0.54 0.8690980 0.5900410 0.9074755
## 0.55 0.8687565 0.5855540 0.9074755
## 0.56 0.8646249 0.6181943 0.9002802
## 0.57 0.8679794 0.5832011 0.9054296
## 0.58 0.8674274 0.5948564 0.9002855
## 0.59 0.8695200 0.5901778 0.9095533
## 0.60 0.8659532 0.5925581 0.9043933
## 0.61 0.8651930 0.5877428 0.9023473
## 0.62 0.8711825 0.5855540 0.9105630
## 0.63 0.8666612 0.6111902 0.8900132
## 0.64 0.8611872 0.5877702 0.8972297
## 0.65 0.8768339 0.5994528 0.9085223
## 0.66 0.8639603 0.5737346 0.8941211
## 0.67 0.8654696 0.5619973 0.9023421
## 0.68 0.8657598 0.5925034 0.9054137
## 0.69 0.8692144 0.6017784 0.8982554
## 0.70 0.8646742 0.5877155 0.9105578
##
## Tuning parameter 'nrounds' was held constant at a value
## value of 12
## Tuning parameter 'subsample' was held
## constant at a value of 0.8
## ROC was used to select the optimal model using the
## largest value.
## The final values used for the model were nrounds =
## colsample_bytree = 0.65, min_child_weight = 12
## and subsample = 0.8.
xgb_model.3 %>% saveRDS("./data/output/xgb/xgb_model.3.rds")
The best for colsample_bytree = 0.65. Let’s take this parameter and tuner others.
xgb_grid <- expand.grid(nrounds = 40,
max_depth = 5,
eta = c(0.25),
gamma = 1,
colsample_bytree = 0.65,
min_child_weight = 12,
subsample = seq(0.6, 0.9, 0.05))
set.seed(12345678)
xgb_model.4 <- train(class ~ .,
data = df.train,
method = "xgbTree",
trControl = train_control,
tuneGrid = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.4
## eXtreme Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121
## Resampling results across tuning parameters:
##
## subsample ROC Sens Spec
## 0.60 0.8531228 0.5690834 0.8951361
## 0.65 0.8576970 0.6088372 0.8951308
## 0.70 0.8564885 0.5785226 0.9044039
## 0.75 0.8655747 0.5830096 0.9074808
## 0.80 0.8654781 0.5901778 0.9043933
## 0.85 0.8698701 0.6112175 0.9002855
## 0.90 0.8689630 0.5899863 0.8961935
##
## Tuning parameter 'nrounds' was held constant at a value
## value of 0.65
## Tuning parameter 'min_child_weight' was
## held constant at a value of 12
## ROC was used to select the optimal model using the
## largest value.
## The final values used for the model were nrounds =
## colsample_bytree = 0.65, min_child_weight = 12
## and subsample = 0.85.
xgb_model.4 %>% saveRDS("./data/output/xgb/xgb_model.4.rds")
The best for subsample = 0.85. Let’s take this parameter and tuner others.
Let’s double number of trees and reduce by half learning rate
# Change learning rate and number of trees
xgb_grid <- expand.grid(nrounds = 80,
max_depth = 5,
eta = c(0.12),
gamma = 1,
colsample_bytree = 0.65,
min_child_weight = 12,
subsample = 0.85)
set.seed(12345678)
xgb_model.5 <- train(class ~ .,
data = df.train,
method = "xgbTree",
trControl = train_control,
tuneGrid = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.5
## eXtreme Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121
## Resampling results:
##
## ROC Sens Spec
## 0.8749795 0.5902599 0.911594
##
## Tuning parameter 'nrounds' was held constant at a value
## held constant at a value of 12
## Tuning
## parameter 'subsample' was held constant at a value of 0.85
xgb_model.5 %>% saveRDS("./data/output/xgb/xgb_model.5.rds")
Let’s double number of trees and reduce by half learning rate again
# Change learning rate and number of trees
xgb_grid <- expand.grid(nrounds = 160,
max_depth = 5,
eta = c(0.06),
gamma = 1,
colsample_bytree = 0.65,
min_child_weight = 12,
subsample = 0.85)
set.seed(12345678)
xgb_model.6 <- train(class ~ .,
data = df.train,
method = "xgbTree",
trControl = train_control,
tuneGrid = xgb_grid)
## Warning in train.default(x, y, weights = w, ...): The
## metric "Accuracy" was not in the result set. ROC will be
## used instead.
xgb_model.6
## eXtreme Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1119, 1120, 1120, 1121
## Resampling results:
##
## ROC Sens Spec
## 0.876145 0.6041313 0.9064605
##
## Tuning parameter 'nrounds' was held constant at a value
## held constant at a value of 12
## Tuning
## parameter 'subsample' was held constant at a value of 0.85
Compare models on test dataframe
# Load function to take accuracy and gini
source("./scripts/getAccuracyAndGini.R")
models <- c("1":"6")
# Apply the function to every model and build table
sapply(paste0("xgb_model.", models),
function(x) getAccuracyAndGini(model = get(x),
data = df.test,
target_variable = "class",
predicted_class = "good")
)
## xgb_model.1 xgb_model.2 xgb_model.3 xgb_model.4
## Accuracy 0.8116667 0.8016667 0.8033333 0.8083333
## Sensitivity 0.9217604 0.9242054 0.9046455 0.9095355
## Specificity 0.5759162 0.5392670 0.5863874 0.5916230
## Gini 0.7613385 0.7545796 0.7537347 0.7640523
## xgb_model.5 xgb_model.6
## Accuracy 0.8116667 0.8083333
## Sensitivity 0.9070905 0.9070905
## Specificity 0.6073298 0.5968586
## Gini 0.7606216 0.7695055
xgb_model.5 has the highest values, so we choose this one as final for xgb.
ROC.train <- pROC::roc(df.train$class,
predict(xgb_model.6,
df.train, type = "prob")[, "good"])
ROC.test <- pROC::roc(df.test$class,
predict(xgb_model.6,
df.test, type = "prob")[, "good"])
cat("AUC for train = ", pROC::auc(ROC.train),
", Gini for train = ", 2 * pROC::auc(ROC.train) - 1, "\n", sep = "")
## AUC for train = 0.960558, Gini for train = 0.921116
cat("AUC for test = ", pROC::auc(ROC.test),
", Gini for test = ", 2 * pROC::auc(ROC.test) - 1, "\n", sep = "")
## AUC for test = 0.8847527, Gini for test = 0.7695055
list(
ROC.train = ROC.train,
ROC.test = ROC.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"xgb = ",
round(100 * (2 * auc(ROC.train) - 1), 1), "%, ",
"Gini TEST: ",
"xgb = ",
round(100 * (2 * auc(ROC.test) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
In this section we will perform GBM model. Let’s start with clearing the environment and preparing df.
library(gbm)
library(dplyr)
library(caret)
rm(list=ls())
df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")
rfe_columns <- read.csv("./data/output/rfe_columns.csv")
rfe_columns <- c("class", rfe_columns$x)
df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]
# Make factor
df$class <- factor(df$class, levels = c(0, 1), labels = c("bad", "good"))
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))
Function to define model formula
# define model formula - all variables apart from class
create_model_formula <- function(df, explained_variable){
model_formula <- paste(explained_variable, "~")
for (i in 2:length(colnames(df))){
if(i == 2){
model_formula <- paste(model_formula, colnames(df)[i])
} else {
model_formula <- paste(model_formula, " + ", colnames(df)[i])
}
}
return(as.formula(model_formula))
}
Define 2 model formulas - one for class (factor) and one for class_1 (number).
model_formula01 <- create_model_formula(df.train, "class_1")
model_formula <- create_model_formula(df.train, "class")
GBM requires target variable to be defined as 0 and 1, not factor so we do not change it. However we will also need variable “class” with factor. Define new variable - class_1 with 0 and 1.
df.train$class_1 <- (df.train$class == "good") * 1
df.test$class_1 <- (df.test$class == "good") * 1
In this section we will build simple GBM modelm without any hyperparameter tuning.
set.seed(123456789)
gbm.1 <-
gbm(model_formula01,
data = df.train,
distribution = "bernoulli",
# total number of trees
n.trees = 500,
# number of variable interactions - actually depth of the trees
interaction.depth = 4,
# shrinkage parameter - speed (pace) of learning
shrinkage = 0.01,
verbose = FALSE)
gbm.1 %>% saveRDS("./data/output/gbm/gbm.1.rds")
# predict train
df.pred.train.gbm <- predict(gbm.1,
df.train,
type = "response",
n.trees = 500)
# predict test
df.pred.test.gbm <- predict(gbm.1,
df.test,
type = "response",
n.trees = 500)
# Load function to get accuracy and gini values
source("./scripts/getAccuracyAndGini2.R")
# Training set
getAccuracyAndGini2(data = data.frame(class = df.train$class,
pred = df.pred.train.gbm),
predicted_probs = "pred",
target_variable = "class",
target_levels = c("good", "bad"),
predicted_class = "good")
## Accuracy Sensitivity Specificity Gini
## 0.8585714 0.9588900 0.6299766 0.8426509
# Test set
getAccuracyAndGini2(data = data.frame(class = df.test$class,
pred = df.pred.test.gbm),
predicted_probs = "pred",
target_variable = "class",
target_levels = c("good", "bad"),
predicted_class = "good")
## Accuracy Sensitivity Specificity Gini
## 0.7900000 0.9339853 0.4816754 0.7479486
ROC.train.gbm <- pROC::roc(df.train$class,
df.pred.train.gbm)
ROC.test.gbm <- pROC::roc(df.test$class,
df.pred.test.gbm)
cat("AUC for train = ", pROC::auc(ROC.train.gbm),
", Gini for train = ", 2 * pROC::auc(ROC.train.gbm) - 1, "\n", sep = "")
## AUC for train = 0.9213254, Gini for train = 0.8426509
cat("AUC for test = ", pROC::auc(ROC.test.gbm),
", Gini for test = ", 2 * pROC::auc(ROC.test.gbm) - 1, "\n", sep = "")
## AUC for test = 0.8739743, Gini for test = 0.7479486
Plot ROC
list(
ROC.train.gbm = ROC.train.gbm,
ROC.test.gbm = ROC.test.gbm
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"gbm = ",
round(100 * (2 * auc(ROC.train.gbm) - 1), 1), "%, ",
"Gini TEST: ",
"gbm = ",
round(100 * (2 * auc(ROC.test.gbm) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
parameters_gbm <- expand.grid(interaction.depth = c(1, 2, 4),
n.trees = c(100, 300, 500),
shrinkage = c(0.01, 0.1),
n.minobsinnode = c(150, 250, 400))
ctrl_cv3 <- trainControl(method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(123456789)
gbm.2 <- train(model_formula,
data = df.train,
distribution = "bernoulli",
method = "gbm",
tuneGrid = parameters_gbm,
trControl = ctrl_cv3,
verbose = FALSE)
gbm.2
## Stochastic Gradient Boosting
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1261, 1260, 1259, 1261, 1261, 1259, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees
## 0.01 1 150 100
## 0.01 1 150 300
## 0.01 1 150 500
## 0.01 1 250 100
## 0.01 1 250 300
## 0.01 1 250 500
## 0.01 1 400 100
## 0.01 1 400 300
## 0.01 1 400 500
## 0.01 2 150 100
## 0.01 2 150 300
## 0.01 2 150 500
## 0.01 2 250 100
## 0.01 2 250 300
## 0.01 2 250 500
## 0.01 2 400 100
## 0.01 2 400 300
## 0.01 2 400 500
## 0.01 4 150 100
## 0.01 4 150 300
## 0.01 4 150 500
## 0.01 4 250 100
## 0.01 4 250 300
## 0.01 4 250 500
## 0.01 4 400 100
## 0.01 4 400 300
## 0.01 4 400 500
## 0.10 1 150 100
## 0.10 1 150 300
## 0.10 1 150 500
## 0.10 1 250 100
## 0.10 1 250 300
## 0.10 1 250 500
## 0.10 1 400 100
## 0.10 1 400 300
## 0.10 1 400 500
## 0.10 2 150 100
## 0.10 2 150 300
## 0.10 2 150 500
## 0.10 2 250 100
## 0.10 2 250 300
## 0.10 2 250 500
## 0.10 2 400 100
## 0.10 2 400 300
## 0.10 2 400 500
## 0.10 4 150 100
## 0.10 4 150 300
## 0.10 4 150 500
## 0.10 4 250 100
## 0.10 4 250 300
## 0.10 4 250 500
## 0.10 4 400 100
## 0.10 4 400 300
## 0.10 4 400 500
## ROC Sens Spec
## 0.7792206 0.00000000 1.0000000
## 0.8038604 0.19706534 0.9774037
## 0.8158203 0.32563677 0.9476226
## 0.7510418 0.00000000 1.0000000
## 0.7793486 0.15919158 0.9835893
## 0.7869867 0.30238095 0.9455502
## NaN NaN NaN
## NaN NaN NaN
## NaN NaN NaN
## 0.7897975 0.02569214 0.9989796
## 0.8170165 0.38405316 0.9455397
## 0.8302039 0.45420819 0.9280875
## 0.7483803 0.00000000 1.0000000
## 0.7807060 0.14988926 0.9825584
## 0.7871220 0.30465116 0.9455502
## NaN NaN NaN
## NaN NaN NaN
## NaN NaN NaN
## 0.7937979 0.02563677 0.9989796
## 0.8168736 0.37940199 0.9445087
## 0.8289242 0.45658915 0.9321902
## 0.7506987 0.00000000 1.0000000
## 0.7799251 0.15215947 0.9825794
## 0.7876903 0.30470653 0.9414370
## NaN NaN NaN
## NaN NaN NaN
## NaN NaN NaN
## 0.8277510 0.46129568 0.9219440
## 0.8366069 0.56650055 0.8972649
## 0.8345561 0.57364341 0.8941721
## 0.7890732 0.40293466 0.9064801
## 0.7943637 0.45442968 0.8848832
## 0.7945992 0.44496124 0.8828424
## NaN NaN NaN
## NaN NaN NaN
## NaN NaN NaN
## 0.8389667 0.53156146 0.9034399
## 0.8451245 0.58078627 0.8890069
## 0.8445341 0.59241417 0.8941195
## 0.7928862 0.41926910 0.9075005
## 0.7972307 0.45437431 0.8879865
## 0.7948166 0.45431894 0.8838628
## NaN NaN NaN
## NaN NaN NaN
## NaN NaN NaN
## 0.8396336 0.53626800 0.9075531
## 0.8441383 0.55498339 0.8900484
## 0.8483632 0.59246955 0.8890069
## 0.7943256 0.41245847 0.9105933
## 0.7963135 0.44036545 0.8930781
## 0.7953236 0.44966777 0.8849043
## NaN NaN NaN
## NaN NaN NaN
## NaN NaN NaN
##
## ROC was used to select the optimal model using the
## largest value.
## The final values used for the model were n.trees =
## 500, interaction.depth = 4, shrinkage = 0.1
## and n.minobsinnode = 150.
res <- gbm.2$results
saveRDS(object = gbm.2,
file = "./data/output/gbm/gbm.2.rds")
# Predict train
df.pred.train.gbm2 <- predict(gbm.2,
df.train,
type = "prob",
n.trees = 500)
# Predict test
df.pred.test.gbm2 <- predict(gbm.2,
df.test,
type = "prob",
n.trees = 500)
# Training set
accuracy_gini_train <- getAccuracyAndGini2(data = data.frame(class = df.train$class,
pred = df.pred.train.gbm2[, "good"]),
predicted_probs = "pred",
target_variable = "class",
target_levels = c("good", "bad"),
predicted_class = "good")
# Test set
accuracy_gini_test <- getAccuracyAndGini2(data = data.frame(class = df.test$class,
pred = df.pred.test.gbm2[, "good"]),
predicted_probs = "pred",
target_variable = "class",
target_levels = c("good", "bad"),
predicted_class = "good")
write.csv(accuracy_gini_test, "./data/accuracy_gini_test.csv")
write.csv(accuracy_gini_train, "./data/accuracy_gini_train.csv")
ROC.train.gbm2 <- pROC::roc(df.train$class,
df.pred.train.gbm2[, "good"])
ROC.test.gbm2 <- pROC::roc(df.test$class,
df.pred.test.gbm2[, "good"])
cat("AUC for train = ", pROC::auc(ROC.train.gbm2),
", Gini for train = ", 2 * pROC::auc(ROC.train.gbm2) - 1, "\n", sep = "")
## AUC for train = 0.9359089, Gini for train = 0.8718178
cat("AUC for test = ", pROC::auc(ROC.test.gbm2),
", Gini for test = ", 2 * pROC::auc(ROC.test.gbm2) - 1, "\n", sep = "")
## AUC for test = 0.8558609, Gini for test = 0.7117219
list(
ROC.train.gbm2 = ROC.train.gbm2,
ROC.test.gbm2 = ROC.test.gbm2
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"gbm = ",
round(100 * (2 * auc(ROC.train.gbm2) - 1), 1), "%, ",
"Gini TEST: ",
"gbm = ",
round(100 * (2 * auc(ROC.test.gbm2) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
Let’s start with cleaning the environment and loading the data.
rm(list=ls())
df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))
Prepare df as before.
rfe_columns <- read.csv("./data/output/rfe_columns.csv")
rfe_columns <- c("class", rfe_columns$x)
df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]
Estimate first - basic model with all variables and default parameters
library(randomForest)
library(caret)
library(pROC)
library(dplyr)
set.seed(123456789)
rf <- randomForest(class ~ .,
data = df.train)
saveRDS(rf, file = "./data/output/rf/rf.RDS")
print(rf)
##
## Call:
## randomForest(formula = class ~ ., data = df.train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 13.5%
## Confusion matrix:
## bad good class.error
## bad 275 152 0.35597190
## good 37 936 0.03802672
Estimated out-of-bag-error: 13.57%
plot(rf)
The plot above helps determine the appropriate number of trees.
The dark solid line presents the total OOB error, the red line presents the prediction error for “good” response, while the green line for “bad” response.
As the number of trees is higher, the OOB error is smaller, and converges to approx. 120 trees.
Hence, let us try to limit number of trees and estimate the model on bootstrap samples from the full data set.
We also increase the number of predictors used from 5 to 10.
set.seed(123456789)
rf2 <-
randomForest(class ~ .,
data = df.train,
ntree = 120,
sampsize = nrow(df.train),
mtry = 10,
# minimum number of obs in the terminal nodes
nodesize = 100,
# we also generate predictors importance measures,
importance = TRUE)
saveRDS(rf2, file = "./data/output/rf/rf2.RDS")
print(rf2)
##
## Call:
## randomForest(formula = class ~ ., data = df.train, ntree = 120, sampsize = nrow(df.train), mtry = 10, nodesize = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 120
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 21.07%
## Confusion matrix:
## bad good class.error
## bad 194 233 0.54566745
## good 62 911 0.06372045
Let us try to optimize the mtry parameter using the cross-validation process.
We consider 35 predictors in the model formula. Hence, let us try mtry between 2 and 31.
parameters_rf <- expand.grid(mtry = 2:31)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)
set.seed(123456789)
rf3 <-
train(class ~ .,
data = df.train,
method = "rf",
ntree = 120,
nodesize = 100,
tuneGrid = parameters_rf,
trControl = ctrl_oob,
importance = TRUE)
rf3
## Random Forest
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7357143 0.1945491
## 3 0.7578571 0.2926595
## 4 0.7792857 0.3755810
## 5 0.7657143 0.3335094
## 6 0.7714286 0.3697438
## 7 0.7850000 0.4174569
## 8 0.7785714 0.4058685
## 9 0.7857143 0.4292998
## 10 0.7885714 0.4499675
## 11 0.7864286 0.4358065
## 12 0.7842857 0.4460444
## 13 0.7857143 0.4473408
## 14 0.7814286 0.4263970
## 15 0.7800000 0.4350386
## 16 0.7821429 0.4344730
## 17 0.7750000 0.4299468
## 18 0.7735714 0.4255165
## 19 0.7771429 0.4333763
## 20 0.7792857 0.4287088
## 21 0.7757143 0.4337547
## 22 0.7835714 0.4485495
## 23 0.7835714 0.4485495
## 24 0.7857143 0.4612746
## 25 0.7792857 0.4415932
## 26 0.7757143 0.4321573
## 27 0.7750000 0.4283329
## 28 0.7742857 0.4372718
## 29 0.7814286 0.4481813
## 30 0.7800000 0.4430078
## 31 0.7771429 0.4285191
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was mtry = 10.
saveRDS(rf3, file = "./data/output/rf/rf3.RDS")
plot(rf3$results$mtry,
rf3$results$Accuracy, type = "b")
plot(rf3$results$mtry,
rf3$results$Kappa, type = "b")
modelLookup("ranger")
## model parameter label forReg
## 1 ranger mtry #Randomly Selected Predictors TRUE
## 2 ranger splitrule Splitting Rule TRUE
## 3 ranger min.node.size Minimal Node Size TRUE
## forClass probModel
## 1 TRUE TRUE
## 2 TRUE TRUE
## 3 TRUE TRUE
Let’s apply grid search and try to optimize mtry and min.node.size parameters
parameters_ranger <-
expand.grid(mtry = 4:15,
# split rule
splitrule = "gini",
# minimum size of the terminal node
min.node.size = c(50, 100, 150))
ctrl_cv5 <- trainControl(method = "cv",
number = 5,
classProbs = T)
set.seed(123456789)
rf3a <-
train(class ~ .,
data = df.train,
method = "ranger",
num.trees = 120, # default = 500
# numbers of processor cores to use in computations
num.threads = 3,
# impurity measure
importance = "impurity",
# parameters
tuneGrid = parameters_ranger,
trControl = ctrl_cv5)
saveRDS(rf3a, file = "./data/output/rf/rf3a.RDS")
rf3a
## Random Forest
##
## 1400 samples
## 35 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1120, 1119, 1120, 1121
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 4 50 0.7985856 0.4498375
## 4 100 0.7750090 0.3603292
## 4 150 0.7557257 0.2796670
## 5 50 0.7943025 0.4457750
## 5 100 0.7807335 0.3880632
## 5 150 0.7578737 0.3046890
## 6 50 0.8028765 0.4727373
## 6 100 0.7857284 0.4117220
## 6 150 0.7757360 0.3635327
## 7 50 0.7964555 0.4613643
## 7 100 0.7864248 0.4181166
## 7 150 0.7735982 0.3710574
## 8 50 0.8007361 0.4746418
## 8 100 0.7835881 0.4121572
## 8 150 0.7800192 0.3943897
## 9 50 0.8007361 0.4772653
## 9 100 0.7943050 0.4524745
## 9 150 0.7771595 0.3918235
## 10 50 0.7964607 0.4704375
## 10 100 0.7907233 0.4468123
## 10 150 0.7793024 0.4042882
## 11 50 0.7900167 0.4566906
## 11 100 0.7857309 0.4368765
## 11 150 0.7786085 0.4076478
## 12 50 0.8000142 0.4855970
## 12 100 0.7814452 0.4285498
## 12 150 0.7793151 0.4164727
## 13 50 0.7971622 0.4786724
## 13 100 0.7892896 0.4511495
## 13 150 0.7743100 0.4058392
## 14 50 0.7971672 0.4794029
## 14 100 0.7871544 0.4457953
## 14 150 0.7771595 0.4126403
## 15 50 0.7964453 0.4797614
## 15 100 0.7857309 0.4468190
## 15 150 0.7821569 0.4310092
##
## Tuning parameter 'splitrule' was held constant at a
## value of gini
## Accuracy was used to select the optimal model using
## the largest value.
## The final values used for the model were mtry =
## 6, splitrule = gini and min.node.size = 50.
plot(rf3a)
In this case the omptimal value for mtry is 6.
Let’s compare all of the models.
pred.train.rf <- predict(rf,
df.train,
type = "prob")[, "good"]
ROC.train.rf <- roc(as.numeric(df.train$class == "good"),
pred.train.rf)
pred.test.rf <- predict(rf,
df.test,
type = "prob")[, "good"]
ROC.test.rf <- roc(as.numeric(df.test$class == "good"),
pred.test.rf)
pred.train.rf2 <- predict(rf2,
df.train,
type = "prob")[, "good"]
ROC.train.rf2 <- roc(as.numeric(df.train$class == "good"),
pred.train.rf2)
pred.test.rf2 <- predict(rf2,
df.test,
type = "prob")[, "good"]
ROC.test.rf2 <- roc(as.numeric(df.test$class == "good"),
pred.test.rf2)
pred.train.rf3 <- predict(rf3,
df.train,
type = "prob")[, "good"]
ROC.train.rf3 <- roc(as.numeric(df.train$class == "good"),
pred.train.rf3)
pred.test.rf3 <- predict(rf3,
df.test,
type = "prob")[, "good"]
ROC.test.rf3 <- roc(as.numeric(df.test$class == "good"),
pred.test.rf3)
pred.train.rf3a <- predict(rf3a,
df.train,
type = "prob")[, "good"]
ROC.train.rf3a <- roc(as.numeric(df.train$class == "good"),
pred.train.rf3a)
pred.test.rf3a <- predict(rf3a,
df.test,
type = "prob")[, "good"]
ROC.test.rf3a <- roc(as.numeric(df.test$class == "good"),
pred.test.rf3a)
list(
ROC.train.rf = ROC.train.rf,
ROC.test.rf = ROC.test.rf,
ROC.train.rf2 = ROC.train.rf2,
ROC.test.rf2 = ROC.test.rf2,
ROC.train.rf3 = ROC.train.rf3,
ROC.test.rf3 = ROC.test.rf3,
ROC.train.rf3a = ROC.train.rf3a,
ROC.test.rf3a = ROC.test.rf3a
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(title = paste0("Gini TEST: ",
"rf = ",
round(100 * (2 * auc(ROC.test.rf) - 1), 1), "%, ",
"rf2 = ",
round(100 * (2 * auc(ROC.test.rf2) - 1), 1), "%, ",
"rf3 = ",
round(100 * (2 * auc(ROC.test.rf3) - 1), 1), "%, ",
"rf3a = ",
round(100 * (2 * auc(ROC.test.rf3a) - 1), 1), "% "),
subtitle = paste0("Gini TRAIN: ",
"rf = ",
round(100 * (2 * auc(ROC.train.rf) - 1), 1), "%, ",
"rf2 = ",
round(100 * (2 * auc(ROC.train.rf2) - 1), 1), "%, ",
"rf3 = ",
round(100 * (2 * auc(ROC.train.rf3) - 1), 1), "%, ",
"rf3a = ",
round(100 * (2 * auc(ROC.train.rf3a) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
Finally the random forest model with grid search (rf3a) with mtry = 6, splitrule = gini and min.node.size = 50 turned out to be the best.
In this section we tuned hyperparameters using tidymodels and tried to do it in more automated way.
In the first step we tune hyperparameters and take the best model parameters.
We built a function tune_two_stages (defined in a separate script) which takes the 2 grids of parameters and tunes models in two steps. Firstly it tunes one parameter, takes the best one and than it goes to the second grid tuning all parameters at the same time. Results are sorted by -mean value and parameters of three best models are taken and written as csv.
In the second step we take the results and check the performance of the models.
We built the functions summary_xgb, summary_random_forest, summary_decision_tree which takes the 3 best set of parameters and their names and check the performance of the model. We estimate the models using specific parameters on df.train, make predictions on df.train, df.test and check the performance using GINI and saving roc plots.
Let’s start with cleaning the environment and loading the data.
library(dplyr)
library(rpart)
library(rpart.plot)
library(xgboost)
library(plyr)
library(doParallel)
library(pROC)
library(ggplot2)
library(gbm)
library(caret)
library(workflows)
library(parsnip)
library(recipes)
library(dials)
library(rsample)
library(tune)
library(yardstick)
library(DT)
cores=detectCores()
registerDoParallel(cores=cores)
rm(list=ls())
cores=delightgbmcores=detectCores()
registerDoParallel(cores=cores)
source("./scripts/tune_functions.R")
df <- read.csv("./data/output/df.csv")
df.train <- read.csv("./data/output/df.train.csv")
df.test <- read.csv("./data/output/df.test.csv")
# Make factor
df$class <- factor(df$class, levels = c(0, 1), labels = c("bad", "good"))
df.train$class <- factor(df.train$class, levels = c(0, 1), labels = c("bad", "good"))
df.test$class <- factor(df.test$class, levels = c(0, 1), labels = c("bad", "good"))
# Load rfe columns
rfe_columns <- read.csv("./data/output/rfe_columns.csv")
rfe_columns <- c("class", rfe_columns$x)
# Adjust dfs to have rfe columns
df.train <- df.train[, rfe_columns]
df.test <- df.test[, rfe_columns]
# Load functions from tune_functions script
source("./scripts/tune_functions.R")
Define models and their hyperpatemeters to check.
# define parameters of models
models <- list(
xgboost = list(
model_name = 'xgboost',
model_1 = boost_tree(mode = "classification", engine = "xgboost", learn_rate = tune()),
model_2_args = list(tree_depth = tune(),
loss_reduction = tune(),
trees = tune(),
stop_iter = tune(),
min_n = tune()),
rec_spec = recipe(class ~ ., df.train),
grid_1 = set_seed_grid(grid_latin_hypercube(learn_rate(), size = 10), 123),
grid_2 = set_seed_grid(as_tibble(expand.grid(
tree_depth = c(6, 9, 12),
loss_reduction = c(0.5, 1, 2),
trees = c(100, 300, 500),
stop_iter = seq(10, 50, by=10),
min_n = seq(25, 100, by=25)
)), 123)
),
random_forest = list(
model_name = 'random_forest',
model_1 = rand_forest(mode = "classification", engine = "ranger", mtry = 5, trees = tune()),
model_2_args = list(min_n = tune()),
rec_spec = recipe(class ~ ., df.train),
grid_1 = set_seed_grid(grid_latin_hypercube(trees(), size = 50), 123),
grid_2 = set_seed_grid(as_tibble(expand.grid(
min_n = c(75, 100, 150, 300)
)), 123)
),
decision_tree = list(
model_name = 'decision_tree',
model_1 = decision_tree(mode = "classification", engine = "rpart", cost_complexity = tune()),
model_2_args = list(tree_depth = tune(), min_n = tune()),
rec_spec = recipe(class ~ ., df.train),
grid_1 = set_seed_grid(grid_latin_hypercube(cost_complexity(), size = 10), 123),
grid_2 = set_seed_grid(as_tibble(expand.grid(
tree_depth = c(6, 9, 12),
min_n = seq(25, 100, by=25)
)), 123)
)
)
# xgb tune
result_list_xgb <- list(tune_two_stages(models, "xgboost"))
result_list_xgb <- result_list_xgb[[1]]
# xgb summarise
tuned_params_xgb <- c("trees", "min_n", "tree_depth", "loss_reduction", "stop_iter", "learn_rate")
results <- result_list_xgb$result
final_results_xgb <- summary_xgb(results, tuned_params_xgb)
write.csv(final_results_xgb, "./data/output/xgb/final_results_xgb.csv")
result_list_random_forest <- list(tune_two_stages(models, "random_forest"))
result_list_random_forest <- result_list_random_forest[[1]]
# rf summarise
tuned_params_random_forest <- c("trees", "min_n")
results <- result_list_random_forest$result
final_results_random_forest <- summary_random_forest(results, tuned_params_random_forest)
write.csv(final_results_random_forest, "./data/output/rf/final_results_random_forest.csv")
result_decision_tree <- list(tune_two_stages(models, "decision_tree"))
result_decision_tree <- result_decision_tree[[1]]
# decision_tree summarise
tuned_params_decision_tree <- c("cost_complexity", "tree_depth", "min_n")
results <- result_decision_tree$result
final_results_decision_tree <- summary_decision_tree(results, tuned_params_decision_tree)
write.csv(final_results_decision_tree, "./data/output/dt/final_results_decision_tree.csv")
columns_to_compare <- c("iteration", "gini_train", "gini_test", "models_spec", "model_name")
final_results_xgb_c <- final_results_xgb[, columns_to_compare]
final_results_random_forest_c <- final_results_random_forest[, columns_to_compare]
final_results_decision_tree_c <- final_results_decision_tree[, columns_to_compare]
final_results_all <- rbind(final_results_xgb_c, final_results_random_forest_c, final_results_decision_tree_c) %>% arrange(-gini_test)
Plot ROC for the best model - rf1.
knitr::include_graphics("./plots/rf_1.png")
To sum up, we built several ML models using a few techniques. Finally, it turned out that the xgb model tuned by “one by one” method is the best with gini 92.1% on the df.train and 77% on df.test.
Random forest model tuned in more automated way, presented in the table above also turned out to have relatively high gini values (91.2% train, 77.7% df.test).
All results are presented in the attached pdf presentation.
Even though we used a few techniques which might increase the models accuracy and efficiency like handling missing values, transforming columns according to their characteristics, feature selection using Recursive Feature Elimination, cross validation, grid search and so on we believe that the accuracy might be even more increased for instance by using over or undersampling or some other methods, which we did not apply.