Data Manipulation
Import Library
# Import library
library(tidyverse)
library(caret)
library(dplyr)
library(fixest)
library(gtsummary)
library(ggthemes)
library(modelsummary)
library(glmnet)
library(kableExtra)
library(knitr)
library(MASS)
library(smotefamily)
library(class)
library(pROC)
library(scales)
library(reshape2)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gt)
library(factoextra)
library(xgboost)
library(doParallel)
Summary of what each library does:
Data Manipulation and Visualization:
tidyverse: A collection of R packages (e.g., ggplot2, dplyr, tidyr,
etc.) designed for data science tasks, including data manipulation,
visualization, and analysis.
dplyr: Part of the tidyverse, used for data manipulation tasks such
as filtering, summarizing, and joining datasets.
ggthemes: Adds additional themes and scales to ggplot2 for enhanced
visualization aesthetics.
scales: Provides tools for handling numeric scaling and
transformations in visualizations.
reshape2: Tools for reshaping data between wide and long formats for
analysis or visualization.
Statistical Modeling and Machine Learning
caret: A comprehensive package for training and evaluating machine
learning models with various preprocessing options and cross-validation
methods. glmnet: Implements generalized linear models with penalties
like Lasso and Ridge regression for regularization.
MASS: Includes functions for statistical methods and datasets,
including linear and nonlinear modeling.
smotefamily: Provides oversampling methods like SMOTE for dealing
with imbalanced datasets.
class: Includes basic functions for classification like k-Nearest
Neighbors (k-NN).
randomForest: Implements random forest algorithms for classification
and regression tasks.
xgboost: A scalable and efficient gradient-boosting framework for
machine learning tasks.
Summarization and Reporting
gtsummary: Generates publication-ready summary tables for statistical
analyses.
modelsummary: Creates summary tables for regression models and other
statistical outputs.
kableExtra: Enhances knitr::kable() tables with advanced formatting
options for reporting.
knitr: Integrates R code into reports and generates dynamic documents
(e.g., RMarkdown).
Visualization:
rpart.plot: Visualizes decision tree models created by rpart.
factoextra: Visualizes multivariate data analyses like PCA and
clustering.
rpart: Implements recursive partitioning and regression trees.
Statistical Analysis:
pROC: Tools for visualizing and analyzing receiver operating
characteristic (ROC) curves.
gt: Constructs publication-quality tables.
Trainning Model:
doParallel: Optimize trainning model.
Import the Data
First, we import our data. Since we have already download the data
file from Kaggle, so we are going to use read.csv
to read
our data as dataframe here.
# Read csv file and import the data
credit <- read.csv("application_train.csv")
Data Overview
Next, we going to start with a summary of this dataset to get an
overview of our data.
# Create a summry statistic of the dataset before processing
summary_table <- credit %>%
summarise(
`Number of rows` = n(),
`Number of columns` = ncol(.),
`Character columns` = sum(sapply(., is.character)),
`Numeric columns` = sum(sapply(., is.numeric))
) %>%
t() %>%
as.data.frame()
colnames(summary_table) <- c("Values")
summary_table <- tibble::rownames_to_column(summary_table, "Metric")
# Create a styled table using `gt`
gt_summary <- summary_table %>%
gt() %>%
tab_header(
title = "Data Summary"
)
# Display the styled summary
gt_summary
Data Summary |
Metric |
Values |
Number of rows |
307511 |
Number of columns |
122 |
Character columns |
16 |
Numeric columns |
106 |
There is a total of 307,511 observations with 122 variables in our
dataset, where 16 of these variables have character data type and 106
contain numerical data type, which could be binary or continuous. Since
the variable range is diverese, we will categorize variables nicely to
avoid multicolinearity or overfitting.
Now we split our data into training and test dataset, since this is a
large dataset, we choose to split it into 80/20 ratio so that training
data could capture the data well.
# Set a random seed for reproducibility
set.seed(123)
# Split the data (example, assuming 'credit' is your dataset)
train_index <- createDataPartition(credit$TARGET, p = 0.8, list = FALSE) # 70% training, 30% testing
# Create training and testing datasets
train_data <- credit[train_index, ]
test_data <- credit[-train_index, ]
# Save the training and testing sets to CSV files (for others to download)
#write.csv(train_data, "train_data.csv", row.names = FALSE)
#write.csv(test_data, "test_data.csv", row.names = FALSE)
Because we want other users to use the same dataset as we do, we save
the training and testing data into csv
files.
# Separating the data into training and testing datasets.
train <- read.csv("train_data.csv")
test_data <- read.csv("test_data.csv")
Next, we going to make a data summary for the missing values in the
data.
Missing Values
When working with any dataset, we should handle missing values
carefully to provide a dependable and accurate project. With character
columns, it is possible that we would encounter empty strings. We would
want to transform this into N/A
value so we can easily fill
in missing values.
train <- train %>%
mutate(across(where(~ is.character(.) | is.factor(.)), ~ na_if(., "")))
Next, we create a missing value summary from this dataset.
# Calculate missing values summary
missing_summary <- train %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing Count") %>%
mutate(
`Total Rows` = nrow(train),
`Missing Percentage (%)` = (`Missing Count` / `Total Rows`) * 100
) %>%
arrange(desc(`Missing Count`))
# Create a styled table using `gt`
gt_missing_summary <- missing_summary %>%
gt() %>%
tab_header(
title = "Missing Values Summary"
) %>%
fmt_number(
columns = `Missing Count`,
decimals = 0
) %>%
fmt_number(
columns = `Missing Percentage (%)`,
decimals = 2
)
# Display the styled table
gt_missing_summary
Missing Values Summary |
Column |
Missing Count |
Total Rows |
Missing Percentage (%) |
COMMONAREA_AVG |
171,839 |
246009 |
69.85 |
COMMONAREA_MODE |
171,839 |
246009 |
69.85 |
COMMONAREA_MEDI |
171,839 |
246009 |
69.85 |
NONLIVINGAPARTMENTS_AVG |
170,786 |
246009 |
69.42 |
NONLIVINGAPARTMENTS_MODE |
170,786 |
246009 |
69.42 |
NONLIVINGAPARTMENTS_MEDI |
170,786 |
246009 |
69.42 |
FONDKAPREMONT_MODE |
168,188 |
246009 |
68.37 |
LIVINGAPARTMENTS_AVG |
168,144 |
246009 |
68.35 |
LIVINGAPARTMENTS_MODE |
168,144 |
246009 |
68.35 |
LIVINGAPARTMENTS_MEDI |
168,144 |
246009 |
68.35 |
FLOORSMIN_AVG |
166,874 |
246009 |
67.83 |
FLOORSMIN_MODE |
166,874 |
246009 |
67.83 |
FLOORSMIN_MEDI |
166,874 |
246009 |
67.83 |
YEARS_BUILD_AVG |
163,561 |
246009 |
66.49 |
YEARS_BUILD_MODE |
163,561 |
246009 |
66.49 |
YEARS_BUILD_MEDI |
163,561 |
246009 |
66.49 |
OWN_CAR_AGE |
162,155 |
246009 |
65.91 |
LANDAREA_AVG |
146,024 |
246009 |
59.36 |
LANDAREA_MODE |
146,024 |
246009 |
59.36 |
LANDAREA_MEDI |
146,024 |
246009 |
59.36 |
BASEMENTAREA_AVG |
143,879 |
246009 |
58.49 |
BASEMENTAREA_MODE |
143,879 |
246009 |
58.49 |
BASEMENTAREA_MEDI |
143,879 |
246009 |
58.49 |
EXT_SOURCE_1 |
138,600 |
246009 |
56.34 |
NONLIVINGAREA_AVG |
135,710 |
246009 |
55.16 |
NONLIVINGAREA_MODE |
135,710 |
246009 |
55.16 |
NONLIVINGAREA_MEDI |
135,710 |
246009 |
55.16 |
ELEVATORS_AVG |
131,046 |
246009 |
53.27 |
ELEVATORS_MODE |
131,046 |
246009 |
53.27 |
ELEVATORS_MEDI |
131,046 |
246009 |
53.27 |
WALLSMATERIAL_MODE |
125,060 |
246009 |
50.84 |
APARTMENTS_AVG |
124,810 |
246009 |
50.73 |
APARTMENTS_MODE |
124,810 |
246009 |
50.73 |
APARTMENTS_MEDI |
124,810 |
246009 |
50.73 |
ENTRANCES_AVG |
123,819 |
246009 |
50.33 |
ENTRANCES_MODE |
123,819 |
246009 |
50.33 |
ENTRANCES_MEDI |
123,819 |
246009 |
50.33 |
LIVINGAREA_AVG |
123,459 |
246009 |
50.18 |
LIVINGAREA_MODE |
123,459 |
246009 |
50.18 |
LIVINGAREA_MEDI |
123,459 |
246009 |
50.18 |
HOUSETYPE_MODE |
123,432 |
246009 |
50.17 |
FLOORSMAX_AVG |
122,382 |
246009 |
49.75 |
FLOORSMAX_MODE |
122,382 |
246009 |
49.75 |
FLOORSMAX_MEDI |
122,382 |
246009 |
49.75 |
YEARS_BEGINEXPLUATATION_AVG |
119,950 |
246009 |
48.76 |
YEARS_BEGINEXPLUATATION_MODE |
119,950 |
246009 |
48.76 |
YEARS_BEGINEXPLUATATION_MEDI |
119,950 |
246009 |
48.76 |
TOTALAREA_MODE |
118,708 |
246009 |
48.25 |
EMERGENCYSTATE_MODE |
116,568 |
246009 |
47.38 |
OCCUPATION_TYPE |
77,082 |
246009 |
31.33 |
EXT_SOURCE_3 |
48,872 |
246009 |
19.87 |
AMT_REQ_CREDIT_BUREAU_HOUR |
33,285 |
246009 |
13.53 |
AMT_REQ_CREDIT_BUREAU_DAY |
33,285 |
246009 |
13.53 |
AMT_REQ_CREDIT_BUREAU_WEEK |
33,285 |
246009 |
13.53 |
AMT_REQ_CREDIT_BUREAU_MON |
33,285 |
246009 |
13.53 |
AMT_REQ_CREDIT_BUREAU_QRT |
33,285 |
246009 |
13.53 |
AMT_REQ_CREDIT_BUREAU_YEAR |
33,285 |
246009 |
13.53 |
NAME_TYPE_SUITE |
1,030 |
246009 |
0.42 |
OBS_30_CNT_SOCIAL_CIRCLE |
806 |
246009 |
0.33 |
DEF_30_CNT_SOCIAL_CIRCLE |
806 |
246009 |
0.33 |
OBS_60_CNT_SOCIAL_CIRCLE |
806 |
246009 |
0.33 |
DEF_60_CNT_SOCIAL_CIRCLE |
806 |
246009 |
0.33 |
EXT_SOURCE_2 |
531 |
246009 |
0.22 |
AMT_GOODS_PRICE |
229 |
246009 |
0.09 |
AMT_ANNUITY |
12 |
246009 |
0.00 |
CNT_FAM_MEMBERS |
2 |
246009 |
0.00 |
DAYS_LAST_PHONE_CHANGE |
1 |
246009 |
0.00 |
SK_ID_CURR |
0 |
246009 |
0.00 |
TARGET |
0 |
246009 |
0.00 |
NAME_CONTRACT_TYPE |
0 |
246009 |
0.00 |
CODE_GENDER |
0 |
246009 |
0.00 |
FLAG_OWN_CAR |
0 |
246009 |
0.00 |
FLAG_OWN_REALTY |
0 |
246009 |
0.00 |
CNT_CHILDREN |
0 |
246009 |
0.00 |
AMT_INCOME_TOTAL |
0 |
246009 |
0.00 |
AMT_CREDIT |
0 |
246009 |
0.00 |
NAME_INCOME_TYPE |
0 |
246009 |
0.00 |
NAME_EDUCATION_TYPE |
0 |
246009 |
0.00 |
NAME_FAMILY_STATUS |
0 |
246009 |
0.00 |
NAME_HOUSING_TYPE |
0 |
246009 |
0.00 |
REGION_POPULATION_RELATIVE |
0 |
246009 |
0.00 |
DAYS_BIRTH |
0 |
246009 |
0.00 |
DAYS_EMPLOYED |
0 |
246009 |
0.00 |
DAYS_REGISTRATION |
0 |
246009 |
0.00 |
DAYS_ID_PUBLISH |
0 |
246009 |
0.00 |
FLAG_MOBIL |
0 |
246009 |
0.00 |
FLAG_EMP_PHONE |
0 |
246009 |
0.00 |
FLAG_WORK_PHONE |
0 |
246009 |
0.00 |
FLAG_CONT_MOBILE |
0 |
246009 |
0.00 |
FLAG_PHONE |
0 |
246009 |
0.00 |
FLAG_EMAIL |
0 |
246009 |
0.00 |
REGION_RATING_CLIENT |
0 |
246009 |
0.00 |
REGION_RATING_CLIENT_W_CITY |
0 |
246009 |
0.00 |
WEEKDAY_APPR_PROCESS_START |
0 |
246009 |
0.00 |
HOUR_APPR_PROCESS_START |
0 |
246009 |
0.00 |
REG_REGION_NOT_LIVE_REGION |
0 |
246009 |
0.00 |
REG_REGION_NOT_WORK_REGION |
0 |
246009 |
0.00 |
LIVE_REGION_NOT_WORK_REGION |
0 |
246009 |
0.00 |
REG_CITY_NOT_LIVE_CITY |
0 |
246009 |
0.00 |
REG_CITY_NOT_WORK_CITY |
0 |
246009 |
0.00 |
LIVE_CITY_NOT_WORK_CITY |
0 |
246009 |
0.00 |
ORGANIZATION_TYPE |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_2 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_3 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_4 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_5 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_6 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_7 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_8 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_9 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_10 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_11 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_12 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_13 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_14 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_15 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_16 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_17 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_18 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_19 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_20 |
0 |
246009 |
0.00 |
FLAG_DOCUMENT_21 |
0 |
246009 |
0.00 |
Based on the table above, there are 50 columns that has more than 47%
missing data. This is concerning since such a large shortage in data
cannot represent the population appropriately. As for the other
variables with missing values, we will fill in with appropriate
values.
Columns with more than 47% missing data are dropped for several
reasons. First, columns with excessive missing values provide limited
statistical or predictive utility. The absence of data diminishes their
overall relevance and reduces their contribution to meaningful analysis.
Second, imputing missing values for columns with such a high proportion
of missing data introduces significant challenges. The imputation
process is likely to result in considerable bias or inaccuracies, which
could compromise the reliability of the dataset. Third, removing these
columns enhances computational efficiency and simplifies the dataset.
This reduction in complexity can improve the performance of subsequent
analyses or predictive models.
For instance, specific columns like COMMONAREA_AVG and
YEARS_BUILD_AVG have missing percentages exceeding 65%, indicating that
the data for these variables is highly incomplete. Although these
columns might contain potentially valuable information, the high
proportion of missing values makes reliable imputation impractical.By
setting the threshold for missing data at 47%, we aim to balance data
retention and quality. This approach ensures that enough information is
preserved while minimizing the risk of introducing excessive noise. For
example, columns like YEARS_BEGINEXPLUATATION_AVG, which has 48.78%
missing data, are excluded to maintain the overall integrity of the
analysis.
The cleaned dataset will contains more reliable variables for
analysis, ensuring that models built will perform with accurate and
robust performance by handling incomplete information.
Data Imputation
For the first step of cleaning the data, we will be dropping all
columns with more than 47% missing values. As we have mentioned, columns
with over 47% missing values are unreliable to use in our project
because of the huge gap of misinformation. Furthermore, we can’t decide
which values are suited to fill in to provided an accurate and unbiased
result.
# Addressing the value of threshold.
missing_threshold <- 47
# Calculate missing percentage for each column
missing_summary <- train %>%
summarise(across(everything(), ~ sum(is.na(.)) / nrow(train) * 100)) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Percentage")
# Identify columns to drop
columns_to_drop <- missing_summary %>%
filter(Missing_Percentage > missing_threshold) %>%
pull(Column)
# Drop the identified columns
train_drop <- train %>%
dplyr::select(-all_of(columns_to_drop))
Next up, we deal with numerical missing data. Initially, we made a
summary statistics of numerical columns with missing values.
# Identify columns with missing data and calculate the percentage of missing values
missing_data <- sapply(train_drop, function(x) sum(is.na(x)) / length(x) * 100)
# Select only numerical columns with missing data
missing_columns <- names(missing_data[missing_data > 0])
numerical_columns <- train[missing_columns] %>% select_if(is.numeric)
# Check that missing_columns are available in the dataset
valid_missing_columns <- intersect(missing_columns, names(numerical_columns))
# Create a table showing the missing data percentage and the data type for these columns
summary_stats <- data.frame(
Column = valid_missing_columns,
Missing_Percentage = missing_data[valid_missing_columns],
Data_Type = sapply(numerical_columns[valid_missing_columns], class),
Mean = sapply(numerical_columns[valid_missing_columns], function(x) mean(x, na.rm = TRUE)),
Median = sapply(numerical_columns[valid_missing_columns], function(x) median(x, na.rm = TRUE)),
Std_Dev = sapply(numerical_columns[valid_missing_columns], function(x) sd(x, na.rm = TRUE))
) %>%
arrange(desc(Missing_Percentage))
# Display the summary table with gt
summary_stats %>%
gt() %>%
tab_header(
title = "Numerical Columns with Missing Data and Summary Statistics"
) %>%
fmt_number(
columns = c(Missing_Percentage, Mean, Median, Std_Dev),
decimals = 2
) %>%
cols_label(
Column = "Variable",
Missing_Percentage = "Percentage (%)",
Data_Type = "Data Type",
Mean = "Mean",
Median = "Median",
Std_Dev = "Standard Deviation"
)
Numerical Columns with Missing Data and Summary Statistics |
Variable |
Percentage (%) |
Data Type |
Mean |
Median |
Standard Deviation |
EXT_SOURCE_3 |
19.87 |
numeric |
0.51 |
0.54 |
0.20 |
AMT_REQ_CREDIT_BUREAU_HOUR |
13.53 |
integer |
0.01 |
0.00 |
0.08 |
AMT_REQ_CREDIT_BUREAU_DAY |
13.53 |
integer |
0.01 |
0.00 |
0.11 |
AMT_REQ_CREDIT_BUREAU_WEEK |
13.53 |
integer |
0.03 |
0.00 |
0.21 |
AMT_REQ_CREDIT_BUREAU_MON |
13.53 |
integer |
0.27 |
0.00 |
0.91 |
AMT_REQ_CREDIT_BUREAU_QRT |
13.53 |
integer |
0.26 |
0.00 |
0.61 |
AMT_REQ_CREDIT_BUREAU_YEAR |
13.53 |
integer |
1.90 |
1.00 |
1.87 |
OBS_30_CNT_SOCIAL_CIRCLE |
0.33 |
integer |
1.43 |
0.00 |
2.43 |
DEF_30_CNT_SOCIAL_CIRCLE |
0.33 |
integer |
0.14 |
0.00 |
0.45 |
OBS_60_CNT_SOCIAL_CIRCLE |
0.33 |
integer |
1.41 |
0.00 |
2.40 |
DEF_60_CNT_SOCIAL_CIRCLE |
0.33 |
integer |
0.10 |
0.00 |
0.36 |
EXT_SOURCE_2 |
0.22 |
numeric |
0.51 |
0.57 |
0.19 |
AMT_GOODS_PRICE |
0.09 |
numeric |
538,582.32 |
450,000.00 |
369,619.82 |
AMT_ANNUITY |
0.00 |
numeric |
27,128.99 |
24,903.00 |
14,516.26 |
CNT_FAM_MEMBERS |
0.00 |
integer |
2.15 |
2.00 |
0.91 |
DAYS_LAST_PHONE_CHANGE |
0.00 |
integer |
−962.51 |
−758.00 |
826.33 |
For the numerical data, appropriate imputation strategies will be
applied based on the characteristics of the variables and their
context.
For columns that represent counts of events, missing values are
likely indicative of the absence of such events. In these cases, zero
imputation is a logical choice because it aligns with the interpretation
of no activity. This approach ensures that the data remains consistent
with its intended meaning without introducing bias.
For features with skewed distributions, such as external score
variables, median imputation is more suitable. The median is robust
against outliers and better reflects the central tendency of the data
compared to the mean, making it a reliable method to address missing
values while preserving the original distribution.
Columns that track social circle activity or defaults often include
many zeros, and missing values may signify no activity or no defaults.
For these variables, zero imputation is a natural choice because it
aligns with the context of no observed activity. Alternatively, mode
imputation could be considered if the mode represents a commonly
observed non-zero value.
For monetary values that are highly variable and skewed, median
imputation is also appropriate. This method ensures that missing data is
replaced in a way that avoids distortion from extreme values,
maintaining the integrity of the overall distribution.
Finally, for numeric variables with small percentages of missing
data, mean imputation is suitable. This approach works well for data
that is normally distributed or nearly normal and is unlikely to
introduce significant bias due to the low proportion of missing values.
Using the mean ensures that the imputed values remain consistent with
the dataset’s overall trends.
By tailoring the imputation method to the nature of each variable,
this strategy preserves the accuracy and reliability of the dataset for
subsequent analysis.
# Replace missing values using the median or mean as appropriate
train_fill <- train_drop %>%
mutate(
EXT_SOURCE_3 = ifelse(is.na(EXT_SOURCE_3), median(EXT_SOURCE_3, na.rm = TRUE), EXT_SOURCE_3),
AMT_REQ_CREDIT_BUREAU_HOUR = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_HOUR), 0, AMT_REQ_CREDIT_BUREAU_HOUR),
AMT_REQ_CREDIT_BUREAU_DAY = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_DAY), 0, AMT_REQ_CREDIT_BUREAU_DAY),
AMT_REQ_CREDIT_BUREAU_WEEK = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_WEEK), 0, AMT_REQ_CREDIT_BUREAU_WEEK),
AMT_REQ_CREDIT_BUREAU_MON = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_MON), median(AMT_REQ_CREDIT_BUREAU_MON, na.rm = TRUE), AMT_REQ_CREDIT_BUREAU_MON),
AMT_REQ_CREDIT_BUREAU_QRT = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_QRT), median(AMT_REQ_CREDIT_BUREAU_QRT, na.rm = TRUE), AMT_REQ_CREDIT_BUREAU_QRT),
AMT_REQ_CREDIT_BUREAU_YEAR = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_YEAR), median(AMT_REQ_CREDIT_BUREAU_YEAR, na.rm = TRUE), AMT_REQ_CREDIT_BUREAU_YEAR),
OBS_30_CNT_SOCIAL_CIRCLE = ifelse(is.na(OBS_30_CNT_SOCIAL_CIRCLE), median(OBS_30_CNT_SOCIAL_CIRCLE, na.rm = TRUE), OBS_30_CNT_SOCIAL_CIRCLE),
DEF_30_CNT_SOCIAL_CIRCLE = ifelse(is.na(DEF_30_CNT_SOCIAL_CIRCLE), median(DEF_30_CNT_SOCIAL_CIRCLE, na.rm = TRUE), DEF_30_CNT_SOCIAL_CIRCLE),
OBS_60_CNT_SOCIAL_CIRCLE = ifelse(is.na(OBS_60_CNT_SOCIAL_CIRCLE), median(OBS_60_CNT_SOCIAL_CIRCLE, na.rm = TRUE), OBS_60_CNT_SOCIAL_CIRCLE),
DEF_60_CNT_SOCIAL_CIRCLE = ifelse(is.na(DEF_60_CNT_SOCIAL_CIRCLE), median(DEF_60_CNT_SOCIAL_CIRCLE, na.rm = TRUE), DEF_60_CNT_SOCIAL_CIRCLE),
EXT_SOURCE_2 = ifelse(is.na(EXT_SOURCE_2), median(EXT_SOURCE_2, na.rm = TRUE), EXT_SOURCE_2),
AMT_GOODS_PRICE = ifelse(is.na(AMT_GOODS_PRICE), median(AMT_GOODS_PRICE, na.rm = TRUE), AMT_GOODS_PRICE),
AMT_ANNUITY = ifelse(is.na(AMT_ANNUITY), median(AMT_ANNUITY, na.rm = TRUE), AMT_ANNUITY),
CNT_FAM_MEMBERS = ifelse(is.na(CNT_FAM_MEMBERS), median(CNT_FAM_MEMBERS, na.rm = TRUE), CNT_FAM_MEMBERS),
DAYS_LAST_PHONE_CHANGE = ifelse(is.na(DAYS_LAST_PHONE_CHANGE), median(DAYS_LAST_PHONE_CHANGE, na.rm = TRUE), DAYS_LAST_PHONE_CHANGE)
)
Similar to numerical columns, we address any character columns with
missing values with a statistic summary table.
# Step 1: Identify character columns
character_columns <- train_drop %>% dplyr::select(where(is.character))
# Step 2: Calculate missing data percentage, number of missing data, and other summary statistics
summary_stats <- character_columns %>%
reframe(
Column = names(.),
Missing_Count = sapply(., function(x) sum(is.na(x))), # Number of missing values
Missing_Percentage = sapply(., function(x) sum(is.na(x)) / length(x) * 100), # Percentage of missing values
Unique_Values = sapply(., function(x) length(unique(x))), # Count of unique values
Mode = sapply(., function(x) {
if (length(unique(x)) == 1) return(unique(x))
else return(names(sort(table(x), decreasing = TRUE))[1]) # Mode or most frequent value
})
) %>%
arrange(desc(Missing_Percentage))
# Step 3: Display summary table with gt
summary_stats %>%
gt() %>%
tab_header(
title = "Character Columns Summary"
) %>%
fmt_number(
columns = c(Missing_Percentage, Missing_Count),
decimals = 2
) %>%
cols_label(
Column = "Variable",
Missing_Count = "Missing Count",
Missing_Percentage = "Percentage (%)",
Unique_Values = "Unique Values",
Mode = "Most Frequent Value"
)
Character Columns Summary |
Variable |
Missing Count |
Percentage (%) |
Unique Values |
Most Frequent Value |
OCCUPATION_TYPE |
77,082.00 |
31.33 |
19 |
Laborers |
NAME_TYPE_SUITE |
1,030.00 |
0.42 |
8 |
Unaccompanied |
NAME_CONTRACT_TYPE |
0.00 |
0.00 |
2 |
Cash loans |
CODE_GENDER |
0.00 |
0.00 |
3 |
F |
FLAG_OWN_CAR |
0.00 |
0.00 |
2 |
N |
FLAG_OWN_REALTY |
0.00 |
0.00 |
2 |
Y |
NAME_INCOME_TYPE |
0.00 |
0.00 |
8 |
Working |
NAME_EDUCATION_TYPE |
0.00 |
0.00 |
5 |
Secondary / secondary special |
NAME_FAMILY_STATUS |
0.00 |
0.00 |
6 |
Married |
NAME_HOUSING_TYPE |
0.00 |
0.00 |
6 |
House / apartment |
WEEKDAY_APPR_PROCESS_START |
0.00 |
0.00 |
7 |
TUESDAY |
ORGANIZATION_TYPE |
0.00 |
0.00 |
58 |
Business Entity Type 3 |
For categorical variables with missing data, appropriate strategies
will be applied based on the proportion of missing values and the
significance of the feature.
For OCCUPATION_TYPE, the proportion of missing data is significant at
31.35%. Dropping rows with missing values would result in considerable
data loss, which is undesirable. Occupation is a key socio-economic
variable that is likely predictive of an applicant’s ability to repay a
loan. Additionally, missing occupation data might represent a specific
group of applicants, such as those with irregular jobs or no formal
employment, who could exhibit distinct loan repayment behaviors. To
address this, missing values will be replaced with the category
“Unknown.” This approach preserves the data for analysis and treats
“Unknown” as a meaningful category, ensuring that the potential
predictive power of this feature is retained even when specific
occupation information is unavailable.
For NAME_TYPE_SUITE, the proportion of missing data is much lower, at
just 0.42%. This feature may reflect the applicant’s living situation,
such as whether they are unaccompanied, which could have some relevance
to loan repayment behavior. Although this variable is likely less
critical than occupation, it still contributes to the analysis. Missing
values will be replaced with the most frequent category,
“Unaccompanied.” Since the missing data is minimal, this approach
maintains the feature’s utility without significantly affecting the
overall quality of the dataset.
By using these strategies, the integrity and predictive value of the
dataset will be preserved while ensuring that missing data is handled in
a meaningful and contextually appropriate manner.
# Handling missing values for OCCUPATION_TYPE
train_filled <- train_fill %>%
mutate(OCCUPATION_TYPE = ifelse(is.na(OCCUPATION_TYPE), "Unknown", OCCUPATION_TYPE)) %>%
mutate(NAME_TYPE_SUITE = ifelse(is.na(NAME_TYPE_SUITE), "Unaccompanied", NAME_TYPE_SUITE))
Next, some of the categorical data have data that is negative, which
could be due to the way the data is enter, therefore, we will take the
absolute values of those columns.
train_filled <- train_filled %>%
mutate(across(c(DAYS_BIRTH, DAYS_LAST_PHONE_CHANGE, DAYS_EMPLOYED, DAYS_ID_PUBLISH, DAYS_REGISTRATION), abs))
# Reorder and relabel the education categories
train_filled$NAME_EDUCATION_TYPE <- factor(
train_filled$NAME_EDUCATION_TYPE,
levels = c("Secondary / secondary special", "Lower secondary", "Incomplete higher",
"Higher education", "Academic degree"),
labels = c("Secondary", "Lower Secondary", "Incomplete Higher", "Higher Education", "Academic Degree")
)
Exploratory Data Analysis
The goal of our project is to help Home Credit Group make better
lending decision and reduce their risk of default. Thus, one of our most
concerning aspect/variable is whether the client repay a loan or not.
Variable target
indicates whether a client made a payment
on time or they are having difficulties repaying. If target
is 1 then it indicates a late or missing payment. We choose this
variable as our independent variable.
First, we want to see the distribution of the target
variable
# Distribution of repayment (Loan Repayment Status)
ggplot(train_filled, aes(x = factor(TARGET))) +
geom_bar(fill = c("#4F81BD", "#B0B0B0")) +
labs(x = "Repayment", y = "Thoundsands", title = "Frequency of Loan Repayment Status") +
scale_x_discrete(labels = c("1" = "Difficulties", "0" = "On Time")) +
scale_y_continuous(labels = scales::label_comma(scale = 1e-3)) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5))

From this information, we see that there is an imbalanced within
target
. There are far more loans that were repaid on time
than loans that were not. Once we get into the models, we can weight the
classes by their representation in the data to reflect this
imbalance.
Since we take social demographic factors into accounting for the
effect of default, we want to see if target
is affected by
education level.
# Plotting the distribution of repayment by education level
ggplot(train_filled, aes(y = NAME_EDUCATION_TYPE, fill = factor(TARGET))) +
geom_bar(position = "fill") + # Stacked percentage bars
scale_fill_manual(values = c("#4F81BD", "#B0B0B0"), labels = c("On Time", "Difficulties")) +
labs(x = "Proportion", y = "", title = "Loan Repayment Status by Education Level") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.y = element_text(angle = 0, hjust = 1), # Ensure labels on y-axis are horizontal
legend.position = "top" # Move the legend to the top
) +
scale_y_discrete(labels = function(x) str_wrap(x, width = 15)) + # Wrap long labels
guides(fill = guide_legend(title = "Repayment")) # Add title for legend +

Accroding to the plot, higher levels of education, such as “Academic
Degree” and “Higher Education,” show a significantly larger proportion
of on-time repayments compared to those in lower educational categories.
This trend suggests that education level is a strong predictor of
financial reliability in loan repayment.
For “Lower Secondary” or “Secondary” education exhibit a visibly
higher proportion of getting repayment difficulties, such a ratio could
reflect limited financial literacy or employment opportunities, leading
to greater challenges in meeting repayment schedules.
Finally, for “Incomplete Higher Education”, the individuals in this
category exhibit repayment patterns that are between those of lower and
higher educational levels.
In many Asian countries, higher education significantly improves job
prospects in urban areas, where wages are typically higher. Conversely,
those with lower education levels may remain in informal or unstable
employment sectors, leading to financial strain and repayment
difficulties. Culturally, familial obligations and societal pressures in
some Asian countries may also play a role, with educated individuals
being better equipped to manage these dynamics.
After looking at the interaction between education level and default
proportion, our team think we still need to show another factor that
might effect the default rate in our dataset, so we plots the repayment
status by age:
# Measuring wages
new_data <- train_filled %>%
mutate(Age_Group = cut(DAYS_BIRTH / 365,
breaks = c(0, 25, 35, 45, 55, 65, 100),
labels = c("18-25", "26-35", "36-45", "46-55", "56-65", "66+"),
right = FALSE)) %>%
dplyr::select(Age_Group, TARGET) # Select only the Age Group and TARGET columns
# Bar plot with two bars per age group for Loan Repayment Status
ggplot(new_data, aes(x = Age_Group, fill = factor(TARGET))) +
geom_bar(position = "dodge") + # Use dodge for side-by-side bars
scale_fill_manual(values = c("#4F81BD", "#B0B0B0"), labels = c("On Time", "Difficulties")) +
labs(x = "Age Group", y = "Count", title = "Loan Repayment Status by Age Group") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.y = element_text(angle = 0, hjust = 1), # Ensure labels on y-axis are horizontal
legend.position = "top" # Move the legend to the top
) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
guides(fill = guide_legend(title = "Repayment")) # Add title for legend +

So according to the plot, the age group 18-25 does not have that many
appliers like other groups, but still, repayment difficulties appear,
this might be because they often face limited financial literacy, job
instability, or irregular income problems, which make them do not know
the risk of borrowing. “greater levels of financial literacy decreased
one’s likelihood of taking out a student loan. Certain socio-demographic
groups were also more likely to take out student loans. Concerning
student loan repayment, individuals with lower financial literacy were
more likely to be delinquent on their student loan”
The age groups 26–35, 36–45, and 46-55 dominate in the number of
on-time repayments, with 36–45 having the highest count overall, besides
group 26-35 having the highest count of repayment difficulties overall.
We can compare people in general between the ages of 26 and 45. This is
because people between the ages of 26 and 45 are generally in an age of
enthusiasm and active struggle, and generally speaking, they have
reached the age of getting married and buying fixed assets, so people in
this age group are more likely to make some risky choices, such as
borrowing to buy a house or a car, than people in other age groups.
However, such behavior also has a greater probability of causing
repayment difficulties, because not all people can succeed in their
careers, and some people will wrongly estimate their ability, so it will
cause more repayment difficulties than other age groups.
Different with two groups at the front, people around age 46-55 have
a lower percentage of having difficulties with repayment, because
generally, they are richer than other groups of people, although some of
them still cannot finish the repayment on time, while the repayment on
time count stays the same as the group 26-35, the proportion of getting
difficulties in this group is decreasing.
Now comes the last two groups: 56-65 and 66+, or the old people
group, typically these two age groups have lost their enthusiasm for
work and are beginning to enter retirement age. So for people at this
stage, the proportion of people who can’t pay their loans has also been
much lower than in previous age groups (except for 18-25 years old). And
for those older than 66, their lives are rich enough, or enough to live
on, to no longer need a loan.
Additionally, we want to have a peek into the relationship between
repayment ability and client’s financial factors. We use
income
and credit
as our indicators for
wealth. To start of, we will separate the income and credit columns
based on their values. The threshold is determined appropriately so that
the graph provided an equally distributed ranges.
# Create bins for each column
train_bins <- train_filled %>%
mutate(
AMT_INCOME_TOTAL_BIN = cut(AMT_INCOME_TOTAL, breaks = c(0, 100000, 150000, 200000, 250000, 300000, Inf),
labels = c("0-100k", "100k-150k", "150k-200k", "200k-250k", "250k-300k", "300k+")),
AMT_CREDIT_BIN = cut(AMT_CREDIT, breaks = c(0, 200000, 400000, 600000, 800000, 1000000, Inf),
labels = c("0-200k", "200k-400k", "400k-600k", "600k-800k", "800k-1M", "1M+"))
)
# Aggregate data to count TARGET values for each bin
income_target <- train_bins %>%
group_by(AMT_INCOME_TOTAL_BIN, TARGET) %>%
summarise(Count = n(), .groups = "drop")
credit_target <- train_bins %>%
group_by(AMT_CREDIT_BIN, TARGET) %>%
summarise(Count = n(), .groups = "drop")
# Income total
ggplot(income_target, aes(x = AMT_INCOME_TOTAL_BIN, y = Count, fill = factor(TARGET))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#4F81BD", "#B0B0B0"), labels = c("On Time", "Difficulties")) +
labs(title = "Number of Repayment by Income Range",
x = "Income Range",
y = "Count",
fill = "Repayment") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.y = element_text(angle = 0, hjust = 1), # Ensure labels on y-axis are horizontal
legend.position = "top" # Move the legend to the top
)

Based on the plot above, we see a lot of people in the income range
of 0 to $250k decided to lend money the most. Those with income value
higher than 250k are less likely to lend. Additionally, we noticed that
the amount of difficulties fluctuates depending on the total number of
lending. This suggest a comparison of proportion to further understand
the differences between these groups.
# Credit
ggplot(credit_target, aes(x = AMT_CREDIT_BIN, y = Count, fill = factor(TARGET))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#4F81BD", "#B0B0B0"), labels = c("On Time", "Difficulties")) +
labs(title = "Number of Repayment by Credit Range",
x = "Credit Range",
y = "Count",
fill = "Repayment") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.y = element_text(angle = 0, hjust = 1), # Ensure labels on y-axis are horizontal
legend.position = "top" # Move the legend to the top
)

According to the plot, clients in the credit range of 200k to 800k
and those with more than $1 million are the one who lend the most.
Similar to the credit plot, the amount of difficulties fluctuates
depending on the total number of lending for all credit range. This
suggest a further look by comparing their proportion.
# Calculate proportions for each bin
income_target_prop <- train_bins %>%
group_by(AMT_INCOME_TOTAL_BIN, TARGET) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(AMT_INCOME_TOTAL_BIN) %>%
mutate(Proportion = Count / sum(Count))
credit_target_prop <- train_bins %>%
group_by(AMT_CREDIT_BIN, TARGET) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(AMT_CREDIT_BIN) %>%
mutate(Proportion = Count / sum(Count))
# Plotting proportions
# Percentage bar chart for Income Range
ggplot(income_target_prop, aes(fill = factor(TARGET), y = Proportion, x = AMT_INCOME_TOTAL_BIN)) +
geom_bar(position = "fill", stat = "identity") +
scale_fill_manual(values = c("#4F81BD", "#B0B0B0"), labels = c("On Time", "Difficulties")) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(title = "Proportion of repayment by Income Range",
x = "Income Range",
y = "Proportion",
fill = "Repayment") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.y = element_text(angle = 0, hjust = 1), # Ensure labels on y-axis are horizontal
legend.position = "top", # Move the legend to the top
axis.text.x = element_text(angle = 45, hjust = 1))

Interestingly, when we plot the income range into proportion, we can
observe a slight downward trend. This suggest that those with higher
income are more likely to pay back the loan. Combining with the insights
from the count of loans, it can be imply that people with income lower
than 250k are more likely to lend out and less likely to repay the loan
on time. On the other, those with income level higher than 250k are less
likely to lend and responsibly repay the loan on time.
# Percentage bar chart for Credit Range
ggplot(credit_target_prop, aes(fill = factor(TARGET), y = Proportion, x = AMT_CREDIT_BIN)) +
geom_bar(position = "fill", stat = "identity") +
scale_fill_manual(values = c("#4F81BD", "#B0B0B0"), labels = c("On Time", "Difficulties")) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(title = "Proportion of repayment by Credit Range",
x = "Credit Range",
y = "Proportion",
fill = "Repayment") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.y = element_text(angle = 0, hjust = 1), # Ensure labels on y-axis are horizontal
legend.position = "top", # Move the legend to the top
axis.text.x = element_text(angle = 45, hjust = 1))

For credit range proportion, there doesn’t seems to be a specific
trend in the graph. However, interestingly, when the number of repayment
for group 400k-600k decrease, the proportion of difficulties repayment
for that group increased. Also, those who has high credit value are more
likely to repay the loan on time, aligning with what we have from income
graphs’ insights.
Variable Selection
After filling in missing values and removing columns with too many
missing data, we’re left with 54 variables. From the heatmap, we
observed that some of these variables are highly correlated. To improve
our model’s performance and reduce the risk of overfitting, we’re using
a combination of Elastic Net and Principal Component Analysis (PCA). But
before diving into these techniques, we also leveraged domain knowledge
to guide our feature engineering process, which helped us create new
variables and remove irrelevant ones. Here’s how everything fits
together:
- Using Domain Knowledge for Feature Engineering
Before applying any advanced techniques, we first used our
understanding of the domain to create new variables that might be more
meaningful or representative of the underlying patterns in the data. We
also removed features that, based on our domain expertise, were unlikely
to add value or could introduce noise. This helped us refine the dataset
before applying Elastic Net and PCA, ensuring that we weren’t working
with irrelevant or redundant features from the start.
- Elastic Net: Handling Correlated Variables and Feature
Selection
Once we had a cleaner dataset, we turned to Elastic Net. Elastic Net
is a regularization method that combines Lasso (L1 regularization) and
Ridge (L2 regularization), and it’s particularly effective when we have
correlated features, as is the case with our dataset. Elastic Net allows
us to:
Handle multicollinearity (correlated features) by selecting important
features while shrinking the others.
Reduce overfitting by penalizing overly complex models.
Filter out irrelevant variables, ensuring the model only uses the
most meaningful features.
This combination of Lasso and Ridge regularization makes Elastic Net
especially useful in refining the feature set and focusing on the most
impactful variables.
- Principal Component Analysis (PCA): Reducing Dimensionality and
Multicollinearity
Even after applying Elastic Net, we still have a fair number of
features, which is where PCA comes in. PCA is a dimensionality reduction
technique that transforms the original correlated features into
uncorrelated principal components, each of which captures a portion of
the variance in the data.
PCA is particularly useful in this case for a few key reasons:
Multicollinearity: Since many of our features are correlated, PCA
helps by creating new components that are uncorrelated, resolving the
issue of multicollinearity and making the model more stable.
Dimensionality Reduction: By focusing on the components that explain
the most variance, PCA reduces the number of features we need to work
with, simplifying the model and making it more efficient.
Model Efficiency: Reducing the number of features not only makes
training faster but also helps in creating a model that can generalize
better to new data. How They Work Together
In summary, the combination of Elastic Net and PCA, along with the
use of domain knowledge to create and refine features, helps us build a
model that is more efficient, generalizable, and accurate. This process
ensures that we are working with the most meaningful features and
reduces the risk of overfitting, ultimately leading to better
performance on real-world data.
Data Cleaning
# Identify all FLAG_DOCUMENT_X columns
flag_columns <- grep("^FLAG_DOCUMENT_", names(train_filled), value = TRUE)
# Create a new column with the count of documents provided (sum of flags)
train_filled$DOCUMENT_COUNT <- rowSums(train_filled[, flag_columns])
# Remove the FLAG_DOCUMENT_X columns from the dataset
train_filled <- train_filled[, !names(train_filled) %in% flag_columns]
# Dropping and mutate column based on domain knowledge
train_clean <-train_filled %>%
dplyr::select(-AMT_REQ_CREDIT_BUREAU_HOUR, -AMT_REQ_CREDIT_BUREAU_DAY, -AMT_REQ_CREDIT_BUREAU_WEEK, -AMT_REQ_CREDIT_BUREAU_MON, -AMT_REQ_CREDIT_BUREAU_YEAR, -WEEKDAY_APPR_PROCESS_START, -NAME_INCOME_TYPE, -FLAG_MOBIL) %>%
mutate(DAY_EMPLOYED_PERCENT = DAYS_EMPLOYED / DAYS_BIRTH) %>%
mutate(AGES = DAYS_BIRTH/365) %>%
dplyr::select(-SK_ID_CURR, -DAYS_BIRTH) %>%
mutate(CREDIT_INCOME_PERCENT = AMT_CREDIT / AMT_INCOME_TOTAL) %>%
filter(CODE_GENDER != "XNA")
Here, our goal is to reduce dimensionality and improve the relevance
of the features. Instead of using the binary variable for documents, we
created a new column, DOCUMENT_COUNT, which represents the total count
of documents provided. This could be a more meaningful feature for our
model, so we dropped the original document binary variable.
Additionally, we removed variables related to the number of enquiries
to the Credit Bureau at different time intervals (e.g., one hour, one
day, one week, or a year before the application). Time intervals like
“hour” and “day” are too close to each other and may just reflect normal
client behavior, while “year” is too distant. We decided to keep only
the quarter variable, as it strikes a balance between capturing
meaningful action and avoiding overly short or long time frames.
We also created a new feature, DAY_EMPLOYED_PERCENT, by calculating
the ratio of DAYS_EMPLOYED to DAYS_BIRTH, which gives a sense of how
long the person has been employed relative to their age. This was
calculated based on a reasonable assumption that someone with a higher
ratio has more work experience relative to their age.
Finally, we removed irrelevant variables like SK_ID_CURR, FLAG_MOBIL
(since we have mutiple other variable indicate the same thing) and
DAYS_BIRTH, as we now have a new AGE variable, and filtered out the XNA
category in the gender variable, which didn’t make sense in the context
of the data.
# Change the categorical column into numerical based on domain knowledge
train_clean$NAME_EDUCATION_TYPE <- as.numeric(factor(train_clean$NAME_EDUCATION_TYPE, labels = c("Secondary", "Lower Secondary", "Incomplete Higher", "Higher Education", "Academic Degree")))
train_clean$OCCUPATION_TYPE <- as.numeric(table(train_clean$OCCUPATION_TYPE )[train_clean$OCCUPATION_TYPE ])
train_clean$NAME_TYPE_SUITE <- as.numeric(table(train_clean$NAME_TYPE_SUITE )[train_clean$NAME_TYPE_SUITE ])
train_clean$NAME_HOUSING_TYPE <- as.numeric(table(train_clean$NAME_HOUSING_TYPE )[train_clean$NAME_HOUSING_TYPE ])
train_clean$NAME_CONTRACT_TYPE <- as.numeric(table(train_clean$NAME_CONTRACT_TYPE )[train_clean$NAME_CONTRACT_TYPE])
train_clean$ORGANIZATION_TYPE <- as.numeric(table(train_clean$ORGANIZATION_TYPE )[train_clean$ORGANIZATION_TYPE])
train_clean$NAME_FAMILY_STATUS <- as.numeric(table(train_clean$NAME_FAMILY_STATUS )[train_clean$NAME_FAMILY_STATUS])
To improve the model’s performance, we transformed categorical
variables into numerical ones based on domain knowledge. For instance,
NAME_EDUCATION_TYPE was converted into a numeric scale representing
different education levels. The levels are ordered as follows:
“Secondary” (lowest), “Lower Secondary”, “Incomplete Higher”, “Higher
Education”, and “Academic Degree” (highest). This transformation
reflects the increasing level of education and its potential correlation
with income, creditworthiness, or other important factors.
For other categorical variables such as OCCUPATION_TYPE,
NAME_TYPE_SUITE, NAME_HOUSING_TYPE, NAME_CONTRACT_TYPE,
ORGANIZATION_TYPE, and NAME_FAMILY_STATUS, we transformed them into
numeric values based on their frequency in the dataset. This approach
helps capture the importance of each category while simplifying the
data. The more frequent categories are assigned higher numerical values,
which can be interpreted as indicators of commonality or prevalence. By
doing this, we ensure that the model can interpret these categorical
variables effectively without the need for complex one-hot encoding,
while still maintaining important information about their relative
occurrence.
# Do Hot-one Enconding for the remain variable
dummy_vars <- dummyVars(~ ., data = train_clean, fullRank = TRUE)
train_before <- data.frame(predict(dummy_vars, train_clean))
For the remaining categorical variables, we applied One-Hot Encoding
to convert them into a format that can be understood by the machine
learning model. One-hot encoding creates a binary (0 or 1) column for
each category in a variable, where each new column represents the
presence or absence of a specific category. This is particularly useful
for categorical variables that do not have an inherent order or
ranking.
We used the dummyVars function from the caret package to generate
these binary variables, specifying the fullRank = TRUE argument to
ensure we avoid multicollinearity (by excluding one category per
variable). After applying one-hot encoding, we created a new dataset
called train_before, which includes these expanded binary columns,
making the dataset ready for model selection. This transformation
ensures that categorical variables are properly represented, allowing
the model to learn from the individual categories without introducing
any implicit order or relationship.
Variable Correlation
After we fininsh our exploration on datasets, we realize that we have
too much variables in our datasets so here comes up to explain how we
choose the variables in the dataset to do our following model buildings
and training.
First of all, we decide to use the correlation matrix to find what
variables in our dataset can relate with each other:
# Separate the numerical columns
numerical_data <- train_before %>%
dplyr::select_if(is.numeric) # Selects only numerical columns
# Calculate the initial correlation matrix
corr_matrix <- cor(numerical_data)
# Set a high correlation threshold
threshold <- 0.8
# Create a copy of the correlation matrix, setting correlations below the threshold to NA
corr_matrix[abs(corr_matrix) < threshold] <- NA # Set correlations below the threshold to NA
# Function to check if all non-NA values are 1
all_one_or_na <- function(x) all(x[!is.na(x)] == 1)
# Remove columns and rows that contain only 1 and NA
corr_matrix <- corr_matrix[, !apply(corr_matrix, 2, all_one_or_na)]
corr_matrix <- corr_matrix[!apply(corr_matrix, 1, all_one_or_na), ]
# Identify variables with at least one significant correlation
significant_vars <- apply(!is.na(corr_matrix), 1, any)
# Filter out variables without significant correlations
filtered_corr_matrix <- corr_matrix[significant_vars, significant_vars]
# Convert the filtered correlation matrix into long format
cor_long <- melt(filtered_corr_matrix, na.rm = TRUE)
# Plot the heatmap
ggplot(data = cor_long, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "#084594", high = "#B22222", mid = "white",
limit = c(-1, 1), midpoint = 0,
name = "Correlation") +
labs(title = paste("Filtered Correlation Matrix (Corr >= ", threshold, ")", sep = ""),
x = "",
y = "") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 7, vjust = 1),
axis.text.y = element_text(hjust = 1, size = 7, vjust = 1))

So, after a list of comparations, we decide to set an 0.8 threshold,
since either higher or lower than this point, our matrix will excluding
many relationships that may still be relevant or will include weaker
correlations, making it harder to identify the strongest and most
meaningful variable relationships, 0.8 comes to be the best we can
set.
And according to the plot, here are some example variables that we
can focus on:
AMT_GOODS_PRICE and AMT_CREDIT: These two variables show very high
correlation, suggesting that the amount of credit is closely tied to the
price of goods.
REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY: These are
strongly correlated, reflecting how regional client ratings and those
weighted by city are closely linked.
But although we have many high correlation variables, there are still
some variables who do have high correlation but we do not need, for
example:
OBS_30_CNT_SOCIAL_CIRCLE and OBS_60_CNT_SOCIAL_CIRCLE: These two are
highly correlated just because they both showing the same thing, but
with different threshold of days past due (30 days and 60 days). So this
pair will be pass.
Elastic Net
Next, we will use Elastic Net for variable selection
# Change the TARGET variable into factor for Elastic Net
train_before$TARGET <- factor(train_before$TARGET, levels = c(0, 1))
# Ensure that levels of TARGET are valid R variable names
levels(train_before$TARGET) <- make.names(levels(train_before$TARGET))
We would have to make our TARGET variable into factor data type and
make it as a valid name for the model to select.
# Parallel Processing for faster training
#cl <- makeCluster(detectCores()-1)
#registerDoParallel()
# Define custom Precision function with X0 as positive
precisionSummary <- function(data, lev = NULL, model = NULL) {
data$obs <- factor(data$obs)
data$pred <- factor(data$pred)
precision <- posPredValue(data$obs, data$pred, positive = "X1")
out <- c(Precision = precision)
return(out)
}
# Define train control for cross validation and scale the TARGET variable
# train_control <- trainControl(
# method = "cv", # Cross-validation
# number = 10, # 5-fold CV
# classProbs = TRUE, # Enable class probabilities
# summaryFunction = precisionSummary, # Use custom Precision function
# sampling = "down"
# )
# Elastic Net model training
# elastic_net_model <- train(
# TARGET ~ ., # Formula for predictors and outcome
# data = train_before, # Dataset
# method = "glmnet", # Elastic Net
# metric = "Precision", # Optimize for Precision
# trControl = train_control, # Training control
# tuneGrid = expand.grid( # Tuning grid for alpha and lambda
# alpha = seq(0.1, 0.5, by = 0.1),
# lambda = seq(0.01, 0.05, by = 0.01)),
# preProcess = c("center", "scale"))
#
# stopCluster(cl)
# registerDoSEQ()
# Save the model
# saveRDS(elastic_net_model, "elastic.rds")
#Read the model
elastic_net_model <- readRDS("elastic.rds")
# Print model results
print(elastic_net_model)
## glmnet
##
## 246006 samples
## 46 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (46), scaled (46)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221405, 221405, 221406, 221405, 221406, 221405, ...
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results across tuning parameters:
##
## alpha lambda Precision
## 0.1 0.01 0.6645548
## 0.1 0.02 0.6642025
## 0.1 0.03 0.6633477
## 0.1 0.04 0.6622919
## 0.1 0.05 0.6611855
## 0.2 0.01 0.6631970
## 0.2 0.02 0.6610347
## 0.2 0.03 0.6587216
## 0.2 0.04 0.6562076
## 0.2 0.05 0.6535926
## 0.3 0.01 0.6628952
## 0.3 0.02 0.6586214
## 0.3 0.03 0.6557047
## 0.3 0.04 0.6522857
## 0.3 0.05 0.6487151
## 0.4 0.01 0.6602302
## 0.4 0.02 0.6559060
## 0.4 0.03 0.6534422
## 0.4 0.04 0.6495200
## 0.4 0.05 0.6450948
## 0.5 0.01 0.6593754
## 0.5 0.02 0.6537939
## 0.5 0.03 0.6494191
## 0.5 0.04 0.6442899
## 0.5 0.05 0.6396139
##
## Precision was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.01.
Since our dataset is large which could take us a long time to run, we
decide to save the model result into file, and print it.
The Elastic Net model was was evaluated using 10-fold
cross-validation with down-sampling to address class imbalance.
Cross-validation is used to assess the model’s performance by training
and testing it on different subsets of the data, helping to ensure that
the model generalizes well to unseen data and reduces the risk of
overfitting. A 10-fold cross-validation was chosen for reliable
performance estimation, which introduce more randomness into the model.
The grid search tested various combinations of alpha (regularization
type) and lambda (regularization strength) parameters, with a small
range of values (from 0.1 to 0.5 for alpha and from 0.01 to 0.05 for
lambda). This narrow range was selected to keep the model training
efficient, avoiding excessive computational time and complexity. The
best-performing model, based on precision, was achieved with alpha = 0.1
and lambda = 0.01, which yielded the highest precision of 0.6645548.
While this combination was optimal within the tested parameters, one
limitation of this approach is that the relatively small range of alpha
and lambda values may have excluded better-performing configurations
outside the tested grid. This could potentially affect the stability and
reliability of the performance estimate.
# Extract coefficients of the best model
best_model <- elastic_net_model$finalModel
best_lambda <- elastic_net_model$bestTune$lambda
selected_features <- coef(best_model, s = best_lambda)
coefficients <- as.matrix(selected_features)
# Identify non-zero coefficients (i.e., variables retained in the model)
non_zero_coefficients <- coefficients[coefficients != 0, , drop = FALSE]
print(non_zero_coefficients)
## s1
## (Intercept) 0.0070146406
## NAME_CONTRACT_TYPE 0.1226738761
## CODE_GENDERM 0.2068773247
## FLAG_OWN_CARY -0.1060241427
## FLAG_OWN_REALTYY 0.0347760445
## CNT_CHILDREN 0.0033332575
## AMT_INCOME_TOTAL 0.0004824846
## AMT_CREDIT 0.0600975429
## AMT_ANNUITY 0.1005591822
## AMT_GOODS_PRICE -0.2156694927
## NAME_TYPE_SUITE 0.0026908721
## NAME_EDUCATION_TYPE -0.1568598157
## NAME_FAMILY_STATUS -0.0768935604
## NAME_HOUSING_TYPE -0.0294136793
## DAYS_EMPLOYED -0.0458252117
## DAYS_REGISTRATION -0.0266082580
## DAYS_ID_PUBLISH -0.0787614880
## FLAG_EMP_PHONE 0.0002975894
## FLAG_WORK_PHONE 0.0417061642
## FLAG_CONT_MOBILE -0.0042825316
## FLAG_PHONE -0.0429155180
## FLAG_EMAIL -0.0158892852
## OCCUPATION_TYPE -0.0065128121
## CNT_FAM_MEMBERS 0.0002802081
## REGION_RATING_CLIENT_W_CITY 0.0782333276
## HOUR_APPR_PROCESS_START -0.0073622716
## REG_REGION_NOT_LIVE_REGION -0.0010022061
## LIVE_REGION_NOT_WORK_REGION 0.0001220807
## REG_CITY_NOT_LIVE_CITY 0.0377958160
## REG_CITY_NOT_WORK_CITY 0.0225881327
## ORGANIZATION_TYPE 0.0600264943
## EXT_SOURCE_2 -0.4231316692
## EXT_SOURCE_3 -0.4904077395
## DEF_30_CNT_SOCIAL_CIRCLE 0.0306188598
## DEF_60_CNT_SOCIAL_CIRCLE 0.0591089032
## DAYS_LAST_PHONE_CHANGE -0.0603265106
## AMT_REQ_CREDIT_BUREAU_QRT -0.0480897479
## DAY_EMPLOYED_PERCENT -0.0032522688
## AGES -0.1100883117
## CREDIT_INCOME_PERCENT 0.1087624039
This is our significant variable after Elastic Net. Since after our
selection, there still could be correlation in our data, we would do the
heatmap of correlation variable one more time too address this
problem.
# Convert into matrix form for significant variables
non_zero_coefficients <- as.matrix(non_zero_coefficients)
selected_features <- colnames(train_before)[which(non_zero_coefficients[-1,] != 0)]
# Subset the original dataset to include only final features for PCA
final_filtered_data <- train_before %>%
dplyr::select(selected_features)
x <- final_filtered_data %>%
dplyr::select(-TARGET)
# Calculate the initial correlation matrix
corr_matrix <- cor(x)
# Set a high correlation threshold
threshold <- 0.8
# Create a copy of the correlation matrix, setting correlations below the threshold to NA
corr_matrix[abs(corr_matrix) < threshold] <- NA # Set correlations below the threshold to NA
# Function to check if all non-NA values are 1
all_one_or_na <- function(x) all(x[!is.na(x)] == 1)
# Remove columns and rows that contain only 1 and NA
corr_matrix <- corr_matrix[, !apply(corr_matrix, 2, all_one_or_na)]
corr_matrix <- corr_matrix[!apply(corr_matrix, 1, all_one_or_na), ]
# Identify variables with at least one significant correlation
significant_vars <- apply(!is.na(corr_matrix), 1, any)
# Filter out variables without significant correlations
filtered_corr_matrix <- corr_matrix[significant_vars, significant_vars]
# Convert the filtered correlation matrix into long format
cor_long <- melt(filtered_corr_matrix, na.rm = TRUE)
# Plot the heatmap
ggplot(data = cor_long, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "#084594", high = "#B22222", mid = "white",
limit = c(-1, 1), midpoint = 0,
name = "Correlation") +
labs(title = paste("Filtered Correlation Matrix (Corr >= ", threshold, ")", sep = ""),
x = "",
y = "") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 7, vjust = 1),
axis.text.y = element_text(hjust = 1, size = 7, vjust = 1))

From the heatmap, we observed that certain variables, such as
AMT_CREDIT and AMT_GOODS_PRICE, or DAYS_EMPLOYED and FLAG_EMP_PHONE,
show high correlation. Since the importance of these variables needs to
be evaluated in the context of our model, we applied domain knowledge to
filter out the highly correlated variables before performing PCA. This
ensures that we retain only the most relevant features and prevent
redundancy, allowing PCA to work more effectively in reducing
dimensionality and improving model performance..
Domain Kowledge
# Remove correlated variable
final_filtered_data <- final_filtered_data %>%
dplyr::select(-CNT_CHILDREN, -FLAG_EMP_PHONE, -REGION_RATING_CLIENT)
# Filter x variable
final_filtered_data_x <- final_filtered_data[, -which(names(final_filtered_data) == "TARGET")]
Since the number of children is already represented in the number of
family members, we don’t need to retain both features. Additionally, if
a client has a job, it is highly probable they would also have an
employer phone number, and we observed that the region rating for
clients is quite similar, so these variables can be removed as well.
After filtering out these redundant variables, we will create a new
dataset excluding the TARGET variable, which will then be used for PCA
to reduce dimensionality and improve model performance.
After selecting the most impactful variables and delete uncessary
variables, we proceeded with Principal Component Analysis (PCA) to
reduce dimensionality and mitigate multicollinearity. PCA helps by
combining correlated features into uncorrelated principal components,
allowing the model to focus on the most important underlying patterns.
This step aims to enhance the efficiency of model training and prevent
overfitting, as it reduces the risk of including redundant features that
may not add significant predictive power.
Principal Component Analysis
# Apply PCA to the standardized data
pca_result <- prcomp(final_filtered_data_x, center = TRUE, scale. = TRUE)
First, we will apply PCA to our data. Before doing so, it is
essential to center and scale the data to ensure that each feature
contributes equally to the analysis. Centering subtracts the mean of
each variable, while scaling normalizes the variables to have unit
variance. This step is crucial because PCA is sensitive to the scale of
the data, and without it, variables with larger scales could
disproportionately influence the results.
# Summary of PCA results
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.81991 1.68143 1.45057 1.34731 1.24772 1.17698 1.16306
## Proportion of Variance 0.09463 0.08078 0.06012 0.05186 0.04448 0.03958 0.03865
## Cumulative Proportion 0.09463 0.17541 0.23553 0.28739 0.33187 0.37145 0.41010
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.15004 1.13570 1.09476 1.05687 1.03573 1.01938 0.99654
## Proportion of Variance 0.03779 0.03685 0.03424 0.03191 0.03065 0.02969 0.02837
## Cumulative Proportion 0.44789 0.48474 0.51898 0.55090 0.58154 0.61123 0.63961
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.98569 0.97788 0.96246 0.94552 0.92941 0.91388 0.89231
## Proportion of Variance 0.02776 0.02732 0.02647 0.02554 0.02468 0.02386 0.02275
## Cumulative Proportion 0.66737 0.69469 0.72116 0.74670 0.77138 0.79524 0.81799
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.87377 0.84110 0.84057 0.81283 0.80264 0.79757 0.78395
## Proportion of Variance 0.02181 0.02021 0.02019 0.01888 0.01841 0.01817 0.01756
## Cumulative Proportion 0.83980 0.86002 0.88020 0.89908 0.91749 0.93566 0.95322
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.67505 0.66513 0.56062 0.52512 0.29409 0.22510 0.10926
## Proportion of Variance 0.01302 0.01264 0.00898 0.00788 0.00247 0.00145 0.00034
## Cumulative Proportion 0.96624 0.97888 0.98786 0.99574 0.99821 0.99966 1.00000
The PCA results indicate the importance of each principal component
in explaining the variance in the data. The first few components capture
the majority of the variance, with PC1 explaining 9.46% of the variance
alone, and the first three components together accounting for 23.55% of
the total variance. By PC25, the cumulative proportion of variance
reaches 89.9%, meaning that the first 25 components explain most of the
variability in the data. Beyond this, the components explain diminishing
amounts of variance, with PC35 contributing only 0.034%. This suggests
that dimensionality can be effectively reduced by focusing on the first
25 components, retaining significant information while reducing
complexity and improving model efficiency.
We would make a plot too check our findings and proceed with our data
after PCA
# Get explained variance
explained_variance <- summary(pca_result)$importance[2, ]
# Cumulative explained variance
cumulative_variance <- cumsum(explained_variance)
# Plot the cumulative variance
plot(cumulative_variance, type = "b", xlab = "Number of Components", ylab = "Cumulative Explained Variance",
main = "Cumulative Explained Variance by Principal Components")

The plot visualizes the cumulative explained variance, with the
number of components on the x-axis and the cumulative variance on the
y-axis. The resulting plot shows that the first 25 components explain
approximately 90% of the variability in the data. As more components are
added, the cumulative variance increases at a slower rate, indicating
diminishing returns in capturing additional variance. Therefore, we
would choose to retain the first 25 components for further analysis, as
they capture most of the important information, while reducing
dimensionality and simplifying the model without losing significant
explanatory power.
# Select the PC for our data
select_train <- pca_result$x[, 1:25]
# Combine the TARGET variable and our principle component
cleaned_matrix <- bind_cols(TARGET = as.factor(final_filtered_data$TARGET), select_train)
# Transform it to DF for trainning
pca_transformed_df <- as.data.frame(cleaned_matrix)
We will create a new dataset using the selected principal components
from the PCA transformation and proceed to train our model using this
reduced-dimensional data.
Methodology
Next, we will train our model using three different methods: Logistic
Regression (LR), Random Forest (RF), and XGBoost (XGB).
Logistic Regression will serve as a baseline model, as it is
particularly effective for capturing linear relationships between the
predictors and the target variable. This model will help us understand
the basic linear patterns in the data.
Random Forest, a non-linear model, will allow us to capture more
complex relationships and interactions between features, which may not
be apparent in linear models. XGBoost, a powerful gradient boosting
algorithm, will help us model complex, non-linear relationships through
an ensemble of decision trees, focusing on minimizing errors through
iterative refinement.
XGBoost has been shown to perform well in many predictive tasks
due to its efficiency, regularization, and ability to handle
unstructured data. As demonstrated by Gu et al. (2024) in their study on
credit risk assessment for small and micro enterprises, XGBoost
outperforms other machine learning models in terms of both accuracy and
stability, making it a strong choice for our analysis.
We will train Logistic Regression (LR), Random Forest (RF), and
XGBoost (XGB) models using the dataset after applying Principal
Component Analysis (PCA). PCA helps reduce the dimensionality of the
data, improve model efficiency, and enhance generalization by removing
multicollinearity and noise. By focusing on the principal components,
which capture the most significant variance in the data, we aim to
assess how well the reduced feature set performs in terms of predicting
the target variable. This approach will also allow us to evaluate how
dimensionality reduction affects model interpretability and how the
transformation impacts the ability of the models to generalize to new,
unseen data. Training on PCA-transformed data will provide insights into
how each model performs when trained on a simplified, noise-reduced
version of the original dataset
We will use Precision as our evaluation metric, as false
positives—predicting repayment when the loan won’t be repaid—are
particularly costly in this context. Our goal in this dataset is to
investigate whether an applicant’s social demographics and wealth
factors are significant predictors of whether they will repay a loan. By
focusing on Precision, we aim to minimize the impact of false positives
and ensure that the model’s predictions are as reliable as possible.
Preprocess Test Data
First, we will preprocess the test data to ensure it is in the
appropriate format and ready for model evaluation
test_filled <- test_data %>%
dplyr::select(-all_of(columns_to_drop)) %>%
mutate(
EXT_SOURCE_3 = ifelse(is.na(EXT_SOURCE_3), median(EXT_SOURCE_3, na.rm = TRUE), EXT_SOURCE_3),
AMT_REQ_CREDIT_BUREAU_HOUR = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_HOUR), 0, AMT_REQ_CREDIT_BUREAU_HOUR),
AMT_REQ_CREDIT_BUREAU_DAY = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_DAY), 0, AMT_REQ_CREDIT_BUREAU_DAY),
AMT_REQ_CREDIT_BUREAU_WEEK = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_WEEK), 0, AMT_REQ_CREDIT_BUREAU_WEEK),
AMT_REQ_CREDIT_BUREAU_MON = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_MON), median(AMT_REQ_CREDIT_BUREAU_MON, na.rm = TRUE), AMT_REQ_CREDIT_BUREAU_MON),
AMT_REQ_CREDIT_BUREAU_QRT = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_QRT), median(AMT_REQ_CREDIT_BUREAU_QRT, na.rm = TRUE), AMT_REQ_CREDIT_BUREAU_QRT),
AMT_REQ_CREDIT_BUREAU_YEAR = ifelse(is.na(AMT_REQ_CREDIT_BUREAU_YEAR), median(AMT_REQ_CREDIT_BUREAU_YEAR, na.rm = TRUE), AMT_REQ_CREDIT_BUREAU_YEAR),
OBS_30_CNT_SOCIAL_CIRCLE = ifelse(is.na(OBS_30_CNT_SOCIAL_CIRCLE), median(OBS_30_CNT_SOCIAL_CIRCLE, na.rm = TRUE), OBS_30_CNT_SOCIAL_CIRCLE),
DEF_30_CNT_SOCIAL_CIRCLE = ifelse(is.na(DEF_30_CNT_SOCIAL_CIRCLE), median(DEF_30_CNT_SOCIAL_CIRCLE, na.rm = TRUE), DEF_30_CNT_SOCIAL_CIRCLE),
OBS_60_CNT_SOCIAL_CIRCLE = ifelse(is.na(OBS_60_CNT_SOCIAL_CIRCLE), median(OBS_60_CNT_SOCIAL_CIRCLE, na.rm = TRUE), OBS_60_CNT_SOCIAL_CIRCLE),
DEF_60_CNT_SOCIAL_CIRCLE = ifelse(is.na(DEF_60_CNT_SOCIAL_CIRCLE), median(DEF_60_CNT_SOCIAL_CIRCLE, na.rm = TRUE), DEF_60_CNT_SOCIAL_CIRCLE),
EXT_SOURCE_2 = ifelse(is.na(EXT_SOURCE_2), median(EXT_SOURCE_2, na.rm = TRUE), EXT_SOURCE_2),
AMT_GOODS_PRICE = ifelse(is.na(AMT_GOODS_PRICE), median(AMT_GOODS_PRICE, na.rm = TRUE), AMT_GOODS_PRICE),
AMT_ANNUITY = ifelse(is.na(AMT_ANNUITY), median(AMT_ANNUITY, na.rm = TRUE), AMT_ANNUITY),
CNT_FAM_MEMBERS = ifelse(is.na(CNT_FAM_MEMBERS), median(CNT_FAM_MEMBERS, na.rm = TRUE), CNT_FAM_MEMBERS),
DAYS_LAST_PHONE_CHANGE = ifelse(is.na(DAYS_LAST_PHONE_CHANGE), median(DAYS_LAST_PHONE_CHANGE, na.rm = TRUE), DAYS_LAST_PHONE_CHANGE)
) %>%
mutate(OCCUPATION_TYPE = ifelse(is.na(OCCUPATION_TYPE), "Unknown", OCCUPATION_TYPE)) %>%
mutate(NAME_TYPE_SUITE = ifelse(is.na(NAME_TYPE_SUITE), "Unaccompanied", NAME_TYPE_SUITE)) %>%
mutate(across(c(DAYS_BIRTH, DAYS_LAST_PHONE_CHANGE, DAYS_EMPLOYED, DAYS_ID_PUBLISH, DAYS_REGISTRATION), abs))
# Reorder and relabel the education categories
test_filled$NAME_EDUCATION_TYPE <- factor(
test_filled$NAME_EDUCATION_TYPE,
levels = c("Secondary / secondary special", "Lower secondary", "Incomplete higher",
"Higher education", "Academic degree"),
labels = c("Secondary", "Lower Secondary", "Incomplete Higher", "Higher Education", "Academic Degree")
)
# Create a new column with the count of documents provided (sum of flags)
test_filled$DOCUMENT_COUNT <- rowSums(test_filled[, flag_columns])
# Remove the FLAG_DOCUMENT_X columns from the dataset
test_filled <- test_filled[, !names(test_filled) %in% flag_columns]
# Dropping and mutate column based on domain knowledge
test_clean <-test_filled %>%
dplyr::select(-AMT_REQ_CREDIT_BUREAU_HOUR, -AMT_REQ_CREDIT_BUREAU_DAY, -AMT_REQ_CREDIT_BUREAU_WEEK, -AMT_REQ_CREDIT_BUREAU_MON, -AMT_REQ_CREDIT_BUREAU_YEAR, -WEEKDAY_APPR_PROCESS_START, -NAME_INCOME_TYPE, -FLAG_MOBIL) %>%
mutate(DAY_EMPLOYED_PERCENT = DAYS_EMPLOYED / DAYS_BIRTH) %>%
mutate(AGES = DAYS_BIRTH/365) %>%
dplyr::select(-SK_ID_CURR, -DAYS_BIRTH) %>%
mutate(CREDIT_INCOME_PERCENT = AMT_CREDIT / AMT_INCOME_TOTAL) %>%
filter(CODE_GENDER != "XNA")
# Change the categorical column into numerical based on domain knowledge
test_clean$NAME_EDUCATION_TYPE <- as.numeric(factor(test_clean$NAME_EDUCATION_TYPE, labels = c("Secondary", "Lower Secondary", "Incomplete Higher", "Higher Education", "Academic Degree")))
test_clean$OCCUPATION_TYPE <- as.numeric(table(test_clean$OCCUPATION_TYPE )[test_clean$OCCUPATION_TYPE ])
test_clean$NAME_TYPE_SUITE <- as.numeric(table(test_clean$NAME_TYPE_SUITE )[test_clean$NAME_TYPE_SUITE ])
test_clean$NAME_HOUSING_TYPE <- as.numeric(table(test_clean$NAME_HOUSING_TYPE )[test_clean$NAME_HOUSING_TYPE ])
test_clean$NAME_CONTRACT_TYPE <- as.numeric(table(test_clean$NAME_CONTRACT_TYPE )[test_clean$NAME_CONTRACT_TYPE])
test_clean$ORGANIZATION_TYPE <- as.numeric(table(test_clean$ORGANIZATION_TYPE )[test_clean$ORGANIZATION_TYPE])
test_clean$NAME_FAMILY_STATUS <- as.numeric(table(test_clean$NAME_FAMILY_STATUS )[test_clean$NAME_FAMILY_STATUS])
# Do Hot-one Enconding for the remain variable
dummy_vars <- dummyVars(~ ., data = test_clean, fullRank = TRUE)
test_before <- data.frame(predict(dummy_vars, test_clean))
# Change the TARGET variable into factor for Elastic Net
test_before$TARGET <- factor(test_before$TARGET, levels = c(0, 1))
# Ensure that levels of TARGET are valid R variable names
levels(test_before$TARGET) <- make.names(levels(test_before$TARGET))
# Subset the original dataset to include only final features for PCA
final_filtered_test <- test_before %>%
dplyr::select(selected_features)
# Remove correlated variable
final_filtered_test <- final_filtered_test %>%
dplyr::select(-CNT_CHILDREN, -FLAG_EMP_PHONE, -REGION_RATING_CLIENT)
final_filtered_test <- na.omit(final_filtered_test) # Removes rows with NA values
# Filter
final_filtered_test_x <- final_filtered_test[, -which(names(final_filtered_test) == "TARGET")]
# Apply PCA to the standardized data
pca_result_test <- prcomp(final_filtered_test_x, center = TRUE, scale. = TRUE)
# Select the PC for our data
select_train_test <- pca_result_test$x[, 1:25]
# Combine the TARGET variable and our principle component
cleaned_matrix_test <- bind_cols(select_train_test, TARGET = as.factor(final_filtered_test$TARGET))
# Transform it to DF for trainning
pca_transformed_test <- as.data.frame(cleaned_matrix_test)
# Getting independant and dependant variable in test data
x_test <- pca_transformed_test %>%
dplyr::select(-TARGET)
y_test <- pca_transformed_test$TARGET
Logistic Regression
Next, we will train our Logistic Regression model using the dataset
transformed by PCA. This will allow us to leverage the reduced
dimensionality, improving model efficiency and generalization by
eliminating multicollinearity and noise.
# Parallel Processing for faster training
cl <- makeCluster(detectCores()-1)
registerDoParallel()
# Define train control with custom Precision function
train_control1 <- trainControl(
method = "cv", # Cross-validation
number = 10, # Number of folds for CV
classProbs = TRUE, # To compute probabilities
summaryFunction = precisionSummary, # Use custom Precision function
sampling = "down",
allowParallel = TRUE
)
# Train logistic regression model
logistic_model_pca <- train(
TARGET ~ ., # Formula (target ~ predictors)
data = pca_transformed_df, # Training data
method = "glm", # Generalized Linear Model
family = "binomial", # Logistic regression
trControl = train_control1, # Cross-validation settings
metric = "Precision",
preProcess = c("center", "scale")
)
stopCluster(cl)
registerDoSEQ()
# View model results
print(logistic_model_pca)
## Generalized Linear Model
##
## 246006 samples
## 25 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221405, 221406, 221405, 221405, 221406, 221405, ...
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results:
##
## Precision
## 0.6544468
We would use Parralel Processing for faster training time. The
predictors were pre-processed by centering and scaling, ensuring that
all variables were on the same scale. To assess the model’s performance,
a 10-fold cross-validation technique was employed, providing a more
reliable measure of its generalization ability. Additionally, to address
any class imbalance, down-sampling was applied before pre-processing to
ensure balanced representation of both classes during training.
The model’s resampling results yielded a Precision of 0.6545,
indicating that approximately 65.45% of the model’s positive predictions
(‘X1’) were correct, which is particularly valuable in contexts where
minimizing false positives is critical.
We will evaluate the model’s performance on the test set and
calculate its precision, allowing us to compare its effectiveness with
other models. This will provide insight into how well the model
generalizes to unseen data, specifically focusing on its ability to
correctly identify positive cases while minimizing false positives.
# Make predictions on the test set
lr_predictions <- predict(logistic_model_pca, newdata = x_test)
# Assuming the true labels are stored in test_target
lr_conf_matrix <- confusionMatrix(lr_predictions, y_test)
lr_precision <- lr_conf_matrix$byClass["Specificity"]
Random Forest
Next, we will train our model using Random Forest, a powerful
ensemble learning technique that combines multiple decision trees to
improve prediction accuracy and robustness. Here, we have already train
our model on computer with better CPU and RAM for faster accuracy, the
we save our model to a file for quick access.
# cl <- makeCluster(detectCores()-1)
# registerDoParallel()
#
# Define train control with custom Precision function
# train_control2 <- trainControl(
# method = "cv", # Cross-validation
# number = 5, # Number of folds for CV
# summaryFunction = precisionSummary, # Use custom Precision function
# sampling = "down",
# verboseIter = TRUE,
# allowParallel = TRUE
# )
#
# tuneGrid <- expand.grid(
# mtry = seq(2,10, by = 2)
# )
#
# # Train the Random Forest model
# rf_model_pca <- train(
# TARGET ~ ., # Formula (target ~ predictors)
# data = pca_transformed_df, # Training data
# method = "rf", # Random Forest method
# trControl = train_control2, # Cross-validation control
# metric = "Precision", # Optimize for ROC (AUC)
# tuneGrid = tuneGrid,
# ntree = 100,
# preProcess = c("center", "scale")# Tune a few hyperparameters (adjust if needed)
# )
#
# stopCluster(cl)
# registerDoSEQ()
#saveRDS(rf_model_pca, "rf_model_pca.rds")
rf_model_pca <- readRDS("rf_model_pca.rds")
# Print the model results
print(rf_model_pca)
## Random Forest
##
## 246006 samples
## 25 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 196805, 196805, 196805, 196805, 196804
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results across tuning parameters:
##
## mtry Precision
## 2 0.6449946
## 4 0.6423792
## 6 0.6483637
## 8 0.6473577
## 10 0.6434361
##
## Precision was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
We would use Parralel Processing for faster training time. The data
underwent pre-processing, which included centering and scaling of all
predictors, followed by a 5-fold cross-validation for resampling since
it can balance the efficiency and accuracy of the model. Additional
down-sampling was applied prior to pre-processing to address class
imbalance. The model was tuned using the parameter mtry, which
determines the number of variables randomly selected at each split in
the trees. The resampling results showed that the precision values
varied with different values of mtry: 0.64499 for mtry = 2, 0.64238 for
mtry = 4, 0.64836 for mtry = 6, 0.64736 for mtry = 8, and 0.64344 for
mtry = 10. Based on the highest precision value of 0.64836, the optimal
mtry value of 6 was selected for the final model.
# Make predictions on the test set
rf_predictions <- predict(rf_model_pca, newdata = x_test)
# Assuming the true labels are stored in test_target
rf_conf_matrix <- confusionMatrix(rf_predictions, y_test)
rf_precision <- rf_conf_matrix$byClass["Specificity"]
Extreme Gradient Boosting
Finally, we will train our model using XGBoost, a powerful and
efficient gradient boosting algorithm known for its high predictive
performance and ability to handle complex, non-linear relationships in
the data. Here, we have already train our model on computer with better
CPU and RAM for faster accuracy, the we save our model to a file for
quick access.
# # Set up parallel processing
# cl <- makeCluster(detectCores() - 1) # Use one less core to avoid overload
# registerDoParallel(cl)
#
# # Set up train control with custom precision function
# train_control3 <- trainControl(
# method = "cv", # Cross-validation
# number = 5, # 5-fold cross-validation
# summaryFunction = precisionSummary, # Use custom precision function
# sampling = "down", # Down-sample the majority class to balance classes
# allowParallel = TRUE, # Enable parallelization
# )
#
# # Hyperparameter grid for tuning XGBoost
# tuneGrid <- expand.grid(
# nrounds = 50, # Number of boosting rounds (trees)
# max_depth = c(3, 6), # Maximum depth of trees
# eta = c(0.1, 0.3), # Learning rate
# gamma = c(0, 0.1), # Minimum loss reduction required to make a further partition
# colsample_bytree = c(0.6, 0.8), # Fraction of features used for each tree
# min_child_weight = c(1, 5), # Minimum sum of instance weight (hessian) needed in a child
# subsample = c(0.7, 0.8) # Fraction of samples used for training each tree
# )
#
# # Train the XGBoost model using caret
# xgb_model <- train(
# TARGET ~ ., # Formula (TARGET as the target variable)
# data = pca_transformed_df, # Your PCA-transformed data frame
# method = "xgbTree", # Specify XGBoost model (tree-based)
# trControl = train_control3, # Cross-validation control with custom precision function
# tuneGrid = tuneGrid, # Hyperparameter tuning grid
# metric = "Precision", # Optimize for Precision
# preProcess = c("center", "scale") # Preprocess data: standardize features
# )
#
# # Stop the parallel cluster
# stopCluster(cl)
#registerDoSEQ() # Disable parallelism for further operations
#saveRDS(xgb_model, "xgb_model_pca.rds")
xgb_model_pca <- readRDS("xgb_model_pca.rds")
# Print the model results
print(xgb_model_pca)
## eXtreme Gradient Boosting
##
## 246006 samples
## 25 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 196805, 196804, 196805, 196805, 196805
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results across tuning parameters:
##
## eta max_depth gamma colsample_bytree min_child_weight subsample
## 0.1 3 0.0 0.6 1 0.7
## 0.1 3 0.0 0.6 1 0.8
## 0.1 3 0.0 0.6 5 0.7
## 0.1 3 0.0 0.6 5 0.8
## 0.1 3 0.0 0.8 1 0.7
## 0.1 3 0.0 0.8 1 0.8
## 0.1 3 0.0 0.8 5 0.7
## 0.1 3 0.0 0.8 5 0.8
## 0.1 3 0.1 0.6 1 0.7
## 0.1 3 0.1 0.6 1 0.8
## 0.1 3 0.1 0.6 5 0.7
## 0.1 3 0.1 0.6 5 0.8
## 0.1 3 0.1 0.8 1 0.7
## 0.1 3 0.1 0.8 1 0.8
## 0.1 3 0.1 0.8 5 0.7
## 0.1 3 0.1 0.8 5 0.8
## 0.1 6 0.0 0.6 1 0.7
## 0.1 6 0.0 0.6 1 0.8
## 0.1 6 0.0 0.6 5 0.7
## 0.1 6 0.0 0.6 5 0.8
## 0.1 6 0.0 0.8 1 0.7
## 0.1 6 0.0 0.8 1 0.8
## 0.1 6 0.0 0.8 5 0.7
## 0.1 6 0.0 0.8 5 0.8
## 0.1 6 0.1 0.6 1 0.7
## 0.1 6 0.1 0.6 1 0.8
## 0.1 6 0.1 0.6 5 0.7
## 0.1 6 0.1 0.6 5 0.8
## 0.1 6 0.1 0.8 1 0.7
## 0.1 6 0.1 0.8 1 0.8
## 0.1 6 0.1 0.8 5 0.7
## 0.1 6 0.1 0.8 5 0.8
## 0.3 3 0.0 0.6 1 0.7
## 0.3 3 0.0 0.6 1 0.8
## 0.3 3 0.0 0.6 5 0.7
## 0.3 3 0.0 0.6 5 0.8
## 0.3 3 0.0 0.8 1 0.7
## 0.3 3 0.0 0.8 1 0.8
## 0.3 3 0.0 0.8 5 0.7
## 0.3 3 0.0 0.8 5 0.8
## 0.3 3 0.1 0.6 1 0.7
## 0.3 3 0.1 0.6 1 0.8
## 0.3 3 0.1 0.6 5 0.7
## 0.3 3 0.1 0.6 5 0.8
## 0.3 3 0.1 0.8 1 0.7
## 0.3 3 0.1 0.8 1 0.8
## 0.3 3 0.1 0.8 5 0.7
## 0.3 3 0.1 0.8 5 0.8
## 0.3 6 0.0 0.6 1 0.7
## 0.3 6 0.0 0.6 1 0.8
## 0.3 6 0.0 0.6 5 0.7
## 0.3 6 0.0 0.6 5 0.8
## 0.3 6 0.0 0.8 1 0.7
## 0.3 6 0.0 0.8 1 0.8
## 0.3 6 0.0 0.8 5 0.7
## 0.3 6 0.0 0.8 5 0.8
## 0.3 6 0.1 0.6 1 0.7
## 0.3 6 0.1 0.6 1 0.8
## 0.3 6 0.1 0.6 5 0.7
## 0.3 6 0.1 0.6 5 0.8
## 0.3 6 0.1 0.8 1 0.7
## 0.3 6 0.1 0.8 1 0.8
## 0.3 6 0.1 0.8 5 0.7
## 0.3 6 0.1 0.8 5 0.8
## Precision
## 0.6506760
## 0.6558550
## 0.6570620
## 0.6520337
## 0.6534918
## 0.6551011
## 0.6523352
## 0.6528882
## 0.6545977
## 0.6549502
## 0.6539447
## 0.6541451
## 0.6555027
## 0.6537434
## 0.6554529
## 0.6534914
## 0.6539949
## 0.6491165
## 0.6479099
## 0.6538939
## 0.6513799
## 0.6524859
## 0.6553020
## 0.6557544
## 0.6541451
## 0.6553518
## 0.6547991
## 0.6538942
## 0.6535922
## 0.6508770
## 0.6515306
## 0.6556535
## 0.6532906
## 0.6509271
## 0.6479097
## 0.6513799
## 0.6513293
## 0.6513794
## 0.6528881
## 0.6547487
## 0.6515804
## 0.6548998
## 0.6533910
## 0.6470052
## 0.6505755
## 0.6511791
## 0.6544973
## 0.6526373
## 0.6458989
## 0.6391102
## 0.6464516
## 0.6435858
## 0.6452450
## 0.6441390
## 0.6451449
## 0.6447422
## 0.6426809
## 0.6431831
## 0.6434350
## 0.6449934
## 0.6421281
## 0.6404183
## 0.6439879
## 0.6407196
##
## Tuning parameter 'nrounds' was held constant at a value of 50
## Precision was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 3, eta
## = 0.1, gamma = 0, colsample_bytree = 0.6, min_child_weight = 5 and subsample
## = 0.7.
The model underwent pre-processing, including centering and scaling
of all features, followed by 5-fold cross-validation and down-sampling
to handle class imbalance. Various hyperparameters, including learning
rate (eta), maximum tree depth (max_depth), minimum child weight, and
others, were tuned to optimize performance. However, due to the
computationally expensive nature of XGBoost, we opted for a more limited
hyperparameter search to reduce training time. The final model was
selected using a precision metric, with the best configuration being eta
= 0.1, max_depth = 3, gamma = 0, colsample_bytree = 0.6,
min_child_weight = 5, and subsample = 0.7. While more extensive
fine-tuning could potentially improve performance further, the decision
to limit the search was based on balancing model accuracy and
computational efficiency. The selected parameters provided a good
trade-off, ensuring the model was trained within a reasonable timeframe
while still achieving a precision score of 0.657.
# Make predictions on the test set
xgb_predictions <- predict(xgb_model_pca, newdata = x_test)
# Assuming the true labels are stored in test_target
xgb_conf_matrix <- confusionMatrix(xgb_predictions, y_test)
xgb_precision <- xgb_conf_matrix$byClass["Specificity"]
Main Findings
Comparing the Model
# Create a data frame with the model names and their corresponding precision
precision_table <- data.frame(
Model = c("Random Forest (RF)", "Extreme Gradient Boosting (XGB)", "Logistic Regression (LR)"),
Precision = c(rf_precision, xgb_precision, lr_precision)
)
# Order the data frame by decreasing precision
precision_table <- precision_table[order(-precision_table$Precision), ]
# Print the table using knitr::kable
kable(precision_table, caption = "Model Precision")
Model Precision
3 |
Logistic Regression (LR) |
0.4714052 |
2 |
Extreme Gradient Boosting (XGB) |
0.4556100 |
1 |
Random Forest (RF) |
0.4452614 |
The table presents the Precision of three models, listed in
decreasing order:
Logistic Regression (LR) achieved the highest Precision at 0.4676.
This means it was the most reliable in correctly predicting repayment
cases (true positives), minimizing false positives the most effectively
among the models tested.
eXtreme Gradient Boosting (XGB) followed with a Precision of 0.4556,
slightly lower than Logistic Regression. While it was still good at
correctly identifying true positives, it did so at a slightly higher
rate of false positives compared to LR.
Random Forest (RF) had the lowest Precision at 0.4417, indicating it
was the least effective at minimizing false positives when predicting
repayment.
Interestingly, the results contradict our initial expectations. Given
the reputation of XGBoost (XGB) as a powerful model for handling complex
relationships, we anticipated it would perform better in terms of
Precision. However, Logistic Regression (LR) outperformed XGBoost (XGB)
in this case, highlighting that even simpler models can sometimes
deliver better results, especially when the dataset is relatively
straightforward or when the cost of false positives is a critical
consideration.
Principle Component
We will first make a table of principle components in our model,
since there are 25 PCs, we will only show the first 12 PCs here.
# Get the loadings (rotation matrix)
loadings_matrix <- pca_result$rotation
# Select the first 12 principal components
selected_loadings <- loadings_matrix[, 1:12]
# Create a cleaner table display with kable
kable(selected_loadings,
caption = "Loadings for the First 12 Principal Components (PCs)",
digits = 3, # Format numeric values to 3 decimal places
col.names = paste("PC", 1:12)) # Provide descriptive column names
Loadings for the First 12 Principal Components (PCs)
NAME_CONTRACT_TYPE |
-0.082 |
0.137 |
-0.041 |
0.218 |
-0.115 |
0.176 |
-0.045 |
-0.147 |
0.003 |
0.141 |
-0.042 |
-0.163 |
CODE_GENDERM |
-0.143 |
-0.100 |
-0.051 |
0.035 |
0.243 |
0.015 |
-0.070 |
0.237 |
-0.195 |
0.451 |
0.024 |
0.207 |
FLAG_OWN_CARY |
-0.176 |
-0.020 |
-0.127 |
-0.028 |
0.271 |
-0.040 |
-0.044 |
0.244 |
-0.187 |
0.345 |
0.109 |
0.240 |
FLAG_OWN_REALTYY |
0.064 |
0.052 |
-0.016 |
0.020 |
0.240 |
-0.169 |
0.126 |
0.074 |
0.330 |
-0.267 |
0.386 |
0.123 |
AMT_INCOME_TOTAL |
-0.132 |
0.067 |
0.031 |
-0.036 |
0.000 |
-0.109 |
0.046 |
0.152 |
-0.021 |
0.141 |
0.195 |
0.044 |
AMT_CREDIT |
-0.372 |
0.356 |
-0.065 |
0.179 |
-0.138 |
0.015 |
0.017 |
-0.012 |
0.056 |
-0.033 |
-0.019 |
0.002 |
AMT_ANNUITY |
-0.369 |
0.311 |
-0.051 |
0.147 |
-0.102 |
-0.017 |
0.029 |
0.031 |
0.036 |
0.002 |
0.021 |
-0.016 |
AMT_GOODS_PRICE |
-0.374 |
0.356 |
-0.062 |
0.168 |
-0.136 |
0.017 |
0.009 |
-0.016 |
0.048 |
-0.035 |
-0.007 |
0.015 |
NAME_TYPE_SUITE |
0.004 |
-0.028 |
0.043 |
-0.053 |
-0.178 |
-0.036 |
-0.071 |
0.086 |
-0.027 |
0.219 |
0.145 |
0.092 |
NAME_EDUCATION_TYPE |
-0.135 |
0.042 |
-0.001 |
-0.185 |
-0.086 |
-0.218 |
-0.008 |
0.118 |
-0.062 |
-0.026 |
0.230 |
0.074 |
NAME_FAMILY_STATUS |
-0.142 |
0.037 |
-0.244 |
0.030 |
0.433 |
-0.018 |
-0.114 |
-0.144 |
-0.239 |
-0.222 |
-0.096 |
-0.158 |
NAME_HOUSING_TYPE |
0.028 |
0.106 |
-0.042 |
0.030 |
0.260 |
-0.041 |
-0.016 |
-0.075 |
0.298 |
-0.202 |
0.413 |
0.141 |
REGION_POPULATION_RELATIVE |
-0.116 |
0.125 |
0.259 |
-0.344 |
0.123 |
0.189 |
0.129 |
0.105 |
-0.050 |
-0.046 |
0.017 |
-0.139 |
DAYS_EMPLOYED |
0.277 |
0.257 |
0.246 |
0.260 |
0.120 |
0.048 |
-0.046 |
0.037 |
-0.137 |
-0.022 |
0.071 |
-0.074 |
DAYS_REGISTRATION |
0.108 |
0.131 |
0.182 |
0.013 |
-0.001 |
0.207 |
-0.055 |
-0.053 |
0.089 |
-0.034 |
-0.113 |
0.226 |
DAYS_ID_PUBLISH |
0.090 |
0.141 |
0.080 |
0.120 |
0.192 |
0.137 |
-0.146 |
-0.044 |
-0.114 |
0.051 |
-0.054 |
0.240 |
FLAG_WORK_PHONE |
-0.098 |
-0.141 |
-0.061 |
-0.114 |
-0.122 |
0.246 |
-0.267 |
-0.408 |
-0.086 |
0.073 |
0.262 |
0.077 |
FLAG_CONT_MOBILE |
-0.007 |
0.021 |
-0.010 |
0.070 |
-0.038 |
0.082 |
-0.028 |
-0.106 |
0.015 |
0.016 |
-0.001 |
-0.122 |
FLAG_PHONE |
-0.029 |
0.042 |
0.062 |
-0.161 |
-0.038 |
0.272 |
-0.312 |
-0.371 |
-0.101 |
0.061 |
0.367 |
0.087 |
FLAG_EMAIL |
-0.057 |
-0.001 |
0.011 |
-0.081 |
-0.009 |
-0.127 |
0.068 |
0.075 |
-0.043 |
0.035 |
0.336 |
-0.231 |
OCCUPATION_TYPE |
0.217 |
0.189 |
0.237 |
0.261 |
0.140 |
0.086 |
-0.050 |
0.065 |
-0.184 |
0.006 |
0.106 |
-0.149 |
CNT_FAM_MEMBERS |
-0.150 |
-0.063 |
-0.294 |
-0.030 |
0.389 |
-0.059 |
-0.090 |
-0.141 |
-0.223 |
-0.223 |
-0.121 |
-0.185 |
REGION_RATING_CLIENT_W_CITY |
0.149 |
-0.113 |
-0.313 |
0.369 |
-0.108 |
-0.170 |
-0.115 |
-0.091 |
0.034 |
0.073 |
0.060 |
0.112 |
HOUR_APPR_PROCESS_START |
-0.110 |
0.018 |
0.165 |
-0.277 |
-0.022 |
0.131 |
0.071 |
0.017 |
-0.143 |
-0.096 |
-0.090 |
-0.106 |
REG_REGION_NOT_LIVE_REGION |
-0.120 |
-0.130 |
0.246 |
0.124 |
-0.133 |
-0.265 |
-0.014 |
-0.121 |
-0.331 |
-0.257 |
0.070 |
0.226 |
REG_REGION_NOT_WORK_REGION |
-0.239 |
-0.203 |
0.423 |
0.140 |
0.130 |
-0.286 |
-0.045 |
-0.249 |
0.093 |
0.080 |
-0.078 |
-0.003 |
LIVE_REGION_NOT_WORK_REGION |
-0.217 |
-0.165 |
0.375 |
0.102 |
0.215 |
-0.201 |
-0.043 |
-0.224 |
0.254 |
0.206 |
-0.118 |
-0.108 |
REG_CITY_NOT_LIVE_CITY |
-0.093 |
-0.208 |
0.088 |
0.176 |
-0.221 |
-0.045 |
0.063 |
0.122 |
-0.395 |
-0.340 |
0.077 |
0.179 |
REG_CITY_NOT_WORK_CITY |
-0.209 |
-0.361 |
0.052 |
0.243 |
-0.015 |
0.369 |
0.056 |
0.234 |
0.036 |
-0.192 |
0.093 |
0.007 |
LIVE_CITY_NOT_WORK_CITY |
-0.193 |
-0.298 |
0.031 |
0.196 |
0.113 |
0.440 |
0.026 |
0.199 |
0.239 |
-0.048 |
0.070 |
-0.081 |
ORGANIZATION_TYPE |
0.128 |
0.148 |
0.206 |
0.197 |
0.107 |
0.040 |
-0.023 |
0.140 |
-0.225 |
0.055 |
0.179 |
-0.189 |
EXT_SOURCE_2 |
-0.107 |
0.148 |
0.148 |
-0.273 |
0.088 |
0.069 |
0.010 |
0.059 |
-0.043 |
-0.158 |
-0.079 |
0.246 |
EXT_SOURCE_3 |
0.040 |
0.114 |
0.028 |
0.045 |
0.120 |
0.092 |
-0.101 |
0.011 |
0.136 |
-0.125 |
-0.281 |
0.521 |
OBS_30_CNT_SOCIAL_CIRCLE |
0.018 |
0.007 |
-0.067 |
0.065 |
0.094 |
0.108 |
0.577 |
-0.285 |
-0.104 |
0.115 |
0.035 |
0.146 |
DEF_30_CNT_SOCIAL_CIRCLE |
0.028 |
-0.007 |
-0.032 |
0.051 |
0.050 |
0.103 |
0.593 |
-0.272 |
-0.121 |
0.130 |
0.080 |
0.103 |
The provided table offers valuable insights into the relationships
between features and principal components (PCs) derived from a Principal
Component Analysis (PCA). By examining the correlation between various
features (such as NAME_CONTRACT_TYPE
,
CODE_GENDERM
, AMT_INCOME_TOTAL
, etc.) and the
PCs, we can identify which variables contribute significantly to the
variance captured by each component. For instance, variables like
AMT_CREDIT
and AMT_GOODS_PRICE
show strong
positive correlations with PC1, indicating their importance in
influencing the first principal component. In contrast, variables such
as FLAG_OWN_CARY
and FLAG_OWN_REALTYY
are more
correlated with different PCs.
summary(logistic_model_pca)
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.008171 0.010947 0.746 0.45546
## PC1 0.017704 0.011174 1.584 0.11310
## PC2 -0.374428 0.011356 -32.973 < 2e-16 ***
## PC3 -0.205690 0.011141 -18.463 < 2e-16 ***
## PC4 0.259956 0.011211 23.187 < 2e-16 ***
## PC5 -0.192139 0.011081 -17.340 < 2e-16 ***
## PC6 -0.045295 0.011244 -4.028 5.62e-05 ***
## PC7 0.110753 0.011080 9.996 < 2e-16 ***
## PC8 -0.067780 0.011463 -5.913 3.36e-09 ***
## PC9 -0.050919 0.011041 -4.612 3.99e-06 ***
## PC10 0.258694 0.011546 22.405 < 2e-16 ***
## PC11 0.143017 0.011906 12.012 < 2e-16 ***
## PC12 -0.402464 0.011389 -35.336 < 2e-16 ***
## PC13 0.085002 0.011209 7.583 3.37e-14 ***
## PC14 -0.325279 0.011301 -28.783 < 2e-16 ***
## PC15 -0.085886 0.011238 -7.642 2.13e-14 ***
## PC16 0.113705 0.011232 10.123 < 2e-16 ***
## PC17 -0.028670 0.030336 -0.945 0.34460
## PC18 -0.177311 0.011190 -15.846 < 2e-16 ***
## PC19 0.021473 0.010959 1.959 0.05007 .
## PC20 0.107857 0.012789 8.434 < 2e-16 ***
## PC21 -0.120493 0.011208 -10.751 < 2e-16 ***
## PC22 0.053686 0.010948 4.904 9.40e-07 ***
## PC23 0.031235 0.011009 2.837 0.00455 **
## PC24 -0.091964 0.011005 -8.357 < 2e-16 ***
## PC25 -0.047307 0.011046 -4.283 1.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 55138 on 39773 degrees of freedom
## Residual deviance: 48402 on 39748 degrees of freedom
## AIC: 48454
##
## Number of Fisher Scoring iterations: 5
The regression output shows the relationship between the principal
components (PCs) and the response variable. Several PCs have
statistically significant coefficients, indicating their strong
influence on the model. For example, PC2, PC3, PC4, and PC12 have very
low p-values (less than 0.001), suggesting that these principal
components are highly significant predictors. These PCs likely
correspond to the most influential features in the data, as indicated by
their strong correlations with the response variable.
PC1, PC4, PC10, and PC7 also show significant coefficients (p-values
< 0.01) with moderate effect sizes, meaning they play an important
role in the model. For example, PC2 has a large negative coefficient
(-0.369), indicating a negative relationship with the response variable,
while PC4 has a positive coefficient (0.253), suggesting a positive
impact.
On the other hand, PCs like PC17, PC19, and PC1 have higher p-values
(greater than 0.1), suggesting they do not significantly influence the
response variable in this model.
The Intercept is not significant (p-value = 0.43), meaning the
baseline model without any predictor does not provide strong predictive
power. The model’s overall fit is indicated by a deviance of 48490, down
from a null deviance of 55138, suggesting that the principal components
contribute significantly to reducing the residual deviance and
explaining the variance in the response variable.
In summary, several principal components play a critical role in
explaining the response variable, with PC2, PC3, PC4, and PC12 showing
the most notable influence based on their significant coefficients and
low p-values.
Important Factors (XGBoost)
We will make a chart of the top 10 most important factors from the
XGBoost model since it is the second best model that we have.
xgb_model <- xgb_model_pca$finalModel
importance_matrix <- xgb.importance(model = xgb_model)
# Top 10 important features
top_10 <- head(importance_matrix, 10)
# Professional plot with color and styling
ggplot(top_10, aes(x = reorder(Feature, Gain), y = Gain, fill = Gain)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_gradient(low = "#99c2ff", high = "#084594") + # Color gradient
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 7, vjust = 1),
axis.text.y = element_text(hjust = 1, size = 7, vjust = 1),
legend.position = "None") +
labs(
title = "Top 10 Feature Importance (XGBoost)",
x = "Feature",
y = "Importance"
)

The chart displays the top 10 most important features based on their
importance scores. The bar lengths indicate each feature’s contribution
to the model, with longer bars representing higher importance. PC12
stands out as the most important feature, followed by PC2, PC14, and
others. Conversely, features like PC1 and PC21 show relatively low
importance, suggesting they have a minimal impact on the model’s
predictions. Since PC12 and PC2 are the most important factor in both
model, we will further investigate the Principle Component.
# Extract the loadings for the 12th principal component
selected_loadings <- loadings_matrix[, 12]
kable(selected_loadings, col.names = c("Feature", "Loading Value"), caption = "Loadings for the 12th Principal Component (PC12)")
Loadings for the 12th Principal Component (PC12)
NAME_CONTRACT_TYPE |
-0.1630478 |
CODE_GENDERM |
0.2069212 |
FLAG_OWN_CARY |
0.2397480 |
FLAG_OWN_REALTYY |
0.1225203 |
AMT_INCOME_TOTAL |
0.0442684 |
AMT_CREDIT |
0.0019878 |
AMT_ANNUITY |
-0.0163972 |
AMT_GOODS_PRICE |
0.0145111 |
NAME_TYPE_SUITE |
0.0918354 |
NAME_EDUCATION_TYPE |
0.0739670 |
NAME_FAMILY_STATUS |
-0.1581713 |
NAME_HOUSING_TYPE |
0.1414969 |
REGION_POPULATION_RELATIVE |
-0.1392150 |
DAYS_EMPLOYED |
-0.0740448 |
DAYS_REGISTRATION |
0.2262366 |
DAYS_ID_PUBLISH |
0.2401016 |
FLAG_WORK_PHONE |
0.0774079 |
FLAG_CONT_MOBILE |
-0.1220529 |
FLAG_PHONE |
0.0872052 |
FLAG_EMAIL |
-0.2308442 |
OCCUPATION_TYPE |
-0.1485361 |
CNT_FAM_MEMBERS |
-0.1846431 |
REGION_RATING_CLIENT_W_CITY |
0.1122269 |
HOUR_APPR_PROCESS_START |
-0.1057276 |
REG_REGION_NOT_LIVE_REGION |
0.2259917 |
REG_REGION_NOT_WORK_REGION |
-0.0032573 |
LIVE_REGION_NOT_WORK_REGION |
-0.1077115 |
REG_CITY_NOT_LIVE_CITY |
0.1788075 |
REG_CITY_NOT_WORK_CITY |
0.0068919 |
LIVE_CITY_NOT_WORK_CITY |
-0.0810293 |
ORGANIZATION_TYPE |
-0.1886317 |
EXT_SOURCE_2 |
0.2461591 |
EXT_SOURCE_3 |
0.5214227 |
OBS_30_CNT_SOCIAL_CIRCLE |
0.1464505 |
DEF_30_CNT_SOCIAL_CIRCLE |
0.1026571 |
Our analysis of the 12th Principal Component (PC12) reveals several
key findings:
We found that EXT_SOURCE_3 (0.52) and EXT_SOURCE_2 (0.25) have the
strongest positive loadings, indicating that these features are strongly
associated with late repayment. Similarly, DAYS_ID_PUBLISH (0.24) and
CODE_GENDERM (0.21) also show positive loadings, further reinforcing
their role in predicting late repayment.
On the other hand, features such as FLAG_EMAIL (-0.23) and
NAME_CONTRACT_TYPE (-0.16) have negative loadings, suggesting that these
variables are linked to timely repayment. Other features like
DAYS_REGISTRATION (0.24) and REG_REGION_NOT_LIVE_REGION (0.23) also
appear important, though their direction of influence (positive or
negative) requires deeper exploration.
Overall, variables related to external sources of information,
registration details, and gender appear to be most influential in
predicting repayment behavior.
# Extract the loadings for the 12th principal component
selected_loadings <- loadings_matrix[, 2]
kable(selected_loadings, col.names = c("Feature", "Loading Value"), caption = "Loadings for the 2nd Principal Component (PC2)")
Loadings for the 2nd Principal Component (PC2)
NAME_CONTRACT_TYPE |
0.1374282 |
CODE_GENDERM |
-0.0998634 |
FLAG_OWN_CARY |
-0.0200205 |
FLAG_OWN_REALTYY |
0.0523897 |
AMT_INCOME_TOTAL |
0.0668819 |
AMT_CREDIT |
0.3563036 |
AMT_ANNUITY |
0.3105158 |
AMT_GOODS_PRICE |
0.3558416 |
NAME_TYPE_SUITE |
-0.0283150 |
NAME_EDUCATION_TYPE |
0.0424008 |
NAME_FAMILY_STATUS |
0.0365490 |
NAME_HOUSING_TYPE |
0.1061587 |
REGION_POPULATION_RELATIVE |
0.1248927 |
DAYS_EMPLOYED |
0.2569201 |
DAYS_REGISTRATION |
0.1309993 |
DAYS_ID_PUBLISH |
0.1406090 |
FLAG_WORK_PHONE |
-0.1412905 |
FLAG_CONT_MOBILE |
0.0209935 |
FLAG_PHONE |
0.0419170 |
FLAG_EMAIL |
-0.0011272 |
OCCUPATION_TYPE |
0.1890320 |
CNT_FAM_MEMBERS |
-0.0633861 |
REGION_RATING_CLIENT_W_CITY |
-0.1129464 |
HOUR_APPR_PROCESS_START |
0.0179112 |
REG_REGION_NOT_LIVE_REGION |
-0.1300600 |
REG_REGION_NOT_WORK_REGION |
-0.2029258 |
LIVE_REGION_NOT_WORK_REGION |
-0.1651764 |
REG_CITY_NOT_LIVE_CITY |
-0.2081718 |
REG_CITY_NOT_WORK_CITY |
-0.3614661 |
LIVE_CITY_NOT_WORK_CITY |
-0.2976372 |
ORGANIZATION_TYPE |
0.1481735 |
EXT_SOURCE_2 |
0.1479434 |
EXT_SOURCE_3 |
0.1135832 |
OBS_30_CNT_SOCIAL_CIRCLE |
0.0068224 |
DEF_30_CNT_SOCIAL_CIRCLE |
-0.0066143 |
In PC2, key variables such as AMT_CREDIT (0.36), AMT_GOODS_PRICE
(0.36), and AMT_ANNUITY (0.31) show strong positive loadings, indicating
that higher loan amounts, goods prices, and annuities are associated
with late or no repayment (TARGET = 1). Additionally, DAYS_EMPLOYED
(0.26) and DAYS_REGISTRATION (0.13) suggest that longer employment and
account age may increase the likelihood of late repayment. Negative
loadings for variables like REG_CITY_NOT_WORK_CITY (-0.36),
LIVE_CITY_NOT_WORK_CITY (-0.30), and FLAG_WORK_PHONE (-0.14) imply that
living and working in different cities or having a work phone is linked
to timely repayment. This highlights the importance of financial and
geographic factors in repayment behavior.