data <- data.frame(
Matric = c("23119730", "23063431", "23083896", "S2157112", "23070795"),
Full_Name = c("Kau Zhi Wei", "Chang Qi Han", "Cheah Yining", "Chee Zi Yaw", "Zhuang Zhen Yu")
)
data
## Matric Full_Name
## 1 23119730 Kau Zhi Wei
## 2 23063431 Chang Qi Han
## 3 23083896 Cheah Yining
## 4 S2157112 Chee Zi Yaw
## 5 23070795 Zhuang Zhen Yu
A loan refers to a sum of money that is borrowed from licensed financial institution, often a bank, which associated with a certain amount of interest and is required to pay back by the borrowers within a predetermined period. A loan serves varying purposes for different groups. From the perspective of borrower, it can act as a financial resource, providing startup capital or initial funding to launch a project or business. On the other hand, for financial institutions, the interest generated from loans is one of their revenue streams. The loan application process can be lengthy, involving multiple steps, from the submission of the application by applicants to back-and-forth verification and validation, and finally, approval. With the growing demand for loan applications, challenges arise for both parties: applicants may face extended waiting periods, while financial institutions risk overlooking some applications, potentially leading to reputational damage and perceptions of incompetence. On top of that, even after a prolonged verification and validation process, there is no assurance that the chosen applicant will adhere to the repayment timeline. In order to address these limitations, machine learning approaches has been extensively studied to explore their potential in predicting the eligibility of loan applications.
This project aims:
## Warning: package 'DiagrammeR' was built under R version 4.4.2
The flow of methodology of this study is structured as follows:
# Import dataset
df <- read.csv("loan_data_amended.csv")
head(df)
## person_age person_gender person_education person_income person_emp_exp
## 1 22 female Master 71948 0
## 2 21 female High School 12282 0
## 3 25 female High School 12438 3
## 4 23 female Bachelor 79753 0
## 5 24 male Master 66135 1
## 6 21 female High School 12951 0
## person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
## 1 RENT 35000 PERSONAL 16.02 0.49
## 2 OWN 1000 EDUCATION 11.14 0.08
## 3 MORTGAGE 5500 MEDICAL 12.87 0.44
## 4 RENT 35000 MEDICAL 15.23 0.44
## 5 RENT 35000 MEDICAL 14.27 0.53
## 6 OWN 2500 VENTURE 7.14_ 0.19
## cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## 1 3 561 No
## 2 2 504 Yes
## 3 3 635 No
## 4 2 675 No
## 5 4 586 No
## 6 2 532 No
## loan_status
## 1 1
## 2 0
## 3 1
## 4 1
## 5 1
## 6 1
The description or explanation of each features are tabulated as follows:
| variable | description |
|---|---|
| person_age | Applicant’s age |
| person_gender | Applicant’s gender |
| person_education | Applicant’s highest eductaion level |
| person_income | Applicant’s annual income |
| person_emp_exp | Work experience (in years) |
| person_home_ownership | Home ownership status |
| loan_amnt | Loan amount requested |
| loan_intent | Purpose of applying loan |
| loan_int_rate | Loan interest rate |
| loan_percent_income | Loan amount as a percentage of annual income |
| cb_person_cred_hist_length | Length of credit history (in years) |
| credit_score | Applicant’s credit score |
| previous_loan_defaults_on_file | Indicator of previous loan defaults |
| loan_status | Loan approval status |
str(df)
## 'data.frame': 45000 obs. of 14 variables:
## $ person_age : int 22 21 25 23 24 21 26 24 24 21 ...
## $ person_gender : chr "female" "female" "female" "female" ...
## $ person_education : chr "Master" "High School" "High School" "Bachelor" ...
## $ person_income : int 71948 12282 12438 79753 66135 12951 93471 95550 100684 12739 ...
## $ person_emp_exp : int 0 0 3 0 1 0 1 5 3 0 ...
## $ person_home_ownership : chr "RENT" "OWN" "MORTGAGE" "RENT" ...
## $ loan_amnt : int 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
## $ loan_intent : chr "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
## $ loan_int_rate : chr "16.02" "11.14" "12.87" "15.23" ...
## $ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
## $ cb_person_cred_hist_length : chr "3" "2" "3" "2" ...
## $ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...
## $ previous_loan_defaults_on_file: chr "No" "Yes" "No" "No" ...
## $ loan_status : int 1 0 1 1 1 1 1 1 1 1 ...
This dataset consists of 45,000 instances and 14 variables, comprising of categorical and numerical variables. Based on the output above, it appears that loan_int_rate and cb_person_cred_hist_length have been incorrectly classified as a categorical variable, when in fact it is a continuous and discrete variable. Since column may containing text or mixed values, cannot convert the data type to numeric because it will replace them with NA, it might cause the important information lose.
summary(df)
## person_age person_gender person_education person_income person_emp_exp person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score previous_loan_defaults_on_file loan_status
## Min. : 20.00 Length:45000 Length:45000 Min. : 8000 Min. : 0.00 Length:45000 Min. : 500 Length:45000 Length:45000 Min. :0.0000 Length:45000 Min. :390.0 Length:45000 Min. :0.0000
## 1st Qu.: 24.00 Class :character Class :character 1st Qu.: 47202 1st Qu.: 1.00 Class :character 1st Qu.: 5000 Class :character Class :character 1st Qu.:0.0700 Class :character 1st Qu.:601.0 Class :character 1st Qu.:0.0000
## Median : 26.00 Mode :character Mode :character Median : 67046 Median : 4.00 Mode :character Median : 8000 Mode :character Mode :character Median :0.1200 Mode :character Median :640.0 Mode :character Median :0.0000
## Mean : 27.76 Mean : 80311 Mean : 5.41 Mean : 9583 Mean :0.1397 Mean :632.6 Mean :0.2222
## 3rd Qu.: 30.00 3rd Qu.: 95782 3rd Qu.: 8.00 3rd Qu.:12237 3rd Qu.:0.1900 3rd Qu.:670.0 3rd Qu.:0.0000
## Max. :144.00 Max. :7200766 Max. :125.00 Max. :35000 Max. :0.6600 Max. :850.0 Max. :1.0000
## NA's :9
colSums(is.na(df))
## person_age person_gender person_education person_income person_emp_exp person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score previous_loan_defaults_on_file loan_status
## 0 0 0 9 0 0 0 0 0 0 0 0 0 0
duplicatecount <- sum(duplicated(df))
print(duplicatecount)
## [1] 0
# Categorical Columns to process
columns <- c("person_gender", "loan_status", "person_education", "person_home_ownership",
"previous_loan_defaults_on_file", "loan_intent")
# Extract Unique Values and Counts
uniquevalues <- lapply(df[columns], unique)
counts <- lapply(df[columns], table)
# Print Unique Values
print(uniquevalues)
## $person_gender
## [1] "female" "male"
##
## $loan_status
## [1] 1 0
##
## $person_education
## [1] "Master" "High School" "Bachelor" "Associate" "Doctorate"
##
## $person_home_ownership
## [1] "RENT" "OWN" "MORTGAGE" "OTHER"
##
## $previous_loan_defaults_on_file
## [1] "No" "Yes"
##
## $loan_intent
## [1] "PERSONAL" "EDUCATION" "MEDICAL" "VENTURE" "HOMEIMPROVEMENT" "DEBTCONSOLIDATION"
# Print Counts
print(counts)
## $person_gender
##
## female male
## 20159 24841
##
## $loan_status
##
## 0 1
## 35000 10000
##
## $person_education
##
## Associate Bachelor Doctorate High School Master
## 12028 13399 621 11972 6980
##
## $person_home_ownership
##
## MORTGAGE OTHER OWN RENT
## 18489 117 2951 23443
##
## $previous_loan_defaults_on_file
##
## No Yes
## 22142 22858
##
## $loan_intent
##
## DEBTCONSOLIDATION EDUCATION HOMEIMPROVEMENT MEDICAL PERSONAL VENTURE
## 7145 9153 4783 8548 7552 7819
library(plotly)
# Pie chart for person_gender
plot_ly(df, labels = ~person_gender, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
layout(title = "Distribution of Person Gender")
The pie chart above decipts the distribution of applicant’s gender in the dataset. The dataset has more male applicants (55.2%) than female applicants (44.8&) and it is slightly skewed towards male applicants.
# Pie chart for loan_status
plot_ly(df, labels = ~loan_status, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
layout(title = "Distribution of Loan Status")
Based on the output shown above, there is small proportion of applicants (22.2%) qualify for the loan and the dataset is highly imbalanced. This reflect that there might be a strict loan approval process found from the samples in this dataset.
# Pie chart for person_education
plot_ly(df, labels = ~person_education, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
layout(title = "Distribution of Person Education")
The applicants from this dataset are made up of three major groups of educational background, which are Bachelor’s (29.8%), Associate (26.7%) and High School (26.6%) whereas the smallest group falls to applicants with Doctorate, comprising of 1.38% of total applicants.
# Pie chart for person_home_ownership
plot_ly(df, labels = ~person_home_ownership, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
layout(title = "Distribution of Person Home Ownership")
Most applicants are renting (52.1%) or have a mortgage (41.1%), where only a small group of applicant own a house.
# Pie chart for previous_loan_defaults_on_file
plot_ly(df, labels = ~previous_loan_defaults_on_file, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
layout(title = "Distribution of Previous Loan Defaults")
Half of the applicants have loan defaults previously, which indicate they might be financially instable and possibly struggle with future debt repayment. Therefore, further analysis is required to access this group of applicants.
# Pie chart for loan_intent
plot_ly(df, labels = ~loan_intent, type = "pie", textinfo = "label+percent", hole = 0.4) %>%
layout(title = "Distribution of Loan Intent")
Based on the loan intention, it is found that the top two reasons are education (20.3%) and medical (19%), stating that many applicants seek financial support for academic and healthcare expenses, which in other way reflects the high cost in these two areas. Apart from that, nearly similar amount of applicants seek fundings for business (17.4%), personal purposes (16.8%) and debt consolidation (15.9%). The intention for home improvement occupies the least proportion - 10.6%, which might be due to low number of owners in this dataset.
# check the unique in cb_person_cred_hist_length column
personcred <- unique(df$cb_person_cred_hist_length)
print(personcred)
## [1] "3" "2" "4" "4_" "3_" "2_" "8" "7" "6" "9" "10" "5" "8_" "11" "16" "15" "12" "13" "17" "14" "25" "28" "27" "22" "19" "29" "23" "26" "20" "21" "30" "24" "18" "6_"
After checking the unique value, there are exist some value have trailing underscores (). It is the reason to cause inconsistant data format and we would like to remove () and let it turn to numerical.
# Clean trailing underscores
df$cb_person_cred_hist_length <- gsub("_$", "", df$cb_person_cred_hist_length)
df$cb_person_cred_hist_length <- as.numeric(df$cb_person_cred_hist_length)
df$loan_int_rate <- gsub("_$", "", df$loan_int_rate)
df$loan_int_rate <- as.numeric(df$loan_int_rate)
colSums(is.na(df))
## person_age person_gender person_education person_income person_emp_exp person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score previous_loan_defaults_on_file loan_status
## 0 0 0 9 0 0 0 0 4 0 0 0 0 0
Check the Total number of missing value after clear (_) in the column of “loan_int_rate” and “cb_person_cred_hist_length” and there are 4 missing value in column “loan_int_rate” and 9 missing value in column “cb_person_cred_hist_length”.
Subsequently, proceeding to univariate analysis on numerical variables, the boxplots and histrograms below show the distribution of applicants across each numerical variables in this dataset. Also, through the visualizations, it is possible to detect potential outliers visually.
library(ggplot2)
library(gridExtra)
numericalcols <- df[sapply(df, is.numeric)]
numericalcols <- numericalcols[, setdiff(names(numericalcols), "loan_status")]
plot_list <- lapply(names(numericalcols), function(col) {
ggplot(df, aes_string(y = col)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = paste("Boxplot of", col), y = col) +
theme_minimal()
})
do.call(grid.arrange, c(plot_list, ncol = 2))
From the boxplots, the distinct patterns in numerical features are observed:
# Remove rows where person_income has missing values
df <- df[!is.na(df$person_income), ]
dim(df)
## [1] 44991 14
Due to person_income column exist a lot of outliers, then we decide to remove the missing value.
numericalcols <- df[sapply(df, is.numeric)]
numericalcols <- numericalcols[, setdiff(names(numericalcols), "loan_status")]
# Create a list of ggplot objects for each numerical column
plot_list <- lapply(names(numericalcols), function(col) {
ggplot(df, aes_string(x = col)) +
geom_histogram(aes(y = ..count..), bins = 30, fill = "lightblue", color = "gray") +
labs(title = paste("Distribution of", col), x = col, y = "Count") +
theme_minimal()
})
do.call(grid.arrange, c(plot_list, ncol = 2, top = "Distributions of Numerical Columns"))
Based on the above histograms, several insights are obtained:
library(dplyr)
# Filter the data into suitable range for person_age < 100, person_emp_exp < 50
df <- df %>% filter(person_age < 100, person_emp_exp < 50)
dim(df)
## [1] 44977 14
Filter the the data into suitable range for column “person_age” < 100 and column “person_emp_exp” < 50
# Define the categorical columns
df$loan_status <- factor(df$loan_status,
levels = c("0", "1"),
labels = c("Rejected", "Approved"))
categorical <- c("person_gender", "person_education", "person_home_ownership", "loan_intent")
plot_list <- lapply(categorical, function(col) {
ggplot(df, aes_string(x = col, fill = "loan_status")) +
geom_bar(position = "dodge", color = "black", width = 0.7) + # Side-by-side bars
labs(title = paste("Bar Chart of", col, "by Loan Status"),
x = col, y = "Count", fill = "Loan Status") +
scale_fill_brewer(palette = "Set1") + # Use a color palette for distinction
theme_minimal(base_size = 14) +
theme(
legend.position = "top", # Move legend to the top
axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis text
)
})
do.call(grid.arrange, c(plot_list, ncol = 2))
In univariate analysis, we convert the column of “loan_status” from numerical into categorical column, which are replace “Rejected” and “Approved” to “0” and “1” respectively, then focus on understanding the distribution of individual features. However, in this section, we further explore the distribution of four categorical variables, namely person_gender, person_education, person_home_ownership and loan_intent by the target variable - loan status. So, four bar plots are plotted and two general insights are observed across all plots:
library(ggplot2)
library(tidyr)
df_long <- df %>%
select(credit_score, person_income, loan_amnt, loan_int_rate,
loan_percent_income, cb_person_cred_hist_length) %>%
pivot_longer(cols = -credit_score, names_to = "Variable", values_to = "Value")
# Plot scatter plots with facets
ggplot(df_long, aes(x = Value, y = credit_score)) +
geom_point(color = "blue", alpha = 0.6) +
facet_wrap(~ Variable, scales = "free_x") +
labs(title = "Scatter Plots of Credit Score vs Other Variables",
x = "Values of Each Variable", y = "Credit Score") +
theme_minimal()
library(corrplot)
df2 <- df #make a copy
df2$loan_int_rate[is.na(df2$loan_int_rate)] <- mean(df2$loan_int_rate, na.rm = TRUE)
# Remove the Outliers for the spicified numerical columns
numerical_columns <- c("person_age", "person_income", "person_emp_exp", "loan_amnt",
"loan_int_rate", "loan_percent_income",
"cb_person_cred_hist_length", "credit_score")
for (col in numerical_columns) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE) # First quartile
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE) # Third quartile
IQR <- Q3 - Q1 # Interquartile range
lower_bound <- Q1 - 1.5 * IQR # Lower bound
upper_bound <- Q3 + 1.5 * IQR # Upper bound
# Filter the dataframe
df2 <- df2[
df2[[col]] >= lower_bound & df2[[col]] <= upper_bound,
]
}
numericalcols <- df2[sapply(df2, is.numeric)]
correlation_matrix <- cor(numericalcols, method = "pearson")
corrplot(correlation_matrix, method = "color", type = "upper",
col = colorRampPalette(c("blue", "white", "red"))(100),
tl.cex = 0.8, number.cex = 0.7, addCoef.col = "black")
For EDA part, we handle missing values and outliers for the entire dataset (without considering train-test split) to compute the correlation heatmap. For Modelling part, we will Split the dataset into training and testing sets, then handle missing values with imputation of mean and remove outliers separately on the training and test sets to avoid data leakage.
The data pre-processing to clean and prepare the dataset for analysis. Filter was conducted on the person_age and person_emp_exp columns by excluding values over 100 and 50, respectively.
EDA:The exploratory data analysis (EDA) provided important insights into both categorical and numerical features. Among the categorical features, males, renters, and individuals with higher education levels (such as Bachelor’s and Associate degrees) were more likely to receive loan approvals, with the highest loan intent observed for education and medical purposes. The loan_status variable indicated a significant class imbalance, with 77.8% of loans being approved. In terms of numerical features, person_age and person_emp_exp were right-skewed, with most values falling below 50 and 10, respectively. Loan amounts were primarily concentrated under 20,000, and interest rates peaked around 10%. The credit_score exhibited a normal distribution centered between 600 and 700, showing a positive correlation with the length of credit history, while other numerical features demonstrated weaker relationships.
Correlation Coefficient: It reveals the relationships between various numerical variables in the dataset. Notably, there is a strong positive correlation between person_age and person_emp_exp (0.90), indicating that older individuals tend to have more work experience. Similarly, loan_percent_income and loan_amnt exhibit a moderately positive correlation (0.65), suggesting that higher of loan percent income individuals tend to take larger loans. Loan_int_rate shows a moderate positive correlation with loan_status (0.32), indicating that higher interest rates may be associated with higher risk of default. In contrast, loan_percent_income and person_income are negatively correlated (-0.38), implying that individuals with higher incomes tend to have lower loan-to-income ratios. Other correlations, such as those involving credit_score and cb_person_cred_hist_length, show relatively weak associations with the other variables. Overall, the heatmap highlights key relationships that can guide further analysis or modeling efforts.
# Define the levels and labels for the factor
education_levels <- c("High School", "Associate", "Bachelor", "Master", "Doctorate")
education_labels <- c(1, 2, 3, 4, 5)
df$person_education <- factor(df$person_education,
levels = education_levels,
labels = education_labels)
df$person_education <- as.numeric(df$person_education)
Define the levels and labels for the column “education_levels”.
library(caret) #For stratified k-fold cross validation
library(MLmetrics)
library(randomForest)
library(xgboost)
# Classification prediction on Loan Status
# Split data into training and testing
set.seed(123)
train_index <- createDataPartition(df$loan_status, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Calculate the mean of loan_int_rate using the training data only
mean_loan <- mean(train_data$loan_int_rate, na.rm = TRUE)
train_data$loan_int_rate[is.na(train_data$loan_int_rate)] <- mean_loan
test_data$loan_int_rate[is.na(test_data$loan_int_rate)] <- mean_loan
#Remove Outliers
numerical_columns <- c("person_age", "person_income", "person_emp_exp", "loan_amnt",
"loan_int_rate", "loan_percent_income",
"cb_person_cred_hist_length", "credit_score")
for (col in numerical_columns) {
Q1 <- quantile(train_data[[col]], 0.25, na.rm = TRUE) # First quartile
Q3 <- quantile(train_data[[col]], 0.75, na.rm = TRUE) # Third quartile
IQR <- Q3 - Q1 # Interquartile range
lower_bound <- Q1 - 1.5 * IQR # Lower bound
upper_bound <- Q3 + 1.5 * IQR # Upper bound
train_data <- train_data[
train_data[[col]] >= lower_bound & train_data[[col]] <= upper_bound,
]
test_data <- test_data[
test_data[[col]] >= lower_bound & test_data[[col]] <= upper_bound,
]
}
dim(train_data)
## [1] 28772 14
dim(test_data)
## [1] 7228 14
# StandardScaler()
numerical_cols <- c("person_age", "person_education", "person_income", "person_emp_exp",
"loan_amnt", "loan_int_rate", "loan_percent_income",
"credit_score", "cb_person_cred_hist_length")
preprocess_scaler <- preProcess(train_data[, numerical_cols], method = c("center", "scale"))
train_data[, numerical_cols] <- predict(preprocess_scaler, train_data[, numerical_cols])
test_data[, numerical_cols] <- predict(preprocess_scaler, test_data[, numerical_cols])
#One-Hot-Encoding
categorical_cols <- c("person_gender", "person_home_ownership", "loan_intent",
"previous_loan_defaults_on_file")
train_encoded <- dummyVars(" ~ .", data = train_data[, categorical_cols]) %>%
predict(newdata = train_data) %>%
as.data.frame()
test_encoded <- dummyVars(" ~ .", data = test_data[, categorical_cols]) %>%
predict(newdata = test_data) %>%
as.data.frame()
train_data <- bind_cols(train_data, train_encoded) %>%
select(-one_of(categorical_cols)) # Remove original categorical columns
test_data <- bind_cols(test_data, test_encoded) %>%
select(-one_of(categorical_cols)) # Remove original categorical columns
dim(train_data)
## [1] 28772 24
dim(test_data)
## [1] 7228 24
# Set up stratified k-fold cross-validation with k-fold = 5
# When trainControl is used with classification tasks, the caret package automatically ensures that each fold maintains the class distribution of the dataset.
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE, # For classification Tasks
summaryFunction = multiClassSummary,
savePredictions = TRUE
)
# Define metric storage
results <- data.frame(Model = character(), Accuracy = numeric(),
Precision = numeric(), Recall = numeric(), F1 = numeric())
# Logistic Regression
logistic_model <- train(
loan_status ~ ., data = train_data, method = "glm",
family = "binomial", trControl = train_control
)
logistic_perf <- logistic_model$results
results <- rbind(results, data.frame(
Model = "Logistic Regression",
Accuracy = mean(logistic_perf$Accuracy),
Precision = mean(logistic_perf$Precision),
Recall = mean(logistic_perf$Recall),
F1 = mean(logistic_perf$F1)
))
# Random Forest
rf_model <- train(
loan_status ~ ., data = train_data, method = "rf",
trControl = train_control,
tuneGrid = expand.grid(mtry = floor(sqrt(ncol(train_data)))),
ntree = 100
)
rf_perf <- rf_model$results
results <- rbind(results, data.frame(
Model = "Random Forest",
Accuracy = mean(rf_perf$Accuracy),
Precision = mean(rf_perf$Precision),
Recall = mean(rf_perf$Recall),
F1 = mean(rf_perf$F1)
))
# XGBoost
tune_grid <- expand.grid(
nrounds = c(50, 100), # Fewer boosting rounds
max_depth = 3, # Shallower trees
eta = 0.3, # Learning rate
gamma = 0,
colsample_bytree = 0.8, # Subsample features
min_child_weight = 1,
subsample = 0.8
) # It will take a long time to run, because nround default by 500+ and we reduce the nround to (50-100)
xgb_model <- train(
loan_status ~ .,
data = train_data,
method = "xgbTree",
trControl = train_control,
tuneGrid = tune_grid
)
## [18:10:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:16] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:18] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:18] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:19] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:19] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:21] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:21] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:22] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [18:10:22] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
xgb_perf <- xgb_model$results
results <- rbind(results, data.frame(
Model = "XGBoost",
Accuracy = mean(xgb_perf$Accuracy),
Precision = mean(xgb_perf$Precision),
Recall = mean(xgb_perf$Recall),
F1 = mean(xgb_perf$F1)
))
# SVM
importantfea <- varImp(rf_model)$importance %>% arrange(desc(Overall))
selectedfea <- rownames(importantfea)[1:10] # Top 10 features
svm_model <- train(
loan_status ~ .,
data = train_data[, c(selectedfea, "loan_status")],
method = "svmLinear",
trControl = train_control
) # Use a Linear Kernel for Large Dataset: faster as it avoids computing complex transformations
svm_perf <- svm_model$results
results <- rbind(results, data.frame(
Model = "SVM",
Accuracy = mean(svm_perf$Accuracy),
Precision = mean(svm_perf$Precision),
Recall = mean(svm_perf$Recall),
F1 = mean(svm_perf$F1)
))
# KNN
knn_model <- train(
loan_status ~ ., data = train_data, method = "knn",
trControl = train_control,
tuneGrid = data.frame(k = seq(3, 5, 7))
)
knn_perf <- knn_model$results
results <- rbind(results, data.frame(
Model = "KNN",
Accuracy = mean(knn_perf$Accuracy),
Precision = mean(knn_perf$Precision),
Recall = mean(knn_perf$Recall),
F1 = mean(knn_perf$F1)
))
# Display results
results %>% arrange(desc(Accuracy))
## Model Accuracy Precision Recall F1
## 1 XGBoost 0.9236238 0.9366766 0.9698139 0.9529451
## 2 Random Forest 0.9235369 0.9317663 0.9755459 0.9531524
## 3 Logistic Regression 0.8956973 0.9277359 0.9426358 0.9351091
## 4 SVM 0.8898927 0.9236442 0.9395840 0.9315401
## 5 KNN 0.8860003 0.9207725 0.9377097 0.9291636
The dataset for classifying loan_status was divided into 80% for training and 20% for testing. There is a mean imputation for “loan_int_rate” column and remove the outliers will be conducted after the train test split. A StandardScaler() was used to normalize all numerical columns and do the StandardScaler(), and other categorical columns underwent one-hot encoding. To address the imbalance in loan status classes, stratified k-fold cross-validation with k-fold = “5” was performed on the training data to maintain representative distributions in each fold. Several models, including Logistic Regression, Random Forest, SVM, KNN, and XGBoost, were assessed based on accuracy, precision, recall, and F1-score. XGBoost stood out as the top-performing model, which achieving good performance on accuracy, precision, recall and F1-score during cross-validation.
# Final Prediction
final_model <- train(
loan_status ~ ., data = train_data, method = "xgbTree",
trControl = trainControl(method = "none"),
tuneGrid = expand.grid(
nrounds = 100,
max_depth = 3,
eta = 0.3,
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8
)
)
test_predictions <- predict(final_model, newdata = test_data)
# Precision = Pos Pred Value, Recall = Sensitivity
confusionMatrix(test_predictions, test_data$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Rejected Approved
## Rejected 5638 326
## Approved 153 1111
##
## Accuracy : 0.9337
## 95% CI : (0.9277, 0.9394)
## No Information Rate : 0.8012
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7821
##
## Mcnemar's Test P-Value : 3.876e-15
##
## Sensitivity : 0.9736
## Specificity : 0.7731
## Pos Pred Value : 0.9453
## Neg Pred Value : 0.8790
## Prevalence : 0.8012
## Detection Rate : 0.7800
## Detection Prevalence : 0.8251
## Balanced Accuracy : 0.8734
##
## 'Positive' Class : Rejected
##
XGBoost model was then applied to the testing data for final predictions. Overall, the XGBoost is able to classify most of the rejected and approved instances correctly, with small number of instances in false positive and false negative. From loan approval cases, a false positive indicates that an unqualified applicant’s loan request is approved, where the lenders might face financial loss if the borrower is a high risk borrower and loan might be defaulted. On the other hand, false negative means lenders might lose a potential customer when in fact the loan should be approved.
# XGBoost provide the tools to visualize the feature importance of the trained model
xgb_finalmodel <- final_model$finalModel # Extract the underlying XGBoost model
importance <- xgb.importance(model = xgb_finalmodel)
xgb.plot.importance(importance_matrix = importance) # Show top 10 features
Furthermore, an analysis of feature importance indicated that significant predictors included previous_loan_defaults_on_file, loan_int_rate and person_income, among others.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ggplot2)
# Compute ROC Curve
pred_probs <- predict(xgb_model, newdata = test_data, type = "prob")
roc_curve <- roc(test_data$loan_status, pred_probs[, 2])
## Setting levels: control = Rejected, case = Approved
## Setting direction: controls < cases
ggplot() +
geom_line(aes(x = 1 - roc_curve$specificities, y = roc_curve$sensitivities), color = "blue") +
geom_abline(linetype = "dashed", color = "gray") +
ggtitle(paste("ROC Curve (AUC =", round(auc(roc_curve), 2), ")")) +
xlab("1 - Specificity (False Positive Rate)") + # Corrected
ylab("Sensitivity (True Positive Rate)") +
theme_minimal()
The Receiver Operating Characteristics (ROC) curve is a graphical representation of sensitivity (true positive rate) against specificity (false positive rate), in other way, it shows a trade-off between sensitivity and specificity.
# Regression Prediction on Loan on Interest Rate
set.seed(124)
trainindex <- createDataPartition(df$credit_score, p = 0.8, list = FALSE)
traindata <- df[trainindex, ]
testdata <- df[-trainindex, ]
meanloan <- mean(traindata$loan_int_rate, na.rm = TRUE)
traindata$loan_int_rate[is.na(traindata$loan_int_rate)] <- meanloan
testdata$loan_int_rate[is.na(testdata$loan_int_rate)] <- meanloan
# Remove Outliers
numerical_columns <- c("person_age", "person_income", "person_emp_exp", "loan_amnt",
"loan_int_rate", "loan_percent_income",
"cb_person_cred_hist_length", "credit_score")
for (col in numerical_columns) {
Q1 <- quantile(traindata[[col]], 0.25, na.rm = TRUE) # First quartile
Q3 <- quantile(traindata[[col]], 0.75, na.rm = TRUE) # Third quartile
IQR <- Q3 - Q1 # Interquartile range
lower_bound <- Q1 - 1.5 * IQR # Lower bound
upper_bound <- Q3 + 1.5 * IQR # Upper bound
traindata <- traindata[
traindata[[col]] >= lower_bound & traindata[[col]] <= upper_bound,
]
testdata <- testdata[
testdata[[col]] >= lower_bound & testdata[[col]] <= upper_bound,
]
}
dim(traindata)
## [1] 28859 14
dim(testdata)
## [1] 7179 14
# StandardScaler()
numericalcols <- c("person_age", "person_education", "person_income",
"person_emp_exp", "loan_amnt", "loan_percent_income",
"loan_int_rate", "cb_person_cred_hist_length", "credit_score")
preprocessscaler <- preProcess(traindata[, numericalcols], method = c("center", "scale"))
traindata[, numericalcols] <- predict(preprocessscaler, traindata[, numericalcols])
testdata[, numericalcols] <- predict(preprocessscaler, testdata[, numericalcols])
#One-Hot-Encoding
categoricalcols <- c("person_gender", "person_home_ownership", "loan_intent",
"previous_loan_defaults_on_file", "loan_status")
trainencoded <- dummyVars(" ~ .", data = traindata[, categoricalcols]) %>%
predict(newdata = traindata) %>%
as.data.frame()
testencoded <- dummyVars(" ~ .", data = testdata[, categoricalcols]) %>%
predict(newdata = testdata) %>%
as.data.frame()
traindata <- bind_cols(traindata, trainencoded) %>%
select(-one_of(categoricalcols)) # Remove original categorical columns
testdata <- bind_cols(testdata, testencoded) %>%
select(-one_of(categoricalcols)) # Remove original categorical columns
dim(traindata)
## [1] 28859 25
dim(testdata)
## [1] 7179 25
# K-fold Cross Validation
train_control <- trainControl(method = "cv", number = 5)
# Train models using different methods
lm_model <- train(credit_score ~ ., data = traindata, method = "lm", trControl = train_control)
dt_model <- train(credit_score ~ ., data = traindata, method = "rpart", trControl = train_control)
rf_model <- train(credit_score ~ ., data = traindata, method = "rf",
trControl = train_control,
tuneGrid = expand.grid(mtry = c(2, 3, 4)),
ntree = 50)
gbm_model <- train(credit_score ~ ., data = traindata, method = "gbm",
trControl = train_control,
tuneGrid = expand.grid(n.trees = c(50, 100),
interaction.depth = c(1, 2),
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
importantfea <- varImp(rf_model)$importance %>% arrange(desc(Overall))
selectedfea <- rownames(importantfea)[1:10] # Top 10 features
svm_model <- train(
credit_score ~ .,
data = traindata[, c(selectedfea, "credit_score")],
method = "svmLinear",
trControl = train_control
)
# Function to calculate metrics
extract_metrics <- function(model, model_name) {
results <- model$results
data.frame(
Model = model_name,
RMSE = mean(results$RMSE),
MAE = mean(results$MAE),
R2 = mean(results$Rsquared)
)
}
# Collect metrics for each model
# Combine metrics for all models
regressionperformance <- rbind(
extract_metrics(lm_model, "Linear Regression"),
extract_metrics(dt_model, "Decision Tree"),
extract_metrics(rf_model, "Random Forest"),
extract_metrics(gbm_model, "Gradient Boosting"),
extract_metrics(svm_model, "SVM")
)
print(regressionperformance)
## Model RMSE MAE R2
## 1 Linear Regression 0.9483485 0.7714067 0.10077404
## 2 Decision Tree 0.9796231 0.7962962 0.04702660
## 3 Random Forest 0.9603864 0.7806850 0.08037284
## 4 Gradient Boosting 0.9511344 0.7740556 0.09780567
## 5 SVM 0.9628069 0.7679873 0.09136300
# Final Prediction on the Best performance model
final_model <- train(
credit_score ~ .,
data = traindata,
method = "lm",
trControl = trainControl(method = "none")
)
# Make predictions on the test set
test_predictions <- predict(final_model, newdata = testdata)
y_test <- testdata$credit_score
rmse <- sqrt(mean((y_test - test_predictions)^2))
mae <- mean(abs(y_test - test_predictions))
r2 <- 1 - sum((y_test - test_predictions)^2) / sum((y_test - mean(y_test))^2)
# Display performance metrics
final_performance <- data.frame(
RMSE = rmse,
MAE = mae,
R2 = r2
)
print(final_performance)
## RMSE MAE R2
## 1 0.9386298 0.7643898 0.101789
For Regression model, after the train test split, there are mean imputation for “loan_int_rate” column and outliers removed will be conducted, then the StandardScaler for numerical columns and One Hot Encoding also will be implemented. In general, linear regression model exhibited the best performance, with the lowest RMSE (0.94) and R-squared (0.1. However, the R-squared value of 0.1 shows that it has poor predictive power for credit score. This result uncover that potential key features are missing out, causing the patterns are not well captured by the models. This can be clearly seen from the heatmap shown in correlation analysis section where the numerical features has low or weak correlation towards credit score. Features such as user payment history, number of defaults and credit utilization might be more useful to predict credit score.
Overall, each part in this project plays a significant role in contributing to the final outcome. From data pre-processing to the exploratory data analysis, visualizations and finally come up with the best performing model loan approval prediction. All objectives have been achieved in this study:
Additionally, this study is associated with limitations. Despite the linear regression model demonstrating the best performance among the evaluated models, it obtained a R-squared of 10%, indicating poor predictive power towards credit score. It is recommended to include key features that are more relevant to credit score.