set.seed(130)
df <-read.csv("/Users/huangjhongfu/pCloud Drive/05.Personal file/Minerva/-CS312 - Diamond/Regression Assignment/Loan.csv")
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- clean_names(df)
df <- na.omit(df)
df <- filter(df, employment_status %in% c("Self-Employed", "Employed"))
df$debt_income_bracket <-
ifelse(df$debt_to_income_ratio >= 0 & df$debt_to_income_ratio<= 0.2, "low",
ifelse(df$debt_to_income_ratio > 0.2&df$debt_to_income_ratio<= 0.4, "medium",
ifelse(df$debt_to_income_ratio > 0.4, "high", NA)))
df_lm1 <- df[,c("credit_score","debt_income_bracket","loan_purpose",
"loan_amount")]
head(df_lm1)
## credit_score debt_income_bracket loan_purpose loan_amount
## 1 617 medium Home 13152
## 2 628 medium Debt Consolidation 26045
## 3 570 medium Education 17627
## 4 545 high Home 37898
## 5 594 low Debt Consolidation 9184
## 6 626 medium Debt Consolidation 15433
lm1 <- lm(loan_amount ~ credit_score + debt_income_bracket + loan_purpose ,
df_lm1)
summary(lm1)
##
## Call:
## lm(formula = loan_amount ~ credit_score + debt_income_bracket +
## loan_purpose, data = df_lm1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21245 -9256 -2929 5903 159936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26053.130 1144.994 22.754 <2e-16 ***
## credit_score -2.689 1.935 -1.390 0.165
## debt_income_bracketlow 416.518 263.742 1.579 0.114
## debt_income_bracketmedium 548.020 254.101 2.157 0.031 *
## loan_purposeDebt Consolidation -100.747 293.848 -0.343 0.732
## loan_purposeEducation 148.635 335.910 0.442 0.658
## loan_purposeHome -105.285 284.504 -0.370 0.711
## loan_purposeOther -159.298 380.289 -0.419 0.675
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13420 on 18601 degrees of freedom
## Multiple R-squared: 0.0004109, Adjusted R-squared: 3.474e-05
## F-statistic: 1.092 on 7 and 18601 DF, p-value: 0.3648
r_squared <- summary(lm1)$r.squared
cat("R-squared: ", r_squared, "\n")
## R-squared: 0.0004109122
We can tell only the predictor classified as “medium” in debt income from the model summary is considered statistically significant with a p-value of 0.031 (smaller than 0.05). However, the low R-squared value might indicate this model is not adequate enough to perform good predictions; 0.0004 of R-squared means it has limited ability to explain the variety of the data from the model’s prediction.
df_lm1 <- df_lm1 %>% filter(loan_purpose =="Home") %>%
filter(debt_income_bracket == "medium") %>%
filter(credit_score >= 500 & credit_score <=800)
new_data <- data.frame(
credit_score = 500:800,
loan_purpose = rep("Home", 301),
debt_income_bracket = rep("medium", 301))
#use model to create prediction intervals
predictions <- predict(lm1, new_data, interval = "confidence")
x_new <- data.frame(x= 500:800)
x_new$fit <- predictions[, "fit"]
x_new$lwr <- predictions[, "lwr"]
x_new$upr <- predictions[, "upr"]
library(ggplot2)
ggplot(x_new, aes(x = x, y = fit)) +
geom_line(color = "blue") +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) +
geom_point(data = df_lm1, aes(x = credit_score, y = loan_amount)) +
labs(title = "Linear Regression with Confidence Interval",
x = "Independent Variable (credit score)",
y = "Dependent Variable (loan amount)") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(500, 800) +
ylim(20000, 28000)
## Warning: Removed 1553 rows containing missing values or values outside the scale range
## (`geom_point()`).
This data indicates that there’s a negative trend in the loan amount as credit scores increase from 500 to 800, but this can only yield this descriptive statistical outcome for the predictions. No causal inference can be drawn here due to the lack of counterfactual on treatment effect.
From the visualization, when the loan_purpose is home
and the debt_income_bracket is medium, we found that as
credit scores increase, there is a negative trend between the loan
amount and credit scores. This contradicts our intuition about the loan.
This finding may suggest a confounding factor or indicate the need for
additional predictors in our regression model.
To explain the uncertainty of our prediction to fellow unfamiliar with statistics, we can guide him/her through our data points on the confidence interval, taking a credit score of 500 as an example, the average loan amount the financial institution would provide would fall between 24652 to 25651 with 95% confidence. The uncertainty could be well captured by looking through confidence intervals along our data points as the figure below.
ggplot(x_new, aes(x = x, y = fit)) +
xlim(500, 800) +
geom_line(data = x_new, aes(x = x, y = fit), color = "blue") +
geom_ribbon(data = x_new, aes(x = x, ymax = upr, ymin = lwr), alpha = 0.2) +
labs(title = "Linear Regression with Confidence Interval",
x = "Independent Variable (credit score)",
y = "Dependent Variable (loan amount)") +
theme(plot.title = element_text(hjust = 0.5))
set.seed(130)
df <-read.csv("/Users/huangjhongfu/pCloud Drive/05.Personal file/Minerva/-CS312 - Diamond/Regression Assignment/Loan.csv")
library(janitor)
library(dplyr)
df <- clean_names(df)
df <- na.omit(df)
df_glm <- df[,c("credit_score","monthly_income","home_ownership_status","age",
"loan_approved")]
train_indices <- sample(1:nrow(df), size = 0.8 * nrow(df))
train_set <- df_glm[train_indices, ]
test_set <- df_glm[-train_indices, ]
logistic_model <- glm(loan_approved ~ credit_score + monthly_income +
home_ownership_status + age,
data = train_set, family = binomial)
summary(logistic_model)
##
## Call:
## glm(formula = loan_approved ~ credit_score + monthly_income +
## home_ownership_status + age, family = binomial, data = train_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.045e+00 3.033e-01 -26.525 < 2e-16 ***
## credit_score 5.544e-03 5.290e-04 10.482 < 2e-16 ***
## monthly_income 6.151e-04 1.095e-05 56.185 < 2e-16 ***
## home_ownership_statusOther -5.375e-01 9.093e-02 -5.912 3.38e-09 ***
## home_ownership_statusOwn -1.036e-01 6.685e-02 -1.550 0.121
## home_ownership_statusRent -3.162e-01 5.874e-02 -5.383 7.33e-08 ***
## age 1.138e-02 2.225e-03 5.116 3.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17595 on 15999 degrees of freedom
## Residual deviance: 10982 on 15993 degrees of freedom
## AIC: 10996
##
## Number of Fisher Scoring iterations: 5
coef_age <- coef(logistic_model)["age"]
confint_age <- confint(logistic_model)["age", ]
## Waiting for profiling to be done...
cat("Regression coefficient for age:", coef_age, "\n")
## Regression coefficient for age: 0.01138022
cat("95% Confidence Interval for age:", confint_age, "\n")
## 95% Confidence Interval for age: 0.007021479 0.01574308
pred_prob <- predict(logistic_model, newdata = test_set, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
actual_class <- test_set$loan_approved
matrix <- table(Predicted = pred_class, Actual = actual_class)
print(matrix)
## Actual
## Predicted 0 1
## 0 2875 453
## 1 168 504
evaluate_threshold <- function(threshold) {
pred_class <- ifelse(pred_prob > threshold, 1, 0)
confusion <- table(Predicted = pred_class, Actual = test_set$loan_approved)
TN <- confusion[1,1] # True Negatives
FP <- confusion[1,2] # False Positives
FN <- confusion[2,1] # False Negatives
TP <- confusion[2,2] # True Positives
accuracy <- (TP + TN) / (TP + TN + FP + FN)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)
return(data.frame(Threshold = threshold, Accuracy = accuracy,
Precision = precision, Recall = recall,
F1_Score = f1_score))
}
thresholds <- seq(0.1, 0.9, by = 0.1)
results <- do.call(rbind, lapply(thresholds, evaluate_threshold))
print(results)
## Threshold Accuracy Precision Recall F1_Score
## 1 0.1 0.67000 0.9278997 0.4151473 0.5736434
## 2 0.2 0.78400 0.7931034 0.5326316 0.6372796
## 3 0.3 0.82375 0.6833856 0.6193182 0.6497765
## 4 0.4 0.83975 0.5977011 0.6908213 0.6408964
## 5 0.5 0.84475 0.5266458 0.7500000 0.6187845
## 6 0.6 0.84000 0.4555904 0.7855856 0.5767196
## 7 0.7 0.83475 0.3866249 0.8333333 0.5281941
## 8 0.8 0.82325 0.3040752 0.8765060 0.4515128
## 9 0.9 0.81100 0.2257053 0.9350649 0.3636364
best_threshold <- results[which.max(results$F1_Score), "Threshold"]
cat("Optimal threshold according to F1_score:", best_threshold, "\n")
## Optimal threshold according to F1_score: 0.3
Based on the F1 score, it indicated 0.3 is the best threshold the strike the balance between the precision rate and the recall rate. F1 score considers both the error rate of false positive and false negative conditions.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
pred_prob <- predict(logistic_model, newdata = test_set, type = "response")
roc_curve <- roc(test_set$loan_approved, pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, col = "blue", lwd = 2, main = "ROC Curve")
# calculate the AUC
auc_value <- auc(roc_curve)
cat("AUC Value: ", auc_value, "\n")
## AUC Value: 0.8799379
# Find the best threshold
optimal_threshold <- coords(roc_curve, "best", ret = "threshold",
best.method = "youden")
cat("Optimal threshold according to ROC: ", optimal_threshold[1,1], "\n")
## Optimal threshold according to ROC: 0.1628465
By using coords(), which calculate the Youden’s J
Statistic in our ROC curve, could define the optimal threshold that
maximizes the difference between the true positive rate and the false
positive rate. In this case, Youden’s J Statistic indicated 0.16 as the
best threshold.
Why did you choose the features that you did?
We can tell there are different optimal thresholds between Youden’s J Statistic and the F1 score, I will go for the lower threshold of 0.16 because this will help me identify more potential borrowers while the risk of loan defaults is acceptable. This is a proactive lending strategy, which is well-suited for small personal loans.
In contrast, if the risk of loan defaults is not acceptable, I would raise the threshold to 0.3, which could make the decision more conservative and ensure that only qualified people can get the loan.
What other variables might have been interesting to look at that are not available in the data?
The study (K.B., Kiran., Sureshramana and Mayya. 2024) for microfinance institutions (MFIs) on loan repayment performance at Grameen Koota identifies several critical determinants, including borrower income stability, financial literacy, loan size, loan duration, relationships with loan officers, and community support systems. However several variable are not provided in the data (e.g. financial literacy, relationships with loan officers, and community support systems), which could be worth finding out to refine our prediction model.
How accurate were your model’s predictions on the test set? What might explain this, and how could it be improved?
If I use 0.3 as the threshold, the accurate rate of the model is 0.82, which seems acceptable, we can also use other indexes (e.g. precision rate, recall rate, F1_score) to get a more holistic understanding of the model’s performance.
We can improve this model by highlighting the key variables and removing those that are irrelevant, or we can include interaction terms to detect the interaction effect in a linear regression model or even use a more sophisticated model to capture the complex pattern in the data. Reading the performance metrics, adjusting the threshold and experimenting with different types of models, could improve the model’s predictive accuracy and suitability for the loan approval process.
K.B., Kiran., Sureshramana, Mayya. (2024). Analyzing Determinants of Loan Repayment Performance at Grameen Koota: A Comprehensive Study in Bangalore District. International Journal For Multidisciplinary Research, 6(4) doi: 10.36948/ijfmr.2024.v06i04.24400