Regression assignment

Question 1: Simulation Methods

STEP 1: Data processing

set.seed(130)

df <-read.csv("/Users/huangjhongfu/pCloud Drive/05.Personal file/Minerva/-CS312 - Diamond/Regression Assignment/Loan.csv")

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- clean_names(df)
df <- na.omit(df)
df <- filter(df, employment_status %in% c("Self-Employed", "Employed"))

df$debt_income_bracket <- 
  ifelse(df$debt_to_income_ratio >= 0 & df$debt_to_income_ratio<= 0.2, "low",
   ifelse(df$debt_to_income_ratio > 0.2&df$debt_to_income_ratio<= 0.4, "medium", 
    ifelse(df$debt_to_income_ratio > 0.4, "high", NA)))

df_lm1 <- df[,c("credit_score","debt_income_bracket","loan_purpose",
                "loan_amount")]

head(df_lm1)

##   credit_score debt_income_bracket       loan_purpose loan_amount
## 1          617              medium               Home       13152
## 2          628              medium Debt Consolidation       26045
## 3          570              medium          Education       17627
## 4          545                high               Home       37898
## 5          594                 low Debt Consolidation        9184
## 6          626              medium Debt Consolidation       15433

STEP 2: Run a linear regression model

lm1 <- lm(loan_amount ~ credit_score + debt_income_bracket + loan_purpose , 
          df_lm1)

STEP 3: Interpret result of the model

summary(lm1)

## 
## Call:
## lm(formula = loan_amount ~ credit_score + debt_income_bracket + 
##     loan_purpose, data = df_lm1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21245  -9256  -2929   5903 159936 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    26053.130   1144.994  22.754   <2e-16 ***
## credit_score                      -2.689      1.935  -1.390    0.165    
## debt_income_bracketlow           416.518    263.742   1.579    0.114    
## debt_income_bracketmedium        548.020    254.101   2.157    0.031 *  
## loan_purposeDebt Consolidation  -100.747    293.848  -0.343    0.732    
## loan_purposeEducation            148.635    335.910   0.442    0.658    
## loan_purposeHome                -105.285    284.504  -0.370    0.711    
## loan_purposeOther               -159.298    380.289  -0.419    0.675    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13420 on 18601 degrees of freedom
## Multiple R-squared:  0.0004109,  Adjusted R-squared:  3.474e-05 
## F-statistic: 1.092 on 7 and 18601 DF,  p-value: 0.3648

r_squared <- summary(lm1)$r.squared
cat("R-squared: ", r_squared, "\n")

## R-squared:  0.0004109122

We can tell only the predictor classified as “medium” in debt income from the model summary is considered statistically significant with a p-value of 0.031 (smaller than 0.05). However, the low R-squared value might indicate this model is not adequate enough to perform good predictions; 0.0004 of R-squared means it has limited ability to explain the variety of the data from the model’s prediction.

STEP 4: Making prediction

df_lm1 <- df_lm1 %>% filter(loan_purpose =="Home") %>% 
  filter(debt_income_bracket == "medium") %>% 
  filter(credit_score >= 500 & credit_score <=800)

new_data <- data.frame(
  credit_score = 500:800, 
  loan_purpose = rep("Home", 301), 
  debt_income_bracket = rep("medium", 301))

#use model to create prediction intervals
predictions <- predict(lm1, new_data, interval = "confidence")

x_new <- data.frame(x= 500:800)
x_new$fit <- predictions[, "fit"]
x_new$lwr <- predictions[, "lwr"]
x_new$upr <- predictions[, "upr"]

library(ggplot2)

  ggplot(x_new, aes(x = x, y = fit)) +  
    geom_line(color = "blue") +  
    geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2) + 
    geom_point(data = df_lm1, aes(x = credit_score, y = loan_amount)) +  
    labs(title = "Linear Regression with Confidence Interval",
         x = "Independent Variable (credit score)",
         y = "Dependent Variable (loan amount)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    xlim(500, 800) +  
    ylim(20000, 28000)

## Warning: Removed 1553 rows containing missing values or values outside the scale range
## (`geom_point()`).

This data indicates that there’s a negative trend in the loan amount as credit scores increase from 500 to 800, but this can only yield this descriptive statistical outcome for the predictions. No causal inference can be drawn here due to the lack of counterfactual on treatment effect.

STEP 5: Reflection

From the visualization, when the loan_purpose is home and the debt_income_bracket is medium, we found that as credit scores increase, there is a negative trend between the loan amount and credit scores. This contradicts our intuition about the loan. This finding may suggest a confounding factor or indicate the need for additional predictors in our regression model.

To explain the uncertainty of our prediction to fellow unfamiliar with statistics, we can guide him/her through our data points on the confidence interval, taking a credit score of 500 as an example, the average loan amount the financial institution would provide would fall between 24652 to 25651 with 95% confidence. The uncertainty could be well captured by looking through confidence intervals along our data points as the figure below.

ggplot(x_new, aes(x = x, y = fit)) + 
  xlim(500, 800) +
  geom_line(data = x_new, aes(x = x, y = fit), color = "blue")  +
  geom_ribbon(data = x_new, aes(x = x, ymax = upr, ymin = lwr), alpha = 0.2) + 
  labs(title  = "Linear Regression with Confidence Interval",
       x = "Independent Variable (credit score)",
       y = "Dependent Variable (loan amount)") +
      theme(plot.title = element_text(hjust = 0.5))

Question 2: Classification

STEP 0: Load the data

set.seed(130)

df <-read.csv("/Users/huangjhongfu/pCloud Drive/05.Personal file/Minerva/-CS312 - Diamond/Regression Assignment/Loan.csv")

library(janitor)
library(dplyr)

df <- clean_names(df)
df <- na.omit(df)

df_glm <- df[,c("credit_score","monthly_income","home_ownership_status","age",
                "loan_approved")]

STEP 1: Separate the training and test set

train_indices <- sample(1:nrow(df), size = 0.8 * nrow(df))

train_set <- df_glm[train_indices, ]  
test_set <- df_glm[-train_indices, ]

STEP 2: Create logistic regression model for training set

logistic_model <- glm(loan_approved ~ credit_score + monthly_income + 
                        home_ownership_status + age,
                        data = train_set, family = binomial)
summary(logistic_model)

## 
## Call:
## glm(formula = loan_approved ~ credit_score + monthly_income + 
##     home_ownership_status + age, family = binomial, data = train_set)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -8.045e+00  3.033e-01 -26.525  < 2e-16 ***
## credit_score                5.544e-03  5.290e-04  10.482  < 2e-16 ***
## monthly_income              6.151e-04  1.095e-05  56.185  < 2e-16 ***
## home_ownership_statusOther -5.375e-01  9.093e-02  -5.912 3.38e-09 ***
## home_ownership_statusOwn   -1.036e-01  6.685e-02  -1.550    0.121    
## home_ownership_statusRent  -3.162e-01  5.874e-02  -5.383 7.33e-08 ***
## age                         1.138e-02  2.225e-03   5.116 3.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17595  on 15999  degrees of freedom
## Residual deviance: 10982  on 15993  degrees of freedom
## AIC: 10996
## 
## Number of Fisher Scoring iterations: 5

STEP 3: Confusion matrix

coef_age <- coef(logistic_model)["age"]
confint_age <- confint(logistic_model)["age", ]

## Waiting for profiling to be done...

cat("Regression coefficient for age:", coef_age, "\n")

## Regression coefficient for age: 0.01138022

cat("95% Confidence Interval for age:", confint_age, "\n")

## 95% Confidence Interval for age: 0.007021479 0.01574308

pred_prob <- predict(logistic_model, newdata = test_set, type = "response")

pred_class <- ifelse(pred_prob > 0.5, 1, 0)

actual_class <- test_set$loan_approved  
matrix <- table(Predicted = pred_class, Actual = actual_class)
print(matrix)

##          Actual
## Predicted    0    1
##         0 2875  453
##         1  168  504

STEP 4: Prediction performance between different threshold

evaluate_threshold <- function(threshold) {
  pred_class <- ifelse(pred_prob > threshold, 1, 0)
  
  confusion <- table(Predicted = pred_class, Actual = test_set$loan_approved)
  
  TN <- confusion[1,1]  # True Negatives
  FP <- confusion[1,2]  # False Positives
  FN <- confusion[2,1]  # False Negatives
  TP <- confusion[2,2]  # True Positives
  
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  f1_score <- 2 * (precision * recall) / (precision + recall)
  
  return(data.frame(Threshold = threshold, Accuracy = accuracy, 
                    Precision = precision, Recall = recall, 
                    F1_Score = f1_score))
}


thresholds <- seq(0.1, 0.9, by = 0.1)
results <- do.call(rbind, lapply(thresholds, evaluate_threshold))

print(results)

##   Threshold Accuracy Precision    Recall  F1_Score
## 1       0.1  0.67000 0.9278997 0.4151473 0.5736434
## 2       0.2  0.78400 0.7931034 0.5326316 0.6372796
## 3       0.3  0.82375 0.6833856 0.6193182 0.6497765
## 4       0.4  0.83975 0.5977011 0.6908213 0.6408964
## 5       0.5  0.84475 0.5266458 0.7500000 0.6187845
## 6       0.6  0.84000 0.4555904 0.7855856 0.5767196
## 7       0.7  0.83475 0.3866249 0.8333333 0.5281941
## 8       0.8  0.82325 0.3040752 0.8765060 0.4515128
## 9       0.9  0.81100 0.2257053 0.9350649 0.3636364

best_threshold <- results[which.max(results$F1_Score), "Threshold"]
cat("Optimal threshold according to F1_score:", best_threshold, "\n")

## Optimal threshold according to F1_score: 0.3

Based on the F1 score, it indicated 0.3 is the best threshold the strike the balance between the precision rate and the recall rate. F1 score considers both the error rate of false positive and false negative conditions.

STEP 5: Optimal threshold for ROC curve(bonus question)

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

pred_prob <- predict(logistic_model, newdata = test_set, type = "response")

roc_curve <- roc(test_set$loan_approved, pred_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, col = "blue", lwd = 2, main = "ROC Curve")

# calculate the AUC
auc_value <- auc(roc_curve)
cat("AUC Value: ", auc_value, "\n")

## AUC Value:  0.8799379

# Find the best threshold
optimal_threshold <- coords(roc_curve, "best", ret = "threshold", 
                            best.method = "youden")
cat("Optimal threshold according to ROC: ", optimal_threshold[1,1], "\n")

## Optimal threshold according to ROC:  0.1628465

By using coords(), which calculate the Youden’s J Statistic in our ROC curve, could define the optimal threshold that maximizes the difference between the true positive rate and the false positive rate. In this case, Youden’s J Statistic indicated 0.16 as the best threshold.

STEP 6: Reflection

Why did you choose the features that you did?

We can tell there are different optimal thresholds between Youden’s J Statistic and the F1 score, I will go for the lower threshold of 0.16 because this will help me identify more potential borrowers while the risk of loan defaults is acceptable. This is a proactive lending strategy, which is well-suited for small personal loans.

In contrast, if the risk of loan defaults is not acceptable, I would raise the threshold to 0.3, which could make the decision more conservative and ensure that only qualified people can get the loan.
What other variables might have been interesting to look at that are not available in the data?

The study (K.B., Kiran., Sureshramana and Mayya. 2024) for microfinance institutions (MFIs) on loan repayment performance at Grameen Koota identifies several critical determinants, including borrower income stability, financial literacy, loan size, loan duration, relationships with loan officers, and community support systems. However several variable are not provided in the data (e.g. financial literacy, relationships with loan officers, and community support systems), which could be worth finding out to refine our prediction model.
How accurate were your model’s predictions on the test set? What might explain this, and how could it be improved?

If I use 0.3 as the threshold, the accurate rate of the model is 0.82, which seems acceptable, we can also use other indexes (e.g. precision rate, recall rate, F1_score) to get a more holistic understanding of the model’s performance.

We can improve this model by highlighting the key variables and removing those that are irrelevant, or we can include interaction terms to detect the interaction effect in a linear regression model or even use a more sophisticated model to capture the complex pattern in the data. Reading the performance metrics, adjusting the threshold and experimenting with different types of models, could improve the model’s predictive accuracy and suitability for the loan approval process.

Reference

K.B., Kiran., Sureshramana, Mayya. (2024). Analyzing Determinants of Loan Repayment Performance at Grameen Koota: A Comprehensive Study in Bangalore District. International Journal For Multidisciplinary Research, 6(4) doi: 10.36948/ijfmr.2024.v06i04.24400

Regression assignment_CS130

Jhong-Fu, Huang

2024-10-27