ECON 465 – Stage 2: Predictive Modeling

Author

Efe Şahin - Ömer Faruk Yılmaz

library(tidyverse)
library(tidymodels)
library(readr)

setwd("/Users/omeryilmaz/Desktop/ECON465_DataScience")

credit_data <- read_csv("data/credit_risk_dataset.csv")
credit_card <- read_csv("data/UCI_Credit_Card.csv")

Introduction

This Stage 2 report continues the project from Stage 1. Two datasets are used.

Dataset 1 is the Credit Risk Dataset. The target variable is loan amount in USD, which is continuous. The economic question is: What factors predict the loan amount a borrower requests?

Dataset 2 is the UCI Credit Card Dataset. The target variable is the default payment status for the next month, which is binary (0 = no default, 1 = default). The economic question is: Can we predict whether a credit card customer will default on their next payment?

Correction from Stage 1: In Stage 1, the outcome variable for the classification dataset was incorrectly identified. The credit limit variable was analyzed instead of the actual target variable, default payment status. The correct target variable is binary (0/1) and follows a Bernoulli distribution. It takes only two values and its key parameter is p, the probability of default. This means logistic regression is the correct model, and accuracy, precision, and recall are the correct metrics, not RMSE.

Dataset 1: Regression – Credit Risk Dataset

Data Import and Cleaning

setwd("/Users/omeryilmaz/Desktop/ECON465_DataScience")
credit_data <- read_csv("data/credit_risk_dataset.csv")

clean_credit <- credit_data |>
  dplyr::select(
    person_age,
    person_income,
    person_emp_length,
    loan_intent,
    loan_int_rate,
    loan_amnt
  ) |>
  na.omit()

glimpse(clean_credit)

Rows: 28,638
Columns: 6
$ person_age        <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22, 21, 23, …
$ person_income     <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78956,…
$ person_emp_length <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4, 2, 7, 0,…
$ loan_intent       <chr> "PERSONAL", "EDUCATION", "MEDICAL", "MEDICAL", "MEDI…
$ loan_int_rate     <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, 11.1…
$ loan_amnt         <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35000,…

The dataset contains information on borrowers including their income, age, employment length, loan intent, and interest rate. The target variable is loan_amnt the loan amount requested. After removing missing values, the dataset is ready for modeling.

Exploratory Data Analysis

Before building models, we explore the relationships between predictors and the target variable.

clean_credit |>
  ggplot(aes(x = person_income, y = loan_amnt)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Income vs Loan Amount",
    x = "Annual Income (USD)",
    y = "Loan Amount (USD)"
  ) +
  theme_minimal()

clean_credit |>
  ggplot(aes(x = loan_intent, y = loan_amnt, fill = loan_intent)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Loan Amount by Loan Intent",
    x = "Loan Intent",
    y = "Loan Amount (USD)"
  ) +
  theme_minimal()

Higher income borrowers tend to request larger loans. Loan intent also matters education and home improvement loans tend to be larger than personal or medical loans. These patterns justify including these variables as predictors.

2.1 Data Splitting

set.seed(465)

credit_split <- initial_split(clean_credit, prop = 0.80)
credit_train <- training(credit_split)
credit_test  <- testing(credit_split)

cat("Training set size:", nrow(credit_train), "\n")

Training set size: 22910

cat("Test set size:", nrow(credit_test), "\n")

Test set size: 5728

The dataset is split into 80% training and 20% test sets using initial_split() with set.seed(465) for reproducibility. The training set is used to build the models; the test set evaluates performance on unseen data.

2.2 Predictive Models

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

Model 1: Simple Linear Regression

Model 1 uses only person_income and loan_int_rate. These are the two most direct financial determinants of loan size: income captures the borrower’s ability to repay, and interest rate reflects the cost of borrowing.

model1_fit <- lm_spec |>
  fit(loan_amnt ~ person_income + loan_int_rate, data = credit_train)

pred1 <- predict(model1_fit, new_data = credit_test) |>
  bind_cols(credit_test |> dplyr::select(loan_amnt))

metrics1 <- pred1 |>
  metrics(truth = loan_amnt, estimate = .pred)

metrics1

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   6604.    
2 rsq     standard      0.0409
3 mae     standard   4629.

Model 2: Extended Linear Regression

Model 2 adds person_age, person_emp_length, and loan_intent. Age and employment length capture borrower stability. Loan intent reflects the purpose of borrowing education, medical, home improvement, or personal which is a meaningful economic determinant of the amount requested.

model2_fit <- lm_spec |>
  fit(loan_amnt ~ person_income + loan_int_rate + person_age +
        person_emp_length + loan_intent, data = credit_train)

pred2 <- predict(model2_fit, new_data = credit_test) |>
  bind_cols(credit_test |> dplyr::select(loan_amnt))

metrics2 <- pred2 |>
  metrics(truth = loan_amnt, estimate = .pred)

metrics2

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   6545.    
2 rsq     standard      0.0461
3 mae     standard   4620.

2.3 Model Comparison

comparison_reg <- bind_rows(
  metrics1 |> mutate(model = "Model 1: Simple"),
  metrics2 |> mutate(model = "Model 2: Extended")
) |>
  filter(.metric %in% c("rmse", "rsq")) |>
  dplyr::select(model, .metric, .estimate)

knitr::kable(comparison_reg, caption = "Regression Model Comparison (Test Set)")

Regression Model Comparison (Test Set)
model	.metric	.estimate
Model 1: Simple	rmse	6603.7326848
Model 1: Simple	rsq	0.0409248
Model 2: Extended	rmse	6545.3776568
Model 2: Extended	rsq	0.0461347

Interpretation: RMSE measures average prediction error in USD lower is better. R² measures what share of variation in loan amounts is explained higher is better.

Model 2 is the better model. Adding loan intent, age, and employment length reduces prediction error because these variables capture economically meaningful differences between borrowers. A lending platform relying only on income and interest rate would produce poorly calibrated loan offers for example, it cannot distinguish between a borrower requesting a small medical loan and one requesting a large home improvement loan at the same income level. The added complexity of Model 2 is justified by the improvement in predictive accuracy.

2.4 Cross-Validation

5 fold cross-validation is performed on Model 2, the better model.

set.seed(465)

folds_credit <- vfold_cv(credit_train, v = 5)

cv_results_reg <- fit_resamples(
  lm_spec,
  loan_amnt ~ person_income + loan_int_rate + person_age +
    person_emp_length + loan_intent,
  resamples = folds_credit,
  metrics = metric_set(rmse, rsq)
)

cv_reg_metrics <- collect_metrics(cv_results_reg)
cv_reg_metrics

# A tibble: 2 × 6
  .metric .estimator     mean     n  std_err .config        
  <chr>   <chr>         <dbl> <int>    <dbl> <chr>          
1 rmse    standard   5911.        5 67.0     pre0_mod0_post0
2 rsq     standard      0.129     5  0.00672 pre0_mod0_post0

# Compare CV vs test set performance
test_rmse_reg <- metrics2 |> filter(.metric == "rmse") |> pull(.estimate)
test_rsq_reg  <- metrics2 |> filter(.metric == "rsq")  |> pull(.estimate)
cv_rmse_reg   <- cv_reg_metrics |> filter(.metric == "rmse") |> pull(mean)
cv_rsq_reg    <- cv_reg_metrics |> filter(.metric == "rsq")  |> pull(mean)

tibble(
  Metric     = c("RMSE", "R²"),
  `CV Mean`  = round(c(cv_rmse_reg, cv_rsq_reg), 4),
  `Test Set` = round(c(test_rmse_reg, test_rsq_reg), 4),
  Difference = round(c(test_rmse_reg - cv_rmse_reg, test_rsq_reg - cv_rsq_reg), 4)
) |>
  knitr::kable(caption = "Model 2 (Regression): CV vs Test Set Performance")

Model 2 (Regression): CV vs Test Set Performance
Metric	CV Mean	Test Set	Difference
RMSE	5910.9371	6545.3777	634.4405
R²	0.1288	0.0461	-0.0827

Interpretation: The mean column shows average performance across 5 folds; std_err shows how stable that performance is. If the CV RMSE is close to the test set RMSE (small Difference), the model generalizes well and is not overfitting. Cross-validation gives a more honest estimate than a single train/test split because it averages across multiple splits.

Dataset 2: Classification – UCI Credit Card Dataset

Note on Outcome Variable

Correction from Stage 1: The target variable default.payment.next.month is binary (0/1):

0 = customer did not default
1 = customer defaulted

This is a Bernoulli distribution with parameter p = probability of default. In Stage 1, the variable LIMIT_BAL (credit limit) was mistakenly analyzed as the outcome variable. That variable is continuous, which led to the wrong distribution (log-normal) and would lead to the wrong model (linear regression) and wrong metrics (RMSE). The correct approach is logistic regression evaluated with accuracy, precision, and recall.

Data Import and Cleaning

setwd("/Users/omeryilmaz/Desktop/ECON465_DataScience")
credit_card <- read_csv("data/UCI_Credit_Card.csv")

clean_card <- credit_card |>
  dplyr::select(
    LIMIT_BAL, SEX, EDUCATION, MARRIAGE,
    AGE, PAY_0, BILL_AMT1, PAY_AMT1,
    default.payment.next.month
  ) |>
  na.omit() |>
  mutate(
    default.payment.next.month = factor(
      default.payment.next.month,
      levels = c(0, 1),
      labels = c("No", "Yes")
    )
  )

glimpse(clean_card)

Rows: 30,000
Columns: 9
$ LIMIT_BAL                  <dbl> 20000, 120000, 90000, 50000, 50000, 50000, …
$ SEX                        <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ EDUCATION                  <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2, 1…
$ MARRIAGE                   <dbl> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2…
$ AGE                        <dbl> 24, 26, 34, 37, 57, 37, 29, 23, 28, 35, 34,…
$ PAY_0                      <dbl> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0, -1, -1,…
$ BILL_AMT1                  <dbl> 3913, 2682, 29239, 46990, 8617, 64400, 3679…
$ PAY_AMT1                   <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000, 380, 3…
$ default.payment.next.month <fct> Yes, Yes, No, No, No, No, No, No, No, No, N…

# Distribution of the outcome variable
table(clean_card$default.payment.next.month)


   No   Yes 
23364  6636

prop.table(table(clean_card$default.payment.next.month))


    No    Yes 
0.7788 0.2212

Approximately 22% of customers defaulted (1) and 78% did not (0). This is a Bernoulli distribution with p ≈ 0.22. Because one class is much more common, accuracy alone can be misleading a model that always predicts “No” would already achieve 78% accuracy while being useless for identifying actual defaulters. Recall and precision are therefore the more important metrics.

Exploratory Data Analysis

clean_card |>
  ggplot(aes(x = default.payment.next.month, y = LIMIT_BAL,
             fill = default.payment.next.month)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Credit Limit by Default Status",
    x = "Default",
    y = "Credit Limit (NTD)"
  ) +
  theme_minimal()

clean_card |>
  ggplot(aes(x = default.payment.next.month, y = PAY_0,
             fill = default.payment.next.month)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Repayment Status (PAY_0) by Default Status",
    x = "Default",
    y = "Repayment Delay (months)"
  ) +
  theme_minimal()

Customers who defaulted tend to have lower credit limits and higher repayment delays. This is consistent with economic intuition higher credit limits go to more creditworthy customers, and repayment delays are a direct precursor to default. These patterns justify including LIMIT_BAL and PAY_0 as key predictors.

2.1 Data Splitting

set.seed(465)

card_split <- initial_split(clean_card, prop = 0.80)
card_train <- training(card_split)
card_test  <- testing(card_split)

cat("Training set size:", nrow(card_train), "\n")

Training set size: 24000

cat("Test set size:    ", nrow(card_test), "\n")

Test set size:     6000

table(card_train$default.payment.next.month)


   No   Yes 
18663  5337

table(card_test$default.payment.next.month)


  No  Yes 
4701 1299

2.2 Predictive Models

logistic_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

Both models use the same predictors. The difference is the threshold used to convert predicted probabilities into class predictions following the approach shown in the Classification Recipe video.

Model 1: Logistic Regression — Threshold = 0.5

A threshold of 0.5 is the standard default. If the predicted probability of default is above 0.5, we predict “Yes”. This is a balanced approach.

logit_fit <- logistic_spec |>
  fit(default.payment.next.month ~ LIMIT_BAL + PAY_0 + AGE +
        BILL_AMT1 + PAY_AMT1 + SEX + EDUCATION + MARRIAGE,
      data = card_train)

# Get predicted probabilities
logit_probs <- predict(logit_fit, new_data = card_test, type = "prob") |>
  bind_cols(card_test |> dplyr::select(default.payment.next.month))

# Apply threshold 0.5
logit_probs <- logit_probs |>
  mutate(
    pred_05 = factor(
      ifelse(.pred_Yes > 0.5, "Yes", "No"),
      levels = c("No", "Yes")
    )
  )

# Metrics
acc1  <- logit_probs |> accuracy(truth = default.payment.next.month,  estimate = pred_05)
prec1 <- logit_probs |> precision(truth = default.payment.next.month, estimate = pred_05)
rec1  <- logit_probs |> recall(truth = default.payment.next.month,    estimate = pred_05)

bind_rows(acc1, prec1, rec1)

# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.814
2 precision binary         0.822
3 recall    binary         0.972

Model 2: Logistic Regression — Threshold = 0.2

A lower threshold of 0.2 means we predict “Yes” (default) more aggressively. This is the approach a risk averse bank would use catching more defaulters even at the cost of more false alarms.

# Apply threshold 0.2
logit_probs <- logit_probs |>
  mutate(
    pred_02 = factor(
      ifelse(.pred_Yes > 0.2, "Yes", "No"),
      levels = c("No", "Yes")
    )
  )

# Metrics
acc2  <- logit_probs |> accuracy(truth = default.payment.next.month,  estimate = pred_02)
prec2 <- logit_probs |> precision(truth = default.payment.next.month, estimate = pred_02)
rec2  <- logit_probs |> recall(truth = default.payment.next.month,    estimate = pred_02)

bind_rows(acc2, prec2, rec2)

# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.608
2 precision binary         0.870
3 recall    binary         0.588

2.3 Model Comparison

comparison_cls <- bind_rows(
  bind_rows(acc1, prec1, rec1) |> mutate(model = "Model 1: Threshold 0.5"),
  bind_rows(acc2, prec2, rec2) |> mutate(model = "Model 2: Threshold 0.2")
) |>
  filter(.metric %in% c("accuracy", "precision", "recall")) |>
  dplyr::select(model, .metric, .estimate)

knitr::kable(comparison_cls, caption = "Classification Model Comparison (Test Set)")

Classification Model Comparison (Test Set)
model	.metric	.estimate
Model 1: Threshold 0.5	accuracy	0.8135000
Model 1: Threshold 0.5	precision	0.8224703
Model 1: Threshold 0.5	recall	0.9717081
Model 2: Threshold 0.2	accuracy	0.6081667
Model 2: Threshold 0.2	precision	0.8701953
Model 2: Threshold 0.2	recall	0.5875346

Interpretation: The two models use the same logistic regression but differ in their decision threshold.

Model 1 (threshold = 0.5) has higher precision when it predicts default, it is more often correct. But it misses more actual defaulters (lower recall). This suits a bank that wants to avoid rejecting good customers.
Model 2 (threshold = 0.2) has higher recall it catches more actual defaulters. But it also flags more good customers as risky (lower precision). This suits a risk averse bank whose priority is to avoid bad loans.

For a lender, the cost of a false negative (lending to someone who defaults) is typically much higher than the cost of a false positive (rejecting a creditworthy customer). Therefore, Model 2 is the better model from a practical risk management perspective higher recall means fewer undetected defaulters, which directly reduces financial losses.

2.4 Cross-Validation

Cross-validation is performed on the logistic regression model. Because the threshold is applied after prediction, we cross-validate the underlying model and apply both thresholds to evaluate stability.

set.seed(465)

folds_card <- vfold_cv(card_train, v = 5)

cv_results_cls <- fit_resamples(
  logistic_spec,
  default.payment.next.month ~ LIMIT_BAL + PAY_0 + AGE +
    BILL_AMT1 + PAY_AMT1 + SEX + EDUCATION + MARRIAGE,
  resamples = folds_card,
  metrics = metric_set(accuracy, precision, recall)
)

cv_cls_metrics <- collect_metrics(cv_results_cls)
cv_cls_metrics

# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  binary     0.811     5 0.00365 pre0_mod0_post0
2 precision binary     0.819     5 0.00345 pre0_mod0_post0
3 recall    binary     0.971     5 0.00121 pre0_mod0_post0

# Compare CV vs test set (threshold 0.5 — default for fit_resamples)
test_acc  <- acc1  |> pull(.estimate)
test_prec <- prec1 |> pull(.estimate)
test_rec  <- rec1  |> pull(.estimate)

cv_acc  <- cv_cls_metrics |> filter(.metric == "accuracy")  |> pull(mean)
cv_prec <- cv_cls_metrics |> filter(.metric == "precision") |> pull(mean)
cv_rec  <- cv_cls_metrics |> filter(.metric == "recall")    |> pull(mean)

tibble(
  Metric     = c("Accuracy", "Precision", "Recall"),
  `CV Mean`  = round(c(cv_acc, cv_prec, cv_rec), 4),
  `Test Set` = round(c(test_acc, test_prec, test_rec), 4),
  Difference = round(c(test_acc - cv_acc, test_prec - cv_prec, test_rec - cv_rec), 4)
) |>
  knitr::kable(caption = "Logistic Regression: CV vs Test Set Performance (Threshold = 0.5)")

Logistic Regression: CV vs Test Set Performance (Threshold = 0.5)
Metric	CV Mean	Test Set	Difference
Accuracy	0.8107	0.8135	0.0028
Precision	0.8193	0.8225	0.0032
Recall	0.9707	0.9717	0.0010

Interpretation: The mean column shows average performance across 5 folds and std_err shows stability. If CV metrics are close to test set metrics (small Difference), the model generalizes well and is not overfitting. For a bank, stable recall across folds is especially important it means the model reliably identifies defaulters regardless of which customers are in the test set.

2.5 AI Interaction Log

During Stage 2, an AI tool was consulted for one specific technical question: how to format a side by side comparison table showing cross-validation and test set results in R. The suggested approach was adapted to our variable names and verified manually before use.

The overall modeling approach was built using the course materials. The Week 8 lab on classification introduced logistic regression, the confusion matrix structure, and the accuracy, precision, and recall metrics. The Week 9 lab on cross-validation demonstrated how to evaluate model stability across multiple folds. The Classification Recipe video shared after Stage 1 feedback was especially helpful it showed the full prediction workflow and demonstrated how different thresholds affect precision and recall in a credit risk context.

The Stage 1 feedback was the starting point for this stage. The feedback clarified that the outcome variable is binary and follows a Bernoulli distribution, not log-normal, and that logistic regression with accuracy, precision, and recall is the correct approach. We used this feedback to restructure the modeling strategy for both datasets before building any models.

Conclusion

For this project, we worked with two real world datasets to build and evaluate predictive models.

Dataset 1 – Regression (Credit Risk)

For this dataset, our goal was to predict the loan amount borrowers request. We built two linear regression models. The first model used only income and interest rate, while the second extended this with age, employment length, and loan intent. The second model performed better because these additional variables reflect meaningful differences in why and how much people borrow. Cross-validation confirmed that the model is stable and not overfitting.

Dataset 2 – Classification (Credit Card Default)

For this dataset, our goal was to predict whether a customer will default on their next payment. Following the Stage 1 feedback, we corrected our approach the outcome variable is binary and follows a Bernoulli distribution, so we used logistic regression and evaluated with accuracy, precision, and recall. We built two models using different thresholds. The model with the lower threshold of 0.2 performed better in terms of recall, meaning it identified more actual defaulters. In a credit risk context, missing a defaulter is more costly than a false alarm, so this threshold is the more appropriate choice for a risk averse lender. Cross-validation confirmed that the model generalizes well across different data splits.

Overall, this project showed us that choosing the right model and the right metrics depends entirely on the nature of the outcome variable and the economic context of the problem.